Solution: Find Median from Data Stream

Let's solve the Find Median from Data Stream problem using the Heaps pattern.

Statement

Design a data structure that stores a dynamically changing list of integers and can find the median in constant time, O(1)O(1), as the list grows. To do that, implement a class named MedianOfStream with the following functionality:

  • Constructor(): Initializes an instance of the class.

  • insertNum(int num): Adds a new integer num to the data structure.

  • findMedian(): Returns the median of all integers added so far.

Note: The median is the middle value in a sorted list of integers.

  • For an odd-sized list (e.g., [4,5,6][4, 5, 6]), the median is the middle element: 55.

  • For an even-sized list (e.g., [2,4,6,8][2, 4, 6, 8]), the median is the average of the two middle elements: (4+6)/2=5(4 + 6) / 2 = 5.

Constraints:

  • −105≤-10^5 \leq num ≤105\leq 10^5, where num is an integer received from the data stream.

  • There will be at least one element in the data structure before the median is computed.

  • At most, 500500 calls will be made to the function that calculates the median.

Solution

So far, you have probably brainstormed some approaches and have an idea of how to solve this problem. Let’s explore some of these approaches and figure out which one to follow based on considerations such as time complexity and any implementation constraints.

Naive approach

The naive solution involves sorting the data after each insertion and then finding the median. An algorithm like insertion sort can be used to insert each new number into the correct position in the sorted sequence. This ensures that every time a new number is added, the data before it is already sorted, allowing the insertion to happen at the correct index.

However, one major drawback of this approach is that it requires shifting all elements greater than the inserted number one step forward, which can be inefficient when dealing with large datasets.

While it may seem intuitive to use a linked list for dynamic data, since a linked list does not require shifting elements when adding data between them, we would still prefer an array-based structure in this case. This is because arrays provide faster access to indexed elements, making it easier and more suitable to find the median once the data is sorted.

Each insertion in this approach has a time complexity of O(n)O(n) due to the shifting behavior of the insertion sort, resulting in an overall time complexity of O(n²)O(n²), where n is the number of elements in the data stream. The time complexity of the function that calculates the median would be O(1)O(1), assuming the data is stored in an array. The space complexity of this approach is O(n)O(n) since we need to store the entire data set.

Optimized approach using union find

The most suitable data structure for repeatedly finding the smallest or largest number in a changing list is a heap. To manage the dynamically growing data stream and to calculate the median, we use two heaps: a max-heap to store the smaller half of the numbers and a min-heap to store the larger half.

The process works by maintaining these two heaps:

  • The max-heap holds the smaller half of the numbers stream and allows quick access to the largest of the smaller numbers.

  • The min-heap holds the larger half of the numbers stream and gives quick access to the smallest of the larger numbers.

At the start, the first element of the stream is considered the tentative median. As more elements arrive, they are inserted into either the max-heap or min-heap. If the new number is smaller than or equal to the current median, it is added to the max-heap; otherwise, it is added to the min-heap.

To maintain balance between the two heaps:

  • If the max-heap contains more than one element compared to the min-heap, we move the root of the max-heap to the min-heap.

  • If the min-heap contains more elements than the max-heap, we move the root of the min-heap to the max-heap.

This rebalancing step ensures that the two heaps are either of equal size or the max-heap has one extra element. This allows us to quickly calculate the median:

  • If the total number of elements is odd, the median is the root of the max-heap.

  • If the total number of elements is even, the median is the average of the roots of both heaps.

Now, let’s look at the solution steps below:

  1. Constructor():

    1. Initializes two heaps:

      1. maxHeapForSmallnum (holds the smaller half of numbers).

      2. minHeapForLargenum (holds the larger half of numbers).

  2. insertNum(int num):

    1. If maxHeapForSmallnum is empty or the root of maxHeapForSmallnum is greater than or equal to num, push num into maxHeapForSmallnum.

    2. Else if num is larger than the root of maxHeapForSmallnum, push num into minHeapForLargenum.

    3. If the size of maxHeapForSmallnum is greater than minHeapForLargenum + 1, push the root of maxHeapForSmallnum to minHeapForLargenum.

    4. Else if the size of minHeapForLargenum is greater than maxHeapForSmallnum, push the root of minHeapForLargenum to maxHeapForSmallnum to maintain balance.

  3. findMedian():

    1. If the sizes of maxHeapForSmallnum and minHeapForLargenum are equal, return the average of the roots of both heaps.

    2. If the sizes of the heaps are not equal, return the root of maxHeapForSmallnum.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.