FindMedianFromDataStream [source code]


public class FindMedianFromDataStream {
static
/******************************************************************************/
class MedianFinder {
    PriorityQueue<Integer> left, right;

    /** initialize your data structure here. */
    public MedianFinder() {
        left = new PriorityQueue<> ((a, b) -> b - a);
        right = new PriorityQueue<> ();
    }

    public void addNum(int num) {
        if (left.isEmpty () || num <= left.peek ()) {
            left.offer (num);
        } else {
            right.offer (num);
        }
        while (left.size () - right.size () > 1) {
            right.offer (left.poll ());
        }
        while (right.size () - left.size () > 1) {
            left.offer (right.poll ());
        }
    }

    public double findMedian() {
        int left_len = left.size (), right_len = right.size ();
        if (((left_len + right_len) & 1) == 1) {
            return (double) (left_len > right_len ? left.peek () : right.peek ());
        } else {
            return (left.peek () + right.peek ()) / 2.0;
        }
    }
}
/******************************************************************************/
/**
 * Your MedianFinder object will be instantiated and called as such:
 * MedianFinder obj = new MedianFinder();
 * obj.addNum(num);
 * double param_2 = obj.findMedian();
 */

    public static void main(String[] args) {
        FindMedianFromDataStream.MedianFinder tester = new FindMedianFromDataStream.MedianFinder();
    }
}

前几天刚做过一道hard的median题, 好像都忘光了; 反正一个技巧就是, median更多的应该是当做一个等分的思路来做; 左边的长度应该是((N + 1) / 2), 然后最后这个分段能确定之后, median本身的计算还要参考N的奇偶性, 但是就比较简单了;

想了一下, 好像可以用拆分的思路来做, 比如左边一般和右边一般, 然后维护长度, 但是一个问题是假如比如你add了好几个全都要到left, 那么你left被挤出来很多, 怎么办? 所以实际上left和right要分别用PriorityQueue来维护? 那感觉add的时候复杂度还是NlogN啊, 那相比于直接一个resizeable array, 然后自己sort的做法, 有什么好处? 复杂度我感觉至少都是差不多的, 当然, array的做法, 就算不考虑resize的问题, 每次add之后, binary search找到位置之后, 会有一个shift操作, 这个虽然是O(N), 但是也是不少的cost;

还是用两半的方法来写写看先, 要是复杂度不过关, 再说;

最后直接就是写的上面的代码, 调掉一些小bug之后直接AC了, 速度是247ms (46%), 也不错; 看来median问题还是适用于这个套路;

写的时候注意到, PriorityQueue Empty的时候虽然不是Null, 也不能peek; 这个是有点counter intuitive的, 因为我可能会指望Empty的时候的peek返回Null; Empty的时候不能Poll这个是很自然就知道的;

另外, 又写了return (double) ((left.peek () + right.peek ()) / 2)这样的代码, 别犯傻好吗, java里面的浮点运算不是这样进行的; 要cast的是操作数而不是结果;

最后实现的时候, 思考了一下, 没有必要一定要跟MedianOfTwoSorted上面那样, 必须让left的长度不小于right的长度; 实际上这题只要保证两者长度差距不超过1就行了; 最后get的时候, 无论到底哪边长, 对应的计算一下就行了;

另外, 精简版本:

class MedianFinder {  
    PriorityQueue<Integer> left, right;  

    public MedianFinder() {  
        left = new PriorityQueue<> ((a, b) -> b - a);  
        right = new PriorityQueue<> ();  
    }  

    public void addNum(int num) {  
        if (left.isEmpty () || num <= left.peek ()) left.offer (num);  
        else right.offer (num);  
        if (left.size () - right.size () > 1) right.offer (left.poll ());  
        if (right.size () - left.size () > 1) left.offer (right.poll ());  
    }  

    public double findMedian() {  
        if (left.size () != right.size ()) return (double) (left.size () > right.size () ? left.peek () : right.peek ());  
        else return (left.peek () + right.peek ()) / 2.0;  
    }  
}

注意add里面, 那两个while在这题的条件下, 是可以改成if的; 不过类似OS的时候学的概念, 反正养成这里用while的习惯其实也没有害处;


editorial

Approach #1 Simple Sorting [Time Limit Exceeded]

Intuition

Do what the question says.

Algorithm

Store the numbers in a resize-able container. Every time you need to output the median, sort the container and output the median.

class MedianFinder {  
    vector<double> store;  

public:  
    // Adds a number into the data structure.  
    void addNum(int num)  
    {  
        store.push_back(num);  
    }  

    // Returns the median of current data stream  
    double findMedian()  
    {  
        sort(store.begin(), store.end());  

        int n = store.size();  
        return (n & 1 ? (store[n / 2 - 1] + store[n / 2]) * 0.5 : store[n / 2]);  
    }  
};

所以如果真的用这个方法, 是不用自己实现一个resizeable array的, java里面可以用对应的list, 然后用Collections.sort来sort; add是O(1), 但是get是NlogN;

Approach #2 Insertion Sort [Time Limit Exceeded]

Intuition

Keeping our input container always sorted (i.e. maintaining the sorted nature of the container as an invariant).

Algorithm

Which algorithm allows a number to be added to a sorted list of numbers and yet keeps the entire list sorted? Well, for one, insertion sort!

We assume that the current list is already sorted. When a new number comes, we have to add it to the list while maintaining the sorted nature of the list. This is achieved easily by finding the correct place to insert the incoming number, using a binary search (remember, the list is always sorted). Once the position is found, we need to shift all higher elements by one space to make room for the incoming number.

This method would work well when the amount of insertion queries is lesser or about the same as the amount of median finding queries.

这种设计题, 就要类似他这里这样, 解释你的方案在什么情况下有优势;

class MedianFinder {  
    vector<int> store; // resize-able container  

public:  
    // Adds a number into the data structure.  
    void addNum(int num)  
    {  
        if (store.empty())  
            store.push_back(num);  
        else  
            store.insert(lower_bound(store.begin(), store.end(), num), num);     // binary search and insertion combined  
    }  

    // Returns the median of current data stream  
    double findMedian()  
    {  
        int n = store.size();  
        return n & 1 ? store[n / 2] : (store[n / 2 - 1] + store[n / 2]) * 0.5;  
    }  
};

add是O(N), 因为shift很占用时间; 这个也是这个方法的弊端; insertion sort其实就是一个噱头名字, 这个思路的核心是想要利用binary search;

Approach #3 Two Heaps! [Accepted]

Intuition

The above two approaches gave us some valuable insights on how to tackle this problem. Concretely, one can infer two things:

  1. If we could maintain direct access to median elements at all times, then finding the median would take a constant amount of time.
  2. If we could find a reasonably fast way of adding numbers to our containers, additional penalties incurred could be lessened.

这个观点现在看起来可能有点啰嗦, 但是下面在#4的时候就格外明显的看出来他为什么要这样总结了;

But perhaps the most important insight, which is not readily observable, is the fact that we only need a consistent way to access the median elements. Keeping the entire input sorted is not a requirement.

如果用sort的方式来处理median问题, 得到的算法往往是overkill;

Well, if only there were a data structure which could handle our needs.
As it turns out there are two data structures for the job:

  • Heaps (or Priority Queues 1)
  • Self-balancing Binary Search Trees (we'll talk more about them in Approach #4)

Heaps are a natural ingredient for this dish! Adding elements to them take logarithmic order of time. They also give direct access to the maximal/minimal elements in a group.

If we could maintain two heaps in the following way:

  • A max-heap to store the smaller half of the input numbers
  • A min-heap to store the larger half of the input numbers

This gives access to median values in the input: they comprise the top of the heaps!

Wait, what? How?

If the following conditions are met:

  1. Both the heaps are balanced (or nearly balanced)
  2. The max-heap contains all the smaller numbers while the min-heap contains all the larger numbers

then we can say that:

  1. All the numbers in the max-heap are smaller or equal to the top element of the max-heap (let's call it x)
  2. All the numbers in the min-heap are larger or equal to the top element of the min-heap (let's call it y)

Then x and/or y are smaller than (or equal to) almost half of the elements and larger than (or equal to) the other half. That is the definition of median elements.

This leads us to a huge point of pain in this approach: balancing the two heaps!

...

class MedianFinder {  
    priority_queue<int> lo;                              // max heap  
    priority_queue<int, vector<int>, greater<int>> hi;   // min heap  

public:  
    // Adds a number into the data structure.  
    void addNum(int num)  
    {  
        lo.push(num);                                    // Add to max heap  

        hi.push(lo.top());                               // balancing step  
        lo.pop();  

        if (lo.size() < hi.size()) {                     // maintain size property  
            lo.push(hi.top());  
            hi.pop();  
        }  
    }  

    // Returns the median of current data stream  
    double findMedian()  
    {  
        return lo.size() > hi.size() ? (double) lo.top() : (lo.top() + hi.top()) * 0.5;  
    }  
};

所以这题就看出来了维护left比right大0/1的这个invariant就是有代价的, 他这里的实现, add的时候根本不判断num跟left_max and right_min之间的关系, 直接先给left, 然后再调整; 我个人看来, 实际上是浪费了一些时间的;

At worst, there are three heap insertions and two heap deletions from the top.

add是5logN. 我自己之前的分析好像有毛病, 这种方法虽然add N个, 是NlogN, 但是一个只有logN, 而之前sort的做法, 一个就能到N甚至NlogN, 所以这个方法还是快很多的;

Approach #4 Multiset and Two Pointers [Accepted]

Intuition

Self-balancing Binary Search Trees (like an AVL Tree) have some very interesting properties. They maintain the tree's height to a logarithmic bound. Thus inserting a new element has reasonably good time performance. The median always winds up in the root of the tree and/or one of its children. Solving this problem using the same approach as Approach #3 but using a Self-balancing BST seems like a good choice. Except the fact that implementing such a tree is not trivial and prone to errors.

这里也是第一次看到这种self balancing tree的实际应用场景; 估计实际面试的时候应该不会让实现self balancing tree这种东西吧? 真的不会;

Why reinvent the wheel? Most languages implement a multiset class which emulates such behavior. The only problem remains keeping track of the median elements. That is easily solved with pointers! 2

We maintain two pointers: one for the lower median element and the other for the higher median element. When the total number of elements is odd, both the pointers point to the same median element (since there is only one median in this case). When the number of elements is even, the pointers point to two consecutive elements, whose mean is the representative median of the input.

...

If num is equal to the current median element, then the action taken is dependent on how num is inserted into data. NOTE: In our given C++ code example, std::multiset::insert inserts an element after all elements of equal value. Hence we increment hi_median.

看起来这个算法很trivial, 无非是借用一个更高级的结构的屠龙刀完成; 但是实际上这个2pointer的算法还是挺有趣的, 可以自己看一下文章;

class MedianFinder {  
    multiset<int> data;  
    multiset<int>::iterator lo_median, hi_median;  

public:  
    MedianFinder()  
        : lo_median(data.end())  
        , hi_median(data.end())  
    {  
    }  

    void addNum(int num)  
    {  
        const size_t n = data.size();   // store previous size  

        data.insert(num);               // insert into multiset  

        if (!n) {  
            // no elements before, one element now  
            lo_median = hi_median = data.begin();  
        }  
        else if (n & 1) {  
            // odd size before (i.e. lo == hi), even size now (i.e. hi = lo + 1)  

            if (num < *lo_median)       // num < lo  
                lo_median--;  
            else                        // num >= hi  
                hi_median++;            // insertion at end of equal range  
        }  
        else {  
            // even size before (i.e. hi = lo + 1), odd size now (i.e. lo == hi)  

            if (num > *lo_median && num < *hi_median) {  
                lo_median++;                    // num in between lo and hi  
                hi_median--;  
            }  
            else if (num >= *hi_median)         // num inserted after hi  
                lo_median++;  
            else                                // num <= lo < hi  
                lo_median = --hi_median;        // insertion at end of equal range spoils lo  
        }  
    }  

    double findMedian()  
    {  
        return (*lo_median + *hi_median) * 0.5;  
    }  
};

判断size然后移动两个指针(其实也是基于median等分的思路)的灵感还是有点不简单的; 最后add做到了logN, 常数级别的提升;

A much shorter (but harder to understand), one pointer version 3 of this solution is given below:

class MedianFinder {  
    multiset<int> data;  
    multiset<int>::iterator mid;  

public:  
    MedianFinder()  
        : mid(data.end())  
    {  
    }  

    void addNum(int num)  
    {  
        const int n = data.size();  
        data.insert(num);  

        if (!n)                                 // first element inserted  
            mid = data.begin();  
        else if (num < *mid)                    // median is decreased  
            mid = (n & 1 ? mid : prev(mid));  
        else                                    // median is increased  
            mid = (n & 1 ? next(mid) : mid);  
    }  

    double findMedian()  
    {  
        const int n = data.size();  
        return (*mid + *next(mid, n % 2 - 1)) * 0.5;  
    }  
};

实际上还是一个等分思路, 因为在等分的时候, 我们的两个pointer实际上是始终相连的(left_max and right_min), 所以维护的时候, 只维护一个就行了, 只是理解起来稍微有点麻烦; 他上面这个mid, 其实是对应left_len = N / 2(也就是right允许更长)的时候的right_min; 然后你自己找几个例子(分别是奇偶长度的n), 就知道他这里的add的逻辑了;

最后get的时候, 用了n % 2 - 1这个小技巧来统一的处理奇偶性;

有没有想过为什么这里需要用这么一个balancing tree? 这个就对赢了他#3的时候的总结, 一般的array的优势是access非常的快, 但是维护不能做到logN, 而用了balancing tree, 维护order就只需要logN, 但是却丧失了access median的能力; 这个时候的技巧就是用pointer来弥补这个损失;

当然, 我个人认为的最好的方法还是2heap的方法, 这个方法是通过完全的自己的定制, 既保护了维护的速度, 也保留了access的速度;

Further Thoughts

There are so many ways around this problem, that frankly, it is scary. Here are a few more that I came across:

  • Buckets! If the numbers in the stream are statistically distributed, then it is easier to keep track of buckets where the median would land, than the entire array. Once you know the correct bucket, simply sort it find the median. If the bucket size is significantly smaller than the size of input processed, this results in huge time saving. @mitbbs8080 has an interesting implementation here.
  • Reservoir Sampling. Following along the lines of using buckets: if the stream is statistically distributed, you can rely on Reservoir Sampling. Basically, if you could maintain just one good bucket (or reservoir) which could hold a representative sample of the entire stream, you could estimate the median of the entire stream from just this one bucket. This means good time and memory performance. Reservoir Sampling lets you do just that. Determining a "good" size for your reservoir? Now, that's a whole other challenge. A good explanation for this can be found in this StackOverflow answer.
  • Segment Trees are a great data structure if you need to do a lot of insertions or a lot of read queries over a limited range of input values. They allow us to do all such operations fast and in roughly the same amount of time, always. The only problem is that they are far from trivial to implement. Take a look at my introductory article on Segment Trees if you are interested.
  • Order Statistic Trees are data structures which seem to be tailor-made for this problem. They have all the nice features of a BST, but also let you find the k​th order element stored in the tree. They are a pain to implement and no standard interview would require you to code these up. But they are fun to use if they are already implemented in the language of your choice.

@StefanPochmann said in Short simple Java/C++/Python, O(log n) + O(1):

I keep two heaps (or priority queues):

  • Max-heap small has the smaller half of the numbers.
  • Min-heap large has the larger half of the numbers.

This gives me direct access to the one or two middle values (they're the tops of the heaps), so getting the median takes O(1) time. And adding a number takes O(log n) time.

Supporting both min- and max-heap is more or less cumbersome, depending on the language, so I simply negate the numbers in the heap in which I want the reverse of the default order. To prevent this from causing a bug with -231 (which negated is itself, when using 32-bit ints), I use integer types larger than 32 bits.

Using larger integer types also prevents an overflow error when taking the mean of the two middle numbers. I think almost all solutions posted previously have that bug.

Update: These are pretty short already, but by now I wrote even shorter ones.


Java

class MedianFinder {  

    private Queue<Long> small = new PriorityQueue(),  
                        large = new PriorityQueue();  

    public void addNum(int num) {  
        large.add((long) num);  
        small.add(-large.poll());  
        if (large.size() < small.size())  
            large.add(-small.poll());  
    }  

    public double findMedian() {  
        return large.size() > small.size()  
               ? large.peek()  
               : (large.peek() - small.peek()) / 2.0;  
    }  
};

Props to larrywang2014's solution for making me aware that I can use Queue in the declaration instead of PriorityQueue (that's all I got from him, though (just saying because I just saw he changed his previously longer addNum and it's now equivalent to mine)).


C++

class MedianFinder {  
    priority_queue<long> small, large;  
public:  

    void addNum(int num) {  
        small.push(num);  
        large.push(-small.top());  
        small.pop();  
        if (small.size() < large.size()) {  
            small.push(-large.top());  
            large.pop();  
        }  
    }  

    double findMedian() {  
        return small.size() > large.size()  
               ? small.top()  
               : (small.top() - large.top()) / 2.0;  
    }  
};

Big thanks to jianchao.li.fighter for telling me that C++'s priority_queue is a max-queue (see comments below).


Python

from heapq import *  

class MedianFinder:  

    def __init__(self):  
        self.heaps = [], []  

    def addNum(self, num):  
        small, large = self.heaps  
        heappush(small, -heappushpop(large, num))  
        if len(large) < len(small):  
            heappush(large, -heappop(small))  

    def findMedian(self):  
        small, large = self.heaps  
        if len(large) > len(small):  
            return float(large[0])  
        return (large[0] - small[0]) / 2.0

他这个是维护large始终>= small;

另外他这个用负值来完成reverse order的操作还是有点意思;

@jianchao.li.fighter said in Short simple Java/C++/Python, O(log n) + O(1):

Hi, Stefan. In C++, the default priority_queue is a max heap. So I guess in order to fit with your Java and Python codes, the C++ code may require some revisions: large (instead of small) needs to use negated numbers. Do you think so? Anyway, the idea of using negated numbers is really cool :-)

这个还真是没想到, java里面默认的是min heap;


@kennethliaoke said in Share my java solution logn to insert, O(1) to query:

Not sure why it is marked as hard, i think this is one of the easiest questions on leetcode.

class MedianFinder {  
    // max queue is always larger or equal to min queue  
    PriorityQueue<Integer> min = new PriorityQueue();  
    PriorityQueue<Integer> max = new PriorityQueue(1000, Collections.reverseOrder());  
    // Adds a number into the data structure.  
    public void addNum(int num) {  
        max.offer(num);  
        min.offer(max.poll());  
        if (max.size() < min.size()){  
            max.offer(min.poll());  
        }  
    }  

    // Returns the median of current data stream  
    public double findMedian() {  
        if (max.size() == min.size()) return (max.peek() + min.peek()) /  2.0;  
        else return max.peek();  
    }  
};

很装逼, 不过这个人可能只是自己实现很熟悉用等分的思路来处理median这个pattern而已; 这种人就是没有mega cognition能力;


@hanhanbu said in Easy to understand double-heap solution in Java:

The basic idea is to maintain two heaps: a max-heap and a min-heap. The max heap stores the smaller half of all numbers while the min heap stores the larger half. The sizes of two heaps need to be balanced each time when a new number is inserted so that their size will not be different by more than 1. Therefore each time when findMedian() is called we check if two heaps have the same size. If they do, we should return the average of the two top values of heaps. Otherwise we return the top of the heap which has one more element.

To do that, we first need to add two PriorityQueues to the class as the max-heap and min-heap:

    private PriorityQueue<Integer> minH;  
    private PriorityQueue<Integer> maxH;  

We then define the constructor of the class so that the PriorityQueues get initialized. By default, the sorting order of a PriorityQueue is natural order which means it is a min-heap by default. Hence we need to provide a new Comparator to the constructor of the max heap to specify the reversed order.

    MedianFinder(){  
        minH = new PriorityQueue<Integer>();  
        maxH = new PriorityQueue<Integer>(1, new Comparator<Integer>(){  
            public int compare(Integer o1, Integer o2) {  
                if (o1.intValue()>o2.intValue()) return -1;  
                if (o1.intValue()<o2.intValue()) return 1;  
                return 0;  
            }  
        });  
    }  

Now we have the data structure properly built. Let's write the addNum() function next.

    public void addNum(int num) {  
        if ((minH.size()==0)&&(maxH.size()==0)) minH.add(num);  
        else if ((minH.size())>(maxH.size())) {  
            if (num>minH.peek()) {  
                maxH.add(minH.poll());  
                minH.add(num);  
            } else maxH.add(num);  
        } else if ((minH.size())<(maxH.size())) {  
            if (num<maxH.peek()) {  
                minH.add(maxH.poll());  
                maxH.add(num);  
            } else minH.add(num);              
        } else {  
            if (num<maxH.peek()) maxH.add(num);  
            else minH.add(num);               
        }  
    }  

There are several possible situations when a new number is inserted:

1)If both heap are empty, meaning that we are inserting the first number, we just arbitrarily inserted it into a heap, let's say, the min-heap.

2)If min-heap has more elements (later we will argue that the size won't be different by more than 1), we need to compare the new number with the top of the min-heap. If it is larger than that, then the new number belongs to the larger half and it should be added to the min-heap. But since we have to balance the heap, we should move the top element of the min-heap to the max-heap. For the min-heap, we inserted a new number but removed the original top, its size won't change. For the max-heap, we inserted a new element (the top of the min-heap) so its size will increase by 1.

3)If max-heap has more elements, we did the similar thing as 2).

4)If they have the same size, we just compare the new number with one of the top to determine which heap the new number should be inserted. We just simply inserted it there.

It can be seen that for each insertion if it was in situation 1) and 4), then after insertion the heap size difference will be 1. For 2) and 3), the size of the heap with fewer element will increase by 1 to catch up with the heap with more elements. Hence their sizes are well-balanced and the difference will never exceeds 1.

Obviously, the median will be the top element of the heap which has one more element (if max-heap and min-heap have different sizes), or the average of the two tops (if max-heap and min-heap have equal sizes). So the findMedian() function is very straightforward:

    // Returns the median of current data stream  
    public double findMedian() {  
        if ((minH.size()==0)&&(maxH.size()==0)) return 0.0;  
        if ((minH.size())>(maxH.size())) return (double)(minH.peek());  
        if ((minH.size())<(maxH.size())) return (double)(maxH.peek());  
        return ((double)(maxH.peek()+minH.peek()))/2.0;  
    }  

The entire codes are here:

class MedianFinder {  
    private PriorityQueue<Integer> minH;  
    private PriorityQueue<Integer> maxH;  

    MedianFinder(){  
        minH = new PriorityQueue<Integer>();  
        maxH = new PriorityQueue<Integer>(1, new Comparator<Integer>(){  
            public int compare(Integer o1, Integer o2) {  
                if (o1.intValue()>o2.intValue()) return -1;  
                if (o1.intValue()<o2.intValue()) return 1;  
                return 0;  
            }  
        });  
    }  


    // Adds a number into the data structure.  
    public void addNum(int num) {  
        if ((minH.size()==0)&&(maxH.size()==0)) minH.add(num);  
        else if ((minH.size())>(maxH.size())) {  
            if (num>minH.peek()) {  
                maxH.add(minH.poll());  
                minH.add(num);  
            } else maxH.add(num);  
        } else if ((minH.size())<(maxH.size())) {  
            if (num<maxH.peek()) {  
                minH.add(maxH.poll());  
                maxH.add(num);  
            } else minH.add(num);              
        } else {  
            if (num<maxH.peek()) maxH.add(num);  
            else minH.add(num);               
        }  
    }  

    // Returns the median of current data stream  
    public double findMedian() {  
        if ((minH.size()==0)&&(maxH.size()==0)) return 0.0;  
        if ((minH.size())>(maxH.size())) return (double)(minH.peek());  
        if ((minH.size())<(maxH.size())) return (double)(maxH.peek());  
        return ((double)(maxH.peek()+minH.peek()))/2.0;  
    }  
};

思路和解释没有问题, 但是代码写的很啰嗦;


@dietpepsi said in Java/Python two heap solution, O(log n) add, O(1) find:

The invariant of the algorithm is two heaps, small and large, each represent half of the current list. The length of smaller half is kept to be n / 2 at all time and the length of the larger half is either n / 2 or n / 2 + 1 depend on n's parity.

This way we only need to peek the two heaps' top number to calculate median.

Any time before we add a new number, there are two scenarios, (total n numbers, k = n / 2):

(1) length of (small, large) == (k, k)  
(2) length of (small, large) == (k, k + 1)  

After adding the number, total (n + 1) numbers, they will become:

(1) length of (small, large) == (k, k + 1)  
(2) length of (small, large) == (k + 1, k + 1)  

Here we take the first scenario for example, we know the large will gain one more item and small will remain the same size, but we cannot just push the item into large. What we should do is we push the new number into small and pop the maximum item from small then push it into large (all the pop and push here are heappop and heappush). By doing this kind of operations for the two scenarios we can keep our invariant.

Therefore to add a number, we have 3 O(log n) heap operations. Luckily the heapq provided us a function "heappushpop" which saves some time by combine two into one. The document says:

Push item on the heap, then pop and return the smallest item from the heap. The combined action runs more efficiently than heappush() followed by a separate call to heappop().

Alltogether, the add operation is O(logn), The findMedian operation is O(1).

Note that the heapq in python is a min heap, thus we need to invert the values in the smaller half to mimic a "max heap".

A further observation is that the two scenarios take turns when adding numbers, thus it is possible to combine the two into one. For this please see [stefan's post][1]

Java

    private PriorityQueue<Integer> small = new PriorityQueue<>(Collections.reverseOrder());  
    private PriorityQueue<Integer> large = new PriorityQueue<>();  
    private boolean even = true;  

    public double findMedian() {  
        if (even)  
            return (small.peek() + large.peek()) / 2.0;  
        else  
            return small.peek();  
    }  

    public void addNum(int num) {  
        if (even) {  
            large.offer(num);  
            small.offer(large.poll());  
        } else {  
            small.offer(num);  
            large.offer(small.poll());  
        }  
        even = !even;  
    }

Python

from heapq import *  


class MedianFinder:  
    def __init__(self):  
        self.small = []  # the smaller half of the list, max heap (invert min-heap)  
        self.large = []  # the larger half of the list, min heap  

    def addNum(self, num):  
        if len(self.small) == len(self.large):  
            heappush(self.large, -heappushpop(self.small, -num))  
        else:  
            heappush(self.small, -heappushpop(self.large, num))  

    def findMedian(self):  
        if len(self.small) == len(self.large):  
            return float(self.large[0] - self.small[0]) / 2.0  
        else:  
            return float(self.large[0])  

# 18 / 18 test cases passed.  
# Status: Accepted  
# Runtime: 388 ms

[1]: https://leetcode.com/discuss/64910/very-short-o-log-n-o-1

他java代码里面好像左右搞反了, 实际上左边才是维护的较大的;

另外, 这个代码一个toggle看起来很neat, 其实还是本质上类似的思路, 还是判断长度, 只不过用这个toggle, 可以节省掉几个size call; 然后他们这种代码, 好像是了故意避免peek call in add的样子, 也是细节上面抢速度;

另外, 这个是他提到的Stefan的代码:

class MedianFinder {  

    Queue<Integer> q = new PriorityQueue(), z = q, t,  
                   Q = new PriorityQueue(Collections.reverseOrder());   

    public void addNum(int num) {  
        (t=Q).add(num);  
        (Q=q).add((q=t).poll());  
    }  

    public double findMedian() {  
        return (Q.peek() + z.peek()) / 2.;  
    }  
};

好像是用一个类似swap的方式来完成一个交换顺序; 因为都是指针操作, 所以实际上并没有什么特别空间开销;


editorial里面提到的那个bucket的做法:

@mitbbs8080 said in Tired of TWO HEAP/SET solutions? See this segment dividing solution (c++):

The idea of dividing existing numbers into several ranges:

Say we already have 10k numbers in vector,
each time adding a number requires sorting all 10k numbers, which is slow.

To optimize, we can store 10k numbers in several (say 10) vectors,
and nums in each vector are sorted.

Then each time we add a number, just need to find one vector with correct range,
insert the number and sort this vector only. Since its size is relatively small, it's fast.

When we have a vector's size greater than a threshold, just split it into two halfs.

这么一看这个解释, 实际上好像bucket的思路还确实是比较适合这种需要在一次一个的insert的时候维护order的场景;

我看完他这个的时候, 脑子里想到的一个uncertainty是, 如果一个bucket在被insert的过程中, 让自己的比如上限超界, 超过了下一个bucket的下限, 这样bucket之间的相对order就被破坏了, 怎么办? well, 看到uncertainty, 直接证伪; 你自己大概举举例子就可以看到, 这种情况根本不会发生;

class MedianFinder {      
public:  
    vector<vector<int>*> raid; // store all ranges  
    int total_size;  

    MedianFinder() {  
        total_size=0;  
        raid.push_back(new vector<int> ());  
    }  

    void addNum(int num) {  
        vector<int>* correctRange=NULL;  
        int targetIndex;  

        // find the correct range to insert given num  
        for (int i=0; i<raid.size(); i++)  
            if ( raid.size()==1 ||  
                 (i==0 && num<=raid[i]->back()) ||   
                 (i==raid.size()-1 && num>=raid[i]->at(0)) ||  
                 (raid[i]->at(0)<=num && num<=raid[i]->back()) ||  
                 (num > raid[i]->back() && num < raid[i+1]->front()) )  
            {  
                correctRange = raid[i];  
                targetIndex = i;  
                break;  
            }  

        // put num at back of correct range, and sort it to keep increasing sequence  
        total_size++;  
        correctRange->push_back(num);  
        sort(correctRange->begin(), correctRange->end());  

        // if current range's size > threshold, split it into two halfs and add them back to this.raid  
        const int max_size = 30;  
        int len = correctRange->size();  
        if (len > max_size) {  
            vector<int> *half1 = new vector<int>(correctRange->begin(), correctRange->begin()+len/2);  
            vector<int> *half2 = new vector<int>(correctRange->begin()+len/2, correctRange->end());  

            delete correctRange;  
            raid[targetIndex]=half2;  
            raid.insert(raid.begin() + targetIndex, half1);  
        }  

    }  

    // iterate thru all ranges in this.raid to find median value  
    double findMedian() {  
        if (total_size==0)  
            return 0;  

        int mid1 = total_size/2;  
        int mid2 = mid1 + 1;  

        int leftCount=0;  
        double first, second;  
        for (auto r : raid) {  
            if (leftCount<mid1 && mid1<=leftCount+r->size())  
                first = r->at(mid1 - leftCount - 1);  

            if (leftCount<mid2 && mid2<=leftCount+r->size()) {  
                second = r->at(mid2 - leftCount - 1);  
                break;  
            }  
            leftCount += r->size();  
        }  

        if (total_size % 2)  
            return second;  
        else  
            return (first + second)/2;  
    }  
};

看起来还可以的一个思路, 不过get的速度太慢了, O(N), 反正大概了解一下这个思路就行了; 另外raid.insert这个操作我想了一下, 会不会又要引发一个shift? 不过其实这个shift就算发生应该也不会太危险, 因为其实是bucket指针的操作, 所以这个shift带来的cost是O(num_of_bucket), 不对, 这个东西不是还是O(N)吗?

其实你再想想, 就算是不出现shift, 这个算法的add还是O(N)!

@StefanPochmann said in Tired of TWO HEAP/SET solutions? See this segment dividing solution (c++):

Nice title :-)
(Your solution was btw mentioned in a new article.)

How about instead of using a fixed bucket limit, limit it by the square root of the total size? I think simply replacing if (len > max_size) with if (len * len > total_size) would achieve that, right? Then your buckets have size O(sqrt(total_size)) and I think the number of buckets is also O(sqrt(total_size)). Then instead of O(total_size), your addNum would be O(sqrt(total_size)) or O(sqrt(total_size) * log(total_size)), depending on how efficient sort is (could be linear, as it just needs to insert the new element into an already sorted vector).

Stefan这个点子其实就很聪明了;


submission基本波动, 2heap的做法比较多;


Problem Description

Median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value. So the median is the mean of the two middle value.

Examples:

[2,3,4] , the median is 3

[2,3], the median is (2 + 3) / 2 = 2.5

Design a data structure that supports the following two operations:

  • void addNum(int num) - Add a integer number from the data stream to the data structure.
  • double findMedian() - Return the median of all elements so far.
    For example:
addNum(1)  
addNum(2)  
findMedian() -> 1.5  
addNum(3)   
findMedian() -> 2

Credits:

  • Special thanks to @Louis1992 for adding this problem and creating all test cases.

Difficulty:Hard
Total Accepted:55.4K
Total Submissions:191.7K
Contributor:LeetCode
Companies
google
Related Topics
heapdesign
Similar Questions

results matching ""

    No results matching ""