TopKFrequentElements [source code]


public class TopKFrequentElements {
static
/******************************************************************************/
public class Solution {
    public List<Integer> topKFrequent(int[] nums, int k) {
        Map<Integer, Integer> map = new HashMap<>();
        for (int i : nums) {
            Integer count = map.getOrDefault(i, 0);
            map.put(i, count + 1);
        }
        PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(
            new Comparator<Map.Entry<Integer, Integer>>() {
                @Override
                public int compare(Map.Entry<Integer, Integer> a, Map.Entry<Integer, Integer> b) {
                    return b.getValue() - a.getValue(); //descending order
                }
            }
        );
        pq.addAll(map.entrySet());
        List<Integer> res = new ArrayList<>();
        for (int i = 0; i < k; i++)
            res.add(pq.poll().getKey());
        return res;
    }
}
/******************************************************************************/

    public static void main(String[] args) {
        TopKFrequentElements.Solution tester = new TopKFrequentElements.Solution();
    }
}

note的第二条其实就是禁止你直接sort完了用分段算法来数. 注意这里规定的n是整个的size. 然后你看tag里面的heap, 意思是可以用PriorityQueue, 用PriorityQueue的复杂度也是linearithmic, 但是它的N并不是array size这么大;

我就直接用PriorityQueue写法写写看吧. 这个题目之前好像见别人提过是有其他的解法的, 但是PriorityQueue我还没有写过, 正好这个题目写写看算了;

另外这一题既然最后会检查nlgn, 所以估计感觉这个n会相当大. 做counts的时候用hashing好像就不太靠谱了;

翻了一下以前代码, 想起来之前sort frequency的另外一种做法是什么了, 是hashing: 直接把counts作为index的第二个hashing; 这个做法感觉也不是很适合这里. 这里counts的上限并没有保障;

各种API写起来还是不太熟练,最后还是稍微瞄了一眼之前sort frequency题目的代码; 整体来说知道做法之后还是很简单的, 速度是31ms (55%).

这个是discussion上面一个用bucket sort做的解法:

public List<Integer> topKFrequent(int[] nums, int k) {  

    List<Integer>[] bucket = new List[nums.length + 1];  
    Map<Integer, Integer> frequencyMap = new HashMap<Integer, Integer>();  

    for (int n : nums) {  
        frequencyMap.put(n, frequencyMap.getOrDefault(n, 0) + 1);  
    }  

    for (int key : frequencyMap.keySet()) {  
        int frequency = frequencyMap.get(key);  
        if (bucket[frequency] == null) {  
            bucket[frequency] = new ArrayList<>();  
        }  
        bucket[frequency].add(key);  
    }  

    List<Integer> res = new ArrayList<>();  

    for (int pos = bucket.length - 1; pos >= 0 && res.size() < k; pos--) {  
        if (bucket[pos] != null) {  
            res.addAll(bucket[pos]);  
        }  
    }  
    return res;  
}

implicit bound of frequency

我自己做的时候放弃bucket sort的一个原因是因为我认为这个bucket array的长度无法预知. 但是这里犯了一个错误, 之前做sort frequency的时候就已经翻过这个错误: frequency value本身是有一个隐含上限的, 也就是input array的size;

另外感觉他这里的做法还是有点奇怪; 他这个做的是一个完整的sort过程, 我们这里需要的好像实际上并不是sort? 只是要那K个entry的value就行了;

哦不对, 这里他bucket array之所以每一个entry用list是因为这里要考虑不同的entry用相同的count的情况; 这个是我自己犯蠢了; 这个算法最后的速度肯定是比PriorityQueue的做法要快的;

另外看这里, 我学到一个关于变量命名的关键. 我觉得以后变量命名, 尤其是数据结构类型的变量, 命名的时候至少要有两个词. 第一个词就是你自己随便起的名字, 比如freq, 第二个词最好能够反映这个变量的类型, 比如Map; 类似的, 这里的这个bucket array其实就可以直接命名成bArr这样. 我现在的一贯做法是在第一个词的选择上纠结太久. 其实我感觉只要第二个类型词表达正确, 读得人一般是比较容易看懂的;

另外他这个代码有人认为有bug:

The idea is great but I found a bug which may not be caught by OJ.

  for (int pos = bucket.length - 1; pos >= 0 && res.size() < k; pos--) {  
        if (bucket[pos] != null) {  
            res.addAll(bucket[pos]);  
        }  
    }

e.g.[1,1,1,1,1,1,2,2,2,3,3,3,4,4,4,5] k=2. In your code the res.addAll() would add 2,3,4 at same time. the result list will be size of 4 instead of 2, because you did not consider the tie cases. In such a case, you should pick 1 plus any one number among 2,3,4, but never all 3 numbers. My suggestion is as follows:

outerloop:  
        for (int i = n; i >= 0; i--) {  
            for (int j = 0; j < buckets[i].size(); j++) {  
                topK.add(buckets[i].get(j));  
                k--;  
                if (k == 0) {  
                    break outerloop;  
                }  
            }  
        }  
        return topK;

算是一个小的疏忽; 他原来那个代码确实是会把所有tie的都算做只占用一个名次;

另外他这个代码还是值得学习一下;

但是这个bug实际上有一个更好的解决办法:

one caveat is that this will add all the results by default for same frequency numbers. For example for input [1,1,2,2,3,3,4], k=2
this will return [1,2,3] where if we were expecting only top 2 elements (k=2) in increasing order. We could avoid that by returning the subList:
return res.subList(0,k);

受到discussion里面的启发, 好像PriorityQueue最好的并不需要把所有的entry都加进去(这个办法最后的复杂度是nlgn, with n as the number of distinct numbers). 是否可以把PriorityQueue的size维持在K, 这样最后的复杂度就可以保证在KlgK? 但是想了一下, 这样一个操作就要求有一个removeLast的操作for PriorityQueue, 但是Google了一下, 好像这样一个操作并没有API. 除非自己手动做, 手动copy过去;

或许可以把PriorityQueue做成ascending的, 这样直接removeFirst就行了; 最后提取的时候也不麻烦! 因为你的PriorityQueue的size只有K, 直接全部扔到res里面就行了;

最后的代码是:

public class Solution {  
    public List<Integer> topKFrequent(int[] nums, int k) {  
        Map<Integer, Integer> map = new HashMap<>();  
        for (int i : nums) {  
            Integer count = map.getOrDefault(i, 0);  
            map.put(i, count + 1);  
        }  
        PriorityQueue<Map.Entry<Integer, Integer>> pq = new PriorityQueue<>(  
            new Comparator<Map.Entry<Integer, Integer>>() {  
                @Override  
                public int compare(Map.Entry<Integer, Integer> a, Map.Entry<Integer, Integer> b) {  
                    return a.getValue() - b.getValue(); //ascending order  
                }  
            }  
        );  
        for (Map.Entry<Integer, Integer> e : map.entrySet()) {  
            pq.offer(e);  
            if (pq.size() > k)  
                pq.poll();  
        }  
        List<Integer> res = new ArrayList<>();  
        for (Map.Entry<Integer, Integer> e : pq)  
            res.add(e.getKey());  
        return res;  
    }  
}

然而这个代码最后的速度居然变慢了: 52(26). 完全不能理解; 按理说这个实际上是做到了Nlgk的.

这个是一个类似的用heap的做法:

// use maxHeap. Put entry into maxHeap so we can always poll a number with largest frequency  
public class Solution {  
    public List<Integer> topKFrequent(int[] nums, int k) {  
        Map<Integer, Integer> map = new HashMap<>();  
        for(int n: nums){  
            map.put(n, map.getOrDefault(n,0)+1);  
        }  

        PriorityQueue<Map.Entry<Integer, Integer>> maxHeap =   
                         new PriorityQueue<>((a,b)->(b.getValue()-a.getValue()));  
        for(Map.Entry<Integer,Integer> entry: map.entrySet()){  
            maxHeap.add(entry);  
        }  

        List<Integer> res = new ArrayList<>();  
        while(res.size()<k){  
            Map.Entry<Integer, Integer> entry = maxHeap.poll();  
            res.add(entry.getKey());  
        }  
        return res;  
    }  
}

注意他这里comparator的写法, 直接用lambda作为参数, 省掉好多打字.

It's a shortcut for Comparator. You can spend some time to check Java 8 Lambda Expression. When you learn this, you do not want to write anonymous Comparator anymore.

这个人还提出另外一种做法:

// use treeMap. Use freqncy as the key so we can get all freqencies in order  
public class Solution {  
    public List<Integer> topKFrequent(int[] nums, int k) {  
        Map<Integer, Integer> map = new HashMap<>();  
        for(int n: nums){  
            map.put(n, map.getOrDefault(n,0)+1);  
        }  

        TreeMap<Integer, List<Integer>> freqMap = new TreeMap<>();  
        for(int num : map.keySet()){  
           int freq = map.get(num);  
           if(!freqMap.containsKey(freq)){  
               freqMap.put(freq, new LinkedList<>());  
           }  
           freqMap.get(freq).add(num);  
        }  

        List<Integer> res = new ArrayList<>();  
        while(res.size()<k){  
            Map.Entry<Integer, List<Integer>> entry = freqMap.pollLastEntry();  
            res.addAll(entry.getValue());  
        }  
        return res;  
    }  
}

这个TreeMap以前也是没有玩过. 可以看到实际上可以使用的API还是很丰富, 尤其是最后一个pass, 还有一个Poll的操作; 注意这里的这个last自动的就是默认的是ascending里面的last, 也就是是最大的一个;

pollFirst也是有API的, 具体可以参考: https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html

这个方法和heap方法做到的功能性实际上是一样的, 就是sort by assoc value. 大概参考一下, 如果没有什么特别的, 感觉掌握PriorityQueue这个做法是比较主要的;

TreeMap provide all implementations in log(n) time: http://docs.oracle.com/javase/7/docs/api/java/util/TreeMap.html

所以就是一个跟heap非常类似的东西.

PriorityQueue还可以这样做的:

Queue<Map.Entry<Integer,Integer>> q = new PriorityQueue<>(Map.Entry.comparingByValue());

这个是默认的ascending, 适合我那个维护pq size as K的做法;

另外很多人争论, 这个PriorityQueue的做法到底应不应该算作是超时, 因为假如所有的element全都是distinct, 那么最后Map的entry的个数就是N = size of input array. 但是实际上我认为这个是不对的. 就算是这样的一个情况, 我们最后实际上的复杂度应该是lg1 + lg2 + ... + lgN, 所以是没有达到NlgN的. 当然具体多少, 感觉就很难算;

一个用quick select的解法:

class Solution {  
public:  
    vector<int> topKFrequent(vector<int>& nums, int k) {  
        vector<int> res;  
        if (!nums.size()) return res;  
        unordered_map<int, int> cnt;  
        for (auto num : nums) cnt[num]++;  
        vector<pair<int, int>> num_with_cnt;  
        for (auto kv : cnt) {  
            num_with_cnt.push_back({kv.first, kv.second});  
        }  
        kselection(num_with_cnt, 0, num_with_cnt.size()-1, k);  
        for (int i = 0; i < k && i < num_with_cnt.size(); ++i) {  
            res.push_back(num_with_cnt[i].first);  
        }  
        return res;  
    }  

    void kselection(vector<pair<int, int>>& data, int start, int end, int k) {  

        if (start >= end) return;  
        auto pv = data[end];  
        int i = start;  
        int j = start;  
        while (i < end) {  
            if (data[i].second > pv.second) {  
                swap(data[i++], data[j++]);  
            } else {  
                ++i;  
            }  
        }  
        swap(data[j], data[end]);  
        int num = j - start + 1;  
        if (num == k) return;  
        else if (num < k) {  
            kselection(data, j + 1, end, k - num);  
        } else {  
            kselection(data, start, j - 1, k);  
        }  
    }  
};

学学可以, 但是这个做法好像最后是NlgN? 这个就他妈是自己写了一个sort啊. 等于这个题目的OJ对于sort的做法并没有判断超时;

基本没有更好的做法了;

Problem Description

Given a non-empty array of integers, return the k most frequent elements.

For example,
Given [1,1,1,2,2,3] and k = 2, return [1,2].

Note:

You may assume k is always valid, 1 ≤ k < number of unique elements.
Your algorithm's time complexity must be better than O(n log n), where n is the array's size.

Difficulty:Medium
Total Accepted:67.5K
Total Submissions:140.9K
Contributor: LeetCode
Companies
pocket gems yelp
Related Topics
hash table heap
Similar Questions
Word Frequency Kth Largest Element in an Array Sort Characters By Frequency

LC347 TopKFrequentElements

TopKFrequentElements [source code]

implicit bound of frequency

Problem Description

results matching ""

No results matching ""