NumberOfMatchingSubsequences [source code]


public class NumberOfMatchingSubsequences {
static
/******************************************************************************/
class Solution {
    public int numMatchingSubseq(String S, String[] words) {
        char[] chs = S.toCharArray ();
        int count = 0;
        for (String word : words) {
            if (S.endsWith (word) || isSubseq (chs, word.toCharArray ()))
                count++;
        }
        return count;
    }

    boolean isSubseq (char[] chs, char[] chw) {
        if (chs.length < chw.length)
            return false;
        int si = 0, wi = 0;
        while (si < chs.length) {
            if (chs[si] == chw[wi] && ++wi == chw.length)
                break;
            si++;
        }
        return wi == chw.length;
    }
}
/******************************************************************************/

    public static void main(String[] args) {
        NumberOfMatchingSubsequences.Solution tester = new NumberOfMatchingSubsequences.Solution();
        String[][] inputs = {
            {"abcde"}, {"a", "bb", "acd", "ace"}, {"3"},
        };
        for (int i = 0; i < inputs.length / 3; i++) {
            System.out.println (Printer.separator ());
            String S = inputs[3 * i][0];
            String[] words = inputs[3 * i + 1];
            int ans = Integer.parseInt (inputs[3 * i + 2][0]), output = tester.numMatchingSubseq (S, words);
            System.out.printf ("%s and %s -> %s, expected: %d\n", 
                S, Arrays.deepToString (words), Printer.wrap (output + "", output == ans ? 92 : 91), ans
            );
        }
    }
}

简单的subseq的方法直接TLE了; 看来这题是有门道的;

class Solution {  
    public int numMatchingSubseq(String S, String[] words) {  
        char[] chs = S.toCharArray ();  
        int count = 0;  
        for (String word : words) {  
            if (isSubseq (chs, word.toCharArray ()))  
                count++;  
        }  
        return count;  
    }  

    boolean isSubseq (char[] chs, char[] chw) {  
        if (chs.length < chw.length)  
            return false;  
        int si = 0, wi = 0;  
        while (si < chs.length) {  
            if (chs[si] == chw[wi] && ++wi == chw.length)  
                break;  
            si++;  
        }  
        return wi == chw.length;  
    }  
}

丢人了, 恶心了一下OJ; 正确的做法这题应该是判断subseq的时候, 用两头的做法; 没时间了所以最后就直接cheat了一下OJ;

那个TLE的大case就是word是类似yj的东西, 然后S就是特别长的y..yj这种, 用这种case来恶心人是想代表什么中心思想呢, 不太理解; 也许subSequence算法本身就有双头这种常规优化要求?

最后的速度是651ms (NA), 这个肯定是有问题的;

看到contest里面有一个人的优化更加暴力, 他是直接加了memo, 因为判断到words里面很多很短的小重复; 反正最后基本都是OJ-oriented的各种优化; // 这个memo map去重好像好几个人都用了; 相比之下, 我的做法还是脏了一点;

editorial

Approach #1: Brute Force [Time Limit Exceeded]

Intuition and Algorithm

Let's try to check separately whether each word words[i] is a subsequence of S.

For every such word, we try to find the first occurrence of word[0] in S, then from that point on try to find the first occurrence of word[1], and so on.

class Solution {  
    char[] ca, cb;  
    public int numMatchingSubseq(String S, String[] words) {  
        int ans = 0;  
        ca = S.toCharArray();  
        for (String word: words)  
            if (subseq(word)) ans++;  
        return ans;  
    }  

    public boolean subseq(String word) {  
        int i = 0;  
        cb = word.toCharArray();  
        for (char c: ca) {  
            if (i < cb.length && c == cb[i]) i++;  
        }  
        return (i == cb.length);  
    }  
}

Approach #2: Next Letter Pointers [Accepted]

Intuition

Since the length of S is large, let's think about ways to iterate through S only once, instead of many times as in the brute force solution.

讲道理这个实际上还是看OJ看出来的;

We can put words into buckets by starting character. If for example we have words = ['dog', 'cat', 'cop'], then we can group them 'c' : ('cat', 'cop'), 'd' : ('dog',). This groups words by what letter they are currently waiting for. Then, while iterating through letters of S, we will move our words through different buckets.

For example, if we have a string like S = 'dcaog':

heads = 'c' : ('cat', 'cop'), 'd' : ('dog',) at beginning;
heads = 'c' : ('cat', 'cop'), 'o' : ('og',) after S[0] = 'd';
heads = 'a' : ('at',), 'o' : ('og', 'op') after S[0] = 'c';
heads = 'o' : ('og', 'op'), 't': ('t',) after S[0] = 'a';
heads = 'g' : ('g',), 'p': ('p',), 't': ('t',) after S[0] = 'o';
heads = 'p': ('p',), 't': ('t',) after S[0] = 'g';
Algorithm

Instead of a dictionary, we'll use an array heads of length 26, one entry for each letter of the alphabet. For each letter in S, we'll take all the words waiting for that letter, and have them wait for the next letter in that word. If we use the last letter of some word, it adds 1 to the answer.

For another description of this algorithm and a more concise implementation, please see @StefanPochmann's excellent forum post here.

大概的思路是理解了, 确实是一个非常优秀的想法; Google的题目, 难怪了, 真的是设计的很用心了;

class Solution {  
    public int numMatchingSubseq(String S, String[] words) {  
        int ans = 0;  
        ArrayList<Node>[] heads = new ArrayList[26];  
        for (int i = 0; i < 26; ++i)  
            heads[i] = new ArrayList<Node>();  

        for (String word: words)  
            heads[word.charAt(0) - 'a'].add(new Node(word, 0));  

        for (char c: S.toCharArray()) {  
            ArrayList<Node> old_bucket = heads[c - 'a'];  
            heads[c - 'a'] = new ArrayList<Node>();  

            for (Node node: old_bucket) {  
                node.index++;  
                if (node.index == node.word.length()) {  
                    ans++;  
                } else {  
                    heads[node.word.charAt(node.index) - 'a'].add(node);  
                }  
            }  
            old_bucket.clear();  
        }  
        return ans;  
    }  

}  

class Node {  
    String word;  
    int index;  
    public Node(String w, int i) {  
        word = w;  
        index = i;  
    }  
}

注意这里heads的初始化: 要手动一个一个的做, 这个亏以前是吃过的;

每一个node是一个的结合, int记录的是next index to be matched.

总体来说代码也是比较直白的, 用index记录一个word需要被match的位置; 当然也可以用类似可能c++那样的, pop head之类的, 不过java里面因为没有这种操作, 所以并不好考虑;

如果S长度是M, 每一个word的长度是L, 总共有N个word, 那么第一种暴力做法是O((N * (M + N))), 而这里这个做法是O(M + LN). 速度的优势还是显而易见的;

另外, 这里的这个实现看起来, 是要一个额外的O(LN)的空间复杂度的, 因为要把每一个word都单独存一个copy在node里面; 当然, 可以用interning来改善, 或者可以改成一个指针的做法, 因为实际上我们从头到尾都没有修改过word, 所以保留一个额外的copy并没有必要;

https://leetcode.com/problems/number-of-matching-subsequences/discuss/117578/Simple-Python-solution

The idea is to use the String S and build a dictionary of indexes for each character in the string.
Then for each word, we can go character by character and see if it is a subsequence, by using the corresponding index dictionary, but just making sure that the index of the current character occurs after the index where the previous character was seen. To speed up the processing, we should use binary search in the index dictionary.

As an example for S = "aacbacbde"
the
dict_idxs =
{a: [0, 1, 4]
b: [3, 6]
c: [2, 5]
d: [7]
e: [8]
}
Now for the word say “abcb”, starting with d_i = 0,

a => get list for a in the dict_idxs which is [0, 1, 4], and in the list, find the index of a >= d_i which is 0. After this update d_i to +1 => 1

b => get list for b in the dict_idxs [3, 6], and in the list, find the index of b >= d_i => >= 1, which is at index 0 => 3, after this update d_i to 4+1 => 4

c => in the list for c, [3, 5], find the index of c >= 4, which is 5. update d_i to 5+1 = 6

b => in the list for b, [3, 6], fund the index of b >= 6, which is 6, update d_i to 7, since this is the end of the word, and we have found all characters in the word in the index dictionary, we return True for this word.

    def numMatchingSubseq(self, S, words):  
        def isMatch(word, w_i, d_i):  
            if w_i == len(word): return True  
            l = dict_idxs[word[w_i]]  
            if len(l) == 0 or d_i > l[-1]: return False  
            i = l[bisect_left(l, d_i)]  
            return isMatch(word, w_i + 1, i + 1)  

        dict_idxs = defaultdict(list)  
        for i in range(len(S)):  
            dict_idxs[S[i]].append(i)  
        return sum(isMatch(word, 0, 0) for word in words)

bisect.bisect_left好像返回的也是一个insertion position, 所以如果我们l[bisect_left(l, d_i)]找到的实际上肯定就是下一个比d_i大的数字(因为没有duplicate, 不可能hit); 因为之前一步已经已经确定了d_i <= l[-1], 所以这个bisect不会返回l.length;

总体来说这个算法就是一个利用hash的适当的加速; 毕竟传统的subSequence算法, 就是一个线性搜索, 然后这个人可能就想到想要利用binary search来加速, 然后就用了这么一个bucket加上binary search的思路来做;

https://leetcode.com/problems/number-of-matching-subsequences/discuss/117634/Efficient-and-simple-go-through-words-in-parallel-with-explanation

Runtime is linear in the total size of the input (S and all of words). Explanation below the code.

Python:

def numMatchingSubseq(self, S, words):  
    waiting = collections.defaultdict(list)  
    for w in words:  
        waiting[w[0]].append(iter(w[1:]))  
    for c in S:  
        for it in waiting.pop(c, ()):  
            waiting[next(it, None)].append(it)  
    return len(waiting[None])

C++:

int numMatchingSubseq(string S, vector<string>& words) {  
    vector<pair<int, int>> waiting[128];  
    for (int i = 0; i < words.size(); i++)  
        waiting[words[i][0]].emplace_back(i, 1);  
    for (char c : S) {  
        auto advance = waiting[c];  
        waiting[c].clear();  
        for (auto it : advance) {  
            int i = it.first, j = it.second;  
            waiting[words[i][j]].emplace_back(i, j + 1);  
        }  
    }  
    return waiting[0].size();  
}

Java:

public int numMatchingSubseq(String S, String[] words) {  
    List<Integer[]>[] waiting = new List[128];  
    for (int c = 0; c <= 'z'; c++)  
        waiting[c] = new ArrayList();  
    for (int i = 0; i < words.length; i++)  
        waiting[words[i].charAt(0)].add(new Integer[]{i, 1});  
    for (char c : S.toCharArray()) {  
        List<Integer[]> advance = new ArrayList(waiting[c]);  
        waiting[c].clear();  
        for (Integer[] a : advance)  
            waiting[a[1] < words[a[0]].length() ? words[a[0]].charAt(a[1]++) : 0].add(a);  
    }  
    return waiting[0].size();  
}

用bucket[0]来专门记录成功被清空的word的数量; 注意他这个实现就是一个指针结构的(只记录了word在words里面的位置而不是word本身);

Explanation:
For each letter I remember the words currently waiting for that letter. For example:

S = "abcde"  
words = ["a", "bb", "acd", "ace"]

I store that "a", "acd" and "ace" are waiting for an 'a' and "bb" is waiting for a 'b' (using parentheses to show how far I am in each word):


'a':  ["(a)", "(a)cd", "(a)ce"]  
'b':  ["(b)b"]

Then I go through S. First I see 'a', so I take the list of words waiting for 'a' and store them as waiting under their next letter:

'b':  ["(b)b"]  
'c':  ["a(c)d", "a(c)e"]  
None: ["a"]

You see "a" is already waiting for nothing anymore, while "acd" and "ace" are now waiting for 'c'. Next I see 'b' and update accordingly:

'b':  ["b(b)"]  
'c':  ["a(c)d", "a(c)e"]  
None: ["a"]

Then 'c':

'b':  ["b(b)"]  
'd':  ["ac(d)"]  
'e':  ["ac(e)"]  
None: ["a"]

Then 'd':

'b':  ["b(b)"]  
'e':  ["ac(e)"]  
None: ["a", "acd"]

Then 'e':

'b':  ["b(b)"]  
None: ["a", "acd", "ace"]

And now I just return how many words aren’t waiting for anything anymore.

Implementation Details:
In Python I use iterators to represent words and their progress. In C++/Java I use pairs (i,j) which tell me that it’s word i and I’m at letter j. In Go (below) I just keep slicing off letters from the front of the strings. For C++ I also tried istringstream but failed, the compiler gave me the typical cryptic compile error wall.

Variations:
Cleaner initialization, using iterators iterating the whole word instead of a copy of the word after its first letter:

def numMatchingSubseq(self, S, words):  
    waiting = collections.defaultdict(list)  
    for it in map(iter, words):  
        waiting[next(it)].append(it)  
    for c in S:  
        for it in waiting.pop(c, ()):  
            waiting[next(it, None)].append(it)  
    return len(waiting[None])

Make all words waiting for a space character and prepend a space character to S:

def numMatchingSubseq(self, S, words):  
    waiting = collections.defaultdict(list, {' ': map(iter, words)})  
    for c in ' ' + S:  
        for it in waiting.pop(c, ()):  
            waiting[next(it, None)].append(it)  
    return len(waiting[None])

This avoids some code duplication and now the solution can also handle empty words.

A Go version:

func numMatchingSubseq(S string, words []string) (num int) {  
    waiting := map[rune][]string{' ': words}  
    for _, c := range " " + S {  
        advance := waiting[c]  
        delete(waiting, c)  
        for _, word := range advance {  
            if len(word) == 0 {  
                num++  
            } else {  
                waiting[rune(word[0])] = append(waiting[rune(word[0])], word[1:])  
            }  
        }  
    }  
    return  
}

https://leetcode.com/problems/number-of-matching-subsequences/discuss/117598/Java-solution-using-HashMap-and-Queue

class Solution {  
    public int numMatchingSubseq(String S, String[] words) {  
        Map<Character, Deque<String>> map = new HashMap<>();  
        for (char c = 'a'; c <= 'z'; c++) {  
            map.putIfAbsent(c, new LinkedList<String>());  
        }  
        for (String word : words) {  
            map.get(word.charAt(0)).addLast(word);  
        }  

        int count = 0;  
        for (char c : S.toCharArray()) {  
            Deque<String> queue = map.get(c);  
            int size = queue.size();  
            for (int i = 0; i < size; i++) {  
                String word = queue.removeFirst();  
                if (word.length() == 1) {  
                    count++;  
                } else {  
                    map.get(word.charAt(1)).addLast(word.substring(1));  
                }  
            }  
        }  
        return count;  
    }  
}

substring() is time consuming and you could also use an array as a map to further improve this solution.

大概就是选择了另外一个接口: 意识到这题对每一个list的操作实际上只有add_last和remove_first两种, 所以用一个LinkedList来做更加合理;

https://leetcode.com/problems/number-of-matching-subsequences/discuss/117576/Iterative-Java-Solution

class Solution {  
    public int numMatchingSubseq(String S, String[] words) {  
        int count = 0;  
        for (String w : words) {  
            int i = 0;   // word index  
            int j = 0;   // S index  
            while (j < S.length() && i < w.length()) {  
                if (S.charAt(j) == w.charAt(i)) {  
                    j ++;  
                    i ++;  
                } else {  
                    j ++;  
                }  
            }  
            if (i == w.length()) count ++;  
        }  
        return count;  
    }  
}

就是最naive的解法, 居然也能AC, 速度是900+ms;

uwi做法:

class Solution {  
    public int numMatchingSubseq(String S, String[] words) {  
        char[] s = S.toCharArray();  
        int[][] next = makeFatNext(s, 'a', 'z');  
        int ret = 0;  
        outer:  
        for(String w : words){  
            int pos = 0;  
            for(char c : w.toCharArray()){  
                pos = next[c-'a'][pos];  
                if(pos > s.length)continue outer;  
            }  
            ret++;  
        }  
        return ret;  
    }  

    public int[][] makeFatNext(char[] s, char inf, char sup)  
    {  
        int n = s.length;  
        int[][] next = new int[sup-inf+1][n+1];  
        for(int i = 0;i < sup-inf+1;i++)next[i][n] = n+1;  
        for(int i = s.length-1;i >= 0;i--){  
            for(int j = 0;j < sup-inf+1;j++)next[j][i] = next[j][i+1];  
            next[s[i]-inf][i] = i+1;  
        }  
        return next;  
    }  
}

他上面tictactoe那题也是用了nextPermutation的函数去做的, 这里感觉下面这个helper好像也是个什么套路的东西? 不然inf和sup这两个参数是没有必要的;

对于

"abcde"  
["a","bb","acd","ace"]

这个例子, 他的next的前五行(对应a..e):

[[1, 6, 6, 6, 6, 6],  
 [2, 2, 6, 6, 6, 6],  
 [3, 3, 3, 6, 6, 6],  
 [4, 4, 4, 4, 6, 6],  
 [5, 5, 5, 5, 5, 6],

感觉这个就是一个可以让你站在每一个(S的)index上面, 然后寻找next occurrence of a letter的查询; 这个例子的S因为没有duplicate所以看不出来:

"abcdde"  
["a","bb","acd","ace"]  

[[1, 7, 7, 7, 7, 7, 7],  
 [2, 2, 7, 7, 7, 7, 7],  
 [3, 3, 3, 7, 7, 7, 7],  
 [4, 4, 4, 4, 5, 7, 7],  
 [6, 6, 6, 6, 6, 6, 7],

这个也是为什么填表的时候, s的长度要从右到左的走, 这样later occurrence可以被左边的occurrence适当的覆盖;

这个思路大概看了一下, 基本就跟discussion第一个解法差不多, 通过对S的preprocess来完成一个加速; next lookup其实以前是见过的, 最好还是熟悉一下;

这个next计算出来之后, 后面的就是不断的查询来完成对word的一个遍历, 因为有这个lookup, 所以跳转很快;

他这里的index全都用的1based, 有点奇怪, 不过自己搞清楚就行了;

Problem Description

Given string S and a dictionary of words words, find the number of words[i] that is a subsequence of S.

Example :  
Input:   
S = "abcde"  
words = ["a", "bb", "acd", "ace"]  
Output: 3  
Explanation: There are three words in words that are a subsequence of S: "a", "acd", "ace".

Note:

All words in words and S will only consists of lowercase letters.
The length of S will be in the range of [1, 50000].
The length of words will be in the range of [1, 5000].
The length of words[i] will be in the range of [1, 50].

Difficulty:Medium
Total Accepted:648
Total Submissions:2.9K
Contributor:awice
Companies
google
Related Topics
array

LC808 NumberOfMatchingSubsequences

NumberOfMatchingSubsequences [source code]

editorial

Approach #1: Brute Force [Time Limit Exceeded]

Approach #2: Next Letter Pointers [Accepted]

Problem Description

results matching ""

No results matching ""