ImplementstrStr2 [source code]


public class ImplementstrStr2 {
static
/******************************************************************************/
public class Solution {
    static boolean DEBUG_LOG = true;

    public int strStr(String haystack, String needle) {
        List<Integer> res = new ArrayList<>();
        char[] text = haystack.toCharArray();
        char[] pat = needle.toCharArray();
        int N = text.length, M = pat.length;
        if (M == 0) return 0;
        int[] lps = new int[M];
        computeLPS(pat, lps);
        if (DEBUG_LOG) System.out.println ("First implemenation LPS: " + Printer.array (lps));
        int i = 0, j = 0;
        while (i < N) {
            if (text[i] == pat[j]) {
                i++;
                j++;
                if (j == M) {
                    res.add(i - M);
                    j = lps[j - 1];     // easy to forget about: need to backup j here as well
                }
            } else {
                // advance i or backup j
                if (j != 0) j = lps[j - 1];
                else i++;
            }
        }
        return res.size() > 0 ? res.get(0) : -1;
    }

    void computeLPS(char[] pat, int[] lps) {
        lps[0] = 0;
        // length of the previous longest prefix suffix
        // i can not be initialized to 0!!! subtle but fatal
        int len = 0, i = 1;
        while (i < pat.length) {
            if (pat[i] == pat[len]) {
                len++;
                lps[i] = len;
                i++;
            } else {
                if (len != 0) {
                    len = lps[len - 1];
                } else {
                    lps[i] = 0; // can be omitted;
                    i++;
                }
            }
        }
    }

    /* Alternative concise implementation 
    Inspired by: https://discuss.leetcode.com/topic/105579/c-4-lines-o-m-n-o-1-and-7-lines-kmp-o-m-n-o-n/18 */
    List<Integer> kmp (String A, String B) {
        List<Integer> res = new ArrayList<> ();
        char[] cha = A.toCharArray (), chb = B.toCharArray ();
        int[] lps = new int[chb.length];
        for (int len = 0, i = 1; i < chb.length; ) {
            if (chb[len] == chb[i])
                lps[i++] = ++len;
            else if (len > 0)
                len = lps[len - 1];
            else i++;
        }
        if (DEBUG_LOG) System.out.println ("Second implementation LPS: " + Printer.array (lps));
        for (int i = 0, j = 0; i < cha.length; i += j - lps[j - 1], j = lps[j - 1]) {
            if (DEBUG_LOG) System.out.printf ("i:(%d), j:(%d)\n%s\n", i, j, debugKmp (A, B, i, j));
            while (j < chb.length && i + j < cha.length && cha[i + j] == chb[j])
                j++;
            if (DEBUG_LOG) System.out.printf ("after advance >>> i:(%d), j:(%d)\n%s\n", i, j, debugKmp (A, B, i, j));
            if (j == chb.length) {
                res.add (i);
                continue;
            }
            if (i + j >= cha.length)
                break;
            if (j == 0)
                j++;
        }
        return res;
    }
}
/******************************************************************************/

    public static void main(String[] args) {
        ImplementstrStr2.Solution tester = new ImplementstrStr2.Solution();
        String[] inputs = {
            "BCBAABACAABABACAA", "ABABAC", "9",
            "ABABDABACDABABCABAB", "ABABCABAB", "10",
            "ABABDABACDABABCABABABABCABABABABCABAB", "ABABCABAB", "10",
            "AABRAACADABRAACAADABRA", "AACAA", "12",
            "", "", "0",
            "", "a", "-1",
        };
        for (int i = 0; i < inputs.length / 3; i++) {
            System.out.println(Printer.separator ());
            String haystack = inputs[3 * i];
            String needle = inputs[1 + 3 * i];
            int expected = Integer.parseInt(inputs[2 + 3 * i]);
            int output = tester.strStr(haystack, needle);
            Printer.openColor("magenta");
            System.out.print(haystack + " AND " + needle);
            Printer.closeColor();
            System.out.print(" -> " + output + ", ");
            Printer.openColor(output == expected ? "green" : "red");
            System.out.println("expected: " + expected);
            Printer.closeColor();
        }
        System.out.println ("Testing alternative implementation...");
        for (int i = 0; i < inputs.length / 3; i++) {
            String A = inputs[3 * i], B = inputs[3 * i + 1];
            System.out.println (Printer.separator ());
            List<Integer> output = tester.kmp (A, B);
            System.out.printf ("%s and %s -> \n%s\n", 
                A, B, output
            );
        }
    }

    static String debugKmp (String A, String B, int i, int j) {
        StringBuilder sb = new StringBuilder ();
        sb.append (A + '\n');
        for (int k = 0; k < i; k++)
            sb.append (' ');
        sb.append (Printer.wrapColor (B, "magenta") + '\n');
        for (int k = 0; k < i + j; k++)
            sb.append (' ');
        sb.append (Printer.wrapColor ("^", "yellow") + '\n');
        return sb.toString ();
    }
}

ver2尝试自己实现一下KMP;

在compute函数里面一个致命错误就是i的初始值: lps[0]一定是自己手动init到0的; 这个时候i就不能从0开始, 因为进入循环第一件事就是[i] and [len]进行比较; 这里的i的位置有点类似sliding window问题里面的right: 实际上总是向前走一步的, 也就是is the index to be determined/decided;

另外在search这一部分, 我相对于G4G上面的版本, 简化了一下, 我的版本就是一个decision structure的逻辑, match or mismatch; G4G那个版本则不是:

        while (i < N)  
        {  
            if (pat.charAt(j) == txt.charAt(i))  
            {  
                j++;  
                i++;  
            }  
            if (j == M)  
            {  
                System.out.println("Found pattern " + "at index " + (i-j));  
                j = lps[j-1];  
            }  

            // mismatch after j matches  
            else if (i < N && pat.charAt(j) != txt.charAt(i))  
            {  
                // Do not match lps[0..lps[j-1]] characters,  
                // they will match anyway  
                if (j != 0)  
                    j = lps[j-1];  
                else  
                    i = i+1;  
            }

可以看到,他这里的做法更像是把match作为一个conditional override来处理;

我认为我的逻辑更加简练和readable;

最后的速度只有11(40), LeetCode的case还是不讲道理;

不过说实话, KMP这种老比算法, 不自己一个字一个字的敲一遍, 是真的没有办法掌握; 就算是我今天其实实现了一遍, 我都不敢说明天让我重新默写一遍我就有把握, 不过这种训练有一次总归是好过没有的; 多来个几次就熟悉了, KMP肯定还会再次出现的; 最好还是把CLRS上面相关的内容看一下;

2018-01-09 14:17:44 更新

关于KMP, geeks4geeks上面提到的还有一个细节要有点印象:

The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character. Following are some examples.

   txt[] = "AAAAAAAAAAAAAAAAAB"  
   pat[] = "AAAAB"  

   txt[] = "ABABABCABABABCABABABC"  
   pat[] =  "ABABAC" (not a worst case, but a bad case for Naive)

不能光是知道KMP怎么怎么好, naive的做法为什么不好, 有没有什么反例, 也要稍微能举出来一两个; 这个在面试的过程当中是可能被问到的;

对于KMP的具体理解还是建议看geeks4geeks上面的那个文章, 尤其是对于search和preprocess的两个对应的trace; 对于这两个trace烂熟于心之后, 才能够很轻松的把两个算法给默写出来;

我想了一下, 还是把search的trace贴在这里, 因为实在是太重要了; preprocess的算法虽然也有点小技巧, 不过都是编程技巧范围以内的小技巧而已; 关键是理解LPS表格代表的到底是什么, 然后怎样应用到search里面去; 话说到底, preprocess这个算法本身毕竟是很好debug的, 你给一个类似AAACAAAA这样的string, LPS的计算结果你是完全可以打印出来debug的; 这个过程非常的快;

txt[] = "AAAAABAAABA"   
         0123456789  
pat[] = "AAAA"  
lps[] = {0, 1, 2, 3}   

i = 0, j = 0  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] = "AAAA"  
txt[i] and pat[j[ match, do i++, j++  

i = 1, j = 1  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] = "AAAA"  
txt[i] and pat[j[ match, do i++, j++  

i = 2, j = 2  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] = "AAAA"  
pat[i] and pat[j[ match, do i++, j++  

i = 3, j = 3  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] = "AAAA"  
txt[i] and pat[j[ match, do i++, j++  

i = 4, j = 4  
Since j == M, print pattern found and resset j,  
j = lps[j-1] = lps[3] = 3  

Here unlike Naive algorithm, we do not match first three   
characters of this window. Value of lps[j-1] (in above   
step) gave us index of next character to match.  
i = 4, j = 3  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =  "AAAA"  
txt[i] and pat[j[ match, do i++, j++  

i = 5, j = 4  
Since j == M, print pattern found and reset j,  
j = lps[j-1] = lps[3] = 3  

Again unlike Naive algorithm, we do not match first three   
characters of this window. Value of lps[j-1] (in above   
step) gave us index of next character to match.  
i = 5, j = 3  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =   "AAAA"  
txt[i] and pat[j] do NOT match and j > 0, change only j  
j = lps[j-1] = lps[2] = 2  

i = 5, j = 2  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =    "AAAA"  
txt[i] and pat[j] do NOT match and j > 0, change only j  
j = lps[j-1] = lps[1] = 1   

i = 5, j = 1  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =     "AAAA"  
txt[i] and pat[j] do NOT match and j > 0, change only j  
j = lps[j-1] = lps[0] = 0  

i = 5, j = 0  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =      "AAAA"  
txt[i] and pat[j] do NOT match and j is 0, we do i++.  

i = 6, j = 0  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =       "AAAA"  
txt[i] and pat[j] match, do i++ and j++  

i = 7, j = 1  
         0123456789  
txt[] = "AAAAABAAABA"   
pat[] =       "AAAA"  
txt[i] and pat[j] match, do i++ and j++  

We continue this way...

preprocess的那个trace他们讲解的不算是特别优秀, 最好还是直接自己动笔画一下更加直观; 笔算的时候就用他们用的那个例子: AAACAAAA;

代码的话, 建议还是要稍微看一下geeks4geeks他们的代码, 毕竟他们的代码有comment解释, 我自己写的这个代码就是光秃秃的, 如果时间长忘记KMP本身是怎么回事, 回头再来看这个代码很容易完全不知所云;

后面重新思考了一下, 对于preprocess的算法, 好像并没有那么简单; 最好还是硬记下来. 重新理解了一下这个算法, 好像并不是那么trivial的一个计算方式;

preprocess的几个关键点就是:

  • 初始化len = 0, i = 1;
  • len可以理解为一个长度, 也可以理解the index yet to be included in [i]; 在实际写算法的过程中, 将其理解为一个end index (excluded index)其实更利于开发;
  • success的时候的顺序是, len++, then store to [i], then i++, 可以写成[i++] = ++len;
    • 为什么success的时候, 只要进行这个更新就行了? 这个背后的原因其实比你想象的还要复杂一些; 拿AAA这个例子计算一下就比较方便理解: 更进一步, 在判断是否可以len++的时候, 为什么我们只需要比较[len]和[i]就行了?
  • fail的时候len = lps[len - 1], 因为[len]是一个tentative的位置; 注意, 要判断len是否等于0;
    • 这个len = lps[len - 1]的退步方式, 代表的到底是什么含义? geeks4geeks上面说这个退步方式其实跟search的时候是类似的, 但是真的就是类似的吗? 为了理解这个, 一个比较好的例子是比较AAAAAAAC的trace; 尽管这样说, 这个退步操作本身还是比较难理解的, 可以看geeks4geeks给的几个例子, 全都有一个特点, 一旦有退步, 下一个LPS都是直接降低到0; 但是这里这个退步到前一个LPS, 而不是直接到0, 感觉是他们认为肯定是有这样一个情形的; 只不过暂时还没有想到合适的例子;
    • 这里介绍了一个可以称之为comparative trace的技术: 有时候单纯笔算一个case的trace还未必就足够, 还需要稍微给一个变形, 得到另外一个trace, 然后比较分析两者的差别来理解这个算法完成的到底是什么. 不过可惜的是, 在这个题目这里, 这个技术, 也就是AAAA和AAAC`的比较, 并没有给我带来更有价值的观点;
    • 单看AAAC这个trace, 当我们把len从2退步到1的时候, 下一步其实是继续退步的; 但是我们假设, 假设下一步不需要退步, 也就是pat[len = 1] == pat[i = 3], 那么下一步就是我们上面的那个branch的advance操作; 我觉得, 还是类似之前的分析, 只要你理解了为什么只要这里match就可以advance, 那么这个退步操作就很合理;
    • 遗憾的是, 我还是没有找到一个不会彻底退步到0的例子;

根据上面的分析, 可以看到preprocess算法真的是看起来简单, 但是实际上暗藏玄机的, 最好还是稍微记忆一下;


Problem Description

Implement strStr().

Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.

Difficulty:Easy
Total Accepted:189K
Total Submissions:676.9K
Contributor: LeetCode
Companies
pocket gems microsoft apple facebook
Related Topics
two pointers string
Similar Questions
Shortest Palindrome Repeated Substring Pattern

results matching ""

    No results matching ""