ImplementstrStr2 [source code]
public class ImplementstrStr2 {
static
/******************************************************************************/
public class Solution {
static boolean DEBUG_LOG = true;
public int strStr(String haystack, String needle) {
List<Integer> res = new ArrayList<>();
char[] text = haystack.toCharArray();
char[] pat = needle.toCharArray();
int N = text.length, M = pat.length;
if (M == 0) return 0;
int[] lps = new int[M];
computeLPS(pat, lps);
if (DEBUG_LOG) System.out.println ("First implemenation LPS: " + Printer.array (lps));
int i = 0, j = 0;
while (i < N) {
if (text[i] == pat[j]) {
i++;
j++;
if (j == M) {
res.add(i - M);
j = lps[j - 1]; // easy to forget about: need to backup j here as well
}
} else {
// advance i or backup j
if (j != 0) j = lps[j - 1];
else i++;
}
}
return res.size() > 0 ? res.get(0) : -1;
}
void computeLPS(char[] pat, int[] lps) {
lps[0] = 0;
// length of the previous longest prefix suffix
// i can not be initialized to 0!!! subtle but fatal
int len = 0, i = 1;
while (i < pat.length) {
if (pat[i] == pat[len]) {
len++;
lps[i] = len;
i++;
} else {
if (len != 0) {
len = lps[len - 1];
} else {
lps[i] = 0; // can be omitted;
i++;
}
}
}
}
/* Alternative concise implementation
Inspired by: https://discuss.leetcode.com/topic/105579/c-4-lines-o-m-n-o-1-and-7-lines-kmp-o-m-n-o-n/18 */
List<Integer> kmp (String A, String B) {
List<Integer> res = new ArrayList<> ();
char[] cha = A.toCharArray (), chb = B.toCharArray ();
int[] lps = new int[chb.length];
for (int len = 0, i = 1; i < chb.length; ) {
if (chb[len] == chb[i])
lps[i++] = ++len;
else if (len > 0)
len = lps[len - 1];
else i++;
}
if (DEBUG_LOG) System.out.println ("Second implementation LPS: " + Printer.array (lps));
for (int i = 0, j = 0; i < cha.length; i += j - lps[j - 1], j = lps[j - 1]) {
if (DEBUG_LOG) System.out.printf ("i:(%d), j:(%d)\n%s\n", i, j, debugKmp (A, B, i, j));
while (j < chb.length && i + j < cha.length && cha[i + j] == chb[j])
j++;
if (DEBUG_LOG) System.out.printf ("after advance >>> i:(%d), j:(%d)\n%s\n", i, j, debugKmp (A, B, i, j));
if (j == chb.length) {
res.add (i);
continue;
}
if (i + j >= cha.length)
break;
if (j == 0)
j++;
}
return res;
}
}
/******************************************************************************/
public static void main(String[] args) {
ImplementstrStr2.Solution tester = new ImplementstrStr2.Solution();
String[] inputs = {
"BCBAABACAABABACAA", "ABABAC", "9",
"ABABDABACDABABCABAB", "ABABCABAB", "10",
"ABABDABACDABABCABABABABCABABABABCABAB", "ABABCABAB", "10",
"AABRAACADABRAACAADABRA", "AACAA", "12",
"", "", "0",
"", "a", "-1",
};
for (int i = 0; i < inputs.length / 3; i++) {
System.out.println(Printer.separator ());
String haystack = inputs[3 * i];
String needle = inputs[1 + 3 * i];
int expected = Integer.parseInt(inputs[2 + 3 * i]);
int output = tester.strStr(haystack, needle);
Printer.openColor("magenta");
System.out.print(haystack + " AND " + needle);
Printer.closeColor();
System.out.print(" -> " + output + ", ");
Printer.openColor(output == expected ? "green" : "red");
System.out.println("expected: " + expected);
Printer.closeColor();
}
System.out.println ("Testing alternative implementation...");
for (int i = 0; i < inputs.length / 3; i++) {
String A = inputs[3 * i], B = inputs[3 * i + 1];
System.out.println (Printer.separator ());
List<Integer> output = tester.kmp (A, B);
System.out.printf ("%s and %s -> \n%s\n",
A, B, output
);
}
}
static String debugKmp (String A, String B, int i, int j) {
StringBuilder sb = new StringBuilder ();
sb.append (A + '\n');
for (int k = 0; k < i; k++)
sb.append (' ');
sb.append (Printer.wrapColor (B, "magenta") + '\n');
for (int k = 0; k < i + j; k++)
sb.append (' ');
sb.append (Printer.wrapColor ("^", "yellow") + '\n');
return sb.toString ();
}
}
ver2尝试自己实现一下KMP;
在compute函数里面一个致命错误就是i的初始值: lps[0]一定是自己手动init到0的; 这个时候i就不能从0开始, 因为进入循环第一件事就是[i] and [len]进行比较; 这里的i的位置有点类似sliding window问题里面的right: 实际上总是向前走一步的, 也就是is the index to be determined/decided;
另外在search这一部分, 我相对于G4G上面的版本, 简化了一下, 我的版本就是一个decision structure的逻辑, match or mismatch; G4G那个版本则不是:
while (i < N)
{
if (pat.charAt(j) == txt.charAt(i))
{
j++;
i++;
}
if (j == M)
{
System.out.println("Found pattern " + "at index " + (i-j));
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat.charAt(j) != txt.charAt(i))
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
可以看到,他这里的做法更像是把match作为一个conditional override来处理;
我认为我的逻辑更加简练和readable;
最后的速度只有11(40), LeetCode的case还是不讲道理;
不过说实话, KMP这种老比算法, 不自己一个字一个字的敲一遍, 是真的没有办法掌握; 就算是我今天其实实现了一遍, 我都不敢说明天让我重新默写一遍我就有把握, 不过这种训练有一次总归是好过没有的; 多来个几次就熟悉了, KMP肯定还会再次出现的; 最好还是把CLRS上面相关的内容看一下;
2018-01-09 14:17:44 更新
关于KMP, geeks4geeks上面提到的还有一个细节要有点印象:
The Naive pattern searching algorithm doesn’t work well in cases where we see many matching characters followed by a mismatching character. Following are some examples.
txt[] = "AAAAAAAAAAAAAAAAAB"
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive)
不能光是知道KMP怎么怎么好, naive的做法为什么不好, 有没有什么反例, 也要稍微能举出来一两个; 这个在面试的过程当中是可能被问到的;
对于KMP的具体理解还是建议看geeks4geeks上面的那个文章, 尤其是对于search和preprocess的两个对应的trace; 对于这两个trace烂熟于心之后, 才能够很轻松的把两个算法给默写出来;
我想了一下, 还是把search的trace贴在这里, 因为实在是太重要了; preprocess的算法虽然也有点小技巧, 不过都是编程技巧范围以内的小技巧而已; 关键是理解LPS表格代表的到底是什么, 然后怎样应用到search里面去; 话说到底, preprocess这个算法本身毕竟是很好debug的, 你给一个类似AAACAAAA
这样的string, LPS的计算结果你是完全可以打印出来debug的; 这个过程非常的快;
txt[] = "AAAAABAAABA"
0123456789
pat[] = "AAAA"
lps[] = {0, 1, 2, 3}
i = 0, j = 0
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 1, j = 1
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 2, j = 2
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
pat[i] and pat[j[ match, do i++, j++
i = 3, j = 3
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 4, j = 4
Since j == M, print pattern found and resset j,
j = lps[j-1] = lps[3] = 3
Here unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 4, j = 3
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j[ match, do i++, j++
i = 5, j = 4
Since j == M, print pattern found and reset j,
j = lps[j-1] = lps[3] = 3
Again unlike Naive algorithm, we do not match first three
characters of this window. Value of lps[j-1] (in above
step) gave us index of next character to match.
i = 5, j = 3
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[2] = 2
i = 5, j = 2
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[1] = 1
i = 5, j = 1
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j > 0, change only j
j = lps[j-1] = lps[0] = 0
i = 5, j = 0
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] do NOT match and j is 0, we do i++.
i = 6, j = 0
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
i = 7, j = 1
0123456789
txt[] = "AAAAABAAABA"
pat[] = "AAAA"
txt[i] and pat[j] match, do i++ and j++
We continue this way...
preprocess的那个trace他们讲解的不算是特别优秀, 最好还是直接自己动笔画一下更加直观; 笔算的时候就用他们用的那个例子: AAACAAAA
;
代码的话, 建议还是要稍微看一下geeks4geeks他们的代码, 毕竟他们的代码有comment解释, 我自己写的这个代码就是光秃秃的, 如果时间长忘记KMP本身是怎么回事, 回头再来看这个代码很容易完全不知所云;
后面重新思考了一下, 对于preprocess的算法, 好像并没有那么简单; 最好还是硬记下来. 重新理解了一下这个算法, 好像并不是那么trivial的一个计算方式;
preprocess的几个关键点就是:
- 初始化
len = 0, i = 1
; - len可以理解为一个长度, 也可以理解the index yet to be included in [i]; 在实际写算法的过程中, 将其理解为一个end index (excluded index)其实更利于开发;
- success的时候的顺序是, len++, then store to [i], then i++, 可以写成
[i++] = ++len
;- 为什么success的时候, 只要进行这个更新就行了? 这个背后的原因其实比你想象的还要复杂一些; 拿
AAA
这个例子计算一下就比较方便理解: 更进一步, 在判断是否可以len++的时候, 为什么我们只需要比较[len]和[i]就行了?
- 为什么success的时候, 只要进行这个更新就行了? 这个背后的原因其实比你想象的还要复杂一些; 拿
- fail的时候
len = lps[len - 1]
, 因为[len]是一个tentative的位置; 注意, 要判断len是否等于0;- 这个
len = lps[len - 1]
的退步方式, 代表的到底是什么含义? geeks4geeks上面说这个退步方式其实跟search的时候是类似的, 但是真的就是类似的吗? 为了理解这个, 一个比较好的例子是比较AAAA
和AAAC
的trace; 尽管这样说, 这个退步操作本身还是比较难理解的, 可以看geeks4geeks给的几个例子, 全都有一个特点, 一旦有退步, 下一个LPS都是直接降低到0; 但是这里这个退步到前一个LPS, 而不是直接到0, 感觉是他们认为肯定是有这样一个情形的; 只不过暂时还没有想到合适的例子; - 这里介绍了一个可以称之为comparative trace的技术: 有时候单纯笔算一个case的trace还未必就足够, 还需要稍微给一个变形, 得到另外一个trace, 然后比较分析两者的差别来理解这个算法完成的到底是什么. 不过可惜的是, 在这个题目这里, 这个技术, 也就是
AAAA和
AAAC`的比较, 并没有给我带来更有价值的观点; - 单看
AAAC
这个trace, 当我们把len从2退步到1的时候, 下一步其实是继续退步的; 但是我们假设, 假设下一步不需要退步, 也就是pat[len = 1] == pat[i = 3]
, 那么下一步就是我们上面的那个branch的advance操作; 我觉得, 还是类似之前的分析, 只要你理解了为什么只要这里match就可以advance, 那么这个退步操作就很合理; - 遗憾的是, 我还是没有找到一个不会彻底退步到0的例子;
- 这个
根据上面的分析, 可以看到preprocess算法真的是看起来简单, 但是实际上暗藏玄机的, 最好还是稍微记忆一下;
Problem Description
Implement strStr().
Returns the index of the first occurrence of needle in haystack, or -1 if needle is not part of haystack.
Difficulty:Easy
Total Accepted:189K
Total Submissions:676.9K
Contributor: LeetCode
Companies
pocket gems microsoft apple facebook
Related Topics
two pointers string
Similar Questions
Shortest Palindrome Repeated Substring Pattern