Introduction
For now, we considered two algorithms for finding a substring in a string. The first one is the simplest algorithm that performs a symbol-by-symbol comparison on each iteration and always runs in for a pattern and a text . The other one is the Rabin-Karp algorithm that uses string hashing. Its running time is ( is the number of times the pattern occurs in the text ), but in some cases, it may be as well.
In this lesson, we will learn another approach to the substring searching problem, namely, the Knuth-Morris-Pratt algorithm. Unlike the previous two algorithms, it has the running time in the worst case.
The intuition behind the KMP algorithm
As it was said before, the simplest algorithm performs a symbol-by-symbol comparison of the pattern with a substring of the text on each iteration. The problem of this approach is that it does not take into account the information gained on the previous steps. The KMP algorithm, on the contrary, doesn’t perform unnecessary comparisons. It uses information about the structure of the pattern in order to make the best possible shift in case of symbol’s mismatch.
Suppose we want to find a pattern in a text . Then, comparing the pattern with the beginning of the text, we will get the following:
We can see that there is a mismatch in the last symbol of the pattern: . The attempt is not successful and we need to shift the pattern to the right. But can we not to process the previous symbols of the text again and shift the pattern such that it would be possible to continue a comparison immediately from the next symbol ? To do this, we need to find the longest prefix of the pattern that matches with all symbols of the text before the next symbol.
We can see that for the pattern such prefix is . So, we can shift the pattern by 3 symbols to the right and continue a comparison of corresponding symbols knowing that the previous ones already match:
Note that it may happen that the length of such prefix is zero. In this case, the pattern is processed starting from the first symbol.
So, on each step of the algorithm, we want to find the described prefix fast. The structure that allows finding such prefixes is called the prefix function. Now let’s consider the structure in more details.
The prefix function
For a string the prefix function is an array of length such that is the length of the longest prefix of that also is a suffix of (prefixes equal to the string itself are not considered). By convention . Given below is an example of the prefix function for a string :
Consider in more details how the array was obtained:
- by convention.
- as well, since contains no prefixes equal to its suffixes.
- since the longest prefix of that also is a suffix is .
- similar to .
- similar to .
- since the longest prefix of that also is a suffix is .
- since the longest prefix of that also is a suffix is .
- similar to .
Further, for convenience, we will denote a prefix of that is also a suffix as a border of .
Our next goal is to implement an efficient algorithm for calculation of the prefix function. Before doing this, let’s learn a few properties of borders.
The properties of borders
Property 1. Let and are borders of a string and . Then is a border of . The following figure illustrates the statement:
Since is a prefix of , it is a prefix of as well. Similarly, is a suffix of and therefore is a suffix of . So, is a border of .
From this property, we can conclude that if is the longest border of , then the next longest border of can be found as the longest border of .
Property 2. Let is a border of a string . Then can be extended by a symbol a, if a is a border of a. The figure below explains the statement:
The algorithm for finding the prefix function
Using the described properties of borders, we can implement an algorithm for fast calculation of the prefix function for a string .
Suppose we already know . Then, to calculate we need to check whether we can extend the previous border of . If it is possible, then . Otherwise, the next longest borders , , are examined until an appropriate extension is found. If there are no borders that can be extended, is assigned to 0.
Given below is pseudocode of the described algorithm:
prefixFunction(s):
prefixFunc = [0,..,0]
for i = 1..|s|:
j = prefixFunc[i-1]
while j > 0 and s[i] != s[j]: // finding the next longest border
j = prefixFunc[j-1]
if s[i] == s[j]: // extend the border if possible
j += 1
prefixFunc[i] = j
return prefixFunc
An example
Let’s consider how we can calculate the prefix function for a string using the described algorithm.
- by convention.
- We see that and . The previous border cannot be extended, .
- and . The previous border can be extended,
- , but . So, we need to examine the previous longest border . We see that , so .
- and , so .
- and , so .
- and , so .
- , but , we need to check the next longest border . We see that , but . The next longest border can not be extended as well, since . So,.
Finally, we get the same prefix function as in the above example.
The Knuth-Morris-Pratt algorithm
Now, knowing how the prefix function works, we can formulate the KMP algorithm. Given a pattern and a text , a procedure for finding all occurrences of in is the following:
- Calculate the prefix function for the pattern .
- Moving along the text from the left to the right, perform a symbol-by-symbol comparison of the pattern with the current substring of the text. In case of a mismatch, find the longest prefix of the pattern that can be extended by the current symbol of the text using the prefix function and move to the next substring. If all symbols of the pattern match with the current substring of the text, add the current position to a list of all occurrences. In this case, a symbol of the pattern to be considered on the next iteration is the symbol after the longest border of the pattern itself, since we know that the previous ones already match with the current substring of the text.
- Repeat step 2 until all the text is processed. Then, return the list of all found occurrences.
Pseudocode for the algorithm can be written as follows:
kmpSearch(text, pattern):
prefixFunc = prefixFunction(pattern)
occurrences = List()
j = 0
for i = 0..|text|:
while j > 0 and text[i] != pattern[j]: // finding the next longest border
j = prefixFunc[j-1]
if text[i] == pattern[j]: // extend the current matching if possible
j += 1
if j == |pattern|:
occurrences.add(i-j+1)
j = prefixFunc[j-1]
return occurrences
An example
Let’s apply the described algorithm for a pattern and a text . At first, we need to calculate the prefix function for :
Then, let’s start comparing the pattern with the beginning of the text:
We see that there is a mismatch in the last symbol of the pattern: . Now, we want to find the longest prefix of the pattern that can be extended by and continue a comparison from . The length of the longest border of the matched prefix of the pattern is and . So, in the next step, we can start a comparison with the symbol of the pattern knowing that the previous ones already match:
Now we can see that . The length of the longest border of the matched prefix but it cannot be extended since . The length of the next longest border and . So, in the next step, we continue a comparison starting from the symbol of the pattern:
Now, all the corresponding symbols match. Since all the text is processed, we return a list consisting of the only found occurrence 7
Complexity analysis
Let’s estimate the running time of the algorithm for a pattern and a text . At first, consider the procedure.
In the outer loop, we increase the value of no more than times. In the inner loop, we decrease the value of by at least one on each iteration, since the length of the next longest border is strictly fewer than the length of the current one. As the number of increases of is limited by , the total number of decreases in the loop is limited by as well. So, the overall running time of the procedure is .
Now let’s analyze the . At first, we calculate the prefix function for the pattern, which requires operations. Then, we perform a search in the loop. As you can note, this loop is similar to the procedure and for the same reasons mentioned in the previous paragraph, it requires operations. Therefore, the overall running time of the is .
Note that we need additional memory to store the prefix function for the pattern.