EssentialsString algorithms

Knuth-Morris-Pratt algorithm

Introduction

For now, we considered two algorithms for finding a substring in a string. The first one is the simplest algorithm that performs a symbol-by-symbol comparison on each iteration and always runs in $O(|t|·|p|)$ for a pattern $p$ and a text $t$ . The other one is the Rabin-Karp algorithm that uses string hashing. Its running time is $O(|t|+occ·|p|)$ ( $occ$ is the number of times the pattern $p$ occurs in the text $t$ ), but in some cases, it may be $O(|t|·|p|)$ as well.

In this lesson, we will learn another approach to the substring searching problem, namely, the Knuth-Morris-Pratt algorithm. Unlike the previous two algorithms, it has the running time $O(|t|+|p|)$ in the worst case.

The intuition behind the KMP algorithm

As it was said before, the simplest algorithm performs a symbol-by-symbol comparison of the pattern with a substring of the text on each iteration. The problem of this approach is that it does not take into account the information gained on the previous steps. The KMP algorithm, on the contrary, doesn’t perform unnecessary comparisons. It uses information about the structure of the pattern in order to make the best possible shift in case of symbol’s mismatch.

Suppose we want to find a pattern $p=ABCABD$ in a text $t=ABCABCABD$ . Then, comparing the pattern with the beginning of the text, we will get the following:

We can see that there is a mismatch in the last symbol of the pattern: $t[5] \ne p[5]$ . The attempt is not successful and we need to shift the pattern to the right. But can we not to process the previous symbols of the text again and shift the pattern such that it would be possible to continue a comparison immediately from the next symbol $t[6]$ ? To do this, we need to find the longest prefix of the pattern that matches with all symbols of the text before the next symbol.

We can see that for the pattern such prefix is $p[0..3] = ABC$ . So, we can shift the pattern by 3 symbols to the right and continue a comparison of corresponding symbols knowing that the previous ones already match:

Note that it may happen that the length of such prefix is zero. In this case, the pattern is processed starting from the first symbol.

So, on each step of the algorithm, we want to find the described prefix fast. The structure that allows finding such prefixes is called the prefix function. Now let’s consider the structure in more details.

The prefix function

For a string $s$ the prefix function is an array $pf$ of length $|s|$ such that $pf[i]$ is the length of the longest prefix of $s[0..i+1]$ that also is a suffix of $s[0..i+ 1]$ (prefixes equal to the string itself are not considered). By convention $p[0] = 0$ . Given below is an example of the prefix function for a string $s=ABACABAD$ :

Consider in more details how the array $pf$ was obtained:

$pf[0] = 0$ by convention.
$pf[1] = 0$ as well, since $s[0..2] =AB$ contains no prefixes equal to its suffixes.
$pf[2] = 1$ since the longest prefix of $s[0..3] =\color{red}A\color{black}B\color{red}A$ that also is a suffix is $A$ .
$pf[3] = 0$ similar to $pf[1]$ .
$pf[4] = 1$ similar to $pf[2]$ .
$pf[5] = 2$ since the longest prefix of $s[0..6] =\color{red}AB\color{black}AC\color{red}AB$ that also is a suffix is $AB$ .
$pf[6] = 3$ since the longest prefix of $s[0..7] =\color{red}ABA\color{black}C\color{red}ABA$ that also is a suffix is $ABA$ .
$pf[7] = 0$ similar to $pf[1]$ .

Further, for convenience, we will denote a prefix of $s$ that is also a suffix as a border of $s$ .

Our next goal is to implement an efficient algorithm for calculation of the prefix function. Before doing this, let’s learn a few properties of borders.

The properties of borders

Property 1. Let $a$ and $b$ are borders of a string $s$ and $|a|\lt |b|$ . Then $a$ is a border of $b$ . The following figure illustrates the statement:

Since $a$ is a prefix of $s$ , it is a prefix of $b$ as well. Similarly, $a$ is a suffix of $s$ and therefore is a suffix of $b$ . So, $a$ is a border of $b$ .

From this property, we can conclude that if $b$ is the longest border of $s$ , then the next longest border of $s$ can be found as the longest border of $b$ .

Property 2. Let $b$ is a border of a string $s$ . Then $b$ can be extended by a symbol a, if $b$ a is a border of $s$ a. The figure below explains the statement:

The algorithm for finding the prefix function

Using the described properties of borders, we can implement an algorithm for fast calculation of the prefix function $pf$ for a string $s$ .

Suppose we already know $pf[0..i−1]$ . Then, to calculate $pf[i]$ we need to check whether we can extend the previous border $pf[i−1]$ of $s$ . If it is possible, then $pf[i] =pf[i−1] + 1$ . Otherwise, the next longest borders $pf[pf[i−1]−1]$ , $pf[pf[pf[i−1]−1]−1]$ , $...$ are examined until an appropriate extension is found. If there are no borders that can be extended, $pf[i]$ is assigned to 0.

Given below is pseudocode of the described algorithm:

prefixFunction(s):
    prefixFunc = [0,..,0]

    for i = 1..|s|:
        j = prefixFunc[i-1]

        while j > 0 and s[i] != s[j]: // finding the next longest border
            j = prefixFunc[j-1]

        if s[i] == s[j]: // extend the border if possible
            j += 1

        prefixFunc[i] = j

    return prefixFunc

An example

Let’s consider how we can calculate the prefix function for a string $s = \overset{0}{\textrm{A}} \overset{1}{\textrm{B}} \overset{2}{\textrm{A}} \overset{3}{\textrm{C}} \overset{4}{\textrm{A}} \overset{5}{\textrm{B}} \overset{6}{\textrm{A}} \overset{7}{\textrm{D}}$ using the described algorithm.

$pf[0] = 0$ by convention.
We see that $pf[0] = 0$ and $s[0] \ne s[1]$ . The previous border cannot be extended, $pf[1] = 0$ .
$pf[1] = 0$ and $s[0] =s[2]$ . The previous border can be extended, $pf[2] = pf[1] + 1 = 1$
$pf[2] = 1$ , but $s[3] \ne s[1]$ . So, we need to examine the previous longest border $pf[pf[2]−1] = pf[0] = 0$ . We see that $s[3] \ne s[0]$ , so $pf[3] = 0$ .
$pf[3] = 0$ and $s[4] =s[0]$ , so $pf[4] = 1$ .
$pf[4] = 1$ and $s[5] =s[1]$ , so $pf[5] =pf[4] + 1 = 2$ .
$pf[5] = 2$ and $s[6] =s[2]$ , so $pf[6] =pf[5] + 1 = 3$ .
$pf[6] = 3$ , but $s[7] \ne s[3]$ , we need to check the next longest border $pf[pf[6]−1] =pf[2]$ . We see that $pf[2] = 1$ , but $s[1] \ne s[7]$ . The next longest border $pf[pf[2]−1] =pf[0] = 0$ can not be extended as well, since $s[0] \ne s[7]$ . So, $s[7] = 0$ .

Finally, we get the same prefix function $pf= [0,0,1,0,1,2,3,0]$ as in the above example.

The Knuth-Morris-Pratt algorithm

Now, knowing how the prefix function works, we can formulate the KMP algorithm. Given a pattern $p$ and a text $t$ , a procedure for finding all occurrences of $p$ in $t$ is the following:

Calculate the prefix function for the pattern $p$ .
Moving along the text from the left to the right, perform a symbol-by-symbol comparison of the pattern with the current substring of the text. In case of a mismatch, find the longest prefix of the pattern that can be extended by the current symbol of the text using the prefix function and move to the next substring. If all symbols of the pattern match with the current substring of the text, add the current position to a list of all occurrences. In this case, a symbol of the pattern to be considered on the next iteration is the symbol after the longest border of the pattern itself, since we know that the previous ones already match with the current substring of the text.
Repeat step 2 until all the text is processed. Then, return the list of all found occurrences.

Pseudocode for the algorithm can be written as follows:

kmpSearch(text, pattern):
    prefixFunc = prefixFunction(pattern)
    occurrences = List()
    j = 0

    for i = 0..|text|:
        while j > 0 and text[i] != pattern[j]: // finding the next longest border
            j = prefixFunc[j-1]

        if text[i] == pattern[j]: // extend the current matching if possible
            j += 1

        if j == |pattern|:
            occurrences.add(i-j+1)
            j = prefixFunc[j-1]

    return occurrences

An example

Let’s apply the described algorithm for a pattern $p=ABCABD$ and a text $t=ABCABCAABCABD$ . At first, we need to calculate the prefix function for $p$ :

Then, let’s start comparing the pattern with the beginning of the text:

We see that there is a mismatch in the last symbol of the pattern: $t[5] \ne p[5]$ . Now, we want to find the longest prefix of the pattern that can be extended by $t[5]$ and continue a comparison from $t[6]$ . The length of the longest border of the matched prefix of the pattern is $pf[4] = 2$ and $t[5] =s[2]$ . So, in the next step, we can start a comparison with the symbol $s[3]$ of the pattern knowing that the previous ones already match:

Now we can see that $t[7] \ne p[4]$ . The length of the longest border of the matched prefix $pf[3] = 1$ but it cannot be extended since $t[7] \ne p[1]$ . The length of the next longest border $pf[pf[3]−1] =pf[0] = 0$ and $t[7] =s[0]$ . So, in the next step, we continue a comparison starting from the symbol $s[1]$ of the pattern:

Now, all the corresponding symbols match. Since all the text is processed, we return a list consisting of the only found occurrence 7

Complexity analysis

Let’s estimate the running time of the algorithm for a pattern $p$ and a text $t$ . At first, consider the $prefixFunction$ procedure.

In the outer $for$ loop, we increase the value of $j$ no more than $|p|$ times. In the inner $while$ loop, we decrease the value of $j$ by at least one on each iteration, since the length of the next longest border is strictly fewer than the length of the current one. As the number of increases of $j$ is limited by $|p|$ , the total number of decreases in the $while$ loop is limited by $|p|$ as well. So, the overall running time of the procedure is $2·|p|=O(|p|)$ .

Now let’s analyze the $kmpSearch$ . At first, we calculate the prefix function for the pattern, which requires $O(|p|)$ operations. Then, we perform a search in the $for$ loop. As you can note, this loop is similar to the $prefixFunction$ procedure and for the same reasons mentioned in the previous paragraph, it requires $O(|t|)$ operations. Therefore, the overall running time of the $kmpSearch$ is $O(|p|+|t|)$ .

Note that we need $O(|p|)$ additional memory to store the prefix function for the pattern.

How did you like the theory?

Report a typo

Cookie Preferences

Knuth-Morris-Pratt algorithm

Introduction

The intuition behind the KMP algorithm

The prefix function

The properties of borders

The algorithm for finding the prefix function

An example

The Knuth-Morris-Pratt algorithm

An example

Complexity analysis

Related topics