EssentialsString algorithms

Rabin-Karp algorithm

Formulating the problem

In the previous lesson, we considered the algorithm for finding a pattern in a text. We also found out that the running time of the algorithm is $O(|p|·|t|)$ , where $|p|$ is the length of the pattern and $|t|$ is the length of the text. One may wonder: is there an algorithm that can solve the problem more efficiently? The answer is “yes”, and in this lesson, we will consider a more efficient approach to the problem, namely the Rabin-Karp algorithm. The main idea is to use hashing of strings and compare hash values instead of a symbol-by-symbol comparison.

A polynomial hash function

One of the ways to hash a string is to use a so-called polynomial hash function. For a string $s$ it can be defined as follows:

$hash(s) = \left( s[0] \cdot a^{0} + s[1] \cdot a^1 + ... + s[|s| - 1] \cdot a^{|s| - 1} \right) \ mod \ m = \left( \sum_{i=0}^{|s| - 1} s[i] \cdot a^{i} \right) \ mod \ m. \ (1)$

Here $s[i]$ is a value associated with the i-th symbol of the string. For example, it may be the sequence number of a symbol in the alphabet: 1 for $A$ , 2 for $B$ and so on. $a$ is a constant, usually a prime number approximately equal to the number of symbols in the alphabet, $m$ is a constant as well, usually chosen as a big prime number. Let’s consider how we can calculate a polynomial hash for a string $s=ACDC$ . For simplicity, we will consider $a$ and $m$ to be equal 3 and 11 respectively.

$hash(\textrm{ACDC}) = \left( s[0] \cdot 3^{0} + s[1] \cdot 3^{1} + s[2] \cdot 3^{2} + s[3] \cdot 3^{3} \right) \ mod \ 11 = \left(1 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 + 3 \cdot 27 \right) \ mod \ 11 = 127 \ mod \ 11 = 6.$

Recall that if two strings have different hash values, it follows that they are different. But if two hash values are the same, it does not mean that the corresponding strings are equal. For instance, $BBAB$ and $ABCC$ are the different strings, but their hash values are the same (this situation is called a collision):

$hash(BBAC) = (2 \cdot 1 + 2 \cdot 3 + 1 \cdot 3 + 3 \cdot 27) \ mod \ 11 = 98 \ mod \ 11 = 5$ ,

$hash(ABCC) = (1 \cdot 1 + 2 \cdot 3 + 1 \cdot 3 + 3 \cdot 27) \ mod \ 11 = 115 \ mod \ 11 = 5$ .

The described hash function has one important property: If we know a hash value for a substring of some string, we can calculate a hash value for a left neighbor substring using a constant number of operations. Consider how it works for a string $s=ABBCD$ . At first, let’s calculate a hash value for a suffix of $s$ having the length 3:

$hash(s[2:5]) = \left(s[2] \cdot 3^{0} + s[3] \cdot 3^1 + s[4] \cdot 3^2 \right) \ mod \ 11 = \left(2 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 \right) \ mod \ 11 = 3.$

Then, to calculate a hash value for the neighbor substring of length 3, we need to compute the following:

$hash(s[1:4]) = \left(s[1] \cdot 3^{0} + s[2] \cdot 3^1 + s[3] \cdot 3^2 \right) \ mod \ 11.$

Now we can note that in order to get $hash(s[1 : 4])$ we need to subtract the last term of $hash([2 : 5])$ , multiply the resulting expression by 3, add the first term of $hash([1 : 4])$ and finally perform a modulo division. So,

$hash(s[1:4]) = \left( \left( hash([2:5]) - s[4] \cdot 3^{2} \right) \cdot 3 + s[1] \right) \ mod \ 11 = \left( (3 - 4 \cdot 9) \cdot 3 + 2 \right) \ mod \ 11 = -97 \ mod \ 11 = 2.$

For an arbitrary substring $s[i..j]$ of $s$ :

$hash(s[i..j]) = \left (\left( hash \left(s[i+1..j+1] \right) - s[j] \cdot a^{j-i} \right) \cdot a + s[i] \right) \ mod \ m. \ (2)$

A function that has the described property is called a rolling hash. The property is crucial for the Rabin-Karp algorithm.

The Rabin-Karp algorithm

For a pattern p, a text t and the described hash function the Rabin-Karp algorithm can be formulated as follows:

Calculate a hash value for the pattern $h_p$ .
Moving along the text from the right to the left, calculate a hash value $h_s$ for the current substring using the formula (2).
If $h_s \ne h_p$ , move to the next substring. If the hash values are equal, perform a symbol-by-symbol comparison. If the strings are indeed equal, add the starting index of the current substring to a list of occurrences.
Repeat steps 2 and 3 until all the text is processed. Then, return the list of all occurrences.

An example

Consider how the algorithm works for a pattern $p=ACDC$ , a text $t=ACDCCBA$ and the hash function described above.

At first, we calculate a hash value for the pattern:

$h_p = hash(ACDC) = \left(1 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 + 3 \cdot 27 \right) \ mod \ 11 = 6.$

Then, we calculate a hash value for the first substring starting from the right of the text:

$h_c = hash(CCBA) = hash(t[3:7]) = \left(3 \cdot 1 + 3 \cdot 3 + 2 \cdot 9 + 1 \cdot 27 \right) \ mod \ 11 = 2.$

We can see that the hash values are not equal, so we move to the next substring.

$h_c = hash(t[2:6]) = \left( \left( hash(t[3:7]) - t[6] \cdot 3^3 \right) \cdot 3 - 4 \right) \ mod \ m = \left( \left(2 - 1 \cdot 27 \right) \cdot 3 + 4 \right) \ mod \ 11 = - 71 \ mod \ 11 = 6.$

We see that h_p is equal to h_c, but having performed a symbol-by-symbol comparison we find out that actually, the strings are not equal. So, we move to the next step.

$h_c = hash(CDCC) = hash(s[1..5]) = \left( \left(6 - 2 \cdot 27 \right) \cdot 3 + 3 \right) \ mod \ 11 \\ = -141 \ mod \ 11 = 2.$

The hash values are not equal, move to the next step.

$h_c = hash(ACDC) = \left( \left(2 - 3 \cdot 27 \right) \cdot 3 + 1 \right) \ mod \ 11 = -236 \ mod \ 11 = 6.$

We see that $h_p = h_c$ and the strings are equal as well, so we have found an occurrence. Since all the text has been processed, we return a list consisting of the only founded index 0.

Note that in the first step we calculate a hash value for a substring using the formula (1) directly. The remaining substrings are calculated using the formula (2).

Complexity Analysis

Let’s estimate the running time of the algorithm for a pattern $p$ and a text $t$ (assuming that $|t|≥|p|$ ).

At first, we calculate a hash value for the pattern and the first substring of the text using the formula (1). It takes 2·|p| operations. Then, we calculate a hash value for the remaining |t|−|p| substrings of the text using the formula (2). Each of these calculations requires only a constant number of operations. When the hash values of the pattern and the current substring are equal, we perform a symbol-by-symbol comparison which requires |p| operations in the worst case. So, the overall running time of the algorithm can be estimated as

$2 \cdot |p| + (|t| - |p| + occ \cdot |p|) = O(|t| + occ \cdot |p|),$

where $occ$ is the number of times the pattern occurs in the text.

There is one more detail to be taken into account. It may happen that a hash value of the pattern and the current substring are the same, but the strings are not equal. If this happens often, it may result in $O(|t|·|p|)$ running time.

For example, if a pattern $p=AAA$ , a text $t=AAAAAAAA$ , the Rabin-Karp algorithm will perform a symbol-by-symbol comparison of the pattern with each substring of the text, so the overall running time will be $O(|t|·|p|)$ .

However, one may hardly encounter with such examples in real problems. Although even on “real” data collisions are possible, the probability of this event is very low ( $≈1/m$ ) if $m$ is a big prime number. So, the Rabin-Karp algorithm is a good choice for one who needs to solve the problem of finding a substring in a string.

How did you like the theory?

Report a typo

Cookie Preferences