Cookie Preferences

We use cookies to enhance your experience. Choose your preference for cookie usage.

Essential cookies are required for basic functionality. Additional cookies help us improve our service and provide analytics.

View third-party services
  • Google Analytics: Website traffic analysis and conversion tracking
  • RudderStack: Analysis of user interactions with the website
  • Microsoft Clarity: User behavior analysis & session recordings
  • Facebook Pixel: Marketing analysis of user activity
EssentialsString algorithms

Rabin-Karp algorithm

Formulating the problem

In the previous lesson, we considered the algorithm for finding a pattern in a text. We also found out that the running time of the algorithm is O(pt) O(|p|·|t|) , where p |p| is the length of the pattern and t |t| is the length of the text. One may wonder: is there an algorithm that can solve the problem more efficiently? The answer is “yes”, and in this lesson, we will consider a more efficient approach to the problem, namely the Rabin-Karp algorithm. The main idea is to use hashing of strings and compare hash values instead of a symbol-by-symbol comparison.

A polynomial hash function

One of the ways to hash a string is to use a so-called polynomial hash function. For a string s s it can be defined as follows:

hash(s)=(s[0]a0+s[1]a1+...+s[s1]as1) mod m=(i=0s1s[i]ai) mod m. (1) hash(s) = \left( s[0] \cdot a^{0} + s[1] \cdot a^1 + ... + s[|s| - 1] \cdot a^{|s| - 1} \right) \ mod \ m = \left( \sum_{i=0}^{|s| - 1} s[i] \cdot a^{i} \right) \ mod \ m. \ (1)

Here s[i] s[i] is a value associated with the i-th symbol of the string. For example, it may be the sequence number of a symbol in the alphabet: 1 for A A , 2 for B B and so on. a a is a constant, usually a prime number approximately equal to the number of symbols in the alphabet, m m is a constant as well, usually chosen as a big prime number. Let’s consider how we can calculate a polynomial hash for a string s=ACDC s=ACDC . For simplicity, we will consider a a and m m to be equal 3 and 11 respectively.

hash(ACDC)=(s[0]30+s[1]31+s[2]32+s[3]33) mod 11=(11+33+49+327) mod 11=127 mod 11=6. hash(\textrm{ACDC}) = \left( s[0] \cdot 3^{0} + s[1] \cdot 3^{1} + s[2] \cdot 3^{2} + s[3] \cdot 3^{3} \right) \ mod \ 11 = \left(1 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 + 3 \cdot 27 \right) \ mod \ 11 = 127 \ mod \ 11 = 6.

Recall that if two strings have different hash values, it follows that they are different. But if two hash values are the same, it does not mean that the corresponding strings are equal. For instance, BBAB BBAB and ABCC ABCC are the different strings, but their hash values are the same (this situation is called a collision):

hash(BBAC)=(21+23+13+327) mod 11=98 mod 11=5 hash(BBAC) = (2 \cdot 1 + 2 \cdot 3 + 1 \cdot 3 + 3 \cdot 27) \ mod \ 11 = 98 \ mod \ 11 = 5 ,

hash(ABCC)=(11+23+13+327) mod 11=115 mod 11=5 hash(ABCC) = (1 \cdot 1 + 2 \cdot 3 + 1 \cdot 3 + 3 \cdot 27) \ mod \ 11 = 115 \ mod \ 11 = 5 .

The described hash function has one important property: If we know a hash value for a substring of some string, we can calculate a hash value for a left neighbor substring using a constant number of operations. Consider how it works for a string s=ABBCD s=ABBCD . At first, let’s calculate a hash value for a suffix of s s having the length 3:

hash(s[2:5])=(s[2]30+s[3]31+s[4]32) mod 11=(21+33+49) mod 11=3. hash(s[2:5]) = \left(s[2] \cdot 3^{0} + s[3] \cdot 3^1 + s[4] \cdot 3^2 \right) \ mod \ 11 = \left(2 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 \right) \ mod \ 11 = 3.

Then, to calculate a hash value for the neighbor substring of length 3, we need to compute the following:

hash(s[1:4])=(s[1]30+s[2]31+s[3]32) mod 11. hash(s[1:4]) = \left(s[1] \cdot 3^{0} + s[2] \cdot 3^1 + s[3] \cdot 3^2 \right) \ mod \ 11.

Now we can note that in order to get hash(s[1:4]) hash(s[1 : 4]) we need to subtract the last term of hash([2:5]) hash([2 : 5]) , multiply the resulting expression by 3, add the first term of hash([1:4]) hash([1 : 4]) and finally perform a modulo division. So,

hash(s[1:4])=((hash([2:5])s[4]32)3+s[1]) mod 11=((349)3+2) mod 11=97 mod 11=2. hash(s[1:4]) = \left( \left( hash([2:5]) - s[4] \cdot 3^{2} \right) \cdot 3 + s[1] \right) \ mod \ 11 = \left( (3 - 4 \cdot 9) \cdot 3 + 2 \right) \ mod \ 11 = -97 \ mod \ 11 = 2.

For an arbitrary substring s[i..j] s[i..j] of s s :

hash(s[i..j])=((hash(s[i+1..j+1])s[j]aji)a+s[i]) mod m. (2) hash(s[i..j]) = \left (\left( hash \left(s[i+1..j+1] \right) - s[j] \cdot a^{j-i} \right) \cdot a + s[i] \right) \ mod \ m. \ (2)

A function that has the described property is called a rolling hash. The property is crucial for the Rabin-Karp algorithm.

The Rabin-Karp algorithm

For a pattern p, a text t and the described hash function the Rabin-Karp algorithm can be formulated as follows:

  1. Calculate a hash value for the pattern hp h_p .
  2. Moving along the text from the right to the left, calculate a hash value hs h_s for the current substring using the formula (2).
  3. If hshp h_s \ne h_p , move to the next substring. If the hash values are equal, perform a symbol-by-symbol comparison. If the strings are indeed equal, add the starting index of the current substring to a list of occurrences.
  4. Repeat steps 2 and 3 until all the text is processed. Then, return the list of all occurrences.

An example

Consider how the algorithm works for a pattern p=ACDC p=ACDC , a text t=ACDCCBA t=ACDCCBA and the hash function described above.

At first, we calculate a hash value for the pattern:

hp=hash(ACDC)=(11+33+49+327) mod 11=6. h_p = hash(ACDC) = \left(1 \cdot 1 + 3 \cdot 3 + 4 \cdot 9 + 3 \cdot 27 \right) \ mod \ 11 = 6.

Then, we calculate a hash value for the first substring starting from the right of the text:

hc=hash(CCBA)=hash(t[3:7])=(31+33+29+127) mod 11=2. h_c = hash(CCBA) = hash(t[3:7]) = \left(3 \cdot 1 + 3 \cdot 3 + 2 \cdot 9 + 1 \cdot 27 \right) \ mod \ 11 = 2.

We can see that the hash values are not equal, so we move to the next substring.

hc=hash(t[2:6])=((hash(t[3:7])t[6]33)34) mod m=((2127)3+4) mod 11=71 mod 11=6. h_c = hash(t[2:6]) = \left( \left( hash(t[3:7]) - t[6] \cdot 3^3 \right) \cdot 3 - 4 \right) \ mod \ m = \left( \left(2 - 1 \cdot 27 \right) \cdot 3 + 4 \right) \ mod \ 11 = - 71 \ mod \ 11 = 6.

We see that h_p is equal to h_c, but having performed a symbol-by-symbol comparison we find out that actually, the strings are not equal. So, we move to the next step.

hc=hash(CDCC)=hash(s[1..5])=((6227)3+3) mod 11=141 mod 11=2. h_c = hash(CDCC) = hash(s[1..5]) = \left( \left(6 - 2 \cdot 27 \right) \cdot 3 + 3 \right) \ mod \ 11 \\ = -141 \ mod \ 11 = 2.

The hash values are not equal, move to the next step.

hc=hash(ACDC)=((2327)3+1) mod 11=236 mod 11=6. h_c = hash(ACDC) = \left( \left(2 - 3 \cdot 27 \right) \cdot 3 + 1 \right) \ mod \ 11 = -236 \ mod \ 11 = 6.

We see that hp=hc h_p = h_c and the strings are equal as well, so we have found an occurrence. Since all the text has been processed, we return a list consisting of the only founded index 0.

Note that in the first step we calculate a hash value for a substring using the formula (1) directly. The remaining substrings are calculated using the formula (2).

Complexity Analysis

Let’s estimate the running time of the algorithm for a pattern p p and a text t t (assuming that tp |t|≥|p| ).

At first, we calculate a hash value for the pattern and the first substring of the text using the formula (1). It takes 2·|p| operations. Then, we calculate a hash value for the remaining |t|−|p| substrings of the text using the formula (2). Each of these calculations requires only a constant number of operations. When the hash values of the pattern and the current substring are equal, we perform a symbol-by-symbol comparison which requires |p| operations in the worst case. So, the overall running time of the algorithm can be estimated as

2p+(tp+occp)=O(t+occp), 2 \cdot |p| + (|t| - |p| + occ \cdot |p|) = O(|t| + occ \cdot |p|),

where occ occ is the number of times the pattern occurs in the text.

There is one more detail to be taken into account. It may happen that a hash value of the pattern and the current substring are the same, but the strings are not equal. If this happens often, it may result in O(tp) O(|t|·|p|) running time.

For example, if a pattern p=AAA p=AAA , a text t=AAAAAAAA t=AAAAAAAA , the Rabin-Karp algorithm will perform a symbol-by-symbol comparison of the pattern with each substring of the text, so the overall running time will be O(tp) O(|t|·|p|) .

However, one may hardly encounter with such examples in real problems. Although even on “real” data collisions are possible, the probability of this event is very low (1/m ≈1/m ) if m m is a big prime number. So, the Rabin-Karp algorithm is a good choice for one who needs to solve the problem of finding a substring in a string.

How did you like the theory?
Report a typo