EssentialsString algorithms

Searching a substring

Formulating the problem

Recall that a substring is a contiguous subsequence of symbols of an original string. The problem of searching a substring in a string can be formulated as follows: given two strings (one of which is usually called a pattern, the other is called a text). We need to identify whether the pattern is contained as a substring in the text.

Consider a simple example: let a pattern is "ACA", a text is "ACBACAD". In this case, we can see that the pattern is a substring of the text because it's contained in it starting from the 3rd and ending to the 5th symbol (assuming 0-based indexing). Let's try to come up with an algorithm that can solve the problem for an arbitrary pattern and text.

The simplest algorithm

One of the simplest and natural algorithms for finding a pattern in a text is the following:

Let's compare the pattern with the beginning of the text.
If every symbol of the pattern matches the corresponding symbol of the text, we have found an occurrence.
If at least one symbol of the pattern doesn't match with the corresponding symbol of the text, the current attempt has been failed. In this case, we need to move the pattern by one symbol to the right and try to perform a comparison again.
Repeat steps 2 and 3 until the pattern is found or while the end of the text is reached. In case if none of the attempts was successful, we should indicate that there is no pattern in the text.

An example

The picture below illustrates how the algorithm works for the pattern "ACA" and the text "ACBACAD". Matching symbols are shown in green, non-matching ones are shown in red. Symbols not used in the current step are shown in blue.

In the first step, we compare the pattern with the beginning of the text. The first two symbols match but the 3rd doesn't. The attempt isn't successful, we shift the pattern to the right. In the 2nd and the 3rd steps, we try to compare the corresponding symbols again but have a mismatch in the first symbol. In the 4th step, all the corresponding symbols match to each other, so we have found an occurrence.

In this example, you can see that sometimes there is no need to process all the symbols of the pattern. If we have a mismatch, say, in the first symbol, we needn't compare the remaining ones. In case of failure, we can immediately shift the pattern and start a new step.

Complexity analysis

What is the running time of the algorithm? Let's denote $|p|$ as the length of a pattern and $|t|$ as the length of a text. In the worst case, we need to process all $|p|$ symbols of the pattern on every step. The maximal number of steps is $|t| - |p| + 1$ (assuming $|t| \ge |p|$ ). So, the overall running time is

(|t| - |p| + 1) \cdot |p| = |t| \cdot |p| - |p|^2 + |p| = O(|t| \cdot |p|)

Note that the algorithm requires $O(1)$ additional memory.

Additional notes

For now, we know the algorithm that can answer "yes" if a pattern is contained in the text and "no" otherwise. But the problem can also be formulated in other ways. We may need to

find the index of the first occurrence of a pattern in a text.
find all occurrences of a pattern in a text.
find all non-overlapping occurrences of a pattern in a text.

These variations will be suggested as a programming exercise in next.

How did you like the theory?

Report a typo

Cookie Preferences