Formulating the problem
Recall that a substring is a contiguous subsequence of symbols of an original string. The problem of searching a substring in a string can be formulated as follows: given two strings (one of which is usually called a pattern, the other is called a text). We need to identify whether the pattern is contained as a substring in the text.
Consider a simple example: let a pattern is "ACA", a text is "ACBACAD". In this case, we can see that the pattern is a substring of the text because it's contained in it starting from the 3rd and ending to the 5th symbol (assuming 0-based indexing). Let's try to come up with an algorithm that can solve the problem for an arbitrary pattern and text.
The simplest algorithm
One of the simplest and natural algorithms for finding a pattern in a text is the following:
- Let's compare the pattern with the beginning of the text.
- If every symbol of the pattern matches the corresponding symbol of the text, we have found an occurrence.
- If at least one symbol of the pattern doesn't match with the corresponding symbol of the text, the current attempt has been failed. In this case, we need to move the pattern by one symbol to the right and try to perform a comparison again.
- Repeat steps 2 and 3 until the pattern is found or while the end of the text is reached. In case if none of the attempts was successful, we should indicate that there is no pattern in the text.
An example
The picture below illustrates how the algorithm works for the pattern "ACA" and the text "ACBACAD". Matching symbols are shown in green, non-matching ones are shown in red. Symbols not used in the current step are shown in blue.
In this example, you can see that sometimes there is no need to process all the symbols of the pattern. If we have a mismatch, say, in the first symbol, we needn't compare the remaining ones. In case of failure, we can immediately shift the pattern and start a new step.
Complexity analysis
What is the running time of the algorithm? Let's denote as the length of a pattern and as the length of a text. In the worst case, we need to process all symbols of the pattern on every step. The maximal number of steps is (assuming ). So, the overall running time is
Note that the algorithm requires additional memory.
Additional notes
- find the index of the first occurrence of a pattern in a text.
- find all occurrences of a pattern in a text.
- find all non-overlapping occurrences of a pattern in a text.