Finding Patterns with Knuth Morris Pratt
Finding patterns (or substrings) in a piece of text is a very common action on computers. Just think about the amount of times you’ve used the shortcut ‘ctrl+f’ to find a word hidden somewhere in a large body of text. Or think about the amount of times you’ve used a search engine to find something on a website or on the internet. Pattern matching is everywhere in computer science and it’s a common subject for job interviews at tech companies. So plenty of reasons to make yourself more familiar with this handy piece of coding magic!
In this article we’ll look at one of the most used algorithms for pattern matching called Knuth Morris Pratt. We’ll also implement this algorithm in python to show what this would look like in a programming environment.
Why use Knuth Morris Pratt?
The goal of Knuth Morris Pratt (frequently called KMP) is to find a pattern/substring in a body of text. The most intuitive way to do this is to use a brute force algorithm that reads everything until it finds the pattern, this is called the naive approach.
In terms of finding the pattern this approach is successful. The problem is that it takes longer than KMP does. In fact, the worst case time complexity (for a string of size n and a pattern of size m) with the naive approach is m*n. This is far higher than KMP which has a worst case time complexity of n (which is linear in this case).
Knuth Morris Pratt explained
Now how exactly does this algorithm accomplish that? Short answer: KMP uses backtracking to avoid doing the same comparison double. It does this by creating failure links in the pattern that you want to look for. These are indices (characters in the pattern) that the algorithm goes back to in case there is a mismatch.
Failure links are created by comparing the prefix of the pattern with the suffix of the pattern. If these two overlap you know in advance that those characters must’ve matched earlier with the text. In that case it would be double work to go back to the beginning of the pattern and check these characters again (like we did in the naive approach). That’s why KMP goes back to the failure link instead of the beginning.
For example in the pattern aab. If there is a mismatch in b then you know that the two characters before that must’ve been a’s. Therefore it would be double work to go back to the beginning of the pattern and check the first a again. Hence there should be a failure link on the second character. If you do this for the rest of the pattern the result should look like this.
Now let’s go back to the example from earlier. It took the naive approach nine attempts to find the pattern. Let’s see if KMP can solve this problem more efficiently.
It took KMP seven attempts to find the pattern! This is (on a string of five characters) a significant improvement. Note how KMP uses failure links to avoid checking the first character twice.
Knuth Morris Pratt can be implemented in code. This example is written in Python.