Stop Stressing Over Regular Expressions

Create readable expressions with this library instead

Timo Kats
Level Up Coding

--

Regular expressions are strings/patterns that can match with input text. They were originally invented by Stephen Kleene in the 1950s at Bell labs but are now available in most modern code editors and programming languages.

Regular expressions (also referred to as RegEx) can be used for various purposes. For example, a common task in Information Retrieval (IR) is to write an expression that matches with email addresses in a piece of text. This results in the following expression…

^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$reg

Given this example, it should not be surprising that regular expressions don’t have the friendliest learning curve in programming. As a result, the collective verdict on regular expressions is undecided. Some programmers like regular expressions due to their versatility and availability in most languages/editors. However, some programmers are frustrated by the fact that each time you want to write an expression, you need to consult an online-regex builder or Stack Overflow, even for simple/common tasks.

Readable expressions

For this group of programmers, a solution is available. Namely, “readable expressions” (Redex). Redex is an alternative method for creating regular expressions. It’s based on two simple elements: boolean operators and built-in functions. Moreover, to improve the learning curve and readability, both of elements aim to resemble plain English.

Examples

If we re-write the previous example (i.e. creating a pattern that recognizes email addresses) in Redex, we get the following expression. Note how — in contrast to the previous example — it’s human readable.

sequence:{*alpha,@,*alpha,.com} or sequence:{*alpha,@,*alpha,.co.uk}

Let’s dissect this example. The (rough) format of an email is something alphabetic, followed by an ‘@’ symbol, and a host/extension. Meaning, we want to have a sequence of characters that adheres to this order when searching for email addresses in text.

In our example this is implemented using the “sequence” operator, which is a built-in function of Redex. In total, Redex has 7 of these built-in functionalities that you can use to create expressions with. Namely, “startswith”, “endswith”, “count”, “contains”, “location” and “proximity”. The details of these functionalities and their usage is explained in the documentation of Redex, which is available on GitHub (see link below).

Next, since we don’t know the exact content of the email, we use wildcard-characters (also referred to as ‘wildcards’). For this, Redex has a number of built-in options. However — unlike regular expressions — the formatting of the wildcards is very similar to English.

For example, in the previous expression we used the wildcard for anything alphabetic (which is expressed as ‘*alpha’). However, if you want to return anything numeric instead, then simply write ‘*num’. All wildcard characters adhere to this format. In total, Redex supports 8 types of wildcards. However, you can add custom wildcards if you’d like.

Using Redex

Currently, Redex is available in Python through the python-redex library. To start using this library, simply install it through the Python package installer.

pip3 install python-redex

After installing it, you can import it using the following command.

import redex as rd

Next, you can execute a Redex query on a string using one of the three available actions: has (returns boolean), find (returns list) and count (returns integer). Moreover, in these functions you can also set a number of parameters. More specifically, the granularity of retrieval (e.g. return words/sentences/paragraphs), the splitting character (default value is whitespace) and the amount of threads.

Note, the amount of threads refers to multi-threaded programming. Setting this parameter correctly can make Redex faster and more scalable than regular expressions. However, this parameter also requires some knowledge about your CPU (cores/threads) and RAM. Therefore, it’s typically left on the default value.

For example, say we have the following string we want to search in: “it is another test for Rdx”. And, from this string we want to extract all non-stopwords through checking the length and capitalization of the words (which is a common task in natural language processing).

For this, we can combine two built-in functions. First, “count” can check the length of the words (since stopwords are typically shorter than three characters). Second, “startswith” can check the capitalization of the words (since stopwords are typically not capitalized).

Finally, because only one condition needs to be satisfied, we combine them using the “or” operator. This will result in the following Redex expression. Note how the words need to have at least 4 characters or start with an uppercase character to be considered.

rd.find("count:{*alpha,4} or startswith:*upper", string)

The output of this query is formatted as a list (because we used the “find” function). However, we can also execute this query using the “count” or “has” functions if we want boolean/integer return values.

['another', 'test', 'Rdx']

In conclusion

This article aimed to be a beginner friendly introduction to the potential of Redex. For those that want to learn more about Redex, a complete documentation/demo is available on GitHub. Thank you for reading this article.

--

--