This instalment is the start of a series of instalments relating to searching from the command line. Searching is all about patterns, and that means getting to grips with Regular Expressions (also called RegExps, RegExes or REs for short). Regular Expressions are languages for representing patterns, and are used throughout IT, not just on the command line. While this series focuses on the Terminal, an understanding of regular expressions will be helpful in many other places, from programming languages to GUI apps like programming editors, search utilities or file re-namers. It’s going to take us two instalments to properly describe regular expressions, but when we’re done we’ll have gained a very useful skill.
What Are Regular Expressions?
If you want to get scientific about it, regular expressions are languages for describing regular grammars, which are the simplest type of grammar in the Chomsky Hierarchy. You could easily dedicate an entire university level course to explaining the meaning and importance of that last sentence, and in fact, if you take a degree in Computer Science, you will! However, that’s not much use to us for the purpose of this series. In effect, what it means is that regular expressions provide a means for representing patterns that can be described as a series of elements following one after the other. That means regular expressions can do a lot, they can find all currency amounts in a document (a currency symbol followed by either an integer or a decimal number), they can find percentages (an integer or decimal number followed by a percentage symbol), they can find temperatures (an integer or decimal number followed by a
K), and so on. That includes quite complex things like recognising URLs, which could be described something like:
A valid URL consists of a protocol specifier followed by the colon symbol, then two forward slashes, then a domain name, then, optionally, a port number, then optionally a path starting with a /, then, optionally an anchor starting with a pound/hash symbol, and then finally an optional query string starting with a question mark symbol.
That description is actually incomplete, because you would need to describe what some of those parts mean in more detail before you could write a regular expression for them, but that’s no problem because those parts too can be described as a series of elements following each other. For example, you’d have to further break down the domain name part into something like:
A domain name consists of one or more segments separated by a period symbol. Each segment can only contain letters, digits, and dashes, and must start with a letter or a digit.
The key point is that if you can describe a pattern as a series of elements that follow one after the other, then you should be able to write a regular expression to represent that pattern.
So, regular expressions are without doubt powerful, but, they are not all-powerful – there are entire classes of problems regular expressions are powerless to help with. In fact, to get scientific again for a moment, there are three entire grammar classes in the Chomsky Hierarchy that REs are powerless to help with. In practical terms that means that REs can’t help when some kind of memory is needed to know what has gone before, or when the elements in the pattern can be arbitrarily ordered and/or nested. For example, it would be impossible to write a regular expression to test if an arbitrary piece of text contained a matched set of arbitrarily nested brackets, because, to know if a given closing bracket is or it not valid, you need to know how many opening brackets have proceeded it. Also, REs can’t be used to validate something like XML (or HTML for that matter), because tags can come in any order, and be validly nested in all sorts of different ways.
Not understanding the limits of REs leads to a lot of frustration, and a lot of very unreliable code. If you can’t describe it as a series of elements that follow each other in a given order, a regular expression is not the answer!
The fact that many programmers don’t understand the limitations of regular expressions has led to the incorrect maxim that if you have a problem and try to solve it with regular expressions you then have two problems, your original problem and a regular expression.
Don’t Be Intimidated!
Regular expressions can look very intimidating, but, once you know the language they are written in, they are actually very simplistic things. Think of it like a mathematical equation, until you know what all the symbols mean, it’s a big pile of intimidating gobbledegook, but, once you understand the meanings of the symbols, you can work your way through an equation logically.
The following apparent gibberish is a regular expression describing the domain name pattern described above:
For now, that looks horrific, but, when we’ve finished this instalment and the one after, I promise it’ll make sense!
Also, I promise the following is a really funny joke – when you get it, you’ll know you get REs! (I have this one a T-shirt, and it works as a great nerd-test)
Which RE Language?
Just like there is no one programming language, there is no one language for regular expressions. So, that leads to an obvious question, which type of RE should we learn? Because this series is all about the Terminal, the answer is actually very easy, there’s really only one choice that makes sense, but, it happens to be a choice that conveniently gives us a very solid base to build from for other uses of REs.
Lets start with some context. Firstly, when it comes to regular expressions you can’t ignore my favourite scripting language, Perl. Perl was developed for the purpose of processing text, which means pattern matching is at the very core of it’s DNA. The official backronym for Perl is the Practical Extracting and Reporting Language, and the joke backronym is the Pathologically Eclectic Rubbish Lister, either way, Perl is all about extracting information from textual data, so it’s all about pattern matching.
Why POSIX ERE? In fact, more fundamentally, what is POSIX?
POSIX stands for Portable Operating System Interface, and it’s the reason that the things we learn in this series are so surprisingly portable. POSIX is the standard that unites most of the flavours of Unix and Linux, and gives us a common foundation to work off. Not all our *nix operating systems are POSIX certified, but they are all, to a very very high degree, POSIX compliant. OS X is actually POSIX certified, but Linux is not, it just implements pretty much the entire POSIX standard. POSIX covers many things, from how file systems should be presented, to a core set of terminal commands that are the same across all POSIX OSes, to a large set of programming APIs that can be used to create apps that run on all POSIX systems, to a portable regular expression syntax.
Actually, POSIX specifies two regular expression languages, POSIX Basic Regular Expressions (BRE), and POSIX Extended Regular Expressions (ERE). The reason there are two is that POSIX is literally decades old, and regular expressions have come a long way since the BRE syntax was defined. When it comes to the simple stuff, BRE and ERE are the same, but, when it comes to more complex stuff, specifically cardinalities and grouping, they are not compatible. For these advanced features, BRE is not PCRE compatible, but ERE is, making it the best kind of RE for those exploring the terminal.
For all the examples in this series, we are going to use ERE, and we are only going to use command line tools that understand ERE. However, it’s important to know that BRE exists, because you’ll see both BRE and ERE mentioned in many man pages, and, some terminal commands default to BRE for legacy reasons, but can accept ERE if a certain flag is passed.
The only way to really learn regular expressions is through practical examples, so, for this instalment and the next we’ll be using the
egrep command to search the standard Unix words file for words that match a given pattern. We’ll be looking at the
egrep command in more detail later in the series, but for now, all we need to know is that
egrep can be used with two arguments, the first, a regular expression in POSIX ERE format, and the second the path to a file to search.
egrep will print each line that contains text that matches the given pattern, it will not print just the text that matches the pattern, it will print the entire line that contains the match.
The standard Unix words file is a text file containing a list of valid English words, one word per line. On OS X and Ubuntu Linux the file is located at
/usr/share/dict/words, though on some Unix/Linux variants you’ll find it at
Getting Started with POSIX ERE
In this instalment we’re going to start with the simpler parts of the ERE language, and, in fact, everything we learn today will be valid ERE, BRE, and PCRE, so it will apply very very widely indeed.
Ordinary characters represent themselves in a pattern, so the POSIX ERE to represent the letter
a is simply:
Similarly, the RE to represent the character
t followed by the character
h is simply:
Lets start with a simple example – finding all words that contain a double
e in the words file. Remember, the
egrep command prints any line from the input file that matches the specified pattern, so, to find all words with a double
e you could use the following command:
egrep 'ee' /usr/share/dict/words
Lets take things up a notch, and include line boundaries in our pattern. The special character
^ represents start of line when used at the start of a regular expression (it can have other meanings when used elsewhere as we’ll see later). Its opposite number is the special character
$, which represents end of line.
So, the following command will find all words starting with the character
egrep '^b' /usr/share/dict/words
Similarly, the following command will find all words ending in the three letters
egrep 'ing$' /usr/share/dict/words
Note: you may have noticed that I’ve been single-quoting the pattern in all the examples. This is often not necessary, because many patterns don’t contain BASH special characters, but, some do, including the one above, which contains the dollar symbol. If the string had not been single quoted, we would have had to escape the dollar symbol which would be very messy. My advice would be to get into the habit of always single-quoting regular expressions, it’ll save you a lot of frustration over time!
Something else that’s very important is the ability to specify a so-called wild-card character. We can do that using the period character, which you should read in an RE as any one character.
As an example, lets say you’re stuck on a thorny crossword puzzle, and you need a word that fits into something e something something f something. You could use the following terminal command to find a list of possible answers:
egrep '^.e..f.$' /usr/share/dict/words
Something to notice in the above command is that the specific pattern we are looking for is bounded by a
^ and a
$, this is to ensure we don’t get longer words that contain the pattern returned. If you run the command again but leave those symbols out you’ll see that you get a lot of unwanted results (over 900 on OS X).
The last thing we’re going to look at in this instalment is character classes, these are used to match a single character against multiple options. You can think of everything inside a character class as being a big list of ors. Character classes are enclosed inside square brackets, so, you should read the character class below as a or b or c or d or e or f:
As an example, lets search for all four letter words starting with a vowel:
egrep '^[aeiou]...$' /usr/share/dict/words
You can also use the minus sign within character classes to specify ranges of characters. Some commonly used ranges include:
- Any digit
- Any lowercase letter
- Any uppercase letter
You don’t have to stick to those common ranges though, you can use sub-sets of them, and you can use multiple ranges within a single character class.
As an example, the regular expression below matches valid MAC addresses in OS X (and Linux) format. On POSIX OSes like OS X and Linux, MAC addresses are represented as a series of six two-character lower-case hexadecimal numbers separated by colon symbols, so, they could be matched with the following regular expression:
The above RE will work, but it’s quite un-wieldy and full of repetition, you might imagine there’d be a simpler, more efficient way of representing this pattern, and you’d be right! I’ll stop here and leave the following as a teaser for the next instalment, the above un-gainly 102 character RE can be reduced to just 29 characters using two important new concepts, cardinality and grouping.