Demystifying Regular Expressions
The time has finally come to talk about the regular expressions that I’ve been using for the pattern matching part of this pattern matching chatbot. Regular expressions are just a special way to describe patterns of symbols. A regular expression engine, like the one inside Perl, can then compare strings and text to a regular expression and figure out whether they match that pattern or not.
Although regular expressions are really useful they have a bad reputation as being hard to read and write. And to be honest it can be hard to remember all the dozens of rules involved in regex pattern matching. I personally would never dare write anything but the simplest of regex without a cheat sheet on hand.
Even more frustrating is the fact that “wrong” regular expressions don’t throw errors. They just don’t match what you expect them to match. And figuring out why they aren’t matching what you want can be difficult. The worst errors happen when your regex does match everything you want… but also matches things you don’t. So it looks like it’s working properly at first but if it ever comes across a false match your program will suddenly break and you won’t know why.
This is one reason I’m using test driven development for DELPHI. The automatic tests should catch most of my regex mistakes, so I don’t have to worry too much about forgetting a single character and breaking my entire program. The tests will also point out the mistakes as soon as I make them, letting me fix them while the code is still fresh in my mind and before forgetting what the broken regex was supposed to do.
So… Talking About Regular Expressions
I’m going to briefly briefly cover the bare minimum or regex knowledge you need to follow along with me as I program DELPHI. This is a probably a bad idea on my part. If you don’t understand regular expressions this won’t be nearly enough information to teach you how they work. And if you do know how regular expressions work this will be a boring reminder. I honestly should probably have just included a link to a real regular expression tutorial and left it at that.
Well, whatever. This is my Let’s Program. I can waste space talking about regular expressions if I want.
But before we go anywhere we need to cover how to mark a regular experesion in Perl. By default you create a regular expression by putting an ‘/’ symbol before and after the regex pattern you want, very similar to how you mark a string by putting double quotes before and after.
Boring And Simple
The most boring and simple use for regular expressions is to check whether or not a string contains a specific substring. Maybe you’re trying to find every sentence in a book that includes the word “inconceivable”* or are searching through a bunch of code for comments with a “TODO” reminder.
Searching for specific substring is really easy. You just type that substring up as a regex and you’re good to go. Example: /inconceivable/ and /TODO/.
Powerful And Complex
If the only thing regular expressions could do was find specific substrings there would be no reason to use them. We would just use the substring function that almost all languages already have. The real reason to use regular expressions is because of all the powerful tools they give you to find generic patterns instead of specific strings.
Now get ready for a whirlwind tour of the most useful and common regular expression special pattern techniques.
First up are the + and * symbols, which let you find a single symbol or phrase multiple times. So while /abc/ will only match “abc” you can create phrases like /ab+c/ that will match “abc”, “abbc”, “abbbc” and so on. ‘*’ works almost the same as ‘+” except that ‘*’ indicates an optional match. So /ab+c/ and /ab*c/ will both match “abc” and “abbc” but only /ab*c/ will match “ac”.
Next up are anchors, which let you mark when a specific phrase absolutely has to end up at the beginning or end of a string. \A means that the next symbol has to be at the very start while \z means that the previous symbol has to come at the very end.
For example: /\AWhy/ only matches sentences that start with the word “Why”. Having a “Why” in the middle isn’t enough. Similarly “/\?\z/” only matches sentences that end with a question mark. Note that in this case we have to type ‘\?’ instead of just ‘?’ because the plain question mark is actually a special regex symbol**.
Next I want to mention symbol groups and the wild card. These let you search for generic types of symbols instead of specific substrings. You can search for any one digit with [0-9]. You can search for any letter with [a-zA-Z]. Then there is the wildcard ‘.’ that matches just about anything except the newline character. So if you really want a number followed by a letter followed by at least one more something you could write /[0-9][a-zA-Z]./
The last thing I want to mention are parenthesis. Parenthesis let you group multiple other symbols together into one pattern. For example, suppose that you wanted to find sentences where the phrase “abc” repeats. Trying /abc+/ won’t work because that focuses on the ‘c’. It will match “abc” and “abcc” but not “abcabc”.
Instead you want to try something like /(abc)+/. Now the regex engine knows to look for a repeat of the entire group, not just the last word.
And of course you can mix all these things together. /\A(abc)+[0-9][0-9]/ will match with any string that starts with one ore more groupings of “abc” followed by two digits. So “abc11” and “abcabc45” but not “aabc11” or “abc123” or “123abc12”.
Remember three mini-paragraphs ago*** when I talked about how you can use parenthesis to match entire groups of symbols? Well it turns out that parenthesis have a second purpose too: they set up capture groups.
Capture groups are important because they signal for the regex engine to not just find a match but also to save those matches for later. This is very useful for all sorts of reasons. For instance, it lets you build complex regular expressions that match the same phrase in multiple locations.
/([a-zA-Z]+) is \1/
The \1 means “the same as the first capture group in this regex”, so this pattern will match any string where the text on the left of the word ‘is’ exactly matches the text on the right of the word, like “A is A” or “Perl is Perl”.
But that’s not all capture groups can do. You can also ask the regex system to give a copy of the capture groups to your non-regex code. This lets you pull information out of text patterns and then use the full power of your programming language to interpret and manipulate it as much as you want.
Imagine you were writing a program that was supposed to read scientific papers, look for metric weights and then convert those weights to pounds for your American boss.
First you would use a regular expression to look for numbers followed by metric weights abbreviations (g, kg, mg, etc…). By wrapping this search into a capture group you could then pull those numbers into your program, convert them into pounds and then insert them back into the document.
The regex for this metric conversion system might include something like /([0-9]+)kg/. This pattern doesn’t just match the phrase “500kg”, it extracts the 500 and passes it back to our program. How does it pass it back? That depends on the system. In Perl a pattern with capture groups will store the captures in numeric variables. The first capture goes in $1, the second in $2, the third in $3 and so on.
You can also assign capture groups directly to an array like this:
@captureArray = ($string =~ /(regex) (with) (captures)/
You’ll notice that this is the method I use in DELPHI.
Speaking of DELPHI, capture groups are a key component to our chatbot. By using capture groups we can extract useful bits of the user’s original input and use them to customize the chatbot’s output and make it seem more human.
So when the user asks “Is regex awesome?” DELPHI doesn’t just answer “Yes/No”, it uses capture groups to grab “regex” and “awesome” and glues them together to create the intelligent answer: “Regex is awesome.”.
Dissecting A Regular Expression
Feeling a little more comfortable with regular expressions? No? Then let’s spend some time dissecting a regular expression symbol by symbol.
You might recognize this as one of the three basic rules that I programmed into DELPHI when first testing the response generating system. Now let’d look at it’s individual parts.
The regex starts with \AWhy . The ‘\A’ means that the pattern has to happen at the beginning of the string. The ‘Why’ just means the exact phrase ‘Why’. Together this means that this regex pattern only matches phrases that start with ‘Why’.
After that there is a blank space. This tells the regex to match an actual blank space. So this means that there has to be a space directly after the ‘Why’.
After the space we get to (.+). The wildcard matches anything and the plus sign means that we want to match at least one thing. So the idea here is that after the ‘Why ‘ we want to see at least one more symbol. After all, a good question should be “Why something?” not just plain “Why?”.**** We also wrap this bit in a capture group in case we want to further analyze the exact sort of why question the user was asking.
Finally we have /\?\z/, indicating that matching strings absolutely have to end with a question mark.
Add them together and this pattern more or less matches any sentence of the pattern “Why X?”. Things like “Why is the sky blue?” or “Why are we using regex?”. But it will not match things like “Why?” (which doesn’t have enough text after the Why). It also won’t match “Why is this wrong” (no question mark) or “I wonder why this is wrong?” (doesn’t start with Why).
Congratulations, you’ve survived an entire article about regular expressions. Hopefully you’re feeling ready for the next post of this Let’s Program where I start writing new regular expressions to help DELPHI react to new communication patterns.
* I do not think it means what you think it means
** ? in regex means “find the previous symbol zero or one times but never more than once”.
*** If you can’t remember something you read three paragraphs ago you may want to consider seeing a doctor about your short term memory problems.
**** On second thought, maybe we should have a response pattern specifically for plain “Why?”. Let’s see if you can figure out the regex for that before I officially add it to the bot.