The other side of the moon: April 2006

Sunday, April 30, 2006

(Grouped or regular) expressions

If you're still here, chances are you're serious about regular expressions. In that case, I'd suggest reading Mastering Regular Expressions by the great Jeffrey Friedl. You'll learn much more there than I'll cover here, and well, that's where I learnt most of what I'm covering here.

Lesson three looks at entities larger than single character regexes, and how to alternate among them. As usual, let's start with an example. Let's try matching the word 'ohlala'. Fairly simple right? /ohlala/ should do it, and it does, but it misses out on all the variations and melodrama. What if someone were a little more excited and said 'ohlalala' instead? What of the really overwhelmed who go 'ohlalalalalala'? These matches follow a pattern of course. One that we discovered yesterday. They seem to be 'oh' followed by two or more 'la's. But we don't know how to match two or more of something. We know how to match one or more, and zero or more, but not two or more. If you think about it though, two or more is just one followed by one or more, and we know how to match both of those.

   /oh/  matches the 'oh' part
   /la/  matches the first 'la' part
   /la+/  matches ?

Remember, you concatenate multiple regexes to let them match a single string in sequence. Concatenating these, we get:

   /ohlala+/

Does this match what we want?

It matches 'ohlala', but does it match 'ohlalala'? Not really. The last /la+/ matches 'la', 'laa', 'laaa', and so on. We therefore have matches like 'ohlalaa' and 'ohlalaaaaaaa'. Not what we want. Only the 'a' repeats and not the entire 'la'. To get the entire 'la' to match, we've got to make sure that 'la' is the entity preceding the + and not just 'a'. This is where grouping parentheses come in. These are the ( and ) characters placed around the portion that we want to group.

Grouping parentheses ( and ) create a group of the partial regex that they enclose. Everything within ( and ) is treated as a single entity for all modifier metacharacters.

In this case, it's the 'la', so let's do this:

   /(la)+/

By our definition of +, it should match one or more of the preceding entity, and by our definition of grouping parentheses, everything within the ( and ) is a single entity. This regex therefore matches 'la', 'lala', 'lalala' and so on. Taking it back to our example, we now have the regex:

   /ohla(la)+/

which matches 'ohlala', 'ohlalala', and 'ohlalalalala'.

Looking at this though, it's always tempting to wonder whether we could have combined the /la/ and the /(la)+/ into a single entity. Looking at our arsenal of metacharacters, it doesn't seem possible. I believe most popular current implementations have a range count match for entities, ie they allow you to specify both the minimum and the maximum number of matches a particular entity can match. The notation used is curly braces with the min and max within, eg: {2, 15} will match a minimum of 2 and a maximum of 15 times.

A modifier of {n,m} tells the pattern matcher to match the preceding entity at least n times and at most m times.

If any one of n or m are omitted, a suitable default is used. The default value of n is 0, and all implementations agree on this. The default value of m is infinite, but not all implementations agree on the value of infinite. You'll figure it out as you work with it. If the braces contain only a single number, with no comma, then it's a match of an exact count, so {5} means match exactly five times.

We can now modify our example to use this:

   /oh(la){2,}/

Which says match two or more times. This is exactly what we want.

It's tempting to think if we could do something at this point to match the patterns from our last lesson. To recollect, we needed a single pattern to match both /a+r+g+h*/ and /a+r+g*h+/. If we think about it, these can be factored.

/g+h*/ can also be written as /g*gh*/ while /g*h+/ can also be written as /g*hh*/.

Looking at these to regexes, it's basically /g*/ and /h*/ with either a 'g' or an 'h' in between. Either 'g' or 'h' is also written as /[gh]/, so our regular expression now becomes /a+r+g*[gh]h*/

Does this match everything? If the character class chooses 'g', then it is identical to /a+r+g+h*/ and if the character class chooses 'h', then it is identical to /a+r+g*h+/. The character class can never choose anything else, and must match exactly one character. It will therefore not match 'ar'. So, yes, it does match everything we want and nothing we do not want.

There are other ways to match it though. When we came up with the original problem, our solution was to have two regular expressions and pick one of them. In other words, we could alternate between two regexes. The regex language provides a meta character for alternation as well.

The | metacharacter alternates a match between the regex on its left and the regex on its right

Note the use of regex rather than entity. Unlike other meta characters, the | metacharacter does not apply to single character or group entities. It applies to everything on its left within the current regex. The current regex though, may be limited by grouping parentheses. A few examples should clear it up:

   /abc|def/         -  match /abc/ or /def/
   /a(bc)|(de)f/     -  match /a(bc)/ or /(de)f/
   /a(bc|de)f/       -  match /a/ followed by /bc/ or /de/ followed by /f/

Take note of all three. The | metacharacter is not affected by complete groups on its left or right. It will take into affect for everything extending up to the next enclosing group, ie the group in which the | is. If the | is not in any group, then it matches up to the start and end of the regex.

What if there are two | characters in there?

   /abc|def|ghi/
   /a(bc|de)(f|gh)i/
   /a(bc|def|gh)i/

The rule still holds. Everything to the left and right. It doesn't really matter whether you think of each | individually matching everything to its left or everything to its right or all the | characters within a group as a single alternation switch. The outcome will always be the same. Think of it in the way that you find simplest. Note that in case 2, we have two separate | characters, each within a different group, and that case 3 is very similar to case 1 except that a group in case 3 limits the extent of the |.

Getting back to our example now, we could have written the initial regular expression as:

   /a+r+(g+h*|g*h+)/

Which would have worked correctly, and possibly have been easier to read.

It's however not as efficient as the character class method we developed earlier, so when given a choice between a character class and an alternation, pick the character class. That said, it's not always possible to convert an alternation to a character class, and it can in many cases be easier to read, so if performance isn't hitting you, stick with what's easier to read and write.

I'll end lesson three here, and wonder about what's left to be covered.

Previous: More (or less) regular expressions?

More (or less) Regular Expressions?

In our last lesson, we learnt about simple regular expressions that matched exact character sequences, and then how to enhance them to match more than just one specific character at a position. Today we'll look at extending this to match more (or less) than one character. I'm also going to drop the =~ notation for now, as we will seldom care about the string that needs to be matched, and always look at the pattern only.

As an example, let's take the simple string 'argh'. I'm sure everyone relates to that, and it's language independent, so it's a good choice. The regular expression to match this, is simple: /argh/. Cool? Everyone happy? Let's all scream now.

Unfortunately, not everyone screams in the same way. While some of us go 'argh', others go, 'aaaaaargh', and still others go 'aaaaaarrrrrgggghhhh'. Man, it's gonna take some doing trying to match all those arghs. That's where counts come into play.

We have three new meta characters to look at today. The first is +, which says:

Match one or more of the preceding entity

Of course, this raises the question of what exactly is an entity. An entity is an atomic regular expression, or, in terms of what we've learnt so far, a single character, dot or character class. With that in mind, we can take some examples:

   /a/      -  matches 'a'
   /ab/     -  matches 'ab'
   /a+/     -  matches 'a', 'aa', 'aaa', 'aaaaa', and so on ad infinitum
   /a+b/    -  matches 'ab', 'aab', 'aaab', 'aaaaab', and so on but not 'abb'
   /ab+/    -  matches 'ab', 'abb', 'abbb', 'abbbbb', but not 'aab'

Note the last one in particular. The + applies only to the 'b' and not to 'ab' as a whole.

So, let's get back to our example. We need to match 'argh', 'aaaaaaargh', 'aaaaaarrrrggggghhhh' and everything in between. The only thing we know is that each of 'a', 'r', 'g' and 'h' occur at least once. The moment we hear 'at least once', it should trigger the image of a '+' in the regex constructing part of our brains, so let's go ahead and use the +:

   /a+r+g+h+/

Simple enough right? We now have a regex that matches all of our requirements and then some. We should be able to catch everyone screaming now, so go ahead and scream.

Hmm, there's still a bunch of folks who scream differently. Do you hear a bunch of 'aaaaarrrrhhhh's in there? The 'g' completely missing. And there's those who go 'aaaaaarrrrrgggg', but no 'h' at the end.

This brings us to our second metacharacter of the day. The * character, which says:

Match zero or more of the preceding entity

Just as with + our examples with * are:

   /a/      -  matches 'a'
   /ab/     -  matches 'ab'
   /a*/     -  matches '', 'a', 'aa', 'aaa', 'aaaaa', ...
   /a*b/    -  matches 'b', 'ab', 'aab', 'aaab', 'aaaaab', ...
   /ab*/    -  matches 'a', 'ab', 'abb', 'abbb', 'abbbbb', ...

Note the case of /a*/, it also matches the empty string, and the case of /a*b/ which also matches 'b', ie, both these matches do not contain 'a', or, in other words, they contain zero occurances of 'a'.

While + says that there MUST be at least ONE of the preceding entity, * says that the preceding entity may be completely absent and is in fact optional. They both agree that there's no maximum. Specific implementations may apply arbitrary maxima, but for most practical cases, you can assume it's infinite.

Back to our example now, we know that in some cases the 'g' is optional, while in the other case, the 'h' is optional.

So, to change the 'g' from matching 'one or more' times to match 'optionally one or more' times or 'zero or more' times, we change the + to a *:

   /a+r+g*h+/

Similarly for the optional 'h', we have:

   /a+r+g+h*/

We don't however have a single regular expression to match both cases, ie, either an optional 'g' or an optional 'h'. If we combine the two changes, we end up with this:

   /a+r+g*h*/

which unfortunately also matches 'ar', and I couldn't hear anyone scream that way. We've got to find a better way of matching this, but I'll do that later.

In the course of writing regular expressions, you'll come across several cases where a particular entity must occur either zero or one times, but not more than one time. An example would be pluralisation of words by adding 's'. For example, if you needed to match 'apple' as well as 'apples', what would you do? An easy solution would be to use /apples*/ which we know will match 'apple' and 'apples' because the 's' at the end can occur zero or more times. Unfortunately, that 'more times' part comes back to kick us in the behind by also matching 'appless' and 'applesssssss', which isn't what we want to match.

Enter our third metacharacter for the day, the ? character, which says something to the effect of:

The presence of the preceding entity is questionable

Or stated less formally:

This here entity's optional and may occur either once or not at all

Well, what a coincidence, that's exactly what we need. We write up our new regex as /apples?/ and are done with it. Ok, all done, let's go home now.

Umm, just one minute. A few more examples:

   /a?/     -  matches '' and 'a' and nothing else
   /a?b/    -  matches 'b' and 'ab' and nothing else
   /ab?/    -  matches 'a' and 'ab' and nothing else

Note that that last one does not match the empty string because the ? applies only to the 'b' and not to the entire 'ab'.

We've learnt a whole bunch of metacharacters now, including . + * ? and []. One question that I hope has arisen in your minds is, what if I want to match one of these characters? What if I actually do want to match a dot or a plus or star? Well, simply escape it using a backslash character: '\'. Thus, the regex to match '.' is /\./ and the regex to match '*' is '\*' and so on. So how do you match '\.' then? Well, escape the '\' as well as the '.' to get:

   /\\\./

Yup, I know, pretty hairy, and depending on the implementation of regular expressions you use, this could set you up for what is known in professional circles as backslashitis. You have been warned.

I'll stop here today. We'll look into larger entities in the next lesson.

Friday, April 28, 2006

Beginning Regular Expressions

A poet, too, was there, whose verse
Was tender, musical, and terse.
--Longfellow.

I often get asked to take a class on Regular Expressions. I've been doing them for about 7 years now, and I like the terseness with which they get the job done. In the years that I've taught regular expressions, my beginner lectures have gotten far more concise and less detailed. I thought of putting some notes up here.

Start Simple

To start with, are thoughts on how one should go about building a regular expression. Any regular expression. Always start simple. Start with the smallest pattern that you can comfortably identify with. Note the emphasis on you. Don't go by someone else's idea of simplicity, always use your own. To a first time regexer, that may mean thinking in terms of english words directly. Let's start...

Note: for this tutorial, I'll use single quotes '' to mark literal strings and slashes // to mark regular expressions. Also, the =~ operator will be used to match the string on the left with the pattern on the right.

The simplest regular expression is one that matches a single specific character. Let's take the character 'a'. The regular expression to match this is /a/

'a' =~ /a/

Let's try that again. This time, a regular expression to match 't'.

't' =~ /t/

Simple so far?

Combining expressions

Let's move on to a little more complex stuff. Combining two simple regular expressions to make one complex regular expression. We'll combine the two regexes we learnt above to build a slightly more complex regex. A regex to match 'at'. This is basically the regex to match 'a' followed by the regex to match 't':

'at' =~ /at/

If you've got that, then you're all set, and can build some really complex stuff. Just remember to always break it down to the smallest manageable piece.

This also means that you don't break it down unnecessarily. Most people I know don't need to break everything down to a single character regex. They're quite comfortable dealing with /at/ or even something as big as /apple/ or /orange/. In fact, any combination of letters that make up a word in a familiar language also makes a familiar regex. From now on, we'll deal with these large thoughts and make even bigger things jumping back to the smaller chunks only when we introduce a new metacharacter.

The . metacharacter

Regular expressions are full of meta characters. Wildcards and modifiers that allow you to match more than just one specific literal character. One of the most used metachars is '.' (that's a dot or period). A dot matches any single character, so a regex like /./ matches 'a', 'b', 'c' and everything else, but not '' ie, the empty string, because that has no characters. The dot doesn't say what the character at its position should be, but it does say that there MUST be ONE and only ONE character there. It also does not say what, if anything, should preceed or succeed that one character. On its own therefore, a dot is not very useful. A string empty check is probably far more efficient. Used in conjunction with other regexes though, it becomes quite powerful.

For example, take a regular expression to match the word 'ate' preceded by any one character. Ie, I want to match words like 'aate', 'bate', 'cate', 'date', eate', and so on, for any possible first letter.

$string =~ /.ate/

Ok, I lied a bit when I said that /./ matches any character. There are two characters that it does not match. These are the NUL character (ie, character with ASCII code 0) and the newline character (except in special circumstances). For most cases, this isn't really a problem.

I'll move on.

Character Classes

The above example would also match character sequences like ' ate' (that's a space) or '9ate', or '+ate', etc. Perhaps that wasn't quite my intention. My intention was to have a letter of the alphabet followed by ate. The solution is to use what is known as a character class.

Now, before we go forward, in one corner of your mind store the fact that what most languages call a character class, and what POSIX calls a character class are minutely different, but both relate to the same thing. We'll address POSIX later, but for now we'll just think about what most implementations call a character class.

A character class is a series of characters, any one of which is a valid match for that position. An example may be easier to understand. Taking my earlier case, if I had to match the following words: 'date', 'fate', 'gate', 'hate', then my regular expression would be /[dfgh]ate/

$string =~ /[dfgh]ate/

Notice the square brackets containing dfgh (which just coincidentally happen to lie right next to each other on a QWERTY keyboard): '[dfgh]'. What that says is that the character at that position must match any one of 'd', 'f', 'g' or 'h'.

Note that it states two things. First, there MUST be ONE and only ONE character at that position, and second, that one character may be either a 'd', an 'f', a 'g' or an 'h' and nothing else. Commit to memory the ONE character thing, because that's where most people go wrong when they do inverted matches. Remember this as a rule:

A character class [] matches ONE and only ONE character
The . metacharacter is like a character class with all characters except \000 and \n

Note also that you can include \n and \000 in your character class.

Now character classes are useful, but if you have to match too many characters, this could get unwieldy. Not as unwieldy as the alternatives, but unwieldy nonetheless, and terseness is what we like anyway. Enter ranges.

Ranges in character classes

You can specify a character range within the square brackets to indicate that you want to match any character that lies within that range within the current character set. So, if we wanted to match all the lower case letters of the english alphabet, we'd write /[a-z]/. Note the order. In the previous case, it made no difference if we wrote /[dfgh]/ or /[dhgf]/, but in the current example, /[a-z]/ is very different from /[z-a]/. The latter is an invalid range.

A single character class needn't be restricted to a range though, and you can combine a range alongwith characters that are not part of the range. Our earlier example could be replaced with this instead:

$string =~ /[df-h]ate/

Not that it buys us much in terms of reducing keystrokes in this particular example, but there are cases where it will.

Inverted character classes

Character classes can also be negated to match any character NOT included in the class. To do this, use a '^' as the first character of the character class: /[^a-z]/. This regex will match any character that does not lie in the range 'a-z', both inclusive. Note, that it still requires ONE character to be at that position. The only difference is that we now state what that character cannot be rather than what that character can be. This regex will still not match the empty string ''.

I'm going to stop here for today. There's much more that needs to be done before we can make really useful regexes, but we should be able to play around a bit with this much. I'll post a follow up tomorrow.

Next: More (or less) regular expressions?

The other side of the moon