[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Sunday, April 30, 2006

More (or less) Regular Expressions?

In our last lesson, we learnt about simple regular expressions that matched exact character sequences, and then how to enhance them to match more than just one specific character at a position. Today we'll look at extending this to match more (or less) than one character. I'm also going to drop the =~ notation for now, as we will seldom care about the string that needs to be matched, and always look at the pattern only.

As an example, let's take the simple string 'argh'. I'm sure everyone relates to that, and it's language independent, so it's a good choice. The regular expression to match this, is simple: /argh/. Cool? Everyone happy? Let's all scream now.

Unfortunately, not everyone screams in the same way. While some of us go 'argh', others go, 'aaaaaargh', and still others go 'aaaaaarrrrrgggghhhh'. Man, it's gonna take some doing trying to match all those arghs. That's where counts come into play.

We have three new meta characters to look at today. The first is +, which says:

Match one or more of the preceding entity

Of course, this raises the question of what exactly is an entity. An entity is an atomic regular expression, or, in terms of what we've learnt so far, a single character, dot or character class. With that in mind, we can take some examples:
   /a/      -  matches 'a'
/ab/ - matches 'ab'
/a+/ - matches 'a', 'aa', 'aaa', 'aaaaa', and so on ad infinitum
/a+b/ - matches 'ab', 'aab', 'aaab', 'aaaaab', and so on but not 'abb'
/ab+/ - matches 'ab', 'abb', 'abbb', 'abbbbb', but not 'aab'
Note the last one in particular. The + applies only to the 'b' and not to 'ab' as a whole.

So, let's get back to our example. We need to match 'argh', 'aaaaaaargh', 'aaaaaarrrrggggghhhh' and everything in between. The only thing we know is that each of 'a', 'r', 'g' and 'h' occur at least once. The moment we hear 'at least once', it should trigger the image of a '+' in the regex constructing part of our brains, so let's go ahead and use the +:
   /a+r+g+h+/
Simple enough right? We now have a regex that matches all of our requirements and then some. We should be able to catch everyone screaming now, so go ahead and scream.

Hmm, there's still a bunch of folks who scream differently. Do you hear a bunch of 'aaaaarrrrhhhh's in there? The 'g' completely missing. And there's those who go 'aaaaaarrrrrgggg', but no 'h' at the end.

This brings us to our second metacharacter of the day. The * character, which says:

Match zero or more of the preceding entity

Just as with + our examples with * are:
   /a/      -  matches 'a'
/ab/ - matches 'ab'
/a*/ - matches '', 'a', 'aa', 'aaa', 'aaaaa', ...
/a*b/ - matches 'b', 'ab', 'aab', 'aaab', 'aaaaab', ...
/ab*/ - matches 'a', 'ab', 'abb', 'abbb', 'abbbbb', ...
Note the case of /a*/, it also matches the empty string, and the case of /a*b/ which also matches 'b', ie, both these matches do not contain 'a', or, in other words, they contain zero occurances of 'a'.

While + says that there MUST be at least ONE of the preceding entity, * says that the preceding entity may be completely absent and is in fact optional. They both agree that there's no maximum. Specific implementations may apply arbitrary maxima, but for most practical cases, you can assume it's infinite.

Back to our example now, we know that in some cases the 'g' is optional, while in the other case, the 'h' is optional.

So, to change the 'g' from matching 'one or more' times to match 'optionally one or more' times or 'zero or more' times, we change the + to a *:
   /a+r+g*h+/
Similarly for the optional 'h', we have:
   /a+r+g+h*/
We don't however have a single regular expression to match both cases, ie, either an optional 'g' or an optional 'h'. If we combine the two changes, we end up with this:
   /a+r+g*h*/
which unfortunately also matches 'ar', and I couldn't hear anyone scream that way. We've got to find a better way of matching this, but I'll do that later.

In the course of writing regular expressions, you'll come across several cases where a particular entity must occur either zero or one times, but not more than one time. An example would be pluralisation of words by adding 's'. For example, if you needed to match 'apple' as well as 'apples', what would you do? An easy solution would be to use /apples*/ which we know will match 'apple' and 'apples' because the 's' at the end can occur zero or more times. Unfortunately, that 'more times' part comes back to kick us in the behind by also matching 'appless' and 'applesssssss', which isn't what we want to match.

Enter our third metacharacter for the day, the ? character, which says something to the effect of:

The presence of the preceding entity is questionable

Or stated less formally:

This here entity's optional and may occur either once or not at all

Well, what a coincidence, that's exactly what we need. We write up our new regex as /apples?/ and are done with it. Ok, all done, let's go home now.

Umm, just one minute. A few more examples:
   /a?/     -  matches '' and 'a' and nothing else
/a?b/ - matches 'b' and 'ab' and nothing else
/ab?/ - matches 'a' and 'ab' and nothing else
Note that that last one does not match the empty string because the ? applies only to the 'b' and not to the entire 'ab'.

We've learnt a whole bunch of metacharacters now, including . + * ? and []. One question that I hope has arisen in your minds is, what if I want to match one of these characters? What if I actually do want to match a dot or a plus or star? Well, simply escape it using a backslash character: '\'. Thus, the regex to match '.' is /\./ and the regex to match '*' is '\*' and so on. So how do you match '\.' then? Well, escape the '\' as well as the '.' to get:
   /\\\./
Yup, I know, pretty hairy, and depending on the implementation of regular expressions you use, this could set you up for what is known in professional circles as backslashitis. You have been warned.

I'll stop here today. We'll look into larger entities in the next lesson.

2 comments :

lawgon
April 30, 2006 8:33 AM

kewl - language neutral and you got the point across - aaaaaaarrrrrrrrgggggggggghhhhhhh

Anonymous
May 01, 2006 1:59 AM

Nicely written.

Post a Comment

...===...