[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Sunday, April 30, 2006

(Grouped or regular) expressions

If you're still here, chances are you're serious about regular expressions. In that case, I'd suggest reading Mastering Regular Expressions by the great Jeffrey Friedl. You'll learn much more there than I'll cover here, and well, that's where I learnt most of what I'm covering here.

Lesson three looks at entities larger than single character regexes, and how to alternate among them. As usual, let's start with an example. Let's try matching the word 'ohlala'. Fairly simple right? /ohlala/ should do it, and it does, but it misses out on all the variations and melodrama. What if someone were a little more excited and said 'ohlalala' instead? What of the really overwhelmed who go 'ohlalalalalala'? These matches follow a pattern of course. One that we discovered yesterday. They seem to be 'oh' followed by two or more 'la's. But we don't know how to match two or more of something. We know how to match one or more, and zero or more, but not two or more. If you think about it though, two or more is just one followed by one or more, and we know how to match both of those.
   /oh/  matches the 'oh' part
/la/ matches the first 'la' part
/la+/ matches ?
Remember, you concatenate multiple regexes to let them match a single string in sequence. Concatenating these, we get:
   /ohlala+/
Does this match what we want?

It matches 'ohlala', but does it match 'ohlalala'? Not really. The last /la+/ matches 'la', 'laa', 'laaa', and so on. We therefore have matches like 'ohlalaa' and 'ohlalaaaaaaa'. Not what we want. Only the 'a' repeats and not the entire 'la'. To get the entire 'la' to match, we've got to make sure that 'la' is the entity preceding the + and not just 'a'. This is where grouping parentheses come in. These are the ( and ) characters placed around the portion that we want to group.

Grouping parentheses ( and ) create a group of the partial regex that they enclose. Everything within ( and ) is treated as a single entity for all modifier metacharacters.

In this case, it's the 'la', so let's do this:
   /(la)+/
By our definition of +, it should match one or more of the preceding entity, and by our definition of grouping parentheses, everything within the ( and ) is a single entity. This regex therefore matches 'la', 'lala', 'lalala' and so on. Taking it back to our example, we now have the regex:
   /ohla(la)+/
which matches 'ohlala', 'ohlalala', and 'ohlalalalala'.

Looking at this though, it's always tempting to wonder whether we could have combined the /la/ and the /(la)+/ into a single entity. Looking at our arsenal of metacharacters, it doesn't seem possible. I believe most popular current implementations have a range count match for entities, ie they allow you to specify both the minimum and the maximum number of matches a particular entity can match. The notation used is curly braces with the min and max within, eg: {2, 15} will match a minimum of 2 and a maximum of 15 times.

A modifier of {n,m} tells the pattern matcher to match the preceding entity at least n times and at most m times.

If any one of n or m are omitted, a suitable default is used. The default value of n is 0, and all implementations agree on this. The default value of m is infinite, but not all implementations agree on the value of infinite. You'll figure it out as you work with it. If the braces contain only a single number, with no comma, then it's a match of an exact count, so {5} means match exactly five times.

We can now modify our example to use this:
   /oh(la){2,}/
Which says match two or more times. This is exactly what we want.

It's tempting to think if we could do something at this point to match the patterns from our last lesson. To recollect, we needed a single pattern to match both /a+r+g+h*/ and /a+r+g*h+/. If we think about it, these can be factored.

/g+h*/ can also be written as /g*gh*/ while /g*h+/ can also be written as /g*hh*/.

Looking at these to regexes, it's basically /g*/ and /h*/ with either a 'g' or an 'h' in between. Either 'g' or 'h' is also written as /[gh]/, so our regular expression now becomes /a+r+g*[gh]h*/

Does this match everything? If the character class chooses 'g', then it is identical to /a+r+g+h*/ and if the character class chooses 'h', then it is identical to /a+r+g*h+/. The character class can never choose anything else, and must match exactly one character. It will therefore not match 'ar'. So, yes, it does match everything we want and nothing we do not want.

There are other ways to match it though. When we came up with the original problem, our solution was to have two regular expressions and pick one of them. In other words, we could alternate between two regexes. The regex language provides a meta character for alternation as well.

The | metacharacter alternates a match between the regex on its left and the regex on its right

Note the use of regex rather than entity. Unlike other meta characters, the | metacharacter does not apply to single character or group entities. It applies to everything on its left within the current regex. The current regex though, may be limited by grouping parentheses. A few examples should clear it up:
   /abc|def/         -  match /abc/ or /def/
/a(bc)|(de)f/ - match /a(bc)/ or /(de)f/
/a(bc|de)f/ - match /a/ followed by /bc/ or /de/ followed by /f/
Take note of all three. The | metacharacter is not affected by complete groups on its left or right. It will take into affect for everything extending up to the next enclosing group, ie the group in which the | is. If the | is not in any group, then it matches up to the start and end of the regex.

What if there are two | characters in there?
   /abc|def|ghi/
/a(bc|de)(f|gh)i/
/a(bc|def|gh)i/
The rule still holds. Everything to the left and right. It doesn't really matter whether you think of each | individually matching everything to its left or everything to its right or all the | characters within a group as a single alternation switch. The outcome will always be the same. Think of it in the way that you find simplest. Note that in case 2, we have two separate | characters, each within a different group, and that case 3 is very similar to case 1 except that a group in case 3 limits the extent of the |.

Getting back to our example now, we could have written the initial regular expression as:
   /a+r+(g+h*|g*h+)/
Which would have worked correctly, and possibly have been easier to read.

It's however not as efficient as the character class method we developed earlier, so when given a choice between a character class and an alternation, pick the character class. That said, it's not always possible to convert an alternation to a character class, and it can in many cases be easier to read, so if performance isn't hitting you, stick with what's easier to read and write.

I'll end lesson three here, and wonder about what's left to be covered.

3 comments :

Anonymous
May 02, 2006 5:45 AM

are you going to cover things like search and replace and substitute etc, or only on building regexes?

Philip
May 02, 2006 8:54 AM

replace isn't part of regular expressions, search, is what I've been doing.

Anonymous
May 05, 2006 4:08 AM

Are you planning to cover more complex like making non greedy regex, grouping without memory and things like that?

Whatever covered so far has been well written and well explained.

Thanks,
Jemshad

Post a Comment

...===...