[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Friday, April 28, 2006

Beginning Regular Expressions

A poet, too, was there, whose verse
Was tender, musical, and terse.
--Longfellow.
I often get asked to take a class on Regular Expressions. I've been doing them for about 7 years now, and I like the terseness with which they get the job done. In the years that I've taught regular expressions, my beginner lectures have gotten far more concise and less detailed. I thought of putting some notes up here.

Start Simple

To start with, are thoughts on how one should go about building a regular expression. Any regular expression. Always start simple. Start with the smallest pattern that you can comfortably identify with. Note the emphasis on you. Don't go by someone else's idea of simplicity, always use your own. To a first time regexer, that may mean thinking in terms of english words directly. Let's start...

Note: for this tutorial, I'll use single quotes '' to mark literal strings and slashes // to mark regular expressions. Also, the =~ operator will be used to match the string on the left with the pattern on the right.

The simplest regular expression is one that matches a single specific character. Let's take the character 'a'. The regular expression to match this is /a/
'a' =~ /a/
Let's try that again. This time, a regular expression to match 't'.
't' =~ /t/
Simple so far?

Combining expressions

Let's move on to a little more complex stuff. Combining two simple regular expressions to make one complex regular expression. We'll combine the two regexes we learnt above to build a slightly more complex regex. A regex to match 'at'. This is basically the regex to match 'a' followed by the regex to match 't':
'at' =~ /at/
If you've got that, then you're all set, and can build some really complex stuff. Just remember to always break it down to the smallest manageable piece.

This also means that you don't break it down unnecessarily. Most people I know don't need to break everything down to a single character regex. They're quite comfortable dealing with /at/ or even something as big as /apple/ or /orange/. In fact, any combination of letters that make up a word in a familiar language also makes a familiar regex. From now on, we'll deal with these large thoughts and make even bigger things jumping back to the smaller chunks only when we introduce a new metacharacter.

The . metacharacter

Regular expressions are full of meta characters. Wildcards and modifiers that allow you to match more than just one specific literal character. One of the most used metachars is '.' (that's a dot or period). A dot matches any single character, so a regex like /./ matches 'a', 'b', 'c' and everything else, but not '' ie, the empty string, because that has no characters. The dot doesn't say what the character at its position should be, but it does say that there MUST be ONE and only ONE character there. It also does not say what, if anything, should preceed or succeed that one character. On its own therefore, a dot is not very useful. A string empty check is probably far more efficient. Used in conjunction with other regexes though, it becomes quite powerful.

For example, take a regular expression to match the word 'ate' preceded by any one character. Ie, I want to match words like 'aate', 'bate', 'cate', 'date', eate', and so on, for any possible first letter.
$string =~ /.ate/
Ok, I lied a bit when I said that /./ matches any character. There are two characters that it does not match. These are the NUL character (ie, character with ASCII code 0) and the newline character (except in special circumstances). For most cases, this isn't really a problem.

I'll move on.

Character Classes

The above example would also match character sequences like ' ate' (that's a space) or '9ate', or '+ate', etc. Perhaps that wasn't quite my intention. My intention was to have a letter of the alphabet followed by ate. The solution is to use what is known as a character class.

Now, before we go forward, in one corner of your mind store the fact that what most languages call a character class, and what POSIX calls a character class are minutely different, but both relate to the same thing. We'll address POSIX later, but for now we'll just think about what most implementations call a character class.

A character class is a series of characters, any one of which is a valid match for that position. An example may be easier to understand. Taking my earlier case, if I had to match the following words: 'date', 'fate', 'gate', 'hate', then my regular expression would be /[dfgh]ate/
$string =~ /[dfgh]ate/
Notice the square brackets containing dfgh (which just coincidentally happen to lie right next to each other on a QWERTY keyboard): '[dfgh]'. What that says is that the character at that position must match any one of 'd', 'f', 'g' or 'h'.

Note that it states two things. First, there MUST be ONE and only ONE character at that position, and second, that one character may be either a 'd', an 'f', a 'g' or an 'h' and nothing else. Commit to memory the ONE character thing, because that's where most people go wrong when they do inverted matches. Remember this as a rule:
  • A character class [] matches ONE and only ONE character
  • The . metacharacter is like a character class with all characters except \000 and \n
Note also that you can include \n and \000 in your character class.

Now character classes are useful, but if you have to match too many characters, this could get unwieldy. Not as unwieldy as the alternatives, but unwieldy nonetheless, and terseness is what we like anyway. Enter ranges.

Ranges in character classes

You can specify a character range within the square brackets to indicate that you want to match any character that lies within that range within the current character set. So, if we wanted to match all the lower case letters of the english alphabet, we'd write /[a-z]/. Note the order. In the previous case, it made no difference if we wrote /[dfgh]/ or /[dhgf]/, but in the current example, /[a-z]/ is very different from /[z-a]/. The latter is an invalid range.

A single character class needn't be restricted to a range though, and you can combine a range alongwith characters that are not part of the range. Our earlier example could be replaced with this instead:
$string =~ /[df-h]ate/
Not that it buys us much in terms of reducing keystrokes in this particular example, but there are cases where it will.

Inverted character classes

Character classes can also be negated to match any character NOT included in the class. To do this, use a '^' as the first character of the character class: /[^a-z]/. This regex will match any character that does not lie in the range 'a-z', both inclusive. Note, that it still requires ONE character to be at that position. The only difference is that we now state what that character cannot be rather than what that character can be. This regex will still not match the empty string ''.

I'm going to stop here for today. There's much more that needs to be done before we can make really useful regexes, but we should be able to play around a bit with this much. I'll post a follow up tomorrow.

10 comments :

Sumeet
April 28, 2006 5:16 PM

Looks good. Onto part II!

Anonymous
April 28, 2006 10:12 PM

Are you going to be illustrating stuff using a particular language? Or you looking at pure regexes, with all the attention on matching? Your focus (style, examples) might change depending on your decision.

Looks good. I've understood all of it, so far!

-Vivek

Natarajan
April 28, 2006 10:51 PM

A very nice intro to regex. It is very newbie friendly and won't turn people away.

Anonymous
April 28, 2006 11:18 PM

cute.

very well written. concise.

I thought you would go a little further on the first post though. That's my only nitpick.

:)
Abhay

Jess
April 29, 2006 4:41 AM

Heyyy.. the class got documented! Cool! I wasn't bored.. I just had work. :/

SameerDS
April 29, 2006 7:53 AM

I thought I knew all about regular expressions, at least enough to get by on the occasional grep or "perl -pi" sort of work ... until I reached the end of the article about what "[^a-z]" does NOT match. Damn.

Prady
April 30, 2006 3:00 AM

Oh cool ! Neat one. Thanks and want to see more.

Kalyan
May 02, 2006 9:20 AM

i always used regular expression,everytime with a beginer's face.This really helped.Thanks!

Rajesh
May 03, 2006 6:15 AM

was good...
nice work...
thanx

Snehal
June 23, 2008 7:05 AM

Good Comments, very good document,

i impressed. you can read more from
you can get more detail from "1Javascript 2.0 -The Complete Reference" Books. and chapter 8.

Thanks,
Snehal_008
Life Never Stops

Post a Comment

...===...