[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Showing posts with label programming style. Show all posts
Showing posts with label programming style. Show all posts

Thursday, September 04, 2008

Programming patterns in sed

I write a lot of code in sed whenever I need to do some kind of filtering, and I realised that there are several patterns that emerge. Sed is a Stream EDitor, and its capabilities are somewhat limited, yet it does provide for some of the more important things required in a programming language. It has sequence, selection, iteration, variables and debugging statements. In this post, I'll go over each of these.

1. Sequence

I've started with sequence because it's always the easiest to explain. Unless branching is involved, a sed script flows from top to bottom. All statements are executed in sequence, and that's pretty much all I have to say about it. Let's move on.

2. Selection

Selection is where things start happening. There are a few ways to execute a statement based on a condition. That condition almost always deals with a pattern in the current input line, but we'll see later how that can be changed. For now, here's how you do selection:
   /pattern/ command

s/pattern/replace/
t label

s/pattern/replace/
T label
The first is a simple "execute this command if the current pattern space matches /pattern/". That's akin to saying if(line.match(/pattern/)) { command; } in more common programming languages. Command could even be a block of commands enclosed in braces like this:
   /pattern/ {
command1
command2
command3
}
Let's take a few examples. We'll assume that sed is called without arguments, so each line is printed once by default.
  1. If the line starts with "hello", add "world" after it:
       /^hello/ s/^hello/hello world/
  2. If the current line number is 3, print out the line twice:
       3 p
    Since each line is printed once by default, the p prints it a second time.
  3. If the line starts with "next", swap it with the next line and print both out:
       /^next\>/ {
    N
    s/\(.*\)\n\(.*\)/\2\n\1/
    }

The second and third type of selection are similar, and basically say branch to a label if the previous replace command succeeded (t) or failed (T). These make more sense when looking at iteration, so that's what we'll do now.

3. Iteration

Things always get interesting when you iterate. You can execute the same set of statements over a group of data without knowing in advance what that data is. The b, t and T commands come in play here, along with labels defined with the : command, similar to other programming languages. We'll look at some common loop types from other languages:
  1. While(condition) {...} (loop executed 0 or more times)
       :loopstart
    /condition/ {
    command1
    command2
    command3
    b loopstart
    }

    For example, while the input contains ==, append the contents of the file named equals.txt:
       :loopstart
    /==/ {
    s/==//
    r equals.txt
    b loopstart
    }

    We can also do this with the T command:
       :loopstart
    s/==//
    T loopend
    r equals.txt
    b loopstart
    :loopend
    Though it's a little more clumsy this way because you need two labels. The first method is the code pattern that I use for a while loop.
  2. Do {...} While(condition) (loop executed 1 or more times)
       :loopstart
    command1
    command2
    command3
    /condition/ b loopstart
    This is almost the same as the first loop, except that the condition is tested at the end of the block of statements. Let's take the same example, but this time, we read in the file at least once:
       :loopstart
    r equals.txt
    /==/ {
    s/==//
    b loopstart
    }
    In this case, using the t command makes it less messy since we need to do the replacement anyway:
       :loopstart
    r equals.txt
    s/==//
    t loopstart
  3. The third type of loop is a for loop, which is harder because you can't really do math in sed. Still, if one tries, one can figure out weird ways to count. In this case, we use the hold space:

    # We want to print the current line 10 times:

    # 1. Grab the current line into the hold space
    h
    # 2. Replace the pattern space with = based on what we want to count to
    c \
    ==========
    # 3. Print the line as long as there are = left:
    :loopstart
    s/^=//
    T loopend
    x
    p
    x
    b loopstart

    In this code, we need to constantly swap between the pattern space and the hold space, since all our operations are done in the pattern space. Which brings us to variables.

4. Variables

Well, make that variable, since sed has only one piece of memory that can hold something, and that's called the hold space. The good news though, is that it has no size limit - well, theoretically at least. This means that using your own delimiters, you could store anything in there. JSON anyone? I generally use the newline character as a delimiter, since that's unlikely to show up more than once in a single line of input, but you can use anything that you think is unique to your application. Here's one way to do it:

# 1. Swap the hold and pattern space
x
# 2. Set the pattern space to the value of your variable using the s, c, i, a, g or G commands:
s/$/\nfoo\n/
G
# 3. Swap the hold and pattern space again
x

# The hold space now contains:
# prev value of hold space\n
# foo\n
# the input line\n
And you can use these values later using the s/// command:

# 1. Append the current line to the hold space
H
# 2. Pull the hold space into the pattern space
g
# At this point the pattern space has the newline separated list of
# variables followed by a newline and the current input line
# We can use these variables if we know their position, and replace
# them into the input line:

# 3. Append the previous input line to the first word of the current input line:
s/.*\n.*\n\(.*\)\n\([A-Za-z][A-Za-z]*\)\>/\2\1/

# Now the input line has been modified, and the hold space remains the same

5. Debugging

Finally, we come to debugging, which is extremely useful when writing code in such a strange language. sed has two commands that make debugging possible, though I won't say easy. The = command prints out the current line number of the input file. Note that this is not the line number of the sed script, but of the input that the script reads. The l command prints out the current input line in a visually unambiguous way. It's up to you to scatter your code with these lines to figure out what's happening internally. You can always swap the pattern and hold spaces and use the l command to find out what's in your hold space.

Apart from the above, sed also has methods to read and write files. We've seen reading above with the r command, and similarly, writing is handled by the w command. There are also R and W commands, but you can read the manual to figure those out. I'll leave sed here.

Update: should have mentioned this a long time ago, but someone wrote tetris in sed

Saturday, May 17, 2008

Commenting

A discussion at work got into whether comments are good or not. I have my own opinions about comments, and years of having to dig through other people's code has taught me not to trust them too much. Comments that I've put into my code in the past have generally been stories and anecdotes that I wanted to tell future readers of my code, and seldom about the code itself.

When I have to go through someone else's code, I start off by assuming that their comments were bad and delete all but a few of them. They may not actually be bad, but it saves me the time I'd spend separating bad comments from good ones, and I can always refer to the last version in CVS if I need to check on old comments. In general, I've seen the following types of comments:
  1. Useless comments - they explain what a particular piece of code does
  2. Obsolete comments - the code changed but the comments didn't
  3. Mental notes - they explain why a particular piece of code does what it does
  4. Reminders - FIXME/TODO comments
There are also API docs that look like comments, but that's only because the comment syntax in most languages is the only way to add documentation (perl has pod though). I'll ignore this type, but you still need to be careful of API docs that don't match the API or the implementation - they're probably both wrong.

The first type is largely useless. The code is either very obvious in what it does, in which case the comment is not needed, or it's not, in which case it should be rewritten. One of the rules from the Elements of Programming Style goes thus:
Don't comment bad code, rewrite it
It's #63 if you care, while #69 says:
Don't over-comment
The next type is worse than useless, it slows you down in your understanding of the code. Delete and take note of rule #61:
Make sure comments and code agree
The third type of comment is useful in understanding a block of code and the thought process that went into it, and is also useful in knowing how to rewrite/optimise that code. Some developers may prefer to push this kind of comment out to bugzilla and link to it in the code. Either way works. I generally leave these comments around or rewrite them depending on what I'm doing to the code they affect. Rule #62 says:
Don't just echo the code with comments - make every comment count
This is how you make it count.

Reminders, IMO, are better suited to bugzilla with the bug number noted in the code. This is so mainly because bugzilla sends out reminders while your code doesn't. If I can, I'll fix the relevant code, or delete the comment if no longer pertinent (these kinds of comments tend to hang around for several years without anyone noticing).

Oh, and just in case someone wanted to counter #2 by saying that sometimes you need to write obscure code to be efficient, or because it was a smart way to do something, I present to you rules #1, #2 and #5:
Write clearly - don't be too clever
Say what you mean, simply and directly
Write clearly - don't sacrifice clarity for "efficiency"
So, have fun commenting your code, have a dialogue with your readers, or think of it as a mini blog about the code, but make sure you make your comments useful or fun. If you're interested in the quick summary of the Elements of Programming Style, grab the fortune cookie file.

Wednesday, January 28, 2004

The Elements of Style

Programmers have a strong tendency to underrate the importance of good style. Eternally optimistic, we all like to think that once we throw a piece of code together, however haphazardly, it will work properly the first time and ever after. Why waste time cleaning up something that is almost certain to be correct? Besides, it probably will be used for only a few weeks.

There are really two answers to the question. The first is suggested by the word "almost". A slap-dash piece of code that falls short of perfection can be a difficult creature to deal with. The self-discipline of writing it cleanly the first time increases your chances of getting it right and eased the task of fixing it if it is not. The programmer who leaps to the coding pad or the terminal and throws a first draft at the machine spends far more time redoing and debugging than does his or her more careful colleague.

The second point is that phrase "only a few weeks". Certainly we write code differently depending on the ultimate use we expect to make of it. But computer centres are full of programs that were written for a short-term use, then were pressed into years of service. Not only pressed, but sometimes hammered and twisted. It is often simpler to modify existing code, no matter how badly written, than to reinvent the wheel yet again for a new application. Big programs — operating systems, computers, major applications — are never written to be used once and discarded. They change and evolve. Most professional programmers spend much of their time changing their own and other people's code. We will say it once more — clean code is easier to maintain.

One excuse for writing an unintelligible program is that it is a private matter. Only the original programmer will ever look at it, and surely he need not spell out everything when he has it all in his head. This can be a strong argument, particularly if you don't program professionally. But even if only you personally want to understand the message, if it is to be readable a year from now you must write a complete sentence.

You learn to write as if to someone else because next year you will be "someone else". Schools teach English composition, not how to write grocery lists. The latter is easy once the former is mastered. Yet when it comes to computer programming, many programmers seem to think that a mastery of "grocery list" writing is adequate preparation for composing large programs. This is not so.

The essence of what we are trying to convey is summed up in the elusive word "style". It is not a list of rules so much as an approach and an attitude. "Good programmers" are those who already have learnt a set of rules that ensures good style; many of them will read this book and see no reason to change. If you are still learning to be a "good programmer", however, then perhaps some of what we consider good style will have rubbed off in the reading.

— The Elements of Programming Style (Second Edition)
by Kernighan and Plaugher

The above text is part of the Epilogue of The Elements of Programming Style
by Kernighan and Plaugher.

See also: Elements of Programming Style fortune mod.

...===...