The other side of the moon: September 2008

Tuesday, September 23, 2008

Date formats and browsers

Since my last post on javascript dates, I've been doing a lot more reading and testing. This post is a summary of what I've found.

I read up a few RFCs and ISO specifications. In particular, RFC2822: Internet Message Format, and RFC3339: Date and Time on the Internet: Timestamps, which summarises ISO8601: An internet standard for date and time (Wikipedia link). Below is a summary of what these specifications state.

RFC2822 lists two date formats, in sections 3.3 and 4.3. 3.3 is current while 4.3 is obsolete. RFC3339 lists a third date format based on ISO8601. All these specifications are for date-time representations for use on the Internet.

RFC3339:: 2008-09-23T17:39:44-07:00
RFC2822:: Tue, 23 Sep 2008 17:39:44 -0700; (obsolete) Tue, 23 Sep 2008 17:39:44 PDT (might also include 2 digit year)

Both specifications state that the time zone code SHOULD NOT be used, and only the offset should be used. The colon used in the time zone offset is optional in ISO8601, but not in RFC3339. The time zone may be replaced with Z, indicating Zulu time (UTC+0000).

Additionally, RFC2822 lists 10 codes in the obsolete: UT, GMT, EST, EDT, CST, CDT, MST, MDT, PST, PDT. Military time zones (A-Z) are also included in the obsolete list. While other time zone codes have been used by various implementations, they are not defined by a common specification, and SHOULD NOT be used.

My tests with the Date.toString() method have shown the following:

Internet Explorer returns the time zone code for the eight US time zones listed in section 4.3. of RFC2822, and the offset for all other time zones using the UTC+/-HHMM format. The offset is NOT included for the eight US time zones specified above. This is in line with RFC822, but not 2822, which obsoletes 822.

Examples:

PST:: Thu Sep 18 16:47:36 PDT 2008
IST:: Fri Sep 19 10:11:38 UTC+0530 2008

Firefox returns the offset for all time zones, and a time zone string in parentheses for some time zones. The string is different on different platforms:

Windows:: full timezone name, ie, Pacific Standard Time
Linux & Mac OS X:: timezone code, ie, PST

examples:

Windows/Taiwan: Tue Sep 23 2008 10:14:10 GMT+0800
Windows/India: Tue Sep 23 2008 09:42:49 GMT+0530 (India Standard Time)
Mac OS X/US West Coast: Tue Sep 23 2008 17:57:04 GMT-0700 (PDT)
Linux/India: Tue Sep 23 2008 09:42:49 GMT+0530 (IST)

Safari matches the results for Firefox on Mac OS X.

Both Safari and Firefox match the specification by including the offset. The time zone code that they return is considered auxiliary information for display purpose only.

Opera consistently uses the UTC+/-HHMM format on all platforms, and never includes a time zone code. This is probably the strictest adherence to the specification.

So, what does all this mean? Well, to me, it means that I shouldn't spend any more time searching for a browser independent way to pull out the time zone code. It also means that strftime does not provide an easy way to print out an RFC3339 date with the colon in the time zone offset, but we'll figure something out for that :)

Finally, a bit of trivia...

Standard time in the Netherlands was exactly 19 minutes and 32.13 seconds ahead of UTC by law from 1909-05-01 through 1937-06-30. This time zone cannot be represented exactly using the HH:MM format.

Short URL: http://tr.im/dateformats

Thursday, September 18, 2008

Date inconsistencies in Javascript

Ever since I built my javascript strftime implementation, I've been playing around with dates and date formatting in javascript. One of the things that users of the library have been happy about is the ease with which dates can be localised using the library, and it's this particular feature that's been frustrating me a lot when dealing with time zones. Let me get this out of my head.

The javascript engine of various browsers stringify dates in subtly different ways, but different enough to mess up time zone parsing.

Here's what I'm talking about. Tell me what your browser says when you click the following two links:

Try this snippet of code in your favourite browsers:

var d = new Date();
alert(d.toString());
alert(d.toLocaleString());

I've seen different results on Firefox, IE, Opera and Safari. Firefox is the worst offender with different results across platforms. The following are my results:

Firefox 2 & 3 on Linux and Mac OS X, and Safari on Mac OS X: Thu Sep 18 2008 19:20:23 GMT-0700 (PDT)
Firefox 2 & 3 on Windows: Thu Sep 18 2008 19:20:23 GMT-0700 (Pacific Daylight Time)
Google Chrome: Thu Sep 18 2008 19:20:23 GMT-0700 (Pacific Daylight Time) However it displays GST instead of GMT if your timezone is set to GMT
Opera 9.52 on Linux and Mac OS X: Thu Sep 18 2008 19:20:23 GMT-0700
IE 6, 7, 8 on Windows: Thu Sep 18 19:20:23 PDT 2008

I've only tested this in the PDT time zone, but the results are diverse enough to make parsing out the time zone annoying (except for Opera, where it's impossible). This is what I came up with:

d.toString().replace(/^.*:\d\d( GMT[+-]\d+)? \(?([A-Za-z ]+)\)?\d*$/, '$2').replace(/[a-z ]/g, '');

I derived this regex over a couple of iterations. It started out taking care of the Firefox on Mac & Linux and Safari case:

d.toString().replace(/^.* \(([^)]+)\)$/, '$1');

Next, I added support for Firefox on Windows by getting rid of all lowercase letters and spaces:

d.toString().replace(/^.* \(([A-Za-z ]+)\)$/, '$1').replace(/[a-z ]/g, '');

Finally, I added in support for IE, by checking for a string of characters that fell between seconds and year:

d.toString().replace(/^.*:\d\d( GMT[+-]\d+)? \(?([A-Za-z ]+)\)?\d*$/, '$2').replace(/[a-z ]/g, '');

This is by no means fool proof. It could break with time zones and browsers that I haven't encountered yet, so I'm asking everyone for help in figuring this out.

Short of asking all browser authors to be consistent in their Date.toString() implementations, how do I figure out the timezone string for a date? I've even thought about building a table of offsets to timezone strings, but there isn't a one to one mapping for that.

I've also noticed that the timezone is always displayed in English, even when I change my system locale. Perhaps this is different for you. Let me know.

And yo, Google, what's up with Greenwich Standard Time?

Short URL: http://tr.im/jsinconsistentdate

Thursday, September 04, 2008

Programming patterns in sed

I write a lot of code in sed whenever I need to do some kind of filtering, and I realised that there are several patterns that emerge. Sed is a Stream EDitor, and its capabilities are somewhat limited, yet it does provide for some of the more important things required in a programming language. It has sequence, selection, iteration, variables and debugging statements. In this post, I'll go over each of these.

1. Sequence

I've started with sequence because it's always the easiest to explain. Unless branching is involved, a sed script flows from top to bottom. All statements are executed in sequence, and that's pretty much all I have to say about it. Let's move on.

2. Selection

Selection is where things start happening. There are a few ways to execute a statement based on a condition. That condition almost always deals with a pattern in the current input line, but we'll see later how that can be changed. For now, here's how you do selection:

   /pattern/ command

  s/pattern/replace/
  t label

  s/pattern/replace/
  T label

The first is a simple "execute this command if the current pattern space matches /pattern/". That's akin to saying if(line.match(/pattern/)) { command; } in more common programming languages. Command could even be a block of commands enclosed in braces like this:

   /pattern/ {
    command1
    command2
    command3
  }

Let's take a few examples. We'll assume that sed is called without arguments, so each line is printed once by default.

If the line starts with "hello", add "world" after it:
```
   /^hello/ s/^hello/hello world/
```
If the current line number is 3, print out the line twice:
```
   3 p
```
Since each line is printed once by default, the p prints it a second time.
If the line starts with "next", swap it with the next line and print both out:
```
   /^next\>/ {
     N
     s/\(.*\)\n\(.*\)/\2\n\1/
  }
```

The second and third type of selection are similar, and basically say branch to a label if the previous replace command succeeded (t) or failed (T). These make more sense when looking at iteration, so that's what we'll do now.

3. Iteration

Things always get interesting when you iterate. You can execute the same set of statements over a group of data without knowing in advance what that data is. The b, t and T commands come in play here, along with labels defined with the : command, similar to other programming languages. We'll look at some common loop types from other languages:

While(condition) {...} (loop executed 0 or more times)

   :loopstart
   /condition/ {
      command1
      command2
      command3
      b loopstart
   }

For example, while the input contains ==, append the contents of the file named equals.txt:

   :loopstart
   /==/ {
      s/==//
      r equals.txt
      b loopstart
   }

We can also do this with the T command:

   :loopstart
   s/==//
   T loopend
      r equals.txt
      b loopstart
   :loopend

Though it's a little more clumsy this way because you need two labels. The first method is the code pattern that I use for a while loop.

Do {...} While(condition) (loop executed 1 or more times)
```
   :loopstart
      command1
      command2
      command3
   /condition/ b loopstart
```
This is almost the same as the first loop, except that the condition is tested at the end of the block of statements. Let's take the same example, but this time, we read in the file at least once:
```
   :loopstart
      r equals.txt
   /==/ {
      s/==//
      b loopstart
   }
```
In this case, using the t command makes it less messy since we need to do the replacement anyway:
```
   :loopstart
      r equals.txt
      s/==//
   t loopstart
```

The third type of loop is a for loop, which is harder because you can't really do math in sed. Still, if one tries, one can figure out weird ways to count. In this case, we use the hold space:


   # We want to print the current line 10 times:

   # 1. Grab the current line into the hold space
   h
   # 2. Replace the pattern space with = based on what we want to count to
   c \
==========
   # 3. Print the line as long as there are = left:
   :loopstart
      s/^=//
      T loopend
      x
      p
      x
   b loopstart

In this code, we need to constantly swap between the pattern space and the hold space, since all our operations are done in the pattern space. Which brings us to variables.

4. Variables

Well, make that variable, since sed has only one piece of memory that can hold something, and that's called the hold space. The good news though, is that it has no size limit - well, theoretically at least. This means that using your own delimiters, you could store anything in there. JSON anyone? I generally use the newline character as a delimiter, since that's unlikely to show up more than once in a single line of input, but you can use anything that you think is unique to your application. Here's one way to do it:


   # 1. Swap the hold and pattern space
   x
   # 2. Set the pattern space to the value of your variable using the s, c, i, a, g or G commands:
   s/$/\nfoo\n/
   G
   # 3. Swap the hold and pattern space again
   x

   # The hold space now contains:
   # prev value of hold space\n
   # foo\n
   # the input line\n

And you can use these values later using the s/// command:


   # 1. Append the current line to the hold space
   H
   # 2. Pull the hold space into the pattern space
   g
   # At this point the pattern space has the newline separated list of
   # variables followed by a newline and the current input line
   # We can use these variables if we know their position, and replace
   # them into the input line:

   # 3. Append the previous input line to the first word of the current input line:
   s/.*\n.*\n\(.*\)\n\([A-Za-z][A-Za-z]*\)\>/\2\1/

   # Now the input line has been modified, and the hold space remains the same

5. Debugging

Finally, we come to debugging, which is extremely useful when writing code in such a strange language. sed has two commands that make debugging possible, though I won't say easy. The = command prints out the current line number of the input file. Note that this is not the line number of the sed script, but of the input that the script reads. The l command prints out the current input line in a visually unambiguous way. It's up to you to scatter your code with these lines to figure out what's happening internally. You can always swap the pattern and hold spaces and use the l command to find out what's in your hold space.

Apart from the above, sed also has methods to read and write files. We've seen reading above with the r command, and similarly, writing is handled by the w command. There are also R and W commands, but you can read the manual to figure those out. I'll leave sed here.

Update: should have mentioned this a long time ago, but someone wrote tetris in sed

The other side of the moon