# The other side of the moon

/bb|[^b]{2}/
Never stop Grokking

## Monday, December 27, 2010

### Using bandwidth to mitigate latency

[This post is mirrored from the Performance Advent Calendar]

The craft of web performance has come a long way from yslow and the first few performance best practices. Engineers the web over have made the web faster and we have newer tools, many more guidelines, much better browser support and a few dirty tricks to make our users' browsing experience as smooth and fast as possible. So how much further can we go?

#### Physics

The speed of light has a fixed upper limit, though depending on the medium it passes through, it might be lower. In fibre, this is about 200,000 Km/s, and it's about the same for electricity through copper. This means that a signal sent over a cable that runs 7,000Km from New York to the UK would take about 35ms to get through. Channel capacity (the bit rate), is also limited by physics, and it's Shannon's Law that comes into play here.

This, of course, is the physical layer of our network. We have a few layers above that before we get to TCP, which does most of the dirty work to make sure that all the data that your application sends out actually gets to your client in the right order. And this is where it gets interesting.

#### TCP

Éric Daspet's article on latency includes an excellent discussion of how slow start and congestion control affect the throughput of a network connection, which is why google have been experimenting with an increased TCP initial window size and want to turn it into a standard. Each network roundtrip is limited by how long it takes photons or electrons to get through, and anything we can do to reduce the number of roundtrips should reduce total page download time, right? Well, it may not be that simple. We only really care about roundtrips that run end-to-end. Those that run in parallel need to be paid for only once.

When thinking about latency, we should remember that this is not a problem that has shown up in the last 4 or 5 years, or even with the creation of the Internet. Latency has been a problem whenever signals have had to be transmitted over a distance. Whether it is a rider on a horse, a signal fire (which incidentally has lower latency than light through fibre[1]), a carrier pigeon or electrons running through metal, each has had its own problems with latency, and these are solved problems.

#### C-P-P

There are three primary ways to mitigate latency. Cache, parallelise and predict[2]. Caching reduces latency by bringing data as close as possible to where it's needed. We have multiple levels of cache including the browser's cache, ISP cache, a CDN and front-facing reverse proxies, and anyone interested in web performance already makes good use of these. Prediction is something that's gaining popularity, and Stoyan has written a lot about it. By pre-fetching expected content, we mitigate the effect of latency by paying for it in advance. Parallelism is what I'm interested in at the moment.

#### Multi-lane highways

Mike Belshe's research shows that bandwidth doesn't matter much, but what interests me most is that we aren't exploiting all of this unused channel capacity. Newer browsers do a pretty good job of downloading resources in parallel, and with a few exceptions (I'm looking at you Opera), can download all kinds of resources in parallel with each other. This is a huge change from just 4 years ago. However, are we, as web page developers, building pages that can take advantage of this parallelism? Is it possible for us to determine the best combination of resources on our page to reduce the effects of network latency? We've spent a lot of time, and done a good job combining our JavaScript, CSS and decorative images into individual files, but is that really the best solution for all kinds of browsers and network connections? Can we mathematically determine the best page layout for a given browser and network characteristics[3]?

#### Splitting the combinative

HTTP Pipelining could improve throughput, but given that most HTTP proxies have broken support for pipelining, it could also result in broken user experiences. Can we parallelise by using the network the way it works today? For a high capacity network channel with low throughput due to latency, perhaps it makes better sense to open multiple TCP connections and download more resources in parallel. For example, consider these two pages I've created using Cuzillion:
Have a look at the page downloads using FireBug's Net Panel to see what's actually happening. In all modern browsers other than Opera, the second page should load faster whereas in older browsers and in Opera 10, the first page should load faster.

Instead of combining JavaScript and CSS, split them into multiple files. How many depends on the browser and network characteristics. The number of parallel connections could start of based on the ratio of capacity to throughput and would reduce as network utilisation improved through larger window sizes over persistent connections. We're still using only one domain name, so no additional DNS lookup needs to be done. The only unknown is the channel capacity, but based on the source IP address and a geo lookup[4] or subnet to ISP map, we could make a good guess. Boomerang already measures latency and throughput of a network connection, and the data gathered can be used to make statistically sound guesses.

I'm not sure if there will be any improvements or if the work required to determine the optimal page organisation will be worth it, but I do think it's worth more study. What do you think?

#### Footnotes

1. Signal fires (or even smoke signals) travel at the speed of light in air v/s light through fibre, however the switching time for signal fires is far slower, and you're limited to line of sight.
2. David A. Patterson. 2004. Latency lags bandwith[PDF]. Commun. ACM 47, 10 (October 2004), 71-75.
3. I've previously written about my preliminary thoughts on the mathematical model.
4. CableMap.info has good data on the capacity and latency of various backbone cables.

## Thursday, December 02, 2010

### Bad use of HTML5 form validation is worse than not using it at all

#### ...or why I'm glad we don't use fidelity any more

Fidelity's online stock trading account uses HTML5 form's pattern attribute to do better form validation, only they get it horribly wrong.

First some background...

The pattern attribute that was added to input elements allows the page developer to tell the browser what kind of pre-validation to do on the form before submitting it to the server. This is akin to JavaScript based form validation that runs through the form's onsubmit event, except that it's done before the form's onsubmit event fires. Pages can choose to use this feature while falling back to JS validation if it isn't available. They'd still need to do server-side validation, but that's a different matter. Unfortunately, when you get these patterns wrong, it's not possible to submit a valid value, and given how new this attribute is, many web devs have probably implemented it incorrectly while never having tested on a browser that actually supports the attribute.

This is the problem I faced with fidelity. First some screenshots. This is what the wire transfer page looks like if you try to transfer $100. The message shows up when you click on the submit button: On Firefox 4: On Opera: There are a few things wrong with this. First, to any reasonable human being (and any reasonably well programmed computer too), the amount entered (100.00) looks like a valid format for currency information. I tried alternatives like$100, $100.00, 100, etc., but ended up with the same errors for all of them. Viewing the source told me what the problem was. This is what the relevant portion of the page source looks like: $<input type="text" name="RED_AMT" id="RED_AMT"
maxlength="11" size="14" value="" tabindex="10"


#### MacOSX

On the Mac, it's stored in
~/Library/Application Support/Firefox/Profiles/<profile>/extensions
/Library/Application Support/Mozilla/Extensions


## Wednesday, August 04, 2010

### Site theme

Ten months down the line,
I still like this site's style,
I've tweaked it a bit,
That bluesmoon did this one fine.

When I work on a site, I always start with the design first. I'm a visual developer, so I really need to see what things will look like when I'm done. I also need to know if I can stand the design enough to stare at it for days on end during development. The result is that I end up hating my design in a couple of days and change things around, most often it's the colours.

This has happened a lot with earlier versions of my site's style, but the version I have now is one I like. I first pushed it out standardised across all my subdomains back in November 2009, and now 10 months down the line, it still pleases me.

I like the colours. They don't shout out at me, but frame my content well. Important content is visible while unimportant meta information is dimmed out. I played around with the margins a lot to get it to work the way I want across devices. It's not ideal, but it gets the job done. I took inspiration for the fonts from several sites until I settled on what I use now.

There's still a few issues. Not everyone likes the tag cloud, but it's the best way for me to navigate my site, especially since google's site search doesn't seem to work. The home page looks terrible on my E71, and I should get to fixing that some day, but other than that, it's ok.

## Monday, August 02, 2010

### 4.01 Strict — Tennis Elbow

My left elbow's out of commission, so there will probably not be a strip until it gets better. The technical term is Lateral Epicondylitis, commonly known as Tennis Elbow. What's ironic is that I don't play Tennis. It happened due to bad typing posture.

## Thursday, July 29, 2010

### Boomerang and the WebTiming API

I've been doing a lot of development on boomerang over the last few weeks. In particular around the WebTiming API. During Velocity this year, Microsoft announced that their latest release of IE9 supported this API, so I set about adding support to boomerang.

Since then, Google have announced support in Chrome as well, and there are rumours around that Mozilla plans on adding it to firefox. Now since the specification is still a draft, none of these browsers use the window.performance object, but instead use their own browser specific version. I've now added support for all these versions into boomerang.

More details in my post on the YDN blog.

You can get boomerang from github, and join the discussion on our forum.

## Monday, July 26, 2010

### 4.01 Strict — Invention of the Web

Tim Berners-Lee invents the WWW

There was a young man named Tim,
Who wanted to share docs on a whim,
He built HTTP,
But the W3C
Made standards so webdevs would sin

## Sunday, July 25, 2010

### Handling Date/Times and timezones

Wrote this as an answer to a question on StackOverflow, but I think it was one of my better ones, so am reproducing it here.

There are four different times you should consider:
1. Event time: eg, the time when an international sporting event happens, or a coronation/death/etc. This is dependent on the timezone of the event and not of the viewer.
2. Television time: eg, a particular TV show is broadcast at 9pm local time all around the world. Important when thinking about publishing the results (of say American Idol) on your website
3. Relative time: eg: This question has an open bounty closing in 21 hours. This is easy to display
4. Recurring time: eg: A TV show is on every Monday at 9pm, even when DST changes.
There is also Historic/alternate time. These are annoying because they may not map back to standard time. Eg: Julian dates, dates according to a Lunar calendar on Saturn, The Klingon calendar.

Storing start/end timestamps in UTC works well. For 1, you need an event timezone name + offset stored along with the event. For 2, you need a local time identifier stored with each region and a local timezone name + offset stored for every viewer (it's possible to derive this from the IP if you're in a crunch). For 3, store in UTC seconds and no need for timezones. 4 is a special case of 1 or 2 depending on whether it's a global or a local event, but you also need to store a created at timestamp so you can tell if a timezone definition changed before or after this event was created. This is necessary if you need to show historic data.

#### Storing times

• Always store time in UTC
• Convert to local time on display (local being defined by the user looking at the data)
• When storing a timezone, you need the name, timestamp and the offset. This is required because governments sometimes change the meanings of their timezones (eg: the US govt changed DST dates), and your application needs to handle things gracefully... eg: The exact timestamp when episodes of LOST showed both before and after DST rules changed.
##### Offsets and names
An example of the above would be:

The soccer world cup finals game happened in South Africa (UTC+2--SAST) on July 11, 2010 at 19:00 UTC.

With this information, we can historically determine the exact time when the 2010 WCS finals took place even if the South African timezone definition changes, and be able to display that to viewers in their local timezone at the time when they query the database.
System Time

You also need to keep your OS, database and application tzdata files in sync, both with each other, and with the rest of the world, and test extensively when you upgrade. It's not unheard of that a third party app that you depend on did not handle a TZ change correctly.

Make sure hardware clocks are set to UTC, and if you're running servers around the world, make sure they're OSes are configured to use UTC as well. This becomes apparent when you need to copy hourly rotated apache log files from servers in multiple timezones. Sorting them by filename only works if all files are named with the same timezone. It also means that you don't have to do date math in your head when you ssh from one box to another and need to compare timestamps.

Also, run ntpd on all boxes.

#### Clients

Never trust the timestamp you get from a client machine as valid. For example, the Date: HTTP headers, or a javascript Date.getTime() call. These are fine when used as opaque identifiers, or when doing date math during a single session on the same client, but don't try to cross-reference these values with something you have on the server. Your clients don't run NTP, and may not necessarily have a working battery for their BIOS clock.

#### Trivia

Finally, governments will sometimes do very weird things:
Standard time in the Netherlands was exactly 19 minutes and 32.13 seconds ahead of UTC by law from 1909-05-01 through 1937-06-30. This time zone cannot be represented exactly using the HH:MM format.

## Monday, July 19, 2010

### 4.01 Strict — Don't mess with Stubbornella

Mixing CSS is pixel princess, but weigh her down, and you'll have to deal with... Stubbornella.

## Monday, July 12, 2010

### 4.01 Strict — SpeedFreak

Mild mannered Stoyan Stefanov makes an appearance in this week's 4.01 Strict.

## Tuesday, July 06, 2010

### 4.01 Strict — The discovery of JSON

In episode 2, Douglas Crockford discovers JSON

## Monday, June 28, 2010

### My velocity talk

So I was supposed to speak at Velocity 2010 this year. My talk was titled Latency: Why you should worry and what you can do about it. It was one of the last few talks at the conference. Unfortunately though, the thing about a conference like velocity is that almost everything is about latency, and by the time I went on stage, absolutely every point that I'd planned on making had already been covered.

That's not a bad thing. For one, it validates my ideas, telling me that many other smart people have been thinking about the same things. Secondly, each of these speakers had 60-90 minutes to convey their ideas, which meant they could really do it justice. It also shows that velocity is a pretty awesome conference to be at.

So, by the night of the 23rd, I realised that there was no way I could go ahead with my talk in its current state, and sat down to rewrite it. That's when the drama started.

Now my usual format is to build my slides using LaTeX and generate a PDF from that. I have a bunch of relevant images and charts in there, and it's not too hard to do that with LaTeX. I started going down that path, but realised that I wanted to do something different. I really wanted to reference all the other presentations and go over points that they'd covered, so I switched to Comic Life. I was up until 5am, working out the details and getting the script right.

The thing about a 12 minute talk is that you can't waste a single minute thinking about what you're going to say next. You just need to know it, and there wasn't much time for me to prepare. I'd also planned to announce boomerang in the latter part of my talk, so had to fit that in as well.

At 5 I realised that I was too sleepy to keep going, so I went to bed and figured I could finish things off in a couple of hours once I woke up. I was also too sleepy to backup the presentation to my server.

I woke up at 10, and opened up the mac to start work again, and it wouldn't come on. I could hear the fans whirring inside, and the startup sound would sound, but the screen wouldn't turn on. It was now officially time to panic.

Two things were in my favour though. I work at Yahoo!, and the conference was just down the road from Yahoo! HQ. I headed in to the office and took the mac down to IT. The guys there were really helpful. They opened up a new Mac, and transferred my hard disk to it. The new one started up perfectly, and it's what I'm using right now.

I spent the next two hours putting the finishing touches on, and adding details for the boomerang release, and then rushed off to the conference with a slice of pizza in my hand.

Once I got on stage, I plugged in the mac, and as it has done on numerous occasions in the past, it hung as soon as the VGA adaptor was plugged in. I had to reboot while on stage, and this was all coming out of the 12 minutes that I had.

The talk went off okay. Looking back at the video, there's a bunch of things I'd have done differently, and some more details that I'd have covered, but I'll let you decide for yourselves.

My slides are on slideshare, while the video is on blip.tv.

### 4.01 Strict — I am codepo8

I've been playing around with comic life, and this is what I ended up with:

I'm calling it 4.01 Strict.

## Saturday, June 26, 2010

### boomerang

I've been silent on this blog for a while, and the main reason for that is boomerang. It's taken a few months to get it ready for a public release, and I'm quite pleased with the way it's come out.

As many of you know, I've worked with the Exceptional Performance team at Yahoo! for over two years, and during this time, I've done a lot of work with measuring and analysing the performance of all of Yahoo!'s websites (yes, all). There are several tens of thousands of page types, and probably hundreds of thousands of unique URLs that could show up on a yahoo.com domain. This throws up some very interesting problems that I'm not sure I'd have had the chance to work with anywhere else.

I learnt a lot about scaling cheap hardware to meet the needs of this system, and also a lot about how each of these thousands of pages worked to improve their performance.

The measurement tool that we use has turned into boomerang. It's primarily a beaconing framework, so isn't restricted to performance data, but the core plugins that come packaged with it only do performance measurement.

I'll write more in later posts about what you can do with boomerang, but for now, go get it from github.com/yahoo/boomerang. Fork it, use it, patch it, tell us what you think of it.

There's also documentation along with the source code, or you can have a look at my personal copy of them — frequently, but not automatically updated.

I announced it at Velocity 2010, and they made a video of my talk.

## Thursday, June 10, 2010

### location.href

While working on boomerang, I needed to get at the current document's URL. The way I generally do that is via the location.href property. This worked perfectly until I hit upon a URL with a # in it. The # was correctly URL encoded to %23, and most browsers returned the correct string in location.href.

Safari 4.x was the only exception. It returned the URL after decoding. This wouldn't be a problem on its own, except that only the URL portion was decoded. location.search (everything after the ?) and location.hash (everything after the #) were untouched.

This meant that a URL like this:
http://tech.bluesmoon.info/foobar%231.html#123

Would show up in location.href as this:
http://tech.bluesmoon.info/foobar#1.html#123

This was a problem, because I now had two # signs in the URL, and would have to twist through hoops to correctly encode the URL. The bug is actually in Webkit, and affects Android too. Thanks to Ryan Grove for pointing me to it.

Then I discovered the document.URL property. This is exactly the same as location.href, except that it isn't affected by the bug in Safari 4.x.

I've tested in other browsers, and it seems to work as expected, so that's what I'm using for now.

Oh, and if you don't know what boomerang is yet, wait until June 24.

## Tuesday, April 13, 2010

### Internet access from Mumbai Airport's Clipper Lounge

If you're a frequent flier, or are flying first or business class out of Mumbai, then you'll get access to the Clipper Lounge. They provide free wifi, but it doesn't really work as advertised.

You first connect to the wifi network, and open your browser, which gets redirected to a login page. You click on Complimentary Internet access, and get to a form that asks for your personal details, and phone number, claiming that they will send you an access code via SMS.

Do NOT enter your real phone number here. For starters, no SMS gets sent. Secondly, you'll start getting SMS spam. Use any 10 digit phone number. Enter 91 for the country code.

Now, once you do this, you'll get an intermediate page that will immediately redirect to a second page that shows you your username and password. The password on that page is wrong. Instead, what you want to do is hit escape as soon as you get the intermediate page. It's blue, and it has the database insert statement that looks like this:
insert into users
u_current_quota,u_persistant_quota,u_interim_interval,
u_idle_timeout,u_expiration_date,u_nbr_concurrent_sessions,
u_max_concurrent_sessions, u_nbr_starts, u_nbr_stops,
fk_group_id,fk_profile_id
) values (
'cc3zo7s', '2497161', '2', 'Morons',
'86400', '86400', '1', '15', '600',
'2010-05-13 22:51:28','0','1',
'0','0','2','2'
);

In the above string, cc3zo7s is the username and 2497161 is the password that you need to use.

Note, that if you enter a phone number that already exists in the database (eg: one of the tech support numbers on the login page), you'll get this instead:
Duplicate entry '1271179237' for key 'PRIMARY'

Just try a different phone number. Oh, and what's with the semi-colon at the end of the SQL statement?

Lastly, don't bother sending them feedback. It doesn't go anywhere. And I don't mean that it goes into some mailbox that no one reads. I mean it goes nowhere. This is the page you get when you try to send feedback:

The interesting text in there is:
Warning: mail() [function.mail]: Failed to connect to mailserver at
"ezcommindia.com" port 25, verify your "SMTP" and "smtp_port" setting in
php.ini or use ini_set() in C:\xampp\htdocs\ezcomm\ifeedback.php
on line 225

Yup, real classy. I think we can just start from the end of that message and stop when we get to C:\. It explains a lot.

This login manager isn't just broken, it goes right through broken and comes out on the other end.

## Saturday, April 10, 2010

### Can a website crash your RHEL5 desktop?

A few months ago I started having trouble with my RHEL5 desktop when running Firefox 3.5. On a few websites, the entire system would crash pretty consistently. It took a long time, but I finally managed to find the problem and get a fix in there.

My solution is documented on the YDN blog. Please leave comments there.

## Friday, April 09, 2010

### Analysis of the bandwidth and latency of YUIblog readers

A few months ago, Eric Miraglia from the YUI team helped me run some analysis on the types of network connections that were coming in to YUIBlog. In this article on the YUI blog, I've published the results of my analysis.

## Thursday, April 01, 2010

### Speaking at IPC Spring in Berlin

I'll be speaking at IPC Spring in Berlin between May 31 and June 2 this year. My talk is on MySQL partitioning. It will have more information than the talk I did at ConFoo in March. The talk will be on Tuesday, June 1st.

If you're in or around Berlin at that time, drop in.

## Monday, March 22, 2010

### InnoDB's tablespace ids and Partitions

There are times when what you have is a partially running database and a bunch of backup innodb tablespace files (the .ibd files). If you're using innodb_file_per_table, then you have a separate .ibd file for each InnoDB table.

Now, you have your running database with a bunch of tables, and you want to replace some of them with the backup .ibd files. According to the MySQL docs, you'd do this:
1. ALTER TABLE foo DISCARD TABLESPACE; (this deletes the current .ibd file)
2. copy the old .ibd file into your database directory
3. ALTER TABLE foo IMPORT TABLESPACE;
Assuming your .ibd file was from the same database and you did not drop the table and recreate it sometime between when you made the backup .ibd and now, this should work. Except... if you use partitions. If your table foo uses partitions, ie, its create statement was something like this:
CREATE TABLE foo (
...
) PARTITION BY ... (
PARTITION p0 ...,
);
In this case, you cannot discard the tablespace, and the first alter command throws an error:
mysql> ALTER TABLE foo DISCARD TABLESPACE;

ERROR 1031 (HY000): Table storage engine for 'foo' doesn't have this option

I have not investigated if there are workarounds for this, but I do have a little more information on what's happening. Remember that each .ibd file is a tablespace. For a partitioned table, there are multiple .ibd files, one for each partition. The table's files look like this:
foo.frm
foo.par
foo#P#p0.ibd
foo#P#p1.ibd
...

Where p0, p1, etc. are the partition names that you specified in the create statement. Each partition is a different tablespace and has its own tablespace id. When you create an InnoDB table without partitioning, the internal tablespace id counter is incremented by 1. When you create an InnoDB table with paritions, the internal tablespace id counter is incremented by the number of partitions. The actual tablespace id is stored in each partition's .ibd file somewhere within the first 100 bytes. I have not attempted to find out where exactly though.

## Friday, March 19, 2010

### WebDU

Wow, it's been another couple of weeks that I haven't updated this blog. Most of my time's gone in travelling. I was at ConFoo last week, and have a pending blog post about that, though it will be on the YDN blog. Next month I'll spend some time in India and then in May I head over to Bondi beach in Australia for WebDU.

I'll be speaking about web performance there and will probably do a pub event as well. Stay posted for more details. In any event, if Sydney is where you're at, then go register for WebDU. Geoff and crew put on a great show last year and it only looks to be getting better.

## Sunday, February 28, 2010

### Convert IP to Geo info using YQL

For my missing kids hack, I needed to convert an IP address to a US or Canadian 2 letter state code. This should have been pretty straightforward, but it turned out to require a little more effort than I initially wanted to put in.

First, the easy way. Rasmus Lerdorf has a web service that takes in an IP address and based on the MaxMind data, returns a bunch of information including the country and state/region code. I initially decided to use this. His example page is pretty self-explanatory, so I won't re-document it here. The problem is that this service was really slow and increased page load time a lot, so I scrapped the idea.

I then started looking through YQL. YQL has a whole bunch of geo stuff, but nothing that specifically turns an IP address into a WoEID or a country/state code. I then looked at the community supported tables and found the ip.location table that uses the ipinfodb.com wrapper around the MaxMind database. This returned everything I needed, but the only problem was that the state was returned as a string rather than a two character code. This is the query:
SELECT * From ip.location Where ip=@ip

The output looks like this:
{
"query":{
"count":"1",
"created":"2010-02-28T01:24:30Z",
"lang":"en-US",
"updated":"2010-02-28T01:24:30Z",
"uri":"http://query.yahooapis.com/v1/yql?q=select+*+from+ip.location+where+ip%3D%27209.117.47.253%27",
"results":{
"Response":{
"Ip":"209.117.47.253",
"Status":"OK",
"CountryCode":"US",
"CountryName":"United States",
"RegionCode":null,
"RegionName":null,
"City":null,
"ZipPostalCode":null,
"Latitude":"38",
"Longitude":"-97",
"Timezone":"-6",
"Gmtoffset":"-6",
"Dstoffset":"-5"
}
}
}
}

Now it's pretty trivial to build an array that maps from state name to state code, but I'd have to keep growing that as I added support for more countries, so I decided against that route. Instead I started looking at how I could use the geo APIs to turn this information into what I wanted. Among other things, the data returned also contained the latitude and longitude of the location that the IP was in. I decided to do a reverse geo map from the lat/lon to the geo information. The only problem is that the geo API itself doesn't do this for you.

Tom Croucher then told me that the flickr.places API could turn a latitude and longitude pair into a WoEID, so I decided to explore that. This is the query that does it:
SELECT place.woeid From flickr.places
Where lat=@lat And lon=@lon

Now I could tied the two queries together and get a single one that turns an IP address to a WoEID:
SELECT place.woeid From flickr.places
Where (lat, lon) IN
(
SELECT Latitude, Longitude From ip.location
Where ip=@ip
)

This is what the output looks like:
{
"query":{
"count":"1",
"created":"2010-02-28T01:25:34Z",
"lang":"en-US",
"updated":"2010-02-28T01:25:34Z",
"uri":"http://query.yahooapis.com/v1/yql?q=SELECT+place.woeid+From+flickr.places%0A+Where+%28lat%2C+lon%29+IN%0A+++%28%0A++++++SELECT+Latitude%2C+Longitude+From+ip.location%0A+++++++Where+ip%3D%27209.117.47.253%27%0A+++%29",
"results":{
"places":{
"place":{
"woeid":"12588378"
}
}
}
}
}

The last step of the puzzle was to turn this WoEID into a country and state code. This I already knew how to do:
SELECT country.code, admin1.code
From geo.places
Where woeid=@woeid

country.code gets us the two letter ISO3166 country code while admin1.code gets us a code for the local administrative region. For the US and Canada, this is simply the country code followed by a hyphen, followed by the two letter state code. Once I got this information, I could strip out the country code and the hyphen from admin1.code and get the two letter state code.

My final query looks like this:
SELECT country.code, admin1.code From geo.places
Where woeid IN
(
SELECT place.woeid From flickr.places
Where (lat, lon) IN
(
SELECT Latitude, Longitude From ip.location
Where ip=@ip
)
)

And the output is:
{
"query":{
"count":"1",
"created":"2010-02-28T01:26:32Z",
"lang":"en-US",
"updated":"2010-02-28T01:26:32Z",
"results":{
"place":{
"country":{
"code":"US"
},
"code":"US-KS"
}
}
}
}
}

Paste this code into the YQL console, make sure you've selected "Show community tables" and get the REST API from there. It's a terribly roundabout way to get something that should be a single API call, but at least from my application's point of view, I only need to call a single web service. Now if only we could convince the guys at missingkidsmap.com to use WoEIDs instead of state codes, that would make this all a lot easier.

Have I mentioned how much I like YQL?

## Saturday, February 27, 2010

### Closures and Function Currying

A few weeks ago, someone emailed me with this question:
Assume i have test.js file with code
function createAdder(x) {
return function(y) {
return x + y;
}
}


... My doubt: what's actually happening in the LINE 1-4?
I promised him an answer as a blog post, so here it is.

#### Currying

What we see here is something called function currying. It's common in mathematics and is named after one of its inventors — Haskell Curry. In short, currying converts a single function that takes in multiple arguments to multiple functions each of which takes in a single argument. This is particularly useful when caching the intermediate steps for later reuse with different parameters is of use. The classic example is that of operating on two numbers:
function divide(a) {
return function(b) {
return b/a;
}
}

var divide_2 = divide(2);
var divide_3 = divide(3);


In the above code, divide_2 is a function that takes in one argument and divides it by 2, returning the result. This is a fairly trivial and not very useful example because there are easier ways to divide a number by two. It becomes more useful though, when we need to do a bunch of expensive processing to get to each of the inner results. Consider this code instead:
function hash(salt) {
// do some expensive processing on salt
var hash1 = process(salt);
return function(data) {
// cheap processing of data with hash1
return data + hash1;
};
}

var sign1 = hash(salt1);   // sign1 is a function that signs data with salt1

var signature = sign1(some_data);

In the above code, the outer function does a bunch of expensive processing, and its result is stored in the hash1 variable. This variable is available to the inner function whenever it is called because of the closure that's created. When the inner function is called, it simply uses the value of hash1 without having to redo the processing. Now we could have called process() externally and cached its result, but then the hash1 would be exposed. This may not be something we want to do either because it needs to be abstracted out, or because its value is sensitive.

#### Closures

This all works because of closures. In short, a variable will continue to exist as long as code that can see it can be run. In the above cases, the inner functions are returned and their references stored in global variables. This makes the lifetime of these inner functions global, ie, they will exist as long as their new containing scope exists. These functions do not, however, get the new scope, so the variables they can see are exactly what they could see when they were defined. In the divide example, the inner function sees the variable a, therefore a will exist for as long as the inner function exists. When we create divide_2, the value of a is set to 2, and this is what the inner function (which is now stored in divide_2) sees. When we create divide_3, a new a is created, this time with value 3, and a new inner function is created (which is now stored in divide_3) and this function sees the new value of a. This is a completely new execution scope than when divide(2) was called. So getting back to the example my friend asked about, this is what happens:
1. createAdder(2): At this point, the argument x is set to the value 2, and the inner function is returned and stored in add2. This function remembers the value of x and uses it when it has to be called.
2. createAdder(5): At this point, the argument x is set to the value 5. Note that this is a new invocation of createAdder and does not share the same memory as the first invocation, so these are two completely different variables named x, both living in different scopes.
3. add2(10): At this point, the first inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 2 and computes 2 + 10 and returns its value
4. add5(10): The second instance of the inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 5 from when it was called, and computes 5 + 10 and returns its value
Now this whole explanation of closures would not be complete without one more subtle note that most people tend to forget about. The inner functions see the variables that were defined within its containing scope regardless of where in that containing scope they were defined or when their values were set. This means that if you change the value of a variable after defining an inner function, then the inner function will see the new value. Here's an example:
function foo(x) {
var g=function(y) {
return x+y;
};
x=x*x;
return g;
}


Notice that the value of x was changed after the inner function g was defined, yet g sees the new value of x. This is particularly important when you use a loop control variable inside a closure. The closure will see the last value of the loop control variable and not a different value on each iteration.

Update: 2010-03-04 t3rmin4t0r has a much more useful example of currying on his blog.

## Friday, February 19, 2010

### Missing kids on your 404 page

It's been a long time since I last posted, and unfortunately I've been unable to churn out a post every week. The month of February has been filled with travel, so I haven't had much time to write.

My report on FOSDEM is up on the YDN blog, so I haven't been completely dormant. I also did some stuff at our internal hack day last week. This post is about one of my hacks.

The idea is quite simple. People land up on 404 pages all the time. 404 pages are pages that have either gone missing, or were never there to begin with. 404 is the HTTP error code for a missing resource. Most 404 pages are quite bland, simply stating that the requested resource was not found, and that's it. Back when I worked at NCST, I changed the default 404 page to use a local site search based on the requested URL. I used the namazu search engine since I was working on it at the time.

This time I decided to do something different. Instead of searching the local site for a missing resource, why not engage the user in trying to find missing kids.

I started with trying to find an API for missingkids.com and ended up finding missingkidsmap.com. This service takes the data from Missing Kids and puts it on a google map. The cool thing about the service was that it could return data as XML.

Looking through the source code, I found the data URL:
http://www.missingkidsmap.com/read.php?state=CA

The state code is a two letter code for states in the US and Canada. To get all kids, just pass in ZZ as the state code.

The data returned looks like this:
<locations>
<maplocation zoom="5"
state_long="-119.838867"
state_lat="37.370157"/>
<location id="1"
firstname="Anastasia"
lastname=" Shearer "
picture="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
picture2="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669e1.jpg"
medpic = "img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
smallpic="img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
policenum="1-661-861-3110"
policenum2=""
st=" CA"
city="BAKERSFIELD"
missing="12/26/2009"
status="Endangered Runaway"
age="16"
url="1140669"
lat="35.3733333333333"
lng="-119.017777777778"/>
...
</locations>


Now I could keep hitting this URL for every 404, but I didn't want to kill their servers, so I decided to pass the URL through YQL and let them cache the data. Of course, now that I was passing it through YQL, I could also do some data transformation and get it out as JSON instead of XML. I ended up with this YQL statement:
SELECT * From xml

Pass that through the YQL console to get the URL you should use. The JSON I got back looked like this:
{
"query":{
"count":"1",
"created":"2010-02-19T07:30:44Z",
"lang":"en-US",
"updated":"2010-02-19T07:30:44Z",
"results":{
"locations":{
"maplocation":{
"state_lat":"40.313043",
"state_long":"-94.130859",
"zoom":"4"
},
"location":[{
"age":"7",
"city":"OMAHA",
"firstname":"Christopher",
"id":"Szczepanik",
"lastname":"Szczepanik",
"lat":"41.2586111111111",
"lng":"-95.9375",
"medpic":"img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
"missing":"12/14/2009",
"picture":"img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
"picture2":"",
"policenum":"1-402-444-5600",
"policenum2":"",
"smallpic":"img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
"st":" NE",
"status":"Missing",
"url":"1141175"
},
...
]
}
}
}
}


Step 2 was to figure out whether the visitor was from the US and Canada, and if so, figure out which state they were from and pass that state code to the URL.

This is fairly easy to do at Yahoo!. Not so much on the outside, so I'm going to leave it to you to figure it out (and please let me know when you do).

In any case, my code looked like this:
$json = http_get($missing_kids_url);
$o = json_decode($json, 1);
$children =$o['query']['results']['locations']['location'];

$child = array_rand($children);

print_404($child);  http_get is a function I wrote that wraps around curl_multi to fetch and cache locally a URL. print_404 is the function that prints out the HTML for the 404 page using the $child data object. The object's structure is the same as each of the location elements in the JSON above. The important parts of print_404 are:
function print_404($child) {$img = preg_replace('/.*src=(.*)/', '$1',$child["medpic"]);
$name =$child["firstname"] . " " . $child["lastname"];$age = $child['age'];$since = strtotime(preg_replace('|(\d\d)/(\d\d)/(\d\d\d\d)|', '$3-$1-$2',$child['missing']));
if($age == 0) {$age = ceil((time()-$since)/60/60/24/30);$age .= ' month';
}
else
$age .= ' year';$city = $child['city'];$state = $child['st'];$status = $child['status'];$police = $child['policeadd'] . " at " .$child['policenum'];

?>
<html>
...
<p>
<strong>Sorry, the page you're trying to find is missing.</strong>
</p>
<p>
We may not be able to find the page, but perhaps you could help find this missing child:
</p>
<div style="text-align:center;">
<img style="width:320px; padding: 1em;" alt="<?php echo $name ?>" src="<?php echo$img ?>"><br>
<div style="text-align: left;">
<?php echo $age ?> old <?php echo$name ?>, from <?php echo "$city,$state" ?> missing since <?php echo strftime("%B %e, %Y", $since); ?>.<br> <strong>Status:</strong> <?php echo$status ?>.<br>
<strong>If found, please contact</strong> <?php echo $police ?><br> </div> </div> ... </body> </html> <?php }  Add in your own CSS and page header, and you've got missing kids on your 404 page. The last thing to do is to tell apache to use this script as your 404 handler. To do that, put the page (I call it 404.php) into your document root, and put this into your apache config (or in a .htaccess file): ErrorDocument 404 /404.php  Restart apache and you're done. Update: 2010-02-24 To see it in action, visit a missing page on my website. eg: http://bluesmoon.info/foobar. Update 2: The code is now on github: http://github.com/bluesmoon/404kids Update: 2010-02-25 Scott Hanselman has a Javascript implementation on his blog. Update: 2010-03-28 There's now a drupal module for this. ## Friday, January 29, 2010 ### Pretty print cron emails We have a lot of scripts that run through cron and the output gets emailed by cron to whoever owns the cron job, or to a mailing list or something. Unfortunately, the headers for these mails are ugly and could use a lot of work. The primary problem I had with these mails was that the subject line could not be customised. Other smaller problems were that the From: address and Return-Path didn't have good values. A sample header looks like this: From: Cron Daemon <root@hostname.domain> To: someone@somewhere.com Subject: Cron <root@hostname> /usr/local/bin/foobar --whatever  Of the above, only the To: address makes any sense. The first thing I tried to change was the subject line. This is fairly easy using the mail program: 10 15 * * * /usr/local/bin/foobar --whatever | mail -s "Pretty subject" someone@somewhere.com  The mail now looks like this: From: Cron Daemon <root@hostname.domain> To: someone@somewhere.com Subject: Pretty subject  This is a big step forward, because we can now filter emails based on subject. It still doesn't let us change the from address though. That's when I remembered using sendmail a long time ago to send mail through perl scripts, so I had a look through the sendmail man page, and found a few interesting flags: -F full_name Set the sender full name. This is used only with messages that have no From: message header. -f sender Set the envelope sender address. This is the address where delivery problems are sent to, unless the message contains an Errors-To: message header.  This lets you set the From header, but there's no way to change the Subject. Since we can't use mail and sendmail together, we can either set the From address or the Subject, but not both. I then remembered the -t flag to sendmail that tells it to read the header from the message itself. The only thing left to do was to add the header into the output of every command. Since I couldn't easily do this in cron, I wrote my own script to do it. It's called pmail. The p is for pretty. Get the script now The meat of the script is this one line: ( echo -e "From:$from_name <$from>\r\nSubject:$subject\r\nTo: $to\r\n\r\n"; cat ) | /usr/sbin/sendmail -t  It writes the From and Subject headers, then leaves a blank line, and then passes its STDIN directly on to sendmail. The script has 3 optional parameters — the sender's address, sender's name and subject, and one mandatory parameter — the recipient's address(es). To use it, set your cron line something like this: * * * * * /usr/local/bin/foobar --whatever | \ pmail -F "Philip Tellis" -f "philip@bluesmoon.info" -s "Pretty mail from foobar" someone@somewhere.com  Note: lines wrapped for readability. You can also add the current date to the subject line like this: ... -s "Foobar mail on$( date +\%Y-\%m-\%d )" ...

Remember that you have to escape % signs in crontab lines otherwise cron will translate them to newline characters.

Umm, you also need to have sendmail set up correctly, and doing that is your problem :)

It's not complete yet, but it works for my use case. If you like it, use it. It's under the BSD license. I'll probably add a Reply-To option if it makes sense for me, and at some point add in data validation too, but not today.

Enjoy.

## Tuesday, January 26, 2010

### Speaking at FOSDEM 2010

I'll be speaking at FOSDEM 2010 in Brussels, Belgium on the 6th and 7th of February. I'll be speaking about the YUI Flot library. Registration is free, so if you're in the area, just show up. If anyone's interested in a performance BoF, let's do that as well.

## Monday, January 25, 2010

### Speaking at ConFoo

I'll be speaking at ConFoo in Montreal this March. I have two talks on performance and scaling. If you're in the area, drop in. The conference line-up is great, and if PHP Quebec last year was any indication, it should be very enlightening.

## Tuesday, January 19, 2010

### Bandwidth test v1.2

After testing this out for about a week, I'm ready to release v1.2 of the bandwidth testing code.

#### Changes

The changes in this release are all statistical, and related to the data I've collected while running the test.
1. Switch from the geometric mean to the arithmetic mean, because data for a single user is more or less centrally clustered.

This is not true across users, but for readings for a single user (single connection actually), I've found that the data does not really have outliers.

This allows us to download fewer bytes, and get more consistent results across runs.
3. Add random sampling. In previous versions, the test would be run for every page view. By adding random sampling, we now only run it on a percentage of page views.

You can now set PERFORMANCE.BWTest.sample to a number between 0 and 100, and that's the percentage of page views that will be tested for bandwidth. Note that this is not an integer, so you can set it to 0.1 or 10.25, or anything as long as it's between 0 and 100 (both inclusive). Setting it to 0 means that the test isn't run for anyone, so you probably don't want that, and 100 means that the test is run for everybody. This is also the default.
4. Fire an onload event when the script completes loading. This is mainly so that you can asynchronously load the script so that it does not affect your page load time. Instead of checking for script.onload or onreadystatechange, you can just implement the PERFORMANCE.BWTest.onload method.
You can see the entire history on github.

...===...