[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Monday, December 27, 2010

Using bandwidth to mitigate latency

[This post is mirrored from the Performance Advent Calendar]

The craft of web performance has come a long way from yslow and the first few performance best practices. Engineers the web over have made the web faster and we have newer tools, many more guidelines, much better browser support and a few dirty tricks to make our users' browsing experience as smooth and fast as possible. So how much further can we go?

Physics

The speed of light has a fixed upper limit, though depending on the medium it passes through, it might be lower. In fibre, this is about 200,000 Km/s, and it's about the same for electricity through copper. This means that a signal sent over a cable that runs 7,000Km from New York to the UK would take about 35ms to get through. Channel capacity (the bit rate), is also limited by physics, and it's Shannon's Law that comes into play here.

This, of course, is the physical layer of our network. We have a few layers above that before we get to TCP, which does most of the dirty work to make sure that all the data that your application sends out actually gets to your client in the right order. And this is where it gets interesting.

TCP

Éric Daspet's article on latency includes an excellent discussion of how slow start and congestion control affect the throughput of a network connection, which is why google have been experimenting with an increased TCP initial window size and want to turn it into a standard. Each network roundtrip is limited by how long it takes photons or electrons to get through, and anything we can do to reduce the number of roundtrips should reduce total page download time, right? Well, it may not be that simple. We only really care about roundtrips that run end-to-end. Those that run in parallel need to be paid for only once.

When thinking about latency, we should remember that this is not a problem that has shown up in the last 4 or 5 years, or even with the creation of the Internet. Latency has been a problem whenever signals have had to be transmitted over a distance. Whether it is a rider on a horse, a signal fire (which incidentally has lower latency than light through fibre[1]), a carrier pigeon or electrons running through metal, each has had its own problems with latency, and these are solved problems.

C-P-P

There are three primary ways to mitigate latency. Cache, parallelise and predict[2]. Caching reduces latency by bringing data as close as possible to where it's needed. We have multiple levels of cache including the browser's cache, ISP cache, a CDN and front-facing reverse proxies, and anyone interested in web performance already makes good use of these. Prediction is something that's gaining popularity, and Stoyan has written a lot about it. By pre-fetching expected content, we mitigate the effect of latency by paying for it in advance. Parallelism is what I'm interested in at the moment.

Multi-lane highways

Mike Belshe's research shows that bandwidth doesn't matter much, but what interests me most is that we aren't exploiting all of this unused channel capacity. Newer browsers do a pretty good job of downloading resources in parallel, and with a few exceptions (I'm looking at you Opera), can download all kinds of resources in parallel with each other. This is a huge change from just 4 years ago. However, are we, as web page developers, building pages that can take advantage of this parallelism? Is it possible for us to determine the best combination of resources on our page to reduce the effects of network latency? We've spent a lot of time, and done a good job combining our JavaScript, CSS and decorative images into individual files, but is that really the best solution for all kinds of browsers and network connections? Can we mathematically determine the best page layout for a given browser and network characteristics[3]?

Splitting the combinative

HTTP Pipelining could improve throughput, but given that most HTTP proxies have broken support for pipelining, it could also result in broken user experiences. Can we parallelise by using the network the way it works today? For a high capacity network channel with low throughput due to latency, perhaps it makes better sense to open multiple TCP connections and download more resources in parallel. For example, consider these two pages I've created using Cuzillion:
  1. Single JavaScript that takes 8 seconds to load
  2. 4 JavaScript files that take between 1 and 3 seconds each to load for a combined 8 second load time.
Have a look at the page downloads using FireBug's Net Panel to see what's actually happening. In all modern browsers other than Opera, the second page should load faster whereas in older browsers and in Opera 10, the first page should load faster.

Instead of combining JavaScript and CSS, split them into multiple files. How many depends on the browser and network characteristics. The number of parallel connections could start of based on the ratio of capacity to throughput and would reduce as network utilisation improved through larger window sizes over persistent connections. We're still using only one domain name, so no additional DNS lookup needs to be done. The only unknown is the channel capacity, but based on the source IP address and a geo lookup[4] or subnet to ISP map, we could make a good guess. Boomerang already measures latency and throughput of a network connection, and the data gathered can be used to make statistically sound guesses.

I'm not sure if there will be any improvements or if the work required to determine the optimal page organisation will be worth it, but I do think it's worth more study. What do you think?

Footnotes

  1. Signal fires (or even smoke signals) travel at the speed of light in air v/s light through fibre, however the switching time for signal fires is far slower, and you're limited to line of sight.
  2. David A. Patterson. 2004. Latency lags bandwith[PDF]. Commun. ACM 47, 10 (October 2004), 71-75.
  3. I've previously written about my preliminary thoughts on the mathematical model.
  4. CableMap.info has good data on the capacity and latency of various backbone cables.

Thursday, December 02, 2010

Bad use of HTML5 form validation is worse than not using it at all

...or why I'm glad we don't use fidelity any more

Fidelity's online stock trading account uses HTML5 form's pattern attribute to do better form validation, only they get it horribly wrong.

First some background...

The pattern attribute that was added to input elements allows the page developer to tell the browser what kind of pre-validation to do on the form before submitting it to the server. This is akin to JavaScript based form validation that runs through the form's onsubmit event, except that it's done before the form's onsubmit event fires. Pages can choose to use this feature while falling back to JS validation if it isn't available. They'd still need to do server-side validation, but that's a different matter. Unfortunately, when you get these patterns wrong, it's not possible to submit a valid value, and given how new this attribute is, many web devs have probably implemented it incorrectly while never having tested on a browser that actually supports the attribute.

This is the problem I faced with fidelity. First some screenshots. This is what the wire transfer page looks like if you try to transfer $100. The message shows up when you click on the submit button:

On Firefox 4:
Fidelity's wire transfer page on Firefox 4
On Opera:
Fidelity's wire transfer page on Opera

There are a few things wrong with this. First, to any reasonable human being (and any reasonably well programmed computer too), the amount entered (100.00) looks like a valid format for currency information. I tried alternatives like $100, $100.00, 100, etc., but ended up with the same errors for all of them. Viewing the source told me what the problem was. This is what the relevant portion of the page source looks like:
$<input type="text" name="RED_AMT" id="RED_AMT"
        maxlength="11" size="14" value="" tabindex="10"
        pattern="$#,###.##"
        type="currency" onblur="onBlurRedAmt();"/>
The onblur handler reformatted the value so that it always had two decimal places and a comma to separate thousands, but didn't do anything else. The form's onsubmit event handler was never called. The pattern attribute, looked suspicious. This kind of pattern is what I'd expect for code written in COBOL, or perhaps something using perl forms. The pattern attribute, however, is supposed to be a JavaScript valid regular expression, and the pattern in the code was either not a regular expression, or a very bad one that requires several # characters after the end of the string.

The code also omits the title attribute which is strongly recommended for anything that uses the pattern attribute to make the resulting error message more meaningful, and in fact just usable. The result is that it's impossible to make a wire transfer using any browser that supports HTML5's new form element types and attributes. This is sad because while it looks like Fidelity had good intentions, they messed up horribly on the implementation, and put out an unusable product (unless of course you have firebug or greasemonkey and can mess with the code yourself).

I hacked up a little test page to see if I could reproduce the problem. It's reproducible in Firefox, but not in Opera and I can't figure out why. (Any ideas?). Also notice how using a title attribute makes the error message clearer.

One last point in closing. It looks like both Firefox and Opera's implementations (on Linux at least) have a bad focus grabbing bug. While the error message is visible, the browser grabs complete keyboard focus from the OS (or Windowing environment if you prefer). This means you can't do things like Alt+Tab, or PrtScrn, switching windows, etc. If you click the mouse anywhere, the error disappears. The result is that it's really hard to take a screenshot of the error message. I managed to do it by setting gimp to take a screenshot of the entire Desktop with a 4 second delay. The funny thing is that you can still use the keyboard within the browser to navigate through links, and this does not dismiss the error.

Update: The modality problem with the error message doesn't show up on Firefox 4 on MacOSX

Monday, November 22, 2010

Stream of Collaboration and the Unified InBox

Back in 2003, I'd published a report on the state of computer mediated collaboration at the time. The report did not contain any original research, but was a list of references to other research on the topic. I was also working on ayttm and doing some work on automated project management at the time, which led to my talks on Fallback Messaging and Project Management with Bugzilla, CVS and mailing lists.

The state of technology has changed a lot over the years, and we're getting closer to the Unified Message InBox. The idea has been mulling in my mind for a while, and I've prototyped various implementations as a subset of what I call a stream of collaboration.

Communication

As technologically aware humans, we communicate in a variety of ways. Face-to-face, through the grapevine, handwritten letters and post-it notes, instant messaging, SMS, telephone calls, email, discussion boards, twitter, blogs, smoke signals, morse code using signalling lights, semaphore flags and more. Some of us prefer one form over another, and some of us will completely boycott a particular form of communication purely on principle, some of us won't use a form of communication because simpler methods exist and some of us will use a particular form of communication purely because it isn't the simplest one available. That's what makes each of us uniquely human. It also pushes us into groups and cliques that may or may not intersect. Who among you has a separate group of friends on twitter and facebook even though the two media are not vastly different? Do you also belong to a HAM radio club and a book reading group?

Now some forms of communication are well suited to automated archiving while others might end up being a game of Chinese whispers. Each form of communication also results in a different signal to noise ratio. Depending on the context of the conversation, accuracy and efficiency may or may not be a concern.

Degrees of Communication

Collaboration and Cooperation
Collaboration between a group of people involves communication, often requires archiving of all communications, and a high signal to noise ratio. Discussion lists, IRC logs, wikis, whiteboards, video/audio conferences, project trackers, bug trackers, source control commits, and sometimes email all provide good archiving capabilities. With proper time-stamping of each interaction, the archiving service can interleave events showing the exact chronological order of decisions being made. A Bayesian Filter, similar to the ones used for classifying spam can be used on a per topic basis to hide (but not remove) off-topic sections in an attempt to increase the signal to noise ratio. Once archived, even synchronous communication turns asynchronous [1]. Readers can later comment on sections of a communications log.

While some of the tools mentioned above are aimed at technical collaboration, many of them may also be used effectively for collaboration and support in a non-technical context [2,3] where immediate dissemination of information that can be archived and later used for reference is important.
Social Interaction
A form of communication that is more tolerant of a low signal to noise ratio, and in many cases does not require archival is social interaction. For example, at a party to watch a ball game, conversation may range from the actual events of the game to something completely irrelevant, and they're all socially acceptable in that context. Similarly, the corridor conversations at a conference may only be partially relevant to the subject matter of the conference. How does this translate to a scenario where people are geographically separated and need to communicate electronically rather than face to face?

Social television experiences [4], Co-browsing, LAN parties and digital backchannels [5] are examples where individuals communicate online while simultaneously engaging in a common task at geographically disparate locations. Computer mediated collaboration allows these individuals the ability to come close to the full party experience.
Casual Conversations
With more people moving their lives online [6,10], we're at a point where the volume of casual electronic conversations far exceeds that of technical collaboration. Fallback messaging is a good starting point to tie different forms of communication into a single thread, but it ignores current reality.

For starters, the fallback messaging idea assumes that users would use the same interface to communicate synchronously and asynchronously. In reality people use different methods for each. Asynchronous communication offers the ability, and often creates the necessity of longer messages and a larger amount of time devoted to communicating the message [7]. Should I say "Dear ..." or do I lead in with "Hi"? Should my background be pink or yellow? Do I want hearts, unicorns, birthday cakes or plain white in the background? Do I sign off with "Sincerely", "Regards", "Cheers" or "ttyl"? A different interface for each mode of communication makes it possible to customise the interface for the most common uses of that mode.

Fallback messaging was initially only applied to conversation between two persons, however a lot of casual communication happens between groups with the conversation sometimes forking into side-band conversations between a few (possibly two) members of the group and then merging back in to the parent conversation [7]. Some times the child conversation is made public to the group and some times it isn't. A messaging system must take this into consideration.

Enabling the Stream of Collaboration

Borrowed Ideas
The main idea behind fallback messaging was that service providers, communication protocols and user accounts were an implementation detail and the user should not be concerned with them. Users should only need to think of who they're communicating with, identifying them in any way they find comfortable (for example, using their names, nicknames, an avatar or favourite colour). The service needs to be able to determine based on context and availability, which messaging protocol to use.

Another idea that comes out of the original SMTP specification, is the now obsolete SOML [8] command. The SOML (Send Or MaiL) command would check to see if the recipient of the message was online when the message was sent, and if so, would echo the message to the user's terminal. If the user wasn't online at the time, it would append the message to their mailbox instead. Services like Yahoo! Messenger offer offline messaging capabilities that will hold a message on the server until the user comes online at which point it is delivered to them.

The problem with both these approaches is that they expect the user to create a message appropriate for the underlying service rather than the other way around. An entire email message in the case of SOML or a short text message in the case of Yahoo! Messenger. What we really need is a service that can decide based on what the user does, what kind of message needs to be sent, and how that message should be presented.

Facebook, GMail and Yahoo! Mail all offer a service where instant messaging and mail style messages can be sent from the same page, but with a different interface for each. Additionally, GMail provides the ability to see archived email messages and chat messages in the same context.
Proposed Interface
A messaging system is most useful to the user if it acts as a single point for them to reference and act on all past conversations, and provides an easy gateway to initiating new ones. The read interface must list all past conversations chronologically, with the ability to filter them based on topic and other participants in the conversation. It should be able to show where a conversation forked into separate threads, some of which may have been private, but all of which involved the user, and where these threads merged back into the primary stream.

The interface should include all kinds of communication including email, instant messages, co-browsing sessions, whiteboards, IRC, and anything else. Integrating with a service such as Google Voice, Skype or other VoIP solutions also allows it to tie in telephone conversations. Twitter and facebook notifications would tie in to this timeline as well.

The system should not rely only on message headers and meta-information, but also on message content to determine the topic and participants in a conversation [9]. Some participants, for example, may be involved indirectly, but not be privy to the details of the conversation, however it is useful to the reader to be able to filter messages with these details. Content analysis can also be used to identify messages as common Internet memes and tag them accordingly, possibly providing external links to more detailed information on the topic. Lastly, as has been proposed with fallback messaging, the system needs to aggregate all accounts of a contact across services into a single user defined identifier. It should still be possible for the user to identify which service is being used, but this should be done using colours, icons or similar markers.

Where are we today?

All major electronic communication services already provide some level of API acess to their messaging systems [11,12,13,14,15,16,17]. Services like Tweetdeck provide a single interface to multiple short message services (like twitter, identi.ca and the facebook wall), and Threadsy is supposed to unify your online social experience. Facebook seeks to unify email, IM, texting and Facebook messages through their own messaging service [18] which, like fallback messaging, is supposed to abstract out user accounts so that all you see is your contacts. I haven't seen the new messaging service yet, so I don't know if it also integrates with things like GMail, twitter, Yahoo! and other services that compete with Facebook [19]. If it does, that would be pretty cool. If it doesn't, there's an opportunity to build it. There are still other services that need to be tied in to enable full collaboration, but it doesn't seem too far away.

References

  1. Terry Jones. 2010. Dancing out of time: Thoughts on asynchronous communication. In O'Reilly Radar, October 26 2010. http://radar.oreilly.com/2010/10/dancing-out-of-time-thoughts-o.html
  2. Leysia Palen and Sarah Vieweg. 2008. The emergence of online widescale interaction in unexpected events: assistance, alliance & retreat. In Proceedings of the 2008 ACM conference on Computer supported cooperative work (CSCW '08). ACM, New York, NY, USA, 117-126.
  3. Sutton, J., Palen, L., & Shklovski, I. 2008. Back-Channels on the Front Lines: Emerging Use of Social Media in the 2007 Southern California Wildfires. In Proceedings of the Conference on Information Systems for Crisis Response and Management (ISCRAM).
  4. Crysta Metcalf, Gunnar Harboe, Joe Tullio, Noel Massey, Guy Romano, Elaine M. Huang, and Frank Bentley. 2008. Examining presence and lightweight messaging in a social television experience. ACM Trans. Multimedia Comput. Commun. Appl. 4, 4, Article 27 (November 2008), 16 pages.
  5. Joseph F. McCarthy, danah boyd, Elizabeth F. Churchill, William G. Griswold, Elizabeth Lawley, and Melora Zaner. 2004. Digital backchannels in shared physical spaces: attention, intention and contention. In Proceedings of the 2004 ACM conference on Computer supported cooperative work (CSCW '04). ACM, New York, NY, USA, 550-553.
  6. Donna L. Hoffman, Thomas P. Novak, and Alladi Venkatesh. 2004. Has the Internet become indispensable?. Commun. ACM 47, 7 (July 2004), 37-42. http://cacm.acm.org/magazines/2004/7/6471-has-the-internet-become-indispensable
  7. Rebecca E. Grinter and Leysia Palen. 2002. Instant messaging in teen life. In Proceedings of the 2002 ACM conference on Computer supported cooperative work (CSCW '02). ACM, New York, NY, USA, 21-30.
  8. Jonathan Postel. 1982. RFC 821: Simple Mail Transfer Protocol. IETF Network Working Group RFC Draft. http://www.ietf.org/rfc/rfc0821.txt, (August 1982).
  9. Dong Zhang; Gatica-Perez, D.; Roy, D.; Bengio, S.; "Modeling Interactions from Email Communication," Multimedia and Expo, 2006 IEEE International Conference on , vol., no., pp.2037-2040, 9-12 July 2006
  10. Rana Tassabehji and Maria Vakola. 2005. Business email: the killer impact. Commun. ACM 48, 11 (November 2005), 64-70. http://cacm.acm.org/magazines/2005/11/6081-business-email
  11. Yahoo! Mail API. http://developer.yahoo.com/mail/
  12. Yahoo! Messenger SDK. http://developer.yahoo.com/messenger/
  13. Twitter API. http://dev.twitter.com/doc
  14. Facebook Developer API. http://developers.facebook.com/docs/
  15. GMail API: http://code.google.com/apis/gmail/
  16. Google Voice APIs: http://thatsmith.com/2009/03/google-voice-add-on-for-firefox, http://code.google.com/p/pygooglevoice/
  17. Jabber protocol (Google Chat & Facebook Chat): http://xmpp.org/xmpp-protocols/
  18. MG Siegler. 2010. Facebook's Modern Messaging System: Seamless, History and a Social Inbox. Techcrunch, Nov 15 2010. http://techcrunch.com/2010/11/15/facebook-messaging/
  19. Alexia Tsotsis. 2010. Between Gmail, Twitter and now Facebook There is no Universal Inbox Yet. Techcrunch, Nov 22 2010. http://techcrunch.com/2010/11/21/facebook-messages-is-people/

Sunday, November 21, 2010

Counterfeit coins solution

This is a solution to the Weighing piles of coins problem listed on Microsoft Research. It was fun solving it, and the basic idea of the solution could be applied to many other kinds of elimination problems.

Here's my algorithm. If you don't understand it, well, spend some time figuring it out.
Pile#:    1 2 3 4 5 6 7 8 9 0 1 2 3
Selected: 4 4 4 3 3 3 2 2 2 1 1 1 0 => 30 coins
// This is the first weighing

expected weight = 30X
actual weight = W

X1 = W div 30
D1 = W mod 30    // D1 <= 20 since |D| <= 5 and counterfeit coins in pile <= 4

if D1 == 0 then
        weigh all 52 coins (W)    // this is the second weighing
        X = W div 52
        D = (W mod 52) / 4
        pile = 13
else if D1 <= 5 then
        possible piles = 1-12
else if D1 >= 25 then
        X1 = X1 + 1,
        D1 = 30 - D1
        possible piles = 1-12
else if 5 < D1 < 10 then
        possible piles = 1-9
else if 10 <= D1 <= 20
        X11 = X1 + 1,
        D11 = 30 - D1
        possible piles = 1-6
end if


if 0 < D1 < 10 then

Pile#:    1 2 3 4 5 6 7 8 9 0 1 2 3
Selected: 1 3 4 1 2 4 0 1 3 2 3 4 4 => 32 coins
// This is the second weighing

        expected weight = 32X
        actual weight = W

        X2 = W div 32
        D2 = W mod 32    // D2 <= 20 since |D| <= 5 and counterfeit coins in pile <= 4

        if 12 <= D2 <= 20 and X1 != X2 then
                X2 = X2 + 1
                D2 = 32 - D2
        end if

        X = X2

        if      D2 == D1*0 then
                pile = 7
                D = D1/2
        else if D2 == D1*1 then
                pile = 3
                D = D1/4
        else if D2 == D1*2 then
                pile = 10
                D = D1
        else if D2 == D1*3 then
                pile = 11
                D = D1
        else if D2 == D1*4 then
                pile = 12
                D = D1
        else if D2 == D1*1/2 then
                pile = 8
                D = D1/2
        else if D2 == D1*3/2 then
                pile = 9
                D = D1/2
        else if D2 == D1*1/3 then
                pile = 4
                D = D1/3
        else if D2 == D1*2/3 then
                pile = 5
                D = D1/3
        else if D2 == D1*4/3 then
                pile = 6
                D = D1/3
        else if D2 == D1*1/4 then
                pile = 1
                D = D1/4
        else if D2 == D1*3/4 then
                pile = 2
                D = D1/4
        end if
else

Pile#:    1 2 3 4 5 6 7 8 9 0 1 2 3
Selected: 1 2 3 0 1 2 4 4 4 4 4 4 4 => 37 coins
// This is the second weighing

        expected weight = 37X
        actual weight = W

        X2 = W div 37
        D2 = W mod 37    // D2 <= 15 since |D| <= 5 and counterfeit coins in pile <= 3

        X = X2

        if X11 == X2 then
                D1 = D11
        end if

        if      D2 == D1*1/4 then
                pile = 1
                D = D1/4
        else if D2 == D1*2/4 then
                pile = 2
                D = D1/4
        else if D2 == D1*3/4 then
                pile = 3
                D = D1/4
        else if D2 == D1*0/3 then
                pile = 4
                D = D1/3
        else if D2 == D1*1/3 then
                pile = 5
                D = D1/3
        else if D2 == D1*2/3 then
                pile = 6
                D = D1/3
        end if
end if

Thursday, November 04, 2010

Submitting cross-domain SOAP requests from the browser without XForms

Web service calls from a web page to a back end service can easily be made using XHR as long as the service runs on the same domain that the page is served from. For cross-domain requests, however, we have a problem. Typical methods of doing cross-domain requests require script nodes, a server side proxy, a flash based transport, or submitting a hidden form to an iframe target.

While the server side proxy and flash based transport both add an external dependency the script node can only make a GET request, and the hidden form approach can only send URL encoded key/value pairs... that's until we try a bit of trickery.
<form id="soap" method="POST" action="[SOAP entry point URL]" enctype="text/plain">
<textarea name="<?xml version">
"1.0" encoding="UTF-8"?>
[SOAP message here]
</textarea>
</form>

<script>
document.getElementById("soap").submit();
</script>

And that's it. The key elements are highlighted. In particular, you set the form's enctype attribute to text/plain. This makes sure that none of the data is URL encoded. Then a clever trick that works well with XML documents. Set the text field's name to <?xml version, ie, the starting text of an XML document. Omit the = sign and set the value to everything else.

When the form is submitted, the browser sends form fields as key=value, one on each line (that's how text/plain works). In this case, it sends the following:
<?xml version="1.0" encoding="UTF-8"?>
[SOAP message here]
Which essentially submits the SOAP payload to the web service.

Caveats

Naturally, all browsers don't work alike. For this particular example, all Webkit based browsers are broken. They don't handle an enctype of text/plain correctly. Chrome, Safari and Konqueror all set the Content-type header to text/plain, but the actual submitted data is URL encoded. This is consistent with research done by Gregory Fleischer and Bug #20795 filed on WebKit. Firefox (as far as Netscape 4 IIRC, probably earlier), IE (6 and above) and Opera handle it correctly.

C-Surfing on SOAP

There are security concerns with this approach as well, and in my opinion they are bigger than any benefit this might bring. An attacker can use this method to CSRF your SOAP based web services. Given this, it's a good idea to make sure that all your web service calls also have some kind of nonce or token that can only be generated if the request originated from your site.

Monday, October 18, 2010

Common Security Mistakes in Web Applications

I've just written my first article for Smashing Magazine. It's titled Common Security Mistakes in Web Applications. This is also my first security related post after joining the Yahoo! Paranoid group. The article covers XSS, CSRF, Phishing, SQL injection, Click-Jacking and Shell injection.

Wednesday, October 13, 2010

What's a browser? — the developer edition

Nicholas Zakas has a great writeup explaining a web browser to non-technical people. I thought I'd take this opportunity to explain what a web browser is to web developers.

At the heart of it, a web browser is two things. It is a program that can communicate with a web server using the HTTP protocol, and it is a program that can render HTML and other associated content types... except that it might not care to. As web developers looking out at the world through the window of a TCP socket on port 80, all we see is an agent on the other end issuing GET and POST requests. What it does with our responses, we don't really know. We do our best to cater to what we can identify. We look at user-agent strings to build statistics of the kinds of browsers that visit our site, and we use JavaScript to detect capabilities that we can use to enhance the experience that we deliver, but what if that JavaScript were never executed and the user-agent string were a lie?

No, at the heart of it, a web browser is just one thing — an HTTP client.

Built upon this HTTP client could be various entities. A real web rendering engine, a crawling bot, an audio browser, a web service client, or a script kiddie using curl. While it may be impossible to know of all possible entities on the other end, as web developers, we must build applications that are prepared to deal with anything.

We use progressive enhancement to create an engaging experience for genuine users of our site regardless of the capabilities of the user agent they use. We validate everything that comes in over that HTTP connection to prevent the destruction or degradation of our service either by malice or accident, and we trust nothing that comes in from the other end. Not the POST data, not the query string, not the cookies, not the request time, and certainly not the user agent string.

Do we assume our users are the kind that Nicholas describes or do we assume that they're all out to destroy us, or perhaps somewhere in between? The reality is that we have to build our sites for all of them. Start with HTTP and progressively enhance from there.

Wednesday, August 25, 2010

An equation to predict a page's roundtrip time


Note that I'm using MathJax to render the equations on this post. It can take a while to render everything, so you may need to wait a bit before everything shows up. If you're reading this in a feed reader, then it will all look like gibberish (or LaTeX if you can tell the difference).

So I've been playing with this idea for a while now, and bounced it off YSlow dev Antonia Kwok a few times, and we came up with something that might work. The question was whether we could estimate a page's expected roundtrip time for a particular user just by looking at the page structure. The answer, is much longer than that, but tends towards possibly.

Let's break the problem down first. There are two large unknowns in there:
  • a particular user
  • the page's structure
These can be broken down into their atomic components, each of which is either known or measurable:
  • Network characteristics:
    • Bandwidth to the origin server ( \(B_O\) )
    • Bandwidth to your CDN ( \(B_C\) )
    • Latency to the origin server ( \(L_O\) )
    • Latency to your CDN ( \(L_C\) )
    • DNS latency to their local DNS server ( \(L_D\) )
  • Browser characteristics:
    • Number of parallel connections to a host ( \(N_{Hmax}\) )
    • Number of parallel connections overall ( \(N_{max}\) )
    • Number of DNS lookups it can do in parallel ( \(N_{Dmax}\) )
    • Ability to download scripts in parallel
    • Ability to download css in parallel (with each other and with scripts)
    • Ability to download images in parallel with scripts
  • Page characteristics:
    • Document size (\(S_O\) )
    • Size of each script (\(S_{S_i}\))
    • Size of each non-script resource (images, css, etc.) (\(S_{R_i}\))
    • Number of scripts ( \(N_S\))
    • Number of non-script resources (\(N_R\))
    • Number of hostnames (\(N_H\)), further broken down into:
      • Number of script hostnames (\(N_{SH}\))
      • Number of non-script hostnames (\(N_{RH}\))
All sizes are on the wire, so if a resource is sent across compressed, we consider the compressed size and not the uncompressed size. Additionally, scripts and resources within the page can be combined into groups based on the parallelisation factor of the browser in question. We use the terms \(SG_i\) and \(RG_i\) to identiffy these groups. We treat scripts and non-script resources differently because browsers treat them differently, ie, some browsers will not download scripts in parallel even if they download other resources in parallel.

To simplify the equation a bit, we assume that bandwidth and network latency from the user to the CDN and the origin are the same. Additionally, the latency for the main page includes both network latency and the time it takes the server to generate the page (\(L_S\)). Often this time can be significant, so we redefine the terms slightly:
\begin{align}
B_O & = B_C \\
L_O & = L_S + L_C
\end{align}

Browser characteristics are easy enough to obtain. Simply pull the data from BrowserScope's Network tab. It contains almost all the information we need. The only parameter not listed is the number of parallel DNS lookups that a browser can make. Since it's better to err on the side of caution, we assume that this number is 1, so for all further equations, assume \(N_{Dmax} = 1\).

Before I get to the equation, I should mention a few caveats. It's fairly naïve, assuming that all resources that can be downloaded in parallel will be downloaded in parallel, that there's no blank time between downloads, and that the measured bandwidth \(B_C\) is less than the actual channel capacity, therefore multiple parallel TCP connections will all have access to the full bandwidth. This is not entirely untrue for high bandwidth users, but it does breakdown when we get down to dial-up speeds. Here's the equation:
\[
T_{RT} = T_P + T_D + T_S + T_R
\]
Where:
\begin{align}
T_P \quad & = \quad L_O + \frac{S_O}{B_C}\\
\\
\\
\\
\\
\\
T_D \quad & = \quad \frac{N_H}{N_{Dmax}} \times L_D\\
\\
\\
\\
\\
\\
T_S \quad & = \quad \sum_{i=1}^{N_{SG}} \left( \frac{S_{SG_imax}}{B_C} + L_C \right) \\
\\
\\
\\
\\
\\
N_{SG} \quad & = \quad \left\{
\begin{array}{1 1}
\frac{N_S}{min \left( N_{Hmax} \times N_{SH}, N_{max} \right) } & \quad \text{if browser supports parallel scripts}\\
\\
\\
N_S & \quad \text{if browser does not support parallel scripts}\\
\end{array} \right. \\
\\
\\
\\
\\
\\
S_{SG_imax} \quad & = \quad \text{Size of the largest script in script group } SG_i\\
\\
\\
\\
\\
\\
T_R \quad & = \quad \sum_{i=1}^{N_{RG}} \left( \frac{S_{RG_imax}}{B_C} + L_C \right) \\
\\
\\
\\
\\
\\
N_{RG} \quad & = \quad \frac{N_R}{min \left( N_{Hmax} \times N_{RH}, N_{max} \right) }\\
\\
\\
\\
\\
\\
S_{RG_imax} \quad & = \quad \text{Size of the largest resource in resource group } RG_i
\end{align}

So this is how it works...

We assume that the main page's download time is a linear function of its size, bandwidth, the time it takes for the server to build the page and the network latency between the user and the server. While this is not correct (consider multiple flushes, bursty networks, and other factors), it is close.

We then consider all scripts in groups based on whether the browser can handle parallel script downloads or not. Script groups are populated based on the following algorithm:
for each script:
   if size of group > Nmax:
      process and empty group
   else if number of scripts in group for a given host > NHmax:
      ignore script for the current group, reconsider for next group
   else
      add script to group

process and empty group
If a browser cannot handle parallel scripts, then we just temporarily set \(N_{max}\) to 1.

Similarly, we consider the case for all non-script resources:
for each resource:
   if size of group > Nmax:
      process and empty group
   else if number of resources in group for a given host > NHmax:
      ignore resource for the current group, reconsider for next group
   else
      add resource to group

process and empty group

For DNS, we assume that all DNS lookups are done sequentially. This makes our equation fairly simple, but turns our result into an overestimate.

Overall, this gives us a fairly good guess at what the roundtrip time for the page would be, but it only works well for high bandwidth values.

We go wrong with our assumptions at a few places. For example, we don't consider the fact that resources may download in parallel with the page itself, or that when the smallest script/resource in a group has been downloaded, the browser can start downloading the next script/resource. We ignore the fact that some browsers can download scripts and resources in parallel, and we assume that the browser takes no time to actually execute scripts and render the page. These assumptions introduce an error into our calculations, however, we can overcome them in the lab. Since the primary purpose of this experiment is to determine the roundtrip time of a page without actually pushing it out to users, this isn't a bad thing.

So, where do we get our numbers from?

All browser characteristics come from BrowserScope.

The user's bandwidth is variable, so we leave that as a variable to be filled in by the developer running the test. We could simply select 5 or 6 bandwidth values that best represent our users based on the numbers we get from boomerang. Again, since this equation breaks down at low bandwidth values, we could simply ignore those.

The latency to our CDN is something we can either pull out of data that we've already gathered from boomerang, or something we can calculate with a simple and not terribly incorrect formula:
\[
L_C = 4 \times \frac{distance\left(U \leftrightarrow C\right)}{c_{fiber}}
\]
Where \(c_{fiber}\) is the speed of light in fiber, which is approximately \(2 \times 10^8 m/s\).

DNS latency is a tough number, but since most people are fairly close to their ISPs, we can assume that this number is between 40-80ms. The worst case is much higher than that, but on average, this should be correct.

The last number we need is \(L_S\), the time it takes for the server to generate the page. This is something that we can determine just by hitting our server from a nearby box, which is pretty much what we do during development. This brings us to the tool we use to do all the calculations.

YSlow already analyses a page's structure and looks at the time it takes to download each resource. We just pull the time out from what YSlow already has. YSlow also knows the size of all resources (both compressed and uncompressed), how many domains are in use and more. By sticking these calculations into YSlow, we could get a number that a developer can use during page development.

The number may not be spot on with what real users experience, but a developer should be able to compare two page designs and determine which of these will perform better even if they get the same YSlow score.

Naturally this isn't the end of the story. We've been going back and forth on this some more, and are tending towards more of a CPM approach to the problem. I'll write more about that when we've sorted it out.

For now, leave a comment letting me know what you think. Am I way off? Do I have the right idea? Can this be improved upon? Is this something you'd like to see in YSlow?

Thursday, August 19, 2010

Where does firefox install its extensions?

It took me a while to figure this out for all the OSes I use. The only solution I found for MacOSX was hidden inside a post about installing extensions globally.

Linux

On Linux, extensions are stored inside
~/.mozilla/firefox/<profile>/extensions
$(MOZILLA_DIR)/extensions

MacOSX

On the Mac, it's stored in
~/Library/Application Support/Firefox/Profiles/<profile>/extensions
/Library/Application Support/Mozilla/Extensions

Wednesday, August 04, 2010

Site theme

Ten months down the line,
I still like this site's style,
I've tweaked it a bit,
But I gotta admit,
That bluesmoon did this one fine.


When I work on a site, I always start with the design first. I'm a visual developer, so I really need to see what things will look like when I'm done. I also need to know if I can stand the design enough to stare at it for days on end during development. The result is that I end up hating my design in a couple of days and change things around, most often it's the colours.

This has happened a lot with earlier versions of my site's style, but the version I have now is one I like. I first pushed it out standardised across all my subdomains back in November 2009, and now 10 months down the line, it still pleases me.

I like the colours. They don't shout out at me, but frame my content well. Important content is visible while unimportant meta information is dimmed out. I played around with the margins a lot to get it to work the way I want across devices. It's not ideal, but it gets the job done. I took inspiration for the fonts from several sites until I settled on what I use now.

There's still a few issues. Not everyone likes the tag cloud, but it's the best way for me to navigate my site, especially since google's site search doesn't seem to work. The home page looks terrible on my E71, and I should get to fixing that some day, but other than that, it's ok.

Monday, August 02, 2010

4.01 Strict — Tennis Elbow

Tennis Elbow

My left elbow's out of commission, so there will probably not be a strip until it gets better. The technical term is Lateral Epicondylitis, commonly known as Tennis Elbow. What's ironic is that I don't play Tennis. It happened due to bad typing posture.

Thursday, July 29, 2010

Boomerang and the WebTiming API

I've been doing a lot of development on boomerang over the last few weeks. In particular around the WebTiming API. During Velocity this year, Microsoft announced that their latest release of IE9 supported this API, so I set about adding support to boomerang.

Since then, Google have announced support in Chrome as well, and there are rumours around that Mozilla plans on adding it to firefox. Now since the specification is still a draft, none of these browsers use the window.performance object, but instead use their own browser specific version. I've now added support for all these versions into boomerang.

More details in my post on the YDN blog.

You can get boomerang from github, and join the discussion on our forum.

Monday, July 26, 2010

4.01 Strict — Invention of the Web

Tim Berners-Lee invents the WWW

Tim Berners-Lee invents the Web

There was a young man named Tim,
Who wanted to share docs on a whim,
He built HTTP,
But the W3C
Made standards so webdevs would sin

Sunday, July 25, 2010

Handling Date/Times and timezones

Wrote this as an answer to a question on StackOverflow, but I think it was one of my better ones, so am reproducing it here.

There are four different times you should consider:
  1. Event time: eg, the time when an international sporting event happens, or a coronation/death/etc. This is dependent on the timezone of the event and not of the viewer.
  2. Television time: eg, a particular TV show is broadcast at 9pm local time all around the world. Important when thinking about publishing the results (of say American Idol) on your website
  3. Relative time: eg: This question has an open bounty closing in 21 hours. This is easy to display
  4. Recurring time: eg: A TV show is on every Monday at 9pm, even when DST changes.
There is also Historic/alternate time. These are annoying because they may not map back to standard time. Eg: Julian dates, dates according to a Lunar calendar on Saturn, The Klingon calendar.

Storing start/end timestamps in UTC works well. For 1, you need an event timezone name + offset stored along with the event. For 2, you need a local time identifier stored with each region and a local timezone name + offset stored for every viewer (it's possible to derive this from the IP if you're in a crunch). For 3, store in UTC seconds and no need for timezones. 4 is a special case of 1 or 2 depending on whether it's a global or a local event, but you also need to store a created at timestamp so you can tell if a timezone definition changed before or after this event was created. This is necessary if you need to show historic data.

Storing times

  • Always store time in UTC
  • Convert to local time on display (local being defined by the user looking at the data)
  • When storing a timezone, you need the name, timestamp and the offset. This is required because governments sometimes change the meanings of their timezones (eg: the US govt changed DST dates), and your application needs to handle things gracefully... eg: The exact timestamp when episodes of LOST showed both before and after DST rules changed.
Offsets and names
An example of the above would be:

The soccer world cup finals game happened in South Africa (UTC+2--SAST) on July 11, 2010 at 19:00 UTC.

With this information, we can historically determine the exact time when the 2010 WCS finals took place even if the South African timezone definition changes, and be able to display that to viewers in their local timezone at the time when they query the database.
System Time

You also need to keep your OS, database and application tzdata files in sync, both with each other, and with the rest of the world, and test extensively when you upgrade. It's not unheard of that a third party app that you depend on did not handle a TZ change correctly.

Make sure hardware clocks are set to UTC, and if you're running servers around the world, make sure they're OSes are configured to use UTC as well. This becomes apparent when you need to copy hourly rotated apache log files from servers in multiple timezones. Sorting them by filename only works if all files are named with the same timezone. It also means that you don't have to do date math in your head when you ssh from one box to another and need to compare timestamps.

Also, run ntpd on all boxes.

Clients

Never trust the timestamp you get from a client machine as valid. For example, the Date: HTTP headers, or a javascript Date.getTime() call. These are fine when used as opaque identifiers, or when doing date math during a single session on the same client, but don't try to cross-reference these values with something you have on the server. Your clients don't run NTP, and may not necessarily have a working battery for their BIOS clock.

Trivia

Finally, governments will sometimes do very weird things:
Standard time in the Netherlands was exactly 19 minutes and 32.13 seconds ahead of UTC by law from 1909-05-01 through 1937-06-30. This time zone cannot be represented exactly using the HH:MM format.

Monday, July 19, 2010

4.01 Strict — Don't mess with Stubbornella

Mixing CSS is pixel princess, but weigh her down, and you'll have to deal with... Stubbornella.
Don't mess with stubbornella

Monday, July 12, 2010

4.01 Strict — SpeedFreak

Mild mannered Stoyan Stefanov makes an appearance in this week's 4.01 Strict.

Stoyan (the speedfreak) Stefanov saves the web

Tuesday, July 06, 2010

4.01 Strict — The discovery of JSON

In episode 2, Douglas Crockford discovers JSON

Douglas Crockford discovers JSON

Monday, June 28, 2010

My velocity talk

So I was supposed to speak at Velocity 2010 this year. My talk was titled Latency: Why you should worry and what you can do about it. It was one of the last few talks at the conference. Unfortunately though, the thing about a conference like velocity is that almost everything is about latency, and by the time I went on stage, absolutely every point that I'd planned on making had already been covered.

That's not a bad thing. For one, it validates my ideas, telling me that many other smart people have been thinking about the same things. Secondly, each of these speakers had 60-90 minutes to convey their ideas, which meant they could really do it justice. It also shows that velocity is a pretty awesome conference to be at.

So, by the night of the 23rd, I realised that there was no way I could go ahead with my talk in its current state, and sat down to rewrite it. That's when the drama started.

Now my usual format is to build my slides using LaTeX and generate a PDF from that. I have a bunch of relevant images and charts in there, and it's not too hard to do that with LaTeX. I started going down that path, but realised that I wanted to do something different. I really wanted to reference all the other presentations and go over points that they'd covered, so I switched to Comic Life. I was up until 5am, working out the details and getting the script right.

The thing about a 12 minute talk is that you can't waste a single minute thinking about what you're going to say next. You just need to know it, and there wasn't much time for me to prepare. I'd also planned to announce boomerang in the latter part of my talk, so had to fit that in as well.

At 5 I realised that I was too sleepy to keep going, so I went to bed and figured I could finish things off in a couple of hours once I woke up. I was also too sleepy to backup the presentation to my server.

I woke up at 10, and opened up the mac to start work again, and it wouldn't come on. I could hear the fans whirring inside, and the startup sound would sound, but the screen wouldn't turn on. It was now officially time to panic.

Two things were in my favour though. I work at Yahoo!, and the conference was just down the road from Yahoo! HQ. I headed in to the office and took the mac down to IT. The guys there were really helpful. They opened up a new Mac, and transferred my hard disk to it. The new one started up perfectly, and it's what I'm using right now.

I spent the next two hours putting the finishing touches on, and adding details for the boomerang release, and then rushed off to the conference with a slice of pizza in my hand.

Once I got on stage, I plugged in the mac, and as it has done on numerous occasions in the past, it hung as soon as the VGA adaptor was plugged in. I had to reboot while on stage, and this was all coming out of the 12 minutes that I had.

The talk went off okay. Looking back at the video, there's a bunch of things I'd have done differently, and some more details that I'd have covered, but I'll let you decide for yourselves.

My slides are on slideshare, while the video is on blip.tv.

4.01 Strict — I am codepo8

I've been playing around with comic life, and this is what I ended up with:

I am codepo8

I'm calling it 4.01 Strict.

Saturday, June 26, 2010

boomerang

I've been silent on this blog for a while, and the main reason for that is boomerang. It's taken a few months to get it ready for a public release, and I'm quite pleased with the way it's come out.

As many of you know, I've worked with the Exceptional Performance team at Yahoo! for over two years, and during this time, I've done a lot of work with measuring and analysing the performance of all of Yahoo!'s websites (yes, all). There are several tens of thousands of page types, and probably hundreds of thousands of unique URLs that could show up on a yahoo.com domain. This throws up some very interesting problems that I'm not sure I'd have had the chance to work with anywhere else.

I learnt a lot about scaling cheap hardware to meet the needs of this system, and also a lot about how each of these thousands of pages worked to improve their performance.

The measurement tool that we use has turned into boomerang. It's primarily a beaconing framework, so isn't restricted to performance data, but the core plugins that come packaged with it only do performance measurement.

I'll write more in later posts about what you can do with boomerang, but for now, go get it from github.com/yahoo/boomerang. Fork it, use it, patch it, tell us what you think of it.

There's also documentation along with the source code, or you can have a look at my personal copy of them — frequently, but not automatically updated.

I announced it at Velocity 2010, and they made a video of my talk.

Thursday, June 10, 2010

location.href

While working on boomerang, I needed to get at the current document's URL. The way I generally do that is via the location.href property. This worked perfectly until I hit upon a URL with a # in it. The # was correctly URL encoded to %23, and most browsers returned the correct string in location.href.

Safari 4.x was the only exception. It returned the URL after decoding. This wouldn't be a problem on its own, except that only the URL portion was decoded. location.search (everything after the ?) and location.hash (everything after the #) were untouched.

This meant that a URL like this:
http://tech.bluesmoon.info/foobar%231.html#123
Would show up in location.href as this:
http://tech.bluesmoon.info/foobar#1.html#123
This was a problem, because I now had two # signs in the URL, and would have to twist through hoops to correctly encode the URL. The bug is actually in Webkit, and affects Android too. Thanks to Ryan Grove for pointing me to it.

Then I discovered the document.URL property. This is exactly the same as location.href, except that it isn't affected by the bug in Safari 4.x.

I've tested in other browsers, and it seems to work as expected, so that's what I'm using for now.

Oh, and if you don't know what boomerang is yet, wait until June 24.

Tuesday, April 13, 2010

Internet access from Mumbai Airport's Clipper Lounge

If you're a frequent flier, or are flying first or business class out of Mumbai, then you'll get access to the Clipper Lounge. They provide free wifi, but it doesn't really work as advertised.

You first connect to the wifi network, and open your browser, which gets redirected to a login page. You click on Complimentary Internet access, and get to a form that asks for your personal details, and phone number, claiming that they will send you an access code via SMS.

Do NOT enter your real phone number here. For starters, no SMS gets sent. Secondly, you'll start getting SMS spam. Use any 10 digit phone number. Enter 91 for the country code.

Now, once you do this, you'll get an intermediate page that will immediately redirect to a second page that shows you your username and password. The password on that page is wrong. Instead, what you want to do is hit escape as soon as you get the intermediate page. It's blue, and it has the database insert statement that looks like this:
insert into users 
(u_username, u_password, u_user_type,u_fullname,u_initial_quota,
 u_current_quota,u_persistant_quota,u_interim_interval,
 u_idle_timeout,u_expiration_date,u_nbr_concurrent_sessions,
 u_max_concurrent_sessions, u_nbr_starts, u_nbr_stops,
 fk_group_id,fk_profile_id
) values (
 'cc3zo7s', '2497161', '2', 'Morons',
 '86400', '86400', '1', '15', '600',
 '2010-05-13 22:51:28','0','1',
 '0','0','2','2'
);
In the above string, cc3zo7s is the username and 2497161 is the password that you need to use.

Note, that if you enter a phone number that already exists in the database (eg: one of the tech support numbers on the login page), you'll get this instead:
Duplicate entry '1271179237' for key 'PRIMARY'
Just try a different phone number. Oh, and what's with the semi-colon at the end of the SQL statement?

Lastly, don't bother sending them feedback. It doesn't go anywhere. And I don't mean that it goes into some mailbox that no one reads. I mean it goes nowhere. This is the page you get when you try to send feedback:

Feedback page for Mumbai airport wifi

The interesting text in there is:
Warning: mail() [function.mail]: Failed to connect to mailserver at
 "ezcommindia.com" port 25, verify your "SMTP" and "smtp_port" setting in
 php.ini or use ini_set() in C:\xampp\htdocs\ezcomm\ifeedback.php
 on line 225
Yup, real classy. I think we can just start from the end of that message and stop when we get to C:\. It explains a lot.

This login manager isn't just broken, it goes right through broken and comes out on the other end.

Saturday, April 10, 2010

Can a website crash your RHEL5 desktop?

A few months ago I started having trouble with my RHEL5 desktop when running Firefox 3.5. On a few websites, the entire system would crash pretty consistently. It took a long time, but I finally managed to find the problem and get a fix in there.

My solution is documented on the YDN blog. Please leave comments there.

Friday, April 09, 2010

Analysis of the bandwidth and latency of YUIblog readers

A few months ago, Eric Miraglia from the YUI team helped me run some analysis on the types of network connections that were coming in to YUIBlog. In this article on the YUI blog, I've published the results of my analysis.

Please leave comments on the YUIblog.

Thursday, April 01, 2010

Speaking at IPC Spring in Berlin

I'll be speaking at IPC Spring in Berlin between May 31 and June 2 this year. My talk is on MySQL partitioning. It will have more information than the talk I did at ConFoo in March. The talk will be on Tuesday, June 1st.

If you're in or around Berlin at that time, drop in.

Monday, March 22, 2010

InnoDB's tablespace ids and Partitions

There are times when what you have is a partially running database and a bunch of backup innodb tablespace files (the .ibd files). If you're using innodb_file_per_table, then you have a separate .ibd file for each InnoDB table.

Now, you have your running database with a bunch of tables, and you want to replace some of them with the backup .ibd files. According to the MySQL docs, you'd do this:
  1. ALTER TABLE foo DISCARD TABLESPACE; (this deletes the current .ibd file)
  2. copy the old .ibd file into your database directory
  3. ALTER TABLE foo IMPORT TABLESPACE;
Assuming your .ibd file was from the same database and you did not drop the table and recreate it sometime between when you made the backup .ibd and now, this should work. Except... if you use partitions. If your table foo uses partitions, ie, its create statement was something like this:
CREATE TABLE foo (
   ...
) PARTITION BY ... (
   PARTITION p0 ...,
);
In this case, you cannot discard the tablespace, and the first alter command throws an error:
mysql> ALTER TABLE foo DISCARD TABLESPACE;

ERROR 1031 (HY000): Table storage engine for 'foo' doesn't have this option
I have not investigated if there are workarounds for this, but I do have a little more information on what's happening. Remember that each .ibd file is a tablespace. For a partitioned table, there are multiple .ibd files, one for each partition. The table's files look like this:
foo.frm
foo.par
foo#P#p0.ibd
foo#P#p1.ibd
...
Where p0, p1, etc. are the partition names that you specified in the create statement. Each partition is a different tablespace and has its own tablespace id. When you create an InnoDB table without partitioning, the internal tablespace id counter is incremented by 1. When you create an InnoDB table with paritions, the internal tablespace id counter is incremented by the number of partitions. The actual tablespace id is stored in each partition's .ibd file somewhere within the first 100 bytes. I have not attempted to find out where exactly though.

Friday, March 19, 2010

WebDU

Wow, it's been another couple of weeks that I haven't updated this blog. Most of my time's gone in travelling. I was at ConFoo last week, and have a pending blog post about that, though it will be on the YDN blog. Next month I'll spend some time in India and then in May I head over to Bondi beach in Australia for WebDU.

I'll be speaking about web performance there and will probably do a pub event as well. Stay posted for more details. In any event, if Sydney is where you're at, then go register for WebDU. Geoff and crew put on a great show last year and it only looks to be getting better.

WebDU Speaker badge

Sunday, February 28, 2010

Convert IP to Geo info using YQL

For my missing kids hack, I needed to convert an IP address to a US or Canadian 2 letter state code. This should have been pretty straightforward, but it turned out to require a little more effort than I initially wanted to put in.

First, the easy way. Rasmus Lerdorf has a web service that takes in an IP address and based on the MaxMind data, returns a bunch of information including the country and state/region code. I initially decided to use this. His example page is pretty self-explanatory, so I won't re-document it here. The problem is that this service was really slow and increased page load time a lot, so I scrapped the idea.

I then started looking through YQL. YQL has a whole bunch of geo stuff, but nothing that specifically turns an IP address into a WoEID or a country/state code. I then looked at the community supported tables and found the ip.location table that uses the ipinfodb.com wrapper around the MaxMind database. This returned everything I needed, but the only problem was that the state was returned as a string rather than a two character code. This is the query:
SELECT * From ip.location Where ip=@ip
The output looks like this:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:24:30Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:24:30Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=select+*+from+ip.location+where+ip%3D%27209.117.47.253%27",
  "results":{
   "Response":{
    "Ip":"209.117.47.253",
    "Status":"OK",
    "CountryCode":"US",
    "CountryName":"United States",
    "RegionCode":null,
    "RegionName":null,
    "City":null,
    "ZipPostalCode":null,
    "Latitude":"38",
    "Longitude":"-97",
    "Timezone":"-6",
    "Gmtoffset":"-6",
    "Dstoffset":"-5"
   }
  }
 }
}
Now it's pretty trivial to build an array that maps from state name to state code, but I'd have to keep growing that as I added support for more countries, so I decided against that route. Instead I started looking at how I could use the geo APIs to turn this information into what I wanted. Among other things, the data returned also contained the latitude and longitude of the location that the IP was in. I decided to do a reverse geo map from the lat/lon to the geo information. The only problem is that the geo API itself doesn't do this for you.

Tom Croucher then told me that the flickr.places API could turn a latitude and longitude pair into a WoEID, so I decided to explore that. This is the query that does it:
SELECT place.woeid From flickr.places
 Where lat=@lat And lon=@lon
Now I could tied the two queries together and get a single one that turns an IP address to a WoEID:
SELECT place.woeid From flickr.places
 Where (lat, lon) IN
   (
      SELECT Latitude, Longitude From ip.location
       Where ip=@ip
   )
This is what the output looks like:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:25:34Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:25:34Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+place.woeid+From+flickr.places%0A+Where+%28lat%2C+lon%29+IN%0A+++%28%0A++++++SELECT+Latitude%2C+Longitude+From+ip.location%0A+++++++Where+ip%3D%27209.117.47.253%27%0A+++%29",
  "results":{
   "places":{
    "place":{
     "woeid":"12588378"
    }
   }
  }
 }
}
The last step of the puzzle was to turn this WoEID into a country and state code. This I already knew how to do:
SELECT country.code, admin1.code
  From geo.places
 Where woeid=@woeid
country.code gets us the two letter ISO3166 country code while admin1.code gets us a code for the local administrative region. For the US and Canada, this is simply the country code followed by a hyphen, followed by the two letter state code. Once I got this information, I could strip out the country code and the hyphen from admin1.code and get the two letter state code.

My final query looks like this:
SELECT country.code, admin1.code From geo.places
 Where woeid IN
   (
      SELECT place.woeid From flickr.places
       Where (lat, lon) IN
         (
            SELECT Latitude, Longitude From ip.location
             Where ip=@ip
         )
   )
And the output is:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:26:32Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:26:32Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+country.code%2C+admin1.code+From+geo.places%0A+Where+woeid+IN%0A%28SELECT+place.woeid+From+flickr.places%0A+Where+%28lat%2C+lon%29+IN%0A+++%28%0A++++++SELECT+Latitude%2C+Longitude+From+ip.location%0A+++++++Where+ip%3D%27209.117.47.253%27%0A+++%29%29",
  "results":{
   "place":{
    "country":{
     "code":"US"
    },
    "admin1":{
     "code":"US-KS"
    }
   }
  }
 }
}
Paste this code into the YQL console, make sure you've selected "Show community tables" and get the REST API from there. It's a terribly roundabout way to get something that should be a single API call, but at least from my application's point of view, I only need to call a single web service. Now if only we could convince the guys at missingkidsmap.com to use WoEIDs instead of state codes, that would make this all a lot easier.

Have I mentioned how much I like YQL?

Saturday, February 27, 2010

Closures and Function Currying

A few weeks ago, someone emailed me with this question:
Assume i have test.js file with code
function createAdder(x) {
   return function(y) {
      return x + y;
   }
}

var add2 = createAdder(2); //<-------LINE 1
var add5 = createAdder(5); //<-------LINE 2
alert(add2(10));           //<------------LINE 3
alert(add5(10));           //<------------LINE 4
... My doubt: what's actually happening in the LINE 1-4?
I promised him an answer as a blog post, so here it is.

Currying

What we see here is something called function currying. It's common in mathematics and is named after one of its inventors — Haskell Curry. In short, currying converts a single function that takes in multiple arguments to multiple functions each of which takes in a single argument. This is particularly useful when caching the intermediate steps for later reuse with different parameters is of use. The classic example is that of operating on two numbers:
function divide(a) {
   return function(b) {
      return b/a;
   }
}

var divide_2 = divide(2);
var divide_3 = divide(3);

alert(divide_2(6)); // alerts 3
alert(divide_3(6));     // alerts 2
In the above code, divide_2 is a function that takes in one argument and divides it by 2, returning the result. This is a fairly trivial and not very useful example because there are easier ways to divide a number by two. It becomes more useful though, when we need to do a bunch of expensive processing to get to each of the inner results. Consider this code instead:
function hash(salt) {
   // do some expensive processing on salt
   var hash1 = process(salt);
   return function(data) {
      // cheap processing of data with hash1
      return data + hash1;
   };
}

var sign1 = hash(salt1);   // sign1 is a function that signs data with salt1

var signature = sign1(some_data);
In the above code, the outer function does a bunch of expensive processing, and its result is stored in the hash1 variable. This variable is available to the inner function whenever it is called because of the closure that's created. When the inner function is called, it simply uses the value of hash1 without having to redo the processing. Now we could have called process() externally and cached its result, but then the hash1 would be exposed. This may not be something we want to do either because it needs to be abstracted out, or because its value is sensitive.

Closures

This all works because of closures. In short, a variable will continue to exist as long as code that can see it can be run. In the above cases, the inner functions are returned and their references stored in global variables. This makes the lifetime of these inner functions global, ie, they will exist as long as their new containing scope exists. These functions do not, however, get the new scope, so the variables they can see are exactly what they could see when they were defined. In the divide example, the inner function sees the variable a, therefore a will exist for as long as the inner function exists. When we create divide_2, the value of a is set to 2, and this is what the inner function (which is now stored in divide_2) sees. When we create divide_3, a new a is created, this time with value 3, and a new inner function is created (which is now stored in divide_3) and this function sees the new value of a. This is a completely new execution scope than when divide(2) was called. So getting back to the example my friend asked about, this is what happens:
  1. createAdder(2): At this point, the argument x is set to the value 2, and the inner function is returned and stored in add2. This function remembers the value of x and uses it when it has to be called.
  2. createAdder(5): At this point, the argument x is set to the value 5. Note that this is a new invocation of createAdder and does not share the same memory as the first invocation, so these are two completely different variables named x, both living in different scopes.
  3. add2(10): At this point, the first inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 2 and computes 2 + 10 and returns its value
  4. add5(10): The second instance of the inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 5 from when it was called, and computes 5 + 10 and returns its value
Now this whole explanation of closures would not be complete without one more subtle note that most people tend to forget about. The inner functions see the variables that were defined within its containing scope regardless of where in that containing scope they were defined or when their values were set. This means that if you change the value of a variable after defining an inner function, then the inner function will see the new value. Here's an example:
function foo(x) {
   var g=function(y) {
      return x+y;
   };
   x=x*x;
   return g;
}

var add_2 = foo(2);
var add_3 = foo(3);

alert(add_2(5));    // alerts 9
alert(add_3(5));    // alerts 14
Notice that the value of x was changed after the inner function g was defined, yet g sees the new value of x. This is particularly important when you use a loop control variable inside a closure. The closure will see the last value of the loop control variable and not a different value on each iteration.

Update: 2010-03-04 t3rmin4t0r has a much more useful example of currying on his blog.

Friday, February 19, 2010

Missing kids on your 404 page

It's been a long time since I last posted, and unfortunately I've been unable to churn out a post every week. The month of February has been filled with travel, so I haven't had much time to write.

My report on FOSDEM is up on the YDN blog, so I haven't been completely dormant. I also did some stuff at our internal hack day last week. This post is about one of my hacks.

The idea is quite simple. People land up on 404 pages all the time. 404 pages are pages that have either gone missing, or were never there to begin with. 404 is the HTTP error code for a missing resource. Most 404 pages are quite bland, simply stating that the requested resource was not found, and that's it. Back when I worked at NCST, I changed the default 404 page to use a local site search based on the requested URL. I used the namazu search engine since I was working on it at the time.

This time I decided to do something different. Instead of searching the local site for a missing resource, why not engage the user in trying to find missing kids.

I started with trying to find an API for missingkids.com and ended up finding missingkidsmap.com. This service takes the data from Missing Kids and puts it on a google map. The cool thing about the service was that it could return data as XML.

Looking through the source code, I found the data URL:
http://www.missingkidsmap.com/read.php?state=CA
The state code is a two letter code for states in the US and Canada. To get all kids, just pass in ZZ as the state code.

The data returned looks like this:
<locations>
   <maplocation zoom="5"
                state_long="-119.838867"
                state_lat="37.370157"/>
   <location id="1"
             firstname="Anastasia"
             lastname=" Shearer "
             picture="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             picture2="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669e1.jpg"
             medpic = "img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             smallpic="img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             policenum="1-661-861-3110"
             policeadd="Kern County Sheriff\'s Office (California)"
             policenum2=""
             policeadd2=""
             st=" CA"
             city="BAKERSFIELD"
             missing="12/26/2009"
             status="Endangered Runaway"
             age="16"
             url="1140669"
             lat="35.3733333333333"
             lng="-119.017777777778"/>
   ...
</locations>

Now I could keep hitting this URL for every 404, but I didn't want to kill their servers, so I decided to pass the URL through YQL and let them cache the data. Of course, now that I was passing it through YQL, I could also do some data transformation and get it out as JSON instead of XML. I ended up with this YQL statement:
SELECT * From xml
 Where url='http://www.missingkidsmap.com/read.php?state=ZZ'
Pass that through the YQL console to get the URL you should use. The JSON I got back looked like this:
{
   "query":{
      "count":"1",
      "created":"2010-02-19T07:30:44Z",
      "lang":"en-US",
      "updated":"2010-02-19T07:30:44Z",
      "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+*+From+xml%0A+Where+url%3D%27http%3A%2F%2Fwww.missingkidsmap.com%2Fread.php%3Fstate%3DZZ%27",
      "results":{
         "locations":{
            "maplocation":{
               "state_lat":"40.313043",
               "state_long":"-94.130859",
               "zoom":"4"
            },
            "location":[{
                  "age":"7",
                  "city":"OMAHA",
                  "firstname":"Christopher",
                  "id":"Szczepanik",
                  "lastname":"Szczepanik",
                  "lat":"41.2586111111111",
                  "lng":"-95.9375",
                  "medpic":"img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "missing":"12/14/2009",
                  "picture":"img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "picture2":"",
                  "policeadd":"Omaha Police Department (Nebraska)",
                  "policeadd2":"",
                  "policenum":"1-402-444-5600",
                  "policenum2":"",
                  "smallpic":"img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "st":" NE",
                  "status":"Missing",
                  "url":"1141175"
               },
               ...
            ]
         }
      }
   }
}

Step 2 was to figure out whether the visitor was from the US and Canada, and if so, figure out which state they were from and pass that state code to the URL.

This is fairly easy to do at Yahoo!. Not so much on the outside, so I'm going to leave it to you to figure it out (and please let me know when you do).

In any case, my code looked like this:
$json = http_get($missing_kids_url);
$o = json_decode($json, 1);
$children = $o['query']['results']['locations']['location'];

$child = array_rand($children);

print_404($child);
http_get is a function I wrote that wraps around curl_multi to fetch and cache locally a URL. print_404 is the function that prints out the HTML for the 404 page using the $child data object. The object's structure is the same as each of the location elements in the JSON above. The important parts of print_404 are:
function print_404($child)
{
   $img = preg_replace('/.*src=(.*)/', '$1', $child["medpic"]);
   $name = $child["firstname"] . " " . $child["lastname"];
   $age = $child['age'];
   $since = strtotime(preg_replace('|(\d\d)/(\d\d)/(\d\d\d\d)|', '$3-$1-$2', $child['missing']));
   if($age == 0) {
      $age = ceil((time()-$since)/60/60/24/30);
      $age .= ' month';
   }
   else
      $age .= ' year';

   $city = $child['city'];
   $state = $child['st'];
   $status = $child['status'];
   $police = $child['policeadd'] . " at " . $child['policenum'];

   header('HTTP/1.0 404 Not Found');
?>
<html>
<head>
...
<p>
<strong>Sorry, the page you're trying to find is missing.</strong>
</p>
<p>
We may not be able to find the page, but perhaps you could help find this missing child:
</p>
<div style="text-align:center;">
<img style="width:320px; padding: 1em;" alt="<?php echo $name ?>" src="<?php echo $img ?>"><br>
<div style="text-align: left;">
<?php echo $age ?> old <?php echo $name ?>, from <?php echo "$city, $state" ?> missing since <?php echo strftime("%B %e, %Y", $since); ?>.<br>
<strong>Status:</strong> <?php echo $status ?>.<br>
<strong>If found, please contact</strong> <?php echo $police ?><br>
</div>
</div>
...
</body>
</html>
<?php
}
Add in your own CSS and page header, and you've got missing kids on your 404 page.

The last thing to do is to tell apache to use this script as your 404 handler. To do that, put the page (I call it 404.php) into your document root, and put this into your apache config (or in a .htaccess file):
ErrorDocument 404 /404.php
Restart apache and you're done.

Update: 2010-02-24 To see it in action, visit a missing page on my website. eg: http://bluesmoon.info/foobar.

Update 2: The code is now on github: http://github.com/bluesmoon/404kids

Update: 2010-02-25 Scott Hanselman has a Javascript implementation on his blog.

Update: 2010-03-28 There's now a drupal module for this.

Friday, January 29, 2010

Pretty print cron emails

We have a lot of scripts that run through cron and the output gets emailed by cron to whoever owns the cron job, or to a mailing list or something. Unfortunately, the headers for these mails are ugly and could use a lot of work. The primary problem I had with these mails was that the subject line could not be customised. Other smaller problems were that the From: address and Return-Path didn't have good values.

A sample header looks like this:
From: Cron Daemon <root@hostname.domain>
To: someone@somewhere.com
Subject: Cron <root@hostname> /usr/local/bin/foobar --whatever
Of the above, only the To: address makes any sense. The first thing I tried to change was the subject line. This is fairly easy using the mail program:
10 15 * * * /usr/local/bin/foobar --whatever | mail -s "Pretty subject" someone@somewhere.com
The mail now looks like this:
From: Cron Daemon <root@hostname.domain>
To: someone@somewhere.com
Subject: Pretty subject
This is a big step forward, because we can now filter emails based on subject. It still doesn't let us change the from address though. That's when I remembered using sendmail a long time ago to send mail through perl scripts, so I had a look through the sendmail man page, and found a few interesting flags:
-F full_name
       Set  the  sender full name. This is used only with messages that
       have no From: message header.

-f sender
       Set the envelope sender  address.  This  is  the  address  where
       delivery  problems  are  sent to, unless the message contains an
       Errors-To: message header.
This lets you set the From header, but there's no way to change the Subject. Since we can't use mail and sendmail together, we can either set the From address or the Subject, but not both. I then remembered the -t flag to sendmail that tells it to read the header from the message itself. The only thing left to do was to add the header into the output of every command. Since I couldn't easily do this in cron, I wrote my own script to do it. It's called pmail. The p is for pretty.

Get the script now

The meat of the script is this one line:
( echo -e "From: $from_name <$from>\r\nSubject: $subject\r\nTo: $to\r\n\r\n"; cat ) | /usr/sbin/sendmail -t
It writes the From and Subject headers, then leaves a blank line, and then passes its STDIN directly on to sendmail. The script has 3 optional parameters — the sender's address, sender's name and subject, and one mandatory parameter — the recipient's address(es).

To use it, set your cron line something like this:
* * * * * /usr/local/bin/foobar --whatever | \
   pmail -F "Philip Tellis" -f "philip@bluesmoon.info" -s "Pretty mail from foobar" someone@somewhere.com
Note: lines wrapped for readability.

You can also add the current date to the subject line like this:
... -s "Foobar mail on $( date +\%Y-\%m-\%d )" ...
Remember that you have to escape % signs in crontab lines otherwise cron will translate them to newline characters.

Umm, you also need to have sendmail set up correctly, and doing that is your problem :)

It's not complete yet, but it works for my use case. If you like it, use it. It's under the BSD license. I'll probably add a Reply-To option if it makes sense for me, and at some point add in data validation too, but not today.

Enjoy.

Tuesday, January 26, 2010

Speaking at FOSDEM 2010

I'll be speaking at FOSDEM 2010 in Brussels, Belgium on the 6th and 7th of February. I'll be speaking about the YUI Flot library. Registration is free, so if you're in the area, just show up. If anyone's interested in a performance BoF, let's do that as well.

I'm speaking at FOSDEM, the Free and Open Source Software Developers' European Meeting

Monday, January 25, 2010

Speaking at ConFoo

I'll be speaking at ConFoo in Montreal this March. I have two talks on performance and scaling. If you're in the area, drop in. The conference line-up is great, and if PHP Quebec last year was any indication, it should be very enlightening.

confoo.ca Web Techno Conference

Tuesday, January 19, 2010

Bandwidth test v1.2

After testing this out for about a week, I'm ready to release v1.2 of the bandwidth testing code.

Get the ZIP file or Try it online

Changes

The changes in this release are all statistical, and related to the data I've collected while running the test.
  1. Switch from the geometric mean to the arithmetic mean, because data for a single user is more or less centrally clustered.

    This is not true across users, but for readings for a single user (single connection actually), I've found that the data does not really have outliers.
  2. Instead of downloading all images on every run, use the first run as a pilot test, and based on the results from that run, only download the 3 largest images that can be downloaded for this user.

    This allows us to download fewer bytes, and get more consistent results across runs.
  3. Add random sampling. In previous versions, the test would be run for every page view. By adding random sampling, we now only run it on a percentage of page views.

    You can now set PERFORMANCE.BWTest.sample to a number between 0 and 100, and that's the percentage of page views that will be tested for bandwidth. Note that this is not an integer, so you can set it to 0.1 or 10.25, or anything as long as it's between 0 and 100 (both inclusive). Setting it to 0 means that the test isn't run for anyone, so you probably don't want that, and 100 means that the test is run for everybody. This is also the default.
  4. Fire an onload event when the script completes loading. This is mainly so that you can asynchronously load the script so that it does not affect your page load time. Instead of checking for script.onload or onreadystatechange, you can just implement the PERFORMANCE.BWTest.onload method.
You can see the entire history on github.

...===...