The other side of the moon: November 2009

Tuesday, November 24, 2009

Measuring a user's bandwidth

In my last post about performance, I spoke about measurement. Over the last few days I've been looking at bandwidth measurement. These ideas have been floating around for years and we've tested some before at Yahoo!, but I wanted to try a few new things.

Try it out now.

The concept is actually quite simple.

Try to download multiple images with progressively increasing sizes
Set a reasonable timeout for the images to download
Stop at the first one that times out - that means that we have enough data to make an estimation.
Calculate the bandwidth by dividing each image's size by the time it took to download.

I run this test a few times, and then run some statistical analysis on the data gathered. The analysis is pretty basic. I first pull out the geometric mean of the data, then sort the data, run IQR filtering on it, and then pull out the median point. I use the geometric mean as well as the post IQR filtered median because I'm not sure at this point which is more resilient to temporary changes in network performance. This data is then stored in a database along with the user's IP address and the current timestamp.

I also try to measure latency. This is not network latency, but server latency from the user's point of view, ie, how long does it take between request and first byte of response. I run this test multiple times and do the same kind of stats on this data.

The goal of this test

The few people I've shown this to all had the same question. What's the goal of this test? There are already several free bandwidth testers available that one can use to determine ones bandwidth, so what does this do differently.

The way I see it, as a site owner, I don't really care about the bandwidth that my users have with their ISPs - unless of course, I have my servers in the ISP's data centre. I really care about the bandwidth that user's experience when visiting my website. This test aims to measure that. Ideally, this piece of code can be put into any web page to measure the user's bandwidth in the background while he's interacting with your site. I don't know how it will work in practice though.

Insights from the data

I don't really know. It could be useful to figure out what users from different geographical locations experience. Same with ISPs. It might also just tell me that dreamhost is a really bad hosting provider.

Data consistency

In my repeated tests, I've found that the data isn't really consistent. It's not all over the place, but it fluctuates a fair bit. I've seen different levels of consistency when using the geometric mean and the median, but I don't think I have enough data yet to decide which is more stable. This could mean that my server just responds differently to multiple requests or it could mean many other things. I don't really know, but feel free to leave a comment if you do.

Credits

I don't know who first came up with the idea of downloading multiple images to test bandwdith, but it wasn't my idea. The latency test idea came from Tahir Hashmi and some insights came from Stoyan Stefanov.

Once again, here's the link.

Short URL: http://tr.im/bwmeasure

Monday, November 23, 2009

Storing IP addresses in a MySQL data table

For a lot of log processing, I need to store IP addresses in a database table. The standard process was always to convert it to an unsigned int in perl or php and then insert it. Today I discovered an easier way. MySQL's INET_ATON function. It takes an address in dotted quad format and converts it into an INT. So, all you have to do is this:

INSERT INTO table (ip) VALUES (INET_ATON('$ip_address'));

And done.

Sunday, November 22, 2009

Being a geek

Back in 2004, I did a talk at Linux Bangalore titled Being a geek. It was quite popular at the time. The number of people in the room far exceeded the limits set by fire safety regulations. I then repeated the talk at Freedel in an impromptu session in the corridor with my audience sitting on the floor and on tables around me. It was somehow exactly the way I think some conference tracks should be.

Anyway, a couple of nights ago, I finally converted my slides to PDF by rewriting them in LaTeX and running pdflatex on it. The results are here:

It hit the front page of slideshare the night I posted, so chances are that it's still interesting to someone.

Monday, November 09, 2009

Template update

Apologies if you read my blog in a feed reader and just got swamped by a whole bunch of updates. I just redid this blog's template to match the rest of my website and in the process also went back and cleaned up the markup on some old posts. I can't say that this won't happen again, but any more changes at this time will be to the CSS or the template only and should not affect the feed.

Thanks for reading.

Saturday, November 07, 2009

Favicons on my planet's blogroll

Update: I noticed that some feeds weren't showing favicons even though their sites had them, and it turned out to be because the entire feed was a single line which didn't work with sed. I've changed to use perl instead.

Early last week, Chris Shiflett tweeted about adding favicons to a planet's blogroll for sites that have them. Now I'd considered setting up PlanetPlanet in the past, but had never gotten down to it. Since I was already in the middle of a site redesign, I figured it was a good time to start.

Setting up planet bluesmoon was fairly straighforward. I just followed the instructions in the INSTALL file. I was also very pleased to see that it uses the python implementation of HTML::Template because I'm the author of the Java implementation (Also the last Java project I worked on) and am very familiar with the syntax and tricks of the trade.

Once set up, I went back to Chris' site since he'd also mentioned that he'd be posting his favicon code on github. Unfortunately, at this time, the only thing there is the README file, and well, patience is not one of my virtues, so I decided to write my own.

One advantage that I did have though, was Chris' tweets about the process which made a note of all the problems he ran into.

I ended up with this shell script that does a fairly good job, and can be run through cron (although I don't do that). It's made to specifically work with planet's config.ini file, and edits the file in-place to add the icon code. This is how it works

Translating feed URL to favicon URL

This code pulls out all feed URLs from the file. I'm assuming here that they're all http(s) URLs.

sed -ne '/\[http/{s/[][]//g;p;}' $file

For each URL returned, I run this code which pulls down the feed using curl, and then uses perl to extract the home site's URL. I then check for the link in the feed assuming the feed is in the RSS2.0 or Atom 1.0 formats. I could have looked at the content-type header and figured out which it was, but as Chris pointed out, content-type headers are often wrong. The perl code first splits the feed into multiple lines to make it easier to parse.

curl -m 10 -L $feedurl 2>/dev/null | \
    perl -ne "
        s/></>\\n</g;
        for (split/\\n/) {
            print \"\$1\\n\" and exit
                if /<link/ && 
                    (/<link>(.*?)<\\/link>/ ||
                        (/text\\/html/ && /alternate/ && /href=['\"](.*?)['\"]/)
                    );
        }
    "

I then pull out the domain from the site's URL. I'll need this if the link to the favicon is a relative URL. Again, I'm assuming http(s) and being a little liberal with my regexes to work in all versions of sed.

domain=`echo $url | sed -e 's,\(https*://[^/]*/\).*,\1,'`
base=${url%/*}

Then download the site page and look for a favicon link in there. Favicons are found in link tags with a rel attribute of icon or shortcut icon, so I check for both, again being liberal with my regexes, and when I find it, extract the value of the href attribute. This will break if there are multiple link tags on the same line, but I'll deal with that when I see it.

favicon=$( curl -m 10 -L "$url" 2>/dev/null | \
    perl -ne "
        print \"\$1\\n\" and exit
            if /<link/ && 
               /rel=['\"](?:shortcut )?icon['\"]/ && 
               /href=['\"](.*?)['\"]/;
    " )

If no URL was found, I just appended /favicon.ico to $domain and used that instead. If a relative URL was found, I appended it to either $base or $domain depending on whether the path starts with / or not. This will have trouble if your site URL points to a directory but omits the trailing slash, but shame on you if you do that.

Validating the favicon

Now once I had the URL, I still had to validate if a favicon existed at that location. This was done easily using curl with the -f flag which tells it to fail on error. It returns an error code of 22 for a file not found. The problem I faced here is that some sites don't actually return a 404 for missing resources. That was a WTF moment. So I figured I'd just look for the content-type of the returned resource, and if it did not match image/*, then I'd discard it. However, from Chris's tweets, I already knew that some sites send a favicon with a content type of text/plain or text/html, so I couldn't rely solely on this. Instead, I decided to download the favicons, and if its content-type did not match the image/* pattern, I run the file command on them. This command looks up the file's magic numbers and figures out it's content type. The result was this code:

name=`echo $domain | sed -e 's,/,-,g'`.ico
params=`curl -L -f -w "%{content_type}\t%{size_download}" -o "icons/$name" "$favicon" 2>/dev/null`
[ $? -ne 0 ] && continue                                  # skip if curl was unsuccessful
ctype=${params% *}
clen=${params#* }
[ $clen -eq 0 ]  && continue                              # skip if favicon was 0 bytes
if ! echo $ctype | grep -q "^image/" &>/dev/null; then
    if file -b "icons/$name" | grep '\<text\>' &>/dev/null; then
        continue;                                         # skip if content type is not image/*
    fi
fi

rm "icons/$name"

Write it back

Now that I knew the correct URL for a site's favicon, I could write this information back to the config.ini file. I decided to use perl for this line (though I could have used perl for the whole script). It reads the file in by paragraph, and if a paragraph matches the feed URL, it first strips out the old favicon line, and then adds the new one in. Since this code only runs if we actually find a favicon, it has the side effect of not updating a favicon that was once valid but now isn't.

perl -pi -e "BEGIN {\$/='';} if(m{^\[$feedurl\]}) { s{^icon =.*$}{}m; s{\n\n$}{\nicon = $favicon\n\n}; }" $file

The perl code also assumes a very specific format for the config.ini file. Specifically, everything about a feed must be together with no blank lines in between them, and there needs to be at least one blank line between feed sections. Not hard to maintain this, but it's not a restriction that planet imposes itself.

Adding favicons to the template

Lastly, I needed to add these favicons to the template. Inside the Channels loop, we add this code:

<img src="<TMPL_IF icon><TMPL_VAR icon><TMPL_ELSE>/feed-icon-14x14.png</TMPL_IF>"
     alt="" class="favicon">

The code can go anywhere as long as it's inside the Channels loop. To use it in the Items loop, the variable name should be changed to channel_icon instead.

Et voilà site favicons on a planet. Now I've just got to get a better generic image for the no favicon state since they aren't technically links to feeds.

Update: I'm now using an icon from stdicon.com for the generic favicon.

Performance BoF at FOSS.IN

FOSS.IN runs in Bangalore from the 1st to the 5th of December this year. During the conference, I'll be organising a BoF meet on performance titled Websites on Speed.

In this BoF we'll each bring up a ideas and research that we've been playing with. It's expected to be fairly technical, but how detailed we get depends on what people are interested in. We'll try and cover all layers of the stack that contribute to performance problems, and get into depth on one or two areas chosen on the spot.

This isn't limited to frontend performance.

There's a lot of experimentation, tweaking and understanding of the system involved in web performance, so let's find out what's state of the art today.

Sunday, November 01, 2009

Performance measurement

In my last post, I mentioned the factors that affect web performance. Now that we know what we need to measure, we come to the harder problem of figuring out how to measure each of them. There are different methods depending on how much control you have over the system and the environment it runs in. Additionally, measuring performance in a test setup may not show you what real users experience, however it does give you a good baseline to compare subsequent tests against.

Web, application and database servers

Back end servers are the easiest to measure because we generally have full control over the system and the environment it runs in. The set up is also largely the same in a test and production environment, and by replaying HTTP logs, it's possible to simulate real user interactions with the server.

Some of the tools one can use to measure server performance are:

ab - Apache Benchmark. Despite its name, it can be used to test any kind of HTTP server and not just apache. Nixcraft has a good tutorial on using ab.
httperf from HP labs is also a good tool to generate HTTP load on a server. There's an article on Techrepublic about using it. I prefer httperf because it can be configured to simulate real user load
Log replaying is a good way to simulate real-user load, and a few people have developed scripts to replay an apache log file. The first one uses httperf under the hood.
To measure database performance, we could either put profiling code into our application itself, and measure how long it takes for our queries to return under real load conditions, or run benchmarks with the actual queries that we use. For mysql, the mysql benchmarking suite is useful.
MySQL Tuner is another tool that can tell you how your live production server has been performing though it doesn't give you numbers to quantify perceived performance. I find it useful to tell me if my server needs retuning or not.

The above methods can also be used to measure the performance of remote web service calls, though you may want to talk to your remote web service provider before doing that.

I won't write any more about these because there are a lot of articles about server side performance measurement on the web.

DNS, CDNs and ISP networks

Measuring the performance of DNS, CDNs and your user's ISP network is much harder because you have control over neither the systems nor the environment. Now I mentioned earlier that DNS is something you can control. I was referring to your own DNS set up, ie, the hostnames you have and how they're set up. This is not something we need to measure since no user will use your DNS server. All users use their ISP's DNS server or something like OpenDNS and it's the performance of these servers that we care about.

DNS

DNS is the hardest of the lot since the only way to measure it is to actually put a client application on your users' machines and have that do the measurement. Unless you have really friendly users, this isn't possible. It is an important measurement though. A paper on DNS Performance [Jung et al., 2002] shows that around 20% of all DNS requests fail. This in turn adds to the overall perceived latency of a website. In the absence of an easy way to measure this performance from within a web page, we'll try and figure it out as a side-effect of other measurements.

One possible method is to request the same resource from a host, the first time using the hostname and the second time using its IP address. The difference should give you the DNS lookup time. The problem with this is that it sort of breaks DNS rotations where you may have multiple physical hosts behind a single hostname. It's even worse with a CDN because the hostname may map onto a server that's geographically closer to the user than the IP address you use. In short, you'd better know what you're doing if you try this.

ISP bandwidth

With ISP networks, the number we really care about is the user's effective bandwidth, and it isn't hard to measure this. We use the following procedure:

Place resources of known fixed sizes on a CDN
Make sure these resources are served with no-cache headers
Using javascript, download these resources from the client machine and measure the time it takes
Discard the first resource since it also pays the price of a DNS lookup and TCP slowstart
Use resources of different sizes to handle very slow and very fast connections.

The number we get will be affected by other things the user is using the network for. For example, if they're streaming video at the same time, then bandwidth measured will be lower than it should be, but we take what we can get.

CDNs

Now to measure bandwidth, we need to get that resource relatively close to the user so that the bandwidth of the whole internet doesn't affect it. That's where CDNs come in, and measuring a CDN's performance is somewhat similar.

We could always use a tool like Gomez or Keynote to do this measurement for you, or you can hack up a solution yourself in Javascript. You need to figure out three things:

The IP of the CDN closest to the user
The user's geo-location which you can figure out from their IP address
The time it takes to download a resource of known size from this CDN

It's that first one that's a toughie, but the simplest way to figure it out is to just ask your CDN provider. Most CDNs also provide you with their own performance reports.

Page content and user interaction

YSlow, Show Slow, Page Speed and Wep Page Test are good tools for measuring and analysing the performance of your page content. They can measure and analyse your page from your development environment and suggest improvements. They do not, however, measure real user perceived performance, however this is something we can do with Javascript.

We primarily need to measure the time it takes to download a page and all its components. Additionally we may want to time how long certain user interactions with the page took. All of these can be accomplished by reading the Date() object in javascript at the correct start and end times. What those start and end times are depend on your application, but we'll look at one possible implementation in a later post. Once you have the timing information that you need, it can be sent back to your server using a javascript beacon. We'll go into more detail about this as well in a later post.

This post has already run longer than I'd hoped for, so I'm going to stop here and will continue next time.

About web performance

I currently work with the performance team at Yahoo!. This is the team that did the research behind our performance best practices and built YSlow. Most of our past members write and speak about performance, and while I've done a few talks, I've never actually written a public post about web performance. I'm going to try and change that today.

Note, however, that this blog is about many technical topics that interest me and web performance is just a part of that.

I'm never sure how to start a new series, especially one that's been spoken about by others, but since these blog posts also serve as a script for the talks that I do, I thought I'd start with the last performance talk that I did.

Improving a website's performance starts with measuring its current performance. We need a baseline measurement that will help us determine if the changes we make cause an improvement or a regression in performance. Before we start with measurement, however, we need to know what to measure, and for that we need to look at all the factors that contribute to the time it takes for a website to get to the user.

User perceived web app time is spent in looking up stuff, building stuff, downloading stuff, rendering stuff and interacting with stuff.

It's this perceived time that we need to reduce, and consequently measure. All of the above fall into two basic categories:

Infrastructure
Content structure

Each of these in turn is made up of components that we as developers can control, and those that we cannot. We'd like to be able to measure everything and fix whatever we have control over. I've split the components that I see into this table so we know what can be looked at and who should do the looking.

	Infrastructure	Content
Can control	Web server & App server Database server Web service calls CDNs DNS	HTTP headers HTML Images, Flash CSS Javascript Fonts
Cannot control	ISP's DNS servers ISP's network User's bandwidth User's browser & plugins Other apps using the user's network The internet	Advertisements Third party content included as badges/feeds Third party sites that link to your page

If you have more items to add to this table, leave a comment and I'll add it in.

This is where we can jump to Yahoo!'s performance rules. At the time of this post, there are 34 of them divided into 7 categories. I'll go into more details and refer to these rules in later posts. That's all for this introductory post though.

The other side of the moon