The other side of the moon

/bb|[^b]{2}/
Never stop Grokking

Wednesday, December 23, 2009

Where do your site's visitors come from

This hack, that I demoed during my talk at FOSS.IN was largely written by the audience at my talk. I typed it out on screen, but the audience was busy telling me what I should type. The primary requirement was to find out where the visitors to your website came from. The secondary requirement was that starting from scratch, we had to come up with a solution in five minutes. Given the medium of discussion (a large auditorium with a few hundred people), and the number of wisecracks, we probably went over the 5 minute limit, but that's okay.

So, we start by looking through the web access log to find out what it looks like. Remember, we're starting from scratch, so we have no idea how to solve the problem yet.
$head -1 access.log 65.55.207.47 - - [30/Nov/2009:00:39:50 -0800] "GET /robots.txt HTTP/1.1" 200 297 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"  This tells us that the IP address is the first field in the log, and that's probably the best indicator of who a user is. We now use cut to pull out only the first field. head -1 acces.log | cut -f1 -d' ' 65.55.207.47  Now I don't really care about the IP itself, but the subnet, so I'll just pull out the first three parts of the IP address (I think Tejas came up with this): head -1 access.log | cut -f1-3 -d. 65.55.207  And before anyone tells me that the -1 usage of head is deprecated, I know, but it's a hack. Now, I want to do this for my entire log file (or a large enough section of it), and I want to know how many hits I get from each subnet. The audience came up with using sort and uniq to do this: cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 141 216.113.168 106 88.211.24 80 91.8.88 79 78.31.47 69 199.125.14 64 216.145.54 62 173.50.252 58 193.82.19 57 82.69.13 56 198.163.150  Now, I don't know about you, but I can't just look at an IP address and tell where it's from. I need something in English. The audience came up with whois to do this, but before we could use it, we had to figure out how. We ran it on the first IP address up there: whois 216.113.168.0 OrgName: eBay, Inc OrgID: EBAY Address: 2145 Hamilton Ave City: San Jose StateProv: CA PostalCode: 95008 Country: US NetRange: 216.113.160.0 - 216.113.191.255 CIDR: 216.113.160.0/19 NetName: EBAY-QA-IT-1 NetHandle: NET-216-113-160-0-1 Parent: NET-216-0-0-0-0 NetType: Direct Assignment NameServer: SJC-DNS1.EBAYDNS.COM NameServer: SMF-DNS1.EBAYDNS.COM NameServer: SJC-DNS2.EBAYDNS.COM Comment: RegDate: 2003-05-09 Updated: 2003-10-17 OrgTechHandle: EBAYN-ARIN OrgTechName: eBay Network OrgTechPhone: +1-408-376-7400 OrgTechEmail: network@ebay.com # ARIN WHOIS database, last updated 2009-12-22 20:00 # Enter ? for additional hints on searching ARIN's WHOIS database.  We only care about the OrgName parameter, so grep that out. Since I also wanted to strip out "OrgName:", I used sed instead: whois 216.113.168.0 | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' eBay, Inc  This gives me what I want, but how do I pass the output of the earlier pipeline to this one? Most people suggested I use xargs, but that would either pass the count as well, or lose the count completely. I wanted both. Gabin suggested that I use read in a loop: cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \ while read count net; do \ whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'; \
done

eBay, Inc
RIPE Network Coordination Centre


I've only limited it to 2 entries this time so that the test doesn't take too long.

Finally, in order to print out the count before the network owner, I pipe the output to awk. Most people suggested I just use echo, but I prefer numbers formatted right aligned the way printf does it:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
while read count net; do \
whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \ awk "{printf(\"%4d\\t%s.x\\t%s\\n\",$count, $net, \$0);}"; \
done

141    216.1130.168.x  eBay, Inc
106    88.2110.24.x    RIPE Network Coordination Centre

Note that we use double quotes for awk, and escape a bunch of things inside. This is so that we can use the shell variables $count and $net as-is in the awk script. We can also accomplish this using the -v option to awk, but no one came up with it at the time.

Finally, run this on the top 10 IP blocks, and we get:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 | \
while read count net; do \
whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \ awk "{printf(\"%4d\\t%s.x\\t%s\\n\",$count, $net, \$0);}"; \
done

141    216.1130.168.x  eBay, Inc
106    88.2110.24.x    RIPE Network Coordination Centre
80    91.80.88.x      RIPE Network Coordination Centre
79    78.310.47.x     RIPE Network Coordination Centre
69    199.1250.14.x   InfoUSA
64    216.1450.54.x   Yahoo! Inc.
62    173.50.252.x    Verizon Internet Services Inc.
58    193.820.19.x    RIPE Network Coordination Centre
57    82.690.13.x     RIPE Network Coordination Centre
56    198.1630.15.x   Red River College of Applied Arts, Science and Technology

That's it. The hack requires network access for whois to work, and may be slow depending on how long whois lookups take for you. It also doesn't care about the class of the IP block, and just assumes that everything is class C, but it works well enough.

I also have no idea why my site is so popular with eBay.

Monday, December 21, 2009

BBC headlines as Flickr Photos (with YQL)

One of the examples I showed during my keynote at FOSS.IN was a 10 minute hack to display the BBC's top news headlines as photos. In the interest of time, I wrote half of it in advance and only demoed the YQL section. In this post, I'll document everything. Apologies for the delay in getting this out, I've fallen behind on my writing with an injured left wrist.

BBC

To start with, we get the URL of the headlines feed. We can get this from the BBC news page at news.bbc.co.uk. The feed URL we need is:
http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml


YQL

Now to use this in YQL, we go to the YQL console at http://developer.yahoo.com/yql/console/ and enter the query into the text box:
SELECT * From rss Where url='http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml'

Click the TEST button, and you get your results in the results view. Not terribly useful yet, since you'd get that from the RSS anyway, but since all we're interested in is the title, we change the * to title:
SELECT title From rss Where url='http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml'

This gives us a more readable list with less extra information.

Flickr

For step 2, we first figure out how to use the flickr API. This is much easier because it's listed on the right in the Data Tables section under the flickr subsection. The table we want is flickr.photos.search. Click on this, and the query shows up in the YQL console:
SELECT * From flickr.photos.search Where has_geo="true" and text="san francisco"

Since for our particular example, we don't really care if the photos have geo information or not, we can drop the has_geo="true" part, or leave it in if you like. The predefined search only looks for a single term, we're looking for multiple terms, so we'll change the query from text="san francisco" to text IN ("san francisco", "new york", "paris", "munich") and see what we get:
SELECT * From flickr.photos.search Where text IN ("san francisco", "new york", "paris", "munich")

Notice the diagnostics section of the XML results. This tells us that YQL made 4 API calls to flickr and asked for 10 results from each. Since we only care about one result per term, we tell YQL to only request a single result from each call to the flickr API:
SELECT * From flickr.photos.search(1) Where text IN ("san francisco", "new york", "paris", "munich")

Note the parenthesised 1 after the table name. We now get only one result for each of the cities we've specified. (if you're copying and pasting this into the console, make sure you remove the <em> tag)

Mashup

Now to put it all together, we use the SQL like sub-select. This allows us to pass the result set of the first query in as an operand in the second query:
SELECT * From flickr.photos.search(1) Where text IN

This gets us at most one photo result for each headline in the RSS feed. It's possible that some headlines return 0 results, and that can't be helped.

REST

So how do we use this in our code? YQL gives us a nice REST API to access the data. The URL you need is listed to the right in the box titled The REST query. Now personally, I prefer JSONP, because it's easier to use it from Javascript, so I switch the view to JSON and enter the name of a callback function showphotos. Click TEST again, and then copy the URL out of REST box. It should be something like this (I've added line breaks for readability):
http://query.yahooapis.com/v1/public/yql?
q=
SELECT%20*%20From%20flickr.photos.search(1)%20
Where%20text%20IN%20
Where%20url%3D
)
&format=json
&callback=showphotos

We can use this in our HTML file by setting it as the src attribute of a <script> tag:
<script src="http://query.yahooapis.com/vq/public/yql?.....&format=json&callback=showphotos">
</script>

Before we can do that though, we need to define the showphotos function and the HTML that will hold the photos. This is what it looks like:
<script type="text/javascript">
function showphotos(o)
{
var ul = document.createElement('ul');
for(var i=0; i<o.query.results.photo.length; i++) {
var photo = o.query.results.photo[i];
var url = 'http://farm' + photo.farm + '.static.flickr.com/'
+ photo.server + '/'
+ photo.id + '_' + photo.secret + '_s.jpg';

var li = document.createElement('li');
var img = document.createElement('img');
img.src=url;
img.alt=photo.title;
img.height=img.width=75;

li.appendChild(img);
ul.appendChild(li);
}

document.body.innerHTML = '';
document.body.appendChild(ul);
}
</script>

It creates a LI for each photo, puts all the LIs into a UL, and puts the UL into the document. You can style it any way you like.

Have a look at the full code in action. You could modify it to draw a badge, to update itself every 10 minutes, to link back to the original photos, or any thing else. If the photos have geo information, you could display them on a map as well.

Improvements

It would be really cool if we could print out the News headline along with the photo. Perhaps even pull out the geo location of the story using the placemaker API and plot that on a map. However, this is the limit of this hack. It's taken me more time to write this post than it did to write the hack.

Short URL: http://tr.im/bbcflickr

Thursday, December 10, 2009

The keynote incident

Act I, Scene I - The News

I was sitting at La Luna Cafe in Cambridge sipping on Apple Cider, hacking random stuff and waiting for S to be done with work so we could head to the airport. That's when Atul pinged me and asked if it was okay to change the time slot of my talk. I said, yeah, no problem, just let me know the new time. He says, "6pm, keynote slot". I wasn't sure how to react. I told him that I'd never given a keynote before, and in parallel, my brain was telling me that this was true at some point for every keynote speaker ever.

I wasn't sure what to do apart from panicking. Did I have to approach the talk differently, did I have to put more into it? Did I need to speak for more than the 30 minutes that I'd already planned on? Did I need more than 5 slides? I write this now as a reminder and in the hope that it helps someone else.

Act II, Scene I - Bangalore

I got into Bangalore early on the morning of December 2nd and checked into the hotel. I'd slept well on the plane, so a few hours more in the hotel left me ready for the conference. My first task was to get myself a local phone number and then find the conference venue. About an hour later I was standing in NIMHAN's hospital wondering where the convention centre was. A security guard pointed me in the right direction where I saw familiar faces at the door. Tejas handed me my speaker badge as I got in.

Act II, Scene II - Interviewing

At this point, I still had the structure of the talk I'd planned to do before Atul gave me the news. Part of it involved speaking to hackers at FOSS.IN about their experiences, so I got started with that as soon as I got in. I spoke to Pradeepto, Tarique, Tejas, Kartik, James Morris, Siddhesh, Vinayak, Anant and many others about their experiences with hacking. I was a little relieved to know that their ideas always matched mine. At this point I was starting to lose some of the doubt I had.

I spent some time attending talks and checking out the workouts in the hacker area. This was the same old conference I knew.

Act II, Scene III - The first Keynote

Harald did the first keynote I attended, and the few things that struck me were that he'd used the LaTeX (beamer with the Warsaw theme), he had a lot of content on his slides, and he really knew what he was talking about. I sat up in the balcony to take notes. It only served to scare me.

Act III, Scene I - Panic

I'll jump now to the next keynote by Milosch Meriac. Milosch is a hardware hacker, and like in the last keynote, a few things struck me. He used LaTeX with the Warsaw theme for beamer, he had a lot of content on his slides, and he really really knew his stuff. His talk awed the audience and left them wanting more, which resulted in a follow-up workout on hardware hacking. It was turning out to be impossible to match the quality of the keynotes already delivered, and I had less than 24 hours left to get my act together.

Act IV, Scene I - There is no spoon

I was in early on the 4th. I sat through several talks. All of which were excellent. I thought back on the talks that I'd attended on the previous two days and it hit me that any one of these people could have done the same talk at 6pm and would have been great keynote speakers. Many of them also used LaTeX. I sort of joked on twitter that maybe if I used LaTeX, then I'd have a good talk as well. Then @artagnon replied saying that what I said was more important than my slides. He also helped out with some LaTeX formatting. I'd decided to use LaTeX for my presentation, not because the other presenters did, but because I didn't know LaTeX, and this conference seemed like a good excuse to learn it.

Act IV, Scene II - The photography BoF

At 5pm, Kalyan Varma and James Morris had a photography BoF in the largely unused speaker area. The space was mostly dark for best effect of the photos they were showing off. I sat in on it, and threw out everything I'd planned on talking about. Taking a cue from all the speakers, I decided to talk about what I knew best. My slides were unimportant. I had a few points that I wanted to cover, and made a note of those lest I forget (which I often did) and I had a lot of code that I wanted to show off, some of which I hadn't tried before, but I had a few hundred hackers in the room to help me with. As mentioned in the talk abstract, this talk was meant to be a hack, and that's the attitude I decided to go in with.

Act IV, Scene III - Shining lights

As I stood on the stage and tried to plug the VGA cable into my laptop, my hands were shaking. Not sure if anyone in the audience noticed though. I was nervous. Should I wear my hat and shield my eyes from the light or take off the hat and let my forehead shine? I went without the hat. It didn't matter. The talk had its ups and downs. I missed a few points that I thought I should cover. I lost track of where I was in the little list of points I'd made, and I stopped blank a few times. I also forgot to give away the tshirts that I had. Where it went the smoothest though, was when I was doing what I like best - hacking. Whenever there was code to demo, I was excited and I had real time feedback from everyone in the audience.

Act V, Endgame

I didn't get a chance to read the comments on twitter until late that night. They were mostly good, and the few that were critical were justified. The next day many people came up to me and told me that they enjoyed the talk hack. If they all end up hackers, that's what will be the real win.

Now it may seem from what I've said that the talk was largely spontaneous. The fact is that I wrote down a script a few times and threw it away. I rehearsed by recording myself speak and was aghast with the playback. I was forcing myself to change my talk simply because it had a new label. That was the only time I faltered, and it led me down the wrong path. A lot of people helped me get back on track, so in honesty, I only presented the keynote. It was hacked up by many many people at FOSS.IN.

For those who are interested, my slides are up on slideshare, but if you've been paying attention, you don't really need them.

Short URL: http://tr.im/fossdotinkeynote

Tuesday, November 24, 2009

Measuring a user's bandwidth

In my last post about performance, I spoke about measurement. Over the last few days I've been looking at bandwidth measurement. These ideas have been floating around for years and we've tested some before at Yahoo!, but I wanted to try a few new things.

Try it out now.

The concept is actually quite simple.
3. Stop at the first one that times out - that means that we have enough data to make an estimation.
4. Calculate the bandwidth by dividing each image's size by the time it took to download.
I run this test a few times, and then run some statistical analysis on the data gathered. The analysis is pretty basic. I first pull out the geometric mean of the data, then sort the data, run IQR filtering on it, and then pull out the median point. I use the geometric mean as well as the post IQR filtered median because I'm not sure at this point which is more resilient to temporary changes in network performance. This data is then stored in a database along with the user's IP address and the current timestamp.

I also try to measure latency. This is not network latency, but server latency from the user's point of view, ie, how long does it take between request and first byte of response. I run this test multiple times and do the same kind of stats on this data.

The goal of this test

The few people I've shown this to all had the same question. What's the goal of this test? There are already several free bandwidth testers available that one can use to determine ones bandwidth, so what does this do differently.

The way I see it, as a site owner, I don't really care about the bandwidth that my users have with their ISPs - unless of course, I have my servers in the ISP's data centre. I really care about the bandwidth that user's experience when visiting my website. This test aims to measure that. Ideally, this piece of code can be put into any web page to measure the user's bandwidth in the background while he's interacting with your site. I don't know how it will work in practice though.

Insights from the data

I don't really know. It could be useful to figure out what users from different geographical locations experience. Same with ISPs. It might also just tell me that dreamhost is a really bad hosting provider.

Data consistency

In my repeated tests, I've found that the data isn't really consistent. It's not all over the place, but it fluctuates a fair bit. I've seen different levels of consistency when using the geometric mean and the median, but I don't think I have enough data yet to decide which is more stable. This could mean that my server just responds differently to multiple requests or it could mean many other things. I don't really know, but feel free to leave a comment if you do.

Credits

I don't know who first came up with the idea of downloading multiple images to test bandwdith, but it wasn't my idea. The latency test idea came from Tahir Hashmi and some insights came from Stoyan Stefanov.

Short URL: http://tr.im/bwmeasure

Monday, November 23, 2009

Storing IP addresses in a MySQL data table

For a lot of log processing, I need to store IP addresses in a database table. The standard process was always to convert it to an unsigned int in perl or php and then insert it. Today I discovered an easier way. MySQL's INET_ATON function. It takes an address in dotted quad format and converts it into an INT. So, all you have to do is this:
INSERT INTO table (ip) VALUES (INET_ATON('$ip_address'));  And done. Sunday, November 22, 2009 Being a geek Back in 2004, I did a talk at Linux Bangalore titled Being a geek. It was quite popular at the time. The number of people in the room far exceeded the limits set by fire safety regulations. I then repeated the talk at Freedel in an impromptu session in the corridor with my audience sitting on the floor and on tables around me. It was somehow exactly the way I think some conference tracks should be. Anyway, a couple of nights ago, I finally converted my slides to PDF by rewriting them in LaTeX and running pdflatex on it. The results are here: It hit the front page of slideshare the night I posted, so chances are that it's still interesting to someone. Monday, November 09, 2009 Template update Apologies if you read my blog in a feed reader and just got swamped by a whole bunch of updates. I just redid this blog's template to match the rest of my website and in the process also went back and cleaned up the markup on some old posts. I can't say that this won't happen again, but any more changes at this time will be to the CSS or the template only and should not affect the feed. Thanks for reading. Saturday, November 07, 2009 Favicons on my planet's blogroll Update: I noticed that some feeds weren't showing favicons even though their sites had them, and it turned out to be because the entire feed was a single line which didn't work with sed. I've changed to use perl instead. Early last week, Chris Shiflett tweeted about adding favicons to a planet's blogroll for sites that have them. Now I'd considered setting up PlanetPlanet in the past, but had never gotten down to it. Since I was already in the middle of a site redesign, I figured it was a good time to start. Setting up planet bluesmoon was fairly straighforward. I just followed the instructions in the INSTALL file. I was also very pleased to see that it uses the python implementation of HTML::Template because I'm the author of the Java implementation (Also the last Java project I worked on) and am very familiar with the syntax and tricks of the trade. Once set up, I went back to Chris' site since he'd also mentioned that he'd be posting his favicon code on github. Unfortunately, at this time, the only thing there is the README file, and well, patience is not one of my virtues, so I decided to write my own. One advantage that I did have though, was Chris' tweets about the process which made a note of all the problems he ran into. I ended up with this shell script that does a fairly good job, and can be run through cron (although I don't do that). It's made to specifically work with planet's config.ini file, and edits the file in-place to add the icon code. This is how it works Translating feed URL to favicon URL This code pulls out all feed URLs from the file. I'm assuming here that they're all http(s) URLs. sed -ne '/$http/{s/[][]//g;p;}' file  For each URL returned, I run this code which pulls down the feed using curl, and then uses perl to extract the home site's URL. I then check for the link in the feed assuming the feed is in the RSS2.0 or Atom 1.0 formats. I could have looked at the content-type header and figured out which it was, but as Chris pointed out, content-type headers are often wrong. The perl code first splits the feed into multiple lines to make it easier to parse. curl -m 10 -L feedurl 2>/dev/null | \ perl -ne " s/></>\\n</g; for (split/\\n/) { print \"\1\\n\" and exit if /<link/ && (/<link>(.*?)<\\/link>/ || (/text\\/html/ && /alternate/ && /href=['\"](.*?)['\"]/) ); } "  I then pull out the domain from the site's URL. I'll need this if the link to the favicon is a relative URL. Again, I'm assuming http(s) and being a little liberal with my regexes to work in all versions of sed. domain=echo url | sed -e 's,$$https*://[^/]*/$$.*,\1,' base={url%/*}  Then download the site page and look for a favicon link in there. Favicons are found in link tags with a rel attribute of icon or shortcut icon, so I check for both, again being liberal with my regexes, and when I find it, extract the value of the href attribute. This will break if there are multiple link tags on the same line, but I'll deal with that when I see it. favicon=( curl -m 10 -L "url" 2>/dev/null | \ perl -ne " print \"\1\\n\" and exit if /<link/ && /rel=['\"](?:shortcut )?icon['\"]/ && /href=['\"](.*?)['\"]/; " )  If no URL was found, I just appended /favicon.ico to domain and used that instead. If a relative URL was found, I appended it to either base or domain depending on whether the path starts with / or not. This will have trouble if your site URL points to a directory but omits the trailing slash, but shame on you if you do that. Validating the favicon Now once I had the URL, I still had to validate if a favicon existed at that location. This was done easily using curl with the -f flag which tells it to fail on error. It returns an error code of 22 for a file not found. The problem I faced here is that some sites don't actually return a 404 for missing resources. That was a WTF moment. So I figured I'd just look for the content-type of the returned resource, and if it did not match image/*, then I'd discard it. However, from Chris's tweets, I already knew that some sites send a favicon with a content type of text/plain or text/html, so I couldn't rely solely on this. Instead, I decided to download the favicons, and if its content-type did not match the image/* pattern, I run the file command on them. This command looks up the file's magic numbers and figures out it's content type. The result was this code: name=echo domain | sed -e 's,/,-,g'.ico params=curl -L -f -w "%{content_type}\t%{size_download}" -o "icons/name" "favicon" 2>/dev/null [ ? -ne 0 ] && continue # skip if curl was unsuccessful ctype={params% *} clen={params#* } [ clen -eq 0 ] && continue # skip if favicon was 0 bytes if ! echo ctype | grep -q "^image/" &>/dev/null; then if file -b "icons/name" | grep '\<text\>' &>/dev/null; then continue; # skip if content type is not image/* fi fi rm "icons/name"  Write it back Now that I knew the correct URL for a site's favicon, I could write this information back to the config.ini file. I decided to use perl for this line (though I could have used perl for the whole script). It reads the file in by paragraph, and if a paragraph matches the feed URL, it first strips out the old favicon line, and then adds the new one in. Since this code only runs if we actually find a favicon, it has the side effect of not updating a favicon that was once valid but now isn't. perl -pi -e "BEGIN {\/='';} if(m{^\[feedurl$}) { s{^icon =.*$}{}m; s{\n\n$}{\nicon =$favicon\n\n}; }" $file  The perl code also assumes a very specific format for the config.ini file. Specifically, everything about a feed must be together with no blank lines in between them, and there needs to be at least one blank line between feed sections. Not hard to maintain this, but it's not a restriction that planet imposes itself. Adding favicons to the template Lastly, I needed to add these favicons to the template. Inside the Channels loop, we add this code: <img src="<TMPL_IF icon><TMPL_VAR icon><TMPL_ELSE>/feed-icon-14x14.png</TMPL_IF>" alt="" class="favicon">  The code can go anywhere as long as it's inside the Channels loop. To use it in the Items loop, the variable name should be changed to channel_icon instead. Et voilà site favicons on a planet. Now I've just got to get a better generic image for the no favicon state since they aren't technically links to feeds. Update: I'm now using an icon from stdicon.com for the generic favicon. Performance BoF at FOSS.IN FOSS.IN runs in Bangalore from the 1st to the 5th of December this year. During the conference, I'll be organising a BoF meet on performance titled Websites on Speed. In this BoF we'll each bring up a ideas and research that we've been playing with. It's expected to be fairly technical, but how detailed we get depends on what people are interested in. We'll try and cover all layers of the stack that contribute to performance problems, and get into depth on one or two areas chosen on the spot. This isn't limited to frontend performance. There's a lot of experimentation, tweaking and understanding of the system involved in web performance, so let's find out what's state of the art today. Sunday, November 01, 2009 Performance measurement In my last post, I mentioned the factors that affect web performance. Now that we know what we need to measure, we come to the harder problem of figuring out how to measure each of them. There are different methods depending on how much control you have over the system and the environment it runs in. Additionally, measuring performance in a test setup may not show you what real users experience, however it does give you a good baseline to compare subsequent tests against. Web, application and database servers Back end servers are the easiest to measure because we generally have full control over the system and the environment it runs in. The set up is also largely the same in a test and production environment, and by replaying HTTP logs, it's possible to simulate real user interactions with the server. Some of the tools one can use to measure server performance are: • ab - Apache Benchmark. Despite its name, it can be used to test any kind of HTTP server and not just apache. Nixcraft has a good tutorial on using ab. • httperf from HP labs is also a good tool to generate HTTP load on a server. There's an article on Techrepublic about using it. I prefer httperf because it can be configured to simulate real user load • Log replaying is a good way to simulate real-user load, and a few people have developed scripts to replay an apache log file. The first one uses httperf under the hood. • To measure database performance, we could either put profiling code into our application itself, and measure how long it takes for our queries to return under real load conditions, or run benchmarks with the actual queries that we use. For mysql, the mysql benchmarking suite is useful. • MySQL Tuner is another tool that can tell you how your live production server has been performing though it doesn't give you numbers to quantify perceived performance. I find it useful to tell me if my server needs retuning or not. The above methods can also be used to measure the performance of remote web service calls, though you may want to talk to your remote web service provider before doing that. I won't write any more about these because there are a lot of articles about server side performance measurement on the web. DNS, CDNs and ISP networks Measuring the performance of DNS, CDNs and your user's ISP network is much harder because you have control over neither the systems nor the environment. Now I mentioned earlier that DNS is something you can control. I was referring to your own DNS set up, ie, the hostnames you have and how they're set up. This is not something we need to measure since no user will use your DNS server. All users use their ISP's DNS server or something like OpenDNS and it's the performance of these servers that we care about. DNS DNS is the hardest of the lot since the only way to measure it is to actually put a client application on your users' machines and have that do the measurement. Unless you have really friendly users, this isn't possible. It is an important measurement though. A paper on DNS Performance [Jung et al., 2002] shows that around 20% of all DNS requests fail. This in turn adds to the overall perceived latency of a website. In the absence of an easy way to measure this performance from within a web page, we'll try and figure it out as a side-effect of other measurements. One possible method is to request the same resource from a host, the first time using the hostname and the second time using its IP address. The difference should give you the DNS lookup time. The problem with this is that it sort of breaks DNS rotations where you may have multiple physical hosts behind a single hostname. It's even worse with a CDN because the hostname may map onto a server that's geographically closer to the user than the IP address you use. In short, you'd better know what you're doing if you try this. ISP bandwidth With ISP networks, the number we really care about is the user's effective bandwidth, and it isn't hard to measure this. We use the following procedure: 1. Place resources of known fixed sizes on a CDN 2. Make sure these resources are served with no-cache headers 3. Using javascript, download these resources from the client machine and measure the time it takes 4. Discard the first resource since it also pays the price of a DNS lookup and TCP slowstart 5. Use resources of different sizes to handle very slow and very fast connections. The number we get will be affected by other things the user is using the network for. For example, if they're streaming video at the same time, then bandwidth measured will be lower than it should be, but we take what we can get. CDNs Now to measure bandwidth, we need to get that resource relatively close to the user so that the bandwidth of the whole internet doesn't affect it. That's where CDNs come in, and measuring a CDN's performance is somewhat similar. We could always use a tool like Gomez or Keynote to do this measurement for you, or you can hack up a solution yourself in Javascript. You need to figure out three things: 1. The IP of the CDN closest to the user 2. The user's geo-location which you can figure out from their IP address 3. The time it takes to download a resource of known size from this CDN It's that first one that's a toughie, but the simplest way to figure it out is to just ask your CDN provider. Most CDNs also provide you with their own performance reports. Page content and user interaction YSlow, Show Slow, Page Speed and Wep Page Test are good tools for measuring and analysing the performance of your page content. They can measure and analyse your page from your development environment and suggest improvements. They do not, however, measure real user perceived performance, however this is something we can do with Javascript. We primarily need to measure the time it takes to download a page and all its components. Additionally we may want to time how long certain user interactions with the page took. All of these can be accomplished by reading the Date() object in javascript at the correct start and end times. What those start and end times are depend on your application, but we'll look at one possible implementation in a later post. Once you have the timing information that you need, it can be sent back to your server using a javascript beacon. We'll go into more detail about this as well in a later post. This post has already run longer than I'd hoped for, so I'm going to stop here and will continue next time. About web performance I currently work with the performance team at Yahoo!. This is the team that did the research behind our performance best practices and built YSlow. Most of our past members write and speak about performance, and while I've done a few talks, I've never actually written a public post about web performance. I'm going to try and change that today. Note, however, that this blog is about many technical topics that interest me and web performance is just a part of that. I'm never sure how to start a new series, especially one that's been spoken about by others, but since these blog posts also serve as a script for the talks that I do, I thought I'd start with the last performance talk that I did. Improving a website's performance starts with measuring its current performance. We need a baseline measurement that will help us determine if the changes we make cause an improvement or a regression in performance. Before we start with measurement, however, we need to know what to measure, and for that we need to look at all the factors that contribute to the time it takes for a website to get to the user. User perceived web app time is spent in looking up stuff, building stuff, downloading stuff, rendering stuff and interacting with stuff. It's this perceived time that we need to reduce, and consequently measure. All of the above fall into two basic categories: 1. Infrastructure 2. Content structure Each of these in turn is made up of components that we as developers can control, and those that we cannot. We'd like to be able to measure everything and fix whatever we have control over. I've split the components that I see into this table so we know what can be looked at and who should do the looking. Infrastructure Content Web server & App server Database server Web service calls CDNs DNS HTTP headers HTML Images, Flash CSS Javascript Fonts ISP's DNS servers ISP's network User's bandwidth User's browser & plugins Other apps using the user's network The internet Advertisements Third party content included as badges/feeds Third party sites that link to your page If you have more items to add to this table, leave a comment and I'll add it in. This is where we can jump to Yahoo!'s performance rules. At the time of this post, there are 34 of them divided into 7 categories. I'll go into more details and refer to these rules in later posts. That's all for this introductory post though. Tuesday, October 27, 2009 Getting my twitter updates on this blog I wanted my twitter updates to show up on the sidebar of my blog. At first I found that blogger already had a gadget for that, so I just included it. Unfortunately, this gadget loaded my timeline in an iframe. The iframe pointed to a script on someone else's domain, and every now and then that domain was unresponsive resulting in the rest of my page not loading. It also did not work on Opera. I decided to jump into twitter's API and figure out if I could get this to work on my own. I didn't bother with making it customisable, but if you want to reuse it, you'll need to change my username to your own. Here's what I did. First, create a div that will hold my timeline: <div id="twitter"> <ul class="twitter-timeline"> </ul> <div class="follow-me"><a href="http://twitter.com/bluesmoon">Follow me on twitter</a></div> </div>  Put this wherever you want your timeline to go. Next, write the Javascript that will draw the timeline. This is fairly simple: function show_twitter(o) { var div = document.getElementById("twitter"); var ul = div.getElementsByTagName("ul")[0]; ul.innerHTML = ""; for(var i=0; i<o.length; i++) { var li = document.createElement("li"); li.innerHTML = o[i].text.replace(/@(\w+)/, "<a href='http://twitter.com/$1'>@\$1</a>");
ul.appendChild(li);
}
}

I put this at the start of the document, but you can put it anywhere before you make a call to the twitter API, which is the next step:
<script src='http://twitter.com/statuses/user_timeline.json?id=bluesmoon&count=5&callback=show_twitter'
type='text/javascript'></script>


1. It's got my userid in it. It only works with userids that have public timelines
2. count is limited to 5 items
3. the callback parameter's value is the name of the function we defined in the previous step.
That's it. Put all this together and you have your latest 5 tweets on your blog. If your blogging software requires XML valid templates (like this blogger thing), then you'll need to either put your javascript inside a CDATA section or escape all quotes, &, < and >

Monday, October 26, 2009

Referencing blogger's data tags in javascript

Blogger's templating system includes data tags that you can use to reference parameters of your blog, blog posts, labels and much more in your template or in widgets. All of blogger's templates are built with these tags, and that's how you see posts, comments, timestamps and all the other dynamic content on a blogger blog.

Using these tags in your HTML (XHTML actually) is quite easy. For example, to print to the blog's title, you'd use this:
   <data:blog.title/>

Notice the / before the >. That's needed because templates are XML so all tags have to be closed. If a data tag has to go inside another tag, the syntax is a little different. For example, to enclose the title in a link to the blog, you'd do this:
   <a expr:href="data:blog.url"><data:blog.title/></a>

Notice the expr: before href. That tells the template engine that the value of this attribute is an expression to be expanded. I don't know the details, but that's what I've understood from it.

So knowing all this (and it took me a while to figure it out because the documentation sucks), my next problem was, how to include in in javascript. In particular, I needed it for the delicious tagging at the bottom of each post. I needed the post title and url to be passed to the delicious badge function. This is the code I needed to use:
   Delicious.BlogBadge.writeBadge("delicious-blogbadge-"+Math.random(), "http://url", "title", {});

Now I generally put my javascript inside <![CDATA[ ]]> tags so that I don't have to worry about escaping quotes and relational operators. However, this also means that any data tags I used would be treated as plain text and ignored. I searched for docs on how to do this, and there was none.

I then went to the Edit Template section, and clicked the checkbox that said "Expand Widget Templates" and then searched through the template code for other <script> tags. I found some, and got my answer from there, and this is what my delicious call became:
   <script type='text/javascript'>
&quot;<data:post.url/>&quot;,
&quot;<data:post.title/>&quot;,
{}
);
</script>

First, there's no CDATA section. Second, I need to replace the quotes that surround the values passed the function with &quot; and then just use the data tags as usual.

Pretty simple, but it took me a really long time to find any documentation on the subject. Let this blog post serve as documentation for anyone else who needs to do the same.

Wednesday, September 30, 2009

A couple of days ago I posted about scaling writes in mysql. I didn't say much about read performance in that post because a) it was irrelevant at the time, and b) there are thousands of articles all over the web that already cover read performance.

In this post I'm going to cover some of the things that I did to improve read performance for my own application. It may be relevant to others, however, you'd still need to read a whole bunch of other articles to understand MySQL read performance.

Looking at our access patterns, it turned out that there were two classes of read queries.
1. Reads to build the daily summaries
2. Reads from the summary tables in response to user actions
The former dealt with far more data at one go, but was only run once. Queries for this pattern were slow depending on how many rows were touched. Small resultsets came back in milliseconds while larger resultsets (some over a million rows) took several minutes to return. The latter pattern dealt with small amounts of data, but happened far more frequently. These queries needed to return in around 10ms.

Also note from my last post that inserts handled 40,000 rows per query, which meant that each query took about 4.5seconds to run.

Now why is all of this relevant? It's relevant because it renders your slow query log mostly useless. My slow query log jumps to about 300GB fairly quickly, so you need log rotation implemented. We can, however turn slow query logging on and off at run time using the slow_query_log global system variables, however since these variables are global, we need to worry about a few things.
1. Make sure you set it back on when your script finishes, even if the script crashes
2. Any other queries run while your slow script is running will not be logged even if they are slow
There's nothing we can do about the latter (nothing I can think of anyway). For the former, I prefer to use a wrapper script that turns off slow query logging, then calls my slow script, and then turns it back on when the the script has terminated (successfully or unsuccessfully). This ensures that my slow query log has mostly queries that I should and can optimise. See the MySQL Slow Query Log documentation for more information on what to do and how to do it.

Now, I mentioned in my previous post that we pushed our boxes up from 4GB RAM to 16GB RAM. This left a bit free after allocating enough to the innodb_buffer_pool. I figured that we could use this to improve read performance. My first thought was to use this for the query cache. All past tests had shown that the query cache improves read performance quite a lot. Unfortunately, these tests assume that you have a single database server or use some kind of affinity to make sure that multiple queries go to the same host.

This is not such a good idea with a multi-box set up for BCP and load balancing. There's also the way in which the query cache keys queries which not every developer understands, and this can lead to unexpected results. I wasn't too concerned about this since I was in control over every query that went into the system, but I may not always be maintaining this system. I decided that the best option was to turn off the query cache and turn it on on a per query basis using the SQL_CACHE directive in my queries. Instead, I use a frontend cache similar to memcached. The guys at the MySQL Performance Blog also have similar recommendations wrt query cache, so go read their blog for a more detailed analysis.

The second thing I did was to create tables with redundant information. I call them cache tables. I store information in there while I'm building the main tables that will eventually speed up creating the summary tables. The data in there is quite simple. For example, I have a table that contains an approximate count of rows of each type that I need to summarise. That way I can schedule summarisation using the Shortest Job First algorithm. The result is that in 50% of the time, 97% of all summaries are done and most users can start using that data. Something else I haven't done yet, but may implement soon is to let the script that summarises data run two instances in parallel, one of each slave, and one running the shortest jobs first while the other runs the longest jobs first, or some similar scheduling algorithm. The ideal result would be if it took 50% of the time to run.

The final optimisation for handling summaries was bulk queries. INSERTs, UPDATEs and SELECTs can all be batched and sometimes this can get you much better performance than running single queries. For INSERTs, I developed a method using ON DUPLICATE KEY UPDATE to INSERT and UPDATE multiple rows at once. The query looks something like this:
 INSERT INTO table (        key_field, f1, f2, f3    ) VALUES (        key1, f11, f21, f31    ), (        key2, f12, f22, f32    ), ...    ON DUPLICATE KEY UPDATE        f1 = IF(key_field=key1, f11, IF(key_field=key2, f12, IF(key_field=key3, f13, ...))),        f2 = IF(key_field=key1, f21, IF(key_field=key2, f22, IF(key_field=key3, f23, ...))),        f3 = IF(key_field=key1, f31, IF(key_field=key2, f32, IF(key_field=key3, f33, ...)))
As you can see the query gets quite complicated as the number of rows grows, but you never write this query by hand. It's generated through code in your language of choice. The only thing you have to worry about is making sure the total query size stays below your max tcp packet size. Also longer queries take longer to parse. I restrict it to about 100 rows per insert/update.

Now, it's quite likely that I need to insert/update far more than 100 rows, which means the query parser needs to run for each batch. To get around this, I use a prepared statement with a heck of a lot of question marks in it. I'll leave it as an excercise for you to figure out what to pass to it. The real trick comes on the last batch. It's unlikely that I'll have an exact multiple of 100 records to be inserted, so the last batch may have fewer than 100 records. I have two choices at this point.
1. Create a new prepared statement with the number of records I need
2. Pad the current statement with extra copies of the last row
Neither method has had any advantage over the other, so I prefer the former since it sends less data over the wire at the cost of one extra query parse.

Bulk selects are far similar. It basically means that if I'm going to have to operate on a bunch of records one at a time, then it's faster to select them all at once, store them in an array and operate on the array rather than selecting them one at a time. This, of course, costs memory, and it is possible to use up all the RAM on the system doing something like this. It's happened several times. With experience you learn where to draw the line for your application.

Now for the user queries, I did not optimise too much. I again went with data partitioning, also by time, but this time by month. Our access patterns showed that most queries were for data in the last one month, so by partitioning the summary tables by month, it meant that we only had to query one or two partitions at any time. The primary key was designed to either return the exact results the user wanted, or narrow the search down to a small set that could be filtered either through a DB scan, or in the application itself. Partitions ensured that in the worst case we'd have to do a full partition scan and not a full table scan.

This is by no means the fastest design. It is optimised to speed up the slowest part of the system, ie, writes, but reads don't quite go out of the window as a result.

Monday, September 28, 2009

Scaling writes in MySQL

We use MySQL on most of our projects. One of these projects has a an access pattern unlike any other I've worked on. Several million records a day need to be written to a table. These records are then read out once at the end of the day, summarised and then very rarely touched again. Each record is about 104 bytes long (thre's one VARCHAR column, everything else is fixed), and that's after squeezing out every byte possible. The average number of records that we write in a day is 40 million, but this could go up.

A little bit about the set up. We have fairly powerful boxes with large disks using RAID1/0 and 16GB RAM, however at the time they only had 4GB. For BCP, we have a multi-master set up in two colos with statement level replication. We used MySQL 5.1.

My initial tests with various parameters that affect writes showed that while MyISAM performed slightly better than InnoDB while the tables were small, it quickly deteriorated as the table size crossed a certain point. InnoDB performance deteriorated as well, but at a higher table size. The table size turned out to be related to the innodb_buffer_pool_size, and that in turn was capped by the amount of RAM we had on the system.

I decided to go with InnoDB since we also needed transactions for the summary tables and I preferred not to divide my RAM between two different engines. I stripped out all indexes, and retained only the primary key. Since InnoDB stores the table in the primary key, I decided that rather than use an auto_increment column, I'd cover several columns with the primary key to guarantee uniqueness. This had the added advantage that if the same record was inserted more than once, it would not result in duplicates. This small point was crucial for BCP, because it meant that we did not have to keep track of which records had already been inserted. If something crashed, we could just reinsert the last 30 minutes worth of data, possibly into the secondary master, and not have any duplicates at the end of it. I used INSERT IGNORE to get this done automatically.

Now to get back to the table size limit that we were facing. Initial tests showed that we could insert at most 2100 records per second until the table size got to a little over the innodb_buffer_pool_size and at that point it degraded fairly rapidly to around 150 records per second. This was unacceptable because records were coming in to the system at an average rate of 1000 per second. Since we only needed to read these records at the end of the day, it was safe to accumulate them into a text file and periodically insert them in bulk. I decided to insert 40,000 records at one time. The number I chose was arbitrary, but later tests that I ran on batches of 10K, 20K and 80K showed no difference in insert rates. With batch inserts, we managed to get an insert rate of 10,000 records per second, but this also degraded as soon as we hit the limit going down to 150 records per second.

System stats on the database box showed that the disk was almost idle for most of the run and then suddenly shot up to 90-100% activity once we hit this limit, so it was obvious that at this point, the DB was exchanging data between buffers and disk all the time.

At this point, someone suggested that we try partitioning, which was available in MySQL 5.1. My first instinct was to partition based on the primary key so that we could read data out easily. However, reads weren't really our problem since we had no restriction on how fast they needed to be (at least not as much as writes). Instead, I decided to partition my table based on the pattern of incoming data.

The first part was obvious, use a separate table for each day's data. On a table of this size, DROP TABLE is much faster than DELETE From <table> Where ..., and it also reclaims lost space. I should mention at this point that we used file_per_table as well to make sure that each table had its own file rather than use a single innodb file.

Secondly, each table was partitioned on time. 12 partitions per day, 2 hours of data per partition. The MySQL docs for Partitioning were quite useful in understanding what to do. The command ended up looking like this:
CREATE TABLE (
...
) PARTITION BY RANGE( ( time DIV 3600 ) MOD 24 ) (
Partition p0 values less than (2),
Partition p1 values less than (4),
Partition p2 values less than (6),
Partition p3 values less than (8),
Partition p4 values less than (10),
Partition p5 values less than (12),
Partition p6 values less than (14),
Partition p7 values less than (16),
Partition p8 values less than (18),
Partition p9 values less than (20),
Partition p10 values less than (22),
Partition p11 values less than (24)
); 
The time field is the timestamp of incoming records, and since time always moves forward (at least in my universe), this meant that I would never write to more than 2 partitions at any point in time. Now, a little back of the envelope calculations:
44M x 102 bytes = approx 4.2GB
2x for InnoDB overhead = approx 8.4GB
+10% for partitioning overhead = 9.2GB
/12 partitions = approx 760MB per partition 
This turned out to be more or less correct. In most cases total table size ranges between 8-10GB, sometimes it goes up to 13GB. Partition sizes range from less than 700MB to over 1GB depending on the time of day. With 4GB of RAM, we had an innodb_buffer_pool set at 2.7GB, which was good enough to store two partitions, but not good enough to work on any other tables or do anything else on the box. Boosting the RAM to 16GB meant that we could have a 12GB buffer pool, and leave 4GB for the system. This was enough for 2 partitions, even if the total number of records went up, and we could work on other tables as well.

After partitioning, tests showed that we could sustain an insert rate of 10K rows per second for some time. As the table size grew past 10 million records, the insert rate dropped to about 8500 rows per second, but it stayed at that rate for well over 44 million records. I tested inserts up to 350 million records and we were able to sustain an insert rate of around 8500 rows per second. Coincidentally, during Michael Jackson's memorial service, we actually did hit an incoming rate of a little over 8000 records per second for a few hours.

One more BotE calculation:
8500 rows per second  x  86400 seconds per day = 734.4 Million records per day
Considering that before this system was redesigned it was handling about 7 Million records per day, I'd say that we did pretty well.

Update: If you want to see charts, they're in my ConFoo presentation on slideshare.

Friday, July 24, 2009

Avoid running END blocks in perl

To run perl code without executing any END blocks, put this at the end of your program:
   exec('true');
You'd put that in place of any exit() statement as well. I'll leave it to you to figure out return values. It's not that hard.

So, why did I need this?

I have this large program in perl, and it has a module that prints out a bunch of stats in the END block. This is all fine for the default use case, but today I needed to write another small program that does a bunch of auditing on this module - something like a unit test, but not exactly.

Anyway, this smaller program only needed the module to initialise, but not actually execute, and it doesn't require the stats at the end of the code to be printed either, so I needed to figure out how to run a program without executing the END block.

I looked up the perlmod doc, and it said that the only conditions under which END is not executed is if the process is replaced using exec or you're thrown out of the water by a signal. Voilà, no more END.

Sunday, July 19, 2009

Unscientific network connectivity comparison

For the last few weeks, I've been running a Mac Book Pro (Mac OS X 10.4.11) and Ubuntu Linux 8.10 on an IBM Thinkpad T60p. I generally have both on at the same time, and connected to the same wireless router. I also have both boxes reasonably close to each other since I'd rather not have to physically move in order to switch.

This set up limits any differences in network connectivity to the laptops themselves. Either the hardware or the software running on it. (Though there is an insignificant, but non-zero probability that a thin channel of ionized air exists between one laptop and the router and not the other, but we'll neglect this possibility for now).

In both cases, I have the laptops set up to automatically connect to the network, but have also tried manual connects. I've tested this with the network configured with WEP security and also no security. I did not use WPA because there are known problems with NetworkManager on Ubuntu and WPA networks. My results did not differ with the two networks, so for the rest of this post, assume either.

Time to connect

The Mac consistently connects to the network much faster than Ubuntu does, and by a very large margin. I've tried connecting the Mac first and Ubuntu second, and vice-versa with no noticeable difference.

The Mac connects in a few seconds, slightly longer for WEP than for an unsecured network, but it's always a matter of seconds.

Ubuntu on the other hand takes anywhere from 40 seconds to a few minutes to connect, and on several occassions fails to connect at the first attempt. On system boot up, it takes 3-4 minutes to connect to the network. I've tested this with NetworkManager, Wicd and simply using iwconfig and dhclient on the command line. There is negligible difference in the three methods.

Connection speed

Again the Mac wins. While I haven't measured actual connection speeds, I just inspect how fast DNS lookups take and how fast a given file downloads over HTTP on the two systems at the same time. This time the difference is not an order of magnitude, but is still significant. Ubuntu can take 10-20% more time to download the same file, however at times it will just fail the DNS lookup. This could, however be related to the third item I looked at.

Packet loss

The Ubuntu box frequently ranges between 30-50% packet loss between the laptop and the wireless router, sometimes jumping up to 80% packet loss. The Mac on the other hand had no packet loss.

I haven't been very scientific in my measurements, and I don't yet know whether the problems I see with the Ubuntu/Thinkpad box are hardware or software related. However, for a while, I had Windows XP on an Acer, which also had problems connecting to the network, but did not experience the same kinds of packet loss that the Ubuntu box has.

If anyone has pointers on what I should be looking at, please leave a comment, and I'll update this post with my findings.

...===...