[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Showing posts with label apache. Show all posts
Showing posts with label apache. Show all posts

Wednesday, December 23, 2009

Where do your site's visitors come from

This hack, that I demoed during my talk at FOSS.IN was largely written by the audience at my talk. I typed it out on screen, but the audience was busy telling me what I should type. The primary requirement was to find out where the visitors to your website came from. The secondary requirement was that starting from scratch, we had to come up with a solution in five minutes. Given the medium of discussion (a large auditorium with a few hundred people), and the number of wisecracks, we probably went over the 5 minute limit, but that's okay.

So, we start by looking through the web access log to find out what it looks like. Remember, we're starting from scratch, so we have no idea how to solve the problem yet.
$ head -1 access.log

65.55.207.47 - - [30/Nov/2009:00:39:50 -0800] "GET /robots.txt HTTP/1.1" 200 297 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 
This tells us that the IP address is the first field in the log, and that's probably the best indicator of who a user is.

We now use cut to pull out only the first field.
head -1 acces.log | cut -f1 -d' '

65.55.207.47
Now I don't really care about the IP itself, but the subnet, so I'll just pull out the first three parts of the IP address (I think Tejas came up with this):
head -1 access.log | cut -f1-3 -d.

65.55.207
And before anyone tells me that the -1 usage of head is deprecated, I know, but it's a hack.

Now, I want to do this for my entire log file (or a large enough section of it), and I want to know how many hits I get from each subnet. The audience came up with using sort and uniq to do this:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10

 141 216.113.168
 106 88.211.24
  80 91.8.88
  79 78.31.47
  69 199.125.14
  64 216.145.54
  62 173.50.252
  58 193.82.19
  57 82.69.13
  56 198.163.150
Now, I don't know about you, but I can't just look at an IP address and tell where it's from. I need something in English. The audience came up with whois to do this, but before we could use it, we had to figure out how. We ran it on the first IP address up there:
whois 216.113.168.0

OrgName:    eBay, Inc
OrgID:      EBAY
Address:    2145 Hamilton Ave
City:       San Jose
StateProv:  CA
PostalCode: 95008
Country:    US

NetRange:   216.113.160.0 - 216.113.191.255
CIDR:       216.113.160.0/19
NetName:    EBAY-QA-IT-1
NetHandle:  NET-216-113-160-0-1
Parent:     NET-216-0-0-0-0
NetType:    Direct Assignment
NameServer: SJC-DNS1.EBAYDNS.COM
NameServer: SMF-DNS1.EBAYDNS.COM
NameServer: SJC-DNS2.EBAYDNS.COM
Comment:
RegDate:    2003-05-09
Updated:    2003-10-17

OrgTechHandle: EBAYN-ARIN
OrgTechName:   eBay Network
OrgTechPhone:  +1-408-376-7400
OrgTechEmail:  network@ebay.com

# ARIN WHOIS database, last updated 2009-12-22 20:00
# Enter ? for additional hints on searching ARIN's WHOIS database.
We only care about the OrgName parameter, so grep that out. Since I also wanted to strip out "OrgName:", I used sed instead:
whois 216.113.168.0 | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'

eBay, Inc
This gives me what I want, but how do I pass the output of the earlier pipeline to this one? Most people suggested I use xargs, but that would either pass the count as well, or lose the count completely. I wanted both. Gabin suggested that I use read in a loop:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'; \
    done

eBay, Inc
RIPE Network Coordination Centre

I've only limited it to 2 entries this time so that the test doesn't take too long.

Finally, in order to print out the count before the network owner, I pipe the output to awk. Most people suggested I just use echo, but I prefer numbers formatted right aligned the way printf does it:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \
            awk "{printf(\"%4d\\t%s.x\\t%s\\n\", $count, $net, \$0);}"; \
    done

 141    216.1130.168.x  eBay, Inc
 106    88.2110.24.x    RIPE Network Coordination Centre
Note that we use double quotes for awk, and escape a bunch of things inside. This is so that we can use the shell variables $count and $net as-is in the awk script. We can also accomplish this using the -v option to awk, but no one came up with it at the time.

Finally, run this on the top 10 IP blocks, and we get:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \
            awk "{printf(\"%4d\\t%s.x\\t%s\\n\", $count, $net, \$0);}"; \
    done

 141    216.1130.168.x  eBay, Inc
 106    88.2110.24.x    RIPE Network Coordination Centre
  80    91.80.88.x      RIPE Network Coordination Centre
  79    78.310.47.x     RIPE Network Coordination Centre
  69    199.1250.14.x   InfoUSA
  64    216.1450.54.x   Yahoo! Inc.
  62    173.50.252.x    Verizon Internet Services Inc.
  58    193.820.19.x    RIPE Network Coordination Centre
  57    82.690.13.x     RIPE Network Coordination Centre
  56    198.1630.15.x   Red River College of Applied Arts, Science and Technology
That's it. The hack requires network access for whois to work, and may be slow depending on how long whois lookups take for you. It also doesn't care about the class of the IP block, and just assumes that everything is class C, but it works well enough.

I also have no idea why my site is so popular with eBay.

Sunday, November 01, 2009

Performance measurement

In my last post, I mentioned the factors that affect web performance. Now that we know what we need to measure, we come to the harder problem of figuring out how to measure each of them. There are different methods depending on how much control you have over the system and the environment it runs in. Additionally, measuring performance in a test setup may not show you what real users experience, however it does give you a good baseline to compare subsequent tests against.

Web, application and database servers

Back end servers are the easiest to measure because we generally have full control over the system and the environment it runs in. The set up is also largely the same in a test and production environment, and by replaying HTTP logs, it's possible to simulate real user interactions with the server.

Some of the tools one can use to measure server performance are:
  • ab - Apache Benchmark. Despite its name, it can be used to test any kind of HTTP server and not just apache. Nixcraft has a good tutorial on using ab.
  • httperf from HP labs is also a good tool to generate HTTP load on a server. There's an article on Techrepublic about using it. I prefer httperf because it can be configured to simulate real user load
  • Log replaying is a good way to simulate real-user load, and a few people have developed scripts to replay an apache log file. The first one uses httperf under the hood.
  • To measure database performance, we could either put profiling code into our application itself, and measure how long it takes for our queries to return under real load conditions, or run benchmarks with the actual queries that we use. For mysql, the mysql benchmarking suite is useful.
  • MySQL Tuner is another tool that can tell you how your live production server has been performing though it doesn't give you numbers to quantify perceived performance. I find it useful to tell me if my server needs retuning or not.
The above methods can also be used to measure the performance of remote web service calls, though you may want to talk to your remote web service provider before doing that.

I won't write any more about these because there are a lot of articles about server side performance measurement on the web.

DNS, CDNs and ISP networks

Measuring the performance of DNS, CDNs and your user's ISP network is much harder because you have control over neither the systems nor the environment. Now I mentioned earlier that DNS is something you can control. I was referring to your own DNS set up, ie, the hostnames you have and how they're set up. This is not something we need to measure since no user will use your DNS server. All users use their ISP's DNS server or something like OpenDNS and it's the performance of these servers that we care about.

DNS

DNS is the hardest of the lot since the only way to measure it is to actually put a client application on your users' machines and have that do the measurement. Unless you have really friendly users, this isn't possible. It is an important measurement though. A paper on DNS Performance [Jung et al., 2002] shows that around 20% of all DNS requests fail. This in turn adds to the overall perceived latency of a website. In the absence of an easy way to measure this performance from within a web page, we'll try and figure it out as a side-effect of other measurements.

One possible method is to request the same resource from a host, the first time using the hostname and the second time using its IP address. The difference should give you the DNS lookup time. The problem with this is that it sort of breaks DNS rotations where you may have multiple physical hosts behind a single hostname. It's even worse with a CDN because the hostname may map onto a server that's geographically closer to the user than the IP address you use. In short, you'd better know what you're doing if you try this.

ISP bandwidth

With ISP networks, the number we really care about is the user's effective bandwidth, and it isn't hard to measure this. We use the following procedure:
  1. Place resources of known fixed sizes on a CDN
  2. Make sure these resources are served with no-cache headers
  3. Using javascript, download these resources from the client machine and measure the time it takes
  4. Discard the first resource since it also pays the price of a DNS lookup and TCP slowstart
  5. Use resources of different sizes to handle very slow and very fast connections.
The number we get will be affected by other things the user is using the network for. For example, if they're streaming video at the same time, then bandwidth measured will be lower than it should be, but we take what we can get.

CDNs

Now to measure bandwidth, we need to get that resource relatively close to the user so that the bandwidth of the whole internet doesn't affect it. That's where CDNs come in, and measuring a CDN's performance is somewhat similar.

We could always use a tool like Gomez or Keynote to do this measurement for you, or you can hack up a solution yourself in Javascript. You need to figure out three things:
  1. The IP of the CDN closest to the user
  2. The user's geo-location which you can figure out from their IP address
  3. The time it takes to download a resource of known size from this CDN
It's that first one that's a toughie, but the simplest way to figure it out is to just ask your CDN provider. Most CDNs also provide you with their own performance reports.

Page content and user interaction

YSlow, Show Slow, Page Speed and Wep Page Test are good tools for measuring and analysing the performance of your page content. They can measure and analyse your page from your development environment and suggest improvements. They do not, however, measure real user perceived performance, however this is something we can do with Javascript.

We primarily need to measure the time it takes to download a page and all its components. Additionally we may want to time how long certain user interactions with the page took. All of these can be accomplished by reading the Date() object in javascript at the correct start and end times. What those start and end times are depend on your application, but we'll look at one possible implementation in a later post. Once you have the timing information that you need, it can be sent back to your server using a javascript beacon. We'll go into more detail about this as well in a later post.

This post has already run longer than I'd hoped for, so I'm going to stop here and will continue next time.

...===...