# The other side of the moon

/bb|[^b]{2}/
Never stop Grokking

## Wednesday, December 23, 2009

### Where do your site's visitors come from

This hack, that I demoed during my talk at FOSS.IN was largely written by the audience at my talk. I typed it out on screen, but the audience was busy telling me what I should type. The primary requirement was to find out where the visitors to your website came from. The secondary requirement was that starting from scratch, we had to come up with a solution in five minutes. Given the medium of discussion (a large auditorium with a few hundred people), and the number of wisecracks, we probably went over the 5 minute limit, but that's okay.

So, we start by looking through the web access log to find out what it looks like. Remember, we're starting from scratch, so we have no idea how to solve the problem yet.
$head -1 access.log 65.55.207.47 - - [30/Nov/2009:00:39:50 -0800] "GET /robots.txt HTTP/1.1" 200 297 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"  This tells us that the IP address is the first field in the log, and that's probably the best indicator of who a user is. We now use cut to pull out only the first field. head -1 acces.log | cut -f1 -d' ' 65.55.207.47  Now I don't really care about the IP itself, but the subnet, so I'll just pull out the first three parts of the IP address (I think Tejas came up with this): head -1 access.log | cut -f1-3 -d. 65.55.207  And before anyone tells me that the -1 usage of head is deprecated, I know, but it's a hack. Now, I want to do this for my entire log file (or a large enough section of it), and I want to know how many hits I get from each subnet. The audience came up with using sort and uniq to do this: cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 141 216.113.168 106 88.211.24 80 91.8.88 79 78.31.47 69 199.125.14 64 216.145.54 62 173.50.252 58 193.82.19 57 82.69.13 56 198.163.150  Now, I don't know about you, but I can't just look at an IP address and tell where it's from. I need something in English. The audience came up with whois to do this, but before we could use it, we had to figure out how. We ran it on the first IP address up there: whois 216.113.168.0 OrgName: eBay, Inc OrgID: EBAY Address: 2145 Hamilton Ave City: San Jose StateProv: CA PostalCode: 95008 Country: US NetRange: 216.113.160.0 - 216.113.191.255 CIDR: 216.113.160.0/19 NetName: EBAY-QA-IT-1 NetHandle: NET-216-113-160-0-1 Parent: NET-216-0-0-0-0 NetType: Direct Assignment NameServer: SJC-DNS1.EBAYDNS.COM NameServer: SMF-DNS1.EBAYDNS.COM NameServer: SJC-DNS2.EBAYDNS.COM Comment: RegDate: 2003-05-09 Updated: 2003-10-17 OrgTechHandle: EBAYN-ARIN OrgTechName: eBay Network OrgTechPhone: +1-408-376-7400 OrgTechEmail: network@ebay.com # ARIN WHOIS database, last updated 2009-12-22 20:00 # Enter ? for additional hints on searching ARIN's WHOIS database.  We only care about the OrgName parameter, so grep that out. Since I also wanted to strip out "OrgName:", I used sed instead: whois 216.113.168.0 | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' eBay, Inc  This gives me what I want, but how do I pass the output of the earlier pipeline to this one? Most people suggested I use xargs, but that would either pass the count as well, or lose the count completely. I wanted both. Gabin suggested that I use read in a loop: cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \ while read count net; do \ whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'; \
done

eBay, Inc
RIPE Network Coordination Centre


I've only limited it to 2 entries this time so that the test doesn't take too long.

Finally, in order to print out the count before the network owner, I pipe the output to awk. Most people suggested I just use echo, but I prefer numbers formatted right aligned the way printf does it:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
while read count net; do \
whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \ awk "{printf(\"%4d\\t%s.x\\t%s\\n\",$count, $net, \$0);}"; \
done

141    216.1130.168.x  eBay, Inc
106    88.2110.24.x    RIPE Network Coordination Centre

Note that we use double quotes for awk, and escape a bunch of things inside. This is so that we can use the shell variables $count and $net as-is in the awk script. We can also accomplish this using the -v option to awk, but no one came up with it at the time.

Finally, run this on the top 10 IP blocks, and we get:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 | \
while read count net; do \
whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \ awk "{printf(\"%4d\\t%s.x\\t%s\\n\",$count, $net, \$0);}"; \
done

141    216.1130.168.x  eBay, Inc
106    88.2110.24.x    RIPE Network Coordination Centre
80    91.80.88.x      RIPE Network Coordination Centre
79    78.310.47.x     RIPE Network Coordination Centre
69    199.1250.14.x   InfoUSA
64    216.1450.54.x   Yahoo! Inc.
62    173.50.252.x    Verizon Internet Services Inc.
58    193.820.19.x    RIPE Network Coordination Centre
57    82.690.13.x     RIPE Network Coordination Centre
56    198.1630.15.x   Red River College of Applied Arts, Science and Technology

That's it. The hack requires network access for whois to work, and may be slow depending on how long whois lookups take for you. It also doesn't care about the class of the IP block, and just assumes that everything is class C, but it works well enough.

I also have no idea why my site is so popular with eBay.