[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Wednesday, December 23, 2009

Where do your site's visitors come from

This hack, that I demoed during my talk at FOSS.IN was largely written by the audience at my talk. I typed it out on screen, but the audience was busy telling me what I should type. The primary requirement was to find out where the visitors to your website came from. The secondary requirement was that starting from scratch, we had to come up with a solution in five minutes. Given the medium of discussion (a large auditorium with a few hundred people), and the number of wisecracks, we probably went over the 5 minute limit, but that's okay.

So, we start by looking through the web access log to find out what it looks like. Remember, we're starting from scratch, so we have no idea how to solve the problem yet.
$ head -1 access.log

65.55.207.47 - - [30/Nov/2009:00:39:50 -0800] "GET /robots.txt HTTP/1.1" 200 297 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)" 
This tells us that the IP address is the first field in the log, and that's probably the best indicator of who a user is.

We now use cut to pull out only the first field.
head -1 acces.log | cut -f1 -d' '

65.55.207.47
Now I don't really care about the IP itself, but the subnet, so I'll just pull out the first three parts of the IP address (I think Tejas came up with this):
head -1 access.log | cut -f1-3 -d.

65.55.207
And before anyone tells me that the -1 usage of head is deprecated, I know, but it's a hack.

Now, I want to do this for my entire log file (or a large enough section of it), and I want to know how many hits I get from each subnet. The audience came up with using sort and uniq to do this:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10

 141 216.113.168
 106 88.211.24
  80 91.8.88
  79 78.31.47
  69 199.125.14
  64 216.145.54
  62 173.50.252
  58 193.82.19
  57 82.69.13
  56 198.163.150
Now, I don't know about you, but I can't just look at an IP address and tell where it's from. I need something in English. The audience came up with whois to do this, but before we could use it, we had to figure out how. We ran it on the first IP address up there:
whois 216.113.168.0

OrgName:    eBay, Inc
OrgID:      EBAY
Address:    2145 Hamilton Ave
City:       San Jose
StateProv:  CA
PostalCode: 95008
Country:    US

NetRange:   216.113.160.0 - 216.113.191.255
CIDR:       216.113.160.0/19
NetName:    EBAY-QA-IT-1
NetHandle:  NET-216-113-160-0-1
Parent:     NET-216-0-0-0-0
NetType:    Direct Assignment
NameServer: SJC-DNS1.EBAYDNS.COM
NameServer: SMF-DNS1.EBAYDNS.COM
NameServer: SJC-DNS2.EBAYDNS.COM
Comment:
RegDate:    2003-05-09
Updated:    2003-10-17

OrgTechHandle: EBAYN-ARIN
OrgTechName:   eBay Network
OrgTechPhone:  +1-408-376-7400
OrgTechEmail:  network@ebay.com

# ARIN WHOIS database, last updated 2009-12-22 20:00
# Enter ? for additional hints on searching ARIN's WHOIS database.
We only care about the OrgName parameter, so grep that out. Since I also wanted to strip out "OrgName:", I used sed instead:
whois 216.113.168.0 | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'

eBay, Inc
This gives me what I want, but how do I pass the output of the earlier pipeline to this one? Most people suggested I use xargs, but that would either pass the count as well, or lose the count completely. I wanted both. Gabin suggested that I use read in a loop:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}'; \
    done

eBay, Inc
RIPE Network Coordination Centre

I've only limited it to 2 entries this time so that the test doesn't take too long.

Finally, in order to print out the count before the network owner, I pipe the output to awk. Most people suggested I just use echo, but I prefer numbers formatted right aligned the way printf does it:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -2 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \
            awk "{printf(\"%4d\\t%s.x\\t%s\\n\", $count, $net, \$0);}"; \
    done

 141    216.1130.168.x  eBay, Inc
 106    88.2110.24.x    RIPE Network Coordination Centre
Note that we use double quotes for awk, and escape a bunch of things inside. This is so that we can use the shell variables $count and $net as-is in the awk script. We can also accomplish this using the -v option to awk, but no one came up with it at the time.

Finally, run this on the top 10 IP blocks, and we get:
cat access.log | cut -f1-3 -d. | sort | uniq -c | sort -nr | head -10 | \
    while read count net; do \
        whois "$net.0" | sed -ne '/^OrgName: */{s/^OrgName: *//;p;}' | \
            awk "{printf(\"%4d\\t%s.x\\t%s\\n\", $count, $net, \$0);}"; \
    done

 141    216.1130.168.x  eBay, Inc
 106    88.2110.24.x    RIPE Network Coordination Centre
  80    91.80.88.x      RIPE Network Coordination Centre
  79    78.310.47.x     RIPE Network Coordination Centre
  69    199.1250.14.x   InfoUSA
  64    216.1450.54.x   Yahoo! Inc.
  62    173.50.252.x    Verizon Internet Services Inc.
  58    193.820.19.x    RIPE Network Coordination Centre
  57    82.690.13.x     RIPE Network Coordination Centre
  56    198.1630.15.x   Red River College of Applied Arts, Science and Technology
That's it. The hack requires network access for whois to work, and may be slow depending on how long whois lookups take for you. It also doesn't care about the class of the IP block, and just assumes that everything is class C, but it works well enough.

I also have no idea why my site is so popular with eBay.

Monday, December 21, 2009

BBC headlines as Flickr Photos (with YQL)

One of the examples I showed during my keynote at FOSS.IN was a 10 minute hack to display the BBC's top news headlines as photos. In the interest of time, I wrote half of it in advance and only demoed the YQL section. In this post, I'll document everything. Apologies for the delay in getting this out, I've fallen behind on my writing with an injured left wrist.

BBC

To start with, we get the URL of the headlines feed. We can get this from the BBC news page at news.bbc.co.uk. The feed URL we need is:
http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml

YQL

Now to use this in YQL, we go to the YQL console at http://developer.yahoo.com/yql/console/ and enter the query into the text box:
SELECT * From rss Where url='http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml'
Click the TEST button, and you get your results in the results view. Not terribly useful yet, since you'd get that from the RSS anyway, but since all we're interested in is the title, we change the * to title:
SELECT title From rss Where url='http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml'
This gives us a more readable list with less extra information.

Flickr

For step 2, we first figure out how to use the flickr API. This is much easier because it's listed on the right in the Data Tables section under the flickr subsection. The table we want is flickr.photos.search. Click on this, and the query shows up in the YQL console:
SELECT * From flickr.photos.search Where has_geo="true" and text="san francisco"
Since for our particular example, we don't really care if the photos have geo information or not, we can drop the has_geo="true" part, or leave it in if you like. The predefined search only looks for a single term, we're looking for multiple terms, so we'll change the query from text="san francisco" to text IN ("san francisco", "new york", "paris", "munich") and see what we get:
SELECT * From flickr.photos.search Where text IN ("san francisco", "new york", "paris", "munich")
Notice the diagnostics section of the XML results. This tells us that YQL made 4 API calls to flickr and asked for 10 results from each. Since we only care about one result per term, we tell YQL to only request a single result from each call to the flickr API:
SELECT * From flickr.photos.search(1) Where text IN ("san francisco", "new york", "paris", "munich")
Note the parenthesised 1 after the table name. We now get only one result for each of the cities we've specified. (if you're copying and pasting this into the console, make sure you remove the <em> tag)

Mashup

Now to put it all together, we use the SQL like sub-select. This allows us to pass the result set of the first query in as an operand in the second query:
SELECT * From flickr.photos.search(1) Where text IN 
    (SELECT title From rss Where url='http://newsrss.bbc.co.uk/rss/newsonline_world_edition/front_page/rss.xml')
This gets us at most one photo result for each headline in the RSS feed. It's possible that some headlines return 0 results, and that can't be helped.

REST

So how do we use this in our code? YQL gives us a nice REST API to access the data. The URL you need is listed to the right in the box titled The REST query. Now personally, I prefer JSONP, because it's easier to use it from Javascript, so I switch the view to JSON and enter the name of a callback function showphotos. Click TEST again, and then copy the URL out of REST box. It should be something like this (I've added line breaks for readability):
http://query.yahooapis.com/v1/public/yql?
    q=
        SELECT%20*%20From%20flickr.photos.search(1)%20
            Where%20text%20IN%20
                (SELECT%20title%20From%20rss%20
                    Where%20url%3D
                        'http%3A%2F%2Fnewsrss.bbc.co.uk%2Frss%2Fnewsonline_world_edition%2Ffront_page%2Frss.xml'
                )
    &format=json
    &callback=showphotos
We can use this in our HTML file by setting it as the src attribute of a <script> tag:
<script src="http://query.yahooapis.com/vq/public/yql?.....&format=json&callback=showphotos">
</script>
Before we can do that though, we need to define the showphotos function and the HTML that will hold the photos. This is what it looks like:
<script type="text/javascript">
function showphotos(o)
{
        var ul = document.createElement('ul');
        for(var i=0; i<o.query.results.photo.length; i++) {
                var photo = o.query.results.photo[i];
                var url = 'http://farm' + photo.farm + '.static.flickr.com/'
                        + photo.server + '/'
                        + photo.id + '_' + photo.secret + '_s.jpg';

                var li = document.createElement('li');
                var img = document.createElement('img');
                img.src=url;
                img.alt=photo.title;
                img.height=img.width=75;

                li.appendChild(img);
                ul.appendChild(li);
        }

        document.body.innerHTML = '';
        document.body.appendChild(ul);
}
</script>
It creates a LI for each photo, puts all the LIs into a UL, and puts the UL into the document. You can style it any way you like.

Have a look at the full code in action. You could modify it to draw a badge, to update itself every 10 minutes, to link back to the original photos, or any thing else. If the photos have geo information, you could display them on a map as well.

Improvements

It would be really cool if we could print out the News headline along with the photo. Perhaps even pull out the geo location of the story using the placemaker API and plot that on a map. However, this is the limit of this hack. It's taken me more time to write this post than it did to write the hack.

Short URL: http://tr.im/bbcflickr

Thursday, December 10, 2009

The keynote incident

Act I, Scene I - The News

I was sitting at La Luna Cafe in Cambridge sipping on Apple Cider, hacking random stuff and waiting for S to be done with work so we could head to the airport. That's when Atul pinged me and asked if it was okay to change the time slot of my talk. I said, yeah, no problem, just let me know the new time. He says, "6pm, keynote slot". I wasn't sure how to react. I told him that I'd never given a keynote before, and in parallel, my brain was telling me that this was true at some point for every keynote speaker ever.

I wasn't sure what to do apart from panicking. Did I have to approach the talk differently, did I have to put more into it? Did I need to speak for more than the 30 minutes that I'd already planned on? Did I need more than 5 slides? I write this now as a reminder and in the hope that it helps someone else.

Act II, Scene I - Bangalore

I got into Bangalore early on the morning of December 2nd and checked into the hotel. I'd slept well on the plane, so a few hours more in the hotel left me ready for the conference. My first task was to get myself a local phone number and then find the conference venue. About an hour later I was standing in NIMHAN's hospital wondering where the convention centre was. A security guard pointed me in the right direction where I saw familiar faces at the door. Tejas handed me my speaker badge as I got in.

Act II, Scene II - Interviewing

At this point, I still had the structure of the talk I'd planned to do before Atul gave me the news. Part of it involved speaking to hackers at FOSS.IN about their experiences, so I got started with that as soon as I got in. I spoke to Pradeepto, Tarique, Tejas, Kartik, James Morris, Siddhesh, Vinayak, Anant and many others about their experiences with hacking. I was a little relieved to know that their ideas always matched mine. At this point I was starting to lose some of the doubt I had.

I spent some time attending talks and checking out the workouts in the hacker area. This was the same old conference I knew.

Act II, Scene III - The first Keynote

Harald did the first keynote I attended, and the few things that struck me were that he'd used the LaTeX (beamer with the Warsaw theme), he had a lot of content on his slides, and he really knew what he was talking about. I sat up in the balcony to take notes. It only served to scare me.

Act III, Scene I - Panic

I'll jump now to the next keynote by Milosch Meriac. Milosch is a hardware hacker, and like in the last keynote, a few things struck me. He used LaTeX with the Warsaw theme for beamer, he had a lot of content on his slides, and he really really knew his stuff. His talk awed the audience and left them wanting more, which resulted in a follow-up workout on hardware hacking. It was turning out to be impossible to match the quality of the keynotes already delivered, and I had less than 24 hours left to get my act together.

Act IV, Scene I - There is no spoon

I was in early on the 4th. I sat through several talks. All of which were excellent. I thought back on the talks that I'd attended on the previous two days and it hit me that any one of these people could have done the same talk at 6pm and would have been great keynote speakers. Many of them also used LaTeX. I sort of joked on twitter that maybe if I used LaTeX, then I'd have a good talk as well. Then @artagnon replied saying that what I said was more important than my slides. He also helped out with some LaTeX formatting. I'd decided to use LaTeX for my presentation, not because the other presenters did, but because I didn't know LaTeX, and this conference seemed like a good excuse to learn it.

Act IV, Scene II - The photography BoF

At 5pm, Kalyan Varma and James Morris had a photography BoF in the largely unused speaker area. The space was mostly dark for best effect of the photos they were showing off. I sat in on it, and threw out everything I'd planned on talking about. Taking a cue from all the speakers, I decided to talk about what I knew best. My slides were unimportant. I had a few points that I wanted to cover, and made a note of those lest I forget (which I often did) and I had a lot of code that I wanted to show off, some of which I hadn't tried before, but I had a few hundred hackers in the room to help me with. As mentioned in the talk abstract, this talk was meant to be a hack, and that's the attitude I decided to go in with.

Act IV, Scene III - Shining lights

As I stood on the stage and tried to plug the VGA cable into my laptop, my hands were shaking. Not sure if anyone in the audience noticed though. I was nervous. Should I wear my hat and shield my eyes from the light or take off the hat and let my forehead shine? I went without the hat. It didn't matter. The talk had its ups and downs. I missed a few points that I thought I should cover. I lost track of where I was in the little list of points I'd made, and I stopped blank a few times. I also forgot to give away the tshirts that I had. Where it went the smoothest though, was when I was doing what I like best - hacking. Whenever there was code to demo, I was excited and I had real time feedback from everyone in the audience.

Act V, Endgame

I didn't get a chance to read the comments on twitter until late that night. They were mostly good, and the few that were critical were justified. The next day many people came up to me and told me that they enjoyed the talk hack. If they all end up hackers, that's what will be the real win.

Now it may seem from what I've said that the talk was largely spontaneous. The fact is that I wrote down a script a few times and threw it away. I rehearsed by recording myself speak and was aghast with the playback. I was forcing myself to change my talk simply because it had a new label. That was the only time I faltered, and it led me down the wrong path. A lot of people helped me get back on track, so in honesty, I only presented the keynote. It was hacked up by many many people at FOSS.IN.

For those who are interested, my slides are up on slideshare, but if you've been paying attention, you don't really need them.

Short URL: http://tr.im/fossdotinkeynote

...===...