[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Sunday, February 28, 2010

Convert IP to Geo info using YQL

For my missing kids hack, I needed to convert an IP address to a US or Canadian 2 letter state code. This should have been pretty straightforward, but it turned out to require a little more effort than I initially wanted to put in.

First, the easy way. Rasmus Lerdorf has a web service that takes in an IP address and based on the MaxMind data, returns a bunch of information including the country and state/region code. I initially decided to use this. His example page is pretty self-explanatory, so I won't re-document it here. The problem is that this service was really slow and increased page load time a lot, so I scrapped the idea.

I then started looking through YQL. YQL has a whole bunch of geo stuff, but nothing that specifically turns an IP address into a WoEID or a country/state code. I then looked at the community supported tables and found the ip.location table that uses the ipinfodb.com wrapper around the MaxMind database. This returned everything I needed, but the only problem was that the state was returned as a string rather than a two character code. This is the query:
SELECT * From ip.location Where ip=@ip
The output looks like this:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:24:30Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:24:30Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=select+*+from+ip.location+where+ip%3D%27209.117.47.253%27",
  "results":{
   "Response":{
    "Ip":"209.117.47.253",
    "Status":"OK",
    "CountryCode":"US",
    "CountryName":"United States",
    "RegionCode":null,
    "RegionName":null,
    "City":null,
    "ZipPostalCode":null,
    "Latitude":"38",
    "Longitude":"-97",
    "Timezone":"-6",
    "Gmtoffset":"-6",
    "Dstoffset":"-5"
   }
  }
 }
}
Now it's pretty trivial to build an array that maps from state name to state code, but I'd have to keep growing that as I added support for more countries, so I decided against that route. Instead I started looking at how I could use the geo APIs to turn this information into what I wanted. Among other things, the data returned also contained the latitude and longitude of the location that the IP was in. I decided to do a reverse geo map from the lat/lon to the geo information. The only problem is that the geo API itself doesn't do this for you.

Tom Croucher then told me that the flickr.places API could turn a latitude and longitude pair into a WoEID, so I decided to explore that. This is the query that does it:
SELECT place.woeid From flickr.places
 Where lat=@lat And lon=@lon
Now I could tied the two queries together and get a single one that turns an IP address to a WoEID:
SELECT place.woeid From flickr.places
 Where (lat, lon) IN
   (
      SELECT Latitude, Longitude From ip.location
       Where ip=@ip
   )
This is what the output looks like:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:25:34Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:25:34Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+place.woeid+From+flickr.places%0A+Where+%28lat%2C+lon%29+IN%0A+++%28%0A++++++SELECT+Latitude%2C+Longitude+From+ip.location%0A+++++++Where+ip%3D%27209.117.47.253%27%0A+++%29",
  "results":{
   "places":{
    "place":{
     "woeid":"12588378"
    }
   }
  }
 }
}
The last step of the puzzle was to turn this WoEID into a country and state code. This I already knew how to do:
SELECT country.code, admin1.code
  From geo.places
 Where woeid=@woeid
country.code gets us the two letter ISO3166 country code while admin1.code gets us a code for the local administrative region. For the US and Canada, this is simply the country code followed by a hyphen, followed by the two letter state code. Once I got this information, I could strip out the country code and the hyphen from admin1.code and get the two letter state code.

My final query looks like this:
SELECT country.code, admin1.code From geo.places
 Where woeid IN
   (
      SELECT place.woeid From flickr.places
       Where (lat, lon) IN
         (
            SELECT Latitude, Longitude From ip.location
             Where ip=@ip
         )
   )
And the output is:
{
 "query":{
  "count":"1",
  "created":"2010-02-28T01:26:32Z",
  "lang":"en-US",
  "updated":"2010-02-28T01:26:32Z",
  "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+country.code%2C+admin1.code+From+geo.places%0A+Where+woeid+IN%0A%28SELECT+place.woeid+From+flickr.places%0A+Where+%28lat%2C+lon%29+IN%0A+++%28%0A++++++SELECT+Latitude%2C+Longitude+From+ip.location%0A+++++++Where+ip%3D%27209.117.47.253%27%0A+++%29%29",
  "results":{
   "place":{
    "country":{
     "code":"US"
    },
    "admin1":{
     "code":"US-KS"
    }
   }
  }
 }
}
Paste this code into the YQL console, make sure you've selected "Show community tables" and get the REST API from there. It's a terribly roundabout way to get something that should be a single API call, but at least from my application's point of view, I only need to call a single web service. Now if only we could convince the guys at missingkidsmap.com to use WoEIDs instead of state codes, that would make this all a lot easier.

Have I mentioned how much I like YQL?

Saturday, February 27, 2010

Closures and Function Currying

A few weeks ago, someone emailed me with this question:
Assume i have test.js file with code
function createAdder(x) {
   return function(y) {
      return x + y;
   }
}

var add2 = createAdder(2); //<-------LINE 1
var add5 = createAdder(5); //<-------LINE 2
alert(add2(10));           //<------------LINE 3
alert(add5(10));           //<------------LINE 4
... My doubt: what's actually happening in the LINE 1-4?
I promised him an answer as a blog post, so here it is.

Currying

What we see here is something called function currying. It's common in mathematics and is named after one of its inventors — Haskell Curry. In short, currying converts a single function that takes in multiple arguments to multiple functions each of which takes in a single argument. This is particularly useful when caching the intermediate steps for later reuse with different parameters is of use. The classic example is that of operating on two numbers:
function divide(a) {
   return function(b) {
      return b/a;
   }
}

var divide_2 = divide(2);
var divide_3 = divide(3);

alert(divide_2(6)); // alerts 3
alert(divide_3(6));     // alerts 2
In the above code, divide_2 is a function that takes in one argument and divides it by 2, returning the result. This is a fairly trivial and not very useful example because there are easier ways to divide a number by two. It becomes more useful though, when we need to do a bunch of expensive processing to get to each of the inner results. Consider this code instead:
function hash(salt) {
   // do some expensive processing on salt
   var hash1 = process(salt);
   return function(data) {
      // cheap processing of data with hash1
      return data + hash1;
   };
}

var sign1 = hash(salt1);   // sign1 is a function that signs data with salt1

var signature = sign1(some_data);
In the above code, the outer function does a bunch of expensive processing, and its result is stored in the hash1 variable. This variable is available to the inner function whenever it is called because of the closure that's created. When the inner function is called, it simply uses the value of hash1 without having to redo the processing. Now we could have called process() externally and cached its result, but then the hash1 would be exposed. This may not be something we want to do either because it needs to be abstracted out, or because its value is sensitive.

Closures

This all works because of closures. In short, a variable will continue to exist as long as code that can see it can be run. In the above cases, the inner functions are returned and their references stored in global variables. This makes the lifetime of these inner functions global, ie, they will exist as long as their new containing scope exists. These functions do not, however, get the new scope, so the variables they can see are exactly what they could see when they were defined. In the divide example, the inner function sees the variable a, therefore a will exist for as long as the inner function exists. When we create divide_2, the value of a is set to 2, and this is what the inner function (which is now stored in divide_2) sees. When we create divide_3, a new a is created, this time with value 3, and a new inner function is created (which is now stored in divide_3) and this function sees the new value of a. This is a completely new execution scope than when divide(2) was called. So getting back to the example my friend asked about, this is what happens:
  1. createAdder(2): At this point, the argument x is set to the value 2, and the inner function is returned and stored in add2. This function remembers the value of x and uses it when it has to be called.
  2. createAdder(5): At this point, the argument x is set to the value 5. Note that this is a new invocation of createAdder and does not share the same memory as the first invocation, so these are two completely different variables named x, both living in different scopes.
  3. add2(10): At this point, the first inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 2 and computes 2 + 10 and returns its value
  4. add5(10): The second instance of the inner function is called with the argument 10, which is stored in y. This function remembers the value of x as 5 from when it was called, and computes 5 + 10 and returns its value
Now this whole explanation of closures would not be complete without one more subtle note that most people tend to forget about. The inner functions see the variables that were defined within its containing scope regardless of where in that containing scope they were defined or when their values were set. This means that if you change the value of a variable after defining an inner function, then the inner function will see the new value. Here's an example:
function foo(x) {
   var g=function(y) {
      return x+y;
   };
   x=x*x;
   return g;
}

var add_2 = foo(2);
var add_3 = foo(3);

alert(add_2(5));    // alerts 9
alert(add_3(5));    // alerts 14
Notice that the value of x was changed after the inner function g was defined, yet g sees the new value of x. This is particularly important when you use a loop control variable inside a closure. The closure will see the last value of the loop control variable and not a different value on each iteration.

Update: 2010-03-04 t3rmin4t0r has a much more useful example of currying on his blog.

Friday, February 19, 2010

Missing kids on your 404 page

It's been a long time since I last posted, and unfortunately I've been unable to churn out a post every week. The month of February has been filled with travel, so I haven't had much time to write.

My report on FOSDEM is up on the YDN blog, so I haven't been completely dormant. I also did some stuff at our internal hack day last week. This post is about one of my hacks.

The idea is quite simple. People land up on 404 pages all the time. 404 pages are pages that have either gone missing, or were never there to begin with. 404 is the HTTP error code for a missing resource. Most 404 pages are quite bland, simply stating that the requested resource was not found, and that's it. Back when I worked at NCST, I changed the default 404 page to use a local site search based on the requested URL. I used the namazu search engine since I was working on it at the time.

This time I decided to do something different. Instead of searching the local site for a missing resource, why not engage the user in trying to find missing kids.

I started with trying to find an API for missingkids.com and ended up finding missingkidsmap.com. This service takes the data from Missing Kids and puts it on a google map. The cool thing about the service was that it could return data as XML.

Looking through the source code, I found the data URL:
http://www.missingkidsmap.com/read.php?state=CA
The state code is a two letter code for states in the US and Canada. To get all kids, just pass in ZZ as the state code.

The data returned looks like this:
<locations>
   <maplocation zoom="5"
                state_long="-119.838867"
                state_lat="37.370157"/>
   <location id="1"
             firstname="Anastasia"
             lastname=" Shearer "
             picture="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             picture2="img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1140669e1.jpg"
             medpic = "img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             smallpic="img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1140669c1.jpg"
             policenum="1-661-861-3110"
             policeadd="Kern County Sheriff\'s Office (California)"
             policenum2=""
             policeadd2=""
             st=" CA"
             city="BAKERSFIELD"
             missing="12/26/2009"
             status="Endangered Runaway"
             age="16"
             url="1140669"
             lat="35.3733333333333"
             lng="-119.017777777778"/>
   ...
</locations>

Now I could keep hitting this URL for every 404, but I didn't want to kill their servers, so I decided to pass the URL through YQL and let them cache the data. Of course, now that I was passing it through YQL, I could also do some data transformation and get it out as JSON instead of XML. I ended up with this YQL statement:
SELECT * From xml
 Where url='http://www.missingkidsmap.com/read.php?state=ZZ'
Pass that through the YQL console to get the URL you should use. The JSON I got back looked like this:
{
   "query":{
      "count":"1",
      "created":"2010-02-19T07:30:44Z",
      "lang":"en-US",
      "updated":"2010-02-19T07:30:44Z",
      "uri":"http://query.yahooapis.com/v1/yql?q=SELECT+*+From+xml%0A+Where+url%3D%27http%3A%2F%2Fwww.missingkidsmap.com%2Fread.php%3Fstate%3DZZ%27",
      "results":{
         "locations":{
            "maplocation":{
               "state_lat":"40.313043",
               "state_long":"-94.130859",
               "zoom":"4"
            },
            "location":[{
                  "age":"7",
                  "city":"OMAHA",
                  "firstname":"Christopher",
                  "id":"Szczepanik",
                  "lastname":"Szczepanik",
                  "lat":"41.2586111111111",
                  "lng":"-95.9375",
                  "medpic":"img width=60 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "missing":"12/14/2009",
                  "picture":"img width=160 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "picture2":"",
                  "policeadd":"Omaha Police Department (Nebraska)",
                  "policeadd2":"",
                  "policenum":"1-402-444-5600",
                  "policenum2":"",
                  "smallpic":"img width=30 border=0 target=_new src=http://www.missingkids.com/photographs/NCMC1141175c1.jpg",
                  "st":" NE",
                  "status":"Missing",
                  "url":"1141175"
               },
               ...
            ]
         }
      }
   }
}

Step 2 was to figure out whether the visitor was from the US and Canada, and if so, figure out which state they were from and pass that state code to the URL.

This is fairly easy to do at Yahoo!. Not so much on the outside, so I'm going to leave it to you to figure it out (and please let me know when you do).

In any case, my code looked like this:
$json = http_get($missing_kids_url);
$o = json_decode($json, 1);
$children = $o['query']['results']['locations']['location'];

$child = array_rand($children);

print_404($child);
http_get is a function I wrote that wraps around curl_multi to fetch and cache locally a URL. print_404 is the function that prints out the HTML for the 404 page using the $child data object. The object's structure is the same as each of the location elements in the JSON above. The important parts of print_404 are:
function print_404($child)
{
   $img = preg_replace('/.*src=(.*)/', '$1', $child["medpic"]);
   $name = $child["firstname"] . " " . $child["lastname"];
   $age = $child['age'];
   $since = strtotime(preg_replace('|(\d\d)/(\d\d)/(\d\d\d\d)|', '$3-$1-$2', $child['missing']));
   if($age == 0) {
      $age = ceil((time()-$since)/60/60/24/30);
      $age .= ' month';
   }
   else
      $age .= ' year';

   $city = $child['city'];
   $state = $child['st'];
   $status = $child['status'];
   $police = $child['policeadd'] . " at " . $child['policenum'];

   header('HTTP/1.0 404 Not Found');
?>
<html>
<head>
...
<p>
<strong>Sorry, the page you're trying to find is missing.</strong>
</p>
<p>
We may not be able to find the page, but perhaps you could help find this missing child:
</p>
<div style="text-align:center;">
<img style="width:320px; padding: 1em;" alt="<?php echo $name ?>" src="<?php echo $img ?>"><br>
<div style="text-align: left;">
<?php echo $age ?> old <?php echo $name ?>, from <?php echo "$city, $state" ?> missing since <?php echo strftime("%B %e, %Y", $since); ?>.<br>
<strong>Status:</strong> <?php echo $status ?>.<br>
<strong>If found, please contact</strong> <?php echo $police ?><br>
</div>
</div>
...
</body>
</html>
<?php
}
Add in your own CSS and page header, and you've got missing kids on your 404 page.

The last thing to do is to tell apache to use this script as your 404 handler. To do that, put the page (I call it 404.php) into your document root, and put this into your apache config (or in a .htaccess file):
ErrorDocument 404 /404.php
Restart apache and you're done.

Update: 2010-02-24 To see it in action, visit a missing page on my website. eg: http://bluesmoon.info/foobar.

Update 2: The code is now on github: http://github.com/bluesmoon/404kids

Update: 2010-02-25 Scott Hanselman has a Javascript implementation on his blog.

Update: 2010-03-28 There's now a drupal module for this.

...===...