[philiptellis] /bb|[^b]{2}/
Never stop Grokking


Saturday, November 07, 2009

Favicons on my planet's blogroll

Update: I noticed that some feeds weren't showing favicons even though their sites had them, and it turned out to be because the entire feed was a single line which didn't work with sed. I've changed to use perl instead.

Early last week, Chris Shiflett tweeted about adding favicons to a planet's blogroll for sites that have them. Now I'd considered setting up PlanetPlanet in the past, but had never gotten down to it. Since I was already in the middle of a site redesign, I figured it was a good time to start.

Setting up planet bluesmoon was fairly straighforward. I just followed the instructions in the INSTALL file. I was also very pleased to see that it uses the python implementation of HTML::Template because I'm the author of the Java implementation (Also the last Java project I worked on) and am very familiar with the syntax and tricks of the trade.

Once set up, I went back to Chris' site since he'd also mentioned that he'd be posting his favicon code on github. Unfortunately, at this time, the only thing there is the README file, and well, patience is not one of my virtues, so I decided to write my own.

One advantage that I did have though, was Chris' tweets about the process which made a note of all the problems he ran into.

I ended up with this shell script that does a fairly good job, and can be run through cron (although I don't do that). It's made to specifically work with planet's config.ini file, and edits the file in-place to add the icon code. This is how it works

Translating feed URL to favicon URL

This code pulls out all feed URLs from the file. I'm assuming here that they're all http(s) URLs.
sed -ne '/\[http/{s/[][]//g;p;}' $file
For each URL returned, I run this code which pulls down the feed using curl, and then uses perl to extract the home site's URL. I then check for the link in the feed assuming the feed is in the RSS2.0 or Atom 1.0 formats. I could have looked at the content-type header and figured out which it was, but as Chris pointed out, content-type headers are often wrong. The perl code first splits the feed into multiple lines to make it easier to parse.
curl -m 10 -L $feedurl 2>/dev/null | \
    perl -ne "
        s/></>\\n</g;
        for (split/\\n/) {
            print \"\$1\\n\" and exit
                if /<link/ && 
                    (/<link>(.*?)<\\/link>/ ||
                        (/text\\/html/ && /alternate/ && /href=['\"](.*?)['\"]/)
                    );
        }
    "
I then pull out the domain from the site's URL. I'll need this if the link to the favicon is a relative URL. Again, I'm assuming http(s) and being a little liberal with my regexes to work in all versions of sed.
domain=`echo $url | sed -e 's,\(https*://[^/]*/\).*,\1,'`
base=${url%/*}
Then download the site page and look for a favicon link in there. Favicons are found in link tags with a rel attribute of icon or shortcut icon, so I check for both, again being liberal with my regexes, and when I find it, extract the value of the href attribute. This will break if there are multiple link tags on the same line, but I'll deal with that when I see it.
favicon=$( curl -m 10 -L "$url" 2>/dev/null | \
    perl -ne "
        print \"\$1\\n\" and exit
            if /<link/ && 
               /rel=['\"](?:shortcut )?icon['\"]/ && 
               /href=['\"](.*?)['\"]/;
    " )
If no URL was found, I just appended /favicon.ico to $domain and used that instead. If a relative URL was found, I appended it to either $base or $domain depending on whether the path starts with / or not. This will have trouble if your site URL points to a directory but omits the trailing slash, but shame on you if you do that.

Validating the favicon

Now once I had the URL, I still had to validate if a favicon existed at that location. This was done easily using curl with the -f flag which tells it to fail on error. It returns an error code of 22 for a file not found. The problem I faced here is that some sites don't actually return a 404 for missing resources. That was a WTF moment. So I figured I'd just look for the content-type of the returned resource, and if it did not match image/*, then I'd discard it. However, from Chris's tweets, I already knew that some sites send a favicon with a content type of text/plain or text/html, so I couldn't rely solely on this. Instead, I decided to download the favicons, and if its content-type did not match the image/* pattern, I run the file command on them. This command looks up the file's magic numbers and figures out it's content type. The result was this code:
name=`echo $domain | sed -e 's,/,-,g'`.ico
params=`curl -L -f -w "%{content_type}\t%{size_download}" -o "icons/$name" "$favicon" 2>/dev/null`
[ $? -ne 0 ] && continue                                  # skip if curl was unsuccessful
ctype=${params% *}
clen=${params#* }
[ $clen -eq 0 ]  && continue                              # skip if favicon was 0 bytes
if ! echo $ctype | grep -q "^image/" &>/dev/null; then
    if file -b "icons/$name" | grep '\<text\>' &>/dev/null; then
        continue;                                         # skip if content type is not image/*
    fi
fi

rm "icons/$name"

Write it back

Now that I knew the correct URL for a site's favicon, I could write this information back to the config.ini file. I decided to use perl for this line (though I could have used perl for the whole script). It reads the file in by paragraph, and if a paragraph matches the feed URL, it first strips out the old favicon line, and then adds the new one in. Since this code only runs if we actually find a favicon, it has the side effect of not updating a favicon that was once valid but now isn't.
perl -pi -e "BEGIN {\$/='';} if(m{^\[$feedurl\]}) { s{^icon =.*$}{}m; s{\n\n$}{\nicon = $favicon\n\n}; }" $file
The perl code also assumes a very specific format for the config.ini file. Specifically, everything about a feed must be together with no blank lines in between them, and there needs to be at least one blank line between feed sections. Not hard to maintain this, but it's not a restriction that planet imposes itself.

Adding favicons to the template

Lastly, I needed to add these favicons to the template. Inside the Channels loop, we add this code:
<img src="<TMPL_IF icon><TMPL_VAR icon><TMPL_ELSE>/feed-icon-14x14.png</TMPL_IF>"
     alt="" class="favicon">
The code can go anywhere as long as it's inside the Channels loop. To use it in the Items loop, the variable name should be changed to channel_icon instead.

Et voilà site favicons on a planet. Now I've just got to get a better generic image for the no favicon state since they aren't technically links to feeds.

Update: I'm now using an icon from stdicon.com for the generic favicon.

0 comments :

Post a Comment

...===...