Update: I noticed that some feeds weren't showing favicons even though their sites had them, and it turned out to be because the entire feed was a single line which didn't work with sed. I've changed to use perl instead.
Early last week,
Chris Shiflett tweeted about adding favicons to a planet's blogroll for sites that have them. Now I'd considered setting up
PlanetPlanet in the past, but had never gotten down to it. Since I was already in the middle of a site redesign, I figured it was a good time to start.
Setting up
planet bluesmoon was fairly straighforward. I just followed the instructions in the INSTALL file. I was also very pleased to see that it uses the
python implementation of HTML::Template because I'm the author of the
Java implementation (Also the last Java project I worked on) and am very familiar with the syntax and tricks of the trade.
Once set up, I went back to Chris' site since he'd also mentioned that he'd be posting his
favicon code on github. Unfortunately, at this time, the only thing there is the README file, and well, patience is not one of my virtues, so I decided to write my own.
One advantage that I did have though, was Chris' tweets about the process which made a note of all the problems he ran into.
I ended up with
this shell script that does a fairly good job, and can be run through cron (although I don't do that). It's made to specifically work with planet's
config.ini
file, and edits the file in-place to add the icon code. This is how it works
Translating feed URL to favicon URL
This code pulls out all feed URLs from the file. I'm assuming here that they're all http(s) URLs.
sed -ne '/\[http/{s/[][]//g;p;}' $file
For each URL returned, I run this code which pulls down the feed using curl, and then uses perl to extract the home site's URL. I then check for the link in the feed assuming the feed is in the RSS2.0 or Atom 1.0 formats. I could have looked at the content-type header and figured out which it was, but as Chris pointed out, content-type headers are often wrong. The perl code first splits the feed into multiple lines to make it easier to parse.
curl -m 10 -L $feedurl 2>/dev/null | \
perl -ne "
s/></>\\n</g;
for (split/\\n/) {
print \"\$1\\n\" and exit
if /<link/ &&
(/<link>(.*?)<\\/link>/ ||
(/text\\/html/ && /alternate/ && /href=['\"](.*?)['\"]/)
);
}
"
I then pull out the domain from the site's URL. I'll need this if the link to the favicon is a relative URL. Again, I'm assuming http(s) and being a little liberal with my regexes to work in all versions of sed.
domain=`echo $url | sed -e 's,\(https*://[^/]*/\).*,\1,'`
base=${url%/*}
Then download the site page and look for a favicon link in there. Favicons are found in link tags with a
rel
attribute of
icon
or
shortcut icon
, so I check for both, again being liberal with my regexes, and when I find it, extract the value of the
href
attribute. This will break if there are multiple link tags on the same line, but I'll deal with that when I see it.
favicon=$( curl -m 10 -L "$url" 2>/dev/null | \
perl -ne "
print \"\$1\\n\" and exit
if /<link/ &&
/rel=['\"](?:shortcut )?icon['\"]/ &&
/href=['\"](.*?)['\"]/;
" )
If no URL was found, I just appended
/favicon.ico
to
$domain
and used that instead. If a relative URL was found, I appended it to either
$base
or
$domain
depending on whether the path starts with
/
or not. This will have trouble if your site URL points to a directory but omits the trailing slash, but shame on you if you do that.
Validating the favicon
Now once I had the URL, I still had to validate if a favicon existed at that location. This was done easily using curl with the -f flag which tells it to fail on error. It returns an error code of 22 for a file not found. The problem I faced here is that some sites don't actually return a 404 for missing resources. That was a WTF moment. So I figured I'd just look for the content-type of the returned resource, and if it did not match
image/*
, then I'd discard it. However, from Chris's tweets, I already knew that some sites send a favicon with a content type of
text/plain
or
text/html
, so I couldn't rely solely on this. Instead, I decided to download the favicons, and if its content-type did not match the
image/*
pattern, I run the
file
command on them. This command looks up the file's magic numbers and figures out it's content type. The result was this code:
name=`echo $domain | sed -e 's,/,-,g'`.ico
params=`curl -L -f -w "%{content_type}\t%{size_download}" -o "icons/$name" "$favicon" 2>/dev/null`
[ $? -ne 0 ] && continue # skip if curl was unsuccessful
ctype=${params% *}
clen=${params#* }
[ $clen -eq 0 ] && continue # skip if favicon was 0 bytes
if ! echo $ctype | grep -q "^image/" &>/dev/null; then
if file -b "icons/$name" | grep '\<text\>' &>/dev/null; then
continue; # skip if content type is not image/*
fi
fi
rm "icons/$name"
Write it back
Now that I knew the correct URL for a site's favicon, I could write this information back to the config.ini file. I decided to use perl for this line (though I could have used perl for the whole script). It reads the file in by paragraph, and if a paragraph matches the feed URL, it first strips out the old favicon line, and then adds the new one in. Since this code only runs if we actually find a favicon, it has the side effect of not updating a favicon that was once valid but now isn't.
perl -pi -e "BEGIN {\$/='';} if(m{^\[$feedurl\]}) { s{^icon =.*$}{}m; s{\n\n$}{\nicon = $favicon\n\n}; }" $file
The perl code also assumes a very specific format for the config.ini file. Specifically, everything about a feed must be together with no blank lines in between them, and there needs to be at least one blank line between feed sections. Not hard to maintain this, but it's not a restriction that planet imposes itself.
Adding favicons to the template
Lastly, I needed to add these favicons to the template. Inside the
Channels
loop, we add this code:
<img src="<TMPL_IF icon><TMPL_VAR icon><TMPL_ELSE>/feed-icon-14x14.png</TMPL_IF>"
alt="" class="favicon">
The code can go anywhere as long as it's inside the Channels loop. To use it in the
Items
loop, the variable name should be changed to
channel_icon
instead.
Et voilà site favicons on a planet. Now I've just got to get a better generic image for the no favicon state since they aren't technically links to feeds.
Update: I'm now using an icon from
stdicon.com for the generic favicon.