Over at unbiasd, we have URL canonicalization problem. You see, unbiasd brings together a bunch of news feeds from various websites and publishes selected articles from these feeds on the homepage. For each recently published article, we use Ice Rocket to search for back links to the article. This is how we power our “blog reactions” feature.

This works great—except for one problem. The article URLs given in feeds are not always the same URLs that bloggers would be linking to. FeedBurner is an obvious example of this. While some bloggers might mistakenly link to a FeedBurner URL, they probably meant to link to the actual article’s permalink instead.

The solution (in my mind) was simple: do a HEAD request on each URL before doing a search for blog reactions. If the HEAD responds with a Location header, then use the referenced URL rather than the original.

This works great for FeedBurner URLs, but there is a problem that I just discovered yesterday. Some sites (including The New York Times) will sometimes redirect you to a login/registration URL before allowing you to view an article. This obviously throws a little hitch into my giddy up: my HEAD request responds with a Location header pointing to a login page, so I go ahead and do a blog search using the login URL. Sweet.

The more that I deal with real-world web content, the more I realize that it is the wild west of data. Pretty much nothing is normalized. Wild hacks are the norm, because sometimes they are the only thing that work. If nothing else, at least this will force me to be seriously pragmatic in my day to day coding.