I've just noticed that my SIOCwiki-2-rdf script didn't resolve SIOC data URL for a few blogs from SIOC enabled sites. I first thaught it was the autodiscovery regexp that failed (it was the case for only one blog), but actually, the error came from the part of the script that fetch pages.
Indeed, on the wiki page, I mentionned my blog URL was http://www.apassant.net/blog/. Yet, this page is now redirected to http://apassant.net/blog using Apache RedirectMatch.
So, when using HTTP_Request to get the content of the page, it doesn't return the expected body, but this page:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN"> <html><head> <title>302 Found</title> </head><body> <h1>Found</h1> <p>The document has moved <a href="http://apassant.net/blog/">here</a>.</p> </body></html>
in which that's difficult to find any reference to a SIOC link ...
But HTTP_Request provides a getResponseCode() method - will return 302 in this case - and a getResponseHeader() method, that will give the following informations:
Array ( [date] => Fri, 04 Aug 2006 11:10:33 GMT [server] => Apache/2.0.55 (Debian) mod_python/3.2.8 Python/2.4.4c0 PHP/5.1.4-0.1 [location] => http://apassant.net/blog/ [content-length] => 209 [connection] => close [content-type] => text/html; charset=iso-8859-1 [x-pad] => avoid browser bug )
So, using it, I can get the new location of the page. Yet, there are use cases where the location contains a relative URL (see http://www.openlinksw.com/blog/~kidehen).
So, finally, here's the code I now use to get URL content, whatever they've moved or not:
function url_get_content($url, $visited=array()) { if(in_array($url, $visited)) { return "Error: infinite redirection"; } else { $visited[] = $url; } $req =& new HTTP_Request($url); if (!PEAR::isError($req->sendRequest())) { if(in_array($req->getResponseCode(), array('301', '302', '303'))) { $headers = $req->getResponseHeader(); $location = $headers['location']; if(array_key_exists('scheme', parse_url($location))) { return url_get_content($location, $visited); } else { $parsed = parse_url($url); $scheme = $parsed['scheme']; $host = $parsed['host']; return ($port = $parsed['port']) ? url_get_content("$scheme://$host$location", $visited) : url_get_content("$scheme:$port//$host$location", $visited); } } return $req->getResponseBody(); } else { return "Error: " . $req->getResponseCode(); } }I've fixed it in the script, that now returns an complete RDF file with SIOC data URLs for each blog that used an auto-discovery link.
Edit 09/08/2006 @ 13:30: Fixed a bug about infinite loops, see Richard comment about this point.