Updating doap:store architecture and rewriting queries
I recently changed the server that runs the various websites I host / maintain, moving from a Core2Duo 2×1.80Gh2 with 1Go RAM to a Celeron 2.0Ghz with only 256Mo.
The main reason I had such a server was hosting doap:store, as I wanted to run SPARQL queries in a reasonable time. Moving to the new server (a you can guess, for pricing reasons), most of the queries were really slow, even with an optimized MySQL config, some of them even freezing the MySQL server, hanging on “Copying to tmp table on disk” instructions, making the website almost unusable.
Looking for a solution to host the triplestore on a more powerful box, Kingsley Idehen and Openlink kindly offered to host it using Virtuoso Open-Source Edition with EC2. I just made the changes, re-imported the ~4600 RDF files fetched until now (now including doapspace data), and the service is live again, with really better performances in both finding and browsing projects.
So, now, doap:store runs thanks to :
- a Amazon EC2 server, using Virtuoso to store RDF data and to provide a SPARQL endpoint for it;
- a Debian GNU/Linux box, using Apache2 and PHP5 to build the interfaces. This one also fetches new projects descriptions thanks to Ping The Semantic Web (with a Python cron job) and updates / queries the former triple store.
In the meanwhile, I optimized some queries by removing useless vars and reordering statements, after reading this technical report about OptARQ. Finally, I took advantage of Virtuoso aggregate functions to use count, instead of fetching all graphs / projects and counting in PHP for projects stats.
Thanks again to Openlink for hosting the data and for their support !
Tags: doap, doapstore, rdf, scalability, virtuoso
DOAP and the Linked Data Web
From the LOD mailing-list, I discovered and browsed doapspace.org, that aims to build a DOAP repository. Contrary to doap:store that fetches RDF files from the Web thanks to PTSW, one of the nice feature of doapspace is that it creates DOAP files from services that do not offer such meta-data for their projects, such as SourceForge or FreshMeat.
Rob already scrapped about 45000 DOAP files ! And should be able in a few days to ping PTSW to inform it about new files, that can the be crawled into doap:store, browsed… and SPARQLed - let’s hope my box can support such amount of data and queries…
As discussed on the mailing-list, this service could be a great way to make DOAP enter one step further into the Linked Data Web. What’s currently missing is URIs, for projects and people. I noticed when creating doap:store that many projects don’t have an URI (and are just blank nodes), which makes impossible to link to them from external files, as FOAF profiles.
Now, imagine that doapspace scrapper creates URI for all people and projects, based on the service name and user / project id on the original service. Using owl:sameAs and rdfs:seeAlso, anyone could add something like this in its FOAF profile:
<foaf:Person rdf:about="#me">
<owl:sameAs
rdf:resource="http://doapspace.org/user/sf/terraces"
rdf:seeAlso="http://doapspace.org/user/sf/terraces/rdfview"/>
<owl:sameAs
rdf:resource="http://doapspace.org/user/fm/terraces"
rdf:seeAlso="http://doapspace.org/user/fm/terraces/rdfview"/>
</foaf:Person>
Then, by reading my FOAF profile with Tabulator or any Linked Data Browser I can easilly browse my profiles on doapspace, that would ideally list all projects (with their URIs) I’m into and that I can browse again, thanks again to seeAlso links.
Another advantage is that I don’t have to mention the projects I maintain in my FOAF file, but let doapspace do the job for me, since using owl:sameAs will tell that these different URIs identify the same person (= me). Then using a single SPARQL query on these files with an engine that supports owl:sameAs, I can easilly find all projects I’m into, whereas it comes from:
select ?project
where {
?project doap:maintainer <my_foaf_URI>
}
The final step would be to query only my FOAF profile, and let the engine discover / identify doapspace URIs and retrieve profiles to add them in its files to query, as the Semantic Web Client library does, but I don’t know if it currently supports owl:sameAs or not.
DOAP Dataset
For those who want to play with DOAP files, doap:store now provides a dump of its content, containing all DOAP files fetched from PTSW.
The dump is available in both RDF/XML and N3, and is run daily[1]. It also contains information about RDF files since I made a complete export of the RDF store, including informations added by 3store when dealing with contexts.
Also remember that there’s a SPARQL endpoint for doap:store, that accepts CONSTRUCT queries.
Notes
[1] It seems there’s broken use of rdf:resource in the XML version, the N3 is a cleaned version, thanks to rapper.
Finding doap projects with YubNub
I was writing an opensearch plug-in for doapstore.org to allow searching projects directly from Firefox or IE7 search box, when I remembered yubnub.org, which already have such a plug-in. YubNub allows anyone to create command lines for the web, eg. typing "gim rdf" will search Google images for "rdf".
So rather than creating a plug-in for doapstore, I created a YubNub command, simply called "doap". It works as follow:
doap foowill search all projects with a name (doap:name) or description (doap:shortdescordoap:description) containingfoof;doap name=foowill search by name only;doap desc=foowill search by description (both long and short);doap lang=foowill search by programming language (doap:programming-language);doap host=foowill search by hostname (i.e. project URI)
The other advantage of YubNub is that it can be used not only as a search engine for your favourite browser, but also has various frontends, as Tiger widgets or command line scripts fro shell. Really useful !
NB: I also thaught as a generic SPARQL command for YubNub that would query different endpoints as Danny Ayers suggested and return a single set of results thanks to a SPARQL dispatcher but did not write anything about it.
SPARQL endpoint for doap:store
doap:store now provides a SPARQL endpoint for its data. The RDF store used is threestore, so it supports GRAPH queries, but you may have to use the http://triplestore.aktors.org/direct/# namespace for some properties (as direct:type instead of rdf:type in GRAPH queries). In order to write queries, the endpoint uses Danny Ayers’ sparql-editor.
I also updated / fixed the service regarding various points:
- Show RDF source URI for each project;
- Ability to browse files from the same hostname;
- Search projects by hostname;
- Fixed "double-properties" in projects that are defined in 2 graphs;
- Case-insensitive tagcloud;
- Strict REGEX (
^xxx$) when searching by language.
Thanks to those who sent feedback !
doap:store, a collaborative DOAP projects directory
I’ve just completed a first release of doap:store, a user-driven directory of DOAP projects.
The goal is to build a database of computing projects that have a DOAP description - as does the Semantic Web Apps and Demo service, but not only SW related here - , thanks to Ping The Semantic Web service. Thus, there is no need to register, projects will be fetched from PTSW, that will get them eitheir by pings, or even better, thanks to Semantic Radar plugin for Firefox when people browse the Web.
Then, doap:store provides a common search engine and browsing interface for these decentralized project description, while authors keep control over their data. Data is updated each time PTSW has a new ping for it (in the future, PTSW should store new pings only if the document has changed, so updated will be made only for real document updates).
From the technical side, it’s using 3store for the storage and of course SPARQL for the queries.
I hope this kind of tool can help to outreach DOAP for developers, and SW in general, as thanks to Semantic Radar, anyone can contribute to build this directory.
Maintaining a changelog with DOAP
I’ve just pointed that I used DOAP to represent various information about SIOC Exporter for Dotclear. The DOAP vocabulary is designed to define different properties for a project, as its homepage, repository, different people involved into its development…
There is no direct property for the changelog of a project, yet you can use doap:release to represent information about each release and then add the changelog of a given release using dc:description[1]:
<doap:release> <doap:Version> <doap:revision>1.4</doap:revision> <doap:created>2006-09-09</doap:created> <dc:description> - API updated to use both foaf:maker / sioc:has_creator - Using object type in URL parameters - Generating a basic FOAF profile to be used in user export - French tranlation of the backend - Adding doap.rdf </dc:description> </doap:Version> </doap:release>
Then, just use SPARQL to retrieve and order changes, plus a few lines of code to display the results (eg using Python and librdf):
import string
import RDF
query = """
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX doap: <http://usefulinc.com/ns/doap#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT ?revision ?created ?description
WHERE {
?_p rdf:type doap:Project .
?_p doap:release ?_r .
?_r doap:revision ?revision .
?_r doap:created ?created .
?_r dc:description ?description
} ORDER BY DESC (?created) """
url = "http://apassant.net/home/2006/02/dotclear-sioc/doap.rdf"
model = RDF.Model()
RDF.Parser().parse_into_model(model, url)
for r in RDF.Query(query, query_language="sparql").execute(model):
print "# v%s [%s] %s" %(r['revision'].__str__().strip(), r['created'].__str__().strip(), r['description'].__str__())
And here it is, you’ve got a formatted changelog for your project:
# v1.4 [2006-09-09] - API updated to use both foaf:maker / sioc:has_creator - Using object type in URL parameters - Generating a basic FOAF profile to be used in user export - French tranlation of the backend - Adding doap.rdf # v1.3 [2006-08-02] - Update to match API changes # v1.2 [2006-05-30] [...]
This example shows how data can be formalized and mashed-up using Semantic Web technologies, as you can do with your FOAF profile and then create an homepage from it. So, while I first thaught about a FOAF to Homepage script, I’m now into a simple RDF 2 Templated-HTML converter, which will use Jinja as a templating engine, and librdf + SPARQL to get information from any RDF file.
Notes
[1] Yet, adding a changelog property could be more appropriate from a semantic point of view, or maybe some more complex class, I’ve started the discussion here.
