Updating doap:store architecture and rewriting queries
I recently changed the server that runs the various websites I host / maintain, moving from a Core2Duo 2×1.80Gh2 with 1Go RAM to a Celeron 2.0Ghz with only 256Mo.
The main reason I had such a server was hosting doap:store, as I wanted to run SPARQL queries in a reasonable time. Moving to the new server (a you can guess, for pricing reasons), most of the queries were really slow, even with an optimized MySQL config, some of them even freezing the MySQL server, hanging on “Copying to tmp table on disk” instructions, making the website almost unusable.
Looking for a solution to host the triplestore on a more powerful box, Kingsley Idehen and Openlink kindly offered to host it using Virtuoso Open-Source Edition with EC2. I just made the changes, re-imported the ~4600 RDF files fetched until now (now including doapspace data), and the service is live again, with really better performances in both finding and browsing projects.
So, now, doap:store runs thanks to :
- a Amazon EC2 server, using Virtuoso to store RDF data and to provide a SPARQL endpoint for it;
- a Debian GNU/Linux box, using Apache2 and PHP5 to build the interfaces. This one also fetches new projects descriptions thanks to Ping The Semantic Web (with a Python cron job) and updates / queries the former triple store.
In the meanwhile, I optimized some queries by removing useless vars and reordering statements, after reading this technical report about OptARQ. Finally, I took advantage of Virtuoso aggregate functions to use count, instead of fetching all graphs / projects and counting in PHP for projects stats.
Thanks again to Openlink for hosting the data and for their support !
Tags: doap, doapstore, rdf, scalability, virtuoso
