While toying with the public BigQuery datasets, impatiently waiting for Google Cloud Dataflow to be released, I’ve noticed the Wikipedia Revision History one, which contains a list of 314M Wikipedia edits, up to 2010. In the spirit of Amazon’s “people who bought this”, I’ve decided to run a small experiment about music recommendations based on Wikipedia edits. The results are not perfect, but provide some insights that could be used to bootstrap a recommendation platform.
Wikipedia edits as a data source
Wikipedia pages are often an invaluable source of knowledge. Yet, the type and frequency of their edits also provide great data to mine knowledge from. See for instance the Wikipedia Live Monitor by Thomas Steiner, detecting breaking news through Wikipedia, “You are what you edit“, an ICWSM09 study of Wikipedia edits to identify contributors’ location, or some of my joint work on data provenance with Fabrizio Orlandi.
Here, my assumption to build a recommendation system is that Wikipedia contributors edit similar pages, because they have an expertise and interest in a particular domain, and tend to focus on those. This obviously becomes more relevant at the macro-level, taking a large number of edits into account.
In the music-discovery context, this means that if 200 of the Wikipedia editors contributing to the Weezer page also edited the Rivers Cuomo one, well, there might be something in common between both.
Let’s have a quick look at the aforementioned Wikipedia Revision History dataset:
This dataset contains a version of that data from April, 2010. This dataset does not contain the full text of the revisions, but rather just the meta information about the revisions, including things like language, timestamp, article and the like.
- Name: publicdata:samples.wikipedia
- Number of rows: 314M
Sounds not too bad, as it contains a large set of (
page, title, user) tuples, exactly what we need to experiment.
Querying for similarity
Instead of building a complete user/edits matrix to compute the cosine distance between pages, or using a more advanced algorithm like Slope One (with the number of edits as an equivalent for ratings), I’m simply finding common edits, as explained in the original Amazon paper. And, to make this a bit more fun, I’ve decided to do it with a single query over the 314M rows, testing BigQuery capabilities at the same time.
The following query is used to find all pages sharing common editors with the current ones, ranked by the number of common edits. Tested with multiple inputs, it took an average of 5 seconds to answer it over the full dataset. You can run those by yourself by going to your Google BigQuery console and selecting the Wikipedia dataset.
SELECT title, id, count(id) as edits FROM [publicdata:samples.wikipedia] WHERE contributor_id IN ( SELECT contributor_id FROM [publicdata:samples.wikipedia] WHERE id=30423 AND contributor_id IS NOT NULL AND is_bot is NULL AND is_minor is NULL AND wp_namespace = 0 GROUP BY contributor_id ) AND is_minor is NULL AND wp_namespace = 0 GROUP EACH BY title, id ORDER BY edits DESC LIMIT 100
Update 2014-07-14: To clarify a comment on Twitter / reddit – I’m using page ID instead of title to make sure the edits over time apply to the same page, since IDs are immutable but page titles can change upon requests from the community.
This is actually a simple query, finding all pages (
wp_namespace=0 restricts to content pages, excluding user pages, talks, etc.) edited (excluding minor edits) by users whom also edited (excluding bots and minor contributions) the page with ID 30423, ranking them by number of edits. You can read it as “Find all pages edited by people who also edited the page about the Clash, ranked by most edited first”.
And here are some of the results
As you can see, from a music-discovery perspective, that’s a mix between relevant ones (Ramones, Sex Pistols), and WTF-ones (The Beatles, U2). There’s also a need to exclude non-music pages, but that could be done programmatically with some more information in the dataset.
Towards long tail discovery
As we can expect, and as seen before, results are not that good for mainstream / popular artists. Indeed, edits about the Beatles page are unlikely, in average, to say much about the musical preferences of their editors. Yet, this becomes more relevant for long-tail artist discovery: if you care editing indie bands pages, that’s most likely you care about it.
Trying with Mr Bungle, the query returns Meshuggah and The Mars Volta as the first two music-related entries, all of them playing some kind of experimental metal – but then digresses again with the Pixies. Looking at band members / solo artists and using Frank Black as a seed leads to The Smashing Pumpkins, Pearl Jam, R.E.M. and obviously the Pixies as the first four recommendations. Not perfect for both, but not too bad for an algorithm that is completely music-agnostic!
Scaling the approach
There are many ways this could be improved, for instance:
- Removing too-active contributors – who may edit pages to ensure Wikipedia guidelines are followed, rather than for topic-based interest, and consequently introduce some bias;
- Filtering the results using some ML approaches or graph-based heuristics – e.g. exclude results if their genres are more than X nodes away in a genre taxonomy.
- Using time-decay – someone editing Nirvana pages in 1992 might be interested in completely different genres now, so joint edits might not be relevant if done with an x-days interval or more.
Yet, besides its scientific interest, and showing that BigQuery is very cool to use, this approach also showcases – if needed – that even though algorithms may rule the world of music discovery, they might not be able to do much without user-generated content.