Music recommendations with 300M data points and one SQL query

While toying with the public BigQuery datasets, impatiently waiting for Google Cloud Dataflow to be released, I’ve noticed the Wikipedia Revision History one, which contains a list of 314M Wikipedia edits, up to 2010. In the spirit of Amazon’s “people who bought this”, I’ve decided to run a small experiment about music recommendations based on Wikipedia edits. The results are not perfect, but provide some insights that could be used to bootstrap a  recommendation platform.

Wikipedia edits as a data source

Wikipedia pages are often an invaluable source of knowledge. Yet, the type and frequency of their edits also provide great data to mine knowledge from. See for instance the Wikipedia Live Monitor by Thomas Steiner, detecting breaking news through Wikipedia,  “You are what you edit“, an ICWSM09 study of Wikipedia edits to identify contributors’ location, or some of my joint work on data provenance with Fabrizio Orlandi.

Here, my assumption to build a recommendation system is that Wikipedia contributors edit similar pages, because they have an expertise and interest in a particular domain, and tend to focus on those. This obviously becomes more relevant at the macro-level, taking a large number of edits into account.

In the music-discovery context, this means that if 200 of the Wikipedia editors contributing to the Weezer page also edited the Rivers Cuomo one, well, there might be something in common between both.

The dataset

Let’s have a quick look at the aforementioned Wikipedia Revision History dataset:

This dataset contains a version of that data from April, 2010. This dataset does not contain the full text of the revisions, but rather just the meta information about the revisions, including things like language, timestamp, article and the like.

  • Name: publicdata:samples.wikipedia
  • Number of rows: 314M

Sounds not too bad, as it contains a large set of (page, title, user) tuples, exactly what we need to experiment.

Querying for similarity

Instead of building a complete user/edits matrix to compute the cosine distance between pages, or using a more advanced algorithm like Slope One (with the number of edits as an equivalent for ratings), I’m simply finding common edits, as explained in the original Amazon paper. And, to make this a bit more fun, I’ve decided to do it with a single query over the 314M rows, testing BigQuery capabilities at the same time.

The following query is used to find all pages sharing common editors with the current ones, ranked by the number of common edits. Tested with multiple inputs, it took an average of 5 seconds to answer it over the full dataset. You can run those by yourself by going to your Google BigQuery console and selecting the Wikipedia dataset.

SELECT title, id, count(id) as edits
FROM [publicdata:samples.wikipedia]
WHERE contributor_id IN (
  SELECT contributor_id
  FROM [publicdata:samples.wikipedia]
  WHERE id=30423
    AND contributor_id IS NOT NULL
    AND is_bot is NULL
    AND is_minor is NULL
    AND wp_namespace = 0
  GROUP BY contributor_id
  AND is_minor is NULL
  AND wp_namespace = 0
GROUP EACH BY title, id

Update 2014-07-14: To clarify a comment on Twitter / reddit - I’m using page ID instead of title to make sure the edits over time apply to the same page, since IDs are immutable but page titles can change upon requests from the community.

This is actually a simple query, finding all pages (wp_namespace=0 restricts to content pages, excluding user pages, talks, etc.) edited (excluding minor edits) by users whom also edited (excluding bots and minor contributions) the page with ID 30423, ranking them by number of edits. You can read it as “Find all pages edited by people who also edited the page about the Clash, ranked by most edited first”.

And here are some of the results

Who's related the the Clash, using Wikipedia edits
Who’s related the the Clash, using Wikipedia edits

As you can see, from a music-discovery perspective, that’s a mix between relevant ones (Ramones, Sex Pistols), and WTF-ones (The Beatles, U2). There’s also a need to exclude non-music pages, but that could be done programmatically with some more information in the dataset.

Towards long tail discovery

As we can expect, and as seen before, results are not that good for mainstream / popular artists. Indeed, edits about the Beatles page are unlikely, in average, to say much about the musical preferences of their editors. Yet, this becomes more relevant for long-tail artist discovery: if you care editing indie bands pages, that’s most likely you care about it.

Trying with Mr Bungle, the query returns Meshuggah and The Mars Volta as the first two music-related entries, all of them playing some kind of experimental metal – but then digresses again with the Pixies. Looking at band members / solo artists and using Frank Black as a seed leads to The Smashing Pumpkins, Pearl Jam, R.E.M. and obviously the Pixies as the first four recommendations. Not perfect for both, but not too bad for an algorithm that is completely music-agnostic!

Scaling the approach

There are many ways this could be improved, for instance:

  • Removing too-active contributors – who may edit pages to ensure Wikipedia guidelines are followed, rather than for topic-based interest, and consequently introduce some bias;
  • Filtering the results using some ML approaches or graph-based heuristics – e.g. exclude results if their genres are more than X nodes away in a genre taxonomy.
  • Using time-decay – someone editing Nirvana pages in 1992 might be interested in completely different genres now, so joint edits might not be relevant if done with an x-days interval or more.

Yet, besides its scientific interest, and showing that BigQuery is very cool to use, this approach also showcases – if needed – that even though algorithms may rule the world of music discovery, they might not be able to do much without user-generated content.

The Long Tail, with Spotify and Polymer

The Long Tail. That’s not something new, neither on the Web nor in the music field. I remember when I first read Chris Anderson article , and since, many have talked or wrote about it, including Paul Lamere or Oscar Celma in the music-tech sphere.

Yet, one must admit that, with millions of tracks available online, it’s always a challenge to find something new,  digging in that so-called long-tail of less popular artists or songs.

So, between Word Cup games, I’ve built a web component – and a companion web app –  to enjoy the less popular tracks of any artist.

A web component to play an artist’s long tail

Built with Polymer and using the new Spotify Web API, <long-tail> is a web component that embeds a Spotify play button with the less popular tracks of one artist.

First, install it with Bower:

bower install long-tail

The, include in an HTML page:


  <script src="bower_components/platform/platform.js"></script>
  <link rel="import" href="bower_components/long-tail/long-tail.html">

  <long-tail artist="4tZwfgrHOc3mvqYlEYSvVi" size="25"></long-tail>


And there it is, you’re ready to play. No JS to write, no code to copy and paste, everything is handled internally: the beauty of Web components. Unfortunately, Javascript can’t be used on blogs, but here’s the result of the previous snippet.

Daft Punk less popular tracks
Daft Punk less popular tracks

The source is on github (MIT license), and you can see how easy it is to create. It simply calls the Spotify API to find an artist’s albums, then tracks (limiting to 50 results each time – hence parsing a maximum of 2500 tracks per artist), finally sorting them by inverse popularity. It also excludes the ones with popularity=0, as it seems there are not always the less popular ones. Maybe some region-dependant issue?

I suppose, as many recent JS toolkits such as AngularJS, that the learning curve will be stiffer when building  advanced components (probably due to the early-stage documentation), but at a first glance, it looks very intuitive, and there are many elements to reuse already.

Try it with your favorite artists

As the component is mostly for coders, I’ve put together a companion Web app – shamelessly reusing Paul’s design from his recent Spotify hacks. For each artist, it uses the previous component and displays their 50 less popular tracks, according to Spotify.

Try it at The Long Tail, and have fun exploring the hidden gems of your favorite artist!

The Long Tail of Rancid tracks
The Long Tail of Rancid tracks


Google I/O 2014 Recap: Android, Knowledge Graph and more

Back in April, I was lucky enough to get a partner invite for Google I/O. Coupled with a stay at the Startup House, a co-working / housing space (ideal when you’re jet-lagged at 4AM and want a proper desk to code a few meters away from your bed) located only one black away from Moscone, I’m very glad I’ve made the trip to my first I/O!

Google I/O after hours party in Yerba Buena Gardens
Google I/O after hours party in Yerba Buena Gardens

Here are a few highlights, in a conference which clearly confirmed the role of (1) Android as a global OS, and (2) the Knowledge Graph as a hub for everything AI-related, at Google and beyond.

Most of the videos of the sessions are online on Google Developers’ YouTube channel, and I’ve tried as much as possible to link to the relevant ones below.

Android – One OS to rule them all

While I’m not (yet) a full-time Android user (let alone a developer), it’s now clear that it goes far beyond a phone-only OS. With the introduction of AndroidWear, AndroidCar, and AndroidTV during the keynote, the OS is now the core of all hardware-related initiatives at Google.

With common SDKs and API to interact with, wherever the OS is used, this makes the life of developers much easier when building cross-devices products. Relying on a single ecosystem is also of importance when building an engineering team, and I guess it may also be an decision factor for small start-ups when deciding which market to tackle.

Last but not least, the improvements in the OS itself, including a new runtime – see “What’s new in Android“, makes it even faster then before, a plus for embedded systems of all sorts.

Google’s Knowledge Graph – From search to voice controls and app indexing

So far, Google’s Knowledge Graph has been used mostly in search-related projects, including the snippets you can see when searching for entities such as places, people, music and movies on Google. Several sessions-cases showed how it is now used as a central hub for AI-related projects and products.

Search results getting richer with Google's Knowledge Graph
Search results getting richer with Google’s Knowledge Graph

Using Android TV, you can ask your TV (literally, by talking to your Android watch) to suggest an Oscar-awarded movie from 2000, or who’s casting in X or Y – all answers coming from the Knowledge Graph.  In the first case, results can be bought from Google play, another nice piece of integration between the different offerings from the company.

Another interesting case is the use if the Knowledge Graph to connect the dots between previously isolated silos, namely mobile apps. One of the common issue with those apps is their lack of links and outside-world connections, in spite of recent efforts such as Facebook-supported App Links. In the session “The Future of Apps and Search“, a combination of app indexing, JSON-LD and Knowledge Graph was presented to directly link into an app from, e.g., Google’s search results or autocompletion-search in Android, as well as launching actions from search results – e.g. playing a track in Spotify, a use-case announced a few days before I/O – using the new actions I’ve recently blogged about.

As an early JSON-LD enthusiast, and working on related technologies for almost a decade, you can’t imagine how excited I was when I saw this in something used by million of users! Let’s bet that’s only the beginning, and that new verticals will follow.

Spotify, with real bits of JSON-LD inside
Spotify, with real bits of JSON-LD inside

Google Cloud and DataFlow – Smarter, faster, easier

I’ve been recently using Google Cloud infrastructure in several projects (from GAE to Google Prediction – watch “Predicting the future with the Google Cloud Platform” for more about their ML infrastructure), and a few announcements made my day here:

  • Cloud Debugger – making DevOps and back-engineers more efficient when debugging code. You can now add breakpoints, including conditional ones (e.g. user=X) in your live app, without jeopardising its speed, and most important, without having to stop/restart/deploy anything. This means that code can be debugged on production servers with live data, and  without patching / tracing multiple boxes,  all in the comfort of your browser. A kind of New Relic on steroids, so big thumbs-up here!
  • Dataflow –  aiming to replace MapReduce, with a special focus on stream processing and scalability. A convincing use-case during the keynote was Twitter sentiment analysis, showing not only the simplicity of the interface, but also the orchestration of the services through the API. The service is not open yet, but you can check “Big data, the Cloud Way: Accelerated and simplified” to know more. I’m looking forward to try it on a few stream processing for content discovery!
Dataflow - Coming soon to a theater near you
Dataflow – Coming soon to a theater near you

The Web platform – Polymer, WebRTC and HTML5

Whether you’re accessing if from your desktop, phone, or now, your watch or Glass, there’s only one Web. And far from just websites, it can be used as a platform to build powerful apps, as many session focused on:

  • Polymer / Web components – or how to build your own HTML tags for quick prototyping and distribution. As an AngularJS user, I was immediately convinced by its two-way data bindings. Polymer (“Polymer and the Web Components revolution“) adds another elegant layer to the Web, allowing to define tags that are then rendered as full components. Imagine a <my-recent-tracks> tag that will automatically render the top-tracks you’ve played on all your favorite music platforms. Well, that’s exactly what Polymer can do;
  • HTML5 – the Web as a platform, from different perspectives. In particular, “HTML5 everywhere: How and why YouTube uses the Web platform” was a great intro talk to understand the benefits of HTML5 from different points of view: UX, scalability, cross-platform. Recommended to anyone who still have doubts about it.
  • WebRTC – building real-time systems in your browser. “Making music mobile with the Web” not only showed how to transform your Macbook into a Marshall JCM2000 with Soundtrap, but also how WebRTC was used for real-time collaborative music creation, with very low latency.

Wearables – It’s all about the UX

Then, a big part of the conference: Glass and smart watches. I often thought that most of the effort to build those was put in the hardware and OS side of things (reducing footprint, optimising battery life, gathering sensor data, etc.).

While some talks clearly focused on this (with some nice hacks such as back-camera for biking in “Innovate with the Glass Platform“, and football-related ones), I was impressed by “Designing for wearables“, which focused on the role of UX to make sure wearables are devices that let you connect, and not interfere with the world as a phone does.

Paris Saint-Germain represents at I/O 2014!
Paris Saint-Germain represents at I/O 2014!

Showing some early prototypes and discussing how and why Glass / wear notifications are so minimalistic, this was an inspiring session for anyone interested in UX and products. A must-watch for developers and entrepreneurs aiming to  build appealing user-facing products, whether it’s for wearables or more standard devices.

Google+ – Or how Google missed the spot

I may have missed it from other sessions, but none of the ones I’ve been to mentioned Google+. I was not expecting much about it at I/O since the departure of Vic Gundotra, and Sergey Brin’s statements, as well as a plus-free agenda. Still, that was a big surprise, as it would have been a no-brainer use-case in many talks.

Using dataflow to process streams from your social circles? Not a word about it. Using Glass to see what your friends are posting? Nope. Alerts on your Google TV to binge watch some TV-show together with your friends home 5000km away? Neither.

G+ could have been an awesome social network – or should I say a social platform. Combined with Freebase / Knowledge Graph, linking people to things they like, possibilities would be endless in terms of profiling, discovery and more. Yet, with a poor API, a lack of portability that could have differentiate it from its main competitors from Day 1 (imagine PubSubHubbub / WebSocket as an easy way to integrate G+ into other platforms), I’m sad they’ve missed the spot.

Up to 2015?

Overall, a great conference, in spite of the queue mismatch that forced me to miss about 30min of the keynote, queueing twice around the Moscone, a real shame when you travel 8000km for such an event.

I particularly enjoyed the focus around the 3D topics (Design, Develop, Distribute), the diversity of talks (watch the awesome “Robotics in a new world – Presented by Women Techmakers“), and the accessibility of the DevRel team between sessions at the Developer sandboxes.

Looking forward to the next one!

Enhancing the Freebase/YouTube API mappings… using Freebase and YouTube

The YouTube V3 API is one of those thing you’ll definitely fall in love with, if you’re into real-world Semantic Web applications, a.k.a “Things, not words”. With its integration with Freebase – the core of Google’s Knowledge Graph -, it’s a concrete and practical showcase of the Web as a distributed database of things and relations, and not only keywords and links between pages.

YouTube Data API v3 with Freebase mappings: the good, the bad, and the ugly

While relatively simple to use, it provides advanced features to let developers built data-driven applications. On the one hand, it allows to search for videos by Freebase entities, as you can try in a recent demo from YouTube themselves. On the other hand, it returns which entities are used/described in a video.

Yet, identifying topics from videos is a difficult task, and if you’re not convinced (and interested in all things Machine Learning related), check the following Google I/O talk from last year.

Google I/O talk on Semantic Annotations of YouTube videos, featuring our own seevl
Google I/O talk on Semantic Annotations of YouTube videos, featuring our own seevl

While the API generally delivers correct information, it sometimes requires a bit of work to automatically uses its results in a music-related context (to be exact, the issues might be in the underlying data, rather then on the API itself):

  • In some cases, it provides multiple artists – which is often correct, e.g. Blondie and Debby Harry but makes difficult to find who’s the main one, as the API delivers them at the same level (topicIds).
  • In others, it returns empty results, like this (recently deleted, maybe as part of the YouTube music limbo?) Nirvana video.
  • Finally, when an awesome band like Weezer decides to cover Coldplay, both bands are returned by the API.

This is something we’ve improved to build our former seevl for YouTube plug-in, and while it’s not available anymore, as we’ve moved away from consumer-facing products to refocus on a B2B, turn-key, music discovery solution, I’ve decided to open source the underlying library to find who’s playing and what (yes, that’s music only) in any YouTube videos.

Introducing youplay – who’s and what’s playing in a YouTube music video

The result is youplay, available on PyPI and github, a MIT-licensed python library that works as an enhancement on top of the YouTube Data API v3 to automatically identify who’s and what’s playing in a music video. It uses different heuristics, data look-up, and more to find the correct artists if multiple ones are returned (unless they’re all playing in the video, like this RHCP + Snoop Dogg version of Scar Tissue), to filter ambiguous ones, or to find the correct artist and track if the API doesn’t deliver anything.

Here’s an example

#!/usr/bin/env python
import youplay

(artists, tracks) = youplay.extract('0UjsXo9l6I8')
print '%s - %s' %(', '.join([ for artist in artists]), tracks[0].name)

(artists, tracks) = youplay.extract('c-_vFlDBB8A')
print '%s - %s' %(artists[0].name, tracks[0].name)

will return

(env)marvin-7:youplay alex$ python
Jay-Z, Alicia Keys - Empire State of Mind
Dropkick Murphys - Worker's Song

The tool is also packaged with a command line script returning JSON data for easy integration into non-python apps.

(env)marvin-7:youplay alex$ ./bin/youplay ebBjGp7QOGc
  "tracks": [
      "mid": "/m/0dt1kzp", 
      "name": "For My Family"
  "artists": [
      "mid": "/m/022tqm", 
      "name": "Agnostic Front"

With a little help from my friends

The fun part? All the look-ups (if any) are using the Freebase and YouTube API themselves, such as:

  • Finding the top-tracks of an artist from Freebase and matching it with the video name if the original API call when it returns only artist names;
  • Identifying if a song has been recorded by multiple artists;
  • Looking-up related YouTube videos to identify what’s the common topic between all of them, and guess the current artist of a video with no API-results.

Isn’t it a nice way to bridge the gap?

Even though I hope the API will be useful to other music-tech developers, I also wish that it soon becomes obsolete, as Google’s Knowledge Graph, and other structured-data efforts on the Web, keep growing on the Web in terms of AI, infrastructures and APIs/toolkits – making more and more easier every day to build data-driven applications (if only I had this 10years ago when I started digging into the topic!).

Oh, and I’m attending Google I/O next week, and if you’re working on similar projects, ping me and let’s have a chat!

echoplot – Plot song loudness using the EchoNest API

As I’m working on the second part of my analysis of the Rolling Stone 500 Greatest songs of all time, I needed to draw the loudness representation of various songs extracted from the EchoNest API. I’ve been using mathplotlib with pyechonest, and as the process is quite repetitive, I’ve packaged everything as echoplot, so you can easily plot song loudness using the EchoNest API.

pip install echoplot

Once you’ve setup your EchoNest API key as an environemnt variable ECHO_NEST_API_KEY, just run echoplot

marvin-7:~ alex$ echoplot -h
usage: echoplot [-h] [-s START] [-e END] artist title

Plot loudness of a song using the EchoNest API.

positional arguments:
  artist                the song's artist, e.g. 'The Clash'
  title                 the song's title, e.g. 'London Calling'

optional arguments:
  -h, --help            show this help message and exit
  -s START, --start START
                        start analysis at a given time (seconds)
  -e END, --end END     end analysis at a given time (seconds)

For example

marvin-7:~ alex$ echoplot 'The Clash' 'London Calling'
The Clash - London Calling
The Clash – London Calling


marvin-7:~ alex$ echoplot Radiohead 'Paranoid Android'
Radiohead - Paranoid Android
Radiohead – Paranoid Android

The plot also displays the different segments of the song (chorus, verse, etc.), also provided by the EchoNest API. Echoplot’s source code is on github and the package is on PyPI.

Sex and drugs and Rock’n’roll: Analysing the lyrics of the Rolling Stone 500 greatest songs of all time

I was reading the Wikipedia entry for the Rolling Stone’s 500 Greatest Songs of All Time, and while it contains a lot of interesting statistics (shortest and longest songs, decades, covers, etc.), I’ve decided to do some “API-based data-science” and see what insights we can learn from this top-500.

London Calling - #15: One of the 5 songs from the Clash in the top 500
#15: London Calling – One of the 5 songs from the Clash in the top 500

I’ll split this into multiple posts, in order to showcase how different APIs bring multiple perspectives to the data-set, such as acoustic features with the Echo Nest, mood recognition with Gracenote, or artist and genre data with seevl (disclosure – I’m the main responsible for this one).

Here’s the first one, investigating lyrics from those top-500 songs. The first part is rather technical, so if you’re interested only in the insights, just skip it. And here’s an accompanying playlist, featuring the songs mentioned in this post – all from the top 500, except the opening one.

Come together

Before going through the insights, here’s the process I used to gather the data:

  • Scrap names, artists and reviews of the 500 songs using python‘s urllib and BeautifulSoup, starting from the 500th one: “Shop Around“;
  • Get lyrics of each songs via the Lyrics’n’Music API (powered by LyricFind) with additional scraping, as unfortunately the API returns only a few sentences (as does the musiXmatch one, both for copyright reasons);
  • Run some NLP tasks on the corpus with nltk: tokenize the lyrics (i.e. split lyrics into words), apply stemming for normalisation (i.e. extract the words roots, e.g. “love”, “loved” and “loving” all map to “love”), and extract n-gram (i.e. sequence of words, here using n from 3 to 5) for some tasks described below.

Regarding that last step, I’ve used the PunktWordTokenizer, which gave better results than the default word_tokenize. As most of the lyrics are in English and the Punkt tokenizer is already trained for it, no additional work was required. Stemming was done with the Snowball algorithm – more about it below. Here’s a quick snippet of how it works

from nltk.tokenize.punkt import PunktWordTokenizer
from nltk.stem.snowball import SnowballStemmer

elvis = """
Here we go again
Asking where I've been
You can't see these tears are real
I'm crying (Yes I'm crying)

sb = SnowballStemmer('english')
pk = PunktWordTokenizer()

print [sb.stem(w) for w in pk.word_tokenize(elvis)]

Leading to:

['here', 'we', 'go', 'again', 'ask', 'where', 'i', "'ve", 'been', \ 
'you', 'can', "'t", 'see', 'these', 'tear', 'are', 'real', 'i', \
"'m", 'cri', '(', 'ye', 'i', "'m", 'cri', ')']

As you can see, there are a few issues: “me” is stemmed to “m”, and “crying” to “cri” and not to “cry” – as one could expect. Yet, “cried”, “cry”, “cries” are all stemmed to this same root with Snowball, which is OK in order to group words together. However, no stemming algorithm is perfect. Snowball identified different roots for “love” and “lover”, while the Lancaster algorithm matched both to “lov”, but fails for the previous cry example.

>>> from nltk.stem.snowball import SnowballStemmer
>>> from import LancasterStemmer
>>> sb = SnowballStemmer('english')
>>> lc = LancasterStemmer()
>>> cry = ['cry', 'crying', 'cries', 'cried']
>>> [lc.stem(w) for w in cry]
['cry', 'cry', 'cri', 'cri']
>>> [sb.stem(w) for w in cry]
[u'cri', u'cri', u'cri', u'cri']
>>> love = ['love', 'loves', 'loving', 'loved', 'lover']
>>> [lc.stem(w) for w in love ]
['lov', 'lov', 'lov', 'lov', 'lov']
>>> [sb.stem(w) for w in love ]
[u'love', u'love', u'love', u'love', u'lover']

That being said, on the full corpus, the top-10 stems were the same whatever the algorithm was (albeit a different count and different syntaxes). Hence, I’ll report on the Snowball extraction in the remainder of this post.

Baby love

So, it appears that the most popular word variation in the corpus is “love”. It’s mentioned 1057 times in 219 songs (43.8%), followed by:

  • “I’m”: 1000 times, 242 songs
  • “oh”: 847 times, 180 songs
  • “know”: 779 times, 271 songs
  • “baby”: 746 times, 163 songs
  • “got”: 702 times, 182 songs
  • “yeah”: 656 times, 155 songs

One could probably write lyrics with “Oh yeah baby I got you, yeah I’m in love with you, yeah!” and easily fits here (well, look at that opening line). Sorting by song ranking also brings “like” in the top list, included in 194 of those top-500 songs.

I wanna be anarchy

Looking at the top-5 3-grams and we still have a sense of a general “you-and-me” feeling that occur in those songs:

  • “I want to”: 38 songs
  • “I don’t know”: 35 sons
  • “I love you”: 26 songs
  • “You know I”: 22 songs
  • “You want to”: 21 songs

Follower by other want / don’t want combinations. Once again, most of the want-list is love-related. While some want to hold her handknow if she loved them or simply know what love is, other prefer to be your dog, while some just want to be free.

There was no real pattern on the 4-grams and 5-grams, besides that Blondie, Jimmy Hendrix and 7 others “don’t know why”, and that the B-52’s, Bob Dylan and Jay-Z have something to do on “the other side of the road”.

Hotel California

As a short-list compiled by a rock magazine, you could expect a few tracks falling under the sex, drugs and rock-n-roll stereotype. Well, not really. On the top-500, only 13 songs contain the word sex, 5 drug and 4 rock’n’roll, none of them combining all.

Looking deeper into the drug-theme, and using a Freebase query to a list of abused substances and their aliases, we find 7 occurrences for cocaine and 4 for heroin – three times for the first one in the eponym song, while grass and pot appear a few times, even though it would require more analysis to see in which context they’re used. Of course, a simple token analysis like this one could not capture the full songs messages, and we miss classics like the awesome Comfortably numb or White Rabbit by Jefferson Airplane.

Querying Freebase to find drug aliases
Querying Freebase to find drugs and their aliases

The more details about drugs in this top-500 are in the review themselves – often including background stories about the song. Heroin it mentioned 11 times, acid 3, alcohol 3, and cocaine twice.

Good vibrations

Last but not least, I’ve used AlchemyAPI for topic extraction and sentiment analysis. Nothing very relevant came up from the entity extraction phase, but here are the most negative songs from the list according to their sentiment analysis module.

And the most positive ones

For both, it seems there’s a clear bias towards the words used in the song (e.g. “shame” or “love”), rather than extracting sentiments from the proper song’s meaning. It would be more interesting to use a data-set from SongMeanings or Songfacts to run a proper analysis – this might be for another post.

That’s it for today!

NB: Update 13/05 – If you’re looking for more lyrics-data-mining, here’s an interesting study on Rap lyrics by Matt Daniels.

Export and structure your musical activity with

Following my recent post on and personalisation on the Web, I wrote a music actions exporter for various services, including Facebook, Deezer, and Available at, it’s mostly a proof-of-concept, but it showcases the ability to uniformly export and structure your data (in that case music listening actions) whatever service you initially used. Does that ring a bell?

As the previous post focused on why it matters, I’ll cover technical aspects of the exporter here, including the role of JSON-LD for representing content on the Web.

One model to rule them all

The Music Actions exporter is not rocket science. Basically speaking, it translates (application-specific) JSON data into another (open, with shared semantics) JSON representation, using JSON-LD. But that’s also where the power lies: it would take only a few engineering hours to most platforms to expose their actions with if they already have a public API – or user profile pages (think RDFa or microdata) – doing so. And they would probably enjoy the same benefits as when publishing factual data with

Moreover, it will make life easier for developers: understanding a single model / semantics and learning a common set of tools will be enough to get and use data from multiple sources, as opposed to handling multiple APIs as it is currently the case – meaning, eventually, more exposure for the service. This is the grand Semantic Web promise, and I’m glad to see it more alive than ever.

In particular, let’s consider the music vertical: Inter-operable taste profiles, shared playlists, portable collections, death-to-cold-start… you name it, it could finally be done. The promise has been here for a while, many have tried, and it obviously reminds me some earlier work I’ve done circa 2008 (during and post-Ph.D.), including this initiative with Yves Raimond from the BBC using FOAF, SIOC, MO and more:

Coming back to the exporter, here’s an excerpt of my recent Facebook music.listens activity (mostly gathered from spotify here) exported as JSON-LD, with a longer feed here.

"@context": {
"name": ";,
"agent_of": {
"@reverse": ";
"@id": ";,
"url": ";,
"name": "Alexandre Passant",
"@type": "Person",
"agent_of": [{
"@type": "ListenAction",
"object": {
"@id": ";,
"url": ";,
"@type": "MusicRecording",
"name": "Represent (Rocked Out Mix)",
"audio": ";,
"byArtist": [{
"@id": ";
"url": ";
"@type": "MusicGroup",
"name": "Weezer",
"inAlbum": [{
"@id": ";,
"url": ";
"@type": "MusicAlbum",
"name": "Hurley",

For every service, it returns the most recent tracks listened to (as ListenAction), including – when available – additional data about artists and albums. In the case of Deezer and Lastfm, those information are already in the history feed, while for Facebook, this requires additional calls to the Graph API, querying individual song entities in their data-graph.

Using Google Cloud Endpoints as an API layer

Since the exporter works as a simple API, I’ve implemented it using Google Cloud Endpoints. As part of Google’s Cloud offering, it greatly facilitates the process of building a Web-based APIs. No need to build a full – albeit lightweight – application with routes / handlers (webapp2, etc.): document the API patterns (Request and Response messages),  define the application logic, and let the infrastructure manages everything.

It also automatically provides a web-based front-end to test the API, and other advantages of Google App Engine infrastructure, such as Web-based logs management in order can trace production errors without logging-in to a remote box.

GAE Endpoints API Explorer
GAE Endpoints API Explorer

The only issue is that it can’t directly return JSON-LD , since it encapsulate everything into the following response.

"kind": "musicactions#resourcesItem",
"etag": "\"_oj1ynXDYJ3PHpeV8owlekNCPi4/NH17nWS3hMc3GSHWziswWp2pTFk\""
"data": "<a style="color: #428bca;" href="">some action data</a>"

Thus, if you use the exporter,  you’ll need to parse the response and extract the data string value, then transform it into JSON to get the “real” JSON-LD data. That’s not a big deal as you probably won’t link to the API URL anyways since the it contains your private authentication tokens. But it’s worth keeping in mind for some projects.

JSON-LD and the beauty of RDF

Last but not least: the use of JSON-LD, augmenting JSON with the concept of “Linked Data“, i.e. “meanings, not strings”.

Let’s look at the representation of 2 ListenAction instances for the same user (using their Facebook IDs in this example). The JSON-LD serialisation will be as follows.  I’m using the @graph property to represent two statements about distinct objects (as those are 2 different ListenAction) in the same document, but I could have used multiple contexts.

"@context": ";,
"@graph": [{
"@type": "ListenAction",
"agent" : {
"@id": ";,
"name": "Alexandre Passant",
"@type": "Person"
"object": {
"@id": ";,
"name": "My Name Is Jonas",
"@type": "MusicRecording"
}, {
"@type": "ListenAction",
"agent" : {
"@id": ";,
"name": "Alexandre Passant",
"@type": "Person"
"object": {
"@id": ";,
"name": "Buddy Holly",
"@type": "MusicRecording"

Below is the corresponding graph representation, with 2 nodes for the same agent (i.e. the user committing the action).

Representing ListeningActions with JSON-LD
Representing ListeningActions with JSON-LD

Yet, an interesting aspect of JSON-LD is its relation with RDF – the Resource Description Framework and its graph model especially suited for the Web. As JSON-LD uses @ids as common node identifiers, a.k.a. URIs, those 2 agents are actually the same, and so the graph looks like:

Merging agents with JSON-LD
Merging agents with JSON-LD

Finally, an interesting property of RDF / JSON-LD graphs is their directed edges. Thus, instead of writing the previous statement from an Action-centric perspective, with un-identified action instances (a.k.a. blank nodes), we can write it from a User-centric perspective using an inverse property (“reverse” in the JSON-LD world), as follows.

Using inverse properties in JSON-LD
Using inverse properties in JSON-LD

Leading to the following JSON-LD document, thanks to the definition of an additional reverse property in the context. This makes IMO the document easier to understand, since it’s now user-centric, with the user / Person being the core element of the document, with edges from itself to the actions it contributes to.

"@context": {
"name": ";,
"agent_of": {
"@reverse": ";
"@id": ";,
"name": "Alexandre Passant",
"@type": "Person",
"agent_of": [{
"@type": "ListenAction",
"object": {
"@id": ";,
"name": "My Name Is Jonas",
"@type": "MusicRecording"
}, {
"@type": "ListenAction",
"object": {
"@id": ";,
"name": "Buddy Holly",
"@type": "MusicRecording"

From shared actions to shared entities

While being (for now) a proof of concept, the exporter is a first step towards a common integration of musical actions on the Web. Of course, the same pattern / method could be applied to any other vertical. But, more interestingly, we can hope that services will directly publish their actions using, as they’ve been doing for other facts – for instance artist concert data, now enriching Google’s search results through their Knowledge Graph.

In addition, an interesting next step would be to use common object identifiers across services, in order to not only share a common semantics about actions, but also about the objects used in those actions. This could be achieved by referring to open knowledge bases such as Freebase, or using vertical-specific ones such as our new seevl API in the music area. Oh, and there will be more to come about seevl and actions in the near future. Interested? Let’s connect.