echoplot – Plot song loudness using the EchoNest API

As I’m working on the second part of my analysis of the Rolling Stone 500 Greatest songs of all time, I needed to draw the loudness representation of various songs extracted from the EchoNest API. I’ve been using mathplotlib with pyechonest, and as the process is quite repetitive, I’ve packaged everything as echoplot, so you can easily plot song loudness using the EchoNest API.

pip install echoplot

Once you’ve setup your EchoNest API key as an environemnt variable ECHO_NEST_API_KEY, just run echoplot

marvin-7:~ alex$ echoplot -h
usage: echoplot [-h] [-s START] [-e END] artist title

Plot loudness of a song using the EchoNest API.

positional arguments:
  artist                the song's artist, e.g. 'The Clash'
  title                 the song's title, e.g. 'London Calling'

optional arguments:
  -h, --help            show this help message and exit
  -s START, --start START
                        start analysis at a given time (seconds)
  -e END, --end END     end analysis at a given time (seconds)

For example

marvin-7:~ alex$ echoplot 'The Clash' 'London Calling'
The Clash - London Calling
The Clash – London Calling

or

marvin-7:~ alex$ echoplot Radiohead 'Paranoid Android'
Radiohead - Paranoid Android
Radiohead – Paranoid Android

The plot also displays the different segments of the song (chorus, verse, etc.), also provided by the EchoNest API. Echoplot’s source code is on github and the package is on PyPI.

Sex and drugs and Rock’n’roll: Analysing the lyrics of the Rolling Stone 500 greatest songs of all time

I was reading the Wikipedia entry for the Rolling Stone’s 500 Greatest Songs of All Time, and while it contains a lot of interesting statistics (shortest and longest songs, decades, covers, etc.), I’ve decided to do some “API-based data-science” and see what insights we can learn from this top-500.

London Calling - #15: One of the 5 songs from the Clash in the top 500
#15: London Calling – One of the 5 songs from the Clash in the top 500

I’ll split this into multiple posts, in order to showcase how different APIs bring multiple perspectives to the data-set, such as acoustic features with the Echo Nest, mood recognition with Gracenote, or artist and genre data with seevl (disclosure – I’m the main responsible for this one).

Here’s the first one, investigating lyrics from those top-500 songs. The first part is rather technical, so if you’re interested only in the insights, just skip it. And here’s an accompanying playlist, featuring the songs mentioned in this post – all from the top 500, except the opening one.

Come together

Before going through the insights, here’s the process I used to gather the data:

  • Scrap names, artists and reviews of the 500 songs using python‘s urllib and BeautifulSoup, starting from the 500th one: “Shop Around“;
  • Get lyrics of each songs via the Lyrics’n’Music API (powered by LyricFind) with additional scraping, as unfortunately the API returns only a few sentences (as does the musiXmatch one, both for copyright reasons);
  • Run some NLP tasks on the corpus with nltk: tokenize the lyrics (i.e. split lyrics into words), apply stemming for normalisation (i.e. extract the words roots, e.g. “love”, “loved” and “loving” all map to “love”), and extract n-gram (i.e. sequence of words, here using n from 3 to 5) for some tasks described below.

Regarding that last step, I’ve used the PunktWordTokenizer, which gave better results than the default word_tokenize. As most of the lyrics are in English and the Punkt tokenizer is already trained for it, no additional work was required. Stemming was done with the Snowball algorithm – more about it below. Here’s a quick snippet of how it works

from nltk.tokenize.punkt import PunktWordTokenizer
from nltk.stem.snowball import SnowballStemmer

elvis = """
Here we go again
Asking where I've been
You can't see these tears are real
I'm crying (Yes I'm crying)
"""

sb = SnowballStemmer('english')
pk = PunktWordTokenizer()

print [sb.stem(w) for w in pk.word_tokenize(elvis)]

Leading to:

['here', 'we', 'go', 'again', 'ask', 'where', 'i', "'ve", 'been', \ 
'you', 'can', "'t", 'see', 'these', 'tear', 'are', 'real', 'i', \
"'m", 'cri', '(', 'ye', 'i', "'m", 'cri', ')']

As you can see, there are a few issues: “me” is stemmed to “m”, and “crying” to “cri” and not to “cry” – as one could expect. Yet, “cried”, “cry”, “cries” are all stemmed to this same root with Snowball, which is OK in order to group words together. However, no stemming algorithm is perfect. Snowball identified different roots for “love” and “lover”, while the Lancaster algorithm matched both to “lov”, but fails for the previous cry example.

>>> from nltk.stem.snowball import SnowballStemmer
>>> from nltk.stem.lancaster import LancasterStemmer
>>>
>>> sb = SnowballStemmer('english')
>>> lc = LancasterStemmer()
>>>
>>> cry = ['cry', 'crying', 'cries', 'cried']
>>> [lc.stem(w) for w in cry]
['cry', 'cry', 'cri', 'cri']
>>> [sb.stem(w) for w in cry]
[u'cri', u'cri', u'cri', u'cri']
>>>
>>> love = ['love', 'loves', 'loving', 'loved', 'lover']
>>> [lc.stem(w) for w in love ]
['lov', 'lov', 'lov', 'lov', 'lov']
>>> [sb.stem(w) for w in love ]
[u'love', u'love', u'love', u'love', u'lover']

That being said, on the full corpus, the top-10 stems were the same whatever the algorithm was (albeit a different count and different syntaxes). Hence, I’ll report on the Snowball extraction in the remainder of this post.

Baby love

So, it appears that the most popular word variation in the corpus is “love”. It’s mentioned 1057 times in 219 songs (43.8%), followed by:

  • “I’m”: 1000 times, 242 songs
  • “oh”: 847 times, 180 songs
  • “know”: 779 times, 271 songs
  • “baby”: 746 times, 163 songs
  • “got”: 702 times, 182 songs
  • “yeah”: 656 times, 155 songs

One could probably write lyrics with “Oh yeah baby I got you, yeah I’m in love with you, yeah!” and easily fits here (well, look at that opening line). Sorting by song ranking also brings “like” in the top list, included in 194 of those top-500 songs.

I wanna be anarchy

Looking at the top-5 3-grams and we still have a sense of a general “you-and-me” feeling that occur in those songs:

  • “I want to”: 38 songs
  • “I don’t know”: 35 sons
  • “I love you”: 26 songs
  • “You know I”: 22 songs
  • “You want to”: 21 songs

Follower by other want / don’t want combinations. Once again, most of the want-list is love-related. While some want to hold her handknow if she loved them or simply know what love is, other prefer to be your dog, while some just want to be free.

There was no real pattern on the 4-grams and 5-grams, besides that Blondie, Jimmy Hendrix and 7 others “don’t know why”, and that the B-52’s, Bob Dylan and Jay-Z have something to do on “the other side of the road”.

Hotel California

As a short-list compiled by a rock magazine, you could expect a few tracks falling under the sex, drugs and rock-n-roll stereotype. Well, not really. On the top-500, only 13 songs contain the word sex, 5 drug and 4 rock’n’roll, none of them combining all.

Looking deeper into the drug-theme, and using a Freebase query to a list of abused substances and their aliases, we find 7 occurrences for cocaine and 4 for heroin – three times for the first one in the eponym song, while grass and pot appear a few times, even though it would require more analysis to see in which context they’re used. Of course, a simple token analysis like this one could not capture the full songs messages, and we miss classics like the awesome Comfortably numb or White Rabbit by Jefferson Airplane.

Querying Freebase to find drug aliases
Querying Freebase to find drugs and their aliases

The more details about drugs in this top-500 are in the review themselves – often including background stories about the song. Heroin it mentioned 11 times, acid 3, alcohol 3, and cocaine twice.

Good vibrations

Last but not least, I’ve used AlchemyAPI for topic extraction and sentiment analysis. Nothing very relevant came up from the entity extraction phase, but here are the most negative songs from the list according to their sentiment analysis module.

And the most positive ones

For both, it seems there’s a clear bias towards the words used in the song (e.g. “shame” or “love”), rather than extracting sentiments from the proper song’s meaning. It would be more interesting to use a data-set from SongMeanings or Songfacts to run a proper analysis – this might be for another post.

That’s it for today!

NB: Update 13/05 – If you’re looking for more lyrics-data-mining, here’s an interesting study on Rap lyrics by Matt Daniels.

Export and structure your musical activity with schema.org

Following my recent post on schema.org and personalisation on the Web, I wrote a music actions exporter for various services, including Facebook, Deezer, and last.fm. Available at http://music-actions.appspot.com, it’s mostly a proof-of-concept, but it showcases the ability to uniformly export and structure your data (in that case music listening actions) whatever service you initially used. Does that ring a bell?

As the previous post focused on why it matters, I’ll cover technical aspects of the exporter here, including the role of JSON-LD for representing content on the Web.

One model to rule them all

The Music Actions exporter is not rocket science. Basically speaking, it translates (application-specific) JSON data into another (open, with shared semantics) JSON representation, using JSON-LD. But that’s also where the power lies: it would take only a few engineering hours to most platforms to expose their actions with schema.org if they already have a public API – or user profile pages (think RDFa or microdata) – doing so. And they would probably enjoy the same benefits as when publishing factual data with schema.org.

Moreover, it will make life easier for developers: understanding a single model / semantics and learning a common set of tools will be enough to get and use data from multiple sources, as opposed to handling multiple APIs as it is currently the case – meaning, eventually, more exposure for the service. This is the grand Semantic Web promise, and I’m glad to see it more alive than ever.

In particular, let’s consider the music vertical: Inter-operable taste profiles, shared playlists, portable collections, death-to-cold-start… you name it, it could finally be done. The promise has been here for a while, many have tried, and it obviously reminds me some earlier work I’ve done circa 2008 (during and post-Ph.D.), including this initiative with Yves Raimond from the BBC using FOAF, SIOC, MO and more:

Coming back to the exporter, here’s an excerpt of my recent Facebook music.listens activity (mostly gathered from spotify here) exported as JSON-LD, with a longer feed here.

{
"@context": {
"name": "http://schema.org",
"agent_of": {
"@reverse": "http://schema.org/agent"
}
},
"@id": "http://facebook.com/alexandre.passant",
"url": "http://facebook.com/alexandre.passant",
"name": "Alexandre Passant",
"@type": "Person",
"agent_of": [{
"@type": "ListenAction",
"object": {
"@id": "http://open.spotify.com/track/1B930FbwpwrJKKEQOhXunI",
"url": "http://open.spotify.com/track/1B930FbwpwrJKKEQOhXunI",
"@type": "MusicRecording",
"name": "Represent (Rocked Out Mix)",
"audio": "http://open.spotify.com/track/1B930FbwpwrJKKEQOhXunI",
"byArtist": [{
"@id": "http://open.spotify.com/artist/3jOstUTkEu2JkjvRdBA5Gu"
"url": "http://open.spotify.com/artist/3jOstUTkEu2JkjvRdBA5Gu"
"@type": "MusicGroup",
"name": "Weezer",
}],
"inAlbum": [{
"@id": "http://open.spotify.com/album/0s56sFx1BJMyE8GGskfYJX",
"url": "http://open.spotify.com/album/0s56sFx1BJMyE8GGskfYJX"
"@type": "MusicAlbum",
"name": "Hurley",
}]
}
}]
}

For every service, it returns the most recent tracks listened to (as ListenAction), including – when available – additional data about artists and albums. In the case of Deezer and Lastfm, those information are already in the history feed, while for Facebook, this requires additional calls to the Graph API, querying individual song entities in their data-graph.

Using Google Cloud Endpoints as an API layer

Since the exporter works as a simple API, I’ve implemented it using Google Cloud Endpoints. As part of Google’s Cloud offering, it greatly facilitates the process of building a Web-based APIs. No need to build a full – albeit lightweight – application with routes / handlers (webapp2, etc.): document the API patterns (Request and Response messages),  define the application logic, and let the infrastructure manages everything.

It also automatically provides a web-based front-end to test the API, and other advantages of Google App Engine infrastructure, such as Web-based logs management in order can trace production errors without logging-in to a remote box.

GAE Endpoints API Explorer
GAE Endpoints API Explorer

The only issue is that it can’t directly return JSON-LD , since it encapsulate everything into the following response.

{
"kind": "musicactions#resourcesItem",
"etag": "\"_oj1ynXDYJ3PHpeV8owlekNCPi4/NH17nWS3hMc3GSHWziswWp2pTFk\""
"data": "<a style="color: #428bca;" href="http://music-actions.appspot.com/static/data.json">some action data</a>"
}

Thus, if you use the exporter,  you’ll need to parse the response and extract the data string value, then transform it into JSON to get the “real” JSON-LD data. That’s not a big deal as you probably won’t link to the API URL anyways since the it contains your private authentication tokens. But it’s worth keeping in mind for some projects.

JSON-LD and the beauty of RDF

Last but not least: the use of JSON-LD, augmenting JSON with the concept of “Linked Data“, i.e. “meanings, not strings”.

Let’s look at the representation of 2 ListenAction instances for the same user (using their Facebook IDs in this example). The JSON-LD serialisation will be as follows.  I’m using the @graph property to represent two statements about distinct objects (as those are 2 different ListenAction) in the same document, but I could have used multiple contexts.

{
"@context": "http://schema.org&quot;,
"@graph": [{
"@type": "ListenAction",
"agent" : {
"@id": "http://graph.facebook.com/607513040&quot;,
"name": "Alexandre Passant",
"@type": "Person"
},
"object": {
"@id": "http://graph.facebook.com/10150500879645722&quot;,
"name": "My Name Is Jonas",
"@type": "MusicRecording"
}
}, {
"@type": "ListenAction",
"agent" : {
"@id": "http://graph.facebook.com/607513040&quot;,
"name": "Alexandre Passant",
"@type": "Person"
},
"object": {
"@id": "http://graph.facebook.com/10150142973310868&quot;,
"name": "Buddy Holly",
"@type": "MusicRecording"
}
}]
}

Below is the corresponding graph representation, with 2 nodes for the same agent (i.e. the user committing the action).

Representing ListeningActions with JSON-LD
Representing ListeningActions with JSON-LD

Yet, an interesting aspect of JSON-LD is its relation with RDF – the Resource Description Framework and its graph model especially suited for the Web. As JSON-LD uses @ids as common node identifiers, a.k.a. URIs, those 2 agents are actually the same, and so the graph looks like:

Merging agents with JSON-LD
Merging agents with JSON-LD

Finally, an interesting property of RDF / JSON-LD graphs is their directed edges. Thus, instead of writing the previous statement from an Action-centric perspective, with un-identified action instances (a.k.a. blank nodes), we can write it from a User-centric perspective using an inverse property (“reverse” in the JSON-LD world), as follows.

Using inverse properties in JSON-LD
Using inverse properties in JSON-LD

Leading to the following JSON-LD document, thanks to the definition of an additional reverse property in the context. This makes IMO the document easier to understand, since it’s now user-centric, with the user / Person being the core element of the document, with edges from itself to the actions it contributes to.

{
"@context": {
"name": "http://schema.org&quot;,
"agent_of": {
"@reverse": "http://schema.org/agent&quot;
}
},
"@id": "http://graph.facebook.com/607513040&quot;,
"name": "Alexandre Passant",
"@type": "Person",
"agent_of": [{
"@type": "ListenAction",
"object": {
"@id": "http://graph.facebook.com/10150500879645722&quot;,
"name": "My Name Is Jonas",
"@type": "MusicRecording"
}
}, {
"@type": "ListenAction",
"object": {
"@id": "http://graph.facebook.com/10150142973310868&quot;,
"name": "Buddy Holly",
"@type": "MusicRecording"
}
}]
}

From shared actions to shared entities

While being (for now) a proof of concept, the exporter is a first step towards a common integration of musical actions on the Web. Of course, the same pattern / method could be applied to any other vertical. But, more interestingly, we can hope that services will directly publish their actions using schema.org, as they’ve been doing for other facts – for instance artist concert data, now enriching Google’s search results through their Knowledge Graph.

In addition, an interesting next step would be to use common object identifiers across services, in order to not only share a common semantics about actions, but also about the objects used in those actions. This could be achieved by referring to open knowledge bases such as Freebase, or using vertical-specific ones such as our new seevl API in the music area. Oh, and there will be more to come about seevl and actions in the near future. Interested? Let’s connect.

The new schema.org actions: What they mean for personalisation on the Web

The schema.org initiative just announced the release of a new action vocabulary. As their blog post emphasises:

The Web is not just about static descriptions of entities. It is about taking action on these entities.

Whether they’re online or offline, publishing those actions in a machine-readable format follows TimBL’s “Weaving the Web” vision of the Web as a social machine.

It’s even more relevant when the online and the offline world become one, whether it’s through apps (4square, Uber, etc.) or via sensors and wearable tech (mobile phones, Glass, etc.). A particular aspect I’m interested in is how those actions can help to personalise the Web

The rise of dynamic content and structured data on the Web

This is not the first time actions – at least online ones –  are used on the Web: think of Activity StreamsWeb Intents, as well as SIOC-Actions that I’ve worked on with Pierre-Antoine Champin a few years ago.

Yet, considering the recent advances on structured Web data (schema.org, Google’s Knowledge Graph, Facebook OpenGraph, Twitter cards…), this addition is a timely move. Every one can now publish their actions using a shared vocabulary, meaning that apps and services can consume them openly – pending the correct credentials and privacy settings. And that’s a big move for personalisation.

Personalising content from distributed data

Let’s consider my musical activity. Right now, I can plug my services into Facebook and use the Graph API to retrieve my listening history. Or query APIs such as the Deezer one. Or check my Twitter and Instagram feeds to remember some of the records I’ve put on my turntable. Yet, if all of them would publish actions using the new ListenAction type, I could use a single query engine to get the data from those different endpoints.

Deezer could describe actions using the following JSON-LD, and Spotify with RDFa, but it doesn’t really matter – as both would agree on shared semantics through a single vocabulary.

<scripttype="application/ld+json">
{
  "@context":"http://schema.org",
  "@type":"ListenAction",
  "agent":{
    "@type":"Person",
    "name":"Alex"
  },
  "object":{
    "@type":"MusicGroup",
    "name":"The Clash"
  },
  "instrument":{
    "@type":"WebApplication", 
    "name":"Deezer",
    "url":"http://deezer.com"
  } 
} </script>

Ultimately, that means that every service could gather data from different sources to meaningfully extract information about myself, and deliver a personalised experience as soon as I log-in.

You might think that Facebook enables this already with the Graph API. Indeed, but data need to be in Facebook. This is not always the case, either because the seed services haven’t implemented – or removed – the proper connectors, or because you didn’t allow them to share your actions.

In this new configuration, I could decide, for every service I log-in, which sources it can access. Log-in to a music platform? Let’s access to my Deezer and Spotify profiles, where some schema.org Actions can be found. Booking a restaurant? Check my OpenTable ones. From there, those services can quickly build my profile and start personalising my online experience.

In addition, websites could decide to use background knowledge to enrich one’s profile, using vertical databases, e.g. Factual for geolocation data or our recently relaunched seevl API for music meta-data, combining with advanced heuristics such as such as time decay, actions-objects granularity and more to enhance the profiling capabilities (if you’re interested in the topic, check the slides of Fabrizio Orlandi’s Ph.D. viva on the topic) .

Privacy matters

This way of personalising content could also have important privacy implications. By selecting which sources a service can access, I implicitly block access to data that is non-relevant or too private for that particular service – as opposed to granting access to all my content.

Going further, we can imagine an privacy-control matrix where I can select not only the sources, but also the action types to be used, keeping my data safe and avoiding freakomendations. I could provide my 4square eating actions (restaurants I’ve checked-in) to a food website, but offer my musical background (concerts I’ve been to) to a music app, keeping both separate.

Of course, websites should be smart enough to know which action they require, doing a source/action pre-selection for me. This could ultimately solve some of the trust issues often discussed when talking about personalisation, as Facebook’s Sam Lessin addressed in his keynote on the future of travel.

What’s next?

As you could see, I’m particularly interested in what’s going to happen with this new schema.org update, both from the publishers and the consumers point of view.

It will also be interesting to see how mappings could emerge between it and the Facebook Graph API, adding another level of interoperability in this quest to make the Web a social space.

Browsing the upcoming Record Store Day 2014 UK releases

It’s that time of the year again. In a few weeks, I’ll be queuing in front of my favorite music store on an early Saturday morning. Why? Because Record Store Day is coming!

While the list of upcoming releases is available on the official RSD website, I thought a quick hack would help me to more efficiently find what I’d like to put on my turntable this year without having to browse each page separately. So if, like me, you want to filter releases (by keyword, type, artist, label, …) and pick and print your selection, go to http://rsd.mdg.io!

Record Store Day browser

It uses BeautifulSoup for scraping, a drop of CoffeeScript and a pinch of AngularJS + Bootstrap for the UI.

NB: I’m looking for someone who can get some US releases. Drop me a tweet if you’re interested, I’m happy to exchange with UK ones, or buying – face-value only.

Streaming the Pink Floyd: where are all the potheads?

While catalogues now tend to be similar between streaming platforms, besides a few notable exceptions, it’s interesting to note how top-tracks can differ between services.

If you’re listening to the Pink Floyd on Spotify, Rdio or YouTube, top-tracks tend to be from their “mainstream” area, especially The Wall and Dark Side of the Moon.

Pink Floyd top-tracks on Spotify
Pink Floyd top-tracks on Spotify

Yet, top-tracks on Deezer are much more interesting, including live tracks from the Kralingen festival in 1970.

Pink Floyd top-tracks on Deezer
Pink Floyd top-tracks on Deezer

It might be due to their availability on Deezer before the full catalogue (hence getting more plays), or because potheads prefer Deezer to Spotify, but I thought it was a fun fact to note. Anyway, enjoy the following performance below, or check the festival album on Deezer!

And, of course, you can listen to them on seevl as well, with top-tracks gathered from iTunes.

MIDEM Music Hack Day 2014 – seevl hipster

Screen Shot 2014-02-10 at 15.16.39

Last week-end, the music industry meet in Cannes for MIDEM. And, as for the past 4 years, the music-tech community gathered for a special Music Hack Day, sponsored by Deezer and Spotify developer platforms and organised by Martyn Davies.

I’ve been lucky to participate for the third time in this week-end full of music, tech and energy, and built an obviously not-so-serious hack: seevl hipster

Do you want to impress your friend who’s into electro-folk, or that other one who only listens to avant-garde metal? Now you can! By logging-in to seevl hipster, you can eventually find obscure artists that match your friend tastes, and show-off on their Facebook wall.

Image

This hack uses the Facebook API to identify your friends’ likes, that are sent to our (so far internal) seevl API, in order to match their top-genres (similarly to what you get when creating a seevl account), then using the API again to suggest musicians, linking to their seevl page for a full listening experience. It’s built using my now-favorite combo: AngularJS + Flask.

You can read more about the 18 hacks built during the week-end, and see how LEAP motion might solve your Justin Bieber addiction (a MHD without a Bieber hack is not really a MHD), how to DJ with Spotify, or how to recycle your old MIDI keyboard, among others.