But, what can we learn from such a dataset? Well, a lot actually: Discovering top-tracks, building content-based recommendations, mining new trends, and finding influencers to target during album releases. This can be invaluable for a record label or an artist, and it’s no surprise that compagnies like Musicmetric or The Next Big Sound tackle it from the analytics perspective, while Gracenote or The Echo Nest focus on data, recommendations of user profiling.
Analyzing playlists is not a new thing, and you could read about various Big Data architectures such as Spark at Spotify, from the music discovery standpoint. I’ve used Google’s BigQuery in order to quickly get insights without setting-up my own stack. As I’ve experimented with it in the past, it was a good time to try with my own dataset.
With a few Python scripts, here are the steps to setup the experiment. [Update 2014-10-29: The scripts, as well as links to the dataset, are now available on Github]:
– First, gather about 500,000 playlists from the Deezer API , using a threaded crawler, randomly picking playlists with ID between 1 and 10M, for a total of 9.7Go of JSON data;
– Finally, defining a schema to map the data to tables, and loading it from Cloud Storage to BigQuery. It took only 12 seconds to load the 1Go of compressed data, for a total of 510,187 playlists, with 12M tracks (and 900K distinct ones) in total.
With such an amount of data, and not only in the music domain, it’s relatively easy to build a content recommendation platform, based on the “If you like X you’ll like Y”. Using this simple SQL query, you can find the top-related artist for anyone in the dataset:
SELECT COUNT(b.tracks.data.artist.id), b.tracks.data.artist.name FROM FLATTEN([Playlists.Playlists], tracks.data) a LEFT JOIN EACH FLATTEN([Playlists.Playlists], tracks.data) b ON a.id == b.id WHERE a.tracks.data.artist.id == &lt;artist_id&gt; AND b.tracks.data.artist.id != &lt;artist_id&gt; GROUP EACH BY 2 ORDER BY 1 DESC LIMIT 5
### Related to Rihanna * Britney Spears * Beyoncé * The Black Eyed Peas * David Guetta * Justin Timberlake
### Related to Daft Punk * Justice * Muse * David Guetta * Moby * The Chemical Brothers
### Related to Agnostic Front * Blood for Blood * Hatebreed * Dropkick Murphys * Helga Hahnemann * Bad Religion
A good way to bootstrap an artist-based radio station!
Going further, building a song-to-song recommendations algorithm is not really complicated neither. Here are for instance the most frequent tracks played together with “Harder Better Faster Stronger”, which are not by Daft Punk.
### Related to Harder Better Faster Stronger, non Daft-Punk * David Guetta: Cozi Baby When The Light * Laurent Wolf: No Stress (Radio edit) * David Guetta: Love Don't Let Me Go (Original Edit) * David Guetta: Love Is Gone (Radio Edit Rmx) * Mika: Relax, Take It Easy
Top artists and tracks, popularity, and more
Besides recommendations, an obvious use-case is to identify top-tracks or top-artists. For instance, here are the top-tracks for some artists based on their popularity in the full dataset.
### Most popular tracks from Daft Punk * Around The World * Harder Better Faster Stronger * Da Funk * Technologic * Around The World / Harder Better Faster Stronger
### Most popular tracks from Weezer * Island In The Sun * My Name Is Jonas * Beverly Hills * Buddy Holly * Hash Pipe
Combining with temporal attributes (not available here unfortunately, more on this later), one could also identify how fast a track progress from its release to a top-X.
Regarding top-artist, the easy way is to simply track the top-ones in the list, with the number of tracks they have on the full dataset (900K distinct ones).
### Top-artists by number of tracks * Linkin Park (65,415) * Muse (59,550) * U2 (54,688) * Rihanna (53,354) * Queen (51,717)
But another way is to sort artists by number of playlists they appear in
SELECT COUNT(id) as c, tracks.data.artist.id, tracks.data.artist.name FROM ( SELECT id, tracks.data.artist.id, tracks.data.artist.name FROM [Playlists.Playlists] GROUP EACH BY 1, 2, 3 ) GROUP EACH BY 2, 3 ORDER BY c DESC
Surprisingly, the most popular artist it then a Karaoke cover band, included in 23,993 of the 900K playlists, more than Rihanna or U2!
### Top-artists by playlists appearance * Studio Group (23,993) * Rihanna (23,398) * U2 (17,860) * Queen (17,463) * Linkin Park (17,232)
Another interesting insight – that is not surprising if you’re into music discovery and the long tail – concerns the way popular artists outweight less popular ones in their distribution: 43346 artists, i.e. about a third of them, appear only once in the dataset, and 37864 appear between 2 and 10 times.
Trends, influencers and targeted recommendations
Finally, what about identifying trends and influencers?
One approach would be to identify which artists jump from top-1000, to top-100 and event to top-50 in a given timeframe. Unfortunately, Deezer playlists do not contain any temporal information. Yet, coming back to the starting point of this post, that’s definitely something valuable that WMG could get from Playlists.net.
They could then identify and target influencers, for instance users who’re among the top 10% to listen to them, which could be a goldmine when marketing new artists or releases.
Definitely, this acquisition makes sense considering the trends in the industry, and the recent consolidation around various services (Rhapsody, rd.io. etc.), most of them focusing on the the analytics / discovery domain. An domain which matters for artists and labels, but also for streaming services and data-providers, providing them with valuable insights and ways to beat competitors, ensuring their users are given the best listening experience they could possibly expect, depending on who they are, and how they listen to music.
If you have an interesting dataset and want to run analytics or recommendation experiments, let’s get in touch! And if you’re mostly interested in the discovery / recommendation part, have a look at our turn-key solution at seevl.fm.
 I used Deezer and not Spotify, even though Playlists.net is Spotify-based, as there’s no rate limiting on their API for playlist search and retrieval (whether it’s a bug or a feature is another topic for discussion)