Defining a schema to load JSON data into Big Query

Insights from 500,000 Deezer playlists using Google’s BigQuery

A few days ago, Warner Music acquired, and as Techcrunch pointed out, one reason can be be its data, and the related insights.

But, what can we learn from such a dataset? Well, a lot actually: Discovering top-tracks, building content-based recommendations, mining new trends, and finding influencers to target during album releases. This can be invaluable for a record label or an artist, and it’s no surprise that compagnies like Musicmetric or The Next Big Sound tackle it from the analytics perspective, while Gracenote or The Echo Nest focus on data, recommendations of user profiling.

To prove some of those points, I’ve run a small experiment using 500,000 playlists from Deezer, together with Google’s BigQuery infrastructure.

The setup

Analyzing playlists is not a new thing, and you could read about various Big Data architectures such as Spark at Spotify, from the music discovery standpoint. I’ve used Google’s BigQuery in order to quickly get insights without setting-up my own stack. As I’ve experimented with it in the past, it was a good time to try with my own dataset.

With a few Python scripts, here are the steps to setup the experiment. [Update 2014-10-29: The scripts, as well as links to the dataset, are now available on Github]:

– First, gather about 500,000 playlists from the Deezer API [1], using a threaded crawler, randomly picking playlists with ID between 1 and 10M, for a total of 9.7Go of JSON data;

– Then, prepare the playlists for Google’s BigQuery, concatenating the 500K original files into 9 gzip-ped JSON files ([1-9].json.gz), and uploading them to Google Cloud Storage, for a total of 1Go;

– Finally, defining a schema to map the data to tables, and loading it from Cloud Storage to BigQuery. It took only 12 seconds to load the 1Go of compressed data, for a total of 510,187 playlists, with 12M tracks (and 900K distinct ones) in total.

Defining a schema to load JSON data into BigQuery
Defining a schema to load JSON data into BigQuery

Content recommendations

With such an amount of data, and not only in the music domain, it’s relatively easy to build a content recommendation platform, based on the “If you like X you’ll like Y”. Using this simple SQL query, you can find the top-related artist for anyone in the dataset:

  FLATTEN([Playlists.Playlists], a
  EACH FLATTEN([Playlists.Playlists], b
ON ==
WHERE == <artist_id>
  AND != <artist_id>

For instance:

### Related to Rihanna
* Britney Spears
* Beyoncé
* The Black Eyed Peas
* David Guetta
* Justin Timberlake
### Related to Daft Punk
* Justice
* Muse
* David Guetta
* Moby
* The Chemical Brothers
### Related to Agnostic Front
* Blood for Blood	 
* Hatebreed	 
* Dropkick Murphys	 
* Helga Hahnemann	 
* Bad Religion

A good way to bootstrap an artist-based radio station!

Going further, building a song-to-song recommendations algorithm is not really complicated neither. Here are for instance the most frequent tracks played together with “Harder Better Faster Stronger”, which are not by Daft Punk.

### Related to Harder Better Faster Stronger, non Daft-Punk
* David Guetta: Cozi Baby When The Light
* Laurent Wolf: No Stress (Radio edit)
* David Guetta: Love Don't Let Me Go (Original Edit)
* David Guetta: Love Is Gone (Radio Edit Rmx)
* Mika: Relax, Take It Easy

Top artists and tracks, popularity, and more

Besides recommendations, an obvious use-case is to identify top-tracks or top-artists. For instance, here are the top-tracks for some artists based on their popularity in the full dataset.

### Most popular tracks from Daft Punk
* Around The World
* Harder Better Faster Stronger
* Da Funk
* Technologic
* Around The World / Harder Better Faster Stronger
### Most popular tracks from Weezer
* Island In The Sun	 
* My Name Is Jonas	 
* Beverly Hills	 
* Buddy Holly	 
* Hash Pipe

Combining with temporal attributes (not available here unfortunately, more on this later), one could also identify how fast a track progress from its release to a top-X.

Regarding top-artist, the easy way is to simply track the top-ones in the list, with the number of tracks they have on the full dataset (900K distinct ones).

### Top-artists by number of tracks
* Linkin Park (65,415)
* Muse (59,550)
* U2 (54,688)
* Rihanna (53,354)
* Queen (51,717)

But another way is to sort artists by number of playlists they appear in

SELECT COUNT(id) as c,,
 SELECT id,,
 FROM [Playlists.Playlists]
 GROUP EACH BY 1, 2, 3

Surprisingly, the most popular artist it then a Karaoke cover band, included in 23,993 of the 900K playlists, more than Rihanna or U2!

### Top-artists by playlists appearance
* Studio Group (23,993)
* Rihanna (23,398)
* U2 (17,860)
* Queen (17,463)
* Linkin Park (17,232)

Another interesting insight – that is not surprising if you’re into music discovery and the long tail – concerns the way popular artists outweight less popular ones in their distribution: 43346 artists, i.e. about a third of them, appear only once in the dataset, and 37864 appear between 2 and 10 times.

Trends, influencers and targeted recommendations

Finally, what about identifying trends and influencers?

One approach would be to identify which artists jump from top-1000, to top-100 and event to top-50 in a given timeframe. Unfortunately, Deezer playlists do not contain any temporal information. Yet, coming back to the starting point of this post, that’s definitely something valuable that WMG could get from

They could then identify and target influencers, for instance users who’re among the top 10% to listen to them, which could be a goldmine when marketing new artists or releases.

Definitely, this acquisition makes sense considering the trends in the industry, and the recent consolidation around various services (Rhapsody, etc.), most of them focusing on the the analytics / discovery domain. An domain which matters for artists and labels, but also for streaming services and data-providers, providing them with valuable insights and ways to beat competitors, ensuring their users are given the best listening experience they could possibly expect, depending on who they are, and how they listen to music.

If you have an interesting dataset and want to run analytics or recommendation experiments, let’s get in touch! And if you’re mostly interested in the discovery / recommendation part, have a look at our turn-key solution at

[1] I used Deezer and not Spotify, even though is Spotify-based, as there’s no rate limiting on their API for playlist search and retrieval (whether it’s a bug or a feature is another topic for discussion)

Leave a Reply

Your email address will not be published. Required fields are marked *

10 thoughts on “Insights from 500,000 Deezer playlists using Google’s BigQuery

  1. Nice article. It would be interesting to see how the recommendations change if you take into consideration the general popularity of a band when you are looking for correlation. I.e what bands appear often in the same playlist as band X that is not usual in most playlists. The bands might usually be in the same playlist but does the existence of X indicate an increased probability of seeing Y. This might avoid a situation where the recommender algorithm seem to fall back on only top list results.

    1. Yes, that definitely makes sense!

      For now the algorithm is very basic (e.g. doesn’t take into account popularity inside a playlist), but such kind of improvement could be fun to try with.

      I’ve also experimented a bit with star-rating, e.g. which artists tend to appear a lot in badly-rated playlists, and does that influence other recommendations, etc.

      The post has been updated with links to the dataset, if you’re trying something with it, feel free to reply again – would love to see some more experiments with it.

  2. Hi, thank you for sharing your idea and codes.
    I’d like to ask you if you could share the Deezer crawling code. I’m new to those APIs and Python, so I have no idea if I’m doing right or wrong. Actually, for the moment, what I’m trying to do is also crawling playlists so yours would be a perfect example if you’re willing to share!

  3. Hi, i get 403 forbidden errors when I try to download the data using wget or gsutil. Is there a better way to download the data? Thanks.

    1. Hi Tim – I’ve disabled billing on this app so the files might be un-accessible currently, let me have a look and update the post / github repo ASAP if needs to be.