Defining a schema to load JSON data into Big Query

Insights from 500,000 Deezer playlists using Google’s BigQuery

A few days ago, Warner Music acquired, and as Techcrunch pointed out, one reason can be be its data, and the related insights.

But, what can we learn from such a dataset? Well, a lot actually: Discovering top-tracks, building content-based recommendations, mining new trends, and finding influencers to target during album releases. This can be invaluable for a record label or an artist, and it’s no surprise that compagnies like Musicmetric or The Next Big Sound tackle it from the analytics perspective, while Gracenote or The Echo Nest focus on data, recommendations of user profiling.

To prove some of those points, I’ve run a small experiment using 500,000 playlists from Deezer, together with Google’s BigQuery infrastructure.

The setup

Analyzing playlists is not a new thing, and you could read about various Big Data architectures such as Spark at Spotify, from the music discovery standpoint. I’ve used Google’s BigQuery in order to quickly get insights without setting-up my own stack. As I’ve experimented with it in the past, it was a good time to try with my own dataset.

With a few Python scripts, here are the steps to setup the experiment. [Update 2014-10-29: The scripts, as well as links to the dataset, are now available on Github]:

- First, gather about 500,000 playlists from the Deezer API [1], using a threaded crawler, randomly picking playlists with ID between 1 and 10M, for a total of 9.7Go of JSON data;

- Then, prepare the playlists for Google’s BigQuery, concatenating the 500K original files into 9 gzip-ped JSON files ([1-9].json.gz), and uploading them to Google Cloud Storage, for a total of 1Go;

- Finally, defining a schema to map the data to tables, and loading it from Cloud Storage to BigQuery. It took only 12 seconds to load the 1Go of compressed data, for a total of 510,187 playlists, with 12M tracks (and 900K distinct ones) in total.

Defining a schema to load JSON data into BigQuery

Defining a schema to load JSON data into BigQuery

Content recommendations

With such an amount of data, and not only in the music domain, it’s relatively easy to build a content recommendation platform, based on the “If you like X you’ll like Y”. Using this simple SQL query, you can find the top-related artist for anyone in the dataset:

  FLATTEN([Playlists.Playlists], a
  EACH FLATTEN([Playlists.Playlists], b
ON ==
WHERE == <artist_id>
  AND != <artist_id>

For instance:

### Related to Rihanna
* Britney Spears
* Beyoncé
* The Black Eyed Peas
* David Guetta
* Justin Timberlake
### Related to Daft Punk
* Justice
* Muse
* David Guetta
* Moby
* The Chemical Brothers
### Related to Agnostic Front
* Blood for Blood	 
* Hatebreed	 
* Dropkick Murphys	 
* Helga Hahnemann	 
* Bad Religion

A good way to bootstrap an artist-based radio station!

Going further, building a song-to-song recommendations algorithm is not really complicated neither. Here are for instance the most frequent tracks played together with “Harder Better Faster Stronger”, which are not by Daft Punk.

### Related to Harder Better Faster Stronger, non Daft-Punk
* David Guetta: Cozi Baby When The Light
* Laurent Wolf: No Stress (Radio edit)
* David Guetta: Love Don't Let Me Go (Original Edit)
* David Guetta: Love Is Gone (Radio Edit Rmx)
* Mika: Relax, Take It Easy

Top artists and tracks, popularity, and more

Besides recommendations, an obvious use-case is to identify top-tracks or top-artists. For instance, here are the top-tracks for some artists based on their popularity in the full dataset.

### Most popular tracks from Daft Punk
* Around The World
* Harder Better Faster Stronger
* Da Funk
* Technologic
* Around The World / Harder Better Faster Stronger
### Most popular tracks from Weezer
* Island In The Sun	 
* My Name Is Jonas	 
* Beverly Hills	 
* Buddy Holly	 
* Hash Pipe

Combining with temporal attributes (not available here unfortunately, more on this later), one could also identify how fast a track progress from its release to a top-X.

Regarding top-artist, the easy way is to simply track the top-ones in the list, with the number of tracks they have on the full dataset (900K distinct ones).

### Top-artists by number of tracks
* Linkin Park (65,415)
* Muse (59,550)
* U2 (54,688)
* Rihanna (53,354)
* Queen (51,717)

But another way is to sort artists by number of playlists they appear in

SELECT COUNT(id) as c,,
 SELECT id,,
 FROM [Playlists.Playlists]
 GROUP EACH BY 1, 2, 3

Surprisingly, the most popular artist it then a Karaoke cover band, included in 23,993 of the 900K playlists, more than Rihanna or U2!

### Top-artists by playlists appearance
* Studio Group (23,993)
* Rihanna (23,398)
* U2 (17,860)
* Queen (17,463)
* Linkin Park (17,232)

Another interesting insight – that is not surprising if you’re into music discovery and the long tail – concerns the way popular artists outweight less popular ones in their distribution: 43346 artists, i.e. about a third of them, appear only once in the dataset, and 37864 appear between 2 and 10 times.

Trends, influencers and targeted recommendations

Finally, what about identifying trends and influencers?

One approach would be to identify which artists jump from top-1000, to top-100 and event to top-50 in a given timeframe. Unfortunately, Deezer playlists do not contain any temporal information. Yet, coming back to the starting point of this post, that’s definitely something valuable that WMG could get from

They could then identify and target influencers, for instance users who’re among the top 10% to listen to them, which could be a goldmine when marketing new artists or releases.

Definitely, this acquisition makes sense considering the trends in the industry, and the recent consolidation around various services (Rhapsody, etc.), most of them focusing on the the analytics / discovery domain. An domain which matters for artists and labels, but also for streaming services and data-providers, providing them with valuable insights and ways to beat competitors, ensuring their users are given the best listening experience they could possibly expect, depending on who they are, and how they listen to music.

If you have an interesting dataset and want to run analytics or recommendation experiments, let’s get in touch! And if you’re mostly interested in the discovery / recommendation part, have a look at our turn-key solution at

[1] I used Deezer and not Spotify, even though is Spotify-based, as there’s no rate limiting on their API for playlist search and retrieval (whether it’s a bug or a feature is another topic for discussion)


How mood and tempo can influence artist discovery?

If you log-in to Deezer, Spotify, YouTube, etc. to listen to a particular artist, you can simply pick their top-tracks. Yet, while they’re the most popular, they are not necessarily the ones providing a good understanding of their style, or – on the opposite – might not surprise you enough. Plus, depending on which platform you use, unexpected results can appear!

Using the Gracenote API, here’s an experiment using their mood and tempo detection features to answers questions like “What a band generally plays”, “How eclectic an album is” or “How can I listen to something unexpected my favorite artist”.

You Can’t, You Won’t And You Don’t Stop

First, let’s try to understand how eclectic an artist is: do they tend to play diverse style, or do they stick to common patterns? In the first case, is that something we can experience through a single album, or did they simply switch genres over the years?

Take for instance the Beastie Boys, who played hardcore punk in their early years, before becoming hip-hop stars in the 90’s. If you look as their old recordings, compiled in “Same old bullshit“, you’ll find the following top-3 tempos and moods.

Beastie boys - Some Old Bullshit
# Tempo
- Medium Tempo: 8 (57.14%)
- Fast Tempo: 6 (42.86%)
- Medium Fast: 5 (35.71%)
# Mood
- Aggressive: 8 (57.14%)
- Cool Confidence: 4 (28.57%)
- Heavy Triumphant: 4 (28.57%)

While a more recent record like “Hello Nasty” seems to leave away the aggressive parts of their early years, even though the defiant mood is definitely here to stay!

Beastie boys - Hello Nasty
# Tempo
- Medium Tempo: 21 (95.45%)
- Medium Fast: 12 (54.55%)
- 90s: 7 (31.82%)
# Mood
- Attitude / Defiant: 7 (31.82%)
- Defiant: 7 (31.82%)
- Cool Confidence: 6 (27.27%)

Looking at individual albums, there are interesting patterns as well. London Calling from The Clash combines elements of Punk-Rock, Jazz, Ska, R&B and more. Consequently, lots of different moods are covered in the same album:

The Clash - London Calling
# Mood
- Rowdy: 5 (26.32%)
- Excited: 4 (21.05%)
- Ramshackle / Rollicking: 4 (21.05%)
- Cool: 4 (21.05%)
- Loud Celebratory: 3 (15.79%)
- Casual Groove: 3 (15.79%)
- Carefree Pop: 2 (10.53%)
- Upbeat: 2 (10.53%)
- Empowering: 1 (5.26%)
- Cool Confidence: 1 (5.26%)

Fear of the Dark

On the other hand, some bands didn’t significantly evolved during decades. Running the same test on the first and most recent studio albums of Iron Maiden (“Iron Maiden” and “The Final Frontier“) shows that tempo remains the same, while the Defiant mood is still a strong part of their style, 30 years after their first release.

Iron Maiden - Iron Maiden
# Tempo
- Medium Tempo: 6 (75.00%)
- Medium Fast: 5 (62.50%)
- 100s: 4 (50.00%)
# Mood
- Defiant: 5 (62.50%)
- Hard Positive Excitement: 3 (37.50%)
- Hard Dark Excitement: 2 (25.00%)
Iron Maiden - The Final Frontier
# Tempo
- Medium Tempo: 6 (60.00%)
- Medium Fast: 4 (40.00%)
- Fast: 3 (30.00%)
# Mood
- Defiant: 6 (60.00%)
- Heavy Brooding: 6 (60.00%)
- Brooding: 2 (20.00%)

Finally, for all their studio albums (143 tracks), we have:

Iron Maiden
# Tempo
- Medium Tempo: 88 (61.54%)
- Medium Fast: 55 (38.46%)
- Fast Tempo: 53 (37.06%)
- Fast: 44 (30.77%)
- Medium: 31 (21.68%)
- 100s: 26 (18.18%)
- 80s: 21 (14.69%)
- 90s: 14 (9.79%)
- 130s: 12 (8.39%)
- 140s: 12 (8.39%)
# Mood
- Defiant: 72 (50.35%)
- Heavy Brooding: 33 (23.08%)
- Hard Dark Excitement: 30 (20.98%)
- Brooding: 26 (18.18%)
- Rowdy: 18 (12.59%)
- Confident / Tough: 10 (6.99%)
- Hard Positive Excitement: 9 (6.29%)
- Aggressive: 9 (6.29%)
- Heavy Triumphant: 7 (4.90%)
- Alienated / Brooding: 7 (4.90%)

This becomes interesting in terms of discovery. if you want to listen to typical Maiden, just pick a mid-tempo track with a defiant mood: “The Trooper” is one of them. On the other hand, let’s imagine you’re into something more obscure, pick an Alienating 90s-BPM track, like “Mother Russia“.

If you want to run similar experiments on your favorite albums, simply set-up an account with the Gracenote API, and get the small Python class I’ve build for the analysis.

@seevl’s DJ, Twitter, and the Semantic Web

As most of my side projects, “seevl DJ” started as a quick hack on a sunday afternoon. Yet, it has been quickly picked and featured on Fast Company and Hypebot, and also got some attention on Twitter itself.

With a little help from my friends

I’ve spend some time improving it so you can use additional commands, e.g. “a song by”, “play me something like”. In addition, it now uses the Freebase / YouTube mappings combined with the seevl API in order to find an artist’s videos (when using a genre / label / related query).

Last but not least, you can now use “/cc @user” and “for @user” in your Tweet to send a track to any of your friend, the music video being available directly on their feed through Twitter cards (Web and mobile).

Services, actions, and payments on Twitter

Thinking again about Twitter as an intelligent agent on the Web, let’s be bold and imagine this integrated with the buy / Stripe integration. While it’s now used to buy stuff, what about paying for services with it? “Hey @uber, bring @myfriend here”. “Hey @trycaviar, sushis for 6 please”. Both answering with an automated tweet embedding a Buy button so you can validate the order; and get your black car or food home within minutes. All through Twitter.

Natural Language Processing is one way to enable this, but another one is to pre-fill such “service-based tweets” so that users would just have to complete a few fields (e.g. number of people when messaging @opentable). This makes things much easier from the processing side, also providing a friction-less experience to users. Technically, the intelligence can be brought by actions, as I’ve wrote in the past, using JSON-LD as the supporting data serialisation.

A similar approach is used in Gmail (see for instance the Github integration). So, Twitter, what’s your next move to also embrace the Semantic Web?

Remove inactive Twitter followees with this tiny Python script

I recently reached the Twitter limit to add new followees so I’ve wrote a tiny Python script, Twitter Cleaner, to remove people who haven’t send anything for a number of days (30 by default) – and consequently be able to add new ones. It’s now available on github.

Twitter Cleaner

Twitter Cleaner

Note that it might conflict with the previous Twitter TOS if you unfollow too many people at once. However, it will happen only once if you put it into a daily crontab. It was safe in my case, but I can’t guarantee it will be in yours. You may also reach the API rate-limiting if you’ve too many followees.

It’t built using python-twitter, and is available under the MIT license.


Last night a DJ saved my life: What if Twitter could be your own DJ?

While the Twitter music app eventually failed, it’s still clear that people use Twitter’s data stream to share and/or discover new #music. Thanks to Twitter cards, a great thing is that you can directly watch a YouTube video, or listen to a SoundCloud clip, right from your feed, without leaving the platform. But what if Twitter could be your own DJ, playing songs on your request?

Since it’s been a few month since I enjoyed my last Music Hack Day – oh, I definitely miss that! – I’ve hacked a proof of concept using the seevl API, combined with the Twitter and the YouTube ones, to make Twitter acts as your own personal DJ.

Hey @seevl, play something cool

The result is a twitter bot, running under our @seevl handle, which accepts a few (controlled) natural-language queries and replies with an appropriate track, embedded in a Tweet via a YouTube card. Here are a few patterns you can use:

Hey @seevl, play something like A

To play something that is similar to A. For instance, tweet “play something like New Order”, and you might get a reply with a Joy Division track in your feed.

Hey @seevl, play something from L

To play something from an artist signed on label L (or, at least, that used to be on this label at some stage)

Hey @seevl, play some G

To play something from a given genre G

Hey @seevl, play A

To simply play a track from A.

By the way, you can replace “Hey” by anything you want, as long as you politely ask your DJ what you want him to spin. Here’s an example, with my tweet just posted (top of the timeline), and a reply from the bot (bottom left).

Twitter As A DJ

Twitter As A DJ

A little less conversation

As it’s all Twitter-based, not only you can send messages, but you can have a conversation with your virtual DJ. Here’s for instance what I’ve sent first

And got this immediate reply – with the embedded YouTube video

Followed by (“coo” meant to be “cool”)

To immediately listen to Bettie Smith in my stream

It’s kind of fun, I have to say, especially due to the instantaneous nature of the conversation – and it even reminds IRC bots!

Unfortunately, it’s likely that the bot will reach the API rate-limit when posting Tweets (and I’m not handling those errors in the current MVP), so you may not have a reply when you interact with it.

Twitter As A Service?

Besides the music-related hack, I also wanted to showcase the growth of intelligent services on the Web – and how a platform like Twitter can be part of it, using “Twitter As A Service” as a layer for an intelligent Web.

The recently-launched “Buy button” is a simple example of how Twitter can be a Siri-like interface to the world. But why not bringing more intelligence into Twitter. What about “Hey @uber, pick me in 10 minutes”, and using the Tweet geolocation plus a Uber-API integration integration to directly pick – and bill – whoever #requested a black car? Or “Please @opentable, I’d love to have sushis tonight”, and get a reply with links to the top-rated places nearby, with in-tweet booking capability (via the previous buy button)? The data is there, the tools and APIs are there, so…

Yes, this sound a bit like what’s described in the seminal Semantic Web article by Tim Berners-Lee, James Hendler and Ora Lassila. Maybe it’s because we’re finally there, in an age where computers can be those social machines that we’re dreaming about!


Love Product Hunt? Here’s a Chrome extension to discover even more products

Product Hunt is the new rising star in the start-up community. Think of it as a mix of Beta List and Hacker News, but with products that are already live, and a wider community, including engineers of course, but also product people, investors, media and more. A few days ago, and by popular demand, they’ve launched early access to heir official API.

More Products: A Chrome plug-in for Product Hunt recommendations

With more than 6,000 products already in the Product Hunt database, I’ve decided to use the API to build a product recommendation engine. It seems that evey times it comes to hacking and APIs, I can’t get away of discovery, or music. Or both.

The result is a Chrome extension simply named “More Products!”. It directly integrates top-10 related products for each product page, as you can see below. I might iterate on the algorithm itself, but want to keep this plug-in very focused so it’s unlikely that it will integrate other features. Note that it doesn’t track anything, so your privacy is preserved.

More Products on Product Hunt!

More Products on Product Hunt!

Under the hood

The engine relies on the API to get the list of all products and related posts, and then uses TF-IDF and Cosine similarity to to find similarities between them, using NLTK and scikit-learn, respectively the standard Python tools for Natural Language Processing and Machine Learning. To put it simply, it builds a giant database of words used in all posts, mapped to products with their frequency, and then finds how close products are,  based on those frequencies.

New products are fetched every 2 hours, and recommendations are updated at the same time. Flask handles requests between the extension and the recommendations database, and Redis is used as a cache layer.