24 - Discover YouTube music trends via Twitter

24 hours, 24 genres, 24 tracks: Discover YouTube music trends via Twitter

24’s idea is simple: Identify the top-24 tracks of the top-24 genres played via YouTube during the last 24 hours on Twitter.

24 - Heavy Metal tracks

24 – Heavy Metal tracks

It’s based on the Twitter + Freebase + BigQuery pipeline I’ve built to run last week-end (indeed, using only a subset of the full Twitter stream), and uses Bootstrap and AngularJS for the UI. While the data is mined in real-time, the page itself is refreshed every two hours.

Feedback? Comments? Say hello or check MDG’s portfolio for more. In the meantime, enjoy 24 at http://24.mdg.io, and check it out regularly for updates on its current MVP.

Screen Shot 2014-11-21 at 00.03.08

Sharing YouTube music on Twitter: Analytics using Freebase and BigQuery

Following my journey with Google Cloud, in particular BigQuery, I’m building a pipeline which mines Tweets containing YouTube videos, and maps those videos to Freebase in order to run various discovery / recommendations analytics and products experiment.

If you’re not yet familiar with it, Freebase is the core of Google’s Knowledge Graph and provides machine-readable, structured, information about a large number of entities, or “real-world things described on the Web”


To build this, I’ve been using the streaming APIs from both Twitter and BigQuery to get and save the data. In-between, a middleware parses the tweets and calls the YouTube API and Freebase’s Knowledge Graph to extract additional data from each, with a bit of memcached to avoir rate-limiting on those APIs.

The infrastructure started to run a few days ago, and I’ve gathered (when starting this write-up) 1.2M Tweets so far, for a total or 516,056 distinct videos, of which 345,410 have been linked to Freebase entities. Using this sample, I’ll now describe a few things that we can learn using this data in the context of music.

But first, why music videos? Not only because I’m big into music-related data science and engineering, but also because Music is the most shared category on the sample with 28.2% of the videos being in this category, followed by People and Blog (19.5%) and Entertainment ( 14.4). Another reason why YouTube Music Key definitely makes sense.

Popular videos: super fans or spammers?

One of the first query I’ve tried – focusing solely on BigQuery’s SQL capabilities – was to identify the most popular videos on Twitter, with their corresponding YouTube views (using data from the last YouTube API call).

  COUNT(tweet_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
  tweet_youtube.youtube_category_id = 10
Most popular videos in the dataset

Most popular videos in the dataset

I was surprised by the low difference between the number of tweets and views for some of them, so I’ve decided to measure the number of tweets per user for any video. Using another simple SQL query, we can easily identify videos that are self-promoted, or should I say spammed, on Twitter.

  COUNT(DISTINCT(tweet_user_id)) as num_users,
  COUNT(tweet_youtube.youtube_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
  CAST(COUNT(DISTINCT(tweet_user_id)) as float)
    /CAST(COUNT(tweet_youtube.youtube_id) as float) as ratio
  tweet_youtube.youtube_category_id = 10
Number of tweets vs views on YouTube

Number of tweets vs views on YouTube

On the other hand, limiting the first SQL query to allow only one tweet per video per user provides an easiest way to identify top-tracks based on their number of unique fans (whether or not some of them are spam accounts is another topic).

Popular videos (one tweet per user)

Popular videos (one tweet per user)

Entities: better than tags

By linking videos to entities, rather than doing simple tag or keyword extraction, much more meaning can be derived from Tweets. As every entity is typed, additional filtering can be applied using the type of each entity. For instance, we can adapt the previous query to find not the top-tracks, but the top artists (i.e. entities having a type music artist).

Popular artists (via Freebase mappings)

Popular artists (via Freebase mappings)

Going deeper, you can also find what are the most popular music genres in the dataset.

  COUNT(DISTINCT tweet_youtube.youtube_id) as num_videos,
  tweet_youtube.youtube_relevant_topic.topic_type = '/music/genre'
Top-10 genres in the dataset

Top-10 genres in the dataset

From there, and using the same entity-filtering approach, we can build genre-specific top-10, as below for Heavy-metal.

Top Heavy-metal videos

Top Heavy-metal videos


User profiling, semantic advertisement, and more

Besides analytics, an obvious use-case of such approach is user profiling. When I was at DERI, we learned that a lot of valuable content for user profiling is mined from things that people link to, by extracting structured data from those links (in this current case, via the YouTube / Freebase mappings).

Using a similar process, Twitter users could be categorised more specifically thorough their music tastes, and get recommendations of artists to follow, videos to watch, or music to buy based on this data mined from external sources. This is definitely relevant in the context of upcoming Twitter feed update! On the other side of the spectrum, we can imagine that advertisers, or bands that want to promote themselves on Twitter, could use those signals for specific user-targeting – a constant struggle for music industry marketing.

As the pipeline is progressing, I’ll try to come up with some other interesting experiments, while I’m building a small hack / product in the meantime using this data, most likely combined with data from seevl, to be released soon.

YouTube Music Key

YouTube Music Key: My hopes for music discovery and personalisation

YouTube Music Key, YouTube’s music subscription service, is about to go live – and I’m obviously very excited about this. As YouTube is one of the top online music source, it’s great to see it finally enhanced with music-specific features (expectedly: discovery, playlists, influencers, full albums, etc.).

This is the vision we had when we started seevl about two years ago, and that you can still experience today on play.seevl.fm (including recommendation, discovery, auto play-listing and more, see for instance what we provide for Blur).

Blur's page on play.seevl.fm

Blur’s page on play.seevl.fm

In a competitive landscape, personalisation and discovery matter

In particular, I’m looking forward to the personalisation and discovery aspect of the upcoming YouTube platform. Online streaming is a competitive landscape, and I’m a strong believer that those two aspects will differentiate the OK-services and the top-ones, and eventually who will own the market.

Whether it’s trough a laid-back experience (radio/playlist) or via active browsing (crate-digging style), personalisation and discovery are two factors that can enlight listeners, and eventually increase user retention and acquisition for a music streaming service.

But, maybe as important, it can also drive more streams for big acts, and surface unknown artists that are lost in the long-tail. In times of complaints and arguments regarding streaming revenues, this is definitely an key aspect, and it’s no surprises that services take this into consideration, either through dedicated websites (e.g. Spotify Artists or Pandora AMP), or by hiring industry veterans to work with artists (e.g. Dave Allen at Beats).

Why Google / YouTube can nail it

Besides the catalogue itself (VEVO, the Merlin dealas for Pandora), Google’s / YouTube’s data is a key factor that could let them nail the new platform.

While some services amassed a considerable amount of user data, letting them implement great features in their personalisation algorithms, on-boarding users with a great experience is a challenge for new streaming platform. Indeed, the first time a user logs-in, it won’t know much about them – a.k.a the cold-start problem. Facebook connect (and the graph API) can solve this, pending that users (1) allow the platform to use their “like” data, and (2) have enough music-related information, e.g. by manually liking bands or connecting their existing streaming services to Facebook. If that’s not the case, users will either have to use the platform for a while before it correctly reacts (a catch-22, as they might leave early because of the bad experience), or answer a few questions about favorite acts and genres, as done on the pleasant Beats Music on-boarding interface.

Beats' on-boarding process

Beats Music on-boarding process

On the other hand, YouTube already amassed a large amount of user data – generally linked to already existing accounts. Hence, when you will first log-in to YouTube Music Key, it’s very likely that it will know you very well – and can suggest relevant music from day one! Not only by knowing your favorite band, but also, using their large user-dataset, identifying what kind of listener you are: Do you like taking risks and listen to indie bands? Are you the one that generally discover great stuff before they’re signed? Or are you just comfortably listen to top-40 tracks?

Moreover, thanks to the power of Freebase‘s Knowledge Graph, and its integration with YouTube, the platform could also build your music-DNA (similar to what we’ve started at play.seevl.fm) and know, based solely on the artists that you’ve listened to, that you’re into everything punk-rock, but prefer Epitaph to Sub Pop; or that you love Motown, but only recordings from their Detroit years. Note that this can also be valuable for artists / labels (and eventually for revenue streams, sponsored recommendations and more).

The new YouTube music page

So while I’m waiting for a beta-access, I gave a try to the new YouTube music landing page. I believe this is just a just a MVP, but I was a bit disappointed by the personalisation experience.

While the mixes definitely made sense – thumbs up for “My Mix” -, some recommendations felt a bit awkward, but I will only blame myself for using my account to play lullabies to the kids (ahem, what about a “mute signal” option for a given genre?).

My personalised YouTube music page...

My personalised YouTube music page…

Most of my disappointment came from the next sections. While the “Trending Music Videos” section probably makes sense from a business perspective, it’s completely un-personalised and I’d rather see trends for genres that I like. Similarly, I would expect from the “Top Videos by Genres” to include mostly genres I’m familiar with (which can be easily derived from my past listening habits, as discussed before).

I’d have hoped as well that the “Hitting the Gym Mixes” and “Music For Every Mood” sections would be more contextual. Depending on the time of the day, my geolocation, etc. what about replacing the first one by “Music for Coding”, or “Chillin’ at home”. After all, if you’d allow YouTube to sync with your calendar, or to access sensors from your Android (phone or Wear), we can imagine such user-experiences, where music meets the IoT. And with the recent acquisition of Songza, we can expect very nice curated playlists for every circonstances inferred from those external signals!

... versus my play.seevl.fm dashboard

… versus my play.seevl.fm dashboard

All taken into consideration, and in spite of the lack of personalisation of the new music page, I’m still sure they’ll do an awesome job at personalisation. I’m looking forward to seeing what’s next for the platform and in particular YouTube Music Key, and in a way, see our vision for YouTube as a music platform finally released at scale!

Defining a schema to load JSON data into Big Query

Insights from 500,000 Deezer playlists using Google’s BigQuery

A few days ago, Warner Music acquired Playlists.net, and as Techcrunch pointed out, one reason can be be its data, and the related insights.

But, what can we learn from such a dataset? Well, a lot actually: Discovering top-tracks, building content-based recommendations, mining new trends, and finding influencers to target during album releases. This can be invaluable for a record label or an artist, and it’s no surprise that compagnies like Musicmetric or The Next Big Sound tackle it from the analytics perspective, while Gracenote or The Echo Nest focus on data, recommendations of user profiling.

To prove some of those points, I’ve run a small experiment using 500,000 playlists from Deezer, together with Google’s BigQuery infrastructure.

The setup

Analyzing playlists is not a new thing, and you could read about various Big Data architectures such as Spark at Spotify, from the music discovery standpoint. I’ve used Google’s BigQuery in order to quickly get insights without setting-up my own stack. As I’ve experimented with it in the past, it was a good time to try with my own dataset.

With a few Python scripts, here are the steps to setup the experiment. [Update 2014-10-29: The scripts, as well as links to the dataset, are now available on Github]:

- First, gather about 500,000 playlists from the Deezer API [1], using a threaded crawler, randomly picking playlists with ID between 1 and 10M, for a total of 9.7Go of JSON data;

- Then, prepare the playlists for Google’s BigQuery, concatenating the 500K original files into 9 gzip-ped JSON files ([1-9].json.gz), and uploading them to Google Cloud Storage, for a total of 1Go;

- Finally, defining a schema to map the data to tables, and loading it from Cloud Storage to BigQuery. It took only 12 seconds to load the 1Go of compressed data, for a total of 510,187 playlists, with 12M tracks (and 900K distinct ones) in total.

Defining a schema to load JSON data into BigQuery

Defining a schema to load JSON data into BigQuery

Content recommendations

With such an amount of data, and not only in the music domain, it’s relatively easy to build a content recommendation platform, based on the “If you like X you’ll like Y”. Using this simple SQL query, you can find the top-related artist for anyone in the dataset:

SELECT COUNT(b.tracks.data.artist.id), b.tracks.data.artist.name
  FLATTEN([Playlists.Playlists], tracks.data) a
  EACH FLATTEN([Playlists.Playlists], tracks.data) b
ON a.id == b.id
  a.tracks.data.artist.id == <artist_id>
  AND b.tracks.data.artist.id != <artist_id>

For instance:

### Related to Rihanna
* Britney Spears
* Beyoncé
* The Black Eyed Peas
* David Guetta
* Justin Timberlake
### Related to Daft Punk
* Justice
* Muse
* David Guetta
* Moby
* The Chemical Brothers
### Related to Agnostic Front
* Blood for Blood	 
* Hatebreed	 
* Dropkick Murphys	 
* Helga Hahnemann	 
* Bad Religion

A good way to bootstrap an artist-based radio station!

Going further, building a song-to-song recommendations algorithm is not really complicated neither. Here are for instance the most frequent tracks played together with “Harder Better Faster Stronger”, which are not by Daft Punk.

### Related to Harder Better Faster Stronger, non Daft-Punk
* David Guetta: Cozi Baby When The Light
* Laurent Wolf: No Stress (Radio edit)
* David Guetta: Love Don't Let Me Go (Original Edit)
* David Guetta: Love Is Gone (Radio Edit Rmx)
* Mika: Relax, Take It Easy

Top artists and tracks, popularity, and more

Besides recommendations, an obvious use-case is to identify top-tracks or top-artists. For instance, here are the top-tracks for some artists based on their popularity in the full dataset.

### Most popular tracks from Daft Punk
* Around The World
* Harder Better Faster Stronger
* Da Funk
* Technologic
* Around The World / Harder Better Faster Stronger
### Most popular tracks from Weezer
* Island In The Sun	 
* My Name Is Jonas	 
* Beverly Hills	 
* Buddy Holly	 
* Hash Pipe

Combining with temporal attributes (not available here unfortunately, more on this later), one could also identify how fast a track progress from its release to a top-X.

Regarding top-artist, the easy way is to simply track the top-ones in the list, with the number of tracks they have on the full dataset (900K distinct ones).

### Top-artists by number of tracks
* Linkin Park (65,415)
* Muse (59,550)
* U2 (54,688)
* Rihanna (53,354)
* Queen (51,717)

But another way is to sort artists by number of playlists they appear in

SELECT COUNT(id) as c, tracks.data.artist.id, tracks.data.artist.name
 SELECT id, tracks.data.artist.id, tracks.data.artist.name
 FROM [Playlists.Playlists]
 GROUP EACH BY 1, 2, 3

Surprisingly, the most popular artist it then a Karaoke cover band, included in 23,993 of the 900K playlists, more than Rihanna or U2!

### Top-artists by playlists appearance
* Studio Group (23,993)
* Rihanna (23,398)
* U2 (17,860)
* Queen (17,463)
* Linkin Park (17,232)

Another interesting insight – that is not surprising if you’re into music discovery and the long tail – concerns the way popular artists outweight less popular ones in their distribution: 43346 artists, i.e. about a third of them, appear only once in the dataset, and 37864 appear between 2 and 10 times.

Trends, influencers and targeted recommendations

Finally, what about identifying trends and influencers?

One approach would be to identify which artists jump from top-1000, to top-100 and event to top-50 in a given timeframe. Unfortunately, Deezer playlists do not contain any temporal information. Yet, coming back to the starting point of this post, that’s definitely something valuable that WMG could get from Playlists.net.

They could then identify and target influencers, for instance users who’re among the top 10% to listen to them, which could be a goldmine when marketing new artists or releases.

Definitely, this acquisition makes sense considering the trends in the industry, and the recent consolidation around various services (Rhapsody, rd.io. etc.), most of them focusing on the the analytics / discovery domain. An domain which matters for artists and labels, but also for streaming services and data-providers, providing them with valuable insights and ways to beat competitors, ensuring their users are given the best listening experience they could possibly expect, depending on who they are, and how they listen to music.

If you have an interesting dataset and want to run analytics or recommendation experiments, let’s get in touch! And if you’re mostly interested in the discovery / recommendation part, have a look at our turn-key solution at seevl.fm.

[1] I used Deezer and not Spotify, even though Playlists.net is Spotify-based, as there’s no rate limiting on their API for playlist search and retrieval (whether it’s a bug or a feature is another topic for discussion)


How mood and tempo can influence artist discovery?

If you log-in to Deezer, Spotify, YouTube, etc. to listen to a particular artist, you can simply pick their top-tracks. Yet, while they’re the most popular, they are not necessarily the ones providing a good understanding of their style, or – on the opposite – might not surprise you enough. Plus, depending on which platform you use, unexpected results can appear!

Using the Gracenote API, here’s an experiment using their mood and tempo detection features to answers questions like “What a band generally plays”, “How eclectic an album is” or “How can I listen to something unexpected my favorite artist”.

You Can’t, You Won’t And You Don’t Stop

First, let’s try to understand how eclectic an artist is: do they tend to play diverse style, or do they stick to common patterns? In the first case, is that something we can experience through a single album, or did they simply switch genres over the years?

Take for instance the Beastie Boys, who played hardcore punk in their early years, before becoming hip-hop stars in the 90’s. If you look as their old recordings, compiled in “Same old bullshit“, you’ll find the following top-3 tempos and moods.

Beastie boys - Some Old Bullshit
# Tempo
- Medium Tempo: 8 (57.14%)
- Fast Tempo: 6 (42.86%)
- Medium Fast: 5 (35.71%)
# Mood
- Aggressive: 8 (57.14%)
- Cool Confidence: 4 (28.57%)
- Heavy Triumphant: 4 (28.57%)

While a more recent record like “Hello Nasty” seems to leave away the aggressive parts of their early years, even though the defiant mood is definitely here to stay!

Beastie boys - Hello Nasty
# Tempo
- Medium Tempo: 21 (95.45%)
- Medium Fast: 12 (54.55%)
- 90s: 7 (31.82%)
# Mood
- Attitude / Defiant: 7 (31.82%)
- Defiant: 7 (31.82%)
- Cool Confidence: 6 (27.27%)

Looking at individual albums, there are interesting patterns as well. London Calling from The Clash combines elements of Punk-Rock, Jazz, Ska, R&B and more. Consequently, lots of different moods are covered in the same album:

The Clash - London Calling
# Mood
- Rowdy: 5 (26.32%)
- Excited: 4 (21.05%)
- Ramshackle / Rollicking: 4 (21.05%)
- Cool: 4 (21.05%)
- Loud Celebratory: 3 (15.79%)
- Casual Groove: 3 (15.79%)
- Carefree Pop: 2 (10.53%)
- Upbeat: 2 (10.53%)
- Empowering: 1 (5.26%)
- Cool Confidence: 1 (5.26%)

Fear of the Dark

On the other hand, some bands didn’t significantly evolved during decades. Running the same test on the first and most recent studio albums of Iron Maiden (“Iron Maiden” and “The Final Frontier“) shows that tempo remains the same, while the Defiant mood is still a strong part of their style, 30 years after their first release.

Iron Maiden - Iron Maiden
# Tempo
- Medium Tempo: 6 (75.00%)
- Medium Fast: 5 (62.50%)
- 100s: 4 (50.00%)
# Mood
- Defiant: 5 (62.50%)
- Hard Positive Excitement: 3 (37.50%)
- Hard Dark Excitement: 2 (25.00%)
Iron Maiden - The Final Frontier
# Tempo
- Medium Tempo: 6 (60.00%)
- Medium Fast: 4 (40.00%)
- Fast: 3 (30.00%)
# Mood
- Defiant: 6 (60.00%)
- Heavy Brooding: 6 (60.00%)
- Brooding: 2 (20.00%)

Finally, for all their studio albums (143 tracks), we have:

Iron Maiden
# Tempo
- Medium Tempo: 88 (61.54%)
- Medium Fast: 55 (38.46%)
- Fast Tempo: 53 (37.06%)
- Fast: 44 (30.77%)
- Medium: 31 (21.68%)
- 100s: 26 (18.18%)
- 80s: 21 (14.69%)
- 90s: 14 (9.79%)
- 130s: 12 (8.39%)
- 140s: 12 (8.39%)
# Mood
- Defiant: 72 (50.35%)
- Heavy Brooding: 33 (23.08%)
- Hard Dark Excitement: 30 (20.98%)
- Brooding: 26 (18.18%)
- Rowdy: 18 (12.59%)
- Confident / Tough: 10 (6.99%)
- Hard Positive Excitement: 9 (6.29%)
- Aggressive: 9 (6.29%)
- Heavy Triumphant: 7 (4.90%)
- Alienated / Brooding: 7 (4.90%)

This becomes interesting in terms of discovery. if you want to listen to typical Maiden, just pick a mid-tempo track with a defiant mood: “The Trooper” is one of them. On the other hand, let’s imagine you’re into something more obscure, pick an Alienating 90s-BPM track, like “Mother Russia“.

If you want to run similar experiments on your favorite albums, simply set-up an account with the Gracenote API, and get the small Python class I’ve build for the analysis.

@seevl’s DJ, Twitter, and the Semantic Web

As most of my side projects, “seevl DJ” started as a quick hack on a sunday afternoon. Yet, it has been quickly picked and featured on Fast Company and Hypebot, and also got some attention on Twitter itself.

With a little help from my friends

I’ve spend some time improving it so you can use additional commands, e.g. “a song by”, “play me something like”. In addition, it now uses the Freebase / YouTube mappings combined with the seevl API in order to find an artist’s videos (when using a genre / label / related query).

Last but not least, you can now use “/cc @user” and “for @user” in your Tweet to send a track to any of your friend, the music video being available directly on their feed through Twitter cards (Web and mobile).

Services, actions, and payments on Twitter

Thinking again about Twitter as an intelligent agent on the Web, let’s be bold and imagine this integrated with the buy / Stripe integration. While it’s now used to buy stuff, what about paying for services with it? “Hey @uber, bring @myfriend here”. “Hey @trycaviar, sushis for 6 please”. Both answering with an automated tweet embedding a Buy button so you can validate the order; and get your black car or food home within minutes. All through Twitter.

Natural Language Processing is one way to enable this, but another one is to pre-fill such “service-based tweets” so that users would just have to complete a few fields (e.g. number of people when messaging @opentable). This makes things much easier from the processing side, also providing a friction-less experience to users. Technically, the intelligence can be brought by schema.org actions, as I’ve wrote in the past, using JSON-LD as the supporting data serialisation.

A similar approach is used in Gmail (see for instance the Github integration). So, Twitter, what’s your next move to also embrace the Semantic Web?

Remove inactive Twitter followees with this tiny Python script

I recently reached the Twitter limit to add new followees so I’ve wrote a tiny Python script, Twitter Cleaner, to remove people who haven’t send anything for a number of days (30 by default) – and consequently be able to add new ones. It’s now available on github.

Twitter Cleaner

Twitter Cleaner

Note that it might conflict with the previous Twitter TOS if you unfollow too many people at once. However, it will happen only once if you put it into a daily crontab. It was safe in my case, but I can’t guarantee it will be in yours. You may also reach the API rate-limiting if you’ve too many followees.

It’t built using python-twitter, and is available under the MIT license.