Testing ski apps in the French Alps

I’ve spend a week in Courchevel (French Alps), and besides seeing our 3 years old daughter discover snow, and proudly wearing her Piou-piou club medal on the slopes at the end of the week, it was also a good opportunity to ski again, enjoy good mountain food, but also to test various iOS skiing apps.

I skied regularly  between 3 and 12-ish, but it was only the third time doing it in the past 10 years or so. Needless to say I’m not an expert, but as a data geek, and with a recent interest in sport+technology and quantified-self, I’ve tried a few apps to monitor my week. Here’s a quick summary of what I’ve tried, mostly about two apps that I’ve particularly enjoyed: Ski Tracks and Trace Snow.

A panoramic view of Mont-Blanc from "La cave des creux"
A panoramic view of Mont-Blanc from “La cave des creux”

An iPhone and 5 apps in the pocket

To start with, here are the five apps I’ve installed on the first day: Edge, Runkeeper, Ski+, Ski Tracks, and Trace Snow. In order to monitor each day, I wanted to have an experience as frictionless as possible: launch the app in the morning when I’m heading to the cableway, and stopping it at the end of the day. Hopefully with runs / lifts auto-detection, metrics on each run, etc. Pro-tip: don’t put your phone in the same pocket as your hands-free ski-pass, or the pass won’t activate when you’re at the gates.

As expected (since they’re not tailored for Ski), Runkeeper did a pretty bad job at it, simply monitoring the distance with its GPS, but unable to auto-split laps, and most important to differentiate lifts from runs.

I was not really able to experience Ski+ results, as I’ve figured out when looking at it mid-day, that it only recorded a partial run, and stopped recording. A second try did the same, and I stopped using it after a day.

Then came Edge. The design is pretty solid, and very similar to the Strava iOS app, with a simple start button on the home screen, and some analytics at the end of the day. However, besides the limited analytics, those were also wrongly measured (in terms of runs vs lifts, and speed).

Keeping 2 of them after day 1

The two others apps I’ve tested and that I’ll review next, Ski Tracks and Trace Snow, kept running on the phone for the next days. By being available on Android, it was also a fun way to compare metrics at the end of the day.

They both made a very good job at automatically splitting runs and lift without any input, so I didn’t have to think about them during the day – one of my main criteria. Regarding metrics, both apps gather similar analytics (speed, distance, vertical, etc.), with very similar values (at least with an acceptable difference for a skier like me).

The big contrast, besides the community aspect of Trace Snow, is their interface. It is a bit like comparing a Geocities site of 1998 with the last flat-design website from a hype start-up.

Ski Tracks

As old-school as it can be, I kept using Ski Tracks during the week. I enjoyed that they can include all relevant data in a single day-screen, and really liked the altitude profile on the screen, showing a different dynamics than the usual maps, also available in a different screen.

Summary of a day skiing with Ski Tracks
Summary of a day skiing with Ski Tracks
Summary of a day skiing with Ski Tracks (map view)
Summary of a day skiing with Ski Tracks (map view)

Regarding the map, it’s actually disturbing to see your ski tracks on a Google map with no snow at all. I wish they’d take their satellite pictures of mountains during the winter! In addition to full view, the app also display statistics for a single run.

Analytics of a single run with Ski Tracks
Analytics of a single run with Ski Tracks

It is basic, and the interface is indeed very old-school, but it does the job very well, with no fuss. Sharing on Facebook is available on the paid version, but once again it’s not really up-to-date with the latest technologies, simply posting pictures in a dedicated album, without using OpenGraph to display nicer stories in the feed.

Sharing Ski Tracks logs on Facebook
Sharing Ski Tracks logs on Facebook

Trace Snow

Last but not least, Trace Snow. A splendid design, reminding me of Strava (like Edge before), and a user interface that allows to quickly swipe from one run to another, with all metrics (and a map) in one page. A nice metric, not available in Ski Tracks, is the Sustained Speed, which is a better indicator than the average speed, especially if you regularly stop when running in groups

Summary of a day skiing with Trace Snow
Summary of a day skiing with Trace Snow
Summary of a day skiing with Trace Snow (map view)
Summary of a day skiing with Trace Snow (map view)

A single run provides the same view, also identifying lift names – which is useful at the end of the day. However, I missed the altitude graph (overall and per run) of Ski Tracks, probably one of the reason I kept using both.

Analytics of a single run with Trace Snow
Analytics of a single run with Trace Snow

As for the Facebook sharing, it doesn’t use OpenGraph neither, but uploads a “Session sheet” picture that redirects to the Web view of the session.

And this is a core difference between both apps. While Ski Tracks is “just an app”, Trace Snow is a full platform, with a social network, a gamification aspect (earn badges a-la Foursquare), and more; together with a Web interface so that anyone can browse your statistics for a run, a week or a full ski season.

The Trace Snow web view of a session
The Trace Snow web view of a session

The comparison with Strava that I’ve made before hence is not limited to the design only, but to the platform aspect. Even though I haven’t make much use of it, I think it has real potential for ski amateurs and professional to log their data, compete with each others, and more; as Strava is doing for bike and running.

What about next year?

I’m excited to see what’s next for both apps – and others plus newcomers – as I’m already impatient about my next ski trip, to run more slopes, and gather more data!

Actually, it’s likely that I’ll try the Recon Snow2, for their live data but also make use of their dashboard with complete analytics, including slopes names, colours, and more, as you can see in this ski trip and gadget review from DC Rainmaker. Plus, I’ve just ordered a Polar V800, so I’m looking forward to see what their ski profile is about.

The case for Tasks Queues on Google App Engine: Pinging remote APIs

I’ve spend the past few months building YapMe, and our first MVP was released on the AppStore a few days ago! The app aims to bridge the gap between photos and videos, letting users to take pictures with ambient sound, up to 25 seconds, in a single click.

To build it, I’ve decided to fully rely on the Google Cloud Platform: App Engine, Datastore, Endpoints, and more. I’ll blog about the overall experience later, but here’s a quick post about a particular topic: Task Queues.

Gathering user metrics

As for every new products, metrics matter. To gather those, we use various APIs and toolkits: Crashlytics, Google Analytics, and Intercom.

While Crashlytics and Google Analytics calls are done directly through the device, Intercom calls are done in the back-end. So for instance, when adding a new followee, instead of doing

- (iOS) /POST add_followee to YapMe
 -- (YapMe back-end) User.add_followee(other)
 -- (YapMe back-end) 200 OK
 - (iOS) /POST add_followee to Intercom
 -- (Intercom back-end) set "add followee" metric
 -- (Intercom back-end) 200 OK

Or

- (Android) /POST add_followee to YapMe
 -- (YapMe back-end) User.add_followee(other)
 -- (YapMe back-end) 200 OK
 - (Android) /POST add_followee to Intercom
 -- (Intercom back-end) set "add followee" metric
 -- (Intercom back-end) 200 OK

We simply do

- (Android | iOS) /POST add_followee to YapMe
 -- (YapMe back-end) User.add_followee(other)
 -- (YapMe back-end) /POST add_followee to Intercom
 ---- (Intercom back-end) set "add followee" metric
 ---- (Intercom back-end) 200 OK
 -- (YapMe back-end) 200 OK

Here are a few reasons for this:

  • Unlike Crashlytics or GA, our Intercom metrics are not directly related to the app (e.g. session length) but to actions on database entities (e.g. creating a new yap, or following a user). As those actions are recorded in the back-end, it makes sense to gather the metrics at the same time;
  • Some metrics are conditional, and those conditions are evaluated on the back-end (e.g. “has the media been already shared by the user?”). Pushing metrics from the app would require another layer of back-and-forth between the device and the API;
  • We’ll eventually have multiple clients (iOS, Web, Android), so having the metrics handled on the back-end avoids us to implement those on any client, especially useful when update are required: this can be done on the back-end without pushing new app releases.

Pinging remote endpoints with Task Queues

Initially deploying those back-end metrics with a simple urlfetch (GAE API to handle remote requests), I was bugged by some queries which were more time consuming than expected. Using the new Cloud Trace tool, I’ve noticed that the Intercom queries where taking a while on the back-end, as seen on the log trace below, representing the track of an API call on our back-end, while the two urlfetch.Fetch() calls are used to call the Intercom API.

Using Cloud Trace to debug remote API calls
Using Cloud Trace to debug remote API calls

There are a few solutions to handle this, and to make sure the main API call continues without waiting a reply from Intercom (I don’t really care if the call a success or not, we’re OK losing a metric if something happens):

  • Use async urlfetch requests. Yet, it keeps the connection open while I just want a simple ping and don’t need to handle the query result;
  • Use a Python thread. In this case, the task is threaded (so the main API call can exit) but it runs on the same instance(s) as the one that initiated the thread, consuming resources on those;
  • Use a Task Queue. The Intercom query is pushed in a separate push queue, that is immediately processed and auto-scales, delegating the work to a new module in our case.

Which gives the following trace result. It takes less than 10% of the original time, and delegates all the process and resources to another module, so that one and the related instances are not overloaded by simple pinging tasks.

Pushing remote API calls in a Task Queue
Pushing remote API calls in a Task Queue

Note that we’re using the same approach to implement push notifications, which are now available in our new release. In both cases, pushing data into the queue and handling it is straight-forward, as described in Google’s push queues tutorial. Note that pull queues are executed on App Engine, which means you cannot do advanced processing (such as image processing). For those, we rely on pull queues. More about this later.

Context: The future of music streaming and personalisation?

With CES starting tomorrow, I though it would be a good time to reflect on the future of music streaming, and what’s needed to own the space. Not only because the conference holds a session on this very topic, but also because advances in Data Science, wearables, and context-aware computing could bring brand-new experiences regarding how we consume – and discover – music.

The need for discovery and personalisation

Besides a few exclusive artists, such as Thom Yorke on BandCamp, or Metallica on Spotify, mainstream services (Deezer, Rdio, Rhapsody, iTunes radio, Pandora, etc.) tend to have very similar catalogues. As music streaming tends to be a commodity, those services need to find incentives to let users choose them versus competitors.

While one way to do so is to focus on Hi-Fi content (as done by Pono Music or Qobuz), another aspect is to invest more time – both on product and R&D – on personalisation and discovery, in order to be ahead of the pack and own the space. That’s an obvious strategy, and a win-win-win for all parties involved:

  • For consumers, delighted when discovering new artists they’ll love, based on their past behaviours or the streaming habits of their friends; and when figuring out that those platforms really understand what they should listen to next;
  • For artists, escaping the long-tail and hence generating more streams, and a little revenue, but most importantly: having the opportunity to convert casual listeners into super-fans;
  • For streaming services, keeping existing users active and adding new ones; consequently gathering more data and analytics (plays, thumbs-up, social networks, etc.), re-investing this into new product features.

That being said, the music-tech consolidation that happened over the past few months is not surprising: Spotify + Echonest, Rdio + TasteMakerX, or Songza + Google, etc. Interestingly, they showcase different ways that music discovery can be done: “Big Data” (or, should I say, smart data) for the Echonest, social recommendations for TasteMakerX, or mood-based curation for Songza. But one approach doesn’t fit all, and they’re often combined: let’s look at your Spotify homepage and see the different ways you can discovery music (“Top lists”, “Discover”, etc.) if you’re not convinced about it.

Various ways to discover new music through Spotify
Various ways to discover new music through Spotify

How hardware and context could help

Considering all those ways to discover music: what’s next? Well, probably a lot.

  • On the one hand, advances in large-scale infrastructures and AI now make possible to run algorithms on billions of data-points – combining existing techniques such as Collaborative Filtering or Natural Language Processing, as well as new experiments on Deep Learning;
  • On the other hand, social networks such as Twitter or Facebook provide a huge amount of signals to identify user tastes, correlations between artists, trends and predictions and more – which could go further that discovery by creating communities through music-based user profiling.

But I think that the most exciting part resides in context-aware recommendations.
Remember Beats’ “The Sentence”? Or Spotify’s activity-based playlist (“Workout”, “Focus”, etc.)? Well, this is all good, but it requires manual input to understand users’ context, expectations and short-term listening goals.

Generating Music with Beat's "The Sentence"
Generating Music with Beat’s “The Sentence”

We can soon expect this to be generated automatically for us, using the wide range of sensors we have in our cars, houses, or bodies (from smart watches to Nest appliances), and information we already provided to other services we use daily.

Building the future of context-based music personalisation

What about a Spotify integration with Runkeeper that automatically detects when you’re in the last mile of this race, and plays “Harder Better Faster Stronger” to push yourself through? Or your car’s Rdio automatically playing your friends’ top tracks when you’re joining them at a party recorded in your Google calendar? And, at this particular party, should Nest send signals to Songza / YouTube to play some funky music when it’s calming down and there’s no more energy in the room?

This obviously require some work to make those services talk intelligently to each other. But we’re already there, with the growth of APIs on various fronts (music, devices, fitness, etc.), and standards such as schema.org and especially their actions module. CES will be a perfect time for wearable manufacturers, streaming services, and data providers to announce some ground-breaking partnerships, putting context as a first-class citizen of music discovery and personalisation. Let’s wait and see!

(Header picture by lgh75 – under CC by-nc-nd)

24 hours, 24 genres, 24 tracks: Discover YouTube music trends via Twitter

24’s idea is simple: Identify the top-24 tracks of the top-24 genres played via YouTube during the last 24 hours on Twitter.

24 - Heavy Metal tracks
24 – Heavy Metal tracks

It’s based on the Twitter + Freebase + BigQuery pipeline I’ve built to run last week-end (indeed, using only a subset of the full Twitter stream), and uses Bootstrap and AngularJS for the UI. While the data is mined in real-time, the page itself is refreshed every two hours.

Feedback? Comments? Say hello or check MDG’s portfolio for more. In the meantime, enjoy 24 at http://24.mdg.io, and check it out regularly for updates on its current MVP.

Sharing YouTube music on Twitter: Analytics using Freebase and BigQuery

Following my journey with Google Cloud, in particular BigQuery, I’m building a pipeline which mines Tweets containing YouTube videos, and maps those videos to Freebase in order to run various discovery / recommendations analytics and products experiment.

If you’re not yet familiar with it, Freebase is the core of Google’s Knowledge Graph and provides machine-readable, structured, information about a large number of entities, or “real-world things described on the Web”

 

To build this, I’ve been using the streaming APIs from both Twitter and BigQuery to get and save the data. In-between, a middleware parses the tweets and calls the YouTube API and Freebase’s Knowledge Graph to extract additional data from each, with a bit of memcached to avoir rate-limiting on those APIs.

The infrastructure started to run a few days ago, and I’ve gathered (when starting this write-up) 1.2M Tweets so far, for a total or 516,056 distinct videos, of which 345,410 have been linked to Freebase entities. Using this sample, I’ll now describe a few things that we can learn using this data in the context of music.

But first, why music videos? Not only because I’m big into music-related data science and engineering, but also because Music is the most shared category on the sample with 28.2% of the videos being in this category, followed by People and Blog (19.5%) and Entertainment ( 14.4). Another reason why YouTube Music Key definitely makes sense.

Popular videos: super fans or spammers?

One of the first query I’ve tried – focusing solely on BigQuery’s SQL capabilities – was to identify the most popular videos on Twitter, with their corresponding YouTube views (using data from the last YouTube API call).

SELECT
  tweet_youtube.youtube_id,
  tweet_youtube.youtube_title,
  COUNT(tweet_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_category_id = 10
GROUP EACH BY 1, 2
ORDER BY 3 DESC
Most popular videos in the dataset
Most popular videos in the dataset

I was surprised by the low difference between the number of tweets and views for some of them, so I’ve decided to measure the number of tweets per user for any video. Using another simple SQL query, we can easily identify videos that are self-promoted, or should I say spammed, on Twitter.

SELECT
  tweet_youtube.youtube_id,
  tweet_youtube.youtube_title,
  COUNT(DISTINCT(tweet_user_id)) as num_users,
  COUNT(tweet_youtube.youtube_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
  CAST(COUNT(DISTINCT(tweet_user_id)) as float)
    /CAST(COUNT(tweet_youtube.youtube_id) as float) as ratio
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_category_id = 10
GROUP EACH BY 1, 2
ORDER BY 6 ASC
Number of tweets vs views on YouTube
Number of tweets vs views on YouTube

On the other hand, limiting the first SQL query to allow only one tweet per video per user provides an easiest way to identify top-tracks based on their number of unique fans (whether or not some of them are spam accounts is another topic).

Popular videos (one tweet per user)
Popular videos (one tweet per user)

Entities: better than tags

By linking videos to entities, rather than doing simple tag or keyword extraction, much more meaning can be derived from Tweets. As every entity is typed, additional filtering can be applied using the type of each entity. For instance, we can adapt the previous query to find not the top-tracks, but the top artists (i.e. entities having a type music artist).

Popular artists (via Freebase mappings)
Popular artists (via Freebase mappings)

Going deeper, you can also find what are the most popular music genres in the dataset.

SELECT
  tweet_youtube.youtube_relevant_topic.topic_id,
  tweet_youtube.youtube_relevant_topic.topic_name,
  COUNT(DISTINCT tweet_youtube.youtube_id) as num_videos,
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_relevant_topic.topic_type = '/music/genre'
GROUP EACH BY 1, 2
ORDER BY 3 DESC
Top-10 genres in the dataset
Top-10 genres in the dataset

From there, and using the same entity-filtering approach, we can build genre-specific top-10, as below for Heavy-metal.

Top Heavy-metal videos
Top Heavy-metal videos

 

User profiling, semantic advertisement, and more

Besides analytics, an obvious use-case of such approach is user profiling. When I was at DERI, we learned that a lot of valuable content for user profiling is mined from things that people link to, by extracting structured data from those links (in this current case, via the YouTube / Freebase mappings).

Using a similar process, Twitter users could be categorised more specifically thorough their music tastes, and get recommendations of artists to follow, videos to watch, or music to buy based on this data mined from external sources. This is definitely relevant in the context of upcoming Twitter feed update! On the other side of the spectrum, we can imagine that advertisers, or bands that want to promote themselves on Twitter, could use those signals for specific user-targeting – a constant struggle for music industry marketing.

As the pipeline is progressing, I’ll try to come up with some other interesting experiments, while I’m building a small hack / product in the meantime using this data, most likely combined with data from seevl, to be released soon.

YouTube Music Key: My hopes for music discovery and personalisation

YouTube Music Key, YouTube’s music subscription service, is about to go live – and I’m obviously very excited about this. As YouTube is one of the top online music source, it’s great to see it finally enhanced with music-specific features (expectedly: discovery, playlists, influencers, full albums, etc.).

This is the vision we had when we started seevl about two years ago, and that you can still experience today on play.seevl.fm (including recommendation, discovery, auto play-listing and more, see for instance what we provide for Blur).

Blur's page on play.seevl.fm
Blur’s page on play.seevl.fm

In a competitive landscape, personalisation and discovery matter

In particular, I’m looking forward to the personalisation and discovery aspect of the upcoming YouTube platform. Online streaming is a competitive landscape, and I’m a strong believer that those two aspects will differentiate the OK-services and the top-ones, and eventually who will own the market.

Whether it’s trough a laid-back experience (radio/playlist) or via active browsing (crate-digging style), personalisation and discovery are two factors that can enlight listeners, and eventually increase user retention and acquisition for a music streaming service.

But, maybe as important, it can also drive more streams for big acts, and surface unknown artists that are lost in the long-tail. In times of complaints and arguments regarding streaming revenues, this is definitely an key aspect, and it’s no surprises that services take this into consideration, either through dedicated websites (e.g. Spotify Artists or Pandora AMP), or by hiring industry veterans to work with artists (e.g. Dave Allen at Beats).

Why Google / YouTube can nail it

Besides the catalogue itself (VEVO, the Merlin dealas for Pandora), Google’s / YouTube’s data is a key factor that could let them nail the new platform.

While some services amassed a considerable amount of user data, letting them implement great features in their personalisation algorithms, on-boarding users with a great experience is a challenge for new streaming platform. Indeed, the first time a user logs-in, it won’t know much about them – a.k.a the cold-start problem. Facebook connect (and the graph API) can solve this, pending that users (1) allow the platform to use their “like” data, and (2) have enough music-related information, e.g. by manually liking bands or connecting their existing streaming services to Facebook. If that’s not the case, users will either have to use the platform for a while before it correctly reacts (a catch-22, as they might leave early because of the bad experience), or answer a few questions about favorite acts and genres, as done on the pleasant Beats Music on-boarding interface.

Beats' on-boarding process
Beats Music on-boarding process

On the other hand, YouTube already amassed a large amount of user data – generally linked to already existing accounts. Hence, when you will first log-in to YouTube Music Key, it’s very likely that it will know you very well – and can suggest relevant music from day one! Not only by knowing your favorite band, but also, using their large user-dataset, identifying what kind of listener you are: Do you like taking risks and listen to indie bands? Are you the one that generally discover great stuff before they’re signed? Or are you just comfortably listen to top-40 tracks?

Moreover, thanks to the power of Freebase‘s Knowledge Graph, and its integration with YouTube, the platform could also build your music-DNA (similar to what we’ve started at play.seevl.fm) and know, based solely on the artists that you’ve listened to, that you’re into everything punk-rock, but prefer Epitaph to Sub Pop; or that you love Motown, but only recordings from their Detroit years. Note that this can also be valuable for artists / labels (and eventually for revenue streams, sponsored recommendations and more).

The new YouTube music page

So while I’m waiting for a beta-access, I gave a try to the new YouTube music landing page. I believe this is just a just a MVP, but I was a bit disappointed by the personalisation experience.

While the mixes definitely made sense – thumbs up for “My Mix” -, some recommendations felt a bit awkward, but I will only blame myself for using my account to play lullabies to the kids (ahem, what about a “mute signal” option for a given genre?).

My personalised YouTube music page...
My personalised YouTube music page…

Most of my disappointment came from the next sections. While the “Trending Music Videos” section probably makes sense from a business perspective, it’s completely un-personalised and I’d rather see trends for genres that I like. Similarly, I would expect from the “Top Videos by Genres” to include mostly genres I’m familiar with (which can be easily derived from my past listening habits, as discussed before).

I’d have hoped as well that the “Hitting the Gym Mixes” and “Music For Every Mood” sections would be more contextual. Depending on the time of the day, my geolocation, etc. what about replacing the first one by “Music for Coding”, or “Chillin’ at home”. After all, if you’d allow YouTube to sync with your calendar, or to access sensors from your Android (phone or Wear), we can imagine such user-experiences, where music meets the IoT. And with the recent acquisition of Songza, we can expect very nice curated playlists for every circonstances inferred from those external signals!

... versus my play.seevl.fm dashboard
… versus my play.seevl.fm dashboard

All taken into consideration, and in spite of the lack of personalisation of the new music page, I’m still sure they’ll do an awesome job at personalisation. I’m looking forward to seeing what’s next for the platform and in particular YouTube Music Key, and in a way, see our vision for YouTube as a music platform finally released at scale!

Insights from 500,000 Deezer playlists using Google’s BigQuery

A few days ago, Warner Music acquired Playlists.net, and as Techcrunch pointed out, one reason can be be its data, and the related insights.

But, what can we learn from such a dataset? Well, a lot actually: Discovering top-tracks, building content-based recommendations, mining new trends, and finding influencers to target during album releases. This can be invaluable for a record label or an artist, and it’s no surprise that compagnies like Musicmetric or The Next Big Sound tackle it from the analytics perspective, while Gracenote or The Echo Nest focus on data, recommendations of user profiling.

To prove some of those points, I’ve run a small experiment using 500,000 playlists from Deezer, together with Google’s BigQuery infrastructure.

The setup

Analyzing playlists is not a new thing, and you could read about various Big Data architectures such as Spark at Spotify, from the music discovery standpoint. I’ve used Google’s BigQuery in order to quickly get insights without setting-up my own stack. As I’ve experimented with it in the past, it was a good time to try with my own dataset.

With a few Python scripts, here are the steps to setup the experiment. [Update 2014-10-29: The scripts, as well as links to the dataset, are now available on Github]:

– First, gather about 500,000 playlists from the Deezer API [1], using a threaded crawler, randomly picking playlists with ID between 1 and 10M, for a total of 9.7Go of JSON data;

– Then, prepare the playlists for Google’s BigQuery, concatenating the 500K original files into 9 gzip-ped JSON files ([1-9].json.gz), and uploading them to Google Cloud Storage, for a total of 1Go;

– Finally, defining a schema to map the data to tables, and loading it from Cloud Storage to BigQuery. It took only 12 seconds to load the 1Go of compressed data, for a total of 510,187 playlists, with 12M tracks (and 900K distinct ones) in total.

Defining a schema to load JSON data into BigQuery
Defining a schema to load JSON data into BigQuery

Content recommendations

With such an amount of data, and not only in the music domain, it’s relatively easy to build a content recommendation platform, based on the “If you like X you’ll like Y”. Using this simple SQL query, you can find the top-related artist for anyone in the dataset:

SELECT COUNT(b.tracks.data.artist.id), b.tracks.data.artist.name
FROM
  FLATTEN([Playlists.Playlists], tracks.data) a
LEFT JOIN
  EACH FLATTEN([Playlists.Playlists], tracks.data) b
ON a.id == b.id
WHERE
  a.tracks.data.artist.id == <artist_id>
  AND b.tracks.data.artist.id != <artist_id>
GROUP EACH BY 2
ORDER BY 1 DESC
LIMIT 5

For instance:

### Related to Rihanna
* Britney Spears
* Beyoncé
* The Black Eyed Peas
* David Guetta
* Justin Timberlake
### Related to Daft Punk
* Justice
* Muse
* David Guetta
* Moby
* The Chemical Brothers
### Related to Agnostic Front
* Blood for Blood	 
* Hatebreed	 
* Dropkick Murphys	 
* Helga Hahnemann	 
* Bad Religion

A good way to bootstrap an artist-based radio station!

Going further, building a song-to-song recommendations algorithm is not really complicated neither. Here are for instance the most frequent tracks played together with “Harder Better Faster Stronger”, which are not by Daft Punk.

### Related to Harder Better Faster Stronger, non Daft-Punk
* David Guetta: Cozi Baby When The Light
* Laurent Wolf: No Stress (Radio edit)
* David Guetta: Love Don't Let Me Go (Original Edit)
* David Guetta: Love Is Gone (Radio Edit Rmx)
* Mika: Relax, Take It Easy

Top artists and tracks, popularity, and more

Besides recommendations, an obvious use-case is to identify top-tracks or top-artists. For instance, here are the top-tracks for some artists based on their popularity in the full dataset.

### Most popular tracks from Daft Punk
* Around The World
* Harder Better Faster Stronger
* Da Funk
* Technologic
* Around The World / Harder Better Faster Stronger
### Most popular tracks from Weezer
* Island In The Sun	 
* My Name Is Jonas	 
* Beverly Hills	 
* Buddy Holly	 
* Hash Pipe

Combining with temporal attributes (not available here unfortunately, more on this later), one could also identify how fast a track progress from its release to a top-X.

Regarding top-artist, the easy way is to simply track the top-ones in the list, with the number of tracks they have on the full dataset (900K distinct ones).

### Top-artists by number of tracks
* Linkin Park (65,415)
* Muse (59,550)
* U2 (54,688)
* Rihanna (53,354)
* Queen (51,717)

But another way is to sort artists by number of playlists they appear in

SELECT COUNT(id) as c, tracks.data.artist.id, tracks.data.artist.name
FROM (
 SELECT id, tracks.data.artist.id, tracks.data.artist.name
 FROM [Playlists.Playlists]
 GROUP EACH BY 1, 2, 3
)
GROUP EACH BY 2, 3
ORDER BY c DESC

Surprisingly, the most popular artist it then a Karaoke cover band, included in 23,993 of the 900K playlists, more than Rihanna or U2!

### Top-artists by playlists appearance
* Studio Group (23,993)
* Rihanna (23,398)
* U2 (17,860)
* Queen (17,463)
* Linkin Park (17,232)

Another interesting insight – that is not surprising if you’re into music discovery and the long tail – concerns the way popular artists outweight less popular ones in their distribution: 43346 artists, i.e. about a third of them, appear only once in the dataset, and 37864 appear between 2 and 10 times.

Trends, influencers and targeted recommendations

Finally, what about identifying trends and influencers?

One approach would be to identify which artists jump from top-1000, to top-100 and event to top-50 in a given timeframe. Unfortunately, Deezer playlists do not contain any temporal information. Yet, coming back to the starting point of this post, that’s definitely something valuable that WMG could get from Playlists.net.

They could then identify and target influencers, for instance users who’re among the top 10% to listen to them, which could be a goldmine when marketing new artists or releases.

Definitely, this acquisition makes sense considering the trends in the industry, and the recent consolidation around various services (Rhapsody, rd.io. etc.), most of them focusing on the the analytics / discovery domain. An domain which matters for artists and labels, but also for streaming services and data-providers, providing them with valuable insights and ways to beat competitors, ensuring their users are given the best listening experience they could possibly expect, depending on who they are, and how they listen to music.

If you have an interesting dataset and want to run analytics or recommendation experiments, let’s get in touch! And if you’re mostly interested in the discovery / recommendation part, have a look at our turn-key solution at seevl.fm.

[1] I used Deezer and not Spotify, even though Playlists.net is Spotify-based, as there’s no rate limiting on their API for playlist search and retrieval (whether it’s a bug or a feature is another topic for discussion)