Sharing YouTube music on Twitter: Analytics using Freebase and BigQuery

Following my journey with Google Cloud, in particular BigQuery, I’m building a pipeline which mines Tweets containing YouTube videos, and maps those videos to Freebase in order to run various discovery / recommendations analytics and products experiment.

If you’re not yet familiar with it, Freebase is the core of Google’s Knowledge Graph and provides machine-readable, structured, information about a large number of entities, or “real-world things described on the Web”

 

To build this, I’ve been using the streaming APIs from both Twitter and BigQuery to get and save the data. In-between, a middleware parses the tweets and calls the YouTube API and Freebase’s Knowledge Graph to extract additional data from each, with a bit of memcached to avoir rate-limiting on those APIs.

Enjoyed this post?Read about related experiments in my #MusicTech e-book
Enjoyed this post? Read about related experiments in my #MusicTech e-book

The infrastructure started to run a few days ago, and I’ve gathered (when starting this write-up) 1.2M Tweets so far, for a total or 516,056 distinct videos, of which 345,410 have been linked to Freebase entities. Using this sample, I’ll now describe a few things that we can learn using this data in the context of music.

But first, why music videos? Not only because I’m big into music-related data science and engineering, but also because Music is the most shared category on the sample with 28.2% of the videos being in this category, followed by People and Blog (19.5%) and Entertainment ( 14.4). Another reason why YouTube Music Key definitely makes sense.

Popular videos: super fans or spammers?

One of the first query I’ve tried – focusing solely on BigQuery’s SQL capabilities – was to identify the most popular videos on Twitter, with their corresponding YouTube views (using data from the last YouTube API call).

SELECT
  tweet_youtube.youtube_id,
  tweet_youtube.youtube_title,
  COUNT(tweet_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_category_id = 10
GROUP EACH BY 1, 2
ORDER BY 3 DESC
Most popular videos in the dataset
Most popular videos in the dataset

I was surprised by the low difference between the number of tweets and views for some of them, so I’ve decided to measure the number of tweets per user for any video. Using another simple SQL query, we can easily identify videos that are self-promoted, or should I say spammed, on Twitter.

SELECT
  tweet_youtube.youtube_id,
  tweet_youtube.youtube_title,
  COUNT(DISTINCT(tweet_user_id)) as num_users,
  COUNT(tweet_youtube.youtube_id) as num_tweets,
  MAX(tweet_youtube.youtube_views) as num_views,
  CAST(COUNT(DISTINCT(tweet_user_id)) as float)
    /CAST(COUNT(tweet_youtube.youtube_id) as float) as ratio
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_category_id = 10
GROUP EACH BY 1, 2
ORDER BY 6 ASC
Number of tweets vs views on YouTube
Number of tweets vs views on YouTube

On the other hand, limiting the first SQL query to allow only one tweet per video per user provides an easiest way to identify top-tracks based on their number of unique fans (whether or not some of them are spam accounts is another topic).

Popular videos (one tweet per user)
Popular videos (one tweet per user)

Entities: better than tags

By linking videos to entities, rather than doing simple tag or keyword extraction, much more meaning can be derived from Tweets. As every entity is typed, additional filtering can be applied using the type of each entity. For instance, we can adapt the previous query to find not the top-tracks, but the top artists (i.e. entities having a type music artist).

Popular artists (via Freebase mappings)
Popular artists (via Freebase mappings)

Going deeper, you can also find what are the most popular music genres in the dataset.

SELECT
  tweet_youtube.youtube_relevant_topic.topic_id,
  tweet_youtube.youtube_relevant_topic.topic_name,
  COUNT(DISTINCT tweet_youtube.youtube_id) as num_videos,
FROM
  [Twitter.TwitterStream]
WHERE
  tweet_youtube.youtube_relevant_topic.topic_type = '/music/genre'
GROUP EACH BY 1, 2
ORDER BY 3 DESC
Top-10 genres in the dataset
Top-10 genres in the dataset

From there, and using the same entity-filtering approach, we can build genre-specific top-10, as below for Heavy-metal.

Top Heavy-metal videos
Top Heavy-metal videos

 

User profiling, semantic advertisement, and more

Besides analytics, an obvious use-case of such approach is user profiling. When I was at DERI, we learned that a lot of valuable content for user profiling is mined from things that people link to, by extracting structured data from those links (in this current case, via the YouTube / Freebase mappings).

Using a similar process, Twitter users could be categorised more specifically thorough their music tastes, and get recommendations of artists to follow, videos to watch, or music to buy based on this data mined from external sources. This is definitely relevant in the context of upcoming Twitter feed update! On the other side of the spectrum, we can imagine that advertisers, or bands that want to promote themselves on Twitter, could use those signals for specific user-targeting – a constant struggle for music industry marketing.

As the pipeline is progressing, I’ll try to come up with some other interesting experiments, while I’m building a small hack / product in the meantime using this data, most likely combined with data from seevl, to be released soon.

Enjoyed this post?Read about related experiments in my #MusicTech e-book
Enjoyed this post? Read about related experiments in my #MusicTech e-book

Leave a Reply

Your email address will not be published. Required fields are marked *

6 thoughts on “Sharing YouTube music on Twitter: Analytics using Freebase and BigQuery

  1. This is some really interesting stuff… is there any chance you’ll make the underlying data public so that other folks can try it out themselves? Or is the data proprietary?