Following my journey with Google Cloud, in particular BigQuery, I’m building a pipeline which mines Tweets containing YouTube videos, and maps those videos to Freebase in order to run various discovery / recommendations analytics and products experiment.
If you’re not yet familiar with it, Freebase is the core of Google’s Knowledge Graph and provides machine-readable, structured, information about a large number of entities, or “real-world things described on the Web”
To build this, I’ve been using the streaming APIs from both Twitter and BigQuery to get and save the data. In-between, a middleware parses the tweets and calls the YouTube API and Freebase’s Knowledge Graph to extract additional data from each, with a bit of memcached to avoir rate-limiting on those APIs.
The infrastructure started to run a few days ago, and I’ve gathered (when starting this write-up) 1.2M Tweets so far, for a total or 516,056 distinct videos, of which 345,410 have been linked to Freebase entities. Using this sample, I’ll now describe a few things that we can learn using this data in the context of music.
But first, why music videos? Not only because I’m big into music-related data science and engineering, but also because Music is the most shared category on the sample with 28.2% of the videos being in this category, followed by People and Blog (19.5%) and Entertainment ( 14.4). Another reason why YouTube Music Key definitely makes sense.
Popular videos: super fans or spammers?
One of the first query I’ve tried – focusing solely on BigQuery’s SQL capabilities - was to identify the most popular videos on Twitter, with their corresponding YouTube views (using data from the last YouTube API call).
SELECT tweet_youtube.youtube_id, tweet_youtube.youtube_title, COUNT(tweet_id) as num_tweets, MAX(tweet_youtube.youtube_views) as num_views, FROM [Twitter.TwitterStream] WHERE tweet_youtube.youtube_category_id = 10 GROUP EACH BY 1, 2 ORDER BY 3 DESC
I was surprised by the low difference between the number of tweets and views for some of them, so I’ve decided to measure the number of tweets per user for any video. Using another simple SQL query, we can easily identify videos that are self-promoted, or should I say spammed, on Twitter.
SELECT tweet_youtube.youtube_id, tweet_youtube.youtube_title, COUNT(DISTINCT(tweet_user_id)) as num_users, COUNT(tweet_youtube.youtube_id) as num_tweets, MAX(tweet_youtube.youtube_views) as num_views, CAST(COUNT(DISTINCT(tweet_user_id)) as float) /CAST(COUNT(tweet_youtube.youtube_id) as float) as ratio FROM [Twitter.TwitterStream] WHERE tweet_youtube.youtube_category_id = 10 GROUP EACH BY 1, 2 ORDER BY 6 ASC
On the other hand, limiting the first SQL query to allow only one tweet per video per user provides an easiest way to identify top-tracks based on their number of unique fans (whether or not some of them are spam accounts is another topic).
Entities: better than tags
By linking videos to entities, rather than doing simple tag or keyword extraction, much more meaning can be derived from Tweets. As every entity is typed, additional filtering can be applied using the type of each entity. For instance, we can adapt the previous query to find not the top-tracks, but the top artists (i.e. entities having a type music artist).
Going deeper, you can also find what are the most popular music genres in the dataset.
SELECT tweet_youtube.youtube_relevant_topic.topic_id, tweet_youtube.youtube_relevant_topic.topic_name, COUNT(DISTINCT tweet_youtube.youtube_id) as num_videos, FROM [Twitter.TwitterStream] WHERE tweet_youtube.youtube_relevant_topic.topic_type = '/music/genre' GROUP EACH BY 1, 2 ORDER BY 3 DESC
From there, and using the same entity-filtering approach, we can build genre-specific top-10, as below for Heavy-metal.
User profiling, semantic advertisement, and more
Besides analytics, an obvious use-case of such approach is user profiling. When I was at DERI, we learned that a lot of valuable content for user profiling is mined from things that people link to, by extracting structured data from those links (in this current case, via the YouTube / Freebase mappings).
Using a similar process, Twitter users could be categorised more specifically thorough their music tastes, and get recommendations of artists to follow, videos to watch, or music to buy based on this data mined from external sources. This is definitely relevant in the context of upcoming Twitter feed update! On the other side of the spectrum, we can imagine that advertisers, or bands that want to promote themselves on Twitter, could use those signals for specific user-targeting – a constant struggle for music industry marketing.
As the pipeline is progressing, I’ll try to come up with some other interesting experiments, while I’m building a small hack / product in the meantime using this data, most likely combined with data from seevl, to be released soon.