AC/DC's TNT live

Tagging YouTube’s music videos with Clarifai deep learning

Following the posts in which I’ve used Clarifai’s deep leaning API to tag, classify and then automatically generate album covers, here’s a quick experiment to analyze YouTube music videos using the same API.

From photos to videos

Besides image recognition, that I’ve covered previously, Clarifai’s deep learning image recognition algorithms can also extract content from videos, as described in this recent MIT Technology Review article.

The API is as simple to use for videos as it is for images: you just send video URLs to the API to tag their content. For instance:

clarifai = ClarifaiApi(clarifai_app_id, clarifai_app_secret)
data = clarifai.tag_urls(video_urls)

yields to

data = {
  "status_code": "OK",
  "status_msg": "All images in request have completed successfully. ",
  "meta": {
    "tag": {
      "timestamp": 1435921530.770622,
      "model": "default",
      "config": "0b2b7436987dd912e077ff576731f8b7"
  "results": [
      "docid": 3608671363070960741,
      "url": "https://video_url.mp4",
      "status_code": "OK",
      "status_msg": "OK",
      "local_id": "",
      "result": {
        "tag": {
          "timestamps": [
          "classes": [
            ], [
          "probs": [
            ], [
    }, {

As the JSON object groups timestamps, classes and probs into different arrays (for each video), a quick workaround is needed to transform those into a Python dictionary, that would easily get the tags (and their probability) for each second of each video.

videos = {}
for result in data.get('results'):
  tag = result['result']['tag']
  videos[result['url']] = dict([(
    dict(zip(tag['classes'][int(ts)], tag['probs'][int(ts)]))
  ) for ts in tag['timestamps']])

You can then find the score of any tag, for any second of any video in the results

>> # Score for cat at second 3 of video_url
... print videos[video_url][3].get('cat')
>> # Score for "guitar" at second 8 of video_url
... print videos[video_url][8].get('guitar')

Parsing YouTube videos

Tagging YouTube videos through Clarifai is a bit more complicated, since you can’t directly give the video URL to the API. Here’s the process I’ve used:

  • Download videos from YouTube via youtube-dl (disclosure: I’m aware this is against YouTube TOS, but I’m using it as a quick-and-dirty experiment here, please don’t send your lawyers!);
  • Use ffmpeg to resize videos to 1024, as this is the maximum resolution supported (so far) by the Clarifai API;
  • Upload the videos to any hosting provider (here, I used Google Storage), and sends that URL to the API.
  • Done!
Enjoyed this post?Read about related experiments in my #MusicTech e-book
Enjoyed this post? Read about related experiments in my #MusicTech e-book

Analysing music videos

Let’s give it a try with a few music videos, using the pipeline describe before.

First here’s AC/DC’s TNT, live. Picking a particular frame (second 10), here’s the list of tags:

[u'recreation', u'musical instrument', u'concert', u'people', u'festival', u'light', u'energy', u'musician', u'singer', u'one', u'band', u'music', u'adult', u'performance', u'clothing', u'stage']

Going further than a single frame, let’s look at the aggregate values to find the top-tags for the whole video. Note that I’m ignoring, for each frame, all tags where the probability was below a certain threshold (0.7).

_tags = {}
for tag in videos[video_url].values():
  for t in [t[0] for t in tag.items() if t[1] > 0.7]:
    _tags.setdefault(t, 0)
    _tags[t] += 1
print sorted(_tags.items(), key=lambda x: x[1], reverse=True)

And the top-10 tags for the video are as follow (the values being the number of time they appear), which definitely makes sense for a rock concert!

  • people, 224
  • music, 223
  • festival, 218
  • performance, 211
  • concert, 202
  • musician, 198
  • stage, 175,
  • singer, 172
  • adult, 169
  • band, 144

Now for something different, here are the tags for Eminem’s Guilty Conscience:

  • people, 206
  • adult, 204
  • men, 177
  • one, 164
  • indoors, 161
  • music, 152
  • clothing, 150
  • portrait, 140
  • politics, 140
  • women, 125

And for Beyonce’s Countdown:

  • people, 205
  • women, 203
  • adult, 193
  • clothing, 175
  • portrait, 167
  • fashion, 146
  • one, 126
  • men, 119
  • model, 112
  • stylish, 104

That’s all nice and interesting, but besides the fun of parsing videos, this can be useful in many ways.

One could for instance recommend videos based on those tags and a user’s interests; identify when a video is a real performance versus a (bunch of) photo(s) on top of a song [1]; or filter-out sensitive content. Another natural business case is advertisement – with tags like “clothing”, “fashion” and “stylish” extracted from the previous Beyonce video, fashion advertisers could easily target viewers of this video.

The possibilities are endless, and using an image recognition API like this one is just the first building block to leverage video content on the Web!

Enjoyed this post?Read about related experiments in my #MusicTech e-book
Enjoyed this post? Read about related experiments in my #MusicTech e-book

[1] Yet, I’ve noticed that for a “static video” (single picture with a music track), the API returns slightly different tags for each of the frame.

Leave a Reply

Your email address will not be published. Required fields are marked *