Product Hunt is the new rising star in the start-up community. Think of it as a mix of Beta List and Hacker News, but with products that are already live, and a wider community, including engineers of course, but also product people, investors, media and more. A few days ago, and by popular demand, they’ve launched their official API.
More Products: A Chrome plug-in for Product Hunt recommendations
With more than 6,000 products already in the Product Hunt database, I’ve decided to use the API to build a product recommendation engine. It seems that evey times it comes to hacking and APIs, I can’t get away of discovery, or music. Or both.
The result is a Chrome extension simply named “More Products!”. It directly integrates top-10 related products for each product page, as you can see below. I might iterate on the algorithm itself, but want to keep this plug-in very focused so it’s unlikely that it will integrate other features. Note that it doesn’t track anything, so your privacy is preserved.
Under the hood
The engine relies on the API to get the list of all products and related posts, and then uses TF-IDF and Cosine similarity to to find similarities between them, using NLTK and scikit-learn, respectively the standard Python tools for Natural Language Processing and Machine Learning. To put it simply, it builds a giant database of words used in all posts, mapped to products with their frequency, and then finds how close products are, based on those frequencies.
New products are fetched every 2 hours, and recommendations are updated at the same time. Flask handles requests between the extension and the recommendations database, and Redis is used as a cache layer.
Here’s the second post of my data analysis series on the Rolling Stone top 500 greatest songs of all time. While the first one focused on lyrics, this one is all about the acoustic properties of the data-set – especially their volume and tempo.
To do so, I used the EchoNest, which delivers a good understanding of each track at the section level (e.g. verse, chorus, etc.) but also at a deeper “segment” level, providing loudness details about very short intervals (up to less than a second). This is not perfect, due to some issues discussed below, but gives a few interesting insights.
[Update 22 July] As noted in the comments, there were a few unexpected results. I’ve run the pipeline again and done some more cleaning on the API results, as explained here.
Black leather, knee-hole pants, can’t play no highschool dance
As my goal was to identify relevant tracks from the dataset, in addition to absolute values for the loudness and tempo of each track, I also looked at their standard deviation. If you’re not familiar with it, this helps to identify which songs / artists tend to be closer to their average tempo / loudness, versus the ones that are more dynamic.
Before going through individual songs from the top-500, let’s take an example with the top-10 Spotify tracks of a few artists to check their loudness:
And the tempo:
You can see that some bands really deserve their reputation. For instance, while the Pink Floyd have a high standard deviation both in volume and tempo (not surprising), Motörhead is not only the loudest (in average) of the list, but also the one with the smallest standard deviation, meaning most of their tracks evolve around that average loudness. In order words, they play everything loud. While the Ramones and just fast, everything fast. And when they’re together on stage, the result is not surprising
But you don’t really care for music, do you?
Coming back to the top 500, I ran the Echonest analysis on 474 tracks of the list. The 26 missing are due to various errors at different stages of the full pipeline.
On the one hand, I’ve used raw results from the song API to get the average values. [Update 22 July] I had to consolidate the data by aggregating multiple API results together. For a single song, multiple tracks are returned by the API (as expected), but there can be large inconsistencies between them. For instance, if you search for American Idiot, one track (ID=SOHDHEA1391229C0EF) is identified having a tempo of 93, the other one (SOCVQDB129F08211FC) of 186. Some can also have slighter variations (in volume for instance, between a live and the original version). To simplify things – and I agree it include a bias in the results – I averaged the first 3 results from the API.
On the other hand, I relied on numpy to compute the standard deviation from the first API result, removing first the fade-in and fade-out of each track. Here, I’ve also skipped every segment of section where the API confidence was too low (< 0.4).
I’m waiting for that final moment you say the words that I can’t say
Last but not least, I’ve normalized and combined both the tempo deviation and the rhythm one to assign a [0:1] score to each track in order find the most and less dynamic tracks overall. Here’s the top-5 of the most dynamic ones:
If you listen to My Generation, you can clearly hear the dynamic both in tempo and volume with the different bursts of the song. While the Radiohead one is more on the long-run, with clearly distinct phases as shown below for the volume part.
Finally, here are the less dynamic ones. Several ones on that list made it through the charts, showing that even though a song can be pretty flat in both volume and tempo, it can still be a hit – or at least an earworm:
While toying with the public BigQuery datasets, impatiently waiting for Google Cloud Dataflow to be released, I’ve noticed the Wikipedia Revision History one, which contains a list of 314M Wikipedia edits, up to 2010. In the spirit of Amazon’s “people who bought this”, I’ve decided to run a small experiment about music recommendations based on Wikipedia edits. The results are not perfect, but provide some insights that could be used to bootstrap a recommendation platform.
Here, my assumption to build a recommendation system is that Wikipedia contributors edit similar pages, because they have an expertise and interest in a particular domain, and tend to focus on those. This obviously becomes more relevant at the macro-level, taking a large number of edits into account.
In the music-discovery context, this means that if 200 of the Wikipedia editors contributing to the Weezer page also edited the Rivers Cuomo one, well, there might be something in common between both.
This dataset contains a version of that data from April, 2010. This dataset does not contain the full text of the revisions, but rather just the meta information about the revisions, including things like language, timestamp, article and the like.
Number of rows: 314M
Sounds not too bad, as it contains a large set of (page, title, user) tuples, exactly what we need to experiment.
Querying for similarity
Instead of building a complete user/edits matrix to compute the cosine distance between pages, or using a more advanced algorithm like Slope One (with the number of edits as an equivalent for ratings), I’m simply finding common edits, as explained in the original Amazon paper. And, to make this a bit more fun, I’ve decided to do it with a single query over the 314M rows, testing BigQuery capabilities at the same time.
The following query is used to find all pages sharing common editors with the current ones, ranked by the number of common edits. Tested with multiple inputs, it took an average of 5 seconds to answer it over the full dataset. You can run those by yourself by going to your Google BigQuery console and selecting the Wikipedia dataset.
SELECT title, id, count(id) as edits
WHERE contributor_id IN (
AND contributor_id IS NOT NULL
AND is_bot is NULL
AND is_minor is NULL
AND wp_namespace = 0
GROUP BY contributor_id
AND is_minor is NULL
AND wp_namespace = 0
GROUP EACH BY title, id
ORDER BY edits DESC
This is actually a simple query, finding all pages (wp_namespace=0 restricts to content pages, excluding user pages, talks, etc.) edited (excluding minor edits) by users whom also edited (excluding bots and minor contributions) the page with ID 30423, ranking them by number of edits. You can read it as “Find all pages edited by people who also edited the page about the Clash, ranked by most edited first”.
And here are some of the results
As you can see, from a music-discovery perspective, that’s a mix between relevant ones (Ramones, Sex Pistols), and WTF-ones (The Beatles, U2). There’s also a need to exclude non-music pages, but that could be done programmatically with some more information in the dataset.
Towards long tail discovery
As we can expect, and as seen before, results are not that good for mainstream / popular artists. Indeed, edits about the Beatles page are unlikely, in average, to say much about the musical preferences of their editors. Yet, this becomes more relevant for long-tail artist discovery: if you care editing indie bands pages, that’s most likely you care about it.
Trying with Mr Bungle, the query returns Meshuggah and The Mars Volta as the first two music-related entries, all of them playing some kind of experimental metal – but then digresses again with the Pixies. Looking at band members / solo artists and using Frank Black as a seed leads to The Smashing Pumpkins, Pearl Jam, R.E.M. and obviously the Pixies as the first four recommendations. Not perfect for both, but not too bad for an algorithm that is completely music-agnostic!
Scaling the approach
There are many ways this could be improved, for instance:
Removing too-active contributors – who may edit pages to ensure Wikipedia guidelines are followed, rather than for topic-based interest, and consequently introduce some bias;
Filtering the results using some ML approaches or graph-based heuristics – e.g. exclude results if their genres are more than X nodes away in a genre taxonomy.
Using time-decay – someone editing Nirvana pages in 1992 might be interested in completely different genres now, so joint edits might not be relevant if done with an x-days interval or more.
Yet, besides its scientific interest, and showing that BigQuery is very cool to use, this approach also showcases – if needed – that even though algorithms may rule the world of music discovery, they might not be able to do much without user-generated content.
The Long Tail. That’s not something new, neither on the Web nor in the music field. I remember when I first read Chris Anderson article , and since, many have talked or wrote about it, including Paul Lamere or Oscar Celma in the music-tech sphere.
Yet, one must admit that, with millions of tracks available online, it’s always a challenge to find something new, digging in that so-called long-tail of less popular artists or songs.
So, between Word Cup games, I’ve built a web component – and a companion web app – to enjoy the less popular tracks of any artist.
The source is on github (MIT license), and you can see how easy it is to create. It simply calls the Spotify API to find an artist’s albums, then tracks (limiting to 50 results each time – hence parsing a maximum of 2500 tracks per artist), finally sorting them by inverse popularity. It also excludes the ones with popularity=0, as it seems there are not always the less popular ones. Maybe some region-dependant issue?
I suppose, as many recent JS toolkits such as AngularJS, that the learning curve will be stiffer when building advanced components (probably due to the early-stage documentation), but at a first glance, it looks very intuitive, and there are many elements to reuse already.
Try it with your favorite artists
As the component is mostly for coders, I’ve put together a companion Web app – shamelessly reusing Paul’s design from his recent Spotify hacks. For each artist, it uses the previous component and displays their 50 less popular tracks, according to Spotify.
Try it at The Long Tail, and have fun exploring the hidden gems of your favorite artist!
Back in April, I was lucky enough to get a partner invite for Google I/O. Coupled with a stay at the Startup House, a co-working / housing space (ideal when you’re jet-lagged at 4AM and want a proper desk to code a few meters away from your bed) located only one black away from Moscone, I’m very glad I’ve made the trip to my first I/O!
Here are a few highlights, in a conference which clearly confirmed the role of (1) Android as a global OS, and (2) the Knowledge Graph as a hub for everything AI-related, at Google and beyond.
While I’m not (yet) a full-time Android user (let alone a developer), it’s now clear that it goes far beyond a phone-only OS. With the introduction of AndroidWear, AndroidCar, and AndroidTVduring the keynote, the OS is now the core of all hardware-related initiatives at Google.
With common SDKs and API to interact with, wherever the OS is used, this makes the life of developers much easier when building cross-devices products. Relying on a single ecosystem is also of importance when building an engineering team, and I guess it may also be an decision factor for small start-ups when deciding which market to tackle.
Google’s Knowledge Graph – From search to voice controls and app indexing
So far, Google’s Knowledge Graph has been used mostly in search-related projects, including the snippets you can see when searching for entities such as places, people, music and movies on Google. Several sessions-cases showed how it is now used as a central hub for AI-related projects and products.
Using Android TV, you can ask your TV (literally, by talking to your Android watch) to suggest an Oscar-awarded movie from 2000, or who’s casting in X or Y – all answers coming from the Knowledge Graph. In the first case, results can be bought from Google play, another nice piece of integration between the different offerings from the company.
Another interesting case is the use if the Knowledge Graph to connect the dots between previously isolated silos, namely mobile apps. One of the common issue with those apps is their lack of links and outside-world connections, in spite of recent efforts such as Facebook-supported App Links. In the session “The Future of Apps and Search“, a combination of app indexing, JSON-LD and Knowledge Graph was presented to directly link into an app from, e.g., Google’s search results or autocompletion-search in Android, as well as launching actions from search results – e.g. playing a track in Spotify, a use-case announced a few days before I/O – using the new schema.org actions I’ve recently blogged about.
As an early JSON-LD enthusiast, and working on related technologies for almost a decade, you can’t imagine how excited I was when I saw this in something used by million of users! Let’s bet that’s only the beginning, and that new verticals will follow.
Google Cloud and DataFlow – Smarter, faster, easier
Cloud Debugger – making DevOps and back-engineers more efficient when debugging code. You can now add breakpoints, including conditional ones (e.g. user=X) in your live app, without jeopardising its speed, and most important, without having to stop/restart/deploy anything. This means that code can be debugged on production servers with live data, and without patching / tracing multiple boxes, all in the comfort of your browser. A kind of New Relic on steroids, so big thumbs-up here!
Dataflow – aiming to replace MapReduce, with a special focus on stream processing and scalability. A convincing use-case during the keynote was Twitter sentiment analysis, showing not only the simplicity of the interface, but also the orchestration of the services through the API. The service is not open yet, but you can check “Big data, the Cloud Way: Accelerated and simplified” to know more. I’m looking forward to try it on a few stream processing for content discovery!
The Web platform – Polymer, WebRTC and HTML5
Whether you’re accessing if from your desktop, phone, or now, your watch or Glass, there’s only one Web. And far from just websites, it can be used as a platform to build powerful apps, as many session focused on:
WebRTC – building real-time systems in your browser. “Making music mobile with the Web” not only showed how to transform your Macbook into a Marshall JCM2000 with Soundtrap, but also how WebRTC was used for real-time collaborative music creation, with very low latency.
Wearables – It’s all about the UX
Then, a big part of the conference: Glass and smart watches. I often thought that most of the effort to build those was put in the hardware and OS side of things (reducing footprint, optimising battery life, gathering sensor data, etc.).
While some talks clearly focused on this (with some nice hacks such as back-camera for biking in “Innovate with the Glass Platform“, and football-related ones), I was impressed by “Designing for wearables“, which focused on the role of UX to make sure wearables are devices that let you connect, and not interfere with the world as a phone does.
Showing some early prototypes and discussing how and why Glass / wear notifications are so minimalistic, this was an inspiring session for anyone interested in UX and products. A must-watch for developers and entrepreneurs aiming to build appealing user-facing products, whether it’s for wearables or more standard devices.
Google+ – Or how Google missed the spot
I may have missed it from other sessions, but none of the ones I’ve been to mentioned Google+. I was not expecting much about it at I/O since the departure of Vic Gundotra, and Sergey Brin’s statements, as well as a plus-free agenda. Still, that was a big surprise, as it would have been a no-brainer use-case in many talks.
Using dataflow to process streams from your social circles? Not a word about it. Using Glass to see what your friends are posting? Nope. Alerts on your Google TV to binge watch some TV-show together with your friends home 5000km away? Neither.
G+ could have been an awesome social network – or should I say a social platform. Combined with Freebase / Knowledge Graph, linking people to things they like, possibilities would be endless in terms of profiling, discovery and more. Yet, with a poor API, a lack of portability that could have differentiate it from its main competitors from Day 1 (imagine PubSubHubbub / WebSocket as an easy way to integrate G+ into other platforms), I’m sad they’ve missed the spot.
Up to 2015?
Overall, a great conference, in spite of the queue mismatch that forced me to miss about 30min of the keynote, queueing twice around the Moscone, a real shame when you travel 8000km for such an event.
The YouTube V3 API is one of those thing you’ll definitely fall in love with, if you’re into real-world Semantic Web applications, a.k.a “Things, not words”. With its integration with Freebase – the core of Google’s Knowledge Graph -, it’s a concrete and practical showcase of the Web as a distributed database of things and relations, and not only keywords and links between pages.
YouTube Data API v3 with Freebase mappings: the good, the bad, and the ugly
While relatively simple to use, it provides advanced features to let developers built data-driven applications. On the one hand, it allows to search for videos by Freebase entities, as you can try in a recent demo from YouTube themselves. On the other hand, it returns which entities are used/described in a video.
Yet, identifying topics from videos is a difficult task, and if you’re not convinced (and interested in all things Machine Learning related), check the following Google I/O talk from last year.
While the API generally delivers correct information, it sometimes requires a bit of work to automatically uses its results in a music-related context (to be exact, the issues might be in the underlying data, rather then on the API itself):
In some cases, it provides multiple artists – which is often correct, e.g. Blondie and Debby Harry but makes difficult to find who’s the main one, as the API delivers them at the same level (topicIds).
This is something we’ve improved to build our former seevl for YouTube plug-in, and while it’s not available anymore, as we’ve moved away from consumer-facing products to refocus on a B2B, turn-key, music discovery solution, I’ve decided to open source the underlying library to find who’s playing and what (yes, that’s music only) in any YouTube videos.
Introducing youplay – who’s and what’s playing in a YouTube music video
The result is youplay, available on PyPI and github, a MIT-licensed python library that works as an enhancement on top of the YouTube Data API v3 to automatically identify who’s and what’s playing in a music video. It uses different heuristics, data look-up, and more to find the correct artists if multiple ones are returned (unless they’re all playing in the video, like this RHCP + Snoop Dogg version of Scar Tissue), to filter ambiguous ones, or to find the correct artist and track if the API doesn’t deliver anything.
Here’s an example
(artists, tracks) = youplay.extract('0UjsXo9l6I8')
print '%s - %s' %(', '.join([artist.name for artist in artists]), tracks.name)
(artists, tracks) = youplay.extract('c-_vFlDBB8A')
print '%s - %s' %(artists.name, tracks.name)
(env)marvin-7:youplay alex$ python sample.py
Jay-Z, Alicia Keys - Empire State of Mind
Dropkick Murphys - Worker's Song
The tool is also packaged with a command line script returning JSON data for easy integration into non-python apps.
The fun part? All the look-ups (if any) are using the Freebase and YouTube API themselves, such as:
Finding the top-tracks of an artist from Freebase and matching it with the video name if the original API call when it returns only artist names;
Identifying if a song has been recorded by multiple artists;
Looking-up related YouTube videos to identify what’s the common topic between all of them, and guess the current artist of a video with no API-results.
Isn’t it a nice way to bridge the gap?
Even though I hope the API will be useful to other music-tech developers, I also wish that it soon becomes obsolete, as Google’s Knowledge Graph, and other structured-data efforts on the Web, keep growing on the Web in terms of AI, infrastructures and APIs/toolkits – making more and more easier every day to build data-driven applications (if only I had this 10years ago when I started digging into the topic!).
Oh, and I’m attending Google I/O next week, and if you’re working on similar projects, ping me and let’s have a chat!