Using the wisdom of crowds to tell a story about music genres
Music genres are serious business — the source of debate, speculation, fights, and of course,mockery. What seems like a fairly clear-cut concept in a record store is less clear when debating with your friends whether Brian Eno makes electronica or ambient music, or what kind of hip-hop this Kid Cudi album is (if it’s hip-hop at all). Or even worse, how do genres link together — is hip-hop a descendent of R&B, or are they both sibling children of soul music? Is indy folk closer to 60s and 70s folk music, or to indy rock (or are all three just branches of the same limb stretching back to the blues)?
The goal of this post is to take advantage of some of the available social data (in this case, tags on last.fm) to form a sort of consensus on music genre classification. This isn’t meant to produce an authoritative ground truth on music classification (I doubt such a thing could exist), but rather to try to get at the most widely-held conception in a somewhat objective and perhaps novel way.
note — I saved the technical details for the end; if you want to read them before seeing the results, skip to the Mining Details section below
Pop Music Genre Tree
As my source of data, I took the most common genre-related tags on last.fm for songs from the Whitburn project. To work out the relationships between all these tags (and by extension the genres themselves), I used some phylogenetic software to produce a family tree of tags. The logic of using phylogenetics algorithms for this is explained in the Mining Details below. Here’s the tree, with colors and (terrible) labels added by me (click for bigger version):
This tree serves two purposes: it works as a map from the varied and whimsical landscape of social tags onto a concise and recognizable group of genres, and it also reveals some surprising insights about how genres (are perceived to) actually relate to one another.
For instance, the R&B tags seem to cluster into two groups – a 70s and 80s R&B closely aligned with soul music, and a later R&B aligned with hip-hop. It’s also surprising that country music seems to cluster very closely to folk rock and southern rock, both genres I expected to see closer to the pure rock camp. Speaking of which, a few other genres I associate with rock (soft rock / ballads, alternative / punk / grunge, and pop rock) defied expectation by branching out on their own rather than falling under the rock umbrella.
Less surprising was the close association of electronica with other dance music including disco, and the very broad nature of the rock genre (which includes classic rock, hard rock, psychadelic rock, glam rock, progressive rock, etc.).
One caveat — I do expect the exact structure of this tree to be somewhat sensitive to things like which songs are included in the dataset. Still, even if slightly rearranged versions of the tree are valid themselves, that really doesn’t make this less valid, as it’s still a representation of genre relationships based on input from perhaps millions of last.fm users.
Pop Genres Through History
Having a sensible map of social tags to song genres also gave me the chance to take a look at pop history — to take a look at the growth of, the decline of, and in some cases the resurgence of genres over time.
Taking a look at the number of songs associated with each derived genre over time reveals a few cool things. The first thing to notice is that the total number of tagged songs each year varies quite a bit — from a few in 1920 to a few hundred by the 1980s. Though the number of songs in the Whitburn project does vary a little from year to year, most of this variation is due to a lot of songs, especially old songs, just not being tagged or even present in last.fm. This means some (real) genres of music are completely absent; after all, users of last.fm are people that live in the 20th century and listen to digital music, which for better or worse does not include old gospel recordings of Homer Rodeheaver or ragtime covers by the US Marine Band (though I’m sure a few people will be saddened by the lack of tagged Broadway showtunes). I prefer to take this as a reminder that history (or maybe I should say culture) is in the eyes of the beholders. When we think of music of the 30s, we think of blues and jazz, and that’s what represented here.
Of the songs that are tagged, a few interesting patterns emerge. First, except for the explosion of rock and soul in the late 50s / early 60s (fairly quickly after the respective introductions of the two genres), most genres seem to grow at the expense of others. The growth in hip-hop and alternative music in the late 80s / early 90s coincides with the decline of rock (and to a lesser extent dance and soul music) in the same period. Second, just because a genre of music is down doesn’t mean it’ll stay down — country / americana might have looked like it was on its last legs by the late 80s, but by the 2000s it actually had a bigger marketshare than ever.
Normalizing the songs per year to produce a genre ratio plot makes a few things a bit more visible. One of these is that out of all these genres, the one with the best longevity seems to be soul music, though I do have to qualify that somewhat, as the tag “soul” is pretty ambigious so I might be picking up some songs that are just soulful without being soul.
Finally, I do have to point out that tagging each song as a member of a single genre only gives part of the story: a lot of songs are tagged as members of several genres. For the curious, out of this dataset the artists with the most genre-spanning power were Prince, Phil Collins, Peter Gabriel, and Michael Jackson. Taking a closer look at genre blending and fusion will most likely be the topic of a future post.
Mining Details (for the curious)
My basic strategy for this analysis was to link up two pieces of data. The first was pop music charts, and the second was the social tags associated with these songs on last.fm. The second piece was straightforward to obtain thanks to the well-maintained last.fm api, but the first required some curated and maintained dataset. My original plan was to use the publicly-available data in the Billboard Charts API to gather a list of popular songs over the last century. Sadly, as of right now the service is completely broken and useless. But where Billboard’s effort falls short, the Whitburn project managed to make up for it by releasing a meticulously gathered and annotated list of 37000 chart-hitting songs since the 1890s.
Here are the most common tags for the Whitburn project songs represented as a word cloud (I highlighted genre-specific tags in red):
The first thing to notice (which will be nothing new to people who work with this kind of data professionally) is that the tags are, for lack of a better term, “messy”. For instance, there are about eight different tags for R&B, including the alternative spelling “rhythum and blues tag”. Several tags are ambiguous — does “soul” mean that the song is in the genre of soul music, or that the song is soulful? Since this is social data, we have to contend with people using a single tag for more than one meaning, and using different tags to mean the same thing.
Rather than simply letting it be an annoyance though, the idea here was to let treat ambiguity itself as a source of information. Grabbing the 100-odd common tags that have to do with genre, I labelled each by which songs have that tag. I admit this sounds somewhat backwards; to use a metaphor, we can think of each genre as having a sort of genotype — a sequence that defines it. To get that sequence, I look through the set of songs and mark down 1 where that tag is mentioned and 0 where it is not (this means that the songs are basically being treated as alleles).
To help visualize, here’s a raster image of a section of this “genotype” map. For each genre tag (y-axis) there’s a mark if the song on the x-axis has ben tagged with that genre.
The first thought that comes to mind looking at this kind of data is to use a standard clustering algorithm (e.g. hierarchical clustering or PCA followed by k-means) on it to find groups of related tags. The problem with that is coming up with a sensible distance metric — one that puts a large distance say between two rarely-used tags with few overlapping songs, but also puts a small distance between a common tag and a rare tag whose songs overlap with the common tag (i.e. its parent).
This is actually where the genotype metaphor came in handy. I simply took it literally, and used an algorithm developed by evolutionary biologists that does exactly what I want: produces a tree of the relationship between the tags assuming that losing songs from parents to children is common, but gaining songs is very rare (for the even more technically-curious, I produced the maximum parsimony character tree for the genre tags by taking the consensus tree for 100 bootstrap rounds). Once I had the tree, using it to classify songs based on their tags was straightforward.