NYC scaled to commute time

What would New York City look like if the distance between places was actually the time it takes to get between them in rush hour?

One of the ideas that comes up semi-regularly in neuroscience is multi-dimensional scaling. In a nutshell, MDS is a set of algorithms whose intent is to take a set of distances, and produce a (usually) 2-D map that matches them. For instance, you might want to make a map of “taste”, so you ask people which is similar to which: chocolate vs steak, milk vs soy, ramen vs eggs and so on. You take these similarity measures, and produce a map of the land of taste.

This is a pretty neat idea that I think bears some more exploitation. And of course, being someone who spends a lot of time on the NY subway system, I couldn’t help but think of how you could use it to see the real geography of New York (or at least the geography that chronic MTA riders slowly build in their minds). That is, the map of New York where the Upper East Side isn’t that close to the Upper West Side. The map where Time Square, Union Square, and Grand Central are all close neighbors. The map where getting between Brooklyn and Queens may as well be a trans-Atlantic journey. You know, the real map.

This ended up being a bit more work than I was expecting, so I’ll split this post up. Today: the map. Soon: how I made the thing.

To begin with, here is a standard, to-scale map of the four Boroughs (Staten Island will be considered part of New Jersey for this exercise for reasons that will be explained later, no offense to either).


Here we see (using an overlay of a screenshot from Google Maps) the subway lines by their color, and areas in their proper geographical location. Now I’m going to measure the time it takes to get from one point in the city to another via the MTA (a whole bunch of times), and make a map where the distances match up with this time. For the geography nerds, this is called a Distance Cartogram. We can see it gives us a very different view:


(click on the maps to see full size).

The first thing I notice is how squished together lower Manhattan becomes, almost like a deflated balloon. As far as anything in Manhattan below 14th St. together with Borough Hall goes, you’re more or less living in the transit center. Of course, this is also where the greatest density of trains lives. Upper Manhattan is significantly wider, and is pulled toward Queens presumably with the combined power of the E, F, M, N, R, Q, and 7.

Brooklyn seems to pinch off around Borough Hall, making kind of a corner on the left. Brooklyn itself is also much larger in comparison to the squished Manhattan, and South Brooklyn now stretched quite far south, with Coney Island stretching off the map. Williamsburg and Greenpoint seem a little more connected by comparison.

Queens bends a little bit toward Brooklyn near Long Island City on the left (probably thanks to the G train) and near Forest Hills / Jamaica on the right (Probably due to the J and Z trains meeting the E). Jackson Heights however bends away from Brooklyn, and seems comparable more isolated than the other parts of Queens.

Finally, the winner for most isolated place (no surprises here) is the Rockaways. In fact, the software I used to calculate the trip times (Open Trip Planner) failed to find paths to several places around there. Seriously far.

For the more curious, here is a version of the rescaled map with the Google overlay removed. The individual neighborhood are labelled here, so you can see exactly where your ‘hood fits in.


Next time I’ll talk about how I went about making this thing (for anyone that wants to do something similar / better maybe in their own cities), and a few possible applications of this outside of picking which neighborhood to move to. Maybe a bonus version of this done for Manhattan alone if I have time.

Updata: NYC scaled to commute time, part 2


A glimpse of gene alteration frequencies in cancer

Making use of data of gene alterations (deletions and amplifications) mapped to the chromosomes of patients with brain tumors (Glioblastoma), this visualization is meant to give a glimpse into the complexity of gene changes that occur as the animation cycles through each patient — and hopefully also leave an impression of the huge efforts that are being done to understand this devastating disease for improved treatment.

It is rare that the broad public gets a look into the actual data that are being produced by the biomedical community. This is one attempt to show some of the thousands of gene alterations in one particular brain cancer, Glioblastoma – one of several cancers that are under intense investigation through large cross-institutional efforts, such as The Cancer Genome Atlas (TCGA).  Using publicly available data of thousands of gene alterations in more than 200 Glioblastoma samples (cBio Cancer Genomics Portal), each alteration is color coded (red=amplification, blue=deletion) and mapped to the more than 20,000 annotated genes in the human genome.

As the animation cycles through each patient, it can be seen that the majority of alterations are unique to each patient, since blue and red dots appear and reappear in different places throughout the genome.  This shows the complexity of this disease, which is not simply defined by a single alteration such as many inherited diseases are.  It is difficult to determine if such seemingly random events are causally implicated in the disease or are secondary effects due to other factors – an area of ongoing research for scientist around the world.

We also see that there are a few hotspots that start to dominate as more patients are covered: alterations that reoccur increment in size and hereby underline some of the few high frequency alterations that characterize this disease.  For example, the EGFR gene on chromosome 7 is amplified in around 40% of patients.

Glioblastoma is a devastating disease both for the patients and their families and it often affects young individuals.  Hopefully recent technological developments combined with large scale efforts such as the TCGA project, will shed more light into the mechanisms driving this disease and ultimately improve treatment for patients.

Disclaimer:  Although genes and alterations are accurately mapped to their locations on the chromosomes, this visualization is not meant as a objective scientific report, but is rather a freely interpreted representation of the data. For a scientific analysis of the genomic characterization of Glioblastoma, please visit here: link.

Analysis details: DNA copy number variation (CNV) data was obtained from the cancer database cBio Cancer Genomics Portal , hosted by Memorial Sloan-Kettering Cancer Center (link).  The data was preprocessed with the RAE algorithm providing gene-based CNV scores and chromosomal coordinates. For simplicity only high level amplifications and homozygous deletions were used.


I still hate the Pats, but…

Of course it would be nice if complex problems boiled down to a single number; recall the days of high school math.  Sadly, the real world is never that simple. That does not, however, stop us amateur sports enthusiasts and even some actual professional sports analysts from using simple decontextualized tally metrics as evidence for assigning superlatives to players or teams.

At a party for Super Bowl 46 featuring the New England Patriots and the New York Giants, someone argued to me that the Patriots defense ranked 31st (out of 32 teams) in total yards allowed (TYA) and thus were a horrible defense and would single-handedly lose the super bowl to the reawakened Giants offense.  It’s true that the New England Patriots’ defense did indeed essentially tie for worst with the Green Bay Packers for TYA with 6577 and 6585 yards, respectively in the 2011 regular season. I’m definitely not a Patriots fan, but I felt that this accusation based entirely on TYA was a little unfair.  There I was, caught in a very rare moment in defense of Boston sports. I waved my hands and posited that in some cases TYA is biased against high scoring offenses since the defenses are playing conservatively to preserve the lead and eat up time.  This isn’t a new thing, mind you, I just haven’t ever actually seen a plot of this to test this hypothesis.

Consequently, I set out to see how many yards each NFL defense allowed as a function of the relative score.  My tentative hypothesis was that some of the high scoring teams would have defenses with high TYA where a large chunk of those yards were forfeited when the team was leading by at least two scores.

Disclaimer: I am not a sports analyst or in any way a football strategist.  I do not claim that this graphic is the most informative or accurate way of ranking defenses. My priorities for this project were to (a) practice parsing sports data with Python (this is my first Python project of any kind), (b) make a good friend eat some crow, and (c) further stigmatize the use of TYA as the sole piece of evidence for ranking defenses.  TYA for sure has some value, but it should bow down to other metrics that incorporate scoring, probability of allowing points, take-aways and game context. Errors: I did not credit defenses for yards after take-aways either by fumble or interception.  This failure is mostly due to my fear of parsing text from human-constructed sentences.   Data


These distributions are Yards Allowed by Lead (YAbL). The x-axis consists of 3.5 point interval bins centered at 0 and extending from -28 (a deficit of more than 28 points) to +28 (a lead of more than 28 points).  On the y-axis are the cumulative number of yards allowed by each defense in those relative score bins.

As expected, it is clear that defenses on teams with highest-scoring offenses give up a significant number of their yards when in the lead by more than one possession.  The four highest-scoring teams in the 2011 regular season were, in order, the Green Bay Packers (GB), the New Orleans Saints (NO), the New England Patriots (NE) and the Detroit Lions (Det)[]. All four of these teams finished 23rd or worse in TYA.  The YAbL plots confirm that GB and NE especially give up most of their yards when preserving a two-possession lead.  My rationalization for this is that teams switch strategies to play with a prevent defense when preserving a significant lead.  In a prevent defense, teams are looking to prevent quick scores and take time off of the play clock at the cost of allowing long drives with many plays.  So this strategic decision is one contribution to the right-heavy YAbL we see for these teams.  The other contribution is the simple fact that these defenses are playing a disproportionate number of drives with a lead.  With this data alone, it is impossible to determine how many extra yards are forfeited from switching to a prevent defense.  I suspect that if GB and NE were playing on teams with weaker offenses and thus playing in closer contests, their TYA would decrease a little bit.  I do not claim nor believe that either team’s defense is all that great; indeed they appear to be fairly average or below average in tie-game situations.

Detroit on the other hand performed poorly in single-posession games and games where they are down by more than one score.  Despite having a powerful offense at times, scoring an average of almost 30 points per game, the defense struggled to preserve the lead.  More information is needed to accurately compare their distribution to other teams, but on the surface they appear to play on par with defenses like Tennessee and Miami, who were without strong offenses and were not in playoff contention.

The more dominant defenses stand out and correlate well with TYA.  Pittsburgh, Miami, Baltimore, San Francisco and Houston were almost always playing with a lead and yielded few yards regardless of their situation. In these cases, TYA paints a reasonable portrait of the effectiveness of these defenses.

Tall and narrow distributions centered at zero don’t provide much information about the defense, but they do tell a story about the kind of games these teams play.  It’s fitting, I suppose, that both drama-riddled New York teams play in noticeably tight games, either up or down by a single score with roughly equal frequency.

In general, defenses allowing the fewest yards are indeed elite and effective defenses.  However, some of the teams allowing the most yards are unfairly slandered as being defensive sieves; these defenses give up many of their yards when already leading by two or more possessions.  The added value of these YAbL plots is relatively minimal and are mostly useful in discriminating high TYA defenses that actually stink vs those whose numbers are inflated by context.  Finally, the YAbL distribution plots are at least as informative as TYA and can in some cases provide useful contextual information.  I’ll continue to work on this kind of analysis to incorporate different normalization options, scoring outcomes, and interactive features for users to more efficiently compare defenses.  It’s possible that a similar distribution will have some valuable information and can be incorporated in written analyses with sparklines [here], so users have instant visualization without having to look at a separate figure.

Who won the Super Bowl (halftime show)?

The Super Bowl is the most-watched annual television event in the U.S.; some years, nearly half of all households watch it. And while players earn tens of thousands of dollars for a day’s work, and advertisers pay $100,000 per second for air time, the performers at the halftime show — often huge stars, like Prince or the Black-Eyed Peas — are not paid to appear.

One reason the performers agree to do this, of course, is the terrific publicity. In fact, the halftime show can have more viewers (per minute) than the Super Bowl itself. So one might wonder: how useful is this publicity? How many new listeners does it get you? Does it only help if you’re already an established performer, or does it help up-and-coming artists as well? Does it help you even if your music isn’t very, so to speak, mainstream?

Let’s look at some data from the latest Super Bowl (XLVI, if you’re counting (in Roman)). The main performer for the halftime show was Madonna, who’s been releasing music for 30 years, has released plenty of well-known (and well-loved) singles. At the time, she was about to come out with a new, persona-defining album, MDNA, her first in four years. However, Madonna pulled on stage with her several other musicians. In approximate order of decreasing seniority, there was Cee-Lo, the rapper-turned-pop-crooner; M.I.A., the electro-pop agitator; Nicki Minaj, the energetic, attention-grabbing rapper/provocateur; and LMFAO, the humorous “party rock” duo.

How did the Super Bowl halftime show affect each of these artists’ listening numbers? One way to look at this is to examine’s listening charts for the weeks before and after the Super Bowl. (Here, for example, are Madonna’s charts for the week preceding the Super Bowl, showing number of unique listeners for her top songs.) We can plot how her top songs do before and after the Super Bowl:


As we can see, a few things happened. First, two days before the Super Bowl, Madonna premiered a song from her new album, “Give Me All Your Luvin’;” she performed the song with Nicki Minaj and M.I.A. at the halftime show. This song had a major publicity push separately from the halftime show performance, so it’s not too surprising to see it shoot up in listeners. (In fact, it reached over 11,000 listeners even before the Super Bowl.) However, all of Madonna’s other top songs, from 1984’s “Like a Virgin” to 2005’s “Hung Up,” saw a boost in listeners after the Super Bowl, too. The songs with the biggest bumps in listeners (“Like a Prayer” and “Vogue,” each with 50% increases) are the songs that were performed in the halftime show.

What about the other performers?


Cee Lo Green saw runaway success with his 2006 song “Crazy,” as part of the group Gnarls Barkley; more recently, he’s had a lot of success with his 2010 song “F••• You” (played on the radio as “Forget You”). This song picked up slightly after the Super Bowl, but otherwise, his listenership was not largely affected. Why? This warrants further analysis. I think it’s a combination of two things: Cee Lo was not presenting any new music at the Super Bowl, but at the same time, Cee Lo has had fewer major singles, so fewer people have an easily-accessible Cee Lo song already on their computer. Compare that with Madonna: the Super Bowl may have gotten more people to listen to her new song, and in addition, many people already had Madonna songs in their music libraries, and the Super Bowl performance was a prompt to listen to those songs again.


M.I.A. had a more productive Super Bowl than Cee Lo. Like Madonna, she released a new song in the week before the Super Bowl; though this song wasn’t performed, it did see a quick rise in the week before and the week after the Super Bowl. Also, like Madonna, M.I.A.’s other top songs saw rises in the week before and the week after the Super Bowl, with a decline in listeners afterwards. It’s been a few years since M.I.A.’s last album, so maybe people who have that last album were reminded of it — and of their excitement about M.I.A. as an artist.


This idea doesn’t seem to hold for Nicki Minaj. She’s an up-and-coming musician, one who’s built buzz through many singles distributed over the Internet. She did release a new single a little after the Super Bowl, and that release met with success; however, her other songs did not see any bumps in listeners. Was this because people were less excited about her halftime performance? Or she got less of the spotlight in the show? (You could argue M.I.A. had an unfair advantage, courting controversey with a digital malfunction.)


Finally, there was LMFAO, a band with only a few singles out, but with one that’s built a fair amount of buzz. How much did they capitalize on the halftime show? As a percentage, they only got a small bump for their top song, though it’s important to note that more people are already listening to that song than any of the other artists’ songs. However, their other songs did not see a bump at all. It might be that the appeal of LMFAO is fairly specific; of Super Bowl watchers, maybe only a small fraction of those seeing the LMFAO performance were intrigued to hear more.

There are many factors involved in how an audience reacts to hearing an artist’s song; unraveling the importance of these factors requires more data than just listening figures for a few artists’ songs before and after a single event. However, looking at these numbers can suggest potential targets for larger-scale analysis. Data analysis rarely (if ever) exists in a vacuum; developing a sense of the system being studied is an important part of reaching statistically meaningful conclusions.

Notes on the data

As always, drawing conclusions from data requires a good understanding of your data, a lot of care, and good controls. Doing this is outside the scope of this blog post, but it’s important to at least mention some limitations. First, there’s the geographic issue: is a British company, and they promote themselves most heavily there, while the Super Bowl is a primarily North American event. certainly has lots of users in the U.S., and to analyze the effects of a primarily-American marketing event, it would be best to limit the analysis to the effects on American listeners. There’s also, as always, the issue of demographics — I would guess that’s demographics skew younger than the overall demographics of Super Bowl listeners. (LMFAO’s numbers might be evidence to that effect.) In terms of the source of the data, measures plays in a variety of ways, but it is probably dominated by people listening to MP3 files on their computers and by people using on-demand streaming services like Spotify. The halftime show probably has a more complicated effect on what gets played on the radio, which is another factor that can translate to album sales (and which largely doesn’t measure). Finally, there’s the question of what data are easily available online, versus the data that has internally but doesn’t make easily accessible. The data available online are counts of the number of unique listeners to each song in a week; the data do not tell you how many times each of these people listen to the song. These numbers are likely very different for artists with many songs (Madonna) and artists with few (LMFAO). Frequency of listening may be a useful indicator of interest in or loyalty to an artist, but these data are not available through the Web.

Scientific 3D animation of bump hole example

“Powering the Cell: mitochondria” and the “Inner Life of the Cell”, videos produced and distributed by XVIVO, are two of the most sensational examples of modern scientific animation.  An ensuing story in the New York Times solidified for me that scientific animation was not only a blooming industry producing eye candy for fund raisers but was also an active area of research and the beginnings of a community trying to push the boundaries of scientific communication.  This inspired me to get a copy of Maya (an industry-leading 3D animation software package freely available to the academic community) and get playing.

Even with all the inspiration from personal heroes Gael McGill and Drew Barry, the short animation above was the most style I could muster with my feeble newbie maya skills. The subject of this video is a particular example of the bump hole method, pioneered by Kevan Shokat [paper], which in general refers to the strategy of introducing a genetic mutation to a native enzyme in such a way that the enzyme can catalyze a specific reaction between the native substrate and another molecule.  In this case, the lab at Memorial Sloan Kettering Cancer Center (MSKCC) at whose request I made this video used the bump hole method to enable a transferase reaction that could attach a label to a substrate.  For a more detailed narrative of the video, please see the end of the post.

The stated purpose of this animation was to replace a simple 2D schematic illustration of the entire problem and its solution.  The video is intended to accompany a live presenter that will narrate the video.  Its primary function as a schematic allows us flexibility to deviate from scientific accuracy, when necessary.  The scientific accuracy here is limited to the shape of the molecules and the positioning of the substrate and cofactor relative to the enzyme.  All colors (obviously), transparencies and glow effects are for illustration.  Furthermore, the magnitude of the mutation’s effect on the enzyme geometry is greatly exaggerated to clearly show a perceptible change in the enzyme structure.  Walking the line between scientific representation and interpretation is something all scientific animators will have to deal with, and the rules for scientific integrity and responsibility in this arena are still up for discussion.  I hope that here I don’t exemplify any egregious violation.

A few tidbits for the interested.  I used the free and brilliant maya plugin for molecular animators called molecularMaya. As far as I understand, molecularMaya is the brainchild of Digizyme owner Gael McGill (and his super friendly and helpful team).  It allows for automatic importing of pdb files from the pdb website or locally on your machine.  The plugin provides a set of menu options for viewing the protein as a set of atoms, a mesh, or ribbon.  Each viewing mode is coupled to different style options.  For example, I used a mesh resolution of 1.714 for the enzyme to show more detail.  The mesh resolution does not,  as far as I am aware, translate to an Angstrom resolution.  molecularMaya is due for a much anticipated new release and I believe it will be significantly more than just a few new features.  One feature that I would really like would be the ability to select individual residues.  I trust additional representations such as beta-sheet and alpha-helix cartoons wil be included.

I hope to extend the utility of this animation with interactive labels and overlaid figures to supplement the content with scientific evidence.  In a dream world, scientisits will be communicating with each other and to the public through such interactive media. I expect also that 3D animations can be a valuable part of that media experience.  New presentation modalities are here and new ways of learning need to be explored.  We might as well also have some fun with it.

I apologize in advance for the generics, but the specific names and information about the enzymes, substrates, cofactors and mutations are privileged until publication.  The characters of this animation include a ‘blue’ enzyme, a ‘red’ substrate, and a ball-and-stick model of a cofactor that consists of a base and a clickable moiety, which I’ll refer to as the tag finger.

Scene 1 begins with an introduction to the native enzyme and its substrate.  Scene 2 introduces the cofactor and its constituent parts. Scene 3 consists of a demonstration of the problem, which is that the full cofactor does not bind to the native enzyme.  We tried to use the effect of the cofactor bouncing off the enzyme to clearly illustrate that the cofactor does not fit.  Scene 3 continues with the placement of the cofactor in its intended position.  Here we use a simple rotation to get a better view and a transparency on the enzyme mesh to give the viewer an idea of where the cofactor sits relative to the native enzyme. As the transparency goes away and returns to opaque, we see that the cofactor’s tag finger is no longer visible. We hope this clearly suggests that this part of the cofactor doesn’t fit the native enzyme structure.  Scene 3 concludes with a slow morph from a representation of the native enzyme into a representation of the mutated enzyme.  This part should conceptually explain that the effect of the mutation is to ‘make room’ for the cofactor tag finger.  In this part of scene 3 we take an artistic license and devaite from scientific accuracy.  We exaggerate the size of the hole, since at that mesh resolution, the deleted residue would be noticed.  Scene 4 shows the consecutive binding of the substrate and cofactor to the enzyme followed by an artistic (non-scientifically accurate) representation of the reaction carried out by the transferase, where the tag finger breaks from the cofactor and attaches to a specific lysine residue on the substrate.  I use a glow effect to represent the start of the reaction; Why? Scientists love glow effects, don’t we?  Scene 5 is the money shot.  It shows the individual components breaking off after the reaction, with special emphasis on the newly tagged substrate.

A Data-mined History of Pop Music

Using the wisdom of crowds to tell a story about music genres

Music genres are serious business — the source of debate, speculation, fights, and of course,mockery. What seems like a fairly clear-cut concept in a record store is less clear when debating with your friends whether Brian Eno makes electronica or ambient music, or what kind of hip-hop this Kid Cudi album is (if it’s hip-hop at all). Or even worse, how do genres link together — is hip-hop a descendent of R&B, or are they both sibling children of soul music? Is indy folk closer to 60s and 70s folk music, or to indy rock (or are all three just branches of the same limb stretching back to the blues)?

The goal of this post is to take advantage of some of the available social data (in this case, tags on to form a sort of consensus on music genre classification. This isn’t meant to produce an authoritative ground truth on music classification (I doubt such a thing could exist), but rather to try to get at the most widely-held conception in a somewhat objective and perhaps novel way.

note — I saved the technical details for the end; if you want to read them before seeing the results, skip to the Mining Details section below

Pop Music Genre Tree

As my source of data, I took the most common genre-related tags on for songs from the Whitburn project. To work out the relationships between all these tags (and by extension the genres themselves), I used some phylogenetic software to produce a family tree of tags. The logic of using phylogenetics algorithms for this is explained in the Mining Details below. Here’s the tree, with colors and (terrible) labels added by me (click for bigger version):


This tree serves two purposes: it works as a map from the varied and whimsical landscape of social tags onto a concise and recognizable group of genres, and it also reveals some surprising insights about how genres (are perceived to) actually relate to one another.

For instance, the R&B tags seem to cluster into two groups – a 70s and 80s R&B closely aligned with soul music, and a later R&B aligned with hip-hop. It’s also surprising that country music seems to cluster very closely to folk rock and southern rock, both genres I expected to see closer to the pure rock camp. Speaking of which, a few other genres I associate with rock (soft rock / ballads, alternative / punk / grunge, and pop rock) defied expectation by branching out on their own rather than falling under the rock umbrella.

Less surprising was the close association of electronica with other dance music including disco, and the very broad nature of the rock genre (which includes classic rock, hard rock, psychadelic rock, glam rock, progressive rock, etc.).

One caveat — I do expect the exact structure of this tree to be somewhat sensitive to things like which songs are included in the dataset. Still, even if slightly rearranged versions of the tree are valid themselves, that really doesn’t make this less valid, as it’s still a representation of genre relationships based on input from perhaps millions of users.

Pop Genres Through History

Having a sensible map of social tags to song genres also gave me the chance to take a look at pop history — to take a look at the growth of, the decline of, and in some cases the resurgence of genres over time.


Taking a look at the number of songs associated with each derived genre over time reveals a few cool things. The first thing to notice is that the total number of tagged songs each year varies quite a bit — from a few in 1920 to a few hundred by the 1980s. Though the number of songs in the Whitburn project does vary a little from year to year, most of this variation is due to a lot of songs, especially old songs, just not being tagged or even present in This means some (real) genres of music are completely absent; after all, users of are people that live in the 20th century and listen to digital music, which for better or worse does not include old gospel recordings of Homer Rodeheaver or ragtime covers by the US Marine Band (though I’m sure a few people will be saddened by the lack of tagged Broadway showtunes). I prefer to take this as a reminder that history (or maybe I should say culture) is in the eyes of the beholders. When we think of music of the 30s, we think of blues and jazz, and that’s what represented here.

Of the songs that are tagged, a few interesting patterns emerge. First, except for the explosion of rock and soul in the late 50s / early 60s (fairly quickly after the respective  introductions of the two genres), most genres seem to grow at the expense of others. The growth in hip-hop and alternative music in the late 80s / early 90s coincides with the decline of rock (and to a lesser extent dance and soul music) in the same period. Second, just because a genre of music is down doesn’t mean it’ll stay down — country / americana might have looked like it was on its last legs by the late 80s, but by the 2000s it actually had a bigger marketshare than ever.


Normalizing the songs per year to produce a genre ratio plot makes a few things a bit more visible. One of these is that out of all these genres, the one with the best longevity seems to be soul music, though I do have to qualify that somewhat, as the tag “soul” is pretty ambigious so I might be picking up some songs that are just soulful without being soul.

Finally, I do have to point out that tagging each song as a member of a single genre only gives part of the story: a lot of songs are tagged as members of several genres. For the curious, out of this dataset the artists with the most genre-spanning power were Prince, Phil Collins, Peter Gabriel, and Michael Jackson. Taking a closer look at genre blending and fusion will most likely be the topic of a future post.

Mining Details (for the curious)
My basic strategy for this analysis was to link up two pieces of data. The first was pop music charts, and the second was the social tags associated with these songs on The second piece was straightforward to obtain thanks to the well-maintained api, but the first required some curated and maintained dataset. My original plan was to use the publicly-available data in the Billboard Charts API to gather a list of popular songs over the last century. Sadly, as of right now the service is completely broken and useless. But where Billboard’s effort falls short, the Whitburn project managed to make up for it by releasing a meticulously gathered and annotated list of 37000 chart-hitting songs since the 1890s.

Here are the most common tags for the Whitburn project songs represented as a word cloud (I highlighted genre-specific tags in red):


The first thing to notice (which will be nothing new to people who work with this kind of data professionally) is that the tags are, for lack of a better term, “messy”. For instance, there are about eight different tags for R&B, including the alternative spelling “rhythum and blues tag”. Several tags are ambiguous — does “soul” mean that the song is in the genre of soul music, or that the song is soulful? Since this is social data, we have to contend with people using a single tag for more than one meaning, and using different tags to mean the same thing.

Rather than simply letting it be an annoyance though, the idea here was to let treat ambiguity itself as a source of information. Grabbing the 100-odd common tags that have to do with genre, I labelled each by which songs have that tag. I admit this sounds somewhat backwards; to use a metaphor, we can think of each genre as having a sort of genotype — a sequence that defines it. To get that sequence, I look through the set of songs and mark down 1 where that tag is mentioned and 0 where it is not (this means that the songs are basically being treated as alleles).

To help visualize, here’s a raster image of a section of this “genotype” map. For each genre tag (y-axis) there’s a mark if the song on the x-axis has ben tagged with that genre.tag_membership


The first thought that comes to mind looking at this kind of data is to use a standard clustering algorithm (e.g. hierarchical clustering or PCA followed by k-means) on it to find groups of related tags. The problem with that is coming up with a sensible distance metric — one that puts a large distance say between two rarely-used tags with few overlapping songs, but also puts a small distance between a common tag and a rare tag whose songs overlap with the common tag (i.e. its parent).

This is actually where the genotype metaphor came in handy. I simply took it literally, and used an algorithm developed by evolutionary biologists that does exactly what I want: produces a tree of the relationship between the tags assuming that losing songs from parents to children is common, but gaining songs is very rare (for the even more technically-curious, I produced the maximum parsimony character tree for the genre tags by taking the consensus tree for 100 bootstrap rounds). Once I had the tree, using it to classify songs based on their tags was straightforward.

No S**t Sherlock

The popularity of infographics, particularly interactive maps, is on the rise in popular culture and daily life.  Recently, I’ve seen glimpses of boring information, in graphical form, popping up in popular entertainment. Today, I’ll talk about the British TV series ‘Sherlock’. Note: this post would be better with a nice action, instrumental soundtrack. I’m just saying.

Who doesn’t love a good chase scene?  Cops jumping over buildings, secret agents commandeering motorcycles in pursuit of thuggish Russian agents, Wiley Cayote mounting an ACME rocket in pursuit of that dastardly road runner. What can be better than well shot sequences of gun shots, jumps, swinging, running, ducking and breaking through glass?  What’s missing?  I think the answer is ‘context.’

SherlockMapSingle perspective views can not capture both the chaser and his or her prey.  It’s not always easy to know if either is making a good move to catch or evade their competitor.  Enter the info-graphic.

In the pilot episode of ‘Sherlock’ (the mystery of pink), Sherlock and his reluctant sidekick Watson are in pursuit of a cab through the streets of London.  The cab is bound to stay on roads.  The heroes are on foot but are free to climb stairs, jump across buildings, cut through gardens and romp through small alley ways.

The scene is populated with cut sequences of cars whizzing and our heroes moving in a totally non-overlapping geographical space.  To help the viewer keep track of this scene, they display a map of the London neighborhood (frame 1), with the baddies in red and the goodies in green (can it be any other way? I maintain that it can not).  These graphics help provide strategic context by illustrating Sherlock’s thought process and update the audience with the geographical history and future of the chase scene.

Starting at around the intersection of Ingestre Pl and Hopkins St, Sherlock deduces that the cab is unlikely to take the route along Wardour to Broadwick as in Frame 1 and rather assumes the cab will take Warwick to Wardour street (Frame 2).  The purple dot in frame 2 presumably indicates Sherlock’s intended point of intersection.

Running and jumping ensues.  I have no idea exactly where they are, but probably on someone’s roof who lives near the intersection of Broadwick and Poland Street (not shown).  Poor bastards.  Sherlock and Watson arrive on D’Abblay St, but oh no, they’ve missed them.  Frame 3 clearly shows that the green line overlaps with the red, and we’re left to assume that they did not intersect at the same time.  Sherlock  picks a new intersection point (see purple dot), and instead of following the cab by turning Left along Poland Street, he turns right.  The combination of the physical actors prancing off in the wrong direction and the clear graphic helps the audience understand what is going on, and thereby CARE about all the action.

Running and trampling of nicely landscaped gardens ensues.  Frame 4 is the final graphic we are given. It shows only slight progress over frame 3.  This is the least informative frame. More running, and then, BAM, sherlock has his man, thus demonstrating that logic, deduction, an unnatural amount of cartographical memory coupled with obscene amounts of mundane municipal construction knowledge will triumph over cab drivers (who don’t even know they’re being chased).

Success: These map frames definitely help the viewer understand the logic and agency in Sherlocks’s pursuit.  Without it, it would appear to be gratuitous running and jumping and a fortunate and inexplicable interception of his target.  We see glimpses of why Shelock assumes the cab will take one route over another. We also see two different instances of where Sherlock wants to intercept his target.  This makes the running and jumping over buildings meaningful.  We also see, in sort-of real time, the progress he’s making whilst running and jumping, thus making it more suspenseful.