NYC scaled to commute time, part 2

What would New York City look like if the distance between places was actually the time it takes to get between them in rush hour?

by Zachary Nichols

This is part 2 of the post on rescaling NYC based on transit time. Part 1 is here.

Last time, I showed some maps of New York where I replaced normal distance with the time it takes to get between places on MTA. Here’s a video of the transformation from normal to time-scaled and back again:

I also mentioned last time that I would take a shot at doing a Manhattan-only version. Just as a note, this won’t look just like a bigger version of the deflated Manhattan balloon in the full map. That’s because the scaling algorithm I used (multi-dimensional scaling) is trying to take all the distances into account that it sees in order to put them into a 2D map that makes sense. Of course, the distances I give it don’t really fit into a 2D world, and the projection chosen can be very different when I use only part of the map.

Here’s (boring old) Manhattan:


And here’s Manhattan rescaled for MTA travel time. It’s not as interesting as the full map, but we can see some of the obvious features: getting across town above 60th St. or so is a huge pain, and getting acrosstown in general is a slower prospect than going North/South.


It is a little bit notable that the Manhattan-only version is not as dramatically scaled as Manhattan is in the full map. It suggests maybe the reason Manhattan was as squished as it was is that the transit time in the outer boroughs is comparatively much longer.

Anyway, in case anyone is interested, here is the nitty gritty behind how I made these maps. First, to calculate the MDS rescaling, the first thing I need is a set of points and the distances between them.

To get these points efficiently, I used a bit of a shortcut. I made a hexagonal lattice covering NYC (including some of Long Island, New Jersey, Westchester, and a lot of water). To find out which lattice points were actually in the city, used google’s maps api to probe the geographic information of the point. After eliminating everything not actually in the city, I was left with these 800 or so points (colored by neighborhood):


The next step of course was to find the time it takes to get between them via MTA. Ideally, I would just use google maps’ distance matrix to get the whole calculation at once. Alas, the rights restrictions on public transit data prohibits google from allowing them to give distance matrix calculations on transit times (they do allow such calculations for roads etc.).

At this point I had a bit of a decision to make. I could be a total jerk and try to get the travel times between all the points from google maps anyway with a screen scraper or something like that, but it turns out there’s a much better solution. Since MTA is kind enough to publish their transit data, I had the opportunity to (with a little bit of work) use Open Trip Planner to set up a local server that allows me to get the transit times via a REST api. So I could still spam and spam to get travel times, but it would be on my local machine rather than someone else’s server.

After filling in a matrix with travel times, I was forced to fire up Matlab (which has a very easy-to-useimplementation of multi-dimensional scaling). Getting the results of this rescaling (and matching them up to the original latitude / longitude with a call to procrustes), I get a point-wise remapping of the lattice.


Here is the original lattice with arrows pointing to where the rescaling puts them.

Finally, I used Quantum GIS to make some annotated maps of NYC, and skewed them in Python according to the remapping. Since not every point in the map was in the lattice, I used a weighted average to interpolate between the lattice points, and allow me to make the final maps.

Of course, this is meant only as an interesting exercise, but these types of maps do have applications. A lot of what real GIS people do is planning — planning new roads, planning where to put hospitals and fire stations, etc. Knowing how long it takes to get between places can be very critical when placing a fire station for example: you really want to minimize response and travel time for as many places as you can. Having this type of map can help with visualizing where you should position your emergency services to get even coverage, the same way a map like this might help a New Yorker decide where to look for an apartment that’s convenient for work and play.


NYC scaled to commute time

What would New York City look like if the distance between places was actually the time it takes to get between them in rush hour?

One of the ideas that comes up semi-regularly in neuroscience is multi-dimensional scaling. In a nutshell, MDS is a set of algorithms whose intent is to take a set of distances, and produce a (usually) 2-D map that matches them. For instance, you might want to make a map of “taste”, so you ask people which is similar to which: chocolate vs steak, milk vs soy, ramen vs eggs and so on. You take these similarity measures, and produce a map of the land of taste.

This is a pretty neat idea that I think bears some more exploitation. And of course, being someone who spends a lot of time on the NY subway system, I couldn’t help but think of how you could use it to see the real geography of New York (or at least the geography that chronic MTA riders slowly build in their minds). That is, the map of New York where the Upper East Side isn’t that close to the Upper West Side. The map where Time Square, Union Square, and Grand Central are all close neighbors. The map where getting between Brooklyn and Queens may as well be a trans-Atlantic journey. You know, the real map.

This ended up being a bit more work than I was expecting, so I’ll split this post up. Today: the map. Soon: how I made the thing.

To begin with, here is a standard, to-scale map of the four Boroughs (Staten Island will be considered part of New Jersey for this exercise for reasons that will be explained later, no offense to either).


Here we see (using an overlay of a screenshot from Google Maps) the subway lines by their color, and areas in their proper geographical location. Now I’m going to measure the time it takes to get from one point in the city to another via the MTA (a whole bunch of times), and make a map where the distances match up with this time. For the geography nerds, this is called a Distance Cartogram. We can see it gives us a very different view:


(click on the maps to see full size).

The first thing I notice is how squished together lower Manhattan becomes, almost like a deflated balloon. As far as anything in Manhattan below 14th St. together with Borough Hall goes, you’re more or less living in the transit center. Of course, this is also where the greatest density of trains lives. Upper Manhattan is significantly wider, and is pulled toward Queens presumably with the combined power of the E, F, M, N, R, Q, and 7.

Brooklyn seems to pinch off around Borough Hall, making kind of a corner on the left. Brooklyn itself is also much larger in comparison to the squished Manhattan, and South Brooklyn now stretched quite far south, with Coney Island stretching off the map. Williamsburg and Greenpoint seem a little more connected by comparison.

Queens bends a little bit toward Brooklyn near Long Island City on the left (probably thanks to the G train) and near Forest Hills / Jamaica on the right (Probably due to the J and Z trains meeting the E). Jackson Heights however bends away from Brooklyn, and seems comparable more isolated than the other parts of Queens.

Finally, the winner for most isolated place (no surprises here) is the Rockaways. In fact, the software I used to calculate the trip times (Open Trip Planner) failed to find paths to several places around there. Seriously far.

For the more curious, here is a version of the rescaled map with the Google overlay removed. The individual neighborhood are labelled here, so you can see exactly where your ‘hood fits in.


Next time I’ll talk about how I went about making this thing (for anyone that wants to do something similar / better maybe in their own cities), and a few possible applications of this outside of picking which neighborhood to move to. Maybe a bonus version of this done for Manhattan alone if I have time.

Updata: NYC scaled to commute time, part 2


A glimpse of gene alteration frequencies in cancer

Making use of data of gene alterations (deletions and amplifications) mapped to the chromosomes of patients with brain tumors (Glioblastoma), this visualization is meant to give a glimpse into the complexity of gene changes that occur as the animation cycles through each patient — and hopefully also leave an impression of the huge efforts that are being done to understand this devastating disease for improved treatment.

It is rare that the broad public gets a look into the actual data that are being produced by the biomedical community. This is one attempt to show some of the thousands of gene alterations in one particular brain cancer, Glioblastoma – one of several cancers that are under intense investigation through large cross-institutional efforts, such as The Cancer Genome Atlas (TCGA).  Using publicly available data of thousands of gene alterations in more than 200 Glioblastoma samples (cBio Cancer Genomics Portal), each alteration is color coded (red=amplification, blue=deletion) and mapped to the more than 20,000 annotated genes in the human genome.

As the animation cycles through each patient, it can be seen that the majority of alterations are unique to each patient, since blue and red dots appear and reappear in different places throughout the genome.  This shows the complexity of this disease, which is not simply defined by a single alteration such as many inherited diseases are.  It is difficult to determine if such seemingly random events are causally implicated in the disease or are secondary effects due to other factors – an area of ongoing research for scientist around the world.

We also see that there are a few hotspots that start to dominate as more patients are covered: alterations that reoccur increment in size and hereby underline some of the few high frequency alterations that characterize this disease.  For example, the EGFR gene on chromosome 7 is amplified in around 40% of patients.

Glioblastoma is a devastating disease both for the patients and their families and it often affects young individuals.  Hopefully recent technological developments combined with large scale efforts such as the TCGA project, will shed more light into the mechanisms driving this disease and ultimately improve treatment for patients.

Disclaimer:  Although genes and alterations are accurately mapped to their locations on the chromosomes, this visualization is not meant as a objective scientific report, but is rather a freely interpreted representation of the data. For a scientific analysis of the genomic characterization of Glioblastoma, please visit here: link.

Analysis details: DNA copy number variation (CNV) data was obtained from the cancer database cBio Cancer Genomics Portal , hosted by Memorial Sloan-Kettering Cancer Center (link).  The data was preprocessed with the RAE algorithm providing gene-based CNV scores and chromosomal coordinates. For simplicity only high level amplifications and homozygous deletions were used.


I still hate the Pats, but…

Of course it would be nice if complex problems boiled down to a single number; recall the days of high school math.  Sadly, the real world is never that simple. That does not, however, stop us amateur sports enthusiasts and even some actual professional sports analysts from using simple decontextualized tally metrics as evidence for assigning superlatives to players or teams.

At a party for Super Bowl 46 featuring the New England Patriots and the New York Giants, someone argued to me that the Patriots defense ranked 31st (out of 32 teams) in total yards allowed (TYA) and thus were a horrible defense and would single-handedly lose the super bowl to the reawakened Giants offense.  It’s true that the New England Patriots’ defense did indeed essentially tie for worst with the Green Bay Packers for TYA with 6577 and 6585 yards, respectively in the 2011 regular season. I’m definitely not a Patriots fan, but I felt that this accusation based entirely on TYA was a little unfair.  There I was, caught in a very rare moment in defense of Boston sports. I waved my hands and posited that in some cases TYA is biased against high scoring offenses since the defenses are playing conservatively to preserve the lead and eat up time.  This isn’t a new thing, mind you, I just haven’t ever actually seen a plot of this to test this hypothesis.

Consequently, I set out to see how many yards each NFL defense allowed as a function of the relative score.  My tentative hypothesis was that some of the high scoring teams would have defenses with high TYA where a large chunk of those yards were forfeited when the team was leading by at least two scores.

Disclaimer: I am not a sports analyst or in any way a football strategist.  I do not claim that this graphic is the most informative or accurate way of ranking defenses. My priorities for this project were to (a) practice parsing sports data with Python (this is my first Python project of any kind), (b) make a good friend eat some crow, and (c) further stigmatize the use of TYA as the sole piece of evidence for ranking defenses.  TYA for sure has some value, but it should bow down to other metrics that incorporate scoring, probability of allowing points, take-aways and game context. Errors: I did not credit defenses for yards after take-aways either by fumble or interception.  This failure is mostly due to my fear of parsing text from human-constructed sentences.   Data


These distributions are Yards Allowed by Lead (YAbL). The x-axis consists of 3.5 point interval bins centered at 0 and extending from -28 (a deficit of more than 28 points) to +28 (a lead of more than 28 points).  On the y-axis are the cumulative number of yards allowed by each defense in those relative score bins.

As expected, it is clear that defenses on teams with highest-scoring offenses give up a significant number of their yards when in the lead by more than one possession.  The four highest-scoring teams in the 2011 regular season were, in order, the Green Bay Packers (GB), the New Orleans Saints (NO), the New England Patriots (NE) and the Detroit Lions (Det)[]. All four of these teams finished 23rd or worse in TYA.  The YAbL plots confirm that GB and NE especially give up most of their yards when preserving a two-possession lead.  My rationalization for this is that teams switch strategies to play with a prevent defense when preserving a significant lead.  In a prevent defense, teams are looking to prevent quick scores and take time off of the play clock at the cost of allowing long drives with many plays.  So this strategic decision is one contribution to the right-heavy YAbL we see for these teams.  The other contribution is the simple fact that these defenses are playing a disproportionate number of drives with a lead.  With this data alone, it is impossible to determine how many extra yards are forfeited from switching to a prevent defense.  I suspect that if GB and NE were playing on teams with weaker offenses and thus playing in closer contests, their TYA would decrease a little bit.  I do not claim nor believe that either team’s defense is all that great; indeed they appear to be fairly average or below average in tie-game situations.

Detroit on the other hand performed poorly in single-posession games and games where they are down by more than one score.  Despite having a powerful offense at times, scoring an average of almost 30 points per game, the defense struggled to preserve the lead.  More information is needed to accurately compare their distribution to other teams, but on the surface they appear to play on par with defenses like Tennessee and Miami, who were without strong offenses and were not in playoff contention.

The more dominant defenses stand out and correlate well with TYA.  Pittsburgh, Miami, Baltimore, San Francisco and Houston were almost always playing with a lead and yielded few yards regardless of their situation. In these cases, TYA paints a reasonable portrait of the effectiveness of these defenses.

Tall and narrow distributions centered at zero don’t provide much information about the defense, but they do tell a story about the kind of games these teams play.  It’s fitting, I suppose, that both drama-riddled New York teams play in noticeably tight games, either up or down by a single score with roughly equal frequency.

In general, defenses allowing the fewest yards are indeed elite and effective defenses.  However, some of the teams allowing the most yards are unfairly slandered as being defensive sieves; these defenses give up many of their yards when already leading by two or more possessions.  The added value of these YAbL plots is relatively minimal and are mostly useful in discriminating high TYA defenses that actually stink vs those whose numbers are inflated by context.  Finally, the YAbL distribution plots are at least as informative as TYA and can in some cases provide useful contextual information.  I’ll continue to work on this kind of analysis to incorporate different normalization options, scoring outcomes, and interactive features for users to more efficiently compare defenses.  It’s possible that a similar distribution will have some valuable information and can be incorporated in written analyses with sparklines [here], so users have instant visualization without having to look at a separate figure.

Who won the Super Bowl (halftime show)?

The Super Bowl is the most-watched annual television event in the U.S.; some years, nearly half of all households watch it. And while players earn tens of thousands of dollars for a day’s work, and advertisers pay $100,000 per second for air time, the performers at the halftime show — often huge stars, like Prince or the Black-Eyed Peas — are not paid to appear.

One reason the performers agree to do this, of course, is the terrific publicity. In fact, the halftime show can have more viewers (per minute) than the Super Bowl itself. So one might wonder: how useful is this publicity? How many new listeners does it get you? Does it only help if you’re already an established performer, or does it help up-and-coming artists as well? Does it help you even if your music isn’t very, so to speak, mainstream?

Let’s look at some data from the latest Super Bowl (XLVI, if you’re counting (in Roman)). The main performer for the halftime show was Madonna, who’s been releasing music for 30 years, has released plenty of well-known (and well-loved) singles. At the time, she was about to come out with a new, persona-defining album, MDNA, her first in four years. However, Madonna pulled on stage with her several other musicians. In approximate order of decreasing seniority, there was Cee-Lo, the rapper-turned-pop-crooner; M.I.A., the electro-pop agitator; Nicki Minaj, the energetic, attention-grabbing rapper/provocateur; and LMFAO, the humorous “party rock” duo.

How did the Super Bowl halftime show affect each of these artists’ listening numbers? One way to look at this is to examine’s listening charts for the weeks before and after the Super Bowl. (Here, for example, are Madonna’s charts for the week preceding the Super Bowl, showing number of unique listeners for her top songs.) We can plot how her top songs do before and after the Super Bowl:


As we can see, a few things happened. First, two days before the Super Bowl, Madonna premiered a song from her new album, “Give Me All Your Luvin’;” she performed the song with Nicki Minaj and M.I.A. at the halftime show. This song had a major publicity push separately from the halftime show performance, so it’s not too surprising to see it shoot up in listeners. (In fact, it reached over 11,000 listeners even before the Super Bowl.) However, all of Madonna’s other top songs, from 1984’s “Like a Virgin” to 2005’s “Hung Up,” saw a boost in listeners after the Super Bowl, too. The songs with the biggest bumps in listeners (“Like a Prayer” and “Vogue,” each with 50% increases) are the songs that were performed in the halftime show.

What about the other performers?


Cee Lo Green saw runaway success with his 2006 song “Crazy,” as part of the group Gnarls Barkley; more recently, he’s had a lot of success with his 2010 song “F••• You” (played on the radio as “Forget You”). This song picked up slightly after the Super Bowl, but otherwise, his listenership was not largely affected. Why? This warrants further analysis. I think it’s a combination of two things: Cee Lo was not presenting any new music at the Super Bowl, but at the same time, Cee Lo has had fewer major singles, so fewer people have an easily-accessible Cee Lo song already on their computer. Compare that with Madonna: the Super Bowl may have gotten more people to listen to her new song, and in addition, many people already had Madonna songs in their music libraries, and the Super Bowl performance was a prompt to listen to those songs again.


M.I.A. had a more productive Super Bowl than Cee Lo. Like Madonna, she released a new song in the week before the Super Bowl; though this song wasn’t performed, it did see a quick rise in the week before and the week after the Super Bowl. Also, like Madonna, M.I.A.’s other top songs saw rises in the week before and the week after the Super Bowl, with a decline in listeners afterwards. It’s been a few years since M.I.A.’s last album, so maybe people who have that last album were reminded of it — and of their excitement about M.I.A. as an artist.


This idea doesn’t seem to hold for Nicki Minaj. She’s an up-and-coming musician, one who’s built buzz through many singles distributed over the Internet. She did release a new single a little after the Super Bowl, and that release met with success; however, her other songs did not see any bumps in listeners. Was this because people were less excited about her halftime performance? Or she got less of the spotlight in the show? (You could argue M.I.A. had an unfair advantage, courting controversey with a digital malfunction.)


Finally, there was LMFAO, a band with only a few singles out, but with one that’s built a fair amount of buzz. How much did they capitalize on the halftime show? As a percentage, they only got a small bump for their top song, though it’s important to note that more people are already listening to that song than any of the other artists’ songs. However, their other songs did not see a bump at all. It might be that the appeal of LMFAO is fairly specific; of Super Bowl watchers, maybe only a small fraction of those seeing the LMFAO performance were intrigued to hear more.

There are many factors involved in how an audience reacts to hearing an artist’s song; unraveling the importance of these factors requires more data than just listening figures for a few artists’ songs before and after a single event. However, looking at these numbers can suggest potential targets for larger-scale analysis. Data analysis rarely (if ever) exists in a vacuum; developing a sense of the system being studied is an important part of reaching statistically meaningful conclusions.

Notes on the data

As always, drawing conclusions from data requires a good understanding of your data, a lot of care, and good controls. Doing this is outside the scope of this blog post, but it’s important to at least mention some limitations. First, there’s the geographic issue: is a British company, and they promote themselves most heavily there, while the Super Bowl is a primarily North American event. certainly has lots of users in the U.S., and to analyze the effects of a primarily-American marketing event, it would be best to limit the analysis to the effects on American listeners. There’s also, as always, the issue of demographics — I would guess that’s demographics skew younger than the overall demographics of Super Bowl listeners. (LMFAO’s numbers might be evidence to that effect.) In terms of the source of the data, measures plays in a variety of ways, but it is probably dominated by people listening to MP3 files on their computers and by people using on-demand streaming services like Spotify. The halftime show probably has a more complicated effect on what gets played on the radio, which is another factor that can translate to album sales (and which largely doesn’t measure). Finally, there’s the question of what data are easily available online, versus the data that has internally but doesn’t make easily accessible. The data available online are counts of the number of unique listeners to each song in a week; the data do not tell you how many times each of these people listen to the song. These numbers are likely very different for artists with many songs (Madonna) and artists with few (LMFAO). Frequency of listening may be a useful indicator of interest in or loyalty to an artist, but these data are not available through the Web.

Scientific 3D animation of bump hole example

“Powering the Cell: mitochondria” and the “Inner Life of the Cell”, videos produced and distributed by XVIVO, are two of the most sensational examples of modern scientific animation.  An ensuing story in the New York Times solidified for me that scientific animation was not only a blooming industry producing eye candy for fund raisers but was also an active area of research and the beginnings of a community trying to push the boundaries of scientific communication.  This inspired me to get a copy of Maya (an industry-leading 3D animation software package freely available to the academic community) and get playing.

Even with all the inspiration from personal heroes Gael McGill and Drew Barry, the short animation above was the most style I could muster with my feeble newbie maya skills. The subject of this video is a particular example of the bump hole method, pioneered by Kevan Shokat [paper], which in general refers to the strategy of introducing a genetic mutation to a native enzyme in such a way that the enzyme can catalyze a specific reaction between the native substrate and another molecule.  In this case, the lab at Memorial Sloan Kettering Cancer Center (MSKCC) at whose request I made this video used the bump hole method to enable a transferase reaction that could attach a label to a substrate.  For a more detailed narrative of the video, please see the end of the post.

The stated purpose of this animation was to replace a simple 2D schematic illustration of the entire problem and its solution.  The video is intended to accompany a live presenter that will narrate the video.  Its primary function as a schematic allows us flexibility to deviate from scientific accuracy, when necessary.  The scientific accuracy here is limited to the shape of the molecules and the positioning of the substrate and cofactor relative to the enzyme.  All colors (obviously), transparencies and glow effects are for illustration.  Furthermore, the magnitude of the mutation’s effect on the enzyme geometry is greatly exaggerated to clearly show a perceptible change in the enzyme structure.  Walking the line between scientific representation and interpretation is something all scientific animators will have to deal with, and the rules for scientific integrity and responsibility in this arena are still up for discussion.  I hope that here I don’t exemplify any egregious violation.

A few tidbits for the interested.  I used the free and brilliant maya plugin for molecular animators called molecularMaya. As far as I understand, molecularMaya is the brainchild of Digizyme owner Gael McGill (and his super friendly and helpful team).  It allows for automatic importing of pdb files from the pdb website or locally on your machine.  The plugin provides a set of menu options for viewing the protein as a set of atoms, a mesh, or ribbon.  Each viewing mode is coupled to different style options.  For example, I used a mesh resolution of 1.714 for the enzyme to show more detail.  The mesh resolution does not,  as far as I am aware, translate to an Angstrom resolution.  molecularMaya is due for a much anticipated new release and I believe it will be significantly more than just a few new features.  One feature that I would really like would be the ability to select individual residues.  I trust additional representations such as beta-sheet and alpha-helix cartoons wil be included.

I hope to extend the utility of this animation with interactive labels and overlaid figures to supplement the content with scientific evidence.  In a dream world, scientisits will be communicating with each other and to the public through such interactive media. I expect also that 3D animations can be a valuable part of that media experience.  New presentation modalities are here and new ways of learning need to be explored.  We might as well also have some fun with it.

I apologize in advance for the generics, but the specific names and information about the enzymes, substrates, cofactors and mutations are privileged until publication.  The characters of this animation include a ‘blue’ enzyme, a ‘red’ substrate, and a ball-and-stick model of a cofactor that consists of a base and a clickable moiety, which I’ll refer to as the tag finger.

Scene 1 begins with an introduction to the native enzyme and its substrate.  Scene 2 introduces the cofactor and its constituent parts. Scene 3 consists of a demonstration of the problem, which is that the full cofactor does not bind to the native enzyme.  We tried to use the effect of the cofactor bouncing off the enzyme to clearly illustrate that the cofactor does not fit.  Scene 3 continues with the placement of the cofactor in its intended position.  Here we use a simple rotation to get a better view and a transparency on the enzyme mesh to give the viewer an idea of where the cofactor sits relative to the native enzyme. As the transparency goes away and returns to opaque, we see that the cofactor’s tag finger is no longer visible. We hope this clearly suggests that this part of the cofactor doesn’t fit the native enzyme structure.  Scene 3 concludes with a slow morph from a representation of the native enzyme into a representation of the mutated enzyme.  This part should conceptually explain that the effect of the mutation is to ‘make room’ for the cofactor tag finger.  In this part of scene 3 we take an artistic license and devaite from scientific accuracy.  We exaggerate the size of the hole, since at that mesh resolution, the deleted residue would be noticed.  Scene 4 shows the consecutive binding of the substrate and cofactor to the enzyme followed by an artistic (non-scientifically accurate) representation of the reaction carried out by the transferase, where the tag finger breaks from the cofactor and attaches to a specific lysine residue on the substrate.  I use a glow effect to represent the start of the reaction; Why? Scientists love glow effects, don’t we?  Scene 5 is the money shot.  It shows the individual components breaking off after the reaction, with special emphasis on the newly tagged substrate.

A Data-mined History of Pop Music

Using the wisdom of crowds to tell a story about music genres

Music genres are serious business — the source of debate, speculation, fights, and of course,mockery. What seems like a fairly clear-cut concept in a record store is less clear when debating with your friends whether Brian Eno makes electronica or ambient music, or what kind of hip-hop this Kid Cudi album is (if it’s hip-hop at all). Or even worse, how do genres link together — is hip-hop a descendent of R&B, or are they both sibling children of soul music? Is indy folk closer to 60s and 70s folk music, or to indy rock (or are all three just branches of the same limb stretching back to the blues)?

The goal of this post is to take advantage of some of the available social data (in this case, tags on to form a sort of consensus on music genre classification. This isn’t meant to produce an authoritative ground truth on music classification (I doubt such a thing could exist), but rather to try to get at the most widely-held conception in a somewhat objective and perhaps novel way.

note — I saved the technical details for the end; if you want to read them before seeing the results, skip to the Mining Details section below

Pop Music Genre Tree

As my source of data, I took the most common genre-related tags on for songs from the Whitburn project. To work out the relationships between all these tags (and by extension the genres themselves), I used some phylogenetic software to produce a family tree of tags. The logic of using phylogenetics algorithms for this is explained in the Mining Details below. Here’s the tree, with colors and (terrible) labels added by me (click for bigger version):


This tree serves two purposes: it works as a map from the varied and whimsical landscape of social tags onto a concise and recognizable group of genres, and it also reveals some surprising insights about how genres (are perceived to) actually relate to one another.

For instance, the R&B tags seem to cluster into two groups – a 70s and 80s R&B closely aligned with soul music, and a later R&B aligned with hip-hop. It’s also surprising that country music seems to cluster very closely to folk rock and southern rock, both genres I expected to see closer to the pure rock camp. Speaking of which, a few other genres I associate with rock (soft rock / ballads, alternative / punk / grunge, and pop rock) defied expectation by branching out on their own rather than falling under the rock umbrella.

Less surprising was the close association of electronica with other dance music including disco, and the very broad nature of the rock genre (which includes classic rock, hard rock, psychadelic rock, glam rock, progressive rock, etc.).

One caveat — I do expect the exact structure of this tree to be somewhat sensitive to things like which songs are included in the dataset. Still, even if slightly rearranged versions of the tree are valid themselves, that really doesn’t make this less valid, as it’s still a representation of genre relationships based on input from perhaps millions of users.

Pop Genres Through History

Having a sensible map of social tags to song genres also gave me the chance to take a look at pop history — to take a look at the growth of, the decline of, and in some cases the resurgence of genres over time.


Taking a look at the number of songs associated with each derived genre over time reveals a few cool things. The first thing to notice is that the total number of tagged songs each year varies quite a bit — from a few in 1920 to a few hundred by the 1980s. Though the number of songs in the Whitburn project does vary a little from year to year, most of this variation is due to a lot of songs, especially old songs, just not being tagged or even present in This means some (real) genres of music are completely absent; after all, users of are people that live in the 20th century and listen to digital music, which for better or worse does not include old gospel recordings of Homer Rodeheaver or ragtime covers by the US Marine Band (though I’m sure a few people will be saddened by the lack of tagged Broadway showtunes). I prefer to take this as a reminder that history (or maybe I should say culture) is in the eyes of the beholders. When we think of music of the 30s, we think of blues and jazz, and that’s what represented here.

Of the songs that are tagged, a few interesting patterns emerge. First, except for the explosion of rock and soul in the late 50s / early 60s (fairly quickly after the respective  introductions of the two genres), most genres seem to grow at the expense of others. The growth in hip-hop and alternative music in the late 80s / early 90s coincides with the decline of rock (and to a lesser extent dance and soul music) in the same period. Second, just because a genre of music is down doesn’t mean it’ll stay down — country / americana might have looked like it was on its last legs by the late 80s, but by the 2000s it actually had a bigger marketshare than ever.


Normalizing the songs per year to produce a genre ratio plot makes a few things a bit more visible. One of these is that out of all these genres, the one with the best longevity seems to be soul music, though I do have to qualify that somewhat, as the tag “soul” is pretty ambigious so I might be picking up some songs that are just soulful without being soul.

Finally, I do have to point out that tagging each song as a member of a single genre only gives part of the story: a lot of songs are tagged as members of several genres. For the curious, out of this dataset the artists with the most genre-spanning power were Prince, Phil Collins, Peter Gabriel, and Michael Jackson. Taking a closer look at genre blending and fusion will most likely be the topic of a future post.

Mining Details (for the curious)
My basic strategy for this analysis was to link up two pieces of data. The first was pop music charts, and the second was the social tags associated with these songs on The second piece was straightforward to obtain thanks to the well-maintained api, but the first required some curated and maintained dataset. My original plan was to use the publicly-available data in the Billboard Charts API to gather a list of popular songs over the last century. Sadly, as of right now the service is completely broken and useless. But where Billboard’s effort falls short, the Whitburn project managed to make up for it by releasing a meticulously gathered and annotated list of 37000 chart-hitting songs since the 1890s.

Here are the most common tags for the Whitburn project songs represented as a word cloud (I highlighted genre-specific tags in red):


The first thing to notice (which will be nothing new to people who work with this kind of data professionally) is that the tags are, for lack of a better term, “messy”. For instance, there are about eight different tags for R&B, including the alternative spelling “rhythum and blues tag”. Several tags are ambiguous — does “soul” mean that the song is in the genre of soul music, or that the song is soulful? Since this is social data, we have to contend with people using a single tag for more than one meaning, and using different tags to mean the same thing.

Rather than simply letting it be an annoyance though, the idea here was to let treat ambiguity itself as a source of information. Grabbing the 100-odd common tags that have to do with genre, I labelled each by which songs have that tag. I admit this sounds somewhat backwards; to use a metaphor, we can think of each genre as having a sort of genotype — a sequence that defines it. To get that sequence, I look through the set of songs and mark down 1 where that tag is mentioned and 0 where it is not (this means that the songs are basically being treated as alleles).

To help visualize, here’s a raster image of a section of this “genotype” map. For each genre tag (y-axis) there’s a mark if the song on the x-axis has ben tagged with that genre.tag_membership


The first thought that comes to mind looking at this kind of data is to use a standard clustering algorithm (e.g. hierarchical clustering or PCA followed by k-means) on it to find groups of related tags. The problem with that is coming up with a sensible distance metric — one that puts a large distance say between two rarely-used tags with few overlapping songs, but also puts a small distance between a common tag and a rare tag whose songs overlap with the common tag (i.e. its parent).

This is actually where the genotype metaphor came in handy. I simply took it literally, and used an algorithm developed by evolutionary biologists that does exactly what I want: produces a tree of the relationship between the tags assuming that losing songs from parents to children is common, but gaining songs is very rare (for the even more technically-curious, I produced the maximum parsimony character tree for the genre tags by taking the consensus tree for 100 bootstrap rounds). Once I had the tree, using it to classify songs based on their tags was straightforward.

No S**t Sherlock

The popularity of infographics, particularly interactive maps, is on the rise in popular culture and daily life.  Recently, I’ve seen glimpses of boring information, in graphical form, popping up in popular entertainment. Today, I’ll talk about the British TV series ‘Sherlock’. Note: this post would be better with a nice action, instrumental soundtrack. I’m just saying.

Who doesn’t love a good chase scene?  Cops jumping over buildings, secret agents commandeering motorcycles in pursuit of thuggish Russian agents, Wiley Cayote mounting an ACME rocket in pursuit of that dastardly road runner. What can be better than well shot sequences of gun shots, jumps, swinging, running, ducking and breaking through glass?  What’s missing?  I think the answer is ‘context.’

SherlockMapSingle perspective views can not capture both the chaser and his or her prey.  It’s not always easy to know if either is making a good move to catch or evade their competitor.  Enter the info-graphic.

In the pilot episode of ‘Sherlock’ (the mystery of pink), Sherlock and his reluctant sidekick Watson are in pursuit of a cab through the streets of London.  The cab is bound to stay on roads.  The heroes are on foot but are free to climb stairs, jump across buildings, cut through gardens and romp through small alley ways.

The scene is populated with cut sequences of cars whizzing and our heroes moving in a totally non-overlapping geographical space.  To help the viewer keep track of this scene, they display a map of the London neighborhood (frame 1), with the baddies in red and the goodies in green (can it be any other way? I maintain that it can not).  These graphics help provide strategic context by illustrating Sherlock’s thought process and update the audience with the geographical history and future of the chase scene.

Starting at around the intersection of Ingestre Pl and Hopkins St, Sherlock deduces that the cab is unlikely to take the route along Wardour to Broadwick as in Frame 1 and rather assumes the cab will take Warwick to Wardour street (Frame 2).  The purple dot in frame 2 presumably indicates Sherlock’s intended point of intersection.

Running and jumping ensues.  I have no idea exactly where they are, but probably on someone’s roof who lives near the intersection of Broadwick and Poland Street (not shown).  Poor bastards.  Sherlock and Watson arrive on D’Abblay St, but oh no, they’ve missed them.  Frame 3 clearly shows that the green line overlaps with the red, and we’re left to assume that they did not intersect at the same time.  Sherlock  picks a new intersection point (see purple dot), and instead of following the cab by turning Left along Poland Street, he turns right.  The combination of the physical actors prancing off in the wrong direction and the clear graphic helps the audience understand what is going on, and thereby CARE about all the action.

Running and trampling of nicely landscaped gardens ensues.  Frame 4 is the final graphic we are given. It shows only slight progress over frame 3.  This is the least informative frame. More running, and then, BAM, sherlock has his man, thus demonstrating that logic, deduction, an unnatural amount of cartographical memory coupled with obscene amounts of mundane municipal construction knowledge will triumph over cab drivers (who don’t even know they’re being chased).

Success: These map frames definitely help the viewer understand the logic and agency in Sherlocks’s pursuit.  Without it, it would appear to be gratuitous running and jumping and a fortunate and inexplicable interception of his target.  We see glimpses of why Shelock assumes the cab will take one route over another. We also see two different instances of where Sherlock wants to intercept his target.  This makes the running and jumping over buildings meaningful.  We also see, in sort-of real time, the progress he’s making whilst running and jumping, thus making it more suspenseful.

5 Simple Herb Garden Designs to Get You Started

When you begin to think about growing your own herbs it can be confusing as there are so many different ones that you could begin with. There are also many garden designs that you could choose from. So, with that in mind, here are 5 simple herb garden designs to get you started.


1. Kitchen Garden

A kitchen garden is a very popular choice for most end users. If you intend to use your fresh herbs to flavour your cooking, it makes sense to plant your kitchen garden close to the kitchen door. Then you can just go outside and pick your fresh herbs instead of having to trek to the other side of a large garden. Useful kitchen herbs are parsley, oregano, chives, thyme and sage. But, again, the choice of herbs that you plant will depend on the herbs that you will use the most in your cooking.

2. A Medicinal Garden

If you are interested in herbal remedies and herbal teas then a medicinal garden would be an ideal herb garden design for you. If space is limited choose the plants very carefully. Make sure that you only plant the herbs that you will use. Mint teas are very popular as a digestive aid but, if you plant mint, make sure that the roots are contained in a pot of its own before planting the entire pot as the roots will soon take over the whole area.

3. Themed Garden

You could plant your herbs using a theme such as color, aromatics or even height. Lavender would make a lovely display fro an aromatic themed herb garden and there are hundreds of varieties each giving off a lovely aroma.

4. A Formal Garden

If you have enough space you could plant a formal herb garden that could be a talking point for years to come. To begin, sketch your design on paper, then, when you are happy with your design, mark it out in the actual garden before planting to make sure that you have exactly the look that you want. Use privet or box as a hedging around the herb garden, this will need to be kept short by trimming regularly which will also encourage the individual hedging plants to grow thicker.

5. A Regimented Theme

If you plant your herbs in regimented rows, perhaps along the edge of a path or around the edge of a flower garden, they look decorative as well as being useful. Some of the aromatic herbs will give off a beautiful aroma when they are brushed against as you walk along the paths. Planting along the edge of a path is also a very useful way to plant herbs if the available space is limited.

The only limit to your herb garden design is your imagination. Just give some thought as to the ultimate use of the herbs that you plan to grow and you can design an herb garden to suit your individual needs. I hope that the 5 simple herb garden designs to get you started that I have listed above will help give you some ideas for your herb garden design.

Gardening Supplies

Gardening Supplies include many different items. Some of these items include Lawn Mowers. Lawn Mowers come in many different designs and styles.

Lawn Mowers are a huge part of Gardening Supplies. It also includes items like gardening books, chain saws, live animal traps and ratchet pruners too. Besides mulching lawn mowers, leaf vacuums and leaf blowers and other lawn care equipment, these can help in the upkeep of the garden.

Gardening Supplies also includes items like compost and fertilizer. Fertilizer stimulates growth in plants. Compost keeps the soil fertile and manageable.


Water Garden Supplies

Water Garden Supplies includes items like pumps and tubing. Small water garden pumps are needed for the flow of water. This keeps the water fresh and clean.

Water Garden Supplies include pond liners. A small, cheap preformed plastic pond liner will keep everything together. For ponds too large for preformed liners, you have no choice but to purchase a flexible liner, instead, and form your own walls.

Water Garden Supplies includes sand. The sand will supply adjustable flooring for your preformed water pond liner. This will come in handy when you attempt to get your pond liner to sit level in its hole.

Outdoor Garden Fountains

Outdoor Garden Fountains can add a touch of style and elegance to any yard or garden. They add immense value to your house. Neighbors and Guests will also be instinctively drawn to a fountain.

Outdoor Garden Fountains are beautiful additions to any garden. Water features are wonderful additions to the garden. They give a feeling of lushness and create a habitat for all kinds of creatures that will patrol your garden for pests.

Outdoor Garden Fountains can transform a garden into something amazing. Ornate water features are sufficiently eye-popping to serve as focal points. In landscape design, focal points can make or break a project, so you’re spending your money wisely.

Garden Tool Set

Garden Tool Sets include many items and make a great gift. They usually have a garden fork, a garden trowel, a Garden Cultivator and a Garden Cutting tool. There may be more or less tools depending on what kind of set you buy.

Garden Tool Sets can save you money, by buying all at once. A garden trowel is ideal for weeding, digging, and transplanting. This is an essential tool for any gardener.

Garden Tool Sets make great gifts. Many types of people garden, so it can be considered a safe gift. In essence whoever gets it will appreciate it.

Garden Power Tools

Garden Power tools include items like Chain-Saws and tillers. They can do a lot of the heavy duty work regular tools can’t. Always use caution when using any Power tools.

Chain Saws are a great member of the Garden Power Tools lineup. They can cut through most anything. They also let the machine do most of the work for you.

Garden Power Tools also rely on tillers to break up ground. Tiller’s can make a day’s work turn into just mere hours. Again always practice safety around power tools, because they can be dangerous if used improperly.

Solar Garden Fountain

Solar Garden Fountains are really good for the environment, and are really cool to have at your house. They save money because it is not using electricity. They also do not have a ton of wires running around everywhere.

Solar Garden Fountains and garden statues save you money because they do not use electricity. They are powered for free by the sun. Make sure you have good solar panels.

Solar Fountains also have the benefit of not having wires sticking out everywhere. This helps make your pond look the best. So don’t delay get a Solar Garden Fountain today.

Garden Pond Supply

Garden Pond Supplies include many items to keep your pond running smoothly. If you stick with a small water feature for this project, you shouldn’t have to sweat the choice of pumps that much. Water pump manufacturers recommend that the water in a small pond be turned between 1/2 time per hour and 1 time per hour.

Garden Pond Supplies will keep your pond in tip top shape. Sand will supply “adjustable flooring” for your preformed water pond liner. This will come in handy when you attempt to get your pond liner to sit level in its hole.

Garden Pond Supplies can make caring for a pond a lot simpler. Optional backyard pond supplies include rocks, plants and additional statuary. The rocks would be an ornamental feature, to be placed in and around the artificial pond.

Garden Supplies Online

Garden Supplies Online can be a great way to find the best prices for Garden Supplies. Most planting will require you get down on your knees with a trowel. Steel blades will last longest.

Garden Supplies Online can really find you some great deals. Garden forks are slightly shorter and thicker than pitch forks. The strongest have square, rather than flat tines.

Garden Supplies Online gives you the luxury of buying items from the comfort of your house. This also gives you the opportunity to compare prices against one another. By doing this you ensure you are always getting the best deal.


Garden Tools, Fountains, and Supplies


Most people, who have a house, love to garden often. Whether it is the general upkeep of pulling weeds to more large scale designs, we are here to help you in whatever way possible.

The one thing most people can agree on is the nuisance that pests and bugs are. They eat your plants and or crops and they are an eyesore. Try to stay on top of the problem by looking for warning signs and acting accordingly.

If you have your heart set on growing a specific plant, check to see what growing conditions it requires. Vegetables will need at least 6 hours of sun exposure a day. The same goes for most flowering plants, however there are still many to choose from for a partially shaded site. If you want to start a garden where there is mostly shade, your choices are going to be more limited, but not prohibitive.

Garden design is the art and process of designing and creating plans for layout and planting of gardens and landscapes. Design may be done by the garden owner themselves, or by professionals of varying levels of experience and expertise. It takes a lot more planning than putting a few garden statues in your yard. Most professional garden designers are trained in principles of design and in horticulture, and have an expert knowledge and experience of using plants.

Some professional garden designers are also landscape architects, a more formal level of training that usually requires an advanced degree and often a state license. Many amateur gardeners also attain a high level of experience from extensive hours working in their own gardens, through casual study, serious study in Master Gardener Programs, or by joining gardening clubs.

Take into consideration when the sun hits your site. Afternoon sun will be hotter and more drying than morning sun. Many plants turn their faces toward the sun, so if your view of the garden is from a west window, your flowers may face away from you in the afternoon. Evaluate other elements of exposure such as high, drying winds or heavy foot traffic. A greenhouse can help with this issue. Also, any outdoor birdhouses should be in the shade.

Once you know where you’d like to try your first garden, use a hose or extension cord to try laying it out on the ground. Garden owners have shown an increasing interest in garden design during the late twentieth century, both as enthusiasts of gardening as a hobby, as well as an expansion in the use of professional garden designers.

Garden Tools

Garden Tools are great for keeping your garden in the best shape possible. You will want a heavy metal rake. These are long and straight with teeth about 3″ long. They are necessary to smooth out newly tilled garden soil and break up clumps.

Yard rakes are a valuable member of Garden Tools and will help you get leaves out of your gardens. A narrow rake can maneuver around plants easier, but a wide rake makes quicker work of leaves.

Without Garden Tools, gardening really can become a chore. Plants, soil and compost all have to get to your garden somehow. That’s why you will need a wheelbarrow.

Garden Supplies

Garden Supplies consist of many different items. Garden shovels have round, pointed blades. They’re absolutely necessary for moving soil, digging holes and planting.

Garden Supplies are needed for transforming your garden. Pruning, deadheading and shaping plants goes on all year in the garden. Good pruners will not only make your job easier, it will make a cleaner cut on the plants and not tear or rip.

Garden Supplies can make great gifts too for all genders and ages. Hoes can make quick work of weeds. They can also be used to break up soil that isn’t too compacted.

Garden Fountains

Garden Fountains are beautiful additions to any garden. Water features are wonderful additions to the garden. They give a feeling of lushness and create a habitat for all kinds of creatures that will patrol your garden for pests.

Garden Fountains can transform a garden into something amazing. Ornate water features are sufficiently eye-popping to serve as focal points. In landscape design, focal points can make or break a project, so you’re spending your money wisely.

Garden Fountains need to be taken care of often. They can add value to your house. People will also be instinctively drawn to a fountain.