Reconstructing the structure of the world-wide music scene with last.fm

Technical details

I used many open source program to produce these visualisations, and I am grateful to the authors of these softwares for sparing me a whole lot of time:

Python. Practically all the scripts used for downloading the data, calculating the layout and drawing the visualisations was written in Python on the highest level - although I accessed many libraries written in C via their Python interfaces.
Cairo - a 2D graphics library that supports multiple output devices from PNG to Quartz.
Audioscrobbler - the source of the similarity data I used.
SQLite and MySQL - the database backend. I started experimenting with SQLite and then set up a MySQL database later when the data file grew too large and it took ages to insert a new record.
The igraph library - a C library for the analysis of large networks. Shameless advertising on my side, since I am one of the authors :) igraph was used for all graph-theoretic calculations (e.g., centrality scores of the edges).
The DrL graph layout algorithm - this algorithm is responsible for arranging the vertices of the graph in a way that pleases the eye. (Well, at least in a way that pleases my eyes). Theoretically it also has the ability to add new vertices to a previously drawn graph, which would be useful for creating time-lapse movies of the evolution of the music network, but I doubt that I have enough computing power to track that continuously. (Maybe the guys at last.fm will do that sooner or later).

Data collection

All the data I used were obtained via the Audioscrobbler web API. I set up a simple database with four tables. One of them held the list of artists, the other one the list of tags encountered during my crawl. The other two tables contained the similarity relations between pairs of artists and the associations between artists and tags. Data were collected in a simple breadth-first manner starting from Nightwish. I decided to skip artists that did not have a MusicBrainz ID or did not have a profile picture - this was the easiest way to get rid of mistagged artists, although the process admittedly removed legitimate musicians who are not widely known enough to have their own MusicBrainz IDs. (This was the fate of Stonehenge, one of my favourite Hungarian bands :( ). Another drawback of this approach is that artists who are not reachable through similarity paths from Nightwish are not included in the database. What I obtained is most probably the largest strongly connected component of the last.fm music network, or something close to it. Since the network is constantly evolving (similarity strengths change over time, new artists get MusicBrainz IDs and so on), and the data collection took slightly more than a week (in order not to overload the Audioscrobbler servers), the dataset is not really an exact snapshot but spans over about a week in time. The crawler I implemented keeps on scanning the network and updating links and tags after the crawling finished (and every artist in the largest component has been visited at least once), so theoretically it could track the dynamics of the network, but I do not have the computing power and storage space yet.

Calculations

Edge centralities

As mentioned on the main page, the edges on the visualisation are coloured according to their betweenness centrality scores. These scores were calculated using igraph in less than a day. I don't know the exact running time, since the whole thing ran as a background process, but the complete network had 74 805 vertices and 3 878 449 directed connections. I also tried colouring the edges according to the similarity score reported by the Audioscrobbler database, but since the similarity scores are re-scaled by Audioscrobbler so that the maximal similarity between an artist and any other is always 100, it did not look too appealing.

Vertex colors

Vertex colors encode information about music genres. The genre of a given artist was inferred from the tags attached to that artist using a very simple algorithm. I sticked to the following mapping between tags and categories:

Rock = rock, classic rock, hard rock, indie rock, garage rock, emo, punk, post-punk, alternative, progressive rock
Pop = pop, indie pop, funk, latin, soul, rnb, r&b, jpop, j-pop
Metal = metal, progressive metal, metalcore, power metal, symphonic metal, black metal, doom metal, death metal, heavy metal
Electronic = electronic, electronica, electro, trance, house, techno, noise, drum and bass, dance, psytrance, ambient, chillout
Hip-hop and rap = hip-hop, hip hop, rap. (OK, that's way too simple - any suggestions on additional tags? There are still ~17000 unclassified artists)
Jazz = jazz :) The same comment applies here as well.
Country, folk and world music = country, folk, world, world music. Should I repeat myself again?
Classical music = classic, classical, classical music. Should I repeat myself again?
Reggae and ska = reggae, ska

For every artist, I took the list of top tags and classified the artist into the group in which the first tag in the tag list appeared. If the first tag was nowhere to be found, I moved on to the second tag and so on. There are still ~17000 unclassified artists, these ones did not possess a single tag from the above list.

Vertex sizes

The area (not the diameter) of the vertices is proportional to the number of listeners of the top track of that given artist. This is so because this was the only direct popularity measure exposed via the Audioscrobbler API, although I know that the total number of listeners or the total number of tracks scrobbled would have been better. I experimented with a different kind of analysis as well: I calculated the sum of incoming edge weights for each vertex using igraph. This latter measure assigns large size to artists who are "copied" by many others, or to put it the other way, artists who have influence on many others would be larger on the visualisation. Not too surprisingly, artists who are renowned for their role in the emergence of whole new music genres are not among the largest ones. I decided to use the first measure on visualisations, it looks better and seems to convey more information.

Top artists according to the number of listeners of their top tracks

These are the largest vertices on the visualisations:

Radiohead (180605) - alternative rock
Oasis (165902) - rock, britpop
Amy Winehouse (163362) - jazz, soul
Muse (146114) - alternative rock
The Killers (144533) - indie rock
Coldplay (139077) - alternative rock
Led Zeppelin (137206) classic rock
Nirvana (134731) grunge
The Postal Service (131916) indie electro-pop
Franz Ferdinand (131451) indie, alternative rock

(Genres are according to the textual descriptions and tags on Last.fm)

Remember, these data are not "live", they were valid at the time of the data collection.

Top artists according to the sum of incoming edge weights

Using this measure, the top 10 are as follows:

Death Cab for Cutie (25633) - indie rock
The Shins (23049) - indie pop
The Decemberists (19887) - indie rock
Sufjan Stevens (19845) - folk, indie pop
Taking Back Sunday (19559) - emo
Jimmy Eat World (18594) - alternative rock, indie rock, emo
Bloc Party (18373) - indie rock
Modest Mouse (17871) - indie rock
Brand New (17535) - alternative rock, emo
The Killers (16880) - indie rock

(Genres are again according to the textual descriptions and tags on Last.fm)

Honestly, I haven't heard about any of them before (although this does not mean anything except that I'm shamelessly ignorant of certain genres), and I double-checked my code to convince myself that this is really the result. Well, does it tell us anything about those specific bands or the users of Last.fm?

Graph layout

I tried several graph layout algorithms ranging from the classical force-directed layouts such as the Fruchterman-Reingold or Kamada-Kawai algorithms, but they did not scale well, and all I obtained was a seemingly random arrangement of vertices and edges. Unfortunately this was also the case with the LGL algorithm used in the Opte Project for the Internet maps. Finally I settled on the DrL graph layout algorithm, which produced suitable layouts in less than ten minutes.

Any more questions? Mail me on Last.fm.