PAM Logo
 
Background
One of the problems with P2P filesharing programs is that you need to know what you want. Often you do know what you want. However, it is so easy to download music (or anything else these days), that very quickly you run out of ideas. But you know there is still music out there that you like! It's just a matter of finding it. PAM is a tool for suggesting music that you are likely to enjoy, based on music that you know you enjoy.

Back in the time that Napster was still alive, I started this project. I didn't only use Napster to download music (just to make sure I like it before I buy it, right), but also to create PAM. Napster allows you to get a user's list of songs. Assuming that users have mostly music that they like, you can build a statistical model from this data that predicts which other songs you might like, given that you like songs X, Y, and Z. It's no rocket science. Sites such as Netflix and Amazon use this technique too. Of course their data is from their customers. My data is from Napster.
 
Data collection
The goal of the data collection effort was to collect song lists from as many users as possible. Over a period of a couple of months, I downloaded songs lists from approximately 500,000 Napster users. The data collection was a little bit tricky, because Napster (and other opennap servers) don't like bots crawling their network. To stay undetected for as long as possible, I designed my bots such that they were very nice to the networks. For example, I created 11 Napster accounts, and had the bots randomly pick one for every session. For each session, a bot would only download the songs lists of a small number of users, before disconnecting from the server again. It would wait a random amount of time between each query, etc., etc. So, lots of randomness and slowness. I used 23 different Napster compatible networks, and only a few of the small ones banned me at some point.

The actual crawling part was pretty straight forward. You cannot get the list of users from Napster, but song query results contain user names, so you can build up a list of users from queries. I gave the bots a couple of seed queries, such as "madonna" and "prince", and the bot would go from there. New user names are obtained from queries, and new queries are obtained from user song lists. All the bots used a central user database to make sure that each user list is only downloaded once. I ran up to 20 bots in parallel.
 
Database
The problem with the Napster data is that it is very noisy. In the end the queries return filenames. From the filenames you need to determine the name of the artist and song. That is not a trivial task. In fact, I spent a lot of time writing the tool that takes the raw data and constructs a nice and clean database.

to be completed....
Last Updated: 13 May 2006, 2:00pm   -   Generated: 8 Feb 2012, 5:38pm