|
| Background |
|
One of the problems with P2P filesharing programs is that you need to know what you want.
Often you do know what you want. However, it is so easy to download music (or anything else
these days), that very quickly you run out of ideas. But you know there is still music out there
that you like! It's just a matter of finding it. PAM is a tool for suggesting music that you
are likely to enjoy, based on music that you know you enjoy. Back in the time that Napster was still alive, I started this project. I didn't only use Napster to download music (just to make sure I like it before I buy it, right), but also to create PAM. Napster allows you to get a user's list of songs. Assuming that users have mostly music that they like, you can build a statistical model from this data that predicts which other songs you might like, given that you like songs X, Y, and Z. It's no rocket science. Sites such as Netflix and Amazon use this technique too. Of course their data is from their customers. My data is from Napster. |
| Data collection |
|
The goal of the data collection effort was to collect song lists from
as many users as possible. Over a period of a couple of months, I downloaded songs lists
from approximately 500,000 Napster users. The data collection was a little bit tricky, because
Napster (and other opennap servers) don't like bots crawling their network. To stay
undetected for as long as possible, I designed my bots such that they were very nice to the
networks. For example, I created 11 Napster accounts, and had the bots randomly pick one
for every session. For each session, a bot would only download the songs lists of a small
number of users, before disconnecting from the server again. It would wait a random amount
of time between each query, etc., etc. So, lots of randomness and slowness. I used 23
different Napster compatible networks, and only a few of the small ones banned me at some
point. The actual crawling part was pretty straight forward. You cannot get the list of users from Napster, but song query results contain user names, so you can build up a list of users from queries. I gave the bots a couple of seed queries, such as "madonna" and "prince", and the bot would go from there. New user names are obtained from queries, and new queries are obtained from user song lists. All the bots used a central user database to make sure that each user list is only downloaded once. I ran up to 20 bots in parallel. |
| Database |
|
The problem with the Napster data is that it is very noisy. In the end the queries return
filenames. From the filenames you need to determine the name of the artist and song. That
is not a trivial task. In fact, I spent a lot of time writing the tool that takes the raw
data and constructs a nice and clean database.
to be completed.... |