April 20, 2008

Ruby Tutorial: Scraping Muxtape for mp3 + iTunes playlist creation

Muxtape is a website that lets users share mixtapes online, which is pretty cool, as is their minimalistic design. This Ruby tutorial will show how to use Mechanize (together with Hpricot) and rb-appscript to scrape the site, download the mp3’s and add them automatically to an iTunes playlist. The last part will only work on Mac OS X (because of rb-appscript).

First download the gems :

gem install mechanize rb-appscript

For clarity, I will do it in 4 steps : crawling (ie download the HTML pages), analysis (ie get the URL’s of the mp3’s), download (ie download the mp3’s) and iTunes integration (ie playlist creation and file addition).

Crawling

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
  def crawl(n = 10)
    FileUtils.mkdir_p("crawl")
    agent = WWW::Mechanize.new
    page = agent.get("http://muxtape.com/")
    page.save_as "crawl/main.html"
    new_mixtapes = {}
    (page/"ul.featured li a").each_with_index do |mixtape,i|
      link = mixtape.attributes['href']
      name = mixtape.inner_text
      file_name = "crawl/" + name + "_mixtape.html"
      break if  i + 1 > n
      next if File.exist?(file_name) #don't crawl/analyze... if already crawled
      new_mixtapes[name] = file_name
      agent.get(link).save_as(file_name)
    end
    new_mixtapes
  end

What I did here was instantiate a Mechanize object to download the first page. When I have that page, I use the Mechanize integration with Hpricot to search for all links that match the CSS expression : “ul.featured li a”, which I got by looking at the HTML content of the Muxtape page. These links point to individual mixtape pages (which in turn will contain the links to the mp3’s). Then I download the pages pointed to by the links and return a mapping between the mixtape name and the location of the downloaded HTML file.

Analysis

Once I have the individual mixtape pages, I can try to find where the mp3’s are. One caveat here is that the HTML pages do not have it in plain. The “embed” elements which do are created in JavaScript, after the page has loaded, which is no good for Mechanize. To overcome this, it could be possible to use something like Watir, which uses an actual browser instead of only emulating one, but I won’t, since the JavaScript to parse is not all that hard. To do the parsing, I briefly considered RKelly, which looks very interesting, but there is no gem yet, so I passed. I just use Hpricot to parse the page and give me the list of script elements, the content of which I will parse by myself.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
  def analyze(mixtapes)
    mixtape_songs = {}
    mixtapes.each do |name,file_name|
      page = Hpricot(open(file_name))
      songs = []
      (page/"script").each do |script|
        src = script.inner_text
        if src =~ /new\s+Kettle\(\[([^\]]+)\],\[([^\]]+)\]/
          ids, codes = [$1, $2].map {|a| a.gsub("'",'').split(",") }
          ids.zip(codes).each do |ic|
            songs << OpenStruct.new(:sid =>"#{ic[0]}.mp3",
                :url => "http://muxtape.s3.amazonaws.com/songs/#{ic[0]}?#{ic[1]}")
          end
        end
      end
      mixtape_songs[name] = songs
    end
    mixtape_songs
  end

Here again, I looked at a mixtape page to learn how the URL for the mp3’s are formed. It turns out they are stored on Amazon S3. The JavaScript that creates the “embed” element for the Flash mp3 player is a one-liner that looks like this :

1
2
3
4
5
var x480bc8459f377 = new Kettle(
<!-- ['62f2c1f39e0'.......] -->
,
<!-- [...] -->
)

After using a regexp to parse this, I now have all the info needed to determine the URL’s of the mp3’s for the mixtape. The list is returned for each mixtape considered at the crawling step.

Download

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  def download(mixtapes)
    FileUtils.mkdir_p("dl")
    mixtape_songs = {}
    mixtapes.each do |mixtape, songs|
      dl_songs = []
      songs.each do |song|
          song_file =  "dl/#{song.sid}"
          open(song.url) do |f|
              open(song_file,"wb") {|mp3| mp3.write f.read }
          end
          dl_songs << song_file
     end
     mixtape_songs[mixtape] = dl_songs
    end
    mixtape_songs
  end

This step is pretty straightforward: Just go through all the URL’s determined at the previous step and download them. This uses the “openuri” standard library to open the remote URL. A mapping between the mixtape and the locations of the downloaded mp3’s is returned.

iTunes integration

While the previous code could be used on other platforms, this step is restricted to Mac OS X. It uses the rb-appscript library, which is a bridge between AppleScript and Ruby. There are other libraries that do this (Apple’s own Scripting Bridge, or Ruby OSA), but this one is well documented so I use it. This step will create an iTunes playlist for each considered mixtape, add the mp3’s to the iTunes library and to the corresponding playlist.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  def itunes(mixtapes)
    i_tunes = app('iTunes')
    mixtapes.each do |mixtape,song_files|
      next if i_tunes.playlists[its.name.eq(mixtape)].exists #skip if exists
      pl = i_tunes.make(:new => :user_playlist, :with_properties => {:name => mixtape})
      song_files.each do |sf|
          i_tunes.add(MacTypes::FileURL.path(File.expand_path(File.dirname(__FILE__) + "/#{sf}")),  :to => pl)
      end
    end
  end

The first line gets a reference to the iTunes application. If it is not launched, it will be as soon as it is asked for info or an action is called on it. Then for each mixtape/playlist, I first test if one of the same name already exists, using a filter expression (the “its” referring to each considered playlist in turn). If there isn’t, I create one, with the “make” method. I then add all the mixtape songs to the newly created playlist. For this, I use “add” on the iTunes reference, to which I pass a URL created with the “MacTypes::FileURL.path” method, which takes as argument an absolute path to the mp3 I want to add (a relative one does not seem to work). And that’s it! New playlists have been created in iTunes, with an iPod synchronization for offline listening just a step away.

Code

Here is the file with the complete code. It will only download the first muxtape. Run as :

ruby crawler.rb