*   \   &   ©   |   /   * *   \   &   ©   |   /   * *   \   &   ©   |   /   * *   \   &   ©   |   /   * *   \   &   ©   |   /   *

How to scrape MySpace, YouTube, BitJunkie

scrape, scraping, screenscraping, web scraping, scrape myspace, scrape youtube, scrape torrents, scraping code, scraper code, scraper example, ruby, mechanize, ruby mechanize, ruby scraper, mechanize scraper

Bootstrap your career in data hacking! With Ruby and WWW::Mechanize you can get started collecting data on the web with just a few lines of code.

download Jdubs’ mechanize scrapers 1.0 — simple scraping examples for MySpace, YouTube, and torrent index BTjunkie.

Techniques for exploring a web page, Ruby & gem installation, and explanations of the simple extractors below.

Need to install Ruby, or the mechanize gem? See Installing Ruby

Inspecting the DOM

  1. Install Firefox and the “Firebug” extension
  2. restart Firefox and crack open the Firebug console by clicking the green checkmark in the bottom left:
  3. get familiar with inspecting a page’s DOM using the ‘inspect’ tool, which allows you to easily identify the CSS describing the page — this is how you’ll identify page elements while scraping


“span.viewCount” is the video’s view count. That was easy.

Example code

Some simple data collectors with file downloading to get you started.

Execute them on the command line with “ruby myspace.rb“, or in irb

myspace.rb
find the top 20 friends for a given profile, then download all those people’s thumbnails

agent = WWW::Mechanize.new
agent.get("http://myspace.com/graffitiresearchlab")
links = agent.page.search('.friendSpace img') # found w/ firebug
FileUtils.mkdir_p 'myspace-images' # make the images dir
links.each_with_index { |link, index| 
  url = link['src']
  puts "Saving thumbnail #{url}"
  agent.get(url).save_as("myspace-images/top_friend#{index}_#{File.basename url}")
}

youtube.rb
get the most viewed YouTube videos via the gdata API... and download all of their thumbnails

agent = WWW::Mechanize.new
url = "http://gdata.youtube.com/feeds/api/standardfeeds/most_viewed" # all time
page = agent.get(url)
# parse again w/ Hpcricot for some XML convenience
doc = Hpricot.parse(page.body)
# pp (doc/:entry) # like "search"; cool division overload
images = (doc/'media:thumbnail') # use strings instead of symbols for namespaces
FileUtils.mkdir_p 'youtube-images' # make the images dir
urls = images.map { |i| i[:url] }
urls.each_with_index do |file,index|
  puts "Saving image #{file}"
  agent.get(file).save_as("youtube-images/vid#{index}_#{File.basename file}")
end

btjunkie.rb
download all the .torrent files on the front page

agent = WWW::Mechanize.new
agent.get("http://btjunkie.org/")
links = agent.page.search('.tor_details tr a')
hrefs = links.map { |m| m['href'] }.select { |u| u =~ /\.torrent$/ } # just links ending in .torrent
FileUtils.mkdir_p('btjunkie-torrents') # keep it neat
hrefs.each { |torrent|
  filename = "btjunkie-torrents/#{torrent[0].split('/')[-2]}"
  puts "Saving #{torrent} as #{filename}"
  agent.get(torrent).save_as(filename)
}

More code:

Further reading:

The mechanize docs have examples of filling out and submitting forms, e.g. for logging in or searching.

If you write any fun scrapers or bots with these let me know.



Comment on this

Textile Help