Hello Nokogiri

I’ve talked about scRUBYt! once and I’ve been using it for years as my primary ‘Google crawler’ aka Google web-scraper. So it is not a surprise if I say.. It was part of MyLipas Defacement Crawler as well ;)

If you are using scRUBYt! as your Google web-scraper as well, I would suggest you to take a look at your script, since it might be broken by now. Its not only the gem itself, event the domain of their website, scrubyt.org, is now expired. (but yes the project is till in github). I’ve noticed that my crawler reported zero URL (scraped from Google) everyday and it made me to think of 2 possibilities; the strings return zero match, OR the scraper is broken. And guess what, my second thought was right.

Yes.. Its another day in lab looking back at the crawler/scraper code. Now I don’t really depend on scRUBYt anymore due to lack of support/maintenance and broken gem dependencies. So here come the Nokogiri. With the XPaths support I manage to get working crawler as for the replacement.. in just few minutes. But of course the code will be a bit longer but NVM.. It works like a charm! :D

Installing scRUBYt! on Ubuntu Linux

scRUBYt! is a simple but powerful web scraping toolkit written in Ruby. It’s purpose is to free you from the drudgery of web page crawling, looking up HTML tags, attributes, XPaths, form names and other typical low-level web scraping stuff by figuring these out from your examples copy’n’pasted from the Web page or straight from Firebug.

Here are some tips on how to make scRUBYt! works on Ubuntu Linux :

Update your packages list

sudo apt-get update

Now install build-essential and dependencies

sudo apt-get install build-essential ruby-full rubygems libxml-ruby libxslt1.1 libxslt1-dev libxslt-ruby libxml2 libxml2-dev

By using gem, install scRUBYt!’s dependencies

sudo gem install rack rubyforge rake hoe sinatra nokogiri user-choices xml-simple s4t-utils builder commonwatir activesupport hpricot mechanize firewatir

Finally, install scrubyt

sudo gem install scrubyt

Enjoy! :D

Go to top