Hello Nokogiri

Posted: March 2nd, 2012 | Author: | Filed under: IT Related | Tags: , , , , , | No Comments »

I’ve talked about scRUBYt! once and I’ve been using it for years as my primary ‘Google crawler’ aka Google web-scraper. So it is not a surprise if I say.. It was part of MyLipas Defacement Crawler as well 😉

If you are using scRUBYt! as your Google web-scraper as well, I would suggest you to take a look at your script, since it might be broken by now. Its not only the gem itself, event the domain of their website, scrubyt.org, is now expired. (but yes the project is till in github). I’ve noticed that my crawler reported zero URL (scraped from Google) everyday and it made me to think of 2 possibilities; the strings return zero match, OR the scraper is broken. And guess what, my second thought was right.

Yes.. Its another day in lab looking back at the crawler/scraper code. Now I don’t really depend on scRUBYt anymore due to lack of support/maintenance and broken gem dependencies. So here come the Nokogiri. With the XPaths support I manage to get working crawler as for the replacement.. in just few minutes. But of course the code will be a bit longer but NVM.. It works like a charm! 😀


Convert Shorten URL (bit.ly, tinyurl, ow.ly, and many more) to Full URL in Ruby

Posted: October 13th, 2009 | Author: | Filed under: IT Related | Tags: , , , , | 4 Comments »

You might worry to visit directly to a shorten URL because who knows it may contain some malicious script/code

I’ve found a solution “Python: Convert those TinyURL (bit.ly, tinyurl, ow.ly) to full URLS” in stackoverflow.com but the code is in Python.

Here is how you can perform the conversion in Ruby

#!/usr/bin/ruby
 
require 'net/http'
 
def ConvertToFull(tinyurl)
   url = URI.parse(tinyurl)
   host, port = url.host, url.port if url.host && url.port
   req = Net::HTTP::Get.new(url.path)
   res = Net::HTTP.start(host, port) {|http|  http.request(req) }
   return res.header['location']
end
 
puts ConvertToFull('http://bit.ly/rgCbf') #here is how you can call the function. Thank you Captain Obvious!

**UPDATED on 19/10/2009**

I’ve work on a more complete version which can determine Shorten URL or Full URL and return the full URL for the shorten URL.. email for for the code 😉


Notes to Myself : Ruby Curb Curl::Err::HostResolutionError and Exception Handling

Posted: September 2nd, 2009 | Author: | Filed under: IT Related | Tags: , , , , , , | No Comments »

The following error occur when executing the following code WITH NO internet connection

#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
 
def browse(url)
  c = Curl::Easy.new(url)
  c.connect_timeout = 3
  c.perform
  return c.body_str
end
 
url = gets
puts browse(url)

So to handle the error in ruby, I’ll next time use “begin” and “rescue”

#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
 
def browse(url)
  c = Curl::Easy.new(url)
  begin
     c.connect_timeout = 3
     c.perform
     return c.body_str
  rescue
     return "Error in connection"
    end
end
 
url = gets
puts browse(url)

Resulting:


Just a notes to myself and at the same time “snip” code for others


Installing scRUBYt! on Ubuntu Linux

Posted: March 10th, 2009 | Author: | Filed under: IT Related | Tags: , , , , , | 5 Comments »

scRUBYt! is a simple but powerful web scraping toolkit written in Ruby. It’s purpose is to free you from the drudgery of web page crawling, looking up HTML tags, attributes, XPaths, form names and other typical low-level web scraping stuff by figuring these out from your examples copy’n’pasted from the Web page or straight from Firebug.

Here are some tips on how to make scRUBYt! works on Ubuntu Linux :

Update your packages list

sudo apt-get update

Now install build-essential and dependencies

sudo apt-get install build-essential ruby-full rubygems libxml-ruby libxslt1.1 libxslt1-dev libxslt-ruby libxml2 libxml2-dev

By using gem, install scRUBYt!’s dependencies

sudo gem install rack rubyforge rake hoe sinatra nokogiri user-choices xml-simple s4t-utils builder commonwatir activesupport hpricot mechanize firewatir

Finally, install scrubyt

sudo gem install scrubyt

Enjoy! 😀