Posted: March 2nd, 2012 | Author: xanda | Filed under: IT Related | Tags: crawler, google, nokogiri, ruby, scrubyt, web | No Comments »
I’ve talked about scRUBYt! once and I’ve been using it for years as my primary ‘Google crawler’ aka Google web-scraper. So it is not a surprise if I say.. It was part of MyLipas Defacement Crawler as well 😉
If you are using scRUBYt! as your Google web-scraper as well, I would suggest you to take a look at your script, since it might be broken by now. Its not only the gem itself, event the domain of their website, scrubyt.org, is now expired. (but yes the project is till in github). I’ve noticed that my crawler reported zero URL (scraped from Google) everyday and it made me to think of 2 possibilities; the strings return zero match, OR the scraper is broken. And guess what, my second thought was right.
Yes.. Its another day in lab looking back at the crawler/scraper code. Now I don’t really depend on scRUBYt anymore due to lack of support/maintenance and broken gem dependencies. So here come the Nokogiri. With the XPaths support I manage to get working crawler as for the replacement.. in just few minutes. But of course the code will be a bit longer but NVM.. It works like a charm! 😀
Posted: October 13th, 2009 | Author: xanda | Filed under: IT Related | Tags: convert, ow.ly bit.ly, ruby, shorten url, tinyurl | 4 Comments »
You might worry to visit directly to a shorten URL because who knows it may contain some malicious script/code
I’ve found a solution “Python: Convert those TinyURL (bit.ly, tinyurl, ow.ly) to full URLS” in stackoverflow.com but the code is in Python.
Here is how you can perform the conversion in Ruby
#!/usr/bin/ruby
require 'net/http'
def ConvertToFull(tinyurl)
url = URI.parse(tinyurl)
host, port = url.host, url.port if url.host && url.port
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(host, port) {|http| http.request(req) }
return res.header['location']
end
puts ConvertToFull('http://bit.ly/rgCbf') #here is how you can call the function. Thank you Captain Obvious! |
#!/usr/bin/ruby
require 'net/http'
def ConvertToFull(tinyurl)
url = URI.parse(tinyurl)
host, port = url.host, url.port if url.host && url.port
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(host, port) {|http| http.request(req) }
return res.header['location']
end
puts ConvertToFull('http://bit.ly/rgCbf') #here is how you can call the function. Thank you Captain Obvious!
**UPDATED on 19/10/2009**
I’ve work on a more complete version which can determine Shorten URL or Full URL and return the full URL for the shorten URL.. email for for the code 😉
Posted: September 2nd, 2009 | Author: xanda | Filed under: IT Related | Tags: begin, curb, curl, error, HostResolutionError, rescue, ruby | No Comments »
The following error occur when executing the following code WITH NO internet connection
#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
def browse(url)
c = Curl::Easy.new(url)
c.connect_timeout = 3
c.perform
return c.body_str
end
url = gets
puts browse(url) |
#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
def browse(url)
c = Curl::Easy.new(url)
c.connect_timeout = 3
c.perform
return c.body_str
end
url = gets
puts browse(url)
So to handle the error in ruby, I’ll next time use “begin” and “rescue”
#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
def browse(url)
c = Curl::Easy.new(url)
begin
c.connect_timeout = 3
c.perform
return c.body_str
rescue
return "Error in connection"
end
end
url = gets
puts browse(url) |
#!/usr/bin/ruby
require 'rubygems'
require 'curb' #yerp you need to sudo gem install curb
def browse(url)
c = Curl::Easy.new(url)
begin
c.connect_timeout = 3
c.perform
return c.body_str
rescue
return "Error in connection"
end
end
url = gets
puts browse(url)
Resulting:
Just a notes to myself and at the same time “snip” code for others
Posted: March 10th, 2009 | Author: xanda | Filed under: IT Related | Tags: howto, install, ruby, scrubyt, setup, ubuntu | 5 Comments »
scRUBYt! is a simple but powerful web scraping toolkit written in Ruby. It’s purpose is to free you from the drudgery of web page crawling, looking up HTML tags, attributes, XPaths, form names and other typical low-level web scraping stuff by figuring these out from your examples copy’n’pasted from the Web page or straight from Firebug.
Here are some tips on how to make scRUBYt! works on Ubuntu Linux :
Update your packages list
Now install build-essential and dependencies
sudo apt-get install build-essential ruby-full rubygems libxml-ruby libxslt1.1 libxslt1-dev libxslt-ruby libxml2 libxml2-dev |
sudo apt-get install build-essential ruby-full rubygems libxml-ruby libxslt1.1 libxslt1-dev libxslt-ruby libxml2 libxml2-dev
By using gem, install scRUBYt!’s dependencies
sudo gem install rack rubyforge rake hoe sinatra nokogiri user-choices xml-simple s4t-utils builder commonwatir activesupport hpricot mechanize firewatir |
sudo gem install rack rubyforge rake hoe sinatra nokogiri user-choices xml-simple s4t-utils builder commonwatir activesupport hpricot mechanize firewatir
Finally, install scrubyt
Enjoy! 😀