module Robotstxt
Provides a flexible interface to help authors of web-crawlers respect the robots.txt exclusion standard.
Constants
- AUTHORS
- GEM
- NAME
- VERSION
Public Class Methods
Obtains and parses a robotstxt file from the host identified by source, source can either be a URI, a string representing a URI, or a Net::HTTP connection associated with a host.
The second parameter should be the user-agent header for your robot.
There are currently two options:
:num_redirects (default 5) is the maximum number of HTTP 3** responses the get() method will accept and follow the Location: header before giving up. :http_timeout (default 10) is the number of seconds to wait for each request before giving up. :url_charset (default "utf8") the character encoding you will use to encode urls.
As indicated by robotstxt.org, this library treats HTTPUnauthorized and HTTPForbidden as though the robots.txt file denied access to the entire site, all other HTTP responses or errors are treated as though the site allowed all access.
The return value is a Robotstxt::Parser
, which you can then interact with by calling .allowed? or .sitemaps. i.e.
Robotstxt.get
(“example.com/”, “SuperRobot”).allowed? “/index.html”
Net::HTTP.open(“example.com”) do |http|
if Robotstxt.get(http, "SuperRobot").allowed? "/index.html" http.get("/index.html") end
end
# File lib/robotstxt.rb, line 61 def self.get(source, robot_id, options={}) self.parse(Getter.new.obtain(source, robot_id, options), robot_id) end
Gets a robotstxt file from the host identified by the uri
(which can be a URI object or a string)
Parses it for the given robot_id
(which should be your user-agent)
Returns true iff your robot can access said uri.
Robotstxt.get_allowed?
“www.example.com/good”, “SuperRobot”
# File lib/robotstxt.rb, line 86 def self.get_allowed?(uri, robot_id) self.get(uri, robot_id).allowed? uri end
Parses the contents of a robots.txt file for the given robot_id
Returns a Robotstxt::Parser
object with methods .allowed? and .sitemaps, i.e.
Robotstxt.parse
(“User-agent: *nDisallow: /a”, “SuperRobot”).allowed? “/b”
# File lib/robotstxt.rb, line 72 def self.parse(robotstxt, robot_id) Parser.new(robot_id, robotstxt) end
# File lib/robotstxt.rb, line 90 def self.ultimate_scrubber(str) str.encode("UTF-8", :invalid => :replace, :undef => :replace, :replace => '') end