class Robotstxt::Getter
Public Instance Methods
obtain(source, robot_id, options)
click to toggle source
Get the text of a robots.txt file from the given source, see get.
# File lib/robotstxt/getter.rb, line 6 def obtain(source, robot_id, options) options = { :num_redirects => 5, :http_timeout => 10 }.merge(options) robotstxt = if source.is_a? Net::HTTP obtain_via_http(source, "/robots.txt", robot_id, options) else uri = objectify_uri(source) http = Net::HTTP.new(uri.host, uri.port) http.read_timeout = options[:http_timeout] if uri.scheme == 'https' http.use_ssl = true http.verify_mode = OpenSSL::SSL::VERIFY_NONE end obtain_via_http(http, "/robots.txt", robot_id, options) end end
Protected Instance Methods
all_allowed()
click to toggle source
A robots.txt body that allows access to everywhere
# File lib/robotstxt/getter.rb, line 63 def all_allowed "User-agent: *\nDisallow:\n" end
all_forbidden()
click to toggle source
A robots.txt body that forbids access to everywhere
# File lib/robotstxt/getter.rb, line 58 def all_forbidden "User-agent: *\nDisallow: /\n" end
decode_body(response)
click to toggle source
Decode the response’s body according to the character encoding in the HTTP headers. In the case that we can’t decode, Ruby’s laissez faire attitude to encoding should mean that we have a reasonable chance of working anyway.
# File lib/robotstxt/getter.rb, line 71 def decode_body(response) return "" if response.body.blank? || response.body.nil? Robotstxt.ultimate_scrubber(response.body) end
obtain_via_http(http, uri, robot_id, options)
click to toggle source
Recursively try to obtain robots.txt following redirects and handling the various HTTP response codes as indicated on robotstxt.org
# File lib/robotstxt/getter.rb, line 30 def obtain_via_http(http, uri, robot_id, options) response = http.get(uri, {'User-Agent' => robot_id}) begin case response when Net::HTTPSuccess decode_body(response) when Net::HTTPRedirection if options[:num_redirects] > 0 && response['location'] options[:num_redirects] -= 1 obtain(response['location'], robot_id, options) else all_allowed end when Net::HTTPUnauthorized all_forbidden when Net::HTTPForbidden all_forbidden else all_allowed end rescue Timeout::Error #, StandardError all_allowed end end