class Robotstxt::Getter

Public Instance Methods

obtain(source, robot_id, options) click to toggle source

Get the text of a robots.txt file from the given source, see get.

# File lib/robotstxt/getter.rb, line 6
def obtain(source, robot_id, options)
  options = {
    :num_redirects => 5,
    :http_timeout => 10
  }.merge(options)

  robotstxt = if source.is_a? Net::HTTP
    obtain_via_http(source, "/robots.txt", robot_id, options)
  else
    uri = objectify_uri(source)
    http = Net::HTTP.new(uri.host, uri.port)
    http.read_timeout = options[:http_timeout]
    if uri.scheme == 'https'
      http.use_ssl = true
      http.verify_mode = OpenSSL::SSL::VERIFY_NONE
    end
    obtain_via_http(http, "/robots.txt", robot_id, options)
  end
end

Protected Instance Methods

all_allowed() click to toggle source

A robots.txt body that allows access to everywhere

# File lib/robotstxt/getter.rb, line 63
def all_allowed
  "User-agent: *\nDisallow:\n"
end
all_forbidden() click to toggle source

A robots.txt body that forbids access to everywhere

# File lib/robotstxt/getter.rb, line 58
def all_forbidden
  "User-agent: *\nDisallow: /\n"
end
decode_body(response) click to toggle source

Decode the response’s body according to the character encoding in the HTTP headers. In the case that we can’t decode, Ruby’s laissez faire attitude to encoding should mean that we have a reasonable chance of working anyway.

# File lib/robotstxt/getter.rb, line 71
def decode_body(response)
  return "" if response.body.blank? || response.body.nil?
  Robotstxt.ultimate_scrubber(response.body)
end
obtain_via_http(http, uri, robot_id, options) click to toggle source

Recursively try to obtain robots.txt following redirects and handling the various HTTP response codes as indicated on robotstxt.org

# File lib/robotstxt/getter.rb, line 30
def obtain_via_http(http, uri, robot_id, options)
  response = http.get(uri, {'User-Agent' => robot_id})

  begin
    case response
    when Net::HTTPSuccess
      decode_body(response)
    when Net::HTTPRedirection
      if options[:num_redirects] > 0 && response['location']
        options[:num_redirects] -= 1
        obtain(response['location'], robot_id, options)
      else
        all_allowed
      end
    when Net::HTTPUnauthorized
      all_forbidden
    when Net::HTTPForbidden
      all_forbidden
    else
      all_allowed
    end
  rescue Timeout::Error #, StandardError
    all_allowed
  end

end