class Arachnid2

Constants

BASE_CRAWL_TIME
BASE_URLS
DEFAULT_LANGUAGE
DEFAULT_MAXIMUM_LOAD_RATE
DEFAULT_NON_HTML_EXTENSIONS
DEFAULT_TIMEOUT
DEFAULT_USER_AGENT
MAXIMUM_TIMEOUT
MAX_CRAWL_TIME

META:

About the origins of this crawling approach

The Crawler is heavily borrowed from by Arachnid. Original: github.com/dchuk/Arachnid Other iterations I've borrowed liberally from:

- https://github.com/matstc/Arachnid
- https://github.com/intrigueio/Arachnid
- https://github.com/jhulme/Arachnid

And this was originally written as a part of Tellurion's bot github.com/samnissen/tellurion_bot

MAX_URLS
MEMORY_LIMIT_FILE
MEMORY_USE_FILE
MINIMUM_TIMEOUT
VERSION

Public Class Methods

new(url) click to toggle source

Creates the object to execute the crawl

@example

url = "https://daringfireball.net"
spider = Arachnid2.new(url)

@param [String] url

@return [Arachnid2] self

# File lib/arachnid2.rb, line 65
def initialize(url)
  @url = url
end

Public Instance Methods

crawl(opts = {}, with_watir = false) click to toggle source

Visits a URL, gathering links and visiting them, until running out of time, memory or attempts.

@example

url = "https://daringfireball.net"
spider = Arachnid2.new(url)

opts = {
  :followlocation => true,
  :timeout => 25000,
  :time_box => 30,
  :headers => {
    'Accept-Language' => "en-UK",
    'User-Agent' => "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0",
  },
  :memory_limit => 89.99,
  :proxy => {
    :ip => "1.2.3.4",
    :port => "1234",
    :username => "sam",
    :password => "coolcoolcool",
  }
  :non_html_extensions => {
    3 => [".abc", ".xyz"],
    4 => [".abcd"],
    6 => [".abcdef"],
    11 => [".abcdefghijk"]
  }
}
responses = []
spider.crawl(opts) { |response|
  responses << response
}

@param [Hash] opts

@return nil

# File lib/arachnid2.rb, line 108
def crawl(opts = {}, with_watir = false)
  if with_watir
    crawl_watir(opts, &Proc.new)
  else
    Arachnid2::Typhoeus.new(@url).crawl(opts, &Proc.new)
  end
end
crawl_watir(opts) click to toggle source
# File lib/arachnid2.rb, line 116
def crawl_watir(opts)
  Arachnid2::Watir.new(@url).crawl(opts, &Proc.new)
end