class Arachnid2
Constants
- BASE_CRAWL_TIME
- BASE_URLS
- DEFAULT_LANGUAGE
- DEFAULT_MAXIMUM_LOAD_RATE
- DEFAULT_NON_HTML_EXTENSIONS
- DEFAULT_TIMEOUT
- DEFAULT_USER_AGENT
- MAXIMUM_TIMEOUT
- MAX_CRAWL_TIME
META:
About the origins of this crawling approach
The Crawler is heavily borrowed from by Arachnid. Original: github.com/dchuk/Arachnid Other iterations I've borrowed liberally from:
- https://github.com/matstc/Arachnid - https://github.com/intrigueio/Arachnid - https://github.com/jhulme/Arachnid
And this was originally written as a part of Tellurion's bot github.com/samnissen/tellurion_bot
- MAX_URLS
- MEMORY_LIMIT_FILE
- MEMORY_USE_FILE
- MINIMUM_TIMEOUT
- VERSION
Public Class Methods
new(url)
click to toggle source
Creates the object to execute the crawl
@example
url = "https://daringfireball.net" spider = Arachnid2.new(url)
@param [String] url
@return [Arachnid2] self
# File lib/arachnid2.rb, line 65 def initialize(url) @url = url end
Public Instance Methods
crawl(opts = {}, with_watir = false)
click to toggle source
Visits a URL, gathering links and visiting them, until running out of time, memory or attempts.
@example
url = "https://daringfireball.net" spider = Arachnid2.new(url) opts = { :followlocation => true, :timeout => 25000, :time_box => 30, :headers => { 'Accept-Language' => "en-UK", 'User-Agent' => "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0", }, :memory_limit => 89.99, :proxy => { :ip => "1.2.3.4", :port => "1234", :username => "sam", :password => "coolcoolcool", } :non_html_extensions => { 3 => [".abc", ".xyz"], 4 => [".abcd"], 6 => [".abcdef"], 11 => [".abcdefghijk"] } } responses = [] spider.crawl(opts) { |response| responses << response }
@param [Hash] opts
@return nil
# File lib/arachnid2.rb, line 108 def crawl(opts = {}, with_watir = false) if with_watir crawl_watir(opts, &Proc.new) else Arachnid2::Typhoeus.new(@url).crawl(opts, &Proc.new) end end
crawl_watir(opts)
click to toggle source
# File lib/arachnid2.rb, line 116 def crawl_watir(opts) Arachnid2::Watir.new(@url).crawl(opts, &Proc.new) end