class SynonymScrapper::Scrapper

Base scrapper used to scrape APIs/websites

Constants

USER_AGENTS

List of user agents to select from when scraping.

Attributes

base_url[RW]

Base url of the API/website to be consulted.

max_retries[RW]

Number, denotes the maximum number of retries to do for each failed request.

retries_left[RW]

Number, denotes how many more retries will be done for a request.

Public Class Methods

new(max_retries, base_url) click to toggle source

Initilalize the scrapper with the base_url to scrape and the maximum number of retries, max_retries

# File lib/synonym_scrapper/scrapper.rb, line 43
def initialize max_retries, base_url
     @max_retries = max_retries
     @retries_left = max_retries
     @base_url = base_url
end

Public Instance Methods

build_call_url(endpoint) click to toggle source

Method to be overwritten by classes that inherit from this one endpoint can be anything [Array, Hash, String, etc] as long as it is used consistently in the child class.

# File lib/synonym_scrapper/scrapper.rb, line 54
def build_call_url endpoint
     raise Error, "This method must be redefined in subclasses"
end
call(endpoint) click to toggle source

Executes a call to the given endpoint and returns its response.

In case of HTTP Error, method will retry +@max_retries+ times. In case of a 404 response, then it will be assumed that retrying would be useless and an empty array is returned. No retrying is done for other types of errors.

# File lib/synonym_scrapper/scrapper.rb, line 66
def call endpoint
     uri = build_call_url(endpoint)
              begin
                      response = URI.open(uri, 'User-Agent' => USER_AGENTS.sample)
              rescue OpenURI::HTTPError => e
                      puts e
                      return [] if e.message == '404 Not Found'
                      retry_call endpoint unless @retries_left <= 0 
              rescue => e
                      puts e
              end
              # Reset the retries_left variable on this instance after each request
     @retries_left = @max_retries
     return response
end
retry_call(endpoint) click to toggle source

Retry the call to the endpoint specified after a waiting between 50 and 1000 milliseconds (random sleep)

# File lib/synonym_scrapper/scrapper.rb, line 86
def retry_call endpoint
     @retries_left -= 1

     sleepTime = (50 + rand(950)) / 1000
     sleep(sleepTime)

     call(endpoint)
end