class SynonymScrapper::Scrapper
Base scrapper used to scrape APIs/websites
Constants
- USER_AGENTS
List of user agents to select from when scraping.
Attributes
Base url of the API/website to be consulted.
Number, denotes the maximum number of retries to do for each failed request.
Number, denotes how many more retries will be done for a request.
Public Class Methods
Initilalize the scrapper with the base_url
to scrape and the maximum number of retries, max_retries
# File lib/synonym_scrapper/scrapper.rb, line 43 def initialize max_retries, base_url @max_retries = max_retries @retries_left = max_retries @base_url = base_url end
Public Instance Methods
Method to be overwritten by classes that inherit from this one endpoint can be anything [Array, Hash, String, etc] as long as it is used consistently in the child class.
# File lib/synonym_scrapper/scrapper.rb, line 54 def build_call_url endpoint raise Error, "This method must be redefined in subclasses" end
Executes a call to the given endpoint
and returns its response.
In case of HTTP Error
, method will retry +@max_retries+ times. In case of a 404 response, then it will be assumed that retrying would be useless and an empty array is returned. No retrying is done for other types of errors.
# File lib/synonym_scrapper/scrapper.rb, line 66 def call endpoint uri = build_call_url(endpoint) begin response = URI.open(uri, 'User-Agent' => USER_AGENTS.sample) rescue OpenURI::HTTPError => e puts e return [] if e.message == '404 Not Found' retry_call endpoint unless @retries_left <= 0 rescue => e puts e end # Reset the retries_left variable on this instance after each request @retries_left = @max_retries return response end
Retry the call to the endpoint
specified after a waiting between 50 and 1000 milliseconds (random sleep)
# File lib/synonym_scrapper/scrapper.rb, line 86 def retry_call endpoint @retries_left -= 1 sleepTime = (50 + rand(950)) / 1000 sleep(sleepTime) call(endpoint) end