class NewsScraper::Scraper

Public Class Methods

new(query:) click to toggle source

Initialize a Scraper object

Params

  • query: a keyword arugment specifying the query to scrape

# File lib/news_scraper/scraper.rb, line 8
def initialize(query:)
  @query = query
end

Public Instance Methods

scrape() { |transformed_article| ... } click to toggle source

Fetches articles from Extraction sources and scrapes the results

Yields

  • Will yield individually extracted articles

Raises

  • Will raise a Transformers::ScrapePatternNotDefined if an article is not in the root domains

    • Will yield the error if a block is given

    • Root domains are specified by the article_scrape_patterns.yml file

    • This root domain will need to be trained, it would be helpful to have a PR created to train the domain

    • You can train the domain by running NewsScraper::Trainer::UrlTrainer.new(URL_TO_TRAIN).train

Returns

  • transformed_articles: The transformed articles fetched from the extracted sources

# File lib/news_scraper/scraper.rb, line 27
def scrape
  article_urls = Extractors::GoogleNewsRss.new(query: @query).extract

  transformed_articles = []

  article_urls.each do |article_url|
    payload = Extractors::Article.new(url: article_url).extract
    article_transformer = Transformers::Article.new(url: article_url, payload: payload)

    begin
      transformed_article = article_transformer.transform
      transformed_articles << transformed_article
      yield transformed_article if block_given?
    rescue Transformers::ScrapePatternNotDefined => e
      raise e unless block_given?
      yield e
    end
  end

  transformed_articles
end