class Arwen

Parses a sitemap url and provides all links provided by the sitemap or sitemap_index. It uses Typheous for network requests and making concurrent requests when parsing a sitemap_index. Ox is the XML parser used to parse the sitemap. Sitemaps are assumed to follow the sitemaps.org protocol.

@see github.com/typhoeus/typhoeus @see github.com/ohler55/ox @see www.sitemaps.org/protocol.html

Constants

VERSION

Public Class Methods

new(url, opts = {}) click to toggle source

Create a new Arwen instance

@param [string] url the full URL to the sitemap or sitemap_index XML file @param [hash] opts options passed to Typheous::Request instances. @option opts [integer] :max_concurrency maximum concurrent requests passed to Typheous::Hydra @see rubydoc.info/github/typhoeus/typhoeus/Typhoeus/Request

# File lib/arwen.rb, line 22
def initialize(url, opts = {})
  @url = url
  max_concurrency = opts.delete(:max_concurrency) { 200 }
  @opts = { followlocation: true }.merge(opts)
  @hydra = Typhoeus::Hydra.new(max_concurrency: max_concurrency)
end

Public Instance Methods

sitemap() click to toggle source

parses the sitemap url to an Ox::Document instance

@return [Ox::Document] @see www.ohler.com/ox/Ox/Document.html

# File lib/arwen.rb, line 47
def sitemap
  @sitemap ||= raw_sitemap
end
to_a() click to toggle source

returns an array of url strings for all URls in the sitemap

@return [Array<String>]

# File lib/arwen.rb, line 39
def to_a
  urls.map(&:url)
end
urls() click to toggle source

fetches and returns all urls for the sitemap with corresponding <url> sitemap schema metadata

@return [Array<SitemapParser::Url>]

# File lib/arwen.rb, line 32
def urls
  @urls ||= all_urls(sitemap)
end

Private Instance Methods

all_urls(sitemap) click to toggle source
# File lib/arwen.rb, line 53
def all_urls(sitemap)
  return parse_multiple_sitemaps(sitemap) if sitemap.root.respond_to?(:sitemap)

  parse_single_sitemap(sitemap)
end
fetch_sitemaps(urls) click to toggle source
# File lib/arwen.rb, line 74
def fetch_sitemaps(urls)
  requests = urls.map do |url|
    req = Typhoeus::Request.new(url, @opts)
    @hydra.queue(req)
    req
  end
  @hydra.run

  requests
end
parse_multiple_sitemaps(sitemap) click to toggle source
# File lib/arwen.rb, line 59
def parse_multiple_sitemaps(sitemap)
  raise "invalid sitemap format" unless sitemap&.root&.value == "sitemapindex"

  urls = sitemap.root.locate("sitemap/loc/*")
  site_urls = []

  requests = fetch_sitemaps(urls)
  requests.each do |req|
    sitemap = Ox.load(req.response.body)
    site_urls += parse_single_sitemap(sitemap)
  end

  site_urls
end
parse_single_sitemap(sitemap) click to toggle source
# File lib/arwen.rb, line 85
def parse_single_sitemap(sitemap)
  raise "invalid sitemap format" unless sitemap&.root&.value == "urlset"

  sitemap.root.nodes.map { |node| Url.new(node) }
end
raw_sitemap() click to toggle source
# File lib/arwen.rb, line 91
def raw_sitemap
  response = Typhoeus.get(@url, @opts)
  raise "invalid sitemap url for #{@url}" unless response.success?

  Ox.load(response.body)
end