class Arwen
Parses a sitemap url and provides all links provided by the sitemap or sitemap_index. It uses Typheous for network requests and making concurrent requests when parsing a sitemap_index. Ox is the XML parser used to parse the sitemap. Sitemaps are assumed to follow the sitemaps.org protocol.
@see github.com/typhoeus/typhoeus @see github.com/ohler55/ox @see www.sitemaps.org/protocol.html
Constants
- VERSION
Public Class Methods
Create a new Arwen
instance
@param [string] url the full URL to the sitemap or sitemap_index XML file @param [hash] opts options passed to Typheous::Request instances. @option opts [integer] :max_concurrency maximum concurrent requests passed to Typheous::Hydra @see rubydoc.info/github/typhoeus/typhoeus/Typhoeus/Request
# File lib/arwen.rb, line 22 def initialize(url, opts = {}) @url = url max_concurrency = opts.delete(:max_concurrency) { 200 } @opts = { followlocation: true }.merge(opts) @hydra = Typhoeus::Hydra.new(max_concurrency: max_concurrency) end
Public Instance Methods
parses the sitemap url to an Ox::Document instance
@return [Ox::Document] @see www.ohler.com/ox/Ox/Document.html
# File lib/arwen.rb, line 47 def sitemap @sitemap ||= raw_sitemap end
returns an array of url strings for all URls in the sitemap
@return [Array<String>]
# File lib/arwen.rb, line 39 def to_a urls.map(&:url) end
fetches and returns all urls for the sitemap with corresponding <url> sitemap schema metadata
@return [Array<SitemapParser::Url>]
# File lib/arwen.rb, line 32 def urls @urls ||= all_urls(sitemap) end
Private Instance Methods
# File lib/arwen.rb, line 53 def all_urls(sitemap) return parse_multiple_sitemaps(sitemap) if sitemap.root.respond_to?(:sitemap) parse_single_sitemap(sitemap) end
# File lib/arwen.rb, line 74 def fetch_sitemaps(urls) requests = urls.map do |url| req = Typhoeus::Request.new(url, @opts) @hydra.queue(req) req end @hydra.run requests end
# File lib/arwen.rb, line 59 def parse_multiple_sitemaps(sitemap) raise "invalid sitemap format" unless sitemap&.root&.value == "sitemapindex" urls = sitemap.root.locate("sitemap/loc/*") site_urls = [] requests = fetch_sitemaps(urls) requests.each do |req| sitemap = Ox.load(req.response.body) site_urls += parse_single_sitemap(sitemap) end site_urls end
# File lib/arwen.rb, line 85 def parse_single_sitemap(sitemap) raise "invalid sitemap format" unless sitemap&.root&.value == "urlset" sitemap.root.nodes.map { |node| Url.new(node) } end
# File lib/arwen.rb, line 91 def raw_sitemap response = Typhoeus.get(@url, @opts) raise "invalid sitemap url for #{@url}" unless response.success? Ox.load(response.body) end