class UrlFinder::SitemapReader
Parse Sitemaps, www.sitemaps.org
Public Instance Methods
The XML document @return [REXML::Document] the XML document
# File lib/url_finder/readers/sitemap_reader.rb, line 17 def document @document ||= begin REXML::Document.new(content) rescue REXML::ParseException => _e REXML::Document.new('') end end
Check if sitemap is a plain file @return [Boolean] whether document is plain
# File lib/url_finder/readers/sitemap_reader.rb, line 36 def plain_document? document.elements.empty? end
Return the name of the document (if there is one) @return [String] the document root name
# File lib/url_finder/readers/sitemap_reader.rb, line 42 def root_name return unless document.root document.root.name end
Returns true of Sitemap is a Sitemap index @return [Boolean] of whether the Sitemap is an Sitemap index or not @example Check if Sitemap is a sitemap index
sitemap = Sitemap.new(xml) sitemap.sitemap_index?
# File lib/url_finder/readers/sitemap_reader.rb, line 53 def sitemap_index? root_name == 'sitemapindex' end
Return all sitemap URLs defined in Sitemap. @return [Array<String>] of Sitemap URLs defined in Sitemap. @example Get Sitemap URLs defined in Sitemap
sitemap = Sitemap.new(xml) sitemap.sitemaps
# File lib/url_finder/readers/sitemap_reader.rb, line 30 def sitemaps @sitemaps ||= extract_urls('sitemap') end
Return all URLs defined in Sitemap. @return [Array<String>] of URLs defined in Sitemap. @example Get URLs defined in Sitemap
sitemap = Sitemap.new(xml) sitemap.urls
# File lib/url_finder/readers/sitemap_reader.rb, line 11 def urls @urls ||= extract_urls('url') end
Returns true of Sitemap lists regular URLs @return [Boolean] of whether the Sitemap regular URL list @example Check if Sitemap is a regular URL list
sitemap = Sitemap.new(xml) sitemap.urlset?
# File lib/url_finder/readers/sitemap_reader.rb, line 62 def urlset? root_name == 'urlset' end
Private Instance Methods
Extract URLs from Sitemap
# File lib/url_finder/readers/sitemap_reader.rb, line 69 def extract_urls(node_name) return document.to_s.each_line.map(&:strip) if plain_document? urls = [] document.root.elements.each("#{node_name}/loc") do |element| urls << element.text end urls end