class UrlFinder::SitemapReader

Parse Sitemaps, www.sitemaps.org

Public Instance Methods

document() click to toggle source

The XML document @return [REXML::Document] the XML document

# File lib/url_finder/readers/sitemap_reader.rb, line 17
def document
  @document ||= begin
    REXML::Document.new(content)
  rescue REXML::ParseException => _e
    REXML::Document.new('')
  end
end
plain_document?() click to toggle source

Check if sitemap is a plain file @return [Boolean] whether document is plain

# File lib/url_finder/readers/sitemap_reader.rb, line 36
def plain_document?
  document.elements.empty?
end
root_name() click to toggle source

Return the name of the document (if there is one) @return [String] the document root name

# File lib/url_finder/readers/sitemap_reader.rb, line 42
def root_name
  return unless document.root

  document.root.name
end
sitemap_index?() click to toggle source

Returns true of Sitemap is a Sitemap index @return [Boolean] of whether the Sitemap is an Sitemap index or not @example Check if Sitemap is a sitemap index

sitemap = Sitemap.new(xml)
sitemap.sitemap_index?
# File lib/url_finder/readers/sitemap_reader.rb, line 53
def sitemap_index?
  root_name == 'sitemapindex'
end
sitemaps() click to toggle source

Return all sitemap URLs defined in Sitemap. @return [Array<String>] of Sitemap URLs defined in Sitemap. @example Get Sitemap URLs defined in Sitemap

sitemap = Sitemap.new(xml)
sitemap.sitemaps
# File lib/url_finder/readers/sitemap_reader.rb, line 30
def sitemaps
  @sitemaps ||= extract_urls('sitemap')
end
urls() click to toggle source

Return all URLs defined in Sitemap. @return [Array<String>] of URLs defined in Sitemap. @example Get URLs defined in Sitemap

sitemap = Sitemap.new(xml)
sitemap.urls
# File lib/url_finder/readers/sitemap_reader.rb, line 11
def urls
  @urls ||= extract_urls('url')
end
urlset?() click to toggle source

Returns true of Sitemap lists regular URLs @return [Boolean] of whether the Sitemap regular URL list @example Check if Sitemap is a regular URL list

sitemap = Sitemap.new(xml)
sitemap.urlset?
# File lib/url_finder/readers/sitemap_reader.rb, line 62
def urlset?
  root_name == 'urlset'
end

Private Instance Methods

extract_urls(node_name) click to toggle source

Extract URLs from Sitemap

# File lib/url_finder/readers/sitemap_reader.rb, line 69
def extract_urls(node_name)
  return document.to_s.each_line.map(&:strip) if plain_document?

  urls = []
  document.root.elements.each("#{node_name}/loc") do |element|
    urls << element.text
  end
  urls
end