module Sitemaps
Discover, fetch and parse XML sitemaps as defined by the `sitemaps.org` spec.
Constants
- Entry
@attr loc [URI] the location referred to by this entry. Will never be `nil`. @attr lastmod [Time, nil] the last modification time of this entry, or `nil` if unspecified. @attr changefreq [:always, :hourly, :daily, :weekly, :monthly, :yearly, :never, nil]
the change frequency of this entry, or nil if unspecified.
@attr priority [Float] the priority of this entry, a float from 0 to 1. 0.5 if unspecified.
- Sitemap
@attr entries [Enumerable<Entry>] A set of entries that were parsed out of one or more sitemaps, recursively. @attr sitemaps [Enumerable<Sitemap>] A set of sitemaps that were found in a sitemap index.
- Submap
@attr loc [URI] the location referred to by this entry. Will never be `nil`. @attr lastmod [Time, nil] the last modification time of this entry, or `nil` if unspecified.
- VERSION
Public Class Methods
@return [Instance] @private @api private
# File lib/sitemaps.rb, line 118 def self._instance @instance ||= Sitemaps::Instance.new end
Discover, fetch and parse sitemaps from the given host.
Attempts to find and fetch sitemaps at a given host, by examining the `robots.txt` at that host, or if no sitemaps are found via `robots.txt`, checking a small number of common locations, including `sitemap.xml`, `sitemap_index.xml`, and the gzip versions of those same locations.
@overload discover(host, fetcher: nil, max_entries: nil)
@param host [String, URI] the url of the host to interrogate for sitemaps. @param fetcher [#call] given a URI, fetch an HTTP document. Defaults to using `Fetcher`. @param max_entries [Integer] the maximum number of entries to include in the sitemap. Once the sitemap has this many entries, further fetches and parsing will not occur. This is always a good idea to include, as many sites have _very_ large sitemaps. @return [Sitemap]
@overload discover(host, fetcher: nil, max_entries: nil)
If a block is given, it's used as a filter for entries before they're added to the sitemap. @param host [String, URI] the url of the host to interrogate for sitemaps. @param fetcher [#call] given a URI, fetch an HTTP document. Defaults to using `Fetcher`. @param max_entries [Integer] the maximum number of entries to include in the sitemap. Once the sitemap has this many entries, further fetches and parsing will not occur. This is always a good idea to include, as many sites have _very_ large sitemaps. @param filter_indexes [Boolean] if true, Submap instances will be run through the filter block as well as Entry instances. @return [Sitemap] @yield [Entry] Filters the entry from the sitemap if the block returns falsey. @yieldreturn [Boolean] whether or not to include the entry in the sitemap.
# File lib/sitemaps.rb, line 104 def self.discover(url, fetcher: nil, max_entries: nil, filter_indexes: nil, &block) fetcher ||= @default_fetcher unless url.is_a? URI url = "http://#{url}" unless url =~ %r{^https?://} url = URI.parse(url) end roots = _instance.discover_roots(url, fetcher) _instance.fetch_recursive(roots, fetcher, max_entries, filter_indexes, &block) end
Fetch and parse a sitemap from the given URL.
@overload fetch(url, fetcher: nil, max_entries: nil)
@param url [String, URI] the url of the sitemap in question. @param fetcher [#call] given a URI, fetch an HTTP document. Defaults to using `Fetcher`. @param max_entries [Integer] the maximum number of entries to include in the sitemap. Once the sitemap has this many entries, further fetches and parsing will not occur. This is always a good idea to include, as many sites have _very_ large sitemaps. @return [Sitemap]
@overload fetch(url, fetcher: nil, filter_indexes: nil, max_entries: nil)
If a block is given, it's used as a filter for entries before they're added to the sitemap. @param url [String, URI] the url of the sitemap in question. @param fetcher [#call] given a URI, fetch an HTTP document. Defaults to using `Fetcher`. @param max_entries [Integer] the maximum number of entries to include in the sitemap. Once the sitemap has this many entries, further fetches and parsing will not occur. This is always a good idea to include, as many sites have _very_ large sitemaps. @param filter_indexes [Boolean] if true, Submap instances will be run through the filter block as well as Entry instances. @return [Sitemap] @yield [Entry] Filters the entry from the sitemap if the block returns falsey. @yieldreturn [Boolean] whether or not to include the entry in the sitemap.
# File lib/sitemaps.rb, line 67 def self.fetch(url, fetcher: nil, max_entries: nil, filter_indexes: nil, &block) fetcher ||= @default_fetcher unless url.is_a? URI url = "http://#{url}" unless url =~ %r{^https?://} url = URI.parse(url) end _instance.fetch_recursive(url, fetcher, max_entries, filter_indexes, &block) end
Parse a sitemap from an XML string. Does not fail on invalid documents, but doesn't include invalid entries in the final set. As such, a non-XML file, or non-sitemap XML file will return an empty sitemap.
@param source [String] an XML string to parse as a sitemap. @return [Sitemap] the sitemap represented by the given XML string.
# File lib/sitemaps.rb, line 40 def self.parse(source) Sitemaps::Parser.parse(source) end