class Apes::UrlsParser
Utility class to parse URLs, domains and emails.
Constants
- DOMAIN_MATCHER
Regular expression to match a valid domain.
- EMAIL_MATCHER
Regular expression to match a valid email address.
- TEMPLATE
Template to replace URLs in a text.
- TLDS
The list of valid top level domains for a URL, email and domain. To update the list: jecas.cz/tld-list/
- URL_MATCHER
Regular expression to match a valid URL.
- URL_SEPARATOR
Regular expression to detect a URL in a text.
Public Class Methods
Get the singleton instance of the parser.
@param force [Boolean] Whether to force creation of a new singleton. @return [Apes::UrlsParser] A instance of the parser.
# File lib/apes/urls_parser.rb, line 116 def self.instance(force = false) @instance = nil if force @instance ||= new end
Public Instance Methods
Removes all extra characters (like trailing comma) from a URL.
@param url [String] The URL to clean. @return [String] The cleaned URL.
# File lib/apes/urls_parser.rb, line 209 def clean(url) url.strip.gsub(/#{UrlsParser::URL_SEPARATOR.source}$/, "") end
Checks if the value is a valid domain.
@return [Boolean] `true` if the value is a valid domain, `false` otherwise.
# File lib/apes/urls_parser.rb, line 138 def domain?(domain) domain.strip =~ /^(#{UrlsParser::DOMAIN_MATCHER.source})$/ix ? true : false end
Checks if the value is a valid email address.
@return [Boolean] `true` if the value is a valid email address, `false` otherwise.
# File lib/apes/urls_parser.rb, line 131 def email?(email) email.strip =~ /^(#{UrlsParser::EMAIL_MATCHER.source})$/ix ? true : false end
Makes sure the string starts with the scheme for the specified protocol.
@param subject [String] The string to analyze. @param protocol [String] The protocol for the URL. @param secure [Boolean] If the scheme should be secure or not. @return [String] The string with a URL scheme at the beginning.
# File lib/apes/urls_parser.rb, line 156 def ensure_url_with_scheme(subject, protocol = "http", secure: false) schema = protocol + (secure ? "s" : "") subject !~ /^(#{protocol}(s?):\/\/)/ ? "#{schema}://#{subject}" : subject end
Extract all URLS from a text.
@param text [String] The text that contains URLs. @param mode [Symbol] Which URLs to extract. It can be `:shortened`, `:unshortened` or `:all` (the default). @param sort [NilClass|Symbol] If not `nil`, how to sort extracted URLs. It can be `:asc` or `:desc`. @param shortened_domains [Array] Which domains to consider shortened. @return [Array] An array of extracted URLs.
# File lib/apes/urls_parser.rb, line 168 def extract_urls(text, mode: :all, sort: nil, shortened_domains: []) regexp = /((^|\s+)(?<url>#{UrlsParser::URL_MATCHER.source})(#{UrlsParser::URL_SEPARATOR.source}|$))/ix matches = text.scan(regexp).flatten.map { |u| clean(u) }.uniq if mode == :shortened matches.select! { |u| shortened?(u, *shortened_domains) } elsif mode == :unshortened matches.reject! { |u| shortened?(u, *shortened_domains) } end matches = sort_urls(matches, sort) matches end
Generate a hash of a URL.
@param url [String] The URL to hashify. @return [String] The hash for the URL.
# File lib/apes/urls_parser.rb, line 217 def hashify(url) Digest::SHA2.hexdigest(ensure_url_with_scheme(url.strip)) end
Replace all URLs in a text with provided replacements.
@param text [String] The text that contains URLs. @param replacements [Hash] A map where keys are the URLs to replace and values are their replacements. @param mode [Symbol] Which URLs to extract. It can be `:shortened`, `:unshortened` or `:all` (the default). @param shortened_domains [Array] Which domains to consider shortened. @return [String] The original text with all URLs replaced.
# File lib/apes/urls_parser.rb, line 189 def replace_urls(text, replacements: {}, mode: :all, shortened_domains: []) text = text.dup urls = extract_urls(text, mode: mode, sort: :desc, shortened_domains: shortened_domains).reduce({}) do |accu, url| if replacements[url] hash = hashify(url) accu["url_#{hash}"] = ensure_url_with_scheme(replacements[url]) text.gsub!(/#{Regexp.quote(url)}/, format(UrlsParser::TEMPLATE, hash)) end accu end Mustache.render(text, urls: urls) end
Checks if the value is a shortened URL according to the provided shortened domains.
@return [Boolean] `true` if the value is a shortend URL, `false` otherwise.
# File lib/apes/urls_parser.rb, line 145 def shortened?(url, *shortened_domains) domains = ["bit.ly"].concat(shortened_domains).uniq.compact.map(&:strip) url?(url) && (ensure_url_with_scheme(url.strip) =~ /^(http(s?):\/\/(#{domains.map { |d| Regexp.quote(d) }.join("|")}))/i ? true : false) end
Checks if the value is a valid URL.
@return [Boolean] `true` if the value is a valid URL, `false` otherwise.
# File lib/apes/urls_parser.rb, line 124 def url?(url) url.strip =~ /^(#{UrlsParser::URL_MATCHER.source})$/ix ? true : false end