class Apes::UrlsParser

Utility class to parse URLs, domains and emails.

Constants

DOMAIN_MATCHER

Regular expression to match a valid domain.

EMAIL_MATCHER

Regular expression to match a valid email address.

TEMPLATE

Template to replace URLs in a text.

TLDS

The list of valid top level domains for a URL, email and domain. To update the list: jecas.cz/tld-list/

URL_MATCHER

Regular expression to match a valid URL.

URL_SEPARATOR

Regular expression to detect a URL in a text.

Public Class Methods

instance(force = false) click to toggle source

Get the singleton instance of the parser.

@param force [Boolean] Whether to force creation of a new singleton. @return [Apes::UrlsParser] A instance of the parser.

# File lib/apes/urls_parser.rb, line 116
def self.instance(force = false)
  @instance = nil if force
  @instance ||= new
end

Public Instance Methods

clean(url) click to toggle source

Removes all extra characters (like trailing comma) from a URL.

@param url [String] The URL to clean. @return [String] The cleaned URL.

# File lib/apes/urls_parser.rb, line 209
def clean(url)
  url.strip.gsub(/#{UrlsParser::URL_SEPARATOR.source}$/, "")
end
domain?(domain) click to toggle source

Checks if the value is a valid domain.

@return [Boolean] `true` if the value is a valid domain, `false` otherwise.

# File lib/apes/urls_parser.rb, line 138
def domain?(domain)
  domain.strip =~ /^(#{UrlsParser::DOMAIN_MATCHER.source})$/ix ? true : false
end
email?(email) click to toggle source

Checks if the value is a valid email address.

@return [Boolean] `true` if the value is a valid email address, `false` otherwise.

# File lib/apes/urls_parser.rb, line 131
def email?(email)
  email.strip =~ /^(#{UrlsParser::EMAIL_MATCHER.source})$/ix ? true : false
end
ensure_url_with_scheme(subject, protocol = "http", secure: false) click to toggle source

Makes sure the string starts with the scheme for the specified protocol.

@param subject [String] The string to analyze. @param protocol [String] The protocol for the URL. @param secure [Boolean] If the scheme should be secure or not. @return [String] The string with a URL scheme at the beginning.

# File lib/apes/urls_parser.rb, line 156
def ensure_url_with_scheme(subject, protocol = "http", secure: false)
  schema = protocol + (secure ? "s" : "")
  subject !~ /^(#{protocol}(s?):\/\/)/ ? "#{schema}://#{subject}" : subject
end
extract_urls(text, mode: :all, sort: nil, shortened_domains: []) click to toggle source

Extract all URLS from a text.

@param text [String] The text that contains URLs. @param mode [Symbol] Which URLs to extract. It can be `:shortened`, `:unshortened` or `:all` (the default). @param sort [NilClass|Symbol] If not `nil`, how to sort extracted URLs. It can be `:asc` or `:desc`. @param shortened_domains [Array] Which domains to consider shortened. @return [Array] An array of extracted URLs.

# File lib/apes/urls_parser.rb, line 168
def extract_urls(text, mode: :all, sort: nil, shortened_domains: [])
  regexp = /((^|\s+)(?<url>#{UrlsParser::URL_MATCHER.source})(#{UrlsParser::URL_SEPARATOR.source}|$))/ix
  matches = text.scan(regexp).flatten.map { |u| clean(u) }.uniq

  if mode == :shortened
    matches.select! { |u| shortened?(u, *shortened_domains) }
  elsif mode == :unshortened
    matches.reject! { |u| shortened?(u, *shortened_domains) }
  end

  matches = sort_urls(matches, sort)
  matches
end
hashify(url) click to toggle source

Generate a hash of a URL.

@param url [String] The URL to hashify. @return [String] The hash for the URL.

# File lib/apes/urls_parser.rb, line 217
def hashify(url)
  Digest::SHA2.hexdigest(ensure_url_with_scheme(url.strip))
end
replace_urls(text, replacements: {}, mode: :all, shortened_domains: []) click to toggle source

Replace all URLs in a text with provided replacements.

@param text [String] The text that contains URLs. @param replacements [Hash] A map where keys are the URLs to replace and values are their replacements. @param mode [Symbol] Which URLs to extract. It can be `:shortened`, `:unshortened` or `:all` (the default). @param shortened_domains [Array] Which domains to consider shortened. @return [String] The original text with all URLs replaced.

# File lib/apes/urls_parser.rb, line 189
def replace_urls(text, replacements: {}, mode: :all, shortened_domains: [])
  text = text.dup

  urls = extract_urls(text, mode: mode, sort: :desc, shortened_domains: shortened_domains).reduce({}) do |accu, url|
    if replacements[url]
      hash = hashify(url)
      accu["url_#{hash}"] = ensure_url_with_scheme(replacements[url])
      text.gsub!(/#{Regexp.quote(url)}/, format(UrlsParser::TEMPLATE, hash))
    end

    accu
  end

  Mustache.render(text, urls: urls)
end
shortened?(url, *shortened_domains) click to toggle source

Checks if the value is a shortened URL according to the provided shortened domains.

@return [Boolean] `true` if the value is a shortend URL, `false` otherwise.

# File lib/apes/urls_parser.rb, line 145
def shortened?(url, *shortened_domains)
  domains = ["bit.ly"].concat(shortened_domains).uniq.compact.map(&:strip)
  url?(url) && (ensure_url_with_scheme(url.strip) =~ /^(http(s?):\/\/(#{domains.map { |d| Regexp.quote(d) }.join("|")}))/i ? true : false)
end
url?(url) click to toggle source

Checks if the value is a valid URL.

@return [Boolean] `true` if the value is a valid URL, `false` otherwise.

# File lib/apes/urls_parser.rb, line 124
def url?(url)
  url.strip =~ /^(#{UrlsParser::URL_MATCHER.source})$/ix ? true : false
end