class PEROBS::FuzzyStringMatcher

The fuzzy string matcher can be used to perform a fuzzy string search against a known set of strings. The dictionary of known strings does not store the actual strings but references to String or PEROBS objects. Once the dictionary has been established, fuzzy matches can be done. Since the actual input strings are not directly stored, you cannot remove or modified already stored strings. To remove strings, you have to clear the matcher and add the strings again that you want to keep.

Public Class Methods

new(p, case_sensitive = false, n = 4) click to toggle source

Create a new FuzzyStringMatcher. @param p [PEROBS::Store] place to store the dictionary @param case_sensitive [Boolean] True if case matters for matching @param n [Integer] Determines what kind of n-gramm is used to store the

references in the dictionary. It also determines the minimum word
length that can be used for fuzzy matches. Values between 2 and
10 are supported. The default is 4.
Calls superclass method PEROBS::Object::new
# File lib/perobs/FuzzyStringMatcher.rb, line 51
def initialize(p, case_sensitive = false, n = 4)
  super(p)
  if n < 2 || n > 10
    raise ArgumentError, 'n must be between 2 and 10'
  end
  self.case_sensitive = case_sensitive
  self.n = n

  clear unless @dict
end

Public Instance Methods

best_matches(string, min_score = 0.5, max_count = 100) click to toggle source

Find the references who's string best matches the given string. @param string [String] string to search for @param min_score [Float] Value 0.01 and 1.0 that specifies how strict

the matching should be done. The larger the value the more closer
the given string needs to be.

@param max_count [Integer] The maximum number of matches that should be

returned.

@return [Array] The result is an Array of Arrays. The nested Arrays only

have 2 entries. The reference and a Float value between 0 and
1.0 that describes how good the match is. The matches are sorted
in descending order by the match score.
# File lib/perobs/FuzzyStringMatcher.rb, line 102
def best_matches(string, min_score = 0.5, max_count = 100)
  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  matches = {}

  each_n_gramm(string) do |n_gramm|
    if (ng_list = @dict[n_gramm])
      ng_list.each do |reference, dummy|
        if matches.include?(reference)
          matches[reference] += 1
        else
          matches[reference] = 1
        end
      end
    end
  end

  return [] if matches.empty?

  match_list = matches.to_a

  # Set occurance counters to scores relative to the best possible score.
  # This will be the best possible score for a perfect match.
  best_possible_score = string.length - @n + 1
  match_list.map! { |a, b| [ a, b.to_f / best_possible_score ] }

  # Delete all matches that don't have the required minimum match score.
  match_list.delete_if { |a| a[1] < min_score }

  # Sort the list best to worst match
  match_list.sort! do |a, b|
    b[1] <=> a[1]
  end

  # Return the top max_count matches.
  match_list[0..max_count - 1]
end
clear() click to toggle source

Wipe the dictionary.

# File lib/perobs/FuzzyStringMatcher.rb, line 63
def clear
  self.dict = @store.new(BigHash)
end
learn(string, reference = string) click to toggle source

Add a string with its reference to the dictionary. @param string [String] The string to store @param reference [Object] Any object that is associated with the string

# File lib/perobs/FuzzyStringMatcher.rb, line 70
def learn(string, reference = string)
  reference = string if reference.nil?

  unless @case_sensitive
    string = string.downcase
  end
  # Enclose string in 'start of text' and 'end of text' ASCII values.
  string = "\002" + string + "\003"

  each_n_gramm(string) do |n_gramm|
    unless (ng_list = @dict[n_gramm])
      @dict[n_gramm] = ng_list = @store.new(Hash)
    end

    # We use the Hash as a Set. The value doesn't matter.
    ng_list[reference] = true unless ng_list.include?(reference)
  end

  nil
end
stats() click to toggle source

Returns some internal stats about the dictionary.

# File lib/perobs/FuzzyStringMatcher.rb, line 145
def stats
  s = {}
  s['dictionary_size'] = @dict.size
  max = total = 0
  @dict.each do |n_gramm, ng_list|
    size = ng_list.length
    max = size if size > max
    total += size
  end
  s['max_list_size'] = max
  s['avg_list_size'] = total > 0 ? total.to_f / s['dictionary_size'] : 0

  s
end

Private Instance Methods

each_n_gramm(string) { |n_gramm| ... } click to toggle source
# File lib/perobs/FuzzyStringMatcher.rb, line 162
def each_n_gramm(string, &block)
  return if string.length < @n

  0.upto(string.length - @n) do |i|
    n_gramm = string[i, @n]

    yield(n_gramm)
  end
end