class FuzzyStrings

Match words based on the operations needed to get 2 similar words bt insertion, deletion, substitution or transposition operations.

cot => coat (a must be inserted to get the same word) coat => cot (a must be deleted to get the same word) cost => coat (a must be substituted with s to get the same word) foo => floor (l and r must be inserted) floor => foo (l and r must be deleted) cost => cots (t and s must substituted (cost=2) or transpositioned (cost=1))

the cost is the amount of operations needed to make 2 words the same

Usage

fs = FuzzyStrings.new("pattern")
match = fs.compare("pattren")
puts match.match?
# true
puts match.score
# 2
puts match

Unicode?

It is assumed that all strings are utf-8

Public Class Methods

new(string1) click to toggle source
# File lib/fuzzy_strings.rb, line 26
def initialize(string1)
  @string1 = string1.to_s rescue ""
end

Public Instance Methods

compare(string2, no_transpositions = false) click to toggle source

compare a given string to the base pattern, the compared strings is operated upon (soo cot as the pattern and coat in compare leads to deletion)

returns a FuzzyStrings::Match object

# File lib/fuzzy_strings.rb, line 35
def compare(string2, no_transpositions = false)
  @string2 = string2.to_s rescue ""
  @match   = Match.new

  return @match if @string1 == @string2

  rule = 'U*'

  sequence1 = @string1.unpack rule
  sequence2 = @string2.unpack rule

  if (sequence1 + sequence2).include?(0)
    raise ArgumentError.new(
      "Strings cannot contain NULL-bytes due to internal semantics"
    )
  end

  @short, @long = if sequence1.length < sequence2.length
    [sequence1, sequence2]
  else
    [sequence2, sequence1]
  end

  find_insertions
  find_substitutions
  find_transpositions unless no_transpositions == true

  return @match
end

Private Instance Methods

find_insertions() click to toggle source

find insertions (if string2 is shorter we are finding deletions)

place null-bytes on the insert positions

# File lib/fuzzy_strings.rb, line 68
def find_insertions
  # when both are equal in length no insertions can happen
  return if @short.length == @long.length

  mode = @short.pack('U*') == @string2 ? :insertions : :deletions

  ## # don't destroy the object short'
  ## short = @short

  @long.each_with_index do |long_chr, i|
    short_chr = @short[i]
    if long_chr != short_chr
      next if @long[i+1].nil? or @long[i+1] != short_chr

      # there is an insertion
        @short = @short[0,i] + [ 0 ] + @short[i, @short.length-1]
      @match.send(:"#{mode}=", @match.send(mode) + 1)
    end
  end

  # pad the short with 0 until equal in length (these are not insertions)
  while @long.length > @short.length
    @short << 0
    @match.send(:"#{mode}=", @match.send(mode) + 1)
  end
end
find_substitutions() click to toggle source

compare characters, dont compare if 1 character is a null byte

# File lib/fuzzy_strings.rb, line 96
def find_substitutions
  @short.each_with_index do |char1, i|  # .select { |c| c != 0 }
    char2 = @long[i]
    next if [ char1, char2 ].include? 0
    @match.substitutions += 1 if char1 != char2
  end
end
find_transpositions() click to toggle source

compare characters by 2 and find transposed characters (when given cost, cots, ts is transposed)

# File lib/fuzzy_strings.rb, line 107
def find_transpositions
  short = @short.select { |c| c != 0 }
  short.each_index do |i|
    break if i == (short.length - 1)

    one = short[i..i+1]
    two = @long[i..i+1]
    next if one == two

    @match.transpositions += 1 if (one & two).length == 2
  end
end