class FuzzyStrings
Match
words based on the operations needed to get 2 similar words bt insertion, deletion, substitution or transposition operations.
cot => coat (a must be inserted to get the same word) coat => cot (a must be deleted to get the same word) cost => coat (a must be substituted with s to get the same word) foo => floor (l and r must be inserted) floor => foo (l and r must be deleted) cost => cots (t and s must substituted (cost=2) or transpositioned (cost=1))
the cost is the amount of operations needed to make 2 words the same
Usage¶ ↑
fs = FuzzyStrings.new("pattern") match = fs.compare("pattren") puts match.match? # true puts match.score # 2 puts match
Unicode?¶ ↑
It is assumed that all strings are utf-8
Public Class Methods
# File lib/fuzzy_strings.rb, line 26 def initialize(string1) @string1 = string1.to_s rescue "" end
Public Instance Methods
compare a given string to the base pattern, the compared strings is operated upon (soo cot as the pattern and coat in compare leads to deletion)
returns a FuzzyStrings::Match
object
# File lib/fuzzy_strings.rb, line 35 def compare(string2, no_transpositions = false) @string2 = string2.to_s rescue "" @match = Match.new return @match if @string1 == @string2 rule = 'U*' sequence1 = @string1.unpack rule sequence2 = @string2.unpack rule if (sequence1 + sequence2).include?(0) raise ArgumentError.new( "Strings cannot contain NULL-bytes due to internal semantics" ) end @short, @long = if sequence1.length < sequence2.length [sequence1, sequence2] else [sequence2, sequence1] end find_insertions find_substitutions find_transpositions unless no_transpositions == true return @match end
Private Instance Methods
find insertions (if string2 is shorter we are finding deletions)
place null-bytes on the insert positions
# File lib/fuzzy_strings.rb, line 68 def find_insertions # when both are equal in length no insertions can happen return if @short.length == @long.length mode = @short.pack('U*') == @string2 ? :insertions : :deletions ## # don't destroy the object short' ## short = @short @long.each_with_index do |long_chr, i| short_chr = @short[i] if long_chr != short_chr next if @long[i+1].nil? or @long[i+1] != short_chr # there is an insertion @short = @short[0,i] + [ 0 ] + @short[i, @short.length-1] @match.send(:"#{mode}=", @match.send(mode) + 1) end end # pad the short with 0 until equal in length (these are not insertions) while @long.length > @short.length @short << 0 @match.send(:"#{mode}=", @match.send(mode) + 1) end end
compare characters, dont compare if 1 character is a null byte
# File lib/fuzzy_strings.rb, line 96 def find_substitutions @short.each_with_index do |char1, i| # .select { |c| c != 0 } char2 = @long[i] next if [ char1, char2 ].include? 0 @match.substitutions += 1 if char1 != char2 end end
compare characters by 2 and find transposed characters (when given cost, cots, ts is transposed)
# File lib/fuzzy_strings.rb, line 107 def find_transpositions short = @short.select { |c| c != 0 } short.each_index do |i| break if i == (short.length - 1) one = short[i..i+1] two = @long[i..i+1] next if one == two @match.transpositions += 1 if (one & two).length == 2 end end