module Minhash

Public Class Methods

approximate_similarity(a, b) click to toggle source

Returns the approximate jaccard similarity of 2 sets, given their signatures.

# File lib/minhash.rb, line 56
def self.approximate_similarity(a, b)
  a.length.times.select {|i| a[i] == b[i] }.size / a.length.to_f
end
jaccard_similarity(a, b) click to toggle source

Returns the jaccard similarity between two sets of shingles / tokens.

# File lib/minhash.rb, line 50
def self.jaccard_similarity(a, b)
  (a & b).size / (a | b).size.to_f
end
k_shingles(text, k) click to toggle source

Generates the k-shingles for a String.

For example, with the string “12345” and k = 3 the shingles are “123”, “234” and “345”.

k_shingles("ABCDE", 3)
# => ["ABC", "BCD", "CDE"]

Normalizes all whitespace to be single character spaces.

You want to select k to be large enough to be able to distinguish chunks of text adequately - a value of around 9 is sensible for reasonable sized chunks of standard written text.

# File lib/minhash.rb, line 18
def self.k_shingles(text, k)
  if text.size < k
    [text]
  else
    text.tr("\t\n\r", "   ").
      squeeze(" ").
      each_char.
      each_cons(k).
      map(&:join)
  end
end
tokenized_k_shingles(text, k, &hash) click to toggle source

Generates a Set of tokenized 32-bit integer k-shingles.

tokenized_k_shingles("ABCDE", 3)
# =>  #<Set: {1136772405, 3561005568, 944681077}>

MurmurHash3 is used by default; if you want to use a different hash function, pass a block:

tokenized_k_shingles("ABCDE", 3) do |shingle|
  # obviously a terrible hash function, just an example.
  shingle[0].ord
end
# => #<Set: {65, 66, 67}>
# File lib/minhash.rb, line 43
def self.tokenized_k_shingles(text, k, &hash)
  hash ||= lambda {|s| MurmurHash3::Native32.murmur3_32_str_hash(s) }
  k_shingles(text, k).map {|shingle| hash.call(shingle) }.to_set
end