module Minhash
Public Class Methods
approximate_similarity(a, b)
click to toggle source
Returns the approximate jaccard similarity of 2 sets, given their signatures.
# File lib/minhash.rb, line 56 def self.approximate_similarity(a, b) a.length.times.select {|i| a[i] == b[i] }.size / a.length.to_f end
jaccard_similarity(a, b)
click to toggle source
Returns the jaccard similarity between two sets of shingles / tokens.
# File lib/minhash.rb, line 50 def self.jaccard_similarity(a, b) (a & b).size / (a | b).size.to_f end
k_shingles(text, k)
click to toggle source
Generates the k-shingles for a String
.
For example, with the string “12345” and k = 3 the shingles are “123”, “234” and “345”.
k_shingles("ABCDE", 3) # => ["ABC", "BCD", "CDE"]
Normalizes all whitespace to be single character spaces.
You want to select k to be large enough to be able to distinguish chunks of text adequately - a value of around 9 is sensible for reasonable sized chunks of standard written text.
# File lib/minhash.rb, line 18 def self.k_shingles(text, k) if text.size < k [text] else text.tr("\t\n\r", " "). squeeze(" "). each_char. each_cons(k). map(&:join) end end
tokenized_k_shingles(text, k, &hash)
click to toggle source
Generates a Set of tokenized 32-bit integer k-shingles.
tokenized_k_shingles("ABCDE", 3) # => #<Set: {1136772405, 3561005568, 944681077}>
MurmurHash3 is used by default; if you want to use a different hash function, pass a block:
tokenized_k_shingles("ABCDE", 3) do |shingle| # obviously a terrible hash function, just an example. shingle[0].ord end # => #<Set: {65, 66, 67}>
# File lib/minhash.rb, line 43 def self.tokenized_k_shingles(text, k, &hash) hash ||= lambda {|s| MurmurHash3::Native32.murmur3_32_str_hash(s) } k_shingles(text, k).map {|shingle| hash.call(shingle) }.to_set end