class Minhash::Algorithm
The Minhash
signature algorithm.
See section 3.3 of the www.mmds.org/ book: infolab.stanford.edu/~ullman/mmds/ch3.pdf
Simple XORs of random integer bit masks are used as the hash functions.
Attributes
masks[R]
Returns the bit masks used to implement the hash functions.
Public Class Methods
create(length)
click to toggle source
Creates a new instance of the algorithm with length
random bit masks.
# File lib/minhash.rb, line 97 def self.create(length) new length.times.map { rand(2 ** 32 -1) } end
new(masks)
click to toggle source
Creates a new instance of the algorithm, with the given bit masks.
# File lib/minhash.rb, line 90 def initialize(masks) @masks = masks.freeze @hash_functions ||= @masks.map {|mask| lambda {|i| i ^ mask } } end
Public Instance Methods
signature(tokens)
click to toggle source
Returns the minhash signature for a set of tokens.
# File lib/minhash.rb, line 102 def signature(tokens) @hash_functions.map {|f| tokens.map(&f).min } end