class Minhash::Algorithm

The Minhash signature algorithm.

See section 3.3 of the www.mmds.org/ book: infolab.stanford.edu/~ullman/mmds/ch3.pdf

Simple XORs of random integer bit masks are used as the hash functions.

Attributes

masks[R]

Returns the bit masks used to implement the hash functions.

Public Class Methods

create(length) click to toggle source

Creates a new instance of the algorithm with length random bit masks.

# File lib/minhash.rb, line 97
def self.create(length)
  new length.times.map { rand(2 ** 32 -1) }
end
new(masks) click to toggle source

Creates a new instance of the algorithm, with the given bit masks.

# File lib/minhash.rb, line 90
def initialize(masks)
  @masks = masks.freeze
  @hash_functions ||= @masks.map {|mask| lambda {|i| i ^ mask } }
end

Public Instance Methods

signature(tokens) click to toggle source

Returns the minhash signature for a set of tokens.

# File lib/minhash.rb, line 102
def signature(tokens)
  @hash_functions.map {|f| tokens.map(&f).min }
end