class TfIdfSimilarity::Document

Attributes

id[R]

The document's identifier.

size[R]

The number of tokens in the document.

term_counts[R]

The number of times each term appears in the document.

text[R]

The document's text.

Public Class Methods

new(text, opts = {}) click to toggle source

@param [String] text the document's text @param [Hash] opts optional arguments @option opts [String] :id the document's identifier @option opts [Array] :tokens the document's tokenized text @option opts [Hash] :term_counts the number of times each term appears @option opts [Integer] :size the number of tokens in the document

# File lib/tf-idf-similarity/document.rb, line 21
def initialize(text, opts = {})
  @text   = text
  @id     = opts[:id] || object_id
  @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens]
  @tokenizer = opts[:tokenizer] || Tokenizer.new

  if opts[:term_counts]
    @term_counts = opts[:term_counts]
    @size = opts[:size] || term_counts.values.reduce(0, :+)
    # Nothing to do.
  else
    @term_counts = Hash.new(0)
    @size = 0
    set_term_counts_and_size
  end
end

Public Instance Methods

average_term_count() click to toggle source

@return [Float] the average term count of all terms in the document

# File lib/tf-idf-similarity/extras/document.rb, line 9
def average_term_count
  @average_term_count ||= term_counts.values.reduce(0, :+) / term_counts.size.to_f
end
maximum_term_count() click to toggle source

@return [Float] the maximum term count of any term in the document

# File lib/tf-idf-similarity/extras/document.rb, line 4
def maximum_term_count
  @maximum_term_count ||= term_counts.values.max.to_f
end
term_count(term) click to toggle source

Returns the number of occurrences of the term in the document.

@param [String] term a term @return [Integer] the number of times the term appears in the document

# File lib/tf-idf-similarity/document.rb, line 49
def term_count(term)
  term_counts[term].to_i # need #to_i if unmarshalled
end
terms() click to toggle source

Returns the set of terms in the document.

@return [Array<String>] the unique terms in the document

# File lib/tf-idf-similarity/document.rb, line 41
def terms
  term_counts.keys
end

Private Instance Methods

set_term_counts_and_size() click to toggle source

Tokenizes the text and counts terms and total tokens.

# File lib/tf-idf-similarity/document.rb, line 56
def set_term_counts_and_size
  tokenize(text).each do |token|
    if token.valid?
      term = token.to_s
      @term_counts[term] += 1
      @size += 1
    end
  end
end
tokenize(text) click to toggle source

Tokenizes a text, respecting the word boundary rules from Unicode’s Default Word Boundary Specification.

If a tokenized text was provided at the document's initialization, those tokens will be returned without additional processing.

@param [String] text a text @return [Enumerator] a token enumerator

@note We should evaluate the tokenizers by {www.sciencemag.org/content/suppl/2010/12/16/science.1199644.DC1/Michel.SOM.revision.2.pdf Google}

or {http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory Solr}.

@see unicode.org/reports/tr29/#Default_Word_Boundaries @see wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory

# File lib/tf-idf-similarity/document.rb, line 80
def tokenize(text)
  @tokens || @tokenizer.tokenize(text)
end