class TfIdfSimilarity::Document
Attributes
The document's identifier.
The number of tokens in the document.
The number of times each term appears in the document.
The document's text.
Public Class Methods
@param [String] text the document's text @param [Hash] opts optional arguments @option opts [String] :id the document's identifier @option opts [Array] :tokens the document's tokenized text @option opts [Hash] :term_counts the number of times each term appears @option opts [Integer] :size the number of tokens in the document
# File lib/tf-idf-similarity/document.rb, line 21 def initialize(text, opts = {}) @text = text @id = opts[:id] || object_id @tokens = Array(opts[:tokens]).map { |t| Token.new(t) } if opts[:tokens] @tokenizer = opts[:tokenizer] || Tokenizer.new if opts[:term_counts] @term_counts = opts[:term_counts] @size = opts[:size] || term_counts.values.reduce(0, :+) # Nothing to do. else @term_counts = Hash.new(0) @size = 0 set_term_counts_and_size end end
Public Instance Methods
@return [Float] the average term count of all terms in the document
# File lib/tf-idf-similarity/extras/document.rb, line 9 def average_term_count @average_term_count ||= term_counts.values.reduce(0, :+) / term_counts.size.to_f end
@return [Float] the maximum term count of any term in the document
# File lib/tf-idf-similarity/extras/document.rb, line 4 def maximum_term_count @maximum_term_count ||= term_counts.values.max.to_f end
Returns the number of occurrences of the term in the document.
@param [String] term a term @return [Integer] the number of times the term appears in the document
# File lib/tf-idf-similarity/document.rb, line 49 def term_count(term) term_counts[term].to_i # need #to_i if unmarshalled end
Returns the set of terms in the document.
@return [Array<String>] the unique terms in the document
# File lib/tf-idf-similarity/document.rb, line 41 def terms term_counts.keys end
Private Instance Methods
Tokenizes the text and counts terms and total tokens.
# File lib/tf-idf-similarity/document.rb, line 56 def set_term_counts_and_size tokenize(text).each do |token| if token.valid? term = token.to_s @term_counts[term] += 1 @size += 1 end end end
Tokenizes a text, respecting the word boundary rules from Unicode’s Default Word Boundary Specification.
If a tokenized text was provided at the document's initialization, those tokens will be returned without additional processing.
@param [String] text a text @return [Enumerator] a token enumerator
@note We should evaluate the tokenizers by {www.sciencemag.org/content/suppl/2010/12/16/science.1199644.DC1/Michel.SOM.revision.2.pdf Google}
or {http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.UAX29URLEmailTokenizerFactory Solr}.
@see unicode.org/reports/tr29/#Default_Word_Boundaries @see wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.StandardTokenizerFactory
# File lib/tf-idf-similarity/document.rb, line 80 def tokenize(text) @tokens || @tokenizer.tokenize(text) end