class Groupie

Groupie is a text grouper and classifier, using naive Bayesian filtering.

This extends Groupie and adds a version number

Constants

VERSION

Public Class Methods

new() click to toggle source
# File lib/groupie.rb, line 12
def initialize
  @groups = {}
end
tokenize(object) click to toggle source

Turn a String (or anything else that responds to to_s) into an Array of String tokens. This attempts to remove most common punctuation marks and types of whitespace.

@param [String, to_s] object @return [Array<String>]

# File lib/groupie.rb, line 21
def self.tokenize(object)
  object
    .to_s
    .downcase
    .gsub(/\s/, ' ')
    .gsub(/[$']/, '')
    .gsub(/<[^>]+?>|[^\w -.,]/, '')
    .split.map { |str| str.gsub(/\A['"]+|[!,."']+\Z/, '') }
end
version() click to toggle source
# File lib/groupie/version.rb, line 7
def self.version
  VERSION
end

Public Instance Methods

[](group) click to toggle source

Access an existing Group or create a new one.

@param [Object] group The name of the group to access. @return [Groupie::Group] An existing or new group identified by group.

# File lib/groupie.rb, line 35
def [](group)
  @groups[group] ||= Group.new(group)
end
classify(entry, strategy = :sum) click to toggle source

Classify a single word against all groups.

@param [String] entry A word to be classified @param [Symbol] strategy @return [Hash<Object, Float>] Hash with <group, score> pairings. Scores are always in 0.0..1.0 @raise [Groupie::Error] Raise when an invalid strategy is provided

# File lib/groupie.rb, line 60
def classify(entry, strategy = :sum)
  results = {}
  total_count = @groups.values.inject(0) do |sum, group|
    sum + apply_count_strategy(group.count(entry), strategy)
  end
  return results if total_count.zero?

  @groups.each do |name, group|
    count = apply_count_strategy(group.count(entry), strategy)
    results[name] = count.positive? ? count.to_f / total_count : 0.0
  end

  results
end
classify_text(words, strategy = :sum) click to toggle source

Classify a text by taking the average of all word classifications.

@param [Array<String>] words List of words to be classified @param [Symbol] strategy @return [Hash<Object, Float>] Hash with <group, score> pairings. Scores are always in 0.0..1.0 @raise [Groupie::Error] Raise when an invalid strategy is provided

# File lib/groupie.rb, line 45
def classify_text(words, strategy = :sum)
  words &= unique_words if strategy == :unique
  group_score_sums, hits = calculate_group_scores(words, strategy)

  group_score_sums.each.with_object({}) do |(group, sum), averages|
    averages[group] = hits.positive? ? sum / hits : 0
  end
end
unique_words() click to toggle source

Return a word score dictionary that excludes the 4th quartile most popular words. Why do this? So the most common (and thus meaningless) words are ignored and less common words gain more predictive power.

This is used by the :unique strategy of the classifier.

@return [Hash<String, Integer>]

# File lib/groupie.rb, line 82
def unique_words
  # Iterate over all Groups and merge their <word, count> dictionaries into one
  total_count = @groups.inject({}) do |total, (_name, group)|
    total.merge!(group.word_counts) { |_key, o, n| o + n }
  end
  # Extract the word count that's at the top 75%
  top_quartile_index = [total_count.size * 3 / 4 - 1, 1].max
  top_quartile_frequency = total_count.values.sort[top_quartile_index]
  # Throw out all words which have a count that's above this frequency
  total_count.reject! { |_word, count| count > top_quartile_frequency }
  total_count.keys
end

Private Instance Methods

apply_count_strategy(count, strategy) click to toggle source

Helper function to reduce a raw word count to a strategy-modified weight. @param [Integer] count @param [Symbol] strategy @return [Integer, Float] @raise [Groupie::Error] Raise when an invalid strategy is provided

# File lib/groupie.rb, line 120
def apply_count_strategy(count, strategy)
  case strategy
  when :sum
    # keep count
  when :sqrt, :unique
    count = Math.sqrt(count)
  when :log
    count = Math.log10(count) if count.positive?
  else
    raise Error, "Invalid strategy: #{strategy}"
  end
  count
end
calculate_group_scores(words, strategy) click to toggle source

Calculate grouped scores

@param [Array<String>] words @param [Symbol] strategy @return [Array<Enumerator<String>, Integer>] a Hash with <group, score> pairs and an integer with the number of hits

# File lib/groupie.rb, line 102
def calculate_group_scores(words, strategy)
  hits = 0
  group_score_sums = words.each.with_object({}) do |word, results|
    word_results = classify(word, strategy)
    next results if word_results.empty?

    hits += 1
    results.merge!(word_results) { |_key, old, new| old + new }
  end

  [group_score_sums, hits]
end