class Groupie
Groupie
is a text grouper and classifier, using naive Bayesian filtering.
This extends Groupie
and adds a version number
Constants
- VERSION
Public Class Methods
# File lib/groupie.rb, line 12 def initialize @groups = {} end
Turn a String
(or anything else that responds to to_s) into an Array of String
tokens. This attempts to remove most common punctuation marks and types of whitespace.
@param [String, to_s] object @return [Array<String>]
# File lib/groupie.rb, line 21 def self.tokenize(object) object .to_s .downcase .gsub(/\s/, ' ') .gsub(/[$']/, '') .gsub(/<[^>]+?>|[^\w -.,]/, '') .split.map { |str| str.gsub(/\A['"]+|[!,."']+\Z/, '') } end
# File lib/groupie/version.rb, line 7 def self.version VERSION end
Public Instance Methods
Access an existing Group
or create a new one.
@param [Object] group The name of the group to access. @return [Groupie::Group] An existing or new group identified by group
.
# File lib/groupie.rb, line 35 def [](group) @groups[group] ||= Group.new(group) end
Classify a single word against all groups.
@param [String] entry A word to be classified @param [Symbol] strategy @return [Hash<Object, Float>] Hash with <group, score> pairings. Scores are always in 0.0..1.0 @raise [Groupie::Error] Raise when an invalid strategy is provided
# File lib/groupie.rb, line 60 def classify(entry, strategy = :sum) results = {} total_count = @groups.values.inject(0) do |sum, group| sum + apply_count_strategy(group.count(entry), strategy) end return results if total_count.zero? @groups.each do |name, group| count = apply_count_strategy(group.count(entry), strategy) results[name] = count.positive? ? count.to_f / total_count : 0.0 end results end
Classify a text by taking the average of all word classifications.
@param [Array<String>] words List of words to be classified @param [Symbol] strategy @return [Hash<Object, Float>] Hash with <group, score> pairings. Scores are always in 0.0..1.0 @raise [Groupie::Error] Raise when an invalid strategy is provided
# File lib/groupie.rb, line 45 def classify_text(words, strategy = :sum) words &= unique_words if strategy == :unique group_score_sums, hits = calculate_group_scores(words, strategy) group_score_sums.each.with_object({}) do |(group, sum), averages| averages[group] = hits.positive? ? sum / hits : 0 end end
Return a word score dictionary that excludes the 4th quartile most popular words. Why do this? So the most common (and thus meaningless) words are ignored and less common words gain more predictive power.
This is used by the :unique strategy of the classifier.
@return [Hash<String, Integer>]
# File lib/groupie.rb, line 82 def unique_words # Iterate over all Groups and merge their <word, count> dictionaries into one total_count = @groups.inject({}) do |total, (_name, group)| total.merge!(group.word_counts) { |_key, o, n| o + n } end # Extract the word count that's at the top 75% top_quartile_index = [total_count.size * 3 / 4 - 1, 1].max top_quartile_frequency = total_count.values.sort[top_quartile_index] # Throw out all words which have a count that's above this frequency total_count.reject! { |_word, count| count > top_quartile_frequency } total_count.keys end
Private Instance Methods
Helper function to reduce a raw word count to a strategy-modified weight. @param [Integer] count @param [Symbol] strategy @return [Integer, Float] @raise [Groupie::Error] Raise when an invalid strategy is provided
# File lib/groupie.rb, line 120 def apply_count_strategy(count, strategy) case strategy when :sum # keep count when :sqrt, :unique count = Math.sqrt(count) when :log count = Math.log10(count) if count.positive? else raise Error, "Invalid strategy: #{strategy}" end count end
Calculate grouped scores
@param [Array<String>] words @param [Symbol] strategy @return [Array<Enumerator<String>, Integer>] a Hash with <group, score> pairs and an integer with the number of hits
# File lib/groupie.rb, line 102 def calculate_group_scores(words, strategy) hits = 0 group_score_sums = words.each.with_object({}) do |word, results| word_results = classify(word, strategy) next results if word_results.empty? hits += 1 results.merge!(word_results) { |_key, old, new| old + new } end [group_score_sums, hits] end