module Markovfun::Trigram

Public Class Methods

counts_to_probs(counts_hash) click to toggle source

Generates a probability hash, given a counts hash. Similar to counts_hash, except containing the probability that a word follows two preceding words (as opposed to number of occurrences).

# File lib/markovfun/trigram.rb, line 41
def self.counts_to_probs(counts_hash)
  probs_hash = {}
  counts_hash.each do |prev, cur_freq|
    probs_hash[prev] ||= {}
    cur_freq.each do |cur, freq|
      prob = freq.to_f / cur_freq.values.reduce(:+)
      probs_hash[prev][cur] = prob
    end
  end
  probs_hash
end
get_counts(sentences) click to toggle source

Returns a counts hash, given a list of sentences. The keys to the hash are all observed combinations of [prev2, prev1], where prev2 and prev1 are the two previous words. The values are hashes, in which the keys are words (cur) that have followed prev2 and prev1, and the values are the number of occurrences.

# File lib/markovfun/trigram.rb, line 21
def self.get_counts(sentences)
  counts_hash = {}
  sentences.each do |sent|
    # nil denotes the beginnings and ends of sentences
    sent = [nil, nil] + sent + [nil]
    sent.zip(sent[1..-1], sent[2..-1]).each do |prev2, prev1, cur|
      counts_hash[[prev2, prev1]] ||= {}
      if !(counts_hash[[prev2, prev1]][cur])
        counts_hash[[prev2, prev1]][cur] = 1
      else
        counts_hash[[prev2, prev1]][cur] += 1
      end
    end
  end
  counts_hash
end
sample_sentence(probs_hash) click to toggle source

Generates a sample sentence, given a probability hash.

# File lib/markovfun/trigram.rb, line 63
def self.sample_sentence(probs_hash)
  prev2 = nil
  prev1 = nil
  out = []

  while true
    cur = sample_word(probs_hash[[prev2, prev1]])
    if cur.nil?
      return out
    else
      out << cur
      prev2 = prev1
      prev1 = cur
    end
  end
end
sample_word(probs_hash) click to toggle source

Generates a sample word, given a probability hash.

# File lib/markovfun/trigram.rb, line 54
def self.sample_word(probs_hash)
  score = rand
  probs_hash.each do |word, prob|
    return word if score < prob
    score -= prob
  end
end
score_sentence(sent, probs) click to toggle source

Scores a sentence, depending on the likelihood that it occurs within a corpus.

# File lib/markovfun/trigram.rb, line 92
def self.score_sentence(sent, probs)
  total_surprise = 0

  sent = sent[0..-2].split(" ").push(".") if sent.class == String
  sent = [nil, nil] + sent + [nil]

  sent.zip(sent[1..-1], sent[2..-1]).each do |prev2, prev1, cur|
    total_surprise += -Math.log(probs[[prev2, prev1]][cur], 2)
  end
  total_surprise
end
sentence_from_file(filename, min_length) click to toggle source

Generates a sentence, given a file.

# File lib/markovfun/trigram.rb, line 9
def self.sentence_from_file(filename, min_length)
  sentences = get_sentences(filename)
  counts = get_counts(sentences)
  probs = counts_to_probs(counts)
  sentence_from_probs_hash(probs, min_length)
end
sentence_from_probs_hash(probs, min_length) click to toggle source

Generates a sentence from a probability hash.

# File lib/markovfun/trigram.rb, line 81
def self.sentence_from_probs_hash(probs, min_length)
  sent = []
  while score_sentence(sent, probs) > 30 || sent.length < min_length
    sent = sample_sentence(probs)
  end
  sent = sent[0..-2].join(" ") + "."
  sent
end