module Ebooks::NLP

Constants

PUNCTUATION

We deliberately limit our punctuation handling to stuff we can do consistently It'll just be a part of another token if we don't split it out, and that's fine

Public Class Methods

adjectives() click to toggle source

Lazily loads an array of known English adjectives @return [Array<String>]

# File lib/bot_twitter_ebooks/nlp.rb, line 31
def self.adjectives
  @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split
end
htmlentities() click to toggle source

Lazily load HTML entity decoder @return [HTMLEntities]

# File lib/bot_twitter_ebooks/nlp.rb, line 45
def self.htmlentities
  @htmlentities ||= HTMLEntities.new
end
keywords(text) click to toggle source

Use highscore gem to find interesting keywords in a corpus @param text [String] @return [Highscore::Keywords]

# File lib/bot_twitter_ebooks/nlp.rb, line 88
def self.keywords(text)
  # Preprocess to remove stopwords and urls (highscore's blacklist is v. slow)
  text = NLP.tokenize(text).reject do |t|
    t.downcase.start_with?('http') || stopword?(t)
  end

  text = Highscore::Content.new(text.join(' '))

  text.configure do
    #set :multiplier, 2
    #set :upper_case, 3
    #set :long_words, 2
    #set :long_words_threshold, 15
    #set :vowels, 1                     # => default: 0 = not considered
    #set :consonants, 5                 # => default: 0 = not considered
    set :ignore_case, true             # => default: false
    set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/           # => default: /\w+/
    #set :stemming, true                # => default: false
  end

  text.keywords
end
normalize(text) click to toggle source

Normalize some strange unicode punctuation variants @param text [String] @return [String]

# File lib/bot_twitter_ebooks/nlp.rb, line 54
def self.normalize(text)
  htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...')
end
nouns() click to toggle source

Lazily loads an array of known English nouns @return [Array<String>]

# File lib/bot_twitter_ebooks/nlp.rb, line 25
def self.nouns
  @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split
end
punctuation?(token) click to toggle source

Is this token comprised of punctuation? @param token [String] @return [Boolean]

# File lib/bot_twitter_ebooks/nlp.rb, line 149
def self.punctuation?(token)
  (token.chars.to_set - PUNCTUATION.chars.to_set).empty?
end
reconstruct(tikis, tokens) click to toggle source

Builds a proper sentence from a list of tikis @param tikis [Array<Integer>] @param tokens [Array<String>] @return [String]

# File lib/bot_twitter_ebooks/nlp.rb, line 115
def self.reconstruct(tikis, tokens)
  text = ""
  last_token = nil
  tikis.each do |tiki|
    next if tiki == INTERIM
    token = tokens[tiki]
    text += ' ' if last_token && space_between?(last_token, token)
    text += token
    last_token = token
  end
  text
end
sentences(text) click to toggle source

Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation @param text [String] @return [Array<String>]

# File lib/bot_twitter_ebooks/nlp.rb, line 64
def self.sentences(text)
  text.split(/\n+|(?<=[.?!])\s+/)
end
space_between?(token1, token2) click to toggle source

Determine if we need to insert a space between two tokens @param token1 [String] @param token2 [String] @return [Boolean]

# File lib/bot_twitter_ebooks/nlp.rb, line 132
def self.space_between?(token1, token2)
  p1 = self.punctuation?(token1)
  p2 = self.punctuation?(token2)
  if p1 && p2 # "foo?!"
    false
  elsif !p1 && p2 # "foo."
    false
  elsif p1 && !p2 # "foo. rah"
    true
  else # "foo rah"
    true
  end
end
stem(word) click to toggle source

Get the 'stem' form of a word e.g. 'cats' -> 'cat' @param word [String] @return [String]

# File lib/bot_twitter_ebooks/nlp.rb, line 81
def self.stem(word)
  Stemmer::stem_word(word.downcase)
end
stopword?(token) click to toggle source

Is this token a stopword? @param token [String] @return [Boolean]

# File lib/bot_twitter_ebooks/nlp.rb, line 156
def self.stopword?(token)
  @stopword_set ||= stopwords.map(&:downcase).to_set
  @stopword_set.include?(token.downcase)
end
stopwords() click to toggle source

Lazily loads an array of stopwords Stopwords are common words that should often be ignored @return [Array<String>]

# File lib/bot_twitter_ebooks/nlp.rb, line 19
def self.stopwords
  @stopwords ||= File.exists?('stopwords.txt') ? File.read('stopwords.txt').split : []
end
subseq?(a1, a2) click to toggle source

Determine if a2 is a subsequence of a1 @param a1 [Array] @param a2 [Array] @return [Boolean]

# File lib/bot_twitter_ebooks/nlp.rb, line 191
def self.subseq?(a1, a2)
  !a1.each_index.find do |i|
    a1[i...i+a2.length] == a2
  end.nil?
end
tagger() click to toggle source

Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb @return [EngTagger]

# File lib/bot_twitter_ebooks/nlp.rb, line 38
def self.tagger
  require 'engtagger'
  @tagger ||= EngTagger.new
end
tokenize(sentence) click to toggle source

Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps @param sentence [String] @return [Array<String>]

# File lib/bot_twitter_ebooks/nlp.rb, line 73
def self.tokenize(sentence)
  regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/
  sentence.split(regex)
end
unmatched_enclosers?(text) click to toggle source

Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry @param text [String] @return [Boolean]

# File lib/bot_twitter_ebooks/nlp.rb, line 166
def self.unmatched_enclosers?(text)
  enclosers = ['**', '""', '()', '[]', '``', "''"]
  enclosers.each do |pair|
    starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S')
    ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)')

    opened = 0

    tokenize(text).each do |token|
      opened += 1 if token.match(starter)
      opened -= 1 if token.match(ender)

      return true if opened < 0 # Too many ends!
    end

    return true if opened != 0 # Mismatch somewhere.
  end

  false
end