module Ebooks::NLP
Constants
- PUNCTUATION
We deliberately limit our punctuation handling to stuff we can do consistently It'll just be a part of another token if we don't split it out, and that's fine
Public Class Methods
Lazily loads an array of known English adjectives @return [Array<String>]
# File lib/bot_twitter_ebooks/nlp.rb, line 31 def self.adjectives @adjectives ||= File.read(File.join(DATA_PATH, 'adjectives.txt')).split end
Lazily load HTML entity decoder @return [HTMLEntities]
# File lib/bot_twitter_ebooks/nlp.rb, line 45 def self.htmlentities @htmlentities ||= HTMLEntities.new end
Use highscore gem to find interesting keywords in a corpus @param text [String] @return [Highscore::Keywords]
# File lib/bot_twitter_ebooks/nlp.rb, line 88 def self.keywords(text) # Preprocess to remove stopwords and urls (highscore's blacklist is v. slow) text = NLP.tokenize(text).reject do |t| t.downcase.start_with?('http') || stopword?(t) end text = Highscore::Content.new(text.join(' ')) text.configure do #set :multiplier, 2 #set :upper_case, 3 #set :long_words, 2 #set :long_words_threshold, 15 #set :vowels, 1 # => default: 0 = not considered #set :consonants, 5 # => default: 0 = not considered set :ignore_case, true # => default: false set :word_pattern, /(?<!@)(?<=\s)[\p{Word}']+/ # => default: /\w+/ #set :stemming, true # => default: false end text.keywords end
Normalize some strange unicode punctuation variants @param text [String] @return [String]
# File lib/bot_twitter_ebooks/nlp.rb, line 54 def self.normalize(text) htmlentities.decode text.gsub('“', '"').gsub('”', '"').gsub('’', "'").gsub('…', '...') end
Lazily loads an array of known English nouns @return [Array<String>]
# File lib/bot_twitter_ebooks/nlp.rb, line 25 def self.nouns @nouns ||= File.read(File.join(DATA_PATH, 'nouns.txt')).split end
Is this token comprised of punctuation? @param token [String] @return [Boolean]
# File lib/bot_twitter_ebooks/nlp.rb, line 149 def self.punctuation?(token) (token.chars.to_set - PUNCTUATION.chars.to_set).empty? end
Builds a proper sentence from a list of tikis @param tikis [Array<Integer>] @param tokens [Array<String>] @return [String]
# File lib/bot_twitter_ebooks/nlp.rb, line 115 def self.reconstruct(tikis, tokens) text = "" last_token = nil tikis.each do |tiki| next if tiki == INTERIM token = tokens[tiki] text += ' ' if last_token && space_between?(last_token, token) text += token last_token = token end text end
Split text into sentences We use ad hoc approach because fancy libraries do not deal especially well with tweet formatting, and we can fake solving the quote problem during generation @param text [String] @return [Array<String>]
# File lib/bot_twitter_ebooks/nlp.rb, line 64 def self.sentences(text) text.split(/\n+|(?<=[.?!])\s+/) end
Determine if we need to insert a space between two tokens @param token1 [String] @param token2 [String] @return [Boolean]
# File lib/bot_twitter_ebooks/nlp.rb, line 132 def self.space_between?(token1, token2) p1 = self.punctuation?(token1) p2 = self.punctuation?(token2) if p1 && p2 # "foo?!" false elsif !p1 && p2 # "foo." false elsif p1 && !p2 # "foo. rah" true else # "foo rah" true end end
Get the 'stem' form of a word e.g. 'cats' -> 'cat' @param word [String] @return [String]
# File lib/bot_twitter_ebooks/nlp.rb, line 81 def self.stem(word) Stemmer::stem_word(word.downcase) end
Is this token a stopword? @param token [String] @return [Boolean]
# File lib/bot_twitter_ebooks/nlp.rb, line 156 def self.stopword?(token) @stopword_set ||= stopwords.map(&:downcase).to_set @stopword_set.include?(token.downcase) end
Lazily loads an array of stopwords Stopwords are common words that should often be ignored @return [Array<String>]
# File lib/bot_twitter_ebooks/nlp.rb, line 19 def self.stopwords @stopwords ||= File.exists?('stopwords.txt') ? File.read('stopwords.txt').split : [] end
Determine if a2 is a subsequence of a1 @param a1 [Array] @param a2 [Array] @return [Boolean]
# File lib/bot_twitter_ebooks/nlp.rb, line 191 def self.subseq?(a1, a2) !a1.each_index.find do |i| a1[i...i+a2.length] == a2 end.nil? end
Lazily load part-of-speech tagging library This can determine whether a word is being used as a noun/adjective/verb @return [EngTagger]
# File lib/bot_twitter_ebooks/nlp.rb, line 38 def self.tagger require 'engtagger' @tagger ||= EngTagger.new end
Split a sentence into word-level tokens As above, this is ad hoc because tokenization libraries do not behave well wrt. things like emoticons and timestamps @param sentence [String] @return [Array<String>]
# File lib/bot_twitter_ebooks/nlp.rb, line 73 def self.tokenize(sentence) regex = /\s+|(?<=[#{PUNCTUATION}]\s)(?=[a-zA-Z])|(?<=[a-zA-Z])(?=[#{PUNCTUATION}]+\s)/ sentence.split(regex) end
Determine if a sample of text contains unmatched brackets or quotes This is one of the more frequent and noticeable failure modes for the generator; we can just tell it to retry @param text [String] @return [Boolean]
# File lib/bot_twitter_ebooks/nlp.rb, line 166 def self.unmatched_enclosers?(text) enclosers = ['**', '""', '()', '[]', '``', "''"] enclosers.each do |pair| starter = Regexp.new('(\W|^)' + Regexp.escape(pair[0]) + '\S') ender = Regexp.new('\S' + Regexp.escape(pair[1]) + '(\W|$)') opened = 0 tokenize(text).each do |token| opened += 1 if token.match(starter) opened -= 1 if token.match(ender) return true if opened < 0 # Too many ends! end return true if opened != 0 # Mismatch somewhere. end false end