class Glove::Parser

Takes a string of text and tokenizes it for usage in {Glove::Corpus}

Constants

DEFAULTS

Default options (see initialize)

Attributes

text[R]

@!attribute [r] text

@return [String] the current value of the text attribute

Public Class Methods

new(text, options={}) click to toggle source

Create a new {Glove::Parser}, passing the text and options as arguments

@param [String] text value for the text attribute @param [Hash] options the options to initialize the instance with. @option options [Boolean] :stem (true) Whether to stem the tokens @option options [Boolean] :alphabetic (true) Remove any non-alphabetic chars @option options [Boolean] :normalize (true) Normalize the text and keep

words with length between option[:min_length] and option[:max_length]

@option options [Boolean] :stop_words (true) Filter stop words @option options [Integer] :min_length (3) the min allowed length of a word @option options [Integer] :max_length (25) the max allowed length of a word @return [Glove::Parser] A new parser.

# File lib/glove/parser.rb, line 32
def initialize(text, options={})
  @text, @opt = text, DEFAULTS.dup.merge(options)
end

Public Instance Methods

alphabetic() click to toggle source

Filters out the text leaving only alphabetical characters in words and splits the words

# File lib/glove/parser.rb, line 62
def alphabetic
  text.gsub!(/([^[:alpha:]]+)|((?=\w*[a-z])(?=\w*[0-9])\w+)/, ' ')
end
downcase() click to toggle source

Downcases the text value

# File lib/glove/parser.rb, line 51
def downcase
  text.downcase!
end
normalize() click to toggle source

Selects words with length within the :min_length and :max_length boundaries

# File lib/glove/parser.rb, line 72
def normalize
  text.keep_if do |word|
    word.length.between?(@opt[:min_length], @opt[:max_length])
  end
end
split() click to toggle source

Splits the text string into an array of words

# File lib/glove/parser.rb, line 56
def split
  @text = text.split
end
stem() click to toggle source

Stems every member of the text array

# File lib/glove/parser.rb, line 67
def stem
  text.map!(&:stem)
end
stop_words() click to toggle source

Exclude words that are in the STOP_WORDS array

# File lib/glove/parser.rb, line 79
def stop_words
  @text = text.scan(/(\w+)(\W+)/).reject do |(word, other)|
    stop_words_array.include? word
  end.flatten.join
end
stop_words_array() click to toggle source

Reads the default stop words file and return array of its entries

# File lib/glove/parser.rb, line 86
def stop_words_array
  @stop_words ||= File.read(File.join(Glove.root_path, 'resources', 'en.stop')).split
end
tokenize() click to toggle source

Call all parsing methods in the class and return the final text value as array of words

@return [Array] The tokens array

# File lib/glove/parser.rb, line 40
def tokenize
  downcase
  stop_words  if @opt[:stop_words]
  alphabetic  if @opt[:alphabetic]
  split
  normalize   if @opt[:normalize]
  stem        if @opt[:stem]
  text
end