class Glove::Corpus

Class responsible for building the token count, token index and token pairs hashes from a given text

Attributes

min_count[R]

@!attribute [r] tokens

@return [Fixnum] Returns the parsed tokens array. Holds all the tokens
  in the exact order they appear in the text
tokens[R]

@!attribute [r] tokens

@return [Fixnum] Returns the parsed tokens array. Holds all the tokens
  in the exact order they appear in the text
window[R]

@!attribute [r] tokens

@return [Fixnum] Returns the parsed tokens array. Holds all the tokens
  in the exact order they appear in the text

Public Class Methods

build(text, options={}) click to toggle source

Convenience method for creating an instance and building the token count, index and pairs (see initialize)

# File lib/glove/corpus.rb, line 12
def self.build(text, options={})
  new(text, options).build_tokens
end
new(text, options={}) click to toggle source

Create a new {Glove::Corpus} instance

@param [Hash] options the options to initialize the instance with. @option options [Integer] :window (2) Number of context words to the left

and to the right

@option options [Integer] :min_count (5) Lower limit such that words which

occur fewer than :min_count times are discarded.
# File lib/glove/corpus.rb, line 23
def initialize(text, options={})
  @tokens = Parser.new(text, options).tokenize
  @window = options[:window] || 2
  @min_count = options[:min_count] || 5
end

Public Instance Methods

build_count()
Alias for: count
build_index()
Alias for: index
build_pairs()
Alias for: pairs
build_tokens() click to toggle source

Builds the token count, token index and token pairs

@return [Glove::Corpus]

# File lib/glove/corpus.rb, line 32
def build_tokens
  build_count
  build_index
  build_pairs
  self
end
count() click to toggle source

Hash that stores the occurence count of unique tokens

@return [Hash{String=>Integer}] Token-Count pairs where count is total occurences of

token in the (non-unique) tokens hash
# File lib/glove/corpus.rb, line 43
def count
  @count ||= tokens.inject(Hash.new(0)) do |hash,item|
    hash[item] += 1
    hash
  end.to_h.keep_if{ |word,count| count >= min_count }
end
Also aliased as: build_count
index() click to toggle source

A hash whose values hold the senquantial index of a word as it appears in the count hash

@return [Hash{String=>Integer}] Token-Index pairs where index is the sequential index

of the token in the unique vocabulary pool
# File lib/glove/corpus.rb, line 56
def index
  @index ||= @count.keys.each_with_index.inject({}) do |hash,(word,idx)|
    hash[word] = idx
    hash
  end
end
Also aliased as: build_index
marshal_dump() click to toggle source

Data to dump with Marshal.dump

# File lib/glove/corpus.rb, line 94
def marshal_dump
  [@tokens, @count, @index, @pairs]
end
marshal_load(contents) click to toggle source

Reconstruct the instance data via Marshal.load

# File lib/glove/corpus.rb, line 99
def marshal_load(contents)
  @tokens, @count, @index, @pairs = contents
end
pairs() click to toggle source

Iterates over the tokens array and contructs {Glove::TokenPair}s where neighbors holds the adjacent (context) words. The number of neighbours is controlled by the :window option (on each side)

@return [Array<(Glove::TokenPair)>] Array of {Glove::TokenPair}s

# File lib/glove/corpus.rb, line 69
def pairs
  @pairs ||= tokens.map.with_index do |word, index|
    next unless count[word] >= min_count

    TokenPair.new(word, token_neighbors(word, index))
  end.compact
end
Also aliased as: build_pairs
token_neighbors(word, index) click to toggle source

Construct array of neighbours to the given word and its index in the tokens array

@param [String] word The word to get neighbours for @param [Integer] index Index of the word in the @tokens array @return [Array<(String)>] List of the nighbours

# File lib/glove/corpus.rb, line 84
def token_neighbors(word, index)
  start_pos = index - window < 0 ? 0 : index - window
  end_pos   = (index + window >= tokens.size) ? tokens.size - 1 : index + window

  tokens[start_pos..end_pos].map do |neighbor|
    neighbor unless word == neighbor
  end.compact
end