class Tokenizer::WhitespaceTokenizer

Simple whitespace based tokenizer with configurable punctuation detection.

Constants

FS

Default whitespace separator.

PAIR_POST

Characters as splittable suffixes with an optional matching prefix.

PAIR_PRE

Characters as splittable prefixes with an optional matching suffix.

PRE_N_POST

Characters which can be both prefixes AND suffixes.

SIMPLE_POST

Characters only in the role of splittable suffixes.

SIMPLE_PRE

Characters only in the role of splittable prefixes.

Public Class Methods

new(lang = :de, options = {}) click to toggle source

@param [Symbol] lang Language identifier. @param [Hash] options Additional options. @option options [Array] :pre Array of splittable prefix characters. @option options [Array] :post Array of splittable suffix characters. @option options [Array] :pre_n_post Array of characters with

suffix AND prefix functions.
# File lib/tokenizer/tokenizer.rb, line 34
def initialize(lang = :de, options = {})
  @lang = lang
  @options = {
    pre: SIMPLE_PRE + PAIR_PRE,
    post: SIMPLE_POST + PAIR_POST,
    pre_n_post: PRE_N_POST
  }.merge(options)
end

Public Instance Methods

process(str)
Alias for: tokenize
tokenize(str) click to toggle source

@param [String] str String to be tokenized. @return [Array<String>] Array of tokens.

# File lib/tokenizer/tokenizer.rb, line 45
def tokenize(str)
  tokens = sanitize_input(str).split(FS)
  return [''] if tokens.empty?

  splittables = SIMPLE_PRE + SIMPLE_POST + PAIR_PRE + PAIR_POST + PRE_N_POST
  pattern = Regexp.new("[^#{Regexp.escape(splittables.join)}]+")
  output = []
  tokens.each do |token|
    prefix, stem, suffix = token.partition(pattern)
    output << prefix.split('') unless prefix.empty?
    output << stem unless stem.empty?
    output << suffix.split('') unless suffix.empty?
  end

  output.flatten
end
Also aliased as: process

Private Instance Methods

sanitize_input(str) click to toggle source

@param [String] str User defined string to be tokenized. @return [String] A new modified string.

# File lib/tokenizer/tokenizer.rb, line 68
def sanitize_input(str)
  str.chomp.strip
end