class Tokenizer::WhitespaceTokenizer
Simple whitespace based tokenizer with configurable punctuation detection.
Constants
- FS
Default whitespace separator.
- PAIR_POST
Characters as splittable suffixes with an optional matching prefix.
- PAIR_PRE
Characters as splittable prefixes with an optional matching suffix.
- PRE_N_POST
Characters which can be both prefixes AND suffixes.
- SIMPLE_POST
Characters only in the role of splittable suffixes.
- SIMPLE_PRE
Characters only in the role of splittable prefixes.
Public Class Methods
@param [Symbol] lang Language identifier. @param [Hash] options Additional options. @option options [Array] :pre Array of splittable prefix characters. @option options [Array] :post Array of splittable suffix characters. @option options [Array] :pre_n_post Array of characters with
suffix AND prefix functions.
# File lib/tokenizer/tokenizer.rb, line 34 def initialize(lang = :de, options = {}) @lang = lang @options = { pre: SIMPLE_PRE + PAIR_PRE, post: SIMPLE_POST + PAIR_POST, pre_n_post: PRE_N_POST }.merge(options) end
Public Instance Methods
@param [String] str String to be tokenized. @return [Array<String>] Array of tokens.
# File lib/tokenizer/tokenizer.rb, line 45 def tokenize(str) tokens = sanitize_input(str).split(FS) return [''] if tokens.empty? splittables = SIMPLE_PRE + SIMPLE_POST + PAIR_PRE + PAIR_POST + PRE_N_POST pattern = Regexp.new("[^#{Regexp.escape(splittables.join)}]+") output = [] tokens.each do |token| prefix, stem, suffix = token.partition(pattern) output << prefix.split('') unless prefix.empty? output << stem unless stem.empty? output << suffix.split('') unless suffix.empty? end output.flatten end
Private Instance Methods
@param [String] str User defined string to be tokenized. @return [String] A new modified string.
# File lib/tokenizer/tokenizer.rb, line 68 def sanitize_input(str) str.chomp.strip end