class Greeb::Segmentator

It is possible to perform simple sentence detection that is based on Greeb's tokenization.

Constants

SENTENCE_AINT_START

Sentence does not start from the separator charater, line break character, punctuation characters, and spaces.

Attributes

tokens[R]

Public Class Methods

new(tokens) click to toggle source

Create a new instance of {Greeb::Segmentator}.

@param tokens [Array<Greeb::Span>] tokens from [Greeb::Tokenizer].

# File lib/greeb/segmentator.rb, line 18
def initialize(tokens)
  @tokens = tokens
end

Public Instance Methods

extract(sentences, collection = tokens) click to toggle source

Extract tokens from the set of sentences.

@param sentences [Array<Greeb::Span>] a list of sentences.

@return [Array<Greeb::Span, Array<Greeb::Span>>] a hash with

sentences as keys and tokens arrays as values.
# File lib/greeb/segmentator.rb, line 45
def extract(sentences, collection = tokens)
  sentences.map do |s|
    [s, collection.select { |t| t.from >= s.from and t.to <= s.to }]
  end
end
sentences() click to toggle source

Sentences memoization method.

@return [Array<Greeb::Span>] a set of sentences.

# File lib/greeb/segmentator.rb, line 26
def sentences
  @sentences ||= detect_spans(new_sentence, [:punct])
end
subsentences() click to toggle source

Subsentences memoization method.

@return [Array<Greeb::Span>] a set of subsentences.

# File lib/greeb/segmentator.rb, line 34
def subsentences
  @subsentences ||= detect_spans(new_subsentence, [:punct, :spunct])
end

Protected Instance Methods

detect_spans(sample, stop_marks, collection = []) click to toggle source

Implementation of the span detection method.

@param sample [Greeb::Span] a sample of span to be cloned in the process. @param stop_marks [Array<Symbol>] an array that stores the correspondent stop marks of the necessary spans. @param collection [Array<Greeb::Span>] an initial set of spans to be populated.

@return [Array<Greeb::Span>] a modified collection.

# File lib/greeb/segmentator.rb, line 63
def detect_spans(sample, stop_marks, collection = [])
  rest = tokens.inject(sample.dup) do |span, token|
    next span if sentence_aint_start? span, token
    span.from = token.from unless span.from
    next span if span.to and span.to > token.to

    if stop_marks.include? token.type
      span.to = find_forward(tokens, token).to
      collection << span
      span = sample.dup
    elsif ![:separ, :space].include? token.type
      span.to = token.to
    end

    span
  end

  rest.from && rest.to ? collection << rest : collection
end

Private Instance Methods

find_forward(collection, sample) click to toggle source

Find a forwarding token that has another type.

@param collection [Array<Greeb::Span>] array of possible tokens. @param sample [Greeb::Span] a token that is treated as a sample.

@return [Greeb::Span] a forwarding token.

# File lib/greeb/segmentator.rb, line 103
def find_forward(collection, sample)
  collection.select { |t| t.from >= sample.from }.
    inject(sample) { |r, t| t.type == sample.type ? t : (break r) }
end
new_sentence() click to toggle source

Create a new instance of {Greeb::Span} with `:sentence` type.

@return [Greeb::Span] a new span instance.

# File lib/greeb/segmentator.rb, line 112
def new_sentence
  Greeb::Span.new(nil, nil, :sentence)
end
new_subsentence() click to toggle source

Create a new instance of {Greeb::Span} with `:subsentence` type.

@return [Greeb::Span] a new span instance.

# File lib/greeb/segmentator.rb, line 120
def new_subsentence
  Greeb::Span.new(nil, nil, :subsentence)
end
sentence_aint_start?(span, token) click to toggle source

Check the possibility of starting a new sentence by the specified pair of span and token.

@param span [Greeb::Span] an span to be checked. @param token [Greeb::Span] an token to be checked.

@return true or false.

# File lib/greeb/segmentator.rb, line 92
def sentence_aint_start?(span, token)
  !span.from and SENTENCE_AINT_START.include? token.type
end