module Emoninja::Stemmable

There are numerous strategies and algorithms for stemming. A widely-used algorithm for English stemming is the Porter stemming algorithm,

written by Martin Porter in 1980. The Porter stemmer follows a strategy
of suffix stripping, which basically uses a set of rules to strip away suffixes.

For example, a word that ends with ‘-ed’ might be suffix-stripped to remove the ‘-ed’. The Porter stemmer follows a sequence of steps in stripping suffixes.

tartarus.org/~martin/PorterStemmer/ruby.txt rubocop:disable Metrics/ModuleLength

Constants

C
CC
MEQ1
MGR0
MGR1
STEP_2_LIST
STEP_3_LIST
SUFFIX_1_REGEXP
SUFFIX_2_REGEXP
V
VOWEL_IN_STEM
VV

Public Instance Methods

stem()

make the stem_porter the default stem method, just in case we feel like having multiple stemmers available later.

Alias for: stem_porter
stem_porter() click to toggle source

rubocop:disable Metrics/MethodLength rubocop:disable Style/PerlBackrefs

# File lib/emoninja/porter_stemmer.rb, line 97
def stem_porter
  # make a copy of the given object and convert it to a string.
  w = dup.to_str
  return w if w.length < 3

  # now map initial y to Y so that the patterns never treat it as vowel
  w[0] = 'Y' if w[0] == 'y'

  # Step 1a
  case w
  when /(ss|i)es$/ then w = $` + $1
  when /([^s])s$/  then w = $` + $1
  end

  # Step 1b
  if w =~ /eed$/
    w.chop! if $` =~ MGR0
  elsif w =~ /(ed|ing)$/
    stem = $`
    if stem =~ VOWEL_IN_STEM
      w = stem
      case w
      when /(at|bl|iz)$/             then w << 'e'
      when /([^aeiouylsz])\1$/       then w.chop!
      when /^#{CC}#{V}[^aeiouwxy]$/o then w << 'e'
      end
    end
  end

  if w =~ /y$/
    stem = $`
    w = stem + 'i' if stem =~ VOWEL_IN_STEM
  end

  # Step 2
  if w =~ SUFFIX_1_REGEXP
    stem = $`
    suffix = $1
    # print "stem= " + stem + "\n" + "suffix=" + suffix + "\n"
    w = stem + STEP_2_LIST[suffix] if stem =~ MGR0
  end

  # Step 3
  if w =~ /(icate|ative|alize|iciti|ical|ful|ness)$/
    stem = $`
    suffix = $1
    w = stem + STEP_3_LIST[suffix] if stem =~ MGR0
  end

  # Step 4
  if w =~ SUFFIX_2_REGEXP
    stem = $`
    w = stem if stem =~ MGR1
  elsif w =~ /(s|t)(ion)$/
    stem = $` + $1
    w = stem if stem =~ MGR1
  end

  #  Step 5
  if w =~ /e$/
    stem = $`
    w = stem if (stem =~ MGR1) || (stem =~ MEQ1 && stem !~ /^#{CC}#{V}[^aeiouwxy]$/o)
  end

  w.chop! if w =~ /ll$/ && w =~ MGR1

  # and turn initial Y back to y
  w[0] = 'y' if w[0] == 'Y'

  w
end
Also aliased as: stem