Package org.apache.lucene.analysis.de
Class GermanStemmer
java.lang.Object
org.apache.lucene.analysis.de.GermanStemmer
A stemmer for German words.
The algorithm is based on the report "A Fast and Simple Stemming Algorithm for German Words" by Jörg Caumanns (joerg.caumanns at isst.fhg.de).
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate static final Locale
private StringBuilder
Buffer for the terms while stemming them.private int
Amount of characters that are removed withsubstitute()
while stemming. -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionprivate boolean
isStemmable
(String term) Checks if a term could be stemmed.private void
optimize
(StringBuilder buffer) Does some optimizations on the term.private void
removeParticleDenotion
(StringBuilder buffer) Removes a particle denotion ("ge") from a term.private void
resubstitute
(StringBuilder buffer) Undoes the changes made by substitute().protected String
Stemms the given term to an uniquediscriminator
.private void
strip
(StringBuilder buffer) suffix stripping (stemming) on the current term.private void
substitute
(StringBuilder buffer) Do some substitutions for the term to reduce overstemming:
-
Field Details
-
sb
Buffer for the terms while stemming them. -
substCount
private int substCountAmount of characters that are removed withsubstitute()
while stemming. -
locale
-
-
Constructor Details
-
GermanStemmer
public GermanStemmer()
-
-
Method Details
-
stem
Stemms the given term to an uniquediscriminator
.- Parameters:
term
- The term that should be stemmed.- Returns:
- Discriminator for
term
-
isStemmable
Checks if a term could be stemmed.- Returns:
- true if, and only if, the given term consists in letters.
-
strip
suffix stripping (stemming) on the current term. The stripping is reduced to the seven "base" suffixes "e", "s", "n", "t", "em", "er" and * "nd", from which all regular suffixes are build of. The simplification causes some overstemming, and way more irregular stems, but still provides unique. discriminators in the most of those cases. The algorithm is context free, except of the length restrictions. -
optimize
Does some optimizations on the term. This optimisations are contextual. -
removeParticleDenotion
Removes a particle denotion ("ge") from a term. -
substitute
Do some substitutions for the term to reduce overstemming:- Substitute Umlauts with their corresponding vowel:
äöü -> aou
, "ß" is substituted by "ss" - Substitute a second char of a pair of equal characters with an asterisk:?? -> ?*
- Substitute some common character combinations with a token:sch/ch/ei/ie/ig/st -> $/§/%/&/#/!
-
resubstitute
Undoes the changes made by substitute(). That are character pairs and character combinations. Umlauts will remain as their corresponding vowel, as "ß" remains as "ss".
-