Class Analyzer

java.lang.Object
org.apache.lucene.analysis.Analyzer
All Implemented Interfaces:
Closeable, AutoCloseable
Direct Known Subclasses:
AnalyzerWrapper, CollationKeyAnalyzer, CustomAnalyzer, DutchAnalyzer, ICUCollationKeyAnalyzer, JapaneseCompletionAnalyzer, KeywordAnalyzer, KoreanAnalyzer, SimpleAnalyzer, SmartChineseAnalyzer, StopwordAnalyzerBase, UnicodeWhitespaceAnalyzer, WhitespaceAnalyzer

public abstract class Analyzer extends Object implements Closeable
An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in createComponents(String). The components are then reused in each call to tokenStream(String, Reader).

Simple example:

 Analyzer analyzer = new Analyzer() {
  @Override
   protected TokenStreamComponents createComponents(String fieldName) {
     Tokenizer source = new FooTokenizer(reader);
     TokenStream filter = new FooFilter(source);
     filter = new BarFilter(filter);
     return new TokenStreamComponents(source, filter);
   }
   @Override
   protected TokenStream normalize(String fieldName, TokenStream in) {
     // Assuming FooFilter is about normalization and BarFilter is about
     // stemming, only FooFilter should be applied
     return new FooFilter(in);
   }
 };
 
For more examples, see the Analysis package documentation.

For some concrete implementations bundled with Lucene, look in the analysis modules:

  • Common: Analyzers for indexing content in different languages and domains.
  • ICU: Exposes functionality from ICU to Apache Lucene.
  • Kuromoji: Morphological analyzer for Japanese text.
  • Morfologik: Dictionary-driven lemmatization for the Polish language.
  • Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
  • Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
  • Stempel: Algorithmic Stemmer for the Polish Language.
Since:
3.1