Class DefaultICUTokenizerConfig
java.lang.Object
org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
org.apache.lucene.analysis.icu.segmentation.DefaultICUTokenizerConfig
Default
ICUTokenizerConfig
that is generally applicable to many languages.
Generally tokenizes Unicode text according to UAX#29 (BreakIterator.getWordInstance(ULocale.ROOT)
), but with
the following tailorings:
- Thai, Lao, Myanmar, Khmer, and CJK text is broken into words with a dictionary.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final boolean
private static final com.ibm.icu.text.BreakIterator
private static final com.ibm.icu.text.RuleBasedBreakIterator
private final boolean
private static final com.ibm.icu.text.RuleBasedBreakIterator
static final String
Token type for words that appear to be emoji sequencesstatic final String
Token type for words containing Korean hangulstatic final String
Token type for words containing Japanese hiraganastatic final String
Token type for words containing ideographic charactersstatic final String
Token type for words containing Japanese katakanastatic final String
Token type for words that contain lettersstatic final String
Token type for words that appear to be numbersFields inherited from class org.apache.lucene.analysis.icu.segmentation.ICUTokenizerConfig
EMOJI_SEQUENCE_STATUS
-
Constructor Summary
ConstructorsConstructorDescriptionDefaultICUTokenizerConfig
(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. -
Method Summary
Modifier and TypeMethodDescriptionboolean
true if Han, Hiragana, and Katakana scripts should all be returned as Japanesecom.ibm.icu.text.RuleBasedBreakIterator
getBreakIterator
(int script) Return a breakiterator capable of processing a given script.getType
(int script, int ruleStatus) Return a token type value for a given script and BreakIterator rule status.private static com.ibm.icu.text.RuleBasedBreakIterator
readBreakIterator
(String filename)
-
Field Details
-
WORD_IDEO
Token type for words containing ideographic characters -
WORD_HIRAGANA
Token type for words containing Japanese hiragana -
WORD_KATAKANA
Token type for words containing Japanese katakana -
WORD_HANGUL
Token type for words containing Korean hangul -
WORD_LETTER
Token type for words that contain letters -
WORD_NUMBER
Token type for words that appear to be numbers -
WORD_EMOJI
Token type for words that appear to be emoji sequences -
cjkBreakIterator
private static final com.ibm.icu.text.BreakIterator cjkBreakIterator -
defaultBreakIterator
private static final com.ibm.icu.text.RuleBasedBreakIterator defaultBreakIterator -
myanmarSyllableIterator
private static final com.ibm.icu.text.RuleBasedBreakIterator myanmarSyllableIterator -
cjkAsWords
private final boolean cjkAsWords -
myanmarAsWords
private final boolean myanmarAsWords
-
-
Constructor Details
-
DefaultICUTokenizerConfig
public DefaultICUTokenizerConfig(boolean cjkAsWords, boolean myanmarAsWords) Creates a new config. This object is lightweight, but the first time the class is referenced, breakiterators will be initialized.- Parameters:
cjkAsWords
- true if cjk text should undergo dictionary-based segmentation, otherwise text will be segmented according to UAX#29 defaults. If this is true, all Han+Hiragana+Katakana words will be tagged as IDEOGRAPHIC.myanmarAsWords
- true if Myanmar text should undergo dictionary-based segmentation, otherwise it will be tokenized as syllables.
-
-
Method Details
-
combineCJ
public boolean combineCJ()Description copied from class:ICUTokenizerConfig
true if Han, Hiragana, and Katakana scripts should all be returned as Japanese- Specified by:
combineCJ
in classICUTokenizerConfig
-
getBreakIterator
public com.ibm.icu.text.RuleBasedBreakIterator getBreakIterator(int script) Description copied from class:ICUTokenizerConfig
Return a breakiterator capable of processing a given script.- Specified by:
getBreakIterator
in classICUTokenizerConfig
-
getType
Description copied from class:ICUTokenizerConfig
Return a token type value for a given script and BreakIterator rule status.- Specified by:
getType
in classICUTokenizerConfig
-
readBreakIterator
-