Class WordDelimiterIterator

java.lang.Object
org.apache.lucene.analysis.miscellaneous.WordDelimiterIterator

public final class WordDelimiterIterator extends Object
A BreakIterator-like API for iterating over subwords in text, according to WordDelimiterGraphFilter rules.
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
     
    static final int
     
    private final byte[]
     
    (package private) int
    Beginning of subword
    static final byte[]
     
    (package private) static final int
     
    static final int
    Indicates the end of iteration
    (package private) int
    End of subword
    (package private) int
    end position of text, excluding trailing delimiters
    private boolean
     
    (package private) int
     
    (package private) static final int
     
    private boolean
    if true, need to skip over a possessive found in the last call to next()
    (package private) final boolean
    If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).
    (package private) final boolean
    If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens).
    (package private) int
    start position of text, excluding leading delimiters
    (package private) final boolean
    If true, causes trailing "'s" to be removed for each subword.
    (package private) static final int
     
    (package private) char[]
     
    (package private) static final int
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    WordDelimiterIterator(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive)
    Create a new WordDelimiterIterator operating with the supplied rules.
  • Method Summary

    Modifier and Type
    Method
    Description
    private int
    charType(int ch)
    Determines the type of the given character
    private boolean
    Determines if the text at the given position indicates an English possessive which should be removed
    static byte
    getType(int ch)
    Computes the type of the given character
    (package private) static boolean
    isAlpha(int type)
    Checks if the given word type includes ALPHA
    private boolean
    isBreak(int lastType, int type)
    Determines whether the transition from lastType to type indicates a break
    (package private) static boolean
    isDigit(int type)
    Checks if the given word type includes DIGIT
    (package private) boolean
    Determines if the current word contains only one subword.
    (package private) static boolean
    isSubwordDelim(int type)
    Checks if the given word type includes SUBWORD_DELIM
    (package private) static boolean
    isUpper(int type)
    Checks if the given word type includes UPPER
    (package private) int
    Advance to the next subword in the string.
    private void
    Set the internal word bounds (remove leading and trailing delimiters).
    (package private) void
    setText(char[] text, int length)
    Reset the text to a new value, and reset all state
     
    (package private) int
    Return the type of the current subword.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
  • Field Details

    • LOWER

      static final int LOWER
      See Also:
    • UPPER

      static final int UPPER
      See Also:
    • DIGIT

      static final int DIGIT
      See Also:
    • SUBWORD_DELIM

      static final int SUBWORD_DELIM
      See Also:
    • ALPHA

      public static final int ALPHA
      See Also:
    • ALPHANUM

      public static final int ALPHANUM
      See Also:
    • DONE

      public static final int DONE
      Indicates the end of iteration
      See Also:
    • DEFAULT_WORD_DELIM_TABLE

      public static final byte[] DEFAULT_WORD_DELIM_TABLE
    • text

      char[] text
    • length

      int length
    • startBounds

      int startBounds
      start position of text, excluding leading delimiters
    • endBounds

      int endBounds
      end position of text, excluding trailing delimiters
    • current

      int current
      Beginning of subword
    • end

      int end
      End of subword
    • hasFinalPossessive

      private boolean hasFinalPossessive
    • splitOnCaseChange

      final boolean splitOnCaseChange
      If false, causes case changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
    • splitOnNumerics

      final boolean splitOnNumerics
      If false, causes numeric changes to be ignored (subwords will only be generated given SUBWORD_DELIM tokens). (Defaults to true)
    • stemEnglishPossessive

      final boolean stemEnglishPossessive
      If true, causes trailing "'s" to be removed for each subword. (Defaults to true)

      "O'Neil's" => "O", "Neil"

    • charTypeTable

      private final byte[] charTypeTable
    • skipPossessive

      private boolean skipPossessive
      if true, need to skip over a possessive found in the last call to next()
  • Constructor Details

    • WordDelimiterIterator

      WordDelimiterIterator(byte[] charTypeTable, boolean splitOnCaseChange, boolean splitOnNumerics, boolean stemEnglishPossessive)
      Create a new WordDelimiterIterator operating with the supplied rules.
      Parameters:
      charTypeTable - table containing character types
      splitOnCaseChange - if true, causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regardless)
      splitOnNumerics - if true, causes "j2se" to be three tokens; "j" "2" "se"
      stemEnglishPossessive - if true, causes trailing "'s" to be removed for each subword: "O'Neil's" => "O", "Neil"
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • next

      int next()
      Advance to the next subword in the string.
      Returns:
      index of the next subword, or DONE if all subwords have been returned
    • type

      int type()
      Return the type of the current subword. This currently uses the type of the first character in the subword.
      Returns:
      type of the current word
    • setText

      void setText(char[] text, int length)
      Reset the text to a new value, and reset all state
      Parameters:
      text - New text
      length - length of the text
    • isBreak

      private boolean isBreak(int lastType, int type)
      Determines whether the transition from lastType to type indicates a break
      Parameters:
      lastType - Last subword type
      type - Current subword type
      Returns:
      true if the transition indicates a break, false otherwise
    • isSingleWord

      boolean isSingleWord()
      Determines if the current word contains only one subword. Note, it could be potentially surrounded by delimiters
      Returns:
      true if the current word contains only one subword, false otherwise
    • setBounds

      private void setBounds()
      Set the internal word bounds (remove leading and trailing delimiters). Note, if a possessive is found, don't remove it yet, simply note it.
    • endsWithPossessive

      private boolean endsWithPossessive(int pos)
      Determines if the text at the given position indicates an English possessive which should be removed
      Parameters:
      pos - Position in the text to check if it indicates an English possessive
      Returns:
      true if the text at the position indicates an English possessive, false otherwise
    • charType

      private int charType(int ch)
      Determines the type of the given character
      Parameters:
      ch - Character whose type is to be determined
      Returns:
      Type of the character
    • getType

      public static byte getType(int ch)
      Computes the type of the given character
      Parameters:
      ch - Character whose type is to be determined
      Returns:
      Type of the character
    • isAlpha

      static boolean isAlpha(int type)
      Checks if the given word type includes ALPHA
      Parameters:
      type - Word type to check
      Returns:
      true if the type contains ALPHA, false otherwise
    • isDigit

      static boolean isDigit(int type)
      Checks if the given word type includes DIGIT
      Parameters:
      type - Word type to check
      Returns:
      true if the type contains DIGIT, false otherwise
    • isSubwordDelim

      static boolean isSubwordDelim(int type)
      Checks if the given word type includes SUBWORD_DELIM
      Parameters:
      type - Word type to check
      Returns:
      true if the type contains SUBWORD_DELIM, false otherwise
    • isUpper

      static boolean isUpper(int type)
      Checks if the given word type includes UPPER
      Parameters:
      type - Word type to check
      Returns:
      true if the type contains UPPER, false otherwise