Class WordStorage

java.lang.Object
org.apache.lucene.analysis.hunspell.WordStorage

class WordStorage extends Object
A data structure for memory-efficient word storage and fast lookup/enumeration. Each dictionary entry is stored as:
  1. the last character
  2. pointer to a similar entry for the prefix (all characters except the last one)
  3. value data: a list of ints representing word flags and morphological data, and a pointer to hash collisions, if any
There's only one entry for each prefix, so it's like a trie/FST, but a reversed one: each node points to a single previous node instead of several following ones. For example, "abc" and "abd" point to the same prefix entry "ab" which points to "a" which points to 0.

The entries are stored in a contiguous byte array, identified by their offsets, using DataOutput.writeVInt(int) ()} VINT} format for compression.
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    (package private) static class 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    private static final int
     
    private final int[]
    A map from word's hash (modulo array's length) into an int containing: lower OFFSET_BITS: the offset in wordData of the last entry with this hash the remaining highest bits: COLLISION+LENGTH info for that entry, i.e.
    private static final int
     
    private static final int
     
    private static final int
     
    private final byte[]
    An array of word entries: VINT: the word's last character VINT: a delta pointer to the entry for the same word without the last character.
  • Constructor Summary

    Constructors
    Modifier
    Constructor
    Description
    private
    WordStorage(int[] hashTable, byte[] wordData)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    private static boolean
    hasCollision(int mask)
     
    private boolean
    hasLength(int mask, int length)
     
    private static boolean
    hasLengthInRange(int mask, int minLength, int maxLength)
     
    private boolean
    isSameString(char[] word, int offset, int length, int dataPos, ByteArrayDataInput in)
     
    (package private) IntsRef
    lookupWord(char[] word, int offset, int length)
     
    (package private) void
    processAllWords(int minLength, int maxLength, BiConsumer<CharsRef,IntsRef> processor)
    Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive.
    private void
    readForms(IntsRef forms, ByteArrayDataInput in, int length)
     

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

    • OFFSET_BITS

      private static final int OFFSET_BITS
      See Also:
    • OFFSET_MASK

      private static final int OFFSET_MASK
      See Also:
    • COLLISION_MASK

      private static final int COLLISION_MASK
      See Also:
    • MAX_STORED_LENGTH

      private static final int MAX_STORED_LENGTH
      See Also:
    • hashTable

      private final int[] hashTable
      A map from word's hash (modulo array's length) into an int containing:
      • lower OFFSET_BITS: the offset in wordData of the last entry with this hash
      • the remaining highest bits: COLLISION+LENGTH info for that entry, i.e. one bit indicating whether there are other entries with the same hash, and the length of the entry in chars, or MAX_STORED_LENGTH if the length exceeds that limit (next highest bits)
    • wordData

      private final byte[] wordData
      An array of word entries:
      • VINT: the word's last character
      • VINT: a delta pointer to the entry for the same word without the last character. Precisely, it's the difference of this entry's start and the prefix's entry start. 0 for single-character entries
      • (Optional, for hash-colliding entries only)
        • BYTE: COLLISION+LENGTH info (see hashTable) for the previous entry with the same hash
        • VINT: (delta) pointer to the previous entry
      • (Optional, for non-leaf entries only) VINT+: word form data, returned from lookupWord(char[], int, int), preceded by its length
  • Constructor Details

    • WordStorage

      private WordStorage(int[] hashTable, byte[] wordData)
  • Method Details

    • lookupWord

      IntsRef lookupWord(char[] word, int offset, int length)
    • hasCollision

      private static boolean hasCollision(int mask)
    • processAllWords

      void processAllWords(int minLength, int maxLength, BiConsumer<CharsRef,IntsRef> processor)
      Calls the processor for every dictionary entry with length between minLength and maxLength, both ends inclusive. Note that the callback arguments (word and forms) are reused, so they can be modified in any way, but may not be saved for later by the processor
    • hasLength

      private boolean hasLength(int mask, int length)
    • hasLengthInRange

      private static boolean hasLengthInRange(int mask, int minLength, int maxLength)
    • isSameString

      private boolean isSameString(char[] word, int offset, int length, int dataPos, ByteArrayDataInput in)
    • readForms

      private void readForms(IntsRef forms, ByteArrayDataInput in, int length)