Class WordDictionary

java.lang.Object
org.apache.lucene.analysis.cn.smart.hhmm.AbstractDictionary
org.apache.lucene.analysis.cn.smart.hhmm.WordDictionary

class WordDictionary extends AbstractDictionary
SmartChineseAnalyzer Word Dictionary
  • Field Details

    • singleInstance

      private static WordDictionary singleInstance
    • PRIME_INDEX_LENGTH

      public static final int PRIME_INDEX_LENGTH
      Large prime number for hash function
      See Also:
    • wordIndexTable

      private short[] wordIndexTable
      wordIndexTable guarantees to hash all Chinese characters in Unicode into PRIME_INDEX_LENGTH array. There will be conflict, but in reality this program only handles the 6768 characters found in GB2312 plus some ASCII characters. Therefore in order to guarantee better precision, it is necessary to retain the original symbol in the charIndexTable.
    • charIndexTable

      private char[] charIndexTable
    • wordItem_charArrayTable

      private char[][][] wordItem_charArrayTable
      To avoid taking too much space, the data structure needed to store the lexicon requires two multidimensional arrays to store word and frequency. Each word is placed in a char[]. Each char represents a Chinese char or other symbol. Each frequency is put into an int. These two arrays correspond to each other one-to-one. Therefore, one can use wordItem_charArrayTable[i][j] to look up word from lexicon, and wordItem_frequencyTable[i][j] to look up the corresponding frequency.
    • wordItem_frequencyTable

      private int[][] wordItem_frequencyTable
  • Constructor Details

    • WordDictionary

      private WordDictionary()
  • Method Details

    • getInstance

      public static WordDictionary getInstance()
      Get the singleton dictionary instance.
      Returns:
      singleton
    • load

      public void load(String dctFileRoot)
      Attempt to load dictionary from provided directory, first trying coredict.mem, failing back on coredict.dct
      Parameters:
      dctFileRoot - path to dictionary directory
    • load

      public void load() throws IOException, ClassNotFoundException
      Load coredict.mem internally from the jar file.
      Throws:
      IOException - If there is a low-level I/O error.
      ClassNotFoundException
    • loadFromObj

      private boolean loadFromObj(Path serialObj)
    • loadFromObjectInputStream

      private void loadFromObjectInputStream(InputStream serialObjectInputStream) throws IOException, ClassNotFoundException
      Throws:
      IOException
      ClassNotFoundException
    • saveToObj

      private void saveToObj(Path serialObj)
    • loadMainDataFromFile

      private int loadMainDataFromFile(String dctFilePath) throws IOException
      Load the datafile into this WordDictionary
      Parameters:
      dctFilePath - path to word dictionary (coredict.dct)
      Returns:
      number of words read
      Throws:
      IOException - If there is a low-level I/O error.
    • expandDelimiterData

      private void expandDelimiterData()
      The original lexicon puts all information with punctuation into a chart (from 1 to 3755). Here it then gets expanded, separately being placed into the chart that has the corresponding symbol.
    • mergeSameWords

      private void mergeSameWords()
    • sortEachItems

      private void sortEachItems()
    • setTableIndex

      private boolean setTableIndex(char c, int j)
    • getAvaliableTableIndex

      private short getAvaliableTableIndex(char c)
    • getWordItemTableIndex

      private short getWordItemTableIndex(char c)
    • findInTable

      private int findInTable(short knownHashIndex, char[] charArray)
      Look up the text string corresponding with the word char array, and return the position of the word list.
      Parameters:
      knownHashIndex - already figure out position of the first word symbol charArray[0] in hash table. If not calculated yet, can be replaced with function int findInTable(char[] charArray).
      charArray - look up the char array corresponding with the word.
      Returns:
      word location in word array. If not found, then return -1.
    • getPrefixMatch

      public int getPrefixMatch(char[] charArray)
      Find the first word in the dictionary that starts with the supplied prefix
      Parameters:
      charArray - input prefix
      Returns:
      index of word, or -1 if not found
      See Also:
    • getPrefixMatch

      public int getPrefixMatch(char[] charArray, int knownStart)
      Find the nth word in the dictionary that starts with the supplied prefix
      Parameters:
      charArray - input prefix
      knownStart - relative position in the dictionary to start
      Returns:
      index of word, or -1 if not found
      See Also:
    • getFrequency

      public int getFrequency(char[] charArray)
      Get the frequency of a word from the dictionary
      Parameters:
      charArray - input word
      Returns:
      word frequency, or zero if the word is not found
    • isEqual

      public boolean isEqual(char[] charArray, int itemIndex)
      Return true if the dictionary entry at itemIndex for table charArray[0] is charArray
      Parameters:
      charArray - input word
      itemIndex - item index for table charArray[0]
      Returns:
      true if the entry exists