Class IndicNormalizer

java.lang.Object
org.apache.lucene.analysis.in.IndicNormalizer

public class IndicNormalizer extends Object
Normalizes the Unicode representation of text in Indian languages.

Follows guidelines from Unicode 5.2, chapter 6, South Asian Scripts I and graphical decompositions from http://ldc.upenn.edu/myl/IndianScriptsUnicode.html

  • Field Details

    • scripts

    • decompositions

      private static final int[][] decompositions
      Decompositions according to Unicode 5.2, and http://ldc.upenn.edu/myl/IndianScriptsUnicode.html

      Most of these are not handled by unicode normalization anyway.

      The numbers here represent offsets into the respective codepages, with -1 representing null and 0xFF representing zero-width joiner.

      the columns are: ch1, ch2, ch3, res, flags ch1, ch2, and ch3 are the decomposition res is the composition, and flags are the scripts to which it applies.

  • Constructor Details

    • IndicNormalizer

      public IndicNormalizer()
  • Method Details

    • flag

      private static int flag(Character.UnicodeBlock ub)
    • normalize

      public int normalize(char[] text, int len)
      Normalizes input text, and returns the new length. The length will always be less than or equal to the existing length.
      Parameters:
      text - input text
      len - valid length
      Returns:
      normalized length
    • compose

      private int compose(int ch0, Character.UnicodeBlock block0, IndicNormalizer.ScriptData sd, char[] text, int pos, int len)
      Compose into standard form any compositions in the decompositions table.