class ZawgyiUnicodeMarkovModel

For the purposes of Unicode/Zawgyi detection, all characters are treated as the NULL state except for characters in the Myanmar script or characters in the Unicode whitespace range U+2000 through U+200B.

Constants

AFT_CP0

Standard Myanmar code point range after digits

AFT_CP1
AFT_OFFSET
BINARY_TAG

Magic number used to identify this object in byte streams. (Reads in ASCII as “UZMODEL ”)

EXA_CP0

Extended Myanmar code point range A

EXA_CP1
EXA_OFFSET
EXB_CP0

Extended Myanmar code point range B

EXB_CP1
EXB_OFFSET
NUM_STATES
SPC_CP0

Unicode space characters

SPC_CP1
SPC_OFFSET
STD_CP0

Standard Myanmar code point range before digits

STD_CP1
STD_OFFSET

Indices into Markov nodes

Public Class Methods

get_index_for_code_point(cp) click to toggle source

Returns the index of the state in the Markov chain corresponding to the given code point. Code points in the standard Myanmar range, Myanmar Extended A, Myanmar Extended B, and Unicode Whitespace each have a unique state assigned to them. All other code points are mapped to state 0.

Package-private so that the builder can use this method. @param cp The code point to convert to a state index. @return The index of the state in the Markov chain. 0 <= state < getSize()

# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 96
def self.get_index_for_code_point(cp)
  marko_chain_index = 0
  if STD_CP0 <= cp && cp <= STD_CP1
    marko_chain_index = cp - STD_CP0 + STD_OFFSET
  elsif AFT_CP0 <= cp && cp <= AFT_CP1
    marko_chain_index = cp - AFT_CP0 + AFT_OFFSET
  elsif EXA_CP0 <= cp && cp <= EXA_CP1
    marko_chain_index = cp - EXA_CP0 + EXA_OFFSET
  elsif EXB_CP0 <= cp && cp <= EXB_CP1
    marko_chain_index = cp - EXB_CP0 + EXB_OFFSET
  elsif SPC_CP0 <= cp && cp <= SPC_CP1
    marko_chain_index = cp - SPC_CP0 + SPC_OFFSET
  end
  marko_chain_index
end
new(stream) click to toggle source

Creates an instance from a binary data stream.

# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 66
def initialize(stream)
  # Check magic number and serial version number
  binary_tag = stream.read(8).unpack('H*')[0]
  if binary_tag != BINARY_TAG
    raise "Unexpected magic number: expected #{BINARY_TAG} but got #{binary_tag}"
  end
  
  binary_version = stream.read(4).unpack('H*')[0].to_i
  if binary_version == 1
    @ssv = 0
  elsif binary_version == 2
    # TODO: Support nonzero SSV if needed in the future
    @ssv = stream.read(4).unpack('H*')[0].to_i
    if @ssv != 0
      raise "Unsupported ssv: #{@ssv}"
    end
  else          
    raise "Unexpected serial version number: expected 1 or 2 but got #{binary_version}"
  end
  
  @classifier = BinaryMarkov.new(stream)
end

Public Instance Methods

char_count(code_point) click to toggle source

Determine the number of char values needed to represent the specified character (Unicode code point) if the code point is equal to or greater than 0x10000, then the method returns 2. otherwise, the method returns 1.

# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 183
def char_count(code_point)
  code_point >= 0x10000 ? 2 : 1
end
predict(input, verbose=false) click to toggle source

Runs the given input string on both internal Markov chains and computes the probability of the string being unicode or zawgyi. @param1 input The string to evaluate. @param2 verbose Whether to print the log probabilities for debugging. @return The probability that the string is Zawgyi given that it is either Unicode or Zawgyi, or -Infinity if there are no Myanmar range code points in the string.

# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 118
def predict(input, verbose=false)
  if verbose
    puts "Running detector on string: #{input}"
  end
  
  # Start at the base state
  prev_cp = 0
  prev_state = 0
  
  total_delta = 0.0
  seen_transition = false
  
  offset = 0
  while offset <= input.length
    if offset == input.length
      cp = 0
      curr_state = 0
    else
      cp = input.codepoints[offset]
      curr_state = self.class.get_index_for_code_point(cp)
    end
  
    # Ignore 0-to-0 transitions
    if prev_state != 0 || curr_state != 0
      # Gets the difference in log probabilities between chain A and chain B.
      # First param: The index of the source node to transition from.
      # Second param: The index of the destination node to transition to.
      delta = @classifier.log_probability_differences[prev_state][curr_state].to_f
  
      if verbose
        puts "#{prev_cp} -> #{cp}: delta=#{delta}"
        puts "ABS: #{delta.abs}"
        delta_index = 1
        while delta_index < delta.abs
          print "!"
          delta_index +=1
        end
        puts ""
      end
  
      total_delta += delta
      seen_transition = true
    end
  
    offset += char_count(cp)
    prev_cp = cp
    prev_state = curr_state
  end
  
  if verbose
    puts "Final: delta=#{total_delta}"
  end
  
  # Special case: if there is no signal, return -Infinity,
  # which will get interpreted by users as strong Unicode.
  # This happens when the input string contains no Myanmar-range code points.
  unless seen_transition
    return -1.0/0.0
  end
  1.0 / (1.0 + Math.exp(total_delta))
end