class ZawgyiUnicodeMarkovModel
For the purposes of Unicode/Zawgyi detection, all characters are treated as the NULL state except for characters in the Myanmar script or characters in the Unicode whitespace range U+2000 through U+200B.
Constants
- AFT_CP0
Standard Myanmar code point range after digits
- AFT_CP1
- AFT_OFFSET
- BINARY_TAG
Magic number used to identify this object in byte streams. (Reads in ASCII as “UZMODEL ”)
- EXA_CP0
Extended Myanmar code point range A
- EXA_CP1
- EXA_OFFSET
- EXB_CP0
Extended Myanmar code point range B
- EXB_CP1
- EXB_OFFSET
- NUM_STATES
- SPC_CP0
Unicode space characters
- SPC_CP1
- SPC_OFFSET
- STD_CP0
Standard Myanmar code point range before digits
- STD_CP1
- STD_OFFSET
Indices into Markov nodes
Public Class Methods
Returns the index of the state in the Markov chain corresponding to the given code point. Code points in the standard Myanmar range, Myanmar Extended A, Myanmar Extended B, and Unicode Whitespace each have a unique state assigned to them. All other code points are mapped to state 0.
Package-private so that the builder can use this method. @param cp The code point to convert to a state index. @return The index of the state in the Markov chain. 0 <= state < getSize()
# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 96 def self.get_index_for_code_point(cp) marko_chain_index = 0 if STD_CP0 <= cp && cp <= STD_CP1 marko_chain_index = cp - STD_CP0 + STD_OFFSET elsif AFT_CP0 <= cp && cp <= AFT_CP1 marko_chain_index = cp - AFT_CP0 + AFT_OFFSET elsif EXA_CP0 <= cp && cp <= EXA_CP1 marko_chain_index = cp - EXA_CP0 + EXA_OFFSET elsif EXB_CP0 <= cp && cp <= EXB_CP1 marko_chain_index = cp - EXB_CP0 + EXB_OFFSET elsif SPC_CP0 <= cp && cp <= SPC_CP1 marko_chain_index = cp - SPC_CP0 + SPC_OFFSET end marko_chain_index end
Creates an instance from a binary data stream.
# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 66 def initialize(stream) # Check magic number and serial version number binary_tag = stream.read(8).unpack('H*')[0] if binary_tag != BINARY_TAG raise "Unexpected magic number: expected #{BINARY_TAG} but got #{binary_tag}" end binary_version = stream.read(4).unpack('H*')[0].to_i if binary_version == 1 @ssv = 0 elsif binary_version == 2 # TODO: Support nonzero SSV if needed in the future @ssv = stream.read(4).unpack('H*')[0].to_i if @ssv != 0 raise "Unsupported ssv: #{@ssv}" end else raise "Unexpected serial version number: expected 1 or 2 but got #{binary_version}" end @classifier = BinaryMarkov.new(stream) end
Public Instance Methods
Determine the number of char values needed to represent the specified character (Unicode code point) if the code point is equal to or greater than 0x10000, then the method returns 2. otherwise, the method returns 1.
# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 183 def char_count(code_point) code_point >= 0x10000 ? 2 : 1 end
Runs the given input string on both internal Markov chains and computes the probability of the string being unicode or zawgyi. @param1 input The string to evaluate. @param2 verbose Whether to print the log probabilities for debugging. @return The probability that the string is Zawgyi given that it is either Unicode or Zawgyi, or -Infinity if there are no Myanmar range code points in the string.
# File lib/myanmar-tools/zawgyi_unicode_markov_model.rb, line 118 def predict(input, verbose=false) if verbose puts "Running detector on string: #{input}" end # Start at the base state prev_cp = 0 prev_state = 0 total_delta = 0.0 seen_transition = false offset = 0 while offset <= input.length if offset == input.length cp = 0 curr_state = 0 else cp = input.codepoints[offset] curr_state = self.class.get_index_for_code_point(cp) end # Ignore 0-to-0 transitions if prev_state != 0 || curr_state != 0 # Gets the difference in log probabilities between chain A and chain B. # First param: The index of the source node to transition from. # Second param: The index of the destination node to transition to. delta = @classifier.log_probability_differences[prev_state][curr_state].to_f if verbose puts "#{prev_cp} -> #{cp}: delta=#{delta}" puts "ABS: #{delta.abs}" delta_index = 1 while delta_index < delta.abs print "!" delta_index +=1 end puts "" end total_delta += delta seen_transition = true end offset += char_count(cp) prev_cp = cp prev_state = curr_state end if verbose puts "Final: delta=#{total_delta}" end # Special case: if there is no signal, return -Infinity, # which will get interpreted by users as strong Unicode. # This happens when the input string contains no Myanmar-range code points. unless seen_transition return -1.0/0.0 end 1.0 / (1.0 + Math.exp(total_delta)) end