module TwitterCldr::Collation::ImplicitCollationElements

ImplicitCollationElements generates implicit collation elements for code points (including some CJK characters), that are not explicitly mentioned in the collation elements table.

This module was ported from the ICU4J library (ImplicitCEGenerator class). See NOTICE file for license information.

Constants

CJK_A_BASE
CJK_A_LIMIT
CJK_BASE
CJK_B_BASE
CJK_B_LIMIT
CJK_COMPAT_USED_BASE
CJK_COMPAT_USED_LIMIT
CJK_C_BASE
CJK_C_LIMIT
CJK_D_BASE
CJK_D_LIMIT
CJK_LIMIT
DEFAULT_SECONDARY_AND_TERTIARY
FINAL_3_COUNT
FINAL_3_MULTIPLIER

number of values we can use in trailing bytes leave room for empty values between AND above, e.g., if gap = 2

range 3..7 => +3 -4 -5 -6 -7: so 1 value
range 3..8 => +3 -4 -5 +6 -7 -8: so 2 values
range 3..9 => +3 -4 -5 +6 -7 -8 -9: so 2 values
FINAL_4_COUNT
FINAL_4_MULTIPLIER
GAP_3

gap for tailoring of 3-byte forms

GAP_4
MAX_INPUT

2 * [Unicode range] + 2

MAX_PRIMARY
MAX_TRAIL
MEDIAL_COUNT

medials can use full range

MIN_4_BOUNDARY
MIN_4_PRIMARY
MIN_PRIMARY

primary value

MIN_TRAIL

final byte

NEEDED_PER_FINAL_BYTE
NEEDED_PER_PRIMARY_BYTE
NON_CJK_OFFSET

CJK constants

PRIMARIES_3_COUNT

number of 3-byte primaries that can be used

PRIMARIES_4_COUNT
PRIMARIES_AVAILABLE

now determine where the 3/4 boundary is we use 3 bytes below the boundary, and 4 above

THREE_BYTE_COUNT

find out how many values fit in each form

TOTAL_NEEDED

Public Class Methods

for_code_point(code_point) click to toggle source
# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 20
def for_code_point(code_point)
  [[primary_weight(swapCJK(code_point) + 1), DEFAULT_SECONDARY_AND_TERTIARY, DEFAULT_SECONDARY_AND_TERTIARY]]
end

Private Class Methods

primary_weight(code_point) click to toggle source

Generates the primary weight of the implicit CE for a given code point.

# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 28
def primary_weight(code_point)
  byte0 = code_point - MIN_4_BOUNDARY

  if byte0 < 0
    byte1 = code_point / FINAL_3_COUNT
    byte0 = code_point % FINAL_3_COUNT

    byte2 = byte1 / MEDIAL_COUNT
    byte1 %= MEDIAL_COUNT

    # spread out, leaving gap at start
    byte0 = MIN_TRAIL + byte0 * FINAL_3_MULTIPLIER

    # offset
    byte1 += MIN_TRAIL
    byte2 += MIN_PRIMARY

    (byte2 << 16) + (byte1 << 8) + byte0
  else
    byte1 = byte0 / FINAL_4_COUNT
    byte0 %= FINAL_4_COUNT

    byte2 = byte1 / MEDIAL_COUNT
    byte1 %= MEDIAL_COUNT

    byte3 = byte2 / MEDIAL_COUNT
    byte2 %= MEDIAL_COUNT

    # spread out, leaving gap at start
    byte0 = MIN_TRAIL + byte0 * FINAL_4_MULTIPLIER

    # offset
    byte1 += MIN_TRAIL
    byte2 += MIN_TRAIL
    byte3 += MIN_4_PRIMARY

    (byte3 << 24) + (byte2 << 16) + (byte1 << 8) + byte0
  end
end
swapCJK(code_point) click to toggle source

Method used to:

a) collapse two different Han ranges from UCA into one (in the right order)
b) bump any non-CJK characters by NON_CJK_OFFSET.

The relevant blocks are: A: 4E00..9FFF; CJK Unified Ideographs

F900..FAFF; CJK Compatibility Ideographs

B: 3400..4DBF; CJK Unified Ideographs Extension A

20000..XX;  CJK Unified Ideographs Extension B (and others later on)

As long as

no new B characters are allocated between 4E00 and FAFF, and
no new A characters are outside of this range,

(very high probability) this simple code will work.

The reordered blocks are:

Block1 is CJK
Block2 is CJK_COMPAT_USED
Block3 is CJK_A
(all contiguous)

Any other CJK gets its normal code point.

When we reorder Block1, we make sure that it is at the very start, so that it will use a 3-byte form.

# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 93
def swapCJK(code_point)
  if code_point >= CJK_BASE
    return code_point - CJK_BASE                                      if code_point < CJK_LIMIT
    return code_point + NON_CJK_OFFSET                                if code_point < CJK_COMPAT_USED_BASE
    return code_point - CJK_COMPAT_USED_BASE + (CJK_LIMIT - CJK_BASE) if code_point < CJK_COMPAT_USED_LIMIT
    return code_point + NON_CJK_OFFSET                                if code_point < CJK_B_BASE
    return code_point                                                 if code_point < CJK_B_LIMIT # non-BMP-CJK
    return code_point + NON_CJK_OFFSET                                if code_point < CJK_C_BASE
    return code_point                                                 if code_point < CJK_C_LIMIT # non-BMP-CJK
    return code_point + NON_CJK_OFFSET                                if code_point < CJK_D_BASE
    return code_point                                                 if code_point < CJK_D_LIMIT # non-BMP-CJK

    return code_point + NON_CJK_OFFSET # non-CJK
  end

  return code_point + NON_CJK_OFFSET if code_point < CJK_A_BASE
  return code_point - CJK_A_BASE + (CJK_LIMIT - CJK_BASE) + (CJK_COMPAT_USED_LIMIT - CJK_COMPAT_USED_BASE) if code_point < CJK_A_LIMIT

  code_point + NON_CJK_OFFSET # non-CJK
end