# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 28 def primary_weight(code_point) byte0 = code_point - MIN_4_BOUNDARY if byte0 < 0 byte1 = code_point / FINAL_3_COUNT byte0 = code_point % FINAL_3_COUNT byte2 = byte1 / MEDIAL_COUNT byte1 %= MEDIAL_COUNT # spread out, leaving gap at start byte0 = MIN_TRAIL + byte0 * FINAL_3_MULTIPLIER # offset byte1 += MIN_TRAIL byte2 += MIN_PRIMARY (byte2 << 16) + (byte1 << 8) + byte0 else byte1 = byte0 / FINAL_4_COUNT byte0 %= FINAL_4_COUNT byte2 = byte1 / MEDIAL_COUNT byte1 %= MEDIAL_COUNT byte3 = byte2 / MEDIAL_COUNT byte2 %= MEDIAL_COUNT # spread out, leaving gap at start byte0 = MIN_TRAIL + byte0 * FINAL_4_MULTIPLIER # offset byte1 += MIN_TRAIL byte2 += MIN_TRAIL byte3 += MIN_4_PRIMARY (byte3 << 24) + (byte2 << 16) + (byte1 << 8) + byte0 end end
module TwitterCldr::Collation::ImplicitCollationElements
ImplicitCollationElements
generates implicit collation elements for code points (including some CJK characters), that are not explicitly mentioned in the collation elements table.
This module was ported from the ICU4J library (ImplicitCEGenerator class). See NOTICE file for license information.
Constants
- CJK_A_BASE
- CJK_A_LIMIT
- CJK_BASE
- CJK_B_BASE
- CJK_B_LIMIT
- CJK_COMPAT_USED_BASE
- CJK_COMPAT_USED_LIMIT
- CJK_C_BASE
- CJK_C_LIMIT
- CJK_D_BASE
- CJK_D_LIMIT
- CJK_LIMIT
- DEFAULT_SECONDARY_AND_TERTIARY
- FINAL_3_COUNT
- FINAL_3_MULTIPLIER
number of values we can use in trailing bytes leave room for empty values between AND above, e.g., if gap = 2
range 3..7 => +3 -4 -5 -6 -7: so 1 value range 3..8 => +3 -4 -5 +6 -7 -8: so 2 values range 3..9 => +3 -4 -5 +6 -7 -8 -9: so 2 values
- FINAL_4_COUNT
- FINAL_4_MULTIPLIER
- GAP_3
gap for tailoring of 3-byte forms
- GAP_4
- MAX_INPUT
2 * [Unicode range] + 2
- MAX_PRIMARY
- MAX_TRAIL
- MEDIAL_COUNT
medials can use full range
- MIN_4_BOUNDARY
- MIN_4_PRIMARY
- MIN_PRIMARY
primary value
- MIN_TRAIL
final byte
- NEEDED_PER_FINAL_BYTE
- NEEDED_PER_PRIMARY_BYTE
- NON_CJK_OFFSET
CJK constants
- PRIMARIES_3_COUNT
number of 3-byte primaries that can be used
- PRIMARIES_4_COUNT
- PRIMARIES_AVAILABLE
now determine where the 3/4 boundary is we use 3 bytes below the boundary, and 4 above
- THREE_BYTE_COUNT
find out how many values fit in each form
- TOTAL_NEEDED
Public Class Methods
# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 20 def for_code_point(code_point) [[primary_weight(swapCJK(code_point) + 1), DEFAULT_SECONDARY_AND_TERTIARY, DEFAULT_SECONDARY_AND_TERTIARY]] end
Private Class Methods
Generates the primary weight of the implicit CE for a given code point.
Method used to:
a) collapse two different Han ranges from UCA into one (in the right order) b) bump any non-CJK characters by NON_CJK_OFFSET.
The relevant blocks are: A: 4E00..9FFF; CJK Unified Ideographs
F900..FAFF; CJK Compatibility Ideographs
B: 3400..4DBF; CJK Unified Ideographs Extension A
20000..XX; CJK Unified Ideographs Extension B (and others later on)
As long as
no new B characters are allocated between 4E00 and FAFF, and no new A characters are outside of this range,
(very high probability) this simple code will work.
The reordered blocks are:
Block1 is CJK Block2 is CJK_COMPAT_USED Block3 is CJK_A (all contiguous)
Any other CJK gets its normal code point.
When we reorder Block1, we make sure that it is at the very start, so that it will use a 3-byte form.
# File lib/twitter_cldr/collation/implicit_collation_elements.rb, line 93 def swapCJK(code_point) if code_point >= CJK_BASE return code_point - CJK_BASE if code_point < CJK_LIMIT return code_point + NON_CJK_OFFSET if code_point < CJK_COMPAT_USED_BASE return code_point - CJK_COMPAT_USED_BASE + (CJK_LIMIT - CJK_BASE) if code_point < CJK_COMPAT_USED_LIMIT return code_point + NON_CJK_OFFSET if code_point < CJK_B_BASE return code_point if code_point < CJK_B_LIMIT # non-BMP-CJK return code_point + NON_CJK_OFFSET if code_point < CJK_C_BASE return code_point if code_point < CJK_C_LIMIT # non-BMP-CJK return code_point + NON_CJK_OFFSET if code_point < CJK_D_BASE return code_point if code_point < CJK_D_LIMIT # non-BMP-CJK return code_point + NON_CJK_OFFSET # non-CJK end return code_point + NON_CJK_OFFSET if code_point < CJK_A_BASE return code_point - CJK_A_BASE + (CJK_LIMIT - CJK_BASE) + (CJK_COMPAT_USED_LIMIT - CJK_COMPAT_USED_BASE) if code_point < CJK_A_LIMIT code_point + NON_CJK_OFFSET # non-CJK end