module U
This version of UnicodeUtils
implements algorithms as defined by version 6.2.0 of the Unicode standard. Each public method is declared as a module_function
of the UnicodeUtils
module and defined in a separate file under the unicode_utils
directory.
As a convenience, the toplevel unicode_utils
file loads all methods (needs lots of memory!). Also as a convenience for irb usage, the file unicode_utils/u
assigns the UnicodeUtils
module to the toplevel U
constant and loads all methods:
$ irb -r unicode_utils/u irb(main):001:0> U.grep /angstrom/ => [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]
If a method takes a character as argument (usually named char
), that argument can be an integer or a string (in which case the first code point counts) or any other object that responds to ord
by returning an integer.
All methods are non-destructive, string return values are in the same encoding as strings passed as arguments, which must be in one of the Unicode encodings.
Highlevel methods are:
UnicodeUtils.upcase
-
full conversion to uppercase
UnicodeUtils.downcase
-
full conversion to lowercase
UnicodeUtils.titlecase
-
full conversion to titlecase
UnicodeUtils.casefold
-
case folding (case insensitive string comparison)
UnicodeUtils.nfd
-
Normalization Form D
UnicodeUtils.nfc
-
Normalization Form C
UnicodeUtils.nfkd
-
Normalization Form KD
UnicodeUtils.nfkc
-
Normalization Form KC
UnicodeUtils.each_grapheme
-
grapheme boundaries
UnicodeUtils.each_word
-
word boundaries
UnicodeUtils.char_name
-
character names
UnicodeUtils.grep
-
find code points by character name
Constants
- CDATA_DIR
Absolute path to the directory from which
UnicodeUtils
loads its compiled Unicode data files at runtime.- UNICODE_VERSION
The version of Unicode implemented by this version of
UnicodeUtils
.require "unicode_utils/version" puts "Unicode #{UnicodeUtils::UNICODE_VERSION}"
- VERSION
Corresponds to the unicode_utils gem version.
Conforms to Semantic Versioning as documented at semver.org.
Summary: MAJOR.MINOR.PATCHLEVEL
-
A backwards incompatible change causes a change in MAJOR
-
New features or non-bugfix improvals cause a change in MINOR
-
Bugfixes increase only PATCHLEVEL.
-
Pre-release versions append more info after a dash.
-
Public Class Methods
Get the canonical decomposition of the given string, also called Normalization Form D or short NFD.
The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “decomposes” these characters in str
into the latter representation.
Example:
require "unicode_utils/canonical_decomposition" # LATIN SMALL LETTER A WITH ACUTE => LATIN SMALL LETTER A, COMBINING ACUTE ACCENT UnicodeUtils.canonical_decomposition("\u{E1}") => "\u{61}\u{301}"
See also: UnicodeUtils.nfd
# File lib/unicode_utils/canonical_decomposition.rb, line 27 def canonical_decomposition(str) res = String.new.force_encoding(str.encoding) str.each_codepoint { |cp| if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable Impl.append_hangul_syllable_decomposition(res, cp) else mapping = CANONICAL_DECOMPOSITION_MAP[cp] if mapping Impl.append_recursive_canonical_decomposition_mapping(res, mapping) else res << cp end end } Impl.put_into_canonical_order(res) end
The strings a
and b
are canonical equivalents if their canonical decompositions are equal.
Example:
require "unicode_utils/canonical_equivalents_q" UnicodeUtils.canonical_equivalents?("Äste", "A\u{308}ste") => true UnicodeUtils.canonical_equivalents?("Äste", "Aste") => false
# File lib/unicode_utils/canonical_equivalents_q.rb, line 14 def canonical_equivalents?(a, b) UnicodeUtils.canonical_decomposition(a) == UnicodeUtils.canonical_decomposition(b) end
Returns true if the given character is case-ignorable as defined by Unicode 5.0, section 3.13.
# File lib/unicode_utils/case_ignorable_char_q.rb, line 10 def case_ignorable_char?(char) CASE_IGNORABLE_SET.include?(char.ord) end
A cased char is a character that has the Unicode property Lowercase or Uppercase or the general category Titlecase_Letter.
See also: lowercase_char?, uppercase_char?, titlecase_char?
# File lib/unicode_utils/cased_char_q.rb, line 12 def cased_char?(char) lowercase_char?(char) || uppercase_char?(char) || titlecase_char?(char) end
Perform full case folding. The returned string may be longer than str
. The purpose of case folding is case insensitive string comparison.
Examples:
require "unicode_utils/casefold" UnicodeUtils.casefold("Ümit") == UnicodeUtils.casefold("ümit") => true UnicodeUtils.casefold("WEISS") == UnicodeUtils.casefold("weiß") => true
# File lib/unicode_utils/casefold.rb, line 18 def casefold(str) String.new.force_encoding(str.encoding).tap do |res| str.each_codepoint { |cp| if mapping = CASEFOLD_C_MAP[cp] res << mapping elsif mapping = CASEFOLD_F_MAP[cp] mapping.each { |m| res << m } else res << cp end } end end
Get the width of char
when displayed with a fixed pitch font.
Some code points (especially from east asian scripts) take the width of two characters, while others have no width.
Examples:
require "unicode_utils/char_display_width" UnicodeUtils.char_display_width("別") # => 2 UnicodeUtils.char_display_width(0x308) # => 0 UnicodeUtils.char_display_width("a") # => 1
Performs the same logic as UnicodeUtils.display_width
, but for a single code point.
# File lib/unicode_utils/char_display_width.rb, line 20 def char_display_width(char) cp = char.ord # copied from display_width, keep in sync! case UnicodeUtils.east_asian_width(cp) when :Wide, :Fullwidth then 2 else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)] end end
Get the normative Unicode name of the given character.
Private Use code points have no name, this function returns nil for such code points.
All control characters have the special name “<control>”. All other characters have a unique name.
Example:
require "unicode_utils/char_name" UnicodeUtils.char_name "ᾀ" => "GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI" UnicodeUtils.char_name "\t" => "<control>"
Note that this method deviates from the Unicode Name property in two points:
-
It returns “<control>” for control codes, the Unicode Name property for these code points is an empty string
-
It returns nil for other non-graphic, non-format code points, the Unicode Name property for these code points is an empty string
See also: UnicodeUtils.sid
# File lib/unicode_utils/char_name.rb, line 33 def char_name(char) # TODO: improve with code point labels, see section 4.8 in Unicode 6.0.0 if char.kind_of?(Integer) cp = char str = nil else cp = char.ord str = char end NAME_MAP[cp] || case cp when 0x3400..0x4DB5, 0x4E00..0x9FCC, 0x20000..0x2A6D6, 0x2A700..0x2B734, 0x2B740..0x2B81D "CJK UNIFIED IDEOGRAPH-#{sprintf('%04X', cp)}" when 0xAC00..0xD7A3 str ||= cp.chr(Encoding::UTF_8) "HANGUL SYLLABLE ".tap do |n| hangul_syllable_decomposition(str).each_char { |c| n << (jamo_short_name(c) || '') } end end end
Get the long major general category alias of char.
Example:
require "unicode_utils/char_type" UnicodeUtils.char_type("1") # => :Number
Always returns a symbol when char is in the Unicode code point range.
See also: UnicodeUtils.general_category
# File lib/unicode_utils/char_type.rb, line 26 def char_type(char) GENERAL_CATEGORY_TYPE_MAP[UnicodeUtils.gc(char)] end
Get the code point type of the given integer
(must be instance of Integer) as defined by the Unicode standard.
If integer
is a code point (anything in UnicodeUtils::Codepoint::RANGE), returns one of the following symbols:
:Graphic :Format :Control :Private_Use :Surrogate :Noncharacter :Reserved
For an exact meaning of these values, read the sections “Conformance/Characters and Encoding” and “General Structure/Types of Codepoints” in the Unicode standard.
Following is a paraphrased excerpt:
Surrogate
, Noncharacter
and Reserved
code points are not assigned to an _abstract character_. All other code points are assigned to an abstract character.
Reserved
code points are also called Undesignated code points, all others are Designated code points.
Returns nil if integer
is not a code point.
# File lib/unicode_utils/code_point_type.rb, line 60 def code_point_type(integer) cpt = GENERAL_CATEGORY_CODE_POINT_TYPE[UnicodeUtils.gc(integer)] if false == cpt cpt = CN_CODE_POINT_TYPE[integer] end cpt end
Get the combining class of the given character as an integer in the range 0..255.
# File lib/unicode_utils/combining_class.rb, line 11 def combining_class(char) COMBINING_CLASS_MAP[char.ord] end
Get the compatibility decomposition of the given string, also called Normalization Form KD or short NFKD.
Compatibility decomposition decomposes more code points than canonical decomposition and contrary to Normalization Form D and C, this normalization can alter how a string is displayed.
Example:
require "unicode_utils/compatibility_decomposition" # LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I UnicodeUtils.compatibility_decomposition("fi") => "fi"
See also: UnicodeUtils.nfkd
# File lib/unicode_utils/compatibility_decomposition.rb, line 25 def compatibility_decomposition(str) res = String.new.force_encoding(str.encoding) str.each_codepoint { |cp| if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable Impl.append_hangul_syllable_decomposition(res, cp) else Impl.append_recursive_compatibility_decomposition_mapping(res, cp) end } Impl.put_into_canonical_order(res) end
Print a table with detailed information about each code point in str
. opts
can have the following keys:
:io
-
An IO compatible object. Receives the output. Defaults to
$stdout
.
str
may also be an Integer, in which case it is interpreted as a single code point that must be in UnicodeUtils::Codepoint::RANGE.
Examples:
$ ruby -r unicode_utils/u -e 'U.debug "良い一日"' Char | Ordinal | Sid | General Category | UTF-8 ------+---------+----------------------------+------------------+---------- "良" | 826F | CJK UNIFIED IDEOGRAPH-826F | Other_Letter | E8 89 AF "い" | 3044 | HIRAGANA LETTER I | Other_Letter | E3 81 84 "一" | 4E00 | CJK UNIFIED IDEOGRAPH-4E00 | Other_Letter | E4 B8 80 "日" | 65E5 | CJK UNIFIED IDEOGRAPH-65E5 | Other_Letter | E6 97 A5 $ ruby -r unicode_utils/u -e 'U.debug 0xd800' Char | Ordinal | Sid | General Category | UTF-8 ------+---------+------------------+------------------+------- N/A | D800 | <surrogate-D800> | Surrogate | N/A
The output is purely informal and may change even in minor releases.
# File lib/unicode_utils/debug.rb, line 36 def debug(str, opts = {}) io = opts[:io] || $stdout table = [Impl::DEBUG_COLUMNS.keys] if str.kind_of?(Integer) table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(str) } else str.each_codepoint { |cp| table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(cp) } } end Impl.print_table(table, io) nil end
True if the given character has the Unicode property Default_Ingorable_Code_Point (see section 5.3 in Unicode 6.0.0).
When a system (e.g. font) can’t display a default ignorable code point, it is allowed to simply ignore, i.e. skip it (as opposed to other characters, which must at least be displayed with a replacement character).
# File lib/unicode_utils/default_ignorable_char_q.rb, line 16 def default_ignorable_char?(char) PROP_DEFAULT_IGNORABLE_SET.include?(char.ord) end
Get the width of str
when displayed with a fixed pitch font.
Counts code points, where code points with an east asian width of Wide
or Fullwidth
count for two, non-graphic code points (e.g. control characters, including newline!) and non-spacing marks count for zero and all others count for one.
Examples:
require "unicode_utils/display_width" "別れ".length => 2 UnicodeUtils.display_width("別れ") => 4 "12".length => 2 UnicodeUtils.display_width("12") => 2 "a\u{308}".length => 2 UnicodeUtils.display_width("a\u{308}") => 1
Unicode assigns some reserved code points an east asian width of Wide
. Some systems correctly display a double width replacement character, others not.
See also: UnicodeUtils.graphic_char?
, UnicodeUtils.east_asian_width
# File lib/unicode_utils/display_width.rb, line 40 def display_width(str) str.each_codepoint.reduce(0) { |sum, cp| sum + case UnicodeUtils.east_asian_width(cp) when :Wide, :Fullwidth then 2 else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)] end } end
Perform a full case-conversion of str
to lowercase according to the Unicode standard.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Examples:
require "unicode_utils/downcase" UnicodeUtils.downcase("ᾈ") => "ᾀ" UnicodeUtils.downcase("aBI\u{307}", :tr) => "abi"
# File lib/unicode_utils/downcase.rb, line 27 def downcase(str, language_id = nil) String.new.force_encoding(str.encoding).tap { |res| if Impl::LANGS_WITH_RULES.include?(language_id) # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) end pos = 0 str.each_codepoint { |cp| special_mapping = Impl.conditional_downcase_mapping(cp, str, pos, language_id) || SPECIAL_DOWNCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_DOWNCASE_MAP[cp] || cp) end pos += 1 } } end
Iterate over the grapheme clusters that make up str
. A grapheme cluster is a user perceived character (the basic unit of a writing system for a language) and consists of one or more code points.
This method uses the default Unicode algorithm for extended grapheme clusters.
Returns an enumerator if no block is given.
Examples:
require "unicode_utils/each_grapheme" UnicodeUtils.each_grapheme("a\r\nb") { |g| p g }
prints:
"a" "\r\n" "b"
and
UnicodeUtils.each_grapheme("a\r\nb").count => 3
# File lib/unicode_utils/each_grapheme.rb, line 34 def each_grapheme(str) return enum_for(__method__, str) unless block_given? c0 = nil c0_prop = nil grapheme = String.new.force_encoding(str.encoding) str.each_codepoint { |c| gbreak = false c_prop = GRAPHEME_CLUSTER_BREAK_MAP[c] ### rules ### if c0_prop == 0x0 && c_prop == 0x1 # don't break CR LF elsif c0_prop == 0x0 || c0_prop == 0x1 || c0_prop == 0x2 # break after controls gbreak = true elsif c_prop == 0x0 || c_prop == 0x1 || c_prop == 0x2 # break before controls gbreak = true elsif c0_prop == 0x6 && (c_prop == 0x6 || c_prop == 0x7 || c_prop == 0x9 || c_prop == 0xA) # don't break hangul syllable elsif (c0_prop == 0x9 || c0_prop == 0x7) && (c_prop == 0x7 || c_prop == 0x8) # don't break hangul syllable elsif (c0_prop == 0xA || c0_prop == 0x8) && c_prop == 0x8 # don't break hangul syllable elsif c0_prop == 0xB && c_prop == 0xB # don't break between regional indicator symbols elsif c_prop == 0x3 # don't break before extending characters elsif c_prop == 0x5 # don't break before SpacingMarks elsif c0_prop == 0x4 # don't break after Prepend characters else # break everywhere gbreak = true end ############# if gbreak && !grapheme.empty? yield grapheme grapheme = String.new.force_encoding(str.encoding) end grapheme << c c0 = c c0_prop = c_prop } yield grapheme unless grapheme.empty? end
Split str
along word boundaries according to Unicode’s Default Word Boundary Specification, calling the given block with each word. Returns str
, or an enumerator if no block is given.
Example:
require "unicode_utils/each_word" UnicodeUtils.each_word("Hello, world!").to_a => ["Hello", ",", " ", "world", "!"]
# File lib/unicode_utils/each_word.rb, line 19 def each_word(str) return enum_for(__method__, str) unless block_given? cs = str.each_codepoint.map { |c| WORD_BREAK_MAP[c] } cs << nil << nil # for negative indices word = String.new.force_encoding(str.encoding) i = 0 str.each_codepoint { |c| word << c if Impl.word_break?(cs, i) && !word.empty? yield word word = String.new.force_encoding(str.encoding) end i += 1 } yield word unless word.empty? str end
Returns the default with of the given code point as described in “UAX #11: East Asian Width” (unicode.org/reports/tr11/).
Each code point is mapped to one of the following six symbols: :Neutral, :Ambiguous, :Halfwidth, :Wide, :Fullwidth, :Narrow.
# File lib/unicode_utils/east_asian_width.rb, line 17 def east_asian_width(char) cp = char.ord EAST_ASIAN_WIDTH_RANGES.each { |pair| return pair[1] if pair[0].cover?(cp) } EAST_ASIAN_WIDTH_MAP_PER_CP[cp] end
Get the two letter general category alias of the given char. The first letter denotes a major class, the second letter a subclass of the major class.
See section 4.5 in Unicode 6.0.0.
Example:
require "unicode_utils/gc" UnicodeUtils.gc("A") # => :Lu (Letter, uppercase)
Returns nil for ordinals outside the Unicode code point range, a two letter symbol otherwise.
See also: UnicodeUtils.general_category
, UnicodeUtils.char_type
# File lib/unicode_utils/gc.rb, line 27 def gc(char) cp = char.ord cat = GENERAL_CATEGORY_PER_CP_MAP[cp] and return cat GENERAL_CATEGORY_RANGES.each { |pair| return pair[1] if pair[0].cover?(cp) } if cp >= 0x0 && cp <= 0x10FFFF :Cn # Other, not assigned else nil end end
Get the long general category alias of char.
Example:
require "unicode_utils/general_category" UnicodeUtils.general_category("A") # => :Uppercase_Letter
Returns a symbol if char is in the Unicode code point range, nil otherwise.
See also: UnicodeUtils.gc
, UnicodeUtils.char_type
# File lib/unicode_utils/general_category.rb, line 21 def general_category(char) GENERAL_CATEGORY_ALIAS_MAP[UnicodeUtils.gc(char)] end
Returns true if the given char is a graphic char, false otherwise. See table 2-3 in section 2.4 of Unicode 6.0.0.
Examples:
require "unicode_utils/graphic_char_q" UnicodeUtils.graphic_char?("a") # => true UnicodeUtils.graphic_char?("\n") # => false UnicodeUtils.graphic_char?(0x0) # => false
# File lib/unicode_utils/graphic_char_q.rb, line 25 def graphic_char?(char) GENERAL_CATEGORY_IS_GRAPHIC_MAP[UnicodeUtils.gc(char)] end
Get an array of all Codepoint
instances in Codepoint::RANGE whose name matches regexp. Matching is case insensitive.
require "unicode_utils/grep" UnicodeUtils.grep(/angstrom/) => [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]
# File lib/unicode_utils/grep.rb, line 11 def grep(regexp) # TODO: enhance behaviour by searching aliases in NameAliases.txt unless regexp.casefold? regexp = Regexp.new(regexp.source, Regexp::IGNORECASE) end Codepoint::RANGE.select { |cp| regexp =~ UnicodeUtils.char_name(cp) }.map { |cp| Codepoint.new(cp) } end
Derives the canonical decomposition of the given Hangul syllable.
Example:
require "unicode_utils/hangul_syllable_decomposition" UnicodeUtils.hangul_syllable_decomposition("\u{d4db}") => "\u{1111}\u{1171}\u{11b6}"
# File lib/unicode_utils/hangul_syllable_decomposition.rb, line 10 def hangul_syllable_decomposition(char) String.new.force_encoding(char.encoding).tap do |str| Impl.append_hangul_syllable_decomposition(str , char.ord) end end
The Jamo Short Name property of the given character (defaults to nil).
Example:
require "unicode_utils/jamo_short_name" UnicodeUtils.jamo_short_name("\u{1101}") => "GG"
# File lib/unicode_utils/jamo_short_name.rb, line 15 def jamo_short_name(char) JAMO_SHORT_NAME_MAP[char.ord] end
True if the given character has the Unicode property Lowercase.
# File lib/unicode_utils/lowercase_char_q.rb, line 9 def lowercase_char?(char) PROP_LOWERCASE_SET.include?(char.ord) end
Get an Enumerable of formal name aliases of the given character. Returns an empty Enumerable if the character doesn’t have an alias.
The aliases are instances of UnicodeUtils::NameAlias
, the order of the aliases in the returned Enumerable is preserved from NameAliases.txt in the Unicode Character Database.
Example:
require "unicode_utils/name_aliases" UnicodeUtils.name_aliases("\n").map(&:name) # => ["LINE FEED", "NEW LINE", "END OF LINE", "LF", "NL", "EOL"]
See also: UnicodeUtils.char_name
# File lib/unicode_utils/name_aliases.rb, line 23 def name_aliases(char) NAME_ALIASES_MAP[char.ord] end
Get str
in Normalization Form C.
The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “composes” these characters into the former representation.
Example:
require "unicode_utils/nfc" UnicodeUtils.nfc("La\u{308}mpchen") => "Lämpchen"
# File lib/unicode_utils/nfc.rb, line 135 def nfc(str) str = UnicodeUtils.canonical_decomposition(str) Impl.composition(str) end
Get str
in Normalization Form D.
Alias for UnicodeUtils.canonical_decomposition
.
# File lib/unicode_utils/nfd.rb, line 9 def nfd(str) UnicodeUtils.canonical_decomposition(str) end
Get str
in Normalization Form KC.
Normalization Form KC is compatibiliy decomposition (NFKD) followed by composition. Like NFKD, this normalization can alter how a string is displayed.
Example:
require "unicode_utils/nfkc" # LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I UnicodeUtils.nfkc("fi") => "fi"
See also: UnicodeUtils.compatibility_decomposition
# File lib/unicode_utils/nfkc.rb, line 20 def nfkc(str) str = UnicodeUtils.compatibility_decomposition(str) Impl.composition(str) end
Get str
in Normalization Form KD.
Alias for UnicodeUtils.compatibility_decomposition
.
# File lib/unicode_utils/nfkd.rb, line 9 def nfkd(str) UnicodeUtils.compatibility_decomposition(str) end
Returns a unique string identifier for every code point. Returns nil if code_point
is not in the Unicode codespace. code_point
must be an Integer.
The returned string identifier is either the non-empty Name property value of code_point
, a non-empty Name_Alias string property value of code_point
, or the code point label as described by section “Code Point Labels” in chapter 4.8 “Name” of the Unicode standard.
If the returned identifier starts with “<”, it is a code point label and it ends with “>”. Otherwise it is the normative name or a formal alias string.
The exact name/alias/label selection algorithm may change even in minor UnicodeUtils
releases, but overall behaviour will stay the same in spirit.
The selection process in this version of UnicodeUtils
is:
-
Use an alias of type :correction, :control, :figment or :alternate (with listed precendence) if available
-
Use the Unicode Name property value if it is not empty
-
Construct a code point label in angle brackets.
Examples:
require "unicode_utils/sid" U.sid 0xa # => "LINE FEED" U.sid 0x0 # => "NULL" U.sid 0xfeff # => "BYTE ORDER MARK" U.sid 0xe000 # => "<private-use-E000>" U.sid 0x61 # => "LATIN SMALL LETTER A" U.sid -1 # => nil
# File lib/unicode_utils/sid.rb, line 52 def sid(code_point) s = CP_PREFERRED_ALIAS_STRING_MAP[code_point] and return s cn = UnicodeUtils.char_name(code_point) return cn if cn && cn !~ /\A(\<|\z)/ ct = UnicodeUtils.code_point_type(code_point) or return nil ts = ct.to_s.downcase.gsub('_', '-') "<#{ts}-#{code_point.to_s(16).upcase.rjust(4, '0')}>" end
Perform simple case folding. Contrary to full case folding, this uses only one to one mappings, so that the length of the returned string is equal to the length of str
.
The purpose of case folding is case insensitive string comparison.
Examples:
require "unicode_utils/simple_casefold" UnicodeUtils.simple_casefold("Ümit") == UnicodeUtils.simple_casefold("ümit") => true UnicodeUtils.simple_casefold("WEISS") == UnicodeUtils.simple_casefold("weiß") => false
See also: UnicodeUtils.casefold
# File lib/unicode_utils/simple_casefold.rb, line 23 def simple_casefold(str) String.new.force_encoding(str.encoding).tap do |res| str.each_codepoint { |cp| res << (CASEFOLD_C_MAP[cp] || CASEFOLD_S_MAP[cp] || cp) } end end
Map each code point in str
that has a single code point lowercase-mapping to that lowercase mapping. The returned string has the same length as the original string.
This function is locale independent.
Examples:
require "unicode_utils/simple_downcase" UnicodeUtils.simple_downcase("ÜMIT: 123") => "ümit: 123" UnicodeUtils.simple_downcase("STRASSE") => "strasse"
# File lib/unicode_utils/simple_downcase.rb, line 19 def simple_downcase(str) String.new.force_encoding(str.encoding).tap { |res| str.each_codepoint { |cp| res << (SIMPLE_DOWNCASE_MAP[cp] || cp) } } end
Map each code point in str
that has a single code point uppercase-mapping to that uppercase mapping. The returned string has the same length as the original string.
This function is locale independent.
Examples:
require "unicode_utils/simple_upcase" UnicodeUtils.simple_upcase("ümit: 123") => "ÜMIT: 123" UnicodeUtils.simple_upcase("weiß") => "WEIß"
# File lib/unicode_utils/simple_upcase.rb, line 19 def simple_upcase(str) String.new.force_encoding(str.encoding).tap { |res| str.each_codepoint { |cp| res << (SIMPLE_UPCASE_MAP[cp] || cp) } } end
Returns true if the given character has the Unicode property Soft_Dotted.
# File lib/unicode_utils/soft_dotted_char_q.rb, line 10 def soft_dotted_char?(char) SOFT_DOTTED_SET.include?(char.ord) end
Convert the first cased character after each word boundary to titlecase and all other cased characters to lowercase. For many, but not all characters, the titlecase mapping is the same as the uppercase mapping.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Example:
require "unicode_utils/titlecase" UnicodeUtils.titlecase("hello, world!") => "Hello, World!"
# File lib/unicode_utils/titlecase.rb, line 31 def titlecase(str, language_id = nil) String.new.force_encoding(str.encoding).tap do |res| # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) i = 0 each_word(str) { |word| cased_char_found = false word.each_codepoint { |cp| cased = cased_char?(cp) if !cased_char_found && cased cased_char_found = true special_mapping = Impl.conditional_titlecase_mapping(cp, str, i, language_id) || SPECIAL_TITLECASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_TITLECASE_MAP[cp] || cp) end elsif cased special_mapping = Impl.conditional_downcase_mapping(cp, str, i, language_id) || SPECIAL_DOWNCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_DOWNCASE_MAP[cp] || cp) end else res << cp end i += 1 } } end end
True if the given character has the General_Category Titlecase_Letter (Lt).
# File lib/unicode_utils/titlecase_char_q.rb, line 10 def titlecase_char?(char) TITLECASE_LETTER_SET.include?(char.ord) end
Perform a full case-conversion of str
to uppercase according to the Unicode standard.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Examples:
require "unicode_utils/upcase" UnicodeUtils.upcase("weiß") => "WEISS" UnicodeUtils.upcase("i", :en) => "I" UnicodeUtils.upcase("i", :tr) => "İ"
# File lib/unicode_utils/upcase.rb, line 28 def upcase(str, language_id = nil) String.new.force_encoding(str.encoding).tap { |res| if Impl::LANGS_WITH_RULES.include?(language_id) # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) end pos = 0 str.each_codepoint { |cp| special_mapping = Impl.conditional_upcase_mapping(cp, str, pos, language_id) || SPECIAL_UPCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_UPCASE_MAP[cp] || cp) end pos += 1 } } end
True if the given character has the Unicode property Uppercase.
# File lib/unicode_utils/uppercase_char_q.rb, line 9 def uppercase_char?(char) PROP_UPPERCASE_SET.include?(char.ord) end
True if the given character has the Unicode property White_Space.
Example:
require "unicode_utils/general_category" require "unicode_utils/white_space_char_q" UnicodeUtils.general_category("\n") => :Control UnicodeUtils.white_space_char?("\n") => true
# File lib/unicode_utils/white_space_char_q.rb, line 17 def white_space_char?(char) WHITE_SPACE_SET.include?(char.ord) end
Private Instance Methods
Get the canonical decomposition of the given string, also called Normalization Form D or short NFD.
The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “decomposes” these characters in str
into the latter representation.
Example:
require "unicode_utils/canonical_decomposition" # LATIN SMALL LETTER A WITH ACUTE => LATIN SMALL LETTER A, COMBINING ACUTE ACCENT UnicodeUtils.canonical_decomposition("\u{E1}") => "\u{61}\u{301}"
See also: UnicodeUtils.nfd
# File lib/unicode_utils/canonical_decomposition.rb, line 27 def canonical_decomposition(str) res = String.new.force_encoding(str.encoding) str.each_codepoint { |cp| if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable Impl.append_hangul_syllable_decomposition(res, cp) else mapping = CANONICAL_DECOMPOSITION_MAP[cp] if mapping Impl.append_recursive_canonical_decomposition_mapping(res, mapping) else res << cp end end } Impl.put_into_canonical_order(res) end
The strings a
and b
are canonical equivalents if their canonical decompositions are equal.
Example:
require "unicode_utils/canonical_equivalents_q" UnicodeUtils.canonical_equivalents?("Äste", "A\u{308}ste") => true UnicodeUtils.canonical_equivalents?("Äste", "Aste") => false
# File lib/unicode_utils/canonical_equivalents_q.rb, line 14 def canonical_equivalents?(a, b) UnicodeUtils.canonical_decomposition(a) == UnicodeUtils.canonical_decomposition(b) end
Returns true if the given character is case-ignorable as defined by Unicode 5.0, section 3.13.
# File lib/unicode_utils/case_ignorable_char_q.rb, line 10 def case_ignorable_char?(char) CASE_IGNORABLE_SET.include?(char.ord) end
A cased char is a character that has the Unicode property Lowercase or Uppercase or the general category Titlecase_Letter.
See also: lowercase_char?, uppercase_char?, titlecase_char?
# File lib/unicode_utils/cased_char_q.rb, line 12 def cased_char?(char) lowercase_char?(char) || uppercase_char?(char) || titlecase_char?(char) end
Perform full case folding. The returned string may be longer than str
. The purpose of case folding is case insensitive string comparison.
Examples:
require "unicode_utils/casefold" UnicodeUtils.casefold("Ümit") == UnicodeUtils.casefold("ümit") => true UnicodeUtils.casefold("WEISS") == UnicodeUtils.casefold("weiß") => true
# File lib/unicode_utils/casefold.rb, line 18 def casefold(str) String.new.force_encoding(str.encoding).tap do |res| str.each_codepoint { |cp| if mapping = CASEFOLD_C_MAP[cp] res << mapping elsif mapping = CASEFOLD_F_MAP[cp] mapping.each { |m| res << m } else res << cp end } end end
Get the width of char
when displayed with a fixed pitch font.
Some code points (especially from east asian scripts) take the width of two characters, while others have no width.
Examples:
require "unicode_utils/char_display_width" UnicodeUtils.char_display_width("別") # => 2 UnicodeUtils.char_display_width(0x308) # => 0 UnicodeUtils.char_display_width("a") # => 1
Performs the same logic as UnicodeUtils.display_width
, but for a single code point.
# File lib/unicode_utils/char_display_width.rb, line 20 def char_display_width(char) cp = char.ord # copied from display_width, keep in sync! case UnicodeUtils.east_asian_width(cp) when :Wide, :Fullwidth then 2 else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)] end end
Get the normative Unicode name of the given character.
Private Use code points have no name, this function returns nil for such code points.
All control characters have the special name “<control>”. All other characters have a unique name.
Example:
require "unicode_utils/char_name" UnicodeUtils.char_name "ᾀ" => "GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI" UnicodeUtils.char_name "\t" => "<control>"
Note that this method deviates from the Unicode Name property in two points:
-
It returns “<control>” for control codes, the Unicode Name property for these code points is an empty string
-
It returns nil for other non-graphic, non-format code points, the Unicode Name property for these code points is an empty string
See also: UnicodeUtils.sid
# File lib/unicode_utils/char_name.rb, line 33 def char_name(char) # TODO: improve with code point labels, see section 4.8 in Unicode 6.0.0 if char.kind_of?(Integer) cp = char str = nil else cp = char.ord str = char end NAME_MAP[cp] || case cp when 0x3400..0x4DB5, 0x4E00..0x9FCC, 0x20000..0x2A6D6, 0x2A700..0x2B734, 0x2B740..0x2B81D "CJK UNIFIED IDEOGRAPH-#{sprintf('%04X', cp)}" when 0xAC00..0xD7A3 str ||= cp.chr(Encoding::UTF_8) "HANGUL SYLLABLE ".tap do |n| hangul_syllable_decomposition(str).each_char { |c| n << (jamo_short_name(c) || '') } end end end
Get the long major general category alias of char.
Example:
require "unicode_utils/char_type" UnicodeUtils.char_type("1") # => :Number
Always returns a symbol when char is in the Unicode code point range.
See also: UnicodeUtils.general_category
# File lib/unicode_utils/char_type.rb, line 26 def char_type(char) GENERAL_CATEGORY_TYPE_MAP[UnicodeUtils.gc(char)] end
Get the code point type of the given integer
(must be instance of Integer) as defined by the Unicode standard.
If integer
is a code point (anything in UnicodeUtils::Codepoint::RANGE), returns one of the following symbols:
:Graphic :Format :Control :Private_Use :Surrogate :Noncharacter :Reserved
For an exact meaning of these values, read the sections “Conformance/Characters and Encoding” and “General Structure/Types of Codepoints” in the Unicode standard.
Following is a paraphrased excerpt:
Surrogate
, Noncharacter
and Reserved
code points are not assigned to an _abstract character_. All other code points are assigned to an abstract character.
Reserved
code points are also called Undesignated code points, all others are Designated code points.
Returns nil if integer
is not a code point.
# File lib/unicode_utils/code_point_type.rb, line 60 def code_point_type(integer) cpt = GENERAL_CATEGORY_CODE_POINT_TYPE[UnicodeUtils.gc(integer)] if false == cpt cpt = CN_CODE_POINT_TYPE[integer] end cpt end
Get the combining class of the given character as an integer in the range 0..255.
# File lib/unicode_utils/combining_class.rb, line 11 def combining_class(char) COMBINING_CLASS_MAP[char.ord] end
Get the compatibility decomposition of the given string, also called Normalization Form KD or short NFKD.
Compatibility decomposition decomposes more code points than canonical decomposition and contrary to Normalization Form D and C, this normalization can alter how a string is displayed.
Example:
require "unicode_utils/compatibility_decomposition" # LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I UnicodeUtils.compatibility_decomposition("fi") => "fi"
See also: UnicodeUtils.nfkd
# File lib/unicode_utils/compatibility_decomposition.rb, line 25 def compatibility_decomposition(str) res = String.new.force_encoding(str.encoding) str.each_codepoint { |cp| if cp >= 0xAC00 && cp <= 0xD7A3 # hangul syllable Impl.append_hangul_syllable_decomposition(res, cp) else Impl.append_recursive_compatibility_decomposition_mapping(res, cp) end } Impl.put_into_canonical_order(res) end
Print a table with detailed information about each code point in str
. opts
can have the following keys:
:io
-
An IO compatible object. Receives the output. Defaults to
$stdout
.
str
may also be an Integer, in which case it is interpreted as a single code point that must be in UnicodeUtils::Codepoint::RANGE.
Examples:
$ ruby -r unicode_utils/u -e 'U.debug "良い一日"' Char | Ordinal | Sid | General Category | UTF-8 ------+---------+----------------------------+------------------+---------- "良" | 826F | CJK UNIFIED IDEOGRAPH-826F | Other_Letter | E8 89 AF "い" | 3044 | HIRAGANA LETTER I | Other_Letter | E3 81 84 "一" | 4E00 | CJK UNIFIED IDEOGRAPH-4E00 | Other_Letter | E4 B8 80 "日" | 65E5 | CJK UNIFIED IDEOGRAPH-65E5 | Other_Letter | E6 97 A5 $ ruby -r unicode_utils/u -e 'U.debug 0xd800' Char | Ordinal | Sid | General Category | UTF-8 ------+---------+------------------+------------------+------- N/A | D800 | <surrogate-D800> | Surrogate | N/A
The output is purely informal and may change even in minor releases.
# File lib/unicode_utils/debug.rb, line 36 def debug(str, opts = {}) io = opts[:io] || $stdout table = [Impl::DEBUG_COLUMNS.keys] if str.kind_of?(Integer) table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(str) } else str.each_codepoint { |cp| table << Impl::DEBUG_COLUMNS.values.map { |f| f.call(cp) } } end Impl.print_table(table, io) nil end
True if the given character has the Unicode property Default_Ingorable_Code_Point (see section 5.3 in Unicode 6.0.0).
When a system (e.g. font) can’t display a default ignorable code point, it is allowed to simply ignore, i.e. skip it (as opposed to other characters, which must at least be displayed with a replacement character).
# File lib/unicode_utils/default_ignorable_char_q.rb, line 16 def default_ignorable_char?(char) PROP_DEFAULT_IGNORABLE_SET.include?(char.ord) end
Get the width of str
when displayed with a fixed pitch font.
Counts code points, where code points with an east asian width of Wide
or Fullwidth
count for two, non-graphic code points (e.g. control characters, including newline!) and non-spacing marks count for zero and all others count for one.
Examples:
require "unicode_utils/display_width" "別れ".length => 2 UnicodeUtils.display_width("別れ") => 4 "12".length => 2 UnicodeUtils.display_width("12") => 2 "a\u{308}".length => 2 UnicodeUtils.display_width("a\u{308}") => 1
Unicode assigns some reserved code points an east asian width of Wide
. Some systems correctly display a double width replacement character, others not.
See also: UnicodeUtils.graphic_char?
, UnicodeUtils.east_asian_width
# File lib/unicode_utils/display_width.rb, line 40 def display_width(str) str.each_codepoint.reduce(0) { |sum, cp| sum + case UnicodeUtils.east_asian_width(cp) when :Wide, :Fullwidth then 2 else GENERAL_CATEGORY_BASIC_WIDTH_MAP[UnicodeUtils.gc(cp)] end } end
Perform a full case-conversion of str
to lowercase according to the Unicode standard.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Examples:
require "unicode_utils/downcase" UnicodeUtils.downcase("ᾈ") => "ᾀ" UnicodeUtils.downcase("aBI\u{307}", :tr) => "abi"
# File lib/unicode_utils/downcase.rb, line 27 def downcase(str, language_id = nil) String.new.force_encoding(str.encoding).tap { |res| if Impl::LANGS_WITH_RULES.include?(language_id) # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) end pos = 0 str.each_codepoint { |cp| special_mapping = Impl.conditional_downcase_mapping(cp, str, pos, language_id) || SPECIAL_DOWNCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_DOWNCASE_MAP[cp] || cp) end pos += 1 } } end
Iterate over the grapheme clusters that make up str
. A grapheme cluster is a user perceived character (the basic unit of a writing system for a language) and consists of one or more code points.
This method uses the default Unicode algorithm for extended grapheme clusters.
Returns an enumerator if no block is given.
Examples:
require "unicode_utils/each_grapheme" UnicodeUtils.each_grapheme("a\r\nb") { |g| p g }
prints:
"a" "\r\n" "b"
and
UnicodeUtils.each_grapheme("a\r\nb").count => 3
# File lib/unicode_utils/each_grapheme.rb, line 34 def each_grapheme(str) return enum_for(__method__, str) unless block_given? c0 = nil c0_prop = nil grapheme = String.new.force_encoding(str.encoding) str.each_codepoint { |c| gbreak = false c_prop = GRAPHEME_CLUSTER_BREAK_MAP[c] ### rules ### if c0_prop == 0x0 && c_prop == 0x1 # don't break CR LF elsif c0_prop == 0x0 || c0_prop == 0x1 || c0_prop == 0x2 # break after controls gbreak = true elsif c_prop == 0x0 || c_prop == 0x1 || c_prop == 0x2 # break before controls gbreak = true elsif c0_prop == 0x6 && (c_prop == 0x6 || c_prop == 0x7 || c_prop == 0x9 || c_prop == 0xA) # don't break hangul syllable elsif (c0_prop == 0x9 || c0_prop == 0x7) && (c_prop == 0x7 || c_prop == 0x8) # don't break hangul syllable elsif (c0_prop == 0xA || c0_prop == 0x8) && c_prop == 0x8 # don't break hangul syllable elsif c0_prop == 0xB && c_prop == 0xB # don't break between regional indicator symbols elsif c_prop == 0x3 # don't break before extending characters elsif c_prop == 0x5 # don't break before SpacingMarks elsif c0_prop == 0x4 # don't break after Prepend characters else # break everywhere gbreak = true end ############# if gbreak && !grapheme.empty? yield grapheme grapheme = String.new.force_encoding(str.encoding) end grapheme << c c0 = c c0_prop = c_prop } yield grapheme unless grapheme.empty? end
Split str
along word boundaries according to Unicode’s Default Word Boundary Specification, calling the given block with each word. Returns str
, or an enumerator if no block is given.
Example:
require "unicode_utils/each_word" UnicodeUtils.each_word("Hello, world!").to_a => ["Hello", ",", " ", "world", "!"]
# File lib/unicode_utils/each_word.rb, line 19 def each_word(str) return enum_for(__method__, str) unless block_given? cs = str.each_codepoint.map { |c| WORD_BREAK_MAP[c] } cs << nil << nil # for negative indices word = String.new.force_encoding(str.encoding) i = 0 str.each_codepoint { |c| word << c if Impl.word_break?(cs, i) && !word.empty? yield word word = String.new.force_encoding(str.encoding) end i += 1 } yield word unless word.empty? str end
Returns the default with of the given code point as described in “UAX #11: East Asian Width” (unicode.org/reports/tr11/).
Each code point is mapped to one of the following six symbols: :Neutral, :Ambiguous, :Halfwidth, :Wide, :Fullwidth, :Narrow.
# File lib/unicode_utils/east_asian_width.rb, line 17 def east_asian_width(char) cp = char.ord EAST_ASIAN_WIDTH_RANGES.each { |pair| return pair[1] if pair[0].cover?(cp) } EAST_ASIAN_WIDTH_MAP_PER_CP[cp] end
Get the two letter general category alias of the given char. The first letter denotes a major class, the second letter a subclass of the major class.
See section 4.5 in Unicode 6.0.0.
Example:
require "unicode_utils/gc" UnicodeUtils.gc("A") # => :Lu (Letter, uppercase)
Returns nil for ordinals outside the Unicode code point range, a two letter symbol otherwise.
See also: UnicodeUtils.general_category
, UnicodeUtils.char_type
# File lib/unicode_utils/gc.rb, line 27 def gc(char) cp = char.ord cat = GENERAL_CATEGORY_PER_CP_MAP[cp] and return cat GENERAL_CATEGORY_RANGES.each { |pair| return pair[1] if pair[0].cover?(cp) } if cp >= 0x0 && cp <= 0x10FFFF :Cn # Other, not assigned else nil end end
Get the long general category alias of char.
Example:
require "unicode_utils/general_category" UnicodeUtils.general_category("A") # => :Uppercase_Letter
Returns a symbol if char is in the Unicode code point range, nil otherwise.
See also: UnicodeUtils.gc
, UnicodeUtils.char_type
# File lib/unicode_utils/general_category.rb, line 21 def general_category(char) GENERAL_CATEGORY_ALIAS_MAP[UnicodeUtils.gc(char)] end
Returns true if the given char is a graphic char, false otherwise. See table 2-3 in section 2.4 of Unicode 6.0.0.
Examples:
require "unicode_utils/graphic_char_q" UnicodeUtils.graphic_char?("a") # => true UnicodeUtils.graphic_char?("\n") # => false UnicodeUtils.graphic_char?(0x0) # => false
# File lib/unicode_utils/graphic_char_q.rb, line 25 def graphic_char?(char) GENERAL_CATEGORY_IS_GRAPHIC_MAP[UnicodeUtils.gc(char)] end
Get an array of all Codepoint
instances in Codepoint::RANGE whose name matches regexp. Matching is case insensitive.
require "unicode_utils/grep" UnicodeUtils.grep(/angstrom/) => [#<U+212B "Å" ANGSTROM SIGN utf8:e2,84,ab>]
# File lib/unicode_utils/grep.rb, line 11 def grep(regexp) # TODO: enhance behaviour by searching aliases in NameAliases.txt unless regexp.casefold? regexp = Regexp.new(regexp.source, Regexp::IGNORECASE) end Codepoint::RANGE.select { |cp| regexp =~ UnicodeUtils.char_name(cp) }.map { |cp| Codepoint.new(cp) } end
Derives the canonical decomposition of the given Hangul syllable.
Example:
require "unicode_utils/hangul_syllable_decomposition" UnicodeUtils.hangul_syllable_decomposition("\u{d4db}") => "\u{1111}\u{1171}\u{11b6}"
# File lib/unicode_utils/hangul_syllable_decomposition.rb, line 10 def hangul_syllable_decomposition(char) String.new.force_encoding(char.encoding).tap do |str| Impl.append_hangul_syllable_decomposition(str , char.ord) end end
The Jamo Short Name property of the given character (defaults to nil).
Example:
require "unicode_utils/jamo_short_name" UnicodeUtils.jamo_short_name("\u{1101}") => "GG"
# File lib/unicode_utils/jamo_short_name.rb, line 15 def jamo_short_name(char) JAMO_SHORT_NAME_MAP[char.ord] end
True if the given character has the Unicode property Lowercase.
# File lib/unicode_utils/lowercase_char_q.rb, line 9 def lowercase_char?(char) PROP_LOWERCASE_SET.include?(char.ord) end
Get an Enumerable of formal name aliases of the given character. Returns an empty Enumerable if the character doesn’t have an alias.
The aliases are instances of UnicodeUtils::NameAlias
, the order of the aliases in the returned Enumerable is preserved from NameAliases.txt in the Unicode Character Database.
Example:
require "unicode_utils/name_aliases" UnicodeUtils.name_aliases("\n").map(&:name) # => ["LINE FEED", "NEW LINE", "END OF LINE", "LF", "NL", "EOL"]
See also: UnicodeUtils.char_name
# File lib/unicode_utils/name_aliases.rb, line 23 def name_aliases(char) NAME_ALIASES_MAP[char.ord] end
Get str
in Normalization Form C.
The Unicode standard has multiple representations for some characters. One representation as a single code point and other representation(s) as a combination of multiple code points. This function “composes” these characters into the former representation.
Example:
require "unicode_utils/nfc" UnicodeUtils.nfc("La\u{308}mpchen") => "Lämpchen"
# File lib/unicode_utils/nfc.rb, line 135 def nfc(str) str = UnicodeUtils.canonical_decomposition(str) Impl.composition(str) end
Get str
in Normalization Form D.
Alias for UnicodeUtils.canonical_decomposition
.
# File lib/unicode_utils/nfd.rb, line 9 def nfd(str) UnicodeUtils.canonical_decomposition(str) end
Get str
in Normalization Form KC.
Normalization Form KC is compatibiliy decomposition (NFKD) followed by composition. Like NFKD, this normalization can alter how a string is displayed.
Example:
require "unicode_utils/nfkc" # LATIN SMALL LIGATURE FI => LATIN SMALL LETTER F, LATIN SMALL LETTER I UnicodeUtils.nfkc("fi") => "fi"
See also: UnicodeUtils.compatibility_decomposition
# File lib/unicode_utils/nfkc.rb, line 20 def nfkc(str) str = UnicodeUtils.compatibility_decomposition(str) Impl.composition(str) end
Get str
in Normalization Form KD.
Alias for UnicodeUtils.compatibility_decomposition
.
# File lib/unicode_utils/nfkd.rb, line 9 def nfkd(str) UnicodeUtils.compatibility_decomposition(str) end
Returns a unique string identifier for every code point. Returns nil if code_point
is not in the Unicode codespace. code_point
must be an Integer.
The returned string identifier is either the non-empty Name property value of code_point
, a non-empty Name_Alias string property value of code_point
, or the code point label as described by section “Code Point Labels” in chapter 4.8 “Name” of the Unicode standard.
If the returned identifier starts with “<”, it is a code point label and it ends with “>”. Otherwise it is the normative name or a formal alias string.
The exact name/alias/label selection algorithm may change even in minor UnicodeUtils
releases, but overall behaviour will stay the same in spirit.
The selection process in this version of UnicodeUtils
is:
-
Use an alias of type :correction, :control, :figment or :alternate (with listed precendence) if available
-
Use the Unicode Name property value if it is not empty
-
Construct a code point label in angle brackets.
Examples:
require "unicode_utils/sid" U.sid 0xa # => "LINE FEED" U.sid 0x0 # => "NULL" U.sid 0xfeff # => "BYTE ORDER MARK" U.sid 0xe000 # => "<private-use-E000>" U.sid 0x61 # => "LATIN SMALL LETTER A" U.sid -1 # => nil
# File lib/unicode_utils/sid.rb, line 52 def sid(code_point) s = CP_PREFERRED_ALIAS_STRING_MAP[code_point] and return s cn = UnicodeUtils.char_name(code_point) return cn if cn && cn !~ /\A(\<|\z)/ ct = UnicodeUtils.code_point_type(code_point) or return nil ts = ct.to_s.downcase.gsub('_', '-') "<#{ts}-#{code_point.to_s(16).upcase.rjust(4, '0')}>" end
Perform simple case folding. Contrary to full case folding, this uses only one to one mappings, so that the length of the returned string is equal to the length of str
.
The purpose of case folding is case insensitive string comparison.
Examples:
require "unicode_utils/simple_casefold" UnicodeUtils.simple_casefold("Ümit") == UnicodeUtils.simple_casefold("ümit") => true UnicodeUtils.simple_casefold("WEISS") == UnicodeUtils.simple_casefold("weiß") => false
See also: UnicodeUtils.casefold
# File lib/unicode_utils/simple_casefold.rb, line 23 def simple_casefold(str) String.new.force_encoding(str.encoding).tap do |res| str.each_codepoint { |cp| res << (CASEFOLD_C_MAP[cp] || CASEFOLD_S_MAP[cp] || cp) } end end
Map each code point in str
that has a single code point lowercase-mapping to that lowercase mapping. The returned string has the same length as the original string.
This function is locale independent.
Examples:
require "unicode_utils/simple_downcase" UnicodeUtils.simple_downcase("ÜMIT: 123") => "ümit: 123" UnicodeUtils.simple_downcase("STRASSE") => "strasse"
# File lib/unicode_utils/simple_downcase.rb, line 19 def simple_downcase(str) String.new.force_encoding(str.encoding).tap { |res| str.each_codepoint { |cp| res << (SIMPLE_DOWNCASE_MAP[cp] || cp) } } end
Map each code point in str
that has a single code point uppercase-mapping to that uppercase mapping. The returned string has the same length as the original string.
This function is locale independent.
Examples:
require "unicode_utils/simple_upcase" UnicodeUtils.simple_upcase("ümit: 123") => "ÜMIT: 123" UnicodeUtils.simple_upcase("weiß") => "WEIß"
# File lib/unicode_utils/simple_upcase.rb, line 19 def simple_upcase(str) String.new.force_encoding(str.encoding).tap { |res| str.each_codepoint { |cp| res << (SIMPLE_UPCASE_MAP[cp] || cp) } } end
Returns true if the given character has the Unicode property Soft_Dotted.
# File lib/unicode_utils/soft_dotted_char_q.rb, line 10 def soft_dotted_char?(char) SOFT_DOTTED_SET.include?(char.ord) end
Convert the first cased character after each word boundary to titlecase and all other cased characters to lowercase. For many, but not all characters, the titlecase mapping is the same as the uppercase mapping.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Example:
require "unicode_utils/titlecase" UnicodeUtils.titlecase("hello, world!") => "Hello, World!"
# File lib/unicode_utils/titlecase.rb, line 31 def titlecase(str, language_id = nil) String.new.force_encoding(str.encoding).tap do |res| # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) i = 0 each_word(str) { |word| cased_char_found = false word.each_codepoint { |cp| cased = cased_char?(cp) if !cased_char_found && cased cased_char_found = true special_mapping = Impl.conditional_titlecase_mapping(cp, str, i, language_id) || SPECIAL_TITLECASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_TITLECASE_MAP[cp] || cp) end elsif cased special_mapping = Impl.conditional_downcase_mapping(cp, str, i, language_id) || SPECIAL_DOWNCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_DOWNCASE_MAP[cp] || cp) end else res << cp end i += 1 } } end end
True if the given character has the General_Category Titlecase_Letter (Lt).
# File lib/unicode_utils/titlecase_char_q.rb, line 10 def titlecase_char?(char) TITLECASE_LETTER_SET.include?(char.ord) end
Perform a full case-conversion of str
to uppercase according to the Unicode standard.
Some conversion rules are language dependent, these are in effect when a non-nil language_id
is given. If non-nil, the language_id
must be a two letter language code as defined in BCP 47 (tools.ietf.org/rfc/bcp/bcp47.txt) as a symbol. If a language doesn’t have a two letter code, the three letter code is to be used. If locale independent behaviour is required, nil
should be passed explicitely, because a later version of UnicodeUtils
may default to something else.
Examples:
require "unicode_utils/upcase" UnicodeUtils.upcase("weiß") => "WEISS" UnicodeUtils.upcase("i", :en) => "I" UnicodeUtils.upcase("i", :tr) => "İ"
# File lib/unicode_utils/upcase.rb, line 28 def upcase(str, language_id = nil) String.new.force_encoding(str.encoding).tap { |res| if Impl::LANGS_WITH_RULES.include?(language_id) # ensure O(1) lookup by index str = str.encode(Encoding::UTF_32LE) end pos = 0 str.each_codepoint { |cp| special_mapping = Impl.conditional_upcase_mapping(cp, str, pos, language_id) || SPECIAL_UPCASE_MAP[cp] if special_mapping special_mapping.each { |m| res << m } else res << (SIMPLE_UPCASE_MAP[cp] || cp) end pos += 1 } } end
True if the given character has the Unicode property Uppercase.
# File lib/unicode_utils/uppercase_char_q.rb, line 9 def uppercase_char?(char) PROP_UPPERCASE_SET.include?(char.ord) end
True if the given character has the Unicode property White_Space.
Example:
require "unicode_utils/general_category" require "unicode_utils/white_space_char_q" UnicodeUtils.general_category("\n") => :Control UnicodeUtils.white_space_char?("\n") => true
# File lib/unicode_utils/white_space_char_q.rb, line 17 def white_space_char?(char) WHITE_SPACE_SET.include?(char.ord) end