class Regex::Character

A regular expression that matches a specific character in a given character set

Constants

DigramSequences

Constant with all special 2-characters escape sequences

MetaChars
MetaCharsInClass

Attributes

codepoint[R]

The integer value that uniquely identifies the character.

lexeme[R]

The initial text representation of the character (if any).

Public Class Methods

char2codepoint(aChar) click to toggle source

Convertion method that returns the codepoint for the given single character. Example: RegAn::Character::char2codepoint('Σ') # Returns: 0x3a3

# File lib/regex/character.rb, line 92
def self.char2codepoint(aChar)
  aChar.ord
end
codepoint2char(aCodepoint) click to toggle source

Convertion method that returns a character given a codepoint (integer) value. Example: RegAn::Character::codepoint2char(0x3a3) # Returns: Σ ( The Unicode GREEK CAPITAL LETTER SIGMA)

# File lib/regex/character.rb, line 85
def self.codepoint2char(aCodepoint)
  [aCodepoint].pack('U') # Remark: chr() fails with codepoints > 256
end
esc2codepoint(esc_seq) click to toggle source

Convertion method that returns the codepoint for the given escape sequence (a String). Recognized escaped characters are: a (alarm, 0x07), n (newline, 0xA), r (carriage return, 0xD), t (tab, 0x9), e (escape, 0x1B), f (form feed, 0xC), v (vertical feed, 0xB) uXXXX where XXXX is a 4 hex digits integer value, u{X…}, ooo (octal) xXX (hex) Any other escaped character will be treated as a literal character Example: RegAn::Character::esc2codepoint('n') # Returns: 0xd

# File lib/regex/character.rb, line 106
def self.esc2codepoint(esc_seq)
  msg = "Escape sequence #{esc_seq} does not begin with a backslash (\)."
  raise StandardError, msg unless esc_seq[0] == '\\'

  result = (esc_seq.length == 2) ? digram2codepoint(esc_seq) : esc_number2codepoint(esc_seq)

  return result
end
new(aValue) click to toggle source

Constructor.

aValue

Initialize the character with a either a String literal or a

codepoint value. Examples: Initializing with codepoint value… RegAn::Character.new(0x3a3) # Represents: Σ (Unicode GREEK CAPITAL LETTER SIGMA) RegAn::Character.new(931) # Also represents: Σ (931 dec == 3a3 hex)

Initializing with a single character string RegAn::Character.new(?u03a3) # Also represents: Σ RegAn::Character.new('Σ') # Obviously, represents a Σ

Initializing with an escape sequence string Recognized escaped characters are: a (alarm, 0x07), n (newline, 0xA), r (carriage return, 0xD), t (tab, 0x9), e (escape, 0x1B), f (form feed, 0xC) uXXXX where XXXX is a 4 hex digits integer value, u{X…}, ooo (octal) xXX (hex) Any other escaped character will be treated as a literal character RegAn::Character.new('n') # Represents a newline RegAn::Character.new('u03a3') # Represents a Σ

Calls superclass method
# File lib/regex/character.rb, line 61
def initialize(aValue)
  super()
  case aValue
    when String
      if aValue.size == 1
        # Literal single character case...
        @codepoint = self.class.char2codepoint(aValue)
      else
        # Should be an escape sequence...
        @codepoint = self.class.esc2codepoint(aValue)
      end
      @lexeme = aValue

    when Integer
      @codepoint = aValue
    else
      raise StandardError, "Cannot initialize a Character with a '#{aValue}'."
  end
end

Private Class Methods

digram2codepoint(aDigram) click to toggle source

Convertion method that returns a codepoint for the given two characters (digram) escape sequence. Recognized escaped characters are: a (alarm, 0x07), n (newline, 0xA), r (carriage return, 0xD), t (tab, 0x9), e (escape, 0x1B), f (form feed, 0xC), v (vertical feed, 0xB) Any other escape sequence will return the codepoint of the escaped character.

aDigram

A sequence of two characters that starts with a backslash.

# File lib/regex/character.rb, line 174
def self.digram2codepoint(aDigram)
  # Check that the digram is a special escape sequence
  result = DigramSequences.fetch(aDigram, nil)

  # If it not a special sequence, then escaped character is
  # considered literally (the backslash is 'dummy')
  result = char2codepoint(aDigram[-1]) if result.nil?
  return result
end
esc_number2codepoint(anEscapeSequence) click to toggle source

Convertion method that returns a codepoint for the given complex escape sequence.

anEscapeSequence

A String with the format:

uXXXX where XXXX is a 4 hex digits integer value, u{X…} X 1 or more hex digits ooo (1..3 octal digits literal) xXX (1..2 hex digits literal)

# File lib/regex/character.rb, line 193
def self.esc_number2codepoint(anEscapeSequence)
  unless /^\\(?:(?:[uxX]\{?(?<hexa>\h+)\}?)|(?<octal>[0-7]{1,3}))$/ =~ anEscapeSequence
    raise StandardError, "Unsupported escape sequence #{anEscapeSequence}."
  else
  # Octal literal case?
    return octal.oct if octal # shorterSeq =~ /[0-7]{1,3}/

    # Extract the hexadecimal number
    hexliteral = hexa # shorterSeq.sub(/^[xXu]\{?([0-9a-fA-F]+)}?$/, '\1')
    return hexliteral.hex
  end
end

Public Instance Methods

==(other) click to toggle source

Returns true iff this Character and parameter 'another' represent the same character.

another

any Object. The way the equality is tested depends on the another's class

Example: newOne = Character.new(?u03a3) newOne == newOne # true. Identity newOne == Character.new(?u03a3) # true. Both have same codepoint newOne == ?u03a3 # true. The single character String match exactly the char attribute. newOne == 0x03a3 # true. The Integer is compared to the codepoint value. Will test equality with any Object that knows the to_s method

# File lib/regex/character.rb, line 129
def ==(other)
  result = case other
    when Character
      to_str == other.to_str

    when Integer
      codepoint == other

    when String
      other.size > 1 ? false : to_str == other

    else
      # Unknown type: try with a convertion
      self == other.to_s # Recursive call
  end

  return result
end
char() click to toggle source

Return the character as a String object

# File lib/regex/character.rb, line 116
def char
  self.class.codepoint2char(@codepoint)
end
explain() click to toggle source

Return a plain English description of the character

# File lib/regex/character.rb, line 149
def explain
  "the character '#{to_str}'"
end

Protected Instance Methods

text_repr() click to toggle source

Conversion method re-definition. Purpose: Return the String representation of the expression. If the Character was initially from a text (the lexeme), then the lexeme is returned back. Otherwise the character corresponding to the codepoint is returned.

# File lib/regex/character.rb, line 160
def text_repr
  return char if lexeme.nil?

  return lexeme.dup
end