class UnicodeScanner

UnicodeScanner provides for Unicode-aware lexical scanning operations on a `String`. Here is an example of its usage:

““ ruby s = UnicodeScanner.new('This is an example string') s.eos? # -> false

p s.scan(/w+/) # -> “This” p s.scan(/w+/) # -> nil p s.scan(/s+/) # -> “ ” p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> “is” s.eos? # -> false

p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “an” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “example” p s.scan(/s+/) # -> “ ” p s.scan(/w+/) # -> “string” s.eos? # -> true

p s.scan(/s+/) # -> nil p s.scan(/w+/) # -> nil ““

Scanning a string means remembering the position of a _scan pointer_, which is just an index. The point of scanning is to move forward a bit at a time, so matches are sought after the scan pointer; usually immediately after it.

Given the string “test string”, here are the pertinent scan pointer positions:

““

t e s t   s t r i n g

0 1 2 … 1

0

““

When you {#scan} for a pattern (a regular expression), the match must occur at the character after the scan pointer. If you use {#scan_until}, then the match can occur anywhere after the scan pointer. In both cases, the scan pointer moves _just beyond_ the last character of the match, ready to scan again from the next character onwards. This is demonstrated by the example above.

Method Categories


There are other methods besides the plain scanners. You can look ahead in the string without actually scanning. You can access the most recent match. You can modify the string being scanned, reset or terminate the scanner, find out or change the position of the scan pointer, skip ahead, and so on.

### Advancing the Scan Pointer

### Looking Ahead

### Finding Where we Are

### Setting Where we Are

### Match Data

### Miscellaneous

There are aliases to several of the methods.

Constants

INSPECT_LENGTH

Attributes

string[R]

@return [String] The string being scanned.

Public Class Methods

new(string) click to toggle source

Creates a new UnicodeScanner object to scan over the given `string`.

@param [String] string The string to iterate over.

# File lib/unicode_scanner.rb, line 109
def initialize(string)
  @string   = string
  @matches  = nil
  @matched  = false
  @current  = 0
  @previous = 0
end

Public Instance Methods

<<(str)
Alias for: concat
[](n) click to toggle source

Return the nth subgroup in the most recent match.

@param [Fixnum] n The index of the subgroup to return. @return [String, nil] The subgroup, if it exists.

@example

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan(/(\w+) (\w+) (\d+) /)       # -> "Fri Dec 12 "
s[0]                               # -> "Fri Dec 12 "
s[1]                               # -> "Fri"
s[2]                               # -> "Dec"
s[3]                               # -> "12"
s.post_match                       # -> "1975 14:39"
s.pre_match                        # -> ""
# File lib/unicode_scanner.rb, line 150
def [](n)
  @matched ? @matches[n] : nil
end
beginning_of_line?() click to toggle source

@return [true, false] `true` iff the scan pointer is at the beginning of the

line.

@example

s = UnicodeScanner.new("test\ntest\n")
s.bol?           # => true
s.scan(/te/)
s.bol?           # => false
s.scan(/st\n/)
s.bol?           # => true
s.terminate
s.bol?           # => true
# File lib/unicode_scanner.rb, line 167
def beginning_of_line?
  return nil if @current > @string.size
  return true if @current.zero?

  return @string[@current - 1] == "\n"
end
Also aliased as: bol?
bol?()
Alias for: beginning_of_line?
check(pattern) click to toggle source

This returns the value that {#scan} would return, without advancing the scan pointer. The match register is affected, though.

Mnemonic: it “checks” to see whether a {#scan} will return a value.

@param [Regexp] pattern The pattern to scan for. @return [String, nil] The matched segment, if matched.

@example

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.check /Fri/               # -> "Fri"
s.pos                       # -> 0
s.matched                   # -> "Fri"
s.check /12/                # -> nil
s.matched                   # -> nil
# File lib/unicode_scanner.rb, line 192
def check(pattern)
  do_scan pattern, false, true, true
end
check_until(pattern) click to toggle source

This returns the value that {#scan_until} would return, without advancing the scan pointer. The match register is affected, though.

Mnemonic: it “checks” to see whether a {#scan_until} will return a value.

@param [Regexp] pattern The pattern to scan until reaching. @return [String, nil] The matched segment, if matched.

@example

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.check_until /12/          # -> "Fri Dec 12"
s.pos                       # -> 0
s.matched                   # -> 12
# File lib/unicode_scanner.rb, line 210
def check_until(pattern)
  do_scan pattern, false, true, false
end
clear()
Alias for: terminate
concat(str) click to toggle source

Appends `str` to the string being scanned. This method does not affect scan pointer.

@param [String] str The string to append.

@example

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan(/Fri /)
s << " +1000 GMT"
s.string            # -> "Fri Dec 12 1975 14:39 +1000 GMT"
s.scan(/Dec/)       # -> "Dec"
# File lib/unicode_scanner.rb, line 129
def concat(str)
  @string.concat str
end
Also aliased as: <<
eos?() click to toggle source

@return [true, false] `true` if the scan pointer is at the end of the string.

@example

s = UnicodeScanner.new('test string')
p s.eos?          # => false
s.scan(/test/)
p s.eos?          # => false
s.terminate
p s.eos?          # => true
# File lib/unicode_scanner.rb, line 224
def eos?
  @current >= @string.length
end
exist?(pattern) click to toggle source

Looks ahead to see if the `pattern` exists anywhere in the string, without advancing the scan pointer. This predicates whether a {#scan_until} will return a value.

@param [Regexp] pattern The pattern to search for. @return [true, false] Whether the pattern exists ahead.

@example

s = UnicodeScanner.new('test string')
s.exist? /s/            # -> 3
s.scan /test/           # -> "test"
s.exist? /s/            # -> 2
s.exist? /e/            # -> nil
# File lib/unicode_scanner.rb, line 242
def exist?(pattern)
  do_scan pattern, false, false, false
end
getch() click to toggle source

Scans one character and returns it.

@return [String] The character.

@example

s = UnicodeScanner.new("ab")
s.getch           # => "a"
s.getch           # => "b"
s.getch           # => nil

$KCODE = 'EUC'
s = UnicodeScanner.new("\2244\2242")
s.getch           # => "\244\242"   # Japanese hira-kana "A" in EUC-JP
s.getch           # => nil
# File lib/unicode_scanner.rb, line 261
def getch
  return nil if eos?

  do_scan(/./u, true, true, true)
end
inspect() click to toggle source

Returns a string that represents the UnicodeScanner object, showing:

  • the current position

  • the size of the string

  • the characters surrounding the scan pointer

@return [String] A description of this object.

@example

s = ::new("Fri Dec 12 1975 14:39")
s.inspect # -> '#<UnicodeScanner 0/21 @ "Fri D...">'
s.scan_until /12/ # -> "Fri Dec 12"
s.inspect # -> '#<UnicodeScanner 10/21 "...ec 12" @ " 1975...">'
# File lib/unicode_scanner.rb, line 281
def inspect
  return "#<#{self.class} (uninitialized)>" if @string.nil?
  return "#<#{self.class} fin>" if eos?

  if @current.zero?
    return format("#<%{class} %<cur>d/%<len>d @ %{after}>",
                  class: self.class.to_s,
                  cur:   @current,
                  len:   @string.length,
                  after: inspect_after.inspect)
  end

  format("#<%{class} %<cur>d/%<len>d %{before} @ %{after}>",
         class:  self.class.to_s,
         cur:    @current,
         len:    @string.length,
         before: inspect_before.inspect,
         after:  inspect_after.inspect)
end
match?(pattern) click to toggle source

Tests whether the given `pattern` is matched from the current scan pointer. Returns the length of the match, or `nil`. The scan pointer is not advanced.

@param [Regexp] pattern The pattern to match with. @return [true, false] Whether the pattern is matched from the scan pointer.

@example

s = UnicodeScanner.new('test string')
p s.match?(/\w+/)   # -> 4
p s.match?(/\w+/)   # -> 4
p s.match?(/\s+/)   # -> nil
# File lib/unicode_scanner.rb, line 313
def match?(pattern)
  do_scan pattern, false, false, true
end
matched() click to toggle source

@return [String, nil] The last matched string. @example

s = UnicodeScanner.new('test string')
s.match?(/\w+/)     # -> 4
s.matched           # -> "test"
# File lib/unicode_scanner.rb, line 323
def matched
  return nil unless @matched

  @matches[0]
end
matched?() click to toggle source

@return [true, false] `true` iff the last match was successful. @example

s = UnicodeScanner.new('test string')
s.match?(/\w+/)     # => 4
s.matched?          # => true
s.match?(/\d+/)     # => nil
s.matched?          # => false
# File lib/unicode_scanner.rb, line 337
def matched?() @matched end
matched_size() click to toggle source

@return [Fixnum, nil] The size of the most recent match (see {#matched}), or

`nil` if there was no recent match.

@example

s = UnicodeScanner.new('test string')
s.check /\w+/           # -> "test"
s.matched_size          # -> 4
s.check /\d+/           # -> nil
s.matched_size          # -> nil
# File lib/unicode_scanner.rb, line 348
def matched_size
  return nil unless @matched

  @matches.end(0) - @matches.begin(0)
end
peek(len) click to toggle source

Extracts a string corresponding to `string`, without advancing the scan pointer.

@param [Fixnum] len The number of characters ahead to peek. @return [String] The string after the current position.

@example

s = UnicodeScanner.new('test string')
s.peek(7)          # => "test st"
s.peek(7)          # => "test st"
# File lib/unicode_scanner.rb, line 365
def peek(len)
  return '' if eos?

  @string[@current, len]
end
pointer()
Alias for: pos
pos() click to toggle source

Returns the byte position of the scan pointer. In the 'reset' position, this value is zero. In the 'terminated' position (i.e. the string is exhausted), this value is the bytesize of the string.

In short, it's a 0-based index into the string.

@return [Fixnum] The current scan position.

@example

s = UnicodeScanner.new('test string')
s.pos               # -> 0
s.scan_until /str/  # -> "test str"
s.pos               # -> 8
s.terminate         # -> #<UnicodeScanner fin>
s.pos               # -> 11
# File lib/unicode_scanner.rb, line 387
def pos() @current end
Also aliased as: pointer
pos=(n) click to toggle source

Set the byte position of the scan pointer.

@param [Fixnum] n The new position.

@example

s = UnicodeScanner.new('test string')
s.pos = 7            # -> 7
s.rest               # -> "ring"
# File lib/unicode_scanner.rb, line 400
def pos=(n)
  n += @string.length if n.negative?
  raise RangeError, "index out of range" if n.negative?
  raise RangeError, "index out of range" if n > @string.length

  @current = n
end
post_match() click to toggle source

@return [String] The _post-match_ (in the regular expression sense) of

the last scan.

@example

s = UnicodeScanner.new('test string')
s.scan(/\w+/)           # -> "test"
s.scan(/\s+/)           # -> " "
s.pre_match             # -> "test"
s.post_match            # -> "string"
# File lib/unicode_scanner.rb, line 417
def post_match
  return nil unless @matched

  @string[@previous + @matches.end(0), @string.length]
end
pre_match() click to toggle source

@return [String] The _pre-match_ (in the regular expression sense) of

the last scan.

@example

s = UnicodeScanner.new('test string')
s.scan(/\w+/)           # -> "test"
s.scan(/\s+/)           # -> " "
s.pre_match             # -> "test"
s.post_match            # -> "string"
# File lib/unicode_scanner.rb, line 432
def pre_match
  return nil unless @matched

  @string[0, @previous + @matches.begin(0)]
end
reset() click to toggle source

Reset the scan pointer (index 0) and clear matching data.

# File lib/unicode_scanner.rb, line 440
def reset
  @current = 0
  @matched = false
end
rest() click to toggle source

@return [String] The “rest” of the string (i.e. everything after the scan

pointer). If there is no more data (`eos? = true`), it returns `""`.
# File lib/unicode_scanner.rb, line 448
def rest
  return '' if eos?

  return @string[@current, @string.length]
end
rest_size() click to toggle source

@return [Fixnum] The value returned by `s.rest.size`.

# File lib/unicode_scanner.rb, line 456
def rest_size
  return 0 if eos?

  @string.length - @current
end
scan(pattern) click to toggle source

Tries to match with `pattern` at the current position. If there's a match, the scanner advances the “scan pointer” and returns the matched string. Otherwise, the scanner returns `nil`.

@param [Regexp] pattern The pattern to match. @return [String, nil] The string that was matched, if a match was found.

@example

s = UnicodeScanner.new('test string')
p s.scan(/\w+/)   # -> "test"
p s.scan(/\w+/)   # -> nil
p s.scan(/\s+/)   # -> " "
p s.scan(/\w+/)   # -> "string"
p s.scan(/./)     # -> nil
# File lib/unicode_scanner.rb, line 477
def scan(pattern)
  do_scan pattern, true, true, true
end
scan_full(pattern, advance_pointer, return_string) click to toggle source

Tests whether the given `pattern` is matched from the current scan pointer. Advances the scan pointer if `advance_pointer` is `true`. Returns the matched string if `return_string` is true. The match register is affected.

“full” means “scan with full parameters”.

@param [Regexp] pattern The pattern to scan. @param [true, false] advance_pointer Whether to advance the scan pointer if

a match is found.

@param [true, false] return_string Whether to return the matched segment. @return [String, Fixnum, nil] The matched segment if `return_string` is

`true`, otherwise the number of characters advanced. `nil` if nothing
matched.
# File lib/unicode_scanner.rb, line 495
def scan_full(pattern, advance_pointer, return_string)
  do_scan pattern, advance_pointer, return_string, true
end
scan_until(pattern) click to toggle source

Scans the string until the `pattern` is matched. Returns the substring up to and including the end of the match, advancing the scan pointer to that location. If there is no match, `nil` is returned.

@param [Regexp] pattern The pattern to match. @return [String, nil] The segment that matched.

@example

s = UnicodeScanner.new("Fri Dec 12 1975 14:39")
s.scan_until(/1/)        # -> "Fri Dec 1"
s.pre_match              # -> "Fri Dec "
s.scan_until(/XYZ/)      # -> nil
# File lib/unicode_scanner.rb, line 512
def scan_until(pattern)
  do_scan pattern, true, true, false
end
search_full(pattern, advance_pointer, return_string) click to toggle source

Scans the string `until` the pattern is matched. Advances the scan pointer if `advance_pointer`, otherwise not. Returns the matched string if `return_string` is `true`, otherwise returns the number of characters advanced. This method does affect the match register.

@param [Regexp] pattern The pattern to scan. @param [true, false] advance_pointer Whether to advance the scan pointer if

a match is found.

@param [true, false] return_string Whether to return the matched segment. @return [String, Fixnum, nil] The matched segment if `return_string` is

`true`, otherwise the number of characters advanced. `nil` if nothing
matched.
# File lib/unicode_scanner.rb, line 529
def search_full(pattern, advance_pointer, return_string)
  do_scan pattern, advance_pointer, return_string, false
end
skip(pattern) click to toggle source

Attempts to skip over the given `pattern` beginning with the scan pointer. If it matches, the scan pointer is advanced to the end of the match, and the length of the match is returned. Otherwise, `nil` is returned.

It's similar to {#scan}, but without returning the matched string.

@param [Regexp] pattern The pattern to match. @return [Fixnum, nil] The number of characters advanced, if matched.

@example

s = UnicodeScanner.new('test string')
p s.skip(/\w+/)   # -> 4
p s.skip(/\w+/)   # -> nil
p s.skip(/\s+/)   # -> 1
p s.skip(/\w+/)   # -> 6
p s.skip(/./)     # -> nil
# File lib/unicode_scanner.rb, line 550
def skip(pattern)
  do_scan pattern, true, false, true
end
skip_until(pattern) click to toggle source

Advances the scan pointer until `pattern` is matched and consumed. Returns the number of characters advanced, or `nil` if no match was found.

Look ahead to match `pattern`, and advance the scan pointer to the end of the match. Return the number of characters advanced, or `nil` if the match was unsuccessful.

It's similar to {#scan_until}, but without returning the intervening string.

@param [Regexp] pattern The pattern to match. @return [Fixnum, nil] The number of characters advanced, if matched.

# File lib/unicode_scanner.rb, line 566
def skip_until(pattern)
  do_scan pattern, true, false, false
end
string=(str) click to toggle source

Changes the string being scanned to `str` and resets the scanner.

@param [String] str The new string to scan. @return [String] `str`

# File lib/unicode_scanner.rb, line 579
def string=(str)
  @string  = str
  @matched = false
  @current = 0
end
terminate() click to toggle source

Set the scan pointer to the end of the string and clear matching data.

# File lib/unicode_scanner.rb, line 587
def terminate
  @current = @string.length
  @matched = false
  self
end
Also aliased as: clear
unscan() click to toggle source

Set the scan pointer to the previous position. Only one previous position is remembered, and it changes with each scanning operation.

@example

s = UnicodeScanner.new('test string')
s.scan(/\w+/)        # => "test"
s.unscan
s.scan(/../)         # => "te"
s.scan(/\d/)         # => nil
s.unscan             # ScanError: unscan failed: previous match record not exist
# File lib/unicode_scanner.rb, line 605
def unscan
  raise ScanError, "unscan failed: previous match record not exist" unless @matched

  @current = @previous
  @matched = false
  self
end

Private Instance Methods

do_scan(regex, advance_pointer, return_string, head_only) click to toggle source
# File lib/unicode_scanner.rb, line 615
def do_scan(regex, advance_pointer, return_string, head_only)
  raise ArgumentError unless regex.kind_of?(Regexp)

  @matched = false
  return nil if eos?

  @matches = regex.match(@string[@current, @string.length])
  return nil unless @matches

  if head_only && @matches.begin(0).positive?
    @matches = nil
    return nil
  end

  @matched = true

  @previous = @current
  @current += @matches.end(0) if advance_pointer
  if return_string
    return @string[@previous, @matches.end(0)]
  else
    return @matches.end(0)
  end
end
inspect_after() click to toggle source
# File lib/unicode_scanner.rb, line 657
def inspect_after
  return '' if eos?

  str = String.new
  len = @string.length - @current
  if len > INSPECT_LENGTH
    len = INSPECT_LENGTH
    str << @string[@current, len]
    str << '...'
  else
    str << @string[@current, len]
  end

  return str
end
inspect_before() click to toggle source
# File lib/unicode_scanner.rb, line 640
def inspect_before
  return '' if @current.zero?

  str = String.new
  len = 0

  if @current > INSPECT_LENGTH
    str << '...'
    len = INSPECT_LENGTH
  else
    len = @current
  end

  str << @string[@current - len, len]
  return str
end