class EDI::E::StreamingParser

Class StreamingParser

Introduction

Turning a whole EDI interchange into an EDI::E::Interchange object with method parse is both convenient and memory consuming. Sometimes, interchanges become just too large to keep them completely in memory. The same reasoning holds for large XML documents, where there is a common solution: The SAX/SAX2 API, a streaming approach. This class implements the same idea for UN/EDIFACT data.

Use StreamingParser instances to parse UN/EDIFACT data sequentially. Sequential parsing saves main memory and is applicable to arbitrarily large interchanges.

At its core lies method go. It scans the input stream and employs callbacks on_* which implement most of the parser tasks.

Syntax check

Without your customizing the callbacks, this parser just scans through the data. Only callback on_error() contains code: It raises an exception telling you about the location and kind of syntax error encountered.

Example: Syntax check

parser = EDI::E::StreamingParser.new
parser.go( File.open 'damaged_file.edi' )
--> EDI::EDISyntaxError at offset 1234, last chars = UNt+1+0

Callbacks

Most callbacks provided here are just empty shells. They usually receive a string of interest (a segment content, i.e. everything from the segment tag to and excluding the segment terminator) and also the segment tag as a separate string when tags could differ.

Overwrite them to adapt the parser to your needs!

Example: Counting segments

class MyParser < EDI::E::StreamingParser
  attr_reader :counters

  def initialize
    @counters = Hash.new(0)
    super
  end

  def on_segment( s, tag )
    @counters[tag] += 1
  end
end

parser = MyParser.new
parser.go( File.open 'myfile.edi' )
puts "Segment tag statistics:"
parser.counters.keys.sort.each do |tag|
  print "%03s: %4d\n" % [ tag, parser.counters[tag] ]
end

Want to save time? Throw :done when already done!

Most callbacks may terminate further parsing by throwing symbol :done. This saves a lot of time e.g. if you already found what you were looking for. Otherwise, parsing continues until getc hits EOF or an error occurs.

Example: A simple search

parser = EDI::E::StreamingParser.new
def parser.on_segment( s, tag ) # singleton
  if tag == 'ADJ'
    puts "Interchange contains at least one segment ADJ !"
    puts "Here is its contents: #{s}"
    throw :done   # Skip further parsing
  end
end
parser.go( File.open 'myfile.edi' )

Public Class Methods

new() click to toggle source
# File lib/edi4r/edifact.rb, line 1569
def initialize
  @path = 'input stream'
end

Public Instance Methods

go( hnd ) click to toggle source

The one-pass reader & dispatcher of segments, SAX-style.

It reads sequentially through the given stream of octets and generates calls to the callbacks on_... Parameter hnd may be any object supporting method getc.

# File lib/edi4r/edifact.rb, line 1660
    def go( hnd )
      state, offset, iedi, item, tag, una = :outside, 0, false, '', '', ''
      seg_term, esc_char = nil, ?? # @ic.una.seg_term, @ic.una.esc_char
      una_count = uib_unb_count = nil

      @path = hnd.path if hnd.respond_to? :path

      self.on_interchange_start

      catch(:done) do
        loop do
          c = hnd.getc

          case state # State machine

            # Characters outside of a segment or UNA context
          when :outside
            case c

            when nil
              break # Regular exit at EOF

            when (?A..?Z)
              unless item.empty? # Flush
                self.on_other( item )
                item = ''
              end
              item << c; tag << c
              state = :tag1

            else
              item << c
            end

            # Found first tag char, now expecting second
          when :tag1
            case c

            when (?A..?Z)
              item << c; tag << c
              state = :tag2

            else # including 'nil'
              self.on_error(EDISyntaxError, offset, item, c)
            end

            # Found second tag char, now expecting last
          when :tag2
            case c
            when (?A..?Z)
              item << c; tag << c
              if tag=='UNA'
                state = :in_una
                una_count = 0
              elsif tag=~/U[IN]B/
                state = :in_uib_unb
                uib_unb_count = 0
              else
                state = :in_segment
              end
            else # including 'nil'
              self.on_error(EDISyntaxError, offset, item, c)
            end

          when :in_una
            self.on_error(EDISyntaxError, offset, item) if c.nil?
            item << c; una_count += 1
            if una_count == 6 # completed?
              esc_char, seg_term = item[6], item[8]
              self.on_una( item )
              item, tag = '', ''
              state = :outside
            end

            # Set seg_term if version==2 && charset=='UNOB'
          when :in_uib_unb
            self.on_error(EDISyntaxError, offset, item) if c.nil?
            item << c; uib_unb_count += 1
            if uib_unb_count == 7 # Read up to charset?
              # Set seg_term if not previously set by UNA
              if seg_term.nil? && item[4,4]=='UNOB' && item[9]==?2
                seg_term = ?\x14  # Special case
              else
                seg_term = ?'     # Default value
              end
              state = :in_segment # Continue normally
            end

          when :in_segment
            case c
            when nil
              self.on_error(EDISyntaxError, offset, item)
            when esc_char
              state = :esc_mode
            when seg_term
              dispatch_item( item , tag )
              item, tag = '', ''
              state = :outside
            else
              item << c
            end

          when :esc_mode
            case c
            when nil
              self.on_error(EDISyntaxError, offset, item)
            when seg_term      # Treat seg_term as regular character
              item << seg_term
            # when esc_char      # Redundant - skip
            #   item << esc_char << esc_char
            else
              item << esc_char << c
            end
            state = :in_segment
            
          else # Should never occur...
            raise ArgumentError, "unexpected state: #{state}"
          end  
          offset += 1
        end # loop
#        self.on_error(EDISyntaxError, offset, item) unless state==:outside
      end # catch(:done)

      self.on_interchange_end
      offset
    end
on_error(err, offset, fragment, c=nil) click to toggle source

Called upon syntax errors. Parsing should be aborted now.

# File lib/edi4r/edifact.rb, line 1648
def on_error(err, offset, fragment, c=nil)
  raise err, "offset = %d, last chars = %s%s" % 
    [offset, fragment, c.nil? ? '<EOF>' : c.chr]
end
on_interchange_end() click to toggle source

Called at EOF - overwrite for your cleanup purposes. Note: Must not throw :done !

# File lib/edi4r/edifact.rb, line 1588
def on_interchange_end
end
on_interchange_start() click to toggle source

Called at start of reading - overwrite for your init purposes. Note: Must not throw :done !

# File lib/edi4r/edifact.rb, line 1582
def on_interchange_start
end
on_other( s ) click to toggle source

This callback is usually kept empty. It is called when the parser finds strings between segments or in front of or trailing an interchange.

Strictly speaking, such strings are not permitted by the UN/EDIFACT syntax rules (ISO 9573). However, it is quite common to put a line break between segments for better readability. The default settings thus ignore such occurrences.

If you need strict conformance checking, feel free to put some code into this callback method, otherwise just ignore it.

# File lib/edi4r/edifact.rb, line 1643
def on_other( s )
end
on_segment( s, tag ) click to toggle source

Called when any other segment encountered

# File lib/edi4r/edifact.rb, line 1628
def on_segment( s, tag )
end
on_una( s ) click to toggle source

Called when UNA pseudo segment encountered

# File lib/edi4r/edifact.rb, line 1593
def on_una( s )
end
on_unb_uib( s, tag ) click to toggle source

Called when UNB or UIB encountered

# File lib/edi4r/edifact.rb, line 1598
def on_unb_uib( s, tag )
end
on_une( s ) click to toggle source

Called when UNE encountered

# File lib/edi4r/edifact.rb, line 1613
def on_une( s )
end
on_ung( s ) click to toggle source

Called when UNG encountered

# File lib/edi4r/edifact.rb, line 1608
def on_ung( s )
end
on_unh_uih( s, tag ) click to toggle source

Called when UNH or UIH encountered

# File lib/edi4r/edifact.rb, line 1618
def on_unh_uih( s, tag )
end
on_unt_uit( s, tag ) click to toggle source

Called when UNT or UIT encountered

# File lib/edi4r/edifact.rb, line 1623
def on_unt_uit( s, tag )
end
on_unz_uiz( s, tag ) click to toggle source

Called when UNZ or UIZ encountered

# File lib/edi4r/edifact.rb, line 1603
def on_unz_uiz( s, tag )
end
path() click to toggle source

Convenience method. Returns the path of the File object passed to method go or just string ‘input stream’

# File lib/edi4r/edifact.rb, line 1575
def path
  @path
end