class EDI::E::StreamingParser
Class StreamingParser
¶ ↑
Introduction¶ ↑
Turning a whole EDI
interchange into an EDI::E::Interchange
object with method parse
is both convenient and memory consuming. Sometimes, interchanges become just too large to keep them completely in memory. The same reasoning holds for large XML documents, where there is a common solution: The SAX/SAX2 API, a streaming approach. This class implements the same idea for UN/EDIFACT data.
Use StreamingParser
instances to parse UN/EDIFACT data sequentially. Sequential parsing saves main memory and is applicable to arbitrarily large interchanges.
At its core lies method go
. It scans the input stream and employs callbacks on_*
which implement most of the parser tasks.
Syntax check¶ ↑
Without your customizing the callbacks, this parser just scans through the data. Only callback on_error()
contains code: It raises an exception telling you about the location and kind of syntax error encountered.
Example: Syntax check¶ ↑
parser = EDI::E::StreamingParser.new parser.go( File.open 'damaged_file.edi' ) --> EDI::EDISyntaxError at offset 1234, last chars = UNt+1+0
Callbacks¶ ↑
Most callbacks provided here are just empty shells. They usually receive a string of interest (a segment content, i.e. everything from the segment tag to and excluding the segment terminator) and also the segment tag as a separate string when tags could differ.
Overwrite them to adapt the parser to your needs!
Example: Counting segments¶ ↑
class MyParser < EDI::E::StreamingParser attr_reader :counters def initialize @counters = Hash.new(0) super end def on_segment( s, tag ) @counters[tag] += 1 end end parser = MyParser.new parser.go( File.open 'myfile.edi' ) puts "Segment tag statistics:" parser.counters.keys.sort.each do |tag| print "%03s: %4d\n" % [ tag, parser.counters[tag] ] end
Want to save time? Throw :done
when already done!¶ ↑
Most callbacks may terminate further parsing by throwing symbol :done
. This saves a lot of time e.g. if you already found what you were looking for. Otherwise, parsing continues until getc
hits EOF
or an error occurs.
Example: A
simple search¶ ↑
parser = EDI::E::StreamingParser.new def parser.on_segment( s, tag ) # singleton if tag == 'ADJ' puts "Interchange contains at least one segment ADJ !" puts "Here is its contents: #{s}" throw :done # Skip further parsing end end parser.go( File.open 'myfile.edi' )
Public Class Methods
# File lib/edi4r/edifact.rb, line 1569 def initialize @path = 'input stream' end
Public Instance Methods
The one-pass reader & dispatcher of segments, SAX-style.
It reads sequentially through the given stream of octets and generates calls to the callbacks on_...
Parameter hnd
may be any object supporting method getc
.
# File lib/edi4r/edifact.rb, line 1660 def go( hnd ) state, offset, iedi, item, tag, una = :outside, 0, false, '', '', '' seg_term, esc_char = nil, ?? # @ic.una.seg_term, @ic.una.esc_char una_count = uib_unb_count = nil @path = hnd.path if hnd.respond_to? :path self.on_interchange_start catch(:done) do loop do c = hnd.getc case state # State machine # Characters outside of a segment or UNA context when :outside case c when nil break # Regular exit at EOF when (?A..?Z) unless item.empty? # Flush self.on_other( item ) item = '' end item << c; tag << c state = :tag1 else item << c end # Found first tag char, now expecting second when :tag1 case c when (?A..?Z) item << c; tag << c state = :tag2 else # including 'nil' self.on_error(EDISyntaxError, offset, item, c) end # Found second tag char, now expecting last when :tag2 case c when (?A..?Z) item << c; tag << c if tag=='UNA' state = :in_una una_count = 0 elsif tag=~/U[IN]B/ state = :in_uib_unb uib_unb_count = 0 else state = :in_segment end else # including 'nil' self.on_error(EDISyntaxError, offset, item, c) end when :in_una self.on_error(EDISyntaxError, offset, item) if c.nil? item << c; una_count += 1 if una_count == 6 # completed? esc_char, seg_term = item[6], item[8] self.on_una( item ) item, tag = '', '' state = :outside end # Set seg_term if version==2 && charset=='UNOB' when :in_uib_unb self.on_error(EDISyntaxError, offset, item) if c.nil? item << c; uib_unb_count += 1 if uib_unb_count == 7 # Read up to charset? # Set seg_term if not previously set by UNA if seg_term.nil? && item[4,4]=='UNOB' && item[9]==?2 seg_term = ?\x14 # Special case else seg_term = ?' # Default value end state = :in_segment # Continue normally end when :in_segment case c when nil self.on_error(EDISyntaxError, offset, item) when esc_char state = :esc_mode when seg_term dispatch_item( item , tag ) item, tag = '', '' state = :outside else item << c end when :esc_mode case c when nil self.on_error(EDISyntaxError, offset, item) when seg_term # Treat seg_term as regular character item << seg_term # when esc_char # Redundant - skip # item << esc_char << esc_char else item << esc_char << c end state = :in_segment else # Should never occur... raise ArgumentError, "unexpected state: #{state}" end offset += 1 end # loop # self.on_error(EDISyntaxError, offset, item) unless state==:outside end # catch(:done) self.on_interchange_end offset end
Called upon syntax errors. Parsing should be aborted now.
# File lib/edi4r/edifact.rb, line 1648 def on_error(err, offset, fragment, c=nil) raise err, "offset = %d, last chars = %s%s" % [offset, fragment, c.nil? ? '<EOF>' : c.chr] end
Called at EOF - overwrite for your cleanup purposes. Note: Must not throw :done
!
# File lib/edi4r/edifact.rb, line 1588 def on_interchange_end end
Called at start of reading - overwrite for your init purposes. Note: Must not throw :done
!
# File lib/edi4r/edifact.rb, line 1582 def on_interchange_start end
This callback is usually kept empty. It is called when the parser finds strings between segments or in front of or trailing an interchange.
Strictly speaking, such strings are not permitted by the UN/EDIFACT syntax rules (ISO 9573). However, it is quite common to put a line break between segments for better readability. The default settings thus ignore such occurrences.
If you need strict conformance checking, feel free to put some code into this callback method, otherwise just ignore it.
# File lib/edi4r/edifact.rb, line 1643 def on_other( s ) end
Called when any other segment encountered
# File lib/edi4r/edifact.rb, line 1628 def on_segment( s, tag ) end
Called when UNA
pseudo segment encountered
# File lib/edi4r/edifact.rb, line 1593 def on_una( s ) end
Called when UNB or UIB encountered
# File lib/edi4r/edifact.rb, line 1598 def on_unb_uib( s, tag ) end
Called when UNE encountered
# File lib/edi4r/edifact.rb, line 1613 def on_une( s ) end
Called when UNG encountered
# File lib/edi4r/edifact.rb, line 1608 def on_ung( s ) end
Called when UNH or UIH encountered
# File lib/edi4r/edifact.rb, line 1618 def on_unh_uih( s, tag ) end
Called when UNT or UIT encountered
# File lib/edi4r/edifact.rb, line 1623 def on_unt_uit( s, tag ) end
Called when UNZ or UIZ encountered
# File lib/edi4r/edifact.rb, line 1603 def on_unz_uiz( s, tag ) end
Convenience method. Returns the path of the File object passed to method go
or just string ‘input stream’
# File lib/edi4r/edifact.rb, line 1575 def path @path end