module NdrImport::Helpers::File::XmlStreaming

This mixin adds XML streaming functionality, to support more performant handling of large files by Nokogiri. Uses the ‘XML::Reader` API, and maintains a temporary DOM as the XML is streamed to allow XPath querying from the root node.

If the system has ‘iconv` available, will attempt to verify the encoding of the file being read externally, so it can be streamed in to Ruby. Otherwise, will load the raw data in to check the encoding, but still stream it through Nokogiri’s parser.

Public Instance Methods

each_node(stream, encoding, xpath, pattern_match_xpath = nil, &block) click to toggle source

Yields each element matching ‘xpath` from `stream` as they’re found.

In the case of dodgy encoding, may fall back to slurping the file, but will still use stream parsing for XML.

Optionally pattern match the xpath

# File lib/ndr_import/helpers/file/xml_streaming.rb, line 147
def each_node(stream, encoding, xpath, pattern_match_xpath = nil, &block)
  return enum_for(:each_node, stream, encoding, xpath, pattern_match_xpath) unless block

  require 'nokogiri'

  stream_xml_nodes(stream, xpath, pattern_match_xpath, encoding, &block)
end

Private Instance Methods

external_utf8_check?(safe_path) click to toggle source

Use iconv, if available, to check raw data encoding:

# File lib/ndr_import/helpers/file/xml_streaming.rb, line 174
def external_utf8_check?(safe_path)
  iconv = system('command -v iconv > /dev/null 2>&1')
  return false unless iconv

  path = SafeFile.safepath_to_string(safe_path)
  system("iconv -f UTF-8 #{Shellwords.escape(path)} > /dev/null 2>&1")
end
stream_xml_nodes(io, node_xpath, pattern_match_xpath, encoding = nil) { |element| ... } click to toggle source
# File lib/ndr_import/helpers/file/xml_streaming.rb, line 182
def stream_xml_nodes(io, node_xpath, pattern_match_xpath, encoding = nil)
  # Track nesting as the cursor moves through the document:
  cursor = Cursor.new(node_xpath, pattern_match_xpath)

  # If markup isn't well-formed, try to work around it:
  options = Nokogiri::XML::ParseOptions::RECOVER
  reader  = Nokogiri::XML::Reader(io, nil, encoding, options)

  reader.each do |node|
    case node.node_type
    when Nokogiri::XML::Reader::TYPE_ELEMENT # "opening tag"
      raise NestingError, node if cursor.in?(node)

      cursor.enter(node)
      next unless cursor.matches?

      # The xpath matched - construct a DOM fragment to yield back:
      element = Nokogiri::XML(node.outer_xml).at("./#{node.name}")
      yield element
    when Nokogiri::XML::Reader::TYPE_END_ELEMENT # "closing tag"
      cursor.leave(node)
    end
  end
end
with_encoding_check(safe_path) { |stream, forced_encoding| ... } click to toggle source

We need to ensure the raw data is UTF8 before we start streaming it with nokogiri. If we can do an external check, great. Otherwise, we need to slurp and convert the raw data before presenting it.

# File lib/ndr_import/helpers/file/xml_streaming.rb, line 160
def with_encoding_check(safe_path)
  forced_encoding = nil

  stream = ::File.open(SafeFile.safepath_to_string(safe_path))

  unless external_utf8_check?(safe_path)
    stream = StringIO.new ensure_utf8!(stream.read)
    forced_encoding = 'UTF8'
  end

  yield stream, forced_encoding
end