module NdrImport::Helpers::File::XmlStreaming
This mixin adds XML streaming functionality, to support more performant handling of large files by Nokogiri. Uses the `XML::Reader` API, and maintains a temporary DOM as the XML is streamed to allow XPath querying from the root node.
If the system has `iconv` available, will attempt to verify the encoding of the file being read externally, so it can be streamed in to Ruby. Otherwise, will load the raw data in to check the encoding, but still stream it through Nokogiri's parser.
Public Instance Methods
Streams the contents of the given `safe_path`, and yields each element matching `xpath` as they're found.
In the case of dodgy encoding, may fall back to slurping the file, but will still use stream parsing for XML.
# File lib/ndr_import/helpers/file/xml_streaming.rb, line 119 def each_node(safe_path, xpath, &block) return enum_for(:each_node, safe_path, xpath) unless block require 'nokogiri' with_encoding_check(safe_path) do |stream, encoding| stream_xml_nodes(stream, xpath, encoding, &block) end end
Private Instance Methods
Use iconv, if available, to check raw data encoding:
# File lib/ndr_import/helpers/file/xml_streaming.rb, line 148 def external_utf8_check?(safe_path) iconv = system('command -v iconv > /dev/null 2>&1') return false unless iconv path = SafeFile.safepath_to_string(safe_path) system("iconv -f UTF-8 #{Shellwords.escape(path)} > /dev/null 2>&1") end
# File lib/ndr_import/helpers/file/xml_streaming.rb, line 156 def stream_xml_nodes(io, node_xpath, encoding = nil) # Track nesting as the cursor moves through the document: cursor = Cursor.new(node_xpath) # If markup isn't well-formed, try to work around it: options = Nokogiri::XML::ParseOptions::RECOVER reader = Nokogiri::XML::Reader(io, nil, encoding, options) reader.each do |node| case node.node_type when Nokogiri::XML::Reader::TYPE_ELEMENT # "opening tag" raise NestingError, node if cursor.in?(node) cursor.enter(node) next unless cursor.matches? # The xpath matched - construct a DOM fragment to yield back: element = Nokogiri::XML(node.outer_xml).at("./#{node.name}") yield element when Nokogiri::XML::Reader::TYPE_END_ELEMENT # "closing tag" cursor.leave(node) end end end
We need to ensure the raw data is UTF8 before we start streaming it with nokogiri. If we can do an external check, great. Otherwise, we need to slurp and convert the raw data before presenting it.
# File lib/ndr_import/helpers/file/xml_streaming.rb, line 134 def with_encoding_check(safe_path) forced_encoding = nil stream = ::File.open(SafeFile.safepath_to_string(safe_path)) unless external_utf8_check?(safe_path) stream = StringIO.new ensure_utf8!(stream.read) forced_encoding = 'UTF8' end yield stream, forced_encoding end