class NewspaperWorks::TextExtraction::AltoReader
Class to obtain plain text and JSON word-coordinates from ALTO source
Attributes
doc_stream[RW]
source[RW]
Public Class Methods
new(xml, image_width = nil, image_height = nil)
click to toggle source
Construct with either path
@param xml [String], and process document
# File lib/newspaper_works/text_extraction/alto_reader.rb, line 93 def initialize(xml, image_width = nil, image_height = nil) @source = isxml?(xml) ? xml : File.read(xml) @image_width = image_width @image_height = image_height @doc_stream = AltoDocStream.new(image_width) parser = Nokogiri::XML::SAX::Parser.new(doc_stream) parser.parse(@source) end
Public Instance Methods
isxml?(xml)
click to toggle source
Determine if source parameter is path or xml
@param xml [String] either path to xml file or xml source @return [true, false] true if string appears to be XML source, not path
# File lib/newspaper_works/text_extraction/alto_reader.rb, line 106 def isxml?(xml) xml.lstrip.start_with?('<') end
json()
click to toggle source
Output JSON flattened word coordinates
@return [String] JSON serialization of flattened word coordinates
# File lib/newspaper_works/text_extraction/alto_reader.rb, line 113 def json words = @doc_stream.words builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new(words, @image_width, @image_height) builder.to_json end