class NewspaperWorks::TextExtraction::AltoReader

Class to obtain plain text and JSON word-coordinates from ALTO source

Attributes

doc_stream[RW]
source[RW]

Public Class Methods

new(xml, image_width = nil, image_height = nil) click to toggle source

Construct with either path

@param xml [String], and process document

# File lib/newspaper_works/text_extraction/alto_reader.rb, line 93
def initialize(xml, image_width = nil, image_height = nil)
  @source = isxml?(xml) ? xml : File.read(xml)
  @image_width = image_width
  @image_height = image_height
  @doc_stream = AltoDocStream.new(image_width)
  parser = Nokogiri::XML::SAX::Parser.new(doc_stream)
  parser.parse(@source)
end

Public Instance Methods

isxml?(xml) click to toggle source

Determine if source parameter is path or xml

@param xml [String] either path to xml file or xml source @return [true, false] true if string appears to be XML source, not path

# File lib/newspaper_works/text_extraction/alto_reader.rb, line 106
def isxml?(xml)
  xml.lstrip.start_with?('<')
end
json() click to toggle source

Output JSON flattened word coordinates

@return [String] JSON serialization of flattened word coordinates

# File lib/newspaper_works/text_extraction/alto_reader.rb, line 113
def json
  words = @doc_stream.words
  builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new(words,
                                                                  @image_width,
                                                                  @image_height)
  builder.to_json
end