class NewspaperWorks::TextExtraction::HOCRReader

Class to obtain plain text and JSON word-coordinates from hOCR source

- Coordinates in px units, unlike ALTO, which may have scaling concerns

Attributes

doc_stream[RW]
source[RW]

Public Class Methods

new(html) click to toggle source

Construct with either path or HTML [String]

@param html [String], and process document

# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 144
def initialize(html)
  @source = isxml?(html) ? html : File.read(html)
  @doc_stream = HOCRDocStream.new
  parser = Nokogiri::HTML::SAX::Parser.new(doc_stream)
  parser.parse(@source)
end

Public Instance Methods

isxml?(xml) click to toggle source

Determine if source parameter is path or xml/html

@param xml [String] either path to xml file or xml source @return [true, false] true if value appears to be XML/HTML, not path

# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 155
def isxml?(xml)
  xml.lstrip.start_with?('<')
end
json() click to toggle source

Output JSON flattened word coordinates

@return [String] JSON serialization of flattened word coordinates

# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 162
def json
  words = @doc_stream.words
  builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new(
    words,
    @doc_stream.width,
    @doc_stream.height
  )
  builder.to_json
end