class NewspaperWorks::TextExtraction::HOCRReader
Class to obtain plain text and JSON word-coordinates from hOCR source
- Coordinates in px units, unlike ALTO, which may have scaling concerns
Attributes
doc_stream[RW]
source[RW]
Public Class Methods
new(html)
click to toggle source
Construct with either path or HTML [String]
@param html [String], and process document
# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 144 def initialize(html) @source = isxml?(html) ? html : File.read(html) @doc_stream = HOCRDocStream.new parser = Nokogiri::HTML::SAX::Parser.new(doc_stream) parser.parse(@source) end
Public Instance Methods
isxml?(xml)
click to toggle source
Determine if source parameter is path or xml/html
@param xml [String] either path to xml file or xml source @return [true, false] true if value appears to be XML/HTML, not path
# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 155 def isxml?(xml) xml.lstrip.start_with?('<') end
json()
click to toggle source
Output JSON flattened word coordinates
@return [String] JSON serialization of flattened word coordinates
# File lib/newspaper_works/text_extraction/hocr_reader.rb, line 162 def json words = @doc_stream.words builder = NewspaperWorks::TextExtraction::WordCoordsBuilder.new( words, @doc_stream.width, @doc_stream.height ) builder.to_json end