class PDF::Reader::ObjectHash
This monkey-patches pdf-reader to allow it to read PDFs that have junk characters that appear in the file before the start of the PDF
stream. (this is quite commonly an html head block - I suspect a bug in the Adobe or other software used to serve the bills)
The patch has been contributed back to the pdf-reader project (github.com/yob/pdf-reader/pull/54) and has already been merged on master. When it shows up in a release of the pdf-reader gem we can trash this patch.
Public Instance Methods
extract_io_from(input)
click to toggle source
# File lib/pdf/reader/patch/object_hash.rb, line 12 def extract_io_from(input) if input.respond_to?(:seek) && input.respond_to?(:read) input elsif File.file?(input.to_s) read_with_quirks(input) else raise ArgumentError, "input must be an IO-like object or a filename" end end
Private Instance Methods
pdf_offset(stream)
click to toggle source
Returns the offset of the PDF
document in the stream
. Checks up to 50 chars into the file, returns nil of no PDF
stream detected.
# File lib/pdf/reader/patch/object_hash.rb, line 37 def pdf_offset(stream) stream.rewind ofs = stream.pos until (c = stream.readchar) == '%' || c == 37 || ofs > 50 ofs += 1 end ofs < 50 ? ofs : nil end
read_with_quirks(input)
click to toggle source
Load file as a StringIO stream, accounting for invalid format where additional characters exist in the file before the %PDF start of file
# File lib/pdf/reader/patch/object_hash.rb, line 24 def read_with_quirks(input) stream = File.open(input.to_s, "rb") if ofs = pdf_offset(stream) stream.seek(ofs) StringIO.new(stream.read) else raise ArgumentError, "invalid file format" end end