module PdfReaderPatch
This is a horrendous patch to work around issue #169 in the pdf-reader repo (https://github.com/yob/pdf-reader/issues/169). Short version is that whatever HSBC is using to generate PDFs doesn’t seem to null-/whitespace-terminate inline image data. Thus, when PdfReader tries to find the ‘EI’ token when parsing inline media, it can’t and simply runs off the end of the document, causing a TypeError to be thrown. The PDF files I’m getting all seem to end some of the images with xE0, so I’ve simply monkey-patched this into the library for use with my files.
This may not be the case for anyone else, in which case maybe add whatever your problem character is to the regex and
open a PR should you feel the need. Or, y’know, look at the PdfReader source and see if you can work out something better, because this is horrendous :/
Public Class Methods
included( base )
click to toggle source
# File lib/hsbc_pdf_statement_parser/pdf_reader_patch.rb, line 15 def self.included( base ) base.class_eval do def prepare_inline_token str = "".dup buffer = [] to_rewind = -3 until buffer[0] =~ /\s|\0|\xE0/n && buffer[1, 2] == ['E', 'I'] chr = @io.read(1) buffer << chr if buffer.length > 3 str << buffer.shift end to_rewind = -2 if buffer.first =~ /\xE0/n end str << '\0' if buffer.first == '\0' @tokens << string_token(str) @io.seek(to_rewind, IO::SEEK_CUR) unless chr.nil? end end end
Public Instance Methods
prepare_inline_token()
click to toggle source
# File lib/hsbc_pdf_statement_parser/pdf_reader_patch.rb, line 17 def prepare_inline_token str = "".dup buffer = [] to_rewind = -3 until buffer[0] =~ /\s|\0|\xE0/n && buffer[1, 2] == ['E', 'I'] chr = @io.read(1) buffer << chr if buffer.length > 3 str << buffer.shift end to_rewind = -2 if buffer.first =~ /\xE0/n end str << '\0' if buffer.first == '\0' @tokens << string_token(str) @io.seek(to_rewind, IO::SEEK_CUR) unless chr.nil? end