class Iguvium::Page

It's document page, you can extract tables from here. to do so, use {Iguvium::Page#extract_tables!}.

{Iguvium::Page#text} method is handy in order to pre-analyze whether you need this page.

@example

pages = Iguvium.read('nixon.pdf', gspath: '/usr/bin/gs')
pages = pages.select { |page| page.text.match?(/[Tt]able.+\d+/) }
tables = pages.map(&:extract_tables!)

Attributes

lines[R]

@!visibility private @return (see Iguvium::CV#lines)

Public Class Methods

new(page, path, **opts) click to toggle source

@param page [PDF::Reader::Page] @param (see Iguvium.read) Typically you don't need it, prefer {Page} creation from {Iguvium.read}

# File lib/iguvium/page.rb, line 17
def initialize(page, path, **opts)
  @opts = opts
  @reader_page = page
  @path = path
end

Public Instance Methods

characters() click to toggle source

@!visibility private @return [Array<PDF::Reader::TextRun>] array of characters on page. Each character has its coordinates,

size, and width
# File lib/iguvium/page.rb, line 63
def characters
  return @characters if @characters

  receiver = PDF::Reader::PageTextReceiver.new
  @reader_page.send(:walk, receiver)
  @characters = receiver.instance_variable_get('@characters')
end
extract_tables!(images: @opts[:images]) click to toggle source

This method does all the heavy lifting which include optical recognition of table borders. It returns an array of {Iguvium::Table} or an empty array if it fails to recognize any. To get structured data from parsed {Iguvium::Table}, just call {Iguvium::Table#to_a}.

@todo Further speed improvements should be done, expecting at least 30% speedup on multicore systems

Due to the nature of PDF document which is generally a collection of independent pages, {Iguvium::Page#extract_tables!} is suitable for parallel processing. Concurrent processing (think fork as parallel vs. thread as concurrent) on the other hand would be not a great idea, because it's a CPU-intensive task.

On some older CPUs it takes up to 2 seconds per page for it to work (up to 1 second on more modern ones), so use it wisely.

@example extract tables using pictures as possible borders

tables = page.extract_tables! images: true #=> [Array<Iguvium::Table>]

@return [Array<Iguvium::Table>]

# File lib/iguvium/page.rb, line 45
def extract_tables!(images: @opts[:images])
  return @tables if @tables

  @opts[:images] = images
  recognize!
  @tables
end
text() click to toggle source

@return [String] rendered page text, result of underlying PDF::Reader::Page#text call It takes ~150 ms for it to work, so it's handy for picking up pages before trying to extract tables, which is an expensive operation

# File lib/iguvium/page.rb, line 56
def text
  @text ||= @reader_page.text
end

Private Instance Methods

box_empty?(box) click to toggle source
# File lib/iguvium/page.rb, line 85
def box_empty?(box)
  characters.select { |character|
    box.first.cover?(character.x) && box.last.cover?(character.y)
  }.empty?
end
recognize!() click to toggle source
# File lib/iguvium/page.rb, line 73
def recognize!
  image = Image.read(@path, @reader_page.number, @opts)
  recognized = CV.new(image).recognize
  @lines = recognized[:lines]
  @boxes = recognized[:boxes].reject { |box| box_empty?(box) }
  @tables = @boxes
            .map { |box| Table.new(box, self) }
            .reject { |table| table.grid[:rows].empty? || table.grid[:columns].empty? }
            .reverse
  self
end