module Iguvium

PDF tables extractor. @example Get all the tables in 2D text array format

pages = Iguvium.read('filename.pdf') #=> [Array<Iguvium::Page>]
tables = pages.flat_map { |page| page.extract_tables! } #=> [Array<Iguvium::Table>]
tables.map(&:to_a)

@example Get first table from the page 8

pages = Iguvium.read('filename.pdf')
tables = pages[7].extract_tables!
tables.first.to_a

For more details please look {Iguvium.read} and {Iguvium::Page#extract_tables!} @author Dima Ermilov <wlaer@wlaer.com>

Constants

FLAT_THRESHOLD
GAUSS
HORIZONTAL
NEIGHBORS
VERSION
VERTICAL

Public Class Methods

logger() click to toggle source

Creates and gives access to Ruby Logger. Default [Logger::Level] is Logger::ERROR.

To set another level call `Iguvium.logger.level = Logger::INFO` or some other standard logger level

It is possible to redefine Iguvium's logger, for example to replace it with a global one like `Iguvium.logger = Rails.logger` @return [Logger]

# File lib/iguvium.rb, line 87
def logger
  return @logger if @logger

  @logger = Logger.new(STDOUT)
  @logger.formatter = proc do |severity, _, _, msg|
    "#{severity}: #{msg}\n"
  end
  @logger.level = Logger::ERROR
  @logger
end
logger=(new_logger) click to toggle source
# File lib/iguvium.rb, line 97
def logger=(new_logger)
  @logger = new_logger
end
read(path, **opts) click to toggle source

It's main method. Usually this is where you start.

It returns an array of {Iguvium::Page}.

Tables on those pages are neither extracted nor detected yet, all the heavy lifting is done in {Iguvium::Page#extract_tables!} method.

@param path [String] path to PDF file to be read @option opts [String] :gspath (nil) explicit path to the GhostScript executable. Use it in case of

non-standard gs executable placement. If not specified, gem tries standard options
like `C:\\Program Files\\gs\\gs*\\bin\\gswin??c.exe` on Windows or just `gs` on Mac and Linux

@option opts [Logger::Level] :loglevel level like Logger::INFO, default is Logger::ERROR @return [Array <Iguvium::Page>]

@example prepare pages, consider images meaningful

pages = Iguvium.read('filename.pdf', images: true)

@example set nonstandard gs path, get pages starting with the one which contains keyword

pages = Iguvium.read('nixon.pdf', gspath: '/usr/bin/gs')
pages = pages.drop_while { |page| !page.text.match?(/Watergate/) }
# {Iguvium::Page#text} does not require optical page scan and thus is relatively cheap.
# It uses an underlying PDF::Reader::Page#text which is fast but not completely free though.

@option opts [Boolean] :images (false) consider pictures in PDF as possible table separators. This typically makes sense in a rare case when table grid in your pdf is filled with rasterized texture or is actually a background picture. Usually you don't want to use it.

# File lib/iguvium.rb, line 60
  def read(path, **opts)
    if windows?
      unless opts[:gspath]
        gspath = Dir.glob('C:/Program Files/gs/gs*/bin/gswin??c.exe').first.tr('/', '\\')
        opts[:gspath] = "\"#{gspath}\""
      end

      if opts[:gspath].empty?
        puts "There's no gs utility in your $PATH.
Please install GhostScript: https://www.ghostscript.com/download/gsdnld.html"
        exit
      end
    else
      opts[:gspath] ||= gs_nix?
    end

    PDF::Reader.new(path, opts).pages.map { |page| Page.new(page, path, opts) }
  end

Private Class Methods

gs_nix?() click to toggle source
# File lib/iguvium.rb, line 103
  def gs_nix?
    if `which gs`.empty?
      puts "There's no gs utility in your $PATH.
Please install GhostScript with `brew install ghostscript` on Mac
or download it here: https://www.ghostscript.com/download/gsdnld.html"
      exit
    end
    'gs'
  end
windows?() click to toggle source
# File lib/iguvium.rb, line 113
def windows?
  RbConfig::CONFIG['host_os'].match(/mswin|mingw|cygwin/)
end