class Docsplit::PdfExtractor
Constants
- CLASSPATH
- HEADLESS
- HOST_OS
Provide a set of helper functions to determine the OS.
- LOGGING
Public Instance Methods
extract(docs, opts)
click to toggle source
Convert documents to PDF.
# File lib/docsplit/pdf_extractor.rb, line 121 def extract(docs, opts) out = opts[:output] || '.' FileUtils.mkdir_p out unless File.exist?(out) [docs].flatten.each do |doc| ext = File.extname(doc) basename = File.basename(doc, ext) escaped_doc, escaped_out, escaped_basename = [doc, out, basename].map(&ESCAPE) if GM_FORMATS.include?(`file -b --mime #{ESCAPE[doc]}`.strip.split(/[:;]\s+/)[0]) `gm convert #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf` else if libre_office? # Set the LibreOffice user profile, so that parallel uses of cloudcrowd don't trip over each other. ENV['SYSUSERCONFIG'] = "file://#{File.expand_path(escaped_out)}" options = "--headless --invisible --norestore --nolockcheck --convert-to pdf --outdir #{escaped_out} #{escaped_doc}" cmd = "#{office_executable} #{options} 2>&1" result = `#{cmd}`.chomp raise ExtractionFailed, result if $?.exitstatus.nonzero? true else # open office presumably, rely on JODConverter to figure it out. options = "-jar #{ESCAPED_ROOT}/vendor/jodconverter/jodconverter-core-3.0-beta-4.jar -r #{ESCAPED_ROOT}/vendor/conf/document-formats.js" run_jod "#{options} #{escaped_doc} #{escaped_out}/#{escaped_basename}.pdf", [], {} end end end end
libre_office?()
click to toggle source
# File lib/docsplit/pdf_extractor.rb, line 35 def libre_office? !!version_string.match(/^LibreOffice/) end
linux?()
click to toggle source
# File lib/docsplit/pdf_extractor.rb, line 18 def linux? !!HOST_OS.match(/linux/i) end
office_executable()
click to toggle source
Identify the path to a working office executable.
# File lib/docsplit/pdf_extractor.rb, line 78 def office_executable paths = office_search_paths # If an OFFICE_PATH has been specified on the commandline # raise an error if that path isn't valid, otherwise, add # it to the front of our search paths. if ENV['OFFICE_PATH'] raise ArgumentError, "No such file or directory #{ENV['OFFICE_PATH']}" unless File.exist? ENV['OFFICE_PATH'] paths.unshift(ENV['OFFICE_PATH']) end # The location of the office executable is OS dependent path_pieces = ['soffice'] if windows? path_pieces += [['program', 'soffice.bin']] elsif osx? path_pieces += [%w(MacOS soffice), %w(Contents MacOS soffice)] else path_pieces += [%w(program soffice)] end # Search for the first suitable office executable # and short circuit an executable is found. paths.each do |path| if File.exist? path @@executable ||= path unless File.directory? path path_pieces.each do |pieces| check_path = File.join(path, pieces) @@executable ||= check_path if File.exist? check_path end end break if @@executable end raise OfficeNotFound, 'No office software found' unless @@executable @@executable end
office_path()
click to toggle source
Used to specify the office location for JODConverter
# File lib/docsplit/pdf_extractor.rb, line 116 def office_path File.dirname(File.dirname(office_executable)) end
office_search_paths()
click to toggle source
A set of default locations to search for office software These have been extracted from JODConverter. Each listed path should contain a directory “program” which in turn contains the “soffice” executable. see: github.com/mirkonasato/jodconverter/blob/master/jodconverter-core/src/main/java/org/artofsolving/jodconverter/office/OfficeUtils.java#L63-L91
# File lib/docsplit/pdf_extractor.rb, line 48 def office_search_paths if windows? office_names = ['LibreOffice 3', 'LibreOffice 4', 'OpenOffice.org 3'] program_files_path = ENV['CommonProgramFiles'] search_paths = office_names.map { |program| File.join(program_files_path, program) } elsif osx? search_paths = %w( /Applications/LibreOffice.app/Contents /Applications/OpenOffice.org.app/Contents ) else # probably linux/unix # heroku libreoffice buildpack: https://github.com/rishihahs/heroku-buildpack-libreoffice search_paths = %w( /usr/lib/libreoffice /usr/lib64/libreoffice /opt/libreoffice /usr/lib/openoffice /usr/lib64/openoffice /opt/openoffice.org3 /app/vendor/libreoffice /usr/bin/libreoffice /usr/local/bin /usr/lib64/libreoffice /usr/lib64/openoffice.org3 ) end search_paths end
open_office?()
click to toggle source
# File lib/docsplit/pdf_extractor.rb, line 39 def open_office? !!version_string.match(/^OpenOffice.org/) end
osx?()
click to toggle source
# File lib/docsplit/pdf_extractor.rb, line 14 def osx? !!HOST_OS.match(/darwin/i) end
version_string()
click to toggle source
The first line of the help output holds the name and version number of the office software to be used for extraction.
# File lib/docsplit/pdf_extractor.rb, line 24 def version_string unless @@version_string null = windows? ? 'NUL' : '/dev/null' @@version_string = `#{office_executable} -h 2>#{null}`.split("\n").first if !!@@version_string.to_s.match(/[0-9]*/) @@version_string = `#{office_executable} --version`.split("\n").first end end @@version_string end
windows?()
click to toggle source
# File lib/docsplit/pdf_extractor.rb, line 10 def windows? !!HOST_OS.match(/mswin|windows|cygwin/i) end
Private Instance Methods
run_jod(command, pdfs, _opts, return_output = false)
click to toggle source
Runs a Java command, with quieted logging, and the classpath set properly.
# File lib/docsplit/pdf_extractor.rb, line 158 def run_jod(command, pdfs, _opts, return_output = false) pdfs = [pdfs].flatten.map { |pdf| "\"#{pdf}\"" }.join(' ') office = osx? ? "-Doffice.home=#{office_path}" : office_path cmd = "java #{HEADLESS} #{LOGGING} #{office} -cp #{CLASSPATH} #{command} #{pdfs} 2>&1" result = `#{cmd}`.chomp raise ExtractionFailed, result if $?.exitstatus.nonzero? return_output ? (result.empty? ? nil : result) : true end