module Docsplit
The Docsplit
module delegates to the Java PDF extractors.
Constants
- DEPENDENCIES
- ESCAPE
- ESCAPED_ROOT
- GM_FORMATS
- METADATA_KEYS
- ROOT
- VERSION
Public Class Methods
clean_text(text)
click to toggle source
Utility method to clean OCR'd text with garbage characters.
# File lib/docsplit.rb, line 83 def self.clean_text(text) TextCleaner.new.clean(text) end
extract_images(pdfs, opts = {})
click to toggle source
Use the ExtractImages Java class to rasterize a PDF into each page's image.
# File lib/docsplit.rb, line 54 def self.extract_images(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) opts[:pages] = normalize_value(opts[:pages]) if opts[:pages] ImageExtractor.new.extract(pdfs, opts) end
extract_info(pdfs, opts = {})
click to toggle source
# File lib/docsplit.rb, line 77 def self.extract_info(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) InfoExtractor.new.extract_all(pdfs, opts) end
extract_pages(pdfs, opts = {})
click to toggle source
Use the ExtractPages Java class to burst a PDF into single pages.
# File lib/docsplit.rb, line 42 def self.extract_pages(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) PageExtractor.new.extract(pdfs, opts) end
extract_pdf(docs, opts = {})
click to toggle source
Use JODCConverter to extract the documents as PDFs. If the document is in an image format, use GraphicsMagick to extract the PDF.
# File lib/docsplit.rb, line 62 def self.extract_pdf(docs, opts = {}) PdfExtractor.new.extract(docs, opts) end
extract_text(pdfs, opts = {})
click to toggle source
Use the ExtractText Java class to write out all embedded text.
# File lib/docsplit.rb, line 48 def self.extract_text(pdfs, opts = {}) pdfs = ensure_pdfs(pdfs) TextExtractor.new.extract(pdfs, opts) end
Private Class Methods
normalize_value(value)
click to toggle source
Normalize a value in an options hash for the command line. Ranges look like: 1-10, Arrays like: 1,2,3.
# File lib/docsplit.rb, line 91 def self.normalize_value(value) case value when Range then value.to_a.join(',') when Array then value.map! { |v| v.is_a?(Range) ? normalize_value(v) : v }.join(',') else value.to_s end end