API reference

This page summarizes the rest of the public API. Generally speaking this should be mainly of interest to plugin developers.

ocrmypdf

ocrmypdf.exceptions

OCRmyPDF’s exceptions.

exception ocrmypdf.exceptions.BadArgsError

Invalid arguments on the command line or API.

exit_code = 1
exception ocrmypdf.exceptions.ColorConversionNeededError

PDF needs color conversion.

message = 'The input PDF has an unusual color space. Use\n--color-conversion-strategy to convert to a common color space\nsuch as RGB, or use --output-type pdf to skip PDF/A conversion\nand retain the original color space.\n'
exception ocrmypdf.exceptions.DigitalSignatureError

PDF has a digital signature.

message = 'Input PDF has a digital signature. OCR would alter the document,\ninvalidating the signature.\n'
exception ocrmypdf.exceptions.DpiError

Missing information about input image DPI.

exit_code = 2
exception ocrmypdf.exceptions.EncryptedPdfError

Input PDF is encrypted.

exit_code = 8
message = "Input PDF is encrypted. The encryption must be removed to\nperform OCR.\n\nFor information about this PDF's security use\n    qpdf --show-encryption infilename\n\nYou can remove the encryption using\n    qpdf --decrypt [--password=[password]] infilename\n"
class ocrmypdf.exceptions.ExitCode(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)

OCRmyPDF’s exit codes.

already_done_ocr = 6
bad_args = 1
child_process_error = 7
ctrl_c = 130
encrypted_pdf = 8
file_access_error = 5
input_file = 2
invalid_config = 9
invalid_output_pdf = 4
missing_dependency = 3
ok = 0
other_error = 15
pdfa_conversion_failed = 10
exception ocrmypdf.exceptions.ExitCodeException

An exception which should return an exit code with sys.exit().

exit_code = 15
message = ''
exception ocrmypdf.exceptions.InputFileError

Something is wrong with the input file.

exit_code = 2
exception ocrmypdf.exceptions.MissingDependencyError

A third-party dependency is missing.

exit_code = 3
exception ocrmypdf.exceptions.OutputFileAccessError

Cannot access the intended output file path.

exit_code = 5
exception ocrmypdf.exceptions.PriorOcrFoundError

This file already has OCR.

exit_code = 6
exception ocrmypdf.exceptions.SubprocessOutputError

A subprocess returned an unexpected error.

exit_code = 7
exception ocrmypdf.exceptions.TaggedPDFError

PDF is tagged.

message = 'This PDF is marked as a Tagged PDF. This often indicates\nthat the PDF was generated from an office document and does\nnot need OCR. Use --force-ocr, --skip-text or --redo-ocr to\noverride this error.\n'
exception ocrmypdf.exceptions.TesseractConfigError

Tesseract config can’t be parsed.

exit_code = 9
message = 'Error occurred while parsing a Tesseract configuration file'
exception ocrmypdf.exceptions.UnsupportedImageFormatError

The image format is not supported.

exit_code = 2

ocrmypdf.helpers

Support functions.

@ocrmypdf.helpers.deprecated(deprecated_in=None, removed_in=None, current_version=None, details='')

Decorate a function to signify its deprecation

This function wraps a method that will soon be removed and does two things:
  • The docstring of the method will be modified to include a notice about deprecation, e.g., “Deprecated since 0.9.11. Use foo instead.”

  • Raises a DeprecatedWarning via the warnings module, which is a subclass of the built-in DeprecationWarning. Note that built-in DeprecationWarning`s are ignored by default, so for users to be informed of said warnings they will need to enable them--see the :mod:`warnings module documentation for more details.

Parameters:
  • deprecated_in – The version at which the decorated method is considered deprecated. This will usually be the next version to be released when the decorator is added. The default is None, which effectively means immediate deprecation. If this is not specified, then the removed_in and current_version arguments are ignored.

  • removed_in – The version or datetime.date when the decorated method will be removed. The default is None, specifying that the function is not currently planned to be removed. Note: This parameter cannot be set to a value if deprecated_in=None.

  • current_version – The source of version information for the currently running code. This will usually be a __version__ attribute on your library. The default is None. When current_version=None the automation to determine if the wrapped function is actually in a period of deprecation or time for removal does not work, causing a DeprecatedWarning to be raised in all cases.

  • details – Extra details to be added to the method docstring and warning. For example, the details may point users to a replacement method, such as “Use the foo_bar method instead”. By default there are no details.

ocrmypdf.helpers.NeverRaise()

An exception that is never raised.

Deprecated since version 15.4.0.

class ocrmypdf.helpers.Resolution(x: T, y: T)

The number of pixels per inch in each 2D direction.

Resolution objects are considered “equal” for == purposes if they are equal to a reasonable tolerance.

flip_axis() Resolution[T]

Return a new Resolution object with x and y swapped.

property is_finite: bool

True if both x and y are finite numbers.

property is_square: bool

True if the resolution is square (x == y).

round(ndigits: int) Resolution

Round to ndigits after the decimal point.

take_max(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution

Return a new Resolution object with the maximum resolution of inputs.

take_min(vals: Iterable[Any], yvals: Iterable[Any] | None = None) Resolution

Return a new Resolution object with the minimum resolution of inputs.

to_int() Resolution[int]

Round to nearest integer.

to_scalar() float

Return the harmonic mean of x and y as a 1D approximation.

In most cases, Resolution is 2D, but typically it is “square” (x == y) and can be approximated as a single number. When not square, the harmonic mean is used to approximate the 2D resolution as a single number.

ocrmypdf.helpers.available_cpu_count() int

Returns number of CPUs in the system.

ocrmypdf.helpers.check_pdf(input_file: Path) bool

Check if a PDF complies with the PDF specification.

Checks for proper formatting and proper linearization. Uses pikepdf (which in turn, uses QPDF) to perform the checks.

ocrmypdf.helpers.clamp(n: T, smallest: T, largest: T) T

Clamps the value of n to between smallest and largest.

ocrmypdf.helpers.is_file_writable(test_file: PathLike) bool

Intentionally racy test if target is writable.

We intend to write to the output file if and only if we succeed and can replace it atomically. Before doing the OCR work, make sure the location is writable.

ocrmypdf.helpers.is_iterable_notstr(thing: Any) bool

Is this is an iterable type, other than a string?

ocrmypdf.helpers.monotonic(seq: Sequence) bool

Does this sequence increase monotonically?

ocrmypdf.helpers.page_number(input_file: PathLike) int

Get one-based page number implied by filename (000002.pdf -> 2).

ocrmypdf.helpers.pikepdf_enable_mmap() None

Enable pikepdf memory mapping.

ocrmypdf.helpers.remove_all_log_handlers(logger: Logger) None

Remove all log handlers, usually used in a child process.

The child process inherits the log handlers from the parent process when a fork occurs. Typically we want to remove all log handlers in the child process so that the child process can set up a single queue handler to forward log messages to the parent process.

ocrmypdf.helpers.safe_symlink(input_file: PathLike, soft_link_name: PathLike) None

Create a symbolic link at soft_link_name, which references input_file.

Think of this as copying input_file to soft_link_name with less overhead.

Use symlinks safely. Self-linking loops are prevented. On Windows, file copy is used since symlinks may require administrator privileges. An existing link at the destination is removed.

ocrmypdf.helpers.samefile(file1: PathLike, file2: PathLike) bool

Return True if two files are the same file.

Attempts to account for different relative paths to the same file.

ocrmypdf.hocrtransform

Transform .hocr and page image to text PDF.

class ocrmypdf.hocrtransform.DebugRenderOptions(render_paragraph_bbox: bool = False, render_baseline: bool = False, render_triangle: bool = False, render_line_bbox: bool = False, render_word_bbox: bool = False, render_space_bbox: bool = False)

A class for managing rendering options.

class ocrmypdf.hocrtransform.HocrTransform(*, hocr_filename: str | ~pathlib._local.Path, dpi: float, debug: bool = False, fontname: ~pikepdf.objects.Name = pikepdf.Name("/f-0-0"), font: ~ocrmypdf.hocrtransform._font.EncodableFont = <ocrmypdf.hocrtransform._font.GlyphlessFont object>, debug_render_options: ~ocrmypdf.hocrtransform._hocr.DebugRenderOptions | None = None)

A class for converting documents from the hOCR format.

For details of the hOCR format, see: http://kba.github.io/hocr-spec/1.2/.

classmethod baseline(element: Element) tuple[float, float]

Get baseline’s slope and intercept.

classmethod element_coordinates(element: Element) Rectangle | None

Get coordinates of the bounding box around an element.

classmethod normalize_text(s: str) str

Normalize the given text using the NFKC normalization form.

classmethod polyval(poly, x)

Calculate the value of a polynomial at a point.

to_pdf(*, out_filename: Path, image_filename: Path | None = None, invisible_text: bool = True) None

Creates a PDF file with an image superimposed on top of the text.

Text is positioned according to the bounding box of the lines in the hOCR file. The image need not be identical to the image used to create the hOCR file. It can have a lower resolution, different color mode, etc.

Parameters:
  • out_filename – Path of PDF to write.

  • image_filename – Image to use for this file. If omitted, the OCR text is shown.

  • invisible_text – If True, text is rendered invisible so that is selectable but never drawn. If False, text is visible and may be seen if the image is skipped or deleted in Acrobat.

exception ocrmypdf.hocrtransform.HocrTransformError

Error while applying hOCR transform.

ocrmypdf.pdfa

Utilities for PDF/A production and confirmation with Ghostspcript.

ocrmypdf.pdfa.file_claims_pdfa(filename: Path)

Determines if the file claims to be PDF/A compliant.

This only checks if the XMP metadata contains a PDF/A marker. It does not do full PDF/A validation.

ocrmypdf.pdfa.generate_pdfa_ps(target_filename: Path, icc: str = 'sRGB')

Create a Postscript PDFMARK file for Ghostscript PDF/A conversion.

pdfmark is an extension to the Postscript language that describes some PDF features like bookmarks and annotations. It was originally specified Adobe Distiller, for Postscript to PDF conversion.

Ghostscript uses pdfmark for PDF to PDF/A conversion as well. To use Ghostscript to create a PDF/A, we need to create a pdfmark file with the necessary metadata.

This function takes care of the many version-specific bugs and peculiarities in Ghostscript’s handling of pdfmark.

The only information we put in specifies that we want the file to be a PDF/A, and we want to Ghostscript to convert objects to the sRGB colorspace if it runs into any object that it decides must be converted.

Parameters:
  • target_filename – filename to save

  • icc – ICC identifier such as ‘sRGB’

References

Adobe PDFMARK Reference: https://opensource.adobe.com/dc-acrobat-sdk-docs/library/pdfmark/

ocrmypdf.quality

ocrmypdf.subprocess