Class PdfTextLocator

java.lang.Object
org.openpdf.text.pdf.parser.PdfTextLocator

public class PdfTextLocator extends Object
Locates text pattern coordinates inside a PDF file.
Since:
2.1.4
  • Field Details

    • reader

      private final PdfReader reader
      The PdfReader that holds the PDF file.
    • renderListener

      private final TextAssembler renderListener
      The TextAssembler that will receive render notifications and provide resultant text
  • Constructor Details

    • PdfTextLocator

      public PdfTextLocator(PdfReader reader)
      Creates a new Text Locator object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
    • PdfTextLocator

      public PdfTextLocator(PdfReader reader, boolean usePdfMarkupElements)
      Creates a new Text Extractor object, using a TextAssembler as the render listener
      Parameters:
      reader - the reader with the PDF
      usePdfMarkupElements - should we use higher level tags for PDF markup entities?
    • PdfTextLocator

      public PdfTextLocator(PdfReader reader, TextAssembler renderListener)
      Creates a new Text Locator object.
      Parameters:
      reader - the reader with the PDF
      renderListener - the render listener that will be used to analyze renderText operations and provide resultant text
  • Method Details

    • getContentBytesForPage

      private byte[] getContentBytesForPage(int pageNum) throws IOException
      Gets the content bytes of a page.
      Parameters:
      pageNum - the 1-based page number of page you want get the content stream from
      Returns:
      a byte array with the effective content stream of a page
      Throws:
      IOException
    • getContentBytesFromContentObject

      private byte[] getContentBytesFromContentObject(PdfObject contentObject) throws IOException
      Gets the content bytes from a content object, which may be a reference a stream or an array.
      Parameters:
      contentObject - the object to read bytes from
      Returns:
      the content bytes
      Throws:
      IOException
    • searchPage

      public List<MatchedPattern> searchPage(int page, String pattern) throws IOException
      Locates text pattern inside a page
      Parameters:
      page - page number we are interested in
      pattern - text to match
      Returns:
      ArrayList List of matched text patterns with coordinates.
      Throws:
      IOException - on error
    • searchFile

      public List<MatchedPattern> searchFile(String pattern) throws IOException
      Locates text pattern inside a PDF
      Parameters:
      pattern - text to match
      Returns:
      ArrayList List of matched text patterns with coordinates.
      Throws:
      IOException - on error
    • searchPage

      public List<MatchedPattern> searchPage(int page, float[] coordinates) throws IOException
      Locates text within a bounding box inside a page
      Parameters:
      page - page number we are interested in
      coordinates - bounding box to extract text from
      Returns:
      ArrayList List of matched text patterns with coordinates.
      Throws:
      IOException - on error
    • searchFile

      public List<MatchedPattern> searchFile(float[] coordinates) throws IOException
      Locates text within a bounding box inside a PDF
      Parameters:
      coordinates - bounding box to extract text from
      Returns:
      ArrayList List of matched text patterns with coordinates.
      Throws:
      IOException - on error
    • processContent

      public void processContent(byte[] contentBytes, PdfDictionary resources, PdfContentTextLocator handler)
      Processes PDF syntax
      Parameters:
      contentBytes - the bytes of a content stream
      resources - the resources that come with the content stream
      handler - interprets events caused by recognition of operations in a content stream.