Class PdfContentTextLocator

java.lang.Object
org.openpdf.text.pdf.parser.PdfContentStreamHandler
org.openpdf.text.pdf.parser.PdfContentTextLocator

public class PdfContentTextLocator extends PdfContentStreamHandler
  • Field Details

  • Constructor Details

    • PdfContentTextLocator

      public PdfContentTextLocator(TextAssembler renderListener, String pattern, int page)
      Construct a content PdfContetStreamHandler for regex-based text extraction pattern
      Parameters:
      renderListener - the text assembler
      pattern - the pattern to match text against
      page - PdfPage to inspect
    • PdfContentTextLocator

      public PdfContentTextLocator(TextAssembler renderListener, float[] coordinates, int page)
      Construct a content PdfContetStreamHandler for coordinates-based text extraction pattern
      Parameters:
      renderListener - the text assembler
      coordinates - the bounding box to search text within
      page - PdfPage to inspect
  • Method Details

    • installDefaultOperators

      protected void installDefaultOperators()
      Loads all the supported graphics and text state operators in a map.
      Overrides:
      installDefaultOperators in class PdfContentStreamHandler
    • popContext

      void popContext()
      Specified by:
      popContext in class PdfContentStreamHandler
    • pushContext

      void pushContext(String newContextName)
      Specified by:
      pushContext in class PdfContentStreamHandler
    • reset

      public void reset()
      Specified by:
      reset in class PdfContentStreamHandler
    • displayPdfString

      void displayPdfString(PdfString string)
      Extract a PdfString content and coordinates based on the handler extraction pattern: either matches a given regex or intersects a given bounding box
      Specified by:
      displayPdfString in class PdfContentStreamHandler
      Parameters:
      string - the text to inspect
    • matchPdfString

      private void matchPdfString(String decoded, float[] widths, float totalWidth, float fontFloor, float fontCeiling)
      Search for a pattern in a PdfString and if found, collect its bounding box
      Parameters:
      decoded - the text to inspect
      widths - array of prefix widths of each char
      totalWidth - width of the text
      fontFloor - lowest y-coordinate of the font
      fontCeiling - highest y-coordinate of the font
    • locatePdfString

      private void locatePdfString(String decoded, float startWidth, float totalWidth, float fontFloor, float fontCeiling)
      Extract text if it's coordinates intersect with the given bounding box
      Parameters:
      decoded - the text to inspect
      startWidth - left-most x-coordinate of the text
      totalWidth - width of the text
      fontFloor - lowest y-coordinate of the font
      fontCeiling - highest y-coordinate of the font
    • convertHeightToUser

      private float convertHeightToUser(float height)
    • getResultantText

      public String getResultantText()
      Specified by:
      getResultantText in class PdfContentStreamHandler
      Returns:
      result text
    • getMatchedPatterns

      public List<MatchedPattern> getMatchedPatterns()
      Returns:
      list of text strips that matches