Class PdfContentTextLocator
java.lang.Object
org.openpdf.text.pdf.parser.PdfContentStreamHandler
org.openpdf.text.pdf.parser.PdfContentTextLocator
-
Nested Class Summary
Nested classes/interfaces inherited from class PdfContentStreamHandler
PdfContentStreamHandler.BeginText, PdfContentStreamHandler.Do, PdfContentStreamHandler.EndText, PdfContentStreamHandler.ModifyCurrentTransformationMatrix, PdfContentStreamHandler.MoveNextLineAndShowText, PdfContentStreamHandler.MoveNextLineAndShowTextWithSpacing, PdfContentStreamHandler.PopGraphicsState, PdfContentStreamHandler.ProcessGraphicsStateResource, PdfContentStreamHandler.PushGraphicsState, PdfContentStreamHandler.SetTextCharacterSpacing, PdfContentStreamHandler.SetTextFont, PdfContentStreamHandler.SetTextHorizontalScaling, PdfContentStreamHandler.SetTextLeading, PdfContentStreamHandler.SetTextRenderMode, PdfContentStreamHandler.SetTextRise, PdfContentStreamHandler.SetTextWordSpacing, PdfContentStreamHandler.ShowText, PdfContentStreamHandler.ShowTextArray, PdfContentStreamHandler.TextMoveNextLine, PdfContentStreamHandler.TextMoveStartNextLine, PdfContentStreamHandler.TextMoveStartNextLineWithLeading, PdfContentStreamHandler.TextSetTextMatrix -
Field Summary
FieldsModifier and TypeFieldDescriptionprivate final ArrayList<MatchedPattern> private float[]private final ArrayList<ParsedText> private final intprivate Patternprivate final intFields inherited from class PdfContentStreamHandler
contextNames, gsStack, operators, renderListener, textFragments, textFragmentStreams, textLineMatrix, textMatrix -
Constructor Summary
ConstructorsConstructorDescriptionPdfContentTextLocator(TextAssembler renderListener, float[] coordinates, int page) Construct a content PdfContetStreamHandler for coordinates-based text extraction patternPdfContentTextLocator(TextAssembler renderListener, String pattern, int page) Construct a content PdfContetStreamHandler for regex-based text extraction pattern -
Method Summary
Modifier and TypeMethodDescriptionprivate floatconvertHeightToUser(float height) (package private) voiddisplayPdfString(PdfString string) Extract a PdfString content and coordinates based on the handler extraction pattern: either matches a given regex or intersects a given bounding boxprotected voidLoads all the supported graphics and text state operators in a map.private voidlocatePdfString(String decoded, float startWidth, float totalWidth, float fontFloor, float fontCeiling) Extract text if it's coordinates intersect with the given bounding boxprivate voidmatchPdfString(String decoded, float[] widths, float totalWidth, float fontFloor, float fontCeiling) Search for a pattern in a PdfString and if found, collect its bounding box(package private) void(package private) voidpushContext(String newContextName) voidreset()
-
Field Details
-
accumulator
-
fragments
-
fragmentsWidths
-
page
private final int page -
p
-
coordinates
private float[] coordinates -
mode
private final int mode
-
-
Constructor Details
-
PdfContentTextLocator
Construct a content PdfContetStreamHandler for regex-based text extraction pattern- Parameters:
renderListener- the text assemblerpattern- the pattern to match text againstpage- PdfPage to inspect
-
PdfContentTextLocator
Construct a content PdfContetStreamHandler for coordinates-based text extraction pattern- Parameters:
renderListener- the text assemblercoordinates- the bounding box to search text withinpage- PdfPage to inspect
-
-
Method Details
-
installDefaultOperators
protected void installDefaultOperators()Loads all the supported graphics and text state operators in a map.- Overrides:
installDefaultOperatorsin classPdfContentStreamHandler
-
popContext
void popContext()- Specified by:
popContextin classPdfContentStreamHandler
-
pushContext
- Specified by:
pushContextin classPdfContentStreamHandler
-
reset
public void reset()- Specified by:
resetin classPdfContentStreamHandler
-
displayPdfString
Extract a PdfString content and coordinates based on the handler extraction pattern: either matches a given regex or intersects a given bounding box- Specified by:
displayPdfStringin classPdfContentStreamHandler- Parameters:
string- the text to inspect
-
matchPdfString
private void matchPdfString(String decoded, float[] widths, float totalWidth, float fontFloor, float fontCeiling) Search for a pattern in a PdfString and if found, collect its bounding box- Parameters:
decoded- the text to inspectwidths- array of prefix widths of each chartotalWidth- width of the textfontFloor- lowest y-coordinate of the fontfontCeiling- highest y-coordinate of the font
-
locatePdfString
private void locatePdfString(String decoded, float startWidth, float totalWidth, float fontFloor, float fontCeiling) Extract text if it's coordinates intersect with the given bounding box- Parameters:
decoded- the text to inspectstartWidth- left-most x-coordinate of the texttotalWidth- width of the textfontFloor- lowest y-coordinate of the fontfontCeiling- highest y-coordinate of the font
-
convertHeightToUser
private float convertHeightToUser(float height) -
getResultantText
- Specified by:
getResultantTextin classPdfContentStreamHandler- Returns:
- result text
-
getMatchedPatterns
- Returns:
- list of text strips that matches
-