Class StreamedSource
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Iterable<Segment>
This class provides a means, via the iterator()
method, of sequentially parsing every tag, character reference
and plain text segment contained within the source document using a minimum amount of memory.
In contrast, the standard Source
class stores the entire source text in memory and caches every tag parsed,
resulting in memory problems when attempting to parse very large files.
The iterator
parses and returns each segment as the source text is streamed in.
Previous segments are discarded for garbage collection.
Source documents up to 2GB in size can be processed, a limit which is imposed by the java language because of its use of the int
data type to index string operations.
There is however a significant trade-off in functionality when using the StreamedSource
class as opposed to the Source
class.
The Tag.getElement()
method is not supported on tags that are returned by the iterator, nor are any methods that use the Element
class in any way.
The Segment.getSource()
method is also not supported.
Most of the methods and constructors in this class mirror similarly named methods in the Source
class where the same functionality is available.
See the description of the iterator()
method for a typical usage example of this class.
In contrast to a Source
object, the Reader
or InputStream
specified in the constructor or created implicitly by the constructor
remains open for the life of the StreamedSource
object. If the stream is created internally, it is automatically closed
when the end of the stream is reached or the StreamedSource
object is finalized.
However a Reader
or InputStream
that is specified directly in a constructor is never closed automatically, as it can not be assumed
that the application has no further use for it. It is the user's responsibility to ensure it is closed in this case.
Explicitly calling the close()
method on the StreamedSource
object ensures that all resources used by it are closed, regardless of whether
they were created internally or supplied externally.
The functionality provided by StreamedSource
is similar to a StAX parser,
but with some important benefits:
- The source document does not have to be valid XML. It can be plain HTML, can contain invalid syntax, undefined entities, incorrectly nested elements, server tags, or anything else that is commonly found in "tag soup".
- Every single syntactical construct in the source document's original text is included in the iterator, including the XML declaration, character references, comments, CDATA sections and server tags, each providing the segment's begin and end position in the source document. This allows an exact copy of the original document to be generated, allowing modifications to be made only where they are explicitly required. This is not possible with either SAX or StAX, which to some extent provide interpretations of the content of the XML instead of the syntactial structures used in the original source document.
The following table summarises the differences between the StreamedSource
, StAX and SAX interfaces.
Note that some of the available features are documented as optional and may not be supported by all implementations of StAX and SAX.
Feature | StreamedSource | StAX | SAX |
---|---|---|---|
Parse XML | ● | ● | ● |
Parse entities without DTD | ● | ||
Automatically validate XML | ● | ● | |
Parse HTML | ● | ||
Tolerant of syntax or nesting errors | ● | ||
Provide begin and end character positions of each event1 | ● | ○ | |
Provide source text of each event | ● | ||
Handle server tag events | ● | ||
Handle XML declaration event | ● | ||
Handle comment events | ● | ● | ● |
Handle CDATA section events | ● | ● | ● |
Handle document type declaration event | ● | ● | ● |
Handle character reference events | ● | ||
Allow chunking of plain text | ● | ● | ● |
Allow chunking of comment text | |||
Allow chunking of CDATA section text | ● | ||
Allow specification of maximum buffer size | ● |
Note that the OutputDocument
class can not be used to create a modified version of a streamed source document.
Instead, the output document must be constructed manually from the segments provided by the iterator
.
StreamedSource
objects are not thread safe.
-
Constructor Summary
ConstructorsConstructorDescriptionStreamedSource
(InputStream inputStream) Constructs a newStreamedSource
object by loading the content from the specifiedInputStream
.StreamedSource
(Reader reader) Constructs a newStreamedSource
object by loading the content from the specifiedReader
.StreamedSource
(CharSequence text) Constructs a newStreamedSource
object from the specified text.StreamedSource
(URL url) Constructs a newStreamedSource
object by loading the content from the specified URL.StreamedSource
(URLConnection urlConnection) Constructs a newStreamedSource
object by loading the content from the specifiedURLConnection
. -
Method Summary
Modifier and TypeMethodDescriptionvoid
close()
Closes the underlyingReader
orInputStream
and releases any system resources associated with it.protected void
finalize()
Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.int
Returns the current size of the internal character buffer.Returns the currentSegment
from the iterator().Returns aCharBuffer
containing the source text of the current segment.Returns the character encoding scheme of the source byte stream used to create this object.Returns a concise description of how the encoding of the source document was determined.Returns theLogger
that handles log messages.Returns the preliminary encoding of the source document together with a concise description of how it was determined.boolean
isXML()
Indicates whether the source document is likely to be XML.iterator()
Returns an iterator over every tag, character reference and plain text segment contained within the source document.setBuffer
(char[] buffer) Specifies an existing character array to use for buffering the incoming character stream.setCoalescing
(boolean coalescing) Specifies whether an unbroken section of plain text in the source document should always be coalesced into a singleSegment
by the iterator.void
Sets theLogger
that handles log messages.toString()
Returns a string representation of the object as generated by the defaultObject.toString()
implementation.Methods inherited from class java.lang.Object
clone, equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
-
Constructor Details
-
StreamedSource
Constructs a newStreamedSource
object by loading the content from the specifiedReader
.If the specified reader is an instance of
InputStreamReader
, thegetEncoding()
method of the createdStreamedSource
object returns the encoding fromInputStreamReader.getEncoding()
.- Parameters:
reader
- thejava.io.Reader
from which to load the source text.- Throws:
IOException
- if an I/O error occurs.
-
StreamedSource
Constructs a newStreamedSource
object by loading the content from the specifiedInputStream
.The algorithm for detecting the character encoding of the source document from the raw bytes of the specified input stream is the same as that for the
Source(URLConnection)
constructor of theSource
class, except that the first step is not possible as there is no Content-Type header to check.If the specified
InputStream
does not support themark
method, the algorithm that determines the encoding may have to wrap it in aBufferedInputStream
in order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of theStreamedSource
, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)
constructor if the encoding is known in advance.- Parameters:
inputStream
- thejava.io.InputStream
from which to load the source text.- Throws:
IOException
- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSource
object by loading the content from the specified URL.This is equivalent to
StreamedSource(url.openConnection())
.- Parameters:
url
- the URL from which to load the source text.- Throws:
IOException
- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSource
object by loading the content from the specifiedURLConnection
.The algorithm for detecting the character encoding of the source document is identical to that described in the
Source(URLConnection)
constructor of theSource
class.The algorithm that determines the encoding may have to wrap the input stream in a
BufferedInputStream
in order to look ahead at the encoding meta data if the encoding is not specified in the HTTP headers. This extra layer of buffering will then remain in place for the life of theStreamedSource
, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)
constructor if the encoding is known in advance.- Parameters:
urlConnection
- the URL connection from which to load the source text.- Throws:
IOException
- if an I/O error occurs.- See Also:
-
StreamedSource
Constructs a newStreamedSource
object from the specified text.Although the
CharSequence
argument of this constructor apparently contradicts the notion of streaming in the source text, it can still benefits over the equivalent use of the standardSource
class.Firstly, using the
StreamedSource
class to iterate the nodes of an in-memoryCharSequence
source document still requires much less memory than the equivalent operation using the standardSource
class.Secondly, the specified
CharSequence
object could possibly implement its own paging mechanism to minimise memory usage.If the specified
CharSequence
is mutable, its state must not be modified while theStreamedSource
is in use.- Parameters:
text
- the source text.
-
-
Method Details
-
setBuffer
Specifies an existing character array to use for buffering the incoming character stream.The specified buffer is fixed for the life of the
StreamedSource
object, in contrast to the default buffer which can be automatically replaced by a larger buffer as needed. This means that if a tag (including a comment or CDATA section) is encountered that is larger than the specified buffer, an unrecoverableBufferOverflowException
is thrown. This exception is also thrown ifcoalescing
has been enabled and a plain text segment is encountered that is larger than the specified buffer.In general this method should only be used if there needs to be an absolute maximum memory limit imposed on the parser, where that requirement is more important than the ability to parse any source document successfully.
This method can only be called before the
iterator()
method has been called.- Parameters:
buffer
- an existing character array to use for buffering the incoming character stream, must not benull
.- Returns:
- this
StreamedSource
instance, allowing multiple property setting methods to be chained in a single statement. - Throws:
IllegalStateException
- if theiterator()
method has already been called.
-
setCoalescing
Specifies whether an unbroken section of plain text in the source document should always be coalesced into a singleSegment
by the iterator.If this property is set to the default value of
false
, and a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage.If this property is set to
true
then chunking is disabled, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text.Note that
CharacterReference
segments are always handled separately from plain text, regardless of the value of this property. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.- Parameters:
coalescing
- the new value of the coalescing property.- Returns:
- this
StreamedSource
instance, allowing multiple property setting methods to be chained in a single statement. - Throws:
IllegalStateException
- if theiterator()
method has already been called.
-
close
Closes the underlyingReader
orInputStream
and releases any system resources associated with it.If the stream is already closed then invoking this method has no effect.
- Specified by:
close
in interfaceAutoCloseable
- Specified by:
close
in interfaceCloseable
- Throws:
IOException
- if an I/O error occurs.
-
getEncoding
Returns the character encoding scheme of the source byte stream used to create this object.This method works in essentially the same way as the
Source.getEncoding()
method.If the byte stream used to create this object does not support the
mark
method, the algorithm that determines the encoding may have to wrap it in aBufferedInputStream
in order to look ahead at the encoding meta data. This extra layer of buffering will then remain in place for the life of theStreamedSource
, possibly impacting memory usage and/or degrading performance. It is always preferable to use theStreamedSource(Reader)
constructor if the encoding is known in advance.The
getEncodingSpecificationInfo()
method returns a simple description of how the value of this method was determined.- Returns:
- the character encoding scheme of the source byte stream used to create this object, or
null
if the encoding is not known. - See Also:
-
getEncodingSpecificationInfo
Returns a concise description of how the encoding of the source document was determined.The description is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- a concise description of how the encoding of the source document was determined.
- See Also:
-
getPreliminaryEncodingInfo
Returns the preliminary encoding of the source document together with a concise description of how it was determined.This method works in essentially the same way as the
Source.getPreliminaryEncodingInfo()
method.The description returned by this method is intended for informational purposes only. It is not guaranteed to have any particular format and can not be reliably parsed.
- Returns:
- the preliminary encoding of the source document together with a concise description of how it was determined, or
null
if no preliminary encoding was required. - See Also:
-
iterator
Returns an iterator over every tag, character reference and plain text segment contained within the source document.Plain text is defined as all text that is not part of a
Tag
orCharacterReference
.This results in a sequential walk-through of the entire source document. The end position of each segment should correspond with the begin position of the subsequent segment, unless any of the tags are enclosed by other tags. This could happen if there are server tags present in the document, or in rare circumstances where the document type declaration contains markup declarations.
Each segment generated by the iterator is parsed as the source text is streamed in. Previous segments are discarded for garbage collection.
If a section of plain text is encountered in the document that is larger than the current buffer size, the text is chunked into multiple consecutive plain text segments in order to minimise memory usage. Setting the
Coalescing
property totrue
disables chunking, ensuring that consecutive plain text segments are never generated, but instead forcing the internal buffer to expand to fit the largest section of plain text. Note thatCharacterReference
segments are always handled separately from plain text, regardless of whether coalescing is enabled. For this reason, algorithms that process element content almost always have to be designed to expect the text in multiple segments in order to handle character references, so there is usually no advantage in coalescing plain text segments.Character references that are found inside tags, such as those present inside attribute values, do not generate separate segments from the iterator.
This method may only be called once on any particular
StreamedSource
instance.- Example:
-
The following code demonstrates the typical (implied) usage of this method through the
Iterable
interface to make an exact copy of the document fromreader
towriter
(assuming no server tags are present):StreamedSource streamedSource=new StreamedSource(reader); for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; // HANDLE TAG // Uncomment the following line to ensure each tag is valid XML: // writer.write(tag.tidy()); continue; } else if (segment instanceof CharacterReference) { CharacterReference characterReference=(CharacterReference)segment; // HANDLE CHARACTER REFERENCE // Uncomment the following line to decode all character references instead of copying them verbatim: // characterReference.appendCharTo(writer); continue; } else { // HANDLE PLAIN TEXT } // unless specific handling has prevented getting to here, simply output the segment as is: writer.write(segment.toString()); }
Note that the last line
writer.write(segment.toString())
in the above code can be replaced with the following for improved performance:CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer(); writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());
-
The following code demonstrates how to process the plain text content of a specific element, in this case to print the content of every paragraph element:
StreamedSource streamedSource=new StreamedSource(reader); StringBuilder sb=new StringBuilder(); boolean insideParagraphElement=false; for (Segment segment : streamedSource) { if (segment instanceof Tag) { Tag tag=(Tag)segment; if (tag.getName().equals("p")) { if (tag instanceof StartTag) { insideParagraphElement=true; sb.setLength(0); } else { // tag instanceof EndTag insideParagraphElement=false; System.out.println(sb.toString()); } } } else if (insideParagraphElement) { if (segment instanceof CharacterReference) { ((CharacterReference)segment).appendCharTo(sb); } else { sb.append(segment); } } }
- Specified by:
iterator
in interfaceIterable<Segment>
- Returns:
- an iterator over every tag, character reference and plain text segment contained within the source document.
-
getCurrentSegment
Returns the currentSegment
from the iterator().This is defined as the last
Segment
returned from the iterator'snext()
method.This method returns
null
if the iterator'snext()
method has never been called, or itshasNext()
method has returned the valuefalse
.- Returns:
- the current
Segment
from the iterator().
-
getCurrentSegmentCharBuffer
Returns aCharBuffer
containing the source text of the current segment.The returned
CharBuffer
provides a window into the internalchar[]
buffer including the position and length that spans the current segment.For example, the following code writes the source text of the current segment to
writer
:CharBuffer charBuffer=streamedSource.getCurrentSegmentCharBuffer();
writer.write(charBuffer.array(),charBuffer.position(),charBuffer.length());
This may provide a performance benefit over the standard way of accessing the source text of the current segment, which is to use the
CharSequence
interface of the segment directly, or to callSegment.toString()
.Because this
CharBuffer
is a direct window into the internal buffer of theStreamedSource
, the contents of theCharBuffer.array()
must not be modified, and the array is only guaranteed to hold the segment source text until the iterator'shasNext()
ornext()
method is next called.- Returns:
- a
CharBuffer
containing the source text of the current segment.
-
isXML
public boolean isXML()Indicates whether the source document is likely to be XML.The algorithm used to determine this is designed to be relatively inexpensive and to provide an accurate result in most normal situations. An exact determination of whether the source document is XML would require a much more complex analysis of the text.
The algorithm is as follows:
- If the document begins with an XML declaration, it is an XML document.
- If the document begins with a document type declaration that contains the text
"
xhtml
", it is an XHTML document, and hence also an XML document. - If none of the above conditions are met, assume the document is normal HTML, and therefore not an XML document.
This method can only be called after the
iterator()
method has been called.- Returns:
true
if the source document is likely to be XML, otherwisefalse
.- Throws:
IllegalStateException
- if theiterator()
method has not yet been called.
-
setLogger
Sets theLogger
that handles log messages.Specifying a
null
argument disables logging completely for operations performed on thisStreamedSource
object.A logger instance is created automatically for each
StreamedSource
object in the same way as is described in theSource.setLogger(Logger)
method.- Parameters:
logger
- the logger that will handle log messages, ornull
to disable logging.- See Also:
-
getLogger
Returns theLogger
that handles log messages.A logger instance is created automatically for each
StreamedSource
object using theLoggerProvider
specified by the staticConfig.LoggerProvider
property. This can be overridden by calling thesetLogger(Logger)
method. The name used for all automatically created logger instances is "net.htmlparser.jericho
".- Returns:
- the
Logger
that handles log messages, ornull
if logging is disabled.
-
getBufferSize
public int getBufferSize()Returns the current size of the internal character buffer.This information is generally useful only for investigating memory and performance issues.
- Returns:
- the current size of the internal character buffer.
-
toString
Returns a string representation of the object as generated by the defaultObject.toString()
implementation.In contrast to the
Source.toString()
implementation, it is generally not possible for this method to return the entire source text. -
finalize
protected void finalize()Called by the garbage collector on an object when garbage collection determines that there are no more references to the object.This implementation calls the
close()
method if the underlyingReader
orInputStream
stream was created internally.
-