Class TagType
- Direct Known Subclasses:
EndTagType
,StartTagType
This class is the root abstract class common to all tag types, and contains methods to register and deregister tag types as well as various methods to aid in their implementation.
Every tag type is represented by a singleton instance of a class that must be a subclass of either
StartTagType
or EndTagType
. These two abstract classes, the only direct descendants of this class,
represent the two major classifications under which every tag type exists.
Because all TagType
instaces must be singletons, the '==
' operator can be used to test for a particular tag type
instead of the equals(Object)
method.
The term predefined tag type refers to any of the tag types defined in this library, including both standard and extended tag types.
The term standard tag type refers to any of the tag types represented by instances
in static fields of the StartTagType
and EndTagType
subclasses.
Standard tag types are registered by default, and define the tags most commonly found in HTML documents.
The term extended tag type refers to any predefined tag type
that is not a standard tag type.
The PHPTagTypes
and MasonTagTypes
classes contain extended tag types related to their respective server platforms.
The tag types defined within them must be registered by the user before they are recognised by the parser.
The term custom tag type refers to any user-defined tag type, or any tag type that is not a predefined tag type.
The tag recognition process of the parser gives each tag type a precedence level, which is primarily determined by the length of its start delimiter. A tag type with a more specific start delimiter is chosen in preference to one with a less specific start delimiter, assuming they both share the same prefix. If two tag types have exactly the same start delimiter, the one which was registered later has the higher precedence.
The two special tag types StartTagType.UNREGISTERED
and EndTagType.UNREGISTERED
represent
tags that do not match the syntax of any other tag type. They have the lowest precedence
of all the tag types. The Tag.isUnregistered()
method provides a detailed explanation of unregistered tags.
See the documentation of the tag parsing process for more information on how each tag is identified by the parser.
Note that the standard HTML element names do not represent different
tag types. All standard HTML tags have a tag type of StartTagType.NORMAL
or EndTagType.NORMAL
,
and are also referred to as normal tags.
Apart from the registration related methods, all of the methods in this class and its subclasses relate to the implementation of custom tag types and are not relevant to the majority of users who just use the predefined tag types.
For perfomance reasons, this library only allows tag types that start
with a '<
' character.
The character following this defines the immediate subclass of the tag type.
An EndTagType
always has a slash ('/
') as the second character, while a StartTagType
has any character other than a slash as the second character.
This definition means that tag types which are not intuitively classified as either start tag types or end tag types
(such as an HTML comment) are mostly classified as start tag types.
Every method in this and the StartTagType
and EndTagType
abstract classes can be categorised
as one of the following:
- Properties:
- Simple properties (marked final) that were either specified as parameters during construction or are derived from those parameters.
- Abstract implementation methods:
- Methods that must be implemented in a subclass.
- Default implementation methods:
- Methods (not marked final) that implement common behaviour, but may be overridden in a subclass.
- Implementation assistance methods:
- Protected methods that provide low-level functionality and are only of use within other implementation methods.
- Registration related methods:
- Utility methods (marked final) relating to the registration of tag type instances.
-
Method Summary
Modifier and TypeMethodDescriptionprotected abstract Tag
constructTagAt
(Source source, int pos) Constructs a tag of this type at the specified position in the specified source document if it matches all of the required features.final void
Deregisters this tag type.final String
Returns the character sequence that marks the end of the tag.final String
Returns a description of this tag type useful for debugging purposes.protected final String
Returns the name prefix required by this tag type.Returns a list of all the currently registered tag types in order of lowest to highest precedence.final String
Returns the character sequence that marks the start of the tag.static final TagType[]
Returns an array of all the tag types inside which the parser ignores all non-server tags in parse on demand mode.final boolean
Indicates whether this tag type represents a server tag.protected boolean
isValidPosition
(Source source, int pos, int[] fullSequentialParseData) Indicates whether a tag of this type is valid in the specified position of the specified source document.final void
register()
Registers this tag type for recognition by the parser.static final void
setTagTypesIgnoringEnclosedMarkup
(TagType[] tagTypes) Sets the tag types inside which the parser ignores all non-server tags.protected final boolean
tagEncloses
(Source source, int pos) Indicates whether a tag of this type encloses the specified position of the specified source document.toString()
Returns a string representation of this object useful for debugging purposes.
-
Method Details
-
register
public final void register()Registers this tag type for recognition by the parser.
(registration related method)The order of registration affects the precedence of the tag type when a potential tag is being parsed.
- See Also:
-
deregister
public final void deregister()Deregisters this tag type.
(registration related method)- See Also:
-
getRegisteredTagTypes
Returns a list of all the currently registered tag types in order of lowest to highest precedence.
(registration related method)- Returns:
- a list of all the currently registered tag types in order of lowest to highest precedence.
-
getDescription
Returns a description of this tag type useful for debugging purposes.
(property method)- Returns:
- a description of this tag type useful for debugging purposes.
-
getStartDelimiter
Returns the character sequence that marks the start of the tag.
(property method)The character sequence must be all in lower case.
The first character in this property must be '
<
'. This is a deliberate limitation of the system which is necessary to retain reasonable performance.The second character in this property must be '
/
' if the implementing class is anEndTagType
. It must not be '/
' if the implementing class is aStartTagType
.- Standard Tag Type Values:
-
Tag Type Start Delimiter StartTagType.UNREGISTERED
<
StartTagType.NORMAL
<
StartTagType.COMMENT
<!--
StartTagType.XML_DECLARATION
<?xml
StartTagType.XML_PROCESSING_INSTRUCTION
<?
StartTagType.DOCTYPE_DECLARATION
<!doctype
StartTagType.MARKUP_DECLARATION
<!
StartTagType.CDATA_SECTION
<![cdata[
StartTagType.SERVER_COMMON
<%
EndTagType.UNREGISTERED
</
EndTagType.NORMAL
</
- Extended Tag Type Values:
- Returns:
- the character sequence that marks the start of the tag.
-
getClosingDelimiter
Returns the character sequence that marks the end of the tag.
(property method)The character sequence must be all in lower case.
In a
StartTag
of a type that has attributes, characters appearing inside a quoted attribute value are ignored when determining the location of the closing delimiter.Note that the optional '
/
' character preceding the closing '>
' in an empty-element tag is not considered part of the end delimiter. This property must define the closing delimiter common to all instances of the tag type.- Standard Tag Type Values:
- Extended Tag Type Values:
- Returns:
- the character sequence that marks the end of the tag.
-
isServerTag
public final boolean isServerTag()Indicates whether this tag type represents a server tag.
(property method)Server tags are typically parsed by some process on the web server and substituted with other text or markup before delivery to the user agent. This parser therefore handles them differently to non-server tags in that they can occur at any position in the document without regard for the HTML document structure. As a result they can occur anywhere inside any other tag, although a non-server tag cannot theoretically occur inside a server tag.
The documentation of the tag parsing process explains in detail how the value of this property affects the recognition of server tags, as well as how the presence of server tags affects the recognition of non-server tags in and around them.
Most XML-style server tags can not be represented as a distinct tag type because they are generally indistinguishable from non-server XML tags. See the
Segment.ignoreWhenParsing()
method for information about how to prevent such server tags from interfering with the proper parsing of the rest of the document.- Standard Tag Type Values:
-
Tag Type Is Server Tag StartTagType.UNREGISTERED
false
StartTagType.NORMAL
false
StartTagType.COMMENT
false
StartTagType.XML_DECLARATION
false
StartTagType.XML_PROCESSING_INSTRUCTION
false
StartTagType.DOCTYPE_DECLARATION
false
StartTagType.MARKUP_DECLARATION
false
StartTagType.CDATA_SECTION
false
StartTagType.SERVER_COMMON
true
EndTagType.UNREGISTERED
false
EndTagType.NORMAL
false
- Extended Tag Type Values:
- Returns:
true
if this tag type represents a server tag, otherwisefalse
.
-
getNamePrefix
Returns the name prefix required by this tag type.
(property method)This string is identical to the start delimiter, except that it does not include the initial "
<
" or "</
" characters that always prefix the start delimiter of aStartTagType
orEndTagType
respectively.The name of a tag of this type may or may not include extra characters after the prefix. This is determined by properties such as
StartTagType.isNameAfterPrefixRequired()
orEndTagTypeGenericImplementation.isStatic()
.- Standard Tag Type Values:
-
Tag Type Name Prefix StartTagType.UNREGISTERED
(empty string) StartTagType.NORMAL
(empty string) StartTagType.COMMENT
!--
StartTagType.XML_DECLARATION
?xml
StartTagType.XML_PROCESSING_INSTRUCTION
?
StartTagType.DOCTYPE_DECLARATION
!doctype
StartTagType.MARKUP_DECLARATION
!
StartTagType.CDATA_SECTION
![cdata[
StartTagType.SERVER_COMMON
%
EndTagType.UNREGISTERED
(empty string) EndTagType.NORMAL
(empty string)
- Extended Tag Type Values:
- Returns:
- the name prefix required by this tag type.
- See Also:
-
isValidPosition
Indicates whether a tag of this type is valid in the specified position of the specified source document.
(implementation assistance method)This method is called immediately before
constructTagAt(Source, int pos)
to do a preliminary check on the validity of a tag of this type in the specified position.This check is not performed as part of the
constructTagAt(Source, int pos)
call because the same validation is used for all the standard tag types, and is likely to be sufficient for all custom tag types. Having this check separated into a different method helps to isolate common code from the code that is unique to each tag type.A server tag is valid in any position except inside a server-side comment, but a non-server tag is not valid inside any other tag, nor inside elements with implicit CDATA content such as
SCRIPT
andSTYLE
elements.The common implementation of this method behaves differently depending upon whether or not a full sequential parse is being peformed.
For server tags it simply checks that the position is not enclosed by a server-side comment if a full sequential parse is not being performed. If a full sequential parse is being performed, it always returns
true
for server tags as the parser automatically skips over all positions enclosed by server-side comments, so this method is only called in positions where a server tag is always valid.When this method is called for non-server tags during a full sequential parse, the
fullSequentialParseData
argument contains information allowing the exact theoretical check to be performed, rejecting a non-server tag if it is inside any other tag. See below for further information about thefullSequentialParseData
parameter.When this method is called in parse on demand mode (not during a full sequential parse,
fullSequentialParseData==null
), practical constraints prevent the exact theoretical check from being carried out, and non-server tags are only rejected if they are found inside HTML comments or CDATA sections.This behaviour is configurable by manipulating the static
TagTypesIgnoringEnclosedMarkup
array to determine which tag types can not contain non-server tags in parse on demand mode. The documentation of this property contains a more detailed analysis of the subject and explains why only the comment and CDATA section tag types are included by default.See the documentation of the tag parsing process for more information about how this method fits into the whole tag parsing process.
This method can be overridden in custom tag types if the default implementation is unsuitable.
The
fullSequentialParseData
parameter:This parameter is used to discard non-server tags that are found inside other tags or inside
SCRIPT
elements.In the current version of this library, the
fullSequentialParseData
argument is eithernull
(in parse on demand mode) or an integer array containing only a single entry (if a full sequential parse is being peformed).The integer contained in the array is the maximum position in the document at which the end of a tag has been found, indicating that no non-server tags should be recognised before that position. If no tags have yet been encountered, the value of this integer is zero.
If the last tag encountered was the start tag of a
SCRIPT
element, the value of this integer isInteger.MAX_VALUE
, indicating that no other non-server elements should be recognised until the end tag of theSCRIPT
element is found.The HTML 4 DTD defines script element content as a special type of CDATA. The XHTML DTD changed it to PCDATA, meaning that HTML elements should be parsed inside script elements if they are not escaped by comments or an explicit CDATA section. The HTML 5 parsing rules reversed this again, making it closer to the original HTML 4 rules. Because this parser is designed to facilitate parsing HTML rather than XHTML, it treats script element content as implicit CDATA, consistent with HTML 4 and HTML 5.
According to the HTML 4.01 specification section 6.2, the first occurrence of the character sequence "
</
" terminates the special handling of CDATA withinSCRIPT
andSTYLE
elements. This library however only terminates the CDATA handling ofSCRIPT
element content when the character sequence "</script
" is detected, in line with the behaviour of the major browsers and with HTML 5 script element parsing rules.Note that the implicit treatment of
SCRIPT
element content as CDATA also prevents the recognition of comments and explicit CDATA sections inside script elements. All major browsers used to recognise comments inside script elements regardless, which is relevant if the script element contains a javascript string literal "<script
", which would terminate the script element unless it was enclosed in a comment. Versions 3.0 to 3.2 of this parser therefore also recognised comments inside script elements in a full sequential parse to maintain compatibility with the major browsers, but the latest versions of gecko and webkit browsers now correctly ignore comments inside script elements, so as of version 3.3 this parser has also reverted to the correct behaviour.Although
STYLE
elements should theoretically be treated in the same way asSCRIPT
elements, the syntax of Cascading Style Sheets (CSS) does not contain any constructs that could be misinterpreted as HTML tags, so there is virtually no need to perform any special checks in this case.IMPLEMENTATION NOTE: The rationale behind using an integer array to hold this value, rather than a scalar
int
value, is to emulate passing the parameter by reference. This value needs to be shared amongst several internal methods during the full sequential parse process, and any one of those methods needs to be able to modify the value and pass it back to the calling method. This would normally be implemented by passing the parameter by reference, but because Java does not support this language construct, a container for a mutable integer must be passed instead. Because the standard Java library does not provide a class for holding a single mutable integer (thejava.lang.Integer
class is immutable), the easiest container to use, without creating a class especially for this purpose, is an integer array. The use of an array does not imply any intention to use more than a single array entry in subsequent versions.- Parameters:
source
- theSource
document.pos
- the character position in the source document to check.fullSequentialParseData
- an integer array containing data allowing this method to implement a better algorithm when a full sequential parse is being performed, ornull
in parse on demand mode.- Returns:
true
if a tag of this type is valid in the specified position of the specified source document, otherwisefalse
.
-
getTagTypesIgnoringEnclosedMarkup
Returns an array of all the tag types inside which the parser ignores all non-server tags in parse on demand mode.
(implementation assistance method)The tag types returned by this property (referred to in the following paragraphs as the "listed types") default to
StartTagType.COMMENT
andStartTagType.CDATA_SECTION
.This property is used by the default implementation of the
isValidPosition
method in parse on demand mode. It is not used at all during a full sequential parse.In the default implementation of the
isValidPosition
method, in parse on demand mode, every new non-server tag found by the parser (referred to as a "new tag") undergoes a check to see whether it is enclosed by a tag of one of the listed types. This includes new tags of the listed types themselves if they are non-server tags. The recursive nature of this check means that all tags of the listed types occurring before the new tag must be found by the parser before it can determine whether the new tag should be ignored. To mitigate any performance issues arising from this process, the listed types are given special treatment in the tag cache. This dramatically decreases the time taken to search on these tag types, so adding a tag type to this array that is easily recognised and occurs infrequently only results in a small degradation in overall performance.A special exception to the algorithm described above applies to
COMMENT
tags. The default implementation of theisValidPosition
method does not check whether aCOMMENT
tag is inside anotherCOMMENT
tag, as this should never happen in a syntactically correct document (the characters '--
' should not occur inside a comment). Skipping this check also avoids the need to recursively check everyCOMMENT
tag back to the start of the document, which has the potential to cause a stack overflow in a large document containing lots of comments.Theoretically, non-server tags appearing inside any other tag should be ignored, which is how the parser behaves during a full sequential parse.
Server tags in particular very often contain other "tags" that should not be recognised as tags by the parser. If this behaviour is required in parse on demand, the tag type of each server tag that might be found in the source documents can be added to this property using the static
setTagTypesIgnoringEnclosedMarkup(TagType[])
method. For example, the following command would prevent non-server tags from being recognised inside standard PHP tags, as well as the default comment and CDATA section tags:TagType.setTagTypesIgnoringEnclosedMarkup(new TagType[] {PHPTagTypes.PHP_STANDARD, StartTagType.COMMENT, StartTagType.CDATA_SECTION});
The only situation where a non-server tag can legitimately contain a sequence of characters that resembles a tag is within an attribute value. The HTML 4.01 specification section 5.3.2 specifically allows the presence of '
<
' and '>
' characters within attribute values. A common occurrence of this is in event attributes containing scripts, such as theonclick
attribute. There is no way of preventing such "tags" from being recognised in parse on demand mode, as addingStartTagType.NORMAL
to this property as a listed type would be far too inefficient. Performing a full sequential parse of the source document prevents these attribute values from being recognised as tags, but can be very expensive if only a few tags in the document need to be parsed. The penalty of not parsing every tag in the document is that the exactness of this check is compromised, but in practical terms the difference is inconsequential. The default listed types of comments and CDATA sections yields sensible results in the vast majority of practical applications with only a minor impact on performance.In XHTML, '
<
' and '>
' characters must be represented in attribute values as character references (see the XML 1.0 specification section 3.1), so the situation should never arise that a tag is found inside another tag unless one of them is a server tag.- Returns:
- an array of all the tag types inside which the parser ignores all non-server tags.
-
setTagTypesIgnoringEnclosedMarkup
Sets the tag types inside which the parser ignores all non-server tags.
(implementation assistance method)See
getTagTypesIgnoringEnclosedMarkup()
for the documentation of this property.- Parameters:
tagTypes
- an array of tag types.
-
constructTagAt
Constructs a tag of this type at the specified position in the specified source document if it matches all of the required features.
(abstract implementation method)The implementation of this method must check that the text at the specified position meets all of the criteria of this tag type, including such checks as the presence of the correct or well formed closing delimiter, name, attributes, end tag, or any other distinguishing features.
It can be assumed that the specified position starts with the start delimiter of this tag type, and that all other tag types with higher precedence (if any) have already been rejected as candidates. Tag types with lower precedence will be considered if this method returns
null
.This method is only called after a successful check of the tag's position, i.e.
isValidPosition(source,pos,fullSequentialParseData)
==true
.The
StartTagTypeGenericImplementation
andEndTagTypeGenericImplementation
subclasses provide default implementations of this method that allow the use of much simpler properties and implementation assistance methods and to carry out the required functions.- Parameters:
source
- theSource
document.pos
- the position in the source document.- Returns:
- a tag of this type at the specified position in the specified source document if it meets all of the required features, or
null
if it does not meet the criteria.
-
tagEncloses
Indicates whether a tag of this type encloses the specified position of the specified source document.
(implementation assistance method)This is logically equivalent to
source.
getEnclosingTag(pos,this)
!=null
, but is safe to use within other implementation methods without the risk of causing an infinite recursion.This method is called from the default implementation of the
isValidPosition(Source, int pos, int[] fullSequentialParseData)
method.- Parameters:
source
- theSource
document.pos
- the character position in the source document to check.- Returns:
true
if a tag of this type encloses the specified position of the specified source document, otherwisefalse
.
-
toString
Returns a string representation of this object useful for debugging purposes.
-