class Oga::XML::Lexer
Low level lexer that supports both XML
and HTML
(using an extra option). To lex HTML
input set the `:html` option to `true` when creating an instance of the lexer:
lexer = Oga::XML::Lexer.new(:html => true)
This lexer can process both String and IO instances. IO instances are processed on a line by line basis. This can greatly reduce memory usage in exchange for a slightly slower runtime.
## Thread Safety
Since this class keeps track of an internal state you can not use the same instance between multiple threads at the same time. For example, the following will not work reliably:
# Don't do this! lexer = Oga::XML::Lexer.new('....') threads = [] 2.times do threads << Thread.new do lexer.advance do |*args| p args end end end threads.each(&:join)
However, it is perfectly save to use different instances per thread. There is no global state used by this lexer.
## Strict Mode
By default the lexer is rather permissive regarding the input. For example, missing closing tags are inserted by default. To disable this behaviour the lexer can be run in “strict mode” by setting `:strict` to `true`:
lexer = Oga::XML::Lexer.new('...', :strict => true)
Strict mode only applies to XML
documents.
@private
Constants
- HTML_CLOSE_SELF
Elements that should be closed automatically before a new opening tag is processed.
- HTML_SCRIPT
These are all constant/frozen to remove the need for String allocations every time they are referenced in the lexer.
- HTML_SCRIPT_ELEMENTS
- HTML_STYLE
- HTML_TABLE_ALLOWED
Elements that are allowed directly in a <table> element.
- HTML_TABLE_ROW_ELEMENTS
The elements that may occur in a thead, tbody, or tfoot.
Technically “th” is not allowed per the HTML5 spec, but it's so commonly used in these elements that we allow it anyway.
- LITERAL_HTML_ELEMENTS
Names of
HTML
tags of which the content should be lexed as-is.
Public Class Methods
@param [String|IO] data The data to lex. This can either be a String or
an IO instance.
@param [Hash] options
@option options [TrueClass|FalseClass] :html When set to `true` the
lexer will treat the input as HTML instead of XML. This makes it possible to lex HTML void elements such as `<link href="">`.
@option options [TrueClass|FalseClass] :strict Enables/disables strict
parsing of XML documents, disabled by default.
# File lib/oga/xml/lexer.rb, line 115 def initialize(data, options = {}) @data = data @html = options[:html] @strict = options[:strict] || false @line = 1 @elements = [] reset_native end
Public Instance Methods
Advances through the input and generates the corresponding tokens. Each token is yielded to the supplied block.
Each token is an Array in the following format:
[TYPE, VALUE]
The type is a symbol, the value is either nil or a String.
This method stores the supplied block in `@block` and resets it after the lexer loop has finished.
@yieldparam [Symbol] type @yieldparam [String] value @yieldparam [Fixnum] line
# File lib/oga/xml/lexer.rb, line 172 def advance(&block) @block = block read_data do |chunk| advance_native(chunk) end # Add any missing closing tags if !strict? and !@elements.empty? @elements.length.times { on_element_end } end ensure @block = nil end
@return [TrueClass|FalseClass]
# File lib/oga/xml/lexer.rb, line 188 def html? @html == true end
@return [TrueClass|FalseClass]
# File lib/oga/xml/lexer.rb, line 198 def html_script? html? && current_element == HTML_SCRIPT end
@return [TrueClass|FalseClass]
# File lib/oga/xml/lexer.rb, line 203 def html_style? html? && current_element == HTML_STYLE end
Gathers all the tokens for the input and returns them as an Array.
@see advance
@return [Array]
# File lib/oga/xml/lexer.rb, line 147 def lex tokens = [] advance do |type, value, line| tokens << [type, value, line] end tokens end
Yields the data to lex to the supplied block.
@return [String] @yieldparam [String]
# File lib/oga/xml/lexer.rb, line 128 def read_data if @data.is_a?(String) yield @data # IO, StringIO, etc # THINK: read(N) would be nice, but currently this screws up the C code elsif @data.respond_to?(:each_line) @data.each_line { |line| yield line } # Enumerator, Array, etc elsif @data.respond_to?(:each) @data.each { |chunk| yield chunk } end end
@return [TrueClass|FalseClass]
# File lib/oga/xml/lexer.rb, line 193 def strict? @strict end
Private Instance Methods
@param [String] name
# File lib/oga/xml/lexer.rb, line 381 def add_element(name) @elements << name add_token(:T_ELEM_NAME, name) end
Calls the supplied block with the information of the current token.
@param [Symbol] type The token type. @param [String] value The token value.
@yieldparam [String] type @yieldparam [String] value @yieldparam [Fixnum] line
# File lib/oga/xml/lexer.rb, line 222 def add_token(type, value = nil) @block.call(type, value, @line) end
@param [Fixnum] amount The amount of lines to advance.
# File lib/oga/xml/lexer.rb, line 210 def advance_line(amount = 1) @line += amount end
Handles inserting of any missing tags whenever a new HTML
tag is opened.
@param [String] name
# File lib/oga/xml/lexer.rb, line 361 def before_html_element_name(name) close_current = HTML_CLOSE_SELF[current_element] if close_current and !close_current.allow?(name) on_element_end end # Close remaining parent elements. This for example ensures that a # "<tbody>" not only closes an unclosed "<th>" but also the surrounding, # unclosed "<tr>". while close_current = HTML_CLOSE_SELF[current_element] if close_current.allow?(name) break else on_element_end end end end
Returns the name of the element we're currently in.
@return [String]
# File lib/oga/xml/lexer.rb, line 229 def current_element @elements.last end
Called on tag attributes.
@param [String] value
# File lib/oga/xml/lexer.rb, line 448 def on_attribute(value) add_token(:T_ATTR, value) end
Called on attribute namespaces.
@param [String] value
# File lib/oga/xml/lexer.rb, line 441 def on_attribute_ns(value) add_token(:T_ATTR_NS, value) end
Called for the body of a CDATA tag.
@param [String] value
# File lib/oga/xml/lexer.rb, line 294 def on_cdata_body(value) add_token(:T_CDATA_BODY, value) end
Called on the closing CDATA tag.
# File lib/oga/xml/lexer.rb, line 287 def on_cdata_end add_token(:T_CDATA_END) end
Called on the open CDATA tag.
# File lib/oga/xml/lexer.rb, line 282 def on_cdata_start add_token(:T_CDATA_START) end
Called on a comment.
@param [String] value
# File lib/oga/xml/lexer.rb, line 311 def on_comment_body(value) add_token(:T_COMMENT_BODY, value) end
Called on the closing comment tag.
# File lib/oga/xml/lexer.rb, line 304 def on_comment_end add_token(:T_COMMENT_END) end
Called on the open comment tag.
# File lib/oga/xml/lexer.rb, line 299 def on_comment_start add_token(:T_COMMENT_START) end
Called on the end of a doctype.
# File lib/oga/xml/lexer.rb, line 270 def on_doctype_end add_token(:T_DOCTYPE_END) end
Called on an inline doctype block.
@param [String] value
# File lib/oga/xml/lexer.rb, line 277 def on_doctype_inline(value) add_token(:T_DOCTYPE_INLINE, value) end
Called on the identifier specifying the name of the doctype.
@param [String] value
# File lib/oga/xml/lexer.rb, line 265 def on_doctype_name(value) add_token(:T_DOCTYPE_NAME, value) end
Called when a doctype starts.
# File lib/oga/xml/lexer.rb, line 251 def on_doctype_start add_token(:T_DOCTYPE_START) end
Called on the identifier specifying the type of the doctype.
@param [String] value
# File lib/oga/xml/lexer.rb, line 258 def on_doctype_type(value) add_token(:T_DOCTYPE_TYPE, value) end
Called on the closing tag of an element.
@param [String] name The name of the element (minus namespace
prefix). This is not set for self closing tags.
# File lib/oga/xml/lexer.rb, line 411 def on_element_end(name = nil) return if @elements.empty? if html? and name and @elements.include?(name) while current_element != name add_token(:T_ELEM_END) @elements.pop end end # Prevents a superfluous end tag of a self-closing HTML tag from # closing its parent element prematurely return if html? && name && name != current_element add_token(:T_ELEM_END) @elements.pop end
Called on the name of an element.
@param [String] name The name of the element, including namespace.
# File lib/oga/xml/lexer.rb, line 352 def on_element_name(name) before_html_element_name(name) if html? add_element(name) end
Called on the element namespace.
@param [String] namespace
# File lib/oga/xml/lexer.rb, line 390 def on_element_ns(namespace) add_token(:T_ELEM_NS, namespace) end
Called on the closing `>` of the open tag of an element.
# File lib/oga/xml/lexer.rb, line 395 def on_element_open_end return unless html? # Only downcase the name if we can't find an all lower/upper version of # the element name. This can save us a *lot* of String allocations. if HTML_VOID_ELEMENTS.allow?(current_element) \ or HTML_VOID_ELEMENTS.allow?(current_element.downcase) add_token(:T_ELEM_END) @elements.pop end end
Called on the body of a processing instruction.
@param [String] value
# File lib/oga/xml/lexer.rb, line 340 def on_proc_ins_body(value) add_token(:T_PROC_INS_BODY, value) end
Called on the end of a processing instruction.
# File lib/oga/xml/lexer.rb, line 345 def on_proc_ins_end add_token(:T_PROC_INS_END) end
Called on a processing instruction name.
@param [String] value
# File lib/oga/xml/lexer.rb, line 333 def on_proc_ins_name(value) add_token(:T_PROC_INS_NAME, value) end
Called on the start of a processing instruction.
# File lib/oga/xml/lexer.rb, line 326 def on_proc_ins_start add_token(:T_PROC_INS_START) end
Called when processing the body of a string.
@param [String] value The data between the quotes.
# File lib/oga/xml/lexer.rb, line 246 def on_string_body(value) add_token(:T_STRING_BODY, value) end
Called when processing a double quote.
# File lib/oga/xml/lexer.rb, line 239 def on_string_dquote add_token(:T_STRING_DQUOTE) end
Called when processing a single quote.
# File lib/oga/xml/lexer.rb, line 234 def on_string_squote add_token(:T_STRING_SQUOTE) end
Called on regular text values.
@param [String] value
# File lib/oga/xml/lexer.rb, line 432 def on_text(value) return if value.empty? add_token(:T_TEXT, value) end
Called on the end of an XML
declaration tag.
# File lib/oga/xml/lexer.rb, line 321 def on_xml_decl_end add_token(:T_XML_DECL_END) end
Called on the start of an XML
declaration tag.
# File lib/oga/xml/lexer.rb, line 316 def on_xml_decl_start add_token(:T_XML_DECL_START) end