class ANTLR3::Recognizer
Recognizer
¶ ↑
As the base class of all ANTLR-generated recognizers, Recognizer
provides much of the shared functionality and structure used in the recognition process. For all effective purposes, the class and its immediate subclasses Lexer
, Parser
, and TreeParser are abstract classes. They can be instantiated, but they’re pretty useless on their own. Instead, to make useful code, you write an ANTLR grammar and ANTLR will generate classes which inherit from one of the recognizer base classes, providing the implementation of the grammar rules itself. this group of classes to implement necessary tasks. Recognizer
defines methods related to:
-
token and character matching
-
prediction and recognition strategy
-
recovering from errors
-
reporting errors
-
memoization
-
simple rule tracing and debugging
Attributes
Public Class Methods
# File lib/antlr3/recognizers.rb, line 314 def Scope( *declarations, &body ) Scope.new( *declarations, &body ) end
# File lib/antlr3/recognizers.rb, line 306 def debug? return false end
this method is used to generate return-value structures for rules with multiple return values. To avoid generating a special class for ever rule in AST
parsers and such (where most rules have the same default set of return values), each recognizer gets a default return value structure assigned to the constant Return
. Rules which don’t require additional custom members will have a rule-return name constant that just points to the generic return value.
# File lib/antlr3/recognizers.rb, line 241 def define_return_scope( *members ) if members.empty? then generic_return_scope else members += return_scope_members Struct.new( *members ) end end
sets up and returns the generic rule return scope for a recognizer
# File lib/antlr3/recognizers.rb, line 260 def generic_return_scope @generic_return_scope ||= begin struct = Struct.new( *return_scope_members ) const_set( :Return, struct ) end end
# File lib/antlr3/recognizers.rb, line 267 def imported_grammars @imported_grammars ||= Set.new end
# File lib/antlr3/recognizers.rb, line 275 def master master_grammars.last end
# File lib/antlr3/recognizers.rb, line 271 def master_grammars @master_grammars ||= [] end
Create a new recognizer. The constructor simply ensures that all recognizers are initialized with a shared state object. See the main recognizer subclasses for more specific information about creating recognizer objects like lexers and parsers.
# File lib/antlr3/recognizers.rb, line 360 def initialize( options = {} ) @state = options[ :state ] || RecognizerSharedState.new @error_output = options.fetch( :error_output, $stderr ) defined?( @input ) or @input = nil initialize_dfas end
# File lib/antlr3/recognizers.rb, line 310 def profile? return false end
used as a hook to add additional default members to default return value structures For example, all AST-building parsers override this method to add an extra :tree
field to all rule return structures.
# File lib/antlr3/recognizers.rb, line 254 def return_scope_members [ :start, :stop ] end
# File lib/antlr3/recognizers.rb, line 298 def rules self::RULE_METHODS.dup rescue [] end
# File lib/antlr3/recognizers.rb, line 318 def token_class @token_class ||= begin self::Token rescue superclass.token_class rescue ANTLR3::CommonToken end end
Private Class Methods
generated recognizer code uses this method to stamp the code with the name of the grammar file and the current version of ANTLR being used to generate the code
# File lib/antlr3/recognizers.rb, line 219 def generated_using( grammar_file, antlr_version, library_version = nil ) @grammar_file_name = grammar_file.freeze @antlr_version_string = antlr_version.freeze @library_version = Util.parse_version( library_version ) if @antlr_version_string =~ /^(\d+)\.(\d+)(?:\.(\d+)(?:b(\d+))?)?(.*)$/ @antlr_version = [ $1, $2, $3, $4 ].map! { |str| str.to_i } timestamp = $5.strip #@antlr_release_time = $5.empty? ? nil : Time.parse($5) else raise "bad version string: %p" % version_string end end
# File lib/antlr3/recognizers.rb, line 289 def imports( *grammar_names ) for grammar in grammar_names imported_grammars.add?( grammar.to_sym ) and attr_reader( Util.snake_case( grammar ) ) end return imported_grammars end
# File lib/antlr3/recognizers.rb, line 279 def masters( *grammar_names ) for grammar in grammar_names unless master_grammars.include?( grammar ) master_grammars << grammar attr_reader( Util.snake_case( grammar ) ) end end end
Public Instance Methods
# File lib/antlr3/recognizers.rb, line 866 def already_parsed_rule?( rule ) stop_index = rule_memoization( rule, @input.index ) case stop_index when MEMO_RULE_UNKNOWN then return false when MEMO_RULE_FAILED raise BacktrackingFailed else @input.seek( stop_index + 1 ) end return true end
# File lib/antlr3/recognizers.rb, line 336 def antlr_version self.class.antlr_version end
# File lib/antlr3/recognizers.rb, line 340 def antlr_version_string self.class.antlr_version_string end
# File lib/antlr3/recognizers.rb, line 839 def backtrack @state.backtracking += 1 start = @input.mark success = begin yield rescue BacktrackingFailed then false else true end return success ensure @input.rewind( start ) @state.backtracking -= 1 end
Returns true if the recognizer is currently in a decision for which backtracking has been enabled
# File lib/antlr3/recognizers.rb, line 827 def backtracking? @state.backtracking > 0 end
# File lib/antlr3/recognizers.rb, line 831 def backtracking_level @state.backtracking end
# File lib/antlr3/recognizers.rb, line 835 def backtracking_level=( n ) @state.backtracking = n end
overridable hook method that is executed at the start of the resyncing procedure in recover
by default, it does nothing
# File lib/antlr3/recognizers.rb, line 519 def begin_resync # do nothing end
# File lib/antlr3/recognizers.rb, line 779 def combine_follows( exact ) follow_set = Set.new @state.following.each_with_index.reverse_each do |local_follow_set, index| follow_set |= local_follow_set if exact if local_follow_set.include?( EOR_TOKEN_TYPE ) follow_set.delete( EOR_TOKEN_TYPE ) if index > 0 else break end end end return follow_set end
Compute the context-sensitive FOLLOW
set for current rule. This is set of token types that can follow a specific rule reference given a specific call chain. You get the set of viable tokens that can possibly come next (look depth 1) given the current call chain. Contrast this with the definition of plain FOLLOW for rule r:
FOLLOW(r)={x | S=>*alpha r beta in G and x in FIRST(beta)}
where x in T* and alpha, beta in V*; T is set of terminals and V is the set of terminals and nonterminals. In other words, FOLLOW® is the set of all tokens that can possibly follow references to r in any sentential form (context). At runtime, however, we know precisely which context applies as we have the call chain. We may compute the exact (rather than covering superset) set of following tokens.
For example, consider grammar:
stat : ID '=' expr ';' // FOLLOW(stat)=={EOF} | "return" expr '.' ; expr : atom ('+' atom)* ; // FOLLOW(expr)=={';','.',')'} atom : INT // FOLLOW(atom)=={'+',')',';','.'} | '(' expr ')' ;
The FOLLOW sets are all inclusive whereas context-sensitive FOLLOW sets are precisely what could follow a rule reference. For input input “i=(3);”, here is the derivation:
stat => ID '=' expr ';' => ID '=' atom ('+' atom)* ';' => ID '=' '(' expr ')' ('+' atom)* ';' => ID '=' '(' atom ')' ('+' atom)* ';' => ID '=' '(' INT ')' ('+' atom)* ';' => ID '=' '(' INT ')' ';'
At the “3” token, you’d have a call chain of
stat -> expr -> atom -> expr -> atom
What can follow that specific nested ref to atom? Exactly ‘)’ as you can see by looking at the derivation of this specific input. Contrast this with the FOLLOW(atom)={‘+’,‘)’,‘;’,‘.’}.
You want the exact viable token set when recovering from a token mismatch. Upon token mismatch, if LA(1) is member of the viable next token set, then you know there is most likely a missing token in the input stream. “Insert” one by just not throwing an exception.
# File lib/antlr3/recognizers.rb, line 775 def compute_context_sensitive_rule_follow combine_follows true end
(The following explanation has been lifted directly from the
source code documentation of the ANTLR Java runtime library)
Compute the error recovery set for the current rule. During rule invocation, the parser pushes the set of tokens that can follow that rule reference on the stack; this amounts to computing FIRST of what follows the rule reference in the enclosing rule. This local follow set only includes tokens from within the rule; i.e., the FIRST computation done by ANTLR stops at the end of a rule.
EXAMPLE
When you find a “no viable alt exception”, the input is not consistent with any of the alternatives for rule r. The best thing to do is to consume tokens until you see something that can legally follow a call to r or any rule that called r. You don’t want the exact set of viable next tokens because the input might just be missing a token–you might consume the rest of the input looking for one of the missing tokens.
Consider grammar:
a : '[' b ']' | '(' b ')' ; b : c '^' INT ; c : ID | INT ;
At each rule invocation, the set of tokens that could follow that rule is pushed on a stack. Here are the various “local” follow sets:
FOLLOW( b1_in_a ) = FIRST( ']' ) = ']' FOLLOW( b2_in_a ) = FIRST( ')' ) = ')' FOLLOW( c_in_b ) = FIRST( '^' ) = '^'
Upon erroneous input “[]”, the call chain is
a -> b -> c
and, hence, the follow context stack is:
depth local follow set after call to rule 0 \<EOF> a (from main( ) ) 1 ']' b 3 '^' c
Notice that ')'
is not included, because b would have to have been called from a different context in rule a for ‘)’ to be included.
For error recovery, we cannot consider FOLLOW© (context-sensitive or otherwise). We need the combined set of all context-sensitive FOLLOW sets–the set of all tokens that could follow any reference in the call chain. We need to resync to one of those tokens. Note that FOLLOW©=‘^’ and if we resync’d to that token, we’d consume until EOF. We need to sync to context-sensitive FOLLOWs for a, b, and c: {‘]’,‘^’}. In this case, for input “[]”, LA(1) is in this set so we would not consume anything and after printing an error rule c would return normally. It would not find the required ‘^’ though. At this point, it gets a mismatched token error and throws an exception (since LA(1) is not in the viable following token set). The rule exception handler tries to recover, but finds the same recovery set and doesn’t consume anything. Rule b exits normally returning to rule a. Now it finds the ‘]’ (and with the successful match exits errorRecovery mode).
So, you cna see that the parser walks up call chain looking for the token that was a member of the recovery set.
Errors are not generated in errorRecovery mode.
ANTLR’s error recovery mechanism is based upon original ideas:
“Algorithms + Data Structures = Programs” by Niklaus Wirth
and
“A note on error recovery in recursive descent parsers”: portal.acm.org/citation.cfm?id=947902.947905
Later, Josef Grosch had some good ideas:
“Efficient and Comfortable Error
Recovery in Recursive Descent Parsers”: www.cocolab.com/products/cocktail/doca4.ps/ell.ps.zip
Like Grosch I implemented local FOLLOW sets that are combined at run-time upon error to avoid overhead during parsing.
# File lib/antlr3/recognizers.rb, line 623 def compute_error_recovery_set combine_follows( false ) end
Consume input symbols until one matches a type within types
types can be a single symbol type or a set of symbol types
# File lib/antlr3/recognizers.rb, line 813 def consume_until( types ) types.is_a?( Set ) or types = Set[ *types ] type = @input.peek until type == EOF or types.include?( type ) @input.consume type = @input.peek end return( type ) end
Match needs to return the current input symbol, which gets put into the label for the associated token ref; e.g., x=ID. Token
and tree parsers need to return different objects. Rather than test for input stream type or change the IntStream interface, I use a simple method to ask the recognizer to tell me what the current input symbol is.
This is ignored for lexers.
# File lib/antlr3/recognizers.rb, line 804 def current_symbol @input.look end
error reporting hook for presenting the information The default implementation builds appropriate error message text using error_header
and error_message
, and calls emit_error_message
to write the error message out to some source
# File lib/antlr3/recognizers.rb, line 424 def display_recognition_error( e = $! ) header = error_header( e ) message = error_message( e ) emit_error_message( "#{ header } #{ message }" ) end
# File lib/antlr3/recognizers.rb, line 347 def each_delegate block_given? or return enum_for( __method__ ) for grammar in self.class.imported_grammars del = __send__( Util.snake_case( grammar ) ) and yield( del ) end end
Write the error report data out to some source. By default, the error message is written to $stderr
# File lib/antlr3/recognizers.rb, line 491 def emit_error_message( message ) @error_output.puts( message ) if @error_output end
overridable hook method that is after the resyncing procedure has completed
by default, it does nothing
# File lib/antlr3/recognizers.rb, line 526 def end_resync # do nothing end
used to add a tag to the error message that indicates the location of the input stream when the error occurred
# File lib/antlr3/recognizers.rb, line 466 def error_header( e = $! ) e.location end
used to construct an appropriate error message based on the specific type of error and the error’s attributes
# File lib/antlr3/recognizers.rb, line 433 def error_message( e = $! ) case e when UnwantedToken token_name = token_name( e.expecting ) "extraneous input #{ token_error_display( e.unexpected_token ) } expecting #{ token_name }" when MissingToken token_name = token_name( e.expecting ) "missing #{ token_name } at #{ token_error_display( e.symbol ) }" when MismatchedToken token_name = token_name( e.expecting ) "mismatched input #{ token_error_display( e.symbol ) } expecting #{ token_name }" when MismatchedTreeNode token_name = token_name( e.expecting ) "mismatched tree node: #{ e.symbol } expecting #{ token_name }" when NoViableAlternative "no viable alternative at input " << token_error_display( e.symbol ) when MismatchedSet "mismatched input %s expecting set %s" % [ token_error_display( e.symbol ), e.expecting.inspect ] when MismatchedNotSet "mismatched input %s expecting set %s" % [ token_error_display( e.symbol ), e.expecting.inspect ] when FailedPredicate "rule %s failed predicate: { %s }?" % [ e.rule_name, e.predicate_text ] else e.message end end
# File lib/antlr3/recognizers.rb, line 332 def grammar_file_name self.class.grammar_file_name end
Attempt to match the current input symbol the token type specified by type
. If the symbol matches the type, consume the current symbol and return its value. If the symbol doesn’t match, attempt to use the follow-set data provided by follow
to recover from the mismatched token.
# File lib/antlr3/recognizers.rb, line 385 def match( type, follow ) matched_symbol = current_symbol if @input.peek == type @input.consume @state.error_recovery = false return matched_symbol end raise( BacktrackingFailed ) if @state.backtracking > 0 return recover_from_mismatched_token( type, follow ) end
match anything – i.e. wildcard match. Simply consume the current symbol from the input stream.
# File lib/antlr3/recognizers.rb, line 398 def match_any @state.error_recovery = false @input.consume end
# File lib/antlr3/recognizers.rb, line 878 def memoize( rule, start_index, success ) stop_index = success ? @input.index - 1 : MEMO_RULE_FAILED memo = @state.rule_memory[ rule ] and memo[ start_index ] = stop_index end
# File lib/antlr3/recognizers.rb, line 692 def mismatch_is_missing_token?( follow ) follow.nil? and return false if follow.include?( EOR_TOKEN_TYPE ) viable_tokens = compute_context_sensitive_rule_follow follow = follow | viable_tokens follow.delete( EOR_TOKEN_TYPE ) unless @state.following.empty? end if follow.include?( @input.peek ) or follow.include?( EOR_TOKEN_TYPE ) return true end return false end
# File lib/antlr3/recognizers.rb, line 688 def mismatch_is_unwanted_token?( type ) @input.peek( 2 ) == type end
Conjure up a missing token during error recovery.
The recognizer attempts to recover from single missing symbols. But, actions might refer to that missing symbol. For example, x=ID {f($x);}. The action clearly assumes that there has been an identifier matched previously and that $x points at that token. If that token is missing, but the next token in the stream is what we want we assume that this token is missing and we keep going. Because we have to return some token to replace the missing token, we have to conjure one up. This method gives the user control over the tokens returned for missing tokens. Mostly, you will want to create something special for identifier tokens. For literals such as ‘{’ and ‘,’, the default action in the parser or tree parser works. It simply creates a CommonToken
of the appropriate type. The text will be the token. If you change what tokens must be created by the lexer, override this method to create the appropriate tokens.
# File lib/antlr3/recognizers.rb, line 684 def missing_symbol( error, expected_token_type, follow ) return nil end
factor out what to do upon token mismatch so tree parsers can behave differently.
-
override this method in your parser to do things
like bailing out after the first error * just raise the exception instead of calling the recovery method.
# File lib/antlr3/recognizers.rb, line 718 def number_of_syntax_errors @state.syntax_errors end
Error
Recovery ########################################
# File lib/antlr3/recognizers.rb, line 499 def recover( error = $! ) @state.last_error_index == @input.index and @input.consume @state.last_error_index = @input.index follow_set = compute_error_recovery_set resync { consume_until( follow_set ) } end
# File lib/antlr3/recognizers.rb, line 653 def recover_from_mismatched_element( e, follow ) follow.nil? and return false if follow.include?( EOR_TOKEN_TYPE ) viable_tokens = compute_context_sensitive_rule_follow follow = ( follow | viable_tokens ) - Set[ EOR_TOKEN_TYPE ] end if follow.include?( @input.peek ) report_error( e ) return true end return false end
# File lib/antlr3/recognizers.rb, line 645 def recover_from_mismatched_set( e, follow ) if mismatch_is_missing_token?( follow ) report_error( e ) return missing_symbol( e, INVALID_TOKEN_TYPE, follow ) end raise e end
# File lib/antlr3/recognizers.rb, line 627 def recover_from_mismatched_token( type, follow ) if mismatch_is_unwanted_token?( type ) err = UnwantedToken( type ) resync { @input.consume } report_error( err ) return @input.consume end if mismatch_is_missing_token?( follow ) inserted = missing_symbol( nil, type, follow ) report_error( MissingToken( type, inserted ) ) return inserted end raise MismatchedToken( type ) end
When a recognition error occurs, this method is the main hook for carrying out the error reporting process. The default implementation calls display_recognition_error
to display the error info on $stderr.
# File lib/antlr3/recognizers.rb, line 412 def report_error( e = $! ) @state.error_recovery and return @state.syntax_errors += 1 @state.error_recovery = true display_recognition_error( e ) end
Resets the recognizer’s state data to initial values. As a result, all error tracking and error recovery data accumulated in the current state will be cleared. It will also attempt to reset the input stream via input.reset, but it ignores any errors received from doing so. Thus the input stream is not guarenteed to be rewound to its initial position
# File lib/antlr3/recognizers.rb, line 374 def reset @state and @state.reset! @input and @input.reset rescue nil end
# File lib/antlr3/recognizers.rb, line 508 def resync begin_resync return( yield ) ensure end_resync end
# File lib/antlr3/recognizers.rb, line 860 def rule_memoization( rule, start_index ) @state.rule_memory.fetch( rule ) do @state.rule_memory[ rule ] = Hash.new( MEMO_RULE_UNKNOWN ) end[ start_index ] end
# File lib/antlr3/recognizers.rb, line 853 def syntactic_predicate?( name ) backtrack { send name } end
# File lib/antlr3/recognizers.rb, line 706 def syntax_errors? ( error_count = @state.syntax_errors ) > 0 and return( error_count ) end
formats a token object appropriately for inspection within an error message
# File lib/antlr3/recognizers.rb, line 474 def token_error_display( token ) unless text = token.text || ( token.source_text rescue nil ) text = case when token.type == EOF then '<EOF>' when name = token_name( token.type ) rescue nil then "<#{ name }>" when token.respond_to?( :name ) then "<#{ token.name }>" else "<#{ token.type }>" end end return text.inspect end
# File lib/antlr3/recognizers.rb, line 883 def trace_in( rule_name, rule_index, input_symbol ) @error_output.printf( "--> enter %s on %s", rule_name, input_symbol ) @state.backtracking > 0 and @error_output.printf( " (in backtracking mode: depth = %s)", @state.backtracking ) @error_output.print( "\n" ) end
# File lib/antlr3/recognizers.rb, line 891 def trace_out( rule_name, rule_index, input_symbol ) @error_output.printf( "<-- exit %s on %s", rule_name, input_symbol ) @state.backtracking > 0 and @error_output.printf( " (in backtracking mode: depth = %s)", @state.backtracking ) @error_output.print( "\n" ) end
Private Instance Methods
# File lib/antlr3/recognizers.rb, line 901 def initialize_dfas # do nothing end