===== The "Up-to-here" Marker ===== The //up-to-here// marker ''=>||'' and the //up-to-here-plus// marker ''=>|+n'' (where n is an integer) were introduced to be able to express [[lookahead]] in a more succinct and less err-prone manner than using the ''LOOKAHEAD'' from legacy JavaCC or even the terser ''SCAN'' statement. Where in legacy JavaCC, you would write: LOOKAHEAD(Foo() Bar()) Foo() Bar() Baz() or in CongoCC (//without up-to-here//) you could write: SCAN Foo Bar => Foo Bar Baz but now, with the //up-to-here// syntax, you can now write the above simply as: Foo Bar =>|| Baz The latter construct can be read as meaning that we enter the expansion ''Foo Bar Baz'' and yes, we scan ahead to check when deciding whether to enter the expansion. However, if we get to the end of the Bar, we can stop scanning. Now, I would point out that, if we knew for a fact that ''Foo Bar'' consumes //exactly// 7 tokens, say, then the above would //effectively// be the same as writing: LOOKAHEAD(7) Foo() Bar() Baz() However, in the general case, the expansions ''Foo'' and ''Bar'' above will consume a variable number of tokens. Also, as the grammar evolves over time, the maximum number of tokens that these productions consume may change and you would then have to change the number specified in the ''LOOKAHEAD''. The //up-to-here// marker dispenses with that. ==== The up-to-here marker in parsing productions ==== The //up-to-here// marker can also be used in grammatical productions to indicate that the production requires scanahead up to a certain point. For example, suppose we have the following two simple productions: Foobar : "foo" bar" "baz" ; Foobaz : "foo" "baz" "bat" ; (I refer to these as //simple// because both of them are just a sequence of three string literal tokens. I use these as examples, but the same concept applies to much more complex expansions, of course.) Suppose further that we have a choice construct: ( Foobar | Foobaz ) Now, if no ''LOOKAHEAD'' (or ''SCAN'') is specified, then the ''Foobaz'' production is going to be unreachable. This is because, by default, CongoCC looks ahead exactly one token to decide which of the two expansions to enter, and will simply go with the first one that matches. So, assuming that the next token is a "foo", then it will //always// enter ''Foobar'' and the second choice, ''Foobaz'', is unreachable. The classic solution would be to specify two tokens of lookahead, so that we need to match "foo" followed by "bar" to enter ''Foobar''. Like so: ( LOOKAHEAD (2) Foobar() | Foobaz() ) In CongoCC, you have the option of putting an //up-to-here// marker in the ''Foobar'' production like so: Foobar : "foo" "bar" =>|| "baz" ; This means that we scan up to (and including) the "bar" token when the expansion at a [[choice points|choice point]] starts with ''Foobar''. Now, the choice construct above can be written simply as: ( Foobar | FooBaz ) Since the //up-to-here// marker is specified inside the ''Foobar'' production, the code generated for the choice construct takes that into account and scans the necessary two tokens, i.e. up to and including the "bar" token. ==== The "up-to-here-plus" marker ==== Not long after implementing the above //up-to-here// syntax, I realized that there was a common problem that it did not address. Frequently, you want to scan up to a given point and then an extra token (or two... or three...) For example, suppose you have a production that represents the declaration of a function in some C-ish sort of language. FunctionDeclaration : [ReturnType] Args Block ; Suppose that you decide that, to decide that something is indeed a function declaration, you need to scan to the end of the Args and //then one more token// which would be the opening curly brace of the ''Block'', "{". Regardless, what we certainly don't want to do is scan all the way to the end of the Block that could be very very long and costly. We reason that it is enough to hit the "{" that starts the ''Block''. So, we could write: FunctionDeclaration : [ReturnType] Args =>|+1 Block ; The ''=>|+1'' is the //up-to-here-plus// marker and means that once we scan past Args, we set a numerical limit of 1 more token. In this case, that would mean that we scan up to and including the initial "{" of the ''Block'' production. Then, for example, if the ''FunctionDeclaration'' was used at some [[choice point]] in your grammar, say: ( FunctionDeclaration | VariableDeclaration | StructDeclaration | EnumDeclaration ) without the ability to put the up-to-here (or in this case, //up-to-here-plus// marker in ''FunctionDeclaration'' we would have to write something like this, as in legacy JavaCC: ( LOOKAHEAD (ReturnType() Args() "{" ) FunctionDeclaration | VariableDeclaration | StructDeclaration | EnumDeclaration ) Of course, most likely, the other productions, such as ''VariableDeclaration'' etcetera would also requires some complicated (complicated looking) syntactic lookahead and the whole thing would end up looking very hairy indeed! ==== Niggles (Annoying Details) ==== The above basically summarizes how the new //up-to-here// construct works. There are, however, a few little annoying details to keep in mind. * If an expansion has an //up-to-here// marker then any nested //up-to-here// is ignored. * If an expansion has an explicit //numerical// or //syntactic// lookahead, then any //up-to-here// marker in a that expansion or in a nested production is ignored. What this means, for example, is that if we have: Foo Bar =>|| Baz then any //up-to-here// inside of the ''Foo'' production is ignored because, at a higher level, we specified that we want to scan past it up to and including the ''Bar'' that follows. By the same token, if we specify a specific numerical limit, then that overrides any nested //up-to-here// as well. So, in the following: [ SCAN 1 Foo Bar Baz ] We are just going to scan ahead one token, regardless of whether ''Foo'' contains an //up-to-here// saying that we should scan more than 1, because the ''SCAN'' statement is assumed to mean that we override that. Note also that the up-to-here marker in a production only applies at a maximum of one level of nesting. Thus, for example, if you have: Foo : Bar Baz ; and: Bar : "baz" "bat" =>|| "bam"; Then using ''Foo'' at a choice point, like : ( Foo )* Though it could conceivably be anti-intuitive (I'm not sure really...) the //up-to-here// marker in the ''Bar'' above is //not// used. Or, in other words, the ''( Foo )*'' expansion above only scans //one// token ahead on each iteration of the loop. This is the current implementation and it is possible that it will be revisited at a later point. Perhaps not, since respecting the //up-to-here// marker when recursing more than one level deep seems like it could be problematic, both in the implementation and in terms of grammars being easily readable. Note also that the only //nested// up-to-here that applies to an expansion is in a NonTerminal that //starts// the expansion. Thus, if we have the expansion: ( Foo Bar Baz )* The //up-to-here// marker inside the ''Bar'' production is not used, only one inside of ''Foo'', assuming it does have one! In the above case, if we did want to scan past Foo and into Bar, stopping after two tokens, we could write: ( Foo =>|+2 Bar Baz )* As of this writing, there is no way to get the lookahead machinery to respect the //up-to-here// marker inside of a production (like ''Bar'' above) if it is not at the very beginning of the expansion. This may be addressed later, but it is probably not at all urgent. In most cases, where you want to scan past a production and then somewhat more into the next one, it is exactly one token. You will typically have something like: MethodDeclaration : [ Modifiers ] ReturnType =>||+1 Args Block ; What the above means is that we scan to the end of the (which would be the method name) and then one more token to check for the initial parenthesis "("in the Args production. We anticipate that the vast majority of //up-to-here-plus// markers will indicate one extra token. (Though that remains to be seen in actual praxis, admittedly...)