The "Up-to-here" Marker

The up-to-here marker ⇒|| and the up-to-here-plus marker ⇒|+n (where n is an integer) were introduced to be able to express lookahead in a more succinct and less err-prone manner than using the LOOKAHEAD from legacy JavaCC or even the terser SCAN statement.

Where in legacy JavaCC, you would write:

    LOOKAHEAD(Foo() Bar()) Foo() Bar() Baz()

or in CongoCC (without up-to-here) you could write:

    SCAN Foo Bar => Foo Bar Baz

but now, with the up-to-here syntax, you can now write the above simply as:

     Foo Bar =>|| Baz

The latter construct can be read as meaning that we enter the expansion Foo Bar Baz and yes, we scan ahead to check when deciding whether to enter the expansion. However, if we get to the end of the Bar, we can stop scanning.

Now, I would point out that, if we knew for a fact that Foo Bar consumes exactly 7 tokens, say, then the above would effectively be the same as writing:

    LOOKAHEAD(7) Foo() Bar() Baz()

However, in the general case, the expansions Foo and Bar above will consume a variable number of tokens. Also, as the grammar evolves over time, the maximum number of tokens that these productions consume may change and you would then have to change the number specified in the LOOKAHEAD. The up-to-here marker dispenses with that.

The up-to-here marker in parsing productions

The up-to-here marker can also be used in grammatical productions to indicate that the production requires scanahead up to a certain point.

For example, suppose we have the following two simple productions:

    Foobar : "foo" bar" "baz" ;
    Foobaz : "foo" "baz" "bat" ;

(I refer to these as simple because both of them are just a sequence of three string literal tokens. I use these as examples, but the same concept applies to much more complex expansions, of course.)

Suppose further that we have a choice construct:

   ( Foobar | Foobaz )

Now, if no LOOKAHEAD (or SCAN) is specified, then the Foobaz production is going to be unreachable. This is because, by default, CongoCC looks ahead exactly one token to decide which of the two expansions to enter, and will simply go with the first one that matches. So, assuming that the next token is a “foo”, then it will always enter Foobar and the second choice, Foobaz, is unreachable.

The classic solution would be to specify two tokens of lookahead, so that we need to match “foo” followed by “bar” to enter Foobar. Like so:

   ( LOOKAHEAD (2) Foobar() | Foobaz() )

In CongoCC, you have the option of putting an up-to-here marker in the Foobar production like so:

    Foobar : "foo" "bar" =>|| "baz" ;

This means that we scan up to (and including) the “bar” token when the expansion at a choice point starts with Foobar.

Now, the choice construct above can be written simply as:

   ( Foobar | FooBaz )

Since the up-to-here marker is specified inside the Foobar production, the code generated for the choice construct takes that into account and scans the necessary two tokens, i.e. up to and including the “bar” token.

The "up-to-here-plus" marker

Not long after implementing the above up-to-here syntax, I realized that there was a common problem that it did not address. Frequently, you want to scan up to a given point and then an extra token (or two… or three…)

For example, suppose you have a production that represents the declaration of a function in some C-ish sort of language.

    FunctionDeclaration : [ReturnType] <IDENTIFIER> Args Block ;

Suppose that you decide that, to decide that something is indeed a function declaration, you need to scan to the end of the Args and then one more token which would be the opening curly brace of the Block, “{”.

Regardless, what we certainly don't want to do is scan all the way to the end of the Block that could be very very long and costly. We reason that it is enough to hit the “{” that starts the Block. So, we could write:

    FunctionDeclaration : [ReturnType] <IDENTIFIER> Args =>|+1 Block ;

The ⇒|+1 is the up-to-here-plus marker and means that once we scan past Args, we set a numerical limit of 1 more token. In this case, that would mean that we scan up to and including the initial “{” of the Block production.

Then, for example, if the FunctionDeclaration was used at some choice point in your grammar, say:

   ( FunctionDeclaration | VariableDeclaration | StructDeclaration | EnumDeclaration )

without the ability to put the up-to-here (or in this case, up-to-here-plus marker in FunctionDeclaration we would have to write something like this, as in legacy JavaCC:

   ( LOOKAHEAD (ReturnType() <IDENTIFIER> Args() "{" ) FunctionDeclaration
     | VariableDeclaration | StructDeclaration | EnumDeclaration )

Of course, most likely, the other productions, such as VariableDeclaration etcetera would also requires some complicated (complicated looking) syntactic lookahead and the whole thing would end up looking very hairy indeed!

Niggles (Annoying Details)

The above basically summarizes how the new up-to-here construct works. There are, however, a few little annoying details to keep in mind.

  • If an expansion has an up-to-here marker then any nested up-to-here is ignored.
  • If an expansion has an explicit numerical or syntactic lookahead, then any up-to-here marker in a that expansion or in a nested production is ignored.

What this means, for example, is that if we have:

   Foo Bar =>|| Baz

then any up-to-here inside of the Foo production is ignored because, at a higher level, we specified that we want to scan past it up to and including the Bar that follows.

By the same token, if we specify a specific numerical limit, then that overrides any nested up-to-here as well. So, in the following:

  [ SCAN 1 Foo Bar Baz ]

We are just going to scan ahead one token, regardless of whether Foo contains an up-to-here saying that we should scan more than 1, because the SCAN statement is assumed to mean that we override that.

Note also that the up-to-here marker in a production only applies at a maximum of one level of nesting. Thus, for example, if you have:

    Foo : Bar Baz ;

and:

    Bar : "baz" "bat" =>|| "bam";

Then using Foo at a choice point, like :

    ( Foo )*

Though it could conceivably be anti-intuitive (I'm not sure really…) the up-to-here marker in the Bar above is not used. Or, in other words, the ( Foo )* expansion above only scans one token ahead on each iteration of the loop. This is the current implementation and it is possible that it will be revisited at a later point. Perhaps not, since respecting the up-to-here marker when recursing more than one level deep seems like it could be problematic, both in the implementation and in terms of grammars being easily readable.

Note also that the only nested up-to-here that applies to an expansion is in a NonTerminal that starts the expansion. Thus, if we have the expansion:

   ( Foo Bar Baz )*

The up-to-here marker inside the Bar production is not used, only one inside of Foo, assuming it does have one! In the above case, if we did want to scan past Foo and into Bar, stopping after two tokens, we could write:

   ( Foo =>|+2 Bar Baz )*

As of this writing, there is no way to get the lookahead machinery to respect the up-to-here marker inside of a production (like Bar above) if it is not at the very beginning of the expansion. This may be addressed later, but it is probably not at all urgent. In most cases, where you want to scan past a production and then somewhat more into the next one, it is exactly one token. You will typically have something like:

   MethodDeclaration : [ Modifiers ] ReturnType <IDENTIFIER> =>||+1 Args Block ;

What the above means is that we scan to the end of the <IDENTIFIER> (which would be the method name) and then one more token to check for the initial parenthesis “(“in the Args production. We anticipate that the vast majority of up-to-here-plus markers will indicate one extra token. (Though that remains to be seen in actual praxis, admittedly…)