meta data for this page
  •  

This is an old revision of the document!


The "Up-to-here" Construct

The up-to-here marker ⇒|| and the up-to-here-plus marker ⇒|+n (where n is an integer) was introduced to be able to express lookahead in a more succinct and less err-prone manner than using the legacy LOOKAHEAD or even the terser SCAN statement.

Where in legacy JavaCC, you would write:

    LOOKAHEAD(Foo() Bar()) Foo() Bar() Baz()

or in JavaCC 21 without (up-to-here) you could write:

    SCAN Foo Bar => Foo Bar Baz

but now, with the up-to-here syntax, you can now write the above simply as:

     Foo Bar =>|| Baz

The latter construct can be read as meaning that we enter the expansion Foo Bar Baz and yes, we scan ahead to check when deciding whether to enter the expansion. However, if we get to the end of the Bar, we can stop scanning.

Now, I would point out that, if we knew for a fact that Foo Bar consumes exactly 7 tokens, say, then the above would effectively be the same as writing:

    LOOKAHEAD(7) Foo() Bar() Baz()

However, in the general case, the expansions Foo and Bar above will consume a variable number of tokens. Also, as the grammar evolves over time, the maximum number of tokens that these productions consume may change and you would then have to change the number specified in the LOOKAHEAD. The up-to-here marker dispenses with that.

The up-to-here marker in parsing productions

The up-to-here marker can also be used in grammatical productions to indicate that the production requires scanahead up to a certain point.

For example, suppose we have the following two simple productions:

    Foobar : "foo" bar" "baz" ;
    Foobaz : "foo" "baz" "bat" ;

(I refer to these as simple because both of them are just a sequence of three string literal tokens. The same concept applies to much more complex expansions, fo course.)

Suppose further that we have a choice construct:

   ( Foobar | Foobaz )

Now, if no LOOKAHEAD (or SCAN) is specified, then the Foobaz production is going to be unreachable. This is because, by default, JavaCC looks ahead exactly one token to decide which of the two expansions to enter, and will simply go with the first one that matches. So, assuming that the next token is a “foo”, then it will always enter Foobar and Foobaz is unreachable.

The classic solution would be to specify two tokens of lookahead, so that we need to match “foo” followed by “bar” to enter Foobar. Like so:

   ( LOOKAHEAD (2) Foobar() | Foobaz() )

In JavaCC 21, you have the option of putting an up-to-here marker in the Foobar production like so:

    Foobar : "foo" "bar" =>|| "baz" ;

This means that we scan up to (and including) the “bar” token when the expansion at a choice point starts with Foobar.

Now, the choice construct above can be written simply as:

   ( Foobar | FooBaz )

Since the up-to-here marker is specified inside the Foobar production, the code generated for the choice construct takes that into account and scans the necessary two tokens, i.e. up to and including the “bar” token.

The "up-to-here-plus" marker

Not long after implementing the above up-to-here syntax, I realized that there was a common problem that it did not address. Frequently, you want to scan up to a given point and then an extra token (or two… or three…)

For example, suppose you have a production that represents the declaration of a function in some C-ish sort of language.

    FunctionDeclaration : [ReturnType] <IDENTIFIER> Args Block ;

Suppose that you decide that, to decide that something is indeed a function declaration, you need to scan to the end of the Args and then one more token which would be the opening curly brace of the Block, “{”.

Regardless, what we certainly don't want to do is scan all the way to the end of the Block that could be very very long and costly. We reason that it is enough to hit the “{” that starts the Block. So, we could write:

    FunctionDeclaration : [ReturnType] <IDENTIFIER> Args =>|+1 Block ;

The ⇒|+1 is the up-to-here-plus marker and means that once we scan past Args, we set a numerical limit of 1 more token. In this case, that would mean that we scan up to and including the initial “{” of the Block production.

Then, for example, if the FunctionDeclaration was used at some choice point in your grammar, say:

   ( FunctionDeclaration | VariableDeclaration | StructDeclaration | EnumDeclaration )

without the ability to put the up-to-here (or in this case, up-to-here-plus marker in FunctionDeclaration we would have to write something like this, as in legacy JavaCC:

   ( LOOKAHEAD (ReturnType() <IDENTIFIER> Args() "{" ) FunctionDeclaration
     | VariableDeclaration | StructDeclaration | EnumDeclaration )

Of course, most likely, the other productions, such as VariableDeclaration etcetera would also requires some complicated (complicated looking) syntactic lookahead and the whole thing would end up looking very hairy indeed!

Niggles (Annoying Details)

The above basically summarizes how the new up-to-here construct works. There are, however, a few little annoying details to keep in mind.

  • If an expansion has an up-to-here marker then any nested up-to-here is ignored.
  • If an expansion has an explicit numerical or syntactic lookahead, then any up-to-here marker in a nested production is ignored.

What this means, for example, is that if we have:

   Foo Bar =>|| Baz

then any up-to-here inside of the Foo production is ignored because, at a higher level, we specified that we want to scan past it up to and including the Barthat follows.

By the same token, if we specify a specific numerical limit, then that overrides any nested up-to-here as well. So, in the following:

  [ SCAN 1 Foo Bar Baz ]

We are just going to scan ahead one token, regardless of whether Foo contains an up-to-here saying that we should scan more than 1, because the SCAN statement is assumed to mean that we override that.

Note also that the up-to-here marker in a production only applies at a maximum of one level of nesting. Thus, for example, if you have:

    Foo : Bar Baz ;

and:

    Bar : "baz" "bat" =>|| "bam";

Then using Foo at a choice point, like :

    ( Foo )*

The up-to-here marker in Baris not used. This is the current implementation and it could be revisited at a later point. Perhaps not, since recursing more than one level deep seems like it could be problematic, both in the implementation and in terms of grammars being easily readable.