meta data for this page
  •  

This is an old revision of the document!


Choice Points and Lookahead

A choice point in a JavaCC grammar is a juncture where the parser must decide between two or more expansions. There are four kinds of choice point:

  • An actual choice, as in Foo | Bar | Baz, i.e. two or more expansions separated by the “|” character.
  • An optional expansion, a.k.a. zero or one which is written ( Foo )? or alternatively as [ Foo ]
  • A repeated expansion that is possibly empty, a.k.a. zero or more which is written as ( Foo )*
  • A repeated expansion that cannot be empty, a.k.a. one or more which is written as ( Foo )+

All of the above constitute choices. The first case is self-explanatory, I guess. We have n choices and we must choose one of them. The choices are mutually exclusive.

The second case is effectively a choice between the enclosed expansion and jumping directly to whatever follows it. So, if we write:

    ( some_expansion )? another_expansion

the parser must choose between the first expansion some_expansion and jumping directly to what follows, i.e. another_expansion.

That is zero or one. A zero or more is basically the same as zero or one except that it is an infinite loop. So the following:

    ( loop_expansion )* following_expansion

is a loop that, on each iteration, chooses between the loop_expansion and the following_expansion. As long as it chooses the first one, it repeats the loop, but once it opts for the following_expansion, it breaks out of the loop.

A one or more is like a zero or more loop except that the first iteration is obligatory. Thus:

   ( loop_expansion)+ following_expansion

is effectively just a more terse way of writing:

   loop_expansion ( loop_expansion )* following_expansion

So, one should bear in mind that a one or more loop is a little bit tricky because, on the first iteration, it is not actually a choice point! Well, the difference between zero or more and one or more is exactly the same as the difference between a while loop and a do-while loop in Java or similar languages. To write:

    do {something();} while (someCondition());

is really just an abbreviated version of:

 
    something(); while (someCondition()) something();

So, again, we see that the first iteration is actually executed unconditionally.

A choice, such as expansion1 | expansion2 |…expansionN | … final_expansion might as well be thought of as an if-else if-else construct. In a kind of pseudo-code:

    if condition1 
      expansion1 
    elseif condition2 
      expansion2 
    ....
    elseif conditionN
      expansionN
    ...
    else 
      final_expansion

We test the conditions in sequence until one is satisfied and then we enter that expansion and that's that.

Well, all the above kind of begs the question of how the various conditions are determined. In the default case, this has a very simple answer:

The Default Choice Resolution Algorithm

The default choice resolution algorithm for whether we enter an expansion at a choice point is just to check ahead exactly one token and if that token potentially starts the expansion, then that is what we choose!

Another, more theoretical way of expressing this is:

The default assumption of a parser generated by JavaCC is that the grammar is LL(1).

This is a fancy way of saying that looking ahead one token is enough to resolve any choices at any juncture in the grammar.

Well, that might or might not be true in any given spot, but if we don't specify any extra information, that is the default assumption. Another fancy, theoretical way of expressing this default would be:

We enter an expansion at a choice point if the next token is in that expansion's first set.

An expansion's first set is simply the set of tokens with which an expansion can begin. Depending how your brain is wired, it might be easier to think about it the other way round. Like so:

If the next token is not in an expansion's first set, the expansion cannot possibly be matched, so we skip ahead to the following choice.

Now, it may be the case that more than one of the expansions at a given choice point matches this condition. Well, in that case, the first one gets it. This is really no different, by the way, from how if-elseif-else in a procedural programming language works.

If you write:

   if (x > 0) doSomething();
   else if (x > 1) doSomethingElse();
   else somethingElseEntirely();

then clearly the second choice, doSomethingElse() is never entered, because if the condition x>1 is satisfied, then x>0 would also have been satisfied, so we would have already entered doSomething() and subsequently jumped out of the if-else construct!

Well, nobody considers that a very big deal, or to be that confusing, but legacy JavaCC was based on the notion that this was something that trips people up and would emit warnings in analogous situations. JavaCC 21 does not bother. At some later point, some checks will probably be put in for emitting warnings when code is unreachable, but it is not a high priority either. This was actually quite annoying in legacy JavaCC, because most of the time, the warnings emitted were complete spurious!

In any case, if we have a choice construct like:

    "foo" "bar" ...
    |
    "foo" "baz" ...
    |
    "bar" "bat" ...

then yes, the second choice is clearly unreachable. If our choice resolution algorithm is to look ahead exactly one token, if the next token is “foo”, we always enter the first choice and never the second one.

So, this leads us naturally to the next question, which is:

If the default resolution algorithm is not sufficient, i.e. the grammar is not LL(1) at that juncture, how do we override the default and get the behavior we actually want?

The answer is that we do this by defining some sort of lookahead. (Note that lookahead is legacy JavaCC terminology but is frequently inaccurate. Really, we're talking about defining some sort of condition or predicate that overrides the default resolution algorithm, and that could well involve looking ahead more than the default 1 token, or it could involve something else.) Probably the term predicate is more accurate, but that is a bit of a fancy word. Probably just condition is better, but I may well use the term lookahead frequently in these spots because, yes, resolving the choice very often does involve looking ahead more than one token. (But not always.)

Definite Numerical Lookahead

In the case given above, where we have the unreachable second expansion, the simplest solution would simply be to specify that we should scan ahead two tokens on the first expansion. So we could write the above choice as:

    SCAN "foo" "bar" ...
    |
    "foo" "baz" ...
    |
    "bar" "bat" ...

(Note that I use the newer SCAN statement here. In legacy JavaCC this would be written as: LOOKAHEAD(2) “foo” “bar” etc

Regardless, once we specify two tokens of lookahead on the first choice, we only enter it if the next two tokens are “foo” followed by “bar”, so if the next two tokens happen to be “foo” followed by something other than “bar” we enter the second choice.

Now, here is an interesting point: if the next two tokens are “foo” and “bat”, then the second choice is incorrect as well, but since we did not specify any lookahead, we enter the second expansion based on it matching the first token. (That's all we check by default.) Of course, we have a problem when we consume the “foo” and the next token is not what we want, a “bat” instead of a “bar”.

BUT… in terms of the topic of the page, this is not a problem! You see, the second token, “baz” in the second expansion is not a choice point! So, yes, the generated parser at this point is expecting a “baz” but encountering a “bat”. So it will complain! (That is an anthropomorphic way of saying that an exception will be thrown.)

Or look at it this way… the above grammar snippet says that we look ahead 2 tokens to decide whether to enter the first expansion. Since we didn't specify anything for the second expansion, it look ahead one token. If an error occurs after that, so be it. The parser is doing what it is supposed to do, right? There is no expansion of the three given choices that can match the “foo” “bat” sequence anyway! So it throws an exception. But the point is that we can have “foo” “bar” here or “foo” “baz” and the generated parser will handle both cases. Our first go at this, without the lookahead specified, it would spuriously reject a “foo” “bar”, which is valid input. If it rejects “foo” “bat” that is not really a problem in terms of the problem we have set ourselves here.

(TBD: talk about syntactic and semantic lookahead, both of which exist in legacy JavaCC. Also point people to the page on lookbehind and up to here.)