Differences

This shows you the differences between two versions of the page.

--- choice_points [2020/09/26 19:54] – revusky
+++ choice_points [2021/02/08 18:09] (current) – ↷ Links adapted because of a move operation revusky
@@ Line 2: / Line 2: @@
 A //choice point// in a JavaCC grammar is a juncture where the parser must decide between two or more expansions. There are four kinds of choice point:
-  * An actual choice, as in ''Foo | Bar | Baz'', i.e. two or more expansions separated by the "|" character.
+  * An optional expansion, i.e. the expansion inside a //zero or one// construct, which is written ''( Foo )?'' or alternatively as ''[ Foo ] ''
-  * An optional expansion, a.k.a. //zero or one// which is written ''( Foo )?'' or alternatively as ''[ Foo ] ''
   * A repeated expansion that is possibly empty, a.k.a. //zero or more// which is written as ''( Foo )*''
   * A repeated expansion that cannot be empty, a.k.a. //one or more// which is written as ''( Foo )+''
+  * The expansions within a //choice construct//, i.e. n (two or more) expansions, separated by the "|" character, i.e. '' Foo | Bar | Baz ''
-All of the above constitute choices. The first case is self-explanatory, I guess. We have n choices and we must choose one of them. The choices are //mutually exclusive//.
+(Perhaps needless to say, all of these constructs arbitrarily combine and nest within one another to an arbitrary level of complexity.)
-The second case is effectively a choice between the enclosed expansion and jumping directly to whatever follows it. So, if we write:
+I think the simplest way to think about the above cases (for most people, anyway) is just in terms of their analogues in a procedural programming language. The first three cases are effectively binary choices, a choice between entering the expansion inside the parentheses and jumping directly to what follows it. In all of these cases, the expansion within the parentheses is a choice point. The last case is a choice between n options and each of those n sub-expansions is a choice point in the grammar.
+==== Zero Or One ====
+A //zero-or-one// is a single (non-looping) choice. If the enclosed expansion matches, we enter it, and if not, we jump directly to whatever follows it. So, if we write:
 <code>
-    ( some_expansion )? another_expansion
+    ( inner_expansion )? following_expansion
 </code>
-the parser must choose between the first expansion ''some_expansion'' and jumping directly to what follows, i.e. ''another_expansion''.
+This is quite analogous to a simple ''if'' statement in a procedural language. In pseudo-code, something like:
+<code>
+    if (inner_expansion_matches()) inner_expansion();
+    following_expansion();
+</code>
+==== Zero Or More ====
+A //zero or more// is basically the same as a //zero or one// **except** that it is not a one-time choice, but a loop. So, the following:
+<code>
+    ( inner_expansion )* following_expansion
+</code>
-That is //zero or one//. A //zero or more// is basically the same as //zero or one// **except** that it is an infinite loop. So the following:
+is a loop that, on each iteration, chooses between the ''inner_expansion'' and the ''following_expansion''. As long as it chooses the first one, it stays in the loop, but once it opts for the ''following_expansion'', it breaks out. In pseudo-code, that looks like:
 <code>
-    ( loop_expansion )* following_expansion
+    while (inner_expansion_matches()) inner_expansion();
+    following_expansion();
 </code>
-is a loop that, on each iteration, chooses between the ''loop_expansion'' and the ''following_expansion''. As long as it chooses the first one, it repeats the loop, but once it opts for the ''following_expansion'', it breaks out of the loop.
+==== One or More ====
 A //one or more// is like a //zero or more// loop except that the first iteration is obligatory. Thus:
@@ Line 31: / Line 49: @@
 </code>
-is effectively just a more terse way of writing:
+can be considered as just a more terse way of writing:
 <code>
@@ Line 37: / Line 55: @@
 </code>
-So, one should bear in mind that a //one or more// loop is a little bit tricky because, on the first iteration, it is **not** actually a //choice point//! Well, the difference between //zero or more// and //one or more// is exactly the same as the difference between a ''while'' loop and a ''do-while'' loop in Java or similar languages. To write:
+So, one should bear in mind that a //one or more// loop is a little bit tricky because, on the first iteration, it is **not** actually a //choice point//! But, then again, the difference between //zero or more// and //one or more// is exactly the same as the difference between a ''while'' loop and a ''do-while'' loop in Java or similar languages. So, in pseudo-code, the //one or more// would be:
 <code>
-    do {something();} while (someCondition());
+    do {inner_expansion();} while (inner_expansion_matches());
+    following_expansion();
 </code>
-is really just an abbreviated version of:
+which, again, is just an abbreviated way of writing:
 <code>
-    something(); while (someCondition()) something();
+     inner_expansion();
+     while (inner_expansion_matches()) inner_expansion();
+     following_expansion();
 </code>
 So, again, we see that the first iteration is actually executed //unconditionally//.
-A choice, such as ''expansion1 | expansion2 |...expansionN | ... final_expansion'' might as well be thought of as an if-else if-else construct. In a kind of pseudo-code:
+==== The Choice Construct ====
+This construct, something like ''expansion1 | expansion2 |...expansionN | ... final_expansion'' might as well be thought of as an if-elseif-else construct. Again, in pseudo-code:
 <code>
-    if condition1
+    if (expansion1matches())
-      expansion1
+      expansion1();
-    elseif condition2
+    else if (expansion2matches())
-      expansion2
+      expansion2();
     ....
-    elseif conditionN
+    else
-      expansionN
+      final_expansion();
-    ...
+</code>
-    else
-      final_expansion
-</code>
 We test the conditions in sequence until one is satisfied and then we enter that expansion and that's that.
-Well, all the above kind of begs the question of how the various conditions are determined. In the default case, this has a very simple answer:
+Well, all the above kind of begs the question of how we decide when an expansion matches the current input.
+In the default case, this has a very simple answer:
-==== The Default Choice Resolution Algorithm ====
+===== The Default Choice Resolution Algorithm =====
 The //default// choice resolution algorithm for whether we enter an expansion at a choice point is just to check ahead //exactly one token// and if that token //potentially// starts the expansion, then that is what we choose!
@@ Line 79: / Line 100: @@
 //The default assumption of a parser generated by JavaCC is that the grammar is **LL(1)**.//
-This is a fancy way of saying that looking ahead one token is enough to resolve any choices at any juncture in the grammar.
+(NB. I use this sort of terminology to demystify things. There is no need to be intimidated by this sort of thing. The above is just a fancy way of saying that looking ahead one token is enough to resolve any choices at any juncture in the grammar.)
-Well, that might or might not be true in any given spot, but if we don't specify any extra information, that is the //default assumption//. Another fancy, theoretical way of expressing this //default// would be:
+Now, to be clear, looking ahead one token might be sufficient or not in any given spot, but if we don't specify any extra information, that is the //default assumption//. Another fancy, theoretical way of expressing this //default// would be:
-//We enter an expansion at a choice point if the next token is in that expansion's **first set**.//
+//We enter an expansion at a choice point if the next token is in that expansion's [[first set]].//
-An expansion's //first set// is simply the set of tokens with which an expansion can begin. Depending how your brain is wired, it might be easier to think about it the other way round. Like so:
+Again, no need to be intimidated by the lingo. An expansion's [[first set]] is simply the set of tokens with which an expansion can begin. Depending how your brain is wired, it might be easier to think about it the other way round. Like so:
-//If the next token is **not** in an expansion's **first set**, the expansion cannot possibly be matched, so we skip ahead to the following choice.//
+//If the next token is **not** in an expansion's [[first set]], the expansion cannot possibly be matched, so we skip ahead to the following choice.//
-Now, it may be the case that more than one of the expansions at a given choice point matches this condition. Well, in that case, the first one gets it. This is really no different, by the way, from how if-elseif-else in a procedural programming language works.
+Now, regardless, it may be the case that more than one of the expansions at a given choice point matches this condition. Well, in that case, we have a secondary rule: //the first one gets it//. And this is really no different, by the way, from how if-elseif-else in a procedural programming language works.
 If you write:
@@ Line 98: / Line 119: @@
 </code>
-then clearly the second choice, ''doSomethingElse()'' is never entered, because if the condition ''x>1'' is satisfied, then ''x>0'' would also have been satisfied, so we would have already entered ''doSomething()'' and subsequently jumped out of the if-else construct!
+then clearly the second choice, ''doSomethingElse()'' is never entered, because if the condition ''x>1'' is satisfied, then ''x>0'' would also have been satisfied, so we would have already entered ''doSomething()'' and subsequently jumped out of the if-else construct! Or, another way of putting this is that if x is 2, say, the first and second conditions match, but the first one gets it!
-Well, nobody considers that a very big deal, or to be that confusing, but legacy JavaCC was based on the notion that this was something that trips people up and would emit warnings in analogous situations. JavaCC 21 does not bother. At some later point, some checks will probably be put in for emitting warnings when code is unreachable, but it is not a high priority either. This was actually quite annoying in legacy JavaCC, because most of the time, the warnings emitted were complete spurious!
+A funny thing about all this is that, in the context of procedural programming languages, nobody considers this to be a big deal, or to be confusing. However, the legacy JavaCC tool does make a big deal out of this, always emitting annoying warnings in analogous situations. JavaCC 21 does not bother to do so, at least, as I write these lines. At some later point, some checks will probably be put in place for emitting warnings when code is clearly unreachable, since that could be helpful in some spots. But it is not a high priority either. This was actually quite annoying in legacy JavaCC, because most of the time, the warnings emitted were complete spurious!
-In any case, if we have a choice construct like:
+In any case, let's move to a concrete example. Suppose we have a choice construct like:
 <code>
@@ Line 112: / Line 133: @@
 </code>
-then yes, the second choice is clearly unreachable. If our choice resolution algorithm is to look ahead exactly one token, if the next token is "foo", we always enter the first choice and never the second one.
+then yes, the second choice is clearly unreachable. Given that our //default// choice resolution algorithm is to look ahead exactly one token, if the next token is "foo", we always enter the first choice and never the second one.
 So, this leads us naturally to the next question, which is:
-//If the **default** resolution algorithm is not sufficient, i.e. the grammar is not **LL(1)** at that juncture, how do we override the default and get the behavior we actually want?//
+//If the **default** resolution algorithm is not sufficient, (i.e. expressing things the egghead academic way, the grammar is not **LL(1)** at that juncture) how do we override that default and get the behavior we actually want?//
-The answer is that we do this by defining some sort of //lookahead//. (Note that //lookahead// is legacy JavaCC terminology but is frequently inaccurate. Really, we're talking about defining some sort of condition or //predicate// that overrides the default resolution algorithm, and that could well involve looking ahead more than the default 1 token, or it could involve something else.) Probably the term //predicate// is more accurate, but that is a bit of a fancy word. Probably just //condition// is better, but I may well use the term //lookahead// frequently in these spots because, yes, resolving the choice very often does involve looking ahead more than one token. (//But not always.//)
+The answer is that we do this by defining some sort of //lookahead//. (Note that //lookahead// is legacy JavaCC terminology but is frequently inaccurate. Really, we're talking about defining some sort of condition or //predicate// that overrides the default resolution algorithm, and that could well involve looking ahead more than the default of 1 token, or it could involve something else.) Probably the term //predicate// is more accurate, but that is a bit of a fancy word. Maybe just //condition// is better, but I may well use the term //lookahead// frequently in these spots because, yes, resolving the choice very often does involve scanning ahead more than one token. (//But not always.//)
-==== Definite Numerical Lookahead ====
+===== Numerical Lookahead =====
-In the case given above, where we have the unreachable second expansion, the simplest solution would simply be to specify that we should scan ahead two tokens on the first expansion. So we could write the above choice as:
+In the case given above, where we have that unreachable second expansion, the simplest solution would simply be to specify that we should scan ahead two tokens on the first expansion. So we could write the above choice as:
 <code>
-    SCAN "foo" "bar" ...
+    SCAN 2 "foo" "bar" ...
     |
     "foo" "baz" ...
@@ Line 132: / Line 153: @@
 </code>
-(Note that I use the newer [[SCAN statement]] here. In //legacy JavaCC// this would be written as: ''LOOKAHEAD(2) "foo" "bar" '' //etc//
+(Note that I use the newer [[SCAN statement]] here. In //legacy JavaCC// this would be the more verbose:  ''LOOKAHEAD(2) "foo" "bar" '' //etc// )
 Regardless, once we specify two tokens of lookahead on the first choice, we only enter it if the next two tokens are "foo" followed by "bar", so if the next two tokens happen to be "foo" followed by something other than "bar" we enter the second choice.
-Now, here is an interesting point: if the next two tokens are "foo" and "bat", then the second choice is incorrect as well, but since we did not specify any lookahead, we enter the second expansion based on it matching the first token. (//That's all we check by **default**.//) Of course, we have a problem when we //consume// the "foo" and the next token is not what we want, a "bat" instead of a "bar".
+Now, here is an interesting point: if the next two tokens are "foo" and "bat", say, then the second choice is incorrect as well, but since we did not specify any lookahead there, we enter the second expansion based on it matching the first token. (Again: //that's all we check **by default**.//) Of course, we have a problem when we //consume// the "foo" and the next token is not what we want -- a "bat" instead of a "bar".
+BUT... here is an important conceptual issue in terms of understanding the logic of a parser generator like this:
+//In terms of the topic of this page, choice points and lookaheads, this is simply **not** an issue!//
+(//Huh?//)
+It's really quite simple, you see. The second token, "baz", in the second expansion choice is //not a choice point//! So, yes, the generated parser at this point is expecting a "baz" but encountering a "bat". So yes, it will complain! (//Okay, when I say "complain", that is an anthropomorphic way of saying that an exception will be thrown.//)
+But that is all right and proper. Look at it this way... the above grammar snippet says that we look ahead 2 tokens to decide whether to enter the first expansion. Since we didn't specify anything for the second expansion, it now just looks ahead one token, and if an error occurs after that, then so be it. The parser is doing what it is supposed to do, right? You see, in this case, none of the three given choices can match the "foo" "bat" sequence anyway! So, end result: it throws an exception, **which is what it should do**. But the point is that we can have '' "foo" "bar" '' here or '' "foo" "baz" '' and the generated parser will handle both cases. Our first go at this, without the lookahead specified, it would spuriously reject a "foo" "bar", **even though it is valid input**! If it rejects "foo" "bat" that is not really an issue //in terms of the problem we have set ourselves here// because **that is not valid input anyway**!
+Well, if you don't currently understand my point there, then all I can say is just think about it. It will eventually become clear...
+===== More than one way to skin a cat... =====
+The //definite numerical lookahead// of two tokens worked okay in the above example, but generally speaking, it is a rather crude disposition. The legacy JavaCC tool provides two other ways to specify how we resolve a choice -- when the default resolution is not good enough. In the original, somewhat inaccurate terminology, they are called //syntactic// and //semantic// lookahead. JavaCC 21, meanwhile, provides two other tools, [[contextual_predicates]] and [[up to here]].
+==== Syntactic Lookahead ====
+//Syntactic// lookahead allows you to specify a separate //lookahead expansion// which is checked as a condition at the choice point. We could have used this in the previous example, instead of writing ''SCAN 2'', we could write:
+<code>
+    SCAN "foo" "bar" => "foo" bar" ....
+    |
+    "foo" "baz"
+    |
+    "bar" "bat"
+</code>
+This means that we scan ahead to see if the input matches the expansion "foo" "bar", and if so, we enter the first choice. This is effectively the same as what we had before, since matching "foo" "bar" is the same as scanning ahead and matching the first two tokens. In the general case, however, syntactic lookahead is much more powerful because the lookahead expansion can be of variable length, and also it can match input of indefinite length. For example, consider:
+<code>
+   ( SCAN ( "foo" )+ "bar" => Foobar )?
+</code>
+The above is a //zero or one// choice and we scan ahead to check whether the coming input is //one or more// "foo"s followed by a "bar". There is no precise amount of numerical lookahead that would handle this case if we really are going to accept any number of "foo"s followed by a bar! The above generates a scan routine that scans forward past all the "foo" tokens and then checks if the very next token is a "bar" and only then, enters the ''Foobar'' production. So, if the input is "foo" "bar", that matches. If the input is "foo" "foo" "foo" "foo" "bar" that matches too. But if the input is "foo" "foo" ... and a thousand more "foos" followed by a "baz", then that is no good.
+In short, the above //syntactic// lookahead generates code that scans as far forward as necessary. That might just be two tokens, but it could be 2000 tokens! (In practice that is probably rare, but at least in theory...)
+(N.B. Sometimes, it is easier to express a condition by what it is not than by what it is. So, in JavaCC, you can put a tilde "~" in front of the //lookahead expansion// to indicate that you only match if it does **not** match. Note, however, that **negative** //syntactic lookahead// does not exist in legacy JavaCC.)
+==== Semantic Lookahead ====
+//Semantic// lookahead is actually a misnomer. It just means that the condition is expressed with arbitrary Java code. So, it might look ahead or it might do something else entirely. It's just Java code that is injected in this spot to resolve the choice at a choice point. It looks like:
+<code>
+    SCAN {someCondition()} => Foo Bar Baz
+</code>
+The code in the braces is just any Java expression that resolves to a boolean (true/false) value, a black box from the point of view of our parser generator.
+==== Newer dispositions in JavaCC 21 ====
+Since the //lookahead expansion// when using syntactic lookahead so frequently is substantially (or exactly) the same as the expansion to be parsed, JavaCC 21 has the [[up to here]] syntax to express these conditions more tersely. For example, in legacy JavaCC, one frequently had constructs like:
+<code>
+   [ LOOKAHEAD (Foo() Bar()) Foo() Bar() Baz() ]
+</code>
+In JavaCC 21 this is expressed more tersely using the [[new syntax summary|new streamlined syntax]]:
+<code>
+   [ SCAN Foo Bar => Foo Bar Baz ]
+</code>
+However, this is still annoyingly repetitious, so the [[up to here]] syntax was introduced for these cases and we can write:
+<code>
+    [ Foo Bar =>|| Baz ]
+</code>
+In very many common usage cases, the [[up to here]] syntax removes the need to write separate //syntactic// or //numerical// lookaheads.
-BUT... in terms of the topic of the page, this is not a problem! You see, the second token, "baz" in the second expansion is //not a choice point//! So, yes, the generated parser at this point is expecting a "baz" but encountering a "bat". So it will complain! (That is an anthropomorphic way of saying that an exception will be thrown.)
+By the same token, //semantic lookahead//, i.e. using arbitrary Java code to express conditions should always be a last resort. The new [[contextual_predicates]] feature in JavaCC 21 was designed to drastically reduce the cases in which it is necessary.)
-Or look at it this way... the above grammar snippet says that we look ahead 2 tokens to decide whether to enter the first expansion. Since we didn't specify anything for the second expansion, it look ahead one token. If an error occurs after that, so be it. The parser is doing what it is supposed to do, right? There is no expansion of the three given choices that can match the "foo" "bat" sequence anyway! So it throws an exception. But the point is that we can have '' "foo" "bar" '' here or '' "foo" "baz" '' and the generated parser will handle both cases. Our first go at this, without the lookahead specified, it would spuriously reject a "foo" "bar", which is valid input. If it rejects "foo" "bat" that is not really a problem in terms of the problem we have set ourselves here.
+A general rule of thumb would be to use [[up to here]] and [[contextual_predicates]] constructs whenever possible instead of //syntactic// and //semantic// lookahead respectively. (Though, granted, if it is not possible, then it's not possible.)
-(TBD: talk about //syntactic// and //semantic// lookahead, both of which exist in legacy JavaCC. Also point people to the page on [[lookbehind]] and [[up to here]].)

Tools

menus and quick search

quick search

site status

Page Tools

meta data for this page

Differences