meta data for this page
Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
choice_points [2020/09/26 19:54] – revusky | choice_points [2021/02/08 18:09] (current) – ↷ Links adapted because of a move operation revusky | ||
---|---|---|---|
Line 2: | Line 2: | ||
A //choice point// in a JavaCC grammar is a juncture where the parser must decide between two or more expansions. There are four kinds of choice point: | A //choice point// in a JavaCC grammar is a juncture where the parser must decide between two or more expansions. There are four kinds of choice point: | ||
- | * An actual choice, as in '' | + | * An optional expansion, i.e. the expansion |
- | * An optional | + | |
* A repeated expansion that is possibly empty, a.k.a. //zero or more// which is written as '' | * A repeated expansion that is possibly empty, a.k.a. //zero or more// which is written as '' | ||
* A repeated expansion that cannot be empty, a.k.a. //one or more// which is written as '' | * A repeated expansion that cannot be empty, a.k.a. //one or more// which is written as '' | ||
+ | * The expansions within a //choice construct//, | ||
- | All of the above constitute choices. The first case is self-explanatory, | + | (Perhaps needless to say, all of these constructs arbitrarily combine |
- | The second | + | I think the simplest way to think about the above cases (for most people, anyway) is just in terms of their analogues in a procedural programming language. The first three cases are effectively binary choices, a choice between entering the expansion inside the parentheses and jumping directly to what follows it. In all of these cases, the expansion within the parentheses is a choice point. |
+ | |||
+ | ==== Zero Or One ==== | ||
+ | |||
+ | A // | ||
< | < | ||
- | ( some_expansion | + | ( inner_expansion |
</ | </ | ||
- | the parser must choose between the first expansion | + | This is quite analogous to a simple |
+ | |||
+ | < | ||
+ | if (inner_expansion_matches()) inner_expansion(); | ||
+ | following_expansion(); | ||
+ | </ | ||
+ | |||
+ | ==== Zero Or More ==== | ||
+ | |||
+ | A //zero or more// is basically the same as a //zero or one// **except** that it is not a one-time choice, but a loop. So, the following: | ||
+ | |||
+ | < | ||
+ | ( inner_expansion )* following_expansion | ||
+ | </ | ||
- | That is //zero or one//. A //zero or more// is basically | + | is a loop that, on each iteration, chooses between |
< | < | ||
- | ( loop_expansion | + | |
+ | | ||
</ | </ | ||
- | is a loop that, on each iteration, chooses between the '' | + | ==== One or More ==== |
A //one or more// is like a //zero or more// loop except that the first iteration is obligatory. Thus: | A //one or more// is like a //zero or more// loop except that the first iteration is obligatory. Thus: | ||
Line 31: | Line 49: | ||
</ | </ | ||
- | is effectively | + | can be considered as just a more terse way of writing: |
< | < | ||
Line 37: | Line 55: | ||
</ | </ | ||
- | So, one should bear in mind that a //one or more// loop is a little bit tricky because, on the first iteration, it is **not** actually a //choice point// | + | So, one should bear in mind that a //one or more// loop is a little bit tricky because, on the first iteration, it is **not** actually a //choice point// |
< | < | ||
- | do {something();} while (someCondition()); | + | do {inner_expansion();} while (inner_expansion_matches()); |
+ | following_expansion(); | ||
</ | </ | ||
- | is really | + | which, again, |
- | < | + | < |
- | | + | inner_expansion(); |
+ | while (inner_expansion_matches()) inner_expansion(); | ||
+ | | ||
</ | </ | ||
So, again, we see that the first iteration is actually executed // | So, again, we see that the first iteration is actually executed // | ||
- | A choice, such as '' | + | ==== The Choice Construct ==== |
+ | |||
+ | This construct, something like '' | ||
< | < | ||
- | if condition1 | + | if (expansion1matches()) |
- | expansion1 | + | expansion1(); |
- | | + | |
- | expansion2 | + | expansion2(); |
.... | .... | ||
- | | + | else |
- | expansionN | + | final_expansion(); |
- | ... | + | </ |
- | | + | |
- | final_expansion | + | |
- | </ | + | |
We test the conditions in sequence until one is satisfied and then we enter that expansion and that's that. | We test the conditions in sequence until one is satisfied and then we enter that expansion and that's that. | ||
- | Well, all the above kind of begs the question of how the various conditions are determined. In the default case, this has a very simple answer: | + | Well, all the above kind of begs the question of how we decide when an expansion matches |
+ | In the default case, this has a very simple answer: | ||
- | ==== The Default Choice Resolution Algorithm ==== | + | ===== The Default Choice Resolution Algorithm |
The //default// choice resolution algorithm for whether we enter an expansion at a choice point is just to check ahead //exactly one token// and if that token // | The //default// choice resolution algorithm for whether we enter an expansion at a choice point is just to check ahead //exactly one token// and if that token // | ||
Line 79: | Line 100: | ||
//The default assumption of a parser generated by JavaCC is that the grammar is **LL(1)**.// | //The default assumption of a parser generated by JavaCC is that the grammar is **LL(1)**.// | ||
- | This is a fancy way of saying that looking ahead one token is enough to resolve any choices at any juncture in the grammar. | + | (NB. I use this sort of terminology to demystify things. There is no need to be intimidated by this sort of thing. The above is just a fancy way of saying that looking ahead one token is enough to resolve any choices at any juncture in the grammar.) |
- | Well, that might or might not be true in any given spot, but if we don't specify any extra information, | + | Now, to be clear, looking ahead one token might be sufficient |
- | //We enter an expansion at a choice point if the next token is in that expansion' | + | //We enter an expansion at a choice point if the next token is in that expansion' |
- | An expansion' | + | Again, no need to be intimidated by the lingo. |
- | //If the next token is **not** in an expansion' | + | //If the next token is **not** in an expansion' |
- | Now, it may be the case that more than one of the expansions at a given choice point matches this condition. Well, in that case, the first one gets it. This is really no different, by the way, from how if-elseif-else in a procedural programming language works. | + | Now, regardless, it may be the case that more than one of the expansions at a given choice point matches this condition. Well, in that case, we have a secondary rule: //the first one gets it//. And this is really no different, by the way, from how if-elseif-else in a procedural programming language works. |
If you write: | If you write: | ||
Line 98: | Line 119: | ||
</ | </ | ||
- | then clearly the second choice, '' | + | then clearly the second choice, '' |
- | Well, nobody considers | + | A funny thing about all this is that, in the context of procedural programming languages, nobody considers |
- | In any case, if we have a choice construct like: | + | In any case, let's move to a concrete example. Suppose |
< | < | ||
Line 112: | Line 133: | ||
</ | </ | ||
- | then yes, the second choice is clearly unreachable. | + | then yes, the second choice is clearly unreachable. |
So, this leads us naturally to the next question, which is: | So, this leads us naturally to the next question, which is: | ||
- | //If the **default** resolution algorithm is not sufficient, i.e. the grammar is not **LL(1)** at that juncture, how do we override | + | //If the **default** resolution algorithm is not sufficient, |
- | The answer is that we do this by defining some sort of // | + | The answer is that we do this by defining some sort of // |
- | ==== Definite | + | ===== Numerical Lookahead |
- | In the case given above, where we have the unreachable second expansion, the simplest solution would simply be to specify that we should scan ahead two tokens on the first expansion. So we could write the above choice as: | + | In the case given above, where we have that unreachable second expansion, the simplest solution would simply be to specify that we should scan ahead two tokens on the first expansion. So we could write the above choice as: |
< | < | ||
- | SCAN " | + | SCAN 2 " |
| | | | ||
" | " | ||
Line 132: | Line 153: | ||
</ | </ | ||
- | (Note that I use the newer [[SCAN statement]] here. In //legacy JavaCC// this would be written as: '' | + | (Note that I use the newer [[SCAN statement]] here. In //legacy JavaCC// this would be the more verbose: '' |
Regardless, once we specify two tokens of lookahead on the first choice, we only enter it if the next two tokens are " | Regardless, once we specify two tokens of lookahead on the first choice, we only enter it if the next two tokens are " | ||
- | Now, here is an interesting point: if the next two tokens are " | + | Now, here is an interesting point: if the next two tokens are " |
+ | |||
+ | BUT... here is an important conceptual issue in terms of understanding the logic of a parser generator like this: | ||
+ | |||
+ | //In terms of the topic of this page, choice points and lookaheads, this is simply **not** an issue!// | ||
+ | |||
+ | (// | ||
+ | |||
+ | It's really quite simple, you see. The second token, " | ||
+ | |||
+ | But that is all right and proper. Look at it this way... the above grammar snippet says that we look ahead 2 tokens to decide whether to enter the first expansion. Since we didn't specify anything for the second expansion, it now just looks ahead one token, and if an error occurs after that, then so be it. The parser is doing what it is supposed to do, right? You see, in this case, none of the three given choices can match the " | ||
+ | |||
+ | Well, if you don't currently understand my point there, then all I can say is just think about it. It will eventually become clear... | ||
+ | |||
+ | ===== More than one way to skin a cat... ===== | ||
+ | |||
+ | The //definite numerical lookahead// of two tokens worked okay in the above example, but generally speaking, it is a rather crude disposition. The legacy JavaCC tool provides two other ways to specify how we resolve a choice -- when the default resolution is not good enough. In the original, somewhat inaccurate terminology, | ||
+ | |||
+ | ==== Syntactic Lookahead ==== | ||
+ | |||
+ | // | ||
+ | |||
+ | < | ||
+ | SCAN " | ||
+ | | | ||
+ | " | ||
+ | | | ||
+ | " | ||
+ | </ | ||
+ | |||
+ | This means that we scan ahead to see if the input matches the expansion " | ||
+ | |||
+ | < | ||
+ | ( SCAN ( " | ||
+ | </ | ||
+ | |||
+ | The above is a //zero or one// choice and we scan ahead to check whether the coming input is //one or more// " | ||
+ | |||
+ | In short, the above // | ||
+ | |||
+ | (N.B. Sometimes, it is easier to express a condition by what it is not than by what it is. So, in JavaCC, you can put a tilde " | ||
+ | |||
+ | ==== Semantic Lookahead ==== | ||
+ | |||
+ | // | ||
+ | |||
+ | < | ||
+ | SCAN {someCondition()} => Foo Bar Baz | ||
+ | </ | ||
+ | |||
+ | The code in the braces is just any Java expression that resolves to a boolean (true/ | ||
+ | |||
+ | |||
+ | |||
+ | ==== Newer dispositions in JavaCC 21 ==== | ||
+ | |||
+ | Since the //lookahead expansion// when using syntactic lookahead so frequently is substantially (or exactly) the same as the expansion to be parsed, JavaCC 21 has the [[up to here]] syntax to express these conditions more tersely. For example, in legacy JavaCC, one frequently had constructs like: | ||
+ | |||
+ | < | ||
+ | [ LOOKAHEAD (Foo() Bar()) Foo() Bar() Baz() ] | ||
+ | </ | ||
+ | |||
+ | In JavaCC 21 this is expressed more tersely using the [[new syntax summary|new streamlined syntax]]: | ||
+ | |||
+ | < | ||
+ | [ SCAN Foo Bar => Foo Bar Baz ] | ||
+ | </ | ||
+ | |||
+ | However, this is still annoyingly repetitious, | ||
+ | |||
+ | < | ||
+ | [ Foo Bar =>|| Baz ] | ||
+ | </ | ||
+ | |||
+ | In very many common usage cases, the [[up to here]] syntax removes the need to write separate // | ||
- | BUT... in terms of the topic of the page, this is not a problem! You see, the second | + | By the same token, //semantic lookahead//, i.e. using arbitrary Java code to express conditions should always be a last resort. The new [[contextual_predicates]] feature in JavaCC 21 was designed to drastically reduce the cases in which it is necessary.) |
- | Or look at it this way... the above grammar snippet says that we look ahead 2 tokens | + | A general rule of thumb would be to use [[up to here]] and [[contextual_predicates]] constructs whenever possible instead of // |
- | (TBD: talk about // | ||
- | |||