CMPS 450 - Week 5 - Ch 4.1-4.3 "Syntax Analysis"

resources:
bison parser
gnu bison manual

Ch 4.1 "Introduction"

Parsing Terminology Review (from Dragon Chapter 2)

A parser accepts or rejects a string of terminal symbols in a language by following the rules of the BNF/attribute grammar defining the language. Parsing can occur top-down, where the root of the parse tree is constructed first, or bottom-up, where the leaf nodes of the parse tree are constructed first. We are interested in the Programming languages are context-free languages thus require parsing techniques modeled on context-free grammars. The context sensitive elements in programming languages are taken care of through the use of attribute values that are trickled up or down the parse tree and/or a symbol table..

Top-down parsing starts with the starting symbol of the grammar and applies productions in a left-to-right derivation to construct the parse tree. The leaf nodes of the parse tree are the terminals of the language. Each internal node of the parse tree, with the exception of the root, is a LHS (left hand side) of some production rule. Sibling nodes (children of the same parent node) share the same RHS (right hand side) of some production. The parser stops when all leaf nodes on the tree are terminal symbols.

While parsing you often have a choice of which nonterminal to proceed with next. Proceeding with the leftmost nonterminal is called a leftmost derivation. Proceeding with the rightmost nonterminal is a rightmost derivation. A leftmost derivation is the default method and is called the leftmost canonical form.

Recursive Descent Parsing is a top-down parsing technique in which every production involves a function call. An alternative to recursive descent is to use a parsing function in the form of a parsing table.

Predictive parsing (covered in later sections of ch 4) is a form of recursive descent in which the lookahead symbol unambiguously determines the flow from LHS to RHS of the BNF rules. Without predictive parsing, top-down algorithm may involve backtracking if the incorrect rule is selected.

LL(1) is a class of context-free grammars to which predictive parsing can be applied. The first 'el' denotes a left-to-right scan. The second 'el' denotes a leftmost derivation. The '1' denotes one lookahead symbol. One lookahead symbol means only one input symbol of lookahead is used at each step to make a parsing decision. LL(1) grammars are parsed with a left-to-right scan, a leftmost derivation and one lookahead symbol. LL(1) grammars cannot be left recursive since the leftmost nonterminal is the same as the LHS. This would result in infinite recursion. There are techniques to remove left recursion.

Bottom-up parsers that use a left-to-right scan and a rightmost derivation can parse LR(1) grammars. 'L' denotes left-scan and 'R' denotes rightmost derivation. Since scanning is rightmost an LR(1) parser allows production rules that are left recursive.

Error Recovery Strategies

1. Panic - Mode Recovery is an error recovery strategy in which the parser discards tokens one at a time until a designated set of synchronizing tokens is found. Example:

    { int a = 4.5; // valid code
      9stu += sum; // invalid ident causes parser to discard entire statement
      a = 99.2;    // parser restarts parsing here 
    }              
2. Phrase - Level Recovery is an error recovery strategy that attempts some correction on input after the error; e.g.
 
    { int a = 4.5  // parser inserts the missing ; and continues
      sum += sum;  
    }
This strategy is not implemented in compilers for high-level languages. It is implemented by some scripting language interpreters and HTML parsers.

3. Error Productions is an error recovery strategy that defines common errors in the language in advance; i.e., define the error as a BNF rule and trap for it. You will implement this technique in your parser for certain error conditions.

4. Global Correction is a costly error recovery strategy that is of theoretical interest only. We will not cover this strategy.

You can implement error-recovery in your parser by using any combinations of the above strategies.

Ch.4.2 "Context Free Grammars"

A regular grammar formally describes the tokens (lexical components) of a language. The syntax of the language is formally described by a context-free grammar (CFG). A regular grammar can always be converted into a CFG, but not the reverse is never true. A Context Free Grammar has less restrictive production rules than a Regular Grammar.

A production rule in a regular grammar must have a single nonterminal on the LHS and a single terminal on the RHS or a single terminal followed or prefaced by one nonterminal on the RHS. All regular grammars can be expressed as a regular expression. Example:

 
          REGULAR GRAMMAR
          S -> aS | a | e   L(R) = {a, aa, aaa, aaaa, ...}    
          regex: a* 
Production rules in a CFG are more complex and thus can describe languages with richer syntax. A rule must have a single nonterminal on the LHS but can have one or more nonterminals or terminals on the RHS. The RHS does not have to begin with a terminal.
         CONTEXT FREE GRAMMAR       
         S -> aS | Sb | b           L(G) = {abb, aabbb, ababb, ... }     
Every valid string in a language that can be described by a regular language can also be described by a CFG.
Example.
Let Σ = {a,b}. Design a regex that matches all possible strings over Σ 
including the empty string followed by a single 'a'. 

L = { a, aa, ba, aaa, aba, baa, bba, ... }

The corresponding regex is (a|b)*a

Design a CFG for the same language:

   CFG:
   S -> Ta
   T -> aT | bT | a | b | e
 
   Test Parse:  babaa

             S
           /  \
          T    a
         / \
        b   T 
           / \          
          a   T  
             / \   
            b   T     
                |
                a  
On the other hand, CFGs cannot be defined by regexes. An easy example is a^nb^n, a >= 1. This language can easily be described with a CFG
          S -> aSb | ab  = {ab, aabb, aaabbb, ... }  
but it is not possible with a regex since by the time scanning starts with the list of 'b's, the scanner does not *remember* how many 'a's have already been scanned.

A yet more restrictive classification of grammars is context-sensitive grammars. The classic example is:

   Language L is defined as a^nb^nc^n, for n >= 1.
 
         L = {abc, aabbcc, aaabbbccc, aaaabbbbccc, ...}  
It is possible to define this language with a context-sensitive grammar but NOT possible with a CFG. Since production rules in a context-sensitive grammar are complicated a more intuitive approach is to construct a parser based on a CFG for a language that is complete (accepts all valid strings) but not accurate (accepts some invalid strings). The invalid strings are identified through the use of constraints and attribute values that trickle around the parse tree. Attribute values provide a way of validating context-sensitive predicates during parsing.

In the language a^nb^nc^n the parser needs to keep a running total of the number of 'a's, the number of 'b's and the number of 'c's in the input. During parsing these counts are percolated around the parse tree and used to enforce the semantic (context-sensitive) information in the grammar.

Ch.4.3 "Writing a Grammar"

The underlying CFG for a parser can NEVER be ambiguous. Recall that a left or right recursive grammar is not necessarily ambiguous.

Ambiguity

An ambiguous grammar cannot be implemented by a parser unless you can play fast and loose with the semantic meaning of the program without problems. A grammar is ambiguous if two different parse trees can be generated from the same sentence. Generally, you can uncover ambiguity by generating a left derivation and a right derivation for the same sentence. Sometimes two left- most derivations produce different trees. If left and right derivations generate the same parse tree, the grammar is *most likely* unambiguous, but not always. Another example.
 
         <list> -> <list>, <list> | <item>
         <item> -> i    
Leftmost derivation for
            i, i, i                              
                                                     
          <list> -> <list>, <list>          
                 -> <list>, <list>, <list>   
                 -> <item>, <list>, <list>  
                 -> i, <list>, <list>           
                 -> i, <item>, <list>          
                 -> i, i, <list>
                 -> i, i, <item>
                 -> i, i,  i 

Leftmost Parse Tree:

           list
         /  |  \
      list  ,  list
     / | \       | 
  list , list    i
   |      |
   i      i
Rightmost derivation for
          i, i, i 

          <list> -> <list>, <list>                      
                 -> <list>, <list>, <list>   
                 -> <list>, <list>, <item> 
                 -> <list>, <list>, i     
                 -> <list>, <item>, i  
                 -> <list>, i, i 
                 -> <item>, i, i 
                 -> i, i,  i  

Rightmost Parse Tree:

           list
         /  |  \
      list  ,  list
       |       /  | \ 
       i     list ,  list
              |       |
              i       i

Obviously an ambiguous grammar.

**********
EXAMPLE 2.
**********

Below are two grammars, G and G', each of which generates strings of correctly
balanced parentheses. Both are recursive but only one is ambiguous. The letter 
'e' denotes epsilon, the empty string.

----------
Grammar G 
----------
S -> SS | (S) | e

Derive the string () () ()

Leftmost derivation and tree
-----------------------------
         S -> SS 
           -> SSS 
           -> (S)SS
           -> (e)SS
           -> (e)(S)S
           -> (e)(e)S
           -> (e)(e)(S)
           -> (e)(e)(e)

                     S
                    /  \
                  S     S
                /  \    |
               S    S  (S)
               |    |   |
              (S)  (S)  e 
               |    |         
               e    e       

Rightmost derivation and tree
-----------------------------
         S -> SS 
           -> S(S) 
           -> S(e) 
           -> SS(e) 
           -> S(S)(e)
           -> S(e)(e) 
           -> (S)(e)(e) 
           -> (e)(e)(e)

                    S
                  /   \
                 S     S
                 |    / \ 
                (S)  S  (S)
                 |   |   |
                 e  (S) (S) 
                     |   |        
                     e   e  
You can remove ambiguity from Grammar G by creating a grammar that enforces the order of evaluation; E.g., in Grammar G' below, parentheses associate to the right.
------------  
 Grammar G' 
------------  
S -> (S)S |  e

Regardless of a leftmost or rightmost derivation, you only produce one tree:
 
  Input string:  ()()()
   
   Leftmost parse:                      Rightmost parse:
                       S                                  S 
                     /  \                                / \
                   (S)   S                             (S)  S
                    |   /  \                            |  /  \
                    e  (S)  S                           e (S)  S
                        |  / \                             |  / \ 
                        e (S) S                            e (S) S 
                           |  |                               |  |
                           e  e                               e  e
Grammar G' is no longer ambiguous because (S) and S are different tokens. S goes to either (S) or epsilon but not both at the same time.

Problem 1. Consider the following specification of expressions of this form:


              9
              7 - a 
              9 + 8 - c
              a - b + 9

    <expr> -> <element> | <expr> <weak-op> <expr>
    <element> -> 0..9 | a..z 
    <weak-op> -> + | -
Two derivation trees for the expression a - b - c demonstrate that the grammar is ambiguous.
    Tree 1. Rightmost derivation.

                           <expr> 
                     /        |     \    
                <expr>    <weak-op>   <expr>
              /   |  \        |        |
      <expr> <weak-op> <expr> -      <element>
        |         |      |             
   <element>      -   <element>     
Tree 2. Leftmost derivation (swap left and right subtrees under the root)
                         <expr> 
                     /     |       \    
                 <expr> <weak-op>    <expr>
                   /       |       /   |      \             
             <element>     -  <expr> <weak-op> <expr>     
                                |        |       |          
                             <element>   -    <element>   
To remove ambiguity replace the ambiguous rule
            <expr> -> <expr> <weak-op> <expr>   
with a rule that does not duplicate the nonterminal in the RHS:
             <expr> -> <expr> <weak-op> <term>  

Precedence & Associativity

This BNF grammar defines expressions with operations *, -, + and variables a,b,c,d. The order of operations (precedence and associativity) is controlled by the grammar rules.

    <expr> -> <thing> | <thing> * <expr>
    <object> -> <element> | <element> - <object>
    <thing> -> <object> | <thing> + <object>
    <element> -> a | b | c | d | (<object>)
The order of precedence among the three operations, lowest to highest is - (minus) + (plus) * (multiplication)

The associativity of operations is:

* (multiplication) is executed right-to-left  (right associative)
- (minus) is executed right-to-left  (right associative)
+ (plus) is executed left-to-right  (left associative) 
Since the parentheses operation is defined in the lowest rule, the operation inside the parentheses will be performed first. The parentheses will override default precedence or associativity.

Removing Left Recursion

Left recursion is a problem for recursive-descent parsers even if the grammar is unambiguous. Leftmost derivation results in infinite recursion.
Left recursion (direct or indirect) in a grammar is a problem for top-down 
parsing. Direct means recursion occurs within 1 production rule. Indirect means
recursion occurs within 2 or more production rules.  

       +
     S -> Sa    (plus means in 1 or more production rules) 
 
Ex.  S -> Aa | b
     A -> Ac | Sd | e, where e is epsilon   
A right recursive grammar involves rules of this form:
       +
     S -> aS
                                            Old.            New.
  Example. Parse: a + b + c                  E               E
  Rule 1. E -> E + T                       / | \           /   \ 
  Rule 2. E -> T                          E  +  T         T     E'
  Rule 3. T -> a | b | c                / | \   |        /    / |  \
  Replace Rule 1 & 2 with:             E  +  T  c       a    +  T   E'
  E -> TE'                             |     |                 /   /| \ 
  E'-> +TE' | e                        T     b                b   + T  E'
  T -> a | b | c                       |                            |   \ 
                                       a                            c    e
Observations.
Removing left recursion as shown above produces a parse tree that maintains the semantics of the original grammar (the evaluation occurs in the same order) even though the trees differ.

The parse tree in the old grammar grows from the bottom up; e.g., parsing a + b + c + a adds an E -> E + T production to the top of the existing tree. The new grammar grows from the top down; e.g., parsing a + b + c + a replaces the E' -> ε in the rightmost lowest production with an E' -> +TE' production. This pattern can be applied to remove left left recursion from any grammar. You simply remove the recursive symbol and any terminals that follow it. Replace that chunk (in this case E+) with a new non-terminal symbol call it E' that derives to +TE' (essentially flipping the recursion from left to right).

---------------------------
| General Algorithm 4.19  |
---------------------------
                                   +
Input: Grammar G with no cycles (A -> A) or e-productions.
Output: Equivalent Grammar with no left recursion.

Notation: A is a single nonterminal, a and B are strings of terminals and 
nonterminals and B does not begin with A.

Step 1. Order all nonterminals A_1, A_2, A_3, ... A_n.

Step 2. Execute the following:

  for (each i from 1 to n)  {
        for (each j from 1 to i - 1) {
             replace each production of the form A_i -> A_jB by productions
                  A_i -> a_1B | a_2B | ... | a_kB, where   
                  A_i -> a_1 | a_2 | ... | a_k, are all current A_j-productions
        }
        eliminate immediate left recursion among A_i-productions
   }

  Example 1.
                S -> Aa | b
                A -> Ac | Sd | e

  Order nonterminals S,A. S has no left recursion in outer loop i = 1. 
  For i = 2 substitute for S in A -> Sd :

                S ->Aa | b
                A -> Ac | Aad | bd | e 

  We are done with S. Now take care of left-recursion in A in the new rules:  

                A -> Ac | Aad | bd | e 

  We need a new nonterminal A'.               

               A  -> bdA' | A'
               A' -> cA'  | adA' | e

 Example 2.

   A -> Aa | Ba | c
   B -> Bb | Ab | d     

  Order nonterminals A,B. After the outer loop executes for A: 
   A -> cA' | Ba | c
   A'-> aA' | e  

Left Factoring to Eliminate Ambiguity

There are two methods to remove ambiguity:
#1. Add a rule outside the grammar to specify which parse tree is correct.
    (C does this to solve the dangling else problem)

#2. Change the grammar to force the construction of the correct parse tree.

Left factoring is an example of approach #2.

When given a choice between A-productions, rewrite the productions to defer
the decision until more input comes in.

Ex.
    <stmt> -> if <expr> then <stmt> else <stmt>  | if <expr> then <stmt>

Change the grammar so you know whether to use the first or the second rule.

Method. For each nonterminal A, find the longest prefix alpha that is common
to two or more alternatives. If there is a non-trivial common prefix (alpha 
!= epsilon), replace all A-productions A -> alphaBeta_1 | alphaBeta_2 | ...
alphaBeta_n | gamma, where gamma represents all alternatives that do not begin
with alpha, with

        A -> alphaA' | gamma
        A' -> Beta_1, | Beta_2 | .. | Beta_n.

 Example. The IF ELSE THEN where i is IF t THEN e is ELSE.

        S -> iEtS | iEtSeS | a
        E -> b

 Step 1. factor out iEtS:  iEts(ε | eS)

 Step 2. create a new production S' to handle rules inside parentheses:
                S' -> ε  | eS

 Step 3. Insert S'-production back into the original grammar:

           S  -> iEtSS' | a
           S' -> eS | ε
           E -> b