CMPS 450 Week 6

Ch 4.4 - 4.6 "Parsing"

resources:
llvm compiler infrastructure
ANSI C Language Reference
Parsing Techniques (pdf)

Syntax analysis occurs during parsing. The parser's job is to determine whether an input stream of terminal tokens meets the specifications of the context free grammar defined by the parser. Parsing can occur top-down or bottom-up. In top-down parsing the root of parse tree (the starting symbol of the grammar) is constructed first. In bottom-up parsing the leaf nodes of the parse tree (the terminals) are constructed first. An unambiguous grammar will produce the same parse tree during bottom-up and top-down parsing although the nodes on the tree will be added in a different order. Both forms of parsing use a left-to-right scan of the input stream. The context-sensitive components of a programming language (also called static semantics or attribute grammars) are covered in the next chapter.

Ch.4.4 "Top-Down Parsing"

In top-down parsing the leftmost nonterminal of a production rule in the language is always evaluated first. This technique corresponds to a leftmost derivation. When you construct a parse tree starting from the root as the starting symbol and follow a leftmost derivation you are essentially performing top-down parsing. There are two types of top-down parsing algorithms:

#1. recursive descent parsing (the syntax analyzer is based directly on the recursive rules in the BNF grammar describing the language) The implementation is a series of recursive calls mirroring the BNF rules. Recursive descent can involve backtracking if you pick the wrong rule.

#2. LL(k) parsing algorithms (the first 'el' means left-to-right scan and the second 'el' means leftmost derivation and the 'k' is the number of lookahead symbols). The lookahead symbol is the next token in the input string. LL(k) grammars define languages to which LL(k) parsing can be applied. The letter 'k' denotes the number of lookahead symbols needed to prevent backtracking.

Predictive parsing refers to the behavior of LL(k) parsers in in which the lookahead symbol unambiguously determines the flow from LHS (head) to RHS (body) of the BNF rules. Without predictive parsing, top-down algorithm may involve backtracing if the incorrect rule is selected. An LL(1) parser is a predictive parser that uses one (1) input symbol of lookahead at each step to make parsing decisions.

The big drawback to LL parsing is that left recursive rules must be removed. Left recursive rules result in infinite recursion.

RECURSIVE DESCENT

Grammar G. Σ = {+, -, *,(,)} 
      exp → exp addop term | term
      addop → + | -
      term → term mulop factor | factor
      mulop → *
      factor → (exp) | NUM
As an example, BNF rule term -> factor can be implemented with one symbol (token) of lookahead:
    //  holds next token in input
    TokenType token;     

    procedure factor  {
      case token of
       (: match ( '(' );
          exp();
          match( ')' ); 
       number:
          match (NUM);
       else
          error;
    }

 // match current token with expectedToken
 // advance input if OK else return err 
  match ( TokenType expectedToken ) {
    if token == expectedToken 
       then getToken;
    else 
       err;
  }         

To avoid infinite recursion you need to eliminate left recursion or replace BNF rule

         exp → exp addop term | term  
with an extended BNF (EBNF) rule:
      exp → term { addop term } 
where the curly braces expresses repetition of zero or more times. You can then safely use the EBNF rule in a top-down parser.

Top-Down LL(1) PARSING

Recursive descent works because the runtime stack handles the BNF recursive definitions. An alternative to recursive calls is to use a parsing table and an explicit stack to implement the BNF rules rather than relying on the runtime stack provided through recursive calls. LL(1) parsing is a top-down methodology that uses an explicit stack to handle the recursive BNF rules.
EXAMPLE.
             S → (S)S | e      // language of balanced parentheses 

Overview. Initially, the input buffer holds the string to be parsed, ending with $:

          () $ 
The stack starts with symbol '$' to denote the bottom of the stack. The starting rule is 'S' so push S onto the stack. Read input. On hitting '(' S evaluates to (S)S. Pop 'S' and push (S)S onto the stack in reverse order. Why reverse? Because you want the leftmost terminal symbol at the top of the stack as you read input. When you hit a match you pop and advance input. If the top of the stack is a nonterminal and has an epsilon rule, follow the epsilon rule and pop the nonterminal but don't advance input. Continue process until EOF or error.

 input: ()$

              (         
              S  S                
              )  )  )      
           S  S  S  S  S  
stack:     $  $  $  $  $ $
          -------------------
input:        (  )       $     Accept!
           1  2  3  4  5 6 

The above process is also shown below, where '$' denotes the stack bottom and EOF, and process begins by pushing the starting symbol onto the stack. Terminals are popped off the stack when the terminal matches input. Nonterminals are popped off the stack by the epsilon rule. This rule is triggered when the other rule that matches input.

       Parsing Stack   Input   Action
     1 $S              ()$     S → (S)S   <=pop S and push S)S( 
     2 $S)S(           ()$     match
     3 $S)S             )$     S→e
     4 $S)              )$     match
     5 $S                $     S→e
     6 $                 $     accept 

LL(1) Parsing Table.
Assume nonterminal A is at the top of the parsing stack. Upon receiving input a decision needs to be made about what rule to place on the stack. When a token (i.e., a terminal) is at the top of the stack, no decision is needed. The choices are loaded into a LL(1) parsing table, where M[N,T] is read as the Move on non-terminal N for the token T.

   ----------+-----------+--------+---------
    M[N,T]   |   (       |   )    |     EOF 
    ---------+-----------+--------+---------
     S       | S → (S)S  |  S → e |  S → ε 
    ----------------------------------------

The parsing table describes all actions that can be taken as parsing progresses.

Another example. Terminals '(' and ')' normally surrounding a boolean statement are omitted since there is no action associated with them. Notation: 'e' is epsilon.

        stmtif-stmt  | other
        if-stmtif bool stmt else-stmt
        else-stmtelse stmt | ε
        bool0 | 1 
Table 1. LL(1) parsing table constructed from the above grammar

M[N,T]if else0 1 EOF
stmt stmt → if-stmt        
if-stmt if-stmtif bool stmt else-stmt        
else-stmt   else-smtelse stmt
else-stmt → e
     
bool     bool→0 bool→1  

The two entries in M[else-stmt,else] correspond to the dangling else ambiguity.



Grammars that are not as simple as the ones above require the computation of lookahead sets called FIRST and FOLLOW in order to construct the parsing table.

The elements in FIRST and FOLLOW are terminals in the language.

Notation. For some grammar G:
Σ is the set of terminals in G.
N is the set of nonterminals in G.
X is a single grammar symbol from either Σ or N.
A is a single nonterminal grammar symbol.
a denotes a single terminal from Σ
α is a string of terminal or nonterminal grammar symbols.
β is a string of terminal or nonterminal grammar symbols.
ε is the empty string.
$ is the endmarker symbol (not a symbol in the grammar)

Definitions.
A sentential form is any valid RHS in a derivation of some input string. Sentential forms can contain terminals or nonterminals. The last sentential form is all terminals since it is the input string itself. For example, given grammar
        S -> 0Q 
        Q -> S1 | 1 
over the set of binary strings, a leftmost derivation of 0011 is this:
          S -> 0Q
            -> 0S1   // applying Q -> S1 produces sentential form 0S1
            -> 00Q1  // applying S -> 0Q produces sentential form 00Q1
            -> 0011  // applying Q -> 1  produces sentential form 0011 
************** * FIRST SETS * ************** Notation: *-> means 1 or more derivations. FIRST(X), where X is a single grammar symbol (nonterminal or terminal), is defined as the set of terminals that begin strings derived in one or more derivations from X.

FIRST(X), for terminal or nonterminal X, is computed as follows:

RULE 1. If X is a terminal then FIRST(X) = {X}.

RULE 2. If X is a nonterminal and X->Y_1 Y_2...Y_k is a production for some k>=1, then place a in FIRST(S) if for some i, a is in FIRST(Y_i) and ε is in all of FIRST(Y_1),...,FIRST(Y_i-1). If ε is in FIRST(Y_j) for all j = 1,2,...,k, then add ε to FIRST(X). If Y_1 does not derive ε, then add nothing more to FIRST(X), but if Y_1 *-> ε, then add FIRST(Y_2) and so on.

RULE 3. If X->ε is a production, then add ε to FIRST(X).

FIRST(α), α = X_1 X_2 . . . X_n, can now be defined as:

FIRST(α) contains FIRST(X_1) - {e}. For each i = 2,...,n, if FIRST(X_k) contains ε for all k=1,...,i-1, then FIRST(q) contains FIRST(X_i) - {e}. Finally, if for all i = 1,..,n, FIRST(X_i) contains ε then FIRST(q) contains ε.

The definition for FIRST can be implemented by an iterative algorithm (simplified here by avoiding ε-productions).

     // computing FIRST(A), assuming no ε-productions  
     for all nonterminals A do FIRST(A) = {};
     while there are changes to any FIRST(A) do
          for each production choice A → X_1,X_2,..X_n do
             add FIRST(X_1) to FIRST(A);

An important attribute of a nonterminal is whether it is NULLABLE.

Definition. A nonterminal X is nullable if X *-> ε.

*****************
*  FOLLOW SETS *
*****************
Definition.
FOLLOW(A), for nonterminal A, is the set of terminals a that can appear immediately to the right of A in some sentential form; i.e., the set of terminals a for which there exists a derivation
    
  S *-> αAaβ, for some α and β   
Example.
              S
           / / \ \ 
          α A   d β 
           / \
          c   γ

Terminal c is in FIRST(A) and Terminal d is in FOLLOW(A).

FOLLOW(A), for all nonterminals A, is computed recursively by these rules:

Rule 1. Place $ in FOLLOW(S), where S is the start symbol, and $ is the input right endmarker.

Rule 2. If there is a production A → αBβ, then everything in FIRST(β) except ε is in FOLLOW(B).

Rule 3. If there is a production A → αB, or a production A → αBβ, where FIRST(β) contains ε, then everything in FOLLOW(A) is in FOLLOW(B).

In general, FIRST is constructed by looking at the LHS of a production and determining which terminal(s) that are derived first. FOLLOW is constructed by looking at nonterminals in the RHS of productions and determining which terminal(s) follow it first.


Example from Dragon. 
E -> TE'
E' -> +TE' | e
T -> FT'
T' -> *FT' | e
F-> (E) | id

FIRST(E) = {(,id}
FIRST(E')= {+,e}
FIRST(T) = {(,id}
FIRST(T')= {*,e}
FIRST(F) = {(,id}

FOLLOW(E) = {$,)}      // EOF symbol is in follow E
FOLLOW(E')= {$,)}      // since E' appears only at the end of E productions
FOLLOW(T) = {+,),$}    // first(E') - e and follow(E)
FOLLOW(T')= {+,),$}    // T' appears only at ends of T production so follow(T)
FOLLOW(F) = {+,*,),$}  // first(T') - e; follow(E); first(E') - e 

EXAMPLE TWO. 
Grammar G = <S,Σ,N,P>      
  Σ = {+, -, *, (, ), NUM}   // terminals
  S = exp    // the starting symbol
  N = {exp, addop, term, mulop, factor} // nonterminals 
  P, the set the set of productions in  G:
   
     exp → exp addop term | term
     addop → + | -
     term → term mulop factor | factor
     mulop → *
     factor → (exp) | NUM 
The FIRST and FOLLOW sets contain terminals from Σ. Both sets are recursively constructed in multiple passes through the grammar rules. When complete, every symbol in G has a FIRST set. Every nonterminal symbol in G has a FOLLOW set. Since |Σ| = 6 and |N| = 5, there are 11 FIRST sets and 5 FOLLOW sets. FIRST and FOLLOW sets for G:

   FIRST(exp)    = {(, NUM}        FIRST(() = {(}    
   FIRST(term)   = {(, NUM}        FIRST()) = {)}
   FIRST(factor) = {(, NUM}        FIRST(+) = {+}
   FIRST(addop)  = {+, -}          FIRST(-) = {-}
   FIRST(mulop)  = {*}             FIRST(*) = {*}
                                   FIRST(NUM) = NUM
   
   FOLLOW(exp)    = {$,+,-,)}
   FOLLOW(addop)  = {(,NUM}
   FOLLOW(term)   = {$,+,-,*,) } 
   FOLLOW(mulop)  = {(,NUM} 
   FOLLOW(factor) = {$,+,-,*,)}
   
A grammar G is LL(1) if and only if whenever A → α | β, FIRST(α) and FIRST(β) are disjoint sets. If ε is in FIRST(β) then FIRST(α) and FOLLOW(A) are disjoint sets.

Is Grammar G from Example 2 LL(1)?
FIRST(exp) ∩ FIRST(term) = {(,NUM}. Not LL(1). The next section discusses bottom-up parsing, which is a technique by which languages that are not LL(1) can be parsed.

Ch 4.5 Bottom-Up Parsing

LR grammars define languages to which LR parsing can be applied. LR meaning a left-to-right scan, a rightmost derivation and a bottom-up recursive ascent. Left recursion is not a problem in LR grammars. The primary drawback is that LR parsers are difficult to code manually, thus are usually generated by a parser generator. Yacc generates an LALR parser (Look-Ahead LR parser) which is a simplified version of a canonical LR parser.

Bottom-up parsing is a process of "reducing" a string back to the starting symbol of the grammar. Decisions involve when to reduce and what production rule to apply. Going "backwards" means you need to know what chunk of symbols is is the body of a production in G. These 'chunks' are called handles. This is the same Grammar G as above, simplified slightly to make parsing easier to read. Σ = {+,*,(,),id}. N = {E,T,F}. S = E. There are six production rules. The handles for G are the RHS of each rule in G.

      --------------
      | GRAMMAR G  |
      --------------
                           Handle 
Rule 1.   E -> E + T       E+T     
Rule 2.   E -> T           T
Rule 3.   T -> T * F       T*F   
Rule 4.   T -> F           F
Rule 5.   F -> (E)         (E)   
Rule 6.   F -> id          id
 
Example 1.
Input string: id + id 

Rightmost derivation (F derives to id in E+F):
------------------------------------------------------------
  E => E + T => E + F => T + id => F + id => id+id


Example 2.
Input string: id * id 

Rightmost derivation (F derives to id in T*F):
------------------------------------------------------------
  E => T => T * F => T * id => F * id => id*id

Bottom-up parsing is a rightmost derivation in reverse. When you have a choice of handles (as in id * id) select the leftmost handle since we are doing a left-to-right scan. Bottom-up parse on input string: id * id:

   --        --           --    -----       --
   id * id   F * id   T * id    T * F        T           E
             |        |         |   |      / | \         |
             id       F         F   id    T  *  F        T
                      |         |         |     |      / | \
                      id        id        F     id    T  *  F
                                          |           |     |
                                          id          F     id
                                                      |
                                                      id

Handle Pruning is the process of grabbing the handle and reducing it (deriving in reverse) to the LHS (head) of its sentential form. Handles are marked by as dashed line in the parse tree above. Handle pruning produces a right-most derivation in reverse.

Shift-Reduce Parsing is an implementation of bottom-up parsing in which a stack holds the grammar symbols and an input buffer holds the string to be parsed. This uses the same data structures (a stack and a buffer) as in the top-down parsing with a stack but the technique differs. The stack begins empty. "Shift" means to move a terminal symbol from input onto the stack. Reduce means to reduce the current sentential form on the stack to another sentential form. "Accept" occurs when the stack holds the start symbol and the buffer is empty($). "Error" occurs when the shift/reduce process gets stuck before the Accept state. Using Grammar G and the input string: x * y, the actions of shift-reduce are shown below. Note that the handle always appears at the top of the stack.

    Grammar G.        Handle 
   E -> E + T          E+T     
   E -> T              T
   T -> T * F          T*F   
   T -> F              F
   F -> (E)           (E)   
   F -> x              x 
   F -> y              y 
 
   Input string: x * y 

   STACK      INPUT       ACTION              HANDLE
   $          x*y$        shift 
   $x          *y$        reduce by F->x      x 
   $F          *y$        reduce by T->F      F
   $T          *y$        shift                 
   $T*          y$        shift                
   $T*y          $        reduce by F->y      y 
   $T*F          $        reduce by T->T*F    T*F  
   $T            $        reduce by E->T      T
   $E            $        accept  


Note: $ marks the bottom of the stack and EOF.
See this wiki article for a good introduction to LALR (Look-Ahead Left Scan Right Derivation) parsing.