CMPS 450 Week 7 Lecture Notes

resources:
llvm compiler infrastructure
ANSI C Language Reference
Parsing Techniques (pdf)

Ch 4.6 LR Parsing: Simple LR

The theoretical model for lexical analysis is regular grammars and finite-state automata. The theoretical model for syntactic analysis is context-free grammars and push-down automata. The theoretical model for static semantics is context-sensitive grammars and turing machines. The most common bottom-up parsing technique in use today is LR(k) (left scan, right derivation and k symbols of lookahead). The goal of this section is to construct an automaton for an LR(0) grammar. Def: A Viable Prefix is a prefix of some right-sentential form that ends no further right than the end of the handle of that right-sentential form. In shift-reduce parsing for an LR(0) grammar, the stack contents are always a viable prefix. A viable prefix can be reduced back to the head of the production rule (the nonterminal on the LHS). Def: An Item in an LR(0) grammar G is a production rule of G with a dot somewhere in the body (RHS) to mark the state of the parser. The dot can be anywhere in the body. E.g., If production A is A → BcD there are four items in A's set of items: { [A→.BcD], [A→B.cD], [A→Bc.D], [A→Bcd.] }. Item A→B.cD means that the parser is expecting terminal c next in the input. A → ε generates one item: A → . Def: An Item is valid for a viable prefix if the production of that item is used to generate the handle, and the viable prefix includes all symbols to the left of the dot, but not those below. Def: LR parsers first construct the sets of valid items (LR states) for all possible viable prefixes, associating each prefix with a state. Reduce if there is a valid item with the dot at the right end of the body, shift the next symbol onto the stack if that symbol appears immediately to the right of the dot in some valid item, otherwise error. Def: The augmented grammar G' for grammar G is G with G' as a new start symbol. Def: CLOSURE(I), where I is a set of items for grammar G, is the set of items constructed from I by Rule 1. Initially, add every item in I to CLOSURE (I). Rule 2. If A → α.Bβ is in CLOSURE(I) and B → γ is a production in G, add item B → .γ to CLOSURE(I). Apply Rule 2 to any new items until there are no more new items to add. Example using Grammar G above. Augment the grammar G to G'. Grammar G. E → E + T | T T → T * F | F F → (E) | id Augmented Grammar G' (0) E' → E (1) E → E + T (2) E → T (3) T → T * F (4) T → F (5) F → (E) (6) F → id Compute the item set for each of the productions in the augmented grammar: G': (0) E' → E = {[E'→.E] , [E'→E.]} (1) E → E+T = {[E→.E+T], [E→E.+T], [E→E+.T], [E→E+T.]} (2) E → T = {[E→.T] , [E→T.]} (3) T → T*F = {[T→.T*F], [T→T.*F], [T→T*.F], [T→T*F.]} (4) T → F = {[T→.F], [T→F.]} (5) F → (E) = {[F→.(E)], [F→(.E)], [F→(E.)], [F→(E).]} (6) F → id = {[F→.id], [F→id.] Next, using Rule 1 and Rule 2, compute the CLOSURE for each item in each item set for each of the productions in the augmented grammar G'. For example, the set of items for E'→E is {[E'→.E] , [E'→E.]}. The CLOSURE of the first item in this set, I = [E'→.E], is constructed by applying the two rules: Put E'→.E in the set by Rule 1. CLOSURE([E'→.E]) = { [E'→.E], The rest of of set is constructed by Rule 2. Since E immediately follows a dot, add all E-productions that begin with a dot in the RHS: CLOSURE([E'→.E]) = { [E'→.E], [E→.E+T], [E→.T], .T in [E→.T] means you add all T-productions that begin with a dot in the RHS: CLOSURE([E'→.E]) = { [E'→.E], [E→.E+T], [E→.T], [T→.T*F], [T→.F], Finally, the F to the right of the dot means you add these F-productions : [F→.(E)] and [F→.id]. CLOSURE([E'→.E]) = {[E'→.E], [E→.E+T], [E→.T], [T→.T*F], [T→.F], [F→.(E)], [F→.id]} Stop since nothing else can be added. The collection of CLOSURES correspond to all possible states in an LR(0) automaton. Def: The function GOTO(I,X) where I is a set of items and X is a grammar symbol (terminal or nonterminal) is defined to be the closure of the set of all items [A→αX.β] such that [A→α.Xβ] is in I. GOTO(I,X) defines the transitions from state I under input X for the LR(0) automaton. Def: The canonical LR(0) collection C of augmented grammar G' is the CLOSURE of all item sets for the productions in G'. Def: A Viable Prefix is a prefix of some right-sentential form that ends no further right than the end of the handle of that right-sentential form. In shift-reduce parsing for an LR(0) grammar, the stack contents are always a viable prefix. Simple LR Parsing works because LR(0) automata recognize viable prefixes. "A central theorem of LR-parsing theory is that the set of valid items for a viable prefix γ is exactly the set of items reached from the initial state along the path labeled γ in the LR(0) automaton for the grammar. " Example. Show that E+T* is a viable prefix of this augmented grammar G': GRAMMAR G' (0) E' -> E (1) E -> E + T (2) E -> T (3) T -> T * F (4) T -> F (5) F -> (E) (6) F -> id After having processed E+T* the automaton will be in the state containing Closure({[T->T*.F]}): [ T -> T*.F ] [ F -> .(E) ] [ F -> .id ] These three rightmost derivations show the validity of each of the items. T -> T*.F F -> .(E) F ->.id E' => E E'=> E E' => E => E + T => E + T => E + T => E + T * F => E + T * F => E+T*F => E + T * (E) => E+T*id

Detailed Step-by-Step Construction of an SLR(1) Parser from an LR(0) Automaton

------------------------------------------------------
* Creating an SLR(1) Parser From an LR(0) Automaton  *
------------------------------------------------------

Starting BNF Grammar G:
   E -> E + T | T
   T -> T * F | F
   F -> (E) | id

********** 
* STEP 1 * Augment Grammar G and call it G'. Label each Production Rule.
********** 

 0. E' -> E
 1. E -> E + T 
 2. E -> T
 3. T -> T * F 
 4. T -> F
 5. F -> (E) 
 6. F -> id

******** 
 STEP 2. Construct the collection of Items for each production in G'. 
******** 
  E' -> E  = {[E'->.E] , [E'->E.]}                             
  E -> T   = {[E->.T] , [E->T.]}                               
  E -> E+T = {[E->.E+T], [E->E.+T], [E->E+.T], [E->E+T.]}      
  T -> T*F = {[T->.T*F], [T->T.*F], [T->T*.F], [T->T*F.]}      
  T -> F   = {[T->.F], [T->F.]}                               
  F -> (E) = {[F->.(E)], [F->(.E)], [F->(E.)], [F->(E).]}      
  F -> id  = {[F->.id], [F->id.]}                              

  Total : 20          

******** 
 STEP 3. 
******** 
Construct the Closure set for each of the 20 Items of G', where each set is an 
accepting state and state numbers past I_0 are assigned after applying the GOTO
function. The collection of the closure sets is the Canonical LR(0) Collection 
for G'. The state numbers are arbitrary -- the ones below match the diagram in
Dragon.

      --------------------------------
      //   1. Production E' -> E    //
      -------------------------------
(I_0) Closure({[E' -> .E]})
    = { [E'->.E], [E->.E+T], [E->.T], [T->.T*F], [T->.F], [F->.(E)], [F->.id] }
(I_1) Closure({[E' -> E.]})
    = { [E'->E.], [E->E.+T]  }

      -------------------------------
      //   2. Production E -> T    //
      -------------------------------
      Closure( {[E->.T]}) = { [E->.T], [T->.T*F] ... }  #  defined in I_0 
(I_2) Closure( [E->T.]}
     = {[E->T.], [T->T.*F] }  

     --------------------------------
     //  3. Production E -> E+T    //
      ------------------------------
      Closure({[E->.E+T]}) = { [E->.E+T], ....  }    # defined in I_0
(I_8) Closure({[E->E.+T]})
    = { [E->E.+T], [F->(E.)]}
(I_6) Closure({[E->E+.T]})
    = { [E->E+.T], [T->.T*F], [T->.F], [F->.(E)], [F->.id] }
(I_9) Closure({[E->E+T.]})
    = { [E->E+T.], [T->T.*F] }

      --------------------------------
      //   4. Production T -> T*F    //
      --------------------------------
      Closure({[T->.T*F]}) = { [T->.T*F], [T->.F] }  # defined in I_0 
      Closure({[T->T.*F]}) = { [T->T.*F], ... }      # defined in I_0
(I_7) Closure({[T->T*.F]})
     = { [T->T*.F], [F->.(E)], [F->.id] }     
(I_10) Closure({[T->T*F.]})
     = { [T->T*F.] }     

      --------------------------------
      //   5.  Production T -> F    //
      -------------------------------
      Closure( {[T->.F]}) = { [T->.F] }        # defined in I_0
(I_3) Closure( [T->F.]} ) 
       = { [T->F.] }
  
      ---------------------------------
      //   6.  Production F -> (E)   //
      ---------------------------------
      Closure( {[F->.(E)]}) = { [F->.(E)] }   # defined in I_0 

(I_4) Closure( {[F->(.E)]})
       = { [F->(.E)], [E->.E+T],[E->.T],[T->.F],[T->.T*F],[F->.(E)],[F->.id]  } 

      Closure( {[F->(E.)]}) =  {[F->(E.)], .. )   # defined in I_8 

(I_11) Closure( {[F->(E).]})
       =  { [F->(E).] }

      ------------------------------
      //  7. Production F -> id    // 
      ------------------------------
       Closure( {[F->.id]}) =  {[F->.id]}  //  defined in I_4
(I_5)  Closure( {[F->id.]})
       =  { [F->(id.)] }

********
 STEP 4. 
********
 Define state transitions for the LR(0) Automaton by the GOTO function, where 
 GOTO(I,X) specifies the transition from State I for symbol X (X is a terminal
 or nonterminal symbol).

 Ex. For I_9 = { [E->E+T.], [T->T.*F] }
 GOTO(I_9,*) transitions to I_7: { [T->T*.F], [F->.(E)], [F->.id] }
 This reflects a shift on input '*'.

 Ex. For I_7 = { [T->T*.F], [F->.(E)], [F->.id] } 
 GOTO(I_7,F) transitions to I_10: { [T->T*F.] }
 This reflects a reduce to F. 

 The parser reduces if no shift transition matches input.

 The GOTO functions for each Item set of the Automaton are shown below 
 where a blank marks an (I,X) pair where no transition exists. 

      E     F     T    (    )     +     *    id   $
      ---   ---   ---  ---  ---   ---   ---  ---  ---
I_0   I_1   I_3   I_2  I_4                   I_5
I_1                                I_6             Accept
I_2                                     I_7 
I_3   
I_4   I_8   I_3    I_2  I_4                  I_5
I_5
I_6         I_3    I_9  I_4                  I_5
I_7         I_10        I_4                  I_5
I_8                          I_11  I_6
I_9                                     I_7
I_10 
I_11

A graphical depiction of the LR(0) Automaton defined by Closure and GOTO functions is shown below. Kernel items are the Production Rule itself and all items with dots NOT on the left end. Non-kernel items are shaded.

LR(0) Automaton for Expression Grammar G':

Automaton in Action as a Shift-Reduce Parser. Input: id*id+id
         STATE       SYMBOLS   INPUT       ACTION
         ----------  --------  --------    ---------------------------------
  1)     0                    id*id+id$    shift 5          (push 5)  
  2)     0 5         id         *id+id$    reduce by F->id  (pop 5 push 3)
  3)     0 3         F          *id+id$    reduce by T->F   (pop 3 push 2)
  4)     0 2         T          *id+id$    shift 7          (push 7)
  5)     0 2 7       T*          id+id$    shift 5          (push 5)
  6)     0 2 7 5     T*id          +id$    reduce by F->id  (pop 5 push 10)
  7)     0 2 7 10    T*F           +id$    reduce by T->T*F (pop 10, pop 7)
  8)     0 2         T             +id$    reduce by E->T   (pop 2 push 1)
  9)     0 1         E             +id$    shift 6          (push 6)
 10)     0 1 6       E+             id$    shift 5          (push 5)
 11)     0 1 6 5     E+id             $    reduce by F->id  (pop 5 push 3) 
 12)     0 1 6 3     E+F              $    reduce by T->F   (pop 3 push 9) 
 13)     0 1 6 9     E+T              $    reduce by E->E+T (pop 9 pop 6) 
 14)     0 1         E                $    accept

In shift-reduce parsing one stack holds states 0 - 11 and one stack holds symbols. A shift operation on input '+' pushes '+' onto the symbol stack and the corresponding state onto the stack stack; a reduce operation pops the stack, finds the symbols that corresponds to a reduction of that symbol and pushes the new symbol onto the stack.

Detailed Explanation of line #5 - line #7:

Line#5. On input 'id' from state 7 shift state 5 onto stack.

Line#6. From state 5 lookahead is '+' so reduce by production rule 6 which is F->id. State 5 is popped off stack exposing state 7. From state 7 on F (F is where to goto in F->id) you push state 10.

Line#7. From state 10 lookahead is '+' so reduce by production rule 3, which is T -> T*F. State 10 is popped off the stack exposing state 7. From state 7 on T there is no transition so pop 7.

The behavior of the LR(0) automaton corresponds to the action of an LR parse:

Input: id*id+id
Leftmost Top Down:           Rightmost Top down (handles marked by __ ):

                                 _                    Stack with reductions
E'->E                          E'->E                  0 1 
                                 ___   
  ->E+T                   8)   ->E+T                  0 1 6 9
                                   _ 
  ->T+T                   7)   ->E+F                  0 1 6 3
                                   __ 
  ->T*F+T                 6)   ->E+id                 0 1 6 5
                                 _
  ->F*F+T                 5)   ->T+id                 0 2
                                 ___   
  ->id*F+T                4)   ->T*F+id               0 2 7 10
                                   __ 
  ->id*id+T               3)   ->T*id+id              0 2 7 5 
                                 _
  ->id*id+F               2)   ->F*id+id              0 2
                                 __ 
  ->id*id+id              1)   ->id*id+id             0 3


EIGHT Reductions of the rightmost derivation in reverse denoted by handles:
id->F   F->T   id->F   T*F->T  T->E  id->F  F->T  E+T->E  Accept!

********
 STEP 5. 
********
The only thing left to do to implement the SLR(1) parser is to construct the 
parsing table, which is the data structure used during parsing.

--------------------------------------------------------------
Constructing an SLR(1) Parsing Table from the LR(0) Automaton
--------------------------------------------------------------

Stack States S_0 ... S_n correspond to Item Collection I_0 ... I_n.

Actions: 's2' means consume input and shift state 2 onto the stack; 'r3' means 
reduce stack by Production Rule 3 

Apply the rules below to make shift, reduce and GOTO decisions. 

Rule 1. To make shift decisions, look for Items with a dot followed by a 
        terminal; i.e., foreach GOTO[I_i,a] = I_j, denote ACTION[i,a] as sj. 

Rule 2. To make reduce decisions, look for Items ending in dot; i.e., if 
        [A->alpha.] isin I_i then set ACTION[i,a] to "reduce A->alpha" for all 
        terminals a in FOLLOW(A). 

Rule 3. If [S` -> S.] isin I_i, then set ACTION[i,$] to accept. 

Rule 4. If GOTO(I_i, A) = I_j, then GOTO[i,A] = j, for nonterminal A. This rule
        extends the GOTO function to map State x Nonterminals -> State; e.g. 
        GOTO[1,N]->2 means from state 1 and nonterminal N, move to state 2. The
        GOTO action is used after a reduction to determine what state to push 
        onto the stack; e.g., pop the current state and push the state number 
        defined by GOTO. 

Rule 5. Anything not filled in by this point is an error state

---------------
Supporting work.
---------------
Label each production in the grammar (E'->E is not reduced so no need to label):
    (1) E -> E + T        (4) T -> F 
    (2) E -> T            (5) F -> (E)
    (3) T -> T * F        (6) F -> id
      
Compute FOLLOW(A) for each nonterminal in G. FOLLOW(A) contains all terminals 
that immediately follow nonterminal A in the RHS of a production.

     FOLLOW(E) = {$,+,)}     FOLLOW(T)={$,+,*,)}     FOLLOW(F)={$,+,*,)} 

4 states have items ending in dot - you will reduce in these states: 
I_1: E'->E.   $ is accept - nothing else matches. 
I_2: E->T.    reduce by Rule 2 for all terminals in FOLLOW(E)
I_3: T->F.    reduce by Rule 4 for all terminals in FOLLOW(T) 
I_10: F->T*F. reduce by Rule 3 for all terminals in FOLLOW(F)  
I_11: F->(E). reduce by Rule 5 for all terminals in FOLLOW(F)

The shift actions are easy. For example, from I_2 on input * you shift state
7 onto the stack since you move from E->T.*F to T->T*.F and I_7 contains that
item.

The GOTO transactions correspond to which state you end up in after you do a
reduce. For example, assume you reduce by rule F -> id and thus pop a state
symbol off the stack. If state 0 is then exposed on the stack you goto state 3
by pushing state 3 onto the stack. The GOTO functions are easily filled in by
following the transitions.   

   Parsing Table for Grammar G'.
   --------,-----------------------------,-----------------
     STATE |           ACTION            |     GOTO
   --------+-----------------------------+-----------------
           |  id   +   *   (    )   $    |   E   T   F
   --------+-----------------------------+-----------------
      0    |  s5           s4            |   1   2   3
   --------+-----------------------------+-----------------
      1    |       s6              acc   |
   --------+-----------------------------+-----------------
      2    |       r2  s7      r2  r2    | 
   --------+-----------------------------+-----------------
      3    |       r4  r4      r4  r4    |
   --------+-----------------------------+-----------------
      4    |  s5           s4            |  8   2    3
   --------+-----------------------------+-----------------
      5    |       r6  r6      r6  r6    |   
   --------+-----------------------------+-----------------
      6    |  s5           s4            |      9    3
   --------+-----------------------------+-----------------
      7    |  s5           s4            |           10
   --------+-----------------------------+-----------------
      8    |       s6          s11       |
   --------+-----------------------------+-----------------
      9    |       r1  s7      r1  r1    |
   --------+-----------------------------+-----------------
     10    |       r3  r3      r3  r3    |
   --------+-----------------------------+-----------------
     11    |       r5  r5      r5  r5    |
   --------+-----------------------------+-----------------

yacc parsing tables for Grammar G.

Ch.4.7 "More Powerful LR Parsers"

not covered

Ch.4.8 "Using Ambiguous Grammars"

not covered

Ch.4.9 "Parser Generators"

covered in laboratory work on yacc