CMPS 450 Compiler Front-End Project

Phase II - Syntactic and Semantic Analysis

Bison Ref
cflat lexicon
cflat BNF
phaseII files
lab05 files
ANSI C ref
lex/yacc interaction
type.h ref
some asides on typing in language design

Before beginning you should backup your scan.l file from Phase I. Keep this version in case you do not complete Phase II. For Phase II you will fuse your scanner generated by lex with a parser generated by yacc. You are going to need to do a bit of cleaning up in scan.l:

 
1. remove main() and any other functions
2. add #include "y.tab.h"
3. comment out the typedef enum for Token (the comment is for reference ) 
4. put the list of Token types from scan.l in parse.y for reference purposes
5. remove any references to the structures involving token type. 
6. your token actions should simply be this:
         {float}      { return FCONST; }
Yacc will define its own numbers for all the tokens and stuff these as defines in y.tab.h. Make sure you call yacc to generate y.tab.h before calling lex. For consistency you should always refer to a token by its enumerated type rather than its literal ascii value; e.g. RPAREN rather than ')'.

Lab05 and Lab06 are designed to help you with the concepts you need to complete phase II which encompasses the static semantic analysis portion of your compiler front-end. The parser will perform syntax and type-checking analysis for cflat programs. Grab the files in phaseII files before starting. These files include a Makefile and cflat test files. The code in sample.y and sample.l demonstrates how to fuse your lexer with your parser. You may want to start with the parser lab05.y from lab05 for your phase II parser.

Cflat Language
Cflat is a minimal subset of C and is designed to compile under gcc. Everything in cflat--flow control, expressions (including operator precedence) and function calls--matches C with a few added constraints. You can extend the cflat language in any way you wish as long as this minimal subset still works. The lexicon for Phase II is the lexicon defined in Phase I. For Phase II we now consider language syntax (how the tokens are strung together). You will write a parser (parse.y) that interacts with the lexer (scan.l).

When testing you should compile and link your scanner+parser (parse.y and scan.l) with debugging on and verbose flags as done in lab05. Name your executable 'cf' (cflat compiler). From the command-line your compiler should read the name of the source file to be compiled:

      $ ./cf 80.cf 
When testing your parser against an input file with no syntax errors, output should read "accept" or be in a final state.

When generating your parser from parser.y you will most likely receive shift/reduce conflicts. This means that your grammar is ambiguous or would require more than one lookahead symbol to resolve the conflict. Bison resolves the conflict by doing the shift. A reduce/reduce conflict occurs because two or more rules match the same string of tokens. Bison resolves this conflict by reducing the rule that occurs earlier in the grammar. These warnings are not necessarily bad but you should to pay attention to them and check your grammar for possible problems.

Your parser must follow the BNF specifications for the grammar as described in BNF.txt. An example of a syntactically correct cflat program is good.cf. You can extend your cflat language as much as you wish as long as, at a minimum, your compiler supports the constructs in good.cf. This code represents all elements of the language. The specification below, good.cf, and BNF.txt are all you need to write your parser.

One strategy for building your parser is to start with a bare-bones parser and start adding the rules following the BNF from the bottom-up. First code your starting production as <program>. Then code the production rule for <number>. :

 
       <program> ::= <number> 
       <number> ::= FLOAT | INTEGER 
Test your scanner with numbers. Next, add this production rule next:
      <addop> ::= + | - 
Change your starting non-terminal to this:
      <program> ::= <number>  | <addop>  
Test your scanner with numbers and addition ops. And so on.

Error Handing.
In Phase II don't try to recover from syntax errors--when you catch a syntax error terminate the program with an appropriate error msg. You will recover from syntax errors in Phase III. You will be able to recover from errors that involve static semantics (such as a type violation); e.g., you should spit out an error msg and continue parsing.

Semantics for Type Checking

After you have defined the syntax of cflat you can add the context-sensitive element of type checking. Type checking will be done for all assignment statements and expressions. Scope semantics -- multiple declarations of the same variable in the same scope, references to variables that have not been declared, what gets bound to what and where - will be covered in Phase III. In phaseII you need the scanner perform some name-mangling so that two identifiers with the same name but in a different scope are given different names. The easiest way to facilitate this is to append a scope identifier to the name; e.g., num declared in global scope will be renamed num0.

Building on this foundation you will test for fatal scope errors in phase III. Completing the semantic requirements of type checking will require a symbol table lookup of identifiers.

Cflat is a strongly typed language (more like Ada and closer to Java than C). There are good reasons to enforce strong typing (reliability, security). There is one good reason not to (flexibility). A strongly typed language enforces all type constraints and the compiler never coerces anything (coerce means that the compiler silently performs a type conversion).

To be consistent, Cflat does not allow downcasting (float to int) or upcasting (int to float). A downcast is any casting in which information is lost. Downcasting also refers to memory reduction such as an int to char. Examples:

   int i = 5;          // this is OK
   float x = 5;        // this is not OK 
   i = 1.42857;        // silently coercing 1.42857 into 1 is an error
   i = 29/3;           // another error
   x = 3 + 4.5;        // not OK - you cannot upcast the 3 to a float
   i = 3 + 4.5;        // this is not OK - you cannot downcast 4.5 to 4 
   char c = i;         // NOT OK - this is a downcast
Finally, you must enforce strong typing for return statements. The actual return type must match the declared return type in the function definition. No coercions allowed.

To complete the requirements of type checking you will need to use the LookupType(char *) function that will return the type (ITYPE, FTYPE, CTYPE) of the name from the symbol table.

You also need to worry about parametric typing within the function definition; i.e. this is a type error:

       int foo ( int n ) {
           n = 34.5;   // type error here
           return n;
       } 

Note: the cflat specification for Phase II does not include a function call. Function calls are very challenging during intermediate code generation and will be added to the BNF for Phase III.

Type violations are recoverable since they do not involve syntax. Simply report all instances in which the source code does not pass the semantic constraints of the cflat language with a fprintf statement to your log file.

For this phase your scanner will need to do a bit name mangling. You need to be able to type check this code:

   /* 90.cf - type checks paramater list - name mangling required 
    * spits out only one error per line 
    */
   float num; 
   int i;
   int foo ( int num, float y ) 
   { 
       num = 7;          /* OK */
       num = y;          /* line 11 type error */
       y = 4.5 + 8 * 2;  /* line 12 type error */
   
       return 0;  
   } 
You can facilitate what is needed to insert two different identifiers - num0 in global scope and num1 local to foo - by having the parser set the scope and the scanner append the scope to the identifier before inserting it into the symbol table. This is a little tricky but still possible to let the scanner do the insertion. In Phase III you will implement scope checking semantics. Here is my solution.

When you are finished with all type checking, parsing 100.cf will produce the following log and symbol table dump:

/* 100.cf - finds type errors for char; enforces strong typing on return  */
int   x =  5;    /* OK */ 
float y = 5.5;    /* OK */
char c = 'a';    /* OK */
int foo ( int y, float x ) 
{
    x = 5;        /* line 7 type error */
    y = 5.5;      /* line 8 type error */
    c = 7;        /* line 9 type error */
    return 1.2;   /* line 10 type error */
}  
Type err: assign_stmt, line 7
Type err: assign_stmt, line 8
Type err: assign_stmt, line 9
Type err: return value, line 10


------------ ------ ------------
Name         Type   Line Numbers
------------ ------ -------------
foo0          1      5 
c0            3      4    9 
x0            1      2 
x1            2      5    7 
y0            2      3 
y1            1      5    8 

HOW TO TEST YOUR CODE

Read the contents of 60.cf and 65.cf. 70.cf. 75.cf These files involve syntax only - you receive 75% if you can parse a syntactically correct cflat source with all constructs of the language without errors.

The files 80.cf, 85.cf, 90.cf and 100.cf involve type checking semantics. These files are syntactically correct but have type errors - just spit out an error message and continue parsing.

One suggestion is that you complete syntax completely (70%) before starting semantics. I will run your code against all preceding files. Code that does not compile will receive 0 points.

WHAT TO SUBMIT FOR PHASE II

You do not need to submit anything until Monday June 9. If you are unable to complete Phase III then follow these instructions to submit Phase II. Construct a tarball of everything needed to execute your code including a Makefile. In the body of your submission, let me know how far you completed your parser. For example, if you completed 90% I will do this:

     $ tar xvf yourtarball.tar
     $ make 
     $ ./cf 60.cf
     $ ./cf 65.cf
     $ ./cf 70.cf
     $ ./cf 75.cf
 and so on...

-----------------------------------------------------------------------------
NOTES:
o yacc permits actins to be embedded in the middle of a rule or at the end.
  you may run into problems embedding an action after the head of the RHS -
  not sure why

o the default action is $$ = $1  - if you do not want this action to occur but
  do not have another action you need to put {;} as your action

o this warning
      warning: empty rule for typed nonterminal, and no action
  can be fixed with {;}

Notes on this warning
   conflicts:  12 reduce/reduce

  A reduce/reduce conflict occurs because two or more rules match the same 
  string of tokens. Bison resolves this conflict by reducing the rule that 
  occurs earlier in the grammar. These errors can often be ignored but you
  should check your parsing tables to be sure. Call yacc with debugging:

     yacc -d -v --debug parse.y

  Then look at y.output.

  Shift/reduce conflicts means your grammar is ambiguous or would require more
  than one lookahead symbol to resolve the conflict. Bison resolves this
  conflict by doing the shift. Shift/reduce can be more problematic that 
  reduce/reduce errors. Shift/reduce errors can often be fixed by simplifying
  your grammar.

//////
  You can make your life easier by letting yacc control precedence and 
  associativity of arithmetic operators; precedence is low to high; e.g., * 
  has precedence over + below:
 
  %left PLUS MINUS 
  %left TIMES  OVER 

////

  The pseudo-token error calls the default error routine yyerrok - if you
  cannot recover from parsing you should immediately exit.
////
  If you try to close a file handle twice you will get a page full of errors
  from glibc