Chapter 3.1 - 3.3 Describing Syntax

TERMINOLOGY.

Language: a finite or infinite set of sentences. A language has a lexicon, syntax rules, and semantics. A grammar is a formal definition of a language.

Lexicon: contains all the lexemes of the language; i.e., predefined names, symbols and user defined identifiers (see C's lexicon)

Syntax: the form or structure of units in the language whether sentences, expressions, statements, or program units.

Semantics: the meaning of the expressions, statements, and program units in a language. For a programming language, semantics most often describe the runtime behavior of a program. A syntactically correct statement may be semantically meaningless.

Lexeme: the lowest level syntactic unit of a language; i.e., lexemes are the terminal symbols in the language. The lexemes in the English lexeme include words and puncutation symbols. An example of a lexeme is any C keyword.

A Sentence in language L is a valid string of lexemes over the terminal set of L. In English, the lexemes are words and a sentence is a string words (plus punctuation). A word can also have a grammar, defined as the set of arrangements over the terminal set we call the alphabet. If the language is the set of identifiers, then the terminal set is called the character set of L. The C/C++ character set is ASCII (excluding non-printable characters). The Java character set for Unicode.

Token: a category of lexemes (e.g., identifier, keyword, literal, separator, operator in a programming language or noun, verb, adverb, ... in human language). A token is a classification of the terminal symbols of a language. The lexemes are the actual terminals in the language. For example, a Keyword is a token classification in most lanuages. The actual keyword 'const' is a lexeme belonging to that classification.

Grammars: A Formal Definition of a Language

Grammar: a formal definition for describing the syntax of a language through a finite nonempty set of rules upon the tokens in the language. A grammar is a metalanguage of a language. Grammars do not describe semantics. ( see Java's Grammar )

Two types of grammars:

Generational models "generate" all possible sentences and no others in the language. Context-free and BNF grammars are generational models. A device that generates sentences of a language (a derivation or a parse tree) can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator.

Recognitional models verify whether a sentence is in the language or not. A recognition device reads input strings of the language and decides whether the input strings belong to the language. Example: The syntax analysis part of a compiler is a recognitional model.

Noam Chomsky (mid-1950s)

Chomsky developed language generators to describe the syntax of natural languages

A generative grammar should generate all possible sentences in the language and only those sentences (completeness and accuracy)

All languges contain the empty string, denoted by λ or ε

A generative grammar is a 4-tuple, <Σ, N, P, S>, consisting of

+ A finite set Σ of terminal symbols, e.g. the terminal alphabet of the language that combine to form sentences.

+ A finite set N of nonterminal symbols or syntactic categories, each of which represents some collection of subphrases of the sentences. (Nonterminals are denoted in some way.)

+ A finite set P of productions or rules that describe how each nonterminal is defined in terms of terminal symbols and nonterminals. A production has a left-hand side (LHS) and right-hand side (RHS). Productions use ::= or -> , read as "is defined as". The "|" symbol is OR.

     aBb ::= bC

     S ::= 0B | 1A

+ A nonterminal S, the start symbol, that specifies the uppermost category of the language (e.g. a sentence).

The vocabulary of a grammar consists of the terminal and nonterminal symbols.

A sentential form is a string of symbols in the derivation of a sentence in the grammar.

Chomsky's 4 classes of languages and corresponding grammars

By order of restrictiveness (less restriction means a more sophisticated language):

UNRESTRICTED, requires only that one nonterminal appear on LHS

     aBb ::= bC

CONTEXT SENSITIVE, requires at least one nonterminal on the LHS and that the RHS contain no fewer symbols than LHS. Rules are of the form:

     vAw ::= vzw

Example: A grammar for the language of the set of strings with equal numbers of as, bs, and cs in that order. E.g. { abc, aabbcc, aaabbbccc, ... } is context-sensitive.

    Σ = {a,b,c}
    N = {S,A,C,Q,X}
    S = S
    P =  {
      S -> abC | aSQ
      bQC -> bbCC 
      CQ -> CX
      CX -> QX
      QX -> QC
      C -> c   }

CONTEXT FREE, requires that all rules have a single nonterminal on LHS, and one or more nonterminal and terminals on RHS. Examples:

 CFG 1.
    Σ = {(,)}
    N = {S}
    S = S
    P = { S ::= () | (S) | SS  }

 CFG 2.
    Σ = {a,b}
    N = {S}
    S = S
    P = { S ::= aSa | bSb | a | b | λ }

 CFG 3.
    Σ = {a,b}
    N = {S}
    S = S
    P = { S ::= aSb | ε }
    L = { ab, aabb, aaabbb, aaaabbbb, ... }

 CFG 4.
    Σ = {a,b}
    N = {S,A,B}
    S = S
    P = {S ::= aABb 
         A ::= Ab | b 
         B ::= aB | a }

 CFG 5.
    Σ = {0,1}
    N = {S,A,B}
    S = S
    P = {S ::= 1A | OB 
         A ::= 0 | 0S | 1AA
         B ::= 1 | 1S | 0BB }

Are these strings in the language defined by CFG 5?

    0101
    011
    11110011
    00011000110001110

Before deriving the strings let's see if we can uncover some patterns in the language. This language is complicated but there is one pattern that is easy to discern. The strings below are in L(5) following the S->1A; A->0 | 0S rules:

      10,1010,101010,10101010,..

Following the S->0B; B->1 | 1S rules we have

    
       01,0101, 010101, 01010101,...

It looks like all strings of '01' any number of times and all strings of '10' any number of times are in L(5).

The next rules to be considered are complicated! But if the string follows S -> 1A; A -> 1AA, we know that strings will be prefaced by 11 and followed by an additional 0.


   1100, 1011000,  ...

Based on these patterns it appears that '11110011' is not in the language. But let's try to derive it:

           S -> 1A
             -> 11AA
             -> 111AAA
             -> 1111AAAA
             -> 1111001AA  STUCK

REGULAR is the most restrictive grammar. Rules must have a single nonterminal on the LHS and one terminal on the RHS or one terminal followed or prefaced by one nonterminal. The location of the nonterminal must be consistent. A terminal prefaced by a nonterminal is left regular and the reverse is right regular. Every regular grammar is also context-free but not all context-free grammars are regular. This grammar a right regular.

    <binary number> ::=  0
    <binary number> ::=  1
    <binary number> ::=  1 <binary number>
    <binary number> ::=  0 <binary number>

John Backus - 1959

Backus-Naur Form (BNF) invented to describe Algol 58

BNF is a context-free grammar under Chomsky's definition (a single nonterminal on LHS of every rule)

Most widely known method for describing programming language syntax

BNF terminology

In BNF, abstractions represent classes of syntactic structures and are the nonterminals in the grammar.

Nonterminals are enclosed in angle brackets

Terminals are the lexemes

The tokens in the language are the categories of lexemes; e.g. 'num' is a lexeme in the token classification of identifier.

A production is called a rule in BNF

Examples of BNF rules:

      <ident_list> -> identifier | identifier, <ident_list>
         <if_stmt> -> if <logic_expr> then <stmt>

A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence

a sentence is a sentential form with all terminal symbols

derivation symbol is usually '=>' or '->' rather than ::= (depends on author)

Every string of symbols in the derivation is a sentential form

A derivation may be either leftmost or rightmost (default is leftmost)

In a leftmost derivation, the leftmost nonterminal in each sentential form is expanded first

Parse tree: A hierarchical representation of a derivation

Infinite lists are described using recursion

 
  <ident_list> -> ident | ident, <ident_list>

An Example Grammar:

    <program> -> <stmts>
    <stmts> -> <stmt> | <stmt> ; <stmts>
     <stmt> ->  <var> = <expr>
     <expr> -> <term> + <term> | <term> - <term>
     <term> -> <var> | const
     <var>  -> a | b | c | d

An Example Derivation and Parse Tree:

  Sentence: a = b + const

  <program> => <stmts> 
            => <stmt> 
            => <var> = <expr> 
            => a = <expr> 
            => a = <term> + <term>
            => a = <var> + <term> 
            => a = b + <term>
            => a = b + const

  Parse Tree:
                  <program>
                     |
                  <stmts>
                     |
                   <stmt>
                  /  |   \ 
              <var>  =  <expr>
                |       /  |  \
                a   <term> +  <term>
                      |         |
                    <var>      const      
                      |
                      b

Ambiguity in Grammars:

A grammar is ambiguous if when generating a valid sentence you can end up with more than one distinct parse tree. To prove that a grammar is ambiguous, take a sentence in the grammar and and do a leftmost and a rightmost derivation. If that generates two different parse trees the language is ambiguous. Example:

  AMBIGUOUS GRAMMAR
  <expr> ->  <expr> <op> <expr>  |  const
  <op>   -> /  |  -

Sample expression in the grammar: const - const / const

Rightmost Derivation and Rightmost Parse Tree. 
  <expr> =>  <expr> <op> <expr>  
            =>  <expr> <op> const  
            =>  <expr> / const 
            =>  <expr> <op> <expr>  / const
            =>  <expr> <op> const  / const
            =>  <expr> - const  / const
            =>  const - const  / const

                      <expr>
                 /      |      \ 
            <expr>     <op>   <expr> 
          /    |   \      |     |
      <expr> <op> <expr>  /   const 
         |     |   |     
       const   -  const 

 Leftmost Derivation and Leftmost Parse Tree. 
  <expr> =>  <expr> <op> <expr>  
            =>  const <op> <expr>  
            =>  const - <expr>  
            =>  const - <expr> <op> <expr> 
            =>  const - const <op> <expr> 
            =>  const - const / <expr> 
            =>  const - const / const

                      <expr>
                 /      |      \ 
            <expr>     <op>   <expr> 
                |       |     /   |   \    
              const     -  <expr> <op> <expr>  
                             |     |   |     
                           const   /  const

Sometimes ambiguity is not a problem semantically (i.e., the meaning of the sentence remains unchanged). It is a problem when the semantic *meaning* of the sentence differs between parse trees. The meaning of "const - const / const" differs since subtraction does not commute with division. So ambiguity in this expression grammar is a problem. To fix it, you need to create a grammar that enforces correct precedence for operators (in this case division over subtraction). A grammar that has double recursion (the non-terminal on the LHS is repeated twice in the RHS) is always ambiguous. So you will need to remove the double recursion by replacing the non-terminal with an "intermediate" non-terminal.

An easy method to create the correct unambiguous grammar is to follow the behavior of the parse tree that gives you the correct result, creating production rules and replacing NTs with intermediate NTs as you go in order to force the outcome. In this case, the Leftmost Parse Tree is the one that gives division precedence over subtraction.

  <expr> -> <expr> - <term>  |  <term>
  <term> ->  <term> / const | const

Operator Precedence in Expression Grammars

The parse tree indicates precedence levels of the operators - nodes farthest from the root are evaluated first (since trees are displayed upside down these are the lowest levels). Evaluation is a LVR traversal of the parse tree.

Example Grammar

    <expr> -> <expr> + <term> | <term>
    <term> -> <term> * <factor>  | <factor>
    <factor> -> ( <expr> ) | <id>
    <id> -> A | B | C

Which operator (+ or *) has precedence?

What happens if you swap the + and * operators in the grammar?

<term> is defined as <term> * <factor> before it is defined as <id>. What does this mean in terms of a parse tree?

Associativity of Operators in Expression Grammars

The property of associativity for binary operators means that ((a op b) op c) is the same as (a op (b op c)); Addition is associative, subtraction is non-associative.

Associativity in a programming language defines the order of operations for "like" operators that are non-associative

Right associative means that operators of equal precedence are evaluated from right to left. Right recursion enforces right association. In right recursion the nonterminal in the LHS appears at the right end of the RHS.

Left associative means that operators of equal precedence are evaluated from left to right. Left recursion enforces left association. In left recursion the nonterminal in the LHS appears at the left end of the RHS.

Operator associativity can be indicated by a grammar. For example:

  7 - 4 - 2

Left associativity means the 4 *associates* left: (7 - 4) - 2 = 1

Right associativity means the 4 *associates* right: 7 - (4 - 2)= 5

Example of recursion in the rule to control associativity with corresponding parse trees for 7 - 4 - 2.

  <expr> -> <expr> - <term>  |  <term>   (left recursive)
  <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
 
                     <expr>
                   /   |    \
              <expr>   -  <term>   Semantic meaning: (7-4)-2=1 
            /   |   \       |
       <expr>   - <term>    2
          |           |
       <term>         4
          |
          7
 
  <expr> -> <term> - <expr>  |  <term>   (right recursive)
  <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 

                     <expr>
                   /   |    \
              <term>   -     <expr>      Semantic meaning: 7-(4-2)=5 
                 |          /   |   \
                 7      <term>  -  <expr>
                           |           |
                           4        <term>
                                       |
                                       2

Extended BNF (EBNF)

iso 14977 for EBNF is rarely completely complied with

Extended BNF improves readability and writability of BNF

Optional parts are placed in brackets [ ]


  <proc_call> -> ident [(<expr_list>)]

Alternative parts of RHSs are placed inside parentheses and separated via vertical bars (or by spaces)


  <term> -> <term> (+ | -) const

Repetitions (0 or more) are placed inside braces { }


  <ident> -> letter {letter digit}

  BNF
     <expr> =>  <expr> + <term>
             	 | <expr> - <term> | <term>
      <term> => <term> * <factor>
             	 | <term> / <factor> | <factor>

  EBNF
     <expr> => <term> {(+ | -) <term>}
     <term> => <factor> {(* | /) <factor>}

  note: ENBF does not specify associativity--the syntax analyzer must do this

Some Examples & Problems

Example 3.4 :

    <assign> -> <id> = <expr>
    <id> -> A | B | C 
    <expr> -> <expr> + <term> | <term>
    <term> -> <term> * <factor>  | <factor>
    <factor> -> ( <expr> ) | <id>

Example 3.2:

<assign> -> <id> = <expr>
<id> -> A | B | C
<expr>-><id> + <expr> | <id> * <expr> | ( <expr> ) | <id>

#1. Consider this Grammar

    <assign> -> <id> = <expr>
    <expr> -> <expr> + <term> | <term>
    <term> -> <term> * <factor>  | <factor>
    <factor> -> ( <expr> ) | <id>
    <id> -> A | B | C

    Is + right or left associative?

    How would you change the associativity of +?

#2. Consider the folowing two grammars, each of which generates strings of
    correctly balanced parentheses and brackets. Determine if either or both
    is ambiguous. The letter "e" represents the Greek letter epsilon; i.e., 
    the empty string.

 (a) <string> ::= <string> <string> | (<string>) | [<string>] | e
 (b) <string> ::= (<string>) <string> | [<string>]  <string> | e


#3. Consider the following specification of expressions:

    <expr> ::= <element> | <expr> <weak op> <expr>
    <element> ::= <numeral> | <variable>
    <weak op> ::= + | -

    Demonstrate its ambiguity by displaying two derivation trees for the
    expression a - b - c.


#4. Modify the following grammar to add a unary minus operator that has
    higher precedence than either + or *.

    <assign> -> <id> = <expr>
    <id> -> A | B | C
    <expr> -> <expr> + <term> | <term>
    <term> -> <term> * <factor>  | <factor>
    <factor> -> ( <expr> ) | <id>

#5. Consider the following grammar:

   <expr> ::- <term> | <expr> + <term>
   <term ::= <element> | <term> * <element>
   <element> ::= <num> | <var>

Draw the parse tree for this valid expression:

    5 * Z + 30 + 7

#6. Consider the grammar

  S -> AB | AD
  A -> Aa | b
  B -> Bb | a
  D -> cDc | d 

  Are these valid strings in the language generated by the grammar?

  ba        ?

  bbbbab    ?

  aaaacccc  ?

  bcdcc     ?

  baaab     ?

  baccdcc   ?

#8. Consider this grammar:

     <assign> -> <id> = <expr>
     <id> -> A | B | C
     <expr> -> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id>

   Show a parse tree and a leftmost derivation for the following statement:

     A = A * ( B + ( C * A))

1. <assign> => <id> = <expr>
2.          => A = <expr>
3.          => A = <id> * <expr>
4.          => A = A  * <expr>
5.          => A = A  * (<expr>)
6.          => A = A  * (<id> + <expr>)
7.          => A = A  * (B + <expr>)
8.          => A = A  * (B + (<expr>))
9.          => A = A  * (B + (<id> * <expr>))
10.         => A = A  * (B + ( C * <expr>))
11.         => A = A  * (B + ( C * <id>))
12.         => A = A  * (B + ( C * A))

   Parse tree.

             <assign>
                     /    |    \
                   <id>   =     <expr>
                    |           /  |  \
                    A         <id> *  <expr>
                               |        |
                               A     ( <expr> )
                                       /  |  \
                                     <id> +  <expr>
                                      |        |
                                      B    ( <expr> )
                                             /  |  \
                                          <id>  *  <expr>
                                            |        |
                                            C       <id>
                                                     |
                                                     A

 To prove that a grammar is ambiguous, provide a sentence in the grammar that
  has two different parse trees. (You can't show this with derivations.)

   <S> -> <A>
   <A> -> <A> + <A>  | <id>
   <id> -> a | b | c

  Sentence:   a + b + c

  Tree 1.             <S>
                       |
                      <A>
                    /  |  \
                  <A>  +  <A>
                   |     / | \
                 <id>  <A> + <A>
                   |    |     |
                   a   <id>  <id>
                        |     |
                        b     c


  Tree 2.            <S>
                      |
                     <A>
                    / |  \
                  <A> +   <A>
                 / | \     |
               <A> + <A>  <id>
                |     |    |
               <id>  <id>  c
                |     |
                a     b

Describe the language defined by the following grammar:

  <S> -> <A> a <B> b
  <A> -> <A> b | b
  <B> -> a <B> | a

Start by finding the behavior of <A> and <B> :

 <A> ->  b
         bb
         bbb ...

 <A> is 1 or more "b"s.

 <B> -> a
        aa
        aaa ...
        
  <B> is 1 or more "a"s

  Thus,

  <A>a<B>b defines the language of all strings of 1 or more "b"s followed 
  by 2 or more "a"s ending with a "b".