Chapter 3.1 - 3.3 Describing Syntax
TERMINOLOGY.

Language: a finite or infinite set of sentences. A language has a lexicon, syntax rules, and semantics. A grammar is a formal definition of a language.

Lexicon: contains all the lexemes of the language; i.e., predefined names, symbols and user defined identifiers (see C's lexicon)

Syntax: the form or structure of units in the language whether sentences, expressions, statements, or program units.

Semantics: the meaning of the expressions, statements, and program units in a language. For a programming language, semantics most often describe the runtime behavior of a program. A syntactically correct statement may be semantically meaningless.

Lexeme: the lowest level syntactic unit of a language; i.e., lexemes are the terminal symbols in the language. The lexemes in the English lexeme include words and puncutation symbols. An example of a lexeme is any C keyword.

A Sentence in language L is a valid string of lexemes over the terminal set of L. In English, the lexemes are words and a sentence is a string words (plus punctuation). A word can also have a grammar, defined as the set of arrangements over the terminal set we call the alphabet. If the language is the set of identifiers, then the terminal set is called the character set of L. The C/C++ character set is ASCII (excluding non-printable characters). The Java character set for Unicode.

Token: a category of lexemes (e.g., identifier, keyword, literal, separator, operator in a programming language or noun, verb, adverb, ... in human language). A token is a classification of the terminal symbols of a language. The lexemes are the actual terminals in the language. For example, a Keyword is a token classification in most lanuages. The actual keyword 'const' is a lexeme belonging to that classification.

Grammars: A Formal Definition of a Language

Grammar: a formal definition for describing the syntax of a language through a finite nonempty set of rules upon the tokens in the language. A grammar is a metalanguage of a language. Grammars do not describe semantics. ( see Java's Grammar )

Two types of grammars:

Generational models "generate" all possible sentences and no others in the language. Context-free and BNF grammars are generational models. A device that generates sentences of a language (a derivation or a parse tree) can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator.

Recognitional models verify whether a sentence is in the language or not. A recognition device reads input strings of the language and decides whether the input strings belong to the language. Example: The syntax analysis part of a compiler is a recognitional model.

Noam Chomsky (mid-1950s)

  • Chomsky developed language generators to describe the syntax of natural languages

  • A generative grammar should generate all possible sentences in the language and only those sentences (completeness and accuracy)

  • All languges contain the empty string, denoted by λ or ε

  • A generative grammar is a 4-tuple, <Σ, N, P, S>, consisting of

    + A finite set Σ of terminal symbols, e.g. the terminal alphabet of the language that combine to form sentences.

    + A finite set N of nonterminal symbols or syntactic categories, each of which represents some collection of subphrases of the sentences. (Nonterminals are denoted in some way.)

    + A finite set P of productions or rules that describe how each nonterminal is defined in terms of terminal symbols and nonterminals. A production has a left-hand side (LHS) and right-hand side (RHS). Productions use ::= or -> , read as "is defined as". The "|" symbol is OR.

         aBb ::= bC
    
         S ::= 0B | 1A
    
    + A nonterminal S, the start symbol, that specifies the uppermost category of the language (e.g. a sentence).

  • The vocabulary of a grammar consists of the terminal and nonterminal symbols.

  • A sentential form is a string of symbols in the derivation of a sentence in the grammar.

    Chomsky's 4 classes of languages and corresponding grammars

    By order of restrictiveness (less restriction means a more sophisticated language):

    UNRESTRICTED, requires only that one nonterminal appear on LHS

         aBb ::= bC
    
    CONTEXT SENSITIVE, requires at least one nonterminal on the LHS and that the RHS contain no fewer symbols than LHS. Rules are of the form:
         vAw ::= vzw 
    
    Example: A grammar for the language of the set of strings with equal numbers of as, bs, and cs in that order. E.g. { abc, aabbcc, aaabbbccc, ... } is context-sensitive.
        Σ = {a,b,c}
        N = {S,A,C,Q,X}
        S = S
        P =  {
          S -> abC | aSQ
          bQC -> bbCC 
          CQ -> CX
          CX -> QX
          QX -> QC
          C -> c   }

    CONTEXT FREE, requires that all rules have a single nonterminal on LHS, and one or more nonterminal and terminals on RHS. Examples:

     CFG 1.
        Σ = {(,)}
        N = {S}
        S = S
        P = { S ::= () | (S) | SS  }
    
     CFG 2.
        Σ = {a,b}
        N = {S}
        S = S
        P = { S ::= aSa | bSb | a | b | λ }
    
     CFG 3.
        Σ = {a,b}
        N = {S}
        S = S
        P = { S ::= aSb | ε }
        L = { ab, aabb, aaabbb, aaaabbbb, ... }
    
     CFG 4.
        Σ = {a,b}
        N = {S,A,B}
        S = S
        P = {S ::= aABb 
             A ::= Ab | b 
             B ::= aB | a }
    
     CFG 5.
        Σ = {0,1}
        N = {S,A,B}
        S = S
        P = {S ::= 1A | OB 
             A ::= 0 | 0S | 1AA
             B ::= 1 | 1S | 0BB }
    
    Are these strings in the language defined by CFG 5?
        0101
        011
        11110011
        00011000110001110 

    Before deriving the strings let's see if we can uncover some patterns in the language. This language is complicated but there is one pattern that is easy to discern. The strings below are in L(5) following the S->1A; A->0 | 0S rules:

          10,1010,101010,10101010,..  
    Following the S->0B; B->1 | 1S rules we have
        
           01,0101, 010101, 01010101,... 

    It looks like all strings of '01' any number of times and all strings of '10' any number of times are in L(5).

    The next rules to be considered are complicated! But if the string follows S -> 1A; A -> 1AA, we know that strings will be prefaced by 11 and followed by an additional 0.
    
       1100, 1011000,  ... 

    Based on these patterns it appears that '11110011' is not in the language. But let's try to derive it:

               S -> 1A
                 -> 11AA
                 -> 111AAA
                 -> 1111AAAA
                 -> 1111001AA  STUCK  
    REGULAR is the most restrictive grammar. Rules must have a single nonterminal on the LHS and one terminal on the RHS or one terminal followed or prefaced by one nonterminal. The location of the nonterminal must be consistent. A terminal prefaced by a nonterminal is left regular and the reverse is right regular. Every regular grammar is also context-free but not all context-free grammars are regular. This grammar a right regular.
        <binary number> ::=  0
        <binary number> ::=  1
        <binary number> ::=  1 <binary number>
        <binary number> ::=  0 <binary number>
    

    John Backus - 1959

  • Backus-Naur Form (BNF) invented to describe Algol 58
  • BNF is a context-free grammar under Chomsky's definition (a single nonterminal on LHS of every rule)
  • Most widely known method for describing programming language syntax

    BNF terminology

  • In BNF, abstractions represent classes of syntactic structures and are the nonterminals in the grammar.
  • Nonterminals are enclosed in angle brackets
  • Terminals are the lexemes
  • The tokens in the language are the categories of lexemes; e.g. 'num' is a lexeme in the token classification of identifier.
  • A production is called a rule in BNF

    Examples of BNF rules:

          <ident_list> -> identifier | identifier, <ident_list>
             <if_stmt> -> if <logic_expr> then <stmt>
    
    
  • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence
  • a sentence is a sentential form with all terminal symbols
  • derivation symbol is usually '=>' or '->' rather than ::= (depends on author)
  • Every string of symbols in the derivation is a sentential form
  • A derivation may be either leftmost or rightmost (default is leftmost)
  • In a leftmost derivation, the leftmost nonterminal in each sentential form is expanded first
  • Parse tree: A hierarchical representation of a derivation
  • Infinite lists are described using recursion
     
      <ident_list> -> ident | ident, <ident_list>
    
    An Example Grammar:
        <program> -> <stmts>
        <stmts> -> <stmt> | <stmt> ; <stmts>
         <stmt> ->  <var> = <expr>
         <expr> -> <term> + <term> | <term> - <term>
         <term> -> <var> | const
         <var>  -> a | b | c | d
    
    An Example Derivation and Parse Tree:
      Sentence: a = b + const
    
      <program> => <stmts> 
                => <stmt> 
                => <var> = <expr> 
                => a = <expr> 
                => a = <term> + <term>
                => a = <var> + <term> 
                => a = b + <term>
                => a = b + const
    
      Parse Tree:
                      <program>
                         |
                      <stmts>
                         |
                       <stmt>
                      /  |   \ 
                  <var>  =  <expr>
                    |       /  |  \
                    a   <term> +  <term>
                          |         |
                        <var>      const      
                          |
                          b
    
    Ambiguity in Grammars:

    A grammar is ambiguous if when generating a valid sentence you can end up with more than one distinct parse tree. To prove that a grammar is ambiguous, take a sentence in the grammar and and do a leftmost and a rightmost derivation. If that generates two different parse trees the language is ambiguous. Example:

      AMBIGUOUS GRAMMAR
      <expr> ->  <expr> <op> <expr>  |  const
      <op>   -> /  |  -
    
    Sample expression in the grammar: const - const / const
    
    Rightmost Derivation and Rightmost Parse Tree. 
      <expr> =>  <expr> <op> <expr>  
                =>  <expr> <op> const  
                =>  <expr> / const 
                =>  <expr> <op> <expr>  / const
                =>  <expr> <op> const  / const
                =>  <expr> - const  / const
                =>  const - const  / const
    
                          <expr>
                     /      |      \ 
                <expr>     <op>   <expr> 
              /    |   \      |     |
          <expr> <op> <expr>  /   const 
             |     |   |     
           const   -  const 
    
     Leftmost Derivation and Leftmost Parse Tree. 
      <expr> =>  <expr> <op> <expr>  
                =>  const <op> <expr>  
                =>  const - <expr>  
                =>  const - <expr> <op> <expr> 
                =>  const - const <op> <expr> 
                =>  const - const / <expr> 
                =>  const - const / const
    
                          <expr>
                     /      |      \ 
                <expr>     <op>   <expr> 
                    |       |     /   |   \    
                  const     -  <expr> <op> <expr>  
                                 |     |   |     
                               const   /  const
    
    

    Sometimes ambiguity is not a problem semantically (i.e., the meaning of the sentence remains unchanged). It is a problem when the semantic *meaning* of the sentence differs between parse trees. The meaning of "const - const / const" differs since subtraction does not commute with division. So ambiguity in this expression grammar is a problem. To fix it, you need to create a grammar that enforces correct precedence for operators (in this case division over subtraction). A grammar that has double recursion (the non-terminal on the LHS is repeated twice in the RHS) is always ambiguous. So you will need to remove the double recursion by replacing the non-terminal with an "intermediate" non-terminal.

    An easy method to create the correct unambiguous grammar is to follow the behavior of the parse tree that gives you the correct result, creating production rules and replacing NTs with intermediate NTs as you go in order to force the outcome. In this case, the Leftmost Parse Tree is the one that gives division precedence over subtraction.

      <expr> -> <expr> - <term>  |  <term>
      <term> ->  <term> / const | const
    

    Operator Precedence in Expression Grammars

    The parse tree indicates precedence levels of the operators - nodes farthest from the root are evaluated first (since trees are displayed upside down these are the lowest levels). Evaluation is a LVR traversal of the parse tree.

    Example Grammar

        <expr> -> <expr> + <term> | <term>
        <term> -> <term> * <factor>  | <factor>
        <factor> -> ( <expr> ) | <id>
        <id> -> A | B | C
    
    Which operator (+ or *) has precedence?

    What happens if you swap the + and * operators in the grammar?

    <term> is defined as <term> * <factor> before it is defined as <id>. What does this mean in terms of a parse tree?

    Associativity of Operators in Expression Grammars

    The property of associativity for binary operators means that ((a op b) op c) is the same as (a op (b op c)); Addition is associative, subtraction is non-associative.

    Associativity in a programming language defines the order of operations for "like" operators that are non-associative

    Right associative means that operators of equal precedence are evaluated from right to left. Right recursion enforces right association. In right recursion the nonterminal in the LHS appears at the right end of the RHS.

    Left associative means that operators of equal precedence are evaluated from left to right. Left recursion enforces left association. In left recursion the nonterminal in the LHS appears at the left end of the RHS.

    Operator associativity can be indicated by a grammar. For example:

      7 - 4 - 2     
    

    Left associativity means the 4 *associates* left: (7 - 4) - 2 = 1

    Right associativity means the 4 *associates* right: 7 - (4 - 2)= 5

    Example of recursion in the rule to control associativity with corresponding parse trees for 7 - 4 - 2.

      <expr> -> <expr> - <term>  |  <term>   (left recursive)
      <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
     
                         <expr>
                       /   |    \
                  <expr>   -  <term>   Semantic meaning: (7-4)-2=1 
                /   |   \       |
           <expr>   - <term>    2
              |           |
           <term>         4
              |
              7
     
      <expr> -> <term> - <expr>  |  <term>   (right recursive)
      <term> -> 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 
    
                         <expr>
                       /   |    \
                  <term>   -     <expr>      Semantic meaning: 7-(4-2)=5 
                     |          /   |   \
                     7      <term>  -  <expr>
                               |           |
                               4        <term>
                                           |
                                           2                                
    
    Extended BNF (EBNF)
  • iso 14977 for EBNF is rarely completely complied with
  • Extended BNF improves readability and writability of BNF
  • Optional parts are placed in brackets [ ]
    
      <proc_call> -> ident [(<expr_list>)] 
  • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars (or by spaces)
    
      <term> -> <term> (+ | -) const 
  • Repetitions (0 or more) are placed inside braces { }
    
      <ident> -> letter {letter digit} 
      BNF
         <expr> =>  <expr> + <term>
                 	 | <expr> - <term> | <term>
          <term> => <term> * <factor>
                 	 | <term> / <factor> | <factor>
    
      EBNF
         <expr> => <term> {(+ | -) <term>}
         <term> => <factor> {(* | /) <factor>}
    
      note: ENBF does not specify associativity--the syntax analyzer must do this
    
    Some Examples & Problems
    Example 3.4 :
    
        <assign> -> <id> = <expr>
        <id> -> A | B | C 
        <expr> -> <expr> + <term> | <term>
        <term> -> <term> * <factor>  | <factor>
        <factor> -> ( <expr> ) | <id>
    
    Example 3.2:
    
    <assign> -> <id> = <expr>
    <id> -> A | B | C
    <expr>-><id> + <expr> | <id> * <expr> | ( <expr> ) | <id>
    
    #1. Consider this Grammar
    
        <assign> -> <id> = <expr>
        <expr> -> <expr> + <term> | <term>
        <term> -> <term> * <factor>  | <factor>
        <factor> -> ( <expr> ) | <id>
        <id> -> A | B | C
    
        Is + right or left associative?
    
        How would you change the associativity of +?
    
    #2. Consider the folowing two grammars, each of which generates strings of
        correctly balanced parentheses and brackets. Determine if either or both
        is ambiguous. The letter "e" represents the Greek letter epsilon; i.e., 
        the empty string.
    
     (a) <string> ::= <string> <string> | (<string>) | [<string>] | e
     (b) <string> ::= (<string>) <string> | [<string>]  <string> | e
    
    
    #3. Consider the following specification of expressions:
    
        <expr> ::= <element> | <expr> <weak op> <expr>
        <element> ::= <numeral> | <variable>
        <weak op> ::= + | -
    
        Demonstrate its ambiguity by displaying two derivation trees for the
        expression a - b - c.
    
    
    #4. Modify the following grammar to add a unary minus operator that has
        higher precedence than either + or *.
    
        <assign> -> <id> = <expr>
        <id> -> A | B | C
        <expr> -> <expr> + <term> | <term>
        <term> -> <term> * <factor>  | <factor>
        <factor> -> ( <expr> ) | <id>
    
    #5. Consider the following grammar:
    
       <expr> ::- <term> | <expr> + <term>
       <term ::= <element> | <term> * <element>
       <element> ::= <num> | <var>
    
    Draw the parse tree for this valid expression:
    
        5 * Z + 30 + 7
    
    #6. Consider the grammar
    
      S -> AB | AD
      A -> Aa | b
      B -> Bb | a
      D -> cDc | d 
    
      Are these valid strings in the language generated by the grammar?
    
      ba        ?
    
      bbbbab    ?
    
      aaaacccc  ?
    
      bcdcc     ?
    
      baaab     ?
    
      baccdcc   ?
    
    #8. Consider this grammar:
    
         <assign> -> <id> = <expr>
         <id> -> A | B | C
         <expr> -> <id> + <expr> | <id> * <expr> | ( <expr> ) | <id>
    
       Show a parse tree and a leftmost derivation for the following statement:
    
         A = A * ( B + ( C * A))
    
    1. <assign> => <id> = <expr>
    2.          => A = <expr>
    3.          => A = <id> * <expr>
    4.          => A = A  * <expr>
    5.          => A = A  * (<expr>)
    6.          => A = A  * (<id> + <expr>)
    7.          => A = A  * (B + <expr>)
    8.          => A = A  * (B + (<expr>))
    9.          => A = A  * (B + (<id> * <expr>))
    10.         => A = A  * (B + ( C * <expr>))
    11.         => A = A  * (B + ( C * <id>))
    12.         => A = A  * (B + ( C * A))
    
       Parse tree.
    
                 <assign>
                         /    |    \
                       <id>   =     <expr>
                        |           /  |  \
                        A         <id> *  <expr>
                                   |        |
                                   A     ( <expr> )
                                           /  |  \
                                         <id> +  <expr>
                                          |        |
                                          B    ( <expr> )
                                                 /  |  \
                                              <id>  *  <expr>
                                                |        |
                                                C       <id>
                                                         |
                                                         A
    
     To prove that a grammar is ambiguous, provide a sentence in the grammar that
      has two different parse trees. (You can't show this with derivations.)
    
       <S> -> <A>
       <A> -> <A> + <A>  | <id>
       <id> -> a | b | c
    
      Sentence:   a + b + c
    
      Tree 1.             <S>
                           |
                          <A>
                        /  |  \
                      <A>  +  <A>
                       |     / | \
                     <id>  <A> + <A>
                       |    |     |
                       a   <id>  <id>
                            |     |
                            b     c
    
    
      Tree 2.            <S>
                          |
                         <A>
                        / |  \
                      <A> +   <A>
                     / | \     |
                   <A> + <A>  <id>
                    |     |    |
                   <id>  <id>  c
                    |     |
                    a     b
    
    Describe the language defined by the following grammar:
    
      <S> -> <A> a <B> b
      <A> -> <A> b | b
      <B> -> a <B> | a
    
    Start by finding the behavior of <A> and <B> :
    
     <A> ->  b
             bb
             bbb ...
    
     <A> is 1 or more "b"s.
    
     <B> -> a
            aa
            aaa ...
            
      <B> is 1 or more "a"s
    
      Thus,
    
      <A>a<B>b defines the language of all strings of 1 or more "b"s followed 
      by 2 or more "a"s ending with a "b".