**********************************************
          *  REGULAR LANGUAGES AND REGULAR EXPRESSIONS * 
          **********************************************
 
 A language is a set that can be informally described through examples, e.g. 
 "The orange cat jumped over the fence." and "The hat is red." are sentences 
 in the set of sentences in the English language. The set of a language 
 (which could be infinite) can be more precisely described by set-builder 
 notation. 
 
 Set builder notation is used here to describe the language of regular 
 expressions (regexes). The purpose is to show that regexes are regular
 languages. An important property of a regular language is that it can be 
 scanned for validity from left to right without backtracking.
 
 Def. Let A be an alphabet. The class R of regular languages over A, and the 
 corresponding regular expressions are defined by:
 
 1.  Φ (phi) or {} (the empty-set) is a regular language expression and the 
     corresponding regular expression is {}. This trivial case says that an 
     empty set of strings is regular.
 
 2.  {λ} is a regular language, and the corresponding regular expression 
     is λ (lambda), where lambda is the empty string. The empty string 
     λ is not the same as the empty language {}. This trivial case says
     that a set containing the empty string is regular.
 
 3.  For each 'a' in A, {a} is a regular language, and 'a' is the corresponding 
     regular expression.       
            
 4.  If L1 and L2 are regular languages and r1 and r2 are the corresponding 
     regular expressions, then
     (a) L1 | L2 (set union - sometimes denoted by +) is a regular language 
         with the regular expression (r1 | r2). Reads L1 or L2.
     (b) L1L2 is a regular language with the regular expression (r1r2). This
         is string concatenation.
     (c) L1* is a regular language with the regular expression (r1*), where
         * means L1 concatenated 0 or more times.
     (d) L1^n is a regular language with L1 concatenated n times.
 
 A new regex can be formed by applying the operators * and | to a regex. A
 new regex can be formed by applying concatenation to two existing regexes.
  
 Examples.
 
 Let A = {a,b}
  
  ((a|b)*|(ab|bb))     # denotes all possible strings over A including lambda
  
         {lambda,b,bb,ab,abb,bab,bbb,abab,...}
 
  (a|b)*(ab)            # denotes all strings over A that end with ab
 
         {ab,bab,aab,abab, ... }
 
  (a|b)^2|(ab)*         # denotes the union of all strings over A of length 
                        #  2 with all strings over A of 0 or more 'ab's
 
         {aa,bb,ab,ba,aaab,bbab,abab,baab,...}
 
  a(a|b)*b|b(a|b)*a     # denotes all strings over A whose first and last
                        # symbols differ.
 
          {ab,ba,aab,abb,baa,bba,...}
 
 
  (ab|b)*(b|ab)^2       # denotes all strings over A that are the concatenation
                        #  of ab and b with strings formed by two concatenations
                        #  of b and ab
 
 
          {bb,abab,bbb,babab,abbb,ababab,abbb,...}
 
 
 There is a precedence hierarchy for operators in a regex that does not 
 contain parentheses.
 
  regular expression operator hierarchy
  -------------------------------------
  parentheses
  Kleene star *
  concatenation
  union | 
 
 Example of applying hierarchy for regular expressions
   b|cb   =  b|(cb)    { concatenation has higher priority }
   b|ab*  =  b|a(b*)   { Kleene star has priority over concatenation } 
 
 EXAMPLE.
   (0|1)*00(0|1)*         # denotes all strings containing substring 00
 
  ((01|1)*)(0|lambda)     # denotes all strings without substring 00
 
 Note that there is no clear relationship between the regular expression
 for a language and a regular expression for its complement.
 
 EXAMPLE
  Identifiers: A string of letters and/or digits beginning with letter.
 
     letter(letter|digit)* where letter = a|b|c...|z|A|B|C...|Z and
     digit = 0|1|2..|9
 
  Integers:
     ('+' + '-' + lambda)(digit)(digit)*
 
  Reals:
   ('+'|'-'|lambda)digit digit* (lambda | '.' digit digit*) ((lambda)|
      ('E' (lambda | '+' | '-') digit digit* ) 
 
 EXAMPLE
   (a|bc)*(c|lambda) = {lambda, a, c, aa, ac, bc, ...}
  L=the set of all strings of a or bc, with an optional c at the end.
 
 
 EXAMPLE: Let L be the language of the following regular expression.  
   c*(b|(ac*))*
 
   Is cba in L?  yes
   Is ccbaac in L?  yes
   Is cbc in L?  no, we can't get a second c without an a.
   Is acabacc in L?  yes
   Is bbaaacc in L?  yes
 
 EX:  Given a finite language, find a regular expression.
   L = {a, bcc, acb}
   reg. expr: a|bcc|acb
 
   Clearly, there is a reg. expr. for any finite language.
   That regex is created by the union of all the elements of L.
   Thus, any finite language is regular.
 
 EX:
   A = {a,b}    L = {w : w = (a^n)(b^n) , 0<=n<=3 }
 
   ab|aabb + aaabbb is a regular expression for L.
   This language is regular because it is finite.
 
 HOWEVER:
   A = {a,b}    L = {w : w = (a^n)(b^n) , 0<=n }
 
   a*b*  does NOT work, because it includes aab (among others)
   (ab)* does NOT work, because it includes abababab...
   There is no regular expression for this language. 
 
 EX: L = the set of all palindromes over {a,b}
 
    For example:
      a
      abba
      bab
      bb
 
   We cannot write a regular expression for this language. Why? Informally,
   because as we scan from left to right, the test for validity of the 
   string depends on something yet to come.
 
   We could write a regular expression for a *finite* set of palindrones, but 
   not for the set of all palindromes.