CMPS 350 Chapter 6 - Data Types

FAQ | sec 6.6-6.9

Section 6.1 - 6.5

The richness of the data types in a language largely determines language style and usefulness

Primitive Data Types

A "descriptor" for a data type (int, float, double, etc.) determines the range and kind of values, the size of storage, and allowable operations (elements of C) All languages have a set of primitive data types; e.g. built-in and not defined in terms of other data types. Primitive data types of most languages inc.: integer, float, character,array. Some primitive data types are merely reflections of the hardware - Others require some non-hardware support. The size of a data type is platform dependent in C/C++ but not in Java see C's /usr/include/limits.h

Integer
Almost always an exact reflection of hardware so mapping is trivial C example There may be as many as 8 different integer types in a language (e.g. GNU C) Java's signed integer sizes: byte, short, int, long (no unsigned int) Negative integers primarily stored in twos complement
Floating Point
Approximates real numbers as floating point w/ fraction and exponent Usually two floating-point types - float and double- sometimes more Most common standard: IEEE Floating-Point Standard 754
    sign  exponent     fraction
   -----|------------|----------------  = 32 bits (single-precision)
   1 bit  8 bits       23 bits       
   ------------------------------------ 
   31                                 0
  
    sign  exponent     fraction
   -----|------------|---------------- =  64 bits (double-precision)
   1 bit  11 bits       52 bits     
Rounding problem: floating point to binary conversion
Decimal
For business applications (money); Essential to COBOL Unlike decimal floating point, NO ROUND-OFF ERRORS in C# decimal data type
BCD
(binary coded decimal) encoding takes more space than a binary encoding A decimal digit 0-9 requires 4 bits. Binary encoding of 255 can be done in 8 bits: 1111 1111 BCD encoding of of 255 requires 3*4=12 bits: 0010 0101 0101 Advantage: accuracy (solves floating point rounding problem) Disadvantages: limited range, wastes memory
Boolean
Range of values: two elements, one for true and one for false Could be implemented as bits, but often as bytes Advantage: readability
Character
Stored as numeric codings. Most commonly used coding: ASCII UNICODE 16-bit coding (see unicode charts) (see HTML unicode). UNICODE supports characters from most natural languages plus symbols Example: Ð is the Icelandic Capital Eth Originally used in Java, C# and JavaScript also support Unicode
Character String Types
Values are sequences of characters Design issues: Is it a primitive type (Perl) or a special kind of array or class (C/C++)? Should length of strings be static or dynamic?

o Character String Type Operations: Assignment and copying, comparison (=, >, etc.), catenation, substring reference, pattern matching. see Perl regexes.

Some regex metacharacters:
   ?    0 or 1 time 
   *    0 or more times
   +    1 or more times
   []   marks a character class to match a single character
   ^    negation inside [];  anchor at beginning of line otherwise
   $    marks end of line
   ()   substring deliminters  (must be escaped in vim)
   $1   refers to first matched substring (this is \1 in vim)
   {}   a{n} means match 'a' exactly "n" times

   some of Perl's abbreviations 
   \d is a digit and represents [0-9]
   \s is a whitespace character and represents [\	\t\r\n\f]
   \w is a word character	(alphanumeric or _) and represents [0-9a-zA-Z_]
   \D is a negated \d; it	represents any character but a digit [^0-9]
   \S is a negated \s; it	represents any non-whitespace character [^\s]
   \W is a negated \w; it	represents any non-word	character [^\w]
   The period '.'	matches	any character but "\n"

  Ex.
  $x = 'the cat in the hat';
	 $x =~ /^(.*)(at)(.*)$/; #$1 = 'the cat in the h'; $2='at'; $3=''(0 matches)

   What do these regexes match? 

    /^[A-Za-z][A-Za-z\d]+/   
    /[yY][eE][sS]/;
    /[0-9a-fA-F]/;	          
    /[^0-9]{4}/;           

   Substring replacement (substitution) syntax:

   s/regex/replacement/modifiers
   ex.
	/^'(.*)'$/$1/;	# strip	single quotes that occur at start and end of line

   Try out a few thing with this file in vim.  
Character String Type
Not primitive in C - must use char arrays and low-level utilities in string.h. Not primitive in C++ but provided in Standard Template Library string class. See strings.cpp and strings.c

Primitive in SNOBOL4 (a string manipulation language) ; many operations and elaborate pattern matching. Primitive in Java as a String class (don't need to include anything extra)

String length options (only Ada supports all three):

1. Static: COBOL, Java's String class
2. Limited Dynamic Length: C/C++ \0 marks end of string in fixed array (example)
3. Dynamic (no maximum): SNOBOL4, Perl, JavaScript 

As a primitive type strings aid writability. But C was a systems language and did not see strings as a priority. Inexpensive as a primitive type with static length. Dynamic length is nice, but is it worth the expense?

String implementation options:

1. Static length: need compile-time descriptor for length and address
2. Limited dynamic length: may need a run-time descriptor for length (not in C)
3. Dynamic length: need run-time descriptor for address, max length and current
   length; allocation/de-allocation biggest implementation problem 
User-Defined Ordinal Types
A data type that denotes position in an ordered sequence - must be mapped to integers
Enumeration Types
All possible values are named constants in the definition. C#: enum days {mon, tue, wed, thu, fri, sat, sun}; (see C)

Design issues: Can enumeration constant appear in more than one type definition, if so, how is the type of an occurrence of that constant checked? Are enumeration values coerced to integer? Any other type coerced to an enumeration type?

Evaluation of Enumerated Type: Aid to readability, e.g., no need to code a color as a number. Aid to reliability, e.g., compiler can check operations (don't allow colors to be added). No enumeration variable can be assigned a value outside its defined range. Ada, C#, and Java 5.0 provide better support for enumeration than C++; e.g. enumeration type variables are not coerced into integer types

Subrange Types
An ordered contiguous subsequence of an ordinal type Def: an ordinal type is countable and ordered (real numbers are not countable). Example: 12..18 is a subrange of integer type. Ada's design:
        type Days is (mon, tue, wed, thu, fri, sat, sun);
      subtype Weekdays is Days range mon..fri;
      subtype Index is Integer range 1..100;
   
      Day1: Days;
      Day2: Weekday;
      Day2 := Day1;  
Subrange Evaluation: Aids readability. Makes clear that variables of subrange can store only certain range of values Helps reliability. Assigning a value to a subrange variable outside range is a syntax error.

Implementation of User-Defined Ordinal Types:

Enumeration types are implemented as integers. Subrange types implemented like parent types with code inserted by compiler to restrict assignments to subrange variables.

Array Types
Arrays included in most languages. An array is an aggregate of contiguous, homogeneous data elements where individual elements are identified by position relative to the first element

Array Design Issues:
What types are legal for subscripts? Are subscripting expressions in element references range checked? When is the subscript range bound (when is the size of array determined)? When does allocation take place? What is the maximum number of subscripts? Can array objects be initialized? Are any kind of slices allowed?

Array Indexing:
Indexing (or subscripting) is a mapping from indices to elements

	array_name (index_value_list) ->  an element  
Index Syntax:
Most languages use brackets. FORTRAN, PL/I, Ada use parentheses. Ada explicitly uses parentheses to show uniformity between arrays and function calls; e.g. both are mappings.

Arrays Index (Subscript) Types:
FORTRAN, C: integer only Pascal: any ordinal type (integer, Boolean, char, enumeration) Ada: integer or enumeration (includes Boolean and char) Java: integer types only

Range Checking:
C, C++, Perl, and Fortran do not specify range checking. Ada, Java, ML, C# specify range checking. Ada makes it MANDATORY

Five Array Allocation Categories:
C Example of all five methods

1. Static: the subscript range is statically bound and storage allocation is also at compile time (not on the runtime stack) Advantage: efficiency (no dynamic allocation)

2. Fixed stack-dynamic: the subscript range is statically bound, but allocation is done at declaration time during function call Advantage: space efficiency

3. Stack-dynamic: the subscript range is dynamically bound and the storage allocation is dynamic (on run-time stack) Advantage: flexibility (the size of an array need not be known until the array is to be used)

4. Fixed heap-dynamic: similar to fixed stack-dynamic: storage binding is heap dynamic but the size of the array is fixed

5. Heap-dynamic: binding of the subscript range and storage allocation are both dynamic and can change during execution. Advantage: flexibility (arrays can grow or shrink during program execution)

C and C++ arrays that include static modifier are static. C and C++ arrays without static modifier are fixed stack-dynamic. Ada arrays can be stack-dynamic. C and C++ provide fixed heap-dynamic arrays. C# includes a second array class ArrayList that provides fixed heap-dynamic. Perl and JavaScript support heap-dynamic arrays.

Array Initialization:
Some languages allow initialization at the time of storage allocation:

     // C, C++, Java, C#
    int list [] = {4, 5, 7, 83} <== an aggregate constant is assigned to list 

    char name [] = "freddie";   // C and C++

    char *names [] = {Bob, Jake, Joe};  // C and C++

    String[] names = {Bob, Jake, Joe};  // Java String objects  
Array Operations:
APL has the most powerful array processing operations for vectors, matrixes and unary operators (for example, to reverse column elements). Ada allows array assignment and catenation. Perlarray ops . Fortran provides elemental operations because they are between pairs of array elements; e.g. a pairwise sum of two arrays:
            
     1,2,3,4 + 2,3,1,5 = 3,5,4,9   
Rectangular and Jagged Arrays:
An array in C has only one dimension, but you can create a abstract type of array of arrays. A rectangular array is a multi-dimensioned array with all rows having the same number of elements and all columns have the same number of elements A jagged matrix has rows with varying number of elements Possible when multi-dimensioned arrays appear as arrays of arrays or when an array of pointers to a union type. C example

Slices:
A slice is a method to reference a substructure of an array. Mechanism (only useful in languages that have array operations) Assume, the following array declarations in Fortran 95:

   Integer, Dimension (10) :: Vector       // a 1-dimensional array
   Integer, Dimension (3, 3) :: Matrix     // a 2-dimensional array
   Integer, Dimension (3, 3, 4) :: Cube    // a 3-dimensional array
   
   Slice Examples:
   Vector (3:6) is a four element array
   Mat (:,2) is the second column of Mat  (an array)
   Mat (2:3,:) is the 2nd and 3rd rows of Mat  (a matrix)
   Cube (2,:,:) is a matrix
   Cube (:,:,2:3) is another 3-dimensional array   


Ex. slices in Perl

Implementation of Arrays:
Access function maps subscript expressions to an address in the array Access function for single-dimensioned arrays:

   address(list[k]) = address (list[lower_bound]) + ((k-lower_bound) * element_size)  
Accessing Elements Multi-dimensioned Arrays:
Two common ways: Row major order (by rows) - used in most languages. Column major order (by columns) used in Fortran. In row major the address of the kth element in row i and col j is computed as (where n_cols is the number of columns):

   array_starting_address + ((i * n_cols + j) * element_size)  

Compile-Time Descriptors for arrays:
All arrays: starting address, element type, index type

Single-dimensional: index range (e.g. lower and upper bound)

Multi-dimensional: index range 1 ... index range n, number of dimensions

Associative Arrays (Hash tables, Hash maps)

An unordered collection of data elements indexed by an equal number of values called keys. User defined keys must be stored. Design issue: What is the form of references to elements? Examples: a hash in Perl and support in Java/C++/JavaScript

Sections 6.7 - 6.9

top
resources:
FAQ
GDB basics
C structures and unions (documentation)

* RECORD TYPES A possibly heterogeneous aggregate of data elements individually identified by names (usually called fields) Design issues What is the syntactic form of references to the field? Are elliptical references allowed? e.g. ellipsis (...) any omitted part of speech that is understood Ex: structs in C COBOL ( level numbers show nested records ) 01 EMP-RECORD. 02 EMP-NAME. 03 FIRST PIC IS X(20). 03 MIDDLE PIC IS X(10). 03 LAST PIC IS X(20). 02 HOURLY-RATE PIC IS 99V99. Record Field Reference: MIDDLE OF EMP-NAME OF EMP-RECORD Fully qualified references include all record names Elliptical references allow leaving out some parts of record names as long as the reference is unambiguous; e.g. FIRST, FIRST OF EMP-NAME, and FIRST of EMP-RECORD are elliptical references to the employee's first name Ada Record structures are indicated in an orthogonal way: type Emp_Name_Type is record First: String (1..20); Mid: String (1..10); Last: String (1..20); end record; type Emp_Rec_Type is record Employ_Name: Emp_Name_Type; Hourly_Rate: Float; end record; Emp_Rec: Emp_Rec_Type; Record Field Reference: Emp_Rec.Name <== dot notation - most commonly used Operations on Records Assignment is very common if the types are identical Ada allows record comparison Ada records can be initialized with aggregate literals COBOL provides MOVE CORRESPONDING: Copies a field of source record to the corresponding field in target record Evaluation and Comparison to Arrays Straight forward and safe design Records are used when collection of data values is heterogeneous Access to array elements is slower than to record fields, because subscripts are dynamic (field names are static); e.g. myArray[i] Dynamic subscripts could be used with record field access, but would disallow type checking and be much slower Implementation of Record Type: * UNION TYPES This variables can store different type values at different times during execution Design issues Should type checking be required? Should unions be embedded in records? Unions: o every member begins at offset 0 from the address of the union o the size is the size of the largest member o only one member value can be stored in a union object at a time Discriminated vs. Free Unions Fortran, C, and C++ provide union constructs with no language support for type checking; the union in these languages is called free union; Ex. C Type checking of unions require that each union include a type indicator called a discriminant Discriminated Union Type Ex. Ada type Shape is (Circle, Triangle, Rectangle); type Colors is (Red, Green, Blue); type Figure (Form: Shape) is record Filled: Boolean; Color: Colors; case Form is when Circle => Diameter: Float; when Triangle => Leftside, Rightside: Integer; Angle: Float; when Rectangle => Side1, Side2: Integer; end case; end record;

Illustrated:

Evaluation of Unions: Potentially unsafe construct if no type checking. Java and C# do not support unions. Reflective of growing concerns for safety in programming language. *Note: unions and structures are prevalent in C system programming. (see semaphore example) C was developed to be a low-level system programming language C is still the most popular choice for low-level programming For applications programming, the power of C is also its downfall obfuscated C code contest

POINTER AND REFERENCE TYPES


o A pointer type variable holds a memory address or a special value; e.g., nil 
o Provides  indirect addressing
o Provides a way to manage dynamic memory
o A pointer can access memory that is dynamically created (e.g. heap)
o Pointers add addressing flexibility and control dynamic storage management
   
   Design Issues of Pointers
   What are the scope of and lifetime of a pointer variable?
   What is the lifetime of a heap-dynamic variable?
   Are pointers restricted as to the type of value to which they can point?
   Are pointers used for dynamic storage management, indirect addressing, or both?
   Should the language support pointer types, reference types, or both?
 
   Pointer Operations 
   Two fundamental operations: assignment and dereferencing
   Assignment is used to set a pointer variables value to some useful address
   Dereferencing yields the value stored at the location represented by the pointers value
   Dereferencing can be explicit or implicit
   C++ has explicit operation via *; e.g. 

      j = *ptr

   sets j to the value located at ptr
   
  
   Problems with Pointers
   Dangling pointer: A pointer points to de-allocated a heap-dynamic variable 
   Dangling object: An allocated heap-dynamic variable has no pointer (memory 
   leak)
   Cross-linked pointers: heap-dynamic variable with two pointers (sometimes
   desirable)

   Pointers in C/C++ powerful and dangerous; Ex. C
   Explicit dereferencing and address-of operators
   (void *) can point to any type and be type checked but not de-referenced (Ex. C)
   In C, pointers can point to pointers can point to pointers....
   (C double pointers)
   
   Limited pointer arithmetic is supported in C/C++

     float stuff[100];
     float *p;
     p = stuff;
   
     *(p+5) is equivalent to stuff[5] and  p[5]
     *(p+i) is equivalent to stuff[i] and  p[i]
  
    p = p + stuff; <= not OK
 
   Pointers in Fortran 95 point to heap and non-heap variables that have the 
   TARGET attribute assigned in the declaration; Implicit dereferencing

   C-style pointers in C# are available only in unsafe code:

       unsafe {  
           int i = 10;
           int* px1 = &i;  
       }   

   Reference Types
   C++ includes a special kind of pointer type called a reference type that is 
   used primarily for formal parameters;
   C++ reference type behaves strictly like an alias (see C++ code )
   In a reference, you never see the address and you have implicit assignment 
   and dereferencing - (see C++ code)
   THERE IS NO EXPLICIT DEREFERENCING 

   Advantages of both pass-by-reference and pass-by-value 
   Java extends C++ reference variables to replace pointers entirely
   References refer to call instances
   C# includes both the references of Java and C pointers (in unsafe code)

   Evaluation of Pointers
   Dangling pointers and dangling objects are problems as is heap management

   Pointers or references are necessary for dynamic data structures--can't 
   design a language without them

   Representations of Pointers
   Depends on register and address size
   Large computers use single values 
   Intel microprocessors use segment and offset (2 16-bit addresses)
   
   Solutions to Dangling Pointer Problem

   o automatically de-allocate dynamic objects at the end of pointer's scope
   (Ada to some extent)

   Tombstone: extra heap cell that is a pointer to the heap-dynamic variable
   The actual pointer variable points only at tombstones (another indirection)
   When heap-dynamic variable de-allocated, tombstone remains but set to nil
   Costly in time and space -- not used in any major language

   Locks-and-keys: Pointer values are represented as (key, address) pairs
   Heap-dynamic variables are represented as variable plus cell for integer 
   lock value
   When heap-dynamic variable allocated, lock value is created and placed in 
   lock cell and key cell of pointer 
   Allows multiple pointers to point to the same variable (key is copied also)
   When memory is deallocated, the key is modified to prevent other access
  
   Best solution: no explicit deallocation possible
 
   Heap Management
   A very complex run-time process
   Single-size cells vs. variable-size cells

   Two approaches to Garbage Collection

   (1) eager approach
   reclamation is gradual (C approach)

   Evaluation: less efficient but you have the maximum amount of
   memory at any given time
   
   (2) lazy approach
   reclamation occurs when the list of variable space becomes empty
   The run-time system allocates storage cells as requested and disconnects 
   pointers from cells as necessary; garbage collection then begins

   Evaluation: more efficient but when you need it most, it works worst 
   (takes most time when program needs most of cells in heap)


*  BIT TYPES 
   (not covered in text)

   Systems programming often requires access to individual bits and data
   structures that take up less storage than primitive data types 

   The use of pointers and bit-level operators can almost replace assembly code 
   (Note: Only about 10 % of UNIX is assembly code - the rest is C)
   Example:  C's bit fields and
    displaying bit by bit 
top