CIS 341 Project 4: Type Checking

Due: 21 March 2013 at 11:59:59pm

Download: project4.zip
Oat language specification (Project 4 version)

To Submit: Go Here

This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project. Be sure to fill in the team.txt file.

Warning: This project may time consuming (though hopefully not so much as Project 3). Start early and contact the course staff with any questions. If you find yourself stuck or writing tons and tons of code, please ask for help on Piazza.

Note: You should need to modify only tc.ml, phase1.ml, and phase2.ml when completing this project.

Overview

Implement a type checker and the (interesting bits of) a compiler for a simple C-like language that has boolean, int, string, and array types as well as top-level, mutually-recursive functions. Your compiler will accept source files of the form


  int fact(int x) {
    int acc = 1;
    while (x > 0) {
      acc = acc * x;
      x = x - 1;
    }
    return acc;
  }

  int program(int argc, string[] argv) {
    return fact(5);
  }

and will produce an executable (by default named a.out) that, when linked against runtime.c and then executed produces the resulting output:

  ./a.out
  720

The main challenge of this project is implementing a type checker -- unlike C, this language is (intended to be) completely type safe: a well-formed program should never crash, though it may abort due to the failure of a dynamic array-bounds check.

Hint: For examples of Oat code, see the files in /tests, especially those with sensible names.

The Oat Language

The language accepted by your compiler is a simple variant of C-style programs. Unlike previous assignments, this language now supports multiple base-types of data: int, bool, and string, as well as arrays of such data. Unlike C, the language is type safe -- it does not support type casts, and it doesn't support pointers, so there is no possibility of a null-pointer dereference. The only kind of error that can possibly be generated by a well-typed Oat program is an array-index out-of-bounds error.

At this point, the Oat language is large enough that it is simpler to give the specification using inference rules than to use English prose. The Oat language specification contains a definition of the language syntax and a collection of inference rules that define Oat type checking.

See the file ast.mli for the OCaml representation of the abstract syntax -- the type typ of types is defined there, along with representations of expressions, statements, blocks, function declarations, etc. You should familiarize yourself with the correspondence between the OCaml representation and the notation used in the specification. The astlib module defines some helper functions for working with the abstract syntax. (The ast.ml(i) files are automatically generated by the same tool that we use to create the specification inference rules, which is why they aren't well formatted.)

New Features

Functions:
Oat supports mutually-recursive functions. Each function body consisting of a collection of variable declarations, a statement, and a return terminator, exactly as in Project 3 except that the function body may mention function arguments.

A complete Oat program contains a function called program with type (int, string[]) -> int, which takes command-line arguments like the C main function and is the entry-point of the executable.

Oat has no void type. Instead, functions that return no value have return type unit; such functions are syntactically restricted to return no value (see the fdecl production of the grammar).

Global values:
Oat supports top-level global constants of any type. Initialization of global constants happens once before the program entry point is run. Such global constants may not be mutually recursive. The initialization code is generated by the compiler during Phase 1, and runtime.c invokes the initialization code before calling program

New expression forms:
Oat supports several new forms of expressions. Function calls are written f(x,y,z) as in C. Boolean values are true and false. Oat distinguishes boolean logical operations b1 & b2 (and) and b1 | b2 (or) from bit-wise int operations i1 [&] i2 (bitwise or) and i1 [|] i2 (bitwise or). (This change is necessitated by the strong type system.)

Arrays:
Oat supports multi-dimensional arrays. The expression new typ [len](fun i -> init) creates a new array of length len containing objects of type typ and initializes it by iterating the function over the array indices. For example, the following program creates a 3x4 array of integers and initializes them with the numbers 0 through 11:

  int[][] arr = new int[][3](fun i -> new int[4](fun j -> i*4 + j));

Arrays are mutable, and they can be updated using assignment notation: vec[0] = 17. Array indices start at 0 and go through len - 1, where len is the length of the array. Oat arrays (unlike C arrays) also store a length field, and all indexing operations perform array-bounds checks (which you will implement as part of this project).

Constant arrays are written using braces:

  int[] arr = {1, 2, 3, 4};

Strings:
Oat supports C-style immutable strings, written "in quotes". For the moment, the string operations are very limited and mostly provided by built-in functions provided by the runtime system.

Built-in Functions:
We now have enough infrastructure to support interesting built-in operations, including:

string_of_array : (int[]) -> string

int

array_of_string : (string) -> int[]
print_string : (string) -> unit
print_int : (int) -> unit
print_bool : (bool) -> unit
length_of_array : (T[]) -> int

These built-in operations, along with some internal C-functions used by the Oat runtime are implemented in runtime.c.

Note that length_of_array is generic with respect to the array type of its argument. Since Oat types don't include generic types, this operation must be handled specially in the compiler as described below.

This project uses the build configuration of Project 3.

Running the Project

The test suite of this project expects to be able to load test files (named test1.oat, test2.oat, etc.) from a tests directory. By default, the path to the tests directory is set to tests/, under OcaIDE and on Windows, you will need to configure the tests directory by giving main the -test-path path argument. For example:

In Linux main.native -runtime ../runtime.c -test-path ../tests/ --test
In Windows main.native -runtime ..\runtime.c -test-path ..\tests\ --test

As in Project 3, by default the main looks for runtime.c (in the zip file) from the current directory. In OcaIDE the main is in _build, so you need to specify the path of runtime.c by the -runtime flag before the --test in the ``Run Configurations...'' dialog.

In Linux main.native -runtime ../runtime.c --test
In Windows main.native -runtime ..\runtime.c --test

Do main.native --help from the command line for a synopsis.

  ~/Desktop/project4_temp> ./main.native --help
CIS341 main test harness 

  --test run the test suite, ignoring other inputs
  -q quiet mode -- turn off verbose output
  -bin-path set the output path for generated executable files, default c_bin
  -obj-path set the output path for generated .s  and .o files, default c_obj
  -test-path set the path to the test directory
  -lib add a library to the linked files
  -runtime set the .c file to be used as the language runtime implementation
  -o set the output executable's name
  -S compile only, creating .s files from the source
  -linux use linux-style name mangling
  --stdin parse and interpret inputs from the command line, where X = 
  --clean clean the output directories, removing all files
  --show-ast print the abstract syntax tree after parsing
  --show-il print the il representation
  --llvm-backend use llvm to compile IR
  -help  Display this list of options
  --help  Display this list of options

Working with the compiler from the command line.

By default, the main program accepts the name of an Oat program, for example file.oat, compiles it, links it with the specified runtime.c program, and produces an executable called a.out. The default names for runtime.c and a.out can be modified using the command-line switches shown above.

As a side effect, the compiler also generates c_obj/file.ll and c_obj/file.s outputs, which contain the program generated after completing Phases 1 (to LL IR) and 2 (to x86 assembly), respectively. The c_bin directory is used to hold the executables and outputs generated during automatic testing using the --test flag. The -clean option deletes all the files from the c_bin and c_obj directories -- this is useful because you may generate many hundreds of temporary files during testing.

Testing Tip: use the --llvm-backend option to test Phase1 independently of Phase2. This option assumes that you have a working installation of the llc compiler in your PATH. is installed on Eniac, and is available for other platforms from http://llvm.org

OCaml Libraries

On all platforms you need to add both unix and str to the list of OCaml libraries used for this project. You can set this by specifying them on the command line like: ocamlbuild -lib unix -lib str, or in Eclipse under the project's Properties pane:

Open the Project » Properties dialog
Select the Project pane (choose it from the left column)
In the Libraries (-libs) box add unix,str

Windows

This project depends on platform-specific configuration data to assemble and link the X86 code your compiler generates. On Windows we assume that you are using Cygwin and the gcc4-g++ package for the gcc-4 compiler.

Run Cygwin setup.exe and select one more package.
After search 'gcc', click on '+' to extend the categories tree: the root node is "All", and the child node is "Devel". Then on the package ``gcc4-g++:G++ subpackage'', click 'skip' to change it to be 4.3.4-3. This will also install all dependent packages.

If these are installed properly, the compiler should be fine, but if you run into problems on Windows, ask the TA for help.

To turn off the following warning, under system variables, click on New, set variable name to CYGWIN and set value as nodosfilewarning and you're done.

cygwin warning:

MS-DOS style path detected: ...

Perfered POSIX equivalent is: ...

CYGWIN environment variable option "nodosfilewarning" turns off this warning.

Consult the user's guide for more details about POSIX paths;

http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

Directories and Paths

Your compiler will produce X86 .s and .o files as well as native executables and .out files used during testing. By default these files will be generated in directories c_obj and c_bin respectively, and the default executable is called a.out. The main compiler accepts new command-line flags to change these directories from the default. Likewise, the -o option tells the compiler to use the given name for the result executable.

Note: Before running the compiler, you should create the c_bin and c_obj directories as sub-directories of your project. In OcaIDE you need to create the two directories under _build. If running the compiler issues an error like "can't open output file for writing", it probably means you are missing these directories.

Running on Linux

If you are using some flavor of linux (for example on Eniac) but not Mac OS X, you should specify the -linux flag when running the compiler. This turns off name mangling, which has different conventions on linux vs. Mac or Windows. In Eclipse you should add -linux before the --test flag.

Task I: Type checking

We have provided you with a lexer and parser for this language, and the provided source code implements most of the Phase 1 and Phase 2 translations. Your main task in this project is to complete the type checker and a few parts of Phase 1 and Phase 2 as described below.

Implement a type checker for the Oat language in the file tc.ml. Your implementation should enforce the type system as given in the Oat language specification.

To help get you started, we have provided a skeleton implementation of the type checker. You should start by browsing through tc.ml and trying to line up its code with the inference rules of the specification.

We have also provided context data-structure needed to implement the inference rules. This implementation is found in tctxt.ml, and it provides operations for managing the types and scopes of variables. In the inference rules, we write Δ ; Γ for a context consisting of the globally scoped function declarations (in Δ) and locally-declared variables (in Γ). In the implementation of contexts in tctxt.ml the Δ component is represented by the globals field of a Tctxt.ctxt and Γ is represented by the locals field.

Your type checker should take the form of a collection of functions, one for each type of inference rule. The inputs to these functions are the context and the piece of syntax to be type-checked. The output of such a function depends on which part of the program you are checking: expressions evaluate to values, so their type-checking function returns the types of the value; statements do not evaluate to values, so their type-checking functions return (OCaml's) unit; variable declarations extend the context with new bindings for the variables they introduce, so their typechecking functions return contexts. Once you get the hang of it, transliterating the inference rules from "math" into OCaml should be fairly straightforward.

Hints

Read the documentation in tc.ml carefully!
Your implementation should signal a type error in an ill-formed Oat program by raising the TypeError exception. Use Failure exceptions (via failwith) to indicate broken invariants or impossible cases in the compiler implementation.
Our test suite is not exhaustive, but most of Oat's language features are exercised.
The recursive structure of the tc_XYZ functions mirrors the recursive structure of the language syntax. Use it to guide your implementation.
You will need to special case the type checking for the built-in function length_of_array because it has a generic (polymorphic) type that is not expressible using the syntax of types in the Oat language.
Our sample solution is able to complete tc.ml by adding under 150 lines of well structured, commented code. If you find yourself writing lots of code, then it is likely to be wrong.

Task II: Compilation - Phase 1

The provided phase1.ml code can compile most of the Oat language. In this part of the project, you will complete the implementation, focusing on the new operations.

The target language is an enhanced version of the LLVM IR used in Project 3. In particular, this version is more faithful to the full LLVM language semantics, which is documented extensively at the LLVM Language Reference web site.

The files ll.mli and ll.ml provide the datatypes for working with the LL IR abstract syntax. The lllib.mli library provides some useful functions for creating unique identifiers and printing IR objects. Compared with the version of LL used in Project 3, the main new features of the LL IR are:

To handle global functions and top-level identifiers, the LL IR now distinguishes global identifiers (written like @foo) from the local identifiers (written like %foo). Function names are always global identifiers.
This version of the LLVM IR requires many more type annotations: each operand is decorated with type information.
The LL IR types include the full set of types described in lecture.
There are three new instructions Call, Getelementptr (abbreviated Gep in the LL), and Bitcast. Bitcast is used only for typechecking purposes see see oat_alloc_array_dynamic for an example of how it is used.

For this part of the project, you will implement some array functionality. Arrays in Oat are represented at the LL IR level by objects of the LL type {i32, [0 x t]}*, that is, an array is a pointer to a struct whose first element is an i32 value that is the array's length and whose second component is a sequence of elements of type t. Phase 1 implements the translation of Oat types into LL types via the cmp_typ function.

The length information is located before the 0th element of the array. For example, the constant array arr would be represented as a pointer to memory as shown below:

  int[] arr = {0,1,2,3};
  
  arr --+
        |
        v
       [4][0][1][2][3]

LHS and Array Indexing

Oat makes a distinction between left-hand-sides (lhs) and expressions, which include left-hand-sides. Compiling a lhs produces a memory address that can be assigned to by the Assign statement. In Project 3, the only lhs available was a local variable. Now, array locations can be assigned to as well.

Complete the function cmp_lhs, which computes the address of a location to write to. Array Bounds Checking:
Regardless of whether an array index expression occurs as a lhs or as an expression, your Phase 1 translation should emit code to check that the array index is within bounds. Recall that the length of the array is stored before the 0th element. Your code will need to use an appropriate Gep instruction, and invoke oat_array_bounds_check, which is another internal function provided by runtime.c

Hint: You should probably familiarize yourself with the Phase I compilation strategy before trying this part of the project. Pay particular attention to cmp_init, which is how the static arrays like {1, 2, 3} are compiled. That code isn't directly applicable to compiling new, but understanding it can help you figure out array-bounds checking. You will also want to understand how gen_local_op (found in lllib.mli) is used when generating LL code.

Array Creation

There are two ways to create array values in Oat. The first, statically-sized array initializers, is shown above. (It uses the {1,2,3} syntax.). The second way is more general: it lets the program determine the size of the array at runtime. For example, we could rewrite the code above like this:

  int[] arr = new int[f(2)](fun i -> i);

Assuming that the function call expression f(2) returns 4, then the resulting pointer arr would be the same as above. A more complex example involves computing some value involving the indices:

  int[] arr = new int[f(2)](fun i -> f(i) * (i+1));

The new operation dynamically creates a new array. It allocates a new chunk of memory by using the Oat internal function oat_alloc_array found in runtime.c, which takes a size (in words) and returns a pointer to an allocated chunk of memory with the size field initialized. The Phase 1 function oat_alloc_array_dynamic shows how to create a call to the runtime code at the LL level.

Complete the Ast.New case of the cmp_exp function. This translation should generate code that allocates an Oat array and then initializes it by iterating the initialization function for each index to obtain an element to store at that index.

Because iterating through the array indices involves a loop, generating all of the LL code for the new expression directly is somewhat involved. One possible strategy is create LL code that directly implements the loop structure. This code will be very similar to that for a while statement.

Another possible strategy is to invoke the cmp_stmt function on a carefully crafted piece of abstract syntax. The statement should refer to a variable for the loop index and the array itself, but that data must be initialized by LL code. This is a slightly tricky instance of metaprogramming.

Our sample solution to all of the Phase 1 modifications uses fewer than 60 lines of code. If you find yourself writing a lot more code, especially for New, then you might want to seek help.

Task III: Compilation: Phase 2

This part is considerably simpler than the previous parts. Phase 2 provides almost all of the translation of the LL to X86, but it is missing two operations: Call, and Bitcast. Complete them.

The provided code gives a complete implementation of a solution for Project 3. You might want to study how it is implemented to understand how this phase of the compiler works. In particular, take a look at the Gep instruction, which is a type-directed translation: the code generated depends on the LL type of the pointer being indexed.