Due: 21 March 2013 at 11:59:59pm
Download: project4.zip
Oat language specification (Project 4 version)
To Submit: Go Here
This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project. Be sure to fill in the team.txt file.
Overview
Implement a type checker and the (interesting bits of) a compiler for a simple C-like language that has boolean, int, string, and array types as well as top-level, mutually-recursive functions. Your compiler will accept source files of the formint fact(int x) { int acc = 1; while (x > 0) { acc = acc * x; x = x - 1; } return acc; } int program(int argc, string[] argv) { return fact(5); }and will produce an executable (by default named a.out) that, when linked against runtime.c and then executed produces the resulting output:
./a.out 720
The main challenge of this project is implementing a type checker -- unlike C, this language is (intended to be) completely type safe: a well-formed program should never crash, though it may abort due to the failure of a dynamic array-bounds check.
The Oat Language
The language accepted by your compiler is a simple variant of C-style programs. Unlike previous assignments, this language now supports multiple base-types of data: int, bool, and string, as well as arrays of such data. Unlike C, the language is type safe -- it does not support type casts, and it doesn't support pointers, so there is no possibility of a null-pointer dereference. The only kind of error that can possibly be generated by a well-typed Oat program is an array-index out-of-bounds error.At this point, the Oat language is large enough that it is simpler to give the specification using inference rules than to use English prose. The Oat language specification contains a definition of the language syntax and a collection of inference rules that define Oat type checking.
See the file ast.mli for the OCaml representation of the abstract syntax -- the type typ of types is defined there, along with representations of expressions, statements, blocks, function declarations, etc. You should familiarize yourself with the correspondence between the OCaml representation and the notation used in the specification. The astlib module defines some helper functions for working with the abstract syntax. (The ast.ml(i) files are automatically generated by the same tool that we use to create the specification inference rules, which is why they aren't well formatted.)
New Features
Functions:Oat supports mutually-recursive functions. Each function body consisting of a collection of variable declarations, a statement, and a
A complete Oat program contains a function called program with type (int, string[]) -> int, which takes command-line arguments like the C main function and is the entry-point of the executable.
Oat has no void type. Instead, functions that return no value have return type unit; such functions are syntactically restricted to return no value (see the fdecl production of the grammar).
Global values:
Oat supports top-level global constants of any type. Initialization
of global constants happens once before the program entry
point is run. Such global constants may not be mutually recursive.
The initialization code is generated by the compiler during Phase 1,
and runtime.c invokes the initialization code before
calling program
New expression forms:
Oat supports several new forms of expressions. Function calls are
written f(x,y,z) as in C. Boolean values are true
and false. Oat distinguishes
boolean logical operations b1 & b2 (and) and b1 | b2 (or) from
bit-wise int operations i1 [&] i2 (bitwise or) and i1
[|] i2 (bitwise or). (This change is necessitated by the strong type system.)
Arrays:
Oat supports multi-dimensional arrays. The expression
new typ [len](fun i -> init) creates a new
array of length
int[][] arr = new int[][3](fun i -> new int[4](fun j -> i*4 + j));
Arrays are mutable, and they can be updated using assignment notation: vec[0] = 17. Array indices start at 0 and go through len - 1, where len is the length of the array. Oat arrays (unlike C arrays) also store a length field, and all indexing operations perform array-bounds checks (which you will implement as part of this project).
Constant arrays are written using braces:
int[] arr = {1, 2, 3, 4};
Strings:
Oat supports C-style immutable strings, written "in quotes". For
the moment, the string operations are very limited and mostly
provided by built-in functions provided by the runtime system.
Built-in Functions:
We now have enough infrastructure to support interesting built-in
operations, including:
- string_of_array : (int[]) -> string Assumes each int of the array is the representation of an ASCII character.
- array_of_string : (string) -> int[]
- print_string : (string) -> unit
- print_int : (int) -> unit
- print_bool : (bool) -> unit
- length_of_array : (T[]) -> int
Running the Project
- In Linux main.native -runtime ../runtime.c -test-path ../tests/ --test
- In Windows main.native -runtime ..\runtime.c -test-path ..\tests\ --test
As in Project 3, by default the main looks for runtime.c (in the zip file) from the current directory. In OcaIDE the main is in _build, so you need to specify the path of runtime.c by the -runtime flag before the --test in the ``Run Configurations...'' dialog.
- In Linux main.native -runtime ../runtime.c --test
- In Windows main.native -runtime ..\runtime.c --test
Do main.native --help from the command line for a synopsis.
~/Desktop/project4_temp> ./main.native --help CIS341 main test harness --test run the test suite, ignoring other inputs -q quiet mode -- turn off verbose output -bin-path set the output path for generated executable files, default c_bin -obj-path set the output path for generated .s and .o files, default c_obj -test-path set the path to the test directory -lib add a library to the linked files -runtime set the .c file to be used as the language runtime implementation -o set the output executable's name -S compile only, creating .s files from the source -linux use linux-style name mangling --stdin parse and interpret inputs from the command line, where X =--clean clean the output directories, removing all files --show-ast print the abstract syntax tree after parsing --show-il print the il representation --llvm-backend use llvm to compile IR -help Display this list of options --help Display this list of options
Working with the compiler from the command line.
By default, the main program accepts the name of an Oat program, for example file.oat, compiles it, links it with the specified runtime.c program, and produces an executable called a.out. The default names for runtime.c and a.out can be modified using the command-line switches shown above.As a side effect, the compiler also generates c_obj/file.ll and c_obj/file.s outputs, which contain the program generated after completing Phases 1 (to LL IR) and 2 (to x86 assembly), respectively. The c_bin directory is used to hold the executables and outputs generated during automatic testing using the --test flag. The -clean option deletes all the files from the c_bin and c_obj directories -- this is useful because you may generate many hundreds of temporary files during testing.
OCaml Libraries
On all platforms you need to add both unix and str to the list of OCaml libraries used for this project. You can set this by specifying them on the command line like: ocamlbuild -lib unix -lib str, or in Eclipse under the project's Properties pane:- Open the Project » Properties dialog
- Select the Project pane (choose it from the left column)
- In the Libraries (-libs) box add unix,str
Windows
This project depends on platform-specific configuration data to assemble and link the X86 code your compiler generates. On Windows we assume that you are using Cygwin and the gcc4-g++ package for the gcc-4 compiler.- Run Cygwin setup.exe and select one more package.
- After search 'gcc', click on '+' to extend the categories tree: the root node is "All", and the child node is "Devel". Then on the package ``gcc4-g++:G++ subpackage'', click 'skip' to change it to be 4.3.4-3. This will also install all dependent packages.
To turn off the following warning, under system variables, click on New, set variable name to CYGWIN and set value as nodosfilewarning and you're done.
cygwin warning:
MS-DOS style path detected: ...
Perfered POSIX equivalent is: ...
CYGWIN environment variable option "nodosfilewarning" turns off this warning.
Consult the user's guide for more details about POSIX paths;
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Directories and Paths
Your compiler will produce X86 .s and .o files as well as native executables and .out files used during testing. By default these files will be generated in directories c_obj and c_bin respectively, and the default executable is called a.out. The main compiler accepts new command-line flags to change these directories from the default. Likewise, the -o option tells the compiler to use the given name for the result executable.
Running on Linux
If you are using some flavor of linux (for example on Eniac) but not Mac OS X, you should specify the -linux flag when running the compiler. This turns off name mangling, which has different conventions on linux vs. Mac or Windows. In Eclipse you should add -linux before the --test flag.
Task I: Type checking
We have provided you with a lexer and parser for this language, and the provided source code implements most of the Phase 1 and Phase 2 translations. Your main task in this project is to complete the type checker and a few parts of Phase 1 and Phase 2 as described below.
Implement a type checker for the Oat language in the file tc.ml. Your implementation should enforce the type system as given in the Oat language specification.
To help get you started, we have provided a skeleton implementation of the type checker. You should start by browsing through tc.ml and trying to line up its code with the inference rules of the specification.
We have also provided context data-structure needed to implement the inference rules. This implementation is found in tctxt.ml, and it provides operations for managing the types and scopes of variables. In the inference rules, we write Δ ; Γ for a context consisting of the globally scoped function declarations (in Δ) and locally-declared variables (in Γ). In the implementation of contexts in tctxt.ml the Δ component is represented by the globals field of a Tctxt.ctxt and Γ is represented by the locals field.
Your type checker should take the form of a collection of functions, one for each type of inference rule. The inputs to these functions are the context and the piece of syntax to be type-checked. The output of such a function depends on which part of the program you are checking: expressions evaluate to values, so their type-checking function returns the types of the value; statements do not evaluate to values, so their type-checking functions return (OCaml's) unit; variable declarations extend the context with new bindings for the variables they introduce, so their typechecking functions return contexts. Once you get the hang of it, transliterating the inference rules from "math" into OCaml should be fairly straightforward.
- Read the documentation in tc.ml carefully!
- Your implementation should signal a type error in an ill-formed Oat program by raising the TypeError exception. Use Failure exceptions (via failwith) to indicate broken invariants or impossible cases in the compiler implementation.
- Our test suite is not exhaustive, but most of Oat's language features are exercised.
- The recursive structure of the tc_XYZ functions mirrors the recursive structure of the language syntax. Use it to guide your implementation.
- You will need to special case the type checking for the built-in function length_of_array because it has a generic (polymorphic) type that is not expressible using the syntax of types in the Oat language.
- Our sample solution is able to complete tc.ml by adding under 150 lines of well structured, commented code. If you find yourself writing lots of code, then it is likely to be wrong.
Task II: Compilation - Phase 1
The provided phase1.ml code can compile most of the Oat language. In this part of the project, you will complete the implementation, focusing on the new operations.
The target language is an enhanced version of the LLVM IR used in Project 3. In particular, this version is more faithful to the full LLVM language semantics, which is documented extensively at the LLVM Language Reference web site.
The files ll.mli and ll.ml provide the datatypes for working with the LL IR abstract syntax. The lllib.mli library provides some useful functions for creating unique identifiers and printing IR objects. Compared with the version of LL used in Project 3, the main new features of the LL IR are:
- To handle global functions and top-level identifiers, the LL IR now distinguishes global identifiers (written like @foo) from the local identifiers (written like %foo). Function names are always global identifiers.
- This version of the LLVM IR requires many more type annotations: each operand is decorated with type information.
- The LL IR types include the full set of types described in lecture.
- There are three new instructions Call, Getelementptr (abbreviated Gep in the LL), and Bitcast. Bitcast is used only for typechecking purposes see see oat_alloc_array_dynamic for an example of how it is used.
For this part of the project, you will implement some array functionality. Arrays in Oat are represented at the LL IR level by objects of the LL type {i32, [0 x t]}*, that is, an array is a pointer to a struct whose first element is an i32 value that is the array's length and whose second component is a sequence of elements of type t. Phase 1 implements the translation of Oat types into LL types via the cmp_typ function.
The length information is located before the 0th element of the array. For example, the constant array arr would be represented as a pointer to memory as shown below:
int[] arr = {0,1,2,3}; arr --+ | v [4][0][1][2][3]
LHS and Array Indexing
Oat makes a distinction between left-hand-sides (lhs) and expressions, which include left-hand-sides. Compiling a lhs produces a memory address that can be assigned to by the Assign statement. In Project 3, the only lhs available was a local variable. Now, array locations can be assigned to as well.
Complete the function cmp_lhs, which computes the
address of a location to write to.
Array Bounds Checking:
Regardless of whether an array index expression occurs as a lhs or
as an expression, your Phase 1 translation should emit code to check
that the array index is within bounds. Recall that the length of
the array is stored before the 0th element. Your code will
need to use an appropriate Gep instruction, and invoke
oat_array_bounds_check, which is another internal function
provided by runtime.c
Array Creation
There are two ways to create array values in Oat. The first, statically-sized array initializers, is shown above. (It uses the {1,2,3} syntax.). The second way is more general: it lets the program determine the size of the array at runtime. For example, we could rewrite the code above like this:int[] arr = new int[f(2)](fun i -> i);Assuming that the function call expression f(2) returns 4, then the resulting pointer arr would be the same as above. A more complex example involves computing some value involving the indices:
int[] arr = new int[f(2)](fun i -> f(i) * (i+1));The new operation dynamically creates a new array. It allocates a new chunk of memory by using the Oat internal function oat_alloc_array found in runtime.c, which takes a size (in words) and returns a pointer to an allocated chunk of memory with the size field initialized. The Phase 1 function oat_alloc_array_dynamic shows how to create a call to the runtime code at the LL level.
Complete the Ast.New case of the cmp_exp function. This translation should generate code that allocates an Oat array and then initializes it by iterating the initialization function for each index to obtain an element to store at that index.
Because iterating through the array indices involves a loop, generating all of the LL code for the new expression directly is somewhat involved. One possible strategy is create LL code that directly implements the loop structure. This code will be very similar to that for a while statement.
Another possible strategy is to invoke the cmp_stmt function on a carefully crafted piece of abstract syntax. The statement should refer to a variable for the loop index and the array itself, but that data must be initialized by LL code. This is a slightly tricky instance of metaprogramming.
Task III: Compilation: Phase 2
This part is considerably simpler than the previous parts. Phase 2 provides almost all of the translation of the LL to X86, but it is missing two operations: Call, and Bitcast. Complete them.The provided code gives a complete implementation of a solution for Project 3. You might want to study how it is implemented to understand how this phase of the compiler works. In particular, take a look at the Gep instruction, which is a type-directed translation: the code generated depends on the LL type of the pointer being indexed.
- Your implementation of Call should follow cdecl calling conventions
- You might want to look at the given cases to figure out how to use the compile_op function.
- Bitcast is only used for typechecking at the LL level. It doesn't do very much at runtime.
- Our implementation for the missing parts of Phase II takes under 30 lines of code.
Grading
Your project grade will be based on the test cases in gradedtests.ml. These are divided up into categories with point values as follows:- Type Checking Error Tests: 15 pts.
- Type Checking Correctness Tests: 15 pts.
- Easy Compiling Tests: 20 pts.
- Medium Compiling Tests: 10 pts.
- Hard Tests: 10 pts.
- Runtime Error Tests: 20 pts.
- Stress Tests: 10 pts. (not shown in gradedtest.ml)
- Coding Style: 10 pts.