Due: Monday, February 25th at 11:59:59pm
Download: project3.zip
To Submit: Go Here
This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project. Be sure to fill in the team.txt file.
Overview
Build a compiler for simple C-style procedure bodies. Your compiler will accept source files of the formint x = 6; int acc = 1; while (x > 0) { acc = acc * x; x = x - 1; } return acc;and will produce an executable (by default named a.out) that, when linked against runtime.c and then executed, produces the resulting output:
./a.out 720The code your compiler generates should follow the cdecl calling conventions (though, unlike the last project, there are no command-line inputs). After compilation, the source file from above should be functionally equivalent to this C program:
int program() { int x = 6; int acc = 1; while (x > 0) { acc = acc * x; x = x - 1; } return acc; }
Building/Running the Project
This project uses the build configuration of Project 2. The main executable also supports more command-line flags than our previous project's did.
- In Linux main.native -runtime ../runtime.c -test-path ../tests/ --test
- In Windows main.native -runtime ..\runtime.c -test-path ..\tests\ --test
As in Project 2, by default the main looks for runtime.c (in the zip file) from the current directory. In OcaIDE the main is in _build, so you need to specify the path of runtime.c by the -runtime flag before the --test in the ``Run Configurations...'' dialog.
- In Linux main.native -runtime ../runtime.c --test
- In Windows main.native -runtime ..\runtime.c --test
Do main.native --help from the command line for a synopsis.
OCaml Libraries
On all platforms you need to add both unix and str to the list of OCaml libraries used for this project. You can set this by specifying them on the command line like: ocamlbuild -lib unix -lib str, or in Eclipse under the project's Properties pane:- Open the Project » Properties dialog
- Select the Project pane (choose it from the left column)
- In the Libraries (-libs) box add unix,str
Windows
This project depends on platform-specific configuration data to assemble and link the X86 code your compiler generates. On Windows we assume that you are using Cygwin and the gcc4-g++ package for the gcc-4 compiler.- Run Cygwin setup.exe and select one more package.
- After search 'gcc', click on '+' to extend the categories tree: the root node is "All", and the child node is "Devel". Then on the package ``gcc4-g++:G++ subpackage'', click 'skip' to change it to be 4.3.4-3. This will also install all dependent packages.
To turn off the following warning, under system variables, click on New, set variable name to CYGWIN and set value as nodosfilewarning and you're done.
MS-DOS sytel path detected: ...
Perfered POSIX equivalent is: ...
CYGWIN enviroment variable option "nodosfilewarning" turns off this warning.
Consult the user's guide for more details about POSIX paths;
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Directories and Paths
Your compiler will produce X86 .s and .o files as well as native executables and .out files used during testing. By default these files will be generated in directories c_obj and c_bin respectively, and the default executable is called a.out. The main compiler accepts new command-line flags to change these directories from the default. Likewise, the -o option tells the compiler to use the given name for the result executable.
Running on Linux
If you are using some flavor of linux (for example on Eniac) but not Mac OS X, you should specify the -linux flag when running the compiler. This turns off name mangling, which has different conventions on linux vs. Mac or Windows. In Eclipse you should add -linux before the --test flag.
The Procedure-body Language
The language accepted by your compiler is a simple variant of C-style program statements whose expressions are made up of the 32-bit arithmetic operations you implemented in Project 2. See the file ast.mli for the OCaml representation of the abstract syntax -- the type exp of expressions is defined there.Unlike Project 2, which had one "input variable" X, expressions in this project can mention user-defined variables, represented by values of the form Id varname. Otherwise, the meaning of the arithmetic operators is identical with those in Project 2.
The new constructs of this language are local variable declarations and imperative program statements, with a structure similar to that found in C.
Implement an ocamllex lexer/ocamlyacc parser for integer arithmetic expressions. The terminal-symbol tokens processed by your lexer are described next.
Lexing
The file lexer.mll already defines all of the tokens you need to implement the statement grammar. As before, it depends on an implementation of "file ranges" (defined in range.ml(i)) for generating error messages. You do not need to modify lexer.mll for this project. There are several new tokens usable as terminals by your parser (see parser.mly for the complete list). Notable ones include:- IDENT program identifiers consisting of characters, digits, and underscores
- int, if, else, while, for, and return: language keywords
- ; , { } syntax elements
Parsing
Your parser should accept strings according to the following ambiguous grammar, suitably disambiguated to solve the "dangling else" problem as discussed in Lecture. In particular, in the statement "if (e1) if (e2) s1 else s2", the else s2 clause should be associated with the inner if, not the outer one.The output of your parser should be an abstract syntax tree built using the constructors in ast.mli. To get you started, we have implemented most of the syntax, leaving only the definition of the stmt grammar for you.
prog ::= block return exp ; block ::= vdecls stmts vdecls ::= | vdecl; vdecls | ( nothing ) vdecl ::= typ IDENT = exp typ ::= int vdecllist ::= | vdeclplus | ( nothing ) vdeclplus ::= | vdecl | vdecl, vdeclplus lhs ::= IDENT stmts ::= | stmt stmts | ( nothing ) stmtOPT ::= | stmt | ( nothing ) expOPT ::= | exp | ( nothing ) exp ::= ... stmt ::= ( ambiguous! ) | lhs = exp; | if (exp) stmt | if (exp) stmt else stmt | while (exp) stmt | for (vdecllist ; expOPT ; stmtOPT) stmt | { block }Note that there are two kinds of variable declaration lists -- those that are terminated by semicolons and those that are separated by commas. The former are used at the start of program blocks, the latter are used as part of the for statement. For example:
int x = 3; int y = 4; for (int z = 2, int w = 3; x + y + z > z; x = x - 1; ) y = y -1; return y;Note also that the for statement has an optional second expression that acts as the loop guard -- if it is omitted, then the loop guard should be "true". Unlike C, the third component of a for statement is itself a statement (because our language doesn't provide "side-effecting" expressions like x++). According to the grammar above, you must include a terminating semicolon (as shown in the example) for this statement. It also allows the (slightly obscure) use of inner blocks like this:
for (int x = 0; x < 10; { int y = 17; x = x + y; }) x = x; return 3;
Hints
- The switch --show_ast can be used to pretty-print the abstract syntax tree to the terminal when the compiler is run.
Compilation: Phase 1
In this project, there are two phases in the translation from source to X86 code. In Phase 1 you will compile the source abstract syntax to a simple control-flow-graph representation based on LLVM that uses alloca instructions to create temporary local storage space instead of using machine registers or actual stack locations. (Phase 2, described below, compiles from this intermediate representation to X86.)Files ll.ml and ll.mli define the intermediate language. Local values (a.k.a. temporary values) are named by unique identifiers (uids). Instructions are separated into "computational" instructions insn and "control-flow instructions" terminator. These are grouped into labeled basic blocks.
You should complete the implementation of the compile_prog function in phase1.ml. This function takes as input a single Ast.prog and produces a value of type Il.prog, which is defined as:
type prog = {ll_cfg: bblock list; ll_entry: lbl}Here, il_cfg is the control-flow graph (represented as a list of basic blocsk), and il_entry is the label of the starting block for the program.
Variable Scoping
A significant part of the work of this phase of the translation is converting the user-defined program variables and any needed temporary locations into locals. As discussed in class, the mapping from variable identifiers to local uids is best handled by a- The scope of a variable is the rest of the block in which
it is introduced. For example, the scope of the variable
x in the declaration below is e2 and
..., including any nested sub-blocks, but not e1.
int x = e1; int y = e2; ...
- Identifiers declared in inner blocks shadow the
definitions of outer blocks. In the example below, y
will contain the value 3 when the program returns:
int x = 1; int y = 0; { int x = 2; /* shadow the outer x */ y = y + x; } y = y + x; /* the outer x is still in scope here */ return y; /* y and the outer x are in scope for the returned expression */
-
The local variables declared in a for loop are in scope
only within the loop guard expression, the loop "update"
statement, and the loop body. The following program is illegal
because x is not in scope after the for:
for(int x = 0; x < 10; x = x + 1;) x = x + 2; return x; /* x is not in scope here */
In other words, the following two statements are equivalent:for(int x = 0; x < 10; x = x + 1;) x = x + 2;
{ int x = 0; for(; x < 10; x = x + 1;) x = x + 2; } /* x leaves scope here */
- The expression e occuring in the final return
e; of the program is in the scope of the variables defined
in the preceding block. For example this is allowed:
int x = 0; return x; /* x is in scope */
but this is not:{ int x = 0; } return x; /* x left scope at the end of the inner block */
Compiling Control Flow
As with Project 2, this translation can be thought of as emitting a stream of instructions. However, since the results of this translation will, in general, include more than one basic block, it is useful to think of the emitted stream as including labels, as well as both kinds of instructions (arithmetic and control flow).
There are several reasonable strategies for translating the abstract syntax into basic blocks: one simple solution is to do a first pass emitting a stream of instructions and labels (possible in reverse order) and then do a post-processing pass to convert that stream into basic blocks (with code in the right order). This is the strategy used by our sample implementation, you are free to pursue other approaches as you like.
Hints
- Remember that ctxt.ml contains some suggestions about one good way to structure your context. You can use this file to implement contexts.
- The List library offers some useful functions for implementing contexts. See, for example, List.assoc
-
If you do decide to pursue the stream of instructions and labels
approach suggested above, you might want to make the following
definitions (or something similar):
type elt = | I of Ll.insn | J of Ll.terminator | L of Ll.lbl type stream = elt list
- Your translation strategy should be guided by the structure of the abstract syntax. In particular, you should consider implementing a compile_X function for each type found in ast.mli -- you'll end up with compile_exp, compile_block, compile_var_decl, etc. You might also want to consider how the list List.fold_left functions can be applied to process lists of abstract syntax elements. The mutual recursion in the type definitions can suggest useful ways of structuring mutually recursive translation functions.
Compilation: Phase 2
For this part of the project, you must implement the definition of Phase2.compile_prog, which takes a Il.prog and produces a complete Cunit.cunit suitable for assembly on X86.This translation is largely straight forward, however there is one wrinkle: you must allocate storage for the local storage needed for the program and translate references to locals into appropriate memory operations. Allocation is most easily accomplished in the prologue code for the program -- simply allocate enough stack space to accomodate the slots. If you follow standard C-style calling conventions, you can set Ebp appropriately so that your slots are indirect references off of Ebp (as discussed in lecture). Your translation functions should be parameterized by a mapping from uids to stack indices. Note that the when calculating the offset for a temporary variable stored in stack index n, the offset from Ebp will be negative -4*n (assuming you've set up the stack frame as suggested in lecture).
The only other wrinkle about this phase of the compiler is that X86 disallows a single instruction from using two memmory operands, so you will have to use designated registers to hold the intermediate results. One good strategy is to use Eax and Ecx (which is caller save).
Unlike Phase 1, there is no need to use a reversed-stream-of-instructions approach: the compuation has already been flattened such that each Il instruction will correspond to just a couple of X86 instructions. Your translation can simply translate each instruction individually and concatenate all of the results. Also, each block can be translated independently from the others.
Hints
- Remember to wrap each X86 block in a Cunit Code component. As with Project 2, there is no need for the Data components yet.
- The "prologue" code for your function should reside at the (appropriately name mangled) label program.
- Don't forget to emit the appropriate "epilogue" code, which should clean up the stack and restore Ebp (if it has been modified) before your program returns.
Grading
Your project grade will be based on the test cases in gradedtests.ml. These are divided up into categories with point values as follows:- Easy Parsing Tests: 20 pts.
- Medium Parsing Tests: 10 pts.
- Parse Error Tests: 10 pts.
- Scope Error Tests: 5 pts.
- Easy Compiling Tests: 15 pts.
- Medium Compiling Tests: 10 pts.
- Hard Compilng Tests: 10 pts.
- Stress Tests: 10 pts. (not all are shown in gradedtest.ml)
- Coding Style: 10 pts.