Due: Monday, February 25th at 11:59:59pm

Download: project3.zip

To Submit: Go Here

This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project. Be sure to fill in the team.txt file.

Warning: This project may be more time consuming than Project 2. Start early and contact the course staff with any questions.
Note: You should need to modify only parser.mly, phase1.ml, phase2.ml, and ctxt.ml when completing this project.

Overview

Build a compiler for simple C-style procedure bodies. Your compiler will accept source files of the form
  int x = 6;
  int acc = 1;
  while (x > 0) {
    acc = acc * x;
    x = x - 1;
  }
  return acc;
  
and will produce an executable (by default named a.out) that, when linked against runtime.c and then executed, produces the resulting output:
  ./a.out
  720
  
The code your compiler generates should follow the cdecl calling conventions (though, unlike the last project, there are no command-line inputs). After compilation, the source file from above should be functionally equivalent to this C program:
  int program() {
    int x = 6;
    int acc = 1;
    while (x > 0) {
      acc = acc * x;
      x = x - 1;
    }
    return acc;
 }
  

Building/Running the Project

This project uses the build configuration of Project 2. The main executable also supports more command-line flags than our previous project's did.

The test suite of this project expects to be able to load test files (named test1.oat, test2.oat, etc.) from a tests directory. By default, the path to the tests directory is set to tests/, under OcaIDE and on Windows, you will need to configure the tests directory by giving main the -test-path path argument. For example:
  • In Linux main.native -runtime ../runtime.c -test-path ../tests/ --test
  • In Windows main.native -runtime ..\runtime.c -test-path ..\tests\ --test

As in Project 2, by default the main looks for runtime.c (in the zip file) from the current directory. In OcaIDE the main is in _build, so you need to specify the path of runtime.c by the -runtime flag before the --test in the ``Run Configurations...'' dialog.

Do main.native --help from the command line for a synopsis.

OCaml Libraries

On all platforms you need to add both unix and str to the list of OCaml libraries used for this project. You can set this by specifying them on the command line like: ocamlbuild -lib unix -lib str, or in Eclipse under the project's Properties pane:
  1. Open the Project » Properties dialog
  2. Select the Project pane (choose it from the left column)
  3. In the Libraries (-libs) box add unix,str

Windows

This project depends on platform-specific configuration data to assemble and link the X86 code your compiler generates. On Windows we assume that you are using Cygwin and the gcc4-g++ package for the gcc-4 compiler.
  1. Run Cygwin setup.exe and select one more package.
  2. After search 'gcc', click on '+' to extend the categories tree: the root node is "All", and the child node is "Devel". Then on the package ``gcc4-g++:G++ subpackage'', click 'skip' to change it to be 4.3.4-3. This will also install all dependent packages.
If these are installed properly, the compiler should be fine, but if you run into problems on Windows, ask the TA for help.

To turn off the following warning, under system variables, click on New, set variable name to CYGWIN and set value as nodosfilewarning and you're done.

cygwin warning:
MS-DOS sytel path detected: ...
Perfered POSIX equivalent is: ...
CYGWIN enviroment variable option "nodosfilewarning" turns off this warning.
Consult the user's guide for more details about POSIX paths;
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames

Directories and Paths

Your compiler will produce X86 .s and .o files as well as native executables and .out files used during testing. By default these files will be generated in directories c_obj and c_bin respectively, and the default executable is called a.out. The main compiler accepts new command-line flags to change these directories from the default. Likewise, the -o option tells the compiler to use the given name for the result executable.

Note: Before running the compiler, you should create the c_bin and c_obj directories as sub-directories of your project. In OcaIDE you need to create the two directories under _build. If running the compiler issues an error like "can't open output file for writing", it probably means you are missing these directories.

Running on Linux

If you are using some flavor of linux (for example on Eniac) but not Mac OS X, you should specify the -linux flag when running the compiler. This turns off name mangling, which has different conventions on linux vs. Mac or Windows. In Eclipse you should add -linux before the --test flag.

The Procedure-body Language

The language accepted by your compiler is a simple variant of C-style program statements whose expressions are made up of the 32-bit arithmetic operations you implemented in Project 2. See the file ast.mli for the OCaml representation of the abstract syntax -- the type exp of expressions is defined there.

Unlike Project 2, which had one "input variable" X, expressions in this project can mention user-defined variables, represented by values of the form Id varname. Otherwise, the meaning of the arithmetic operators is identical with those in Project 2.

The new constructs of this language are local variable declarations and imperative program statements, with a structure similar to that found in C.

Implement an ocamllex lexer/ocamlyacc parser for integer arithmetic expressions. The terminal-symbol tokens processed by your lexer are described next.

Lexing

The file lexer.mll already defines all of the tokens you need to implement the statement grammar. As before, it depends on an implementation of "file ranges" (defined in range.ml(i)) for generating error messages. You do not need to modify lexer.mll for this project. There are several new tokens usable as terminals by your parser (see parser.mly for the complete list). Notable ones include:

Parsing

Your parser should accept strings according to the following ambiguous grammar, suitably disambiguated to solve the "dangling else" problem as discussed in Lecture. In particular, in the statement "if (e1) if (e2) s1 else s2", the else s2 clause should be associated with the inner if, not the outer one.

The output of your parser should be an abstract syntax tree built using the constructors in ast.mli. To get you started, we have implemented most of the syntax, leaving only the definition of the stmt grammar for you.


prog ::= block return exp ;

block ::= vdecls stmts

vdecls ::=
  | vdecl; vdecls
  |                              ( nothing )

vdecl ::= typ IDENT = exp

typ ::= int
 
vdecllist ::=
  | vdeclplus
  |                              ( nothing )

vdeclplus ::=
  | vdecl
  | vdecl, vdeclplus
 
lhs ::= IDENT

stmts ::=
  | stmt stmts
  |                              ( nothing )

stmtOPT ::=
  | stmt
  |                              ( nothing )

expOPT ::=
  | exp
  |                              ( nothing )

exp ::= ...       

stmt ::=                  ( ambiguous! )
  | lhs = exp;
  | if (exp) stmt
  | if (exp) stmt else stmt
  | while (exp) stmt
  | for (vdecllist ; expOPT ; stmtOPT) stmt
  | { block }
 
Note that there are two kinds of variable declaration lists -- those that are terminated by semicolons and those that are separated by commas. The former are used at the start of program blocks, the latter are used as part of the for statement. For example:
int x = 3; int y = 4;
for (int z = 2, int w = 3; x + y + z > z; x = x - 1; ) y = y -1;
return y;
Note also that the for statement has an optional second expression that acts as the loop guard -- if it is omitted, then the loop guard should be "true". Unlike C, the third component of a for statement is itself a statement (because our language doesn't provide "side-effecting" expressions like x++). According to the grammar above, you must include a terminating semicolon (as shown in the example) for this statement. It also allows the (slightly obscure) use of inner blocks like this:
for (int x = 0; x < 10; { int y = 17; x = x + y; }) x = x;
return 3;
Note: Your implementation should not cause ocamlyacc to generate any shift-reduce or reduce-reduce conflicts when processing your parser. (Style points will be deducted if this is the case.)

Hints

Compilation: Phase 1

In this project, there are two phases in the translation from source to X86 code. In Phase 1 you will compile the source abstract syntax to a simple control-flow-graph representation based on LLVM that uses alloca instructions to create temporary local storage space instead of using machine registers or actual stack locations. (Phase 2, described below, compiles from this intermediate representation to X86.)

Files ll.ml and ll.mli define the intermediate language. Local values (a.k.a. temporary values) are named by unique identifiers (uids). Instructions are separated into "computational" instructions insn and "control-flow instructions" terminator. These are grouped into labeled basic blocks.

You should complete the implementation of the compile_prog function in phase1.ml. This function takes as input a single Ast.prog and produces a value of type Il.prog, which is defined as:

  type prog = {ll_cfg: bblock list; ll_entry: lbl}
  
Here, il_cfg is the control-flow graph (represented as a list of basic blocsk), and il_entry is the label of the starting block for the program.

Variable Scoping

A significant part of the work of this phase of the translation is converting the user-defined program variables and any needed temporary locations into locals. As discussed in class, the mapping from variable identifiers to local uids is best handled by a context, which also tracks the scope of variables to resolve shadowing. The scoping rules of this language are defined as follows:
  • The scope of a variable is the rest of the block in which it is introduced. For example, the scope of the variable x in the declaration below is e2 and ..., including any nested sub-blocks, but not e1.
           int x = e1;
           int y = e2;
           ...
          
  • Identifiers declared in inner blocks shadow the definitions of outer blocks. In the example below, y will contain the value 3 when the program returns:
           int x = 1;
           int y = 0;
           {
              int x = 2;   /* shadow the outer x */
              y = y + x;
           }
           y = y + x;  /* the outer x is still in scope here */
           return y;   /* y and the outer x are in scope for the returned expression */
             
          
  • The local variables declared in a for loop are in scope only within the loop guard expression, the loop "update" statement, and the loop body. The following program is illegal because x is not in scope after the for:
          for(int x = 0; x < 10; x = x + 1;) x = x + 2;
          return x;  /* x is not in scope here */
          
    In other words, the following two statements are equivalent:
          for(int x = 0; x < 10; x = x + 1;) x = x + 2;
          
          {
            int x = 0;
            for(; x < 10; x = x + 1;) x = x + 2;
          }  /* x leaves scope here */
          
  • The expression e occuring in the final return e; of the program is in the scope of the variables defined in the preceding block. For example this is allowed:
          int x = 0;
          return x;     /* x is in scope */
          
    but this is not:
          {
            int x = 0;
          }
          return x;  /* x left scope at the end of the inner block */
          
Your compilation functions should take a context as an additional parameter to resolve scoping the mapping from names to local identifiers (a.k.a. uids). You are free to implement the context using whatever datatype you like, though the file ctxt.ml gives you some suggestions about how to approach the problem.

Compiling Control Flow

As with Project 2, this translation can be thought of as emitting a stream of instructions. However, since the results of this translation will, in general, include more than one basic block, it is useful to think of the emitted stream as including labels, as well as both kinds of instructions (arithmetic and control flow).

There are several reasonable strategies for translating the abstract syntax into basic blocks: one simple solution is to do a first pass emitting a stream of instructions and labels (possible in reverse order) and then do a post-processing pass to convert that stream into basic blocks (with code in the right order). This is the strategy used by our sample implementation, you are free to pursue other approaches as you like.

Hints

Compilation: Phase 2

For this part of the project, you must implement the definition of Phase2.compile_prog, which takes a Il.prog and produces a complete Cunit.cunit suitable for assembly on X86.

This translation is largely straight forward, however there is one wrinkle: you must allocate storage for the local storage needed for the program and translate references to locals into appropriate memory operations. Allocation is most easily accomplished in the prologue code for the program -- simply allocate enough stack space to accomodate the slots. If you follow standard C-style calling conventions, you can set Ebp appropriately so that your slots are indirect references off of Ebp (as discussed in lecture). Your translation functions should be parameterized by a mapping from uids to stack indices. Note that the when calculating the offset for a temporary variable stored in stack index n, the offset from Ebp will be negative -4*n (assuming you've set up the stack frame as suggested in lecture).

The only other wrinkle about this phase of the compiler is that X86 disallows a single instruction from using two memmory operands, so you will have to use designated registers to hold the intermediate results. One good strategy is to use Eax and Ecx (which is caller save).

Unlike Phase 1, there is no need to use a reversed-stream-of-instructions approach: the compuation has already been flattened such that each Il instruction will correspond to just a couple of X86 instructions. Your translation can simply translate each instruction individually and concatenate all of the results. Also, each block can be translated independently from the others.

Hints

Grading

Your project grade will be based on the test cases in gradedtests.ml. These are divided up into categories with point values as follows: Unlike the previous projects, we're only hiding the "Stress Tests" for grading -- you have all of the other tests. Therefore your score on the remaining 80 points of the project is what you see when you run the tests locally. You have unlimited submissions and no attempt penalty.
Last modified: Wed Feb 13 20:48:56 EST 2013