Due: February 12th, 2013 at 11:59:59 pm
Download: project2.zip
To Submit: Go Here
This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project. Be sure to fill in the team.txt file.
Overview
Build a compiler for simple arithmetic expressions. Your compiler will accept source files of the form1 + 2 + X + 4 + 32and will produce an object file (by default named a.out) that, when linked against runtime.c and then executed, takes a single command-line numeric argument for the value of the variable X and produces the resulting output:
./a.out 3 42The code your compiler generates should follow the cdecl calling conventions. After compilation, the source file from above should be functionally equivalent to this C program:
int program(int X) { return (1 + 2 + X + 4 + 32); }
Building the Project
OCaml Libraries
On all platforms you need to add both unix and str to the list of OCaml libraries used for this project. You can set this by specifying them on the command line like: ocamlbuild -lib unix -lib str, or in Eclipse under the project's Properties pane:- Open the Project » Properties dialog
- Select the Project pane (choose it from the left column)
- In the Libraries (-libs) box add unix,str
Windows
This project depends on platform-specific configuration data to assemble and link the X86 code your compiler generates. On Windows we assume that you are using Cygwin and the gcc4-g++ package for the gcc-4 compiler.- Run Cygwin setup.exe and select one more package.
- After search 'gcc', click on '+' to extend the categories tree: the root node is "All", and the child node is "Devel". Then on the package ``gcc4-g++:G++ subpackage'', click 'skip' to change it to be 4.3.4-3. This will also install all dependent packages.
To turn off the following warning, under system variables, click on New, set variable name to CYGWIN and set value as nodosfilewarning and you're done.
MS-DOS sytel path detected: ...
Perfered POSIX equivalent is: ...
CYGWIN enviroment variable option "nodosfilewarning" turns off this warning.
Consult the user's guide for more details about POSIX paths;
http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Directories and Paths
Your compiler will produce X86 .s and .o files as well as native executables and .out files used during testing. By default these files will be generated in directories c_obj and c_bin respectively, and the default executable is called a.out. The main compiler accepts new command-line flags to change these directories from the default. Likewise, the -o option tells the compiler to use the given name for the result executable.
Running the Project
This project needs slightly different build configurations compared to the earlier projects. The main executable also supports more command-line flags than our previous project's did. By default main looks for runtime.c (in the zip file) from the current directory.Running from OcaIDE
In OcaIDE the main is in _build, so you need to specify the path of runtime.c by the -runtime flag before the --test in the ``Run Configurations...'' dialog.- In Linux main.native -runtime ../runtime.c --test
- In Windows main.native -runtime ..\runtime.c --test
Running on Linux
If you are using some flavor of linux (for example on Eniac) but not Mac OS X, you should specify the -linux flag when running the compiler. This turns off name mangling, which has different conventions on linux vs. Mac or Windows. In Eclipse you should add -linux before the --test flag.
Running from the command line
On all platforms, you should be able to run the main.native program from the command line. Do main --help from the command line for a synopsis of the allowed arguments.
Here is a sample interaction with a (fully implemented) version of the project:
~/Desktop/project2_temp> ocamlbuild -lib str -lib unix main.native Finished, 47 targets (0 cached) in 00:00:00. ~/Desktop/project2_temp> ./main.native --help CIS341 main test harness --test run the test suite, ignoring other inputs -q quiet mode -- turn off verbose output -bin-path set the output path for generated executable files, default c_bin -obj-path set the output path for generated .s and .o files, default c_obj -test-path set the path to the test directory -lib add a library to the linked files -runtime set the .c file to be used as the language runtime implementation -o set the output executable's name -S compile only, creating .s files from the source -linux use linux-style name mangling --stdin parse and interpret inputs from the command line, where X =--clean clean the output directories, removing all files --show_ast print the abstract syntax tree after parsing -help Display this list of options --help Display this list of options ~/Desktop/project2_temp> echo "X * X * X" > foo.e ~/Desktop/project2_temp> ./main.native -o foo -runtime runtime.c foo.e Processing: foo.e root name: foo * gcc -mstackrealign -c -m32 -o c_obj/foo.o c_obj/foo.s * gcc -mstackrealign -m32 -o foo c_obj/foo.o runtime.c ~/Desktop/project2_temp> ./foo 3 27
The Expressions Language
The language accepted by your compiler is a simple collection of 32-bit arithmetic operations. See the file ast.mli for the OCaml representation of the abstract syntax -- the type exp of expressions is defined there.There is only one "input variable", written X, available for use in your expressions -- the value of X is determined by the command-line argument passed in to the generated executable when your program is run. Otherwise, the meaning of the arithmetic operators is mostly standard (see below).
Implement an ocamllex lexer/ocamlyacc parser for integer arithmetic expressions. The terminal-symbol tokens processed by your lexer are described next.
Lexing
The file lexer.mll has some skeleton lexing code you can use as a starting point. It depends on an implementation of "file ranges" (defined in range.ml(i)) for generating error messages. You will need to add lexing rules to handle the cases below. This will require modifying parser.mly to declare the types of any new tokens you add -- the tokens should each take a Range.t parameter, which is used to print error messages. See the example of how the X token is declared in parser.mly.- Whitespace is permitted anywhere before, between or after tokens. Whitespace is defined as the four characters tap '\t', space ' ', newline '\n', and carriage-return '\r'.
- Numeric literals should be parsed as int32s. Use Int32.of_string to convert lexemes to their int32 representation. Remember that it might throw a Failure exception (which you should translate to a Lexer_error). A numeric literal may contain one or more of the ten decimal digits 0123456789. Negative numeric literals should be handled by the parser as a combination of a positive literal and the unary negation operator (see below).
- The non-numeric tokens usable as terminals by your parser (and
their intended semantics) are given below:
- X the unique input argument (just the literal character 'X')
- + binary signed addition
- - binary signed subtraction OR signed unary negation
- * binary signed multiplication
- == binary equality (returns 1 or 0)
- << binary shift left
- >> binary arithmetic shift right
- >>> binary logical shift right
- != binary not-equal (returns 1 or 0)
- < binary signed less-than (returns 1 or 0)
- <= binary signed less-than or equals (returns 1 or 0)
- > binary signed greater-than (returns 1 or 0)
- >= binary signed greater-than or equals (returns 1 or 0)
- ! unary logical not (returns 1 for a zero input, 0 for nonzero input)
- ~ unary bitwise not
- & binary bitwise and
- | binary bitwise or
- ( left parenthesis
- ) right parenthesis
Parsing
Your parser should accept strings according to the following ambiguous grammar, suitably disambiguated to implement the precedence of operations as defined in ast.ml (see the prec_of_* functions). Recall that higher precedence means tighter binding, that is, X + 3 * 4 should yield the same abstract syntax tree as X + (3 * 4). Explicitly parenthesized expressions should be treated as though they had maximal precedence.The output of your parser should be an abstract syntax tree built using the constructors in ast.mli. To get you started, we have implemented the trivial language E ::= X.
The full (ambiguous) grammar for expressions has a single non-terminal (and starting nonterminal) E and rules given by:
| numeric literal
| (E)
| E + E
| E - E
| E * E
| E == E
| E << E
| E >> E
| E >>> E
| E != E
| E < E
| E <= E
| E > E
| E >= E
| ! E
| ~ E
| - E
| E & E
| E | E
Implement a parser for the disambiguated grammar for arithmetic expressions. All binary operations should be made to be left associative. That is, the input string 1 + 2 + 3 should produce the same parse tree as (1 + 2) + 3.
Hints
- The lexer and parser are quite intertwined, so you'll have to make progress on both of them together before testing your code.
- The main executable can parse inputs from standard input
if run with the --stdin n switch. In this mode, the
program reads lines from standard in, parses them and then uses the
interpreter available in the Ast module to compute the
expected answer your compiler should produce. Here n is an
integer that is used as the value for X. For example, once
you make some progress on the lexer and parser you should be able to
have this kind of interaction:
> ./main.native --stdin 17 X X Executable should yield: 17 (X + X) X + X Executable should yield: 34
- The switch --show_ast can be used to dump the abstract syntax tree to the terminal when the compiler is run.
Compiling Expressions
You should complete the implementation of the compile_exp function in compiler.ml. This function takes as input a single Ast.exp expression and produces a single, globally visible code block whose entry point is given by the label created by calling Platform.decorate_cdecl "program". This code should follow the cdecl calling conventions, and behave as though implementing the C function defined as below (where <exp>) is the expression defined in our language:int program(int X) { return (<exp>); }
It is easiest to generate code when you maintain certain regularity invariants. In this case, you might want to use the invariant "the code resulting from compiling an expression places its answer in Eax", since Eax is the "return register" for cdecl calling conventions.
This strategy works fine for everything except binary operations -- for those you need to store a temporary value. Since we don't have very much by way of compiler infrastructure at this point, there aren't too many options for storing those temporary values. For this project, you should store the results on the stack (and return results in Eax). You should only have to push new values to the stack when compiling an expression containing a binary operator. In that case, first generate code for the left-hand branch, then push the result, then generate code for the right-hand branch. Remember to pop the intermediate value when you're done with it.
You can access the value at the top of the stack with the indirect operand (stack_offset 0l). To pop an item off the stack without being forced to store it somewhere, just manipulate Esp directly. Remember that the shift instructions put special requirements on the location of the shift amount; also note that you will have to generate some extra code for the comparison and logical negation operators, since they return exactly 0 or 1.
If you use registers other than the callee save Eax, Ecx, and Edx for the computation, you must save and restore those values (our solution code is able to manage without using any callee save registers). You might also want to consider storing X (the program argument) in a dedicated register for easy access during the computation.
Hints
- Code generation should be a simple, recursive algorithm that processes the abstract syntax tree. The main code generator in our sample solution is 40 lines of well-laid-out code. If your code generator is much larger than this, you're doing something wrong -- seek help!
-
In OCaml, it's most efficient to add to the head of a list. A good
strategy is therefore to generate the instructions in reverse order and then
reverse the entire list once when the code generation is complete and
we need to construct a code block. In this
case, we can think of the code generator as emitting instructions into
an "instruction stream". The main code generator might look like this:
let rec emit_exp (e:exp) (stream : insn list) : insn list = begin match e with | Cint i -> Mov (eax, Imm i) :: stream | ... end
Here, e is the expression being compiled, stream is the "stream of instructions" that we're emitting (represented as an insn list), and we "emit" an instruction by simply consing it on to the head of the list, as shown in the case for compiling integer literals. Following the invariants given above, we simply move the answer into Eax.You are free to use this strategy, or one of your own choosing.
Grading
Your project grade will be based on the test cases in gradedtests.ml. These are divided up into categories with point values as follows:- Easy Parsing Tests: 15 pts.
- Easy Compiling Tests: 15 pts.
- Medium Parsing Tests: 7 pts.
- Medium Parsing Tests 2: 8 pts. (hidden)
- Medium Compiling Tests: 7 pts.
- Medium Compiling Tests 2: 8 pts. (hidden)
- Error Tests 1: 5 pts.
- Error Tests 2: 5 pts. (hidden)
- Hard Tests: 20 pts. (hidden)
- Coding Style: 10 pts. (manually graded later)