Due: 29 April 2015 at 11:59:59pm
Download: hw7.zip
To Submit: Go Here
This is a group project. Teams of size 1 or 2 are acceptable. If you are working in a team of two, only one student needs to submit the project.
What to turn in:
- Code that implements an optimization of your choice at one of the compiler levels (AST, LL, or X86).
- Any additional .oat test files and the accompanying test code needed to run them.
- At least one public test case (posted to Piazza).
- A description of your optimization and a thorough evaluation of its effectiveness.
Overview
The OAT compiler produces relatively inefficient executables due to its lack of optimizations and overuse of temporary slots when computing intermediate values. In this project you are free to implement OAT optimizations of your choice at one (or more) of the different levels of program representations: at the OAT AST level, at the LL level, or at the X86 assembly level.Getting Started
We have provided a complete implementation of the OAT compiler, which is largely the same as the one from HW5 (with a few bug fixes). The compiler has been modified slightly to include several new hooks where you can add optimization code and several new flags that let you invoke the clang backend with different optimization levels.
New Compiler Flags
-Oast : enable optimizations at the OAT AST level -Oll : enable optimizations at the LL level -Ox86 : enable optimizations at the X86 level --clang [O0 | O1 | O2 | O3] : use the clang backend with the given optimization level
What to Do
We have provided a complete, working implementation of the OAT compiler. The new framework has hooks into three places where you can apply optimizations: at the OAT AST level, at the LL level, and at the assembly level. See the code in opt.ml. By default, these optimizations are implemented simply as identity functions. Pick one of these levels and implement an optimization of your choice at that level. See the lecture notes and the suggestions below for inspiration about the kind of optimizations you might choose.
Whatever optimization you choose to implement, it should preserve the correctness of the compiler. We have included the usual suite of test cases -- your compiler should yield the same answers for them with or without optimizations turned on.
The optimization you pick can impact the generated code in various ways: it might improve the runtime performance or it might change the code size (making it smaller or larger, for example). Design some experiments to measure the impact of your optimizations. For example, you might look at the running time of a number of benchmarks with and without the optimization enabled.
You may use the given test suite to help measure the impact of your optimizations, but since most of them are very short-running and uninteresting computationally, you will need to develop additional test cases. For example, you might want to create a long-running loop that does some complex calculations to measure performance improvements. We have provided a few such benchmark programs in oatprograms/optimize*.oat.
Finally, document your project so that the TA can verify your results. Submit a README.txt file containig a brief description of your optimization, your test methodology and a summary of your results.
Measurements
Code size can easily be measured by looking at (for example) the number of bytes in the resulting .o file.
The running-time of your program can be measured using the time command on Linux/Unix/Mac. This utility reports three times, as illustrated by this terminal session:
~/cis341/hw/hw7/soln> ./main.native --clang O1 runtime.c oatprograms/optimize1.oat ~/cis341/hw/hw7/soln> time ./a.out 664999335000000000 real 0m1.530s user 0m1.527s sys 0m0.002s
You should report the sum of the "user" and "sys" times, which are the actual amount of time used by the timed process. The "real" time will usually be close to this sum, but might be longer if the program you are timing happens to be interrupted by another process. (It's also technically possible for the "real" time to be smaller than the sum of the "user" and "sys" times for multithreaded programs on multicore processors, but that shouldn't be possible for OAT programs.)
To get better measurements, you will want to test your programs on a "lightly loaded" machine (e.g. your laptop running only a minimum number of processes). Such an isolated environment might be hard to get on Eniac; we aren't going to hold you to rigorous scientific methodology, but do the best you can.
All reported data is generated on a Linux desktop machine with an Intel i7-4770 @ 3.4 GHz processor and 16 GB of RAM.
We aren't looking for a deep statistical analysis of the performance measurements you make, but you will want to test your results multiple times and compute averages. You might also want to create a program that is parameterized in some way (e.g. by a number of loop iterations or by a size of some data structure) and test your optimizations for varying sizes of inputs or run durations. Such series tests can help you determine whether there are other low-level effects (such as caching) that contribute the the behaviors you see.
Comparison with Clang
To get a sense of the effectiveness of your optimization, we expect you to compare the performance of the benchmark programs using various configurations of the OAT compiler:
- Baseline: no optimization
- Your optimization only (i.e. with one of -Oast, -Oll, or -Ox86
- Using clang at each of -O0, -O1, -O2, and -O3
From your experiments, construct a table containing the (average) running times (in seconds) for each configuration, something like this:
Benchmark | baseline | -Oll | clang -O0 | clang -O1 | clang -O2 | clang -O3 |
optimize1.oat | 47.791 | 46.232 | segfault* | 1.536 | 0.003 | 0.003 |
Once you have collected the performance data, you can plot all of the benchmarks normalized to the baseline (i.e. divide the running time of the optimized version by the running time of the baseline). Such normalization lets you compare effects across programs. Be sure to mention any external plots, histograms, or other data visualizations in your README.txt file, and briefly describe what they represent.
Optimizations
As illustrated by the data above, some optimizations can have a tremendous impact on some programs (while others might have a negative impact). Doing a really good job of optimization (e.g. register allocation), while likely to have a very significant impact on the quality of the code, is a very difficult project for the given time frame. You are welcome to try for such an ambitious optimization, but that level of sophistication is not required to get a good score on this project. If you are in doubt about whether your optimization is too ambitious, contact one of the course staff and ask us about it.
You are welcome to use the dataflow analysis framework from HW6 as part of this project, but that is not required. (If you don't want to use your own implementation, contact the course staff and we can provide you a working version of HW6 code.)
Optimization ideas
Ast-level optimizations
Though many of the optimizations from project 6 would be difficult to do at the OAT source level, there are some transformations that are easiest to do directly on the AST:
- Loop Unrolling: when we know a loop will execute exactly N times, it can be replaced by N contiguous copies of the body
- Inlining Leaf Functions: functions that don't contain any calls (leaf functions) can be always be inlined, without worrying about heuristics for termination of the optimization
- Algebraic expression simplification
Constant propagation on stack slots
You may have noticed that the optimizations you implemented in project 6 do not significantly change the code emitted by the OAT frontend. One problem is that OAT local variables are compiled to stack slots, and our constant propagation pass only analyzes integer-typed LL locals. Using the alias analysis we developed, we can perform constant propagation on stack slots that have a unique name.
Store propagation
Another, more general approach is to propagate stores, even when the stored value is not necessarily a constant. If we track the LL uids stored and loaded from non-aliased stack slots, we greatly improve the LL code generated for OAT expressions. Whenever we find a load that must read the same value that was written by a preceding store, we can replace all uses of the uid associated with the load with argument of the store.
Branch elimination
Whenever a constant reaches a conditional branch, we can rewrite the terminator to an unconditional branch. This may create both unreachable blocks in the CFG and a pairs of blocks that can be merged into a single straight-line piece of code. For this optimization, you will need to update the map of predecessor blocks in the CFG to maintain the invariants of the data structure.
Improved alias analysis
The alias analysis you implemented in project 6 is a very conservative approximation of stack slots that are associated with exactly one uid. This prevents us from doing any interesting optimizations with arrays, since the frontend emits GEPs for array accesses. We can improve the precision of the analysis by maintaining a set of uids that may alias a particular a stack slot. In this case, bitcast and GEP instructions should add the resulting uid to the set of uids that may alias the original name of the alloca'd stack slot.
Devirtualization
You might find it interesting to try to create an optimization that takes advantage of your knowledge of the OAT frontend. The clang backend will have a hard time optimizing OO code, due to the presence of many bitcasts and dynamic dispatch. One OAT-specific optimization is dispatch specialization (i.e. eliminating dynamic dispatch when it's possible to determine statically what LL function will be called for a given method invocation). This can occur when, for example, a class has no sub-classes.
X86-level optimizations
There are a number of simple peephole optimizations that you can do at the assembly level, for example removing redundant moves where the source and destination are the same. If you are extremely ambitious, you can try to improve on the naive approach to code generation in the backend and do some form of register allocation. Definitely contact the TA's with a plan before attempting this.
Grading
Unlike the previous projects, this project does not have as much automated grading support. Your grade will be determined by:- Implementation / level of difficulty of the optimization: 40 pts.
- Correctness of your optimization pass(es): 20 pts.
- Thoroughness of your evaluation, experiments, and write-up in README.txt: 30 pts.
- Public benchmark (posted to Piazza): 10 pts.