Introduction
We highly suggest you read most, if not all, of the assignment specification before getting started.
In this assignment you are going to write a small compiler that will transform code written in a new stack-oriented language, called J, into RISC-V assembly code much the way that the clang compiler converter converts C code into assembly.
You are NOT supposed to simulate the RISC-V output. The compiler should only generate the corresponding RISC-V assembly into a file. Your compiler will be designed to use a similar calling convention that RISC-V uses so that your code can call functions compiled by RISC-V and vice versa.
The name of the executable you should generate should be called jc
, which is short for J Compiler
.
This program will act much like clang does, when provided with a properly formatted source file in the J language it will produce the corresponding assembly code.
That is, if we were to run
$ ./jc source.j
in the terminal, then this will produce a new file called source.s
if source.j
contains an acceptable j program, otherwise some helpful error messages should be printed instead.
This assignment will be autograded using Gradescope.
Collaboration
For assignments in CIS 2400, you will complete each of them on your own or solo. However, you may discuss high-level ideas with other students, but any viewing, sharing, copying, or dictating of code is forbidden. If you are worried about whether something violates academic integrity, please post on Ed or contact the instructor.
Setup
The only file we are giving you for this assignment is the following .h
file. To download it, you should open the terminal (which should be an application on the right hand side of the desktop) and use the provided command:
curl -o token.h https://www.seas.upenn.edu/~cis2400/current/projects/code/token.h
Note that this assignment only provides a single .h
file and no other files.
You will have to modify this file and create other files (including a Makefile
to finish this assignment).
More details on how to structure your code is in the relevant section below.
There are also separate files that can be used for testing your program. These are mentioned in the testing section below.
The J Language Overview
The J language is a stack-oriented language loosely inspired by Forth. You may want to look at the following Wikipedia article for a quick description of stack oriented languages, but this is optional: http://en.wikipedia.org/wiki/Stack-oriented_programming_language.
In a stack oriented language all operands are passed on the stack and most operations end up manipulating the stack. Here are a few quick examples of J code.
Example Program 1
7 5 4 + *
This program starts by pushing the values 7, 5 and then 4 onto the stack (stack state: 7 5 4 – note that the rightmost element in this list, 4, is viewed as being at the top of the stack). Then the program pops the top two elements (5 and 4) adds them and pushes it onto the top of the stack (stack state: 7 9). The last thing this program does is pop the top two values of the top of the stack, multiplies them and pushes the result onto the top of the stack (stack state: 63). This type of operational syntax is sometimes referred to as Reverse Polish Notation (RPN) and you see it used on many revered HP calculators. You may be familiar with this format from the previous homework where we asked you to implement an rpn calculator.
Example Program 2
This next example introduces comparison operators, if statements, and comments. This example is a bit more complex, so if it doesn’t make sense right now, that is fine. Once you have read the full description of the J language though, you should be sure you understand what is going on in this example.
5 7 gt if 12 else 25 endif ; this is a comment
This program starts by pushing the values 5 and then 7 onto the stack (stack state: 5 7).
The next token in the program gt
will compare the two topmost elements on the stack with the top most element being the first operand and the second top most element being the second operand, resulting in doing 7 > 5
in this example.
As part of doing this operation gt
will pop the two elements being compared off of the stack and then push the result of the comparison onto the top of the stack, with the value 1
being used to represent true and 0
for false, so after gt
in our example program the stack will just contain 1 (stack state: 1).
The next token in the program is an if
token, indicating the start of an if
statement like seen in other programming languages. The if
statement will start by popping the top element of the stack and checking if that value is zero or non-zero. If it is non-zero, the code in-between the if
and its corresponding else
will be executed. In this case, the top value is non-zero so the code pushes 12
onto the top of the stack (stack state: 12).
After pushing 12, we will ignore the else branch and start executing code after the endif
which marks the end of the if statement. The only thing after the endif
though is a ;
which marks the start of a comment. Once a ;
has been read, the rest of the line can be interpreted as a comment and won’t be used for executing the program.
If the top value of the stack had instead been zero when the if
token was reached, the code between the else and the endif would have been executed and 25 would have been pushed instead of 12.
Token Structure
As you can see from the examples, a J source file can be thought of as a sequence of tokens separated by whitespace characters (spaces, tabs, newlines, etc.). Tokens can denote literal values or operations. There are no parentheses or square brackets or curly braces to denote program structure like there are in C, Java or other programming languages. There are just tokens and comments in J. You will want to write your program such that it reads tokens sequentially from a source file and generate the corresponding assembly for each token as you read it. More details on parsing the file in the Code Structure Section below.
Operands
For simplicity, all operands in the J language will be 32-bits. There are no other data types.
Literals
Tokens consisting entirely of digits optionally preceded by a minus sign are interpreted as decimal numbers and your compiler should output RISC-V assembly code to push these values onto the stack in the order they are encountered.
Additionally, you should implement support for hexadecimal literals which will begin with 0x
followed by a sequence of hexadecimal characters that could be either upper case or lower case (e.g. 0xD6DEF00D
or 0xd6def00d
).
You should read these hex values and then push them on the stack just as you would a decimal literal.
You should assume that the decimal number can be any number that can be represented in 32-bit 2C and that the hexadecimal numbers are at most 8 hexadecimal digits (e.g. at most 32-bits).
Hint: You will have to be careful about how you do this with the RISC-V code your compiler generates since there are limitations on the size of integer immediates in RISC-V.
Arithmetic Operations
The J language has tokens that denote the basic arithmetic operations provided by the RISC-V assembly language.
J Operator | Corresponding RISC-V Operation |
+ | `ADD` |
- | `SUB` |
* | `MUL` |
/ | `DIV` |
% | `REM` |
In each of these cases, the top value of the stack is popped off and used as the first operand (stored in rs1 in risc-v) and the next element on the stack is popped off and used as the second operand (stored in rs2 in riscv).
Those two values are then used to perform the specified operation and the result is pushed onto the stack.
Please be mindful of the ordering of operands since this can affect the output of some operators (e.g. /
, -
, and %
).
Comparison Operators
Let a
denote the element at the top of the stack and b
denote the next element.
There are 5 comparison operators in the J language and they are listed below.
J Operator | Meaning |
lt | `a < b` |
le | `a <= b` |
eq | `a == b` |
ge | `a >= b` |
gt | `a > b` |
For these operators, the top two values of the stack are popped off the stack and the result of the comparison is pushed onto the top of the stack.
Note that the result of this comparison should either be a 1
indicating the result is true or a 0
indicating the result is false.
Also note that the values a
and b
are viewed as 2C values for the purposes of these comparison.
Logical Operations
The J language supports the 3 basic binary Boolean operations through the tokens and
, or
and not
.
The and
and the or
operations pop the two top values on the top of the stack, compute their respective operations in a bitwise manner using the corresponding instructions and then place the result back on the top of the stack.
The not
operation does much the same thing, but it only consumes the top element of the stack.
Stack Operations
Stack oriented languages always have operations to manipulate the state of the stack. The J language provides many of the basic stack operations offered by Forth and these are listed below.
We will also provide the result of running each operator on a stack that starts in the state 5 7 -6 3
J Operator | Meaning | Result of Running on Sample Stack |
`drop` | Drops the top element of the stack | `5 7 -6` |
`dup` | duplicates the top element of the stack | `5 7 -6 3 3` |
`swap` | swaps the top two entries of the stack | `5 7 3 -6` |
`rot` | rotates the top 3 elements of the stack | `5 -6 3 7` |
Lastly, there are two stack operation set_argN
and get_argN
associated with getting the arguments to a function. Functions will be setup in a similar way they are done in RISC-V with clang
as covered in lecture. More on functions in the functions section below.
set_argN
pops the top value off of the stack and stores it in the appropriate register to indicate that value is being used as an argument to a function.
For example, set_arg1
pops the top element of the stack and stores that value into the first argument register a0
.
set_arg2
will similarly pop the top element of the stack but it will stores its value in a1
.
This pattern repeats for set_arg3
and so on.
get_argN
copies an entry from the corresponding register and pushes that argument’s value onto the top of the stack.
For example, get_arg1
copies the first argument from a0
and pushes it to the top of the stack.
get_arg2
will copy the argument from a1
and push it onto the stack.
This pattern repeats for get_arg3
and so on.
The value N can range from 1 to 8 ie get_arg1
, get_arg2
, get_arg3
, … get_arg7
, get_arg8
are all legal commands.
If Statements
The J language supports if .. else .. endif
constructs as mentioned earlier.
As in other languages the else clause is optional. The syntax is as follows:
if <block A> else <block B> endif
When the if statement is executed, the top element of the stack should be popped off and checked to see if it is zero or non-zero. If the top element was non-zero, i.e. true, then the statements in block A are executed. If the top element was zero, i.e. false, then the statements in block B are executed.
If you take a look at how the RISC-V assembly is generated for this by clang
when compiling an if
statement in C, you will see that the if statement is compiled using some variant of the BEQ
instructions to do the comparison and conditionally branch to the appropriate unique labels.
In your J compiler, you probably want to follow a similar structure to the code shown in lecture and generated by clang
.
Also with J, you need to keep in mind that there are multiple if statements and your compiler must be sure to generate unique labels in the assembly code for each one.
One way to keep unique labels is to keep a count of the number of if tokens that you have encountered as you parse the file from start to finish.
Your compiler will also need to be able to handle nested if statements. This is a rather complex task to manage, so here are hints for two possible ways to use this:
- Use recursion. If an IF token is identified, generate the ASM necessary for the start of this IF and then recursively call a function to generate the body until the corresponding ELSE and/or ENDIF tokens are found.
- Maintain some data structure that supports LIFO (e.g. a Stack or Deque data structure) to keep track how “nested” the compiler is when handling tokens. When you see an IF token, push data onto the structure and when your compiler sees an ENDIF token, pop data. What exact data you store and data structure you use is up to you. Do keep in mind that the Stack being referred to in this hint is a data structure and separate from the stack in memory used to keep track of variables and used in J.
While Loops
The J language supports while loops. To support the while loop, there are two tokens: WHILE and ENDWHILE. The syntax is as follows:
while <block> endwhile
When the while statement is first executed, the top element of the stack should be popped off and checked to see if it is zero or non-zero. If the top element was non-zero, i.e. true, then the statements in the block between WHILE and ENDWHILE are executed. Once the block has finished executing and ENDWHILE is reached, the code should jump back to the start of the loop and execute the code for the WHILE token again (e.g. pop the top value of the stack again and check to see if it is zero or non-zero). If the top element was zero, i.e. false, then the program should start executing the next token after ENDWHILE.
Similarly to the If statements mentioned in the previous section, your compiler will need to generate unique labels for each while loop encountered in the program and need to handle nested structures. It is possible for if statements to show up inside a while loop and while loops to show up in an if statement. It is also possible for a while loop to have while loops inside of it. The tactics used for handling unique labels and nested control structures mentioned in the If Statments section above will likely also be useful for your code to generate while loops.
Comments
As we mentioned earlier, we will use comments in the J code to document the design. If you encounter a semi-colon in the file all content between that point and the end of the line is regarded as a comment and is ignored by the compiler. For example:
1 2 3 4 + - ; This is a comment - everything up to the end of the line
Functions: TODO DOUBLE CHECK
Like C, J is intended to operate as a procedural language which means that all of the code needs to be packaged in functions which will be turned into blocks of code just the way clang does. Function definitions are a bit simpler in J since we can assume that the calling context has put any required arguments into the appropriate registgers before calling the function, hence there is no need for an argument list.
Function definitions are started with the defun token as illustrated in the following example:
defun square
get_arg1
dup *
return
In the above example, the first line has the defun
token which indicates the start of a function and then an identifier token to name the newly defined function, which in this case, will result in the function being called square
.
The next line of the function just has get_arg1
which will fetch the first (and only) input argument to the function from a0
and then push the value of that argument onto the stack.
The third line consists of two tokens, dup
and *
which will duplicate the argument fetched in the previous line and then multiply that argument with its duplicate, pushing the result onto the stack.
The last line consists of the return
token which will copy the top value of the stack (which is the result of the multiplication operation) and use that as the return value of the function and return from the function.
If it helps to understand what is going on in this function, you can almost think of the code snippet above as being similar to the following C code:
int square(int arg1) {
int n = arg1;
int res = n * n;
return res;
}
Once the function has been defined, we can invoke it by just using a token that is the name of the function. For example:
3 4 set_arg1 square
The code above pushes the value 3 and then the value 4 onto the stack. The 4 value will then get popped and set to be used as the first argument to the square
function.
Then, the function sqaure
is called and returns the value 16, which should be used as the new top of the stack.
Assuming that the stack was empty before the above lines of code were executed, the stack should contain the value 3 at the botton and 16 at the top after the above code is executed.
Function names must start with an alphabetical character and can consist of alphabetical characters, underscores ‘_’ and digits only.
These identifiers are case sensitive.
Function names can also not used reserved strings like gt
or defun
which already serve other purposes in the J language.
The return
token is used to copy the value at the top of the stack into the return value slot of the current stack frame.
It also invokes the function epilogue so that the stack and PC are restored so that the calling function can resume execution.
This epilogue should follow the same conventions as the one used by clang as discussed in class.
Note that every function should end with a return
statement at some point.
Functions are invoked with a token that is the name of the function to invoke as shown above. Once again you are expected to use exactly the same calling convention that the clang-15 compiler does where x8 (fp) acts as the frame pointer and x2 (sp) acts as the stack pointer. The only small difference is that when you generate code to call a subroutine you should ensure that once the function has returned the return value of that function is included at the top of the stack after the J program has called that funnction. This is different from clang which just stored it in a register. This is easily accomplished after the subroutine has returned. Remember that it is the calling functions job to clean up or use the return values as it sees fit once the subroutine has returned. In the J language we will need to handle this explicitly in the code – that is the jc compiler should not automatically generate code to pop the return value.
Note that since we are adopting the clang calling convention we should be able to call functions written in C and compiled with the clang compiler from within our J programs and vice versa. This will be a useful way to add functionality to our system
Instructions
For this assignment, you will be writing a compiler that will compile (i.e. translate) a J source file into an RISC-v ASM file.
This means that the code you write is sort of “writing” RISC-V assembly.
Note that your code should not actually execute any of the RISC-V assembly, simply output the RISC-V to an output file that as
will assembly into an object file and penn-sim
will run.
This project is technically split into two assignments HW10 and HW11, this is because this project can take a lot of time and you can think of HW10 as being a “checkpoint” for HW11. In other words, both HW10 and HW11 use the same autograder, and HW10 and it’s deadline is to give you some guidance on where you should start and how to pace yourself.
For HW10, you will have to implement:
- Token Reading
- ASM generation for literals, arithmetic operations and bitwise operations
For HW11, you will have to finish the code by implementing ASM generation for:
- Function Calls
- Function Definitions & Returns
- While loops
- If statements
- Comparisons
- Stack operations (dup, rot, get_argN, set_argN, etc)
This includes handling nested control structures and unique labels for any jumps/branches
Suggested Ordering
We highly suggest you read most, if not all, of the assignment specification before getting started.
In order to give you a couple of ideas about how to get started in thinking about and implementing this program we have provided you with a header file called token.h
.
This file defines a type called token
which is intended to represent the kinds of tokens you will encounter in a J program.
It also declares a function called next_token
which reads the next available token from the file.
It should return true
if it was successful and false
if it encountered an error.
More details on the token
struct and the next_token
function can be found in token.h
.
You should probably start with implementing the next_token
and any helper functions necessary for it in a new file token.c
.
After you do, you should create a jc.c
file with a main function that can be compiled into an executable jc
and a makefile
that can compile jc.c
into jc
and token.c
into token.o
.
Once you have done this, you should be able to upload all 4 files to the gradescope autograder to test your next_token
implementation.
Optionally, you can implement the optional function specified in token.h
called print_token
to be used for debugging your code locally and have your main function in jc.c
open a file, read it with next_token
and then print out the token with print_token
.
Once you have next_token
working, you will have to start implementing ASM generation based on the tokens that are read.
Note that once you get next_token
working and start working on other aspects of the project you will probably have to modify your makefile
and jc.c
as you add more functionality.
You should be able to compile your code into the executable called jc
and run it like shown below:
$ ./jc source.j
Doing this should produce a new file called source.s
if source.j
contains an acceptable j program, otherwise some helpful error messages should be printed instead.
You should modify jc.c
’s main function to start supporting this.
The way the compiler proceeds is that it repeatedly reads tokens from the file and for each token it generates the requisite assembly. You can look at what the clang compiler does with your C files for inspiration. Here is the pseudo code for body of the compiler – note how short it is, at least conceptually.
while next_token():
generate_asm()
The hard part is that you have to design your own version of the function that generates assembly, including the parameters it takes, and how it handles all of the different tokens.
Note that this pseudo code also left out handling the command line arguments to the jc
program, handling potential errors, and other details your code will need to handle.
When starting to write code that generates assembly based on tokens, we suggest you start with the LITERAL tokens, the arithmetic tokens, and the bitwise logical operation tokens. Once you those implemented try to run your program on ASM files that only have those token types and see if your ASM output is correct. You should be able to see if your code passes a few tests on gradescope.
Once you have LITERAL tokens, arithmetic tokens, and bitwise operator tokens working, you should incrementally add support for more tokens until you have supported everything.
Error Handling
Your program should be able to handle any well-formed J program so that’s what you should focus on. We will not be doing a lot of testing with malformed J programs so we are not expecting extensive error handling to catch every possible problem. That being said you probably want to do some basic error handling like checking for illegal tokens that you can’t process. Similarly, you may want to check for missing return statements and if statements that aren’t properly delimited with endifs. These error checks aren’t required but they have greatly helped in the past with debugging code, so we encourage you to add more error checks if you can do so easily.
Code Structure
Parsing
As part of implementing next_token
, you will have to parse the tokens that are contained in the J file, which is stored as a text file.
It may help to go to the testing section and download the sample J files to better understand the layout of a J source file.
Your next_token
function will want to read the next token available in the file (recall that the FILE*
maintains an internal position to the file, and reading from the file advances that location by the amount read).
For the sake of simplicity, you can assume that no token will contain no more than 250 characters.
Remember that you need to handle comments as specified in the comments section above.
Your next_token
code will have to handle the fact that any number of whitespace character can separate tokens.
Note that whitespace can include more than the space character, it can include new line characters, tab characters, and various others that are used in the sample J programs to format the code to be more readable for humans.
You may fund various functions in <string.h>
and <ctype.h>
useful.
Some of these functions are listed below.
Generating the ASM
As a hint to your ASM generation, recall that J is a stack-based language and involves a lot of manipulating of things on this stack.
The RISC-V asm that your compiler outputs should probably manipulate the stack portion of memory like is done in lecture with the topic of “register spilling” and “temporary data” to simulate what is going on in J.
You will likely want to use fprintf()
to print the RISC-V code to the output file.
File Structure
As part of this assignment, we are requiring you to split your code up into multiple files so that you master the process of building programs and writing Makefiles. At minimum your code should use:
-
token.c
that contains the routines you need for reading the tokens sequentially from the file and parsing them intotoken
structus. You may want to have helper functions for parsing and functions for debugging (likeprint_token
) but how you get thenext_token
function working and any other utilities you provide is up to you. -
jc.c
that contains yourmain()
function which should parse command line arguments and functions to read the J file and write the asm file.
You should at least use one header file token.h
.
You can use more files if you want to, but you cannot use fewer.
We highly suggest that you create another c file called asm_gen.c
, compiler.c
or some other similar name that will contain all the functions to generate RISC-V assembly based on J tokens.
Most students end up creating an additional .c and .h file as it helps a lot with organizing their code.
Your final executable should make use of some of the functions placed in token.c
and any other .c files you create .
Makefile
You must also write and include a Makefile named makefile
that builds your program from the source components.
Failure to include a working Makefile will result in all tests failing.
The executable that your Makefile produces must be named jc
so typing make jc
at the command line should make the final executable.
Your Makefile should build intermediate object files for each .c
file instead of just building the program all at once and rebuild targets accordingly when their source .h
and .c
files are updated.
Your Makefile should also contain the phony target clean
so that when you type in make clean
it removes all object files and the jc
executable (and nothing else, be sure to not accidentally delete your .c
or .h
files).
Your Makefile should also compile using the clang-15
compiler and use the -Wall
flag at each step to enable all warnings.
If you want to use gdb
or valgrind
to test your code, you should also compile with the -g
flag so that debugging information that is used by these programs are stored in the compilation outputs.
The autograder will be testing to make sure that your makefile builds the program as described above.
Extra Credit Challenge
Coming Soon!
Compiling and Testing
To compile your code for this assignment, you will have to create your own Makefile (as described above). We suggest looking at the makefile you did in HW08 or shown in lecture slides for a starting point on how you should create your own.
Testing
This section is still a work in programs for testing your code. Once you are able to generate assembly, you will need to download new versions of pennsim and a program called as. These will be released soon.
The sample j files may also be updated, but we will let you know if that happens.
We provide some .j files and their corresponding script files to compile it and run it in PennSim that you can use to test your program as a whole. To utilizes these test cases, you should download the zip in the terminal with:
$ wget https://www.seas.upenn.edu/~cis2400/current/projects/code/j_tests.zip
and then unzip the download with:
$ unzip j_tests.zip
This wil give you various .j
files (e.g. simple1.j
).
The .j
files are example inputs into your program.
To test your code, you will need to compile the j file using your jc
program, assembling the resuling assembly with as
and then using penn-sim to run your program. We have provided the sample script script.txt
which should work for any j tests.
Here is an example of doing this with simple1.j
$ ./jc simple1.j
$ ./as simple1.o simple1.s cstdlib.s
$ ./penn-sim simple1.o
>> script script.txt
For the testing your token parsing, we highly recommend that you implement the print_token
function described in token.h
and then write a small program that will read a j file and print the tokens to an output file.
After running the program, check that it works for all of the supplied J programs
To test your RISC-V assembly generation, you should check that running the code output by your jc
compiler on PennSim has the expected behaviour based on the initial .j
file.
Compiler Warnings
Sometimes when you compile a C program it will issue warnings. These are the compilers way of telling you that your code is not completely clear, in order to compile your program, the compiler had to make some guesses about what you intended which may or may not have been correct. A lot of people figure that if they don’t see an error everything must be fine but that is not a good way to program.
For this assignment you are required to compile all of your code with the –Wall
option which turns on all warnings.
Furthermore, if we run your makefile and we see any compiler warnings we will be deducting points.
It is your job to make sure that all of the warnings and errors are dealt with before you submit your code.
Coding Environment Differences
Note that there are several subtle and annoying differences between C compilers on different machines.
For this assignment you are expected to use the clang-15
compiler in the VM we provided.
The TAs cannot and will not be responsible for getting code to run on the wide variety of platforms and compilers in use today.
More specifically the TAs will not be responsible for answering questions of the form, “how do I get <fill in the blank> to run on Windows/Mac/etc.”.
Because of the differences between compiler implementations and C libraries on different operating systems getting something to compile and run on one system does not necessarily guarantee that it will work on a different machine.
You should plan on making absolutely sure that your program will compile and run correctly on VM, which is the same environment we will be testing your code on.
The safest way to do that is to develop on that platform.
One annoying thing you may need to deal with is that char
is not garunteed to be signed or unsigned and the special character EOF
is represented by -1
.
If you are trying to compare a character to the special value EOF
or to any negative value, you should declare your character as a signed char
Valgrind
We will also test your submission on whether there are any memory errors or memory leaks.
We will be using valgrind to do this.
To run it yourself, you should try running: valgrind --leak-check=full ./jc simple.j
.
Note that you should try this on other j files as well.
If everything is correct, you should see the following towards the bottom of the output:
==1620== All heap blocks were freed -- no leaks are possible
==1620==
==1620== For counts of detected and suppressed errors, rerun with: -v
==1620== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
If you do not see something similar to the above in your output, valgrind will have printed out details about where the errors and memory leaks occurred.
Standard C libraries
In order to program effectively in C you will want to begin to familiarize yourself with the Standard C Libraries. You can find a useful reference to them many places online, though ones that we have liked include:
These utilities are packaged into collections of functions that you can call from your code. In order to avail yourself of these routines you should #include the relevant header files at the beginning of your program like so:
#include <stdio.h>
#include <ctype.h>
Here are some standard C library routines that you may want to look at for this assignment:
-
<stdio.h>
:printf
,fopen
,fclose
,fprintf
,fgetc
,fgets
,fread
,fwrite
,sscanf
-
<stdlib.h>
:malloc
,free
,exit
,EXIT_SUCCESS
,EXIT_FAILURE
-
<ctype.h>
:isspace
,isalpha
,toupper
,tolower
,isalnum
The list is only suggestive not comprehensive, and feel free to use other functions that you find in the standard libraries.
Submission
Please submit your completed code files to Gradescope