Homework Submission#

Your writeup should follow the writeup guidelines. Your writeup should include your answers to the questions below. Even if a certain is just a “step”, please include it in your report and leave the bullet blank for the sake of easy grading.

Note

Note that the last part of this assignment could take longer than the previous parts.

Accelerating the Filter horizontal
1. Create a new Vitis HLS project and add the provided source files. Use a clock xczu3eg-sbva484-1-i in the device selection. Use a 150 MHz clock, and select the Vitis Kernel Flow Target for the Flow Target.
2. Does Filter_horizontal offer any opportunity for data reuse? What is the smallest buffer that we can use? (3 lines)
3. What is the optimal order for traversing the input data (column-wise or row-wise)? Assume that the input and output are stored in a BRAM. Motivate your answer. (3 lines)
4. Create a function Filter_horizontal_HW that is a version of Filter_horizontal_SW that you modified based on the insights from the previous two questions. You don’t have to use the streams at this point. Include the code in your report.
5. Pipeline the loop body of Filter_horizontal_HW. Write a testbench to verify Filter_horizontal_HW. Similar to the one we used in HW5, the testbench should compare the result of Filter_horizontal_SW and Filter_horizontal_HW and exit your program with a value of 1 if the output is not correct. If the output is correct, the testbench can simply print out “TEST PASSED”. The input of the functions can be arbitrary values. Verify that your test function works. Include the testbench in your report. What is the latency(in cycles) that Vitis HLS predicts? (1 line)
  
  Note
  
  Make sure you’ve selected the correct top function for the synthesis. Also check that you are not forcing another function as the top function in your constraints file like directives.tcl. You can create multiple Solutions in Vitis HLS for convenience.
  
  Note
  
  Remember that malloc() is not synthesizable. You can have user-defined macro to seperate simulation code and synthesis code as shown in HLS user guide.
Accelerating the Filter vertical
1. Does Filter_vertical offer any opportunity for data reuse? What is the smallest buffer that we can use? (3 lines)
2. What is the optimal order for traversing the input data (column-wise or row-wise) with respect to FPGA on-chip memory usage? Assume that the input and output data are stored in a BRAM. Motivate your answer. (3 lines)
3. Create a function Filter_vertical_HW that is a version of Filter_vertical_SW that you modified based on the insights from the previous two questions. You don’t have to use the streams yet. Include the code in your report.
4. Pipeline the loop body of Filter_vertical_HW. Write a testbench to verify Filter_vertical_HW. What is the latency(in cycles) that Vitis HLS predicts? (1 line)
hls::stream
1. Write a verification function for Filter_HW. Verify that your test function works. Include the test function in your report.
2. Create a function Filter_HW that connects both parts of the filter together. Store the intermediate results in a local array. Include Filter_HW in your report. Use the default data movers. Also include the testbench’s output in your report. What is the expected latency(in cycles) of Filter_HW?
3. We could replace the local array in Filter_HW with a stream. Assume that the stream requires no resources for buffering. What impact do you expect that will have on the resource consumption? Quantify your answer. (3 lines)
4. Replace the local array with an hls::stream object and insert a dataflow pragma into Filter_HW. The hls::stream class is declared in hls_stream.h. Modify the remaining functions as necessary. Include Filter_HW and any other significant changes in your report.
  
  Hint
  
  We are concerned with streaming now, and that could merit a reconsideration of how we travese the data.
5. What is the latency of Filter_HW that Vitis HLS predicts? Make sure you verify your code. (1 line)
Moving on HW
1. Partition the Filter_HW in a Load-Compute-Store pattern as we did in HW6.(Partition the Code into a Load-Compute-Store pattern) Verify the code and include the final code in the report.
2. Export your Filter_HW as .xo file and build .xclbin file as we did in HW5. Create a host code and include other functions like scale, differentiate, and compress so that they run on ARM core. Run Filter function on FPGA. Use the same Input.bin as input data and Golden.bin from HW3 to verify the output. Use O2 as the optimization level for the host code compile. Include the host code in the report.
  
  Note
  
  Refer to Makefile and the host code we used for the previous HWs. Collect the data before transferring to Filter kernel, and collect the data back after the kernel computation to feed in to the next stage, compress. You want to enable out-of-order queue to overlap communication and computation.
  
  Note
  
  Don’t worry too much about the performance for now. In this question, we just want you to integrate the HW kernel with other application running on CPU.
3. Report the application latency to process 200 frames. Compare it with the baseline application latency from HW3 (1 line).
4. How can you run other stages on the processor concurrently with the Filter kernel on FPGA? What is the speedup you expect to achieve?

Deliverables#

In summary, upload the following in their respective links in canvas:

writeup in pdf.

ESE5320 Handouts Fall 2022

Homework Submission

Contents

Homework Submission#

Deliverables#