3. Homework Submission¶
Your writeup should follow the writeup guidelines. Your writeup should include your answers to the following questions:
Teamwork
As the difficulty of homework is ramping up, we encourage you to spend a moment planning on how to tackle the homework as a team.
Describe which tasks of this homework you will perform, which tasks will be performed by your teammate(s), and which tasks you will perform together (e.g., pair programming, where you both sit together at the same terminal). Motivate your task distribution. (5 lines)
Give an estimate of the duration of each of the tasks. (5 lines)
Record the actual time spent on tasks as you work through the assignment.
Explain how you will make sure that the lessons and knowledge gained from the exercises are shared with everybody in the team. (3 lines)
Compiler Optimizations
Before we dive into the vector optimizations, we will investigate the effects of different levels of compiler optimizations.
¶ Optimization level
Latency (ns)
Code size (bytes)
-O0
-O1
-O2
-O3
-Os
Measure the latency and size of the
baseline
target at the different optimization levels. Put your measurements in a table like Table 3.3. You can change the optimization level by editing theCXXFLAGS
in the hw4 Makefile.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O0
in your report. Use the following command to get the assembly and then look forFilter_horizontal
inFilter_O1.s
:g++ -S -O0 -mcpu=native -fno-tree-vectorize Filter.cpp -o /dev/stdout | c++filt > Filter_O1.s
Note
-fno-tree-vectorize
disables automatic vectorization. We will look at automatic vectorization in the next section.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O2
in your report.Based on the machine code of questions 2b and 2c, explain the most important difference between the
-O0
and-O2
versions. (2 lines)Hint
Leading questions:
for each case (
-O0
,-O2
), how many times does the loop read the variable i?for each case (
-O0
,-O2
), how many times does the loop read and write the variable Sum?why is the
-O2
loop able to avoid recalculatingY*INPUT_WIDTH+X
inside the loop body?what else is the
-O2
loop able to avoid reading from memory or recaculating?how is the
-O2
loop able to perform fewer operations?
Why would you want to use optimization level
-O0
? (3 lines)Hint
Compile the code with
-O3
and track the values of the variablesX
,Y
, andi
as you step throughFilter_horizontal
.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O3
in your report.Based on the machine code of questions 2c and 2f, explain the most important difference between the
-O2
and-O3
versions. (1 line)What are two drawbacks of using a higher optimization level? (5 lines)
Automatic Vectorization
The easiest way to take advantage of vector instructions is by using the automatic vectorization feature of the GCC compiler, which automatically generates NEON instructions from loops. Automatic vectorization in GCC is sparsely documented in the GCC documentation. Although we are not using the ARM compiler, the ARM compiler user guide may give some more insight on how to style your code for auto vectorization. This talk on GCC vectorization may also be useful.
Vectorization Speedup Summary
¶ Baseline
Baseline with SIMD
Baseline with SIMD Modified
Latency (ns)
Suitability (Y/N)
Ideal Vectorization Speedup
Latency (ns)
Speedup
Latency (ns)
Speedup
Scale
Filter_horizontal
Filter_vertical
Differentiate
Compress
Overall
N/A
Report the latency of each stage of the baseline application at
-O3
. (Start a table like Table 3.4; we will continue to fill in this table throughout this problem.)Based on your understanding of the C code, which loops in the streaming stages of the application have sufficient data parallelism for vectorization? Motivate your answer. (Mark suitability by filling in Yes or No in the suitability column of Table 3.4; add explanation in 2–5 lines after table.)
Identify the critical path lower bound for
Filter_vertical
in terms of compute operations. Focus on the data path. Ignore control flow and offset computations. You may assume associativity for integer arithmetic. (5 lines)Hint
Consider only the dependencies in the computation. What happens if you unroll the loops completely?
What is the size of the (non-index) multiplications performed in
Filter_vertical
? (How many bits for each of the input operands? How many bits are necessary to hold the output?) (one line)Report the resource capacity lower bound for
Filter_vertical
. Focus on the computation; you may ignore control flow and addressing computations. There are many resources that may limit the performance.
(5 lines)Hint
As with any resource capacity lower bound analysis, you may have multiple resources and may need to consider them each to identify the one that is most constraining.
You will need to review the NEON architecture (which we discussed in class and in Setup and Walk-through) and reason about what resources it has available to be used on each cycle. Think about how vectorization could exploit the set of computations a NEON unit can do in parallel.
What speedup do you expect your application can achieve if the compiler is able to achieve the resource bound identified in 3e? (5 lines)
Hint
Remember Amdahl’s Law; think about critical path lower bounds and resource capacity lower bounds.
(Fill in the ideal vectorization speedup column in Table 3.4; separately show Amdahl’s Law calculation for overall speedup.)
We will now enable the vectorization in g++. You can enable it by removing the
-fno-tree-vectorize
flag from theCXXFLAGS
in the hw4 Makefile.-O3
optimization automatically turns on the flag-ftree-vectorize
, which vectorizes your code.Report the speedup of the vectorized code with respect to the baseline. (Fill in the “Baseline with SIMD” columns in Table 3.4.)
Explain the discrepancy between your measured and ideal performance based on the optimization of
Filter_horizontal
. (3 lines)Hint
Look at the size of the multiplications in the assembly code.
To read this code, you probably need to understand the relation between Q and V registers. Perhaps useful:
Show how you can resolve the issue that you identified in the previous problem. (1 line) Include the assembly code of
Filter_vertical
after you have resolved the issue.Report the speedup with respect to the baseline after resolving the issue in both
Filter_horizontal
andFilter_vertical
. (Fill in the “Baseline with SIMD Modified” columns in Table 3.4.)
NEON Intrinsics Example
Review the Setup and Walk-through to learn about NEON intrinsics.
Review the code in the
hw4/assignment/neon_example
directory. Note how the Neon version instantiates Neon vector intrinsics to perform the operation. Convince yourself the C version and Neon version perform the same computation. (no turn in)Build and run the code by doing
make example
and./example
.Report the speedup for the Neon version compared to the C version. (1 line)
Review the assembly code produced for both the C and Neon versions. Based on the assembly code, explain how the Neon version is able to achieve the speedup you observed compared to the C version. Include assembly code to support your description. (probably 3–5 lines of description in addition to snippets of assembly)
Using NEON Intrinsics
We will now accelerate the
Filter_vertical
function using intrinsics. We have provided you with a neon intrinsics implementation ofFilter_vertical
inFilter.cpp
right after the#ifdef VECTORIZED
. This implementation doesn’t achieve full performance like the auto vectorization did. We will find out why and fix it.Use
-O3
optimization and build the target by doingmake neon_filter
. Run it with./neon_filter
and report the latency ofFilter_vertical
. Report the speedup with respect to the baseline in 3. Include the assembly code of the neon intrinsic implementation ofFilter_vertical
.The Setup and Walk-through talked about how you get lanes by packing data into vectors. Moreover, in 3j you saw how data type matters in generating performant assembly code that utilizes full lanes of the NEON units. Flowing data through the lanes is key to getting full throughput out of the NEON units.
As we saw,
Filter_vertical
function works on seven 8-bit data elements at a time. Describe how our neon intrinsic implementation deal with a number of data elements that is not divisible by the number of vector lanes without losing significant performance? (3 lines)Explain at which granularity and in which order our implementation processes the input data with vector instructions. (7 lines)
Compare the assembly code from 3j and 5a. What differences do you notice between the instructions generated?
Using NEON intrinsics, modify
Filter_vertical
and try to achieve a speedup with respect to our implementation. Include the accelerated function in your report. Make sure that you verify your optimized code functions properly (i.e. you should seeApplication completed successfully.
).Hint
Can you find an intrinsic that can be used to reduce the number of intrinsics we currently use?
You can also lookup the assembly instruction from 3j at NEON Intrinsics Reference and find a corresponding intrinsic to use in your modification!
Does your modified
Filter_vertical
achieve the speedup you got in 3k? If not, can you figure out why?Hint
Use the flag
-fopt-info-loop-optimized
as follows to find out which loops got vectorized:g++ -S -O3 -mcpu=native -fopt-info-loop-optimized Filter.cpp -o /dev/stdout | c++filt > Filter_O1.s
Use the flag
-fopt-info-vec-missed
as follows to find out what the compiler wasn’t able to vectorize:g++ -S -O3 -mcpu=native -fopt-info-vec-missed Filter.cpp -o /dev/stdout | c++filt > Filter_O1.s
When comparing the assembly code from 3j and the above, how many
bne
instructions do you see? What does it tell you about auto vectorization on code that uses neon intrinsics? What is a common loop optimization technique to reduce the impact of branch instructions?
Try to achieve the same speedup as 3k for your
Filter_vertical
after identifying the problem in 5f. Report the latency of your modifiedFilter_vertical
and the application as a whole. (2 lines)Compare your performance with the lower bounds. (1 lines)
Compare the performance of manual and automatic vectorization. (1 lines)
Reflection
Reflect on the cooperation in your team.
Compare your actual time on tasks with your original estimates. (table with 1-2 line explanation of major discrepancies)
Reflect on your task decomposition (1a). Were you able to complete the task as you originally planned? What aspects of your original task distribution worked well and why? Did you refine the plan during the assignment? How and why? In hindsight, how should you have distributed the tasks? (paragraph)
What was the most useful thing you learned from or working with your teammate? (2–4 lines)
What do you believe was the most useful thing that you were able to contribute to your team? (1–3 lines)
Include a screenshot showing your credit usage in Amazon AWS.
3.1. Deliverables¶
In summary, upload the following in their respective links in canvas:
a tarball containing the hw4 source code with your modified neon intrinsics code.
Quick linux commands for tar files
# Compress tar -cvzf <file_name.tgz> directory_to_compress/ # Decompress tar -xvzf <file_name.tgz>
writeup in pdf.