Homework 4
Due Date : Tuesday November 13
Total Points : 100 pts
You must work on this assignment with one or two other students. You should
work together on all parts of the assignment, but submit only one set
of solutions. If you each work on part, then you each only learn part
of the material. Please be sure to write both names on the submitted
solutions.
Note: Copying material from Wikipedia, other online sources, or
any source will not be tolerated. This form of plagiarism has
occurred in the past, and penalties for violating the Duke Community
Standard will be severe.
- Please submit a single PDF file with your solutions
- Please type your answers
- Explain all your answers to get full credit!
- Keep all your answers short and precise!
The Questions
- (20 pts) A program is running on a computer with a four-entry fully asssociative(micro) TLB:
VP#
|
PP#
|
Entry valid
|
5
|
30
|
1
|
7
|
1
|
0
|
10
|
10
|
1
|
15
|
25
|
1
|
The following is a trace of virtual page numbers accessed by a program. For each access indicate whether it produces a TLB hit/miss and, if it accesses the page table, whether it produces a page hit or fault. Put and X under the page table column if it is not accesses.
Virtual page index
|
Physical page#
|
Present
|
0
|
3
|
Y
|
1
|
7
|
N
|
2
|
6
|
N
|
3
|
5
|
Y
|
4
|
14
|
N
|
5
|
30
|
Y
|
6
|
26
|
Y
|
7
|
11
|
Y
|
8
|
13
|
N
|
9
|
18
|
Y
|
10
|
10
|
Y
|
11
|
56
|
Y
|
12
|
110
|
Y
|
13
|
33
|
Y
|
14
|
12
|
N
|
15
|
25
|
Y
|
Fill the following table.
Virtual page accessed
|
TLB(hit/miss)
|
Page table(hit or fault)
|
1
|
|
|
6
|
|
|
11
|
|
|
5
|
|
|
9
|
|
|
12
|
|
|
2
|
|
|
10
|
|
|
7
|
|
|
14
|
|
|
- (20 pts)
This is a short answer question covering three types of architectures we covered in class: Vector, SIMD, and VLIW.
a. What kind of parallelism does each architecture best exploit?
b. What are the trade-offs between the three methods of filling parallel pipelines?
c. Provide a sequence of 4 instructions that achieves the best case performance for each architecture. Assume a 4-wide version of each architecture. Explain why this is the best case, and any assumptions you make.
d. Provide a sequence of 4 instructions that achieves the worst case performance for each architecture. Assume a 4-wide version of each architecture. Explain why this is the worst case, and any assumptions you make.
e. Give an example system or processor from industry that used each execution model.
- (10 pts) H&P 5.1
- (10 pts) H&P 5.9
- (20 pts) Consider a single-issue in-order multithreading processor. The
processor can fetch and issue (dispatch) one instruction per cycle. If an
instruction cannot be issued due to a data dependency, the processor stalls.
Integer instructions take one cycle to execute and the result can be used in the
next cycle. We also assume that the processor has a perfect branch predictor
with no penalty for both taken and not-taken branches.
Each cycle, the processor can fetch and issue one instruction that performs any
of the following operations:
- load/store, 12-cycle latency (fully pipelined)
- integer add, 1-cycle latency
- floating-point add, 5-cycle latency (fully pipelined)
- branch, no delay slots, 1-cycle latency
The processor does not have a cache. Each memory operation directly accesses
main memory. If an instruction cannot be isseud due to a data dependency, the
processor stalls. We also assume that the processor has a perfect branch
predictor with no penalty for both taken and not-taken branches.
Your job is to analyze the processor utilizations for the following two
thread-switching implementations:
Fixed Switching: the processor switches to a different thread every cycle using
fixed round robin scheduling. Each of the N threads executes an instruction
every N cycles.
Data-dependent Switching: the processor only switches to a different thread when
an instruction cannot execute due to a data dependency.
Each thread executes the following MIPS code:
loop: | L.D F2, O(R1) | ; load data into F2 |
| ADDI R1,R1,4 | ; bump source pointer |
| FADD F3, F3, F2 | ; F3 = F3 + F2 |
| BNE F2, F4, loop | ; continue if F2 != F4 |
a) What is the minimum number of threads that we need to fully utilize the
processor for each implementation? Explain your answer.
b) What is the minimum number of threads that we need to fully utilize the
processor for each implementation if we change the load/store latency to 1-cycle
(but keep the 5-cycle floating-point add)? Explain your answer.
c) Consider a Simultaneous Multithreading (SMT) machine with limited hardware
resources. Which of the following hardware constraints can limit the total
number of threads that the machine can support? For the ones you choose, briefly
describe the minimum requirement to support N threads.
- Number of functional units
- Number of physical registers
- Data cache size
- Data cache associativity
- (20 pts) Write a couple of paragraphs as a progress report on your class project,
and you must include a Figure showing some data that you have collected for your
project (for any simulator you are using, run some workloads and generate a
graph).
Src: Krste Asanovic, Computer Architecture and Enginnering course at UC Berkeley, 2011