## ESE5320: System-on-a-Chip Architecture

Day 25: November 25, 2024 Real Time

Penn ESE5320 Fall 2024 -- DeHon

1



# Today

**Real Time** 

- Connection to Real or Physical World

· Contrast with "virtual" or "variable" time

• "Real" - refers to physical time

· Handles events with absolute

guarantees on timing

#### Real Time

- Part 1: Demands
- · Part 2: Challenges
  - Algorithms
  - Architecture
- · Part 3: Disciplines to achieve

Penn ESE5320 Fall 2024 -- DeHon

2

#### Message

- Real-Time applications demand different discipline from best-effort tasks
- · Look more like synchronous circuits
- · Can sequentialize, like processor
  - But must avoid/rethink typical generalpurpose processor common-case optimizations

Penn ESE5320 Fall 2024 -- DeHon

3

3

enii Ededazo i ali 2024 -- Delion

#### **Time Constants**

- Many mechanical sense/response times are ~ 5-10ms
- Human reaction times ~20ms

ESE5320 Fall 2024 -- DeHon

5

## Breaking

- · Car traveling 60 km/h
- · Visibility of 30 m
- How long do you have to stop as soon as something comes into view?
  - Simple total
    - See/recognize, decide, apply breaks, breaks respond

Penn ESE5320 Fall 2024 -- DeHon

#### Real-Time Tasks

- What timing guarantees might you like for the following tasks?
  - Self-driving car detects an object in its path
    - Delay from object appearing to detection
  - Pacemaker stimulates your heart
  - Turn steering wheel on a drive-by-wire car
    - Delay to recognized and car turns
  - Video playback (frame to frame delay)

Penn ESE5320 Fall 2024 -- DeHon

/

#### Real-Time Guarantees

- Attention/processing within fixed interval
  - Sample new value every XX ms
  - Produce new frame every 30 ms
  - Both: schedule to act and complete action
- Bounded response time
  - Respond to keypress within 20 ms
  - Detect object within 100 ms
  - Return search results within 200 ms

enn ESE5320 Fall 2024 -- DeHon

0

### Computer Response

- · What do these things indicate?
  - When will the computer complete the task?





 $\label{lem:https://en.wikipedia.org/wiki/File:Windows\_8\_\%2B\_10\_wait\_cursor.gif \\ https://en.wikipedia.org/wiki/File:WaitCursor-300p.gif$ 

Penn ESE5320 Fall 2024 -- DeHo

9

## Real-Time Response

- What if your car gave you a spinning wait wheel for 5 seconds when you
  - Turned the wheel?
  - Stepped on the brakes?



Penn ESE5320 Fall 2024 -- DeHon

10

12

## Synchronous Circuit Model

- A simple synchronous circuit is a good "model" for real-time task
  - Run at fixed clock rate
  - Take input every "cycle" (application cycle)
  - Produce output every "cycle" (application cycle)
  - Complete computation between input and output
  - Designed to run at fixed-frequency
    - Critical path meets frequency requirement

Penn ESE5320 Fall 2024 -- DeHon

11

11

(Circuit) Cycle time could operate?
 Assume clocked at 100Hz (application cycle)
 Worst-case delay from (L)eft press to change in heading (posx,posy)?



Historically

- Real-Time concerns grew up in EE
  - Because an analog circuit was the only way could meet frequency demands
  - ...later a dedicated digital circuit...
- · Applications
  - Signal processing, video, control, ...

ESE5320 Fall 2024 -- DeHor

14

## **Technological** Change



- · Area units for spatial design shown (preclass 2c)
- · Fraction of processor capacity required (Prelcass 2d)
- · Why might prefer using a processor to using the spatial circuit?
  - Hint: What does preclass 2c,d suggest?

15

16

## HW/SW Co-Design

- Computer Engineers know can implement anything as hardware or software
- Want freedom to move between hardware and software to meet requirements
  - Performance, costs, energy

17

15

## Performance Scaling

- · As circuit speeds increased
  - Can meet real-time performance demands with heavy sequentialization
- · Circuit and processor clocks
  - from MHz to GHz
- · Many real-time task rates unchanged
  - 44KHz audio, 33 frames/second video
- Even 100MHz processor
  - Can implement audio in a small fraction of its computational throughput capacity

## Real-Time Challenge

- Meet real-time demands / guarantees
  - Economically using programmable architectures
- · Sequentialize and share resources with deterministic, guaranteed timing
- Spatial (all hardware, HLS synthesized) implementations are good at meeting real-time guarantees, but may be bigger than necessary

16

17



Day 3 **Processor Data Caches** · Traditional Processor Data Caches are a heuristic instance of this Add a small memory local to the processor • It is fast, low latency - Store anything fetched from large/remote memory in local memory · Hoping for reuse in near future - On every fetch, check local memory before go to large memory Large M

**Processor Data Caches** 

· Traditional Processor Data Caches are

- Store anything fetched from large/remote

- On every fetch, check local memory before

a heuristic instance of this

memory in local memory

go to large memory

· Hoping for reuse in near future

- Stall processor while waiting for data

Memory

Large

Memory

20

ESE5320 Fall 2024 -- DeHor

#### **Processor Data Caches**

- Demands more than a small memory
  - Need to sparsely store address/data mappings from large memory
  - Makes more area/delay/energy expensive than just a simple memory of capacity
- · Don't need explicit data movement
- · Cannot control when data moved/saved
  - Bad for determinism
- · Limited ability to control what stays in

small memory simultaneously

21

Day

## Preclass 3: **Processor Cache Timing**

Assume

21

- cache miss (go to large memory) takes 10 cycles
- Cache hit (small memory) takes 1
- Start with empty cache
- Due to memory delay, how long to execute:

b=a[0]+a[1];b=a[i]+a[j]; c=a[1]+a[2];c=a[k]+a[l]; d=a[2]+a[0]; d=a[m]+a[n];

SE5320 Fall 2024 -- DeHor 23

# Scratchpad

- · Recall, scratchpad memory
  - Small
  - Explicitly managed (not dynamic like cache)
- · If move (DMA) data to scratchpad memory, would be deterministic

b=a[0]+a[1];b=a[i]+a[j]; c=a[1]+a[2]; c=a[k]+a[l]; d=a[2]+a[0];d=a[m]+a[n];

24

23

24

#### Observe

· Instructions on "General Purpose" processors take variable number of cycles

nn ESE5320 Fall 2024 -- DeHon

25

### Preclass 5

· How many cycles?

```
sum=0;
for (i=0;i<32;i++) {
    sum+=(0-(b\%2)) & a;
    b=b>>1;
    a=a<<1;
```

27

#### Observe

· Data-dependent branching, looping - Means variable time for operations

#### Preclass 4

- How many cycles?
  - sin, cos 100 cycles each
  - Assignments take 1 cycle

```
old_sh=sh; old_ch=ch;
 if (!left | | !right)
       {sh=old sh;ch=old ch;}
 else
       {sh=sin(heading);
```

ch=cos(heading);}

26

28

30

26

#### Preclass 5

· How many cycles?

}

```
sum=0;
for (;b!=0;b=b>>1) {
     if (b\%2==1)
        sum+=a;
     a=a<<1;
```

28

27

### Two Challenges

- 1. Architecture Hardware have variable (data-dependent) delay
  - Esp. for General-Purpose processors
    - Instructions take different number of cycles
- 2. Algorithm computational specification have variable (data-dependent) operations
  - Different number of instructions

 $Time = \sum Cycles(i)$ 

## Algorithm

· What programming constructs are datadependent (variable delay)?

nn ESE5320 Fall 2024 -- DeHon

31

# **Programming Constructs**

- · Conditionals: if/then/else
- · Loops without compile-time determined bounds
  - While with termination expressions
  - For with data-dependent bounds
- · Data-dependent recursion
- Interrupts
  - I/O events, time-slice
- Note: 1st three were issue for HLS

- For same reason - how did we address?

32

### Hardware Architecture

- Some typical (4710,5710) processor "optimizations" can cause variable delay
  - Caches
  - Branch prediction
  - Common-case optimizations
  - Pipeline stalls
  - Speculative issue

34

### **DISCIPLINES TO ACHIEVE REAL-TIME**

35

## Already Addressed

- · Hardware pipelines are deterministic
  - HLS limitations for hardware already drove us to fixed timing
- · Explicit scheduling of VLIW issue
  - Can be fixed timing

36

Open Issues

- · Processor Architecture
- · Resource Sharing

37

36

### What can we do to make architecture more deterministic?

- · Explicitly managed memory
- Eliminate Branching (too severe?)
- · Unpipelined processors
- · Fixed-delay pipelines
  - Offline-scheduled resource sharing
  - Multi-threaded
- Deadlines

ESE5320 Fall 2024 -- DeHor

# **Explicitly Managed Memory**

- · Make memory hierarchy visible
  - Use Scratchpad memories instead of caches
- · Explicitly move data between memories
  - E.g. movement into local memory
- · Already do for Register File in Processor
  - Load/store between memory and RF slot
  - ...but don't do for memory hierarchy

39

38

**Explicitly Managed Memory** 1 cycle small memory 20 cycles Large Memory Medium Memory 40

40

Offline Schedule Resource Sharing

- · Don't arbitrate
- Decide up-front when each shared resource can be used by each thread or processor
  - Simple fixed schedule
  - Detailed Schedule
- What
  - Memory bank, bus, I/O, network link, ...

41

## Time-Multiplexed Bus Fixed by hardware master · 4 masters share a bus · Each master gets to make a request on the bus every 4th cycle - If doesn't use it, goes idle P1 | P2 | P3 | P4 | P1 | P2 | P3 | P4 | P1 | P2 | P3 | P4 |

Time-Multiplexed Bus

- Regular schedule
- Fixed bus slot schedule of length N > masters
  - (probably a multiple)
- · Assign owner for each slot
  - Can assign more slots to one
- . E.g. N=8, for 4 masters
  - Schedule (1 2 1 3 1 2 1 4)

43

42



Simple Deterministic
Processor with Multiplier

No branching
Unpipelined
Every operation completes in fixed time

Cycle time as shown?
Retimed cycle time?
What's inefficient about this design?

45



Simple Deterministic
Pipelined Processor

No branching
Every operation completes in fixed time

How pipelines added change behavior?
Hint R1 value

46

Simple Deterministic
Pipelined Processor

No branching
Every operation completes in fixed time
Retimed cycle time?

Deterministic Pipelines

• Pipeline data updates

- Not available on next cycle, but depend on delay through logic

• Delay branches

- Not happen immediately, a fixed number of clock cycles later

• Schedule like VLIW

49

47

48

### **Deterministic Pipelines**

- Not how ARM, Intel (4710, 5710) processor are pipelined
- Those include operations that make timing variable
  - dynamic data hazards, branch speculation
- Here, data becomes available after a predictable time
- · Branches take effect at a fixed time
  - Likely delayed

68

70

. Schedule to delays to get correct data

68

#### **WCET**

- WCET Worst-Case Execution Time
- Analysis when working with algorithms and architectures with data-dependent delay
  - Need to meet real time
  - Calculate the worst-case runtime of a task
    - Like calculating the critical path (but harder)
    - · Worst-case delay of instructions
    - · Worst-case path through code
    - Worst-case # loop iterations
  - Rationale for setting Deadlines

(like a cycle time)

70

72

ESESSZU FAII 2024 -- Deneil

# SoC Opportunity

- Can choose which resources are shared
- Can dedicate resources to tasks
- Isolate real-time tasks/portions of tasks from best-effort
  - Separate hardware/processors
  - Separate memories, network

SE5320 Fall 2024 -- DeHon

#### Deadline Instruction

- · Deal with algorithmic (branching) variability
- · Set a hardware counter for thread
- · Decrement counter on each cycle
- Demand counter reach 0 before thread allowed to continue at deadline instruction
  - Stall if get there early
  - Similar to flip-flop on a logic path
    - · Wait for clock edge to change or sample value
- · Model: fixed execution time

69

ESE5320 Fall 2024 -- DeHon

#### **Different Goals**

#### Real-Time

- Willing to recompile to new hardware
- Want time on hardware predictable
- Willing to schedule for delays in particular hardware

General Purpose/Best Effort

69

- · ISA fixed
- Want to run same assembly on different implementations
- Tolerate different delays for different hardware
- Run faster on newer, larger implementations

71

### UltraScale+ Zyng

- Has 2 "Real-Time Processor"
  - ARM Cortex-R5
    - 32b (vs. 64b for A53 APU processor)
    - ARMv7-R (vs. ARMv8)
    - Single ALU, dual issue
    - Branch prediction
- Explicitly managed scratchpads
  - Tightly-Coupled Memories
  - On-Chip Memory (OCM)

Penn ESE5320 Fall 2024 -- DeHon

73

72 73





75

## Big Ideas:

- · Real-Time applications demand different discipline from best-effort tasks
- · Look more like synchronous circuits and hardware discipline
- · Avoid or use care with variable delay programming constructs
- Can sequentialize, like processor
  - But must avoid/rethink typical processor common-case optimizations
  - Offline calculate static schedule for computation and sharing

76

76

### Admin

- Feedback
- · Penn on Thursday-Friday schedule for tomorrow and Wednesday
  - → no lecture on Wednesday
- · Next lecture Monday
  - Wrapup lecture

77