# ESE5320: System-on-a-Chip Architecture

Day 1: August 28, 2024 Introduction and Overview (lecture start target 10:20am)

Note: slides, preclass linked to web www.seas.upenn.edu/~ese5320/fall2024/fall2024.html

- Work/finish preclass while waiting for lecture start
- Feedback form (turn in end of lecture)

nn ESE5320 Fall 2024 -- DeHon



1

## Today

- · Part 1: Case for Programmable SoC (motivation)
- · Part 2: Course Goals, Outcomes, Tools (philosophy?)
- Part 3: Sample Optimization (fast, flavor)
- Part 4: This course (operational details)
  - (including policies, logistics)

enn ESE5320 Fall 2024 -- DeHon

2

- ? 110+mm<sup>2</sup>, 4nm
- 16 Billion Tr.
- iPhone 14
- · 6 ARM cores
  - 2 fast (3.5GHz)
  - 4 low energy (2GHz)
- 5 custom GPUs (1.4GHz)
- · 16 Neural Engines
  - 17 Trillion ops/s?

3



Questions

- Why do today's SoC look like they do?
- How approach programming modern SoCs?
- How design a custom SoC?
- When building a System-on-a-Chip (SoC)
  - How much area should go into:
    - · Processor cores, GPUs, FPGA logic, memory, interconnect. custom functions (which) .... ?

# **FPGA** Field-Programmable Gate Array

Compute block w/ optional output Flip-Flop (LUT = LookUp Table)....



ESE1500, CIS5710

Case for Programmable SoC

## End of Microprocessor Scaling

#### Old

#### · Moore's Law scaling delivered faster transistors

- Processors rode Moore's Law
  - Turning transistors into performance
- Could wait and ride technology curve

8

#### Now

- Dennard's Law kicked in
  - How need to scale voltage with size
- · Microprocessors were burning more power
- · Lost ability to scale down voltage
- · Processor performance stalled

ESE5320 Fall 2024 -- DeHor



## The Way things Were

30 years ago

- · Wanted programmability
  - used a processor
- Wanted it a little faster
  - Next year's processor would run faster...
- Wanted high-throughput
  - used a custom Integrated Circuit (IC) -- chip
- · Wanted product differentiation
  - Got it at the board level
  - Select which ICs and how wired
- Build a custom IC (chip)

- It was about gates and logic

10

## Today

- · Microprocessor may not be fast enough
  - (but often it is)
  - Or low enough energy
- · Single core processor scaling has ended
- · Time and Cost of a custom IC is too high
  - \$100M's of dollars for development, Years
- FPGAs promising
  - But build everything from prog. gates?
- · Premium for small part count
  - And avoid chip crossing

ICs with 10—100 Billions of Transistors

11

10

# Non-Recurring Engineering (NRE) Costs

- Costs spent up front on development
  - Engineering Design Time
  - Design Verification
  - Prototypes
  - Mask costs
- · Recurring Engineering
  - Costs to produce each chip

 $Cos(N_{chips}) = Cost_{NRE} + N_{chips} \times Cost_{perchip}$ 

12



13





14

Amortize NRE with Volume

$$Cos(N_{chips}) = Cos(N_{RE} + N_{chips} \times Cos(N_{erchip}))$$

$$Cost = \frac{Cost_{NRE}}{N_{chips}} + Cost_{perchip}$$

Penn ESE5320 Fall 2024 -- DeHon

16

Forcing fewer, more customizable chips

- · Economics force fewer, more customizable chips
  - Mask costs in the millions of dollars
  - Custom IC design NRE 100s of millions of dollars
    - Need market of billions of dollars to recoup investment
    - · With fixed or slowly growing total IC industry revenues
    - > Number of unique chips must decrease

Chips must be programmable

0 Fall 2024 -- DeHon

nn ESE5320 Fall 2024 -- DeHon

17

17

16

# Large ICs (Chips)

- · Now contain significant software
  - Almost all have embedded processors
- · Must co-design SW and HW
- · Must solve complete computing task
  - Tasks has components with variety of needs
  - Some don't need custom circuit
  - -90/10 Rule

Penn ESE5320 Fall 2024 -- DeHon

18

# Given Demand for Programmable

- How do we get higher performance than a processor, while retaining programmability?
  - Programmability don't have to spend 100s of millions of dollars and months for fabrication?

19





20

#### Then and Now

30 years ago

- · Programmability?
  - use a processor
- Faster
  - Processors scaled
- High-throughput
  - used a custom IC
- Wanted product differentiation
  - board level
- Select & wired IC
- Build a custom IC (Chip)
   It was about gates and
- logic

Penn ESE5320 Faii 2024 -- DeHon

Today

- · Programmability?
  - uP, FPGA, GPU, PSoC
- Faster
  - · Can't get with single core
- High-throughput
  - FPGA, GPU, PSoC, custom IC
- Wanted product differentiation
  - Program FPGAs, PSoC
  - Build a custom IC (Chip)
  - System and software

22

Part 2: Course Goals, Outcomes

Penn ESE5320 Fall 2024 -- DeHon

23

22

Goals

- · Create Computer Engineers
  - SW/HW divide is wrong, outdated
  - Computer engineers understand computation
    - · HW and SW are just tools and design options
  - Parallelism, data movement, resource management, abstractions
  - Cannot build a chip without software
- SoC user know how to exploit
- SoC designer architecture space, hw/sw codesign
- Project experience design and optimization

Roles

- PhD Qualifier
  - One broad Computer Engineering
- CMPE Concurrency Lab
- Hands-on Project course

Penn ESE5320 Fall 2024 -- DeHon

25

23

24 25

#### Outcomes

- Design, optimize, and program a modern System-on-a-Chip.
- · Analyze, identify bottlenecks, design-space Modeling → write equations to estimate
- · Decompose into parallel components
- · Characterize and develop real-time solutions
- · Implement both hardware and software solutions
- · Formulate hardware/software tradeoffs, and perform hardware/software codesign

26

**Outcomes** 

- Understand the system on a chip from gates to application software, including:
  - on-chip memories and communication networks, I/O interfacing, design of accelerators, processors, firmware and OS/infrastructure software.
- Understand and estimate key design metrics and requirements including:
  - area, latency, throughput, energy, power, predictability, and reliability.

27

# Course Programming

- · Write everything in C
  - including for hardware (FPGA, spatial) operators
- · Avoid learning separate language
  - Don't require or teach Verilog/VHDL
- · Do focus on how tailor C for hardware
  - Focus on what's unique about specifying and guiding hardware
- Code → CHIPS

SE5320 Fall 2024 -- DeHon

28

**Tools** 

- Are complex
- · Will be challenging, but good for you to build confidence can understand and master
- · Tool runtimes can be long
- · Learning and sharing experience will be part of assignments

29

29

28

Distinction

#### CIS2400, 4710, 5710

- · Best Effort Computing
- Run as fast as you can
- Binary compatible
- ISA separation
- Shared memory parallelism

#### ESE5320

- Real-Time
  - Guarantee meet deadline
- · Hardware-Software codesign
  - Willing to recompile, maybe rewrite code
  - Define/refine hardware
- Non shared-memory parallelism models

nn ESE5320 Fall 2024 -- DeHon

30

Distinction

#### ESE5390:

Hardware/Software Co-Design for Machine Learning

- Deep on Application (ML)
- More accessible to CS
  - Less previous experience with circuits and architecture
- Won't be as deep on understanding HW and optimization
- Program in Pytorch, OpenCL

31

ESE5320:

- · Deep computer engineering
- Broad application
- Program in C
- · Suitable followup if want to dig deeper

31

30



Part 3: Approach -- Example 33 enn ESE5320 Fall 2024 -- DeHoi

33

### Abstract Approach

- · Identify requirements, bottlenecks
- Decompose Parallel Opportunities
  - At extreme, how parallel could make it?
  - What forms of parallelism exist?
    - · Thread-level, data parallel, instruction-level
- · Design space of mapping
  - Choices of where to map, area-time tradeoffs
- · Map, analyze, refine
  - Write equations to understand, predict

34

36

## **Example SPICE Circuit** Simulator · SPICE - Simulate and validate chip performance at transistor level (ESE3700,5700) Specialized Differential Equation simulator for circuits (chips)

35

37

34

# **Example: SPICE Circuit Simulator**



**Abstract Approach** 

- · Identify requirements, bottlenecks
- · Decompose Parallel Opportunities
  - At extreme, how parallel could make it?
  - What forms of parallelism exist?
    - Thread-level, data parallel, instruction-level
- · Design space of mapping
  - Choices of where to map, area-time tradeoffs
- Map, analyze, refine
  - Write equations to understand, predict

37

\* Kirchhoff (Current, Voltage) Laws 36





Speedup 100 90 80 70 60 50 40 30 20 Percent of Total Runtim 10 Circuit Size •  $T=T_{modeleval}+T_{matsolve}+T_{ctrl}$ · What should we speedup first? · What happens if only speedup modeleval? ■ T=T<sub>matsolve</sub>+(T<sub>modeleval</sub>)/S+T<sub>ctrl</sub> 40

40













Parallelism: Model Evaluation Spatial end up Use custom bottlenecked by evaluation engines other components · ...or GPUs 47

46 47





48 49





**Dataflow Processing Element (PE)** Graph Incoming Nodes Messages Dataflow trigger Graph Outgoing Messages Fanout 52



52

Parallelism: Matrix Solve · Settled on constructing dataflow graph · Graph can be iteration independent - Statically scheduled - (cheaper) This is bottleneck to further acceleration 54 nn ESE5320 Fall 2024 -- DeHon

Parallelism Controller? Could leave sequential · For some designs, becomes the bottleneck once others accelerated Fully Sequential · Has internal parallelism in condition evaluation  $T=T_{modeleval}/S_1+(T_{matsolve})/S_2+T_{ctrl}$ nn ESE5320 Fall 2024 -- DeHoi

54 55













60 61

#### Class Components

- · Lecture (incl. preclass exercise)
  - In-person (not hybrid, don't expect recordings)
  - Slides, preclass on web before class (print if you want)
  - N.B. I encourage class participation
    - In class; Questions ("warm" calls)
  - Daily Quiz
- Reading [~1 required paper/lecture]
  - online: Canvas, IEEE, ACM, also ZynqBook, Parallel Programming for FPGAs
- Homework (1 per week due 5pm)
- Project open-ended (~6 weeks)
  - Oct. 23 Dec. 9th (~ weekly milestones, details syllabus)

<sup>อง E</sup>Note รูงและและ course admin online

62



62

#### Second Half

- Use everything on project
- · Going deeper
- · Memory
- Verification
- VLIW
- Reduce
- Energy
- · Chip Cost
- · Real-time
- Reactive

Penn ESE5320 Fall 2024 -- DeHon

64

64

Office & Lab Hours

· Andre: TBD

- Levine 270, Zoom

- TAs Ketterer (starting next week)
  - Tuesday 8:30-9:30 pm
  - Wednesday 4:00-5:00 pm (not today)
  - Thursday 8:30pm-10:30pm (not tomorrow)

Penn ESE5320 Fall 2024 -- DeHon

66

First Half

Quickly cover breadth

- · Metrics, bottlenecks
- Memory
- Line up with homeworks
- Parallel models
- SIMD/Data Parallel
- Thread-level parallelism

enn ESE5320 Fall 2024 -- DeHon

· Spatial, C-to-gates

63

63

## **Teaming**

- HomeWorks (HW) in Groups of 2 (after 0, 1)
- · HW: we assign
- · Individual assignment writeup
- · Project in Groups of 3
- Project: you propose team of 3, we review
  - Most portions group writeup
  - Maybe few components individual writeup

Penn ESE5320 Fall 2024 -- DeHon

65

65

67

# Diagnostic Assessment

- · Course will rely heavily on C
  - Program both hardware and software in C
- If you cannot read/write code in C, this course will be a challenge
- Diagnostic Assessment intended as a quick indication if you aren't ready
  - Should be able to complete quickly
  - Better to find out now than after you're stuck in the course
- Due next Wednesday (9/4)

67

66

#### C Review

- · Course will rely heavily on C
  - Program both hardware and software in C
- HW1 has some C warmup problems
- · TA will hold C review
  - on Sept. 4th, TBD (probably office hours)
  - (before our next class meeting since Monday 9/2 is Labor day)
  - See Ed Discuss for details

68

68

## Daily Quiz

- · Count for Engagement Points
- · Only available until next lecture
- · Incentive to keep up with material

70

#### Feedback

- · Will have anonymous feedback for each lecture

  - Vocabulary?

  - · Specificity most helpful
    - -X was unclear because of Y
    - -Subtopic Z went too fast
- nn ESE5320 Fall 2024 Deed an example for Q

#### **Preclass Exercise**

- Motivate the topic of the day
  - Introduce a problem
  - Introduce a design space, tradeoff, transform
- · Available on syllabus before lecture
  - May want to print a copy to bring to class
- · Should work before lecture starts
  - Won't be available later
- · Do bring/use calculator
- Will be numerical examples

69

69

#### Lecture Timeline

- · Work on preclass before lecture start
- Start lecture at 10:20am
- Lecture until 11:40am
- · (most days) stay for remaining questions
  - Pending course after us
  - Typically take questions in hall after clearing out for next course

- [not today → CMPE Meet-and-Greet]

71

71

73

70

- Clarity?
- Speed?
- General comments

72

#### **Policies**

- · Canvas turn-in of assignments
- No handwritten work
- Due on time
  - Individual assignments only
    - · 3 free late days total
- Collaboration
  - Tools allowed
  - Designs limited to project teams as specified on assignments
- See web page

73

- · Your action: Your action: Admin

  - Feedback sheet for today

  - Find course web page
    - Read it, including the policies
    - Find Syllabus
      - Find diagnostic assessment, homework 1
      - -Find lecture slides
        - » Will try to post before lecture
      - Find reading assignments
  - Find reading for lecture 2 on canvas and web
    - · ...for this lecture if you haven't already
  - Find/join Ed Discussion group for course
  - Signup for detkin/ketterer access

ESE5329 Fall 2024 - DeHon — Complete/submit diagnostic assessment

76

74

## Questions?

76

# Big Ideas

- Programmable Platforms
  - Key delivery vehicle for innovative computing applications
  - Reduce TTM (Time-to-Market), risk
  - More than a microprocessor
  - Heterogeneous, parallel
- · Demand hardware-software codesign
  - Soft view of hardware

Resource-aware view of parallelism

75