Writings on ``Messy'' Computing

Why should we demand that every transistors in a multi-billion transistor chip be prefect? identical to the rest of the transistors on the chip? throughout the entire operational lifetime of the chip?

Given the evidence to date, how reasonable is it to assume that every line of code (LoC) in a multi-million LoC (operating system, application) is correct? free of safety and security bugs?
[Hint: studies repeatedly show the rate of errors in programs is above 1 per 1000 LoC.]

How can we build computing systems that operate correctly (or, at least, well enough), despite these inevitable flaws in their basic components?

Fabrication
- Crystals and Snowflakes: Building Computation from Nanowire Crossbars (IEEE Computer, 2011) -- general audience overview [Abstract and IEEE Xplore link]
- Seven Strategies for Tolerating Highly Defective Fabrication (IEEE Design and Test of Computers, 2005) -- how will we tolerate fabrication which leaves 1-10% of the wires broken and the junctions disconnected? [Abstract and IEEE Xplore Link]
- Limit Study of Energy & Delay Benefits of Component-Specific Routing (FPGA 2012) -- how much benefit can we get using post-fabrication mapping to tolerate high variation in the programmable interconnect of FPGAs? [Abstract and Paper Link]
- Quality-Time Tradeoffs in Component-Specific Mapping: How to Train Your Dynamically Reconfigurable Array of Gates with Outrageous Network-delays. (FPGA 2017) -- how to make component-specific routing practical. [Abstract and Paper Link]
- Exploiting Partially Defective LUTs: Why You Don't Need Perfect Fabrication (ICFPT 2013) -- how to tolerate defects and variation in LUTs. [Abstract and Paper Link]
- GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster Delays using Timing Extraction (TRETS 2015) -- how to measure LUT delay using only on-chip resources. [Abstract and Paper Link]
- GROK-INT: Generating Real On-chip Knowledge for Interconnect Delays using Timing Extraction (FCCM 2014) -- how to measure interconnect delay using only on-chip resources. [Abstract and Paper Link]
- Pitfalls and Tradeoffs in Simultaneous, On-Chip FPGA Delay (FPGA 2016) -- care required for making the GROK measurements. [Abstract and Paper Link]
- Choose-Your-Own-Adventure Routing: Lightweight Load-Time Defect Avoidance (TRETS 2011) -- how to uses precomputed alternative routes to minimize the circuit complexity and time required to route around defects [Abstract, Paper Link]
- Law of Large Numbers System Design (Nano, Quantum and Molecular Computing: Implications to High Level Design and Validation) -- a more overview/tutorial article on coping with and exploiting statistical phenomena at the atomic scale [Abstract and Citation]
- Much of the work on Sublithographic Architectures is about messy computing.
Aging
- Self-Adaptive Timing Repair (IEEE Design and Test 2017) -- general-audience overview for how chips can rapidly self-adapt to variation and aging effects. [Abstract and IEEE Xplorer Link]
- Continuous Online Self-Monitoring Introspection Circuitry for Timing Repair by Incremental Partial-reconfiguration (COSMIC TRIP) (TRETS 2018) -- How to identify aged components and replace them during operation in hundreds of milliseconds. [Abstract and Paper Links]
- The Case for Reconfigurable Components with Logic Scrubbing: Regular Hygiene Keeps Logic FIT (low) (NDCS 2008) -- Why you need to scrub logic as well as memories. [Abstract and Paper Links]
Transients
- Final Report of the CRA/CCC Visioning study on Cross Layer Reliability (2011) -- vision and research agenda for addressing reliability in highly-scaled technologies. [links to report]
- Energy Reduction through Differential Reliability and Lightweight Checking (FCCM 2014) -- how to run most of the computation at low energy (low voltages) and detect errors reliably and inexpensively. [Abstract and Paper Link]
- Fault-Tolerant Sub-lithographic Design with Rollback Recovery (Nanotechnology, 2008) -- how to tolerate transient upsets during operation. [Abstract and IOP Link]
- Fault Secure Encoder and Decoder for NanoMemory Applications (IEEE Tr. VLSI Systems 2009) -- how to build the error-correction circuitry for memories in nanoscale logic and tolerate both defects and transient faults in the ECC logic. [Abstract, Paper link]
Security
- Architectural Support for Software-Defined Metadata Processing (ASPLOS 2015) -- how to add hardware that enforce post-fabrication programmable safety and security policies to a conventional RISC processor. [Abstract and Paper Link]
- Automated Least-Privilege Analysis (μSCOPE: A Methodology for Analyzing Least-Privilege Compartmentalization in Large Software Artifacts) (RAID 2021) -- how much excess privilege is the operating running with, and how we can automatically reduce it by orders of magnitude? [Paper Link]
- SCALPEL: Exploring the Limits of Tag-enforced Compartmentalization (ACM JETC 2022) -- how to realize automated privilege restriction with tags [Abstract and Paper Link]
- Protecting the Stack with Metadata Policies and Tagged Hardware (IEEE S&P (Oakland) 2018) -- SDMP policies to enforce stack protection policies [Abstract and Paper Link]
- DOVER: a Metadata-extended RISC-V (RISC-V Workshop, January 2016) -- how to integrate tagged support into RISC-V [Slides]
- The DOVER Edge (RISC-V Workshop, July 2016) -- how to protect I/O and DMA as well [Slides]
- Hardware Support for Safety Interlocks and Introspection (SASO AHNS Workshop 2012) -- how hardware can guard against gross security and safety errors in programs. [Abstract and Paper Link]
- Low-Fat Pointers: Compact Encoding and Efficient Gate-Level Implementation of Fat Pointers for Spatial Safety and Capability-based Security (CCS 2013) - how hardware can protect against spatial saftey violations. [Abstract and Paper Link]

André DeHon

Writings on ``Messy'' Computing

Fabrication

Aging

Transients

Security