Writings on ``Messy'' Computing
Why should we demand that every transistors in a multi-billion transistor
chip be prefect? identical to the rest of the transistors on the chip?
throughout the entire operational lifetime of the chip?
Given the evidence to date, how reasonable is it to assume that every
line of code (LoC) in a multi-million LoC (operating system,
application) is correct? free of safety and security bugs?
[Hint: studies repeatedly show the rate of errors in programs is
above 1 per 1000 LoC.]
How can we build computing systems that operate correctly (or, at least,
well enough), despite these inevitable
flaws in their basic components?
Fabrication
- Crystals and Snowflakes: Building Computation from Nanowire
Crossbars (IEEE Computer, 2011) -- general audience overview
[Abstract and IEEE Xplore link]
- Seven
Strategies for Tolerating Highly Defective Fabrication (IEEE Design
and Test of Computers, 2005) -- how will we tolerate fabrication which leaves
1-10% of the wires broken and the junctions disconnected?
[Abstract
and IEEE Xplore Link]
- Limit Study of Energy & Delay Benefits of Component-Specific
Routing (FPGA 2012) -- how much benefit can we get using
post-fabrication mapping to tolerate high variation in the
programmable interconnect of FPGAs?
[Abstract and Paper Link]
- Quality-Time Tradeoffs in Component-Specific Mapping: How to
Train Your Dynamically Reconfigurable Array of Gates with Outrageous
Network-delays. (FPGA 2017) -- how to make component-specific
routing practical.
[Abstract and Paper Link]
- Exploiting Partially Defective LUTs: Why You Don't Need Perfect
Fabrication (ICFPT 2013) -- how to tolerate defects and variation
in LUTs.
[Abstract and Paper Link]
- GROK-LAB: Generating Real On-chip Knowledge for Intra-cluster
Delays using Timing Extraction (TRETS 2015) -- how to measure LUT
delay using only on-chip resources.
[Abstract and Paper Link]
- GROK-INT: Generating Real On-chip Knowledge for Interconnect Delays
using Timing Extraction (FCCM 2014) -- how to measure interconnect
delay using only on-chip resources.
[Abstract and Paper Link]
- Pitfalls and Tradeoffs in Simultaneous, On-Chip FPGA Delay
(FPGA 2016) -- care required for making the GROK measurements.
[Abstract and Paper Link]
- Choose-Your-Own-Adventure Routing: Lightweight Load-Time Defect
Avoidance (TRETS 2011) -- how to uses precomputed alternative routes to
minimize the circuit complexity and time required to route around defects
[Abstract, Paper Link]
- Law of Large Numbers System Design (Nano, Quantum and Molecular
Computing: Implications to High Level Design and Validation) -- a
more overview/tutorial article on coping with and exploiting statistical
phenomena at the atomic scale
[Abstract
and Citation]
- Much of the work on Sublithographic
Architectures is about messy computing.
Aging
- Self-Adaptive Timing Repair (IEEE Design and Test 2017) --
general-audience overview for how chips can rapidly self-adapt to variation and
aging effects.
[Abstract and IEEE Xplorer Link]
- Continuous Online Self-Monitoring Introspection Circuitry for
Timing Repair by Incremental Partial-reconfiguration (COSMIC
TRIP) (TRETS 2018) -- How to identify aged components and replace them
during operation in hundreds of milliseconds.
[Abstract
and Paper Links]
- The Case for Reconfigurable Components with Logic Scrubbing:
Regular Hygiene Keeps Logic FIT (low) (NDCS 2008) -- Why you need to scrub logic as
well as memories.
[Abstract and Paper Links]
Transients
- Final Report of the CRA/CCC Visioning study on Cross Layer
Reliability (2011) -- vision and research agenda for addressing
reliability in highly-scaled technologies. [links to report]
- Energy Reduction through Differential Reliability and Lightweight
Checking (FCCM 2014) -- how to run most of the computation at
low energy (low voltages) and detect errors reliably and inexpensively.
[Abstract and Paper Link]
- Fault-Tolerant Sub-lithographic Design with Rollback Recovery
(Nanotechnology, 2008) -- how to tolerate transient upsets during
operation. [Abstract
and IOP Link]
- Fault Secure Encoder and Decoder for NanoMemory Applications
(IEEE Tr. VLSI Systems 2009) -- how to build the error-correction circuitry
for memories in nanoscale logic and tolerate both defects and
transient faults in the ECC logic.
[Abstract,
Paper link]
Security
- Architectural Support for Software-Defined Metadata
Processing (ASPLOS 2015) -- how to add hardware that enforce
post-fabrication programmable safety and security policies to a conventional RISC processor.
[Abstract and Paper Link]
- Automated
Least-Privilege Analysis (μSCOPE: A Methodology for Analyzing
Least-Privilege Compartmentalization in Large Software Artifacts)
(RAID
2021) -- how much excess privilege is the operating running with, and how
we can automatically reduce it by orders of magnitude?
[Paper Link]
- SCALPEL: Exploring
the Limits of Tag-enforced Compartmentalization (ACM JETC 2022) --
how to realize automated privilege restriction with tags
[Abstract and Paper
Link]
- Protecting
the Stack with Metadata Policies and Tagged Hardware (IEEE
S&P (Oakland) 2018) -- SDMP policies to enforce stack protection policies [Abstract and Paper Link]
- DOVER: a Metadata-extended RISC-V (RISC-V Workshop,
January 2016) -- how to integrate tagged support into RISC-V
[Slides]
- The DOVER Edge (RISC-V Workshop, July 2016) -- how to
protect I/O and DMA as well [Slides]
- Hardware Support for Safety Interlocks and Introspection
(SASO AHNS Workshop 2012) -- how hardware can guard against gross
security and safety errors in programs.
[Abstract and Paper Link]
- Low-Fat Pointers: Compact Encoding and Efficient Gate-Level Implementation
of Fat Pointers for Spatial Safety and Capability-based Security (CCS
2013) - how hardware can protect against spatial saftey violations.
[Abstract and Paper Link]
André DeHon