research-article

Shoestring: probabilistic soft error reliability on the cheap

Authors:
Shuguang Feng

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Shantanu Gupta

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Amin Ansari

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Scott Mahlke

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsMarch 2010Pages 385–396https://doi.org/10.1145/1736020.1736063

Published:13 March 2010Publication History

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Pages 385–396

ABSTRACT

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to soft errors. We are quickly approaching a new era where resilience to soft errors is no longer a luxury that can be reserved for just processors in high-reliability, mission-critical domains. Even processors used in mainstream computing will soon require protection. However, due to tighter profit margins, reliable operation for these devices must come at little or no cost. This paper presents Shoestring, a minimally invasive software solution that provides high soft error coverage with very little overhead, enabling its deployment even in commodity processors with "shoestring" reliability budgets. Leveraging intelligent analysis at compile time, and exploiting low-cost, symptom-based error detection, Shoestring is able to focus its efforts on protecting statistically-vulnerable portions of program code. Shoestring effectively applies instruction duplication to protect only those segments of code that, when subjected to a soft error, are likely to result in user-visible faults without first exhibiting symptomatic behavior. Shoestring is able to recover from an additional 33.9% of soft errors that are undetected by a symptom-only approach, achieving an overall user-visible failure rate of 1.6%. This reliability improvement comes at a modest performance overhead of 15.8%.

References

T. Austin. Diva: a reliable substrate for deep submicron microarchitecture design. In Proc. of the 32nd Annual International Symposium on Microarchitecture, pages 196--207, 1999. Google ScholarDigital Library
W. Bartlett and L. Spainhower. Commercial fault tolerance: A tale of two systems. IEEE Transactions on Dependable and Secure Computing, 1(1):87--96, 2004. Google ScholarDigital Library
D. Bernick, B. Bruckert, P. D. Vigna, D. Garcia, R. Jardine, J. Klecka, and J. Smullen. Nonstop advanced architecture. In International Conference on Dependable Systems and Networks, pages 12--21, June 2005. Google ScholarDigital Library
J. A. Blome, S. Gupta, S. Feng, S. Mahlke, and D. Bradley. Costefficient soft error protection for embedded microprocessors. In Proc. of the 2006 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 421--431, 2006. Google ScholarDigital Library
S. Borkar. Designing reliable systems from unreliable components: The challenges of transistor variability and degradation. IEEE Micro, 25(6):10--16, 2005. Google ScholarDigital Library
F. A. Bower, D. J. Sorin, and S. Ozev. A mechanism for online diagnosis of hard faults in microprocessors. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 197--208, 2005. Google ScholarDigital Library
M. Gomaa and T. Vijaykumar. Opportunistic transient-fault detection. In Proc. of the 32nd Annual International Symposium on Computer Architecture, pages 172--183, June 2005. Google ScholarDigital Library
M. A. Gomaa, C. Scarbrough, I. Pomeranz, and T. N. Vijaykumar. Transient-fault recovery for chip multiprocessors. In Proc. of the 30th Annual International Symposium on Computer Architecture, pages 98--109, 2003. Google ScholarDigital Library
J. Kim, N. Hardavellas, K. Mai, B. Falsafi, and J. C. Hoe. Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding. In Proc. of the 40th Annual International Symposium on Microarchitecture, 2007. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004. Google ScholarDigital Library
M.-L. Li, P. Ramachandran, S. K. Sahoo, S. V. Adve, V. S. Adve, and Y. Zhou. Understanding the propagation of hard errors to software and implications for resilient system design. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 265--276, 2008. Google ScholarDigital Library
X. Li and D. Yeung. Application-level correctness and its impact on fault tolerance. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 181--192, Feb. 2007. Google ScholarDigital Library
T. May and M. Woods. Alpha-particle-induced soft errors in dynamic memories. IEEE Transactions on Electron Devices, 26(1):2--9, Jan. 1979.Google ScholarCross Ref
S. McCamant and M. D. Ernst. Quantitative information flow as network flow capacity. In Proc. of the SIGPLAN '08 Conference on Programming Language Design and Implementation, pages 193--205, June 2008. Google ScholarDigital Library
A.Meixner, M. Bauer, and D. Sorin. Argus: Low-cost, comprehensive error detection in simple cores. IEEE Micro, 28(1):52--59, 2008. Google ScholarDigital Library
P. Montesinos, W. Liu, and J. Torrellas. Using register lifetime predictions to protect register files against soft errors. In Proc. of the 2007 International Conference on Dependable Systems and Networks, pages 286--296, 2007. Google ScholarDigital Library
S. S. Mukherjee, M. Kontz, and S. K. Reinhardt. Detailed design and evaluation of redundant multithreading alternatives. In Proc. of the 29th Annual International Symposium on Computer Architecture, pages 99--110, 2002. Google ScholarDigital Library
S. S. Mukherjee, C. Weaver, J. Emer, S. Reinhardt, and T. Austin. A systematic methodology to compute the architectural vulnerability factors for a high performance microprocessor. In International Symposium on Microarchitecture, pages 29--42, Dec. 2003. Google ScholarDigital Library
N. Oh, S.Mitra, and E. J.McCluskey. Ed4i: Error detection by diverse data and duplicated instructions. IEEE Transactions on Computers, 51(2):180--199, 2002. Google ScholarDigital Library
P. Racunas, K. Constantinides, S. Manne, and S. Mukherjee. Perturbation-based fault screening. In Proc. of the 13th International Symposium on High-Performance Computer Architecture, pages 169--180, Feb. 2007. Google ScholarDigital Library
V. Reddy, S. Parthasarathy, and E. Rotenberg. Understanding prediction-based partial redundant threading for low-overhead, highcoverage fault tolerance. In 14th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 83--94, Oct. 2006. Google ScholarDigital Library
V. Reddy and E. Rotenberg. Inherent time redundancy (itr): Using program repetition for low-overhead fault tolerance. In Proc. of the 2007 International Conference on Dependable Systems and Networks, pages 307--316, June 2007. Google ScholarDigital Library
V. Reddy and E. Rotenberg. Coverage of a microarchitecture-level fault check regimen in a superscalar processor. In Proc. of the 2008 International Conference on Dependable Systems and Networks, pages 1--10, June 2008.Google ScholarCross Ref
S. K. Reinhardt and S. S.Mukherjee. Transient fault detection via simulataneous multithreading. In Proc. of the 27th Annual International Symposium on Computer Architecture, pages 25--36, June 2000. Google ScholarDigital Library
G. Reis, J. Chang, N. Vachharajani, R. Rangan, and D. I. August. SWIFT: Software implemented fault tolerance. In Proc. of the 2005 International Symposium on Code Generation and Optimization, pages 243--254, 2005. Google ScholarDigital Library
G. A. Reis, J. Chang, N. Vachharajani, R. Rangan, D. I. August, and S. S. Mukherjee. Software-controlled fault tolerance. ACM Transactions on Architecture and Code Optimization, 2(4):366--396, 2005. Google ScholarDigital Library
E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In International Symposium on Fault Tolerant Computing, pages 84--91, 1999. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Tenth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 45--57, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
P. Shivakumar, M. Kistler, S. Keckler, D. Burger, and L. Alvisi. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proc. of the 2002 International Conference on Dependable Systems and Networks, pages 389--398, June 2002. Google ScholarDigital Library
J. Smolens, J. Kim, J. Hoe, and B. Falsafi. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 256--268, Dec. 2004. Google ScholarDigital Library
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-effective multicore redundancy. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 223--234, 2006. Google ScholarDigital Library
L. Spainhower and T. Gregg. IBMS/390 Parallel Enterprise Server G5 Fault Tolerance: A Historical Perspective. IBM Journal of Research and Development, 43(6):863--873, 1999. Google ScholarDigital Library
N. Vachharajani, M. J. Bridges, J. Chang, R. Rangan, G. Ottoni, J. A. Blome, G. A. Rei, M. Vachharajani, and D. I. August. Rifle: An architectural framework for user-centric information-flow security. In Proc. of the 37th Annual International Symposium on Microarchitecture, pages 243--254, Dec. 2004. Google ScholarDigital Library
C. Wang, H. seop Kim, Y. Wu, and V. Ying. Compiler-managed software-based redundant multi-threading for transient fault detection. In Proc. of the 2007 International Symposium on Code Generation and Optimization, 2007. Google ScholarDigital Library
N. Wang and S. Patel. Restore: Symptom based soft error detection in microprocessors. In International Conference on Dependable Systems and Networks, pages 30--39, June 2005. Google ScholarDigital Library
N. J. Wang, M. Fertig, and S. J. Patel. Y-branches: When you come to a fork in the road, take it. In Proc. of the 12th International Conference on Parallel Architectures and Compilation Techniques, pages 56--65, 2003. Google ScholarDigital Library
N. J. Wang and S. J. Patel. ReStore: Symptom-based soft error detection in microprocessors. IEEE Transactions on Dependable and Secure Computing, 3(3):188--201, June 2006. Google ScholarDigital Library
N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the Effects of Transient Faults on a High-Performance Processor Pipeline. In International Conference on Dependable Systems and Networks, page 61, June 2004. Google ScholarDigital Library
C. Weaver and T. M. Austin. A fault tolerant approach to microprocessor design. In Proc. of the 2001 International Conference on Dependable Systems and Networks, pages 411--420, Washington, DC, USA, 2001. IEEE Computer Society. Google ScholarDigital Library
P. M. Wells, K. Chakraborty, and G. S. Sohi. Mixed-mode multicore reliability. In 17th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 169--180, 2009. Google ScholarDigital Library
M. T. Yourst. Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proc. of the 2007 IEEE Symposium on Performance Analysis of Systems and Software, pages 23--34, 2007.Google ScholarCross Ref
J. F. Ziegler and H. Puchner. SER-History, Trends, and Challenges: A Guide for Designing with Memory ICs. Cypress Semiconductor Corp., 2004.Google Scholar

Index Terms

Shoestring: probabilistic soft error reliability on the cheap
1. Hardware
  1. Hardware test
  2. Robustness
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Shoestring: probabilistic soft error reliability on the cheap
ASPLOS '10

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ...
Read More
Shoestring: probabilistic soft error reliability on the cheap
ASPLOS '10

Aggressive technology scaling provides designers with an ever increasing budget of cheaper and faster transistors. Unfortunately, this trend is accompanied by a decline in individual device reliability as transistors become increasingly susceptible to ...
Read More
Understanding the propagation of hard errors to software and implications for resilient system design
ASPLOS XIII: Proceedings of the 13th international conference on Architectural support for programming languages and operating systems

With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems
March 2010
422 pages
ISBN:9781605588391
DOI:10.1145/1736020
General Chair:
James C. Hoe
Carnegie Mellon University, USA
,
Program Chair:
Vikram S. Adve
University of Illinois at Urbana-Champaign, USA
ACM SIGARCH Computer Architecture News Volume 38, Issue 1
ASPLOS '10
March 2010
399 pages
ISSN:0163-5964
DOI:10.1145/1735970
Issue’s Table of Contents
ACM SIGPLAN Notices Volume 45, Issue 3
ASPLOS '10
March 2010
399 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1735971
Issue’s Table of Contents
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 March 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compiler analysis
error detection
fault injection
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS XV Paper Acceptance Rate32of181submissions,18%Overall Acceptance Rate535of2,713submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 231
  Total Citations
  View Citations
- 1,235
  Total Downloads
- Downloads (Last 12 months)76
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Shoestring: probabilistic soft error reliability on the cheap

ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Shoestring: probabilistic soft error reliability on the cheap

Shoestring: probabilistic soft error reliability on the cheap

Understanding the propagation of hard errors to software and implications for resilient system design