ABSTRACT
In this paper, we propose a new approach for cross-layer electromigration (EM) induced reliability modeling and optimization at physics, system and datacenter levels. We consider a recently proposed physics-based electromigration (EM) reliability model to predict the EM reliability of full-chip power grid networks for long-term failures. We show how the new physics-based dynamic EM model at the physics level can be abstracted at the system level and even at the datacenter level. Our datacenter system-level power model is based on the BigHouse simulator. To speed up the online optimization for energy in a datacenter, we propose a new combined datacenter power and reliability compact model using a learning based approach in which a feed-forward neural network (FNN) is trained to predict energy and long term reliability for each processor under datacenter scheduling and workloads. To optimize the energy and reliability of a datacenter, we apply the efficient adaptive Q-learning based reinforcement learning method. Experimental results show that the proposed compact models for the datacenter system trained with different workloads under different cluster power modes and scheduling policies are able to build accurate energy and lifetime. Moreover, the proposed optimization method effectively manages and optimizes data-center energy subject to reliability, given power budget and performance.
- 2013 cost of data center outages, 2013. http://www.emersonnetworkpower.com.Google Scholar
- M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. pfabric: Minimal near-optimal datacenter transport. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 435--446, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- S. Biswas, M. Tiwari, T. Sherwood, L. Theogarajan, and F. T. Chong. Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 331--340. IEEE, 2011. Google ScholarDigital Library
- J. R. Black. Electromigration-A Brief Survey and Some Recent Results. IEEE Trans. on Electron Devices, 16(4):338--347, 1969.Google ScholarCross Ref
- S. Chatterjee, M. Fawaz, and N. F. Najm. Redundancy-Aware Electromigration Checking for Mesh Power Grids. In Proc. Int. Conf. on Computer Aided Design (ICCAD), 2013. Google ScholarDigital Library
- A. Das, A. Kumar, and B. Veeravalli. Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '13, pages 689--694, San Jose, CA, USA, 2013. EDA Consortium. Google ScholarDigital Library
- X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse-sized computer. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 13--23, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. T. Heath. Scientific Computing: An Introductory Survey. McGraw-Hill, 1997. Google ScholarDigital Library
- R. Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks, 1989. IJCNN., International Joint Conference on, pages 593--605. IEEE, 1989.Google ScholarCross Ref
- K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359--366, 1989. Google ScholarDigital Library
- W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 14(5):501--513, May 2006. Google ScholarDigital Library
- X. Huang, T. Yu, V. Sukharev, and S. X.-D. Tan. Physics-based electromigration assessment for power grid networks. In Proc. Design Automation Conf. (DAC), June 2014. Google ScholarDigital Library
- T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185--1201, Nov. 1994. Google ScholarDigital Library
- M. A. Korhonen, P. Borgesen, K. N. Tu, and C. Y. Li. Stress Evolution Due to Electromigration in Confined Metal Lines. Journal of Applied Physics, 73(8):3790--3799, 1993.Google ScholarCross Ref
- Z. Lu, W. Huang, J. Lach, M. Stan, and K. Skadron. Interconnect lifetime prediction under dynamic stress for reliability-aware design. In Proc. Int. Conf. on Computer Aided Design (ICCAD), pages 327--334. IEEE, November 2004. Google ScholarDigital Library
- C. D. Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN '14, pages 610--621, Washington, DC, USA, 2014. IEEE Computer Society. Google ScholarDigital Library
- D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch. Power management of online data-intensive services. In International Symposium on Computer Architecture, 2011. Google ScholarDigital Library
- D. Meisner, J. Wu, and T. F. Wenisch. Bighouse: A simulation infrastructure for data center systems. In Performance Analysis of Systems and Software (ISPASS), 2012 IEEE International Symposium on, 2012. Google ScholarDigital Library
- E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, and W.-D. Weber. Dram errors in the wild: A large-scale field study. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '09, pages 193--204, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- W. Song, S. Mukhopadhyay, and S. Yalamanchili. Architectural reliability: Lifetime reliability characterization and management of many-core processors. Computer Architecture Letters, PP(99):1--1, 2014. Google ScholarDigital Library
- V. Sukharev. Beyond Black's Equation: Full-Chip EM/SM Assessment in 3D IC Stack. Microelectronic Engineering, 120:99--105, 2014.Google ScholarCross Ref
- S. Wang and J.-J. Chen. Thermal-aware lifetime reliability in multicore systems. In Quality Electronic Design (ISQED), 2010 11th International Symposium on, pages 399--405, March 2010.Google ScholarCross Ref
- D. Wong and M. Annavaram. Implications of high energy proportional servers on cluster-wide energy proportionality. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 '14, 2014.Google ScholarCross Ref
- www.spec.org/power_ssj2008/. Specpower_ssj2008, 2012.Google Scholar
Recommendations
Recent advances in EM and BTI induced reliability modeling, analysis and optimization (invited)
In this article, we will present recent advances in reliability effects such as electromigration on interconnects and Negative/Positive Bias Temperature Instability (N/P BTI) effects on CMOS devices, which are the most important reliability concerns for ...
Lifetime Prediction and Design-for-Reliability of IC Interconnections with Electromigration Induced Degradation in the Presence of Manufacturing Defects
The degradation of IC interconnects due to electromigration (EM) is strongly influenced by physical defects and imperfections on interconnect traces that significantly accelerate EM damage through increased current density and elevated temperature. In ...
Stress Migration Followed by Electromigration Reliability Testing
2019 IEEE International Reliability Physics Symposium (IRPS)Electromigration (EM) and Stress Migration (SM) are reliability concerns for modern day integrated circuits. However, neither mechanism is completely independent of the other, but instead they have a combined impact on the failure behavior of copper ...
Comments