skip to main content
10.1145/2897937.2905010acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdacConference Proceedingsconference-collections
research-article

Invited - Cross-layer modeling and optimization for electromigration induced reliability

Published:05 June 2016Publication History

ABSTRACT

In this paper, we propose a new approach for cross-layer electromigration (EM) induced reliability modeling and optimization at physics, system and datacenter levels. We consider a recently proposed physics-based electromigration (EM) reliability model to predict the EM reliability of full-chip power grid networks for long-term failures. We show how the new physics-based dynamic EM model at the physics level can be abstracted at the system level and even at the datacenter level. Our datacenter system-level power model is based on the BigHouse simulator. To speed up the online optimization for energy in a datacenter, we propose a new combined datacenter power and reliability compact model using a learning based approach in which a feed-forward neural network (FNN) is trained to predict energy and long term reliability for each processor under datacenter scheduling and workloads. To optimize the energy and reliability of a datacenter, we apply the efficient adaptive Q-learning based reinforcement learning method. Experimental results show that the proposed compact models for the datacenter system trained with different workloads under different cluster power modes and scheduling policies are able to build accurate energy and lifetime. Moreover, the proposed optimization method effectively manages and optimizes data-center energy subject to reliability, given power budget and performance.

References

  1. 2013 cost of data center outages, 2013. http://www.emersonnetworkpower.com.Google ScholarGoogle Scholar
  2. M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, and S. Shenker. pfabric: Minimal near-optimal datacenter transport. In Proceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM '13, pages 435--446, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Biswas, M. Tiwari, T. Sherwood, L. Theogarajan, and F. T. Chong. Fighting fire with fire: modeling the datacenter-scale effects of targeted superlattice thermal management. In Computer Architecture (ISCA), 2011 38th Annual International Symposium on, pages 331--340. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. R. Black. Electromigration-A Brief Survey and Some Recent Results. IEEE Trans. on Electron Devices, 16(4):338--347, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  5. S. Chatterjee, M. Fawaz, and N. F. Najm. Redundancy-Aware Electromigration Checking for Mesh Power Grids. In Proc. Int. Conf. on Computer Aided Design (ICCAD), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Das, A. Kumar, and B. Veeravalli. Reliability-driven task mapping for lifetime extension of networks-on-chip based multiprocessor systems. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE '13, pages 689--694, San Jose, CA, USA, 2013. EDA Consortium. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Fan, W.-D. Weber, and L. A. Barroso. Power provisioning for a warehouse-sized computer. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA '07, pages 13--23, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. T. Heath. Scientific Computing: An Introductory Survey. McGraw-Hill, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Hecht-Nielsen. Theory of the backpropagation neural network. In Neural Networks, 1989. IJCNN., International Joint Conference on, pages 593--605. IEEE, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  10. K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359--366, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Huang, S. Ghosh, S. Velusamy, K. Sankaranarayanan, K. Skadron, and M. R. Stan. HotSpot: A compact thermal modeling methodology for early-stage VLSI design. IEEE Trans. on Very Large Scale Integration (VLSI) Systems, 14(5):501--513, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Huang, T. Yu, V. Sukharev, and S. X.-D. Tan. Physics-based electromigration assessment for power grid networks. In Proc. Design Automation Conf. (DAC), June 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185--1201, Nov. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. A. Korhonen, P. Borgesen, K. N. Tu, and C. Y. Li. Stress Evolution Due to Electromigration in Confined Metal Lines. Journal of Applied Physics, 73(8):3790--3799, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  15. Z. Lu, W. Huang, J. Lach, M. Stan, and K. Skadron. Interconnect lifetime prediction under dynamic stress for reliability-aware design. In Proc. Int. Conf. on Computer Aided Design (ICCAD), pages 327--334. IEEE, November 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. D. Martino, Z. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In Proceedings of the 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN '14, pages 610--621, Washington, DC, USA, 2014. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch. Power management of online data-intensive services. In International Symposium on Computer Architecture, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Meisner, J. Wu, and T. F. Wenisch. Bighouse: A simulation infrastructure for data center systems. In Performance Analysis of Systems and Software (ISPASS), 2012 IEEE International Symposium on, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Pinheiro, W.-D. Weber, and L. A. Barroso. Failure trends in a large disk drive population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST '07, pages 2--2, Berkeley, CA, USA, 2007. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Schroeder, E. Pinheiro, and W.-D. Weber. Dram errors in the wild: A large-scale field study. In Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '09, pages 193--204, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Song, S. Mukhopadhyay, and S. Yalamanchili. Architectural reliability: Lifetime reliability characterization and management of many-core processors. Computer Architecture Letters, PP(99):1--1, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. V. Sukharev. Beyond Black's Equation: Full-Chip EM/SM Assessment in 3D IC Stack. Microelectronic Engineering, 120:99--105, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Wang and J.-J. Chen. Thermal-aware lifetime reliability in multicore systems. In Quality Electronic Design (ISQED), 2010 11th International Symposium on, pages 399--405, March 2010.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Wong and M. Annavaram. Implications of high energy proportional servers on cluster-wide energy proportionality. In Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, HPCA-19 '14, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  25. www.spec.org/power_ssj2008/. Specpower_ssj2008, 2012.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    DAC '16: Proceedings of the 53rd Annual Design Automation Conference
    June 2016
    1048 pages
    ISBN:9781450342360
    DOI:10.1145/2897937

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 5 June 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate1,770of5,499submissions,32%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader