research-article

An Analysis of Accelerator Coupling in Heterogeneous Architectures

Authors:
Emilio G. Cota

Dept. of Computer Science, Columbia University

Dept. of Computer Science, Columbia University
View Profile

,
Paolo Mantovani

Dept. of Computer Science, Columbia University

Dept. of Computer Science, Columbia University
View Profile

,
Giuseppe Di Guglielmo

Dept. of Computer Science, Columbia University

Dept. of Computer Science, Columbia University
View Profile

,
Luca P. Carloni

Dept. of Computer Science, Columbia University

Dept. of Computer Science, Columbia University
View Profile

DAC '15: Proceedings of the 52nd Annual Design Automation ConferenceJune 2015Article No.: 202Pages 1–6https://doi.org/10.1145/2744769.2744794

Published:07 June 2015Publication History

DAC '15: Proceedings of the 52nd Annual Design Automation Conference

Pages 1–6

ABSTRACT

Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chip components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.

References

R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems. In Proc. of CODES+ISSS, pages 73{78, 2002. Google ScholarDigital Library
K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo. PERFECT Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, 2013.Google Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: a Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In Proc. of ASPLOS, pages 269{284, 2014. Google ScholarDigital Library
J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture Support for Accelerator-rich CMPs. In Proc. of DAC, pages 843{849, 2012. Google ScholarDigital Library
A. Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Copenhagen University College of Engineering, 2011.Google Scholar
J. Huang, Y. Huang, O. Temam, P. Ienne, Y. Chen, and C. Wu. A Low-cost Memory Interface for High-throughput Accelerators. In Proc. of CASES, pages 11:1{11:10, 2014. Google ScholarDigital Library
A. Jaleel. Memory Characterization of Workloads Using Instrumentation-Driven Simulation. Web Copy, 2010.Google Scholar
J. H. Kelm and S. S. Lumetta. HybridOS: Runtime Support for Reconfigurable Accelerators. In Proc. of FPGA, pages 212{221, 2008. Google ScholarDigital Library
C. D. Kersey, A. Rodrigues, and S. Yalamanchili. A Universal Parallel Front-End for Execution Driven Microarchitecture Simulation. In Proc. of RAPIDO, pages 25{32, 2012. Google ScholarDigital Library
S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proc. of MICRO, pages 469{480, 2009. Google ScholarDigital Library
G. Martin and G. Smith. High-Level Synthesis: Past, Present, and Future. IEEE Design & Test of Computers, 26(4):18{25, 2009. Google ScholarDigital Library
N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proc. of MICRO, 2007. Google ScholarDigital Library
B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. MachSuite: Benchmarks for Accelerator Design and Customized Architectures. 2014.Google Scholar
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters, 10(1):16 {19, jan.-june 2011. Google ScholarDigital Library
R. Sampson and T. F. Wenisch. ZCache Skew-ered. In Proc. of WDDD, 2011.Google Scholar
S. Srinivasan, L. Zhao, R. Illikkal, and R. Iyer. Efficient interaction between os and architecture in heterogeneous platforms. ACM SIGOPS Operating Systems Review, 45(1):62{72, 2011. Google ScholarDigital Library
J. Stuecheli, B. Blaner, C. Johns, and M. Siegel. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development, 59(1):7{1, 2015.Google ScholarDigital Library
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In Proc. of ASPLOS, pages 205{218, 2010. Google ScholarDigital Library
H. Vo, Y. Lee, A. Waterman, and K. Asanovic. A Case for OS-Friendly Hardware Accelerators. In Proc. of WIVOSCA, 2013.Google Scholar
L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: the Architecture and Design of a Database Processing Unit. In Proc. of ASPLOS, pages 255{268, 2014. Google ScholarDigital Library

Index Terms

An Analysis of Accelerator Coupling in Heterogeneous Architectures
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Interconnection architectures
2. Hardware

Recommendations

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Read More
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block ...
Read More
Portable performance on heterogeneous architectures
ASPLOS '13

Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '15: Proceedings of the 52nd Annual Design Automation Conference
June 2015
1204 pages
ISBN:9781450335201
DOI:10.1145/2744769

Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 32
  Total Citations
  View Citations
- 341
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Analysis of Accelerator Coupling in Heterogeneous Architectures

DAC '15: Proceedings of the 52nd Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Portable performance on heterogeneous architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An Analysis of Accelerator Coupling in Heterogeneous Architectures

DAC '15: Proceedings of the 52nd Annual Design Automation Conference

ABSTRACT

References

Cited By

Index Terms

Recommendations

Out-of-core implementation for accelerator kernels on heterogeneous clouds

Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver

Portable performance on heterogeneous architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media