ABSTRACT
Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chip components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.
- R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems. In Proc. of CODES+ISSS, pages 73{78, 2002. Google ScholarDigital Library
- K. Barker, T. Benson, D. Campbell, D. Ediger, R. Gioiosa, A. Hoisie, D. Kerbyson, J. Manzano, A. Marquez, L. Song, N. Tallent, and A. Tumeo. PERFECT Benchmark Suite Manual. Pacific Northwest National Laboratory and Georgia Tech Research Institute, 2013.Google Scholar
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: a Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In Proc. of ASPLOS, pages 269{284, 2014. Google ScholarDigital Library
- J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture Support for Accelerator-rich CMPs. In Proc. of DAC, pages 843{849, 2012. Google ScholarDigital Library
- A. Fog. Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Copenhagen University College of Engineering, 2011.Google Scholar
- J. Huang, Y. Huang, O. Temam, P. Ienne, Y. Chen, and C. Wu. A Low-cost Memory Interface for High-throughput Accelerators. In Proc. of CASES, pages 11:1{11:10, 2014. Google ScholarDigital Library
- A. Jaleel. Memory Characterization of Workloads Using Instrumentation-Driven Simulation. Web Copy, 2010.Google Scholar
- J. H. Kelm and S. S. Lumetta. HybridOS: Runtime Support for Reconfigurable Accelerators. In Proc. of FPGA, pages 212{221, 2008. Google ScholarDigital Library
- C. D. Kersey, A. Rodrigues, and S. Yalamanchili. A Universal Parallel Front-End for Execution Driven Microarchitecture Simulation. In Proc. of RAPIDO, pages 25{32, 2012. Google ScholarDigital Library
- S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. McPAT: an Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proc. of MICRO, pages 469{480, 2009. Google ScholarDigital Library
- G. Martin and G. Smith. High-Level Synthesis: Past, Present, and Future. IEEE Design & Test of Computers, 26(4):18{25, 2009. Google ScholarDigital Library
- N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In Proc. of MICRO, 2007. Google ScholarDigital Library
- B. Reagen, R. Adolf, Y. S. Shao, G.-Y. Wei, and D. Brooks. MachSuite: Benchmarks for Accelerator Design and Customized Architectures. 2014.Google Scholar
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. Computer Architecture Letters, 10(1):16 {19, jan.-june 2011. Google ScholarDigital Library
- R. Sampson and T. F. Wenisch. ZCache Skew-ered. In Proc. of WDDD, 2011.Google Scholar
- S. Srinivasan, L. Zhao, R. Illikkal, and R. Iyer. Efficient interaction between os and architecture in heterogeneous platforms. ACM SIGOPS Operating Systems Review, 45(1):62{72, 2011. Google ScholarDigital Library
- J. Stuecheli, B. Blaner, C. Johns, and M. Siegel. CAPI: A Coherent Accelerator Processor Interface. IBM Journal of Research and Development, 59(1):7{1, 2015.Google ScholarDigital Library
- G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor. Conservation Cores: Reducing the Energy of Mature Computations. In Proc. of ASPLOS, pages 205{218, 2010. Google ScholarDigital Library
- H. Vo, Y. Lee, A. Waterman, and K. Asanovic. A Case for OS-Friendly Hardware Accelerators. In Proc. of WIVOSCA, 2013.Google Scholar
- L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross. Q100: the Architecture and Design of a Database Processing Unit. In Proc. of ASPLOS, pages 255{268, 2014. Google ScholarDigital Library
Index Terms
- An Analysis of Accelerator Coupling in Heterogeneous Architectures
Recommendations
Out-of-core implementation for accelerator kernels on heterogeneous clouds
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to ...
Efficient heterogeneous execution on large multicore and accelerator platforms: Case study using a block tridiagonal solver
The algorithmic and implementation principles are explored in gainfully exploiting GPU accelerators in conjunction with multicore processors on high-end systems with large numbers of compute nodes, and evaluated in an implementation of a scalable block ...
Portable performance on heterogeneous architectures
ASPLOS '13Trends in both consumer and high performance computing are bringing not only more cores, but also increased heterogeneity among the computational resources within a single machine. In many machines, one of the greatest computational resources is now ...
Comments