research-article

Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture

Authors:
Po-Han Wang

National Taiwan University, Taipei, Taiwan R.O.C.

National Taiwan University, Taipei, Taiwan R.O.C.
View Profile

,
Cheng-Hsuan Li

National Taiwan University, Taipei, Taiwan R.O.C.

National Taiwan University, Taipei, Taiwan R.O.C.
View Profile

,
Chia-Lin Yang

National Taiwan University, Taipei, Taiwan R.O.C. and Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan R.O.C.

National Taiwan University, Taipei, Taiwan R.O.C. and Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan R.O.C.
View Profile

DAC '16: Proceedings of the 53rd Annual Design Automation ConferenceJune 2016Article No.: 5Pages 1–6https://doi.org/10.1145/2897937.2898036

Published:05 June 2016Publication History

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

Pages 1–6

ABSTRACT

Shared last-level cache (LLC) management is a critical design issue for heterogeneous multi-cores. In this paper, we observe two major challenges: the contribution of LLC latency to overall performance varies among applications/cores and also across time; overlooking the off-chip latency factor often leads to adverse effects on overall performance. Hence, we propose a Latency Sensitivity-based Cache Partitioning (LSP) framework, including a lightweight runtime mechanism to quantify the latency-sensitivity and a new cost function to guide the LLC partitioning. Results show that LSP improves the overall throughput by 8% on average (27% at most), compared with the state-of-the-art partitioning mechanism, TAP.

References

R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA, 2012. Google ScholarDigital Library
A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, 2009.Google ScholarCross Ref
A. R. Brodtkorb, T. R. Hagen, and M. L. SÃętra. Graphics processing unit (gpu) programming strategies and trends in gpu computing. J. Parallel Distrib. Comput., 2013. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarDigital Library
M. Garrido and J. Grajal. Continuous-flow variable-length memoryless linear regression architecture. Electronics Letters, 2013.Google ScholarCross Ref
L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: Caches as a shared resource. In PACT, 2006. Google ScholarDigital Library
A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA, 2010. Google ScholarDigital Library
O. Kayiran, N. Nachiappan, A. Jog, R. Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, and C. Das. Managing gpu concurrency in heterogeneous architectures. In MICRO, 2014. Google ScholarDigital Library
J. Lee and H. Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In HPCA, 2012. Google ScholarDigital Library
X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In WDDD, 2011.Google Scholar
J. Lotze, P. Sutton, and H. Lahlou. Many-core accelerated libor swaption portfolio pricing. In SCC, 2012. Google ScholarDigital Library
V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. Managing shared last-level cache in a heterogeneous multicore processor. In PACT, 2013. Google ScholarDigital Library
A. Patel, F. Afram, S. Chen, and K. Ghose. Marss: A full system simulator for multicore x86 cpus. In DAC, 2011. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006. Google ScholarDigital Library
B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the bandwidth wall: Challenges in and avenues for cmp scaling. SIGARCH Comput. Archit. News, 2009. Google ScholarDigital Library
P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. Computer Architecture Letters, 2011. Google ScholarDigital Library
G. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The Journal of Supercomputing, 2004. Google ScholarDigital Library
P.-H. Wang, G.-H. Liu, J.-C. Yeh, T.-M. Chen, H.-Y. Huang, C.-L. Yang, S.-L. Liu, and J. Greensky. Full system simulation framework for integrated cpu/gpu architecture. In VLSI-DAT, 2014.Google ScholarCross Ref
P.-H. Wang, C.-W. Lo, C.-L. Yang, and Y.-J. Cheng. A cycle-level simt-gpu simulation framework. In ISPASS, 2012. Google ScholarDigital Library

Recommendations

Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided Design

Graphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...
Read More
Time-sensitivity-aware shared cache architecture for multi-core embedded systems
Abstract
In embedded systems such as automotive systems, multi-core processors are expected to improve performance and reduce manufacturing cost by integrating multiple functions on a single chip. However, inter-core interference in shared last-level cache ...
Read More
Code-based cache partitioning for improving hardware cache performance
ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication

Recently, improving hardware cache performance is getting more important, because the performance gap between processor and memory has caused "memory wall" problem. Most cache designs are based on the LRU replacement policy which is effective for high-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '16: Proceedings of the 53rd Annual Design Automation Conference
June 2016
1048 pages
ISBN:9781450342360
DOI:10.1145/2897937

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
cache partitioning
heterogeneous system architecture
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 360
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

ABSTRACT

References

Cited By

Recommendations

Exploring cache bypassing and partitioning for multi-tasking on GPUs

Time-sensitivity-aware shared cache architecture for multi-core embedded systems

Code-based cache partitioning for improving hardware cache performance

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Latency sensitivity-based cache partitioning for heterogeneous multi-core architecture

DAC '16: Proceedings of the 53rd Annual Design Automation Conference

ABSTRACT

References

Cited By

Recommendations

Exploring cache bypassing and partitioning for multi-tasking on GPUs

Time-sensitivity-aware shared cache architecture for multi-core embedded systems

Code-based cache partitioning for improving hardware cache performance

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media