ABSTRACT
Shared last-level cache (LLC) management is a critical design issue for heterogeneous multi-cores. In this paper, we observe two major challenges: the contribution of LLC latency to overall performance varies among applications/cores and also across time; overlooking the off-chip latency factor often leads to adverse effects on overall performance. Hence, we propose a Latency Sensitivity-based Cache Partitioning (LSP) framework, including a lightweight runtime mechanism to quantify the latency-sensitivity and a new cost function to guide the LLC partitioning. Results show that LSP improves the overall throughput by 8% on average (27% at most), compared with the state-of-the-art partitioning mechanism, TAP.
- R. Ausavarungnirun, K. K.-W. Chang, L. Subramanian, G. H. Loh, and O. Mutlu. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA, 2012. Google ScholarDigital Library
- A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt. Analyzing cuda workloads using a detailed gpu simulator. In ISPASS, 2009.Google ScholarCross Ref
- A. R. Brodtkorb, T. R. Hagen, and M. L. SÃętra. Graphics processing unit (gpu) programming strategies and trends in gpu computing. J. Parallel Distrib. Comput., 2013. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In IISWC, 2009. Google ScholarDigital Library
- M. Garrido and J. Grajal. Continuous-flow variable-length memoryless linear regression architecture. Electronics Letters, 2013.Google ScholarCross Ref
- L. R. Hsu, S. K. Reinhardt, R. Iyer, and S. Makineni. Communist, utilitarian, and capitalist cache policies on cmps: Caches as a shared resource. In PACT, 2006. Google ScholarDigital Library
- A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer. High performance cache replacement using re-reference interval prediction (rrip). In ISCA, 2010. Google ScholarDigital Library
- O. Kayiran, N. Nachiappan, A. Jog, R. Ausavarungnirun, M. Kandemir, G. Loh, O. Mutlu, and C. Das. Managing gpu concurrency in heterogeneous architectures. In MICRO, 2014. Google ScholarDigital Library
- J. Lee and H. Kim. Tap: A tlp-aware cache management policy for a cpu-gpu heterogeneous architecture. In HPCA, 2012. Google ScholarDigital Library
- X. Lin and R. Balasubramonian. Refining the utility metric for utility-based cache partitioning. In WDDD, 2011.Google Scholar
- J. Lotze, P. Sutton, and H. Lahlou. Many-core accelerated libor swaption portfolio pricing. In SCC, 2012. Google ScholarDigital Library
- V. Mekkat, A. Holey, P.-C. Yew, and A. Zhai. Managing shared last-level cache in a heterogeneous multicore processor. In PACT, 2013. Google ScholarDigital Library
- A. Patel, F. Afram, S. Chen, and K. Ghose. Marss: A full system simulator for multicore x86 cpus. In DAC, 2011. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO, 2006. Google ScholarDigital Library
- B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the bandwidth wall: Challenges in and avenues for cmp scaling. SIGARCH Comput. Archit. News, 2009. Google ScholarDigital Library
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. Dramsim2: A cycle accurate memory system simulator. Computer Architecture Letters, 2011. Google ScholarDigital Library
- G. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. The Journal of Supercomputing, 2004. Google ScholarDigital Library
- P.-H. Wang, G.-H. Liu, J.-C. Yeh, T.-M. Chen, H.-Y. Huang, C.-L. Yang, S.-L. Liu, and J. Greensky. Full system simulation framework for integrated cpu/gpu architecture. In VLSI-DAT, 2014.Google ScholarCross Ref
- P.-H. Wang, C.-W. Lo, C.-L. Yang, and Y.-J. Cheng. A cycle-level simt-gpu simulation framework. In ISPASS, 2012. Google ScholarDigital Library
Recommendations
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided DesignGraphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...
Time-sensitivity-aware shared cache architecture for multi-core embedded systems
AbstractIn embedded systems such as automotive systems, multi-core processors are expected to improve performance and reduce manufacturing cost by integrating multiple functions on a single chip. However, inter-core interference in shared last-level cache ...
Code-based cache partitioning for improving hardware cache performance
ICUIMC '12: Proceedings of the 6th International Conference on Ubiquitous Information Management and CommunicationRecently, improving hardware cache performance is getting more important, because the performance gap between processor and memory has caused "memory wall" problem. Most cache designs are based on the LRU replacement policy which is effective for high-...
Comments