APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs

Yunho Oh, Keunsoo Kim, Myung Kuk Yoon, Jong Hyun Park, Yongjun Park, Won Woo Ro, Murali Annavaram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)

Abstract

Long memory latency and limited throughput become performance bottlenecks of GPGPU applications. The latency takes hundreds of cycles which is difficult to be hidden by simply interleaving tens of warp execution. While cache hierarchy helps to reduce memory system pressure, massive Thread-Level Parallelism (TLP) often causes excessive cache contention. This paper proposes Adaptive PREfetching and Scheduling (APRES) to improve GPU cache efficiency. APRES relies on the following observations. First, certain static load instructions tend to generate memory addresses having very high locality. Second, although loads have no locality, the access addresses still can show highly strided access pattern. Third, the locality behavior tends to be consistent regardless of warp ID. APRES schedules warps so that as many cache hits generated as possible before any cache misses generated. This is to minimize cache thrashing when many warps are contending for a cache line. However, to realize this operation, it is required to predict which warp will hit the cache in the near future. Without directly predicting future cache hit/miss for each warp, APRES creates a group of warps that will execute the same load instruction in the near future. Based on the third observation, we expect the locality behavior is consistent over all warps in the group. If the first executed warp in the group hits the cache, then the load is considered as a high locality type, and APRES prioritizes all warps in the group. Group prioritization leads to consecutive cache hits, because the grouped warps are likely to access the same cache line. If the first warp missed the cache, then the load is considered as a strided type, and APRES generates prefetch requests for the other warps in the group. After that, APRES prioritizes prefetch targeted warps so that the demand requests are merged to Miss Status Holding Register (MSHR) or prefetched lines can be accessed. On memory-intensive applications, APRES achieves 31.7% performance improvement compared to the baseline GPU and 7.2% additional speedup compared to the best combination of existing warp scheduling and prefetching methods.

Original languageEnglish
Title of host publicationProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages191-203
Number of pages13
ISBN (Electronic)9781467389471
DOIs
Publication statusPublished - 2016 Aug 24
Event43rd International Symposium on Computer Architecture, ISCA 2016 - Seoul, Korea, Republic of
Duration: 2016 Jun 182016 Jun 22

Publication series

NameProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

Other

Other43rd International Symposium on Computer Architecture, ISCA 2016
CountryKorea, Republic of
CitySeoul
Period16/6/1816/6/22

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Fingerprint Dive into the research topics of 'APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs'. Together they form a unique fingerprint.

  • Cite this

    Oh, Y., Kim, K., Yoon, M. K., Park, J. H., Park, Y., Ro, W. W., & Annavaram, M. (2016). APRES: Improving Cache Efficiency by Exploiting Load Characteristics on GPUs. In Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016 (pp. 191-203). [7551393] (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCA.2016.26