Warped-preexecution: A GPU pre-execution approach for improving latency hiding

Sangpil Lee, Won Woo Ro, Keunsoo Kim, Gunjae Koo, Myung Kuk Yoon, Murali Annavaram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

Original languageEnglish
Title of host publicationProceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016
PublisherIEEE Computer Society
Pages163-175
Number of pages13
ISBN (Electronic)9781467392112
DOIs
Publication statusPublished - 2016 Apr 1
Event22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016 - Barcelona, Spain
Duration: 2016 Mar 122016 Mar 16

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
Volume2016-April
ISSN (Print)1530-0897

Other

Other22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016
CountrySpain
CityBarcelona
Period16/3/1216/3/16

Fingerprint

Data storage equipment
Hazards
Graphics processing unit
Processing

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Lee, S., Ro, W. W., Kim, K., Koo, G., Yoon, M. K., & Annavaram, M. (2016). Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016 (pp. 163-175). [7446062] (Proceedings - International Symposium on High-Performance Computer Architecture; Vol. 2016-April). IEEE Computer Society. https://doi.org/10.1109/HPCA.2016.7446062
Lee, Sangpil ; Ro, Won Woo ; Kim, Keunsoo ; Koo, Gunjae ; Yoon, Myung Kuk ; Annavaram, Murali. / Warped-preexecution : A GPU pre-execution approach for improving latency hiding. Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016. IEEE Computer Society, 2016. pp. 163-175 (Proceedings - International Symposium on High-Performance Computer Architecture).
@inproceedings{6a821ebf06a44275a210930c557f0e7e,
title = "Warped-preexecution: A GPU pre-execution approach for improving latency hiding",
abstract = "This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23{\%} performance improvement for memory intensive applications, without negatively impacting other application categories.",
author = "Sangpil Lee and Ro, {Won Woo} and Keunsoo Kim and Gunjae Koo and Yoon, {Myung Kuk} and Murali Annavaram",
year = "2016",
month = "4",
day = "1",
doi = "10.1109/HPCA.2016.7446062",
language = "English",
series = "Proceedings - International Symposium on High-Performance Computer Architecture",
publisher = "IEEE Computer Society",
pages = "163--175",
booktitle = "Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016",
address = "United States",

}

Lee, S, Ro, WW, Kim, K, Koo, G, Yoon, MK & Annavaram, M 2016, Warped-preexecution: A GPU pre-execution approach for improving latency hiding. in Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016., 7446062, Proceedings - International Symposium on High-Performance Computer Architecture, vol. 2016-April, IEEE Computer Society, pp. 163-175, 22nd IEEE International Symposium on High Performance Computer Architecture, HPCA 2016, Barcelona, Spain, 16/3/12. https://doi.org/10.1109/HPCA.2016.7446062

Warped-preexecution : A GPU pre-execution approach for improving latency hiding. / Lee, Sangpil; Ro, Won Woo; Kim, Keunsoo; Koo, Gunjae; Yoon, Myung Kuk; Annavaram, Murali.

Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016. IEEE Computer Society, 2016. p. 163-175 7446062 (Proceedings - International Symposium on High-Performance Computer Architecture; Vol. 2016-April).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Warped-preexecution

T2 - A GPU pre-execution approach for improving latency hiding

AU - Lee, Sangpil

AU - Ro, Won Woo

AU - Kim, Keunsoo

AU - Koo, Gunjae

AU - Yoon, Myung Kuk

AU - Annavaram, Murali

PY - 2016/4/1

Y1 - 2016/4/1

N2 - This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

AB - This paper presents a pre-execution approach for improving GPU performance, called P-mode (pre-execution mode). GPUs utilize a number of concurrent threads for hiding processing delay of operations. However, certain long-latency operations such as off-chip memory accesses often take hundreds of cycles and hence leads to stalls even in the presence of thread concurrency and fast thread switching capability. It is unclear if adding more threads can improve latency tolerance due to increased memory contention. Further, adding more threads increases on-chip storage demands. Instead we propose that when a warp is stalled on a long-latency operation it enters P-mode. In P-mode, a warp continues to fetch and decode successive instructions to identify any independent instruction that is not on the long latency dependence chain. These independent instructions are then pre-executed. To tackle write-after-write and write-after-read hazards, during P-mode output values are written to renamed physical registers. We exploit the register file underutilization to re-purpose a few unused registers to store the P-mode results. When a warp is switched from P-mode to normal execution mode it reuses pre-executed results by reading the renamed registers. Any global load operation in P-mode is transformed into a pre-load which fetches data into the L1 cache to reduce future memory access penalties. Our evaluation results show 23% performance improvement for memory intensive applications, without negatively impacting other application categories.

UR - http://www.scopus.com/inward/record.url?scp=84965022495&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84965022495&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2016.7446062

DO - 10.1109/HPCA.2016.7446062

M3 - Conference contribution

AN - SCOPUS:84965022495

T3 - Proceedings - International Symposium on High-Performance Computer Architecture

SP - 163

EP - 175

BT - Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016

PB - IEEE Computer Society

ER -

Lee S, Ro WW, Kim K, Koo G, Yoon MK, Annavaram M. Warped-preexecution: A GPU pre-execution approach for improving latency hiding. In Proceedings of the 2016 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2016. IEEE Computer Society. 2016. p. 163-175. 7446062. (Proceedings - International Symposium on High-Performance Computer Architecture). https://doi.org/10.1109/HPCA.2016.7446062