FineReg: Fine-grained register file management for augmenting GPU throughput

Yunho Oh, Myung Kuk Yoon, William Jinho Song, Won Woo Ro

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8% of performance improvement over a conventional GPU architecture.

Original languageEnglish
Title of host publicationProceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
PublisherIEEE Computer Society
Pages364-376
Number of pages13
ISBN (Electronic)9781538662403
DOIs
Publication statusPublished - 2018 Dec 12
Event51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018 - Fukuoka, Japan
Duration: 2018 Oct 202018 Oct 24

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2018-October
ISSN (Print)1072-4451

Other

Other51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
CountryJapan
CityFukuoka
Period18/10/2018/10/24

Fingerprint

Throughput
Graphics processing unit
Hardware
Experiments

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Oh, Y., Yoon, M. K., Song, W. J., & Ro, W. W. (2018). FineReg: Fine-grained register file management for augmenting GPU throughput. In Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018 (pp. 364-376). [8574554] (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2018-October). IEEE Computer Society. https://doi.org/10.1109/MICRO.2018.00037
Oh, Yunho ; Yoon, Myung Kuk ; Song, William Jinho ; Ro, Won Woo. / FineReg : Fine-grained register file management for augmenting GPU throughput. Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. IEEE Computer Society, 2018. pp. 364-376 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO).
@inproceedings{52b6589f9ed34559b5f959cd935c27fd,
title = "FineReg: Fine-grained register file management for augmenting GPU throughput",
abstract = "Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8{\%} of performance improvement over a conventional GPU architecture.",
author = "Yunho Oh and Yoon, {Myung Kuk} and Song, {William Jinho} and Ro, {Won Woo}",
year = "2018",
month = "12",
day = "12",
doi = "10.1109/MICRO.2018.00037",
language = "English",
series = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
publisher = "IEEE Computer Society",
pages = "364--376",
booktitle = "Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018",
address = "United States",

}

Oh, Y, Yoon, MK, Song, WJ & Ro, WW 2018, FineReg: Fine-grained register file management for augmenting GPU throughput. in Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018., 8574554, Proceedings of the Annual International Symposium on Microarchitecture, MICRO, vol. 2018-October, IEEE Computer Society, pp. 364-376, 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, 18/10/20. https://doi.org/10.1109/MICRO.2018.00037

FineReg : Fine-grained register file management for augmenting GPU throughput. / Oh, Yunho; Yoon, Myung Kuk; Song, William Jinho; Ro, Won Woo.

Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. IEEE Computer Society, 2018. p. 364-376 8574554 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO; Vol. 2018-October).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - FineReg

T2 - Fine-grained register file management for augmenting GPU throughput

AU - Oh, Yunho

AU - Yoon, Myung Kuk

AU - Song, William Jinho

AU - Ro, Won Woo

PY - 2018/12/12

Y1 - 2018/12/12

N2 - Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8% of performance improvement over a conventional GPU architecture.

AB - Graphics processing units (GPUs) include a large amount of hardware resources for parallel thread executions. However, the resources are not fully utilized during runtime, and observed throughput often falls far below the peak performance. A major cause is that GPUs cannot deploy enough number of warps at runtime. The limited size of register file constrains the number of cooperative thread arrays (CTAs) as one CTA takes up a few tens of kilobytes of registers. We observe that the actual working set size of a CTA is much smaller in general, and therefore there is room for additional CTAs to run. In this paper, we propose a novel GPU architecture called FineReg that improves overall throughput by increasing the number of concurrent CTAs. In particular, FineReg splits the monolithic register file into two regions, one for active CTAs and another for pending CTAs. Using FineReg, the GPU begins normal executions by allocating all registers required by active CTAs. If all warps of a CTA become stalled, FineReg moves the live registers (i.e., working set) of CTA to the pending-CTA region and launches an additional CTA by assigning registers to the newly activated CTA. If the registers of either active or pending-CTA region are used up, FineReg stops introducing additional CTAs and simply performs context switching between active and pending CTAs. Thus, FineReg increases the number of concurrent CTAs by reducing the effective size of per-CTA registers. Experiment results show that FineReg achieves 32.8% of performance improvement over a conventional GPU architecture.

UR - http://www.scopus.com/inward/record.url?scp=85060022541&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060022541&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2018.00037

DO - 10.1109/MICRO.2018.00037

M3 - Conference contribution

AN - SCOPUS:85060022541

T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SP - 364

EP - 376

BT - Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018

PB - IEEE Computer Society

ER -

Oh Y, Yoon MK, Song WJ, Ro WW. FineReg: Fine-grained register file management for augmenting GPU throughput. In Proceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018. IEEE Computer Society. 2018. p. 364-376. 8574554. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). https://doi.org/10.1109/MICRO.2018.00037