Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

Myung Kuk Yoon, Keunsoo Kim, Sangpil Lee, Won Woo Ro, Murali Annavaram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

24 Citations (Scopus)

Abstract

Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

Original languageEnglish
Title of host publicationProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages609-621
Number of pages13
ISBN (Electronic)9781467389471
DOIs
Publication statusPublished - 2016 Aug 24
Event43rd International Symposium on Computer Architecture, ISCA 2016 - Seoul, Korea, Republic of
Duration: 2016 Jun 182016 Jun 22

Publication series

NameProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

Other

Other43rd International Symposium on Computer Architecture, ISCA 2016
CountryKorea, Republic of
CitySeoul
Period16/6/1816/6/22

Fingerprint

Scheduling
Data storage equipment
Graphics processing unit
Processing

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Yoon, M. K., Kim, K., Lee, S., Ro, W. W., & Annavaram, M. (2016). Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. In Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016 (pp. 609-621). [7551426] (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCA.2016.59
Yoon, Myung Kuk ; Kim, Keunsoo ; Lee, Sangpil ; Ro, Won Woo ; Annavaram, Murali. / Virtual Thread : Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 609-621 (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016).
@inproceedings{c845cf2f7ec948c5a0828e3fe8ed47b4,
title = "Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit",
abstract = "Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9{\%} on average.",
author = "Yoon, {Myung Kuk} and Keunsoo Kim and Sangpil Lee and Ro, {Won Woo} and Murali Annavaram",
year = "2016",
month = "8",
day = "24",
doi = "10.1109/ISCA.2016.59",
language = "English",
series = "Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "609--621",
booktitle = "Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016",
address = "United States",

}

Yoon, MK, Kim, K, Lee, S, Ro, WW & Annavaram, M 2016, Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. in Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016., 7551426, Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016, Institute of Electrical and Electronics Engineers Inc., pp. 609-621, 43rd International Symposium on Computer Architecture, ISCA 2016, Seoul, Korea, Republic of, 16/6/18. https://doi.org/10.1109/ISCA.2016.59

Virtual Thread : Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. / Yoon, Myung Kuk; Kim, Keunsoo; Lee, Sangpil; Ro, Won Woo; Annavaram, Murali.

Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 609-621 7551426 (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Virtual Thread

T2 - Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit

AU - Yoon, Myung Kuk

AU - Kim, Keunsoo

AU - Lee, Sangpil

AU - Ro, Won Woo

AU - Annavaram, Murali

PY - 2016/8/24

Y1 - 2016/8/24

N2 - Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

AB - Modern GPUs require tens of thousands of concurrent threads to fully utilize the massive amount of processing resources. However, thread concurrency in GPUs can be diminished either due to shortage of thread scheduling structures (scheduling limit), such as available program counters and single instruction multiple thread stacks, or due to shortage of on-chip memory (capacity limit), such as register file and shared memory. Our evaluations show that in practice concurrency in many general purpose applications running on GPUs is curtailed by the scheduling limit rather than the capacity limit. Maximizing the utilization of on-chip memory resources without unduly increasing the scheduling complexity is a key goal of this paper. This paper proposes a Virtual Thread (VT) architecture which assigns Cooperative Thread Arrays (CTAs) up to the capacity limit, while ignoring the scheduling limit. However, to reduce the logic complexity of managing more threads concurrently, we propose to place CTAs into active and inactive states, such that the number of active CTAs still respects the scheduling limit. When all the warps in an active CTA hit a long latency stall, the active CTA is context switched out and the next ready CTA takes its place. We exploit the fact that both active and inactive CTAs still fit within the capacity limit which obviates the need to save and restore large amounts of CTA state. Thus VT significantly reduces performance penalties of CTA swapping. By swapping between active and inactive states, VT can exploit higher degree of thread level parallelism without increasing logic complexity. Our simulation results show that VT improves performance by 23.9% on average.

UR - http://www.scopus.com/inward/record.url?scp=84988336029&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988336029&partnerID=8YFLogxK

U2 - 10.1109/ISCA.2016.59

DO - 10.1109/ISCA.2016.59

M3 - Conference contribution

AN - SCOPUS:84988336029

T3 - Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

SP - 609

EP - 621

BT - Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Yoon MK, Kim K, Lee S, Ro WW, Annavaram M. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit. In Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 609-621. 7551426. (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016). https://doi.org/10.1109/ISCA.2016.59