Warped-Slicer

Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

Qiumin Xu, Hyeran Jeon, Keunsoo Kim, Won Woo Ro, Murali Annavaram

Research output: Chapter in Book/Report/Conference proceedingConference contribution

33 Citations (Scopus)

Abstract

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

Original languageEnglish
Title of host publicationProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages230-242
Number of pages13
ISBN (Electronic)9781467389471
DOIs
Publication statusPublished - 2016 Aug 24
Event43rd International Symposium on Computer Architecture, ISCA 2016 - Seoul, Korea, Republic of
Duration: 2016 Jun 182016 Jun 22

Publication series

NameProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

Other

Other43rd International Symposium on Computer Architecture, ISCA 2016
CountryKorea, Republic of
CitySeoul
Period16/6/1816/6/22

Fingerprint

Multiprogramming
Hardware
Multitasking
Graphics processing unit
Decision making
Bandwidth
Data storage equipment

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Xu, Q., Jeon, H., Kim, K., Ro, W. W., & Annavaram, M. (2016). Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016 (pp. 230-242). [7551396] (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ISCA.2016.29
Xu, Qiumin ; Jeon, Hyeran ; Kim, Keunsoo ; Ro, Won Woo ; Annavaram, Murali. / Warped-Slicer : Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 230-242 (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016).
@inproceedings{fc6557769299410297fe2f5352868baa,
title = "Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming",
abstract = "As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23{\%} over the baseline multiprogramming approach with minimal hardware overhead.",
author = "Qiumin Xu and Hyeran Jeon and Keunsoo Kim and Ro, {Won Woo} and Murali Annavaram",
year = "2016",
month = "8",
day = "24",
doi = "10.1109/ISCA.2016.29",
language = "English",
series = "Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "230--242",
booktitle = "Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016",
address = "United States",

}

Xu, Q, Jeon, H, Kim, K, Ro, WW & Annavaram, M 2016, Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. in Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016., 7551396, Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016, Institute of Electrical and Electronics Engineers Inc., pp. 230-242, 43rd International Symposium on Computer Architecture, ISCA 2016, Seoul, Korea, Republic of, 16/6/18. https://doi.org/10.1109/ISCA.2016.29

Warped-Slicer : Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. / Xu, Qiumin; Jeon, Hyeran; Kim, Keunsoo; Ro, Won Woo; Annavaram, Murali.

Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc., 2016. p. 230-242 7551396 (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Warped-Slicer

T2 - Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming

AU - Xu, Qiumin

AU - Jeon, Hyeran

AU - Kim, Keunsoo

AU - Ro, Won Woo

AU - Annavaram, Murali

PY - 2016/8/24

Y1 - 2016/8/24

N2 - As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

AB - As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

UR - http://www.scopus.com/inward/record.url?scp=84988383015&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84988383015&partnerID=8YFLogxK

U2 - 10.1109/ISCA.2016.29

DO - 10.1109/ISCA.2016.29

M3 - Conference contribution

T3 - Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

SP - 230

EP - 242

BT - Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Xu Q, Jeon H, Kim K, Ro WW, Annavaram M. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 230-242. 7551396. (Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016). https://doi.org/10.1109/ISCA.2016.29