As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernel's performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.
|Title of host publication||Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016|
|Publisher||Institute of Electrical and Electronics Engineers Inc.|
|Number of pages||13|
|Publication status||Published - 2016 Aug 24|
|Event||43rd International Symposium on Computer Architecture, ISCA 2016 - Seoul, Korea, Republic of|
Duration: 2016 Jun 18 → 2016 Jun 22
|Name||Proceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016|
|Other||43rd International Symposium on Computer Architecture, ISCA 2016|
|Country||Korea, Republic of|
|Period||16/6/18 → 16/6/22|
Bibliographical notePublisher Copyright:
© 2016 IEEE.
All Science Journal Classification (ASJC) codes
- Hardware and Architecture