Boosting CUDA applications with CPU-GPU hybrid computing

Changmin Lee, Won Woo Ro, Jean Luc Gaudiot

Research output: Contribution to journalArticle

19 Citations (Scopus)

Abstract

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08× in the best case and 1.42× on average compared to the baseline GPU-only processing.

Original languageEnglish
Pages (from-to)384-404
Number of pages21
JournalInternational Journal of Parallel Programming
Volume42
Issue number2
DOIs
Publication statusPublished - 2014 Jan 1

Fingerprint

Boosting
Program processors
Heterogeneous Computing
Computing
Thread
Parallelism
Workload
Queue
Baseline
Scheduling
kernel
Module
Resources
Data storage equipment
Processing
Framework
Graphics processing unit

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Software
  • Information Systems

Cite this

Lee, Changmin ; Ro, Won Woo ; Gaudiot, Jean Luc. / Boosting CUDA applications with CPU-GPU hybrid computing. In: International Journal of Parallel Programming. 2014 ; Vol. 42, No. 2. pp. 384-404.
@article{a6d487afcdd24d45ad35afe30035cf8d,
title = "Boosting CUDA applications with CPU-GPU hybrid computing",
abstract = "This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08× in the best case and 1.42× on average compared to the baseline GPU-only processing.",
author = "Changmin Lee and Ro, {Won Woo} and Gaudiot, {Jean Luc}",
year = "2014",
month = "1",
day = "1",
doi = "10.1007/s10766-013-0252-y",
language = "English",
volume = "42",
pages = "384--404",
journal = "International Journal of Parallel Programming",
issn = "0885-7458",
publisher = "Springer New York",
number = "2",

}

Boosting CUDA applications with CPU-GPU hybrid computing. / Lee, Changmin; Ro, Won Woo; Gaudiot, Jean Luc.

In: International Journal of Parallel Programming, Vol. 42, No. 2, 01.01.2014, p. 384-404.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Boosting CUDA applications with CPU-GPU hybrid computing

AU - Lee, Changmin

AU - Ro, Won Woo

AU - Gaudiot, Jean Luc

PY - 2014/1/1

Y1 - 2014/1/1

N2 - This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08× in the best case and 1.42× on average compared to the baseline GPU-only processing.

AB - This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups of 3.08× in the best case and 1.42× on average compared to the baseline GPU-only processing.

UR - http://www.scopus.com/inward/record.url?scp=84899462103&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899462103&partnerID=8YFLogxK

U2 - 10.1007/s10766-013-0252-y

DO - 10.1007/s10766-013-0252-y

M3 - Article

VL - 42

SP - 384

EP - 404

JO - International Journal of Parallel Programming

JF - International Journal of Parallel Programming

SN - 0885-7458

IS - 2

ER -