Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures

Wasuwee Sodsong, Jingun Hong, Seongwook Chung, Yeongkyu Lim, Shin-Dug Kim, bernd Burgstaller

Research output: Contribution to journalArticle

2 Citations (Scopus)

Abstract

Summary With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets, and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and graphics processing unit (GPU) for JPEG decoding. In this paper, we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and a general-purpose GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses: (1) the CPU and GPU performance characteristics, (2) the image entropy, and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run-time partitioning and scheduling scheme exploits task, data, and pipeline parallelism by scheduling the non-parallelizable entropy-decoding task on the CPU, whereas inverse discrete cosine transformations, color conversions, and upsampling are conducted on both the CPU and the GPU. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison with the proposed approach. We have evaluated our approach for a total of 7194 JPEG images across four high-end and middle-end CPU-GPU combinations including a mobile GPU. We achieve speedups of up to 5.2× over the SIMD version of libjpeg-turbo, and speedups of up to 10.5× over its sequential code. Taking into account the non-parallelizable JPEG entropy-decoding part, our approach achieves up to 97% of the theoretically attainable maximal speedup, with an average of 94%.

Original languageEnglish
Pages (from-to)517-536
Number of pages20
JournalConcurrency Computation
Volume28
Issue number2
DOIs
Publication statusPublished - 2016 Feb 1

Fingerprint

Graphics Processing Unit
Program processors
Decoding
Partitioning
Entropy
Scheduling
Hardware
Image Model
Smartphones
Photography
Performance Model
Architecture
Graphics processing unit
Profiling
Joining
Social Networks
Parallelism
Workload
Encoding
Speedup

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Computer Science Applications
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Cite this

Sodsong, Wasuwee ; Hong, Jingun ; Chung, Seongwook ; Lim, Yeongkyu ; Kim, Shin-Dug ; Burgstaller, bernd. / Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures. In: Concurrency Computation. 2016 ; Vol. 28, No. 2. pp. 517-536.
@article{a394306d088c44dca845f2fcd39624cd,
title = "Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures",
abstract = "Summary With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets, and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and graphics processing unit (GPU) for JPEG decoding. In this paper, we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and a general-purpose GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses: (1) the CPU and GPU performance characteristics, (2) the image entropy, and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run-time partitioning and scheduling scheme exploits task, data, and pipeline parallelism by scheduling the non-parallelizable entropy-decoding task on the CPU, whereas inverse discrete cosine transformations, color conversions, and upsampling are conducted on both the CPU and the GPU. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison with the proposed approach. We have evaluated our approach for a total of 7194 JPEG images across four high-end and middle-end CPU-GPU combinations including a mobile GPU. We achieve speedups of up to 5.2× over the SIMD version of libjpeg-turbo, and speedups of up to 10.5× over its sequential code. Taking into account the non-parallelizable JPEG entropy-decoding part, our approach achieves up to 97{\%} of the theoretically attainable maximal speedup, with an average of 94{\%}.",
author = "Wasuwee Sodsong and Jingun Hong and Seongwook Chung and Yeongkyu Lim and Shin-Dug Kim and bernd Burgstaller",
year = "2016",
month = "2",
day = "1",
doi = "10.1002/cpe.3620",
language = "English",
volume = "28",
pages = "517--536",
journal = "Concurrency Computation Practice and Experience",
issn = "1532-0626",
publisher = "John Wiley and Sons Ltd",
number = "2",

}

Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures. / Sodsong, Wasuwee; Hong, Jingun; Chung, Seongwook; Lim, Yeongkyu; Kim, Shin-Dug; Burgstaller, bernd.

In: Concurrency Computation, Vol. 28, No. 2, 01.02.2016, p. 517-536.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Dynamic partitioning-based JPEG decompression on heterogeneous multicore architectures

AU - Sodsong, Wasuwee

AU - Hong, Jingun

AU - Chung, Seongwook

AU - Lim, Yeongkyu

AU - Kim, Shin-Dug

AU - Burgstaller, bernd

PY - 2016/2/1

Y1 - 2016/2/1

N2 - Summary With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets, and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and graphics processing unit (GPU) for JPEG decoding. In this paper, we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and a general-purpose GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses: (1) the CPU and GPU performance characteristics, (2) the image entropy, and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run-time partitioning and scheduling scheme exploits task, data, and pipeline parallelism by scheduling the non-parallelizable entropy-decoding task on the CPU, whereas inverse discrete cosine transformations, color conversions, and upsampling are conducted on both the CPU and the GPU. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison with the proposed approach. We have evaluated our approach for a total of 7194 JPEG images across four high-end and middle-end CPU-GPU combinations including a mobile GPU. We achieve speedups of up to 5.2× over the SIMD version of libjpeg-turbo, and speedups of up to 10.5× over its sequential code. Taking into account the non-parallelizable JPEG entropy-decoding part, our approach achieves up to 97% of the theoretically attainable maximal speedup, with an average of 94%.

AB - Summary With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets, and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable of joining forces of a system's CPU and graphics processing unit (GPU) for JPEG decoding. In this paper, we introduce a novel JPEG decoding scheme for heterogeneous architectures consisting of a CPU and a general-purpose GPU. We employ an offline profiling step to determine the performance of a system's CPU and GPU with respect to JPEG decoding. For a given JPEG image, our performance model uses: (1) the CPU and GPU performance characteristics, (2) the image entropy, and (3) the width and height of the image to balance the JPEG decoding workload on the underlying hardware. Our run-time partitioning and scheduling scheme exploits task, data, and pipeline parallelism by scheduling the non-parallelizable entropy-decoding task on the CPU, whereas inverse discrete cosine transformations, color conversions, and upsampling are conducted on both the CPU and the GPU. We have implemented the proposed method in the context of the libjpeg-turbo library, which is an industrial-strength JPEG encoding and decoding engine. Libjpeg-turbo's hand-optimized SIMD routines for ARM and x86 architectures constitute a competitive yardstick for the comparison with the proposed approach. We have evaluated our approach for a total of 7194 JPEG images across four high-end and middle-end CPU-GPU combinations including a mobile GPU. We achieve speedups of up to 5.2× over the SIMD version of libjpeg-turbo, and speedups of up to 10.5× over its sequential code. Taking into account the non-parallelizable JPEG entropy-decoding part, our approach achieves up to 97% of the theoretically attainable maximal speedup, with an average of 94%.

UR - http://www.scopus.com/inward/record.url?scp=84956641335&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84956641335&partnerID=8YFLogxK

U2 - 10.1002/cpe.3620

DO - 10.1002/cpe.3620

M3 - Article

VL - 28

SP - 517

EP - 536

JO - Concurrency Computation Practice and Experience

JF - Concurrency Computation Practice and Experience

SN - 1532-0626

IS - 2

ER -