Scalable speculative parallelization on commodity clusters

Hanjun Kim, Arun Raman, Feng Liu, Jae W. Lee, David I. August

Research output: Chapter in Book/Report/Conference proceedingConference contribution

20 Citations (Scopus)

Abstract

While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.

Original languageEnglish
Title of host publicationProceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010
Pages3-14
Number of pages12
DOIs
Publication statusPublished - 2010 Dec 1
Event43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010 - Atlanta, GA, United States
Duration: 2010 Dec 42010 Dec 8

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
ISSN (Print)1072-4451

Other

Other43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010
CountryUnited States
CityAtlanta, GA
Period10/12/410/12/8

Fingerprint

Data storage equipment
Communication
Servers
Pipelines
Costs
Switches

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Kim, H., Raman, A., Liu, F., Lee, J. W., & August, D. I. (2010). Scalable speculative parallelization on commodity clusters. In Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010 (pp. 3-14). [5695521] (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). https://doi.org/10.1109/MICRO.2010.19
Kim, Hanjun ; Raman, Arun ; Liu, Feng ; Lee, Jae W. ; August, David I. / Scalable speculative parallelization on commodity clusters. Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010. 2010. pp. 3-14 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO).
@inproceedings{2f06b51d69ba49159d3fee0cd2d6740b,
title = "Scalable speculative parallelization on commodity clusters",
abstract = "While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.",
author = "Hanjun Kim and Arun Raman and Feng Liu and Lee, {Jae W.} and August, {David I.}",
year = "2010",
month = "12",
day = "1",
doi = "10.1109/MICRO.2010.19",
language = "English",
isbn = "9780769542997",
series = "Proceedings of the Annual International Symposium on Microarchitecture, MICRO",
pages = "3--14",
booktitle = "Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010",

}

Kim, H, Raman, A, Liu, F, Lee, JW & August, DI 2010, Scalable speculative parallelization on commodity clusters. in Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010., 5695521, Proceedings of the Annual International Symposium on Microarchitecture, MICRO, pp. 3-14, 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010, Atlanta, GA, United States, 10/12/4. https://doi.org/10.1109/MICRO.2010.19

Scalable speculative parallelization on commodity clusters. / Kim, Hanjun; Raman, Arun; Liu, Feng; Lee, Jae W.; August, David I.

Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010. 2010. p. 3-14 5695521 (Proceedings of the Annual International Symposium on Microarchitecture, MICRO).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Scalable speculative parallelization on commodity clusters

AU - Kim, Hanjun

AU - Raman, Arun

AU - Liu, Feng

AU - Lee, Jae W.

AU - August, David I.

PY - 2010/12/1

Y1 - 2010/12/1

N2 - While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.

AB - While clusters of commodity servers and switches are the most popular form of large-scale parallel computers, many programs are not easily parallelized for execution upon them. In particular, high inter-node communication cost and lack of globally shared memory appear to make clusters suitable only for server applications with abundant task-level parallelism and scientific applications with regular and independent units of work. Clever use of pipeline parallelism (DSWP), thread-level speculation (TLS), and speculative pipeline parallelism (Spec-DSWP) can mitigate the costs of inter-thread communication on shared memory multicore machines. This paper presents Distributed Software Multi-threaded Transactional memory (DSMTX), a runtime system which makes these techniques applicable to non-shared memory clusters, allowing them to efficiently address inter-node communication costs. Initial results suggest that DSMTX enables efficient cluster execution of a wider set of application types. For 11 sequential C programs parallelized for a 4-core 32-node (128 total core) cluster without shared memory, DSMTX achieves a geomean speedup of 49×. This compares favorably to the 15× speedup achieved by our implementation of TLS-only support for clusters.

UR - http://www.scopus.com/inward/record.url?scp=79951708803&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79951708803&partnerID=8YFLogxK

U2 - 10.1109/MICRO.2010.19

DO - 10.1109/MICRO.2010.19

M3 - Conference contribution

AN - SCOPUS:79951708803

SN - 9780769542997

T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO

SP - 3

EP - 14

BT - Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010

ER -

Kim H, Raman A, Liu F, Lee JW, August DI. Scalable speculative parallelization on commodity clusters. In Proceedings - 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2010. 2010. p. 3-14. 5695521. (Proceedings of the Annual International Symposium on Microarchitecture, MICRO). https://doi.org/10.1109/MICRO.2010.19