An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing

Jie Zhang, Myoungsoo Jung

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that (i) aggregating small messages can improve network bandwidth without violating latency restrictions, (ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 ~ 6 point-to-point connections are established for data communication, (iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.

Original languageEnglish
Title of host publicationNetwork and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings
EditorsXuanhua Shi, Mahmut Kandemir, Hong An, Chao Wang, Hai Jin
PublisherSpringer Verlag
Pages155-159
Number of pages5
ISBN (Print)9783319682099
DOIs
Publication statusPublished - 2017 Jan 1
Event14th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2017 - Hefei, China
Duration: 2017 Oct 202017 Oct 21

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10578 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other14th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2017
CountryChina
CityHefei
Period17/10/2017/10/21

Fingerprint

Heterogeneous Computing
Performance Analysis
Data Communication
Communication
Message passing
Program processors
Data Sharing
Many-core
Synchronization
Message Passing
Throughput
Parallelism
Hardware
Bandwidth
Latency
Data storage equipment
Chip
Restriction
Series
Evaluation

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Zhang, J., & Jung, M. (2017). An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing. In X. Shi, M. Kandemir, H. An, C. Wang, & H. Jin (Eds.), Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings (pp. 155-159). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10578 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-68210-5_19
Zhang, Jie ; Jung, Myoungsoo. / An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing. Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings. editor / Xuanhua Shi ; Mahmut Kandemir ; Hong An ; Chao Wang ; Hai Jin. Springer Verlag, 2017. pp. 155-159 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{e41fb49989d14bd9832d7213d816919e,
title = "An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing",
abstract = "Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that (i) aggregating small messages can improve network bandwidth without violating latency restrictions, (ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 ~ 6 point-to-point connections are established for data communication, (iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.",
author = "Jie Zhang and Myoungsoo Jung",
year = "2017",
month = "1",
day = "1",
doi = "10.1007/978-3-319-68210-5_19",
language = "English",
isbn = "9783319682099",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "155--159",
editor = "Xuanhua Shi and Mahmut Kandemir and Hong An and Chao Wang and Hai Jin",
booktitle = "Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings",
address = "Germany",

}

Zhang, J & Jung, M 2017, An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing. in X Shi, M Kandemir, H An, C Wang & H Jin (eds), Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10578 LNCS, Springer Verlag, pp. 155-159, 14th IFIP WG 10.3 International Conference on Network and Parallel Computing, NPC 2017, Hefei, China, 17/10/20. https://doi.org/10.1007/978-3-319-68210-5_19

An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing. / Zhang, Jie; Jung, Myoungsoo.

Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings. ed. / Xuanhua Shi; Mahmut Kandemir; Hong An; Chao Wang; Hai Jin. Springer Verlag, 2017. p. 155-159 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10578 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing

AU - Zhang, Jie

AU - Jung, Myoungsoo

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that (i) aggregating small messages can improve network bandwidth without violating latency restrictions, (ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 ~ 6 point-to-point connections are established for data communication, (iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.

AB - Many-integrated core (MIC) architecture combines dozens of reduced x86 cores onto a single chip to offer high degrees of parallelism. The parallel user applications executed across many cores that exist in one or more MICs require a series of work related to data sharing and synchronization with the host. In this work, we build a real CPU+MIC heterogeneous cluster and analyze its performance behaviors by examining different communication methods such as message passing method and remote direct memory accesses. Our evaluation results and in-depth studies reveal that (i) aggregating small messages can improve network bandwidth without violating latency restrictions, (ii) while MICs can execute hundreds of hardware cores, the highest network throughput is achieved when only 4 ~ 6 point-to-point connections are established for data communication, (iii) data communication over multiple point-to-point connections between host and MICs introduce severe load unbalancing, which require to be optimized for future heterogeneous computing.

UR - http://www.scopus.com/inward/record.url?scp=85032866027&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85032866027&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-68210-5_19

DO - 10.1007/978-3-319-68210-5_19

M3 - Conference contribution

AN - SCOPUS:85032866027

SN - 9783319682099

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 155

EP - 159

BT - Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings

A2 - Shi, Xuanhua

A2 - Kandemir, Mahmut

A2 - An, Hong

A2 - Wang, Chao

A2 - Jin, Hai

PB - Springer Verlag

ER -

Zhang J, Jung M. An in-depth performance analysis of many-integrated core for communication efficient heterogeneous computing. In Shi X, Kandemir M, An H, Wang C, Jin H, editors, Network and Parallel Computing - 14th IFIP WG 10.3 International Conference, NPC 2017, Proceedings. Springer Verlag. 2017. p. 155-159. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-68210-5_19