Architectural Reliability: Lifetime Reliability Characterization and Management of Many-Core Processors

William Song, Saibal Mukhopadhyay, Sudhakar Yalamanchili

Research output: Contribution to journalArticle

10 Citations (Scopus)

Abstract

This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that reducing the variance of the distribution can improve lifetime reliability by avoiding early failures. Based on this understanding, we present two variance reduction techniques for proactive reliability management; i) proportional dynamic voltage-frequency scaling (DVFS) and ii) coordinated thread swapping. A major advantage of using variance reduction techniques is that the improvement of system lifetime reliability can be achieved without adding design margins or spare components.

Original languageEnglish
Article number6860268
Pages (from-to)103-106
Number of pages4
JournalIEEE Computer Architecture Letters
Volume14
Issue number2
DOIs
Publication statusPublished - 2015 Jul 1

Fingerprint

Normal distribution
Electric potential
Hot Temperature

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

@article{1f38d6c49f30421684da04a9393ae9de,
title = "Architectural Reliability: Lifetime Reliability Characterization and Management of Many-Core Processors",
abstract = "This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that reducing the variance of the distribution can improve lifetime reliability by avoiding early failures. Based on this understanding, we present two variance reduction techniques for proactive reliability management; i) proportional dynamic voltage-frequency scaling (DVFS) and ii) coordinated thread swapping. A major advantage of using variance reduction techniques is that the improvement of system lifetime reliability can be achieved without adding design margins or spare components.",
author = "William Song and Saibal Mukhopadhyay and Sudhakar Yalamanchili",
year = "2015",
month = "7",
day = "1",
doi = "10.1109/LCA.2014.2340873",
language = "English",
volume = "14",
pages = "103--106",
journal = "IEEE Computer Architecture Letters",
issn = "1556-6056",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "2",

}

Architectural Reliability : Lifetime Reliability Characterization and Management of Many-Core Processors. / Song, William; Mukhopadhyay, Saibal; Yalamanchili, Sudhakar.

In: IEEE Computer Architecture Letters, Vol. 14, No. 2, 6860268, 01.07.2015, p. 103-106.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Architectural Reliability

T2 - Lifetime Reliability Characterization and Management of Many-Core Processors

AU - Song, William

AU - Mukhopadhyay, Saibal

AU - Yalamanchili, Sudhakar

PY - 2015/7/1

Y1 - 2015/7/1

N2 - This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that reducing the variance of the distribution can improve lifetime reliability by avoiding early failures. Based on this understanding, we present two variance reduction techniques for proactive reliability management; i) proportional dynamic voltage-frequency scaling (DVFS) and ii) coordinated thread swapping. A major advantage of using variance reduction techniques is that the improvement of system lifetime reliability can be achieved without adding design margins or spare components.

AB - This paper presents a lifetime reliability characterization of many-core processors based on a full-system simulation of integrated microarchitecture, power, thermal, and reliability models. Under normal operating conditions, our model and analysis reveal that the mean-time-to-failure of cores on the die show normal distribution. From the processor-level perspective, the key insight is that reducing the variance of the distribution can improve lifetime reliability by avoiding early failures. Based on this understanding, we present two variance reduction techniques for proactive reliability management; i) proportional dynamic voltage-frequency scaling (DVFS) and ii) coordinated thread swapping. A major advantage of using variance reduction techniques is that the improvement of system lifetime reliability can be achieved without adding design margins or spare components.

UR - http://www.scopus.com/inward/record.url?scp=84961744127&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84961744127&partnerID=8YFLogxK

U2 - 10.1109/LCA.2014.2340873

DO - 10.1109/LCA.2014.2340873

M3 - Article

AN - SCOPUS:84961744127

VL - 14

SP - 103

EP - 106

JO - IEEE Computer Architecture Letters

JF - IEEE Computer Architecture Letters

SN - 1556-6056

IS - 2

M1 - 6860268

ER -