Design and evaluation of a hierarchical decoupled architecture

Won Woo Ro, Stephen P. Crago, Alvin M. Despain, Jean Luc Gaudiot

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

Original languageEnglish
Pages (from-to)237-259
Number of pages23
JournalJournal of Supercomputing
Volume38
Issue number3
DOIs
Publication statusPublished - 2006 Dec 1

Fingerprint

Data storage equipment
Evaluation
Latency
Cache
Prefetching
Program processors
Cycle
Hardware Design
Memory Model
Computer Model
Predictability
Microprocessor
Design
Architecture
Microprocessor chips
Masks
Decoupling
Locality
Computer systems
Mask

All Science Journal Classification (ASJC) codes

  • Software
  • Theoretical Computer Science
  • Information Systems
  • Hardware and Architecture

Cite this

Ro, Won Woo ; Crago, Stephen P. ; Despain, Alvin M. ; Gaudiot, Jean Luc. / Design and evaluation of a hierarchical decoupled architecture. In: Journal of Supercomputing. 2006 ; Vol. 38, No. 3. pp. 237-259.
@article{1c9b651080934e2cb8163c0300ba45f8,
title = "Design and evaluation of a hierarchical decoupled architecture",
abstract = "The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7{\%} of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8{\%}. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2{\%}.",
author = "Ro, {Won Woo} and Crago, {Stephen P.} and Despain, {Alvin M.} and Gaudiot, {Jean Luc}",
year = "2006",
month = "12",
day = "1",
doi = "10.1007/s11227-006-8321-2",
language = "English",
volume = "38",
pages = "237--259",
journal = "Journal of Supercomputing",
issn = "0920-8542",
publisher = "Springer Netherlands",
number = "3",

}

Design and evaluation of a hierarchical decoupled architecture. / Ro, Won Woo; Crago, Stephen P.; Despain, Alvin M.; Gaudiot, Jean Luc.

In: Journal of Supercomputing, Vol. 38, No. 3, 01.12.2006, p. 237-259.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Design and evaluation of a hierarchical decoupled architecture

AU - Ro, Won Woo

AU - Crago, Stephen P.

AU - Despain, Alvin M.

AU - Gaudiot, Jean Luc

PY - 2006/12/1

Y1 - 2006/12/1

N2 - The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

AB - The speed gap between processor and main memory is the major performance bottleneck of modern computer systems. As a result, today's microprocessors suffer from frequent cache misses and lose many CPU cycles due to pipeline stalling. Although traditional data prefetching methods considerably reduce the number of cache misses, most of them strongly rely on the predictability for future accesses and often fail when memory accesses do not contain much locality. To solve the long latency problem of current memory systems, this paper presents the design and evaluation of our high-performance decoupled architecture, the HiDISC (Hierarchical Decoupled Instruction Stream Computer). The motivation for the design originated from the traditional decoupled architecture concept and its limits. The HiDISC approach implements an additional prefetching processor on top of a traditional access/execute architecture. Our design aims at providing low memory access latency by separating and decoupling otherwise sequential pieces of code into three streams and executing each stream on three dedicated processors. The three streams act in concert to mask the long access latencies by providing the necessary data to the upper level on time. This is achieved by separating the access-related instructions from the main computation and running them early enough on the two dedicated processors. Detailed hardware design and performance evaluation are performed with development of an architectural simulator and compiling tools. Our performance results show that the proposed HiDISC model reduces 19.7% of the cache misses and improves the overall IPC (Instructions Per Cycle) by 15.8%. With a slower memory model assuming 200 CPU cycles as memory access latency, our HiDISC improves the performance by 17.2%.

UR - http://www.scopus.com/inward/record.url?scp=33749484001&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33749484001&partnerID=8YFLogxK

U2 - 10.1007/s11227-006-8321-2

DO - 10.1007/s11227-006-8321-2

M3 - Article

VL - 38

SP - 237

EP - 259

JO - Journal of Supercomputing

JF - Journal of Supercomputing

SN - 0920-8542

IS - 3

ER -