DMazerunner: Executing perfectly nested loops on dataflow accelerators

Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, Aviral Shrivastava

Research output: Contribution to journalArticle

Abstract

Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

Original languageEnglish
Article numbera70
JournalACM Transactions on Embedded Computing Systems
Volume18
Issue number5s
DOIs
Publication statusPublished - 2019 Oct

Fingerprint

Particle accelerators
Convolution
Energy efficiency
Data storage equipment
Communication
Processing
Costs
Experiments
Deep learning

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture

Cite this

Dave, Shail ; Kim, Youngbin ; Avancha, Sasikanth ; Lee, Kyoungwoo ; Shrivastava, Aviral. / DMazerunner : Executing perfectly nested loops on dataflow accelerators. In: ACM Transactions on Embedded Computing Systems. 2019 ; Vol. 18, No. 5s.
@article{7e90abac29b94ac4a1f74bed53c4f60a,
title = "DMazerunner: Executing perfectly nested loops on dataflow accelerators",
abstract = "Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56{\%} increase in EDP, as compared to the optimal solution.",
author = "Shail Dave and Youngbin Kim and Sasikanth Avancha and Kyoungwoo Lee and Aviral Shrivastava",
year = "2019",
month = "10",
doi = "10.1145/3358198",
language = "English",
volume = "18",
journal = "Transactions on Embedded Computing Systems",
issn = "1539-9087",
publisher = "Association for Computing Machinery (ACM)",
number = "5s",

}

DMazerunner : Executing perfectly nested loops on dataflow accelerators. / Dave, Shail; Kim, Youngbin; Avancha, Sasikanth; Lee, Kyoungwoo; Shrivastava, Aviral.

In: ACM Transactions on Embedded Computing Systems, Vol. 18, No. 5s, a70, 10.2019.

Research output: Contribution to journalArticle

TY - JOUR

T1 - DMazerunner

T2 - Executing perfectly nested loops on dataflow accelerators

AU - Dave, Shail

AU - Kim, Youngbin

AU - Avancha, Sasikanth

AU - Lee, Kyoungwoo

AU - Shrivastava, Aviral

PY - 2019/10

Y1 - 2019/10

N2 - Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

AB - Dataflow accelerators feature simplicity, programmability, and energy-efficiency and are visualized as a promising architecture for accelerating perfectly nested loops that dominate several important applications, including image and media processing and deep learning. Although numerous accelerator designs are being proposed, how to discover the most efficient way to execute the perfectly nested loop of an application onto computational and memory resources of a given dataflow accelerator (execution method) remains an essential and yet unsolved challenge. In this paper, we propose dMazeRunner - to efficiently and accurately explore the vast space of the different ways to spatiotemporally execute a perfectly nested loop on dataflow accelerators (execution methods). The novelty of dMazeRunner framework is in: i) a holistic representation of the loop nests, that can succinctly capture the various execution methods, ii) accurate energy and performance models that explicitly capture the computation and communication patterns, data movement, and data buffering of the different execution methods, and iii) drastic pruning of the vast search space by discarding invalid solutions and the solutions that lead to the same cost. Our experiments on various convolution layers (perfectly nested loops) of popular deep learning applications demonstrate that the solutions discovered by dMazeRunner are on average 9.16× better in Energy-Delay-Product (EDP) and 5.83× better in execution time, as compared to prior approaches. With additional pruning heuristics, dMazeRunner reduces the search time from days to seconds with a mere 2.56% increase in EDP, as compared to the optimal solution.

UR - http://www.scopus.com/inward/record.url?scp=85073149210&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85073149210&partnerID=8YFLogxK

U2 - 10.1145/3358198

DO - 10.1145/3358198

M3 - Article

AN - SCOPUS:85073149210

VL - 18

JO - Transactions on Embedded Computing Systems

JF - Transactions on Embedded Computing Systems

SN - 1539-9087

IS - 5s

M1 - a70

ER -