Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores

Hyeonjin Kim, Sungwoo Ahn, Yunho Oh, Bogil Kim, Won Woo Ro, William J. Song

Research output: Chapter in Book/Report/Conference proceedingConference contribution

2 Citations (Scopus)

Abstract

This paper introduces a GPU architecture named Duplo that minimizes redundant memory accesses of convolutions in deep neural networks (DNNs). Convolution is one of the fundamental operations used in various classes of DNNs, and it takes the majority of execution time. Various approaches have been proposed to accelerate convolutions via general matrix multiplication (GEMM), Winograd convolution, fast Fourier transform (FFT), etc. Recent introduction of tensor cores in NVIDIA GPUs particularly targets on accelerating neural network computations. A tensor core in a streaming multiprocessor (SM) is a specialized unit dedicated to handling matrix-multiply-and-accumulate (MMA) operations. The underlying operations of tensor cores represent GEMM calculations, and lowering a convolution can effectively exploit the tensor cores by transforming deeply nested convolution loops into matrix multiplication. However, lowering the convolution has a critical drawback since it requires a larger memory space (or workspace) to compute the matrix multiplication, where the expanded workspace inevitably creates multiple duplicates of the same data stored at different memory addresses. The proposed Duplo architecture tackles this challenge by leveraging compile-time information and microarchitectural supports to detect and eliminate redundant memory accesses that repeatedly load the duplicates of data in the workspace matrix. Duplo identifies data duplication based on memory addresses and convolution information generated by a compiler. It uses a load history buffer (LHB) to trace the recent load history of workspace data and their presence in register file. Every load instruction of workspace data refers to the LHB to find if potentially the same copies of data exist in the register file. If data duplicates are found, Duplo simply renames registers and makes them point to the ones containing the same values instead of issuing memory requests to load the same data. Our experiment results show that Duplo improves the performance of DNNs by 29.4% on average and saves 34.1% of energy using tensor cores.

Original languageEnglish
Title of host publicationProceedings - 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
PublisherIEEE Computer Society
Pages725-737
Number of pages13
ISBN (Electronic)9781728173832
DOIs
Publication statusPublished - 2020 Oct
Event53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020 - Virtual, Athens, Greece
Duration: 2020 Oct 172020 Oct 21

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
Volume2020-October
ISSN (Print)1072-4451

Conference

Conference53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
CountryGreece
CityVirtual, Athens
Period20/10/1720/10/21

Bibliographical note

Funding Information:
This research was supported by Samsung Research Funding and Incubation Center of Samsung Electronics under the project #SRFC-IT1801-04. WilliamJ. Song is the corresponding author of this paper.

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Duplo: Lifting redundant memory accesses of deep neural networks for gpu tensor cores'. Together they form a unique fingerprint.

Cite this