Architectural investigation of matrix data layout on multicore processors

Minwoo Kim, Won Woo Ro

Research output: Contribution to journalArticle

3 Citations (Scopus)

Abstract

Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23% faster execution time and 3.91% of last level cache miss rate compared with applications executed with the block data layout.

Original languageEnglish
Pages (from-to)64-75
Number of pages12
JournalFuture Generation Computer Systems
Volume37
DOIs
Publication statusPublished - 2014 Jul

Fingerprint

Factorization
Decomposition
Hardware
Processing

All Science Journal Classification (ASJC) codes

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

@article{8036a98bfc55449fb6462479caceed79,
title = "Architectural investigation of matrix data layout on multicore processors",
abstract = "Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23{\%} faster execution time and 3.91{\%} of last level cache miss rate compared with applications executed with the block data layout.",
author = "Minwoo Kim and Ro, {Won Woo}",
year = "2014",
month = "7",
doi = "10.1016/j.future.2013.10.020",
language = "English",
volume = "37",
pages = "64--75",
journal = "Future Generation Computer Systems",
issn = "0167-739X",
publisher = "Elsevier",

}

Architectural investigation of matrix data layout on multicore processors. / Kim, Minwoo; Ro, Won Woo.

In: Future Generation Computer Systems, Vol. 37, 07.2014, p. 64-75.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Architectural investigation of matrix data layout on multicore processors

AU - Kim, Minwoo

AU - Ro, Won Woo

PY - 2014/7

Y1 - 2014/7

N2 - Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23% faster execution time and 3.91% of last level cache miss rate compared with applications executed with the block data layout.

AB - Many practical applications include matrix operations as essential procedures. In addition, recent studies of matrix operations rely on parallel processing to reduce any calculation delays. Because these operations are highly data intensive, many studies have investigated work distribution techniques and data access latency to accelerate algorithms. However, previous studies have not considered hardware architectural features adequately, although they greatly affect the performance of matrix operations. Thus, the present study considers the architectural characteristics that affect the performance of matrix operations on real multicore processors. We use matrix multiplication, LU decomposition, and Cholesky factorization as the test applications, which are well-known data-intensive mathematical algorithms in various fields. We argue that applications only access matrices in a particular direction, and we propose that the canonical data layout is the optimal matrix data layout compared with the block data layout. In addition, the tiling algorithm is utilized to increase the temporal data locality in multilevel caches and to balance the workload as evenly as possible in multicore environments. Our experimental results show that applications using the canonical data layout with tiling have an 8.23% faster execution time and 3.91% of last level cache miss rate compared with applications executed with the block data layout.

UR - http://www.scopus.com/inward/record.url?scp=84901587832&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84901587832&partnerID=8YFLogxK

U2 - 10.1016/j.future.2013.10.020

DO - 10.1016/j.future.2013.10.020

M3 - Article

AN - SCOPUS:84901587832

VL - 37

SP - 64

EP - 75

JO - Future Generation Computer Systems

JF - Future Generation Computer Systems

SN - 0167-739X

ER -