Threaded Multi-Core GEMM with MoA and Cache-Blocking: Preprint

Stephen Thomas, Lenore Mullin, Kasia Swirydowicz, Rishi Khan

Research output: Contribution to conferencePaper

Abstract

A threaded multi-core implementation of the high performance dense linear algebra matrix-matrix multiply GEMM kernel is described. This kernel is widely implemented by vendors in the basic linear algebra subroutine BLAS library. The mathematics of arrays (MoA) paradigm due to Mullin (1988) results in contiguous memory accesses by employing outer-product forms. Our performance studies demonstrate that the MoA implementation of double precision DGEMM combined with optimal cache-blocking strategies results in at least a 25% performance gain on the Intel Xeon Skylake processor over the vendor supplied Intel MKL basic linear algebra libraries. Results are presented for the NREL Eagle supercomputer. The multi-core DGEMM achieves over 100 GigaFlops/sec with eight openMP threads.
Original languageAmerican English
Number of pages13
StatePublished - 2022
Event2021 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE'21) - Las Vegas, Nevada
Duration: 26 Jul 202129 Jul 2021

Conference

Conference2021 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE'21)
CityLas Vegas, Nevada
Period26/07/2129/07/21

NREL Publication Number

  • NREL/CP-2C00-80530

Keywords

  • cache-blocking
  • contiguous memory
  • mathematics of arrays
  • shared-memory multi-threading

Fingerprint

Dive into the research topics of 'Threaded Multi-Core GEMM with MoA and Cache-Blocking: Preprint'. Together they form a unique fingerprint.

Cite this