Abstract
A threaded multi-core implementation of the high performance dense linear algebra matrix-matrix multiply GEMM kernel is described. This kernel is widely implemented by vendors in the basic linear algebra subroutine BLAS library. The mathematics of arrays (MoA) paradigm due to Mullin (1988) results in contiguous memory accesses by employing outer-product forms. Our performance studies demonstrate that the MoA implementation of double precision DGEMM combined with optimal cache-blocking strategies results in at least a 25% performance gain on the Intel Xeon Skylake processor over the vendor supplied Intel MKL basic linear algebra libraries. Results are presented for the NREL Eagle supercomputer. The multi-core DGEMM achieves over 100 GigaFlops/sec with eight openMP threads.
Original language | American English |
---|---|
Number of pages | 13 |
State | Published - 2022 |
Event | 2021 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE'21) - Las Vegas, Nevada Duration: 26 Jul 2021 → 29 Jul 2021 |
Conference
Conference | 2021 World Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE'21) |
---|---|
City | Las Vegas, Nevada |
Period | 26/07/21 → 29/07/21 |
NREL Publication Number
- NREL/CP-2C00-80530
Keywords
- cache-blocking
- contiguous memory
- mathematics of arrays
- shared-memory multi-threading