Abstract
Intel AMX is a built-in component of recent Intel CPU architectures, first supported by the Intel Sapphire Rapids in 2023, that enables efficient dense matrix multiplications using mixed precision with low-precision data types. The popularity of mixed-precision algorithms has grown recently, primarily due to their use on GPUs to enhance the efficiency of HPC applications, particularly for the training of large language models. The availability of mixed precision on CPUs represents a cost-effective solution for applications where high speed is not critical. This report shows how to use the Intel AMX accelerator through examples in \cpp and Python. The examples will focus on mixed-precision floating-point operations obtained by the use of bfloat16 (or BF16) to accelerate code in single precision. We employ a bottom-up methodology, starting from specific register instructions (TMUL operation) to higher-level applications in libraries such as Intel MKL, PyTorch, and TensorFlow, ensuring a comprehensive understanding of the accelerator's potential. Additionally, we provide insights into the expected performance gains when leveraging the accelerator on the Kestrel HPC machine at the National Renewable Energy Laboratory.
| Original language | American English |
|---|---|
| Number of pages | 20 |
| DOIs | |
| State | Published - 2025 |
NLR Publication Number
- NREL/TP-2C00-93622
Keywords
- AI
- AMX
- bfloat16
- HPC
- Kestrel