1 min readfrom Machine Learning

[D] MXFP8 GEMM: Up to 99% of cuBLAS performance using CUDA + PTX

New blog post by Daniel Vega-Myhre (Meta/PyTorch) illustrating GEMM design for FP8, including deep-dives into all the constraints and design challenges introduced by MXFP8.

Link: https://danielvegamyhre.github.io/2026/03/29/mxfp8-gemm.html
Original Tweet: https://x.com/vega_myhre/status/2038293614204445039

Additional resources:
MXFP8 and DeepEP for DeepSeek-V3 on B200 w/ TorchTitan: https://pytorch.org/blog/enabling-up-to-41-faster-pre-training-mxfp8-and-deepep-for-deepseek-v3-on-b200-with-torchtitan/

submitted by /u/Benlus
[link] [comments]

Want to read more?

Check out the full article on the original site

View original article

Tagged with

#natural language processing for spreadsheets
#generative AI for data analysis
#Excel alternatives for data analysis
#rows.com
#financial modeling with spreadsheets
#big data performance
#MXFP8
#GEMM
#FP8
#cuBLAS
#CUDA
#PTX
#performance
#TorchTitan
#design challenges
#pre-training
#constraints
#DeepEP
#DeepSeek-V3
#deep-dives