August 28, 2025

Matrix Multiplication on Blackwell: Part 1 - Introduction

Ali Taha

Jiexiang Liu

Hengjie Wang

Metric	A100 (Baseline)	H100	H200	B100	B200
Peak Memory Bandwidth	1.0x	1.6x	2.4x	3.9x	3.9x
NVLink Bandwidth	1.0x	1.5x	1.5x	3.0x	3.0x
Peak BF16 TFLOPS (dense)	1.0x	3.2x	3.2x	5.6x	7.2x
Peak FP8 TFLOPS (dense)	N/A	1.0x	1.0x	1.8x	2.3x

Discover what Modular can do for you

Request a demo

Next blog post:

View All Blogs

Sign up for our newsletter

Get all our latest news, announcements and updates delivered directly to your inbox. Unsubscribe at anytime.

Thank you for your submission.

Your report has been received and is being reviewed by the Sales team. A member from our team will reach out to you shortly.

Thank you,

Modular Sales Team

Ali Taha

,

AI Performance Engineer

Ali Taha

,

AI Performance Engineer

Jiexiang Liu

,

Jiexiang Liu

,

Hengjie Wang

,

AI Performance Engineer

Hengjie Wang

,

AI Performance Engineer

Matrix Multiplication on Blackwell: Part 1 - Introduction

What is matmul?

Why Does matmul matter today?

Why do we care about GPUs?

GPU from the hardware architect perspective?

What are Tensor Cores?

What is GPU programming?

Thread block clustering

NVIDIA’s GPUs - A look over the past 3 generations

NVIDIA's Ampere architecture

NVIDIA's Hopper architecture

NVIDIA's Blackwell architecture

GPU comparison at a glance

Pre-Ampere optimization

Ampere optimization

Hopper optimization

Blackwell optimization

Matmul in four lines

Data types and casting

Next blog post: