Modular acquires BentoML to deliver production AI in the cloud! - Read more

January 14, 2026

How to Beat Unsloth's CUDA Kernel Using Mojo—With Zero GPU Experience

My role: research, logic, constraints, system design
AI tools: ChatGPT Pro + a custom Modular docs agent
Testing: native Mojo benchmark harness (three model configs, mixed precision, 1,000 launches per matrix, strict sync for timing)

David Robertson

GPU	Unsloth (CUDA)	GB/s	Mojo	GB/s	Speedup
T4	3.70s	162.6	3.46s	173.7	1.07×
L4	3.00s	200.3	2.40s	250.2	1.25×
A100-40GB	1.21s	498.2	0.66s	916.2	1.84×
H100-PCIe	0.62s	973.9	0.41s	1474.1	1.51×

Read more from Modular

Build the future of AI with Modular

Get started guide
Install MAX with a few commands and deploy a GenAI model locally.
Read Guide
Browse open models
500+ models, many optimized for lightning-fast performance
Browse models

No items found.