Mojo vs. Rust: is Mojo 🔥 faster than Rust 🦀 ?

February 12, 2024

Jack Clayton

AI Developer Advocate

Mojo is built on the latest compiler technology in MLIR, an evolution of LLVM which Rust lowers to, and so it can be faster. It depends largely on the skill of the programmer and how far they're willing to go with optimizations. Mojo's goal as a language, is to meet Python developers where they are, and allow them to learn some new tricks to optimize their code to the performance limits of any hardware.

Blogs and Benchmarks

Over the weekend, Netflix engineer and Rust advocate @ThePrimeagen released a reaction video to a community guest blog we published: Outperforming Rust ⚙️ DNA sequence parsing benchmarks by 50% with Mojo 🔥. The blog post stirred up some controversy, Rust is being positioned as a potential successor to the dominant languages in AI, which are currently Python and C++. This was @ThePrimeagen's take on Mojo vs. Rust for the future of AI programming:

If Mojo is [legitimate], I think Mojo will win, just hands down. And the reason Mojo will win, is you don't change the paradigm of any acclimated or proficient individual. You just have to learn a bit more, and you get amazing performance. If Mojo compiles fast, and it looks like a language you're already familiar with, and it's really close to being the same speed, I just don't see how you're going to make that sell [for Rust].

Following his comment, Luca Palmieri, respected Rustacean and author of Zero to Production in Rust, replied on X:

Mojo: our goal

Mojo aims to be intuitive to learn for Python developers. As Mohamed showed, he was able to learn Mojo and optimize an algorithm using SIMD in a matter of weeks as a side project. While this article is focused on performance differences, the points that @ThePrimeagen and Luca Palmieri made are important to us. We are heavily focused on AI, where a three-language problem exists, and where CPU+GPU programmability is so important across hardware. But lets not forget, the real goal of Mojo is lifting the worlds most popular AI language in Python, and empowering developers everywhere with incredible performance, hardware portability, and programmability.

Is Mojo faster than x language?

@ThePrimeagen raised an important question: Rust is known for low-level performance, how can Mojo provide better performance out of the box than Rust (and C++)?

A common question when users first join the Discord is How much faster is Mojo than x language?. There are a lot of considerations surrounding any benchmark implementation, you can't use any one benchmark to say x language is faster than y language. A better question is How much overhead does Mojo introduce, compared to x?. A major goal for Mojo, is to allow you to push hardware to the limits of physics, while remaining ergonomic and familiar to Python developers.

Compared to a dynamic language like Python, compiled languages allow you to remove unnecessary CPU instructions such as allocating objects to the heap, reference counting, and periodic garbage collection. Mojo takes lessons learned and best practices from C++, Rust and Swift to provide direct access to the machine without these kinds of overheads.

Mojo vs. Rust

Mojo and Rust both allow you to optimize at a lower level, but in Rust for example you can still wrap everything in Arc, Mutex, Box etc. to avoid fights with the borrow checker at the cost of performance. If you’re writing application code this might not have any significant impact, but if you’re writing a library or performance sensitive code, that overhead can add up quickly. It's up to the programmer how much they care about reducing overhead and optimizing performance.

Both of the languages can use LLVM for optimized codegen, and both languages allow the use of inline assembly (but of course, no one can afford to do that) so they both have the same potential on traditional hardware right?

Well sure, but the real question is: how does the performance of idiomatic/normal Mojo code compare to normal Rust code written by someone who isn’t a world expert writing in assembly for every chip, and doesn’t know all the details of how the compiler works?

Reduced memcpy with borrow by default

When a new user is learning Rust, one of the first pitfalls they run into, is that function arguments default to taking an object by moving it. This means when you pass something into a function and try to reuse it, you get a compiler error:

Rust
fn bar(foo: String){} fn main(){ var foo = String::from("bar"); bar(foo); dbg!(foo); }
Output
5 | let foo = String::from("bar"); | --- move occurs because `foo` has type `String`, which does not implement the `Copy` trait 6 | bar(foo); | --- value moved here 7 | dbg!(foo); | ^^^^^^^^^ value used here after move

The line with dbg! throws a compiler error, because you've moved foo into the bar function. In Rust this can also mean foo does a memcpy of the String pointer, size, and capacity. The memcpy can be optimized away by LLVM in some cases, but this doesn't always occur and is hard to predict unless you know how the Rust/LLVM compiler works.

Mojo simplifies this concept for the standard use case:

Mojo
# foo is an immutable reference by default fn bar(foo: String): pass fn main(): var foo = String("foo") bar(foo) print(foo)
Output
foo

Mojo arguments are borrowed by default: not only is this much more gentle when learning Mojo compared to Rust, it's also more efficient due to no implicit memcpys.  If you want to get closer to Rust behavior, you can change the argument to owned:

Mojo
fn bar(owned foo: String): foo += "bar" fn main(): var foo = String("foo") bar(foo) print(foo)
Output
foo

This still works! Because String implements a copy constructor, it's able to be moved into bar and leave behind a copy. Under the hood this is still passing by reference for maximum efficiency, it'll only create a copy if foo is mutated.

To fully opt into the Rust default of moving an object and losing ownership, you need to use the ^ transfer operator:

Mojo
fn bar(owned foo: String): foo += "bar" # Ok to mutate a uniquely owned value fn main(): var foo = String("foo") bar(foo^) print(foo) # error: foo is uninit because it was transferred above

Now you finally get a compiler error for trying to use foo after move, you have to work much harder to fight the borrow checker in Mojo! This is the better default behavior, not only is it more efficient, it doesn't roadblock engineers from dynamic programming backgrounds. They still get the behavior they expect by default, with the best performance possible.

No Pin requirement

In Rust, there is no concept of value identity. For a self-referential struct pointing to its own member, that data can become invalid if the object moves, as it'll be pointing to the old location in memory. This creates a complexity spike, particularly in parts of async Rust where futures need to be self-referential and store state, so you must wrap Self with Pin to guarantee it's not going to move. In Mojo, objects have an identity so referring to self.foo will always return the correct location in memory, without any additional complexity required for the programmer.

There is a nice blog titled pin and suffering that takes you on a journey of a Rustacean 🦀 working through the implications of Pin. These are complexities that a Mojician 🪄 will never encounter.

Built on state-of-the-art compiler technology

Rust was started in 2006 and Swift was started in 2010, and both are primarily built on top of LLVM IR. Mojo started in 2022 and builds on MLIR, which is a more modern “next generation” compiler stack than the LLVM IR approach that Rust uses. There is a history here: our CEO Chris Lattner started LLVM in college in Dec 2000 and learned a lot from its evolution and development over the years.  He then led the development of MLIR at Google to support their TPU and other AI accelerator projects, taking that learning from LLVM IR to build the next step forward: described in this talk from 2019.

Mojo is the first programming language to take advantage of all the advances in MLIR, both to produce more optimized CPU code generation, but also to support GPUs and other accelerators, and to also have much faster compile times than Rust.  This is an advantage that no other language currently provides, and it's why a lot of AI and compiler nerds are excited about Mojo 🔥. They can build their fancy abstractions for exotic hardware, while us mere mortals can take advantage of them with Pythonic syntax.

Great SIMD ergonomics

CPUs have special registers and instructions to process multiple bits of data at the same time, known as SIMD (Single Instruction, Multiple Data). But the ergonomics of writing this code has historically been very ugly and difficult to use. These special instructions have been around for many years, but most code is still not optimized for it. When someone works through the complexities and writes a portable SIMD optimized algorithm, it blows the competition out of the water, for example simd_json.

Mojo's primitives are natively designed to be SIMD-first: UInt8 is actually a SIMD[DType.uint8, 1] which is a SIMD of 1 element. There is no performance overhead to represent it this way, but it allows the programmer to easily use it for SIMD optimizations. For example, you can split up text into 64 byte blocks and represent it as SIMD[DType.uint8, 64] then compare it to a single newline character, in order to find the index for every newline. Because the SIMD registers on your machine can calculate operations on 512bits of data at the same time, this will improve the performance for those operations by 64x!

Or a more simple example is if you have a SIMD[DType.float64, 8](2, 4, 6, 8, 16, 32, 64, 128), you can simply multiply it by a Float64(2), improving performance by 8x on most machines compared to multiplying each element individually.

LLVM (and therefore Rust) has automatic vectorization optimization passes, but they’ll never be able to reach the same level of performance as the programmer expressing exactly what they intended, because LLVM cannot change memory layout or other important details for SIMD. Mojo has been built from the ground up to take advantage of SIMD, and writing SIMD optimizations feels very close to writing normal code.

Eager Destruction

Rust was inspired by RAII (Resource Acquisition is Initialization) from C++, which means that once the object goes out of scope, the application developer doesn't have to worry about freeing the memory, the programming language takes care of it. This is a really nice paradigm, you get the ergonomics of a dynamic language, without the performance drawback of a garbage collector.

Mojo takes this one step further, instead of waiting until the end scope, it frees the memory on last use of the object. This is advantageous in the field of AI, where freeing an object early can mean deallocating a GPU tensor earlier, therefore fitting a larger model in GPU RAM. This is a unique advantage for Mojo, where the programmer gets the best possible outcome without having to think about it. The Rust borrow checker originally extended the lifetime of everything to the end of its scope to match the destructor behavior, which had some confusing consequences for users. Rust added features to simplify this for developers with Non-Lexical Lifetimes. Due to Mojo’s eager destruction, we get these simplifications for free, and it aligns with how objects are actually destroyed so we don’t have confusing edge cases.

Another piece of overhead is the way that Drop works in Rust. It tracks if an object should be dropped at runtime, with Drop Flags. Rust can optimize these away in some cases, but Mojo defines them away categorically to eliminate the overhead in all cases.

Tail Call Optimization (TCO)

Update: It was pointed out in community discussion that for the original examples below Mojo was optimizing away everything, while Rust has a potential bug that was causing the implementation to be much slower. The generated assembly also shows Rust is doing some form of TCO, even with heap allocated objects. I've updated the below examples and reworded this section taking these points into consideration.

Because Mojo has eager destruction, MLIR and LLVM are able to perform tail call optimizations more effectively. This example compares a recursive function with a heap allocated dynamic vector in both languages. Note that this is just a simple example with as few lines of code as possible to demonstrate the difference.

First run cargo new rust and edit ./rust/src/main.rs to look like this:

./rust/src/main.rs
fn recursive(x: usize){ if x == 0 { return; } let mut stuff = Vec::with_capacity(x); for i in 0..x { stuff.push(i); } recursive(x - 1) } fn main() { recursive(50_000); }

Then run:

Bash
cd rust cargo build --release cd target/release hyperfine ./rust

These results are on an M2 Mac:

Output
Benchmark 1: ./rust Time (mean ± σ): 2.119 s ± 0.031 s [User: 1.183 s, System: 0.785 s] Range (min … max): 2.081 s … 2.172 s 10 runs

And you can run the mojo version with a single file in the same folder, call it mojo.mojo:

./mojo.mojo
fn recursive(x: Int): if x == 0: return var stuff = List[Int](x) for i in range(x): stuff.append(i) recursive(x - 1) fn main(): recursive(50_000)

Then run:

Bash
mojo build mojo.mojo hyperfine ./mojo
Output
Benchmark 1: ./mojo Time (mean ± σ): 620.6 ms ± 5.6 ms [User: 605.2 ms, System: 2.1 ms] Range (min … max): 613.9 ms … 632.4 ms 10 runs

The compiler must ensure that destructors are called at the appropriate time, which for Rust is when a value goes out of scope. In the recursive function, the Vec has a destructor that needs to be run after each function call. This means the function's stack frame can't just be discarded or overwritten, as is required for tail call optimization. Because Mojo destructs eagerly it doesn't have this limitation, and is able to optimize for TCO more efficiently with heap allocated objects.

You can get more insight to this behaviour when profiling the two programs with valgrind --tool=massif. I switched to a Linux cloud instance to run this experiment, which sent the Rust mean time to 9.067 s with 10 GB peak allocated memory, and Mojo to 1.189 s with 1.5 MB peak allocated memory! As previously noted, memory is in an important resource in AI applications, and eager destruction ensures the programmer gets optimal behaviour without having to think about it.

You can try running the above bechmarks yourself. If you don't have Mojo 🔥 yet, you can install it here!

Conclusion

We all love Rust at Modular and are inspired by it, the tooling is great, and it currently has one of the best high level ergonomics for any systems programming language. But it has two major problems in the field of AI, which @ThePrimeagen pointed out:

  1. It compiles slow, and AI is all about experimentation and rapid iteration
  2. Most AI researchers experienced with Python won't take time to learn a new language from scratch

Members of our team tried to solve this problem with "Swift for TensorFlow" at Google, which didn't catch on due to the issues mentioned with AI researchers not willing to learn a brand new, slower to compile language. We love Python/C++/Rust/Swift/Julia etc. but after over a decade of the industry hill climbing these technologies, we believe that the fresh start that Mojo embodies is the only way to make a dent in these age-old problems.

Mojo already has optimal performance for systems engineers, but still has a long way to go for all the dynamic features that Python programmers expect. Rust is an excellent choice if you need to put something into production right now. If you're curious and looking towards the future, and want to be early with a language that could be instrumental to the next 50 years of AI, give Mojo a try! We'll be adding AI specific libraries to the package that comes with Mojo soon, which we're working on as the killer app to show the world what Mojo can do. Keep an eye out for MAX in the coming weeks!

We'd love to see you in the Mojo community, here are some links to get you started:

Jack Clayton
,
AI Developer Advocate