August 18, 2023

How Mojo🔥 gets a 35,000x speedup over Python – Part 1

When we announced Mojo, we claimed that Mojo can be 35,000 times faster than Python and we demonstrated this using a specific compute-bound problem: generating a Mandelbrot set. This is an impressive number that has garnered some scrutiny (as it should). In this blog post series, we will show you how we arrived at this figure and the optimizations we implemented to achieve the performance.

In this first part of the series, we'll start by motivating the Mandelbrot problem and why it’s a good candidate to measure performance. We will then describe some basic optimizations you can do in Mojo to get faster performance over Python code. Finally, we will compare our optimized implementation in Mojo against both pure-Python code and Python code using the NumPy library, and discuss why high-performance libraries like NumPy and other programming frameworks are not best suited to handle this problem. 

While these basic optimization techniques show that Mojo is much faster than Python, they alone won’t net you 35,000x speedup. As we continue through this blog series, we will introduce other Mojo features that you'll need to achieve the 35,000x speedup.

What is the Mandelbrot set?

The Mandelbrot set is the set of complex numbers $c$ for which the function $z_{n+1}={z_n}^2+c$  does not diverge when evaluated iteratively from $z=0$ at each point $c$ in the complex plane. In simpler terms, the algorithm repeatedly squares $z_n$ and adds it to $c$ to get the next iteration value  $z_{n+1}$. For some points this sequence will diverge and for some it will converge - this will determine whether the point belongs to the set or not.

There are multiple ways to compute the Mandelbrot set. The simplest is the “escape time algorithm” which looks like the following in both Python and Mojo:


MAX_ITERS = 1000
def mandelbrot_kernel(c): 
  z = c
  nv = 0
  for i in range(MAX_ITERS):
    if abs(z) > 2:
      break
    z = z*z + c
    nv += 1
  return nv
    

The algorithm iterates up to the “escape” condition (when the norm is greater than 2) or it “bailouts out” after MAX_ITERS iterations (set to 1000 throughout this blog post). The “escape time algorithm” is commonly used as a benchmark for demonstrating performance. If we plot which points are in the set and which ones are not, we get the following pictorial representation.

Figure 1: A pictorial representation of the Mandelbrot set. White pixels reside within the set while black pixels are not within the set.

There are other algorithms that take advantage of the periodicity, symmetry, and other features of the Mandelbrot set. For our benchmarking purposes, we are interested in the brute force approach to see how much we can push Mojo to best utilize the hardware.

Why Mandelbrot?

If you’re wondering why we should even care about the Mandelbrot set – the simple answer is that we do not. But from a compute perspective, the Mandelbrot set algorithm has a few interesting properties:

  • Simple to express: Mandelbrot is simple to understand and write. It acts as a proxy to many compute heavy iterative operations.
  • Pure compute: Mandelbrot has no memory overhead (i.e. you perform quite a bit of computation per output pixel). Furthermore, the amount of work to be done per output pixel is adjustable via the max iteration factor or the domain of the points.
  • Irregular computation: adjacent pixels do not necessarily belong in or out of the set, and efficient computation requires early escape. Hence, the computation cannot be efficiently expressed in systems that only operate on regularly structured loop nests (e.g. array-based programming or polyhedral programming paradigms).
  • Embarrassingly parallel: The Mandelbrot computation is easy to parallelize. This means that each pixel computation is independent of the adjacent one. 
  • Vectorizable: The Mandelbrot computation can be written to utilize the vector operations available on modern CPUs.

How to speed up the Mandelbrot algorithm

At a high-level, the recipe to achieve maximal performance is to utilize the available hardware to its theoretical limits, so you are only bound by the “physics” of the hardware – but reaching those limits is easier said than done 🙂.

The simpler answer to speed up your code is: just use Mojo. The code below is Mojo code. We welcome folks to try the Mojo Mandelbrot notebook that is in the Mojo Playground to experiment with this code.


def mandelbrot_0(c: ComplexFloat64) -> Int:
    z = c
    nv = 0
    for i in range(1, MAX_ITERS):
      if abs(z) > 2:
        break
      z = z*z + c
      nv += 1
    return nv
    

In fact, we just took the same Python code we shared earlier and added type annotations (see the introductory blog post on Mojo for Python users and the Mojo programming manual). Mojo uses these types to optimize the generated code. Note, that here we only annotated the function input types and everything else remains the same.

A further optimization we can do is to opt into the Mojo “strict” mode. This will enable Mojo to perform more aggressive optimizations since the compiler can discern certain properties about the program. You opt into the strict mode by using fn instead of def and declaring the variables:


fn mandelbrot_1(c: ComplexFloat64) -> Int:
    var z = c
    var nv = 0
    for i in range(1, MAX_ITERS):
      if abs(z) > 2:
        break
      z = z*z + c
      nv += 1
    return nv
    

Simplifying the math to reduce computation

Once we switch to Mojo, we can look at what further optimizations can be done. The first thing to try is to make the sequential implementation better by removing redundant computations. We can do two things here. One is to avoid the square root, by changing the check of abs(z) > 2 to be squared_norm(z) > 4. A square root typically takes around 6 flops (floating point operation).

The other is by simplifying the computation of $z^2+c$ which would be naively computed as $x*y +c$ and expanded to:

$$ (x_{re} y_{re}-x_{im} y_{im}+c_{re} , x_{re} y_{im}+x_{im} y_{re}+c_{im}) $$

Since $x=y$ in our example, we can simplify the equation to:

$$ (x_{re} x_{re}-x_{im} x_{im}+c_{re} , 2x_{im}x_{re} + c_{im}) $$

This saves a further single flop.

Note that it is fair to do the above problem simplifications because the Mojo language allows us to write high-performance primitives. In Python, on the other hand, it’s impossible to natively implement high-performance primitives, and as a result one gets stuck only with what’s already available in the language or its libraries.

In total, we save around 7 FLOPS. This might not sound like much, but this operation is in the innermost loop and these FLOPS add up.

Squared addition is a common operation, and the Mojo standard library provides a special function for it called squared_add, which is implemented using FMA instructions for maximum performance.


struct Complex[type: DType]:
    ...
    fn squared_add(self, c: Self) -> Self:
        return Self(
            fma(self.re, self.re, fma(-self.im, self.im, c.re)),
            fma(self.re, 2*self.im, c.im)))
            

We can rewrite our code to utilize this function as follows:


fn mandelbrot_2(c: ComplexFloat64) -> Int:
    var z = c
    var nv = 0
    for i in range(1, MAX_ITERS):
        if z.squared_norm() > 4:
            break
        z = z.squared_add(c)
        nv += 1
    return nv
    

Putting it all together

Up until now, we have implemented 3 variants of the Mandelbrot kernel: mandelbrot_0, mandelbrot_1 and mandelbrot_2. This is the core compute operation, but we still need to iterate through the pixels in the image and compute the membership in the set. Again, this is straightforward in Mojo:


alias height = 4096
alias width = 4096
alias min_x = -2.0
alias max_x = 0.47
alias min_y = -1.12
alias max_y = 1.12
alias scalex = (max_x - min_x) / width
alias scaley = (max_y - min_y) / height

for h in range(height):
	let cy = min_y + h * scale_y
	for w in range(width):
		let cx = min_x + w * scale_x
		output[h,w] = mandelbrot_N(ComplexFloat64(cx,cy))
    

With the above, you can replace mandelbrot_N with mandelbrot_0, mandelbrot_1 and mandelbrot_2 to evaluate the Mojo implementation against Python over the same domain, iteration count, and datatype (Python uses double-precision and so do we).

We use an Intel Xeon Platinum 8481C CPU for all evaluations, and the results are shown below:

Figure 2: The speedups of the different Mojo implementations presented above compared to Python. Even moving the code from Python to Mojo with no code changes provides a 46x speedup.

There are a few things to note here. First, by just switching to Mojo we achieved a 46x speedup. Second, there is no difference in performance between mandelbrot_0 and  mandelbrot_1. The reason is that Mojo’s type inference and optimizations remove the dynamism from the types – allowing one to work with concrete types rather than variants.

And by introducing squared_norm and squared_add we achieved a 89x speedup over Python

What about NumPy?

NumPy is a popular library in Python that brings an array programming language paradigm to Python. The idea of NumPy is that we can provide C implementations of common Python operations and by operating large arrays (instead of scalars) we can amortize the Python interpreter overhead.

There are multiple ways to implement the Mandelbrot algorithm using NumPy. One option is to use numpy.vectorize, and the other way is operate on the entire region at once relying on Numpy itself to perform that operation efficiently. The code snippet below shows the second approach:


zs = np.zeros((n, m), dtype=complex)
mask = np.full((n, m), True, dtype=bool)
for i in range(MAX_ITER):
    zs[mask] = zs[mask] * zs[mask] + cs[mask]
    mask[np.abs(Z) > 2] = False
    

Pictorially, what’s happening above is that we are updating all the values in the Mandelbrot set in lockstep based on the mask each time and selectively updating the mask values.

Figure 3: The NumPy implementation of Mandelbrot updates the entire image in lockstep.

So instead of our vectorization approach where it is unlikely that we need 1000 iterations of work (which could happen at the boundary of the fractal), the NumPy implementation above is guaranteed to waste work. As you see above, it performs MAX_ITERS operations on every pixel in the image unconditionally. With NumPy you only has limited control on the algorithm and how it works and is implemented, and thus it cannot perform the optimizations we mentioned earlier.

As a result, a NumPy implementation is only 5x faster than Python:

Figure 4: The NumPy implementation is only able to achieve 5x speedup over Python. The Mojo implementation is 15x faster than the NumPy version.

Conclusion

In this blog post, we motivated the Mandelbrot example. We also presented more detailed performance numbers comparing Mojo’s implementation of Mandelbrot vs existing solutions. We also described why existing solutions are not well-equipped to solve this problem.

Clearly this is not the 35,000x speedup that we showcased in the launch. In the next blog post we will shed some light on the next set of Mojo features that enable the 35,000x speedup. We will also share some insights as to why Mojo is ideally positioned to solve this and other performance programming problems.

Mandelbrot is a nice problem that demonstrates the capabilities of Mojo. We use similar methodologies to author all of our Library and Kernel functions that power our AI engine to enable speed and efficiency across our stack — without sacrifice to generality and good programming practices. If you are interested in making things fast or in general interested in building and driving state-of-the-art AI infrastructure from the ground up, then consider joining Modular’s exceptional team by checking out our careers page. And consider joining our awesome community on Discord and share your Mojo journey with us on social media

Until next time 🔥!

PRODUCT

Abdul Dakkak
,
AI Compiler Engineer