When we announced Mojo, we claimed that Mojo can be 35,000 times faster than Python and we demonstrated this using a specific compute-bound problem: generating a Mandelbrot set. This is an impressive number that has garnered some scrutiny (as it should). In this blog post series, we will show you how we arrived at this figure and the optimizations we implemented to achieve the performance.

In this first part of the series, we'll start by motivating the Mandelbrot problem and why itâ€™s a good candidate to measure performance. We will then describe some basic optimizations you can do in Mojo to get faster performance over Python code. Finally, we will compare our optimized implementation in Mojo against both pure-Python code and Python code using the NumPy library, and discuss why high-performance libraries like NumPy and other programming frameworks are not best suited to handle this problem.Â

While these basic optimization techniques show that Mojo is much faster than Python, they alone wonâ€™t net you 35,000x speedup. As we continue through this blog series, we will introduce other Mojo features that you'll need to achieve the 35,000x speedup.

### What is the Mandelbrot set?

The Mandelbrot set is the set of complex numbers $c$ for which the function $z_{n+1}={z_n}^2+c$Â does not diverge when evaluated iteratively from $z=0$ at each point $c$ in the complex plane. In simpler terms, the algorithm repeatedly squares $z_n$ and adds it to $c$ to get the next iteration valueÂ $z_{n+1}$. For some points this sequence will diverge and for some it will converge - this will determine whether the point belongs to the set or not.

There are multiple ways to compute the Mandelbrot set. The simplest is the â€śescape time algorithmâ€ť which looks like the following in both Python and Mojo:

â€Ť

The algorithm iterates up to the â€śescapeâ€ť condition (when the norm is greater than 2) or it â€śbailouts outâ€ť after *MAX_ITERS* iterations (set to 1000 throughout this blog post). The â€śescape time algorithmâ€ť is commonly used as a benchmark for demonstrating performance. If we plot which points are in the set and which ones are not, we get the following pictorial representation.

â€Ť

There are other algorithms that take advantage of the periodicity, symmetry, and other features of the Mandelbrot set. For our benchmarking purposes, we are interested in the brute force approach to see how much we can push Mojo to best utilize the hardware.

### Why Mandelbrot?

If youâ€™re wondering why we should even care about the Mandelbrot set â€“ the simple answer is that we do not. But from a compute perspective, the Mandelbrot set algorithm has a few interesting properties:

**Simple to express:**Mandelbrot is simple to understand and write. It acts as a proxy to many compute heavy iterative operations.**Pure compute:**Mandelbrot has no memory overhead (i.e. you perform quite a bit of computation per output pixel). Furthermore, the amount of work to be done per output pixel is adjustable via the max iteration factor or the domain of the points.**â€Ť****Irregular computation:**adjacent pixels do not necessarily belong in or out of the set, and efficient computation requires early escape. Hence, the computation cannot be efficiently expressed in systems that only operate on regularly structured loop nests (e.g. array-based programming or polyhedral programming paradigms).**â€Ť****Embarrassingly parallel:**The Mandelbrot computation is easy to parallelize. This means that each pixel computation is independent of the adjacent one.Â**â€Ť****Vectorizable:**The Mandelbrot computation can be written to utilize the vector operations available on modern CPUs.

### How to speed up the Mandelbrot algorithm

At a high-level, the recipe to achieve maximal performance is to utilize the available hardware to its theoretical limits, so you are only bound by the â€śphysicsâ€ť of the hardware â€“ but reaching those limits is easier said than done đź™‚.

The simpler answer to speed up your code is: just use Mojo. The code below is Mojo code. We welcome folks to try the Mojo Mandelbrot notebook that is in the Mojo Playground to experiment with this code.

â€Ť

In fact, we just took the same Python code we shared earlier and added type annotations (see the introductory blog post on Mojo for Python users and the Mojo programming manual). Mojo uses these types to optimize the generated code. Note, that here we only annotated the function input types and everything else remains the same.

A further optimization we can do is to opt into the Mojo â€śstrictâ€ť mode. This will enable Mojo to perform more aggressive optimizations since the compiler can discern certain properties about the program. You opt into the strict mode by using *fn* instead of *def* and declaring the variables:

### Simplifying the math to reduce computation

Once we switch to Mojo, we can look at what further optimizations can be done. The first thing to try is to make the sequential implementation better by removing redundant computations. We can do two things here. One is to avoid the square root, by changing the check of *abs(z) > 2* to be *squared_norm(z) > 4*. A square root typically takes around 6 flops (floating point operation).

The other is by simplifying the computation of $z^2+c$ which would be naively computed as $x*y +c$ and expanded to:

$$ (x_{re} y_{re}-x_{im} y_{im}+c_{re} , x_{re} y_{im}+x_{im} y_{re}+c_{im}) $$

Since $x=y$ in our example, we can simplify the equation to:

$$ (x_{re} x_{re}-x_{im} x_{im}+c_{re} , 2x_{im}x_{re} + c_{im}) $$

This saves a further single flop.

Note that it is fair to do the above problem simplifications because the Mojo language allows us to write high-performance primitives. In Python, on the other hand, itâ€™s impossible to natively implement high-performance primitives, and as a result one gets stuck only with whatâ€™s already available in the language or its libraries.

In total, we save around 7 FLOPS. This might not sound like much, but this operation is in the innermost loop and these FLOPS add up.

Squared addition is a common operation, and the Mojo standard library provides a special function for it called *squared_add*, which is implemented using FMA instructions for maximum performance.

â€Ť

We can rewrite our code to utilize this function as follows:

### Putting it all together

Up until now, we have implemented 3 variants of the Mandelbrot kernel: *mandelbrot_0*, *mandelbrot_1* and *mandelbrot_2*. This is the core compute operation, but we still need to iterate through the pixels in the image and compute the membership in the set. Again, this is straightforward in Mojo:

â€Ť

With the above, you can replace *mandelbrot_N* with *mandelbrot_0*, *mandelbrot_1* and *mandelbrot_2* to evaluate the Mojo implementation against Python over the same domain, iteration count, and datatype (Python uses double-precision and so do we).

We use an Intel Xeon Platinum 8481C CPU for all evaluations, and the results are shown below:

â€Ť

There are a few things to note here. First, by just switching to Mojo we achieved a **46x** speedup. Second, there is no difference in performance between *mandelbrot_0* andÂ *mandelbrot_1*. The reason is that Mojoâ€™s type inference and optimizations remove the dynamism from the types â€“ allowing one to work with concrete types rather than variants.

And by introducing *squared_norm* and *squared_add* we achieved a **89x** speedup over Python

### What about NumPy?

NumPy is a popular library in Python that brings an array programming language paradigm to Python. The idea of NumPy is that we can provide C implementations of common Python operations and by operating large arrays (instead of scalars) we can amortize the Python interpreter overhead.

There are multiple ways to implement the Mandelbrot algorithm using NumPy. One option is to use numpy.vectorize, and the other way is operate on the entire region at once relying on Numpy itself to perform that operation efficiently. The code snippet below shows the second approach:

â€Ť

Pictorially, whatâ€™s happening above is that we are updating all the values in the Mandelbrot set in lockstep based on the mask each time and selectively updating the mask values.

â€Ť

So instead of our vectorization approach where it is unlikely that we need 1000 iterations of work (which could happen at the boundary of the fractal), the NumPy implementation above is guaranteed to waste work. As you see above, it performs *MAX_ITERS* operations on every pixel in the image unconditionally. With NumPy you only has limited control on the algorithm and how it works and is implemented, and thus it cannot perform the optimizations we mentioned earlier.

As a result, a NumPy implementation is only 5x faster than Python:

### Conclusion

In this blog post, we motivated the Mandelbrot example. We also presented more detailed performance numbers comparing Mojoâ€™s implementation of Mandelbrot vs existing solutions. We also described why existing solutions are not well-equipped to solve this problem.

Clearly this is not the 35,000x speedup that we showcased in the launch. In the next blog post we will shed some light on the next set of Mojo features that enable the 35,000x speedup. We will also share some insights as to why Mojo is ideally positioned to solve this and other performance programming problems.

Mandelbrot is a nice problem that demonstrates the capabilities of Mojo. We use similar methodologies to author all of our Library and Kernel functions that power our AI engine to enable speed and efficiency across our stack â€” without sacrifice to generality and good programming practices. If you are interested in making things fast or in general interested in building and driving state-of-the-art AI infrastructure from the ground up, then consider joining Modularâ€™s exceptional team by checking out our careers page. And consider joining our awesome community on Discord and share your Mojo journey with us on social media.Â

Until next time đź”Ą!

â€Ť