In the previous blog post, we motivated the Mandelbrot set problem and described basic optimization techniques to get around 90x speedup over Python. In this blog post (part 2 of our 3 part blog post series), we continue our optimization journey and describe how to go from 90x to 26,000x speedup over Python. We will share insights into the techniques we use and discuss why Mojo is well positioned to address them. In the upcoming and final part 3, we'll wrap this series up by showing you how you can get well over 35,000 speedup over Python.
How do I run this example: Last week we announced that Mojo SDK will become available for local download to everyone in early September. Sign up to be notified if you haven't already! This example will become available in the Mojo examples repository on GitHub along with Mojo SDK availability. And don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo.
First, let’s recap what we've discussed so far. In the previous blog post, we showed the how to express the Mandelbrot set in Mojo:
We then used the following loop to check whether each point is within the set:
The above code produced a 90x speedup over Python and a 15x speedup over NumPy as shown in the figure below:
In this blog post, we are going to continue on this journey and describe other optimizations that can get us closer to the 35,000x speedup we reported during launch. We will still use the 88-Core Intel Xeon Platinum 8481C CPU as we did in the first blog post in this series.
Vectorizing the code
Continuing with the sequential optimizations from the first blog post, the next thing we can do is to switch from using scalars to using vector operations. Vectorization, also known as Single Instruction Multiple Data (SIMD) has been around since the 1960s and allows one to operate on multiple data elements via a single instruction. Vector operations were added to commodity PCs in the mid 1990s (e.g. in 1997 in the Pentium P5 processor). But, despite being pervasive in hardware, many programming languages still don’t have first class support for it. For example, C++ doesn’t have standard support for SIMD (though there is an experimental proposal and many third-party libraries), and SIMD support in more modern languages is still experimental or subject to caveats.
In Mojo, the SIMD type is a first-class type. To take advantage of SIMD to operate on multiple pixels at a time, we have to write our code slightly differently from the scalar code shared earlier. This is especially true because each pixel may escape at a different iteration count. The resulting code is still straightforward, though:
Code snippet above uses a vector mask to keep track of which pixels are still in the set and which pixels have escaped. Let’s dissect what each line in the loop means:
- If none of the elements in the vector lane are in the set, then we exit the loop.
- Compute the mask. i.e. if the SIMD width is 4, then a mask of [True, True, False, False] means that the first 2 elements in the 4 element patch did not diverge while the latter 2 did.
- We increment the values of the pixels that did not escape by 1.
- We update the z value unconditionally.
We can compute the above for each pixel in the image by using our vectorize method:
We again explain each step in this process:
- The compute_vector function is parameterized on a simd_width as a parameter and a runtime argument of the width of the image.
- We compute the x positions in the complex plane using the iota method. The iota function gives the sequence [0, 1, 2, …, simd_width-1] and that enables us to construct a SIMD vector representation of the x positions in the complex plane.
- We call our Mandelbrot kernel with the ComplexSIMD value. Note that Mojo supports overloading so there is no need to rename the function.
- We use the vectorize high-order generator to apply compute_vector on all elements of width in a vectorized manner, while handling the edge cases.
Pictorially what’s happening is that we are computing the Mandelbrot in simd_width chunks where each chunk is computed using vector operations. Note that while the image shows multiple rows are computed at the same time, this is purely for pictorial purposes since it would be hard to see.
Recall that in the first blog post we only achieved 90x speedup over Python. The theoretical expectation from the vectorization is around 8x speedup (the system has a 512-bit vector register, therefore we have 512/64=8 double-precision elements per SIMD vector). We can plot the speedup against Python and the best results from the first blog post (marked as blog1):
By vectorizing the implementation we are able to achieve a 527x speedup over Python and a 6x speedup over the first blog post results. That’s not an 8x speedup, however. One of the reasons why we do not get the 8x speedup is because our code performs a reduction across the vector lanes. The other reason is that escape condition is not the same across all elements in the vector, so there is extra work that’s done on the fractal boundary.
Increasing work per loop iteration
Still, that’s not good enough and much less than our 35,000x speedup goal. One thing to explore is the ratio of useful work (actually compute) to loop logic. The current ratio of the loop overhead to the amount of compute is not great if we have our SIMD width equal to the architectural SIMD width for tiny compute operations. Mojo allows us to operate on wider vectors and thus amortizing the overhead. Note that using the longer vector width is akin to, but not precisely the same as, loop unrolling.
Modern x86 systems have multiple fused multiply-add (FMA) units which enables them to execute multiple FMAs every clock cycle. On the system we are using for benchmarking there are 2 FMA ports, so we can set the SIMD width to be a multiple of the hardware SIMD width. By using twice the hardware SIMD width, you increase the amount of work per loop iteration and create two FMA paths to increase utilization of the FMA pipelines and thus increasing instruction level parallelism (ILP). Of course, this is machine dependent, and we could have alternatively used Mojo’s autotuning feature here to find the right constant.
In terms of performance, we can evaluate the speedup as we vary the num_ports and select the best one (i.e. autotune). If we plot the results, we can see that the value num_ports=4 is the ideal sweet spot (with a 1.6x speedup) and that as we increase the number of ports we degrade the performance since we increase contention.
Parallelizing the code
So far we have only looked at a single-threaded implementation (with vectorization), but modern hardware has multiple cores. A natural way to parallelize is to subdivide the complex plane into a sequence of rows, where each thread gets a sequence of rows to operate on.
In Mojo, this can be implemented using the parallelize method:
Pictorially, what’s happening is that we compute chunks of rows in parallel, with each thread assigned to a sequence of rows.
In terms of speedup, we are able to get 26,194x speedup over the Python implementation by using all the 88 cores available on the system:
However, there is a problem with the above approach. With 88 cores we should expect around an 88x speedup, but we were only getting a 30x speedup. In the next blog post we will describe what this problem is and how we can address it to achieve well over 35,000x speedup.
Conclusion
Let's summarize our optimization journey so far in these 2 blog posts:
- In part 1 we started by just taking our Python code and running it in Mojo, we then type annotated the Mojo code, and performed some algebraic simplifications see a 90x speedup up over Python.
- In this blog post, we started with the code from the first part and vectorized it. We then updated the code to utilize multiple FMA ports, and finally we changed the sequential implementation into a parallel one to take advantage of all the cores on the system to get over 26,000x speedup over Python.
Below is the same chart in log scale, where it's easier to see the exponential speedup going from Python -> Mojo:
As you can see above, our implementation achieves over 26,000x speedup over Python. In the next blog post, we are going to discuss the next set of optimizations to get over the 35,000x speedup. Stay tuned!
Throughout these two blog posts, we showed how using Mojo programming language with the knowledge of the problem and hardware insights, enables us to get massive performance improvements. We employ the same methodology in our internal Modular Kernel development cycle. If you are interested in making algorithms fast or in general interested in building and driving state-of-the-art AI infrastructure from the ground up, then consider joining Modular’s exceptional team by checking out our Kernel engineer role or other roles available on our careers page.
Don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo. We will also be releasing Mojo SDK generally to everyone in early September. Sign up to download.
Until next time 🔥!