Modular: How Mojo🔥 gets a 35,000x speedup over Python

In the previous blog post, we motivated the Mandelbrot set problem and described basic optimization techniques to get around 90x speedup over Python. In this blog post (part 2 of our 3 part blog post series), we continue our optimization journey and describe how to go from 90x to 26,000x speedup over Python. We will share insights into the techniques we use and discuss why Mojo is well positioned to address them. In the upcoming and final part 3, we'll wrap this series up by showing you how you can get well over 35,000 speedup over Python.

How do I run this example: Last week we announced that Mojo SDK will become available for local download to everyone in early September. Sign up to be notified if you haven't already! This example will become available in the Mojo examples repository on GitHub along with Mojo SDK availability. And don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo.

First, let’s recap what we've discussed so far. In the previous blog post, we showed the how to express the Mandelbrot set in Mojo:


fn mandelbrot_blog_post_2(c: ComplexFloat64) -> Int:
    var z = c
    var nv = 0
    for i in range(1, MAX_ITERS):
        if z.squared_norm() > 4:
            break
        z = z.squared_add(c)
        nv += 1
    return nv

‍

We then used the following loop to check whether each point is within the set:


alias height = 4096
alias width = 4096
alias min_x = -2.0
alias max_x = 0.47
alias min_y = -1.12
alias max_y = 1.12
alias scale_x = (max_x - min_x) / width
alias scale_y = (max_y - min_y) / height

for h in range(height):
	let cy = min_y + h * scale_y
	for w in range(width):
		let cx = min_x + w * scale_x
		output[h,w] = mandelbrot_blog_post_2(ComplexFloat64(cx,cy))

‍

The above code produced a 90x speedup over Python and a 15x speedup over NumPy as shown in the figure below:

Figure 1: The results of the first blog post demonstrated Mojo achieving an 89x speedup over Python and 15x speedup over NumPy.

‍

In this blog post, we are going to continue on this journey and describe other optimizations that can get us closer to the 35,000x speedup we reported during launch. We will still use the 88-Core Intel Xeon Platinum 8481C CPU as we did in the first blog post in this series.

Vectorizing the code

Continuing with the sequential optimizations from the first blog post, the next thing we can do is to switch from using scalars to using vector operations. Vectorization, also known as Single Instruction Multiple Data (SIMD) has been around since the 1960s and allows one to operate on multiple data elements via a single instruction. Vector operations were added to commodity PCs in the mid 1990s (e.g. in 1997 in the Pentium P5 processor). But, despite being pervasive in hardware, many programming languages still don’t have first class support for it. For example, C++ doesn’t have standard support for SIMD (though there is an experimental proposal and many third-party libraries), and SIMD support in more modern languages is still experimental or subject to caveats.

In Mojo, the SIMD type is a first-class type. To take advantage of SIMD to operate on multiple pixels at a time, we have to write our code slightly differently from the scalar code shared earlier. This is especially true because each pixel may escape at a different iteration count. The resulting code is still straightforward, though:


fn mandelbrot_kernel(c: ComplexSIMD[DType.float64, simd_width]) -> \
                        SIMD[DType.index, simd_width]:
    var z = c
    var iters = SIMD[DType.index, simd_width](0)
    var in_set_mask = SIMD[DType.bool, simd_width](True)

    for i in range(MAX_ITERS):
        if not in_set_mask.reduce_or():                 # (1)
          break                   
        in_set_mask = z.squared_norm() <= 4             # (2)
        iters = in_set_mask.select(iters + 1, iters)    # (3)
        z = z.squared_add(c)                            # (4) 
    return nv

‍

Code snippet above uses a vector mask to keep track of which pixels are still in the set and which pixels have escaped. Let’s dissect what each line in the loop means:

If none of the elements in the vector lane are in the set, then we exit the loop.
Compute the mask. i.e. if the SIMD width is 4, then a mask of [True, True, False, False] means that the first 2 elements in the 4 element patch did not diverge while the latter 2 did.
We increment the values of the pixels that did not escape by 1.
We update the z value unconditionally.

We can compute the above for each pixel in the image by using our vectorize method:


for h in range(height):
  let cy = min_y + h * scale_y
  
  @parameter
  # (1)
  fn compute_vector[simd_width:Int](w:Int): 
    # (2)
    let cx = min_x + (w+iota[DType.float64, 
                      simd_width]()) * scale_x
    # (3)
    output.simd_store[simd_width](Index(h,w),
    	                mandelbrot_kernel(
                                        ComplexSIMD[DType.float64, 
                                        simd_width](cx,cy))
  # (4)
  vectorize[simd_width, compute_vector](width)

‍

We again explain each step in this process:

The compute_vector function is parameterized on a simd_width as a parameter and a runtime argument of the width of the image.
We compute the x positions in the complex plane using the iota method. The iota function gives the sequence [0, 1, 2, …, simd_width-1] and that enables us to construct a SIMD vector representation of the x positions in the complex plane.
We call our Mandelbrot kernel with the ComplexSIMD value. Note that Mojo supports overloading so there is no need to rename the function.
We use the vectorize high-order generator to apply compute_vector on all elements of width in a vectorized manner, while handling the edge cases.

Pictorially what’s happening is that we are computing the Mandelbrot in simd_width chunks where each chunk is computed using vector operations. Note that while the image shows multiple rows are computed at the same time, this is purely for pictorial purposes since it would be hard to see.

Figure 2: Vectorization enables us to operate on the Mandelbrot set in 1D patches of length SIMD width.

‍

Recall that in the first blog post we only achieved 90x speedup over Python. The theoretical expectation from the vectorization is around 8x speedup (the system has a 512-bit vector register, therefore we have 512/64=8 double-precision elements per SIMD vector). We can plot the speedup against Python and the best results from the first blog post (marked as blog1):

Figure 3: Vectorizing the code enables us to achieve a 527x speedup over Python and a 6x speedup over the first blog post results.

‍

By vectorizing the implementation we are able to achieve a 527x speedup over Python and a 6x speedup over the first blog post results. That’s not an 8x speedup, however. One of the reasons why we do not get the 8x speedup is because our code performs a reduction across the vector lanes. The other reason is that escape condition is not the same across all elements in the vector, so there is extra work that’s done on the fractal boundary.

Increasing work per loop iteration

Still, that’s not good enough and much less than our 35,000x speedup goal. One thing to explore is the ratio of useful work (actually compute) to loop logic. The current ratio of the loop overhead to the amount of compute is not great if we have our SIMD width equal to the architectural SIMD width for tiny compute operations. Mojo allows us to operate on wider vectors and thus amortizing the overhead. Note that using the longer vector width is akin to, but not precisely the same as, loop unrolling.

Modern x86 systems have multiple fused multiply-add (FMA) units which enables them to execute multiple FMAs every clock cycle. On the system we are using for benchmarking there are 2 FMA ports, so we can set the SIMD width to be a multiple of the hardware SIMD width. By using twice the hardware SIMD width, you increase the amount of work per loop iteration and create two FMA paths to increase utilization of the FMA pipelines and thus increasing instruction level parallelism (ILP). Of course, this is machine dependent, and we could have alternatively used Mojo’s autotuning feature here to find the right constant.


alias num_ports = 4 # Is autotuned

fn mandelbrot_kernel(c: ComplexSIMD[DType.float64, 
                                    num_ports * simd_width]) -> \
                                    SIMD[DType.index, 
                                         num_ports * simd_width]:
    var z = c
    var iters = SIMD[DType.index, num_ports * simd_width](0)
    var in_set_mask = SIMD[DType.bool, num_ports * simd_width](True)

    for i in range(MAX_ITERS):
        if not in_set_mask.reduce_or():
        	break                   
        in_set_mask = z.squared_norm() <= 4    
        iters = in_set_mask.select(iters + 1, iters)    
        z = z.squared_add(c)                   
    return iters

‍

In terms of performance, we can evaluate the speedup as we vary the num_ports and select the best one (i.e. autotune). If we plot the results, we can see that the value num_ports=4 is the ideal sweet spot (with a 1.6x speedup) and that as we increase the number of ports we degrade the performance since we increase contention.

‍

Figure 4: Increasing the software SIMD width improves our speedup over Python from 527x to 875x and is a 1.6x speedup improvement over vectorizing using the architectural SIMD width.

‍

Parallelizing the code

So far we have only looked at a single-threaded implementation (with vectorization), but modern hardware has multiple cores. A natural way to parallelize is to subdivide the complex plane into a sequence of rows, where each thread gets a sequence of rows to operate on.

In Mojo, this can be implemented using the parallelize method:


fn compute_row(chunk_idx:Int):
  let y = chunk_size * chunk_idx
  let cy = min_y + y * scale_y

  @parameter
  fn compute_vector[simd_width:Int](w:Int):
  	let cx = min_x + iota[DType.float64, simd_width]() * scale_x
  	output.simd_store[simd_width](Index(h,w), 
                                  mandelbrot_kernel(
                                               ComplexSIMD[DType.float64, 
                                               simd_width](cx,cy))
  vectorize[num_ports * simd_width, compute_vector](width)

  with Runtime(num_cores()) as rt:
  	parallelize[compute_row](rt, height)

‍

Pictorially, what’s happening is that we compute chunks of rows in parallel, with each thread assigned to a sequence of rows.

‍

Figure 5: Parallelization allows us to split the work across threads and exercise all the cores on the system.

‍

In terms of speedup, we are able to get 26,194x speedup over the Python implementation by using all the 88 cores available on the system:

Figure 6: By parallelizing the work across all cores we get a 26,000x speedup over Python and a 30x speedup over the vectorized version.

‍

However, there is a problem with the above approach. With 88 cores we should expect around an 88x speedup, but we were only getting a 30x speedup. In the next blog post we will describe what this problem is and how we can address it to achieve well over 35,000x speedup.

Conclusion

Let's summarize our optimization journey so far in these 2 blog posts:

In part 1 we started by just taking our Python code and running it in Mojo, we then type annotated the Mojo code, and performed some algebraic simplifications see a 90x speedup up over Python.
In this blog post, we started with the code from the first part and vectorized it. We then updated the code to utilize multiple FMA ports, and finally we changed the sequential implementation into a parallel one to take advantage of all the cores on the system to get over 26,000x speedup over Python.

Figure 7: Recapping our journey. We started with a 89x speedup over Python, did a few optimizations, used a few insights about the problem, and achieved a 26,000x speedup over Python.

‍

Below is the same chart in log scale, where it's easier to see the exponential speedup going from Python -> Mojo:

Figure 8: This is the same as Figure 9 but plotted in log scale to better showcase the evolution of our implementation from Python.

‍

As you can see above, our implementation achieves over 26,000x speedup over Python. In the next blog post, we are going to discuss the next set of optimizations to get over the 35,000x speedup. Stay tuned!

Throughout these two blog posts, we showed how using Mojo programming language with the knowledge of the problem and hardware insights, enables us to get massive performance improvements. We employ the same methodology in our internal Modular Kernel development cycle. If you are interested in making algorithms fast or in general interested in building and driving state-of-the-art AI infrastructure from the ground up, then consider joining Modular’s exceptional team by checking out our Kernel engineer role or other roles available on our careers page.

Don't forget to join us live on our first ever Modular Community Livestream where we'll share recent developments in Mojo programming language and answer your questions about Mojo. We will also be releasing Mojo SDK generally to everyone in early September. Sign up to download.

Until next time 🔥!

How Mojo🔥 gets a 35,000x speedup over Python – Part 2

Vectorizing the code

Increasing work per loop iteration

Parallelizing the code

Conclusion

Announcing stack-pr: an open source tool for managing stacked PRs on GitHub