Row-major vs. column-major matrices: a performance analysis in Mojo and NumPy

April 10, 2024

Shashank Prasanna

AI Developer Advocate

A matrix is a rectangular collection of row vectors and column vectors that defines linear transformation. A matrix however, is not implemented as a rectangular grid of numbers in computer memory, we store them as a large array of elements in contiguous memory.

In this blog post, we’ll use Mojo🔥 to take a closer look at how matrices are stored in memory and we’ll answer the questions: What is row-major and column-major ordering? What makes them different? Why do some languages and libraries prefer column-major order and others row-major order? What implication ordering in memory have on performance?

Row-major order vs. column major order

Row-major and column major are different approaches for storing matrices (and higher order tensors) in memory. In row-major order, row-vectors or elements in a row are stored in contiguous memory locations. And in column-major order, column-vectors or elements in a column are stored in contiguous memory locations. Since it’s faster to read and write data that’s stored in contiguous memory, algorithms that need to frequently access elements in rows run faster on row-major order matrices, and algorithms that need to frequently access elements in columns run faster on column-major order matrices. 

Programming languages like MATLAB, Fortran, Julia and R store matrices in column-major order by default since they are commonly used for processing column-vectors in tabular data sets. The popular Python library Pandas also uses column-major memory layout for storing DataFrames. On the other hand, general purpose languages like C, C++, Java and Python libraries like NumPy prefer row-major memory layout by default.

Let's take an example of a common operation in data science, that involves performing reduction operations (sum, average, variance) on a smaller subset of columns in large tabular datasets stored in databases. Reduction operations on columns using column-major order matrices significantly outperform row-major order matrices since reading column values from contiguous memory is faster than reading them with strides (hops across memory). In the red box below, you can see the performance difference on my M2 MacBook Pro using Mojo. In the purple box, you can see that Mojo also outperforms NumPy in both NumPy’s row-major order and column-major order matrices.

Implementing row-major order and column-major order matrices in Mojo

To better understand how matrices are stored in memory, and how it affects performance let’s:

  1. Implement a custom matrix data structure in Mojo called MojoMatrix that lets you instantiate matrices with data in either row-major or column-major order
  2. Inspect row-major and column-major matrices and their memory layouts, and how they are related to each other and to their transposes
  3. Implement fast vectorized column-wise reduction operations for both row-major and column-major matrices and compare their performance. 
  4. Compare their performance with NumPy’s row-major ndarrays (default, C order) and NumPy’s column-major ndarrays (F order)

It’s important to note that the order in which elements of a matrix are stored in memory doesn’t affect what the matrix looks and behaves like, mathematically. The row-/column- majorness is purely an implementation detail, but it does affect the performance of algorithms that operate on matrices. In the accompanying row_col_matrix.mojo file on GitHub, I’ve implemented a matrix data structure called MojoMatrix. MojoMatrix is a barebones matrix data structure that implements the following functions, which we’ll use to inspect and learn about the data layouts and its effect on performance.

  • to_colmajor(): A function that accepts a MojoMatrix with elements in row-major order and returns a MojoMatrix with elements in column-major order
  • mean(): A function that computes column-wise mean, in a fast, vectorized way that is 4x faster than NumPy’s ndarray.mean() when using NumPy’s column-major order or 25x faster than NumPy’s default row-major ordering.
  • transpose(): A function that computes a vectorized transpose of a matrix, that we’ll use to inspect the elements in memory
  • print() and print_linear(): A print function to print the matrix in matrix format, and a print_linear function that prints the elements of the matrix as they’re laid out in memory
  • from_numpy(): A convenience static function that copies a NumPy array to MojoMatrix so we can measure performance and compare performance on the same data
  • to_numpy(): A convenience function that copies MojoMatrix data to NumPy array to verify accuracy of operations performed in Mojo and NumPy

Let’s take a closer look at using MojoMatrix before we run benchmarks.

Mojo
from row_col_matrix import MojoMatrix vals = List[Float64](1,2,3,4,5,6,7,8,9,10,11,12) M = MojoMatrix(4,3,vals) print(M) print("Memory layout:") M.print_linear()

In this example we create a simple matrix that resembles the one displayed in illustration earlier in the blog post. 

Output
[[1.0 2.0 3.0] [4.0 5.0 6.0] [7.0 8.0 9.0] [10.0 11.0 12.0]] Matrix: 4x3 | DType:float64 | Row Major Memory layout: [1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0]

If you take a closer look, you’ll see that the matrix output says Row Major and you can verify from the output of print_linear() resembles the row-major order layout in the illustration earlier. 

Now let’s convert our row-major order matrix to column-major order matrix using to_colmajor()

Mojo
from row_col_matrix import MojoMatrix vals = List[Float64](1,2,3,4,5,6,7,8,9,10,11,12) M = MojoMatrix(4,3,vals) mm_col_major = M.to_colmajor() print(mm_col_major) print("Memory layout:") mm_col_major.print_linear()
Output
[[1.0 2.0 3.0] [4.0 5.0 6.0] [7.0 8.0 9.0] [10.0 11.0 12.0]] Matrix: 4x3 | DType:float64 | Column Major Memory layout: [1.0 4.0 7.0 10.0 2.0 5.0 8.0 11.0 3.0 6.0 9.0 12.0]

Notice that the output of print(M) is exactly the same as with the row-major order matrix. This is because how the matrix is stored in memory is an implementation detail, and does not and should not affect how you use the matrix.

You’ll also notice that the output of print_linear() is actually different. The elements have been re-ordered such that the columns of the 4x3 matrix are next to each other in memory.

How is matrix transpose related to row-major order and column-major order?

A transpose operation switches the rows and columns of a matrix without changing its contents.

If you have a matrix $M_{mxn}$ and you transpose it, then you get $M^{T}_{nxm}$ i.e. the rows and columns switched.

In the above figure, you can see that the order of elements of the transposed matrix $M^T$  in row-major order is exactly the same as the order of elements of the matrix is stored in column-major order. The key difference is that a transposed matrix $M^{T}_{nxm}$ has different properties mathematically, with the fundamental row and column spaces swapped, whereas storing the same matrix $M_{mxn}$ in column-major order is merely an implementation detail. We can implement this and compare the results using MojoMatrix and its transpose() and to_colmajor() functions

Mojo
from row_col_matrix import MojoMatrix vals = List[Float64](1,2,3,4,5,6,7,8,9,10,11,12) M = MojoMatrix(4,3,vals) mm_col_major = M.to_colmajor() print("Matrix in column-major order:") print(mm_col_major) print("Memory layout:") mm_col_major.print_linear() print("Transposed matrix:") print(M.transpose()) print("Memory layout:") M.transpose().print_linear()
Output
Matrix in column-major order: [[1.0 2.0 3.0] [4.0 5.0 6.0] [7.0 8.0 9.0] [10.0 11.0 12.0]] Matrix: 4x3 | DType:float64 | Column Major Memory layout: [1.0 4.0 7.0 10.0 2.0 5.0 8.0 11.0 3.0 6.0 9.0 12.0] Transposed matrix: [[1.0 4.0 7.0 10.0] [2.0 5.0 8.0 11.0] [3.0 6.0 9.0 12.0]] Matrix: 3x4 | DType:float64 | Row Major Memory layout: [1.0 4.0 7.0 10.0 2.0 5.0 8.0 11.0 3.0 6.0 9.0 12.0]

If you inspect the transpose() and to_colmajor() function implementation you will see that the only difference between these two is the dimensions $mxn$ vs $nxm$ and the actual reordering of elements is tucked away in an internal function called _transpose_order() that is shared by both.

Mojo
@always_inline fn to_colmajor(self) raises -> MojoMatrix[dtype,False]: assert_true(self.is_row_major == True, "Matrix must be in row-major format, to convert to column-major.") return MojoMatrix[dtype, False](self.rows, self.cols, self._transpose_order()) @always_inline fn transpose(self) -> MojoMatrix[dtype, is_row_major]: return MojoMatrix[dtype, is_row_major](self.cols, self.rows, self._transpose_order()) fn _transpose_order(self) -> DTypePointer[dtype]: var newPtr = DTypePointer[dtype].alloc(self.rows*self.cols) alias simd_width: Int = self.simd_width for idx_col in range(self.cols): var tmpPtr = self._matPtr.offset(idx_col) @parameter fn convert[simd_width: Int](idx: Int) -> None: newPtr.store[width=simd_width](idx+idx_col*self.rows, tmpPtr.simd_strided_load[width=simd_width](self.cols)) tmpPtr = tmpPtr.offset(simd_width*self.cols) vectorize[convert, simd_width](self.rows) return newPtr

How does row-major order vs. column-major order affect performance?

Now that we’ve built some intuition around row-major and column-major ordering, let’s discuss why it matters for performance. In data science, data scientists routinely load tabular data sets from databases or other sources and load them into Pandas DataFrame or NumPy’s ndarray and run operations such as averages, sums, medians or fit histograms on the columns of the datasets to better understand the data distribution. These data sets have certain properties:

  • They are matrices that are “narrow and tall”, i.e. their number of rows far exceeds the number of columns
  • Not every column or feature or attributes are useful, and many are discarded or ignored in computation

This means most of the data access pattern will be accessing all rows of specific columns that are not next to each other. Libraries like Pandas and programming languages like MATLAB, Fortran and R designed for statistical and linear algebra computations that have to deal with this type of data and data access pattern, are column-major order by default. Therefore it makes sense to have elements in a column be in contiguous memory ala column-major order.

To simulate this problem, let’s define a problem statement: On a 10000x10000 element matrix of double precision values, perform a mean() reduction operation on a subset of 1000 randomly selected columns. This ensures that most columns are not neighbors of each other.

NumPy benchmarking: row-major order vs. column-major order matrices

First, let’s benchmark a NumPy implementation to analyze the effect of row-major vs. column-major ordering. You can find the the accompanying file python_mean_bench.py on GitHub. In this benchmark we show two things:

  1. Find the average of all the columns in the 10000x10000 matrix
  2. Find the average of a subset of 1000 randomly selected columns

NumPy allows you to switch from the default C-style ordering (row-major order) to Fortran F-style ordering (column-major order) using the function np.asfortranarray() and you can see the big performance difference.

Python
from timeit import timeit import numpy as np def row_major_mean(np_mat: np.ndarray, mean_idx=[]): if len(mean_idx) == 0: mean_idx = slice(None) return 1000*timeit(lambda: np_mat[:,mean_idx].mean(axis=0), number=10) / 10 def col_major_mean(np_mat: np.ndarray, mean_idx=[]): if len(mean_idx) == 0: mean_idx = slice(None) np_mat = np.asfortranarray(np_mat) return 1000*timeit(lambda: np_mat[:,mean_idx].mean(axis=0), number=10) / 10 if __name__ == "__main__": np_mat = np.random.rand(10000,10000) print() print("Mean of all columns, time (ms)") print("------------------------------") print("Row-major:", row_major_mean(np_mat)) print("Col-major:",col_major_mean(np_mat)) row_major_mean_vals = np_mat.mean(axis=0) col_major_mean_vals = np.asfortranarray(np_mat).mean(axis=0) np.testing.assert_almost_equal(row_major_mean_vals, col_major_mean_vals) print("Accuracy comparision (2-Norm of difference):", np.linalg.norm(row_major_mean_vals - col_major_mean_vals)) print() print("Mean of 1000 random columns, time (ms)") print("--------------------------------------") mean_idx = np.random.randint(0,10000,1000) print("Row-major:", row_major_mean(np_mat,mean_idx)) print("Col-major:",col_major_mean(np_mat,mean_idx)) row_major_mean_vals = np_mat[:,mean_idx].mean(axis=0) col_major_mean_vals = np.asfortranarray(np_mat[:,mean_idx]).mean(axis=0) np.testing.assert_almost_equal(row_major_mean_vals, col_major_mean_vals) print("Accuracy comparision (2-Norm of difference):", np.linalg.norm(row_major_mean_vals - col_major_mean_vals)) print()
Output
Mean of all columns, time (ms) ------------------------------ Row-major: 24.860700010322034 Col-major: 20.25612499564886 Accuracy comparision (2-Norm of difference): 1.4372499570202904e-13 Mean of 1000 random columns, time (ms) -------------------------------------- Row-major: 47.35492919571698 Col-major: 7.3962084017694 Accuracy comparision (2-Norm of difference): 0.0

The first result is for calculating the average of every column, and you’ll see that the performance for row-major order and column-major order is small. Why isn’t there a much larger difference in performance? After all, accessing columns in a row-major order matrix requires strided memory access which know is slower. 

NumPy implementation exploits the fact that the row values are still contiguous in memory and is able to reduce multiple columns at the same time using SIMD instructions. But this optimization won’t help us when the columns are randomly selected. Which is why you see the big performance difference in the second result: Mean of 1000 random columns, time (ms) where NumPy’s column-major order matrix is about 6x faster than NumPy’s row-major order matrix.

Mojo benchmarking: row-major order vs. column-major order matrices

Now let’s solve this task using MojoMatrix in Mojo. I won't get into the implementation details of MojoMatrix in this blog post, you can find the full implementation on GitHub. Here is the skeleton of the struct:

Mojo
struct MojoMatrix[dtype: DType = DType.float64, is_row_major: Bool=True](): @always_inline fn __init__(...): ... @always_inline fn __len__(self) -> Int: ... fn _transpose_order(self) -> DTypePointer[dtype]: ... @always_inline fn to_colmajor(self) raises -> MojoMatrix[dtype,False]: ... @always_inline fn transpose(self) -> MojoMatrix[dtype, is_row_major]: ... @always_inline fn mean(self, owned cols_list:List[Int]) -> Self: ... @staticmethod fn from_numpy(np_array: PythonObject) raises -> Self: ... fn to_numpy(self) raises -> PythonObject: ... fn print_linear(self): ... fn __str__(self) -> String: ...

In the mean() function in MojoMatrix I’ve implemented a fast, vectorized reduction operation that works on both row-major order and column-major order matrices. 

Mojo
@always_inline fn mean(self, owned cols_list:List[Int]) -> Self: var new_mat = Self(1,len(cols_list)) alias simd_width: Int = self.simd_width var simd_sum = SIMD[dtype, simd_width](0) var simd_multiple = align_down(self.rows, simd_width) for idx_col in range(len(cols_list)): simd_sum = simd_sum.splat(0) if is_row_major: var tmpPtr = self._matPtr.offset(cols_list[idx_col]) for idx_row in range(0, simd_multiple, simd_width): simd_sum+=tmpPtr.simd_strided_load[width=simd_width](self.cols) tmpPtr += simd_width*self.cols for idx_row in range(simd_multiple, self.rows): simd_sum[0] += tmpPtr.simd_strided_load[width=1](self.cols) tmpPtr += self.cols new_mat._matPtr[idx_col] = simd_sum.reduce_add() else: for idx_row in range(0, simd_multiple, simd_width): simd_sum+=self._matPtr.load[width=simd_width](self.rows*cols_list[idx_col]+idx_row) for idx_row in range(simd_multiple, self.rows): simd_sum[0] += self._matPtr.load[width=1](self.rows*cols_list[idx_col]+idx_row) new_mat._matPtr[idx_col] = simd_sum.reduce_add() new_mat._matPtr[idx_col] /= self.rows return new_mat

In the code above, I have two separate approaches for row-major and column-major order matrices. 

Row-major order algorithm: In row-major order the column-vector values are not stored in contiguous memory, but are separated by a stride length = number of columns. We can still perform vectorized reduction by loading the system's SIMD width values using the function simd_strided_load[simd_width](offset=number_of_column). If the number of rows in the matrix is not a multiple of the system’s SIMD width, the spillover values are reduced in the second loop for idx_row in range(simd_multiple, self.rows) in a non-vectorized way.

Column-major order algorithm: In column-major order, the column-vectors are stored in contiguous memory locations, so loading values and reducing them using SIMD vectorization is much faster. Like before, we if the number of rows in the matrix is not a multiple of the system’s SIMD width, the spillover values are reduced in the second loop for idx_row in range(simd_multiple, self.rows) in a non-vectorized way.

In both cases, we verify that the results match the results in NumPy within a very small tolerance value. Here are the 🥁drum roll 🥁results:

Mojo
from row_col_matrix import MojoMatrix import benchmark from random import random_si64 from benchmark import Unit, keep from python import Python from time import sleep from testing import assert_almost_equal from algorithm import vectorize fn mojo_to_py_list(list: List[Int]) raises -> PythonObject: var py_list = PythonObject([]) for idx in list: py_list.append(idx[]) return py_list fn print_numpy_compare_norm(np_mat: PythonObject, mat: MojoMatrix, mean_idx: List[Int]) raises -> Float64: var np = Python.import_module("numpy") var np_mean: PythonObject var mat_mean: PythonObject if len(mean_idx)==0: np_mean = np_mat.mean(axis=0) mat_mean = mat.mean(mean_idx).to_numpy() else: np_mean = np_mat.take(mojo_to_py_list(mean_idx),axis=1).mean(axis=0) mat_mean = mat.mean(mean_idx).to_numpy() return np.linalg.norm(np_mean - mat_mean).to_float64() fn bench_row_major(np_mat: PythonObject, mean_idx: List[Int]) raises -> Float64: var mat = MojoMatrix.from_numpy(np_mat) @parameter fn benchmark_fn(): var m = mat.mean(mean_idx) keep(m._matPtr) var report_row_major = benchmark.run[benchmark_fn]() print("Row-major accuracy check vs. NumPy. (2-Norm of difference):",str(print_numpy_compare_norm(np_mat, mat, mean_idx))) return report_row_major.mean(Unit.ms) fn bench_col_major(np_mat: PythonObject, mean_idx: List[Int]) raises -> Float64: var mat = MojoMatrix.from_numpy(np_mat).to_colmajor() @parameter fn benchmark_fn(): var m = mat.mean(mean_idx) keep(m._matPtr) var report_col_major = benchmark.run[benchmark_fn]() print("Column-major accuracy check vs. NumPy. (2-Norm of difference):",str(print_numpy_compare_norm(np_mat, mat, mean_idx))) return report_col_major.mean(Unit.ms) fn bench_numpy_row_major(np_mat: PythonObject, mean_idx: PythonObject) raises -> Float64: Python.add_to_path(".") var numpy_mean = Python.import_module("python_mean_bench") return numpy_mean.row_major_mean(np_mat, mean_idx).to_float64() fn bench_numpy_col_major(np_mat: PythonObject, mean_idx: PythonObject) raises -> Float64: Python.add_to_path(".") var numpy_mean = Python.import_module("python_mean_bench") return numpy_mean.col_major_mean(np_mat, mean_idx).to_float64() np = Python.import_module("numpy") np_mat = np.random.rand(10000,10000) mean_idx = List[Int]() for i in range(1000): mean_idx.append(int(random_si64(0,10000))) row_major_mean_time = bench_row_major(np_mat, mean_idx) col_major_mean_time = bench_col_major(np_mat, mean_idx) numpy_row_major_mean_time = bench_numpy_row_major(np_mat, mojo_to_py_list(mean_idx)) numpy_col_major_mean_time = bench_numpy_col_major(np_mat, mojo_to_py_list(mean_idx)) print() print("========== Execution time ==========") print() print("Mojo row-major mean time (ms):",row_major_mean_time) print("Mojo column-major mean time (ms):",col_major_mean_time) print() print("NumPy row-major mean time (ms):",numpy_row_major_mean_time) print("NumPy column-major mean time (ms):",numpy_col_major_mean_time) print() print("========== Speedups ==========") print() print("Col-major vs. Row-Major") print("Mojo col-major vs. Mojo row-major:",str(row_major_mean_time/col_major_mean_time)[:6],end='x\n') print("NumPy col-major vs. NumPy row-major (default):",str(numpy_row_major_mean_time/numpy_col_major_mean_time)[:6],end='x\n') print() print("Mojo vs. NumPy") print("Mojo col-major vs. NumPy row-major (default):",str(numpy_row_major_mean_time/col_major_mean_time)[:6],end='x\n') print("Mojo col-major vs. NumPy col-major:",str(numpy_col_major_mean_time/col_major_mean_time)[:6],end='x\n')
Output
Row-major accuracy check vs. NumPy. (2-Norm of difference): 4.6304020458622057e-14 Column-major accuracy check vs. NumPy. (2-Norm of difference): 4.6304020458622057e-14 ========== Execution time ========== Mojo row-major mean time (ms): 39.216590163934427 Mojo column-major mean time (ms): 1.7989985130111523 NumPy row-major mean time (ms): 47.462849994190037 NumPy column-major mean time (ms): 8.4675250109285116 ========== Speedups ========== Col-major vs. Row-Major Mojo col-major vs. Mojo row-major: 21.799x NumPy col-major vs. NumPy row-major (default): 5.6052x Mojo vs. NumPy Mojo col-major vs. NumPy row-major (default): 26.382x Mojo col-major vs. NumPy col-major: 4.7067x

In Mojo, the performance difference between row-major order and column-major order is significant. Column-major order delivers a 21x speedup over row-major order, for this problem. The entire implementation is 25 lines of Mojo code and it is 26x times faster than NumPy’s default row-major order implementation and 4x faster than NumPy’s column-major order implementation, both of which we saw in the previous section. All performance was measured on my MacBook Pro with M2 processor.

Conclusion

I hope you enjoyed this primer on row-major order and column-major order matrices, and why it matters for performance. If you’d like to continue exploring, here’s your homework:

  • Try implementing matrix-vector product M@v using the implementation I shared as a starting point and benchmark the performance between row-major order M and column-major order M. (Hint: row-major order M should be faster, but why?)
  • Try scaling transformations of columns on row-major order and column-major order matrices
  • Implement matrix factorizations operations on row-major order and column-major order matrices.

There are a lot of benefits to using both, and it’s nice to have the option to switch between the two. Mojo makes it very convenient to express scientific ideas, without sacrificing performance. All the code shared in this blog post is on GitHub. Here are some additional resources to get started. 

Until next time🔥!

Shashank Prasanna
,
AI Developer Advocate