# GPU functions written in Mojo Mojo is a Python-family language built for high-performance computing. It allows you to write custom algorithms for GPUs without the use of CUDA or other vendor-specific libraries. The examples in this directory show a few different ways you can compile and dispatch Mojo functions on a GPU. These examples are complementary with the other [examples to write custom graph ops](../custom_ops/) that run on both CPUs and GPUs. > [!IMPORTANT] > These examples require a [compatible > GPU](https://docs.modular.com/max/faq/#gpu-requirements). The examples include the following: - **vector_addition.mojo**: A common "hello world" example for GPU programming, this adds two vectors together in the same way as seen in Chapter 2 of [Programming Massively Parallel Processors](https://www.sciencedirect.com/book/9780323912310/programming-massively-parallel-processors). - **grayscale.mojo**: The parallelized conversion of an RGB image to grayscale, as seen in Chapter 3 of "Programming Massively Parallel Processors". - **naive_matrix_multiplication.mojo**: An implementation of naive matrix multiplication, again inspired by Chapter 3 of "Programming Massively Parallel Processors". - **mandelbrot.mojo**: A parallel calculation of the number of iterations to escape in the Mandelbrot set. An example of the same computation performed as a custom graph operation can be found [here](../custom_ops/). - **reduction.mojo**: A highly performant reduction kernel. For a detailed explanation see [this blogpost](https://veitner.bearblog.dev/very-fast-vector-sum-without-cuda/). ## Setup 1. Make sure your system includes a [compatible GPU](https://docs.modular.com/max/faq/#gpu-requirements). 2. If you don't have [`pixi`](https://pixi.sh/latest/), install it: ```bash curl -fsSL https://pixi.sh/install.sh | sh ``` 3. Clone this repo: ```bash git clone https://github.com/modular/modular.git ``` 4. Navigate to these examples: ```bash cd modular/examples/mojo/gpu-functions ``` ## Quickstart You can run each of the Mojo files using `mojo` via the [`pixi run`](https://pixi.sh/latest/reference/cli/pixi/run/) command. For example: ```bash pixi run mojo vector_addition.mojo ``` Or enter the Pixi shell and run each directly: ```bash pixi shell ``` ```bash mojo vector_addition.mojo mojo grayscale.mojo mojo naive_matrix_multiplication.mojo mojo mandelbrot.mojo mojo reduction.mojo ``` ## Example walkthroughs Writing individual thread-based functions in Mojo is is powered by the [`gpu` module](https://docs.modular.com/mojo/std/gpu/), which handles all the hardware-specific details of allocating and transferring memory between host and accelerator, as well as compilation and execution of accelerator-targeted functions. The first three examples that follow are common starting points for thread-based GPU programming. They follow the first three examples in the popular GPU programming textbook [*Programming Massively Parallel Processors*](https://www.sciencedirect.com/book/9780323912310/programming-massively-parallel-processors): - Parallel addition of two vectors - Conversion of a red-green-blue image to grayscale - Naive matrix multiplication, with no hardware-specific optimization The following sections explain some of these examples a bit further. > [!CAUTION] > The following code in this README could be out of date. Read the corresponding > source files for the latest example code. ### Basic vector addition The common "hello world" example used for data-parallel programming is the addition of each element in two vectors. Here's how it works in our `vector_addition.mojo` example: 1. Define the vector addition function. The function itself is very simple, running once per thread, adding each element in the two input vectors that correspond to that thread ID, and storing the result in the output vector at the matching location. ```mojo fn vector_addition( lhs_tensor: LayoutTensor[mut=True, float_dtype, layout], rhs_tensor: LayoutTensor[mut=True, float_dtype, layout], out_tensor: LayoutTensor[mut=True, float_dtype, layout], ): tid = thread_idx.x out_tensor[tid] = lhs_tensor[tid] + rhs_tensor[tid] ``` 1. Obtain a reference to the accelerator (GPU) context. ```mojo ctx = DeviceContext() ``` 1. Allocate input and output vectors. Buffers for the left-hand-side and right-hand-side vectors need to be allocated on the GPU and initialized with values. ```mojo alias float_dtype = DType.float32 alias VECTOR_WIDTH = 10 lhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH) rhs_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH) lhs_buffer.enqueue_fill(1.25) rhs_buffer.enqueue_fill(2.5) lhs_tensor = lhs_tensor.move_to(gpu_device) rhs_tensor = rhs_tensor.move_to(gpu_device) ``` A buffer to hold the result of the calculation is allocated on the GPU: ```mojo out_buffer = ctx.enqueue_create_buffer[float_dtype](VECTOR_WIDTH) ``` 1. Compile and dispatch the function. The actual `vector_addition()` function we want to run on the GPU is compiled and dispatched across a grid, divided into blocks of threads. All arguments to this GPU function are provided here, in an order that corresponds to their location in the function signature. Note that in Mojo, the GPU function is compiled for the GPU at the time of compilation of the Mojo file containing it. ```mojo ctx.enqueue_function[vector_addition]( lhs_tensor, rhs_tensor, out_tensor, grid_dim=1, block_dim=VECTOR_WIDTH, ) ``` 1. Return the results. Finally, the results of the calculation are moved from the GPU back to the host to be examined: ```mojo with out_buffer.map_to_host() as host_buffer: host_tensor = LayoutTensor[float_dtype, layout](host_buffer) print("Resulting vector:", host_tensor) ``` To try out this example yourself, run it using the following command: ```sh pixi run mojo vector_addition.mojo ``` For this initial example, the output you see should be a vector where all the elements are `3.75`. Experiment with changing the vector length, the block size, and other parameters to see how the calculation scales. ### Conversion of a color image to grayscale The `grayscale.mojo` example shows how to convert a red-green-blue (RGB) color image into grayscale. This uses a rank-3 tensor to host the 2-D image and the color channels at each pixel. The inputs start with three color channels, and the output has only a single grayscale channel. The calculation performed is a common reduction to luminance using weighted values for the three channels: ```mojo gray = 0.21 * red + 0.71 * green + 0.07 * blue ``` And here is the per-thread function to perform this on the GPU: ```mojo fn color_to_grayscale( rgb_tensor: LayoutTensor[mut=True, int_dtype, rgb_layout], gray_tensor: LayoutTensor[mut=True, int_dtype, gray_layout], ): row = global_idx.y col = global_idx.x if col < WIDTH and row < HEIGHT: red = rgb_tensor[row, col, 0].cast[float_dtype]() green = rgb_tensor[row, col, 1].cast[float_dtype]() blue = rgb_tensor[row, col, 2].cast[float_dtype]() gray = 0.21 * red + 0.71 * green + 0.07 * blue gray_tensor[row, col, 0] = gray.cast[int_dtype]() ``` The setup, compilation, and execution of this function is much the same as in the previous example, but in this case we're using rank-3 instead of rank-1 buffers to hold the values. Also, we dispatch the function over a 2-D grid of block, which looks like the following: ```mojo alias BLOCK_SIZE = 16 num_col_blocks = ceildiv(WIDTH, BLOCK_SIZE) num_row_blocks = ceildiv(HEIGHT, BLOCK_SIZE) ctx.enqueue_function[color_to_grayscale]( rgb_tensor, gray_tensor, grid_dim=(num_col_blocks, num_row_blocks), block_dim=(BLOCK_SIZE, BLOCK_SIZE), ) ``` To run this example, run this command: ```sh pixi run mojo grayscale.mojo ``` This will show a grid of numbers representing the grayscale values for a single color broadcast across a simple input image. Try changing the image and block sizes to see how this scales on the GPU. ### Naive matrix multiplication The `naive_matrix_multiplication.mojo` example performs a very basic matrix multiplication, with no optimizations to take advantage of hardware resources. The GPU function for this looks like the following: ```mojo fn naive_matrix_multiplication( m: LayoutTensor[mut=True, float_dtype, m_layout], n: LayoutTensor[mut=True, float_dtype, n_layout], p: LayoutTensor[mut=True, float_dtype, p_layout], ): row = global_idx.y col = global_idx.x m_dim = p.dim(0) n_dim = p.dim(1) k_dim = m.dim(1) if row < m_dim and col < n_dim: for j_index in range(k_dim): p[row, col] = p[row, col] + m[row, j_index] * n[j_index, col] ``` The overall setup and execution of this function are extremely similar to the previous example, with the primary change being the function that is run on the GPU. To try out this example, run this command: ```sh pixi run mojo naive_matrix_multiplication.mojo ``` You will see the two input matrices printed to the console, as well as the result of their multiplication. As with the previous examples, try changing the sizes of the matrices and how they are dispatched on the GPU. ### Calculating the Mandelbrot set fractal The `mandelbrot.mojo` example shows a slightly more complex calculation: [the Mandelbrot set fractal](https://en.wikipedia.org/wiki/Mandelbrot_set). This custom operation takes no input tensors, only a set of scalar arguments, and returns a 2-D matrix of integer values representing the number of iterations it took to escape at that location in complex number space. The per-thread GPU function for this is as follows: ```mojo fn mandelbrot( tensor: LayoutTensor[mut=True, int_dtype, layout], ): row = global_idx.y col = global_idx.x alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT cx = MIN_X + Float32(col) * SCALE_X cy = MIN_Y + Float32(row) * SCALE_Y c = ComplexSIMD[float_dtype, 1](cx, cy) z = ComplexSIMD[float_dtype, 1](0, 0) iters = Scalar[int_dtype](0) var in_set_mask = Scalar[DType.bool](True) for _ in range(MAX_ITERATIONS): if not any(in_set_mask): break in_set_mask = z.squared_norm() <= 4 iters = in_set_mask.select(iters + 1, iters) z = z.squared_add(c) tensor[row, col] = iters ``` This begins by calculating the complex number which represents a given location in the output grid (C). Then, starting from `Z=0`, the calculation `Z=Z^2 + C` is iteratively calculated until Z exceeds 4, the threshold we're using for when Z will escape the set. This occurs up until a maximum number of iterations, and the number of iterations to escape (or not, if the maximum is hit) is then returned for each location in the grid. The area to examine in complex space, the resolution of the grid, and the maximum number of iterations are all provided as constants: ```mojo alias MIN_X: Scalar[float_dtype] = -2.0 alias MAX_X: Scalar[float_dtype] = 0.7 alias MIN_Y: Scalar[float_dtype] = -1.12 alias MAX_Y: Scalar[float_dtype] = 1.12 alias SCALE_X = (MAX_X - MIN_X) / GRID_WIDTH alias SCALE_Y = (MAX_Y - MIN_Y) / GRID_HEIGHT alias MAX_ITERATIONS = 100 ``` Try it with this command: ```sh pixi run mojo mandelbrot.mojo ``` The result should be an ASCII art depiction of the region covered by the calculation: ```output ...................................,,,,c@8cc,,,............. ...............................,,,,,,cc8M @Mjc,,,,.......... ............................,,,,,,,ccccM@aQaM8c,,,,,........ ..........................,,,,,,,ccc88g.o. Owg8ccc,,,,...... .......................,,,,,,,,c8888M@j, ,wMM8cccc,,..... .....................,,,,,,cccMQOPjjPrgg, OrwrwMMMjjc,.... ..................,,,,cccccc88MaP @ ,pGa.g8c,... ...............,,cccccccc888MjQp. o@8cc,.. ..........,,,,c8jjMMMMMMMMM@@w. aj8c,,. .....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,. ..,,,,,,,cccccMMjwQ EpQ .8c,,, .,,,,,,cc888MrajwJ MMcc,,, .cc88jMMM@@jaG. oM8cc,,, .cc88jMMM@@jaG. oM8cc,,, .,,,,,,cc888MrajwJ MMcc,,, ..,,,,,,,cccccMMjwQ EpQ .8c,,, .....,,,,,,ccc88@QEJwr.wPjjjwG w8c,,. ..........,,,,c8jjMMMMMMMMM@@w. aj8c,,. ...............,,cccccccc888MjQp. o@8cc,.. ..................,,,,cccccc88MaP @ ,pGa.g8c,... .....................,,,,,,cccMQOEjjPrgg, OrwrwMMMjjc,.... .......................,,,,,,,,c8888M@j, ,wMM8cccc,,..... ..........................,,,,,,,ccc88g.o. Owg8ccc,,,,...... ............................,,,,,,,ccccM@aQaM8c,,,,,........ ...............................,,,,,,cc8M @Mjc,,,,.......... ``` Try changing the various parameters above to produce different resolution grids, or look into different areas in the complex number space. ## Next Steps - See our [tutorial to get started with GPU programming](https://docs.modular.com/mojo/manual/gpu/intro-tutorial). - Learn GPU programming by solving increasingly challenging [GPU puzzles](https://builds.modular.com/puzzles/introduction.html).