With MAX 24.6, we introduced MAX Serve: our cutting-edge LLM serving solution, delivering state-of-the-art performance on NVIDIA A100 GPUs. While still in its early days, MAX Serve already offers a combination of capabilities that unlock value for AI Engineers—especially those looking to build using Retrieval-Augmented Generation (RAG), tool use, and AI safety.
Out of the box, MAX Serve:
- Runs the same code on your laptop or an NVIDIA-equipped server, with zero configuration required
- Downloads and serves any PyTorch LLM from Hugging Face, with special acceleration for LlamaForCausalLM-compatible models
- Provides an OpenAI-compatible chat completion endpoint, making it a drop-in replacement for other solutions
Enabled by Modular’s groundbreaking work in AI compilers and infrastructure, MAX provides this feature set from a single command at your terminal. In this post, you’ll experience how quickly MAX and Open WebUI get you up-and-running with RAG, web search, and Llama 3.1 on GPU.

About Open WebUI
Building on the solid foundation MAX provides, adding a robust user interface is a natural next step in creating a full-stack web app. At Modular, we often use Open WebUI in our own work, as it seamlessly integrates with our technology stack. This powerful platform offers a familiar chat-driven interface for interacting with open-source AI models.
Like MAX, Open WebUI empowers users to maintain complete ownership of their AI infrastructure, avoiding vendor lock-in risks, and enhancing privacy. By combining MAX with Open WebUI, you gain instant access to a streamlined development environment—you’ll spend no time troubleshooting your CUDA configuration, and more time building.
About RAG and web search
Training a large language model with new knowledge is not feasible for most people—it’s time-consuming and prohibitively expensive. To overcome these limits, we can provide new knowledge to the model by retrieving specific, relevant information from external sources.
With RAG, the content source is often a vector database containing proprietary documents. Meanwhile, with web search the source is an API capable of searching the entire web, like those from DuckDuckGo and Google. Regardless of the tool, the system searches for documents, then essentially copy-pastes the results into the LLM’s context window. This grounds the model’s response in the information the documents contain.
Before you begin
Install Magic
Magic is Modular’s CLI and you’ll need it to follow along with this post. To install it, run this command at your terminal and follow the instructions:
Install Docker
In this post, we’ll use Docker to run the Open WebUI container. Follow the instructions in the Docker documentation if you need to install it.
Set up Hugging Face access
For our work here, we’ll leverage MAX’s ability to run any PyTorch LLM from Hugging Face. Before we can begin, you must obtain an access token from Hugging Face to download models hosted there. Follow the instructions in the Hugging Face documentation to obtain one.
Start MAX and Open WebUI
At this point, you have a choice: run locally on your laptop, or in the cloud on a GPU-equipped instance.
Option A: Run locally
The local to cloud developer experience is something we care deeply about here at Modular. Simply follow our getting started guide to run MAX on your laptop.
To start MAX Serve, customize the guide's serve
command with a sequence length long enough to support the RAG workload:
MAX Serve is ready once you see a log message containing:
Leave the terminal window open and open a second terminal window, then run the following Docker command:
Open WebUI is ready once you see a log message containing:
Leave both terminal windows open and visit http://localhost:8080/
in your web browser to access Open WebUI.
Note: performance will be much slower locally than on a GPU-equipped cloud instance.
Option B: Run in the cloud
This option requires SSH access to an NVIDIA GPU-equipped cloud instance running Linux.
Note: If you don’t have a cloud GPU instance, follow our tutorial to create one on AWS, Azure, or GCP. Once your instance is up, SSH into it and stop the MAX container; we’ll start another one in the following steps.
To get going, SSH into your cloud instance, then run the following command to start a new Magic project and change into its directory:
First, we will store our Hugging Face access token in a .env
file. Such files are a best practice for storing environment-specific configuration variables and sensitive data, such as API keys and other application settings. Create a new
file in your favorite code editor, placing it in the max-open-webui
directory with the name .env
, and add your token like so:
Next, create a new file in your code editory, placing it in the max-open-webui
directory with the name docker-compose.yml
, and paste in the following contents:
This is a typical docker-compose configuration for an app with a backend service (MAX) and web UI (Open WebUI). Let’s briefly explain what’s going on here:
- We start by defining two containers:
max-openai-api
andopen-webui
- For the
max-openai-api
container:- We set
deploy.resources.reservations.devices.count
to1
. This indicates that we require a GPU. - We pass the necessary settings for Hugging Face into the environment, including getting the token from our
.env
file - The command specifies the model we want to use—weights for Llama 3.1 that Modular makes available for convenience—and
max-length
that ensures a large enough context window for RAG and web search - The line containing
~/.cache/huggingface:/root/.cache/huggingface
syncs downloads from Hugging Face between the instance’s persistent storage and the container.
- We set
- For the
open-webui
container:- We set several environment settings:
WEB_AUTH
: Open WebUI supports multiple user accounts. Setting this value to false causes Open WebUI to enter single-user mode, which is convenient to get up and running quickly for development.OPENAI_API_BASE_URL
: This is actually your MAX Serve endpoint, which matches the chat completion API that OpenAI usesOPENAI_API_KEY
: Use any value here. MAX Serve does not need an API key, but the OpenAI library requires one.
- The
depends_on: max-openai-api
means theopen-webui
container will start after themax-openai-api
container
- We set several environment settings:
- The last section,
volumes.open-webui
, is simply a stub that tells Docker to create persistent on-disk storage for theopen-webui
container. (The colon at the end might look odd if you’re new to docker-compose; rest assured it's intentional.)
Next, we’ll add some tasks to our Magic project as shortcuts for a few Docker commands. Open the pyproject.toml
file and replace the [tool.pixi.tasks]
sections with the following:
Let’s dig into each task:
stop
: stops the containers and removes their entries from the Docker daemonstart
: first runs the stop task, and then starts the containerslogs
: streams the log output of the containers
Finally, we’re ready to start our app! Simply run these commands:
The first command above will download and run the containers. The second command will stream logs for each container. The app is ready once you see the following message from both containers:
It’s safe to press CTRL+C here to exit the logs task; the containers will not stop until you run: magic run stop
Visit http://<YOUR_CLOUD_IP>:8080/
in your web browser to access Open WebUI.
Configure Open WebUI
Before we can begin chatting, there’s just a few features to set up.
Configure Connection to MAX
First, we need to manually provide our model name to Open WebUI. (Automatic model discovery is coming to MAX; manually specifying the model name will soon be unnecessary.) Access the Open WebUI admin panel by clicking the 👤User button in the bottom left corner of the page, then navigate to Admin Panel > Settings > Connections.
First, turn off the Ollama API, then add modularai/llama-3.1
to the MAX Serve endpoint under OpenAI API. Your settings should look like this (your URL may differ from what is shown):

Enable web search
To enable web search, navigate to Admin Panel > Settings > Web Search, turn on the Enable Web Search switch, then choose DuckDuckGo as the search engine. Using DuckDuckGo is free and does not require an API key.
Configure RAG
To perform RAG, we’ll use the knowledge and custom models features of Open WebUI. Knowledge is built upon Chroma, a popular open-source vector database. Custom models are how we can augment our MAX model with tools and knowledge.
First, we’ll add some knowledge. Navigate to Workspace > Knowledge and click the + button to add a knowledge base. Provide a name and optionally a description. After creating the knowledge base, click the + button within it and either upload some files, or use the built-in editor to write / paste-in some text documents. Here’s an example of what you should have:

Next, we’ll add a custom model. Navigate to Workspace > Models and click the + button to add a new model. Provide a name and choose modularai/llama-3.1
as the base model. Under knowledge, choose your knowledge base. Optionally, give your model an avatar image. Finally, scroll to the bottom and choose Save & Create. Your custom model should look something like this:

Use Open WebUI to chat with Llama 3.1
Now we’re ready to use Open WebUI with MAX!
RAG
Choose New Chat from the Open WebUI sidebar, then select your custom model from the model selector at the top of the chat surface. Ask it a question, and observe how it can augment the LLM with the knowledge you provide, like so:

As you can see above, the LLM is able to correctly answer a question about something not widely known. It even provides appropriate citations. Out of the box, MAX supports working with RAG pipelines like this.
Web search
To use the web search feature of Open WebUI, start a new chat, select modularai/llama-3.1
as the model, click the + button in the message composer, and turn it on. Then try asking a question about a current event, such as: What are some highlights from CES 2025?

Web search can be an incredibly powerful tool for LLMs, and as you can see, MAX works out of the box with Open WebUI to support it.
Next Steps
In this post, we’ve only scratched the surface of what’s possible by combining MAX with Open WebUI. MAX Serve enables you to access an Open AI-compatible endpoint for any PyTorch LLM from Hugging Face in minutes. Open WebUI provides a slick, feature-rich user interface with all the tools you expect, and more.
We encourage you to dig in and try more of Open WebUI with MAX, like its Tools feature (Workspace > Tools) to execute any arbitrary Python function. If you really want to dig in, have a look at the Pipelines project from Open WebUI, which “extend functionalities, integrate unique logic, and create dynamic workflows with just a few lines of code.”
We at Modular are proud to be part of the open source AI community—expect to hear more on this topic throughout the year. Join our new Modular Forum and Subreddit to share your experiences and connect with the Modular community!