Inference Unleashed: Power and ease of use combined in TensorWave’s MI300X servers

Introduction

Remember the last time you tried to set up a GPU server for AI inference? If you're like most ML engineers, it probably involved hours of wrestling with drivers, CUDA configurations, and compatibility issues - all before you could even think about running your models. That's been the status quo for years: powerful hardware wrapped in layers of complexity that steal your time and attention away from what really matters.

And absolutely nobody likes getting trapped in tasks that are essentially just undifferentiated heavy lifting!

This is where TensorWave's MI300X servers come in, and they're rewriting the rules in two important ways. First, with a staggering 192GB of memory per card (that's 2.4x what you'll find in NVIDIA's H100), they're making it possible to run full-weight 70B parameter models on a single accelerator - something that fundamentally changes the game for inference deployments and allows for options and creativity. Second, and perhaps just as importantly, they've eliminated the traditional setup headaches that have long been the tax we pay for high-performance computing.

Getting Started: Time to Value

One of the more difficult things when getting started with a giant GPU server is drivers. If you try to start up a GPU-powered cloud server from AWS or Azure, for example, you’ll get a decent GPU (assuming you’re lucky enough; GPU servers in the cloud get gobbled up fast), but the time to value sucks. You’ll be doing driver and CUDA installations, for example, for their most common GPU-enabled instances.

That can take a while. And especially if you are an LLM tinkerer or developer, and not an expert in crazy linux compute driver issues, and you’d really rather not be forced to be one.

I totally get it. I have to do that task every now and then, and while I’ve had a couple decades of Linux experience, I completely still hate it. Not because it’s an insurmountable problem, but it can be tricky sometimes and end up wasting too much time on what is ultimately not the fun and valuable thing.

The TensorWave MI300X server was the complete opposite experience. After my first ssh into the server (running on Ubuntu 22.04 LTS), I was gladly surprised that TensorWave has already done all the heavy (driver) lifting.

Above shows a re-enactment of my first ssh. I used rocm-smi (AMD’s version of the nvidia-smi tool) and it just worked - all cards detected, drivers already loaded.

Without having to wrestle linux drivers, I can immediately get to work on stuff I actually want to do in this box - run LLMs! This is a fantastic user experience, and incredible time to value.

Time to inference: TGI 3.0

While all GPU drivers and ROCm (AMD’s CUDA) configuration are taken care of by TensorWave, that still leaves the choice of inference to me. For a quick start, I tried Text Generation Inference (TGI) by Hugging Face:

model=teknium/OpenHermes-2.5-Mistral-7B

volume=$PWD/data

docker run --rm -it --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \

--device=/dev/kfd --device=/dev/dri --group-add video \

--ipc=host --shm-size 256g \

-p 8086:80 \

-v $volume:/data \

ghcr.io/huggingface/text-generation-inference:3.0.1-rocm \

--model-id $model

And it just worked! That’s just a slightly tweaked version of the CLI command from the Hugging Face docs (https://huggingface.co/docs/text-generation-inference/en/installation_amd), and resulted in the TGI server being online and ready:

Alright, the server is ready, but does it actually work? We can try a quick curl CLI command to send an inference request to our server:

curl localhost:8086/generate \

-X POST \

-d '{"inputs":"Who were the first five presidents of the USA?","parameters":{"max_new_tokens":1000}}' \

-H 'Content-Type: application/json'

Which resulted in:

It works!

Man, that is some amazing time to value and time to inference right there. With basically zero effort, we created an inference endpoint using community darling TGI. We just ssh’d into the server, executed a docker command, and boom! Achievement unlocked: 100% working inference endpoint.

Now, what about even more performance using vLLM?

vLLM for Maximum Performance

In my tinkering, there’s a few ways you can run vLLM in TensorWave’s MI300X servers:

There’s a prebuilt AMD image you can use, just like the TGI experience in the previous section (https://www.amd.com/en/developer/resources/technical-articles/how-to-use-prebuilt-amd-rocm-vllm-docker-image-with-amd-instinct-mi300x-accelerators.html)
You can also just build your own docker image from AMD’s vllm fork (https://github.com/ROCm/vllm)

I tried both ways, and while both work fine, I found that the best performance comes from building your docker image from AMD’s vllm fork (mainly due to having newer ROCm and vLLM components). The instructions are the same as in the vLLM official docs (https://docs.vllm.ai/en/v0.6.5/getting_started/amd-installation.html), but instead of cloning the mainstream vLLM repo, you clone AMD’s fork:

#Using AMD's vLLM fork

git clone https://github.com/ROCm/vllm.git

cd vllm

git checkout tags/v0.6.6+rocm #or choose whatever is latest release by then

DOCKER_BUILDKIT=1 docker build -f Dockerfile.rocm -t amd-vllm-rocm .

‍

That’s it, easy-peasy. The build process can take a while though as it’s a huge image. Once that’s built, we can use the resulting amd-vllm-rocm image to run a vLLM container that serves our chosen model:

export volume=$PWD/data

IMAGE_NAME=amd-vllm-rocm #Image from AMD's vLLM fork

export HF_TOKEN=my-secret-huggingface-token

export MODEL=Qwen/Qwen2.5-72B-Instruct

docker run -it \

--network=host \

--env HUGGING_FACE_HUB_TOKEN=$HF_TOKEN \

--group-add=video \

--ipc=host \

--cap-add=SYS_PTRACE \

--security-opt seccomp=unconfined \

--device /dev/kfd \

--device /dev/dri \

-v $volume:/root/.cache/huggingface \

$IMAGE_NAME \

vllm serve $MODEL --tensor-parallel-size=8

And that’s all she wrote. This one took a bit more time than the TGI server to get up and running because we opted to build a docker image based on the ROCm team’s latest vLLM fork release, but that’s just waiting time. Effort-wise, still pretty easy!

MI300X and its insane memory capacity

What can a ridiculously large memory pool give you?

One of the flagship features of AMD’s MI300X accelerator is memory - it has absolutely insane amounts of it.

The competing H100 from Nvidia has a decent 80GB of accelerator memory - that’s way above consumer cards. The 4090 only had 24GB, for example, and the recently released 5090 a mere 32GB.

The MI300x, on the other hand, has 192GB! That’s 2.4x more memory.

And if you had 8xH100 vs 8xMI300X, you’d be comparing 640GB of total accelerator memory on the 8xH100 vs a staggering 1,536 GB (1.5 terabytes!) on the MI300X.

Aside from being able to fit larger models with more context into your inference server (LLM’s memory usage will skyrocket as context size and prompt sizes increase, such as for very heavy documents or large code bases you make your LLM analyze), you can actually do something unique in the MI300X box that you simply can’t in the H100: run full-weight, 70B-class models in a single card.

Yes, with the absurdly huge memory pool in a single MI300X card (more memory than 2 H100 cards), we can comfortably run a nice powerful model like Llama 3.1 70B or, my favorite in this weight class, Qwen2.5 72B!

8 single-card servers vs 1 eight-way server

Our TensorWave box came loaded with 8x MI300X, so with 1.5TB of total memory, of course we could easily fit the biggest models (we even ran current darling DeepSeek V3 on it, which has a staggering amount of parameters - over 600 billion - using SGLang).

But one thing you can also do is trade a little of your max throughput in order to drastically improve latency, and that’s by serving your model in a single-card config.

We ran the same workloads in our TensorWave box using two configurations:

A single vLLM docker container, with all 8 MI300X cards, with tensor parallelism = 8.
8 independent vLLM docker containers (different ports), each with a single MI300X card.

Having some tensor parallelism helps increase max throughput, but at the expense of latency due to the overhead of inter-card communication:

The above figures were taken using Qwen2.5-72B-Instruct. You can see that serving this big 72B parameter model on a single card can reduce latency (time to first token) by almost 50% - and in most cases, it only has ~75% of the latency of the eight-way configuration (i.e., a significant 25% reduction in latency)

Now, latency is not necessarily the be-all and end-all of LLM inference. It depends a lot on your use case for it. Sometimes latency will matter (user experience, time to first token), but sometimes you just want raw, maximum throughput. That’s fine.

What matters here is that TensorWave’s MI300X servers give you that choice. You can configure your 8xMI300X monster of a machine to just have a single LLM server that uses all the cards using tensor parallelism, and you maximize raw throughput. Or, you can split it up into single cards for latency sensitive use cases. You can even be creative with how you split them up. You can split them up into five LLM servers, for example - one docker container running 4xMI300X for more throughput, and you have four smaller docker containers each with 1xMI300X for the best latency.

You have this choice and you get to be creative because of the sheer amount of memory capacity in each accelerator. This would just be flat out impossible on an H100 system, because a single H100 card has nowhere near enough memory to even just load a 70B-parameter-class model.

Kamiwaza AI

You might be wondering how all this power could be channeled in a way that can effectively boost productivity in an enterprise setting.

You aren’t alone. Here at Kamiwaza AI, our mission is to help customers on their genAI journey, to unlock previously unimaginable efficiencies and methodologies before the rise of LLMs.

One of the very many ways we enable customers in the age of generative AI is with AI agents - LLMs equipped with tools and the know-how to use those tools in order to achieve specific tasks given to them by humans ad hoc (as in through a chat interface) or on a fixed schedule (as in triggered by a cron job).

Here’s a demo of one such AI agent in action, running a full-weight Qwen2.5 72B Instruct model:

In the video above, you can see a custom AI agent in action. (Notice how fast a full-weight 72B model was running! That’s thanks to MI300X - we recorded that particular demo running on TensorWave’s beastly server.)

Receiving a single user instruction from us sends the agent on its merry way:

[Git] Cloning a private demo repo from Kamiwaza AI
[Filesystem] Copying a file from the calc folder into the cloned repo
[Python] Execute arbitrary python code in order to get the current date and time (because we gave him python capability, he's not necessarily limited to just the tools we give him!)
[Coding / Editing] He will analyze calculator.py, find the bug, and fix it.
[Git] He will now commit and push the changes
[Coding + Python] He will create a set of tests for the calculator and then run those tests to confirm that things work as expected.

That is agentic AI in action - it didn’t just spit out a response to a human command (that’d just be a vanilla LLM chatbot). Instead, it received instructions, and then autonomously and sequentially used the tools it had in order to achieve all of its goals.

The best part? That entire stack used in the demo will be released as open source software very soon. Yes, as amazing as that demo was, that’s not even our secret sauce!

You see, having a working AI agent is just the first step in a very difficult enterprise AI journey. That demo (and resulting open source release) is just a simple web frontend and a few python files that enable inference and tool calling.

What it lacks are key features that enterprises ABSOLUTELY require:

Authentication+authorization, and SAML integration for federated access
Connections to enterprise data - what good is an agent if it can’t reach out to your vast enterprise data? The Kamiwaza platform simplifies this through our built-in embedding and vector database solutions, enabling near-instant RAG functionality.
Secure and fine-grained control over AI agent data access - it’s not enough that agents can connect to enterprise data - their access permissions must also be able to be managed and controlled in a sane manner so that enterprise users couldn’t just suddenly access data they normally couldn’t (for example, you wouldn’t want your employees to know their bosses’ salaries just because they asked an agent that has SAP connection, “Hey, how much does my boss make?”)

These are just some of the key features that the Kamiwaza platform offers. We simplify the genAI journey so that our customers can get started on reaping the benefits ASAP, instead of getting stuck tweaking different portions of complex LLM infrastructure.

We’ve been having fun with blazing fast genAI - and you can, too!

We at Kamiwaza AI are having so much fun AND doing real work with our TensorWave box. It’s been fun working with the hardware, and also with the awesome support from the folks at TensorWave.

We’ve been using their MI300X monster machine in the genAI demos we’ve been giving throughout the US. In one of the more recent ones, we even started showing off a sort of tachometer that outputs in real time the vLLM engine tokens per second figures while an agentic demo was underway:

(Above are not actual figures from any of the live demos, but indicative figures from our internal tests doing maximum load on the MI300X)

Suffice it to say, we were incredibly happy when a totally-not-planted audience member shouted out, “What hardware are you even on that you were getting all those tokens?”

It’s absolutely amazing when you see an AI agent being autonomous and outputting text, tool calls, and analysis at blazing speeds.

[If you haven’t experienced this yourself, go hit up our friends at TensorWave (https://tensorwave.com/).]

Are you looking for longer contexts, lower latency, and lower costs? Get early access to TensorWave managed inference service on AMD MI300X and unlock up to $100k. You’ll be glad you did!. And if you need help with unlocking genAI, especially for agentic uses and automation and just overall transforming your org into an AI-powered enterprise, go hit us up at Kamiwaza AI! (https://kamiwaza.ai)

‍