The above figures were taken using Qwen2.5-72B-Instruct. You can see that serving this big 72B parameter model on a single card can reduce latency (
time to first token) by almost 50% - and in most cases, it only has ~75% of the latency of the eight-way configuration (
i.e., a significant 25% reduction in latency)
Now, latency is not necessarily the be-all and end-all of LLM inference. It depends a lot on your use case for it. Sometimes latency will matter (
user experience, time to first token), but sometimes you just want raw, maximum throughput. That’s fine.
What matters here is that TensorWave’s MI300X servers
give you that choice. You can configure your 8xMI300X monster of a machine to just have a single LLM server that uses all the cards using tensor parallelism, and you maximize raw throughput. Or, you can split it up into single cards for latency sensitive use cases. You can even be creative with how you split them up. You can split them up into five LLM servers, for example - one docker container running 4xMI300X for more throughput, and you have four smaller docker containers each with 1xMI300X for the best latency.
You have this choice and you get to be creative because of the sheer amount of memory capacity in each accelerator. This would just be
flat out impossible on an H100 system, because a single H100 card has nowhere near enough memory to even just load a 70B-parameter-class model.
Kamiwaza AIYou might be wondering how all this power could be channeled in a way that can effectively boost productivity in an enterprise setting.
You aren’t alone. Here at Kamiwaza AI, our mission is to help customers on their genAI journey, to unlock previously unimaginable efficiencies and methodologies before the rise of LLMs.
One of the very many ways we enable customers in the age of generative AI is with AI agents - LLMs equipped with tools and the know-how to use those tools in order to achieve specific tasks given to them by humans ad hoc (
as in through a chat interface) or on a fixed schedule (
as in triggered by a cron job).
Here’s a demo of one such AI agent in action, running a full-weight Qwen2.5 72B Instruct model:
In the video above, you can see a custom AI agent in action. (Notice how fast a full-weight 72B model was running! That’s thanks to MI300X - we recorded that particular demo running on TensorWave’s beastly server.)
Receiving a single user instruction from us sends the agent on its merry way:
- [Git] Cloning a private demo repo from Kamiwaza AI
- [Filesystem] Copying a file from the calc folder into the cloned repo
- [Python] Execute arbitrary python code in order to get the current date and time (because we gave him python capability, he's not necessarily limited to just the tools we give him!)
- [Coding / Editing] He will analyze calculator.py, find the bug, and fix it.
- [Git] He will now commit and push the changes
- [Coding + Python] He will create a set of tests for the calculator and then run those tests to confirm that things work as expected.
That is agentic AI in action - it didn’t just spit out a response to a human command (that’d just be a vanilla LLM chatbot). Instead, it received instructions, and then autonomously and sequentially used the tools it had in order to achieve all of its goals.
The best part? That entire stack used in the demo will be released as open source software very soon. Yes, as amazing as that demo was, that’s not even our secret sauce!
You see, having a working AI agent is just the first step in a very difficult enterprise AI journey. That demo (and resulting open source release) is just a simple web frontend and a few python files that enable inference and tool calling.
What it lacks are key features that enterprises ABSOLUTELY require:
- Authentication+authorization, and SAML integration for federated access
- Connections to enterprise data - what good is an agent if it can’t reach out to your vast enterprise data? The Kamiwaza platform simplifies this through our built-in embedding and vector database solutions, enabling near-instant RAG functionality.
- Secure and fine-grained control over AI agent data access - it’s not enough that agents can connect to enterprise data - their access permissions must also be able to be managed and controlled in a sane manner so that enterprise users couldn’t just suddenly access data they normally couldn’t (for example, you wouldn’t want your employees to know their bosses’ salaries just because they asked an agent that has SAP connection, “Hey, how much does my boss make?”)
These are just some of the key features that the Kamiwaza platform offers. We simplify the genAI journey so that our customers can get started on reaping the benefits ASAP, instead of getting stuck tweaking different portions of complex LLM infrastructure.
We’ve been having fun with blazing fast genAI - and you can, too!
We at Kamiwaza AI are having so much fun AND doing real work with our TensorWave box. It’s been fun working with the hardware, and also with the awesome support from the folks at TensorWave.
We’ve been using their MI300X monster machine in the genAI demos we’ve been giving throughout the US. In one of the more recent ones, we even started showing off a sort of tachometer that outputs in real time the vLLM engine tokens per second figures while an agentic demo was underway: