NHacker Next
login
▲Compiling LLMs into a MegaKernel: A path to low-latency inferencezhihaojia.medium.com
226 points by matt_d 14 hours ago | 67 comments
Loading comments...
refibrillator 8 hours ago [-]
Hi author(s), the on-GPU interpreter approach looks like a promising path forward, have you seen this strikingly similar concurrent work?

https://news.ycombinator.com/item?id=44111673

I find it curious that fundamentals of the CUDA programming model (eg kernel launches) are being subverted in favor of fine grained task based parallelism that ends up using the hardware more effectively. Makes me wonder if CUDA has been holding us back in some ways.

What are the chances we see your work land in PyTorch as an experimental backend?

Awesome stuff thanks for sharing.

P.S. minor typo, your first two paragraphs under part 1 are nearly identical.

zhihaojia 7 hours ago [-]
Thanks for the great feedback! Stanford's MegaKernel project tackles a similar challenge but focuses on manual CUDA implementation. While MPK takes a compiler-driven approach—users express their LLMs at the PyTorch level, and MPK automatically compiles them into optimized megakernels. Our goal is to make programming megakernels much more accessible.

I completely agree that CUDA can be a limiting factor, especially for latency-sensitive workloads. As GPUs are becoming larger and faster, it's increasingly difficult to write standalone kernels that fully utilize hardware resources—particularly when optimizing for low latency with small batch sizes.

> What are the chances we see your work land in PyTorch as an experimental backend?

We're definitely excited about that direction. We believe MPK can help PyTorch support megakernel generation, and we’re actively exploring how to make that happen. Stay tuned!

> P.S. minor typo, your first two paragraphs under part 1 are nearly identical.

Thanks for pointing it out--I meant to remove the duplicate paragraph when finalizing the post.

bytepoet 13 hours ago [-]
This is very cool. I enjoyed going through the writeup and GitHub README.

I was wondering if these same optimizations can be brought to bear on training as well, rather than only inference. I guess the challenge here is fusing backward computations with gradient communication.

I also saw that this currently does not handle dynamic workloads such as MoE. I recently came across this paper that does exactly this:

FlashDMoE: Fast Distributed MoE in a Single Kernel - https://arxiv.org/pdf/2506.04667

zhihaojia 12 hours ago [-]
Thanks for reading the post and github README. Supporting training is definitely feasible but the benefit may not be as significant as low-latency inference since training generally involves much larger kernels, making kernel launch overhead less significant.

Thanks for sharing the FlashDMoE work. Our next step is to support MoE models. Stay tuned!

bytepoet 9 hours ago [-]
Thanks for the inputs. It's very helpful to know.

I look forward to following mirage development.

ActorNightly 10 hours ago [-]
Personally I think its a bit of a waste to invest time into gradient training optimizations. A lot of training tasks IRL have discrete values in nature, which can't be trained with gradients.
andy12_ 27 minutes ago [-]
How is this possible? I mean, I thought that sometimes you had no choice but to separate computation into several kernels. But here they literally allow cuda threads to dinamically perform tasks assigned by scheduler threads? I only have a little experience writing cuda kernels, so I have my mind blown.
fho 1 hours ago [-]
Somewhat relevant anecdote: we had a small CUDA competition (10-ish years ago). Some embrassingly parallel CV algorithm.

I tried to be smart and cache intermediate results that were shared by multiple kernels.

When the results were published I was stumped to see that others were orders of magnitude faster then me.

Turns out they didn't bother with caching at all. The overhead of recalculating everything a thousand times was tiny compared to the overhead of doing roundtrips through RAM.

I assume it's the same thing here. By compiling into MegaKernels, layer boundaries are squashed. There likely will be _more_ calculations and less shared intermediate results. But overall it's still a win due to less memory roundtrips.

There has to be a sweet spot, especially for convolution networks. No idea if the MegaKernel takes this into account.

baq 13 hours ago [-]
Next step - compile straight to verilog so I can buy some LLMs on aliexpress
bigcat12345678 13 hours ago [-]
https://riscv.org/blog/2021/02/hardware-description-language... That was one of the promising ideas before AI & GPUs come to the scene. As CPUs are stagnant, and naturally people want further optimize the middle layers software and hardware.

But I suspect parallel computing in GPU style is going to dominate acclerated computing.

General purpose CPUs are going to stay to become the little brain that orchestrates GPUs.

Ideas of software direct to hardware transition might never be the mainstream.

baq 12 hours ago [-]
I'm thinking more like pseudointellect over serial to attach a $3 esp32 to. Since it's basically tokens in, tokens out, let's just cut the unnecessary parts out. It's like querying the cloud models, except it's your silicon you personally soldered to the esp so nobody will break your home assistant with a system prompt update or a fine tuning run.
mycall 10 hours ago [-]
> General purpose CPUs are going to stay to become the little brain that orchestrates GPUs

Brings the deterministic compute to the indeterministic.

nialse 3 hours ago [-]
In five to ten years, when LLMs have stabilized, mapping them straight onto hardware will probably make sense. With today’s processes a hundred billion parameters might fit onto a single silicon wafer using ~1.5 bit precision implemented directly in logic gates. Using higher precision raises the gate count exponentially, so it makes more sense to keep the weights in memory and reuse shared compute blocks for the math for now. We need to get the ultra low precision LLMs working for the future though.
fc417fc802 9 hours ago [-]
Because training costs weren't high enough already so lets add mask costs on top.

More seriously, isn't that pretty much what all those AI hardware startups have already been doing for a while now?

adgjlsfhk1 7 hours ago [-]
most of them are much more general purpose. they might be specializing somewhat on the architecture, but not on the weights
fc417fc802 7 hours ago [-]
Realistically specializing on the data flow is all you can do. Assuming a modern CPU contains on the order of 10 billion transistors that only amounts to 1.2 GiB storage before you account for any actual logic (ie 1 bit per transistor). DRAM hardware is quite different from that of processing elements and it takes quite a lot of DRAM chips to hold the weights of a single model.
anitil 9 hours ago [-]
I mean.... LLM-in-a-box would actually be pretty neat! I'm looking at some air-gapped work coming up and having something like that would be quite handy
fc417fc802 9 hours ago [-]
Isn't that easily accomplished by setting up a local deployment and then yanking the network cable? Anything that can quickly run a capable LLM is going to be a pretty beefy box though. More like LLM in an expensive space heater.
stirfish 8 hours ago [-]
I was thinking more like those Bitcoin mining usb Asics that used to be a thing, but instead of becoming ewaste, you can still use them to talk with chatgpt 2 or whatever. I'm picturing an llm appliance.
fc417fc802 7 hours ago [-]
There is no magic ASIC that can get around needing to do hundreds of watts worth of computations and having on the order of hundreds of gigabytes of very fast memory. Otherwise the major players would be doing that instead of (quite literally) investing in nuclear reactors to power their future data center expansions.
rhdunn 2 hours ago [-]
Google have their own ASIC via their TPU. The other major players have leveraged NVIDIA and -- to a lesser extent -- AMD. This is partly due to investment in TPUs/ASICs being complex (need specialist knowledge and fabrication units) and GPU performance being hard to compete with.

Training is the thing that costs the most in terms of power/memory/energy, often requiring months of running multiple (likely 4-8) A100/H100 GPUs on the training data.

Performing inference is cheaper as you can 1) keep the model loaded in VRAM, and 2) run it on a single H100. With the 80GB capacity you would need two to run a 70B model at F16, or one at F8. For 32B models and lower you could run them on a single H100. Then you only need 1 or 2 GPUs to handle the request.

ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

I think the sweat spot will be when CPUs have support for high-throughput matrix operations similar to the SIMD operations. That way the system will benefit from being able to use system memory [1] and not have another chip/board consuming power. -- IIUC, things are already moving in that direction for consumer devices.

[1] This will allow access to large amounts of memory without having to chain multiple GPUs. That will make it possible to run the larger models at higher precisions more efficiently and process the large amount of training data efficiently.

fc417fc802 9 minutes ago [-]
> ASICs could optimize things like the ReLU operations, but modern GPUs already have logic and instructions for matrix multiplication and other operations.

Right but at that point you're describing an H100 plus an additional ASIC plus presumably a CPU and some RAM. Or a variant of an H100 with some specialized ML functions baked in. Both of those just sound like a regular workstation to me.

Inference is certainly cheaper but getting it running quickly requires raw horsepower (thus wattage, thus heat dissipation).

Regarding CPUs there's a severe memory bandwidth issue. I haven't kept track of the extreme high end hardware but it's difficult to compete with GPUs on raw throughput.

baq 2 hours ago [-]
If you focus on just the matmuls, no CUDA, no architectures, no infinibands, everything-on-a-chip - put input tokens in input registers, get output tokens from output registers from a model that's baked into gates - you should be able to save some power. Not sure if 10x or 2x or 100x, but certainly there are gains to be had.
stirfish 7 hours ago [-]
That's a really good point. I wasn't thinking further than ollama on my MacBook, but I'm not deploying my laptop into production.
otabdeveloper4 4 hours ago [-]
24 gigabytes is more than enough to run a local LLM for a small household or business.

This is "gaming PC" territory, not "space heater". I mean people already have PS5's and whatnot in their homes.

The hundreds of gigabytes thing exists because the big cloud LLM providers went down the increasing parameter count path. That way is a dead end and we've reached negative returns already.

Prompt engineering + finetunes is the future, but you need developer brains for that, not TFLOPs.

fc417fc802 4 minutes ago [-]
Only if you'll settle for less than state of the art. The best models still tend to be some of the largest ones.

Anything that overflows VRAM is going to slow down the response time drastically.

"Space heater" is determined by computational horsepower rather than available RAM.

How big a context window do you want? Last I checked that was very expensive in terms of RAM and having a large one was highly desirable.

rhdunn 2 hours ago [-]
It depends on 1) what model you are running; and 2) how many models you are running.

You can just about run a 32B (at Q4/Q5 quantization) on 24GB. Running anything higher (such as the increasingly common 70B models, or higher if you want to run something like Llama 4 or DeepSeek) means splitting the model between RAM and RAM. -- But yes, anything 24B or lower you can run comfortably, including enough capacity for the context.

If you have other models -- such as text-to-speech, speech recognition, etc. -- then those are going to take up VRAM for both the model and during processing/generation. That affects the size of LLM you can run.

kp1197 12 hours ago [-]
After working pretty closely with vLLM and SGLang over the past few months, this is EXACTLY what I had envisioned what a successor project would look like - analyzing an operation dependency graph and then fusing (or, at a minimum, scheduling tasks smarter). Congrats to the team.
zhihaojia 12 hours ago [-]
Thanks a lot for your positive feedback! We believe that MPK can enhance existing LLM serving systems, especially for low-latency LLM serving. We are very excited about the opportunity to collaborate with others on direction.
gongy 6 hours ago [-]
The improvement is real!

And unlike a lot of research, the code actually runs well. I can reproduce the results using Modal GPUs, leaving the code here: https://github.com/mirage-project/mirage/pull/327/files

Triton + FlashInfer: Prompt length 39, generate length 264, per-token latency 19.189573345762312 ms

MPK: Prompt length 39, generate length 334, per-token latency 7.71875 ms

zhihaojia 6 hours ago [-]
Thanks for reproducing our results!
flakiness 12 hours ago [-]
This project is from CMU. Hazy Research at Stanford talked about the megakernel too: https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles

Good to see the competition in this area.

(Edited): Related paper covering the larger "mirage" project, but this doesn't cover the "megakernel" approach: https://arxiv.org/abs/2405.05751

zhihaojia 12 hours ago [-]
This is the writer of the blog post. You are right that Stanford's work is a parallel effort. The main difference is that our focus is on compilation: making it easier to generate megakernels automatically.
zhihaojia 7 hours ago [-]
Ooops, missed one sentence in my previous response. Stanford's MegaKernel project tackles a similar challenge but focuses on manual CUDA implementation. While MPK takes a compiler-driven approach—users express their LLMs at the PyTorch level, and MPK automatically compiles them into optimized megakernels. Our goal is to make programming megakernels much more accessible.
zekrioca 9 hours ago [-]
And their focus is..?
sigbottle 10 hours ago [-]
Hazy Research also has ThunderKittens, pretty cool library. There's a lot of effort to really formalize, pipeline, divide and conquer in the current NVIDIA GPU model for maximize GPU efficiency, and to write compilers/DSL's for things, it seems.
skavi 12 hours ago [-]
Does anyone have an intuition on why this offers significant gains over CUDA Graphs?. The CPU launch cost of a graph is tiny which implies most of the work has been offloaded to the GPU's own scheduler. I'd expect that some I/O marshalling at kernel boundaries could be avoided with megakernels. Maybe some loop fusion? Are there any more interesting optimizations they enable?
saagarjha 11 hours ago [-]
> The CPU launch cost of a graph is tiny

Absolutely not; it’s comparable to the launch overhead of a kernel.

skavi 8 hours ago [-]
Fair enough. I should have clarified that “approximately the cost of a single kernel launch” is pretty much what I meant by “tiny”.

What I was getting at was that a “megakernel” and a captured graph should have similar launch costs.

touisteur 4 hours ago [-]
It's not so much kernel overhead than memory traffic between global memory and L2 cache / shared memory that you (or at least I) target with fused kernel approach. Kernel launch overhead can be drastically reduced with cuda-graph indeed.

I'm not sure it applies so well in LLMs though (should read the paper...).

zhihaojia 7 hours ago [-]
You are right that CUDA graph can help reduce launch overhead but does not support overlapping computation/communication across layers, since data dependencies are described at the kernel level.
refulgentis 12 hours ago [-]
You've hit the nail on the head. The CPU launch cost of a pre-compiled CUDA graph is tiny.

CUDA Graphs are a huge step up from manually launching kernels, but they still treat kernels as monolithic, black-box operations. A megakernel erases the boundaries between those operations.

With CUDA Graphs, as in the example in the article, if you have Matmul -> AllReduce, the AllReduce kernel cannot start until the entire Matmul kernel has finished. The dependency is at the kernel level. With a megakernel, they break these ops into fine-grained "tasks" scheduled across SMs. An AllReduce task that needs data from the first slice of the Matmul can begin as soon as that slice is computed by a few SMs, while other SMs are still working on the rest of the Matmul. This fine-grained software pipelining and compute/communication overlap is simply not possible when the dependency unit is the entire kernel.

skavi 8 hours ago [-]
Ah, that makes a lot of sense. Is this fine grained task scheduling related to CUDA Dynamic Parallelism at all? If not would you have a pointer on where to look?

I suppose I could look through the code of this project, but I’d hate to have detangle that from the compiler infrastructure.

touisteur 4 hours ago [-]
Think more of it as 'tasking by hand' where you have one kernel driving the 18k+ cores and you manually (or using device libraries) fine-grained synchronize them, handle memory traffic asynchoronously and pipeline as much as you can.

You might have a look at cooperative groups, also things cuda::pipeline in libcudacxx to handle asynchronous and pipelined memory traffic, and also most of block/warp CUB primitives, and move on up to cuFFTDx, cuBLASDx and now cuSolverDx as the starting toolbox for your fused kernel journey.

bronxbomber92 6 hours ago [-]
A question for the author(s) since they seem to be very responsive to this thread :).

1. How fine grain is each task? In a traditional matrix multiplication kernel, for example, each thread block is responsible for a small output tile of the resulting matrix. In Mirage's mega kernel, would there correspondingly be a task for each small output tile?

2. How does the Mirage compiler form the task graph? Does it have domain knowledge of every operator's data flow at the granularity of individual elements? Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

zhihaojia 5 hours ago [-]
1. In MPK, each task is mapped to an individual SM. The amount of work handled by a task is similar to that of a thread block in the traditional kernel-per-operator approach.

2. TL;DR: MPK automatically analyzes inter-task dependencies by tracking the input and output tensors associated with each task. A longer version: Longer version: MPK uses imap, omap, and fmap (see Section 2 of the Mirage paper) to determine each task’s input and output tensors. A dependency is introduced between task A and task B if A produces any tensor elements that B consumes—that is, if A's outputs overlap with B's inputs.

> Again taking matmul as an example: a given output output tile requires the correspond M_BLOCK rows of the A matrix. If the A matrix was itself an output of a prior matmul (+ nonlinearity), the dependees would be all of output tile tasks corresponding to those M_BLOCK rows of the operator that produced A?

Exactly. In this case, all output tile tasks that consume those M_BLOCK rows of A will depend on all tasks responsible for producing the corresponding parts of A in the previous operator.

liuliu 13 hours ago [-]
The Qwen 8B number, if verified, is very impressive. Much more practical than the previous megakernel one.

That's being said, these one-persisted kernel on each SM reminds me Larrabee, and now wondering what the world will be if we just do traditional process-thread-simd path rather than CUDA path.

bdbenton5255 11 hours ago [-]
Certainly an important discovery for utilizing these models on scaled hardware. This approach could certainly be applied beyond LLMs to other types of neural networks. That would be an interesting space to explore.
zhihaojia 9 hours ago [-]
Thanks for the feedback! Yes, we believe the approach is general and applicable to other ML workloads.
tuananh 10 hours ago [-]
if you want to try on 5090, it's not supported yet

> Support for modern GPU architectures. One of our next milestones is extending MPK to support next-generation architectures such as NVIDIA Blackwell. A major challenge lies in integrating warp specialization — a key optimization for newer GPUs — with MPK’s megakernel execution model.

zhihaojia 9 hours ago [-]
The task implementations used by MPK are currently optimized for A100. While the Mirage compiler can generate task implementations for other architectures such as Hopper and Blackwell, but we haven't integrated things together yet. This is on the very top of our todo list. Stay tuned!
fxtentacle 11 hours ago [-]
Isn’t fusing ops at a fine-grained level also the core benefit of JAX over TensorFlow? How does this work compare to JAX?
zhihaojia 9 hours ago [-]
JAX's operator fusion (https://apxml.com/courses/advanced-jax/chapter-2-optimizing-...) can fuse a few local operators (e.g., matmul and elementwise computation) into a single kernel. But JAX's approach cannot fuse an entire LLM with hundreds of operators into a single kernel because many operators involve loop transformations.

MPK takes a different approach where instead of incrementally fusing local operators, it decomposes operators into a task graph and builds a runtime system within a single kernel to execute all tasks specified in the task graph.

qihqi 10 hours ago [-]
Probably should make this into a backend of torch.compile
zhihaojia 9 hours ago [-]
Yes, it would be a lot of fun if MPK can enable torch.compile to generate megakernels. Torch-generated kernels are currently too slow for latency-sensitive workloads.
olivia111 12 hours ago [-]
really cool. would love to try it for our 3b model.
olivia111 12 hours ago [-]
any detailed tutorial about how to use it?
zhihaojia 12 hours ago [-]
The github repo includes a tutorial for using MPK: https://github.com/mirage-project/mirage/tree/mpk
babuloseo 8 hours ago [-]
ELI5
scotty79 13 hours ago [-]
> Traditional LLM systems often rely on sequences of GPU kernel launches and external communication calls, resulting in underutilized hardware.

What? Why? This seems like an obvious optimization if it's possible.

catlifeonmars 13 hours ago [-]
From the article

> Despite these advantages, compiling an LLM into a megakernel is highly challenging. Existing high-level ML frameworks — such as PyTorch, Triton, and TVM — do not natively support end-to-end megakernel generation. Additionally, modern LLM systems are built from a diverse collection of specialized kernel libraries: NCCL or NVSHMEM for communication, FlashInfer or FlashAttention for efficient attention, and CUDA or Triton for custom computation. This fragmentation makes it difficult to consolidate the entire inference pipeline into a single, unified kernel.

So my naive assumption is that yes it is obvious, but nontrivial.

saagarjha 11 hours ago [-]
Your naive assumption is the right one. It’s quite hard to do this. Even doing it automatically like it’s done here runs into problems with trying to figure out data dependencies and synchronization across nontrivial computation.
liuliu 13 hours ago [-]
It really is not obvious. These launches are asynchronous, and data movement / computation is overlapped properly through CUDA APIs. Even per-kernel launch cost is reduced with the cudagraph introduction.

CUDA programming model relies on each kernel to be computationally expensive to make sense, and these are not true for token generation of LLM. And we are talking about network evaluation at higher than 1000 per second, whereas previously besides recommendation systems, network evaluation we are look at is ~100 per second at most.

Also, nobody remember Alex's "One Weird Trick" paper, which slices matmul into pieces to overlap device-to-device transfer v.s. computation. That is 10 years ago.

gdiamos 2 hours ago [-]
It's surprising to me that the field is willing to invest this much in mega-kernels, but not models that generate multiple tokens in parallel...
shawntan 13 hours ago [-]
Systems might want to anticipate changes in LLM architectures (even small changes can make a big difference kernel wise), so it's good to not "bake" too much in ahead of time.

That said, at some point it just depends where the costs lie and it might make sense hiring some GPU engineers to do what they did here for whatever architecture you're optimising for.

Not as low-hanging as you might imagine.

delusional 12 hours ago [-]
In the common case where the processor dispatching those kernel calls is much faster than the kernel calls themselves, you're not likely to see a meaningful increase in throughput.

What you need to do first is get really optimized kernels (since that makes the dispatching relatively more expensive) and THEN this becomes worth doing. People who are really good a writing optimized GPU kernels are just not that easy to get a hold of right now.

boxboxbox4 13 hours ago [-]
[dead]
curtisszmania 6 hours ago [-]
[dead]
NitroPython 13 hours ago [-]
Ollama integration?