I patched llama.cpp to gain 20% prompt processing TPS. Help me make a PR
I've been running Qwen3.6-35B-A3B locally on llama.cpp and noticed that prompt processing throughput gets too low with MTP. I got nerd-sniped. What started as curiosity turned into a two-week rabbit hole of experiments and ended with a PoC that fully recovers the MTP PP overhead on GPU, above any expectation I had. TL;DR: instead of processing the last layer MoE FFN for the entire ubatch tokens (usually 512-2048 tokens), this PoC processes only the output row (usually 1 token during prefill). The result is PP TPS is back to the same as with MTP disabled, keeping most of MTP's benefits to TG TPS, even with a slight drop in draft acceptance rate in one of the benchs. I'm not opening a PR to llama.cpp because this is AI-generated code, which goes against their contribution policy, which I support. If you know C++ and llama.cpp internals, I invite to work together with me to open a PR with a more mature implementation.
Comments
2 preview comments · loading full threadLog in to h4cker, then connect Hacker News to publish comments.
you have better chances opening an issue and explaining your findings. lots of people watch the repo so probably someone will chime in.
[deleted]