autotunellm.com

Show HN: Makes local LLMs faster and more reliable by optimizing for your device

tanavc · 5 points · 0 comments · 19小时前
打开原文HN 讨论

Time to first token is 39% faster Agent wall times decrease by 46% No swaps Tracks your resource usage in real-time and adjusts how the model runs so that it works perfectly on your device. Implements KV cache sizing, prefix caching, live RAM pressure management, context trimming, KV quantization, and more. Built a ton of features

评论

0 条预览评论 · 正在加载完整讨论