quesma.com

Qwen 3.6 27B is the sweet spot for local development

stared · 885 points · 600 comments · 15 小时前
打开原文HN 讨论

评论

20 条顶层评论
iagooar13 小时前

I love my MacBook Pro M5 128GB RAM and I love qwen3.6. BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise. Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents. If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro. Thank me later.

astrostl9 小时前

> MacBook Pro M5 128GB RAM 614 GB/s of memory bandwidth > MacMini M4 with 64GB of RAM 273 GB/s of memory bandwidth (also only currently available with 48GB) When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models. And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

iagooar1 小时前

On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.

bigyabai7 小时前

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

jasonjmcghee4 小时前

I'm surprised no one has else has mentioned - low power mode. With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop. If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience. I almost always keep my laptop on low power mode.

c1615 分钟前

Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!

html5cat1 小时前

Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.

mycall1 小时前

It is less efficient use of the GPU and uses more electricity overall, no?

anon3738393 小时前

Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).

SwellJoe11 小时前

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN. It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy. All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

girvo11 小时前

Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

UncleOxidant9 小时前

It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.

ekianjo8 小时前

gemma is also worse for tool calling. not just coding

unknown8 小时前

[deleted]

mycall1 小时前

You can limit TDP on Strix Halo so it runs between 32 and 45W which seems to be the sweet spot for heat vs speed.

andai11 小时前

> The reason is simple: your fingers will burn and your head will explode from the noise. So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :) I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine. There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks). There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...) But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

iagooar11 小时前

Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work. Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

marcuskaz11 小时前

Except they're not available, 3-4 month wait time.

unknown11 小时前

[deleted]

unknown10 小时前

[deleted]

roadside_picnic9 小时前

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client. I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp). Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

HSO52 分钟前

running potentially sota open-weight models locally only became a thing in fall 2023. if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture. more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end. that`s also how i would interpret the recent rumors on m6 and m7. naturally, the cooling and all that will be optimized around that. so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year. you are basically paying the price now to be on the bleeding (sweating) edge

jtbaker8 小时前

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

827a7 小时前

Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB. If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.

angoragoats7 小时前

The Mac mini was available with 64GB of RAM literally 4 days ago; the option was discontinued on June 25th.

dd8601fn5 小时前

I'm using a 64GB M4 Mac Mini. They pulled them a month or two ago, right after I bought it.

ozim2 小时前

DGX Spark everyone is saying performance for the money is not there

Foobar85682 小时前

I have an access to a DGX spark, and while it performs better than my MacBook Pro (M3 Max), the performance on Qwen and Gemma dense models is dog shit, and not worth it.

dgacmu7 小时前

That's incorrect, I have one on my desk right now. They've stopped selling it now, but I got one a year and a half ago: > Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00

PeterStuer59 分钟前

No laptop is thermally designed to handle sustained high workloads. The whole point of a laptop is to keep it thin, quiet and light, the exact opposite of what cooling needs.

swang12 小时前

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

dimitrios112 小时前

I tried to run it on a M4 Air for shits and giggles. After about 1 minute the entire machine basically bricked and I had to hard reset :D

somewhatrandom910 小时前

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

boomskats10 小时前

Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.

somewhatrandom910 小时前

No, I don't think Qwen, but I believe he may try and put some version of GLM in it.

acters12 小时前

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark? You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

girvo11 小时前

My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill. $6800 is a lot of API credits for GLM, for example, on any provider you want to use. Now being able to run models uncensored and with privacy has value! But the cost for these is rough today. I still am going to buy a second one haha

c7b11 小时前

My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

brandensilva6 小时前

Thoughts on a M5 Ultra 768GB if it drops? What's the price to make it worth it for you over a spark cluster? I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.

lee_ars11 小时前

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs. But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me. Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

coder5439 小时前

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated. I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models. As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again. Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

cpburns20099 小时前

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

rnxrx11 小时前

There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

anon3738399 小时前

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

gnerd008 小时前

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

pkroll12 小时前

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

c7b11 小时前

This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.

tedivm11 小时前

I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks. https://github.com/tedivm/qwen36-27b-docker

amatecha6 小时前

I wonder if that's why there is such a good selection of 128gb M5 MBP's on the Apple Certified Refurbished store lol https://www.apple.com/ca/shop/refurbished/mac/macbook-pro-12...

sixothree3 小时前

Wait. Did they raise their prices a second time?

nirvdrum2 小时前

Probably USD vs CAD. The parent posted a /ca/ link, which will look really similar to /us/, but the prices will all appear to be higher.

Arch-TK8 小时前

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

amatecha6 小时前

"macOS" (or however they spell it now) is pretty bad, but I'm not sure it's possible Apple could ever possibly produce an OS as bad as Windows 11 lol, it's really surprising to me to see someone suggest it's somehow actually worse?! How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update? I know multiple people personally who have experienced this with Windows 10/11, not once with a Mac. Just that alone is like the end of the argument for me, ignoring all the shockingly brutal UI problems.

braebo7 小时前

I could not disagree more.

trollbridge5 小时前

Or just buy an R9700 and put it in the basement?

overgard11 小时前

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

geophile11 小时前

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6. My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

trollbridge5 小时前

I'm still kicking myself for buying a 32GB M1 Max Studio two years ago when it wouldn't have been that difficult to get a 64GB instead.

oceanplexian12 小时前

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

chorizo12 小时前

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

nsbk11 小时前

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

unknown11 小时前

[deleted]

sanderjd12 小时前

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

SkitterKherpi12 小时前

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

angoragoats7 小时前

So buy two.

iagooar11 小时前

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed. But man, I have never purchased a computer which is more expensive than a decent family car.

d0gsg0w00f6 小时前

I had this dream too. My 2xDGX Sparks arrive in my reality on Monday.

jnovek12 小时前

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

murderfs10 小时前

A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

angoragoats7 小时前

Yeah this is just not the case at all; a 5090 or any of the recent nvidia workstation cards all fit this criteria. Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.

dheera10 小时前

32GB V100

t0mpr1c34 小时前

Meh. I'd rather have 2x RTX 5060 Ti.

pistoriusp1 小时前

Mac Mini in the rack and a Neo in the lap.

bensyverson14 小时前

The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0] Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs. [0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...

dofm14 小时前

The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it. I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that. I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus. Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task. The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

pizza23413 小时前

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it. Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem. Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy. When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.

dofm13 小时前

Again, I would not argue against any of this. And I can't say that I won't switch to openrouter (even just for the same models) at some point. But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.

Abishek_Muthian4 小时前

I agree completely. I think local AI is best limited to purpose built SLMs; all this craze around running quantized coding LLMs has taken the attention off SLMs.

sanderjd12 小时前

> currently The interesting question is whether that gap will narrow, and if so, how much, and on what timescale. The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.

AlpacaJones13 小时前

The key word there is 'currently'.

bogeholm12 小时前

> Cloud models […] don't consume so much power/generate heat I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place

psychoslave13 小时前

Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues. That's never the point of keeping local alternatives though.

unknown13 小时前

[deleted]

unknown13 小时前

[deleted]

VerifiedReports12 小时前

Exactly. The distinction between the various layers in "AI" systems is pretty vague to the newcomer. What is the "model" vs. the engine "running" it vs. weights? I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight. For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools? And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.

Fr0styMatt8811 小时前

The most unexpected thing for me was kind of philosophical in a ‘holy shit’ way. Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath. Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special. The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.

ricardobayes12 小时前

For the most part you can just download LM Studio and go from there. It provides a chat interface and an easy-to-use interface to browse, load and use LLM models. The engine: it is abstracted away by LM Studio, if you want to dig deep it's llama.cpp as the runtime. Weights are the files what you download, they are the models for practical purposes.

bpye2 小时前

> The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it. Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.

dofm2 小时前

No idea. I am a Mac guy, have been for a very long time. I buy them secondhand as a rule. I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.

codazoda13 小时前

I agree with the learning aspect, but I have another motivation. I suspect that closed models might become too expensive to run for personal hobbyist use. I’ve been planning to buy a 64GB machine just to allow the limited local models this enables.

ehnto6 小时前

It's also great to have capability to run local models for more brute force tasks. Because you can change the system prompt, you can get local LLMs to do all kinds of high volume tasks without burning through tokens on a hosted model. Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight. I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.

ricardobayes12 小时前

I'd say give it some time for the dust to settle. This field badly needs standardized benchmarks even before the conversation around model goodness can start.

ddalex14 小时前

I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you

dofm13 小时前

Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry. I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning. So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off. I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button. For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded. I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer. (Also who am I kidding about the existence of a printer with no problems)

swiftcoder13 小时前

I don't necessarily think your answer is wrong for all people, but if you work in software... how do you plan to differentiate yourself from everyone else out there, if the depth of your understanding is "Claude can do it for me"?

coldtea13 小时前

>no need to learn, just ask it to do it for you And that's how skills die.

sorokod13 小时前

Then what is the point of ddalex?

kdkdjduxnd14 小时前

[deleted]

rusk14 小时前

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)

dofm13 小时前

LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable. I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice. A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.

not_kurt_godel12 小时前

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it. Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.

dofm12 小时前

I'm not that bothered about my coding skills, which are fine, and pretty up-to-date considering I'm now an old bloke. I am bothered about building an instinctive understanding that helps me deal with my anxieties and decide whether I want to carry on with this working life or quit. I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it. YMMV.

sanderjd11 小时前

Continuing to learn new ones, like what? To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment. What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns? I dunno, all of this seems really boring and "been there done that" to me at this moment in time!

oceanplexian12 小时前

Honestly your best bet is to buy a $20 Claude subscription, ask Claude to set it all up with Pi and llama.cpp and come back in 20 minutes after a cup of coffee. This is also a good idea because it will help set expectations of what a local model can do vs. a frontier model.

mullen12 小时前

This is what I did after struggling to get llama.cpp working at a decent speed on my M1 Macbook. The secret is to very specific with your needs and targeted in what you are using llama.cpp for. Mine setup is just about strictly for qwen3-coder and now, I get a fairly decent speed out of it. I also installed Cursor to check Claude and it all worked out well.

cyanydeez14 小时前

I've setup to local paradigms for local coding: - opencode with it's webui - deer-flow with it's research/powered front end They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context). It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed. It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.

dofm13 小时前

> - opencode with it's webui Have you tried Paseo? I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice. (You can also use the Opencode GUI to frame a remote opencode web interface)

bsder10 小时前

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus. Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT. However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger. What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases. I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.

porphyra14 小时前

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers). In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10. [1] https://x.com/MiaAI_lab/status/2070859135399182444 [2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

Zetaphor5 小时前

Alternatively you could run it on Strix Halo for $1,000 less, and while it may be slightly slower you won't have to deal with NVIDIA's shit on Linux and worrying about having to use their custom kernels or Ubuntu.

esperent14 小时前

> 48GB of VRAM with, say, two 3090s So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

fluoridation14 小时前

>Plus I assume it's considerably more effort to get it working. Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

lee_ars11 小时前

The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting. edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!

porphyra8 小时前

I think Atlas might also be slightly faster than vLLM: https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...

Catloafdev14 小时前

The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.

bitexploder13 小时前

For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable. The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive. Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful. Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.

aunty_helen12 小时前

I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE. The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job. It does seem to be doing useful work but it’s not API call level quality

coder5439 小时前

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4 Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

CMay12 小时前

At 24GB, Gemma 4 31B QAT will be better and give more concise answers. This post is mostly about unquantized results, so it's less relevant and I can't say much about as I haven't tested Qwen or Gemma via cloud API or unquantized locally. All I can say is locally, quantized in a 24GB scenario, Gemma 4 31B is better in my tests which are mostly reasoning or C programming related. Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect. When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par. Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for. Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.

thewebguyd14 小时前

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization. If you want to run unquantized, you definitely need 128GB.

Catloafdev14 小时前

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

bitexploder13 小时前

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

gchamonlive13 小时前

[deleted]

Numerlor13 小时前

And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable

nok22kon13 小时前

a computer with 24 GB VRAM is at least $3000

daemonologist12 小时前

A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500. Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

sleepyeldrazi13 小时前

I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.

throw123456789113 小时前

But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months. Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

wilsonnb313 小时前

Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.

throw123456789113 小时前

Sure, but no one gets everything. Just that one.

DANmode13 小时前

That’s at today-prices. If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?

wahnfrieden13 小时前

It's much slower, and often quantized

acchow12 小时前

That $6700 is a $5000 upgrade over a base model Macbook Pro. $5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)

neonstatic11 小时前

I think the argument isn't that local is cheaper - it's that local is doable and delivers unparalleled privacy.

iosjunkie6 小时前

And your government can’t take it away on a Friday afternoon.

stymaar14 小时前

> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0] Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument. Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

boutell13 小时前

That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it: https://github.com/noonghunna/qwen36-27b-single-3090 Flies though (50-70tps is impressive for a model this smart) I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

stymaar13 小时前

> That 3090 is going to burn 750W The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.

hughw12 小时前

My eyes glaze over reading all the AI produced verbiage. I did find a few useful parameter settings I've already discovered using my single 3090 and ollama. I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps. [edited to mention ollama as a nice alt]

nozzlegear14 小时前

Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.

dmayle14 小时前

For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45

montebicyclelo13 小时前

Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.

organsnyder14 小时前

I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.

andy9914 小时前

I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.

cyanydeez13 小时前

did you upgrade to MTP?

bityard10 小时前

There are several variants of Qwen 3.6, the MoE models are performant on Strix Halo, but the 27B dense model (the one spoken about in TFA, and generally regarded as the best of the group in terms of quality) is not so performant: https://kyuz0.github.io/amd-strix-halo-toolboxes/

shockembopper7 小时前

I’ve got qwen3.6 27b running on my media server atm. Given that I built on top of what I already had, it didn’t cost me nearly that amount. I’ve been running 2x 5060 ti 16gbs, and when using text only and nvfp4, I can run the model with 200k context length and roughly 50-60 toks. It’s very good, and costed me about $800 after buying the gpus from microcenter.

elorant13 小时前

You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.

dannyw14 小时前

I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.

DrammBA4 小时前

> I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. Context size?

dom9613 小时前

I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.

doodlesdev14 小时前

How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).

georgeven14 小时前

I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)

Dig1t13 小时前

How did you buy 3 V100's for $1500??

sixdimensional3 小时前

Not OP and just guessing, but probably SXM2 GPU modules for the V100. Those can be acquired fairly inexpensively, but there is work to do to get them working together and the V100 has some limitations on the types of models you can run.

jeffybefffy5199 小时前

I still dont trust the Anthopic and OpenAI are not training on my code. I even just thinking keeping track of what code you have received in prompts and to train/not train on it seems like an impossibly difficult task.

andrekandre8 小时前

am i right in assuming your code is closed-source? i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?

redox9912 小时前

I bought 2 used 3090s some years ago for $500 each. They're probably a bit more expensive now, but I guess for something like $2000 you can build a barebones 2x3090 PC which will be way faster than a Macbook. (you're fine with very basic hardware outside the GPUs)

stared10 小时前

All experiments with Qwen 3.6 required no more than 48GB Apple Silicon. I believe you can go even further with more aggressive quantizations - one can go down even further. In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale. At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.

trentor13 小时前

Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.

dvduval14 小时前

Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.

cyanydeez14 小时前

AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k. I think you might be a little to into the stew here.

zdragnar13 小时前

I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do. I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.

onion2k14 小时前

None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage. The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.

janalsncm14 小时前

> Being able to nail a zero-shot greenfield project is relatively easy even for a small model Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.

hollowturtle11 小时前

In what era spinning up a PoC required a week of work? Especially on the web. I've been a developer for roughly 20 years and that has never been the case, to the point that I believe people impressed by LLMs are the same who had a very low productivity. Today we have game jams as short as 3 days and talented people are able to produce very good PoC, with some almost complete!

janalsncm6 小时前

1) It depends entirely on the concept you are trying to prove and how experienced you are in that domain. 2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.

spiralcoaster10 小时前

So what you're saying is that all PoC's are guaranteed to take less than a week of work. What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.

cyanydeez13 小时前

I love the ability to spin up any repo on github by pointing a local model at it with zero cost beyond the heat & electricity.

onion2k12 小时前

[deleted]

ai_fry_ur_brain12 小时前

Yeah, and we still do take a week for people that actually care. If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere. I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments. Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much. Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.

j_bum12 小时前

I mean, have you looked for examples of things that people using local models to build and ship? Or are you just assuming it doesn’t happen? I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks. I’m not saying these models are perfect. But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.

Aurornis12 小时前

> and it can fall back to similar examples in the training data easily. This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show. My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad. Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable. If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent. The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.

CMay12 小时前

This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models. Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory. For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.

Zambyte13 小时前

I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.

sosodev14 小时前

In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that. Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.

fluoridation13 小时前

>Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.

tenuousemphasis11 小时前

Claude Opus with xhigh thinking is surprisingly good at figuring our details. Granted I'm only using it for little hobby projects, nothing overly complicated.

verdverm13 小时前

I had good results doing an open box reimplementation. Gave qwen access to my old projects and it rebuilt it on JAX. https://github.com/verdverm/pge-jax

mark_l_watson10 小时前

There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc. All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.

internet1010103 小时前

Exactly. If the repo has all of the knowledge living inside of it that window fills up fast, even when using something like codegraph.

esafak13 小时前

I don't use local models but have you tried augmenting the model with code intelligence MCPs like https://github.com/DeusData/codebase-memory-mcp ?

h4ny14 小时前

> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) 1. Maybe you should tell us what those limited experiments are. 2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope. 3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.

snapcaster13 小时前

Nobody owes you a scientifically rigorous write up

imrehg1 小时前

I'm having a decently good time time with `qwen3.6-35b-a3b-mtp` (unsloth's multi-token prediction version) and and `qwen-agentworld-35b-a3b`. On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s. On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings. Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked. The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models. I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.

nunodonato55 分钟前

what are you using agentworld for?

doodlesdev14 小时前

I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds? (I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)

JeremyNT13 小时前

I also don't understand why people in this price bracket are buying Mac laptops instead of desktop computers with GPUs? Just to flex that it's portable?

mft_11 小时前

(I'm not one of the people you're speaking of with a 128gb M5 but) if you want to run one of the medium-sized open-weights models (Qwen 27b, 35b, Gemma 4 26b, 31b) or larger, you get into an interesting optimisation space. * yes, you can run it on an older/smaller GPU plus system RAM but performance will suffer * if you want optimal GPU performance you need the model in VRAM plus context, so 24GB (3090, 4090) or 32GB (5090) cards, plus a system that's reasonable powerful to plug them in to. Ideally you'd have a multiple cards working together but for optimal performance this means either 2x 3090 or nvidia's workstation cards. * you can go for a 128gb Strix Halo system, but the memory bandwidth isn't great and they're becoming increasingly more expensive (5.5k EUR for HP laptop, 3.9k EUR for GMKtec EVO-X2 mini PC) * you can go for a 128gb DGX Spark (5k EUR+) which also has unspectacular memory bandwidth or RTX Spark (price unclear but probably not cheaper) * or go for a Mac with a decent CPU and a good amount of RAM (bandwidth varies by model, but typically a bit better than Strix Halo/DGX Spark and worse than bespoke GPUs. As usual with such questions, there are of course cheaper paths (if you want to accept the tradeoffs) but Macs are reasonable vs. competition for these workloads.

pletnes2 小时前

And with a mac, there are no cuda drivers to fiddle with.

jeroenhd13 小时前

A mac with a boatload of RAM can run models that will exceed the limits of any GPU not worth at least twice the Apple hardware itself. You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend. When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.

ctkhn11 小时前

I don't even travel a ton but portability is huge. It's not a flex, it's a functional thing that lets me move around within my house or work while I'm at my parents or traveling or anywhere else. Other than my media collection that lives on my home server, I want most of my files to come with me on my laptop.

FuckButtons7 小时前

The fact that I can take it with me? That I don’t need internet to still have access to deepseek? The fact that electricity is expensive and an mbp uses ~10% of the power that an equivalent vram set up would using gpu’s. Also, in order to get the same vram I would need to spend a similar amount, but wouldn’t also have a machine that was useful for other workloads that need a huge amount of ram.

indemnity7 小时前

Potentially going to sound privileged here, but why not both? Personally when going on the road I like portability (14" MBP or MBA), but at home I want raw non-thermally throttled power.

LeBit13 小时前

I think it is because desktop computers with GPUs with enough VRAM to run interesting models are insanely expensive, hard to source and consume a lot of electricity and dissipate a lot of heat.

ilogik13 小时前

What GPU can I buy with >100GB of memory?

verdverm13 小时前

DGX Spark is one, but really depends on how much you want to spend

redox9912 小时前

Yeah, it's a much better idea to buy many used 3090s. 4090s or 5090s if you can afford it. Way faster.

aurareturn11 小时前

Probably depends on what you're trying to do. You need an expensive motherboard, cooling, PSU(s) to use multiple high end GPUs together. Then there is the noise and the fact that you can't bring it on an airplane.

bastardoperator12 小时前

I have a bunch of computers and gadgets, why settle on one?

satvikpendem7 小时前

Unified memory.

btbuildem11 小时前

I think it's silly to go for a laptop form factor. Last fall I put together a workstation with two second-hand 3090s in it (paid $850CDN each, now the best I can find is $1200). With 48GB VRAM it's reasonable - and I've been using Qwen 3.6 27B for various tasks around building KGs from text corpora / reasoning about them. I've ran comparisons against everything that's available on OpenRouter (well, as of few weeks ago), and for $0/tok, the local 27B Qwen can't be beat. Sure, it's slower, and yeah, the office is a few degrees warmer than it ought to be -- but nobody can pull the plug, nobody is watching over my shoulder, and the results are on par with SOTA. Can't wait for a similarly sized Qwen 3.7 - from what I've seen so far, it's a leap ahead of the previous version.

Gigachad9 小时前

I think it still makes sense to wait. Hardware is currently hyper expensive and cloud models are subsidized. Waiting 2 years or so once memory prices have dropped and datacenters start wanting a profit would get you a usable setup that's more economical.

whichquestion9 小时前

How much electricity does running your local models take?

alemanek5 小时前

If your workflow benefits from the speed it quickly pays for itself when factoring in developer salaries here in the US. I recently switched companies and they bought me an M5 Max 128GB as my dev machine. Builds and local test runs are 3 times faster than the Windows laptop option. The machine will pay for itself just based on that within 3 months. I can spin up a local kubernetes cluster and do full integration tests while I am working on other things as well. It isn’t a strictly Mac vs Windows thing though. It looks like the culprit is the MDM software on the Windows machines is just crazy slow and constantly getting in the way. If I was paid less it would definitely make less sense for the company to pay for this machine.

v1ne1 小时前

Don't worry. Once IT Security discovers that they miss their trusty endpoint security products on your Mac, they'll add it and you'll be in the same ballpark as the Windows machine. Been there, received that, and learnt that Microsoft Defender exists for macOS, too.

bellowsgulch13 小时前

> Are developers in other countries living in such different worlds? Yes. Your people earn an order of magnitude less income than Americans.

adamors14 小时前

Yes they are, 6k is peanuts to a lot of people.

verdverm12 小时前

It's not always about the price or being the cheapest. For me, it's about freedom, both to play and from the govt/corp censorship.

reilly300011 小时前

It’s an asset on my balance sheet that’s already appreciating nicely and will likely be resale-able for what I paid for it for the next 7-10 years. I am on an Apple monthly installment plan so $5k is $416/month for 1 year, no interest. I’m able to run DS4 scale models and other open models without quantization, often multiple at once. Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.

simplyluke11 小时前

> will likely be resale-able for what I paid for it for the next 7-10 years There is not a period in the history of computing where this is true of consumer hardware over a decade for anything other than hardware already at the very bottom of its depreciation curve. It is surprising to me that you state that as an obvious assumption. I suppose if your base case is Taiwan war that may be true, but there's a lot of folks who seem to be assuming the current hardware crunch will go on indefinitely when the natural state of hardware is getting cheaper over time.

znpy13 小时前

> Are developers in other countries living in such different worlds? Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…

doodlesdev12 小时前

That makes a lot of sense! I have no idea how I'd use that much money, so maybe the 128gb MBP for messing around with local LLMs wouldn't sound so absurd :)

mashygpig10 小时前

It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison. Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...

Saris8 小时前

Even with deepseek v4 flash I burned though $5 in credits in a day just playing around with Hermes, and qwen 3.6 35B is significantly more expensive. I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago. I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.

alentred1 小时前

There is one side effect of running your LLM locally: you stop thinking about the token budget. I often run `/goal` with no limits, or script an endless loop in bash to run opencode, etc. Sometimes I just brute force the task by throwing a /goal at it. Maybe it's not the most efficient use either, but it's nice to have the option.

Perenti6 小时前

If you're not good at prompting yet, that $10 doesn't go very far. The local model allows me to learn what works and what doesn't without paying for tokens. Then when I know how not to waste them, I'll try a paid model.

an0malous4 小时前

Those are all pre-rugpull prices though. Give it a year.

SchemaLoad8 小时前

Agreed, I'm waiting for the time when 48GB+ ram is just the standard that computers come with rather than being the absolute top tier option. It just doesn't make sense to spend extra on a local AI computer right now when the same money would last for a decade of API pricing.

zx7612 小时前

I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo). I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off. I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably. I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.

bblb4 小时前

I got B70 few days ago. Running on CachyOS. 9070XT on PCIe x16 and B70 on the x4. ROCm nightly was pretty easy to setup and get up running. The 9070XT has been a decent card for my use cases. But the SYCL ecosystem versions. Absolutely horrendous and everything is hundred commits behind. Vulkan is probably the only way forward with this card.

kristianp5 小时前

Interesting that Intels latest consumer GPUs only have 10 and 12GB respectively for the B570 and B580.

cpburns200912 小时前

Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.

beastman8214 小时前

FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well. QAT, MTP, 128k context. I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.

kofu14 小时前

My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models. I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.

accrual14 小时前

Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.

nozzlegear13 小时前

I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.

0x000000014 小时前

> ... on my Macbook Max M5 128 GB Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?

kllrnohj14 小时前

You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.

rhdunn14 小时前

A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files). For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.

__s14 小时前

I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced

wpm14 小时前

It wasn't $10k a month ago

bahmboo10 小时前

I work with a lot of 3D graphics and geo stuff so I can hit the ceiling with my 48 GB mac. It's not all LLM work. I prioritized more storage than RAM with my budget. Being able to run local llms has greatly helped me understand how they work. For day to day dev I pay for Gemini or Claude.

mr_mitm14 小时前

Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.

scotty7911 小时前

Qwen3.6 runs great on GPU with 24GB VRAM. You could get used 3090 for it.

spike02114 小时前

Certainly won't work on my M4 Pro with 24GB lol

XCSme10 小时前

Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models. If Qwen models are so much easier to run, why are the providers charging more than V4 Flash? [0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol

Gigachad54 分钟前

Also confused by this. Deepseek V4 flash is so much better than Qwen 3.6 yet cheaper to use.

schmuhblaster1 小时前

I've worked extensively with the slightly less able cousin, the 35B A3B model and tuned my own harness around making it work well with local or non-sota models. The results are quite promising [0], if one sticks to a plan-execute approach. After a bit of fiddling with llama.cpp I was able to get it to work through a small change on a real codebase from work on a 32GB M5 (typical python FastAPI backend, so nothing out of the ordinary). While that's somewhat encouraging, the whole local experience was still far from pleasant with all the noise and heat. [0] https://deepclause.substack.com/p/how-to-make-small-models-p...

zbendefy1 小时前

What harness are you using?

ctkhn12 小时前

I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.

Xeoncross11 小时前

What is the speed on responses? (t/s) The full 128GB is surely helpful in keeping browsers, editors and other things running since even 20-35GB models + k/v caches can eat up a lot of the core 64GB in my experience.

LeifCarrotson10 小时前

I've also been running Qwen 3.6 35B A3b on my Windows laptop (64 GB RAM, a 4GB GPU) and it's at least tolerable. It's not fast - a few tokens per second, slower than reading speed - but I can give it a task and come back later. That was a $600 laptop off eBay a few years ago, not a $6,000 machine. Are these unified memory Macs and giant 24GB desktop GPUs achieving dozens or hundreds of tokens per second commensurate with their 10x-20x cost?

grokkedit56 分钟前

I've been using it with a couple of tools (like context7) as a documentation/helper, without giving it direct access to writing code, in marimo. it works great, albeit a little slow on my server (m1 max 64gb ram), at 8bit with omlx

jimmaswell6 小时前

My partner has been trying various models on our server but we haven't gotten anything to run at a usable speed. Q30H engineering sample (Xeon 8570) with two cpus, 56 cores per CPU, 768GB DDR5 RAM running at 5600MHz, two old 3090s in it at the moment with an NVLink and we could put our third in there. We built this server before the prices skyrocketed because we happened across some Tyan boards on Woot that were absurdly cheap for what they are (the motherboards should be $1000+ but we got them for a few hundred). This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine. Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far. We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.

danielrmay2 小时前

Yes, this should be a monster machine. Ampere is an older generation, so I expect that's where some of your issues have been

christina975 小时前

Start with a quant, you can run the Qwen 27B model at 4-bit on one 3090, presumably 6/8-bit on 2x3090.

starefossen12 小时前

We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace

mips_avatar11 小时前

I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.

tasoeur3 小时前

I know how to build PCs but suck at picking parts, would you happen to have a recommended build or links to people who've done similar ones? Heck I'll click on an affiliate link to support the author of the build :-)

PeterStuer1 小时前

Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.

ljosifov12 小时前

Running 27B dense model on M5 128GB is ok, but one can do better. On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen. 27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.

drnick112 小时前

Works beautifully on a 3090, very usable speed. Don't expect Opus 4.8-level performance, but there are some things you just need to keep local.

brandall1010 小时前

This is discussed in the article: "My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."

kroaton9 小时前

"DeepSeek-V4-Flash will fit" At Q2, 2bit? Lobotomized to death.

fabijanbajo2 小时前

We need machines designed around wide memory + sustained inference thermals, not gaming/creator chassis we're borrowing. Until then "local dev" means clamshell + external fans.