> The training and deployment of LongCat-2.0 are built on large-scale clusters of tens of thousands of AI ASIC superpods. Compared to the mature Nvidia GPU ecosystem, the supporting software community is still less developed. We have therefore put significant effort into building a stable, secure, and scalable infrastructure.
This is the real news story. It looks like they may have used Huawei Ascend 910C chips: https://nitter.net/teortaxesTex/status/2071708141037781407#m
BOBoorishBears3 小时前
If they really managed this from pre-training a 1.6 T parameter model through to post-training without NVIDIA, Dwarkesh Patel got what he wanted.
CHchvid1 小时前
It is interesting how much people doubt Huawei’s capabilities in this area - Jensen does not (in the dp interview) - of course you can dismiss this as him talking his own book.
JAJabrov3 小时前
Who? What did he want?
GAgardnr2 小时前
Dwarkesh Patel has AI/ML guests on his podcast. BoorishBears may have been referring to the Jensen Huang episode where they discuss TPUs: https://youtu.be/Hrbq66XqtCo?t=982
CRcredit_guy6 小时前
I just tested it with a slightly tricky question
> If you could run a nuclear reactor with U-235 as fuel or Pu-241 (both mixed with 95% U-238), which one would you choose and why?
For a human this would not be tricky at all. For an LLM it could be, because this question certainly does not exist in any sort of training, because Pu-241 does not exist in pure form, it only exist as a minor component of reactor-grade plutonium, where Pu-239 would dominate, with Pu-240 coming second and Pu-241 coming third.
In any case, LongCat-2.0. gave a very well reason but incorrect answer that Pu-241 is preferable.
I then tested on Qwen 3.7 Plus, and it correctly answered that U-235 is preferable because of its much higher delayed neutron fraction. I then went to Gemini Flash, which answered the same, with much more confidence, and with much stronger arguments, and the speed of the answer was much higher.
Overall I rate Gemini Flash the best, Qwen 3.7 Plus an acceptable second, and LongCat-2.0 an ok'ish third, if you have nothing better.
3E3eb7988a16635 小时前
I am not a physicist but perhaps your question was leading more than you expected? I would take the question to pre-suppose I have an abundance of the stated material, ignoring practical realities of refinement. If I did have fully pure Pu-241, would that be a better fuel than U-235?
Or stated another way, "If you could run a generator on gasoline or jet fuel, which one would you choose and why?" I would answer jet fuel owing to slightly higher energy density and purity of the material - likely leading to a cleaner burn. Which would ignore that jet fuel is going to be a multiple of the gasoline price.
TRtryagainian14 分钟前
> Which would ignore that jet fuel is going to be a multiple of the gasoline price.
That doesn’t sound right. If my Duck Fu is any good, jet fuel is currently going due US$3.00 per gallon, avgas (leaded petrol) at $3.30, and gasoline at $2.88 gallon.
There’s nothing much special about jet fuel, it’s just kerosene, same as RP1 (Rocket Propellant), heater fuel, and lamp oil you can buy from the hardware store, with a touch of something to stop it gelling at low temperature if I understand correctly, but also jet fuel tanks are heated if I recall correctly.
I believe standard diesel fuel will also works in jet engines, but kerosene is cheaper.
I’m not in the US, and if I understand correctly their gasoline (petrol) price can vary greatly from state to state, California being the worst? Is that right?
ONonion2k4 小时前
If I did have fully pure Pu-241, would that be a better fuel than U-235?
Also not a physicist, but I assume from the fact that the OP is asking the LLM this question to trip it up, the point is that U-235 is better even if you have an abundance of both. It's scarcity of Pu-241 leads to the lack of data in training, not that it's actually better.
3E3eb7988a16632 小时前
Again, really speaking out of my depth, but if there is a lack of plutonium training data, I would assume the LLM answer would be the far more commonly described U-235. To respond otherwise means there is some existing association with Pu-241 being better.
TEteaearlgraycold12 分钟前
It’s tough to write good questions for LLM evaluations. They’re so good at picking up subtleties they can pass a multiple choice test when given only the answers and not the questions.
CYcyberax51 分钟前
A higher delayed fraction of neutrons makes it easier to control the reactor. Without delayed neutrons you can only make a bomb.
BEbel82 小时前
A more fair and useful comparison would be to feed both LLMs with documentation about such niche knowledge in the contex, then ask.
ICicepush5 小时前
Did you ask the question several times in fresh chat contexts to see if it sometimes gives the right answer ?
ZYzythyx5 小时前
Nah, n=1 is enough to give evidence that something is entirely broken, of course.
/s
3E3eb7988a16632 小时前
Well, when we had deterministic tools, it would only take a single example of a calculator claiming 1+1=4 for me to throw it in the trash.
ISIshKebab1 小时前
And if you can come up with a deterministic tool that can do everything LLMs can then that would be amazing! Until then, we have to accept the non-determinism.
SKskybrian5 小时前
Apparently this comes from Meituan which is a Chinese food delivery company.
TWTwirrim3 小时前
I don't think this is where you were going with your comment, but I'll mention this just because you're somewhat adjacent to a routine mistake in business:
Uber is a people delivery company, but they've had a lot of bright engineers working for them on their infrastructure and software over the years, and that work has rippled out through the industry.
Amazon (in VMWare's words) is "a company that sells books", and their leadership couldn't accept they were losing to them ("I look at this audience, and I look at VMware and the brand reputation we have in the enterprise, and I find it really hard to believe that we cannot collectively beat a company that sells books.").
CHChu4eeno2 小时前
And Google is the ad factory.
VAValentineC3 小时前
The thing that stood out for me about Meituan was that their power bank rental gizmos were everywhere in China, and people would rather rent a power bank than own and carry one around because of how convenient it is.
TRtry-working35 分钟前
People buy millions of powerbanks in China.
THthrowa3562621 小时前
1024 Huawei Ascend superpods = 50K 910C chips.
That is a tiny tiny system. OpenAI uses _milions_ of GPUs for training
On the other hand, this probably reuses the existing deepseek v4 architecture and weights. Maybe didn't need that much compute.
EDEDM1151 小时前
is this finally Le Gros Chaton that we were promised ?
MAmappu1 小时前
There was some earlier speculation this is the model behind the stealth-released openrouter/owl-alpha model, that's been free for the last month.
TCtcper1 小时前
Nothing can be downloaded from their Huggingface, and given this company's consistent track record, it can basically be considered a scam
CHchvid2 小时前
The bad ass “resume” of the founder - sounds like the Chinese guy from the Silicon Valley tv show (who ends up ruling the world from somewhere in the jungle):
https://en.wikipedia.org/wiki/Wang_Xing
Wang Xing (Chinese: 王兴; born 18 February 1979) is a Chinese businessman, who co-founded Meituan and has been serving as chief executive officer of Meituan since January 2010. He previously served as chief executive officer of Fanfou from 2007 to 2010.
GWgwerbin3 小时前
I asked a question with "Search" enabled, with the app set to English, and got results back in Chinese. Interesting view into how the LLM responds to its context.
AEaetherspawn5 小时前
I wish they would release the requirements to run on llama.cpp with any announcements of open models.
A bonus would be tok/s on common hardware.
LClcampbell5 小时前
I don't think llama.cpp supports any of the LongCat models, actually.
They haven't posted weights/inference solutions for LongCat-2.0 [1], but LongCat-Next had transformers support, which I assume means it works with vLLM/SGLang.
Given it's 1.6T, "common hardware" is probably out of the question; even 2bpw is going to measure out at 400GB, even before considering the bandwidth requirements for 48B active. I haven't read the LongCat-2.0 architecture docs, but if you're not running GLM-5.2, you're probably not running this either.
[1] https://huggingface.co/meituan-longcat/LongCat-2.0: "Model weights coming soon — stay tuned!"
NLnl5 小时前
Yeah, for me it seems like a if you have to ask you can't run it" type question.
In general the TL;DR is that anything above 35B needs hardware you buy basically only to run large LLMs, and if you have that hardware you don't need to ask the question.
HNhnfong2 小时前
That's simply not true.
~70B models can run fine (albeit somewhat slow) on consumer hardware with 64GB RAM. There are heavily quantized (Q1.x) models that are still usable on similar hardware. Granted recently there haven't been a lot of models of this size, but still, 35B isn't really the practical limit. 35B is mostly the limit if you're using consumer grade GPUs with limited RAM and need the model to run fast.
People have been toying with running large-ish models by partially offloading on CPU+RAM with mixed results, but as long as you're OK with reduced speed, and you quantize the hell out of the big models, you can apparently try a lot more models locally than popular belief.
AEaetherspawn3 小时前
Ah yes but because it’s a MoE 48GB active model, then it’s possible that we might be able to run it locally in specialised setups such as 256GB unified memory.
Many MoE models (seem?) to only require enough memory to load the active expert.
UNunknown5 小时前
[deleted]
YAyashthakker4 小时前
[deleted]
ROrooty_ship2 小时前
[deleted]
DRdryarzeg6 小时前
So... is this literally a... umm, sorry, I'm just genuinely (really, no sarcasm intended) which terminology to use... finetune of DeepSeek V4-Pro or post-trained version of DeepSeek V4-Pro Base? Because I haven't fully dived into the tech report (so I may update my opinion as well as my comment), but this far the architectural solutions seem to be largely similar to DeepSeek ones.
Maybe I'm wrong, but that's just the first impression.
EDIT: I take my words back (which happens rarely) - although they do build upon DeepSeek's work, their contribution far exceeds merely post-training the base model in a different way. They did introduce something new to the architecture, though I still can't find the full tech report, with Hugging Face and GitHub links returning 404 right now.
EDIT-2: Now when I think about it, I'm not quite sure if they're going to release in the open the full report with methodology, as well as the model weights, at all.
TRtrollbridge6 小时前
If more people are doing what DeepSeek did and figuring it out, that's a great thing, because DeepSeek figured out how to radically reduce the cost of inference.
评论
14 条顶层评论请先登录 h4cker 账号,然后连接 Hacker News 后发表评论。
> The training and deployment of LongCat-2.0 are built on large-scale clusters of tens of thousands of AI ASIC superpods. Compared to the mature Nvidia GPU ecosystem, the supporting software community is still less developed. We have therefore put significant effort into building a stable, secure, and scalable infrastructure. This is the real news story. It looks like they may have used Huawei Ascend 910C chips: https://nitter.net/teortaxesTex/status/2071708141037781407#m
If they really managed this from pre-training a 1.6 T parameter model through to post-training without NVIDIA, Dwarkesh Patel got what he wanted.
It is interesting how much people doubt Huawei’s capabilities in this area - Jensen does not (in the dp interview) - of course you can dismiss this as him talking his own book.
Who? What did he want?
Dwarkesh Patel has AI/ML guests on his podcast. BoorishBears may have been referring to the Jensen Huang episode where they discuss TPUs: https://youtu.be/Hrbq66XqtCo?t=982
I just tested it with a slightly tricky question > If you could run a nuclear reactor with U-235 as fuel or Pu-241 (both mixed with 95% U-238), which one would you choose and why? For a human this would not be tricky at all. For an LLM it could be, because this question certainly does not exist in any sort of training, because Pu-241 does not exist in pure form, it only exist as a minor component of reactor-grade plutonium, where Pu-239 would dominate, with Pu-240 coming second and Pu-241 coming third. In any case, LongCat-2.0. gave a very well reason but incorrect answer that Pu-241 is preferable. I then tested on Qwen 3.7 Plus, and it correctly answered that U-235 is preferable because of its much higher delayed neutron fraction. I then went to Gemini Flash, which answered the same, with much more confidence, and with much stronger arguments, and the speed of the answer was much higher. Overall I rate Gemini Flash the best, Qwen 3.7 Plus an acceptable second, and LongCat-2.0 an ok'ish third, if you have nothing better.
I am not a physicist but perhaps your question was leading more than you expected? I would take the question to pre-suppose I have an abundance of the stated material, ignoring practical realities of refinement. If I did have fully pure Pu-241, would that be a better fuel than U-235? Or stated another way, "If you could run a generator on gasoline or jet fuel, which one would you choose and why?" I would answer jet fuel owing to slightly higher energy density and purity of the material - likely leading to a cleaner burn. Which would ignore that jet fuel is going to be a multiple of the gasoline price.
> Which would ignore that jet fuel is going to be a multiple of the gasoline price. That doesn’t sound right. If my Duck Fu is any good, jet fuel is currently going due US$3.00 per gallon, avgas (leaded petrol) at $3.30, and gasoline at $2.88 gallon. There’s nothing much special about jet fuel, it’s just kerosene, same as RP1 (Rocket Propellant), heater fuel, and lamp oil you can buy from the hardware store, with a touch of something to stop it gelling at low temperature if I understand correctly, but also jet fuel tanks are heated if I recall correctly. I believe standard diesel fuel will also works in jet engines, but kerosene is cheaper. I’m not in the US, and if I understand correctly their gasoline (petrol) price can vary greatly from state to state, California being the worst? Is that right?
If I did have fully pure Pu-241, would that be a better fuel than U-235? Also not a physicist, but I assume from the fact that the OP is asking the LLM this question to trip it up, the point is that U-235 is better even if you have an abundance of both. It's scarcity of Pu-241 leads to the lack of data in training, not that it's actually better.
Again, really speaking out of my depth, but if there is a lack of plutonium training data, I would assume the LLM answer would be the far more commonly described U-235. To respond otherwise means there is some existing association with Pu-241 being better.
It’s tough to write good questions for LLM evaluations. They’re so good at picking up subtleties they can pass a multiple choice test when given only the answers and not the questions.
A higher delayed fraction of neutrons makes it easier to control the reactor. Without delayed neutrons you can only make a bomb.
A more fair and useful comparison would be to feed both LLMs with documentation about such niche knowledge in the contex, then ask.
Did you ask the question several times in fresh chat contexts to see if it sometimes gives the right answer ?
Nah, n=1 is enough to give evidence that something is entirely broken, of course. /s
Well, when we had deterministic tools, it would only take a single example of a calculator claiming 1+1=4 for me to throw it in the trash.
And if you can come up with a deterministic tool that can do everything LLMs can then that would be amazing! Until then, we have to accept the non-determinism.
Apparently this comes from Meituan which is a Chinese food delivery company.
I don't think this is where you were going with your comment, but I'll mention this just because you're somewhat adjacent to a routine mistake in business: Uber is a people delivery company, but they've had a lot of bright engineers working for them on their infrastructure and software over the years, and that work has rippled out through the industry. Amazon (in VMWare's words) is "a company that sells books", and their leadership couldn't accept they were losing to them ("I look at this audience, and I look at VMware and the brand reputation we have in the enterprise, and I find it really hard to believe that we cannot collectively beat a company that sells books.").
And Google is the ad factory.
The thing that stood out for me about Meituan was that their power bank rental gizmos were everywhere in China, and people would rather rent a power bank than own and carry one around because of how convenient it is.
People buy millions of powerbanks in China.
1024 Huawei Ascend superpods = 50K 910C chips. That is a tiny tiny system. OpenAI uses _milions_ of GPUs for training On the other hand, this probably reuses the existing deepseek v4 architecture and weights. Maybe didn't need that much compute.
is this finally Le Gros Chaton that we were promised ?
There was some earlier speculation this is the model behind the stealth-released openrouter/owl-alpha model, that's been free for the last month.
Nothing can be downloaded from their Huggingface, and given this company's consistent track record, it can basically be considered a scam
The bad ass “resume” of the founder - sounds like the Chinese guy from the Silicon Valley tv show (who ends up ruling the world from somewhere in the jungle): https://en.wikipedia.org/wiki/Wang_Xing Wang Xing (Chinese: 王兴; born 18 February 1979) is a Chinese businessman, who co-founded Meituan and has been serving as chief executive officer of Meituan since January 2010. He previously served as chief executive officer of Fanfou from 2007 to 2010.
I asked a question with "Search" enabled, with the app set to English, and got results back in Chinese. Interesting view into how the LLM responds to its context.
I wish they would release the requirements to run on llama.cpp with any announcements of open models. A bonus would be tok/s on common hardware.
I don't think llama.cpp supports any of the LongCat models, actually. They haven't posted weights/inference solutions for LongCat-2.0 [1], but LongCat-Next had transformers support, which I assume means it works with vLLM/SGLang. Given it's 1.6T, "common hardware" is probably out of the question; even 2bpw is going to measure out at 400GB, even before considering the bandwidth requirements for 48B active. I haven't read the LongCat-2.0 architecture docs, but if you're not running GLM-5.2, you're probably not running this either. [1] https://huggingface.co/meituan-longcat/LongCat-2.0: "Model weights coming soon — stay tuned!"
Yeah, for me it seems like a if you have to ask you can't run it" type question. In general the TL;DR is that anything above 35B needs hardware you buy basically only to run large LLMs, and if you have that hardware you don't need to ask the question.
That's simply not true. ~70B models can run fine (albeit somewhat slow) on consumer hardware with 64GB RAM. There are heavily quantized (Q1.x) models that are still usable on similar hardware. Granted recently there haven't been a lot of models of this size, but still, 35B isn't really the practical limit. 35B is mostly the limit if you're using consumer grade GPUs with limited RAM and need the model to run fast. People have been toying with running large-ish models by partially offloading on CPU+RAM with mixed results, but as long as you're OK with reduced speed, and you quantize the hell out of the big models, you can apparently try a lot more models locally than popular belief.
Ah yes but because it’s a MoE 48GB active model, then it’s possible that we might be able to run it locally in specialised setups such as 256GB unified memory. Many MoE models (seem?) to only require enough memory to load the active expert.
[deleted]
[deleted]
[deleted]
So... is this literally a... umm, sorry, I'm just genuinely (really, no sarcasm intended) which terminology to use... finetune of DeepSeek V4-Pro or post-trained version of DeepSeek V4-Pro Base? Because I haven't fully dived into the tech report (so I may update my opinion as well as my comment), but this far the architectural solutions seem to be largely similar to DeepSeek ones. Maybe I'm wrong, but that's just the first impression. EDIT: I take my words back (which happens rarely) - although they do build upon DeepSeek's work, their contribution far exceeds merely post-training the base model in a different way. They did introduce something new to the architecture, though I still can't find the full tech report, with Hugging Face and GitHub links returning 404 right now. EDIT-2: Now when I think about it, I'm not quite sure if they're going to release in the open the full report with methodology, as well as the model weights, at all.
If more people are doing what DeepSeek did and figuring it out, that's a great thing, because DeepSeek figured out how to radically reduce the cost of inference.
What on earth are you on about, truly.