Previously: https://news.ycombinator.com/item?id=48709744
https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."
NINitpickLawyer12 小时前
> It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive.
How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!
NOnodja11 小时前
Last thing you want a model to do is hallucinate a tool call and it's outputs...
VIvikingcat12 小时前
Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.
REreactordev11 小时前
Visual Inspection Before Execution… it’s all vibe…
JUjuliangoldsmith9 小时前
That benchmark ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense.
UNunknown13 小时前
[deleted]
RIricardobayes12 小时前
This is the first Qwen fine-tune that is not immediately rejected by the local LLM community, and in some cases even being recommended. Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps. Most people who were complaining did so .
WOwoadwarrior019 小时前
The local LLM community is now teeming with erstwhile crypto and NFT hucksters who've brought the culture of hype from their former communities with them. There still are a few deeply technical people left, but their voices are being crowded out by the vapid marketers'.
S0S0y7 小时前
I've also noticed this. I wonder what causes the overlap. It can't be as simple as crypto and LLMs requiring the same hardware.
AAaardvarkr7 小时前
They both feed off of hype. The people posting about crypto are not the same people gpu mining crypto so I wouldn’t chalk it up to the same hardware
ADadastra221 小时前
Money.
THTheMagicHorsey3 小时前
There are certain influential people on Twitter, who if you see them start tweeting on a subject, you know the influx of hype and hucksters is coming.
MOmonkmartinez11 小时前
> Most people who were complaining did so .
It has been this way since the beginning, unfortunately. There is certainly no harm in trying on local models on local workloads with modest guardrails.
Like most of these models (Qwen, Gemma, Llama, gpt-oss), finding all the little gotchas like, special tokens and prompt structure, model preference are a PITA right now. The reward are really nice models that run exceptionally well in agentic harnesses tuned with the prompts and parameters you fought so hard to learn.
V3v3ss0n11 小时前
Its not any better. Most of us at LocalLLama community dont like it except a few new people poping out and making posts.
GSgslepak10 小时前
Indeed, it performed worse than Qwen3.6-27b in my basic test.
It gave a fancier looking answer, but did a worse job following the prompt.
DOdofm10 小时前
Roughly my experience so far; it trips up on itself a bit.
However, it's much more inclined to do web search unprompted, which is fascinating in its own way.
NINitpickLawyer10 小时前
> LocalLLama community
Ah, the place that shit on gpt-oss because it wasn't good at porn. That place is not what it used to be, hasn't been since that karpathy tweet, tbh. It's mostly slop and vibes nowadays.
BOboredatoms3 小时前
Where’s a good place to go instead nowdays?
V3v3ss0n10 小时前
and a lot of bots advertising a rename models like this one.
ARarcanemachiner11 小时前
We must be in different communities... Qwen models are the most recommended ones that will actually run on local hardware that is accessible to the masses!
MOmontroser11 小时前
Yeah, but they're talking about fine-tunes.
NANarew3 小时前
From what I personally tested Ornith-1.0 35B is slightly better than Qwen-3.6 35B.
My tests are tasks that consist of adding/modify feature in a big C++ codebase.
The part that I find interesting is that the model is way faster than Qwen3.6 35B. It seems Ornith produce a smaller chain of thought.
On my test it can be 3 time faster to produce the answer.
I use it via llamacpp and codex-cli.
KEkennywinker13 小时前
Can anyone explain what’s the story here? Is this just a re-skinned qwen? Who is deepreinforce-ai and why isn’t this model listed on their website?
How does it self-improve, does the model change on disk - or just during a single context run it gets better?
SIsimonw13 小时前
It doesn't self-improve, that's a misleading headline.
As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 (not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?) - so the "self-improving" is about their training process, not how you use the weights.
KAkamranjon13 小时前
I think the 9b and 31b dense are Gemma models and the 35B-MoE, and 397B-MoE are Qwen models since these are model sizes covered by each of them respectively
SIsisve11 小时前
Do you think we will get a self-improving model in 26 or 27? Maybe not a native one but some kind of hack so a model will learn something without loosing part of the context window?
UNunknown13 小时前
[deleted]
KEkennywinker13 小时前
Gotcha. That makes more sense. We ran the model to train the model -> “self-improving”.
V3v3ss0n11 小时前
Clickbait title.
UNunknown13 小时前
[deleted]
S0S0y12 小时前
These are simply benchmaxxed versions of either Qwen or Gemma 4.
202001zhaozhao11 小时前
If so, it's impressive they managed to benchmaxx Qwen even further than it's already benchmaxxed.
V3v3ss0n11 小时前
Nah , they just put graphs with different color prioritizing themselves.
JOjorisw12 小时前
Citation needed
S0S0y10 小时前
Sure.
https://deep-reinforce.com/ornith_1_0.html
>Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.
>Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts.
GIgiancarlostoro9 小时前
> the dense 9B fits on a single 80GB GPU
Us mere mortals cannot use this.
ARarmarr3 小时前
There are already quantizations available
AGagenticup1 小时前
can the orniths self scaffolding could learn to scaffold the rlm loop?
RARandyOrion3 小时前
Glad to see more open models. However, where are the 31b models?
UNunknown13 小时前
[deleted]
V3v3ss0n11 小时前
Self-Improving bullshit. It is just Qwen 3.5 finetune benchmaxxed . Nothing spectacular . even fails at benchmarks.
Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b.
ANanana_12 小时前
They keep mentioning a 31B dense model, but there are no benchmarks or weights for it anywhere?
评论
15 条顶层评论请先登录 h4cker 账号,然后连接 Hacker News 后发表评论。
Previously: https://news.ycombinator.com/item?id=48709744 https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."
> It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive. How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!
Last thing you want a model to do is hallucinate a tool call and it's outputs...
Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.
Visual Inspection Before Execution… it’s all vibe…
That benchmark ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense.
[deleted]
This is the first Qwen fine-tune that is not immediately rejected by the local LLM community, and in some cases even being recommended. Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps. Most people who were complaining did so .
The local LLM community is now teeming with erstwhile crypto and NFT hucksters who've brought the culture of hype from their former communities with them. There still are a few deeply technical people left, but their voices are being crowded out by the vapid marketers'.
I've also noticed this. I wonder what causes the overlap. It can't be as simple as crypto and LLMs requiring the same hardware.
They both feed off of hype. The people posting about crypto are not the same people gpu mining crypto so I wouldn’t chalk it up to the same hardware
Money.
There are certain influential people on Twitter, who if you see them start tweeting on a subject, you know the influx of hype and hucksters is coming.
> Most people who were complaining did so . It has been this way since the beginning, unfortunately. There is certainly no harm in trying on local models on local workloads with modest guardrails. Like most of these models (Qwen, Gemma, Llama, gpt-oss), finding all the little gotchas like, special tokens and prompt structure, model preference are a PITA right now. The reward are really nice models that run exceptionally well in agentic harnesses tuned with the prompts and parameters you fought so hard to learn.
Its not any better. Most of us at LocalLLama community dont like it except a few new people poping out and making posts.
Indeed, it performed worse than Qwen3.6-27b in my basic test. It gave a fancier looking answer, but did a worse job following the prompt.
Roughly my experience so far; it trips up on itself a bit. However, it's much more inclined to do web search unprompted, which is fascinating in its own way.
> LocalLLama community Ah, the place that shit on gpt-oss because it wasn't good at porn. That place is not what it used to be, hasn't been since that karpathy tweet, tbh. It's mostly slop and vibes nowadays.
Where’s a good place to go instead nowdays?
and a lot of bots advertising a rename models like this one.
We must be in different communities... Qwen models are the most recommended ones that will actually run on local hardware that is accessible to the masses!
Yeah, but they're talking about fine-tunes.
From what I personally tested Ornith-1.0 35B is slightly better than Qwen-3.6 35B. My tests are tasks that consist of adding/modify feature in a big C++ codebase. The part that I find interesting is that the model is way faster than Qwen3.6 35B. It seems Ornith produce a smaller chain of thought. On my test it can be 3 time faster to produce the answer. I use it via llamacpp and codex-cli.
Can anyone explain what’s the story here? Is this just a re-skinned qwen? Who is deepreinforce-ai and why isn’t this model listed on their website? How does it self-improve, does the model change on disk - or just during a single context run it gets better?
It doesn't self-improve, that's a misleading headline. As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 (not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?) - so the "self-improving" is about their training process, not how you use the weights.
I think the 9b and 31b dense are Gemma models and the 35B-MoE, and 397B-MoE are Qwen models since these are model sizes covered by each of them respectively
Do you think we will get a self-improving model in 26 or 27? Maybe not a native one but some kind of hack so a model will learn something without loosing part of the context window?
[deleted]
Gotcha. That makes more sense. We ran the model to train the model -> “self-improving”.
Clickbait title.
[deleted]
These are simply benchmaxxed versions of either Qwen or Gemma 4.
If so, it's impressive they managed to benchmaxx Qwen even further than it's already benchmaxxed.
Nah , they just put graphs with different color prioritizing themselves.
Citation needed
Sure. https://deep-reinforce.com/ornith_1_0.html >Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks. >Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts.
> the dense 9B fits on a single 80GB GPU Us mere mortals cannot use this.
There are already quantizations available
can the orniths self scaffolding could learn to scaffold the rlm loop?
Glad to see more open models. However, where are the 31b models?
[deleted]
Self-Improving bullshit. It is just Qwen 3.5 finetune benchmaxxed . Nothing spectacular . even fails at benchmarks. Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b.
They keep mentioning a 31B dense model, but there are no benchmarks or weights for it anywhere?
[deleted]
[deleted]
[deleted]
[deleted]