anthropic.com

Claude Sonnet 5

marinesebastian · 1.2K points · 697 comments · 18小时前
打开原文HN 讨论

评论

5 条预览评论 · 正在加载完整讨论
nijave20分钟前

> Me: What was the sushi place near latitude 41 in Columbus? Did it go out of business I don't see it on Google maps anymore > Sonnet 5 (medium): None of these past chats mention a sushi place — I don't have anything on record about that. Do you remember the name, or roughly which part of Columbus (neighborhood/street) it was near? That'll help me search and check its current status. Not impressed. It got the name right on high effort one shot but hallucinated the date relativity (Jan 2026 is not last month...). Worked okay on extra. Sonnet 4.6 worked fine on medium, high, and extra one shot.

Jcampuzano217小时前

I'm struggling to understand why I'd ever use this instead of just using a lower effort level for opus given on many of the benchmarks listed the cost per task rises above opus at anything higher than medium effort. Only thing I can think of is for when someone is out of opus credits. Of course there are API billing use cases but I'd probably still just use opus on low.

conradkay18小时前

Wow, seems worse even on price/performance than GLM 5.2, which is only 744b parameters. From the system card: "On CyberGym vulnerability discovery, Claude Sonnet 5 is less capable than Sonnet 4.6, and far less capable than Opus 4.8 and Mythos 5 As with the other evaluations in this section, these results were achieved with all safeguards turned off. When run with our default mitigations, Sonnet 5 scored a 0 on CyberGym"

microtonal18小时前

Claude Sonnet 5 is built to be the most agentic Sonnet model yet. It can make plans, use tools like browsers and terminals, and run autonomously at a level that, just a few months ago, required larger and more expensive models. I have been using Sonnet 4.6 more than Opus, because I'm mostly doing agent-assisted development and not fully agent-driven development. This announcement does not make me positive, I have found that the more models are optimized for fully agentic development, the worse they get at assisted development and often start doing too much despite very strict/specific instructions. I have been moving more and more to K2.7 Code and GLM-5.2 the last few weeks. They are often good enough for assistance, very fast, and cheap.

mdrzn1小时前

Edit June 30, 2026: In the original version of this post, we included a cost-performance chart for the BrowseComp evaluation that was based on data from a simpler methodology that did not reflect the standard methodology we use for agentic search evaluations. This had the result of underestimating Sonnet 5's performance on the evaluation. They changed the Sonnet 5 'Agentic search' benchmark graph overnight