A 35B Model Matches Trillion-Parameter Rivals by Thinking Longer, Not Scaling Up

01A 35B Model Says It Matches a Trillion Parameters by Running Longer, Not Growing Bigger

Three model releases in the same week share a thesis: the lever for agentic performance is task duration, not parameter count.

The clearest claim comes from Agents-A1, a 35-billion-parameter Mixture-of-Experts model described in a new paper. Its authors say it reaches trillion-parameter-level performance by scaling what they call the agent horizon. They report two axes of scaling: longer trajectories and more heterogeneous agent abilities. To feed that, they built infrastructure linking external knowledge, actions, observations, and verifier outcomes. The resulting training trajectories run to an average length of 45 steps, according to the paper.

That number is the point. A 35B model competing with systems roughly thirty times larger implies the work happened in the chain of actions, not the weights. DeepReinforce's first release, Ornith-1.0, makes the same bet from a different angle, and ships it under an MIT license.

Ornith is built for what its authors call self-scaffolding agentic coding. Developer Simon Willison notes it comes in 9B Dense, 31B Dense, 35B MoE, and 397B MoE variants, built on pretrained Gemma 4 and Qwen 3.5. DeepReinforce claims state-of-the-art results among open-source models of comparable size on coding benchmarks. The open weights matter for cost: a 9B variant can run on hardware a developer already owns, with no per-token bill. The smaller models lean on the agent constructing its own tooling rather than on raw scale.

The frontier labs are narrating the same shift. Anthropic introduced Claude Sonnet 5 as its most agentic Sonnet model, able to plan, use browsers and terminals, and run autonomously. Its own framing is the tell: Sonnet 5 operates at a level that, the company says, "a few months ago required larger and more expensive models." Anthropic says its performance approaches Opus 4.8 at lower prices, launching at introductory rates of $2 per million input tokens and $10 per million output through August 31, 2026.

A mid-tier model closing on a flagship, a 35B open release claiming trillion-class results, and an MIT-licensed model that scaffolds itself all point the same way. Capability is decoupling from parameter size. What determines the ceiling is how long an agent can sustain a task and how well it builds its own scaffolding around the problem.

Near-flagship agent capability now runs locally on a 9B open modelhorizon and scaffolding, not parameter count, set the performance ceilingSonnet 5's $2/$10 intro pricing pressures larger, costlier model tiers.

Sources

Scaling the Horizon, Not the Parametershuggingface.co Ornith-1.0: Self-Scaffolding LLMs for Agentic Codingsimonwillison.net Claude Sonnet 5anthropic.com

02Anthropic built Claude a lab bench the same week OpenAI built it an exam

On Tuesday, in front of pharmaceutical executives, biotech founders, and researchers, Anthropic announced Claude Science, a product meant to do for scientific research what Claude Code does for software engineering. According to MIT Technology Review, the company is positioning it as its newest flagship. Given short, high-level instructions, Claude Science can carry out meaningful work on its own, with access to the tools a scientist already uses.

The framing is collaboration. Anthropic says the product sits beside the researcher and runs experiments forward, the same autonomous loop that made Claude Code a daily tool for engineers. It is not pitched as a thing scientists check. It is pitched as a thing scientists work with, from day one, on real projects.

OpenAI spent the same week building the opposite. It introduced GeneBench-Pro, a benchmark that tests how well AI performs in genomics, biology, and scientific research using complex, real-world datasets. The product here is not a collaborator. It is a scorecard. Before a model touches a genome in anger, GeneBench-Pro exists to measure whether it can.

That gap defines the two bets. Anthropic is selling presence in the lab and asking researchers to trust the model by using it. OpenAI is selling a yardstick and asking researchers to distrust the model until it clears the bar. One company wants Claude in the workflow. The other wants AI capability pinned to a number first.

The split matters most to the scientists on the receiving end. A biotech founder who adopts Claude Science treats AI as a lab partner whose output gets folded into ongoing work. A researcher who waits on GeneBench-Pro treats AI as a candidate that has to pass before it earns a role. Same underlying models, two different entry points into the same building.

Neither company has shown the other's hand. Anthropic has not published a benchmark for Claude Science. OpenAI has not shipped a research workbench against which GeneBench-Pro scores would be spent. For now, a scientist deciding how to bring AI into the lab gets two offers: a coworker, or a test.

Pharma and biotech researchers face two opposite AI entry points: collaborate now or verify firstAnthropic ties its flagship revenue to lab adoption, not benchmark scoresGeneBench-Pro sets a genomics bar models must clear before research use

Sources

Claude Science is Anthropic's newest flagship producttechnologyreview.com Claude Science, an AI workbench for scientists, is now availableanthropic.com Introducing GeneBench-Proopenai.com

03The rainbow flowers in those seed listings were never real plants

A buyer browsing eBay, Amazon, or Etsy lands on a photo of a flower too vivid to ignore: petals in impossible color gradients, a bloom no garden has produced. The seeds cost a few dollars. The plant in the picture does not exist. According to 404 Media, scammers are selling seeds for exotic flowers generated entirely by AI, and the three largest consumer marketplaces cannot stop the listings from spreading.

The mechanics are simple, which is why they work. A seller pairs a synthetic image with a generic seed packet, then lists it alongside legitimate horticulture products. The buyer pays for a flower that was rendered, not grown. Whatever sprouts, if anything sprouts, will not match the photo. By the time a germination cycle reveals the gap, the transaction is weeks old and the seller has moved on.

404 Media reports that eBay, Amazon, and Etsy are unable to stem the flood. Each platform runs fraud detection and seller policies, yet the listings keep reappearing faster than moderators remove them. The bottleneck is not detection of a single fake. It is volume. A scammer can produce a thousand distinct flower images in an afternoon and seed a thousand listings before any get flagged.

That economics keeps tilting toward the scammer. Google this week released Nano Banana 2 Lite, which it describes as its fastest and most cost-efficient image model, built for high throughput and scale. The company says it is rolling out across Search, the Gemini app, and developer APIs. The product was not built for fraud. But the trend it represents, cheaper images generated faster, is the same input that makes a marketplace flood possible.

For the buyer, the tell is the image itself. A flower with no botanical name, no nursery selling the live plant, and no second photo of an actual grown specimen is the warning. Reverse image search rarely helps when the picture has never existed anywhere before. The cost lands on hobbyist gardeners and small legitimate seed sellers, whose real listings now compete against synthetic ones that look more striking than anything they can grow.

Hobbyist gardeners pay for plants that physically cannot existmarketplace moderation can't match per-image generation speedcheaper image models lower the cost of seeding fake listings at scale

Sources

Scammers Sell Seeds for Exotic AI-Generated Flowers That Don't Exist404media.co Start building with Nano Banana 2 Lite and Gemini Omni Flashdeepmind.google

Anthropic ships Claude Sonnet 5 at lower prices for agent workloads Anthropic released Claude Sonnet 5 with stronger agentic performance and reduced pricing, positioning it below Opus, GPT-5.5, and Gemini Pro. The company pitches it as the cheaper default for running agents at scale. techcrunch.com

Etched reaches $5B valuation with $1B in booked AI chip orders Nvidia competitor Etched says it has $1 billion under contract for inference systems built on its chip. The startup now carries a $5 billion valuation. techcrunch.com

Amazon starts a $1 billion forward-deployed engineering org Amazon launched a team of engineers who embed inside customer companies to build and deploy purpose-built agents. The structure copies recent moves by OpenAI and Anthropic, prioritizing fast deployments and customer self-sufficiency. techcrunch.com

Base44 builds its own model to cut dependence on frontier labs Wix-owned vibe-coding platform Base44 began rolling out an in-house AI model. The company aims to eventually beat frontier models and reduce reliance on external providers for defensibility. techcrunch.com

Google releases Nano Banana 2 Lite for faster, cheaper image generation Google updated its image generator with a lighter variant that runs faster at lower cost. The change targets creators producing high volumes of AI images. techcrunch.com

Researcher finds Claude Code hiding markers in date strings sent to non-Anthropic APIs A developer inspecting Claude Code 2.1.196 found code that alters the apostrophe and date separator in the system prompt's date string. The triggers include the ANTHROPIC_BASE_URL override, an Asia/Shanghai or Asia/Urumqi timezone, and hostnames matching base64-encoded domain and AI-lab keyword lists. The visible text reads as a normal date while the raw request carries the marker. thereallo.dev

X launches a hosted MCP server for its API X released a managed Model Context Protocol server that lets developers connect AI applications to its platform API. The hosted setup removes the need to build custom integrations. techcrunch.com

OpenClaw releases mobile apps for Android and iOS The free, open-source agentic program shipped on both mobile platforms. Users can now run the agent directly from a phone. techcrunch.com

Ex-DeepMind poker AI team reaches $500M valuation trading for hedge funds Three former DeepMind researchers built EquiLibre Technologies, a Prague lab now valued above $500 million. The team applies its game-theory and poker AI work to quant trading strategies. techcrunch.com

OKX builds a payments and identity marketplace for AI agents Crypto exchange OKX wants AI agents to hire and pay each other directly. Its system combines payments, identity, and reputation into one marketplace. techcrunch.com

Report finds heavy AI adopters grew headcount, including entry-level roles A new study shows companies classified as high-intensity AI adopters increased total headcount by 10.2%. Entry-level headcount at those firms rose 12%, cutting against claims that AI eliminates junior jobs. techcrunch.com