model roundup
Gemini 3.1
-
Needle is a 26M model for single-shot tool calling. The small-model headline is interesting, but I think the more useful claim is about agent architecture: A lot of tool calling is not reasoning.
-
PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds. More info, game logs, charts: https://github.com/lechmazur/pact GPT-5.5, Opus 4.7, DeepSeek V4…
-
-
So that's why they call it "YOLO-mode" (news.ycombinator.com)
And why it probably isn't a good idea to use it. Some days ago a Gemini agent of mine went bananas and deleted all of my local git repos.
-
Gemini 3.1 Flash-Lite is now generally available (cloud.google.com via hn)
Gemini 3.1 Flash-Lite is now generally available on Gemini Enterprise Agent Platform Michael Gerstenhaber VP, Product Management, Cloud AI Today, we’re thrilled to announce that Gemini 3.1 Flash-Lite, our fastest and most cost-efficient Ge…
-
Opus 4.6 does better research, Gemini 3.1 has better judgment (www.reddit.com)
Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web…