model roundup

Gemini 3.1

6 items · started 2026-05-08 · ongoing (last activity 2026-05-13)

  1. Needle is a 26M model for single-shot tool calling. The small-model headline is interesting, but I think the more useful claim is about agent architecture: A lot of tool calling is not reasoning.

  2. PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds. More info, game logs, charts: https://github.com/lechmazur/pact GPT-5.5, Opus 4.7, DeepSeek V4…

  3. And why it probably isn't a good idea to use it. Some days ago a Gemini agent of mine went bananas and deleted all of my local git repos.

  4. Gemini 3.1 Flash-Lite is now generally available on Gemini Enterprise Agent Platform Michael Gerstenhaber VP, Product Management, Cloud AI Today, we’re thrilled to announce that Gemini 3.1 Flash-Lite, our fastest and most cost-efficient Ge…

  5. Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web…

← all threads