model roundup

Gemini 3.1

8 items · started 2026-05-08 · closed 2026-05-16

Show HN: Pokémon SVG Generation LLM Benchmark (svg-bench.fenx.work via hn)

+2 6w gemini

Pokémon SVG Bench About Gallery 中文 EN About Gallery 中文 EN Visual Score SVG Structure Rank Model Total S1 S2 S3 Arrow 1.1 Official API 40.93 39.00 52.20 35.20 Gemini 3.1 Pro Official API. reasoning_effort: medium 32.63 55.20 42.20 20.20 Gem…
I tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on financial-control (albertquaisie.substack.com via hn)

+1 6w gpt-5 gemini opus

I Tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro Preview on Financial-Control Scenarios. The Hardest Part Was the Evaluation.
A 26M tool-router suggests tool calling should be split from reasoning (www.reddit.com)

2 6w gemini

Needle is a 26M model for single-shot tool calling. The small-model headline is interesting, but I think the more useful claim is about agent architecture: A lot of tool calling is not reasoning.
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. (www.reddit.com)

+63 6w gpt-5 deepseek gemini+1

PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds. More info, game logs, charts: https://github.com/lechmazur/pact GPT-5.5, Opus 4.7, DeepSeek V4…
Show HN: Studis – Turn product photos into social media ads with AI (studis.io via hn)

+2 6w gemini

I built Studis to solve a problem I kept seeing with small business owners — they have great products but spend hours in Canva trying to make decent ads, or pay $50+ per image to a designer. Upload a product photo, and Studis generates a p…
So that's why they call it "YOLO-mode" (news.ycombinator.com)

+44 6w gemini

And why it probably isn't a good idea to use it. Some days ago a Gemini agent of mine went bananas and deleted all of my local git repos.
Gemini 3.1 Flash-Lite is now generally available (cloud.google.com via hn)

+2 6w gemini

Gemini 3.1 Flash-Lite is now generally available on Gemini Enterprise Agent Platform Michael Gerstenhaber VP, Product Management, Cloud AI Today, we’re thrilled to announce that Gemini 3.1 Flash-Lite, our fastest and most cost-efficient Ge…
Opus 4.6 does better research, Gemini 3.1 has better judgment (www.reddit.com)

+42 7w grok gpt-5 gemini+2

Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web…

← all threads