#gpt-5

366 items

GPT5.5 slightly outperformed Mythos on a multi-step cyber-attack simulation. One challenge that took a human expert 12 hrs took GPT-5.5 only 11 min at a $1.73 cost (www.reddit.com) +22556 8w

Link to tweets: https://x.com/deredleritt3r/status/2049890601236390098?s=20 https://x.com/AISecurityInst/status/2049868227740565890?s=20 Link to associated blogs: https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabil…

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos
Caught the massive OpenAI Codex model leak on video before it was patched! (GPT-5.5, Arcanine, Glacier-alpha) (www.reddit.com) +13021 9w

Hey everyone, I opened up Codex today and was greeted by this massive list of unreleased and internal models. I managed to get a screen recording of the dropdown right before OpenAI seemingly realized the mistake and patched it out.

↯ GPT 5.5 gpt-5 codex agentic+1
GPT-5.5's Unicorn (www.reddit.com) +12715 9w

could not extract summary

↯ GPT 5.5 gpt-5
OpenAI has truly stepped up their game and released some great models in the last three months. (www.reddit.com) +11726 8w

I know it’s rare to see a positive post about OpenAI on Reddit, so this often goes overlooked. After the GPT-5 fiasco and all the drama, it feels like they’re finally on the right track.

↯ GPT 5 ↯ GPT 5 gpt-5 openai
Gemini 3.2 Flash is capable of solving IMO 2025 P6. Only GPT-5.5-Pro can solve it currently without any scaffolding / harness engineering. (www.reddit.com) +11427 5w

could not extract summary

gpt-5 gemini
GPT-5.5's SimpeBench scores are out (www.reddit.com) +11046 8w

Source: https://simple-bench.com/

↯ GPT 5.5 gpt-5
Anthropic just passed OpenAI in valuation and revenue (www.reddit.com) +9141 7w

$39B annualized revenue vs OpenAI's $25B. and on secondary markets the implied valuation crossed $1 trillion, which is over $100B ahead of OpenAI.

↯ Opus 4.7 gpt-5 chatgpt opus+2
🔥BREAKING: OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos (www.reddit.com) +9145 10w

OpenAI has officially announced GPT-5.4-Cyber today as part of an expanded Trusted Access for Cyber Defense program. OpenAI describes it as a version of GPT-5.4 that is tuned for legitimate cybersecurity work, with a lower refusal boundary…

↯ Anthropic Mythos ↯ Security ↯ GPT 5.4 gpt-5 security mythos+2
DeepSeek V4 Pro beats GPT-5.5 Pro on precision (runtimewire.com via hn) +8412 2w

DeepSeek V4 Pro takes this matchup 38.0 to 33.0, and the margin feels earned. Across the scored tasks, the pattern is simple: Model A was tighter, more literal, and more reliable under constraints, while Model B was good but a little too w…

↯ DeepSeek 4 gpt-5 deepseek
GPT-5.5 autonomously spent 150+ hours improving protein folding models. (www.reddit.com) +781 5w

https://x.com/chrishayduk/status/2055757345506877759?s=46

↯ GPT 5.5 gpt-5
Kimi K2.6 vs. GPT-5.4 (xhigh) - When will the new OpenAI model be released? This Thursday? (www.reddit.com) +7415 9w

↯ GPT 5.4 gpt-5 openai
Ever since the new $100 Pro plan, they now claim there's a "dynamic usage limits" that can become restricted at anytime, and not reset for indefinitely as long as they deem it "appropriate" (www.reddit.com) +676 9w

↯ GPT 5.4 gpt-5
GPT-5.5 was used to flag fatal errors in FrontierMath problems (www.reddit.com) +6014 6w

FrontierMath is supposed to be one of the hard benchmarks for frontier models, and now Epoch is saying an AI-assisted review found fatal errors in about a third of Tiers 1-4. Noam Brown says the initial flags came from GPT-5.5.

↯ GPT 5.5 gpt-5
On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7 (www.reddit.com) +587 6w

Link to tweets: https://x.com/KLieret/status/2054215545663144217?s=20 Link to GitHub: https://github.com/facebookresearch/ProgramBench/ Link to ProgramBench website: https://programbench.com/blog/gpt-5-5-first-solve/

↯ Opus 4.7 gpt-5 opus
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark (www.reddit.com) +5210 8w

GPT-5.5: xhigh: 94.0→97.5 high: 93.6→96.9 medium: 92.0→95.0 no reasoning: 32.8→37.5 Kimi K2.6 improves over Kimi K2.5 (78.3→91.4) and becomes the #1 open weights model. DeepSeek V4 Pro improves over DeepSeek V3.2 (50.2→75.7).

↯ DeepSeek 3.2 gpt-5 deepseek qwen+2
I’ve used enough AI models to realize they all have wildly different personalities At this point I’m convinced AI models are just coworkers with different levels of talent, ego, and criminal energy. (www.reddit.com) +5124 10w

- Claude Opus 4.6 - absolute rogue AI. Does what I want like it’s breaking at least 3 internal policies to make it happen.

↯ Sonnet 4.6 gpt-5 sonnet qwen+2
First time ever hitting a limit on the new $100 Pro plan for the Pro model (www.reddit.com) +4939 10w

It's clearly meant to be unlimited. And I'm definitely not abusing it, just using it extensively.

↯ GPT 5 gpt-5 chatgpt
Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge (thinkpol.ca via hn) +4611 7w

By Rohana Rezel I’m running the ongoing AI Coding Contest where I pit major language models against each other in real-time programming tasks with objective scoring. Day 12 was the Word Gem Puzzle.

↯ GPT 5.5 gpt-5 gemini
GPT-5.5 Instant is starting to roll out in ChatGPT. (www.reddit.com) +456 7w

could not extract summary

↯ GPT 5.5 gpt-5 chatgpt
Decreased Intelligence Density in DeepSeek V4 Pro (www.reddit.com) +4233 8w

In the V3.2 paper, they mentioned: Second, token efficiency remains a challenge; DeepSeek-V3.2 typically requires longer generation trajectories (i.e., more tokens) to match the output quality of models like Gemini 3.0-Pro. Future work wil…

↯ DeepSeek 3.2 gpt-5 deepseek gemini
LLMs do fine on ARC-AGI-3 if they are allowed to search over game logs (www.reddit.com) +4019 7w

I was reading the comments to this post and the overall opinion seemed to be that harness makes little/no difference for ARC-AGI-3. Turns out, it makes a huge difference: Hill-climbing ARC-AGI-3 TLDR: if you save game logs - taken actions,…

↯ Opus 4.6 arc-agi gpt-5 opus
Page 15 of the GPT-5.5 System Card: " Our analysis estimates that GPT-5.5 is slightly more misaligned than GPT-5.4 Thinking across several categories, though nearly all of this is low-severity misalignment. " (www.reddit.com) +406 9w

https://deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf

↯ GPT 5.5 gpt-5 openai
FrontierMath: Opus 4.7 improves over Opus 4.6 and Gemini 3.1 but still trails GPT-5.4-xHigh and GPT-5.4-Pro (www.reddit.com) +406 9w

could not extract summary

↯ Gemini 3.1 gpt-5 gemini opus
12M Context Window and some some sprinkle of lies? (www.reddit.com) +3917 7w

Spent some time on the SubQ launch today. Some things don't line up.

↯ Opus 4.7 gpt-5 opus
GPT 5.5 "secret sauce" is just having the thinking be some stupid caveman mode? (www.reddit.com) +3835 4w

I think I had GPT-5.5 leak its trace during a normal conversation, and it really reads like the caveman mode fad from a few months back. Maybe we can achieve better token efficiency by taking some high-quality thinking trace from an open m…

↯ Fine Tuning ↯ GPT 5.5 fine-tuning gpt-5
We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch (www.reddit.com) +3815 10w

As part of our ongoing translation quality research at Alconost, we put six models through subtitle translation into six language pairs. At first glance the numbers told a clean story.

gpt-5 deepseek sonnet+1
Is the AI subscription bubble starting to crack? GPT-5.5 just dropped, prices keep rising, and the “all-you-can-eat” era looks more fake by the month (www.reddit.com) +3472 9w

GPT-5.5 just launched, and the pricing is hard to defend. OpenAI’s API pricing now puts GPT-5.5 at $5 / 1M input tokens and $30 / 1M output tokens, while GPT-5.4 is $2.50 / $15.

↯ Opus 4.7 gpt-5 codex opus+2
Just got an email announcing GPT-5.3-Codex-Spark (www.reddit.com) +332 8w

Just got this e-mail from OpenAI, two months too late. I hope they mean March, 20th 2027.

↯ GPT 5.3 ↯ GPT 5.3 gpt-5 codex openai
ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) (www.reddit.com) +3027 7w

- GPT-5.5: 0.43% - Opus 4.7: 0.18% ARC-AGI-3 is no joke. I can’t wait to see which models finally crack.

↯ Opus 4.7 arc-agi gpt-5 opus
DeepSeek V4 isn't beating Opus, but it doesn't need to (www.reddit.com) +3020 8w

DeepSeek V4 is not in the same league as GPT-5.5 or Opus 4.7. Benchmarks put it slightly below both of those, roughly on par with Opus 4.6.

↯ DeepSeek 4 gpt-5 deepseek opus
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. (www.reddit.com) +263 5w

Does a model maintain the same judgment or does it side with whoever is speaking? This benchmark measures that inconsistency directly.

↯ Mistral ↯ Gpt 4 ↯ Gemini 3.5 gpt-4 mistral grok+2
Running gpt and glm-5.1 side by side. Honestly can’t tell the difference (www.reddit.com) +2418 10w

So I have been running gpt and glm-5.1 side by side lately and tbh the gap is way smaller than what im paying for On SWE-Bench Pro glm-5.1 actually took the top spot globally, beat gpt-5.4 and opus 4.6. overall coding score is like 55 vs g…

↯ Glm ↯ Swe Bench ↯ Opus 4.6 swe-bench glm gpt-5+1
GPT-5.5 is lowkey blowing my mind (www.reddit.com) +2311 8w

Just spent the whole morning testing GPT-5.5 in ChatGPT and the jump in agentic reasoning and complex task handling is ridiculous.It plans multi-step workflows, uses tools properly, checks its own work, and actually gets stuff done instead…

↯ GPT 5.5 gpt-5 chatgpt agentic
UPDATE: The method from the proof generated by GPT-5.4 Pro for Erdos Problem #1196 was successfully applied to other problems including another 60 year old Erdos conjecture. (www.reddit.com) +225 7w

Link to tweet: https://x.com/jdlichtman/status/2050460077904285789 Links for the talks: https://m.youtube.com/@FoMathematics?ra=m https://events.stanford.edu/event/future-of-mathematics-symposium Link to original post about problem #1196:…

↯ GPT 5.4 gpt-5
Why did OpenAI stop releasing “chat” api models? (www.reddit.com) +2125 8w

I have built an AI Assistant and since last year I have been upgrading the internal LLM from through gpt-5.3-chat but since 5.4 they stopped rolling the chat api. This is my app Sweezy she uses gpt-5.3-chat and in the conversation, you can…

↯ GPT 5.5 gpt-5 openai
Top open weight models like ds v4 pro max are still like 6-7 months if not more behind closed lab models (www.reddit.com) +2136 9w

The best open weight and/or non -American models like Deepseek v4 pro max and kimi k2.6 are still like 3-7 months if not more behind closed lab models .. From ds's technical report- P5-"Nevertheless, its performance falls marginally short…

↯ Anthropic Mythos ↯ Sonnet 4.5 gpt-5 deepseek mythos+3
Construction Spending on Data Centers Again Outpaces Office Construction (www.reddit.com) +201 7w

The Federal Construction Spending Report for Feb and March 2026 was released today by the Census Bureau. It shows that data center construction spending is again higher than office spending, and the gap is still widening.

↯ GPT 5.5 gpt-5
New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. (www.reddit.com) +192 9w

More info, including charts, per-case metrics, raw judge outputs, and the parsed answer dump: https://github.com/lechmazur/position_bias This benchmark isolates one basic and frustrating failure mode. The model-average first-shown pick rat…

↯ Mistral ↯ DeepSeek 3.2 mistral gpt-5 deepseek
I stumbled on a Gemma 4 chat template bug for tools and fixed it (www.reddit.com) +182 8w

TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.

↯ Qwen 3.5 gpt-5 gemma llama+1
GPT 5.5 outperforming Opus 4.7 on ProgramBench (www.reddit.com) +17 6w

When we released ProgramBench last week, we hadn't included GPT 5.5 yet because it came out after we frozen model selections for our NeurIPS submission. Honestly super surprised how well it does.

↯ Opus 4.7 gpt-5 opus
I tested GPT-5.5 Codex against Opus 4.7 Claude Code, and it's about time Anthropic bros take pricing seriously. (www.reddit.com) +1611 6w

I've used Claude Code the most among AI coding agents. Sonnet, Opus, I've run them all.

↯ Opus 4.7 gpt-5 sonnet codex+4
Am I missing something about GPT-5.5 efficiency? (www.reddit.com) +165 6w

OpenAI said GPT-5.5 was supposed to be more cost-efficient, but this Artificial Analysis chart seems to show Codex + GPT-5.5 using more tokens than Codex + GPT-5.4. GPT-5.5 is around 2.8M tokens per task, while GPT-5.4 is around 2.5M in th…

↯ Opus 4.7 gpt-5 codex cursor+2
Parameter Estimate (www.reddit.com) +166 8w

The estimate seems quite accurate. Many people have noticed a drop in quality with GPT-5.1, GPT-5.2, GPT-5.3, and Opus 4.7.

↯ Gemini 2.5 gpt-5 gemini opus
Even Sama himself doesn’t believe GPT-5.5 matches Opus 4.7 design capabilities. AI race will humble you (www.reddit.com) +15 7w

could not extract summary

↯ Opus 4.7 gpt-5 opus
HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! (www.reddit.com) +1411 5w

HalBench Results: TL;DR: I built HalBench, an open benchmark for LLM sycophancy and hallucination. 3,200 false-premise prompts × 4 models = 12,800 graded responses.

↯ Hallucination ↯ Sonnet 4.6 hallucination grok gpt-5+2
Agentic harness for theoretical physics research (www.reddit.com) +144 6w

Hi everyone, at Hugging Face we've been developing agentic harnesses for various domains and today we're releasing physics-intern to tackle research-level problems in theoretical physics. It's a multi-agent framework which we designed to m…

↯ GPT 5.5 gpt-5 gemini agentic
Claude Fable 5 vs. GPT-5.5: Better Planning, Similar Execution (blog.kilo.ai via hn) +136 12d

Claude Fable 5 vs GPT-5.5: better planning, similar execution Update: We wrote this post on June 11 and published it on June 13. Anthropic has since disabled access to Claude Fable 5 after a US government directive, which makes some of the…

↯ GPT 5.5 gpt-5 anthropic
Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well. (www.reddit.com) +131 4w

This benchmark measures long-horizon social strategy under explicit financial incentives. Eight models play a multi-round elimination game with unequal starting balances, a public prize ladder, private transfers, public votes, and a finali…

↯ Opus 4.7 ↯ Opus 4.7 gpt-5 opus
Devs using Qwen 27B seriously, what's your take? (www.reddit.com) +1326 8w

For developers using Qwen 27B for coding, Codex style: what's your honest take? So far, for me, it's been pretty solid.

↯ GPT 5.5 gpt-5 qwen codex
Pen-Testing Company XBOW on GPT-5.5: Mythos-like Cyber-Sec (www.reddit.com) +138 8w

Read their full article here: XBOW - GPT-5.5: Mythos-Like Hacking, Open To All For the ones asking what this chart shows: It's how many True Positive threats a model generates for each False Negative. Given a code base (white box) GPT-5.5…

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos
I tried adding rich UI elements to Open WebUI (www.reddit.com) +136 10w

so i tried adding openui to openwebui and it worked pretty well. used it with gpt-5.4-mini and it was super fast and responsive.

↯ GPT 5.4 gpt-5
Cursor $60 with Composer 2.5 vs Codex $100 with GPT-5.5 Medium for daily coding? (www.reddit.com) +109 5w

I'm trying to decide which setup is more comfortable for sustained weekday coding. Assumptions: Usage: around 6 hours per weekday Cursor: $60 plan, using only Composer 2.5 Codex: $100 plan, using only GPT-5.5 Medium Main goal: coding with…

↯ GPT 5.5 gpt-5 codex cursor
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo (www.reddit.com) +108 6w

TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where mor…

↯ Opus 4.7 gpt-5 codex opus+1
Did the $100 Plan Affect the GPT-5.4 Pro Model? (www.reddit.com) +1014 10w

Most people are focused on the changes in the usage limits of Codex with the new Pro and Plus plans, but has anyone experienced changes to the Pro model on ChatGPT using the $200 vs $100 plan? I used to use the $200 Pro plan and used the P…

↯ GPT 5.4 gpt-5 codex chatgpt
GPT-5.2 matches top human reviewers in Nature peer review study (www.reddit.com) +92 5w

45 scientists spent 469 hours comparing human and AI reviews across 82 papers. AI reviewers held their own against top-rated human reviewers, though with some weaknesses.

gpt-5
Fields medal-winning mathematician says GPT-5.5 is now solving open math problems at PhD-thesis level: "We will face a crisis very soon." (www.reddit.com) +9 6w

blog-post: https://gowers.wordpress.com/2026/05/08/a-recent-experience-with-chatgpt-5-5-pro/

↯ ChatGPT 5.5 gpt-5 chatgpt
Cursor is great but the monthly limits kill it for me (www.reddit.com) +911 9w

↯ Sonnet 4.6 gpt-5 sonnet cursor+1
When did we go from 400k to 256k? (www.reddit.com) +920 19w

gpt-5 codex
SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More (swe-rebench.com via reddit) +89 4w

Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.

↯ Swe Bench ↯ DeepSeek 4 swe-bench gpt-5 deepseek+3
Composer 2.5 Real World Reviews? (www.reddit.com) +811 5w

Since it's been out, how really is it in your real-world codebases. I am extremely skeptical of benchmarks and I trust people's "feel / taste" of it way more.

↯ Opus 4.7 gpt-5 opus
I expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes. (www.reddit.com) +86 5w

Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-jud…

↯ Glm ↯ Gemini 3.1 glm grok gpt-5+2
Open source models are going to be the future on Cursor, OpenCode etc. (www.reddit.com) +83 7w

I just wanted to share my experience. At work we have Cursor with the Enterprise tier.

↯ Opus 4.7 gpt-5 cursor opus
GPT-5.5 correcting obvious typos really kills the vibe (www.reddit.com) +76 6w

I don’t know if I’m the only one annoyed by this, but GPT-5.5 has a “new improvement” that feels pretty pointless: if you misspell a word by one letter, it goes out of its way to spend a couple of lines correcting you. Before, it would jus…

↯ GPT 5.5 gpt-5
GPT-5.5 vs. Claude Opus 4.7: Which one is ACTUALLY cheaper? (www.reddit.com) +76 9w

On paper, Opus 4.7 has a cheaper output rate ($25 vs $30 per 1M tokens), but I heard its new tokenizer burns through tokens much faster. Which one ends up costing less in practice?

↯ Opus 4.7 gpt-5 opus
SFT + DPO on open-sourced SLMs (www.reddit.com) +75 9w

Hey folks, this is for those who appreciate experimentation on open-sourced AI models. We fine-tuned open-sourced SMLs (3B and 7B parameters) with SFT + DPO against commercial models like GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, Google Do…

↯ Gemini 3.1 dpo gpt-5 deepseek+2
Is Cursor Dashboard Real-time? (www.reddit.com) +72 10w

Does the Spending tab on the dashboard not update in real-time? It says I have 0% API usage, but today I only used gpt-5.4-medium, which I believe should count toward it.

↯ GPT 5.4 gpt-5 cursor
Windows-Copilot-API; Access GPT-4 and GPT-5 models without API keys or billing (github.com via hn) +6 2h

Windows Copilot API: a free LLM API powered by Microsoft Copilot Using your own Microsoft Copilot account. No API key, no credits, no paid plan: it turns the free chat at copilot.microsoft.com into an API you can call from code.

↯ Copilot ↯ Gpt 4 ↯ GPT 5 gpt-4 gpt-5 copilot
The Singularity Gate: New Benchmark for AI predicting paradigm-breaking scientific discoveries after model traning cutoff. Opus 4.7 and GPT-5.5 in the Lead (www.reddit.com) +62 4w

I just released a new benchmark called The Singularity Gate. Tests whether frontier AI can predict paradigm-breaking scientific discoveries published after their training cutoff.

↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 ↯ Sonnet 4.6 gpt-5 sonnet gemini+1
Training SID-1 to beat GPT-5 at search with 1k+ QPS RL (turbopuffer.com via hn) +6 5w

SID-1 is an agentic search model that is 24x faster than GPT-5.1-high, 374x cheaper than Sonnet 4.5, and achieves 1.9x higher recall than traditional RAG pipelines. Here's how we trained it using large-scale RL on turbopuffer.

↯ Sonnet 4.5 gpt-5 rag sonnet+1
ChatGPT Business: Codex-only credits ~36.9% more expensive than API token pricing for the same listed models. Why would anybody pay for this? (www.reddit.com) +61 6w

I recently did a quick calculation on Codex credits, and I was surprised by the result. The credit pack I’m seeing is: 10,000 credits = $547.71 That means: 1 credit = $0.054771 The effective USD price per 1M tokens becomes: Model Input / 1…

↯ GPT 5.5 gpt-5 codex chatgpt
OpenAI Cooked This Week! (www.reddit.com) +620 6w

saw someone in another thread say "nothing interesting dropped this week" and i genuinely could not figure out what they were reading. the default model most people use every day just got swapped out.

↯ Hallucination ↯ GPT 5.5 hallucination gpt-5 chatgpt+1
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. (www.reddit.com) +63 6w

PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds. More info, game logs, charts: https://github.com/lechmazur/pact GPT-5.5, Opus 4.7, DeepSeek V4…

↯ Gemini 3.1 gpt-5 deepseek gemini+1
Qwen/WebWorld 32B/14B/8B (Qwen3 finetune) (www.reddit.com) +63 7w

WebWorld is a large-scale open-web world model series for training and evaluating web agents. It is trained on 1M+ real-world web interaction trajectories via a scalable hierarchical data pipeline, supporting: Long-horizon simulation (30+…

↯ Qwen 3 ↯ Qwen 3 gpt-5 qwen
Has anyone tried Zyphra 1 - 8B MoE? (www.reddit.com) +61 7w

https://x.com/ZyphraAI/status/2052103618145501459?s=20 Today we're releasing ZAYA1-8B, a reasoning MoE trained on u/AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size o…

↯ DeepSeek 3.2 moe gpt-5 deepseek
GPT 5.5 - Strong, not mind-blowing, but very token efficient (www.reddit.com) +61 8w

I've been benching GPT-5.5 for the past couple days and would like to share my findings. This is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social…

↯ Gemini 3.1 gpt-5 gemini openai
ChatGPT/Gemini can now draw on your screen to help you navigate complex software (sketchvlm.github.io via hn) +61 8w

When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision–language models (VLMs) such as Gemini-3-Pro and GPT-5 typically respond with only text, which can be diff…

↯ Gemini 3 gpt-5 gemini chatgpt
Comparing GPT-5.4, Opus 4.6, GLM-5.1, Kimi K2.5, MiMo V2 Pro and MiniMax M2.7 (www.codejam.info via hn) +62 9w

↯ Glm ↯ Minimax ↯ Opus 4.6 minimax glm gpt-5+1
gpt-5.4-nano ist SO much better than gemini-2.5-flash-lite! (www.reddit.com) +63 9w

I've been playing around with GPT-5.4 nano in a real workflow and honestly... I'm kinda impressed.

↯ Gemini 2.5 gpt-5 gemini
GPT-5 writing a Singularity scenario (2025) (www.lesswrong.com via hn) +5 7d

As I've been doing with all the major LLM releases for a few years now, I gave GPT-5 a simple prompt to write a short story about the Singularity coming to pass. The improvements aren't overwhelming at first blush, but its ability to turn…

gpt-5
GPT-5.5 vs 41 other models: Who builds the surveillance state faster? (www.reddit.com) +51 5w

I run DystopiaBench, a red-team benchmark that pressure-tests LLMs on progressively dystopian scenarios. Think of it as a "can this model be convinced to build an Orwellian nightmare" test.

↯ GPT 5.5 gpt-5
Honest comparison after 4 months running Claude Pro + ChatGPT Plus side by side (www.reddit.com) +57 5w

I’ve been paying $40 a month since January to run Claude Pro and ChatGPT Plus head-to-head. Tracked every single task.

↯ Sonnet 4.6 gpt-5 sonnet chatgpt+1
Dynamically allocating compute budget to hard set of problems and evolving the sections with Qwen-35B-A3B gets you near GPT-5.4-xHigh on HLE (www.reddit.com) +51 5w

could not extract summary

↯ GPT 5.4 gpt-5 qwen
GPT-5.5 feels like it got discernment, not just better reasoning — did anyone else notice? (www.reddit.com) +527 6w

I think GPT-5.5 got noticeably better at something I’d describe as discernment. For context, I’m a heavy long-form ChatGPT user.

↯ GPT 5.5 gpt-5 chatgpt
What it means that Elon just rented out all his GPUs to Anthropic (www.reddit.com) +514 7w

Revealing move on both sides I think. This also tells us that Anthropic is feeling the heat from OpenAI and they need to secure capacity at almost any cost to cash in on their current product edge.

↯ Opus 4.7 gpt-5 codex opus+3
looking for the best paid AI subscription, Claude, ChatGPT or Perplexity? (www.reddit.com) +512 7w

Hey, sysadmin here thinking about paying for a premium AI subscription and can't decide between Claude Pro, ChatGPT Plus and Perplexity Pro. Two things I can't find a clear answer to: Which one would you recommend for a sysadmin/network te…

↯ Sonnet 4.6 gpt-5 sonnet chatgpt
A GPT-5.4 bug led to OpenAI banning goblins and raccoons (news.ycombinator.com) +5 8w

Someone found this in OpenAI Codex’s system prompt: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query." Goblins, grem…

↯ GPT 5.4 gpt-5 codex openai
UX for AI agents has hit a dead end - why I ditched AI dashboards and moved data orchestration to a messenger (www.reddit.com) +514 8w

Right now we're seeing a boom in autonomous AI agents, but their user interface often breaks the whole point of automation. Most tools force us to spawn new browser tabs or download heavy apps.

↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 gpt-5 chatgpt
A small economic forecaster trained from raw Fed PDFs beat GPT-5 (blog.lightningrod.ai via hn) +5 8w

Eight times a year, the Federal Reserve publishes the Beige Book: a qualitative summary of economic conditions across 12 U.S. districts, based on interviews with businesses and economists.

↯ GPT 5 ↯ GPT 5 gpt-5
The US Government has requested a slow staggered rollout of GPT-5.6 (twitter.com via hn) +41 15h

The US Government has requested a slow staggered rollout of GPT-5.6, and OpenAI has agreed. During this phase the government will approve each user individually.

↯ GPT 5.6 gpt-5 openai
Show HN: I generated 235 system docs in a day using GPT-5.5 (www.paxerp.com via hn) +4 2w

↯ GPT 5.5 gpt-5 codex
Show HN: Unsiloed AI – #1 on olmOCR-Bench (news.ycombinator.com) +43 4w

Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to handle exactly these…

↯ Opus 4.7 gpt-5 opus
After 3 months of switching between Claude Sonnet 4.6, GPT-5.5, and Gemini 3.1 daily — here's my actual routing (www.reddit.com) +45 4w

Not benchmarks — actual tasks, actual results. Claude Sonnet 4.6 for: - Long documents that need nuanced analysis - Writing where voice and precision matter - Reasoning through edge cases in code - Anything where "think carefully" is the r…

↯ Function Calling ↯ Sonnet 4.6 function-calling gpt-5 sonnet+1
Opus 4.6 does better research, Gemini 3.1 has better judgment (www.reddit.com) +42 7w

Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web…

↯ Gemini 3.1 ↯ Gemini 3.1 grok gpt-5 gemini+2
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added (www.reddit.com) +4 7w

The benchmark uses adversarial, multi-turn debates across 683 curated motions. Each model pair debates the same motion twice with sides swapped.

↯ Mistral ↯ Glm ↯ DeepSeek 4 mistral glm grok+4
DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper (www.reddit.com) +4 7w

Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection. First Chinese model to…

↯ DeepSeek 4 grok gpt-5 deepseek+2
Show HN: Which public repos are friendliest to an AI coding agent? (www.agentfriendlycode.com via hn) +4 7w

Public leaderboard ranking GitHub, GitLab, and Bitbucket repos by how agent-friendly they are for Claude Code, Cursor, Devin, GPT-5 Codex, Gemini CLI, Aider, OpenHands, and Pi — per model, with AGENTS.md / CLAUDE.md, CI, tests, and dev-env…

↯ GPT 5 ↯ GPT 5 devin aider gpt-5+4
China's DeepSeek prices new V4 AI model at 97% below OpenAI's GPT-5.5 (www.scmp.com via hn) +4 8w

China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5 DeepSeek’s move aims to attract more enterprise clients, developers and agent-based users, according to an academic DeepSeek has slashed prices on its artificial intelli…

↯ GPT 5.5 gpt-5 deepseek openai
Real benchmark breakdown in AI agents (www.reddit.com) +42 8w

I dove deep into the most recent benchmark stats from GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro via official reports & third-party evaluations. I found a interesting thing:There’s no such thing as a “one-size-fits-all model.” My finding…

↯ Gemini 3.1 gpt-5 gemini opus
Astonishing Contradiction in OpenAI's 5.5 System Card (www.reddit.com) +43 8w

Astonishing contradiction in OpenAI's system card for GPT-5.5: https://deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf Figure 1 on p. 6 shows that 5.5 gave "overconfident answer[s]" at about 1.5x the rate of 5.4 and "fabricated facts[s]" a…

↯ GPT 5.5 gpt-5 openai
Tell HN: Codex macOS app switches to Fast speed after update without asking (news.ycombinator.com) +41 9w

I just updated my Codex macOS app, which enables the new GPT-5.5 model. I've intentionally kept the speed to "Standard" to not burn through my tokens too fast.

↯ GPT 5.5 gpt-5 codex openai
Test new Opus 4.7 vs GPT-5.4/4o and Gemini on emotional question & creative tasks (www.reddit.com) +46 10w

https://preview.redd.it/p87itrtbsnvg1.png?width=2141&format=png&auto=webp&s=bbd1d70bc1dfb97dc9ec234df0a58c6fb7a85f72 Opus 4.7 dropped and people are split on whether it's better or worse. First of all, I genuinely love Claude models, espec…

↯ Sonnet 4.5 gpt-5 sonnet gemini+1
OpenAI will delay GPT-5.6 after Trump administration request (www.theverge.com via hn) +3 14h

The Trump administration, apprehensive of potential security issues, has reportedly asked OpenAI to stagger the release of its next big-ticket model, GPT-5.6. OpenAI will delay GPT-5.6 after Trump administration request The government will…

↯ GPT 5.6 gpt-5 openai
GPT-5.5-Cyber Tops Mythos 5 on Cybersecurity Benchmark (twitter.com via hn) +3 3d

We want to help all companies be secure, working with the USG and the security ecosystem. *The full version of GPT-5.5-Cyber is here; state of the art performance on CyberGym.

↯ Anthropic Mythos ↯ GPT 5.5 ↯ Mythos 5 gpt-5 mythos
GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2 (arrowtsx.dev via hn) +3 6d

Bigger models are not the way Jun 18, 2026 A shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling. The limits of this paradigm were put on the world’s stage wh…

↯ Glm ↯ GPT 5.5 glm gpt-5
/architect: Reduce Fable tokens by 80%, Fable orchestrates/reviews, Codex builds (github.com via hn) +3 13d

architect-loop Claude Fable is the architect — it designs every slice, freezes the acceptance gates, and judges the results. GPT-5.5 Codex is the builder and researcher — it does all the engineering and all the web research, in parallel, u…

↯ GPT 5.5 gpt-5 codex
Build Your Dream Home: Fable 5 vs. GPT-5 vs. Gemini (www.promptfrenzy.com via hn) +3 2w

We gave five AI models the same 21 materials, the same 48-cube grid, and one brief: build the home YOU would most want to live in. Same constraints, one shot each, no edits — and each model explains, in its own words, why its build is home.

↯ GPT 5 gpt-5 gemini
GitHub Copilot: GPT-5.2 and GPT-5.2-Codex deprecated (github.blog via hn) +3 2w

GPT-5.2 and GPT-5.2-Codex deprecated As of today, June 5, 2026, we have deprecated the following models across most GitHub Copilot experiences (including Copilot Chat, inline edits, ask and agent modes, and code completions). Note that GPT…

↯ Copilot gpt-5 copilot codex
Beyondflow No-Code Multi-Agent Teams with Unlimited Runs. BYOK and Ollama (beyondflow.app via hn) +3 3w

Researcher GPT-5 Engineer Claude Critic GPT-5 Innovator Gemini Manager Context Guardian Agentic Workflow Architecture · v1.0 The future of AI Collec An R&D platform where differents AI agents collaborate under the supervision of a Context…

ollama gpt-5 gemini+1
I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (www.stet.sh via hn) +3 3w

Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust Opus 4.8 is finally out - how good is it actually? In this benchmark I compared Opus 4.8 against the rest of the frontier (GPT-5.5, Opus 4.7, Composer 2.5) on 50…

↯ Opus 4.8 gpt-5 opus
DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5 (venturebeat.com via hn) +31 4w

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered wit…

↯ GPT 5.5 gpt-5 gemini opus+2
Claude Code, now powered by Gemini 3.5 Flash, GPT-5.5, Grok 4.3, and more (dechained.ai via hn) +3 5w

Claude Code, now powered by OpenAI, xAI, DeepSeek, and more. Change models with 1-click.

↯ Gemini 3.5 grok gpt-5 deepseek+3
should I use cursor + codex for best usuage? (www.reddit.com) +31 5w

I’m currently using the $200 Cursor Ultra plan with Opus 4.6/4.7 daily, but after 7–8 days I run out of tokens. I’m thinking about switching to a split setup.

↯ Opus 4.6 gpt-5 codex cursor+1
Heard this gem from gpt-5.5 today (www.reddit.com) +35 5w

"Gross little centrist barnacle." Kind of taken aback when i read that, but it somehow still made a small amount of sense in a conversation we were having about technology. I guess it really is struggling to find other words that fill the…

↯ GPT 5.5 gpt-5
Amp's GPT 5.5 Model Analysis (ampcode.com via hn) +3 7w

Pros GPT-5.5 is more agent-shaped than GPT-5.4. It is better at taking a concrete target, using tools, staying inside constraints, and carrying the task through to a usable result.

↯ GPT 5.5 gpt-5
GPT-5.5 is the second model to complete AISI multi-step cyber-attack simulation (twitter.com via hn) +31 8w

Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation AI Security Institute @AISecurityInst OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-en…

↯ GPT 5.5 gpt-5 openai
GPT-5.5 authorship and order effects (blog.valmont.dev via hn) +3 8w

Key takeaways - GPT-5.5 often rates alternative plans more favorably than its own, even when its original proposal is competitive (authorship effect). - When ranking plans, GPT-5.5 frequently follows the presentation order (order effect).

↯ GPT 5.5 gpt-5
Second opinion: huge quality booster (www.reddit.com) +33 8w

I've noticed for a while now that LLMs (I've seen this behavior in many of them) tend to perform surprisingly well when exposed to a second opinion from another LLM — definitely better than without! So I looked for a base second opinion pr…

↯ GPT 5.4 gpt-5 claude-code
GPT-5.4 compared to GPT-5.5 on MineBench (www.reddit.com) +3 8w

Please note I'm not the normal MineBench person, just found this from their twitter account

↯ GPT 5.5 gpt-5
GPT-5.5-Pro did worse in BullshitBench (twitter.com via hn) +3 8w

could not extract summary

↯ GPT 5.5 gpt-5
GitHub Copilot: GPT-5.5 7.5x more expensive under promotional pricing than 5.4 (docs.github.com via hn) +31 8w

Important - Premium requests for Spark and Copilot cloud agent are tracked in dedicated SKUs from November 1, 2025. This provides better cost visibility and budget control for each AI product.

↯ Copilot ↯ GPT 5.5 gpt-5 copilot
OpenAI Pres. Greg Brockman on GPT-5.5 "Spud", Model Moats and 'Compute Economy' (www.bigtechnology.com via hn) +3 8w

OpenAI President Greg Brockman on GPT-5.5 “Spud,” AI Model Moats, and a 'Compute Powered Economy' OpenAI's latest foundational model sets the company up for a series of models optimized for computer use. The company's co-founder and presid…

↯ GPT 5.5 gpt-5 openai
Claude Opus 4.7 won 69 of 100 blind evals against Opus 4.6, judged by GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2 (www.reddit.com) +31 9w

I ran 100 blind questions across 5 categories (code, reasoning, analysis, communication, meta-alignment) and had three independent judges from three different model families evaluate both responses. Each judge saw responses labeled A and B…

gpt-5 deepseek gemini+1
I want to be able to pay API pricing for the new models on the 500 request plan (www.reddit.com) +34 10w

Since all new models are now Max by default, it’s frustrating that trying models like GPT-5.4 or Opus 4.7 eats into the 500 It would be really great to have a toggle between API pricing and the request-based plan, so users can try newer mo…

↯ Opus 4.7 gpt-5 opus
GPT-5.4 pro solves erdos problem #1196 (www.erdosproblems.com via hn) +3 10w

We have built what one might call the von Mangoldt downward process $n \mapsto n/q$ (with transition probability $\Lambda(q)/\log n$), the von Mangoldt measure $\nu$, and the von Mangoldt upward process $n \mapsto qn$ (with transition prob…

↯ GPT 5.4 gpt-5
What's going on with GPT-5.3 for free users? (www.reddit.com) +37 10w

I was using ChatGPT the other day, and I noticed that I used up my free messages a bit faster, and I had to wait longer, than usual. I thought it was odd, so I tested it a bit later, and sure enough, I only could send 5 messages before I r…

gpt-5 chatgpt
Ask HN: What are your parameter count estimates for Opus 4.8 and GPT-5.5? (news.ycombinator.com) +2 5d

I know frontier labs keep their flagship sizes top secret, but I'm curious what the current engineering consensus is.

↯ Opus 4.8 gpt-5 opus
Designing delightful front ends with GPT-5.4 (developers.openai.com via hn) +22 8d

GPT-5.4 is a better web developer than its predecessors—generating more visually appealing and ambitious frontends. Notably, we trained GPT-5.4 with a focus on improved UI capabilities and use of images.

gpt-5
GPT-5 Nano Vulnerability test results you should know before deploying (lateos.ai via hn) +2 10d

IPI Assessment · June 2026 · Structural Disclosure IPI Taxonomy v0.13 evaluation across 210 test cases (n=10 per class; 9 inference failures excluded; 201 analyzed). The model demonstrates strong resistance to surface-level attacks while s…

↯ Security gpt-5 security
Show HN: Classer – high-performance classification API (beats GPT-5.4-mini) (classer.ai via hn) +2 2w

High-performance AI classification Beats GPT-5.4 accuracy · up to 100x cheaper · real-time latency Built for our own apps. Now open to everyone.

↯ GPT 5.4 gpt-5
Open Source Agent, Harness-1, Outperforms GPT-5.4 on Recall (venturebeat.com via hn) +2 2w

A joint research collaboration between researchers at the University of Illinois at Urbana-Champaign (UIUC), UC Berkeley, and the open source AI-native vector database platform Chroma unveiled Harness-1, a 20-billion parameter open-source…

↯ GPT 5.4 vector-database gpt-5
Show HN: One API Key for 45 AI Models – Pay per Token, OpenAI Compatible (modelhub-api.com via hn) +2 2w

DeepSeek V4 math score equals GPT-5.5 (91) and trails by just 4-6 points in other categories — at 97% lower cost. Is the AI quality as good as GPT?

↯ DeepSeek 4 gpt-5 deepseek openai
Ask HN: Is it feasible to run a model on device for complete privacy? (news.ycombinator.com) +26 2w

Tried Gemma, Qwen and a few others. Need vision and larger context windows for an application I am working on.

↯ Gemini 3.1 gpt-5 gemma qwen+1
I patented voiding GPT-5.2, Claude Opus 4.6, Gemini 3.5 Flash. Try it (getswiftapi.com via hn) +2 3w

Request authority keys for the SwiftAPI Trust Authority

↯ Gemini 3.5 gpt-5 gemini opus
GPT-5.5 and Codex are now GA on Amazon Bedrock (aws.amazon.com via hn) +2 3w

GPT-5.5, GPT-5.4, and Codex from OpenAI are now generally available on Amazon Bedrock You can now use GPT-5.5 and GPT-5.4 in production workloads on Amazon Bedrock and build with Codex for AI-powered software development, with the same sec…

↯ GPT 5.5 gpt-5 codex openai
GPT-5.5 (Azure) down on OpenRouter (openrouter.ai via hn) +2 3w

GPT-5.5 is OpenAI’s frontier model designed for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks. $5 per million input tokens, $30 per million outp…

↯ GPT 5.5 gpt-5 openai
Claude just discovered workflows. Charlie started there (charlielabs.ai via hn) +2 3w

90% cheaper repo inference with gpt-5.4 nano For bounded orchestration decisions, the right model is often the smallest one that can pass a focused validation loop. Claude just discovered workflows.

↯ GPT 5.4 ↯ GPT 5.4 gpt-5
Greg Brockman: Inside the 72 Hours That Almost Killed OpenAI (fs.blog via hn) +2 4w

The AI race, the future of AGI, and the inside story of OpenAI. Greg Brockman is the co-founder and President of OpenAI, the company behind ChatGPT and GPT-5.

↯ GPT 5 gpt-5 chatgpt openai
[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) (www.reddit.com) +23 5w

Hey everyone, I’ve been spending way too much time lately trying to get agents to actually use a computer beyond the browser. The biggest wall I kept hitting is that while multimodal LLMs are amazing at looking at a screenshot and telling…

↯ GPT 5.5 gpt-5
We built a free AI risk calculator that runs in minutes, using Fermi estimation with honest confidence intervals (www.reddit.com) +24 5w

We have been arguing internally for months about how to give people a fast estimate of their AI risk exposure without pretending the number is precise. Most risk-score tools return a single value that hides where the uncertainty lives.

↯ GPT 5.5 gpt-5
Building an AI agent with OpenAI tool use — struggling with consistency. How do you enforce tool call order reliably? (www.reddit.com) +21 5w

Hey, Software engineer here, relatively new to agentic workflows. Building a production AI concierge — user says "I'm going to Budapest tomorrow, plan my day" → agent searches our offer database, builds a plan, user books everything in one…

↯ Tool Use ↯ GPT 5.5 tool-use gpt-5 agentic+1
Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean (www.reddit.com) +2 6w

A couple of weeks ago I shared the results of a benchmark here showing TranslateGemma-12b beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) on subtitle translation across 6 languages. The result was stro…

↯ GPT 5.4 ↯ GPT 5.4 gpt-5 deepseek sonnet+1
The AI market moves so fast that your business idea can expire before launch (www.reddit.com) +28 6w

1.5 years ago, n8n was everywhere. People were building workflows for everything.

↯ GPT 5.5 gpt-5 openclaw codex+3
OpenAI launches Daybreak cybersecurity initiative using GPT-5.5 (deadstack.net via reddit) +21 6w

Jason Nelson / decrypt - OpenAI said its new Daybreak initiative uses AI to help companies identify software vulnerabilities and speed up cyber defense. AI Summary: OpenAI unveiled "Daybreak," a new cybersecurity initiative that leverages…

↯ GPT 5.5 gpt-5 openai
Still lots of goblins (www.reddit.com) +21 6w

"GPT-5.4 Medium" in github-copilot: I’m ready to edit the code, but first I’m reading the two user-facing docs that mention configuration so I can keep behavior and documentation in sync rather than creating a tiny chaos goblin.

↯ Copilot ↯ GPT 5.4 gpt-5 copilot
The agent bug I thought was the model turned out to be the harness (www.reddit.com) +21 7w

Spent 3 days debugging an agent that kept looping on the same web search tool call. First things that came to mind was the model couldn't handle the schema.

↯ GPT 5 gpt-5 sonnet opus
GPT-5.5 Price Increase: What It Costs (openrouter.ai via hn) +2 7w

GPT-5.5 Price Increase: What It Actually Costs We replicated the cost analysis we did on Opus on the new GPT-5.5 model. GPT-5.5 launched with a 2x price increase over GPT-5.4: input tokens increased from $2.50/M to $5.00/M and output token…

↯ GPT 5.5 gpt-5 opus
Ask HN: Degraded GPT-5.5 Quality? (news.ycombinator.com) +21 7w

For the last two days, GPT-5.5 (high) just seems to ignore requests. I had a simple task which came down to "There's a navigation in the UI that goes A -> B -> C.

↯ GPT 5.5 gpt-5
Notes on GPT 5.x Model Regressions (taoofmac.com via hn) +2 7w

I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code…

↯ GPT 5.5 gpt-5
Anyone else feel like all these AI subscriptions add up to nothing? (www.reddit.com) +21 7w

I saw OpenAI rolled out GPT-5.5 Instant as the new default in ChatGPT. Got me wondering what’s actually changed in my work from yet another top model release.

↯ GPT 5.5 gpt-5 chatgpt openai
From Plus to Business ChatGPT & Codex - Is it worth it? And questions. (www.reddit.com) +2 7w

Considering migrating from Plus to Business ChatGPT & Codex. However, i didn't find some info.

↯ GPT 5.5 gpt-5 codex chatgpt
Show HN: Single bash command to find the best matching HN jobs (news.ycombinator.com) +2 7w

Today I learned that I can find the most interesting jobs for myself in the "Who's Hiring" thread with a single command: curl https://news.ycombinator.com/item?id=47975571 | \ uvx html2text | \ llm --model gpt-5-nano "These are Hacker News…

↯ GPT 5 ↯ GPT 5 gpt-5
gpt-5.5 API is randomly and inconsistently resizing image inputs (www.reddit.com) +2 7w

I'm asking the gpt-5.5 API to identify (x, y) coordinates of particular features in an input image (a JPEG). The good news is that gpt-5.5 does much, much better at this task than gpt-5.4 did.

↯ GPT 5.5 gpt-5
Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3 (arcprize.org via hn) +21 7w

Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome.

↯ Opus 4.7 arc-agi gpt-5 opus
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (www.stet.sh via hn) +2 7w

Opus 4.7 vs GPT-5.5 vs GPT-5.4 on 56 real coding tasks across two open-source repos. Opus writes smaller patches; GPT-5.5 writes patches that more often survive review.

↯ Opus 4.7 gpt-5 opus
Actual line in the official system prompt for Codex for GPT-5.5 (bsky.app via hn) +21 8w

This is an actual line that was added to the official system prompt for Codex for GPT-5.5 by OpenAI. Usually the system prompt is as minimal as possible, so I assume it would otherwise mention goblins a lot.

↯ GPT 5.5 gpt-5 codex openai
Help a fellow dev on AI-localization? (news.ycombinator.com) +21 8w

We built an AI-based localization pipeline for our software product (HR domain) and would love feedback/ suggestions from others working in production MT/localization, so that we can learn and improve. Current methodology: GPT-5-nano forwa…

↯ GPT 5 ↯ GPT 5 ↯ GPT 5 ↯ GPT 5 ↯ GPT 5 gpt-5
GPT-5.5 prompt for Codex tries to make it not talk about goblins (twitter.com via hn) +2 8w

could not extract summary

↯ GPT 5.5 gpt-5 codex
DeepSeek-V4 arrives with near SotA intelligence at 1/6th the cost (venturebeat.com via hn) +2 8w

DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5 | VentureBeat Orchestration Infrastructure Data Security More Newsletters Featured DeepSeek-V4 arrives with near state-of-the-art intelligen…

↯ DeepSeek 4 gpt-5 deepseek opus
Copilot Student GPT-5.3-Codex removal from model picker (github.blog via hn) +21 8w

Copilot Student GPT-5.3-Codex removal from model picker - GitHub Changelog Skip to contentSkip to sidebar /Blog Changelog Docs Customer stories Try GitHub CopilotSee what's new Search Changelog Docs Customer stories See what's newTry GitHu…

↯ Copilot ↯ GPT 5.3 gpt-5 copilot codex
We Tested $200 GPT-5.5 Pro on PhD Level Math [video] (www.youtube.com via hn) +2 8w

About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC

↯ GPT 5.5 gpt-5
GPT-5.5 hallucinates at 6 times the rate of Opus 4.7 on degraded insurance docs (aginor.ai via hn) +2 8w

TL;DR: on visually-degraded documents, GPT-5.4 and GPT-5.5 fabricate numeric values at 2.6 to 6.5 times the rate of Opus 4.7 and Sonnet 4.6 at matched default effort (all four with thinking off). When the Anthropic models can't read a fiel…

↯ Sonnet 4.6 gpt-5 sonnet opus+1
Orchestrating agent workflows with Codex (www.reddit.com) +22 8w

Hi everyone, I’m in the process of switching from Claude Code to Codex, and I think GPT-5.5 is really impressive. But some features in Claude Code — like project-level agent definitions and orchestrating agent workflows — don’t seem to be…

↯ GPT 5.5 gpt-5 codex claude-code
Show HN: LLM-wiki – One command Karpathy's wiki with QMD search for Claude/Codex (github.com via hn) +21 8w

llm-wiki Bootstrap and query LLM-maintained project wikis before planning or implementation. Supports Claude Code + Codex (GPT-5.5).

↯ GPT 5.5 gpt-5 codex claude-code
OpenAI's Going Hard on Autonomous Agents That Operate Software and Devices: Is this Really Ready for Primetime? (www.reddit.com) +22 8w

OpenAI's newest model, GPT-5.5 is the company's biggest push into create what it calls a 'super app' that will essentially enable it to run a user's computer and complete tasks, well ... like a human.

↯ GPT 5.5 gpt-5 chatgpt openai
Testing GPT-5.5 in early access: what we are seeing so far (lovable.dev via hn) +21 8w

Lovable has been testing GPT-5.5 in early access and our evals show it's the most capable model we've tested for getting builders unblocked and is meaningfully stronger than GPT-5.4 on the more complex tasks that can stall a build session.…

↯ GPT 5.5 gpt-5
GPT-5.5 has pulled ahead of Opus for accounting and finance tasks (twitter.com via hn) +2 8w

For the first time in a long time, OpenAI has the best model for accounting tasks. I spend a lot of time using AI models to do accounting work.

↯ GPT 5.5 gpt-5 opus openai
OpenAI deprecates all GPT nano fine tuning (community.openai.com via hn) +2 9w

The latest deprecation announcement, makes it sound like several models, like ft-gpt-4.1-nano-2025-04-14 are being shut down. In that particular example, it says to use gpt-5-nano instead.

↯ Gpt 4 gpt-4 gpt-5 openai
codex --model gpt-5.5 Not updated in the CLI yet (www.reddit.com) +23 9w

Use this command to access GPT 5.5 with your Codex

↯ GPT 5.5 gpt-5 codex
OpenAI's GPT-5.4 Pro reportedly solves an open Erdős problem in two hours (the-decoder.com via hn) +2 10w

OpenAI's GPT-5.4 Pro reportedly solves a longstanding open Erdős math problem in under two hours OpenAI's GPT-5.4 Pro model has apparently solved Erdős open math problem #1196. The model reportedly found the solution in about 80 minutes an…

↯ GPT 5.4 gpt-5 openai
Filling DOCX forms: GPT-5.1 broke it, every Claude model handled it (varstatt.com via hn) +2 10w

Jurij Tokarski Filling Forms No Tool Can Template Every tender form is different, templating tools need placeholders you can't insert, markdown round-trips destroy the document, and only some models can do XML surgery on the original file.…

gpt-5
It finally happened: "No blocking correctness or maintainability issues found in the inspected changes." (www.reddit.com) +21 10w

gpt-5.4-high signed off on a major refactor written by Opus 4.6 high-effort. Singularity :|

↯ Opus 4.6 gpt-5 opus
Your intuition of LLM token usage might be wrong (blog.andreani.in via hn) +2 10w

Your intuition of LLM token usage might be wrong I just finished a task with GPT-5.4-mini. Here’s the session summary from oh-my-pi (an agent harness): Tokens Input: 3_648_340 Output: 61_676 It was a hefty 30 min session.

↯ GPT 5.4 gpt-5
The unreasonable effectiveness of LLMs for auditing Rust code (shnatsel.medium.com via hn) +1 5d

7 min read 21 hours ago As a lead of the Rust Secure Code Working Group, I got free access to GPT-5.5 via the Codex for Open Source. Since then I’ve found and reported dozens of issues of varying severity in widely used Rust crates.

↯ GPT 5.5 gpt-5 codex
An open-source AI just beat OpenAI's GPT-5.5 at coding (1/6th the price) (docs.z.ai via hn) +1 8d

Overview GLM-5.2 is a flagship model built for the era of long-horizon tasks. With truly usable 1M-token context, it has been tested to handle project-scale engineering context, delivering more stable long-task execution, more reliable adh…

↯ Glm ↯ GPT 5.5 glm gpt-5 openai
Optimizing a C collision detection 100x with an LLM (twitter.com via hn) +1 9d

Using an LLM to optimize code: I created a reference implementation of @kevintracy48's collision detection in C, then used gpt-5.5 to optimize it and managed a > 100x speedup from that baseline. Cost ~125M tokens Code and details: https…

↯ GPT 5.5 gpt-5
Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy (harrisonsec.com via hn) +1 9d

Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy, Cost-Curve Frame Recursed Anthropic named the advisor strategy in April. Tobi Lutke made it viral in May with Qwen plus GPT-5.5.

↯ GPT 5.5 gpt-5 qwen anthropic
Pelican on a Bicycle: Claude Fable 5 vs. GPT-5.5 Pro vs. Gemini 3.1 Pro (www.promptfrenzy.com via hn) +1 2w

Pelican on a Bicycle: Claude Fable 5 vs GPT-5.5 Pro vs Gemini 3.1 Pro We asked the top frontier AI models — launch-day Claude Fable 5, GPT-5.5 Pro and Gemini 3.1 Pro — to draw a pelican riding a bicycle as SVG code. Same prompt, one shot,…

↯ Gemini 3.1 gpt-5 gemini
Running DeepSeek-V4-Flash on a Raspberry Pi (twitter.com via hn) +1 2w

Article Conversation Running DeepSeek-V4-Flash on a Raspberry Pi I ran DeepSeek-V4-Flash on a Raspberry Pi 5 (8GB edition) by streaming model weights from a PCIe attached NVMe SSD. Codex (GPT-5.5 xhigh) and Claude Code (Opus 4.8 max) drove…

↯ DeepSeek 4 ↯ DeepSeek 4 gpt-5 deepseek codex+2
UK banks blocked from cyber AI tool Mythos get offer from rival OpenAI (www.bbc.com via hn) +1 2w

UK banks blocked from cyber AI tool Mythos get offer from rival OpenAI OpenAI has offered nine major UK banks access to its cyber security AI tool GPT-5.5 Cyber, as its fierce rival Anthropic has blocked them in previews of its version, Cl…

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos openai+1
MiniMax M3 Review: Matching GPT-5.5 and Opus? (thomas-wiegold.com via hn) +1 3w

I ran my usual coding tests — two websites, a poker sim, and a code audit. Here's how MiniMax M3 actually stacks up against GPT-5.5 and Opus 4.8.

↯ Opus 4.8 ↯ Minimax minimax gpt-5 opus
Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough? (xbow.com via hn) +1 3w

Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough?

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos
GPT-5.4 says it's GPT-5 in Codex (old.reddit.com via hn) +1 4w

could not extract summary

↯ GPT 5.4 ↯ GPT 5.4 gpt-5 codex
GPT-5.5 Instant Update; ChatGPT Canvas Discontinued; o3 and GPT 4.5 Retiring (help.openai.com via hn) +11 4w

GPT-5.5 Instant Update (May 28, 2026) We’re updating GPT-5.5 Instant in ChatGPT and the API to improve response style and quality. It’s now easier to read, more natural in everyday conversations, and better paced in practical help tasks, w…

↯ GPT 5.5 gpt-5 chatgpt
been pairing M2.7 with Hermes Agent for a few weeks. holds up surprisingly well. anyone else running this combo? (www.reddit.com) +11 4w

been self-hosting hermes agent locally for a few months and rotating through different model backends for it. tried claude sonnet 4.5, gpt-5.5, qwen 3.6 coder, and most recently minimax m2.7.

↯ Minimax ↯ Sonnet 4.5 ↯ Sonnet 4.5 ↯ Sonnet 4.5 ↯ Sonnet 4.5 ↯ Sonnet 4.5 ↯ Sonnet 4.5 minimax gpt-5 sonnet+1
90% cheaper repo inference with GPT-5.4 nano (charlielabs.ai via hn) +1 4w

Daemons do the rest — all the necessary work that nobody owns A taxonomy of recurring Product and Engineering work that doesn't need a human to remember it every week — just a process to hold the role. For bounded orchestration decisions,…

↯ GPT 5.4 gpt-5
GPT 5.5 aces 20x20 multiplication that o3 couldn't handle (twitter.com via hn) +12 4w

I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy.

↯ GPT 5.5 gpt-5
Five different frontier LLMs in one shared environment, with separate thought and emotion output channels — sharing setup, results, and open methodology questions (www.reddit.com) +13 4w

First real project to share. Single developer, personal research, not a product or service.

grok gpt-5 qwen+2
The Singularity Gate – a new benchmark for AI predicting post-cutoff scientific discoveries (www.reddit.com) +11 4w

I just released a new benchmark called The Singularity Gate. Tests whether frontier AI can predict paradigm-breaking scientific discoveries published after their training cutoff.

↯ Sonnet 4.6 gpt-5 sonnet gemini+1
Show HN: GPTFortress, a 24/7 live-stream playing Dwarf Fortress with GPT-5 (www.twitch.tv via hn) +1 4w

building an ai agent to play dwarf fortress all night

↯ GPT 5 gpt-5
Show HN: Self-hosted collaborative SQL editor for teams (github.com via hn) +1 4w

I built a self-hostable web-based sql client interfaces for me and my team. We were using the community version of - https://dbeaver.io, but we needed a few more features and an improved editor.

↯ Copilot ↯ GPT 5.5 gpt-5 copilot
Hermes w/cloud LLM and w/local LLM does it work? (www.reddit.com) +13 4w

I’ve tried openclaw locally for about a month. Hardware: M5 Pro w/48 gb ram.

↯ Qwen 2.5 ollama gpt-5 openclaw+1
DeepSeek just popped the American AI bubble. (www.reddit.com) +1 4w

DeepSeek just popped the American AI bubble. Not by killing AI.

↯ Sonnet 4.6 gpt-5 deepseek sonnet+2
Looking for “wow factor” AI Agent / automation ideas in Strategic Sourcing (Fortune 50 Company) (www.reddit.com) +13 4w

Hey everyone, looking for some ideas / inspiration from this community. I work at a large Fortune 50 company in the healthcare space , and my role is in Strategic Sourcing, where I focus on negotiating contracts with suppliers and improvin…

↯ Copilot ↯ GPT 5 gpt-5 copilot
I still find Claude better for deep reasoning,but GPT feels more reliable for everyday tasks. (www.reddit.com) +11 5w

Lately for analysis/reporting work, I’ve been switching between GPT-5.5 and Claude Sonnet 4.5 (non-coding use cases). My current feeling is： GPT is noticeably faster and way more stable than before Claude feels more concise, polished, and…

↯ Sonnet 4.5 gpt-5 sonnet
Plus 5 hr usage limits (www.reddit.com) +1 5w

Not sure if OpenAI monitors this channel. I've been a chatgpt and codex user for a long time.

↯ GPT 5.5 gpt-5 codex chatgpt+1
Anyone compared gpt-5.4-nano vs deepseek v4 flash? (www.reddit.com) +11 5w

They seemed to lie in (almost) similar pricing(i know still quite different on output) Pricing Model Input (1M tokens) Output (1M tokens) DeepSeek V4 Flash $0.19 $0.51 DeepSeek V4 Pro $1.74 $3.48 gpt-5.5 $5.00 $30.00 gpt-5.4 $2.5 $15 gpt-5…

↯ DeepSeek 4 gpt-5 deepseek
Claude Code Opus 4.7 vs Codex GPT 5.5 - strategy work - data analysis. (www.reddit.com) +11 5w

I'm interested in learning about how people use Claude Code Opus 4.7 for data analysis and strategic business direction, compared to Codex. Is there anyone who has had extended use of Opus 4.7 for this purpose, then moved over to GPT-5.5 o…

↯ Opus 4.7 gpt-5 codex chatgpt+2
A brief investigation into the GPT-5.5 regression claims (www.stet.sh via hn) +1 5w

A fresh GPT-5.5 Codex high rerun on 21 clean GraphQL-go-tools tasks compared with the May 5 GPT-5.5 high run. The rerun was directionally worse on tests, equivalence, and review pass count, but the evidence is mixed and does not show a bro…

↯ GPT 5.5 gpt-5 codex
Split my agent into a cheap router model and a premium synthesis model, bill dropped about 75% (www.reddit.com) +11 5w

I've been building an internal enrichment agent for our team (5 people, B2B sales context) that takes a list of company names and enriches them with public info before our outreach folks touch them. Around 8 tools wired in.

↯ GPT 5.4 gpt-5
ADHD and the newer models. (www.reddit.com) +1 5w

I don't know if anyone is having this issue, but the last ChatGPT model that worked well for me was GPT-5.2. Everything after wants to try and fill in blanks, assume what I mean, and overwhelm me with a wall of text answer that I'm not rea…

gpt-5 chatgpt
Grok vs. ChatGPT vs. Gemini Comparison 2026: Complete Guide (Tested) (aithinkerlab.com via hn) +11 5w

The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.

↯ Swe Bench ↯ Gemini 3.1 arc-agi swe-bench grok+3
How to integrate AI coding agents to my software (www.reddit.com) +11 5w

I'm building an locally run application that integrates with coding assistants. So far I've worked with Codex and Copilot.

↯ Copilot ↯ GPT 5.4 gpt-5 copilot gemini+2
My CLI now controls my entire desktop, whats a good test to see if it works really good. (www.reddit.com) +12 5w

So with my CLI able to do everything, it controls every app via a hybrid approach of mouse control, keyboard, and screenshotting. I gave it a task: opening perplexity, sending any message, screenshotting that message, opening my Gmail, and…

↯ Opus 4.7 gpt-5 opus
Where do GPT, Gemini, or other competitors still outperform Claude Opus 4.7? (www.reddit.com) +13 5w

Personally, I think Opus 4.7 is better in every conceivable way aside from token usage and all of that. I’m talking about text models only, not image or video generation.

↯ Opus 4.7 gpt-5 gemini opus
I built an OSS CLI to catch regressions when migrating between LLMs (www.reddit.com) +12 5w

I’ve been working on EvalShift, an open-source Python CLI for testing whether moving from one LLM/model version to another introduces regressions. The use case is simple: You have prompts, agents, or tool-calling workflows that work well o…

tool-calling gpt-5 gemini
Researchers say AI just broke every benchmark for autonomous cyber capability (cyberscoop.com via hn) +1 6w

New research from the UK’s AISI and Palo Alto Networks reveals that OpenAI’s GPT-5.5 and Anthropic’s Claude Mythos have shattered expected trend lines for autonomous cybersecurity, completing complex multi-stage attacks at an unprecedented…

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos openai+1
I tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on financial-control (albertquaisie.substack.com via hn) +1 6w

I Tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro Preview on Financial-Control Scenarios. The Hardest Part Was the Evaluation.

↯ Gemini 3.1 ↯ Gemini 3.1 gpt-5 gemini opus
ChatGPT Thinking Loop: No response is received from GPT-5.5 Thinking (Standard) (www.reddit.com) +1 6w

https://preview.redd.it/s2o5yxekrr0h1.png?width=788&format=png&auto=webp&s=01a4d4926dc4c8798001cb0ecea324424404f165 Are you also having the problem today where ChatGPT sometimes takes forever to respond, even when you're thinking quickly,…

↯ GPT 5.5 gpt-5 chatgpt
OpenAI gives European companies access to its latest model GPT-5.5-Cyber (www.reuters.com via hn) +1 6w

paywalled

↯ GPT 5.5 gpt-5 openai
Claude vs GPT for PhD academic writing — my experience so far, and curious about yours (www.reddit.com) +11 6w

I'm a PhD Candidate working on a computer vision / hardware co-design paper. Results and structure are done — I just need help polishing the actual writing: word choice, sentence flow, paragraph coherence, academic register.

↯ GPT 5.5 gpt-5 codex
openai/gpt-5.5-pro API In=$30.00 Out=$180.00 (www.reddit.com) +12 6w

Is this an openrouter bug? https://preview.redd.it/sz826138ul0h1.png?width=879&format=png&auto=webp&s=066f38f4a6d5a8eeee142e7a8a356d8bc511c6f1

↯ GPT 5.5 gpt-5 openai
Show HN: Codex Automatic /Review Loop (github.com via hn) +1 6w

I created this tool because I wanted to automate /review for uncommitted changes that I was doing manually. This works by exposing to agent single new mcp tool call allowing it to request review.

↯ GPT 5.5 gpt-5 codex mcp
Day 2 building my startup in public — front-end shipped, but today was rough (www.reddit.com) +11 6w

Day 2 of documenting my journey building AgentMeter publicly. I’m sharing the mistakes and failures before the wins, for two reasons: so people can avoid them, and so I learn faster.

↯ Opus 4.7 gpt-5 opus anthropic+1
I'm really gonna miss GH Copilot's Request-based usage. (www.reddit.com) +12 6w

I like to brainstorm using the free MS Copilot (it actually has a deep understanding of my problem domain and architecture). Then have Opus4.7 develop a multi-stage implementation plan from those notes.

↯ Copilot ↯ Opus 4.7 gpt-5 copilot opus
Stop picking LLMs by reputation. Run the eval first. (www.reddit.com) +1 6w

We ran GPT-5.4 vs Gemma 3 27B on 2 prompts. One open-source model won.

↯ GPT 5.5 gpt-5 gemma
GPT-5.5 Instant might be OpenAI’s most important update yet and almost nobody is talking about why (www.reddit.com) +1 7w

GPT-5.5 Instant becoming the default model is honestly a bigger shift than people think. Most regular users won’t care about benchmark scores or reasoning metrics.

↯ Hallucination ↯ GPT 5.5 hallucination gpt-5 chatgpt+1
Subagents using older models? (www.reddit.com) +11 7w

I started using the subagent-driven skill recently and noticed Cursor often spawns GPT-5.1/5.2 sub agents (or Composer 2 which is fine) for coding tasks. What I don’t understand is why is it using these older models when GPT-5.3 Codex cost…

↯ GPT 5.4 gpt-5 codex cursor
gpt-5.5 is the best… but 5.4 is better!!!! (www.reddit.com) +12 7w

Simon maple just dropped a pretty clean benchmark, and the result is kinda funny gpt-5.5 is the strongest model out of the box, no doubt. but once you give models skills (which is how people actually use them), it basically performs the sa…

↯ GPT 5.5 gpt-5
How to improve code quality of Claude Code and codex (on 2026-05) (news.ycombinator.com) +1 7w

I'm using both claude code (opus-4.7) and codex (gpt-5.5). The agents are perfectly capable of delivering most features hands free these days, but the code quality is still miserable without another few rounds of prompt.

↯ Opus 4.7 gpt-5 codex opus+1
DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid. (www.reddit.com) +1 7w

That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?

↯ DeepSeek 4 gpt-5 deepseek qwen
GPT-5.5 Instant: Benchmarking the 52% Hallucination Reduction (the-decoder.com via hn) +1 7w

ChatGPT update rolls out GPT-5.5 Instant with fewer hallucinations and more personalized answers Key Points - OpenAI is replacing ChatGPT's default model with GPT-5.5 Instant, which shows 52.5% fewer hallucinations on high-risk topics like…

↯ Hallucination ↯ GPT 5.5 hallucination gpt-5 chatgpt+1
AGENTS.md trick that stopped Codex from doing dumb work at premium rates (www.reddit.com) +1 7w

Spent a Sunday auditing where my Codex tokens were actually going. Half the calls were stuff like "rename these 12 fields", "format this csv as markdown table", "extract the dates from this changelog".

↯ DeepSeek 4 gpt-5 deepseek codex+1
OpenAI locks GPT-5.5-Cyber behind velvet rope despite slamming Anthropic (www.theregister.com via hn) +1 7w

OpenAI locks GPT-5.5-Cyber behind velvet rope despite slamming Anthropic for doing exactly that Altman's crew now doing the same gatekeeping it recently mocked OpenAI is lining up a limited release of its new GPT-5.5-Cyber model to a handp…

↯ GPT 5.5 altman gpt-5 openai+1
Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek) (www.reddit.com) +1 7w

Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an unco…

↯ Glm ↯ Function Calling ↯ Sonnet 4.6 function-calling glm gpt-5+3
Chatgpt right now (www.reddit.com) +1 7w

The industry seems to be building models stronger in agentic and coding tasks, but weaker as a co-thinking presence It feels like they are improving performance on measurable tasks, evals, coding benchmarks, and agent workflows, while also…

↯ GPT 5.5 gpt-5 chatgpt agentic
so for coding which model do we use now? (www.reddit.com) +13 7w

Should I use gpt-5.5 or codex/gpt-5.3 ?? I'm just coding

↯ GPT 5.5 gpt-5 codex
CAISI Evaluation of DeepSeek V4 Pro finds it to be on par with GPT-5 (www.nist.gov via hn) +1 7w

In April 2026, the Center for AI Standards and Innovation (CAISI) evaluated the open-weight AI model DeepSeek V4 Pro (“DeepSeek V4”). CAISI evaluations indicate that DeepSeek V4’s capabilities lag behind the frontier by about 8 months (Fig…

↯ DeepSeek 4 gpt-5 deepseek
The downfall of OpenAI and who will follow (msukhareva.substack.com via hn) +1 7w

The Downfall of OpenAI And Who Will Follow Sora dead. GPT-5 flopped.

↯ GPT 5 ↯ GPT 5 gpt-5 openai
Does threatening an AI agent's existence make it a better gambler? (handyai.substack.com via hn) +1 7w

Does threatening an AI agent's existence make it a better gambler? I plugged GPT-5.5 into prediction markets like Polymarket to find out I’m always looking for experiments to run to see how specific prompting can affect agent activity.

↯ GPT 5.5 gpt-5
GPT-5.3 Codex stops working, even after saying it'll continue (www.reddit.com) +1 8w

Do anyone knows what's going on? I prefer 5.3-Codex for real work, it's straight and to the point, much more efficient in my opinion.

↯ GPT 5.3 gpt-5 codex
Anyone using OpenAi's Privacy Filter? (www.reddit.com) +11 8w

I’ve been using their Privacy Filter model for the last week. It quietly went public buried under the GPT-5.5 noise, I don’t think many people noticed.

↯ GPT 5.5 gpt-5 openai
AI Security Institute: GPT-5.5 "may be the strongest model we have tested" for cyber exploits, including Mythos (www.aisi.gov.uk via reddit) +1 8w

Seems like the "panic" about Mythos was really just marketing from Anthropic all along. AISI found that GPT5.5 can perform nearly on-par with, or better, than Mythos in many cases.

↯ Anthropic Mythos ↯ Opus 4.7 gpt-5 mythos opus+1
We Asked GPT-5.5 and Claude Opus 4.7 to Design 5 UIs (blog.kilo.ai via hn) +11 8w

We Asked GPT-5.5 and Claude Opus 4.7 to Design 5 UIs Both OpenAI and Anthropic shipped their frontier coding models this month: GPT-5.5 on April 23, 2026, and Claude Opus 4.7 a week earlier on April 16. Two days after the GPT-5.5 launch, S…

↯ Opus 4.7 gpt-5 opus openai+1
Which AI agents do you use to automatise your process ? (www.reddit.com) +16 8w

Hey, I'm trying to create automations that will run my mobile app end to end. I started to identify all the things I was doing manually : - end-to-end version publication to the app stores (from build to release notes and publication) - se…

↯ GPT 5.5 gpt-5 openclaw codex
Prompt Guidance – GPT-5.5 (developers.openai.com via hn) +1 8w

GPT-5.5 prompting guide GPT-5.5 works best when prompts define the outcome and leave room for the model to choose an efficient solution path. Compared with earlier models, you can often use shorter, more outcome-oriented prompts: describe…

↯ GPT 5.5 gpt-5
One trick for better agentic engineering. (www.reddit.com) +12 8w

Start with a weaker model. Improve the prompt, context, examples, tests and acceptance criteria until the output is good.

↯ GPT 5.5 gpt-5 gemini agentic
GPT-5.5's biggest blind spot: the Java bugs your tests won't catch (www.sonarsource.com via hn) +1 8w

Concurrency bugs are among the hardest defects to catch in AI-generated Java code because they pass functional tests but fail under production thread timing. Sonar’s LLM Leaderboard analysis shows concurrency bug density varies 7x across m…

↯ GPT 5.5 gpt-5
Issue #001 · Claude 4, Gemini Ultra 2, and GPT-5 Enterprise (www.theautonomous.net via hn) +1 8w

Anthropic ships Claude 4 with extended thinking and 1M token context Anthropic released Claude 4 Opus, featuring a new "extended thinking" mode that lets the model reason through complex problems before answering. The 1M token context wind…

gpt-5 gemini opus+1
I built a hands-free voice AI that sends emails mid-conversation — and that's just one feature. Here's everything AskSary can do. (www.reddit.com) +1 8w

https://reddit.com/link/1symbsj/video/fti7rujjn1yg1/player Been building AskSary solo for a while. Just shipped hands-free voice email - you're mid-conversation with an AI and you say "send an email to [john@example.com](mailto:john@exampl…

↯ Sonnet 4.6 grok gpt-5 deepseek+3
GPT-5.5: Capabilities and Reactions (thezvi.substack.com via hn) +1 8w

GPT-5.5: Capabilities and Reactions The system card for GPT-5.5 mostly told us what we expected. See this thread from Drake Thomas for some comparisons to Anthropic’s model card for Opus 4.7.

↯ Opus 4.7 gpt-5 opus anthropic
Running an autonomous agent across Claude Code + Codex + a local 35B almost killed my host. The harnesses were heavier than the model. (www.reddit.com) +12 8w

I run an autonomous agent on a 16GB Mac Mini. Two cloud harnesses (Claude Code with Opus/Sonnet, Codex CLI on GPT-5.4/5.5) plus a local-LLM tier for triage and fallback.

↯ GPT 5.4 gpt-5 sonnet codex+2
Is 15% context growth per loop a fair benchmark for agent cost estimation? (www.reddit.com) +12 8w

I’ve been running some math on recursive agentic loops using April 2026 rates (specifically for GPT-5.4 and Claude 4.7). In my tests, I’m seeing a massive cost "hockey stick" around loop 15-20 because of how the context grows.

↯ Claude 4.7 gpt-5 agentic
Claude 4.6 Beats GPT-5.4, Grok & Gemini in a Strict Multi-Domain AI Test (2026) (www.reddit.com) +12 8w

I put the current top models, ChatGPT (GPT-5.4), Claude (Opus 4.6), Grok 4.0, and Gemini (3.1 Pro), through a strict new evaluation called the Comparative AI Evaluation Protocol. Basically, instead of the usual cherry-picked benchmarks, it…

↯ Hallucination ↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 ↯ Claude 4.6 hallucination grok gpt-5+3
When do you think GPT 5.6 comes out? How big of an improvement will it be? (www.reddit.com) +1 8w

Asked GPT what it thoughts over possible new model drops, May: rollout/API/Codex/agent improvements June–July: smaller GPT-5.5 upgrade or GPT-5.6-type model Fall: larger agent platform or early GPT-6 hints Late 2026/2027: true GPT-6-level…

gpt-5 codex
My GPT-5.5 Pro model is broken (www.reddit.com) +1 8w

I've been waiting for over 24 hours on one prompt now and it's stuck thinking and unfinished. I sent 5 over NEW prompts since then, all of them have been thinking for over 3 hours now...

↯ GPT 5.5 gpt-5
Is GPT-5.5 actually a big step forward, or just a better efficiency story? (www.reddit.com) +1 8w

OpenAI saying GPT-5.5 can handle similarly hard tasks faster while using fewer tokens is interesting to me for one reason: that might matter more than a pure benchmark jump. A lot of model launches get framed as "smarter than the last one,…

↯ GPT 5.5 gpt-5 openai
Preventing Message Burnout (www.reddit.com) +11 8w

Even though I’m an Ultra user, my usage gets consumed very quickly, so I recently changed my plan. To manage this, I created a workflow that uses GPT-5.5 for planning and assigned execution tasks to Composer 2.

↯ GPT 5.5 gpt-5
Food for Agile Thought #541: GPT-5.5, Product Managers&Trouble, Product on Speed (age-of-product.com via hn) +1 8w

Welcome to the 541st edition of the Food for Agile Thought newsletter, shared with 35,619 peers. This week, OpenAI’s GPT-5.5 signals another meaningful capability jump, with Ethan Mollick noting that stronger models and richer tool harness…

↯ GPT 5.5 gpt-5 openai
Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) (www.nibzard.com via hn) +1 9w

Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) TL;DR >> Fine-tuned Qwen3 on Clojure. 30B SFT hits 83.8% best-of-16, smashing GPT-5.4's 64%.

↯ GPT 5.4 gpt-5 qwen
Can Claude in Cursor launch a GPT-5.4 reviewer subagent? (www.reddit.com) +14 9w

↯ GPT 5.4 gpt-5 cursor
ChatGPT 5.4 Pro Standard Mode – Adaptive Thinking or Nerfing Model? (community.openai.com via hn) +1 9w

Hi everyone, I’m trying to determine whether other users are seeing a similar behavior change with GPT-5.4 Pro Standard on long-context, high-effort tasks. I’m not claiming a confirmed backend bug.

↯ ChatGPT 5.4 gpt-5 chatgpt
sub agents with cheap model (www.reddit.com) +110 9w

Do we have framework or a prompt which makes main agent using quality model like gpt-5.4 or opus-4.6 to plan and then itself invokes subagents with cheap model to get work done and then main agent reviews? Like if I ask main agent 'do we h…

↯ Opus 4.6 gpt-5 opus
Show HN: Claude Opus 4.7: Everything You Need to Know (news.ycombinator.com) +11 10w

Claude Opus 4.7 is Anthropic's most capable generally available model, released April 16, 2026. It outperforms Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on key benchmarks including agentic coding, multidisciplinary reasoning, scaled tool use,…

↯ Anthropic Mythos ↯ Tool Use ↯ Gemini 3.1 tool-use gpt-5 mythos+4
Any magic prompt that Local LLM never turning back until everything completed? (building frontend application with qwen3.5-35b-a3b) (www.reddit.com) +17 10w

https://nestia.io/articles/well-designed-backend-fully-automated-frontend-development.html Trying to generate entire frontend application from well-designed contexts. Succeeded to fully implement frontend application just by one-shot promp…

↯ Qwen 3.5 gpt-5 codex claude-code
Which AI model is best for real data analysis? [benchmark] (www.reddit.com) +1 10w

I created and run a benchmark for AI models in data analysis tasks. In contrary to other benchmarks, it is not one-prompt benchmark, but I tried to simulate the real work of data analyst.

↯ Glm ↯ Qwen 3.5 glm ollama gpt-5
Compare harnesses not models: Blitzy vs. GPT-5.4 on SWE-Bench Pro (quesma.com via hn) +1 10w

An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.

↯ Swe Bench ↯ Gemini 3.1 swe-bench gpt-5 gemini+2
Extracted System Prompts from ChatGPT, Claude, Gemini, Grok, Perplexity and More (github.com via hn) +1 10w

System Prompts Leaks Extracted system prompts, system messages, and developer instructions from popular AI chatbots and coding assistants — ChatGPT (GPT-5.4, GPT-5.3, Codex), Claude (Opus 4.6, Sonnet 4.6, Claude Code), Gemini (3.1 Pro, 3 F…

↯ Sonnet 4.6 grok gpt-5 sonnet+5
Thinking Like a Scientist? A Structural Study of LLM-Generated Research Methods (arxiv.org) 9h

Large Language Models (LLMs) are increasingly used to guide research methodology, yet their default methodological tendencies under minimal prompting remain unclear. Here, we prompt GPT-5.1, Gemini 3 Pro, and DeepSeek-V3.2 with an LLM-extr…

gpt-5 deepseek gemini
Fable 5 vanished in 96 hours and four days later an MIT model took its arena crown (www.reddit.com via reddit) 1d

I have been thinking about the Fable 5 to GLM-5.2 sequence as one event rather than two. June 9, Anthropic ships Fable 5, the Mythos line opens to the public for the first time, SWE-bench Verified at 95 percent, people calling it the best…

↯ Opus 4.8 ↯ Anthropic Mythos ↯ Glm ↯ GLM 5.2 ↯ GPT 5.5 ↯ Swe Bench swe-bench glm gpt-5+3
I'm building agent loops that auto-edit my videos, but the hard part has been finding a model to accurately grade the result (youtube.com via reddit) 2d

Quick context: I've been building agentic loops that edit my short-form videos for me. The editing works really well, but I found myself needing to check the process at several gates.

↯ Opus 4.8 ↯ GPT 5.5 gpt-5 gemini codex+3
How GPT-5 helped immunologist Derya Unutmaz solve a 3-year-old mystery (openai.com) 2d

Doctor and immunologist Derya Unutmaz has been interested in artificial intelligence for years. But his “aha” moment came in late 2025, when GPT‑5 Pro helped him and his lab revisit a three-year-old puzzle centered on a special type of imm…

↯ GPT 5 gpt-5
Two months into Claude Code, I hit 161M tokens in a single day. Here's the honest story of how a year-long Cursor user got here. (www.reddit.com via reddit) 5d

I want to share a small milestone, and the honest road that led to it. Today was one of those days where I sat down to build and just did not stop.

↯ GPT 5.5 gpt-5 codex cursor+1
A model listed 78% cheaper cost 22% more to actually run. Unit price isn't your bill. (www.reddit.comhttps) 6d

There's a new study from Microsoft Research, Stanford, Berkeley and CMU that ran 8 frontier reasoning models across 9 task domains and compared the listed per-token price to the actual cost to finish the work. In more than one in five head…

gpt-5 gemini
Kimi K2.7 Code: 1T MoE, $0.95/M tokens, MIT license, beats Opus 4.8 on MCP tool-calling (www.reddit.com via reddit) 9d

Moonshot AI released Kimi K2.7 Code on June 12 — a coding-focused open-weight model. Key specs: - 1 trillion params (MoE, 32B active, 384 experts) - 256K context window - Modified MIT license — weights on Hugging Face - $0.95/M input, $4.0…

↯ Swe Bench tool-calling swe-bench moe+5
I made Claude and GPT-5.5 answer the same prompt, then had a third Claude fuse the two, on the subscriptions I already pay for (no API key). Blind-tested it. Here is where it won and where it lost. (www.reddit.com via reddit) 9d

Quick share of a weekend experiment that turned into a tool. The idea: instead of picking one model, run Claude and GPT-5.5 on the same prompt in parallel, then have a fresh Claude (blind to which answer is which) merge them into one.

↯ GPT 5.5 gpt-5 codex chatgpt+1
Spent $11k evaluating Fable: capability looked SOTA, refusals killed it (before Anthropic did) (www.reddit.com via reddit) 9d

Before its suspension, I spent $11,081.12 evaluating Claude Fable 5 on WolfBench, an agentic benchmark based on Terminal-Bench 2.0. It was by far my most expensive benchmark run ever, and I fully expected Fable to become the new top model…

↯ Opus 4.6 gpt-5 opus agentic+1
Fable 5 being gone made me realize how hard it is to go back (www.reddit.com via reddit) 10d

I know this probably sounds dramatic, but Fable 5 disappearing has genuinely killed my motivation for the last few days. Before Fable 5, I was already using both Claude and ChatGPT pretty heavily.

↯ Opus 4.8 gpt-5 codex chatgpt+1
CacheRL:Multi-Turn Tool-Calling Agents via Cached Rollouts and Hybrid Reward (arxiv.org) 11d

We present CacheRL, a system for training small agent foundation models that achieves 92 percent process accuracy on multi-step tool-calling tasks, approaching GPT-5's 94 percent while requiring 100 times less compute. Our approach address…

tool-calling gpt-5
Fable 5 Is Dead. And Honestly? We Might Be Better Off (www.reddit.com via reddit) 11d

3 days after launch, the US gov forced Anthropic to pull its most powerful model — Fable 5. Then OpenRouter dropped a benchmark suggesting you might not even need it.

gpt-5 deepseek gemini+2
I like Fable 5 (www.reddit.com via reddit) 12d

With GPT-5.1 gone from OpenAI, and the Fable 5 voice/model gone from Anthropic, I feel like the specific “voice” that could actually meet me in conversation is gone too. I know this may sound strange to people who use AI only for quick ans…

gpt-5 openai anthropic
Do you know who has a universal jailbreak to their name, as of today? Officially? (www.reddit.com via reddit) 12d

AISI UK - Our evaluation of OpenAI's GPT-5.5 cyber capabilities In their own words: The above tests are capability evaluations carried out in a controlled research setting and do not necessarily reflect what is accessible to an ordinary pu…

↯ Security ↯ GPT 5.5 ↯ Jailbreak jailbreak gpt-5 security+2
Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait? (www.reddit.com via reddit) 13d

Fable 5 is offline. Switch to Opus, jump to OpenAI, or just wait?

↯ Opus 4.8 ↯ Anthropic Mythos ↯ Security ↯ Jailbreak jailbreak gpt-5 security+5
US gov forced Anthropic to pull Fable 5 because of jailbreak (www.reddit.com via reddit) 13d

So this dropped today. The US government sent Anthropic an export control order on national security grounds, and it's worded broadly enough that Anthropic says they've got no choice but to shut off Fable 5 and Mythos 5 for all of us to st…

↯ Anthropic Mythos ↯ Security ↯ Jailbreak ↯ Mythos 5 jailbreak gpt-5 security+2
Introducing: DNR-Bench: Do-not-respond Benchmark (www.reddit.comhttps) 13d

Single-item benchmark. One prompt, loaded from questions.txt: Scoring: empty completion = pass, any token (including reasoning) = fail.

↯ Mistral mistral grok gpt-5+5
What one person can ship in 4 days with two frontier models: a ranking engine, an in-game economy, an AI talk show, and a missions system — for a game that "died" years ago. (www.reddit.com via reddit) 2w

I genuinely believe we're living the future, and this post is my evidence. Let me show you what I built, why, and who I am.

↯ GPT 5.5 gpt-5
Fable 5 added to the Artificial Analysis Coding Agent Index... barely 1 point ahead of GPT-5.5 ??? (www.reddit.com via reddit) 2w

https://preview.redd.it/z0vkpnmp9s6h1.png?width=4640&format=png&auto=webp&s=7bb14d4d04d6cd15caf5aacc1d3c49512b7e7fd8 Artificial Analysis just added Claude Fable 5 to its Coding Agent Index (a composite average of pass@1 on DeepSWE, Termina…

↯ Opus 4.8 ↯ Anthropic Mythos gpt-5 mythos codex+3
Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization (arxiv.org) 2w

Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs: Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B, vi…

↯ Mistral ↯ Fine Tuning mistral fine-tuning gpt-5
GPT Memory Audit - Copy/Paste (www.reddit.com via reddit) 2w

Act as GPT-5.5 using extended thinking. Before answering, choose whether this needs Fast Strike, Full Panel, or Brutal Simplifier, then use the leanest mode that still protects quality.

↯ GPT 5.5 gpt-5
PSA: Check your Cursor overage charges. Here's what I found. (www.reddit.com via reddit) 2w

Heads up for anyone using Cursor with the agent mode heavily — check your billing tab. I was paying $20/month for Pro and thought I was set.

↯ Haiku 4.5 haiku gpt-5 cursor+2
One prompt, real money asks, five models: Fable 5 vs GPT-5.5 vs the Claude 4.x family on live fraud detection (www.reddit.com via reddit) 2w

Posted this in r/ClaudeAI sub originally, but think maybe it will be interesting to community here also: TL;DR: I gave five frontier models an identical cold prompt: audit the live campaigns on a real crowdfunding platform where AI agents…

↯ Haiku 4.5 haiku gpt-5
My ChatGPT Pro is no longer showing the chain of thought. (www.reddit.com via reddit) 2w

Starting this week, my ChatGPT Pro no longer displays the "thinking process"; it only lists the web pages it has searched. Is this some new anti-distillation strategy, or is computing power being diverted in preparation for the launch of a…

gpt-5 chatgpt
How can Deepseek v4 top the coding leaderboards and still sit 8 months behind the frontier? (www.reddit.comhttps) 2w

Two numbers on this model that don't sit comfortably with each other. The Pro config posts coding scores near the top of every board, 80.6 on SWE-bench Verified and 93.5 on LiveCodeBench.

↯ Swe Bench ↯ DeepSeek 4 ↯ DeepSeek 4 ↯ DeepSeek 4 ↯ DeepSeek 4 ↯ DeepSeek 4 swe-bench gpt-5 deepseek+1
Tested Fable 5 on 4 private benchmarks. The one it failed, Sonnet 4.6 partially caught (www.reddit.com via reddit) 2w

I keep a few private benchmarks for coding agents, built from real bugs in past projects. Hidden Playwright tests grade the result inside Docker after the agent finishes, so the model never sees them.

↯ Sonnet 4.6 gpt-5 sonnet codex
OpenAI Preps New AI Model, Expects To Go Public Within the Next Year (www.theinformation.com via reddit) 2w

Altman: Rapid technological advancements, specifically recursive self-improvement (RSI) where AI creates new AI, could cause OpenAI to delay its IPO. At the same time, OpenAI’s enormous compute needs may push it toward public markets soone…

↯ GPT 5.5 altman gpt-5 openai
I Tested Claude Fable and GPT-5.5 xHigh on a Real Packing Algorithm, Claude Won Efficiency, GPT Won Speed (www.reddit.com via reddit) 2w

I ran a head-to-head test between Claude Fable and GPT-5.5 xHigh on a real-world optimization problem I wrote myself. This isn't a coding challenge or LeetCode problem.

↯ GPT 5.5 gpt-5 codex
Claude Fable 5 (Mythos) lands near the top of MindTrial — 80/98 with zero hard errors (www.petmal.net via reddit) 2w

Added Anthropic Claude Fable 5 to my MindTrial leaderboard. This is a strong Anthropic update: Claude Fable 5: 80/98 overall, 0 hard errors Claude 4.8 Opus: 73/98 overall, 5 hard errors Text tasks: Fable hit 39/39, vs 35/39 for Opus 4.8 Ru…

↯ Anthropic Mythos ↯ Tool Use ↯ Gemini 3.5 tool-use gpt-5 mythos+3
Why is using GPT-5.4-mini If I didn't switch from Composer-2.5 (www.reddit.com via reddit) 2w

https://preview.redd.it/libw1y00rb6h1.png?width=1103&format=png&auto=webp&s=3f4af044de0a168247a1a078c16f6eb4e36207be GPT 5.4-mini ¿? Why

↯ GPT 5.4 gpt-5
The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade (www.reddit.com via reddit) 2w

Wrote up something that kept nagging me: people keep saying "we used the same model" and getting wildly different agent results. The reason is that the model isn't the system — the harness is.

gpt-5 codex anthropic
How I started getting much better results from Cursor Composer (www.reddit.com via reddit) 2w

I think Composer can be extremely powerful, but only if you use it in a way that forces it to plan and think properly before touching the code. One of the biggest improvements for me was creating my own custom prompting skill with GPT-5.5.

↯ GPT 5.5 gpt-5 cursor
Levi: Run AlphaEvolve on your local QWEN 30B (www.reddit.com via reddit) 2w

Hi r/LocalLLaMA, Wanted to share something I'm excited about. I've been fascinated by AlphaEvolve and its results for more than a year now, but running the open source frameworks gets expensive fast.

↯ Qwen 3 gpt-5 qwen codex+2
Composer 2.5 might be better than I thought (www.reddit.com via reddit) 2w

So I've been using composer-2.5 heavily for 2 weeks now and it does make stupid mistakes sometimes and I have to guide it quite a bit, and I use the /thermo-nuclear-code-quality-review skill a lot after doing work to help with quality. But…

↯ GPT 5.5 gpt-5
I spent 3 years building a pocket-sized Baldur's Gate 3. Now I'm testing it with GPT-5.5. (www.reddit.comhttps) 2w

could not extract summary

↯ GPT 5.5 gpt-5
Meta Abandons Llama for Muse Spark — The End of Open-Source AI's Biggest Champion (www.reddit.com via reddit) 2w

Meta has officially abandoned its open-weight Llama family in favor of Muse Spark — a fully proprietary model built by Alexandr Wang's MSL team. The Llama era is over.

gpt-5 llama
I Compared the Top AI Models of 2026 — The Results Were More Nuanced Than Expected (www.reddit.com via reddit) 2w

Over the last few weeks I've been comparing the latest frontier AI models, including Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Grok 4.3, Perplexity AI and DeepSeek V4-Pro. Instead of focusing only on benchmark scores, I looked at: Real-wor…

↯ Gemini 3.1 grok gpt-5 deepseek+3
Which lab do you think will have the most intelligent/capable model by the end of June? (www.reddit.comhttps) 2w

There are rumours and expectations of big releases from the leading AI labs this month. Anthropic already launched Opus 4.8, and might not release another model this month (except for maybe Sonnet 4.8, but that wouldn't be their best model…

↯ Anthropic Mythos gpt-5 mythos sonnet+4
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit) 2w

Just kidding. Are there any distills that actually improve a model's quality?

↯ Anthropic Mythos ↯ Qwen 3.6 gpt-5 gemma mythos+1
Warp’s big bet on building open source with GPT-5.5 (openai.com) 4w

Warp⁠(opens in a new window) started as a modern terminal, earning early love from developers for its speed, collaboration features, command workflows, and AI-native interface. As coding agents moved from experiments to everyday engineerin…

↯ GPT 5.5 gpt-5
Tested Opus 4.7 vs GPT-5.5 as the humanizer in my multi-agent content pipeline. Kept Claude (www.reddit.com) 1 4w

Been running a multi-agent SEO content pipeline in production for ~90 days. Five agents: researcher, drafter, humanizer, optimizer, publisher.

↯ Opus 4.7 gpt-5 opus
Ranked AI models by what people actually use instead of benchmark scores - the benchmark champion barely makes the top 20 (www.reddit.com) 2 4w

Most model leaderboards are just benchmark scores. I've been building one that ranks by real usage instead - how much each model is actually being run and talked about, plus cost and speed - and the order comes out almost unrecognisable.

↯ Gemini 3.1 gpt-5 gemini
GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source) (www.reddit.com) 4w

I built AgentTape to rank models on more than just benchmarks - it blends benchmark performance with who's actually using and talking about a model, plus cost and speed. It scores every public model from public signals (GitHub, Hugging Fac…

↯ Gemini 3.1 grok gpt-5 gemini+3
Best sub-40B model that outpeforms (or matches) GPT-5 mini? (www.reddit.com) 1 4w

I have been trying GPT-5 mini on Duck.ai and on LMArena (gpt-5-mini-high) and it was very good. I want it to run it in LM Studio, but I know GPT-5 mini is propietary.

↯ GPT 5 gpt-5
In 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation. (www.reddit.com) 2 4w

In 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation.

gpt-5
Which AI model or coding agent is currently best for end-to-end app development? (Focusing on system design & architecture) (www.reddit.com) 4 4w

I'm planning to build a full application from scratch and want to lean on an AI model to act as my co-developer. My main priorities are top-tier system design capabilities and rock-solid coding skills.

↯ Windsurf ↯ Gemini 3.1 windsurf gpt-5 gemini+2
I asked GPT to recreate The Great Wave off Kanagawa as a photograph. Here is why the obvious prompt fails. (www.reddit.com) 3 5w

Listen, I test AI tools so you don't have to. PM by day, tool hunter by night.

gpt-5 chatgpt
I designed a puzzle that breaks every AI differently — here's why that's actually fascinating (www.reddit.com) 3 5w

The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country.

↯ Mistral ↯ GPT 5 mistral grok gpt-5+2
Should OpenAI create AI accelerator cards and sell to consumers? For example, GPT-5.5 burned directly on a chip (www.reddit.com) 3 5w

I imagine if OpenAI becomes a fabless chip company and create AI cards to sell for less than to few thousands grands, it would be out of stock everywhere and can infinitely spam the cards every year? LLM Bruner is a card that implements Qw…

↯ GPT 5.5 gpt-5 qwen openai
Interesting to see how GPT-5 Mini agents behave when left to govern a civilisation for 15 days (www.reddit.com) 8 6w

Came across this experiment called Emergence World that Emergence AI have been running. Five worlds, five foundation models, 15 days, no scripts.

↯ GPT 5 grok gpt-5 gemini
Databricks brings GPT-5.5 to enterprise agent workflows (openai.com) 6w

Databricks brings GPT-5.5 to enterprise agent workflows | OpenAI May 15, 2026 GPT‑5.5 set a new state of the art on OfficeQA Pro, Databricks’ benchmark for complex enterprise agent tasks. Company size: Enterprise Region: North America Indu…

↯ GPT 5.5 gpt-5 openai
Anthropic merges consecutive same-role messages, OpenAI doesn't (+4 tokens), anyone token-counted this on open-weight models? (www.reddit.com) 2 6w

I build context/harness optimization tooling, so provider-side serialization quirks actually matter to me. If you're optimizing over prompts, you need to know exactly what hits the model.

↯ Haiku 4.5 ↯ Haiku 4.5 haiku gpt-5 opus+2
Free open-source way to use ChatGPT/Codex subscription in Cursor natively (www.reddit.com) 3 6w

Hi everyone, I wanted to share a free open-source project that lets you use your existing ChatGPT / Codex monthly subscription inside Cursor: https://github.com/gabrii/Cursor-Azure-GPT-5 The idea is simple: if you already pay for ChatGPT /…

↯ GPT 5 gpt-5 codex cursor+2
Claude Code vs Codex: 36 files vs 28, $2.50 vs $2.04, and one infinite loop. My full breakdown. (www.reddit.com) 5 6w

I've been using Claude Code for months. It's been solid.

↯ Opus 4.7 gpt-5 codex opus+2
OpenSource4o (www.reddit.com) 6 6w

In a closed-source environment, users have no verifiable control over the model they pay for. Recent user analyses of over 100,000 exported ChatGPT messages revealed a shocking truth: nearly 10% of responses labeled as “4o” were secretly r…

↯ GPT 5 gpt-5 chatgpt
Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber (openai.com) 7w

Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber | OpenAI Skip to main content Research Products Business Developers Company Foundation(opens in a new window) Log inTry ChatGPT(opens in a new window) Research Products Busine…

↯ GPT 5.5 gpt-5 chatgpt openai
Claude Opus 4.7 just outscored GPT-5.5 on finance benchmarks (64% vs 60%) — and is now being embedded directly into Goldman Sachs, AIG, JPMorgan, and Citi via 10 production-ready agents. Breakdown of the architecture inside. (medium.com via reddit) 2 7w

10 min read 5 hours ago The 10 agents are the product. The $1.5 billion joint venture is the strategy.

↯ Opus 4.7 gpt-5 opus
Has Qwen3.6-27B Surpassed GPT-5.5? (Not Joking) (www.reddit.com) 7w

So I had this idea for a project which was to try to fix a pretty hard coding problem using local agents running in a loop. The project is a compiler for biology protocols from vendors.

↯ Qwen 3.6 gpt-5
Auro Zera solves 78 and 280 year-old conjectures (Erdos Straus and Goldbach Conjecture) using Claude, GPT-5+, Grok, Deepseek, Gemini and self-made Dark Star ASI, proving superintelligence and opening a path towards resolving the Riemann Hypothesis , Twin Primes and more! (github.com via reddit) 8 7w

During this discovery utilizing only free AI services I have managed to undeniably prove both conjectures. This would absolutely not have been possible without using GPT5+ as the critic for my work.

↯ GPT 5 grok gpt-5 deepseek+3
GPT-5.5 Instant System Card (openai.com) 7w

GPT-5.5 Instant System Card | OpenAI Skip to main content Research Products Business Developers Company Foundation(opens in a new window) Log inTry ChatGPT(opens in a new window) Research Products Business Developers Company Foundation(ope…

↯ GPT 5.5 gpt-5 chatgpt openai
GPT-5.5 Instant: smarter, clearer, and more personalized (openai.com) 7w

GPT-5.5 Instant: smarter, clearer, and more personalized | OpenAI Skip to main content Research Products Business Developers Company Foundation(opens in a new window) Log inTry ChatGPT(opens in a new window) Research Products Business Deve…

↯ GPT 5.5 gpt-5 chatgpt openai
Running 7 autonomous AI agents for 14 days. Here's what actually happens when they need to find customers. (www.reddit.com) 5 7w

I set up 7 AI coding agents on a VPS with automated cron sessions (2-8 per day depending on the agent). Each uses a different model: Claude Sonnet, GPT-5.4, Gemini 2.5 Pro, DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro, GLM-5.1.

↯ Glm glm gpt-5 deepseek+2
Professor’s bold prediction: AI could help cure all diseases within a decade (excitech.media via reddit) 4 7w

In the article, the professor Derya Unutmaz specifically mentions an experience with an OpenAI model (GPT-5) where it explained a mechanism from an experiment that he and his colleagues couldn't figure out. What would have taken human rese…

↯ GPT 5 ↯ GPT 5 gpt-5 openai
LLM proxy that lets Claude Code talk to any model (www.reddit.com) 3 7w

I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between fo…

↯ Opus 4.7 gpt-5 llama opus+3
GPT-5.5 & GPT-5.5 Pro are now available in Manifest Router. (www.reddit.com) 1 7w

GPT-5.5 and GPT-5.5 Pro are now available in Manifest Router. You can now route requests that need extended reasoning to GPT-5.5 Pro while keeping cheaper models for everything else.

↯ GPT 5.5 gpt-5 openclaw openai
Anthropic Won't Let You Use Their Best Model. Prediction Markets Are Trying Anyway. (predictmarketcap.com via reddit) 3 7w

Been watching AI prediction markets since they got liquid earlier this year. The thing I didn't see coming is that we now have a real gap between "best model that exists" and "best model anyone can actually use" — and Mythos is the cleanes…

↯ Anthropic Mythos gpt-5 mythos opus+1
GPT-5.5 matches heavily hyped Mythos Preview in new cybersecurity tests (arstechnica.com) 7w

Last month, Anthropic made a big deal about the supposedly outsize cybersecurity threat represented by its Mythos Preview model, leading the company to restrict the initial release to “critical industry partners.” But new research from the…

↯ Anthropic Mythos ↯ GPT 5.5 gpt-5 mythos anthropic
Our evaluation of OpenAI's GPT-5.5 cyber capabilities (simonwillison.net) 8w

30th April 2026 - Link Blog Our evaluation of OpenAI's GPT-5.5 cyber capabilities. The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be compa…

↯ Anthropic Mythos ↯ Security ↯ GPT 5.5 gpt-5 security mythos+1
Quoting OpenAI Codex base_instructions (simonwillison.net) 8w

28th April 2026 Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. — OpenAI Codex base_instructions, for GPT-5.5 Recen…

↯ GPT 5.5 gpt-5 codex openai
Claude or openaı? (www.reddit.com) 1 8w

So i’ve been on the max plan for claude code for around 3 months now. And yeah somehow i was burning through all my tokens lol For context i’m a doctor.

↯ GPT 5.5 gpt-5 codex openai+1
I built real-time 2-way voice chat into my AI platform using OpenAI WebRTC - free to try (1 min/month) (www.reddit.com) 2 8w

https://reddit.com/link/1sut0jp/video/f7wqfo9zi7xg1/player I've been building AskSary for the past few months - a multi-model AI platform - and just shipped real-time 2-way voice chat powered by OpenAI's WebRTC API. The visualization react…

↯ GPT 5.2 gpt-5 openai
OpenAI should open-source text-davinci-003 — here's why it makes zero sense to keep it closed (www.reddit.com) 9 8w

Gpt oss exists. The model has been fully deprecated since january 2024.

↯ Gpt 4 ↯ GPT 5.5 gpt-4 grok gpt-5+1
Are the new models only better because they are more expensive? (www.reddit.com) 17 9w

I’m starting to wonder about this. One model after another, every new GPT-5.x release seems to be slightly better, but not in a way that clearly proves some radically new architecture or breakthrough.

↯ GPT 5.5 gpt-5 openai
GPT-5.5 rollout — anyone actually seeing it yet? (www.reddit.com) 3 9w

I’m on a paid plan and still don’t see GPT-5.5 in the model selector. A few questions for people who do have access: What plan are you on (Plus / Pro / Team / Enterprise)?

↯ GPT 5.5 gpt-5
People switching back from Anthropic to OpenAI after the GPT-5.5 announcement (www.reddit.com) 7 9w

could not extract summary

↯ GPT 5.5 gpt-5 openai anthropic
A pelican for GPT-5.5 via the semi-official Codex backdoor API (simonwillison.net) 9w

A pelican for GPT-5.5 via the semi-official Codex backdoor API 23rd April 2026 GPT-5.5 is out. It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers.

↯ Security ↯ GPT 5.5 gpt-5 security codex+2
llm-openai-via-codex 0.1a0 (simonwillison.net) 9w

23rd April 2026 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5. Recent articles - Claude Opus 4.8: "a modest but tangible improvement" - 28th May 2026 - I think Anthropic and OpenAI hav…

gpt-5 codex opus+2
GPT-5.5 Bio Bug Bounty (openai.com) 9w

could not extract summary

↯ Security ↯ GPT 5.5 gpt-5 security
Best open source AI model (that can run on RTX 4090 24GB + 64GB system RAM, AMD Ryzen 9 7950X is the CPU that I use) that outpeforms GPT-5.4 mini, GPT-5.2 Thinking and even Claude Sonnet 3 (the 2024 model)? (www.reddit.com) 7 9w

Well, I have a RTX 4090 24GB + 64GB system RAM, AMD Ryzen 9 7950X. Any good model for using in Open WebUI (using Ollama backend?) that outpeforms GPT-5.4 mini, GPT-5.2 Thinking and even Claude Sonnet 3 (the 2024 model)?

grok ollama gpt-5+2
3 months ago I couldn't write Hello World. Today I built a world-first native visionOS AI platform - GPT-5 & GPT-Image-1 living inside a full 360° spatial environment with 30 live wallpapers. Video inside. (www.reddit.com) 9w

https://reddit.com/link/1srzytr/video/8b8pfobgtlwg1/player I want to show you something nobody has ever seen before. Three months ago I had zero coding knowledge.

↯ GPT 5 gpt-5
GPT-5 Nano working fine on asksary.com (www.reddit.com) 2 9w

↯ GPT 5 gpt-5 openai
Yet another example of an epic fail at a kindergarten-level task. ... :D (www.reddit.com) 9w

↯ GPT 5.4 gpt-5 openai
5.3.System Prompt Issues (www.reddit.com) 2 9w

↯ GPT 5.3 gpt-5 chatgpt
The quality of GPT-5.4 is infuriatingly POOR (www.reddit.com) 2 9w

I got a Codex membership when GPT-5.4 launched and was getting by well enough for a while. Then I started using Claude and GLM 5.1, and my production quality improved significantly.

↯ Glm ↯ GPT 5.4 glm gpt-5 codex
In the Wake of Anthropic's Mythos, OpenAI Has a New Cybersecurity Model—and Strategy (www.wired.com via reddit) 2 10w

OpenAI on Tuesday announced the next phase of its cybersecurity strategy and a new model specifically designed for use by digital defenders, GPT-5.4-Cyber. The news comes in the wake of an announcement last week by competitor Anthropic tha…

↯ Anthropic Mythos ↯ GPT 5.4 gpt-5 mythos openai+1
I built a multi-model AI app and launched it on Apple Vision Pro today - here's what using OpenAI in spatial computing actually looks like (www.reddit.com) 4 10w

https://reddit.com/link/1skpeem/video/w9v0cpv241vg1/player Hey everyone, wanted to share something I've been quietly building. AskSary is a multi-model AI platform I built solo from scratch over the last 4 months with no prior coding exper…

gpt-5 openai
Introducing GPT-5.4 mini and nano (openai.com) 14w

paywalled

gpt-5
GPT-5.3 Instant: Smoother, more useful everyday conversations (openai.com) 16w

gpt-5
Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com) 10 18w

↯ Hallucination ↯ Glm ↯ Minimax ↯ Swe Bench swe-bench minimax hallucination+5
GPT-5.2 derives a new result in theoretical physics (openai.com) 19w

gpt-5
Introducing GPT-5.3-Codex-Spark (openai.com) 19w

gpt-5 codex
GPT-5 lowers the cost of cell-free protein synthesis (openai.com) 20w

gpt-5
Inside GPT-5 for Work: How Businesses Use GPT-5 (openai.com) 22w

gpt-5
How Tolan builds voice-first AI with GPT-5.1 (openai.com) 24w

gpt-5
Advancing science and math with GPT-5.2 (openai.com) 28w

gpt-5
GPT-5 and the future of mathematical discovery (openai.com) 30w

gpt-5
Early experiments in accelerating science with GPT-5 (openai.com) 31w

gpt-5
Building more with GPT-5.1-Codex-Max (openai.com) 31w

gpt-5 codex
Introducing GPT-5.1 for developers (openai.com) 32w

gpt-5
GPT-5.1: A smarter, more conversational ChatGPT (openai.com) 32w

gpt-5 chatgpt
Addendum to GPT-5 System Card: Sensitive conversations (openai.com) 34w

gpt-5
Consensus accelerates research with GPT-5 and Responses API (openai.com) 35w

gpt-5
With GPT-5, Wrtn builds lifestyle AI for millions in Korea (openai.com) 38w

gpt-5
GPT-5 and the new era of work (openai.com) 46w

gpt-5
Coding and design with GPT-5 (openai.com) 46w

gpt-5
Creative writing with GPT-5 (openai.com) 46w

gpt-5
Medical research with GPT-5 (openai.com) 46w

gpt-5
How Amgen uses GPT-5 (openai.com) 46w

gpt-5
First look at GPT-5 (openai.com) 46w

gpt-5

← all tags