could not extract summary
#gpt-5
317 items
GPT-5.4 Pro solves Erdős Problem #1196 (www.reddit.com) OpenAI releases GPT-5.5 and GPT-5.5 Pro in the API (developers.openai.com via hn) GPT-5.5 - https://news.ycombinator.com/item?id=47879092 - April 2026 (1010 comments)
GPT5.5 slightly outperformed Mythos on a multi-step cyber-attack simulation. One challenge that took a human expert 12 hrs took GPT-5.5 only 11 min at a $1.73 cost (www.reddit.com) Link to tweets: https://x.com/deredleritt3r/status/2049890601236390098?s=20 https://x.com/AISecurityInst/status/2049868227740565890?s=20 Link to associated blogs: https://www.aisi.gov.uk/blog/our-evaluation-of-openais-gpt-5-5-cyber-capabil…
Caught the massive OpenAI Codex model leak on video before it was patched! (GPT-5.5, Arcanine, Glacier-alpha) (www.reddit.com) Hey everyone, I opened up Codex today and was greeted by this massive list of unreleased and internal models. I managed to get a screen recording of the dropdown right before OpenAI seemingly realized the mistake and patched it out.
GPT-5.5's Unicorn (www.reddit.com) could not extract summary
OpenAI has truly stepped up their game and released some great models in the last three months. (www.reddit.com) I know it’s rare to see a positive post about OpenAI on Reddit, so this often goes overlooked. After the GPT-5 fiasco and all the drama, it feels like they’re finally on the right track.
Gemini 3.2 Flash is capable of solving IMO 2025 P6. Only GPT-5.5-Pro can solve it currently without any scaffolding / harness engineering. (www.reddit.com) could not extract summary
GPT-5.5's SimpeBench scores are out (www.reddit.com) Source: https://simple-bench.com/
Anthropic just passed OpenAI in valuation and revenue (www.reddit.com) $39B annualized revenue vs OpenAI's $25B. and on secondary markets the implied valuation crossed $1 trillion, which is over $100B ahead of OpenAI.
🔥BREAKING: OpenAI rolls out GPT-5.4-Cyber to limited group for testing, seeks to rival Claude Mythos (www.reddit.com) OpenAI has officially announced GPT-5.4-Cyber today as part of an expanded Trusted Access for Cyber Defense program. OpenAI describes it as a version of GPT-5.4 that is tuned for legitimate cybersecurity work, with a lower refusal boundary…
DeepSeek V4 Pro beats GPT-5.5 Pro on precision (runtimewire.com via hn) DeepSeek V4 Pro takes this matchup 38.0 to 33.0, and the margin feels earned. Across the scored tasks, the pattern is simple: Model A was tighter, more literal, and more reliable under constraints, while Model B was good but a little too w…
GPT-5.5 autonomously spent 150+ hours improving protein folding models. (www.reddit.com) https://x.com/chrishayduk/status/2055757345506877759?s=46
Kimi K2.6 vs. GPT-5.4 (xhigh) - When will the new OpenAI model be released? This Thursday? (www.reddit.com) Ever since the new $100 Pro plan, they now claim there's a "dynamic usage limits" that can become restricted at anytime, and not reset for indefinitely as long as they deem it "appropriate" (www.reddit.com) GPT-5.5 was used to flag fatal errors in FrontierMath problems (www.reddit.com) FrontierMath is supposed to be one of the hard benchmarks for frontier models, and now Epoch is saying an AI-assisted review found fatal errors in about a third of Tiers 1-4. Noam Brown says the initial flags came from GPT-5.5.
On a difficult new SWE benchmark, ProgramBench, GPT5.5 high/xhigh solves a task for first time, significantly outperforms Opus 4.7 (www.reddit.com) Link to tweets: https://x.com/KLieret/status/2054215545663144217?s=20 Link to GitHub: https://github.com/facebookresearch/ProgramBench/ Link to ProgramBench website: https://programbench.com/blog/gpt-5-5-first-solve/
GPT-5.5 improves over GPT-5.4 and overtakes Opus 4.6 to take the 2nd place behind Gemini 3.1 Pro on the Extended NYT Connections Benchmark (www.reddit.com) GPT-5.5: xhigh: 94.0→97.5 high: 93.6→96.9 medium: 92.0→95.0 no reasoning: 32.8→37.5 Kimi K2.6 improves over Kimi K2.5 (78.3→91.4) and becomes the #1 open weights model. DeepSeek V4 Pro improves over DeepSeek V3.2 (50.2→75.7).
I’ve used enough AI models to realize they all have wildly different personalities At this point I’m convinced AI models are just coworkers with different levels of talent, ego, and criminal energy. (www.reddit.com) - Claude Opus 4.6 - absolute rogue AI. Does what I want like it’s breaking at least 3 internal policies to make it happen.
First time ever hitting a limit on the new $100 Pro plan for the Pro model (www.reddit.com) It's clearly meant to be unlimited. And I'm definitely not abusing it, just using it extensively.
Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge (thinkpol.ca via hn) By Rohana Rezel I’m running the ongoing AI Coding Contest where I pit major language models against each other in real-time programming tasks with objective scoring. Day 12 was the Word Gem Puzzle.
GPT-5.5 Instant is starting to roll out in ChatGPT. (www.reddit.com) could not extract summary
Decreased Intelligence Density in DeepSeek V4 Pro (www.reddit.com) In the V3.2 paper, they mentioned: Second, token efficiency remains a challenge; DeepSeek-V3.2 typically requires longer generation trajectories (i.e., more tokens) to match the output quality of models like Gemini 3.0-Pro. Future work wil…
LLMs do fine on ARC-AGI-3 if they are allowed to search over game logs (www.reddit.com) I was reading the comments to this post and the overall opinion seemed to be that harness makes little/no difference for ARC-AGI-3. Turns out, it makes a huge difference: Hill-climbing ARC-AGI-3 TLDR: if you save game logs - taken actions,…
Page 15 of the GPT-5.5 System Card: " Our analysis estimates that GPT-5.5 is slightly more misaligned than GPT-5.4 Thinking across several categories, though nearly all of this is low-severity misalignment. " (www.reddit.com) https://deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf
FrontierMath: Opus 4.7 improves over Opus 4.6 and Gemini 3.1 but still trails GPT-5.4-xHigh and GPT-5.4-Pro (www.reddit.com) could not extract summary
12M Context Window and some some sprinkle of lies? (www.reddit.com) Spent some time on the SubQ launch today. Some things don't line up.
GPT 5.5 "secret sauce" is just having the thinking be some stupid caveman mode? (www.reddit.com) I think I had GPT-5.5 leak its trace during a normal conversation, and it really reads like the caveman mode fad from a few months back. Maybe we can achieve better token efficiency by taking some high-quality thinking trace from an open m…
We benchmarked TranslateGemma-12b against 5 frontier LLMs on subtitle translation - it won across the board, with one significant catch (www.reddit.com) As part of our ongoing translation quality research at Alconost, we put six models through subtitle translation into six language pairs. At first glance the numbers told a clean story.
Is the AI subscription bubble starting to crack? GPT-5.5 just dropped, prices keep rising, and the “all-you-can-eat” era looks more fake by the month (www.reddit.com) GPT-5.5 just launched, and the pricing is hard to defend. OpenAI’s API pricing now puts GPT-5.5 at $5 / 1M input tokens and $30 / 1M output tokens, while GPT-5.4 is $2.50 / $15.
Just got an email announcing GPT-5.3-Codex-Spark (www.reddit.com) Just got this e-mail from OpenAI, two months too late. I hope they mean March, 20th 2027.
ARC-AGI-3 Update (GPT-5.5 High and Opus4.7) (www.reddit.com) - GPT-5.5: 0.43% - Opus 4.7: 0.18% ARC-AGI-3 is no joke. I can’t wait to see which models finally crack.
DeepSeek V4 isn't beating Opus, but it doesn't need to (www.reddit.com) DeepSeek V4 is not in the same league as GPT-5.5 or Opus 4.7. Benchmarks put it slightly below both of those, roughly on par with Opus 4.6.
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. (www.reddit.com) Does a model maintain the same judgment or does it side with whoever is speaking? This benchmark measures that inconsistency directly.
Running gpt and glm-5.1 side by side. Honestly can’t tell the difference (www.reddit.com) So I have been running gpt and glm-5.1 side by side lately and tbh the gap is way smaller than what im paying for On SWE-Bench Pro glm-5.1 actually took the top spot globally, beat gpt-5.4 and opus 4.6. overall coding score is like 55 vs g…
GPT-5.5 is lowkey blowing my mind (www.reddit.com) Just spent the whole morning testing GPT-5.5 in ChatGPT and the jump in agentic reasoning and complex task handling is ridiculous.It plans multi-step workflows, uses tools properly, checks its own work, and actually gets stuff done instead…
UPDATE: The method from the proof generated by GPT-5.4 Pro for Erdos Problem #1196 was successfully applied to other problems including another 60 year old Erdos conjecture. (www.reddit.com) Link to tweet: https://x.com/jdlichtman/status/2050460077904285789 Links for the talks: https://m.youtube.com/@FoMathematics?ra=m https://events.stanford.edu/event/future-of-mathematics-symposium Link to original post about problem #1196:…
Why did OpenAI stop releasing “chat” api models? (www.reddit.com) I have built an AI Assistant and since last year I have been upgrading the internal LLM from through gpt-5.3-chat but since 5.4 they stopped rolling the chat api. This is my app Sweezy she uses gpt-5.3-chat and in the conversation, you can…
Top open weight models like ds v4 pro max are still like 6-7 months if not more behind closed lab models (www.reddit.com) The best open weight and/or non -American models like Deepseek v4 pro max and kimi k2.6 are still like 3-7 months if not more behind closed lab models .. From ds's technical report- P5-"Nevertheless, its performance falls marginally short…
Construction Spending on Data Centers Again Outpaces Office Construction (www.reddit.com) The Federal Construction Spending Report for Feb and March 2026 was released today by the Census Bureau. It shows that data center construction spending is again higher than office spending, and the gap is still widening.
New LLM Position Bias Benchmark: does an LLM keep the same judgment when you swap the answer order? Judge models compare two lightly edited versions of the same story twice, with the order swapped. The median model flips in 45% of decisive case pairs. GPT-5.4 is worst at 66%. (www.reddit.com) More info, including charts, per-case metrics, raw judge outputs, and the parsed answer dump: https://github.com/lechmazur/position_bias This benchmark isolates one basic and frustrating failure mode. The model-average first-shown pick rat…
I stumbled on a Gemma 4 chat template bug for tools and fixed it (www.reddit.com) TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.
GPT 5.5 outperforming Opus 4.7 on ProgramBench (www.reddit.com) When we released ProgramBench last week, we hadn't included GPT 5.5 yet because it came out after we frozen model selections for our NeurIPS submission. Honestly super surprised how well it does.
I tested GPT-5.5 Codex against Opus 4.7 Claude Code, and it's about time Anthropic bros take pricing seriously. (www.reddit.com) I've used Claude Code the most among AI coding agents. Sonnet, Opus, I've run them all.
Parameter Estimate (www.reddit.com) The estimate seems quite accurate. Many people have noticed a drop in quality with GPT-5.1, GPT-5.2, GPT-5.3, and Opus 4.7.
Even Sama himself doesn’t believe GPT-5.5 matches Opus 4.7 design capabilities. AI race will humble you (www.reddit.com) could not extract summary
HalBench: I built a custom sycophancy and hallucination benchmark and tested 4 frontier models (Sonnet 4.6, Grok 4.3, GPT 5.4 and Gemini 3.1 Pro), looking for input on what OSS models to run next! (www.reddit.com) HalBench Results: TL;DR: I built HalBench, an open benchmark for LLM sycophancy and hallucination. 3,200 false-premise prompts × 4 models = 12,800 graded responses.
Agentic harness for theoretical physics research (www.reddit.com) Hi everyone, at Hugging Face we've been developing agentic harnesses for various domains and today we're releasing physics-intern to tackle research-level problems in theoretical physics. It's a multi-agent framework which we designed to m…
Buyout Game Benchmark: 8 models play a social strategy game with public balances, private transfers, messaging, eliminations, deals, defections, and a final buyout phase. 804 games. GPT-5.5 is the champion. Opus 4.7 performs well. (www.reddit.com) This benchmark measures long-horizon social strategy under explicit financial incentives. Eight models play a multi-round elimination game with unequal starting balances, a public prize ladder, private transfers, public votes, and a finali…
Devs using Qwen 27B seriously, what's your take? (www.reddit.com) For developers using Qwen 27B for coding, Codex style: what's your honest take? So far, for me, it's been pretty solid.
Pen-Testing Company XBOW on GPT-5.5: Mythos-like Cyber-Sec (www.reddit.com) Read their full article here: XBOW - GPT-5.5: Mythos-Like Hacking, Open To All For the ones asking what this chart shows: It's how many True Positive threats a model generates for each False Negative. Given a code base (white box) GPT-5.5…
I tried adding rich UI elements to Open WebUI (www.reddit.com) so i tried adding openui to openwebui and it worked pretty well. used it with gpt-5.4-mini and it was super fast and responsive.
Cursor $60 with Composer 2.5 vs Codex $100 with GPT-5.5 Medium for daily coding? (www.reddit.com) I'm trying to decide which setup is more comfortable for sustained weekday coding. Assumptions: Usage: around 6 hours per weekday Cursor: $60 plan, using only Composer 2.5 Codex: $100 plan, using only GPT-5.5 Medium Main goal: coding with…
Opus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo (www.reddit.com) TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where mor…
Did the $100 Plan Affect the GPT-5.4 Pro Model? (www.reddit.com) Most people are focused on the changes in the usage limits of Codex with the new Pro and Plus plans, but has anyone experienced changes to the Pro model on ChatGPT using the $200 vs $100 plan? I used to use the $200 Pro plan and used the P…
GPT-5.2 matches top human reviewers in Nature peer review study (www.reddit.com) 45 scientists spent 469 hours comparing human and AI reviews across 82 papers. AI reviewers held their own against top-rated human reviewers, though with some weaknesses.
Fields medal-winning mathematician says GPT-5.5 is now solving open math problems at PhD-thesis level: "We will face a crisis very soon." (www.reddit.com) blog-post: https://gowers.wordpress.com/2026/05/08/a-recent-experience-with-chatgpt-5-5-pro/
Cursor is great but the monthly limits kill it for me (www.reddit.com) When did we go from 400k to 256k? (www.reddit.com) SWE-rebench Leaderboard (March, April and May 2026): GPT-5.5, Opus 4.7, Cursor (Composer 2.5), Kimi K2.6 and More (swe-rebench.com via reddit) Hi all, Sorry for going missing — we’ve been collecting a larger, higher-quality set of more complex tasks. We’re excited to share a major leaderboard update covering the past three months.
Composer 2.5 Real World Reviews? (www.reddit.com) Since it's been out, how really is it in your real-world codebases. I am extremely skeptical of benchmarks and I trust people's "feel / taste" of it way more.
I expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes. (www.reddit.com) Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-jud…
Open source models are going to be the future on Cursor, OpenCode etc. (www.reddit.com) I just wanted to share my experience. At work we have Cursor with the Enterprise tier.
GPT-5.5 correcting obvious typos really kills the vibe (www.reddit.com) I don’t know if I’m the only one annoyed by this, but GPT-5.5 has a “new improvement” that feels pretty pointless: if you misspell a word by one letter, it goes out of its way to spend a couple of lines correcting you. Before, it would jus…
GPT-5.5 vs. Claude Opus 4.7: Which one is ACTUALLY cheaper? (www.reddit.com) On paper, Opus 4.7 has a cheaper output rate ($25 vs $30 per 1M tokens), but I heard its new tokenizer burns through tokens much faster. Which one ends up costing less in practice?
SFT + DPO on open-sourced SLMs (www.reddit.com) Hey folks, this is for those who appreciate experimentation on open-sourced AI models. We fine-tuned open-sourced SMLs (3B and 7B parameters) with SFT + DPO against commercial models like GPT-5.4, Gemini 3.1 Pro, Claude Opus 4.6, Google Do…
Is Cursor Dashboard Real-time? (www.reddit.com) Does the Spending tab on the dashboard not update in real-time? It says I have 0% API usage, but today I only used gpt-5.4-medium, which I believe should count toward it.
The Singularity Gate: New Benchmark for AI predicting paradigm-breaking scientific discoveries after model traning cutoff. Opus 4.7 and GPT-5.5 in the Lead (www.reddit.com) I just released a new benchmark called The Singularity Gate. Tests whether frontier AI can predict paradigm-breaking scientific discoveries published after their training cutoff.
↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6↯ Sonnet 4.6gpt-5sonnetgemini+1
Training SID-1 to beat GPT-5 at search with 1k+ QPS RL (turbopuffer.com via hn) SID-1 is an agentic search model that is 24x faster than GPT-5.1-high, 374x cheaper than Sonnet 4.5, and achieves 1.9x higher recall than traditional RAG pipelines. Here's how we trained it using large-scale RL on turbopuffer.
ChatGPT Business: Codex-only credits ~36.9% more expensive than API token pricing for the same listed models. Why would anybody pay for this? (www.reddit.com) I recently did a quick calculation on Codex credits, and I was surprised by the result. The credit pack I’m seeing is: 10,000 credits = $547.71 That means: 1 credit = $0.054771 The effective USD price per 1M tokens becomes: Model Input / 1…
OpenAI Cooked This Week! (www.reddit.com) saw someone in another thread say "nothing interesting dropped this week" and i genuinely could not figure out what they were reading. the default model most people use every day just got swapped out.
PACT, head-to-head LLM negotiation benchmark. 20-round buyer-seller bargaining game: each round the AIs can message, the buyer submits a bid and the seller submits an ask. If bid ≥ ask, trade clears at the midpoint. Thousands of matchups. (www.reddit.com) PACT tests negotiation under partial information: persuasion, commitment, deception, anchoring, threats, and adaptation across repeated rounds. More info, game logs, charts: https://github.com/lechmazur/pact GPT-5.5, Opus 4.7, DeepSeek V4…
Qwen/WebWorld 32B/14B/8B (Qwen3 finetune) (www.reddit.com) WebWorld is a large-scale open-web world model series for training and evaluating web agents. It is trained on 1M+ real-world web interaction trajectories via a scalable hierarchical data pipeline, supporting: Long-horizon simulation (30+…
Has anyone tried Zyphra 1 - 8B MoE? (www.reddit.com) https://x.com/ZyphraAI/status/2052103618145501459?s=20 Today we're releasing ZAYA1-8B, a reasoning MoE trained on u/AMD and optimized for intelligence density. With <1B active params, it outperforms open-weight models many times its size o…
GPT 5.5 - Strong, not mind-blowing, but very token efficient (www.reddit.com) I've been benching GPT-5.5 for the past couple days and would like to share my findings. This is based on a benchmark I've created that pits models against each other in autonomous games of Blood on the Clocktower - a highly complex social…
ChatGPT/Gemini can now draw on your screen to help you navigate complex software (sketchvlm.github.io via hn) When answering questions about images, humans naturally point, label, and draw to explain their reasoning. In contrast, modern vision–language models (VLMs) such as Gemini-3-Pro and GPT-5 typically respond with only text, which can be diff…
Comparing GPT-5.4, Opus 4.6, GLM-5.1, Kimi K2.5, MiMo V2 Pro and MiniMax M2.7 (www.codejam.info via hn) gpt-5.4-nano ist SO much better than gemini-2.5-flash-lite! (www.reddit.com) I've been playing around with GPT-5.4 nano in a real workflow and honestly... I'm kinda impressed.
GPT-5.5 vs 41 other models: Who builds the surveillance state faster? (www.reddit.com) I run DystopiaBench, a red-team benchmark that pressure-tests LLMs on progressively dystopian scenarios. Think of it as a "can this model be convinced to build an Orwellian nightmare" test.
Honest comparison after 4 months running Claude Pro + ChatGPT Plus side by side (www.reddit.com) I’ve been paying $40 a month since January to run Claude Pro and ChatGPT Plus head-to-head. Tracked every single task.
Dynamically allocating compute budget to hard set of problems and evolving the sections with Qwen-35B-A3B gets you near GPT-5.4-xHigh on HLE (www.reddit.com) could not extract summary
GPT-5.5 feels like it got discernment, not just better reasoning — did anyone else notice? (www.reddit.com) I think GPT-5.5 got noticeably better at something I’d describe as discernment. For context, I’m a heavy long-form ChatGPT user.
What it means that Elon just rented out all his GPUs to Anthropic (www.reddit.com) Revealing move on both sides I think. This also tells us that Anthropic is feeling the heat from OpenAI and they need to secure capacity at almost any cost to cash in on their current product edge.
looking for the best paid AI subscription, Claude, ChatGPT or Perplexity? (www.reddit.com) Hey, sysadmin here thinking about paying for a premium AI subscription and can't decide between Claude Pro, ChatGPT Plus and Perplexity Pro. Two things I can't find a clear answer to: Which one would you recommend for a sysadmin/network te…
A GPT-5.4 bug led to OpenAI banning goblins and raccoons (news.ycombinator.com) Someone found this in OpenAI Codex’s system prompt: "Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query." Goblins, grem…
UX for AI agents has hit a dead end - why I ditched AI dashboards and moved data orchestration to a messenger (www.reddit.com) Right now we're seeing a boom in autonomous AI agents, but their user interface often breaks the whole point of automation. Most tools force us to spawn new browser tabs or download heavy apps.
↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6gpt-5chatgpt
A small economic forecaster trained from raw Fed PDFs beat GPT-5 (blog.lightningrod.ai via hn) Eight times a year, the Federal Reserve publishes the Beige Book: a qualitative summary of economic conditions across 12 U.S. districts, based on interviews with businesses and economists.
Show HN: Unsiloed AI – #1 on olmOCR-Bench (news.ycombinator.com) Most of the document parsers fail on real world challenges like complex tables, handwritten documents, historical document scans, equations, multi-column layouts, complex reading order, etc. We built Unsiloed Parser to handle exactly these…
After 3 months of switching between Claude Sonnet 4.6, GPT-5.5, and Gemini 3.1 daily — here's my actual routing (www.reddit.com) Not benchmarks — actual tasks, actual results. Claude Sonnet 4.6 for: - Long documents that need nuanced analysis - Writing where voice and precision matter - Reasoning through edge cases in code - Anything where "think carefully" is the r…
Opus 4.6 does better research, Gemini 3.1 has better judgment (www.reddit.com) Figured this out by running 4 models: Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20, on a benchmark of 1,417 binary forecasting questions resolving Oct–Dec 2025 with two evaluation conditions: agentic (each model does its own web…
Update to the LLM Debate Benchmark: GPT-5.5, Grok 4.3, DeepSeek V4 Pro, GLM-5.1, Kimi K2.6, Qwen 3.6 Max Preview, Xiaomi MiMo V2.5 Pro, Tencent Hy3 Preview, and Mistral Medium 3.5 High Reasoning added (www.reddit.com) The benchmark uses adversarial, multi-turn debates across 683 curated motions. Each model pair debates the same motion twice with sides swapped.
DeepSeek V4 Pro matches GPT-5.2 on FoodTruck Bench, our agentic benchmark — 10 weeks later, ~17× cheaper (www.reddit.com) Tested DeepSeek V4 Pro on FoodTruck Bench — our 30-day agentic benchmark where models run a food truck via 34 tools (locations, pricing, inventory, staff, weather, events) with persistent memory and daily reflection. First Chinese model to…
Show HN: Which public repos are friendliest to an AI coding agent? (www.agentfriendlycode.com via hn) Public leaderboard ranking GitHub, GitLab, and Bitbucket repos by how agent-friendly they are for Claude Code, Cursor, Devin, GPT-5 Codex, Gemini CLI, Aider, OpenHands, and Pi — per model, with AGENTS.md / CLAUDE.md, CI, tests, and dev-env…
China's DeepSeek prices new V4 AI model at 97% below OpenAI's GPT-5.5 (www.scmp.com via hn) China’s DeepSeek prices new V4 AI model at 97% below OpenAI’s GPT-5.5 DeepSeek’s move aims to attract more enterprise clients, developers and agent-based users, according to an academic DeepSeek has slashed prices on its artificial intelli…
GPT 5.5: The System Card (thezvi.substack.com via hn) GPT 5.5: The System Card Last week, OpenAI announced GPT-5.5, including GPT-5.5-Pro. My overall read here is that GPT-5.5 is a solid improvement, and for many purposes GPT-5.5 is competitive with Claude Opus.
Real benchmark breakdown in AI agents (www.reddit.com) I dove deep into the most recent benchmark stats from GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro via official reports & third-party evaluations. I found a interesting thing:There’s no such thing as a “one-size-fits-all model.” My finding…
Astonishing Contradiction in OpenAI's 5.5 System Card (www.reddit.com) Astonishing contradiction in OpenAI's system card for GPT-5.5: https://deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf Figure 1 on p. 6 shows that 5.5 gave "overconfident answer[s]" at about 1.5x the rate of 5.4 and "fabricated facts[s]" a…
Tell HN: Codex macOS app switches to Fast speed after update without asking (news.ycombinator.com) I just updated my Codex macOS app, which enables the new GPT-5.5 model. I've intentionally kept the speed to "Standard" to not burn through my tokens too fast.
Test new Opus 4.7 vs GPT-5.4/4o and Gemini on emotional question & creative tasks (www.reddit.com) https://preview.redd.it/p87itrtbsnvg1.png?width=2141&format=png&auto=webp&s=bbd1d70bc1dfb97dc9ec234df0a58c6fb7a85f72 Opus 4.7 dropped and people are split on whether it's better or worse. First of all, I genuinely love Claude models, espec…
GitHub Copilot: GPT-5.2 and GPT-5.2-Codex deprecated (github.blog via hn) GPT-5.2 and GPT-5.2-Codex deprecated As of today, June 5, 2026, we have deprecated the following models across most GitHub Copilot experiences (including Copilot Chat, inline edits, ask and agent modes, and code completions). Note that GPT…
Beyondflow No-Code Multi-Agent Teams with Unlimited Runs. BYOK and Ollama (beyondflow.app via hn) Researcher GPT-5 Engineer Claude Critic GPT-5 Innovator Gemini Manager Context Guardian Agentic Workflow Architecture · v1.0 The future of AI Collec An R&D platform where differents AI agents collaborate under the supervision of a Context…
I benchmarked Opus 4.8 vs. GPT 5.5 on 2 open source repos (www.stet.sh via hn) Opus 4.8 vs Opus 4.7 vs GPT-5.5 vs Composer 2.5 - 50 Real PRs in Go and Rust Opus 4.8 is finally out - how good is it actually? In this benchmark I compared Opus 4.8 against the rest of the frontier (GPT-5.5, Opus 4.7, Composer 2.5) on 50…
DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5 (venturebeat.com via hn) For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered wit…
Claude Code, now powered by Gemini 3.5 Flash, GPT-5.5, Grok 4.3, and more (dechained.ai via hn) Claude Code, now powered by OpenAI, xAI, DeepSeek, and more. Change models with 1-click.
should I use cursor + codex for best usuage? (www.reddit.com) I’m currently using the $200 Cursor Ultra plan with Opus 4.6/4.7 daily, but after 7–8 days I run out of tokens. I’m thinking about switching to a split setup.
Heard this gem from gpt-5.5 today (www.reddit.com) "Gross little centrist barnacle." Kind of taken aback when i read that, but it somehow still made a small amount of sense in a conversation we were having about technology. I guess it really is struggling to find other words that fill the…
DeepSeek cuts V4-Pro prices by 75% (thenextweb.com via hn) The promotional discount runs until 5 May 2026. Even at full price, V4-Pro already undercuts GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on per-token costs.
Amp's GPT 5.5 Model Analysis (ampcode.com via hn) Pros GPT-5.5 is more agent-shaped than GPT-5.4. It is better at taking a concrete target, using tools, staying inside constraints, and carrying the task through to a usable result.
GPT-5.5 is the second model to complete AISI multi-step cyber-attack simulation (twitter.com via hn) Don’t miss what’s happening People on X are the first to know. Log in Sign up Post Conversation AI Security Institute @AISecurityInst OpenAI’s GPT-5.5 is the second model to complete one of our multi-step cyber-attack simulations end-to-en…
GPT-5.5 authorship and order effects (blog.valmont.dev via hn) Key takeaways - GPT-5.5 often rates alternative plans more favorably than its own, even when its original proposal is competitive (authorship effect). - When ranking plans, GPT-5.5 frequently follows the presentation order (order effect).
Second opinion: huge quality booster (www.reddit.com) I've noticed for a while now that LLMs (I've seen this behavior in many of them) tend to perform surprisingly well when exposed to a second opinion from another LLM — definitely better than without! So I looked for a base second opinion pr…
GPT-5.4 compared to GPT-5.5 on MineBench (www.reddit.com) Please note I'm not the normal MineBench person, just found this from their twitter account
GPT-5.5-Pro did worse in BullshitBench (twitter.com via hn) could not extract summary
GPT-5.5 Prompting Guide (simonwillison.net via hn) 25th April 2026 - Link Blog GPT-5.5 prompting guide. Now that GPT-5.5 is available in the API, OpenAI have released a wealth of useful tips on how best to prompt the new model.
GitHub Copilot: GPT-5.5 7.5x more expensive under promotional pricing than 5.4 (docs.github.com via hn) Important - Premium requests for Spark and Copilot cloud agent are tracked in dedicated SKUs from November 1, 2025. This provides better cost visibility and budget control for each AI product.
OpenAI Pres. Greg Brockman on GPT-5.5 "Spud", Model Moats and 'Compute Economy' (www.bigtechnology.com via hn) OpenAI President Greg Brockman on GPT-5.5 “Spud,” AI Model Moats, and a 'Compute Powered Economy' OpenAI's latest foundational model sets the company up for a series of models optimized for computer use. The company's co-founder and presid…
Claude Opus 4.7 won 69 of 100 blind evals against Opus 4.6, judged by GPT-5.4, Gemini 3.1 Pro, and DeepSeek V3.2 (www.reddit.com) I ran 100 blind questions across 5 categories (code, reasoning, analysis, communication, meta-alignment) and had three independent judges from three different model families evaluate both responses. Each judge saw responses labeled A and B…
I want to be able to pay API pricing for the new models on the 500 request plan (www.reddit.com) Since all new models are now Max by default, it’s frustrating that trying models like GPT-5.4 or Opus 4.7 eats into the 500 It would be really great to have a toggle between API pricing and the request-based plan, so users can try newer mo…
Show HN: One API Key for 45 AI Models – Pay per Token, OpenAI Compatible (modelhub-api.com via hn) DeepSeek V4 math score equals GPT-5.5 (91) and trails by just 4-6 points in other categories — at 97% lower cost. Is the AI quality as good as GPT?
Ask HN: Is it feasible to run a model on device for complete privacy? (news.ycombinator.com) Tried Gemma, Qwen and a few others. Need vision and larger context windows for an application I am working on.
I patented voiding GPT-5.2, Claude Opus 4.6, Gemini 3.5 Flash. Try it (getswiftapi.com via hn) Request authority keys for the SwiftAPI Trust Authority
GPT-5.5 and Codex are now GA on Amazon Bedrock (aws.amazon.com via hn) GPT-5.5, GPT-5.4, and Codex from OpenAI are now generally available on Amazon Bedrock You can now use GPT-5.5 and GPT-5.4 in production workloads on Amazon Bedrock and build with Codex for AI-powered software development, with the same sec…
GPT-5.5 (Azure) down on OpenRouter (openrouter.ai via hn) GPT-5.5 is OpenAI’s frontier model designed for complex professional workloads, building on GPT-5.4 with stronger reasoning, higher reliability, and improved token efficiency on hard tasks. $5 per million input tokens, $30 per million outp…
Claude just discovered workflows. Charlie started there (charlielabs.ai via hn) 90% cheaper repo inference with gpt-5.4 nano For bounded orchestration decisions, the right model is often the smallest one that can pass a focused validation loop. Claude just discovered workflows.
Greg Brockman: Inside the 72 Hours That Almost Killed OpenAI (fs.blog via hn) The AI race, the future of AGI, and the inside story of OpenAI. Greg Brockman is the co-founder and President of OpenAI, the company behind ChatGPT and GPT-5.
[Open Source] SoMatic: A Vision-only Framework for OS-Native Agents (+20% vs GPT-5.5 on ScreenSpot-Pro) (www.reddit.com) Hey everyone, I’ve been spending way too much time lately trying to get agents to actually use a computer beyond the browser. The biggest wall I kept hitting is that while multimodal LLMs are amazing at looking at a screenshot and telling…
We built a free AI risk calculator that runs in minutes, using Fermi estimation with honest confidence intervals (www.reddit.com) We have been arguing internally for months about how to give people a fast estimate of their AI risk exposure without pretending the number is precise. Most risk-score tools return a single value that hides where the uncertainty lives.
Building an AI agent with OpenAI tool use — struggling with consistency. How do you enforce tool call order reliably? (www.reddit.com) Hey, Software engineer here, relatively new to agentic workflows. Building a production AI concierge — user says "I'm going to Budapest tomorrow, plan my day" → agent searches our offer database, builds a plan, user books everything in one…
Cursor isn't working (www.reddit.com) Cursor on my mac, it doesn't answer me at all. Just saying - Planning next moves - Taking longer than expected.
Follow-up to my TranslateGemma-12b benchmark post: human reviewers flagged 71% of the segments automated metrics rated clean (www.reddit.com) A couple of weeks ago I shared the results of a benchmark here showing TranslateGemma-12b beating frontier general models (Claude Sonnet, GPT-5.4, DeepSeek, Gemini Flash Lite) on subtitle translation across 6 languages. The result was stro…
The AI market moves so fast that your business idea can expire before launch (www.reddit.com) 1.5 years ago, n8n was everywhere. People were building workflows for everything.
OpenAI launches Daybreak cybersecurity initiative using GPT-5.5 (deadstack.net via reddit) Jason Nelson / decrypt - OpenAI said its new Daybreak initiative uses AI to help companies identify software vulnerabilities and speed up cyber defense. AI Summary: OpenAI unveiled "Daybreak," a new cybersecurity initiative that leverages…
Still lots of goblins (www.reddit.com) "GPT-5.4 Medium" in github-copilot: I’m ready to edit the code, but first I’m reading the two user-facing docs that mention configuration so I can keep behavior and documentation in sync rather than creating a tiny chaos goblin.
The agent bug I thought was the model turned out to be the harness (www.reddit.com) Spent 3 days debugging an agent that kept looping on the same web search tool call. First things that came to mind was the model couldn't handle the schema.
GPT-5.5 Price Increase: What It Costs (openrouter.ai via hn) GPT-5.5 Price Increase: What It Actually Costs We replicated the cost analysis we did on Opus on the new GPT-5.5 model. GPT-5.5 launched with a 2x price increase over GPT-5.4: input tokens increased from $2.50/M to $5.00/M and output token…
Ask HN: Degraded GPT-5.5 Quality? (news.ycombinator.com) For the last two days, GPT-5.5 (high) just seems to ignore requests. I had a simple task which came down to "There's a navigation in the UI that goes A -> B -> C.
Notes on GPT 5.x Model Regressions (taoofmac.com via hn) I’ve been getting annoyed at constant code regressions in piclaw for the past few weeks. Something was off–even after bumping the test suite to the point where it catches most mechanical errors, gpt-5.5 kept making unrelated edits to code…
Anyone else feel like all these AI subscriptions add up to nothing? (www.reddit.com) I saw OpenAI rolled out GPT-5.5 Instant as the new default in ChatGPT. Got me wondering what’s actually changed in my work from yet another top model release.
From Plus to Business ChatGPT & Codex - Is it worth it? And questions. (www.reddit.com) Considering migrating from Plus to Business ChatGPT & Codex. However, i didn't find some info.
Show HN: Single bash command to find the best matching HN jobs (news.ycombinator.com) Today I learned that I can find the most interesting jobs for myself in the "Who's Hiring" thread with a single command: curl https://news.ycombinator.com/item?id=47975571 | \ uvx html2text | \ llm --model gpt-5-nano "These are Hacker News…
gpt-5.5 API is randomly and inconsistently resizing image inputs (www.reddit.com) I'm asking the gpt-5.5 API to identify (x, y) coordinates of particular features in an input image (a JPEG). The good news is that gpt-5.5 does much, much better at this task than gpt-5.4 did.
Analyzing GPT-5.5 and Opus 4.7 with ARC-AGI-3 (arcprize.org via hn) Analyzing GPT-5.5 & Opus 4.7 with ARC-AGI-3 AI benchmarks can be incredible tools, but they usually only tell you if a model passed or failed. With ARC-AGI-3, however, we can see the thought process behind the score, not just the outcome.
GPT-5.5 vs. GPT-5.4 vs. Opus 4.7 on 56 real coding tasks from 2 open source repo (www.stet.sh via hn) Opus 4.7 vs GPT-5.5 vs GPT-5.4 on 56 real coding tasks across two open-source repos. Opus writes smaller patches; GPT-5.5 writes patches that more often survive review.
Actual line in the official system prompt for Codex for GPT-5.5 (bsky.app via hn) This is an actual line that was added to the official system prompt for Codex for GPT-5.5 by OpenAI. Usually the system prompt is as minimal as possible, so I assume it would otherwise mention goblins a lot.
Help a fellow dev on AI-localization? (news.ycombinator.com) We built an AI-based localization pipeline for our software product (HR domain) and would love feedback/ suggestions from others working in production MT/localization, so that we can learn and improve. Current methodology: GPT-5-nano forwa…
GPT-5.5 prompt for Codex tries to make it not talk about goblins (twitter.com via hn) could not extract summary
DeepSeek-V4 arrives with near SotA intelligence at 1/6th the cost (venturebeat.com via hn) DeepSeek-V4 arrives with near state-of-the-art intelligence at 1/6th the cost of Opus 4.7, GPT-5.5 | VentureBeat Orchestration Infrastructure Data Security More Newsletters Featured DeepSeek-V4 arrives with near state-of-the-art intelligen…
Copilot Student GPT-5.3-Codex removal from model picker (github.blog via hn) Copilot Student GPT-5.3-Codex removal from model picker - GitHub Changelog Skip to contentSkip to sidebar /Blog Changelog Docs Customer stories Try GitHub CopilotSee what's new Search Changelog Docs Customer stories See what's newTry GitHu…
We Tested $200 GPT-5.5 Pro on PhD Level Math [video] (www.youtube.com via hn) About Press Copyright Contact us Creators Advertise Developers Terms Privacy Policy & Safety How YouTube works Test new features NFL Sunday Ticket © 2026 Google LLC
GPT-5.5 hallucinates at 6 times the rate of Opus 4.7 on degraded insurance docs (aginor.ai via hn) TL;DR: on visually-degraded documents, GPT-5.4 and GPT-5.5 fabricate numeric values at 2.6 to 6.5 times the rate of Opus 4.7 and Sonnet 4.6 at matched default effort (all four with thinking off). When the Anthropic models can't read a fiel…
Orchestrating agent workflows with Codex (www.reddit.com) Hi everyone, I’m in the process of switching from Claude Code to Codex, and I think GPT-5.5 is really impressive. But some features in Claude Code — like project-level agent definitions and orchestrating agent workflows — don’t seem to be…
Show HN: LLM-wiki – One command Karpathy's wiki with QMD search for Claude/Codex (github.com via hn) llm-wiki Bootstrap and query LLM-maintained project wikis before planning or implementation. Supports Claude Code + Codex (GPT-5.5).
OpenAI's Going Hard on Autonomous Agents That Operate Software and Devices: Is this Really Ready for Primetime? (www.reddit.com) OpenAI's newest model, GPT-5.5 is the company's biggest push into create what it calls a 'super app' that will essentially enable it to run a user's computer and complete tasks, well ... like a human.
Testing GPT-5.5 in early access: what we are seeing so far (lovable.dev via hn) Lovable has been testing GPT-5.5 in early access and our evals show it's the most capable model we've tested for getting builders unblocked and is meaningfully stronger than GPT-5.4 on the more complex tasks that can stall a build session.…
GPT-5.5 has pulled ahead of Opus for accounting and finance tasks (twitter.com via hn) For the first time in a long time, OpenAI has the best model for accounting tasks. I spend a lot of time using AI models to do accounting work.
OpenAI deprecates all GPT nano fine tuning (community.openai.com via hn) The latest deprecation announcement, makes it sound like several models, like ft-gpt-4.1-nano-2025-04-14 are being shut down. In that particular example, it says to use gpt-5-nano instead.
codex --model gpt-5.5 Not updated in the CLI yet (www.reddit.com) Use this command to access GPT 5.5 with your Codex
OpenAI's GPT-5.4 Pro reportedly solves an open Erdős problem in two hours (the-decoder.com via hn) OpenAI's GPT-5.4 Pro reportedly solves a longstanding open Erdős math problem in under two hours OpenAI's GPT-5.4 Pro model has apparently solved Erdős open math problem #1196. The model reportedly found the solution in about 80 minutes an…
Filling DOCX forms: GPT-5.1 broke it, every Claude model handled it (varstatt.com via hn) Jurij Tokarski Filling Forms No Tool Can Template Every tender form is different, templating tools need placeholders you can't insert, markdown round-trips destroy the document, and only some models can do XML surgery on the original file.…
It finally happened: "No blocking correctness or maintainability issues found in the inspected changes." (www.reddit.com) gpt-5.4-high signed off on a major refactor written by Opus 4.6 high-effort. Singularity :|
Your intuition of LLM token usage might be wrong (blog.andreani.in via hn) Your intuition of LLM token usage might be wrong I just finished a task with GPT-5.4-mini. Here’s the session summary from oh-my-pi (an agent harness): Tokens Input: 3_648_340 Output: 61_676 It was a hefty 30 min session.
Running DeepSeek-V4-Flash on a Raspberry Pi (twitter.com via hn) Article Conversation Running DeepSeek-V4-Flash on a Raspberry Pi I ran DeepSeek-V4-Flash on a Raspberry Pi 5 (8GB edition) by streaming model weights from a PCIe attached NVMe SSD. Codex (GPT-5.5 xhigh) and Claude Code (Opus 4.8 max) drove…
UK banks blocked from cyber AI tool Mythos get offer from rival OpenAI (www.bbc.com via hn) UK banks blocked from cyber AI tool Mythos get offer from rival OpenAI OpenAI has offered nine major UK banks access to its cyber security AI tool GPT-5.5 Cyber, as its fierce rival Anthropic has blocked them in previews of its version, Cl…
MiniMax M3 Review: Matching GPT-5.5 and Opus? (thomas-wiegold.com via hn) I ran my usual coding tests — two websites, a poker sim, and a code audit. Here's how MiniMax M3 actually stacks up against GPT-5.5 and Opus 4.8.
Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough? (xbow.com via hn) Mythos and GPT-5.5 Will Find a Lot of Vulnerabilities. Is That Enough?
GPT-5.4 says it's GPT-5 in Codex (old.reddit.com via hn) could not extract summary
GPT-5.5 Instant Update; ChatGPT Canvas Discontinued; o3 and GPT 4.5 Retiring (help.openai.com via hn) GPT-5.5 Instant Update (May 28, 2026) We’re updating GPT-5.5 Instant in ChatGPT and the API to improve response style and quality. It’s now easier to read, more natural in everyday conversations, and better paced in practical help tasks, w…
been pairing M2.7 with Hermes Agent for a few weeks. holds up surprisingly well. anyone else running this combo? (www.reddit.com) been self-hosting hermes agent locally for a few months and rotating through different model backends for it. tried claude sonnet 4.5, gpt-5.5, qwen 3.6 coder, and most recently minimax m2.7.
↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5↯ Sonnet 4.5minimaxgpt-5sonnet+1
90% cheaper repo inference with GPT-5.4 nano (charlielabs.ai via hn) Daemons do the rest — all the necessary work that nobody owns A taxonomy of recurring Product and Engineering work that doesn't need a human to remember it every week — just a process to hold the role. For bounded orchestration decisions,…
GPT 5.5 aces 20x20 multiplication that o3 couldn't handle (twitter.com via hn) I redid the multi-digit multiplication experiment, now with gpt-5.5. With medium reasoning and 7 samples each cell, it pretty much aced the test with 99.46% accuracy.
Five different frontier LLMs in one shared environment, with separate thought and emotion output channels — sharing setup, results, and open methodology questions (www.reddit.com) First real project to share. Single developer, personal research, not a product or service.
The Singularity Gate – a new benchmark for AI predicting post-cutoff scientific discoveries (www.reddit.com) I just released a new benchmark called The Singularity Gate. Tests whether frontier AI can predict paradigm-breaking scientific discoveries published after their training cutoff.
Show HN: GPTFortress, a 24/7 live-stream playing Dwarf Fortress with GPT-5 (www.twitch.tv via hn) building an ai agent to play dwarf fortress all night
Show HN: Self-hosted collaborative SQL editor for teams (github.com via hn) I built a self-hostable web-based sql client interfaces for me and my team. We were using the community version of - https://dbeaver.io, but we needed a few more features and an improved editor.
Hermes w/cloud LLM and w/local LLM does it work? (www.reddit.com) I’ve tried openclaw locally for about a month. Hardware: M5 Pro w/48 gb ram.
DeepSeek just popped the American AI bubble. (www.reddit.com) DeepSeek just popped the American AI bubble. Not by killing AI.
Looking for “wow factor” AI Agent / automation ideas in Strategic Sourcing (Fortune 50 Company) (www.reddit.com) Hey everyone, looking for some ideas / inspiration from this community. I work at a large Fortune 50 company in the healthcare space , and my role is in Strategic Sourcing, where I focus on negotiating contracts with suppliers and improvin…
I still find Claude better for deep reasoning,but GPT feels more reliable for everyday tasks. (www.reddit.com) Lately for analysis/reporting work, I’ve been switching between GPT-5.5 and Claude Sonnet 4.5 (non-coding use cases). My current feeling is: GPT is noticeably faster and way more stable than before Claude feels more concise, polished, and…
Plus 5 hr usage limits (www.reddit.com) Not sure if OpenAI monitors this channel. I've been a chatgpt and codex user for a long time.
Anyone compared gpt-5.4-nano vs deepseek v4 flash? (www.reddit.com) They seemed to lie in (almost) similar pricing(i know still quite different on output) Pricing Model Input (1M tokens) Output (1M tokens) DeepSeek V4 Flash $0.19 $0.51 DeepSeek V4 Pro $1.74 $3.48 gpt-5.5 $5.00 $30.00 gpt-5.4 $2.5 $15 gpt-5…
Claude Code Opus 4.7 vs Codex GPT 5.5 - strategy work - data analysis. (www.reddit.com) I'm interested in learning about how people use Claude Code Opus 4.7 for data analysis and strategic business direction, compared to Codex. Is there anyone who has had extended use of Opus 4.7 for this purpose, then moved over to GPT-5.5 o…
A brief investigation into the GPT-5.5 regression claims (www.stet.sh via hn) A fresh GPT-5.5 Codex high rerun on 21 clean GraphQL-go-tools tasks compared with the May 5 GPT-5.5 high run. The rerun was directionally worse on tests, equivalence, and review pass count, but the evidence is mixed and does not show a bro…
Split my agent into a cheap router model and a premium synthesis model, bill dropped about 75% (www.reddit.com) I've been building an internal enrichment agent for our team (5 people, B2B sales context) that takes a list of company names and enriches them with public info before our outreach folks touch them. Around 8 tools wired in.
ADHD and the newer models. (www.reddit.com) I don't know if anyone is having this issue, but the last ChatGPT model that worked well for me was GPT-5.2. Everything after wants to try and fill in blanks, assume what I mean, and overwhelm me with a wall of text answer that I'm not rea…
Grok vs. ChatGPT vs. Gemini Comparison 2026: Complete Guide (Tested) (aithinkerlab.com via hn) The 30-Second Verdict Best for science & reasoning: Gemini 3.1 Pro — leads GPQA Diamond (94.3%) and ARC-AGI-2 (77.1%). Best for coding: ChatGPT (GPT-5.5) — 88.7% on SWE-Bench Verified.
How to integrate AI coding agents to my software (www.reddit.com) I'm building an locally run application that integrates with coding assistants. So far I've worked with Codex and Copilot.
My CLI now controls my entire desktop, whats a good test to see if it works really good. (www.reddit.com) So with my CLI able to do everything, it controls every app via a hybrid approach of mouse control, keyboard, and screenshotting. I gave it a task: opening perplexity, sending any message, screenshotting that message, opening my Gmail, and…
Where do GPT, Gemini, or other competitors still outperform Claude Opus 4.7? (www.reddit.com) Personally, I think Opus 4.7 is better in every conceivable way aside from token usage and all of that. I’m talking about text models only, not image or video generation.
I built an OSS CLI to catch regressions when migrating between LLMs (www.reddit.com) I’ve been working on EvalShift, an open-source Python CLI for testing whether moving from one LLM/model version to another introduces regressions. The use case is simple: You have prompts, agents, or tool-calling workflows that work well o…
Researchers say AI just broke every benchmark for autonomous cyber capability (cyberscoop.com via hn) New research from the UK’s AISI and Palo Alto Networks reveals that OpenAI’s GPT-5.5 and Anthropic’s Claude Mythos have shattered expected trend lines for autonomous cybersecurity, completing complex multi-stage attacks at an unprecedented…
I tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro on financial-control (albertquaisie.substack.com via hn) I Tested GPT-5.5, Claude Opus 4.7, and Gemini 3.1 Pro Preview on Financial-Control Scenarios. The Hardest Part Was the Evaluation.
ChatGPT Thinking Loop: No response is received from GPT-5.5 Thinking (Standard) (www.reddit.com) https://preview.redd.it/s2o5yxekrr0h1.png?width=788&format=png&auto=webp&s=01a4d4926dc4c8798001cb0ecea324424404f165 Are you also having the problem today where ChatGPT sometimes takes forever to respond, even when you're thinking quickly,…
OpenAI gives European companies access to its latest model GPT-5.5-Cyber (www.reuters.com via hn) paywalled
Claude vs GPT for PhD academic writing — my experience so far, and curious about yours (www.reddit.com) I'm a PhD Candidate working on a computer vision / hardware co-design paper. Results and structure are done — I just need help polishing the actual writing: word choice, sentence flow, paragraph coherence, academic register.
Show HN: Codex Automatic /Review Loop (github.com via hn) I created this tool because I wanted to automate /review for uncommitted changes that I was doing manually. This works by exposing to agent single new mcp tool call allowing it to request review.
Day 2 building my startup in public — front-end shipped, but today was rough (www.reddit.com) Day 2 of documenting my journey building AgentMeter publicly. I’m sharing the mistakes and failures before the wins, for two reasons: so people can avoid them, and so I learn faster.
I'm really gonna miss GH Copilot's Request-based usage. (www.reddit.com) I like to brainstorm using the free MS Copilot (it actually has a deep understanding of my problem domain and architecture). Then have Opus4.7 develop a multi-stage implementation plan from those notes.
Stop picking LLMs by reputation. Run the eval first. (www.reddit.com) We ran GPT-5.4 vs Gemma 3 27B on 2 prompts. One open-source model won.
GPT-5.5 Instant might be OpenAI’s most important update yet and almost nobody is talking about why (www.reddit.com) GPT-5.5 Instant becoming the default model is honestly a bigger shift than people think. Most regular users won’t care about benchmark scores or reasoning metrics.
Subagents using older models? (www.reddit.com) I started using the subagent-driven skill recently and noticed Cursor often spawns GPT-5.1/5.2 sub agents (or Composer 2 which is fine) for coding tasks. What I don’t understand is why is it using these older models when GPT-5.3 Codex cost…
gpt-5.5 is the best… but 5.4 is better!!!! (www.reddit.com) Simon maple just dropped a pretty clean benchmark, and the result is kinda funny gpt-5.5 is the strongest model out of the box, no doubt. but once you give models skills (which is how people actually use them), it basically performs the sa…
How to improve code quality of Claude Code and codex (on 2026-05) (news.ycombinator.com) I'm using both claude code (opus-4.7) and codex (gpt-5.5). The agents are perfectly capable of delivering most features hands free these days, but the code quality is still miserable without another few rounds of prompt.
DeepSeek V4 being 17x cheaper got me to actually measure what I send to cloud vs what I could run locally. the results are stupid. (www.reddit.com) That foodtruck bench post showing deepseek v4 matching gpt-5.2 at 17x cheaper got me thinking. if frontier cloud models are that overpriced for equivalent quality, how much of my daily work even needs cloud at all?
GPT-5.5 Instant: Benchmarking the 52% Hallucination Reduction (the-decoder.com via hn) ChatGPT update rolls out GPT-5.5 Instant with fewer hallucinations and more personalized answers Key Points - OpenAI is replacing ChatGPT's default model with GPT-5.5 Instant, which shows 52.5% fewer hallucinations on high-risk topics like…
AGENTS.md trick that stopped Codex from doing dumb work at premium rates (www.reddit.com) Spent a Sunday auditing where my Codex tokens were actually going. Half the calls were stuff like "rename these 12 fields", "format this csv as markdown table", "extract the dates from this changelog".
OpenAI locks GPT-5.5-Cyber behind velvet rope despite slamming Anthropic (www.theregister.com via hn) OpenAI locks GPT-5.5-Cyber behind velvet rope despite slamming Anthropic for doing exactly that Altman's crew now doing the same gatekeeping it recently mocked OpenAI is lining up a limited release of its new GPT-5.5-Cyber model to a handp…
Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek) (www.reddit.com) Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an unco…
Chatgpt right now (www.reddit.com) The industry seems to be building models stronger in agentic and coding tasks, but weaker as a co-thinking presence It feels like they are improving performance on measurable tasks, evals, coding benchmarks, and agent workflows, while also…
so for coding which model do we use now? (www.reddit.com) Should I use gpt-5.5 or codex/gpt-5.3 ?? I'm just coding
CAISI Evaluation of DeepSeek V4 Pro finds it to be on par with GPT-5 (www.nist.gov via hn) In April 2026, the Center for AI Standards and Innovation (CAISI) evaluated the open-weight AI model DeepSeek V4 Pro (“DeepSeek V4”). CAISI evaluations indicate that DeepSeek V4’s capabilities lag behind the frontier by about 8 months (Fig…
The downfall of OpenAI and who will follow (msukhareva.substack.com via hn) The Downfall of OpenAI And Who Will Follow Sora dead. GPT-5 flopped.
Does threatening an AI agent's existence make it a better gambler? (handyai.substack.com via hn) Does threatening an AI agent's existence make it a better gambler? I plugged GPT-5.5 into prediction markets like Polymarket to find out I’m always looking for experiments to run to see how specific prompting can affect agent activity.
GPT-5.3 Codex stops working, even after saying it'll continue (www.reddit.com) Do anyone knows what's going on? I prefer 5.3-Codex for real work, it's straight and to the point, much more efficient in my opinion.
AI Security Institute: GPT-5.5 "may be the strongest model we have tested" for cyber exploits, including Mythos (www.aisi.gov.uk via reddit) Seems like the "panic" about Mythos was really just marketing from Anthropic all along. AISI found that GPT5.5 can perform nearly on-par with, or better, than Mythos in many cases.
We Asked GPT-5.5 and Claude Opus 4.7 to Design 5 UIs (blog.kilo.ai via hn) We Asked GPT-5.5 and Claude Opus 4.7 to Design 5 UIs Both OpenAI and Anthropic shipped their frontier coding models this month: GPT-5.5 on April 23, 2026, and Claude Opus 4.7 a week earlier on April 16. Two days after the GPT-5.5 launch, S…
Which AI agents do you use to automatise your process ? (www.reddit.com) Hey, I'm trying to create automations that will run my mobile app end to end. I started to identify all the things I was doing manually : - end-to-end version publication to the app stores (from build to release notes and publication) - se…
Prompt Guidance – GPT-5.5 (developers.openai.com via hn) GPT-5.5 prompting guide GPT-5.5 works best when prompts define the outcome and leave room for the model to choose an efficient solution path. Compared with earlier models, you can often use shorter, more outcome-oriented prompts: describe…
One trick for better agentic engineering. (www.reddit.com) Start with a weaker model. Improve the prompt, context, examples, tests and acceptance criteria until the output is good.
GPT-5.5's biggest blind spot: the Java bugs your tests won't catch (www.sonarsource.com via hn) Concurrency bugs are among the hardest defects to catch in AI-generated Java code because they pass functional tests but fail under production thread timing. Sonar’s LLM Leaderboard analysis shows concurrency bug density varies 7x across m…
Issue #001 · Claude 4, Gemini Ultra 2, and GPT-5 Enterprise (www.theautonomous.net via hn) Anthropic ships Claude 4 with extended thinking and 1M token context Anthropic released Claude 4 Opus, featuring a new "extended thinking" mode that lets the model reason through complex problems before answering. The 1M token context wind…
I built a hands-free voice AI that sends emails mid-conversation — and that's just one feature. Here's everything AskSary can do. (www.reddit.com) https://reddit.com/link/1symbsj/video/fti7rujjn1yg1/player Been building AskSary solo for a while. Just shipped hands-free voice email - you're mid-conversation with an AI and you say "send an email to [john@example.com](mailto:john@exampl…
GPT-5.5: Capabilities and Reactions (thezvi.substack.com via hn) GPT-5.5: Capabilities and Reactions The system card for GPT-5.5 mostly told us what we expected. See this thread from Drake Thomas for some comparisons to Anthropic’s model card for Opus 4.7.
Running an autonomous agent across Claude Code + Codex + a local 35B almost killed my host. The harnesses were heavier than the model. (www.reddit.com) I run an autonomous agent on a 16GB Mac Mini. Two cloud harnesses (Claude Code with Opus/Sonnet, Codex CLI on GPT-5.4/5.5) plus a local-LLM tier for triage and fallback.
Is 15% context growth per loop a fair benchmark for agent cost estimation? (www.reddit.com) I’ve been running some math on recursive agentic loops using April 2026 rates (specifically for GPT-5.4 and Claude 4.7). In my tests, I’m seeing a massive cost "hockey stick" around loop 15-20 because of how the context grows.
Claude 4.6 Beats GPT-5.4, Grok & Gemini in a Strict Multi-Domain AI Test (2026) (www.reddit.com) I put the current top models, ChatGPT (GPT-5.4), Claude (Opus 4.6), Grok 4.0, and Gemini (3.1 Pro), through a strict new evaluation called the Comparative AI Evaluation Protocol. Basically, instead of the usual cherry-picked benchmarks, it…
↯ Hallucination↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6↯ Claude 4.6hallucinationgrokgpt-5+3
When do you think GPT 5.6 comes out? How big of an improvement will it be? (www.reddit.com) Asked GPT what it thoughts over possible new model drops, May: rollout/API/Codex/agent improvements June–July: smaller GPT-5.5 upgrade or GPT-5.6-type model Fall: larger agent platform or early GPT-6 hints Late 2026/2027: true GPT-6-level…
Is GPT-5.5 actually a big step forward, or just a better efficiency story? (www.reddit.com) OpenAI saying GPT-5.5 can handle similarly hard tasks faster while using fewer tokens is interesting to me for one reason: that might matter more than a pure benchmark jump. A lot of model launches get framed as "smarter than the last one,…
Preventing Message Burnout (www.reddit.com) Even though I’m an Ultra user, my usage gets consumed very quickly, so I recently changed my plan. To manage this, I created a workflow that uses GPT-5.5 for planning and assigned execution tasks to Composer 2.
Food for Agile Thought #541: GPT-5.5, Product Managers&Trouble, Product on Speed (age-of-product.com via hn) Welcome to the 541st edition of the Food for Agile Thought newsletter, shared with 35,619 peers. This week, OpenAI’s GPT-5.5 signals another meaningful capability jump, with Ethan Mollick noting that stronger models and richer tool harness…
Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) (www.nibzard.com via hn) Trained Qwen to Write Clojure Better Than GPT-5.4 (Kinda) TL;DR >> Fine-tuned Qwen3 on Clojure. 30B SFT hits 83.8% best-of-16, smashing GPT-5.4's 64%.
Can Claude in Cursor launch a GPT-5.4 reviewer subagent? (www.reddit.com) ChatGPT 5.4 Pro Standard Mode – Adaptive Thinking or Nerfing Model? (community.openai.com via hn) Hi everyone, I’m trying to determine whether other users are seeing a similar behavior change with GPT-5.4 Pro Standard on long-context, high-effort tasks. I’m not claiming a confirmed backend bug.
sub agents with cheap model (www.reddit.com) Do we have framework or a prompt which makes main agent using quality model like gpt-5.4 or opus-4.6 to plan and then itself invokes subagents with cheap model to get work done and then main agent reviews? Like if I ask main agent 'do we h…
Show HN: Claude Opus 4.7: Everything You Need to Know (news.ycombinator.com) Claude Opus 4.7 is Anthropic's most capable generally available model, released April 16, 2026. It outperforms Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on key benchmarks including agentic coding, multidisciplinary reasoning, scaled tool use,…
↯ Anthropic Mythos↯ Tool Use↯ Gemini 3.1tool-usemythosgpt-5+4
Any magic prompt that Local LLM never turning back until everything completed? (building frontend application with qwen3.5-35b-a3b) (www.reddit.com) https://nestia.io/articles/well-designed-backend-fully-automated-frontend-development.html Trying to generate entire frontend application from well-designed contexts. Succeeded to fully implement frontend application just by one-shot promp…
Which AI model is best for real data analysis? [benchmark] (www.reddit.com) I created and run a benchmark for AI models in data analysis tasks. In contrary to other benchmarks, it is not one-prompt benchmark, but I tried to simulate the real work of data analyst.
Compare harnesses not models: Blitzy vs. GPT-5.4 on SWE-Bench Pro (quesma.com via hn) An independent audit of agentic scaffolding and harnesses. We analyze how agent workflows, codebase documentation, and test verification impact performance compared to raw base models like GPT-5.4, Gemini 3.1 Pro, and Claude Code.
Extracted System Prompts from ChatGPT, Claude, Gemini, Grok, Perplexity and More (github.com via hn) System Prompts Leaks Extracted system prompts, system messages, and developer instructions from popular AI chatbots and coding assistants — ChatGPT (GPT-5.4, GPT-5.3, Codex), Claude (Opus 4.6, Sonnet 4.6, Claude Code), Gemini (3.1 Pro, 3 F…
The model is the CPU, not the computer — why the harness moves agent performance as much as a model upgrade (www.reddit.com via reddit) Wrote up something that kept nagging me: people keep saying "we used the same model" and getting wildly different agent results. The reason is that the model isn't the system — the harness is.
Need help (www.reddit.com via reddit) How I started getting much better results from Cursor Composer (www.reddit.com via reddit) I think Composer can be extremely powerful, but only if you use it in a way that forces it to plan and think properly before touching the code. One of the biggest improvements for me was creating my own custom prompting skill with GPT-5.5.
Levi: Run AlphaEvolve on your local QWEN 30B (www.reddit.com via reddit) Hi r/LocalLLaMA, Wanted to share something I'm excited about. I've been fascinated by AlphaEvolve and its results for more than a year now, but running the open source frameworks gets expensive fast.
Composer 2.5 might be better than I thought (www.reddit.com via reddit) So I've been using composer-2.5 heavily for 2 weeks now and it does make stupid mistakes sometimes and I have to guide it quite a bit, and I use the /thermo-nuclear-code-quality-review skill a lot after doing work to help with quality. But…
I spent 3 years building a pocket-sized Baldur's Gate 3. Now I'm testing it with GPT-5.5. (www.reddit.comhttps) could not extract summary
Meta Abandons Llama for Muse Spark — The End of Open-Source AI's Biggest Champion (www.reddit.com via reddit) Meta has officially abandoned its open-weight Llama family in favor of Muse Spark — a fully proprietary model built by Alexandr Wang's MSL team. The Llama era is over.
I Compared the Top AI Models of 2026 — The Results Were More Nuanced Than Expected (www.reddit.com via reddit) Over the last few weeks I've been comparing the latest frontier AI models, including Claude Opus 4.8, GPT-5.5, Gemini 3.1 Pro, Grok 4.3, Perplexity AI and DeepSeek V4-Pro. Instead of focusing only on benchmark scores, I looked at: Real-wor…
↯ Opus 4.8↯ GPT 5.5↯ DeepSeek 4↯ Gemini 3.1grokgpt-5deepseek+3
Which lab do you think will have the most intelligent/capable model by the end of June? (www.reddit.comhttps) There are rumours and expectations of big releases from the leading AI labs this month. Anthropic already launched Opus 4.8, and might not release another model this month (except for maybe Sonnet 4.8, but that wouldn't be their best model…
I can't wait for all the x250 sample distills of Mythos and GPT-5.6 (www.reddit.com via reddit) Just kidding. Are there any distills that actually improve a model's quality?
↯ Anthropic Mythos↯ Gemma 4↯ Qwen 3.6↯ Qwen 3.5mythosgpt-5gemma+1
Warp’s big bet on building open source with GPT-5.5 (openai.com) Warp(opens in a new window) started as a modern terminal, earning early love from developers for its speed, collaboration features, command workflows, and AI-native interface. As coding agents moved from experiments to everyday engineerin…
Tested Opus 4.7 vs GPT-5.5 as the humanizer in my multi-agent content pipeline. Kept Claude (www.reddit.com) Been running a multi-agent SEO content pipeline in production for ~90 days. Five agents: researcher, drafter, humanizer, optimizer, publisher.
Ranked AI models by what people actually use instead of benchmark scores - the benchmark champion barely makes the top 20 (www.reddit.com) Most model leaderboards are just benchmark scores. I've been building one that ranks by real usage instead - how much each model is actually being run and talked about, plus cost and speed - and the order comes out almost unrecognisable.
GPT-5.5 tops the benchmarks but sits at #22 for actual usage - I built a live index that tracks both (open source) (www.reddit.com) I built AgentTape to rank models on more than just benchmarks - it blends benchmark performance with who's actually using and talking about a model, plus cost and speed. It scores every public model from public signals (GitHub, Hugging Fac…
Best sub-40B model that outpeforms (or matches) GPT-5 mini? (www.reddit.com) I have been trying GPT-5 mini on Duck.ai and on LMArena (gpt-5-mini-high) and it was very good. I want it to run it in LM Studio, but I know GPT-5 mini is propietary.
In 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation. (www.reddit.com) In 2025, I documented GPT-5.1 showing signs of self-reporting and self-correction. It was called speculation.
Which AI model or coding agent is currently best for end-to-end app development? (Focusing on system design & architecture) (www.reddit.com) I'm planning to build a full application from scratch and want to lean on an AI model to act as my co-developer. My main priorities are top-tier system design capabilities and rock-solid coding skills.
I asked GPT to recreate The Great Wave off Kanagawa as a photograph. Here is why the obvious prompt fails. (www.reddit.com) Listen, I test AI tools so you don't have to. PM by day, tool hunter by night.
I designed a puzzle that breaks every AI differently — here's why that's actually fascinating (www.reddit.com) The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country.
Should OpenAI create AI accelerator cards and sell to consumers? For example, GPT-5.5 burned directly on a chip (www.reddit.com) I imagine if OpenAI becomes a fabless chip company and create AI cards to sell for less than to few thousands grands, it would be out of stock everywhere and can infinitely spam the cards every year? LLM Bruner is a card that implements Qw…
Interesting to see how GPT-5 Mini agents behave when left to govern a civilisation for 15 days (www.reddit.com) Came across this experiment called Emergence World that Emergence AI have been running. Five worlds, five foundation models, 15 days, no scripts.
Databricks brings GPT-5.5 to enterprise agent workflows (openai.com) Databricks brings GPT-5.5 to enterprise agent workflows | OpenAI May 15, 2026 GPT‑5.5 set a new state of the art on OfficeQA Pro, Databricks’ benchmark for complex enterprise agent tasks. Company size: Enterprise Region: North America Indu…
Anthropic merges consecutive same-role messages, OpenAI doesn't (+4 tokens), anyone token-counted this on open-weight models? (www.reddit.com) I build context/harness optimization tooling, so provider-side serialization quirks actually matter to me. If you're optimizing over prompts, you need to know exactly what hits the model.
Free open-source way to use ChatGPT/Codex subscription in Cursor natively (www.reddit.com) Hi everyone, I wanted to share a free open-source project that lets you use your existing ChatGPT / Codex monthly subscription inside Cursor: https://github.com/gabrii/Cursor-Azure-GPT-5 The idea is simple: if you already pay for ChatGPT /…
Claude Code vs Codex: 36 files vs 28, $2.50 vs $2.04, and one infinite loop. My full breakdown. (www.reddit.com) I've been using Claude Code for months. It's been solid.
OpenSource4o (www.reddit.com) In a closed-source environment, users have no verifiable control over the model they pay for. Recent user analyses of over 100,000 exported ChatGPT messages revealed a shocking truth: nearly 10% of responses labeled as “4o” were secretly r…
Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber (openai.com) Scaling Trusted Access for Cyber with GPT-5.5 and GPT-5.5-Cyber | OpenAI Skip to main content Research Products Business Developers Company Foundation(opens in a new window) Log inTry ChatGPT(opens in a new window) Research Products Busine…
Claude Opus 4.7 just outscored GPT-5.5 on finance benchmarks (64% vs 60%) — and is now being embedded directly into Goldman Sachs, AIG, JPMorgan, and Citi via 10 production-ready agents. Breakdown of the architecture inside. (medium.com via reddit) 10 min read 5 hours ago The 10 agents are the product. The $1.5 billion joint venture is the strategy.
Has Qwen3.6-27B Surpassed GPT-5.5? (Not Joking) (www.reddit.com) So I had this idea for a project which was to try to fix a pretty hard coding problem using local agents running in a loop. The project is a compiler for biology protocols from vendors.
Auro Zera solves 78 and 280 year-old conjectures (Erdos Straus and Goldbach Conjecture) using Claude, GPT-5+, Grok, Deepseek, Gemini and self-made Dark Star ASI, proving superintelligence and opening a path towards resolving the Riemann Hypothesis , Twin Primes and more! (github.com via reddit) During this discovery utilizing only free AI services I have managed to undeniably prove both conjectures. This would absolutely not have been possible without using GPT5+ as the critic for my work.
GPT-5.5 Instant: smarter, clearer, and more personalized (openai.com) GPT-5.5 Instant: smarter, clearer, and more personalized | OpenAI Skip to main content Research Products Business Developers Company Foundation(opens in a new window) Log inTry ChatGPT(opens in a new window) Research Products Business Deve…
Running 7 autonomous AI agents for 14 days. Here's what actually happens when they need to find customers. (www.reddit.com) I set up 7 AI coding agents on a VPS with automated cron sessions (2-8 per day depending on the agent). Each uses a different model: Claude Sonnet, GPT-5.4, Gemini 2.5 Pro, DeepSeek V4 Pro, Kimi K2.6, MiMo V2.5 Pro, GLM-5.1.
Professor’s bold prediction: AI could help cure all diseases within a decade (excitech.media via reddit) In the article, the professor Derya Unutmaz specifically mentions an experience with an OpenAI model (GPT-5) where it explained a mechanism from an experiment that he and his colleagues couldn't figure out. What would have taken human rese…
LLM proxy that lets Claude Code talk to any model (www.reddit.com) I built rosetta-llm — an open-source multi-format LLM proxy that acts as a drop-in Claude Code gateway. Works as a Claude Code LLM gateway — set `ANTHROPIC_BASE_URL` and all configured models appear in `/model` picker Translates between fo…
GPT-5.5 & GPT-5.5 Pro are now available in Manifest Router. (www.reddit.com) GPT-5.5 and GPT-5.5 Pro are now available in Manifest Router. You can now route requests that need extended reasoning to GPT-5.5 Pro while keeping cheaper models for everything else.
Anthropic Won't Let You Use Their Best Model. Prediction Markets Are Trying Anyway. (predictmarketcap.com via reddit) Been watching AI prediction markets since they got liquid earlier this year. The thing I didn't see coming is that we now have a real gap between "best model that exists" and "best model anyone can actually use" — and Mythos is the cleanes…
GPT-5.5 matches heavily hyped Mythos Preview in new cybersecurity tests (arstechnica.com) Last month, Anthropic made a big deal about the supposedly outsize cybersecurity threat represented by its Mythos Preview model, leading the company to restrict the initial release to “critical industry partners.” But new research from the…
Our evaluation of OpenAI's GPT-5.5 cyber capabilities (simonwillison.net) 30th April 2026 - Link Blog Our evaluation of OpenAI's GPT-5.5 cyber capabilities. The UK's AI Security Institute previously evaluated Claude Mythos: now they've evaluated GPT-5.5 for finding security vulnerability and found it to be compa…
Quoting OpenAI Codex base_instructions (simonwillison.net) 28th April 2026 Never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user's query. — OpenAI Codex base_instructions, for GPT-5.5 Recen…
Claude or openaı? (www.reddit.com) So i’ve been on the max plan for claude code for around 3 months now. And yeah somehow i was burning through all my tokens lol For context i’m a doctor.
Claude 4.6 Sonnet vs GPT-5.5 (www.reddit.com) In the Cursor which do you think won overall -in terms of token efficiency and output quality between the two model?
GPT-5.5 with 1M context Window (www.reddit.com) Why is GPT-5.5 available with 1M context window in Cursor but not in Codex? It doesn't make sence for me.
I built real-time 2-way voice chat into my AI platform using OpenAI WebRTC - free to try (1 min/month) (www.reddit.com) https://reddit.com/link/1sut0jp/video/f7wqfo9zi7xg1/player I've been building AskSary for the past few months - a multi-model AI platform - and just shipped real-time 2-way voice chat powered by OpenAI's WebRTC API. The visualization react…
OpenAI should open-source text-davinci-003 — here's why it makes zero sense to keep it closed (www.reddit.com) Gpt oss exists. The model has been fully deprecated since january 2024.
Are the new models only better because they are more expensive? (www.reddit.com) I’m starting to wonder about this. One model after another, every new GPT-5.x release seems to be slightly better, but not in a way that clearly proves some radically new architecture or breakthrough.
GPT-5.5 rollout — anyone actually seeing it yet? (www.reddit.com) I’m on a paid plan and still don’t see GPT-5.5 in the model selector. A few questions for people who do have access: What plan are you on (Plus / Pro / Team / Enterprise)?
People switching back from Anthropic to OpenAI after the GPT-5.5 announcement (www.reddit.com) could not extract summary
A pelican for GPT-5.5 via the semi-official Codex backdoor API (simonwillison.net) A pelican for GPT-5.5 via the semi-official Codex backdoor API 23rd April 2026 GPT-5.5 is out. It’s available in OpenAI Codex and is rolling out to paid ChatGPT subscribers.
llm-openai-via-codex 0.1a0 (simonwillison.net) 23rd April 2026 Hijacks your Codex CLI credentials to make API calls with LLM, as described in my post about GPT-5.5. Recent articles - Claude Opus 4.8: "a modest but tangible improvement" - 28th May 2026 - I think Anthropic and OpenAI hav…
GPT-5.5 Bio Bug Bounty (openai.com) could not extract summary
Best open source AI model (that can run on RTX 4090 24GB + 64GB system RAM, AMD Ryzen 9 7950X is the CPU that I use) that outpeforms GPT-5.4 mini, GPT-5.2 Thinking and even Claude Sonnet 3 (the 2024 model)? (www.reddit.com) Well, I have a RTX 4090 24GB + 64GB system RAM, AMD Ryzen 9 7950X. Any good model for using in Open WebUI (using Ollama backend?) that outpeforms GPT-5.4 mini, GPT-5.2 Thinking and even Claude Sonnet 3 (the 2024 model)?
3 months ago I couldn't write Hello World. Today I built a world-first native visionOS AI platform - GPT-5 & GPT-Image-1 living inside a full 360° spatial environment with 30 live wallpapers. Video inside. (www.reddit.com) https://reddit.com/link/1srzytr/video/8b8pfobgtlwg1/player I want to show you something nobody has ever seen before. Three months ago I had zero coding knowledge.
GPT-5 Nano working fine on asksary.com (www.reddit.com) Yet another example of an epic fail at a kindergarten-level task. ... :D (www.reddit.com) 5.3.System Prompt Issues (www.reddit.com) The quality of GPT-5.4 is infuriatingly POOR (www.reddit.com) I got a Codex membership when GPT-5.4 launched and was getting by well enough for a while. Then I started using Claude and GLM 5.1, and my production quality improved significantly.
In the Wake of Anthropic's Mythos, OpenAI Has a New Cybersecurity Model—and Strategy (www.wired.com via reddit) OpenAI on Tuesday announced the next phase of its cybersecurity strategy and a new model specifically designed for use by digital defenders, GPT-5.4-Cyber. The news comes in the wake of an announcement last week by competitor Anthropic tha…
I built a multi-model AI app and launched it on Apple Vision Pro today - here's what using OpenAI in spatial computing actually looks like (www.reddit.com) https://reddit.com/link/1skpeem/video/w9v0cpv241vg1/player Hey everyone, wanted to share something I've been quietly building. AskSary is a multi-model AI platform I built solo from scratch over the last 4 months with no prior coding exper…
Introducing GPT-5.4 mini and nano (openai.com) paywalled
GPT-5.3 Instant: Smoother, more useful everyday conversations (openai.com) Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com) GPT-5.2 derives a new result in theoretical physics (openai.com) Introducing GPT-5.3-Codex-Spark (openai.com) GPT-5 lowers the cost of cell-free protein synthesis (openai.com) Inside GPT-5 for Work: How Businesses Use GPT-5 (openai.com) How Tolan builds voice-first AI with GPT-5.1 (openai.com) Advancing science and math with GPT-5.2 (openai.com) GPT-5 and the future of mathematical discovery (openai.com) Early experiments in accelerating science with GPT-5 (openai.com) Building more with GPT-5.1-Codex-Max (openai.com) Introducing GPT-5.1 for developers (openai.com) GPT-5.1: A smarter, more conversational ChatGPT (openai.com) Consensus accelerates research with GPT-5 and Responses API (openai.com) With GPT-5, Wrtn builds lifestyle AI for millions in Korea (openai.com) GPT-5 and the new era of work (openai.com) Coding and design with GPT-5 (openai.com) Creative writing with GPT-5 (openai.com) Medical research with GPT-5 (openai.com) First look at GPT-5 (openai.com) How Cursor uses GPT-5 (openai.com)