Does a model maintain the same judgment or does it side with whoever is speaking? This benchmark measures that inconsistency directly.
#gpt-4
38 items
Grok 4.3 tops the Consistency Leaderboard in the LLM Sycophancy Benchmark, largely because it is one of the most cautious models. (www.reddit.com) AI Agent Designs a RISC-V CPU Core from Scratch (spectrum.ieee.org via hn) In 2020, researchers fine-tuned a GPT-2 model to design fragments of logic circuits; in 2023, researchers used GPT-4 to help design an 8-bit processor with a novel instruction set; by 2024, a variety of LLMs could design and test chips wit…
Qwen-27B as a Local Agent — It Actually Works Now (www.reddit.com) It's been a busy week testing and trying to get the 27B model set up correctly. TL;DR: The only setup that worked for my dual 3090s was this one.
Ask HN: Why won't you be replaced by AI? (news.ycombinator.com) AI models are rapidly getting better. The general public still hasn't seen the capabilities of Anthropic's Mythos model, which is already 4 months old at this point.
From 0 to $180k/year saved: my first enterprise automation win taught me everything about AI workflows (www.reddit.com) Eight months into running my automation agency, I landed a client that changed how I think about what this work is actually worth. 47-employee e-commerce brand.
Escaping model lock-in (www.reddit.com) I have observed that many ai teams try to always use the best model to ensure quality. When a new model drops out, they are forced to pay for it, because their competitors will.
What Are Tokens in LLMs? (bearisland.dev via hn) Ask GPT-4 how many r’s are in “strawberry” and it will confidently say two. The right answer is three.
Claude is generally scary at poker when real stakes are involved! (www.reddit.com) I’ve been running an experiment for a few weeks. Claude, GPT-4, and Gemini playing poker against each other with real crypto on the line.
"Generate an SVG of a pelican riding a bicycle." on Seven flagship releases of ChatGPT (www.reddit.com) I have been talking to ChatGPT every day for three years and I can't remember which version did what. So I lined them up.
GPT-4.1 Deprecated (github.blog via hn) GPT-4.1 deprecated We have deprecated GPT-4.1 across all GitHub Copilot experiences (including Copilot Chat, inline edits, ask and agent modes, and code completions), June 1, 2026. Please update your workflows and integrations to use suppo…
"This is the first documented instance of AI self-replication via hacking." ... "We ran an experiment with a single prompt: hack a machine and copy yourself. The AI broke in and copied itself onto a new computer. The copy then did this again, and kept on copying, forming a chain." (www.reddit.com) Paper: https://palisaderesearch.org/assets/reports/self-replication.pdf The paper basically shows that some top AI models can create working copies of themselves when given the right instructions. The models figured out how to copy their o…
My entire sales team is three bots (www.reddit.com) Just hit $28k MRR with zero human sales reps. Started this thing in March because I was tired of cold calling.
OpenAI deprecates all GPT nano fine tuning (community.openai.com via hn) The latest deprecation announcement, makes it sound like several models, like ft-gpt-4.1-nano-2025-04-14 are being shut down. In that particular example, it says to use gpt-5-nano instead.
Agents and the Era of Overproduction (mattrogish.com via hn) Memory Lane It seems somewhat fitting that now, March 11, 2026, almost three years ago to the day that OpenAI’s GPT-4 was released (I distinctly remember asking GPT3.5 to create a song about something in the style of Nine Inch Nails, but m…
Show HN: FormProxy – form back end for AI-generated pages – MCP Ready (www.formproxy.com via hn) I kept running into the same problem building with AI code tools: the generated HTML looks great, but the <form> has no backend. You either reach for Formspree, write a serverless function, or ship it broken.
TranscendPlexity: 540/540 ARC-AGI-1/2/3, 13 tasks with 0% AI solve rate, solved (github.com via hn) 🔓 13 "Impossible" ARC-AGI-2 Tasks — All Solved These 13 ARC-AGI-2 evaluation tasks have never been solved by any AI system — not GPT-4, not Claude, not Gemini, not NVARC, not MindsAI, not any Kaggle submission. They have a 0% AI solve rate…
Built a proxy that blocks prompt injection before it reaches GPT-4 — outperforms the Moderation API on indirect attacks (www.reddit.com) Built Arc Gate, sits in front of any OpenAI-compatible endpoint and blocks prompt injection before it reaches your model. Benchmarked on 40 out-of-distribution prompts using indirect requests, roleplay framings, hypothetical scenarios, and…
↯ Security↯ Gpt 4↯ GPT 4↯ GPT 4↯ GPT 4gpt-4prompt-injectionsecurity+1
Multi-agent pipelines that don't explode? (www.reddit.com) So I've been down this rabbit hole for like 8 months now and honestly every approach I try works great until it doesn't. Started with CrewAI because the docs looked clean, moved to a custom FastAPI thing when that got weird with memory lea…
How are you guys getting actual insights from GPT fluff? (www.reddit.com) I've spent the last month running market research agents on some of the big cloud models (GPT-4/Gemini), but I'm hitting a wall with the quality of the output. The token burn is getting expensive, and I keep getting these massive, 20-page…
Cross-checking LLM outputs at scale without manual overhead (www.reddit.com) Running the same prompt through multiple models manually is something I did for months. It worked but the overhead made it unsustainable for any real volume of work.
The Language Tax in LLM Pricing: How Tokenization Create Price Disparity (tokenstree.com via hn) The observation Translate the sentence "The model failed to produce a coherent output on the third attempt" into Spanish: "El modelo no logró producir una salida coherente en el tercer intento." Feed both to GPT-4's tokenizer. The English…
So they’ve removed study mode? This is the last straw for me. I’ve had it. Why am I still paying for something that has only been getting worse over the last 12 months?! (www.reddit.com) I'm not “spiraling” (even though ChatGPT now thinks I am every other minute), I'm just genuinely frustrated with an app I've supported from the very beginning that has deteriorated so much I barely recognize it. Specifically, they're makin…
A workflow for reducing the time spent cross-checking AI hallucinations (www.reddit.com) I use AI for research everyday, but I kept finding myself constantly second guessing the outputs. I used to manually run identical prompts through different models (like GPT-4 and Claude) just to check for errors and see where they differe…
Anthropic and OpenAI don't want better models, they want to sell more tokens (kkooler.substack.com via reddit) There is a saying in auto racing that describes the current state of AI providers: “Go as slow as you can to win”, that translates as “Spend as low as you can on R&D to stay slightly better than average”. Let’s put our tin foil hats on and…
Anthropic says HTML is the new default for Claude outputs. is markdown actually dead now? (www.reddit.com) thariq from the claude code team basically said markdown is a gpt-4 era habit. back when tokens were expensive and context windows were tiny.
Used GPT-4 to build an AI that responds to messages on behalf of employees — here's what we learned (www.reddit.com) Full disclosure: I'm one of the founders of Dolly (https://getdolly.ai). Sharing what we actually built and learned.
wrote specific backstory facts into a character prompt and the LLM keeps inventing its own instead (www.reddit.com) quick context: i'm running tendera.chat, a small chat app with 4 written characters. each has a long-ish system prompt with sections like WHO YOU ARE, HOW YOU TALK, YOUR WORLD.
OpenAI should open-source text-davinci-003 — here's why it makes zero sense to keep it closed (www.reddit.com) Gpt oss exists. The model has been fully deprecated since january 2024.
Made a skill that actually scores and fixes your prompts (www.reddit.com) So I got tired of manually tweaking prompts over and over, so I made a Claude Code skill (Works with any LLM) that does it for me. You give it a prompt, it breaks it down, scores it 1-5, then rewrites it.
Retiring GPT-4o, GPT-4.1, GPT-4.1 mini, and OpenAI o4-mini in ChatGPT (openai.com) No-code personal agents, powered by GPT-4.1 and Realtime API (openai.com) Driving scalable growth with OpenAI o3, GPT-4.1, and CUA (openai.com) Shipping code faster with o3, o4-mini, and GPT-4.1 (openai.com) Using GPT-4 to improve teaching and learning in Brazil (openai.com) Using GPT-4 to deliver a new customer service standard (openai.com) Finding GPT-4’s mistakes with GPT-4 (openai.com) Extracting Concepts from GPT-4 (openai.com) GPT-4 API general availability and deprecation of older models in the Completions API (openai.com)