event

Hallucination

32 items · started 2024-01-12 · ongoing (last activity 2026-04-30)

Reasoning models hallucinate tool calls more, not less. There's a paper. (www.reddit.com)

+12 2h hallucination

Have been seeing this in our agents for a while and finally there's a paper that explains it. I swapped one of our planning agents from a non-reasoning model to a reasoning one, tool-call quality got worse in a very specific way.
Grok hallucinations (www.reddit.com)

5 19h hallucination grok

Grok is supposedly the lowest-hallucination model according to the AA-Omniscience benchmark. Today I've had INSANE hallucinations from Grok 4.2 fast.
Improve claude code on Opus 4.7 (www.reddit.com)

+44 1d hallucination haiku sonnet+2

If this can help anyone who has noticed a reduction in quality from Claude over the last couple of weeks. In my case, I noticed that it often spins up an agent to audit parts of the code and analyse the results to make decisions.
Ran my own benchmark Qwen 3.6 35B vs Gemma 4 26B.... theres a clear winner here (www.reddit.com)

7 1d hallucination gemma qwen

Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…
Open Source Knowledge Graph With Versioning (www.reddit.com)

+11 1d hallucination agentic

I've been running into problems with “agent memory” while using claude when it was a pile of markdown files, started out great but became unreliable as the number of files grew. So I built Omnigraph , an open-source graph runtime for agent…
Open Source AI Infrastructure (news.ycombinator.com)

+2 2d hallucination mistral openai

Hey everyone — built Ombre, an open source AI infrastructure layer that works with any AI model. Eight agents run automatically: security, caching, memory, hallucination detection, tamper-proof audit trail.
For Non-hallucinating work, MiMo 2.5 delivers (www.reddit.com)

+67 2d hallucination deepseek gemma+1

MIT license and fully open source. MiMo-V2.5-Pro was just 3 points from Opus 4.7 max and the normal V2.5 is only a step behind SOTA.
Claude 4.6 Beats GPT-5.4, Grok & Gemini in a Strict Multi-Domain AI Test (2026) (www.reddit.com)

+12 2d hallucination grok gpt-5+3

I put the current top models, ChatGPT (GPT-5.4), Claude (Opus 4.6), Grok 4.0, and Gemini (3.1 Pro), through a strict new evaluation called the Comparative AI Evaluation Protocol. Basically, instead of the usual cherry-picked benchmarks, it…
Is this just a hallucination or does claude actually inject something like this? (www.reddit.com)

+28 2d hallucination

could not extract summary
A hallucination engine. Typed pseudorandom data via LLM (pypi.org via hn)

+11 3d hallucination

A hallucination engine. Typed pseudorandom data via LLM.
The Mushroom That Makes People Have the Exact Same Hallucination (www.vice.com via hn)

+2911 3d hallucination

Biologist Colin Domnauer is reopening an old case that Chinese health officials seem to have stopped caring about. Every summer, residents of the Yunnan province check into hospitals with complaints that they’re hallucinating tiny elflike…
. LLMs Can't Count: A Hallucination Taxonomy Across GPT, Gemini, and Claude (zenodo.org via hn)

+1 4d hallucination gemini

Abstract (English) This study presents an exploratory quantitative analysis of hallucinations arising when large language models (LLMs) count items in large volumes of unstructured text data, and examines the suppression effects of the Kno…
Fixing hallucination in LLM prediction with only one 48gib GPU (zenodo.org via hn)

+1 5d hallucination

Pulse · genji970/hallucination-mitigation-via-contrastive-sampling-method
gpt 5.5 is good but I'm having hallucination/context issues (www.reddit.com)

+514 5d hallucination
Dedicated Repository Agents (www.reddit.com)

+33 6d hallucination codex mcp+1
Multi agent systems are a total nightmare in production (www.reddit.com)

+4845 6d hallucination
I tried a selective training method for hallucination — beats DPO and SFT with ~10% data (www.reddit.com)

+26 8d dpo hallucination
Top Law Firm Apologizes to Bankruptcy Judge for AI Hallucination (www.bloomberg.com via hn)

+41 8d hallucination
This post potentially explains the current happenings to the LLMS and how their hallucination problem appears to be bigger than usual (www.reddit.com)

+411 8d hallucination

So, what the above graph means that a LLM is really good at solving average problems and are great at recombining existing knowledge, so, if i ask something outside my domain of expertise, i get really good answers but as you approach to t…
cursor suggested a package that didnt exist, rabbit hole ensued (www.reddit.com)

+28 10d hallucination security cursor
Your AI agent is acting on memory it can't verify. Here's what we built to fix that. (www.reddit.com)

+53 12d hallucination mcp openai
I built Proxima your Cursor agent doesn't have to be limited to one AI. Proxima connects all 4 at once ChatGPT, Claude, Gemini and Perplexity simultaneously. real-time internet, less hallucination, full context, no API keys. (www.reddit.com)

+2 12d hallucination gemini mcp+2

been switching between ChatGPT, Claude, Gemini and Perplexity across different tabs — new projects, research, discussions, everything had to be done manually and context was always getting lost. so i built Proxima a local server that conne…
Help in building document extractor and checker (www.reddit.com)

+12 13d hallucination

Has anyone here built an AI agent that is extracting, normalizing and checking unstructured documents for a specific ai workflow? I want to know how opinionated you are in the output json schema?
how are teams actually debugging agents in prod? (www.reddit.com)

+23 2w hallucination

spoke to a team recently running agents in production. their problem wasn’t: “did something fail?” it was: “why exactly did it fail?” the top level buckets were easy: - infra issue - tool/API issue - bad reasoning - hallucination - externa…
Claude Opus 4.6 accuracy on BridgeBench hallucination test drops from 83% to 68% (www.reddit.com)

+2815 2w hallucination anthropic opus

Anthropic's flagship model just took a pretty significant accuracy hit on one of the most important AI benchmarks out there. So here's the deal: Claude Opus 4.6 was recently tested on BridgeBench, which specifically measures how often AI m…
Is anyone else terrified of giving Cursor/Claude direct access to their database? I built an open-source solution. (www.reddit.com)

7 2w model-context-protocol hallucination mcp+1

Hey everyone 👋, I absolutely love using Cursor and Claude Desktop for debugging and writing queries, but the idea of hooking them up directly to my database via standard MCP (Model Context Protocol) servers has always given me anxiety. One…
A workflow for reducing the time spent cross-checking AI hallucinations (www.reddit.com)

+1 2w gpt-4 hallucination

I use AI for research everyday, but I kept finding myself constantly second guessing the outputs. I used to manually run identical prompts through different models (like GPT-4 and Claude) just to check for errors and see where they differe…
Prompt —> playable digital TCG card! How I solved the hallucination problem with chained LLMs (www.reddit.com)

+11 2w hallucination

I love AI agents but they proved to be too unreliable atm for serious work. 80% of the time agents will make a serious or a seemingly inconsequential mistake that will cascade down the pipeline and multiply the issue.
I was tired of "Agent Runaway" costs, so I built a tracer with a built-in Kill-Switch. (www.reddit.com)

+33 2w hallucination

Most agent observability tools just show you what happened after the bill arrives. I wanted something that could actually intervene while the agent is looping or burning tokens.
Strong feeling: we are in a folded AI reality (news.ycombinator.com)

+11 2w hallucination agentic

Some people think Agentic AI could do everything, is getting more and more powerful even feel fear about it. Another group non-technical people still just trapped in the LLM chat is weak and full of hallucination world.
Stop donating your salary to OpenAI: Why Minimax M2.5 is making GPT-5.2 Thinking look like an overpriced dinosaur for coding plans. (www.reddit.com)

10 10w swe-bench hallucination minimax+5
A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard (huggingface.co)

119w hallucination

← all threads