event

Hallucination

32 items · started 2024-01-12 · ongoing (last activity 2026-04-30)

  1. Have been seeing this in our agents for a while and finally there's a paper that explains it. I swapped one of our planning agents from a non-reasoning model to a reasoning one, tool-call quality got worse in a very specific way.

  2. Grok is supposedly the lowest-hallucination model according to the AA-Omniscience benchmark. Today I've had INSANE hallucinations from Grok 4.2 fast.

  3. If this can help anyone who has noticed a reduction in quality from Claude over the last couple of weeks. In my case, I noticed that it often spins up an agent to audit parts of the code and analyse the results to make decisions.

  4. Uhh I guess Gemma 4 is so much shittier that it hallucinated this event that happened in china in 1989? According to qwen, nothing of significance happened at Tiananmen square in 1989 - and based on all of the benchmarks of qwen, I believe…

  5. I've been running into problems with “agent memory” while using claude when it was a pile of markdown files, started out great but became unreliable as the number of files grew. So I built Omnigraph , an open-source graph runtime for agent…

  6. Hey everyone — built Ombre, an open source AI infrastructure layer that works with any AI model. Eight agents run automatically: security, caching, memory, hallucination detection, tamper-proof audit trail.

  7. MIT license and fully open source. MiMo-V2.5-Pro was just 3 points from Opus 4.7 max and the normal V2.5 is only a step behind SOTA.

  8. I put the current top models, ChatGPT (GPT-5.4), Claude (Opus 4.6), Grok 4.0, and Gemini (3.1 Pro), through a strict new evaluation called the Comparative AI Evaluation Protocol. Basically, instead of the usual cherry-picked benchmarks, it…

  9. could not extract summary

  10. A hallucination engine. Typed pseudorandom data via LLM.

  11. Biologist Colin Domnauer is reopening an old case that Chinese health officials seem to have stopped caring about. Every summer, residents of the Yunnan province check into hospitals with complaints that they’re hallucinating tiny elflike…

  12. Abstract (English) This study presents an exploratory quantitative analysis of hallucinations arising when large language models (LLMs) count items in large volumes of unstructured text data, and examines the suppression effects of the Kno…

  13. Pulse · genji970/hallucination-mitigation-via-contrastive-sampling-method

  14. So, what the above graph means that a LLM is really good at solving average problems and are great at recombining existing knowledge, so, if i ask something outside my domain of expertise, i get really good answers but as you approach to t…

  15. been switching between ChatGPT, Claude, Gemini and Perplexity across different tabs — new projects, research, discussions, everything had to be done manually and context was always getting lost. so i built Proxima a local server that conne…

  16. Has anyone here built an AI agent that is extracting, normalizing and checking unstructured documents for a specific ai workflow? I want to know how opinionated you are in the output json schema?

  17. spoke to a team recently running agents in production. their problem wasn’t: “did something fail?” it was: “why exactly did it fail?” the top level buckets were easy: - infra issue - tool/API issue - bad reasoning - hallucination - externa…

  18. Anthropic's flagship model just took a pretty significant accuracy hit on one of the most important AI benchmarks out there. So here's the deal: Claude Opus 4.6 was recently tested on BridgeBench, which specifically measures how often AI m…

  19. Hey everyone 👋, I absolutely love using Cursor and Claude Desktop for debugging and writing queries, but the idea of hooking them up directly to my database via standard MCP (Model Context Protocol) servers has always given me anxiety. One…

  20. I use AI for research everyday, but I kept finding myself constantly second guessing the outputs. I used to manually run identical prompts through different models (like GPT-4 and Claude) just to check for errors and see where they differe…

  21. I love AI agents but they proved to be too unreliable atm for serious work. 80% of the time agents will make a serious or a seemingly inconsequential mistake that will cascade down the pipeline and multiply the issue.

  22. Most agent observability tools just show you what happened after the bill arrives. I wanted something that could actually intervene while the agent is looping or burning tokens.

  23. Some people think Agentic AI could do everything, is getting more and more powerful even feel fear about it. Another group non-technical people still just trapped in the LLM chat is weak and full of hallucination world.

← all threads