#tool-calling
34 items
Hot take: the biggest bottleneck in AI agents right now isn't models, frameworks, or even cost. It's that nobody knows how to properly evaluate if their agent is actually working (www.reddit.com) We are finally there: Qwen3.6-27B + agentic search; 95.7% SimpleQA on a single 3090, fully local (www.reddit.com) LDR maintainer here. Thanks to the strong support of r/LocalLLaMA community LDR got very far.
AI agent roadmap for developers who can code but have never built an agent (www.reddit.com) Show HN: Local Coding Agent with LLMs to Delegate Tool Calls to Small AI Models (github.com via hn) Open Agent Tools Coder Open Agent Tools (oats) enables small-to-large self-hosted ai models to use local source code when running tool-calling agentic workloads. We actively data mine 20,970+ (2+ TB) popular github repos using large and sm…
Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster. (www.reddit.com) Ran a head-to-head on two open-weight models for tool-calling on a 4-core CPU, no GPU, no cherry-picking. Wanted to see if the small specialist (Needle, 26M, distilled from Gemini 3.1 for function calls) actually holds up against a small g…
↯ Gemini 3.1↯ Function Callingfunction-callingtool-callinggemini
Computer use is 45x more expensive than a structured API call (www.reddit.com) Hi r/AI_Agents, I recently did a benchmark on computer use agents vs api calls as part of a feature launch for my company. I wanted to share the benchmark here since it seems relevant to this sub: See, most teams default to computer use ag…
how do you design an ai agent to handle heavy data processing and large files? (www.reddit.com) looking for architectural patterns on handling data gravity in production agent pipelines. every tutorial I've found assumes light text payloads or short tool-calling loops, but once your agents have to actually interact with massive sourc…
AI Support Agents & Workflows Worth Exploring in 2026 (www.reddit.com) Been exploring how AI agents are slowly changing customer support workflows, especially for smaller teams trying to scale without adding headcount. Some interesting tools/workflows worth checking out: • SparrowDesk’s Zoona: AI support agen…
What we learned trying to fine-tune a small tool-calling model from production traces (and what not to do) (www.reddit.com) TL;DR: We wanted a small, fast model for multi-turn tool-calling. Training on clean, curated data worked brilliantly (1.7B student beating a 744B teacher).
Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM (deemwar-products.github.io via hn) mochallamaA local, tool-calling LLM inside your JVM The only in-process, tool-calling local LLM for the JVM — Spring-first, OpenAI-compatible, llama.cpp-backed via Project Panama FFM. No JNI, no daemon, no native-install dance.
Show HN: Prism Coder – Qwen3.5-14B fine-tuned for MCP tool-routing decisions (github.com via hn) 🧠 Prism Coder 🌐 Read in your language: 🇬🇧 English · 🇪🇸 Español · 🇫🇷 Français · 🇵🇹 Português · 🇷🇴 Română · 🇺🇦 Українська · 🇷🇺 Русский · 🇩🇪 Deutsch · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇨🇳 中文 · 🇸🇦 العربية Persistent memory + tool-calling intelligence for AI a…
The Oats Protocol – Open Agent Tools for Local Coding Agents (news.ycombinator.com) Recently I was using functiongemma and watched it load and run local source code as a tool call without any training/tuning. A couple days later I got Qwen35 in Open-WebUI to use the "native" tool-calling.
Is Haiku good for building a chatbot with MCP tools ? (www.reddit.com) Hi, We’re experimenting with building a chatbot that handles consumer interactions. The agent currently has access to about 5–8 tools, and we’re exploring different models to find the right balance of speed, cost, and tool-calling reliabil…
I tried to get my AI agent to schedule a meeting over email. The failure mode revealed a problem almost nobody in the agent space is talking about. (www.reddit.com) I've been building an AI agent that operates across SMS, email, WhatsApp, and Slack — and the hardest problem I've run into isn't tool-calling or reasoning. It's what happens when the agent interacts with multiple people who have different…
What we learned building a data agent that talks to 4 database types simultaneously (DAB benchmark) (www.reddit.com) UC Berkeley published DataAgentBench (DAB) in March — 54 queries across PostgreSQL, MongoDB, SQLite, and DuckDB. Best score so far is 54.3% (PromptQL + Gemini).
How does Google Antigravity IDE actually work internally? (www.reddit.com) Hey everyone, I’ve been exploring Google Antigravity recently, and I’m really curious about its internal architecture and engineering design. From the demos, it seems much more advanced than a normal AI coding assistant — almost like an au…
ReAct tool-calling issue: Orchestration model computes internally instead of using tools (www.reddit.com) Built a local ReAct-style calculator agent with 6 tools: add subtract multiply divide modulo etc. The setup is: orchestrator agent dynamic tool selection ReAct loop tools exposed as functions Problem: Even when the user asks multi-step ari…
Training a 22MB prompt injection classifier (www.stackone.com via hn) Training a 22MB Prompt Injection Classifier Table of Contents When we started building Defender (our prompt injection guard for MCP tool-calling agents), the constraint was simple and unforgiving: ship inline inside a TypeScript Lambda, st…
Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks (github.com via hn) Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments. I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.
Needle-rs – AI Function calling in the browser, 258 KB WASM (needle-rs.pages.dev via hn) AI TOOL CALLING · WASM · NO_STD Below is a 26M-parameter tool-calling transformer running entirely in this tab — no server, no API key, no data leaving your device. The model is Needle by Cactus Compute; needle-rs is the pure-Rust runtime…
sAI2.m6s (www.reddit.com) Hey everyone, I'm designing a powerful, autonomous AI chatbot(agent) , fully private, using a Python backend (for the core intelligence and tool-calling loops) and a Flutter frontend for a cross-platform UI. Since this moves past a basic…
I built an OSS CLI to catch regressions when migrating between LLMs (www.reddit.com) I’ve been working on EvalShift, an open-source Python CLI for testing whether moving from one LLM/model version to another introduces regressions. The use case is simple: You have prompts, agents, or tool-calling workflows that work well o…
Built a practical voice-first AI tool for ADHD/executive dysfunction — one-tap brain dump → structured reminders & tasks (not a full autonomous agent) (www.reddit.com) Not a full autonomous agent in the Auto-GPT / LangChain sense, but I built something that uses AI in a very practical, daily way for executive dysfunction / ADHD brains. SAVI is a one-tap voice capture tool.
Show HN: Mlx-code – I built a "backyard shed" AI coding agent for Mac (github.com via hn) mlx-code A lightweight coding agent for Mac, built on Apple's MLX framework. Fast local inference, built-in prompt caching, robust tool-calling.
Help setting up Chrome MCP for Hermes Agent (www.reddit.com) Hi everyone, I'm trying to set up Chrome MCP (Model Context Protocol) for Hermes Agent and need some guidance. **Background:** - Hermes Agent (by NousResearch) has self-learning features - I want to integrate Chrome browser automation via…
Title: Is it just me, or is the "Multi-Agent Swarm" the new "Over-Engineered Spreadsheet"? (www.reddit.com) We’re four months into 2026 and every demo I see features "15 agents working together to write a blog post." In my experience, the more agents you add, the higher the "Cognitive Tax." You get more hallucinations, more token cost, and more…
Defender – Local prompt injection detection for AI agents (no API calls) (www.npmjs.com via hn) Prompt injection defense framework for AI tool-calling Indirect prompt injection defense and protection for AI agents using tool calls (via MCP, CLI or direct function calling). Detects and neutralizes prompt injection attacks hidden in t…
↯ Security↯ Function Callingfunction-callingtool-callingprompt-injection+2
Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning (arxiv.org) Large language model (LLM)-based agents often make suboptimal tool-use decisions, including unsupported tool invocation and hallucinated direct responses, which may accumulate errors throughout multi-step interactions. Existing approaches…
we really all are going to make it, aren't we? 2x3090 setup. (www.reddit.com) i'm blown away. i saw someone made a post the other day about "club-3090" and after having sonnet patch some fixes into it, specifically a sse-session drop bug and a bug with tool-calling, it's fair to say that even "budget" setups like my…
Should I buy Claude Pro as a BTech student — especially for the agentic/coding side? Honest takes wanted (www.reddit.com) https://preview.redd.it/l23rgf5z4qyg1.png?width=1402&format=png&auto=webp&s=73a7a278ca50527c9605488141d7e5ea48089a85 Hey everyone, I'm a BTech (AI/ML) student considering Claude Pro ($20/month) but want to separate the real value from the…
I built AI agents that play Pokemon Showdown autonomously using free LLM APIs via tool-calling (www.reddit.com) I've built a system where models like Llama 3, Qwen, and Gemma play Pokémon Showdown battles autonomously. Instead of simple prompt-response, they analyze the full battle state every turn (type matchups, HP, weather, field conditions, reve…
llm 0.32a1 (simonwillison.net) 29th April 2026 - Fixed a bug in 0.32a0 where tool-calling conversations were not correctly reinflated from SQLite. #1426 Recent articles - LLM 0.32a0 is a major backwards-compatible refactor - 29th April 2026 - Tracking the history of the…
TPS wasn't enough, tool-calling pass rate decided the winner in my Qwen 7B runs (www.reddit.com) I kept running into the same problem: TPS and TTFT tell you which config is fast, and perplexity is helpful only as a rough quality signal. None of them reliably tell you how the model will behave after changing quant, ctx size, kv_cache,…
Small models fail at tool selection - but it's not what I expected (www.reddit.com) Been running small models (1.5B-4B) with tool-calling agents. They consistently failed at selecting the right tool from 80+ options.