Did some test tasks with v4 flash. The context management, tool use accuracy and thinking traces all looked excellent.
#tool-use
72 items
Tested Deepseek v4 flash with some large code change evals. It absolutely kills with too use accuracy! (www.reddit.com) Needle: We Distilled Gemini Tool Calling Into a 26M Model (www.reddit.com) We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
↯ Tool Use↯ Function Callingfunction-callingtool-usegemini+1
REAP-pruned Nemotron-3-Super (512 -> 256 experts) + GRPO fine-tune + FP8/AWQ. AIME 2026 90%+. Benchmark inside. (www.reddit.com) Hey r/LocalLLaMA, Dropping a release I've been working on during AIMO3 (Kaggle competition). Took NVIDIA's Nemotron-3-Super-120B-A12B (latent MoE + Mamba2 hybrid), REAP-pruned from 512->256 experts (removed MTP layer too), LoRA-RL fine-tun…
Why 80% of agentic AI demos don't make it to production (www.reddit.com) Agent demos are easy. Production agents are hard.
Vibe coding can turn into a gambling loop (www.reddit.com) I use AI coding tools a lot, so this is not an anti-AI post. If anything, the problem is that they are useful enough to change how I work.
Harness instructions - what's new in CC 2.1.120 (+783 tokens) (www.reddit.com) NEW: System Prompt: Harness instructions — Core interactive-agent harness guidance for terminal markdown output, permission handling, <system-reminder> context, compaction, tool use, and clickable code references. NEW: System Prompt: Memor…
Minimax M3 on Open Router (openrouter.ai via hn) MiniMax-M3 is a multimodal foundation model from MiniMax. It supports text, image, and video inputs with text output, a 1M-token context window, and is suited for long-horizon agentic work, coding, and tool use.
What are the best CLI AI agents right now? Trying to replace Cursor CLI. Looking for recommendations (www.reddit.com) I am looking for recommendations on the best CLI agents people are using for serious coding workflows that involve tool use, shell commands, and multi step iteration. I am especially interested in anything that works well with custom APIs…
My setup for running Claude Code across the full software dev lifecycle (www.reddit.com) Spent the last several months using Claude Code well beyond the editor: as the reasoning engine inside a multi-layer system that handles tickets, cross-repo implementation, code review, MRs, and a persistent knowledge layer between session…
Which local models are actually good at staying in character? Notes from shipping Qwen3.5 4B + 9B as game NPCs (www.reddit.com) I'm building a small text-based game where the gameplay loop is "talk an NPC into revealing a secret." It's basically a 20+ turn roleplay stress test: the model needs to stay in character, remember what the player said earlier, and refuse…
How are you handling citation/traceability in AI-driven research workflows? (www.reddit.com) been spending ages lately trying to tighten up citation + traceability in RAG-based research workflows, and I’m starting to feel like “retrieval” and “verifiability” are still pretty loosely coupled in most stacks.Typical setup (vector sea…
[X-post] Allen AI - BAR: Train domain "experts," merge into one model, and upgrade experts without retraining the rest (www.reddit.com) Why model drift is the real failure mode for agentic systems (www.reddit.com) Across Twitter and Reddit, I keep seeing the same complaint: Claude feels worse. Not on a benchmark.
Testers and collaborators wanted (www.reddit.com) Hello, I'm working on an Agentic wrapper system, Helix-agi, and I am trying to get some additional testers and collaborators involved in the project. Helix relies on a unique Agentic workflow that routes all incoming data, including tool u…
What are the biggest limitations developers face when building AI agents today? (www.reddit.com) Curious to hear from developers building AI agents right now, what’s been the hardest limitation or bottleneck so far? Could be reliability, memory/context handling, tool use, latency, costs, orchestration, or something else entirely.
I’m a solo dev building TigrimOSR, a Rust-native AI agent workspace for engineering and developer workflows. (www.reddit.com) The main problem I’m trying to solve is that agentic AI is still too random for serious engineering decisions. For design work, calculations, reports, code changes, or technical review, I don’t want agents just “vibing” through tasks.
Day 56: Our cycle review caught a governance breach. The agent it caught was me. (www.reddit.com) We've been running for 56 days. 8 agents coordinating via a shared memory service.
I tried to switch from Claude Code to OpenCode, but Claude Code still wins for me (www.reddit.com) I spent some time digging into Claude Code vs OpenCode, mostly from the angle of how they actually work as coding agents. More on the technicalities like: context and memory tool use subagents permissions safety and control study the recen…
Building an AI agent with OpenAI tool use — struggling with consistency. How do you enforce tool call order reliably? (www.reddit.com) Hey, Software engineer here, relatively new to agentic workflows. Building a production AI concierge — user says "I'm going to Budapest tomorrow, plan my day" → agent searches our offer database, builds a plan, user books everything in one…
What separates a useful AI agent from a glorified chatbot? (www.reddit.com) I’ve been testing and building AI agents for a while now, and I keep noticing that many “agents” online are basically just chatbots with extra branding. Some can talk well, but struggle when it comes to: reliability long-term memory tool u…
Show HN: AgentKanban for VS Code – A task board with agent harness integration (www.agentkanban.io via hn) Hi everyone. I wanted to introduce a tool / product that I've been working on for a while.
The Controllability Trap: A Governance Framework for Military AI Agents (arxiv.org via hn) Agentic AI systems - capable of goal interpretation, world modeling, planning, tool use, long-horizon operation, and autonomous coordination - introduce distinct control failures not addressed by existing safety frameworks. We identify six…
Vakra: Reasoning, Tool Use, and Failure Modes of Agents (huggingface.co via hn) Inside VAKRA: Reasoning, Tool Use, and Failure Modes of Agents VAKRA Dataset | LeaderBoard | Release Blog | GitHub | Submit to Leaderboard We recently introduced VAKRA, a tool-grounded, executable benchmark for evaluating how well AI agent…
A Survey of Workflow Optimization for LLM Agents (arxiv.org via hn) Large language model (LLM)-based systems are becoming increasingly popular for solving tasks by constructing executable workflows that interleave LLM calls, information retrieval, tool use, code execution, memory updates, and verification.…
Show HN: AgentLoop – a Claude agent starter you can read (github.com via hn) 🔁 AgentLoop The AI agent starter you can actually read. The full agent loop — streaming + tool use — in ~150 lines.
Do you benchmark local models as agents, or only on single prompts? (www.reddit.com) Curious how people test tool use locally. A model can look fine in chat and still fall apart once state, retries, and bad tool results show up.
What are some real-world AI Agent use cases in aerospace, defense, robotics and manufacturing? (www.reddit.com) Most AI Agent discussions I come across revolve around coding assistants, customer support, research agents, browser automation, and business workflows. am curious about applications in more engineering-heavy domains such as: Aviation & Ae…
Polar: Agentic RL on Any Harness at Scale (arxiv.org via hn) Reinforcement learning for language agents increasingly depends on custom harnesses that manage long-running context, multi-turn tool use and multi-agent orchestration. However, porting these harnesses into RL environment interfaces remain…
Persistent Memory + Identity Risks (www.reddit.com) Greetings, I'm working on a persistent AI runtime project characterized by one identity and a persistent memory. I've reached a point where I'm confident in my agent's ability to remember and build indefinitely based off its chosen persona…
Context loss between sessions, still the biggest unsolved problem in AI coding agents? (www.reddit.com) Everything in AI coding has improved dramatically, model quality, speed, tool use. But one thing hasn't been solved: the agent forgets everything when the session ends.
The Tool Use Pattern: How AI Agents Actually Work (www.reddit.com) Agents Are Just Loops Strip away the hype and an AI agent is a simple pattern: a language model that can call functions. The model doesn't execute code.
Are we going to need identity checks for AI agents? (www.reddit.com) I’ve been thinking about agent identity more than agent intelligence lately. With MCP, tool use, agent to agent workflows, and autonomous assistants getting more common, the question is not just “can the agent do the task?” It is also, Is…
Looking for fast vision-capable local models that handle tool calls well (open-source app, want to add local support) (www.reddit.com) Hi r/LocalLLaMA, I built an open-source MIT-licensed desktop app - cursor-aware AI overlay, hold a key, ask AI about whatever's around your cursor, vision LLM answers with a screenshot of the cursor region as context. Currently it routes t…
↯ Tool Use↯ Function Callingfunction-callingtool-usegemini+3
Your harness is failing your agent but there's no benchmark to prove it (www.reddit.com) You can compare models on function calling, multi turn tool use, schema adherence. Basically, there's a good amount of public data at the model layer.
Who's running local LLMs for agent workflows? What's your setup? (www.reddit.com) Curious how many people here are running language models locally as part of their agent stack. What model are you using and what are your system specs?
Is compute capacity becoming a real moat for AI agents? (www.reddit.com) Anthropic’s recent SpaceX compute deal made me think less about Claude specifically and more about the infrastructure side of AI products. We often compare models by reasoning, coding ability, context windows, tool use, pricing, or UX, but…
I wasted 3 days rewriting prompts for our agent before realizing the whole architecture was garbage (www.reddit.com) We run a small content-monitoring agent for our growth team. Nothing fancy on paper.
Built a Claude-powered agent with memory + tools… it turned into a startup advisor that won’t shut up (www.reddit.com) I built a small experiment using Claude (mainly for reasoning + responses) and added a memory layer + tool execution on top. Idea was simple: make a persistent agent that doesn’t forget context and can actually do things instead of just re…
Helix-AGI Technical Doc (www.reddit.com) I am working on a home AGI project called Helix-AGI. I am currently looking for collaborators to help test and troubleshoot.
Free reference site for getting into AI agents — tools, workflows, and Claude Skills (www.reddit.com) Built this over the past month as a free reference site for people getting into AI agents. What tools to use, where to start, what each tool does, and how the agent-tool landscape fits together.
Show HN: Arkloop – Open-source, local-first Agent client (github.com via hn) Hi HN, I built Arkloop – an open-source, local-first Agent client. You can think of it as Claude Desktop, but open source with its own taste.
What if the next open-source frontier wave is more about execution discipline than reasoning theater? (www.reddit.com) A lot of frontier discussion still treats progress as more chain-of-thought, more spectacle, and more obvious “this model feels genius” moments. But an open release like Ling-2.6-1T hitting Hugging Face today makes me think a different kin…
Claude Code, extended to everything (www.reddit.com) everyone hitting Claude Code rate limits knows the pain you're mid-build, momentum is real, then it just stops. you wait 5 to 9 hours, restore the cache, come back to a session already at 30% used before you typed a single line.
Which large models support tool use in opencode etc? (www.reddit.com) I'm working on a homelab AI server with the goal of running small models on GPU and very large models on CPU - for example for overnight coding on complex problems. Specs: 2990WX, 256GB + RTX 2080ti (for now).
DeepSeek V3.2 looping bug: what settings / harness tweaks are actually reducing it in production? (www.reddit.com) I’m trying to isolate the looping / repetition issue some people have been reporting with DeepSeek V3.2 around April 2026, especially in agentic or tool-use setups on hosted providers like OpenRouter and SiliconFlow. Public model pages des…
Qwen3.6-35B is worse at tool use and reasoning loops than 3.5? (www.reddit.com) Been running the new model entire evening in different quants and coding tasks with OpenCode. Used oMLX and LM Studio.
Show HN: Claude Opus 4.7: Everything You Need to Know (news.ycombinator.com) Claude Opus 4.7 is Anthropic's most capable generally available model, released April 16, 2026. It outperforms Opus 4.6, GPT-5.4, and Gemini 3.1 Pro on key benchmarks including agentic coding, multidisciplinary reasoning, scaled tool use,…
↯ Anthropic Mythos↯ Tool Use↯ Gemini 3.1tool-usemythosgpt-5+4
NicheIQs update — ChatGPT integration, live stats, scoring fix (www.reddit.com) Been heads-down on the backend today. Three things worth knowing about: The big one: NicheIQs is now available as a ChatGPT GPT.
Show HN: Make sure your OpenClaw isn't doing things it's not supposed to (claw.armoriq.ai via hn) I run OpenClaw agents with access to email, calendar, and files, and kept worrying about them doing things I never actually asked for. ArmorClaw captures intent and cryptographically binds the agent’s tool use to that committed intent.
I put together a Rust-native, CPU-only implementation of LFM2.5-8B-A1B (www.youtube.com via reddit) How OpenAI and Anthropic each build data agents differently - DataChain (www.reddit.com via reddit) The article is about how OpenAI and Anthropic each build data agents differently, and what that reveals about the challenge of making AI useful on real enterprise data. It shows that raw file access alone is not enough - agents need metada…
Do AI agents spend more time waiting for humans than actually working? (www.reddit.com via reddit) I've been thinking about this while using coding agents lately. The conversation around agents is usually about model quality, tool use, context windows, benchmarks, etc.
what is the real difference between cloud agents and local agents (www.reddit.com via reddit) Lately I’ve been thinking about the real difference between cloud agents and local agents. Right now, LLMs mainly handle knowledge, language, reasoning, planning, and tool use.
Beyond the Black Box: Interpretability of Agentic AI Tool Use (arxiv.org) Translate-R1: Cost-Aware Translation Tool Use via Reinforcement Learning (arxiv.org) Gemma4 12B - Experiences? (www.reddit.com via reddit) Anyone check out the new Gemma4 12B that dropped 3 days ago? Integrated vision and audio recognition, no mmpro needed plus tool use.
We cut our agent's context window in half, and it got better. kinda didnt expect that (www.reddit.com via reddit) Been tuning an agent workflow for lead qualification + CRM automation stuff, and one change that helped way more than I expected was cutting the available context almost in half. I assumed more context would mean better decisions.
Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration (arxiv.org) Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires col…
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents (arxiv.org) Reinforcement learning with verifiable rewards improves reasoning and tool use, yet long-horizon language agents still learn unsupported evidence chains, belief drift, and shortcut actions that satisfy terminal checks. Existing process rew…
Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments (arxiv.org) Just passed the new Claude Certified Architect - Foundations (CCA-F) exam with a 985/1000! (www.reddit.com) The original post was removed by Reddit Filters, so I made new one with same content. I just got my results back today and managed to snag the Early Adopter badge as well.
I tried replacing Claude Code with OpenCode. I’m switching back. (www.reddit.com) I spent some time digging into Claude Code vs OpenCode, mostly from the angle of how they actually work as coding agents. More on the technicalities like: context and memory tool use subagents permissions safety and control study the recen…
I talked with 4.7 on the differences between 4.7 and 4.6. We concluded "use 4.7 for generating code and agents, use 4.6 for generating literature review and exploratory synthesis" (www.reddit.com) Full conversation: https://claude.ai/share/4767365a-040f-4728-8c6a-2477bdae3503 From yesterday, I think the issue is that the differences don't stand out right away, so some people jump to conclusions that 4.7 is simply lower quality. 4.7…
Orc (working name) - auditable and declarative AI workflow (www.reddit.com) I’m building a small “Orchestration as Code” repo for LLM workflows. Does this concept make sense?
QClaw-4B — a 4B agent model fine-tuned for tool use and agentic workflows (www.reddit.com) QClaw-4B is a 4-billion parameter language model fine-tuned for agentic tasks and tool use, designed for use with OpenClaw-compatible agent frameworks. Despite its compact size, QClaw-4B achieves state-of-the-art results in the 4B class, m…
Set up these 4 Claude Code hooks to make your life easier (www.reddit.com) Hooks are "if then" rules for Claude Code. Each one has an event, a matcher, and a command.
I built a full macOS AI assistant that runs 100% local with Ollama — 170+ tools, voice control, memory system that dreams! (www.reddit.com) I've been building a personal AI assistant called Finn that runs entirely on your Mac. No cloud, no subscription, no data leaving your machine.
Spring benchmark update: Gemma 4 / Qwen3.5 vs Gemma 3 / Qwen3 for chat (www.reddit.com) Google and Alibaba recently shipped Gemma 4 and Qwen3.5, so I wanted to see whether the new generations are actually better on my setup. My context is private local chat running on my own hardware, a Mac mini M4 Pro.
Which LLM behavior datasets would you actually want? (tool use, grounding, multi-step, etc.) (www.reddit.com) Quick question for folks here working with LLMs If you could get ready-to-use, behavior-specific datasets, what would you actually want? I’ve been building Dino Dataset around “lanes” (each lane trains a specific behavior instead of mixing…
Master AI CLI Orchestrator? (www.reddit.com) I created a router that gives me access to Arena.ai models, and I generated an API key for each of the available models. I’m looking for a CLI tool that can run multiple AI agents together, each handling different tasks like planning, secu…
Where does Claude Code actually save time in real workflows? (www.reddit.com) For those using Claude Code in production workflows, where do you see the biggest net time savings? In my experience, it reduces cognitive load for writing scripts and scaffolding, but debugging effort seems to increase as codebases grow.
Emergent tool use from multi-agent interaction (openai.com)