Evaluated Qwen 3.6 27B across BF16, Q4_K_M, and Q8_0 GGUF quant variants with llama-cpp-python using Neo AI Engineer. Benchmarks used: HumanEval: code generation HellaSwag: commonsense reasoning BFCL: function calling Total samples: HumanE…
#function-calling
29 items
Qwen 3.6 27B BF16 vs Q4_K_M vs Q8_0 GGUF evaluation (www.reddit.com) Needle: We Distilled Gemini Tool Calling Into a 26M Model (www.reddit.com) We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok/s prefill and 1200 tok/s decode on consumer devices.
↯ Tool Use↯ Function Callingfunction-callingtool-usegemini+1
Benchmarked Gemma 4 E2B: The 2B model beat every larger sibling on multi-turn (70%) (aiexplr.com via reddit) Tested Gemma 4 E2B across 10 enterprise task suites against Gemma 2 2B, Gemma 3 4B, Gemma 4 E4B, and Gemma 3 12B. Run locally on Apple Silicon.
↯ Security↯ Gemma 4↯ Function Callingfunction-callingprompt-injectionrag+2
Benchmarked Needle 26M vs Qwen3-0.6B on CPU function calling, 50 queries across 5 difficulty tiers. The 23x smaller model wins on accuracy and is 4.4x faster. (www.reddit.com) Ran a head-to-head on two open-weight models for tool-calling on a 4-core CPU, no GPU, no cherry-picking. Wanted to see if the small specialist (Needle, 26M, distilled from Gemini 3.1 for function calls) actually holds up against a small g…
↯ Gemini 3.1↯ Function Callingfunction-callingtool-callinggemini
After 3 months of switching between Claude Sonnet 4.6, GPT-5.5, and Gemini 3.1 daily — here's my actual routing (www.reddit.com) Not benchmarks — actual tasks, actual results. Claude Sonnet 4.6 for: - Long documents that need nuanced analysis - Writing where voice and precision matter - Reasoning through edge cases in code - Anything where "think carefully" is the r…
Learn, run and test Agentic AI on your browser for free! (Built with Claude Opus 4.7 in 2 days) (www.reddit.com) Hey Everyone, Over the last few months, I noticed a massive gap in how we learn about Agentic AI. There are a million theoretical blog posts and dense whitepapers on RAG, tool calling, and swarms, but almost nowhere to just sit down, run a…
↯ Fine Tuning↯ Opus 4.7↯ Function Callingfunction-callingfine-tuningrag+4
How to build production Agents (by a staff software engineer) - Part 1 (www.reddit.com) I'm a software engineer with 10+ years of experience, from Meta AI and startups. I've been building AI Agents for the past 3 years, as a founding engineer and as a founder building custom AI Agents for businesses.
Context checkpoint erasure in llama.cpp ? (www.reddit.com) Has anyone been able to solve or mitigate context checkpoints being erased during single user inference, specifically when function calling is part of the chat history? I've been using Qwen 3.5 35B A3B for some time (now using 3.6), tested…
What’s the cheapest way to give a local Llama 3 internet access? (SearXNG isn’t cutting it) (www.reddit.com) Finally got Llama 3 70B running locally and wired up function calling so it can search the web. First tried self-hosting SearXNG, but the results are pretty messy.
Show HN: I built a search engine for llms.txt sites (statespace.com via hn) More and more developer tools are adopting the llms.txt standard to build AI-friendly versions of their docs. The problem is that it's very hard to search across them.
↯ Mistral↯ Function Callingfunction-callingvector-databasemistral+1
I built an open source protocol that gives every AI tool a signed contract — so your agent verifies before executing, saves tokens by choosing card depth, and leaves an auditable receipt on every call. No blind function calling. (www.reddit.com) ▎ What problem does this solve? Right now most agents call tools based on a name and a JSON Schema.
Tool calling vs prompt routing for search decisions? (www.reddit.com) Hi, would appreciate your help. I have a summary of a given topic plus past conversation history.
opensource router slm with 50-100ms latency and 99% accuracy that runs locally (www.reddit.com) i am working on a router slm that helps in multiple agent orchestration , excels in tool calling but every option comes with a tradeoff of its own , you are invited to give your approaches to refine the architecture 1 - if we use multiple…
How are you actually predicting AI costs before they hit your invoice? (www.reddit.com) Switched from prototype to production last month and our AI bill was 3x what we estimated. Not because we picked the wrong model - we just didn't know what we didn't know.
Needle-rs – AI Function calling in the browser, 258 KB WASM (needle-rs.pages.dev via hn) AI TOOL CALLING · WASM · NO_STD Below is a 26M-parameter tool-calling transformer running entirely in this tab — no server, no API key, no data leaving your device. The model is Needle by Cactus Compute; needle-rs is the pure-Rust runtime…
Looking for fast vision-capable local models that handle tool calls well (open-source app, want to add local support) (www.reddit.com) Hi r/LocalLLaMA, I built an open-source MIT-licensed desktop app - cursor-aware AI overlay, hold a key, ask AI about whatever's around your cursor, vision LLM answers with a screenshot of the cursor region as context. Currently it routes t…
↯ Tool Use↯ Function Callingfunction-callingtool-usegemini+3
Your harness is failing your agent but there's no benchmark to prove it (www.reddit.com) You can compare models on function calling, multi turn tool use, schema adherence. Basically, there's a good amount of public data at the model layer.
Function calling works great in demos. In production, it’s a different story. (www.reddit.com) I’ve been working on adding function calling to an LLM-based support system over the past few weeks. Thought I’d share a few things that didn’t behave the way the demos suggest.
Qwem Meetup Presentation: Function Calling Harness, from 6.75% to 100% (typia.io via hn) TL;DR - AutoBe — AI backend auto-generation agent - Production-grade backend from natural language conversation - 4 AST types + 4-tier compiler validation + self-healing loops - Schema specs are the new prompts - Typia — The infrastructure…
Local LLM Benchmark about Backend Generation by Function Calling (GLM vs Qwen vs DeepSeek) (www.reddit.com) Detailed Article: https://autobe.dev/articles/local-llm-benchmark-about-backend-generation.html Five months ago I posted the "Hardcore function calling benchmark in backend coding agent" thread here. As I wrote in that post, it was an unco…
Qwen Meetup Draft Review Required (Function Calling Harness 2 - CoT Compliance from 9.91% to 100%) (autobe.dev via reddit) Talk at Qwen Meetup Korea end of May. Looking for review on this draft before I build PPT slides off it.
Multi-agent in production: real win or just hype? (www.reddit.com) Trying to get an honest read on this from people actually shipping. Every other AI announcement lately is "agentic" or "multi-agent," and I can't always tell if it's a real architectural shift or rebranded function calling with extra steps.
Run, Learn and test Agentic AI for free, on your browser! (Open AI Models are included) (www.reddit.com) Hey Everyone, Over the last few months, I noticed a massive gap in how we learn about Agentic AI. There are a million theoretical blog posts and dense whitepapers on RAG, tool calling, and swarms, but almost nowhere to just sit down, run a…
↯ Fine Tuning↯ Function Callingfunction-callingfine-tuningrag+3
The model alone is not the agent. The harness plus the model is the agent (www.reddit.com) An agentic harness is the orchestration and control layer wrapped around a base language model that transforms it from a stateless text predictor into an agent capable of taking actions, calling tools, maintaining state across steps, and e…
Defender – Local prompt injection detection for AI agents (no API calls) (www.npmjs.com via hn) Prompt injection defense framework for AI tool-calling Indirect prompt injection defense and protection for AI agents using tool calls (via MCP, CLI or direct function calling). Detects and neutralizes prompt injection attacks hidden in t…
↯ Security↯ Function Callingfunction-callingtool-callingprompt-injection+2
Building Expertise in Claude - Seeking Quality Learning Resources (www.reddit.com) Hi everyone, I'm on a mission to become a serious expert in Claude and AI, and I'm building a structured learning path. I want to create content that's actually valuable - with real practical applications, not surface-level tutorials.
ReAct or CodeAct, that is the question (www.reddit.com) Hi guys, Idk what you think, but for me, one of the biggest discussions in the AI engineering field is this issue: ReAct vs. CodeAct.
Qwen3.6:27b vs qwen3-coder:30b vs deepseek-coder:33b on code gen, tool calling, and agent tasks (www.reddit.com) Ran a full eval against four local models last weekend and the spread between them is wider than I expected. All running through Ollama on CPU, no cloud, same prompts, same hardware.
↯ Qwen 3.6↯ Function Callinghumanevalfunction-callingollama+1
Function calling and other API updates (openai.com)