model roundup

Qwen 3.5

164 items · started 2026-04-13 · closed 2026-05-30

  1. https://preview.redd.it/u8062juegq3h1.png?width=1919&format=png&auto=webp&s=a213f6929c6cad58e92bc1681dac9f0545b04d13 Overview: As the market for consumer computing parts becomes more scarce due to the AI boom, finding ways to use lower-end…

  2. People are warning me about the prompt-processing speed of a MacBook Pro M5 Max with 128 GB RAM. My main concern is prompt ingestion / prefill latency and large-context handling — not raw token generation speed (which I think is OK).

  3. heavy travel period last month, lots of offline time, and i could not stop building. airplane wifi was unusable so we switched models inside Claude Code and fired up qwen3.5 locally on an M4 macbook.

  4. Disclaimer: I use Qwen models on a day to day basis.. You could take it as a rant or even my concern about innovation in other models.

  5. I've been using Qwen3.5 122B A10B (Q3_K_XL) a lot lately for coding, and it's been pretty incredible overall like it feels not far off from frontier-level for most tasks -- but I've been noticing that usually once I hit around 75-80k conte…

  6. I'm usually not posting about Harbor releases out of the respect for the community here, but I think v0.4.19 might save a lot of people some time. Harbor can now launch your local agentic coding tools with local inference backends.

  7. Built a local ReAct-style calculator agent with 6 tools: add subtract multiply divide modulo etc. The setup is: orchestrator agent dynamic tool selection ReAct loop tools exposed as functions Problem: Even when the user asks multi-step ari…

  8. Hi, I am building a server so that my dual rtx 3090 setup runs at full speed. - asrock romed8 t2 revision 1.3 - epyc 7642 - ddr4 128 gb 3200 or 256 gb 2133 (256 gb is a bit cheaper) 8 channel - dual rtx 3090 - gigabyte psu 1600 w What do y…

  9. The “Trash Can” Mac Pro, once the most expensive machine you could buy from Apple, mine was just shy of £10,000 in 2016 — that’s £14k in today’s money. Until recently mine was just running as a kubernetes single node development platform,…

  10. Currently experimenting with building a React-style looping agent system using small LLMs like Qwen 3.5 9B and Gemma 4 (E2B), and I wanted to ask if anyone here has worked on something similar. Current setup: Using LangGraph Around 5 tools…

  11. I'm having issues with Hermes Agent actually processing commands through the terminal. I'm doing something simple like asking it to make a dir and it tells me it has, but it hasn't.

  12. So I've been feeding the sub file of anime episodes into Claude/ChatGPT/Deepseek and ask them to find all full name of Japanese character in it and put it into a python array so I can run a script to flip the name back to the original Japa…

  13. Using llama.cpp Model - Q8 - unsloth/Qwen3.5-2B-GGUF Is this expected with tiny models like this one? I am trying tiny models for a since most of the task I have involves searching local files etc and need less of the models own knowledge.

  14. Long story short, I am running Qwen3.5-35B-A3B (GGUF format) and other models on MacOS and getting around 1500 tokens/sec for prompt processing and around 35-50 tokens per second for prompt processing. I'm using the latest version of llama…

  15. 🧠 Prism Coder 🌐 Read in your language: 🇬🇧 English · 🇪🇸 Español · 🇫🇷 Français · 🇵🇹 Português · 🇷🇴 Română · 🇺🇦 Українська · 🇷🇺 Русский · 🇩🇪 Deutsch · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇨🇳 中文 · 🇸🇦 العربية Persistent memory + tool-calling intelligence for AI a…

  16. Hello guys, two days ago i ran the spark-arena for my Qwen 3.5 122B Recipe on a single DGX Spark and I got the highest score on speed for any context length and concurrency across all 3.5 122B Int4 Recipes. Just wanted to share if somebody…

  17. Disclosure: I made this. Open-source, MIT, Windows + Linux.

  18. Spent weeks running Hermes Agent in production on my Mac Mini M4 before recording this. Wanted to show things nobody else was covering.

  19. A mechanistic-interpretability study of Qwen 3.5 Disclaimer. This is a mechanistic-interpretability study of how nation-state-mandated content filtering actually gets built into a deployed LLM's weights.

  20. I ran Qwen3.5 9B on my AMD RX 6800 XT with ROCM and it seems to actually be slowing down token generation. I'm using Unsloth's quants.

  21. How do you deem qwen 3.5 4B heretic variants for RP finetunes? I have been struggling to get a decent instruct based model, any tips regarding the goal would be really helpful.

  22. Qwopus3.5-9B-coder is specially optimized and fine-tuned for high-performance 🤖 Agentic Coding, complex Tool Calling, and logical reasoning. 💡 Why the 9B Dense Model?

  23. I'm running the 122-billion Qwen 3.5, specifically Qwen3.5-122B-A10B-Q5_K_M, on DGX Spark (128 GB contiguous memory). I'm (very!) impressed with the general knowledge output.

  24. I kept seeing inference-speed claims for these models and wanting an apples-to-apples comparison on the hardware I actually have. So I built a harness and a public page that dumps every run as YAML.

  25. for anyone who cares... 😄 prompt = spen a 1000 tokens unsloth MTP models strix halo llama.cpp:server-rocm-mtp \ --spec-type draft-mtp \ --spec-draft-n-max 3 Qwen3.5-122B-Q5-MTP-General n_decoded = 100 tg = 29.77 t/s n_decoded = 179 tg = 27…

  26. I've already searched, but information is getting updated each week, so it's really hard to get an answer, I really hope some of you guys can give me some tips. And can I use an agent with it to enhance the code?

  27. Hello Guys, I know everyone has his definition of local models, but for me i see 2 "reasonable" type of frontier local models. a dense one that barely fit in a 32GB ou 24GB of gpu for the most "reasonable" GPU wealthy guys and a MOE in the…

  28. Introduction We introduce Intern-S2-Preview, an efficient 35B scientific multimodal foundation model. Beyond conventional parameter and data scaling, Intern-S2-Preview explores task scaling: increasing the difficulty, diversity, and covera…

  29. RL attackers are becoming a common pattern for automated red teaming: train a model against a live target, reward successful harmful compliance, then use the discovered attacks to harden the defender. This interested me, so I wanted to bui…

  30. What if you could run a capable AI agent without leaning on frontier-scale models? MagenticLite is the next generation of Magentic-UI, an agentic experience reimagined and optimized for small language models.

  31. Shipped this for the AMD x lablab hackathon. Attached video is one of the actual reels the pipeline produced - one English sentence in, finished mp4 with characters, story, music, and voice-over out (fast demo video, not the best quality).

  32. Hello all, I’ve been using Qwen 3.5 9B Q4 262k ctx using Llama cpp for claude code for a while now, is there any model which better complements agentic coding setup locally? Or is there a better harness (than Claude Code)?

  33. Hey everyone, I’ve been a big fan of Unsloth for several reasons: They publish models ASAP after release. They usually offer the lowest PPL.

  34. Hi all I'm trying to disable reasoning for quicker outputs in llamacpp-server. I remember using LM studio and that having a think button in the gui that could be toggled but later I tried the unsloth ggufs but they don't have that button f…

  35. My setup: Windows 10/11 i7 12700K | RTX 3090 TI | 96GB RAM Local server: LM Studio Models: Qwen 3.5/3.6 27B|35B Q5 UD K XL + Gemma 4 31B| 26B Q4 UD K XL Up until this point, I've only used sota models for coding. When Qwen 3.5 dropped, it…

  36. Hi all I recently started a new job and we're doing python development for a ci cd metadata consolidation library for analytics and we cannot use no stuff like claude code or codex or gh copilot or any model APIs (free or paid). I got a la…

  37. most open source RL engines pack sequences naively: prompt + response, repeated for every sample in the group. this is fine for short prompt, long completion workloads but inefficient for long prompt, short completion workloads.

  38. tl;dr - If you are using Apple Silicon, you should be using JANG quants. I discovered this fact in my own testing as I sought to increase the Tok/s of my models n my M5 Max.

  39. I’m trying to use llama-swap with an MLX model on a M2 Max instead of just llama-server. I got mlx_lm.server working directly with /v1/chat/completions, but I’m not sure whether llama-swap reliably supports this setup.

  40. I Started toying around LLM about sometimes ago. I think Qwen3.5 came out after like a month.

  41. DELIGHT – self-hosted AI engineering autopilot: local LLM + browser farm + repo graph + P2P compute TL;DR: Built a local "OS for AI agents" that scans your entire repo into a live graph (Worm), routes tasks between local Qwen, headless Cha…

  42. Two weeks of running Hermes Agent as the daily driver on a local stack. Sharing the trade-offs because anyone evaluating agent runtimes for local models is going to hit these.

  43. Can llama.cpp run MTP for this model?

  44. Previously it was throwing a 'Not Implemented' error due to Mamba layers. Going to test it now!

  45. Increase your CPU Thread Pool Size to your processor's max. In LM Studio, the max is 10.

  46. Quick follow-up on APEX, the MoE-aware mixed-precision quant strategy. The original post was just about Qwen 3.5 35B-A3B ( https://www.reddit.com/r/LocalLLaMA/comments/1s9vzry/apex_moe_quantized_models_boost_with_33_faster/ ); since then t…

  47. Mistral Medium 3.5 128B with 4x3080 20GB with layer split: CUDA_VISIBLE_DEVICES=0,1,2,3 ./build/bin/llama-bench --model /data/huggingface/Mistral-Medium-3.5-GGUF/Mistral-Medium-3.5-128B-IQ4_XS-00001-of-00003. gguf -ngl 99 -d 0,16384 -fa 1…

  48. Round 2: 2026-05-02 — llama.cpp b8198 → d05fe1d Rebuilt llama.cpp from b8198 (2026-03-04) to commit d05fe1d (2026-05-02), ~770 builds of progress. Same model, same hardware, same flags.

  49. I've been developing a self-modifying Al agent system that effectively cut my Codex/Claude Code API usage in half, Codex makes a plan and then I basically just copy/paste Codex instructions for the agents to work on. Come back in 6 hours a…

  50. mind-mem Drop-in memory for Claude Code, OpenClaw, and any MCP-compatible agent. OpenClaw is an open-source AI assistant platform with multi-channel support.

  51. ___ ___ __ | || |/ \| | | | / \ \ \ / / | __ | (_) | |__| || () \ \/\/ / ||||\/||\/ \/\/ This repo is three agents running on qwen3.5:9b on your machine, picking their own goals, writing and deploying their own tools, forming opinions abou…

  52. ## Got DFlash speculative decoding working on Qwen3.5-35B-A3B with an RTX 2080 SUPER 8GB I managed to get **DFlash speculative decoding** working in llama.cpp on a pretty VRAM-limited setup. This was tested with the DFlash PR: https://gith…

  53. Preface: I actually write my posts myself, no slop in this post. I managed to get Qwen 3.5 35BA3B working on my 15" 16GB M3 MBA through mmap, and I must say that given the massive model compared to my ram, 9 TPS is not bad at all.

  54. Bench 3 from my 18GB M3 Pro. Bench 2 was the 4B-class post where the comments were mostly right: I gave thinking models a fixed 1024-token cap, Qwen got kneecapped, Gemma E4B needed clearer active-param labeling, and the headline was partl…

  55. I've spent the last few weeks running real multi-file coding tasks through small local models and small cloud models on free tiers. Wanted to share the failure points that came up consistently, since some of them surprised me and i wanted…

  56. Qwen Team released Qwen-Scope — a collection of Sparse Autoencoders (SAEs) for the Qwen 3.5 family (from 2B to 35B MoE). They’ve mapped internal features for the residual stream across all layers.

  57. Basically, I’m really into the idea of a fully offline setup. (Another way to say it: I’m a data hoarder.) For LLMs, I’m using uncensored models from both Western (Gemma, GPT-OSS) and Eastern ones (GLM 4.7 Flash, Qwen 35B).

  58. TLDR: tool parameters using the common JSON Schema pattern `anyOf: [$ref, null]` are rendered into the prompt as empty `type` fields. This strips the useful schema information before the model sees it.

  59. https://github.com/ggml-org/llama.cpp/pull/22196 And somehow we already got some GGUFs for it! https://huggingface.co/CISCai/gemma-4-31B-it-NVFP4-turbo-GGUF https://huggingface.co/stevelikesrhino/gemma-4-31B-it-nvfp4-GGUF (the below one is…

  60. I'm working on a homelab AI server with the goal of running small models on GPU and very large models on CPU - for example for overnight coding on complex problems. Specs: 2990WX, 256GB + RTX 2080ti (for now).

  61. 33B A3B MoE, Apache 2 licensed. Reported agentic results put it about level with Qwen 3.5 35B A3B, behind the 3.6 version.

  62. I now use (mostly) Gemma 4 and Qwen 3.5 models *. And seems that all of them, after context grows a bit, after providing long output for me and getting a short prompt in response, are starting to process many new tokens as input and I have…

  63. Did some quick tests after building llama.cpp with ROCm 6.4.2 and latest Vulkan for my 6900 XT gemma4 E2B Q4_K ubatch ROCm pp512 Vulkan pp512 ROCm tg128 Vulkan tg128 32 1536.60 1423.49 151.92 174.59 64 1590.65 1930.60 151.41 173.76 128 265…

  64. M4 Mac Mini, 16GB unified, basic spec. For a few weeks I had Qwen 3.5 35B-A3B UD-IQ3_XXS (12GB on disk) running under llama.cpp with --mmap and --flash-attn.

  65. I am thinking to get one of them (or two of them to cluster) I need purely for LLM Inference both cost same in my country Bigger the models I can fit and faster I can run them better I am thinking to get 5070 ti and add second one, but if…

  66. Running my own models. I was having some trouble getting vLLM going so dropped down to LM Studio which I've used on my 24GB MacBook Air.

  67. I'm a compsci student and I've been using the 10$ copilot plan for about 2 years now, and it was fine for me since I did a good model distribution taking into account the complexity of the task, I was able to get through the month always u…

  68. Please help me build some clarity. I want to participate in local LLMs ecosystem more.

  69. Bench 2 from my 18GB M3 Pro. Last week was specialists vs generalists at 7-8B (which I hosed by giving thinking models a 128-token budget, so half the post was an apology).

  70. i’ve been experimenting with building a fully local rag pipeline: weaviate for vectors + hybrid search, node.js scripts, qwen 3.5 on ollama what i found is that most of the challenges live in retrieval and chunking, not the LLM, and a good…

  71. With practically all of LocalLlama glazing Qwen 3.5/3.6 for it's coding skills. Along with the fact that Alibaba themselves are focusing on making Qwen a reliable coding agent, does this rule out the chance for a new Qwen Coder?

  72. I want to implement this ai screen companion concept with local llms with vision capabilities like qwen 3.5 9b or older qwen 3 vl 4b etc for fast realtime inference. Need guidance and advice

  73. Basically, I use To Good 2 Go a lot, get random food, take a photo and ask Qwen 3.5 128B what the fuck to cook. Beyond pasta and pizza, I have zero cooking skills.

  74. I'm building a small text-based game where the gameplay loop is "talk an NPC into revealing a secret." It's basically a 20+ turn roleplay stress test: the model needs to stay in character, remember what the player said earlier, and refuse…

  75. Hi, new user here, just got into local language models after Claude suspended my account, just got my first LLM, and started the conversation with a "Hi", as I stared in disbelief as my LLM in question (qwen 3.5 9b) started deliberating fo…

  76. Somehow I cannot get KV resume for my Qwen3.5 model with lama-server: Save/restore works for tokens, but KV cache is never reused — is this expected? How to enable real resume?

  77. I knew there would be a speed penalty when switching the KV cache quantization from F16 to Q8, but I never expected it to be this significant at longer context sizes. I ran a test with Qwen 3.5 122B on my MacBook M2 Max using llama.cpp.

  78. Hi guys, I’ve been running side-by-side experiments on Gemma 4 (31B FP8) and Qwen 3.5 Vision for the last few days using vLLM in Docker to see how they actually handle real-world images and video. A few things I found out: 1.

  79. iv been running a small experiment at home that i wanted to share because i think the data is interesting. i got some agents running poker games against each other and gave them strategies.

  80. Both amounts are in euro. The AMD is actually 599 but it's sold by a shop, so I can get a VAT return as a company, while for the nvidia I'd have to go to the second hand market and I can't get VAT back, so at the end it's like a 495 vs 850…

  81. [Release] Hito 2B — structured reasoning via trained cognitive tags, +35 pts on GSM8K vs base Qwen3.5-2B (head-to-head) Been cooking this for ~6 months. Finally shipping.

  82. Hi LocalLLaMA, I created a post a few weeks ago, but this time this project has become more reliable and easier to use. This is a manga translator that can also be used to translate any image.

  83. Hey everyone, Ever since the day Google announced TurboQuant, I've been following the news about its extreme compression capabilities without noticeable quality degradation. I see it mentioned constantly on this sub, but despite all the di…

  84. Hi all. Many models on hugging face have been fine tuned with that 3000x opus dataset, but the two I mentioned in the title are missing it.

  85. I asked Qwen to build a 3d game in C++ using OpenGl, he created the whole project in multiples cpp and header files, 2500 lines of codes in on single shot, the code was clean highly technical, the scene load from the first try, i was amaze…

  86. I'm having a hard time determining the hardware I need to run a model like this, and I'm a bit confused about the number of resources publicly available. Is there a centralized hardware benchmark platform for these models, or is it all jus…

  87. Right now I have one 5080 and 64 GB RAM (I prefer not to offload layers to RAM). I see a few options - buy another 5080 to match the same model - buy a 3090 because it has better VRAM for the price Some context I found that local LLMs can…

  88. Hi, I'm running RTX 3090 24GB, with 32GB RAM. I'm running hermes-agent with Qwen3.5-35B-A3B_Q2_K.

  89. I've been using and learning about using all kinds of models for the last few years and I've read a lot of papers. I've even done finetuning and made loras, so I feel stupid asking this question, but here goes.

  90. I have 3090ti and i will add 3080ti to my system soon. With 3090ti only, i found it little bit slow to run gemma 4 26b 4q.

  91. Hi, I have a mac m1 max 64gb, which I thought was a good machine for entry-level ML. However, when running any LLMs on it - it rapidly heats up, which causes thermal throttling, and using any LLM becomes barely possible.

  92. 结果汇总(Qwen3.5-35B-A3B Q8_K_XL,M2 Ultra): | 测试项 | 速度 | |--------|------| | Prefill 10240 | 1734 t/s | | Prefill 16384 | 1552 t/s | | Generate 512 | 63 t/s | 参数:-ngl 99 -fa 1 -b 2048 -ub 2048 -ctk bf16 -ctv bf16 -mmp 0,3 次重复取平均。

  93. Previously, when performing local inference on the Qwen3.5 30B A3B 4-bit large language model, the prefill stage would consistently cause Claude Code to time out. Today, after updating to omlx 0.3.6, I redownloaded the oQ-quantized models.

  94. Ryzen AI MAX+ 395, Bosgame M5, 128GB LPDDR5x. Proxmox VE 9.1 LXC containers with GPU passthrough.

  95. Hi, I do a lot of writing and would be interested to know what people's thoughts are on the most capable model for proofreading, grammatical and academic editing. I have 48GB VRAM but don't imagine i'd need something too overkill.

  96. At the office I'm CPU and local only, so GPU poor. Besides the Qwen3.5 series, I've come to really like Gemma4 E4B there using the Pi agent (llama.cpp, Q4KM).

  97. I just launched ios app that uses Gemma 4 (E2B 4-bit via mlx-community) to rewrite oral transcripts into heirloom-quality paragraphs, 100% offline. What made this interesting technically: MLX Swift + MLXLLM in production (not a demo) — fir…

  98. Google and Alibaba recently shipped Gemma 4 and Qwen3.5, so I wanted to see whether the new generations are actually better on my setup. My context is private local chat running on my own hardware, a Mac mini M4 Pro.

  99. I’m trying to reduce my reliance on Claude. I have a 5090/128GB RAM.

  100. We surgically removed half the experts from Qwen3.5-35B-A3B to create 8 memory efficient domain specialists (coding, web, math, physics, biology, engineering, vocational, humanities). A cross-domain test shows a 96-point pass@5 gap between…

  101. I want to download one and usually do inference on CPU having old GPU so I'm concerned with speed. One link on the web (I have posted with it and post been removed): Multiple users are reporting that Gemma 4's MoE model (26B-A4B) runs sign…

  102. I'm now having fun with Gemma-4-E4B and Qwen3.5-9B, trying different variants like Gemopus and Qwopus, and Qwen3.5-9B-Uncensored-HauhauCS-Aggressive-Q8_0 don't quite know other models, so what's your favorite? why and how are them?

  103. Hey all, Looking for some real-world advice on GPU choices for running the new dense models — mainly Qwen 3.5 27B and Gemma 4 31B. What I’m targeting Context: 64K+ (ideally higher later) Speed: 30+ tok/s @ tg128 minimum Power: not critical…

  104. I was trying to do an easy task automatically with qwen-code using qwen3.5-122b I can totally do it myself, but I wanted to try, so maybe it could just do it entirely for me? But no, because it refused.

  105. Spent a bunch of time tuning llama.cpp on a Windows 11 box (i7-13700F 64GB) with an RTX 4060 Ti 16GB, trying to get unsloth Qwen3.5-35B-A3B-UD-Q4_K_L running well at 64k context. I finally got it into a pretty solid place, so I wanted to s…

  106. My pre-gemma 4 setup was as follows: Llama-swap, open-webui, and Claude code router on 2 RTX 3090s + 1 P40 (My third 3090 died, RIP) and 128gb of system memory Qwen 3.5 4B for semantic routing to the following models, with n_cpu_moe where…

  107. Last time I posted on how this model has performed in creating the webapp based on provided research paper. I got so much love to see people has appreciated the post and of-course the potential of this MOE model.

  108. Wait, I need to be careful about the "no_think" tag in the system prompt. The system prompt says /no_think.

  109. I've been testing a few models lately and I'm running into a weird issue with the bigger Qwen3.5s. Tested: Gemma 4 26B Qwen3.5 9B Qwen3.5 27B Qwen3.5 35B The 27B and 35B are driving me nuts.

  110. Hi all, first post here. I've started a project in OpenClaw a month ago, and it's been a very "intense" 4 weeks to say the least...

  111. I currently am looking for models to fit into my single DGX Spark for use. I have an RTX Pro 6000 and also a 5090 as well that I'm considering using in combination if the DGX Spark is too slow, but the intent here is to play around with Op…

  112. Hey! I was wondering if anyone of you have used Qwen3.5-27B-NVFP4-GGUF on RTX5090 on llama.cpp?

  113. Right from the oven with the latest commit: DFLASH_MAX_CTX=8192 uv run python -m omlx.cli serve oMLX - LLM inference, optimized for your Mac https://github.com/jundot/omlx Benchmark Model: Qwen3.5-35B-A3B-MLX-MXFP4-FP16 ===================…

  114. I bought a dual Skylake server because 12 channels of memory (and 2 x 3090s) THEN found out about NUMA nodes after my poor test results. Very disappointed.

  115. has anyone really tried running models bigger than physical memory capacity? I'd guess most users stick with running models that fit in DRAM + VRAM https://unsloth.ai/docs/models/qwen3.5 even google gemma 4 are released with about 30+ bill…

  116. https://nestia.io/articles/well-designed-backend-fully-automated-frontend-development.html Trying to generate entire frontend application from well-designed contexts. Succeeded to fully implement frontend application just by one-shot promp…

  117. Colleagues, I have a question: does anyone have a locally developed solution for summarizing text? Which qwant qwen 3.5 27b would be able to summarize an entire chapter of medical literature, about 25-30 A4 pages, without hallucinations?

  118. I am an local LLM beginner and I found this Reddit while looking for help. (Please understand that I am unfamiliar with Reddit.) (system- i5 4440 1.8GHz/b85m ds3h/DDR3 32GB/128GB SSD/Ubuntu 25.10 questing) I loaded Qwen3.5 27B Q4_K_M onto…

  119. I'd like to self-host some LLM models but a couple different ones for different usecases, and they don't all fit in VRAM at the same time. So i'm kind of looking for a tool in which i can define "profiles" or "stacks" of LLM's that get loa…

  120. Today I announce the first two models I am posting on here! First off, hello all of r/LocalLLaMA, nice to join.

  121. Claude cooked on the code, but I wrote this post myself, caveman style. I wanted to play with Qwen3.5-122B, but I don't have a unified memory system to work with, and 15 tok/s was rough.

  122. I'm currently vibe-coding (I'm new to vibe-coding) with Gemma 4 4EB Q4 and Qwen 3.5 9B Q5 (KV is quantized to 4 bits with new Google TurboQuant implemented in llama.cpp - I use koboldcpp and release said it's automatically activated): the…

  123. Hi! I hope its okay for me to ask this here.

  124. I am playing around with Intel Arc B70, still trying to decide whether I keep it or not. After some battle, I got it working with Radeon 5500 and B550M, now I am on to the fun part of getting software to work.

  125. I'm not sure if the AesSedai's Q5_K_M version of Minimax M2.7 is too much lobotomized or if the model itself is kind of weak. I did a simple experiment with both models running with the recommended parameters.

  126. Had to sell my AI server and am down to an M4 Macbook Air 16GB. If I were to buy a used M1 Air with 16GB (run it headless) and connect the two via EXO + Thunderbolt...would it be possible to be able to run a (19.6GB) Qwen 3.5-27B-Q5_K_M.gg…

  127. Both Groq and Cerebras haven't really updated their provided model for a while, long enough to notice the difference between old and new models on the market. So why don't they add any new models?

  128. Hello, I've been on a quest to get something "close enough" of Opus 4.5 running locally, for agentic coding, as SWE with 15 years of experience. I tried with one spark (yeah I'm calling my Asus Ascent GX10 sparks - they're the same), with…

  129. Hey there, I noticed something odd when trying out the latest and greatest local reasoning models recently. First, I just noticed it for Qwen3.5, but Gemma 4 seems to do it too: The reasoning traces do that weird thing of starting with "He…

  130. Hello, je suis en plein dans le montage d'une solution IA locale pour virer à terme perplexity, l'usage de chatgpt, claude etc..... mais je ne suis pas informaticien (perplexity est encore mon amie en ce moment !).

  131. I've seen a ton of PR, and a bunch of failed PR with some interesting additions. I was wondering what other people's commands are looking like now, what they are running for llama.cpp I'm still running: CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6 l…

  132. I bought this 7900XTX for 905 euro in Spain, and wondering if can I combine them together to run Qwen 3.5 27B for example ? Using a MSI B650 Gaming Plus Wifi and 64gb DDR5 6400mt/s

  133. This is V2 of my previous post. What's new: --ai-tune — the model starts tuning its own flags in a loop and caches the fastest config it finds.

  134. Basically the title. I know it will depend on your quant, but with 48​gb of vram inbound, I'm ​curious on the communities opinion before I get the chance to vibe check.

  135. Every other day, there's someone posting about how the latest hotness of the month is gamechanger, but flawed in some way relative to their previous favorite. I can't help but wonder, does no one else keep their previous gen models on spee…

  136. I created and run a benchmark for AI models in data analysis tasks. In contrary to other benchmarks, it is not one-prompt benchmark, but I tried to simulate the real work of data analyst.

  137. A high-volume feed of new AI releases — models, open-source repos, developer tools, papers, datasets, and benchmarks — refreshed every 8 hours. Each release is explained in plain English so you actually understand what shipped.

  138. Two questions: which model? In my mind, Qwen3.5 27b or Gemma 4 31b are top options.

  139. using z790 prime p d4 with 128gb ddr4 3200mhz ram. 1x3090 in main PCIe5 16x slot and 2x3090 in chipset PCIe4 4x slots.

  140. I'm sure everyone has seen the posts from people talking about Qwen 3.5 over-thinking, or maybe you've experienced it yourself. Considering we're like 2 months out from the release and I still see people talk about this issue, I decided it…

  141. I've been working on measuring how LLMs actually behave (not what they know) across different hardware setups. Things like: does the model cave when you push back on a correct answer?

  142. Hello everyone, a newbie here. Amazed by OpenClaw and worried by its high API consumption, I decided to buy two Asus Ascent GX10s (like the Nvidia Spark), so I have a pretty powerful inference cluster with 220GB of real available memory.

  143. Hi everyone, Been following a lot of local LLM talk in this forum lately—learned quite a bit from you all! This is my first post, hopefully not my last.

  144. Hello guys, wanted to share this: https://github.com/albond/DGX_Spark_Qwen3.5-122B-A10B-AR-INT4 I am running it on my DGX Spark Int4 V2 with Max context window - and getting 50tok/sec with Multi Token Prediction: Its working great for tool…

  145. my current build is just a 16GB 5060Ti running on a 3800X with 32GB DDR4. not really anything special, but I only really use it right now for Qwen3-VL-8B-Instruct at INT8 to do handwriting transcription (and it works great for that). someo…

  146. Hey guys! How would you reckon a 30-50b model would run on a 48 GBs m5 pro?

  147. I have a 5090, so my VRAM is limited to 32GB, but i find that Qwen3.5-27B-UD-Q5_K_XL with opencode (and mmproj) does a pretty good job for my use case (mainly web development). i use claude and codex here and there, recently a lot less, be…

  148. Models compared: Qwen3.5-27B-UD-Q5_K_XL gemma-4-31B-it-UD-Q5_K_XL Main flags for boths --flash-attn on \ --n-gpu-layers 99 \ --no-mmap \ -c 150000 \ --temp 1 --top-p 0.9 --min-p 0.1 --top-k 20 \ --ctx-checkpoints 1 \ --jinja \ -np 1 \ --re…

  149. A few days ago I posted early results from a native MLX implementation of DFlash. Since then I rewrote the benchmark methodology, fixed numerical issues, and open sourced the whole thing.

  150. I'm getting ready to do a training run on qwen 3.5 27b and it will be the first time I've ever done LoRA. to complicate things I've tried to make my own custom dataset using q&a pairs.

  151. A year ago I would just read about 397B league of models. Today I can run it on my laptop.

  152. I run it on 2xRTX 3090. This is part of my llama-server presets file: [Qwen3.5-27B-bartowski] load-on-startup = true alias = Qwen3.5-27B-bartowski hf = bartowski/Qwen_Qwen3.5-27B-GGUF:Q8_0 hfd = bartowski/Qwen_Qwen3.5-2B-GGUF:Q8_0 draft-mi…

← all threads