This is a follow up to the previous benchmark and tensor analysis of abliteration techniques across the Qwen model family. Same approach, same toolkit, new model family.
#mixture-of-experts
17 items
Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash (www.reddit.com) XiaomiMiMo MiMo-V2.5 (not pro) - Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters (www.reddit.com) https://huggingface.co/XiaomiMiMo/MiMo-V2.5 Interesting because unlike its bigger brother it can be run on "more human" configurations
EMO: Pretraining mixture of experts for emergent modularity (allenai.org via hn) Today we're releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts – just 1…
The cut in the Mixture of Experts compute graph (idlemachines.co.uk via hn) Mixture of Experts looks like it's one of those few changes you can make to the architecture of a model that comes almost for free: many more parameters, barely any more compute. The forward pass is just a router, a softmax and a top-k.
Stratum: System-Hardware Co-Design with 3D-Stackable DRAM for Efficient Moe (dl.acm.org via hn) Abstract Abstract As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse g…
Zyphra releases the ZAYA1-8B MoE model optimized for intelligence density (huggingface.co via hn) ZAYA1-8B ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. ZAYA1-8B sets a new standard of intelligence efficiency for its parameter count through a co…
Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction (arxiv.org) Generalizing GNNs with Tokenized Mixture of Experts (arxiv.org) Deployed graph neural networks (GNNs) are frozen at deployment yet must fit clean data, generalize under distribution shifts, and remain stable to perturbations. We show that static inference induces a fundamental tradeoff: improving stabi…
Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training (arxiv.org) Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural…
Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling (arxiv.org) CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation (arxiv.org) Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps (arxiv.org) FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting (arxiv.org) Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model…
I built a PyTorch MoE/MoD training framework with custom CUDA kernels [Apache 2.0] (www.reddit.com via reddit) PyTorch framework for training transformer LLMs with MoE and MoD architecture support, custom CUDA kernels, and DeepSpeed integration. Key things it does: - Custom CUDA kernels for RMSNorm, RoPE, SwiGLU, MoE routing.
Mixture of Experts (MoEs) in Transformers (huggingface.co) Mixture of Experts Explained (huggingface.co) Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face (huggingface.co)