#mixture-of-experts

17 items

Abliterlitics: Benchmarks and Tensor Comparison for Heretic, Abliterlix, Huiui, HauhauCS for GLM 4.7 Flash (www.reddit.com) +193 7w

This is a follow up to the previous benchmark and tensor analysis of abliteration techniques across the Qwen model family. Same approach, same toolkit, new model family.

↯ Glm ↯ GLM 4.7 mixture-of-experts glm qwen
XiaomiMiMo MiMo-V2.5 (not pro) - Architecture: Sparse MoE (Mixture of Experts), 310B total / 15B activated parameters (www.reddit.com) +31 7w

https://huggingface.co/XiaomiMiMo/MiMo-V2.5 Interesting because unlike its bigger brother it can be run on "more human" configurations

mixture-of-experts moe
EMO: Pretraining mixture of experts for emergent modularity (allenai.org via hn) +2 5w

Today we're releasing EMO, a new mixture-of-experts (MoE) model pretrained end-to-end so that modular structure emerges directly from the data without relying on human-defined priors. EMO lets you use a small subset of its experts – just 1…

mixture-of-experts moe
The cut in the Mixture of Experts compute graph (idlemachines.co.uk via hn) +1 4w

Mixture of Experts looks like it's one of those few changes you can make to the architecture of a model that comes almost for free: many more parameters, barely any more compute. The forward pass is just a router, a softmax and a top-k.

mixture-of-experts
Stratum: System-Hardware Co-Design with 3D-Stackable DRAM for Efficient Moe (dl.acm.org via hn) +1 4w

Abstract Abstract As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse g…

mixture-of-experts moe
Zyphra releases the ZAYA1-8B MoE model optimized for intelligence density (huggingface.co via hn) +11 5w

ZAYA1-8B ZAYA1-8B is a small mixture of experts language model with 760M active parameters and 8.4B total parameters trained end-to-end by Zyphra. ZAYA1-8B sets a new standard of intelligence efficiency for its parameter count through a co…

mixture-of-experts moe
Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction (arxiv.org) 1d

mixture-of-experts
Generalizing GNNs with Tokenized Mixture of Experts (arxiv.org) 2d

Deployed graph neural networks (GNNs) are frozen at deployment yet must fit clean data, generalize under distribution shifts, and remain stable to perturbations. We show that static inference induces a fundamental tradeoff: improving stabi…

mixture-of-experts
Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training (arxiv.org) 6d

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural…

mixture-of-experts
Enhancing Multilingual LLM-based ASR with Mixture of Experts and Dynamic Downsampling (arxiv.org) 7d

mixture-of-experts
CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation (arxiv.org) 7d

mixture-of-experts moe
Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps (arxiv.org) 8d

mixture-of-experts
FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting (arxiv.org) 8d

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model…

mixture-of-experts
I built a PyTorch MoE/MoD training framework with custom CUDA kernels [Apache 2.0] (www.reddit.com via reddit) 9d

PyTorch framework for training transformer LLMs with MoE and MoD architecture support, custom CUDA kernels, and DeepSpeed integration. Key things it does: - Custom CUDA kernels for RMSNorm, RoPE, SwiGLU, MoE routing.

mixture-of-experts moe
Mixture of Experts (MoEs) in Transformers (huggingface.co) 15w

mixture-of-experts
Mixture of Experts Explained (huggingface.co) 131w

mixture-of-experts
Welcome Mixtral - a SOTA Mixture of Experts on Hugging Face (huggingface.co) 131w

mixture-of-experts

← all tags