Hallucinated — page 3

G-IdiomAlign: A Gloss-Pivoted Benchmark for Cross-Lingual Idiom Alignment (arxiv.org)

19h

Idioms are difficult to transfer across languages due to their non-compositionality and weak surface-form grounding, making literal mappings unreliable. We present G-IdiomAlign, a gloss-pivoted benchmark where each idiom is anchored by an…
GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents (arxiv.org)

19h

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and q…
Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing (arxiv.org)

19h

The dense token-to-token interaction pattern of standard dot-product attention remains a central bottleneck in scaling Transformer architectures to long contexts. We introduce \textbf{Gaussian Mixture Attention (GMA)}, a probabilistic atte…
Graph Grounded Cross Attention Transformer Neural Network for Structurally Constrained Full Event Sequence Generation in Predictive Process Monitoring (arxiv.org)

19h

Structurally constrained event sequence generation remains challenging because generated paths must preserve transition feasibility, temporal order, termination, and attribute consistency. In predictive process monitoring (PPM), this chall…
GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents (arxiv.org)

19h

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment,…
Hierarchical Attention via Domain Decomposition (arxiv.org)

19h

We propose a hierarchical attention mechanism based on two-level overlapping Schwarz domain decomposition. The method is motivated by the observation that two-level Schwarz domain decomposition methods combine local subdomain corrections w…
Improve Large Language Model Systems with User Logs (arxiv.org)

19h

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs…
IndicContextEval: A Benchmark for Evaluating Context Utilisation in Audio Large Language Models Across 8 Indic Languages (arxiv.org)

19h

AudioLLMs enable speech recognition conditioned on textual prompts such as domain descriptions or entity lists. However, it remains unclear whether these models genuinely utilise such context or rely on parametric knowledge learned during…
LLM Compression by Block Removal with Constrained Binary Optimization (arxiv.org)

19h

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glas…
LLM Parameters for Math Across Languages: Shared or Separate? (arxiv.org)

19h

Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests diff…
LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents (arxiv.org)

19h

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting trainin…
LLMs Struggle to Measure What Distinguishes Students of Different Proficiency Levels: A Study of Item Discrimination in Reading Comprehension Assessment (arxiv.org)

19h

Item discrimination is a fundamental psychometric property of educational assessment, which measures whether an item meaningfully distinguishes students with higher proficiency from students with lower proficiency. While various existing w…
Language Models as Interfaces, Not Oracles: A Hybrid LLM-ML System for Pediatric Appendicitis (arxiv.org)

19h

Large language models (LLMs) can make clinical decision support more accessible by interpreting free-text documentation, but their direct use as diagnostic engines is limited by sensitivity to prompts, information order, and plausible but…
Leadership as Coordination Control: Behavioral Signatures and the Recovery-Advantage Boundary in Multi-Agent LLM Teams (arxiv.org)

19h

Team science holds that leadership is contingent: it helps only under specific conditions, and capable, autonomous teams may need none at all. We ask the analogous question for multi-agent LLM teams: under what measurable conditions does p…
LegalWorld: A Life-Cycle Interactive Environment for Legal Agents (arxiv.org)

19h

Civil litigation is inherently a life-cycle process: what a lawyer drafts on day one constrains what unfolds at trial months later. Yet existing legal benchmarks evaluate isolated subtasks, and prior legal-agent simulators reinitialize eac…
MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems (arxiv.org)

19h

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing…
MetaboNet-Bench: A Multi-modal Benchmark for Glucose Forecasting in Type 1 Diabetes (arxiv.org)

19h

Glucose forecasting algorithms are an important aspect of glycemic control management in type 1 diabetes. So far, the research community has developed numerous algorithms and models for forecasting.
Montreal Forced Aligner and the state of speech-to-text alignment in 2026 (arxiv.org)

19h

The Montreal Forced Aligner (MFA) was released in 2016 and has since become the most widely used tool for forced alignment in research and industry. In the decade since, MFA has undergone substantial development, including expanded coverag…
Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer? (arxiv.org)

19h

The effectiveness of multi-agent LLM deliberation depends not only on the agents' individual predictions, but also on how they communicate and collaborate. We study this mechanism through the lens of Friedkin-Johnsen (FJ) opinion dynamics,…
Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey (arxiv.org)

19h

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to e…
Possible or Definite? A Benchmark for Evaluating Diagnostic Uncertainty Preservation in Clinical Text (arxiv.org)

19h

Large language models (LLMs) are increasingly used for clinical text tasks such as summarization and revision. While most studies evaluate the fluency and coherence of LLM-generated text, whether LLMs correctly preserve diagnostic uncertai…
PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning (arxiv.org)

19h

Machine unlearning for large language models (LLMs) aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since…
Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing (arxiv.org)

19h

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines t…
Quantifying and Auditing LLM Evaluation via Positive--Unlabeled Learning (arxiv.org)

19h

Large Language Models (LLMs) are increasingly used as judges for scalable evaluation, yet such LLM--as--a--Judge systems exhibit systematic biases that are decoupled from semantic quality, most notably verbosity bias. Meanwhile, human supe…
RouteJudge: An Open Platform for Reproducible and Preference-Aware LLM Routing (arxiv.org)

19h

We present RouteJudge, an online pairwise preference evaluation framework for LLM routing systems, with a public platform available at this https URL. Different from model-level response evaluation, RouteJudge focuses on router-level decis…
SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration (arxiv.org)

19h

Context engineering has emerged as a primary lever for improving AI systems without parameter updates. Recent work showing that textual gradients do not function as real gradients motivates treating automatic prompt optimization (APO) as b…
SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding (arxiv.org)

19h

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core…
Self-Evolving Multi-Agent Systems via Textual Backpropagation (arxiv.org)

19h

Leveraging multiple Large Language Models (LLMs) has proven effective for addressing complex, high-dimensional tasks, but current approaches often rely on static, manually engineered multi-agent configurations. To overcome these constraint…
Self-attention-based non-linear basis transformations for compact latent space modelling of dynamic optical fibre transmission matrices (arxiv.org)

19h

Multimode optical fibres are hair-thin strands of glass that efficiently transport light. They promise next-generation medical endoscopes that provide unprecedented sub-cellular image resolution deep inside the body.
Simulating Hate Speech Cascades with Multi-LLM Agents: Empirical Grounding, Modeling Fidelity, and Intervention Strategies (arxiv.org)

19h

Faithful modeling of hateful content propagation on online platforms remains an open problem for moderation research. Classical cascade models that do not explicitly represent the profile, community, and content factors associated with hat…