Hallucinated — page 2

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI (arxiv.org)

19h

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such clai…
A Hybrid LSTM--Vision Transformer Architecture for Predicting HRRR Forecast Errors (arxiv.org)

19h

Forecast errors in high-resolution numerical weather prediction (NWP) systems are often linked to unresolved planetary boundary layer (PBL) processes, convection, terrain-induced circulations, and other vertically structured atmospheric ph…
ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents (arxiv.org)

19h

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications.
Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance (arxiv.org)

19h

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assist…
Attention as Frustrated Synchronization (arxiv.org)

19h

A network of oscillators that synchronizes perfectly computes nothing further, so an attention architecture built from synchronization must locate its computation in structured departures from agreement. We introduce the Frustrated Synchro…
BLADE: Scalable Bi-level Adaptive Data Selection for LLM Training (arxiv.org)

19h

As Large Language Model (LLM) datasets scale to trillions of tokens, data selection has emerged as a critical frontier to filter out uninformative noise and construct adaptive learning trajectories. Beyond static heuristic filtering, advan…
Balanced Twins: Causal Inference on Time Series with Hidden Confounding (arxiv.org)

19h

Accurately estimating treatment effects in time series is essential for evaluating interventions in real-world applications, especially when treatment assignment is biased by unobserved factors. In many practical settings, interventions ar…
Beyond Prediction: Tail-Aware Scheduling for LLM Inference (arxiv.org)

19h

LLM serving exhibits extreme length variability, making size-based scheduling difficult in practice. Recent LLM schedulers approximate SJF/SRPT using predicted decode lengths or ranks and primarily report mean-centric metrics such as TTFT…
Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection (arxiv.org)

19h

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We a…
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports (arxiv.org)

19h

Reliable evaluation of generated radiology reports requires strict clinical accuracy, as omitted critical findings or mischaracterized radiographic observations can directly affect patient care. Existing metrics obscure this requirement by…
Beyond Tokenization: Direct Timestep Embedding and Contrastive Alignment for Time-Series Question Answering (arxiv.org)

19h

Recent advances in large language models (LLMs) have given rise to time-series question answering (TSQA), which formulates time-series analysis as natural-language question answering. However, directly feeding raw numerical series into LLM…
CAOA -- Completion-Assisted Object-CAD Alignment (arxiv.org)

19h

Accurately aligning CAD models to their corresponding objects in indoor RGB-D scans is a central challenge in 3D semantic reconstruction. The task requires estimating a 9-Degree-of-Freedom (DoF) pose-position, rotation, and scale along thr…
CEO-Bench: Can Agents Play the Long Game? (arxiv.org)

19h

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely unteste…
Complementary Attention Head Pruning for Efficient Transformers (arxiv.org)

19h

The remarkable success of Transformer-based models in natural language processing stems from architectural scaling, which leads to a large number of parameters and hinders deployment in resource-constrained environments. While structured p…
Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment (arxiv.org)

19h

Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these dis…
CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents (arxiv.org)

19h

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe…
DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs (arxiv.org)

19h

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in gl…
Dango: A Strictly L1-Only Large Language Model for Studying Second Language Acquisition (arxiv.org)

19h

We introduce Dango, a 1.8B-parameter large language model designed for controlled studies of L1-to-L2 (Japanese-to-English) transfer in second language acquisition (SLA). While previous studies have explored SLA in language models, they ha…
DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models (arxiv.org)

19h

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four…
Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents (arxiv.org)

19h

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This…
Deep Neural Network Driven Simulation Based Inference Method for Pole Position Estimation under Model Misspecification (arxiv.org)

19h

Simulation Based Inference (SBI) is shown to yield more accurate resonance parameter estimates than traditional chi-squared minimization in certain cases of model misspecification, demonstrated through a case study of pi-pi scattering and…
Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents (arxiv.org)

19h

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it d…
Dual Dimensionality for Local and Global Attention (arxiv.org)

19h

Decoder-only Transformers compute attention over the KV cache of preceding tokens. Keys (and Values) are typically represented with the same dimensionality, regardless of its distance from the prediction target.
EARS: Explanatory Abstention for Reliable Sub-Agent Modeling in Large-scale Multi-Agent Systems (arxiv.org)

19h

In large-scale enterprise settings, centralized multi-agent systems (MAS) are increasingly adopted, in which a coordinator delegates user requests to lightweight, domain-specialized sub-agents. While this architecture improves modularity,…
Enhancing Decision-Making with Large Language Models through Multi-Agent Fictitious Play (arxiv.org)

19h

Large language model (LLM)-based multi-agent systems (MAS) have demonstrated great potential in solving tasks with execution complexity, by distributing subtasks across cooperative agents. However, this divide-and-conquer paradigm falls sh…
Everywhere Valid Bounds on False Discovery Proportions in Conformal Inference (arxiv.org)

19h

Modern applications of conformal inference to multiple testing problems, such as outlier detection and candidate selection, often involve selecting test samples whose conformal p-values fall below a threshold. The quality of such methods i…
FORGE: Foundational Optimization Representations from Graph Embeddings (arxiv.org)

19h

Combinatorial optimization problems are ubiquitous in science and engineering. Still, learning-based approaches to accelerate combinatorial optimization often require solving a large number of difficult instances to collect training data,…
ForecastBench-Sim: A Simulated-World Forecasting Benchmark (arxiv.org)

19h

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a…
From Values to Tokens: An LLM-Driven Framework for Context-aware Time Series Forecasting via Symbolic Discretization (arxiv.org)

19h

Time series forecasting plays a vital role in supporting decision-making across a wide range of critical applications, including energy, healthcare, and finance. Despite recent advances, forecasting accuracy remains limited due to the chal…
FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs (arxiv.org)

19h

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective und…