Hallucinated — page 9

RepSelect: Robust LLM Unlearning via Representation Selectivity (arxiv.org)

1d
Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors (arxiv.org)

1d
Riemann-Bench: A Benchmark for Moonshot Mathematics (arxiv.org)

1d
RubricsTree: Scalable and Evolving Open-Ended Evaluation of Personal Health Agents across Health Memory and Medical Skills (arxiv.org)

1d
SEAGym: An Evaluation Environment for Self-Evolving LLM Agents (arxiv.org)

1d

Self-evolving LLM-based agents improve mainly by changing their agent harness: the structured execution layer around a base model, including prompts, memory, tools, middleware, runtime state, and the model-tool interaction loop. Existing e…
Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery (arxiv.org)

1d

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise…
Securing Multi-Agent GIS Systems: Risk Evaluation and Prompt Hardening Optimization (arxiv.org)

1d
Security and Privacy Prompts in the Wild: What Users Ask LLMs and How LLMs Respond (arxiv.org)

1d
See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL (arxiv.org)

1d

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The pr…
SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology (arxiv.org)

1d

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addre…
Shattering the Autoregressive Curse: Dynamic Epistemic Entropy Orchestrated Erasable Reinforcement Learning for LLMs (arxiv.org)

1d

Although reinforcement learning (RL) has expanded the cognitive boundaries of large language models (LLMs), it often remains vulnerable to the autoregressive curse in long-horizon logical reasoning: small epistemic perturbations introduced…
SkillChain-Gym: A Benchmark for Reskilling-Aware Production-Inventory Control under Disruptions (arxiv.org)

1d

Production planning increasingly has to treat workforce capability as a decision variable: certifications lapse when skills are not maintained, new products require skills the current workforce does not hold, and reskilling competes for th…
SoftMoE: Soft Differentiable Routing for Mixture-of-Experts in LLMs (arxiv.org)

1d
Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective (arxiv.org)

1d
Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work (arxiv.org)

1d

AI coding agents increasingly accept assigned software tasks, modify repositories under bounded authority, and return work packages for review. Prior work proposed the software delegation contract, covering the task, authority, returned wo…
Speaking in Self-Assessing Tongues: On the Verbalized Confidence of LLMs in Machine Translation (arxiv.org)

1d
SpeechDx: A Multi-Task Benchmark for Clinical Speech AI (arxiv.org)

1d

Speech offers a uniquely informative window into health by simultaneously engaging neurological, motor, respiratory, and vocal systems. Current clinical speech AI methods have largely progressed through isolated condition-specific studies,…
Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference (arxiv.org)

1d

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on…
Structural Role Injection in Handlebars-Templated LLM Prompts: Triple-Brace Interpolation, Delimiter Family, and the Limits of HTML Auto-Escaping (arxiv.org)

1d
TACOMORE: Exploring a replicable prompting protocol for LLM-assisted corpus analysis (arxiv.org)

1d
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs (arxiv.org)

1d
The Critical Role of Model Selection in Causal Inference: A Comparative Analysis of Classification Models within the InferBERT Framework for Pharmacovigilance (arxiv.org)

1d
The Price of Anarchy in Disaggregated Inference (arxiv.org)

1d

Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic anal…
Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding (arxiv.org)

1d
Towards Distributed Inference of LLMs on a P2P Network (arxiv.org)

1d

Prefix caching can reduce LLM inference latency by reusing KV caches across requests with shared prompts, but cluster-scale reuse is challenging because caches are partitioned across nodes. We propose a decentralized, prefix-cache-aware ro…
Towards Understanding and Measuring COGNITIVE ATROPHY in LLM Behaviour (arxiv.org)

1d
Transformer-Based Warm-Starting for Feasible and Optimal Terminal Approach to Tumbling Objects with Space Manipulators (arxiv.org)

1d

Real-time trajectory generation for on-orbit robotic servicing is challenging due to the nonlinear coupling between spacecraft bus motion, manipulator dynamics, visibility cone, and trajectory-level safety constraints. This paper studies l…
Trust-Aware Multi-Agent Traceability: Confidence-Calibrated Knowledge Graphs for Consistent Software Artifact Management (arxiv.org)

1d

Multi-agent AI systems are increasingly used to automate software engineering tasks including requirements analysis, architecture design, test generation, and traceability linking. When these agents operate as a sequential pipeline over sh…
Trustworthy Self-Composable Big-Data-as-a-Service: An LLM-Orchestrated Multi-Agent Framework for Automated Data Engineering, AutoML, MLOps Deployment, and Drift-Aware Lifecycle Optimization (arxiv.org)

1d
Understanding LLMs in Title-Abstract Screening: From Disagreements to Recommendations (arxiv.org)

1d

Several studies have examined the use of large language models (LLMs) for title-abstract screening in systematic reviews (SRs), reporting mixed accuracy. However, questions of reliability remain largely unaddressed.