论文ICLR 2026 Poster2026 年trustworthy medical AI 可解释性与嵌入的桥接:让 BEE 识别伪相关
ICLR 2026 Poster accepted paper at ICLR 2026. Current methods for detecting spurious correlations rely on data splits or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space and embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Code/project link: https://github.com/bit-ml/bee
论文ICLR 2026 Poster2026 年clinical NLP VLM-SubtleBench:VLM 距离人类级细微比较推理还有多远?
ICLR 2026 Poster accepted paper at ICLR 2026. The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce **VLM-SubtleBench**, a benchmark designed to evaluate VLMs on *subtle comparative reasoning*. Our benchmark covers ten difference types—Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action—and curate paired question–image sets reflecting these fine-grained variations.
论文ICLR 2026 Poster2026 年trustworthy medical AI Dyslexify:CLIP 中抵御排版攻击的机制性防御
ICLR 2026 Poster accepted paper at ICLR 2026. Typographic attacks exploit multi-modal systems by injecting text into images, leading to targeted misclassifications, malicious content generation and even Vision-Language Model jailbreaks. In this work, we analyze how CLIP vision encoders behave under typographic attacks, locating specialized attention heads in the latter half of the model's layers that causally extract and transmit typographic information to the cls token. Building on these insights, we introduce Dyslexify - a method to defend CLIP models against typographic attacks by selectively ablating a typographic circuit, consisting of attention heads. Without requiring finetuning, dyslexify improves performance by up to 22.06\% on a typographic variant of ImageNet-100, while reducing standard ImageNet-100 accuracy by less than 1\%, and demonstrate its utility in a medical foundation model for skin lesion diagnosis.
论文ICLR 2026 Poster2026 年clinical NLP LaVCa:LLM 辅助的视觉皮层图像描述
ICLR 2026 Poster accepted paper at ICLR 2026. Understanding the properties of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that leverages large language models (LLMs) to generate natural-language captions for images to which voxels are selective.
论文ICLR 2026 Poster2026 年clinical NLP LLM 推理中类人谬误模式的理论扎根评测
ICLR 2026 Poster accepted paper at ICLR 2026. We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open‑source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR‑predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model’s incorrect answers are ETR‑predicted fallacies ($\rho=0.360, p=0.0265$), while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects.
论文ICLR 2026 Poster2026 年trustworthy medical AI SE-Diff:面向综合 ECG 生成的模拟器与经验增强扩散模型
ICLR 2026 Poster accepted paper at ICLR 2026. Cardiovascular disease (CVD) is a leading cause of mortality worldwide. Electrocardiograms (ECGs) are the most widely used non-invasive tool for cardiac assessment, yet large, well-annotated ECG corpora are scarce due to cost, privacy, and workflow constraints. Generating ECGs can aid mechanistic understanding of cardiac electrical activity, enable the construction of large, heterogeneous, and unbiased datasets, and facilitate privacy-preserving data sharing. Generating realistic ECG signals from clinical context is important yet underexplored. Recent work has leveraged diffusion models for text-to-ECG generation, but two challenges remain: (i) existing methods often overlook physiological simulator knowledge of cardiac activity; and (ii) they ignore broader, experience-based clinical knowledge grounded in real-world practice.
论文ICLR 2026 Poster2026 年clinical prediction MedAraBench:大规模阿拉伯语医学问答数据集与基准
ICLR 2026 Poster accepted paper at ICLR 2026. Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region.
论文ICLR 2026 Poster2026 年medical LLM agent 大语言模型能否匹配系统综述的结论?
ICLR 2026 Poster accepted paper at ICLR 2026. Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: **Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies?** To explore this question, we present MedEvidence, a benchmark pairing findings from 100 medical SRs with the studies they are based on.
论文ICLR 2026 Poster2026 年trustworthy medical AI ATPO:面向多轮医学对话的自适应树策略优化
ICLR 2026 Poster accepted paper at ICLR 2026. Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance.
论文ICLR 2026 Poster2026 年clinical prediction 医学 MLLM 如何失效?医学图像视觉定位研究
ICLR 2026 Poster accepted paper at ICLR 2026. Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks—particularly in zero-shot settings where generalization is critical—remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. **In this work**, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle *visual grounding* from *semantic grounding*, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. Code/project link: https://guimeng-leo-liu.github.io/Medical-MLLMs-Fail/
论文ICLR 2026 Poster2026 年medical LLM agent GALAX:面向精准医疗中可解释强化引导子图推理的图增强语言模型
ICLR 2026 Poster accepted paper at ICLR 2026. In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets, guiding the discovery of novel therapeutics and effective treatment strategies. Existing pipelines capture only one or two of these—numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse rich node semantics and the generalization power of LLMs—thereby limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by coarse step definitions, unreliable intermediate evaluation, and vulnerability to reward hacking with added computational cost. These gaps motivate jointly integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context.
论文ICLR 2026 Poster2026 年medical LLM agent Doctor-R1:通过体验式 Agent 强化学习掌握临床问诊
ICLR 2026 Poster accepted paper at ICLR 2026. The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making.
论文ICLR 2026 Poster2026 年clinical NLP 迈向医学图像分割中的文本-掩膜一致性
ICLR 2026 Poster accepted paper at ICLR 2026. Vision-language models for medical image segmentation often produce masks that conflict with the accompanying text, especially under multi-site/multi-lesion descriptions. We trace this failure to two factors: (i) highly templated and repetitive clinical language causes one-to-one hard contrastive learning to yield numerous false negatives, weakening cross-modal alignment; and (ii) predominantly vision-driven, one-way cross-attention lacks a language-dominant, spatially aware pathway, hindering effective injection of textual semantics into the spatial visual domain. To this end, we propose Consistency-enhanced Two-stage Segmentation (C2Seg). In the pretraining stage, Cluster-aware Contrastive Learning uses a frozen strong baseline to construct an intra-batch text similarity matrix as soft labels, thereby alleviating false negative conflicts and producing more discriminative visual representations.
论文ICLR 2026 Poster2026 年trustworthy medical AI PathChat-SegR1:通过 SO-GRPO 实现病理推理分割
ICLR 2026 Poster accepted paper at ICLR 2026. Segmentation in pathology image requires handling out-of-domain tissue morphologies and new pathologies beyond training distributions, where traditional closed-set segmentation approaches fail to generalize. Reasoning segmentation enables zero-shot generalization via prompting with text queries. However, existing reasoning segmentation models face three barriers when applied to pathology: (1) the vision encoder lack pathology-specific knowledge and robustness to staining variations, (2) the large language model (LLM) backbone for reasoning fails to identify whether it has gathered sufficient semantic context to trigger the segmentation output, and (3) no reasoning segmentation benchmarks and datasets exist for pathology analysis. Consequently, we introduce PathChat-SegR1, a reasoning segmentation model built upon pathology-specific vision encoders trained with a novel stain-invariant self-distillation for robust pathology image representations.
论文ICLR 2026 Poster2026 年trustworthy medical AI 特征归因解释中的缺失偏倚校准
ICLR 2026 Poster accepted paper at ICLR 2026. Popular explanation methods often produce unreliable feature importance scores due to missingness bias, a systematic distortion that arises when models are probed with ablated, out-of-distribution inputs. Existing solutions treat this as a deep representational flaw that requires expensive retraining or architectural modifications. In this work, we challenge this assumption and show that missingness bias can be effectively treated as a superficial artifact of the model's output space. We introduce MCal, a lightweight post-hoc method that corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model.
论文ICLR 2026 Poster2026 年trustworthy medical AI 基于强化学习的假设驱动临床决策语言 Agent
ICLR 2026 Poster accepted paper at ICLR 2026. Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. Code/project link: https://github.com/dharouni/LA-CDM
论文ICLR 2026 Poster2026 年trustworthy medical AI 从对话到查询执行:EHR 数据库 Agent 的用户与工具交互基准
ICLR 2026 Poster accepted paper at ICLR 2026. Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA, an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Code/project link: https://github.com/glee4810/EHR-ChatQA
论文ICLR 2026 Poster2026 年trustworthy medical AI Critic-Adviser-Reviser 循环精炼:迈向高质量 EMR 语料生成
ICLR 2026 Poster accepted paper at ICLR 2026. Electronic medical records (EMRs) are vital for healthcare research, but their use is limited by privacy concerns. Synthetic EMR generation offers a promising alternative, yet most existing methods merely imitate real records without adhering to rigorous clinical quality principles. To address this, we introduce LLM-CARe, a stage-wise cyclic refinement framework that progressively improves EMR quality through three stages, each targeting a specific granularity: corpus, section and document. At each stage, a Critic, an Adviser, and a Reviser collaborate iteratively to evaluate, provide feedback, and refine the drafts.
论文ICLR 2026 Poster2026 年clinical NLP 通过多粒度语言学习增强医学视觉理解
ICLR 2026 Poster accepted paper at ICLR 2026. Recent advances in image-text pretraining have significantly enhanced visual understanding by aligning visual and textual representations. Contrastive Language-Image Pretraining (CLIP) has played a pivotal role in multimodal learning. However, its focus on single-label, single-granularity alignment limits its effectiveness in complex domains such as medical imaging, where images often correspond to multiple labels across different levels of granularity. To address this, we propose Multi-Granular Language Learning (MGLL), a contrastive learning framework designed to improve both multi-label and cross-granularity alignment. Code/project link: https://github.com/HUANGLIZI/MGLL
论文ICLR 2026 Oral2026 年clinical prediction CounselBench:心理健康问答中大语言模型的大规模专家评测与对抗基准
ICLR 2026 Oral accepted paper at ICLR 2026. Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat.
论文ICLR 2026 Poster2026 年clinical prediction CerebraGloss:面向细粒度临床 EEG 解读的大型视觉语言模型指令微调
ICLR 2026 Poster accepted paper at ICLR 2026. Interpreting clinical electroencephalography (EEG) is a laborious, subjective process, and existing computational models are limited to narrow classification tasks rather than holistic interpretation. A key bottleneck for applying powerful Large Vision-Language Models (LVLMs) to this domain is the scarcity of datasets pairing EEG visualizations with fine-grained, expert-level annotations. We address this by introducing CerebraGloss, an instruction-tuned LVLM for nuanced EEG interpretation. We first introduce a novel, automated data generation pipeline, featuring a bespoke YOLO-based waveform detector, to programmatically create a large-scale corpus of EEG-text instruction data. Code/project link: https://github.com/iewug/CerebraGloss
论文ICLR 2026 Poster2026 年medical LLM agent AnesSuite:面向 LLM 麻醉学推理的综合基准与数据集套件
ICLR 2026 Poster accepted paper at ICLR 2026. The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Code/project link: https://github.com/MiliLab/AnesSuite
论文ICLR 2026 Poster2026 年trustworthy medical AI Resp-Agent:面向多模态呼吸音生成与疾病诊断的 Agent 系统
ICLR 2026 Poster accepted paper at ICLR 2026. Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present **_Resp-Agent_**, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A²CA). Unlike static pipelines, Thinker-A²CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. Code/project link: https://github.com/zpforlove/Resp-Agent
论文ICLR 2026 Poster2026 年trustworthy medical AI MedVR:通过 Agent 强化学习实现无标注医学视觉推理
ICLR 2026 Poster accepted paper at ICLR 2026. Medical Vision-Language Models (VLMs) hold immense promise for complex clinical tasks, but their reasoning capabilities are often constrained by text-only paradigms that fail to ground inferences in visual evidence. This limitation not only curtails performance on tasks requiring fine-grained visual analysis but also introduces risks of visual hallucination in safety-critical applications. Thus, we introduce MedVR, a novel reinforcement learning framework that enables annotation-free visual reasoning for medical VLMs. Its core innovation lies in two synergistic mechanisms: Entropy-guided Visual Regrounding (EVR) uses model uncertainty to direct exploration, while Consensus-based Credit Assignment (CCA) distills pseudo-supervision from rollout agreement.
论文ICLR 2026 Poster2026 年trustworthy medical AI LiveClin:无泄漏的实时临床基准
ICLR 2026 Poster accepted paper at ICLR 2026. The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for the approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI–human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. Code/project link: https://github.com/AQ-MedAI/LiveClin
论文ICLR 2026 Poster2026 年clinical prediction 知识型语言模型作为个性化医疗黑箱优化器
ICLR 2026 Poster accepted paper at ICLR 2026. The goal of personalized medicine is to discover a treatment regimen that optimizes a patient's clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an *in silico* surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge—such as medical textbooks and biomedical knowledge graphs—can provide a meaningful alternative signal of the fitness of proposed treatments.
论文ICLR 2026 Poster2026 年clinical NLP 用于胸部 X 光图像的结构化、标注式、定位化 VQA 数据集:含完整句答案与场景图
ICLR 2026 Poster accepted paper at ICLR 2026. Visual Question Answering (VQA) enables targeted and context-dependent analysis of medical images, such as chest X-rays (CXRs). However, existing VQA datasets for CXRs are typically constrained by simplistic and brief answer formats, lacking localization annotations (e.g., bounding boxes) and structured tags (e.g., region or radiological finding/disease tags). To address these limitations, we introduce MIMIC-Ext-CXR-QBA (abbr. CXR-QBA), a large-scale CXR VQA dataset derived from MIMIC-CXR, comprising 42 million QA-pairs with multi-granular, multi-part answers, detailed bounding boxes, and structured tags. Code/project link: https://github.com/philip-mueller/mimic-ext-cxr-qba/
论文ICLR 2026 Poster2026 年trustworthy medical AI 用生成器-验证器 LMM 从医学文档合成高质量视觉问答
ICLR 2026 Poster accepted paper at ICLR 2026. Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions.
论文ICLR 2026 Poster2026 年clinical NLP 重新思考放射报告生成:从叙事流到主题引导 findings
ICLR 2026 Poster accepted paper at ICLR 2026. Vision-Language Models (VLMs) for radiology report generation are typically trained to mimic the narrative flow of human experts. However, we identify a potential limitation in this conventional paradigm. We hypothesize that optimizing for narrative coherence encourages models to rely on linguistic priors and inter-sentence correlations, which can weaken their grounding in direct visual evidence and lead to factual inaccuracies. To investigate this, we design a controlled experiment demonstrating that as textual context increases, a model's reliance on the input image systematically decays. We propose LLaVA-TA (Topic-guided and Anatomy-aware), a new fine-tuning framework that directly addresses this challenge by re-engineering the generation process.
论文ICLR 2026 Poster2026 年clinical prediction M3CoTBench:医学图像理解中 MLLM 思维链基准
ICLR 2026 Poster accepted paper at ICLR 2026. Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis.
论文ICLR 2026 Poster2026 年medical LLM agent KnowGuard:面向多轮临床推理的知识驱动拒答
ICLR 2026 Poster accepted paper at ICLR 2026. In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences.
论文ICLR 2026 Poster2026 年trustworthy medical AI SAE 能否揭示并缓解医疗 LLM 的种族偏差?
ICLR 2026 Poster accepted paper at ICLR 2026. LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in gemma-2 models which appear to correlate with Black individuals.
论文ICLR 2026 Poster2026 年trustworthy medical AI CARE:面向多模态医学推理临床问责的证据扎根 Agent 框架
ICLR 2026 Poster accepted paper at ICLR 2026. Large visual language models (VLMs) have shown strong multi-modal medical reasoning ability, but most operate as end-to-end black boxes, diverging from clinicians’ evidence-based, staged workflows and hindering clinical accountability. Complementarily, expert visual grounding models can accurately localize regions of interest (ROIs), providing explicit, reliable evidence that improves both reasoning accuracy and trust. In this paper, we introduce **CARE**, advancing **C**linical **A**ccountability in multi-modal medical **R**easoning with an **E**vidence-grounded agentic framework. Unlike existing approaches that couple grounding and reasoning within a single generalist model, CARE decomposes the task into coordinated sub-modules to reduce shortcut learning and hallucination: a compact VLM proposes relevant medical entities; an expert entity-referring segmentation model produces pixel-level ROI evidence; and a grounded VLM reasons over the full image augmented by ROI hints.
论文ICLR 2026 Poster2026 年trustworthy medical AI 多中心队列中有创机械通气需求预测的自适应测试时训练
ICLR 2026 Poster accepted paper at ICLR 2026. Accurate prediction of the need for invasive mechanical ventilation (IMV) in intensive care units (ICUs) patients is crucial for timely interventions and resource allocation. However, variability in patient populations, clinical practices, and electronic health record (EHR) systems across institutions introduces domain shifts that degrade the generalization performance of predictive models during deployment. Test-Time Training (TTT) has emerged as a promising approach to mitigate such shifts by adapting models dynamically during inference without requiring labeled target-domain data. In this work, we introduce Adaptive Test-Time Training (AdaTTT), an enhanced TTT framework tailored for EHR-based IMV prediction in ICU settings.
论文ICLR 2026 Poster2026 年trustworthy medical AI 大语言模型的医学可解释性与知识图谱
ICLR 2026 Poster accepted paper at ICLR 2026. We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers.
论文ICLR 2026 Poster2026 年trustworthy medical AI Photon:用高效多模态大语言模型加速体数据理解
ICLR 2026 Poster accepted paper at ICLR 2026. Multimodal large language models are promising for clinical visual question answering tasks, but scaling to 3D imaging is hindered by high computational costs. Prior methods often rely on 2D slices or fixed-length token compression, disrupting volumetric continuity and obscuring subtle findings. We present Photon, a framework that represents 3D medical volumes with token sequences of variable length. Photon introduces instruction-conditioned token scheduling and surrogate gradient propagation to adaptively reduce tokens during both training and inference, which lowers computational cost while mitigating the attention dilution caused by redundant tokens.
论文ICLR 2026 Poster2026 年trustworthy medical AI NurValues:临床情境中大语言模型的真实护理价值观评测
ICLR 2026 Poster accepted paper at ICLR 2026. While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse–patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: _Altruism_, _Human Dignity_, _Integrity_, _Justice_, and _Professionalism_. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse–patient conflicts.
论文ICLR 2026 Poster2026 年clinical prediction FETAL-GAUGE:评估胎儿超声视觉语言模型的基准
ICLR 2026 Poster accepted paper at ICLR 2026. The growing demand for prenatal ultrasound imaging has intensified a global shortage of trained sonographers, creating barriers to essential fetal health monitoring. Deep learning has the potential to enhance sonographers' efficiency and support the training of new practitioners. Vision-Language Models (VLMs) are particularly promising for ultrasound interpretation, as they can jointly process images and text to perform multiple clinical tasks within a single framework. However, despite the expansion of VLMs, no standardized benchmark exists to evaluate their performance in fetal ultrasound imaging. Code/project link: https://github.com/BioMedIA-MBZUAI/FETAL-GAUGE
论文ICLR 2026 Poster2026 年clinical NLP 多图像医学思维
ICLR 2026 Poster accepted paper at ICLR 2026. Large language models perform well on many medical QA benchmarks, but real clinical reasoning is harder because diagnosis often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, in which models must interpret each image, combine cross-view evidence, and solve diagnostic questions under intermediate supervision and step-level evaluation. The dataset contains 10,067 cases, including 720 test cases, with an average of 6.68 images per case, substantially denser than prior work (earlier maxima $\leq$ 1.43). On the test set, the best closed-source models, Claude-4.6-opus, Gemini-3-pro, and GPT-5.2-xhigh, achieve only 54.9%--57.2% accuracy, while smaller proprietary variants, GPT-5-mini/nano, drop to 39.7% and 30.8%.
论文ICLR 2026 Poster2026 年trustworthy medical AI AttTok:将属性 token 与生成式预训练视觉语言模型结合用于医学图像理解
ICLR 2026 Poster accepted paper at ICLR 2026. Recent generative pre-trained vision–language (GPTv) models have achieved remarkable success in multi-modal understanding, inspiring their adaptation to medical imaging tasks such as disease diagnosis and visual question answering (VQA). However, current instruction-tuned GPTv models suffer from two key challenges: (1) medical attributes (e.g., disease names, severity grades) are encoded as plain text tokens, collapsing semantically distinct concepts into nearly identical textual sequences; and (2) inadequate textual supervision weakens visual representation learning, leading to severe inter-attribute confusion and misaligned vision–language embeddings. To address these limitations, we introduce attribute tokens (AttTok), a set of pre‑defined special tokens that uniquely encode clinical attributes (e.g., imaging modality, diagnosis, severity) within a structured token space. Complemented by attribute‑centric embedding books, AttTok serves as anchor points for aligning both visual and textual modalities into a shared, discriminative representation space.
论文ICLR 2026 Poster2026 年trustworthy medical AI Cancer-Myth:评估大语言模型回答含错误预设的患者问题
ICLR 2026 Poster accepted paper at ICLR 2026. Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions} in the questions, posing risks to safe medical decision-making.
论文ICLR 2026 Poster2026 年trustworthy medical AI 能否用 LLM 为临床时间序列数据生成可迁移表征?
ICLR 2026 Poster accepted paper at ICLR 2026. Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors.
论文ICLR 2026 Poster2026 年clinical prediction 能否用 LLM 为临床时间序列数据生成可迁移表征?
ICLR 2026 Poster accepted paper at ICLR 2026. Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. Code/project link: https://neuromedbench.github.io/
论文ICLR 2026 Poster2026 年trustworthy medical AI MedAgent-Pro:通过推理型 Agent 工作流迈向证据型多模态医学诊断
ICLR 2026 Poster accepted paper at ICLR 2026. Modern clinical diagnosis relies on the comprehensive analysis of multi-modal patient data, drawing on medical expertise to ensure systematic and rigorous reasoning. Recent advances in Vision–Language Models (VLMs) and agent-based methods are reshaping medical diagnosis by effectively integrating multi-modal information. However, they often output direct answers and empirical-driven conclusions without clinical evidence supported by quantitative analysis, which compromises their reliability and hinders clinical usability. Here we propose MedAgent-Pro, an agentic reasoning paradigm that mirrors modern diagnosis principles via a hierarchical diagnostic workflow, consisting of disease-level standardized plan generation and patient-level personalized step-by-step reasoning.
论文ICLR 2026 Poster2026 年trustworthy medical AI 超越医学考试:面向心理健康真实任务与模糊性的临床医生标注公平性数据集
ICLR 2026 Poster accepted paper at ICLR 2026. Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S. centric dataset — created without any LM assistance — is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets.
论文ICLR 2026 Poster2026 年clinical prediction 学习自我批判机制用于区域引导胸部 X 光报告生成
ICLR 2026 Poster accepted paper at ICLR 2026. Automatic radiology reporting assists radiologists in diagnosing abnormalities in radiology images, where grounding the automatic diagnosis with abnormality locations is important for the report interpretability. However, existing supervised-learning methods could lead to learning the superficial statistical correlations between images and reports, lacking multi-faceted reasoning to critique the relevant regions on which radiologists would focus. Recently, self-critical reasoning has been investigated in test-time scaling approaches to alleviate hallucinations of LLMs with increased time complexity. In this work, we focus on chest X-ray report generation with particular focus on clinical accuracy, where self-critical reasoning is alternatively introduced into the model architecture and their training objective, preferred by the real-time automatic reporting system.
论文ICLR 2026 Poster2026 年clinical prediction 从病历到诊断对话:面向精神共病的临床扎根方法与数据集
ICLR 2026 Poster accepted paper at ICLR 2026. Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards.
论文ICLR 2026 Poster2026 年Medical multimodal AI AttTok:将属性 token 与生成式预训练视觉语言模型结合用于医学图像理解
ICLR 2026 poster introducing AttTok, a medical vision-language method that uses predefined attribute tokens and attribute-centric mechanisms to improve medical image understanding, including classification and visual question answering.
论文npj Digital Medicine2026 年临床语言智能 临床医学中的人类-大语言模型协作:系统综述与荟萃分析
系统综述与荟萃分析,评估临床医学中人类与大语言模型协作相对于人类单独工作流的表现,覆盖临床推理、文档和解释等任务;研究指出当前证据仍初步且具有情境依赖性,建议后续开展预注册、务实、多中心并嵌入真实工作流的临床研究。
论文Nature Medicine2025 年临床 LLM 面向专家级医学问答的大语言模型
Nature Medicine paper on Med-PaLM 2 and expert-level medical question answering with large language models.
数据资源Chinese community medical questions and answersChinese medical QA datasetUpdated cMedQA dataset; see official repository开放访问 cMedQA2:中文社区医学问答数据集
cMedQA2 is an updated Chinese community medical question answering dataset for question-answer matching and medical QA research. It is useful for training and evaluating Chinese medical retrieval, ranking, and answer selection models.
数据资源chest radiographs with radiology reportschest X-ray image-report datasetLarge-scale CXR image-report dataset; version 2.1.0申请访问 MIMIC-CXR v2.1.0 胸部 X 光数据集
MIMIC-CXR is a large deidentified chest radiograph dataset with associated free-text radiology reports. It is widely used for chest X-ray classification, report generation, image-text representation learning, radiology retrieval, and medical multimodal foundation model evaluation.
数据资源deidentified clinical free textclinical notes datasetClinical note extension for MIMIC-IV; version 2.2申请访问 MIMIC-IV-Note v2.2 临床笔记数据集
MIMIC-IV-Note provides deidentified clinical notes linked to MIMIC-IV hospital data. It supports clinical NLP tasks such as note representation learning, discharge summary modeling, information extraction, summarization, and multimodal EHR-text modeling.
数据资源Chinese conversational medical QA textChinese medical conversational QA datasetLarge-scale Chinese medical CQA dataset; see official repository开放访问 CMCQA:中文医学会话问答数据集
CMCQA is a large Chinese medical conversational question-answering dataset released with knowledge-grounded medical dialogue research. It supports medical conversation QA, knowledge-grounded response generation, and evaluation of Chinese medical dialogue systems.
数据资源Chinese medical instruction and dialogue textChinese medical instruction-tuning datasetAbout 140K medical SFT examples; see Hugging Face card开放访问 HuatuoGPT2-SFT-GPT4-140K 医学指令数据集
HuatuoGPT2-SFT-GPT4-140K is a Chinese medical supervised fine-tuning dataset containing medical instruction-style conversations and GPT-4-assisted responses. It is useful for Chinese medical assistant alignment and medical LLM instruction tuning.
数据资源Chinese medical question-answer textChinese medical QA corpusAbout 26 million medical QA pairs开放访问 Huatuo-26M:大规模中文医学问答数据集
Huatuo-26M is a large-scale Chinese medical question-answering dataset with about 26 million QA pairs collected for medical language modeling and medical dialogue research. It is suitable for Chinese medical LLM pretraining, fine-tuning, and QA system development.
数据资源medical exam question-answer textmedical exam QA benchmarkUSMLE, Mainland China, and Taiwan exam-style QA splits; see repository开放访问 MedQA:含美国、中国大陆与台湾拆分的医学考试问答数据集
MedQA is a medical examination question answering benchmark with English and Chinese medical licensing-style question sets, including mainland China and Taiwan variants. It is widely used for medical QA and medical reasoning evaluation.
数据资源Chinese consultation dialogue text with medical entity annotationsChinese medical dialogue generation datasetEntity-annotated dialogue dataset; see official repository开放访问 MedDG:实体中心中文医学对话生成数据集
MedDG is an entity-centric Chinese medical consultation dataset with domain entity annotations for medical dialogue generation. It supports entity-aware response generation, medical consultation modeling, and dialogue systems that ground responses in clinical concepts.
数据资源Chinese medical exam and QA textChinese medical LLM evaluation benchmarkMultiple Chinese medical exam and benchmark splits; see Hugging Face card开放访问 CMB:中文医学基准
CMB is a comprehensive Chinese medical benchmark for evaluating medical large language models on medical exams, reasoning, and clinical knowledge questions. It is suited for Chinese medical QA, LLM evaluation, and instruction-following assessment.
数据资源Chinese biomedical and clinical textChinese biomedical NLP benchmark8 biomedical NLU tasks; see official repository开放访问 CBLUE:中文生物医学语言理解评测基准
CBLUE is a Chinese biomedical language understanding benchmark covering real-world biomedical NLP tasks such as named entity recognition, relation extraction, term normalization, clinical trial classification, sentence similarity, and medical question answering. It is useful for evaluating Chinese clinical NLP models and medical language models.
数据资源TextLLM benchmarkBenchmark and leaderboard开放访问 MedHELM 医学 LLM 评测基准
Medical LLM benchmark and leaderboard intended to broaden coverage beyond single medical QA datasets.
数据资源TextLLM evaluation benchmarkHealth AI evaluation benchmark开放访问 HealthBench 健康 AI 评测基准
Benchmark for evaluating health AI model safety, helpfulness, and clinical-relevance judgments with physician-reviewed rubrics.
技术竞赛报名入口公开,赛程未来阶段仍开放(2026-05-03 核验)sleep apnea detection and medical large-model applicationssleep monitoring signals and medical LLM applications截止 北京时间 2026-08-07 京东健康·全球医疗 AI 创新大赛
京东健康全球医疗 AI 创新大赛公开页面显示赛事聚焦睡眠监测智能算法与医疗大模型创新应用两个方向,面向全球高校、科研机构、企业和个人开放报名,赛程含 6.17-8.7 初赛、后续复赛和 9.21 决赛。
技术竞赛Training release scheduled 2026-05-11 17:00 BeijingReport generationPathology images and text截止 北京时间 2026-07-20 17:00 REG 2026:病理学家推理引导的报告生成挑战
MICCAI 2026 challenge for pathologist reasoning-guided pathology report generation, hosted on Grand Challenge.
征稿与合作Scientific Reports截止 北京时间 2026-06-23期刊专刊 Scientific Reports 专辑:临床决策 AI
This Nature Portfolio / Scientific Reports collection is open for submissions until 2026-06-23. It focuses on AI for clinical decision-making, including diagnostic, prognostic, and therapeutic decision support, EHRs, medical imaging, genomics, real-time patient data, clinical notes, multimodal learning, privacy-preserving AI, interpretability, and validation.
征稿与合作NLPCC 2026截止 北京时间 2026-05-26会议征稿 NLPCC 2026 征稿
CCF-Deadlines lists NLPCC 2026 with papers due 2026-05-26 UTC+8 and conference dates 2026-11-03 to 2026-11-05 in Macau. NLPCC is relevant to Chinese clinical NLP, Chinese biomedical language resources, medical text mining, and healthcare question answering.
征稿与合作EMNLP 2026截止 北京时间 2026-05-25会议征稿 EMNLP 2026 征稿
CCF-Deadlines lists EMNLP 2026 with papers due 2026-05-25 UTC-12 and conference dates 2026-10-24 to 2026-10-29 in Budapest. EMNLP is relevant to clinical NLP, biomedical language models, medical text mining, EHR note understanding, and safe medical LLM evaluation.
征稿与合作IEEE BIBM 2026截止 北京时间 2026-07-05会议征稿 IEEE BIBM 2026 征稿
IEEE BIBM 2026 covers bioinformatics, biomedicine, and health informatics, including machine learning and AI, biomedical image analysis, biomedical signal analysis, clinical decision support, EHR standards, healthcare knowledge representation, NLP and text mining, and precision medicine. The official CFP lists electronic submission of full papers due 2026-07-05, notification on 2026-09-25, camera-ready on 2026-10-25, and the conference on 2026-12-01 to 2026-12-04 in Dallas.
中国大学 MOOC / 武汉大学:医学人工智能
武汉大学在中国大学MOOC开设的《医学人工智能》课程围绕医学 AI 的技术与临床应用展开,涵盖计算机视觉、自然语言处理、医疗伦理法规、临床研究规范和 AI 辅助医学教学等内容,适合中文学习者系统入门医疗 AI。