论文ICLR 2026 Poster2026 年trustworthy medical AI 可解释性与嵌入的桥接:让 BEE 识别伪相关
ICLR 2026 Poster accepted paper at ICLR 2026. Current methods for detecting spurious correlations rely on data splits or error patterns, leaving many harmful shortcuts invisible when counterexamples are absent. We introduce BEE (Bridging Explainability and Embeddings), a framework that shifts the focus from model predictions to the weight space and embedding geometry underlying decisions. By analyzing how fine-tuning perturbs pretrained representations, BEE uncovers spurious correlations that remain hidden from conventional evaluation pipelines. We use linear probing as a transparent diagnostic lens, revealing spurious features that not only persist after full fine-tuning but also transfer across diverse state-of-the-art models. Code/project link: https://github.com/bit-ml/bee
论文ICLR 2026 Poster2026 年clinical NLP VLM-SubtleBench:VLM 距离人类级细微比较推理还有多远?
ICLR 2026 Poster accepted paper at ICLR 2026. The ability to distinguish subtle differences between visually similar images is essential for diverse domains such as industrial anomaly detection, medical imaging, and aerial surveillance. While comparative reasoning benchmarks for vision-language models (VLMs) have recently emerged, they primarily focus on images with large, salient differences and fail to capture the nuanced reasoning required for real-world applications. In this work, we introduce **VLM-SubtleBench**, a benchmark designed to evaluate VLMs on *subtle comparative reasoning*. Our benchmark covers ten difference types—Attribute, State, Emotion, Temporal, Spatial, Existence, Quantity, Quality, Viewpoint, and Action—and curate paired question–image sets reflecting these fine-grained variations.
论文ICLR 2026 Poster2026 年医学影像 无需甲基化输入的全基因组 DNA 甲基化预测新范式
ICLR 2026 Poster accepted paper at ICLR 2026. DNA methylation (DNAm) is a key epigenetic modification that regulates gene expression and is pivotal in development and disease. However, profiling DNAm at genome scale is challenging: of $\textasciitilde$28 million CpG sites in the human genome, only about 1–3\% are typically assayed in common datasets due to technological limitations and cost. Recent deep learning approaches, including masking-based generative Transformer models, have shown promise in capturing DNAm–gene expression relationships, but they rely on partially observed DNAm values for unmeasured CpGs and cannot be applied to completely unmeasured samples. To overcome this barrier, we introduce MethylProphet, a gene-guided, context-aware Transformer model for whole-genome DNAm inference without any measured DNAm input.
论文ICLR 2026 Poster2026 年trustworthy medical AI 面向一般右删失数据的保形化生存反事实预测
ICLR 2026 Poster accepted paper at ICLR 2026. This paper aims to develop a lower prediction bound (LPB) for survival time across different treatments in the general right-censored setting. Although previous methods have utilized conformal prediction to construct the LPB, their resulting prediction sets provide only probably approximately correct (PAC)–type miscoverage guarantees rather than exact ones. To address this problem, we propose a new calibration procedure under the potential outcome framework. Under the strong ignorability assumption, we propose a reweighting scheme that can transform the problem into a weighted conformal inference problem, allowing an LPB to be obtained via quantile regression with an exact miscoverage guarantee.
论文ICLR 2026 Poster2026 年clinical NLP LLM 推理中类人谬误模式的理论扎根评测
ICLR 2026 Poster accepted paper at ICLR 2026. We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open‑source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR‑predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model’s incorrect answers are ETR‑predicted fallacies ($\rho=0.360, p=0.0265$), while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects.
论文ICLR 2026 Poster2026 年clinical prediction SurvHTE-Bench:生存分析中异质治疗效应估计基准
ICLR 2026 Poster accepted paper at ICLR 2026. Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from causal survival forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE‐Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial.
论文ICLR 2026 Poster2026 年clinical prediction MRI 运动校正的可靠评测:数据集与洞见
ICLR 2026 Poster accepted paper at ICLR 2026. Correcting motion artifacts in scientific and medical imaging is important, as they significantly impact image quality. However, evaluating deep learning-based and classical motion correction methods remains fundamentally difficult due to the lack of accessible ground-truth target data. To address this challenge, we study three evaluation approaches: real-world evaluation based on reference scans, simulated motion, and reference-free evaluation, each with its merits and shortcomings. To enable evaluation with real-world motion artifacts, we release PMoC3D, a dataset consisting of unprocessed $\textbf{P}$aired $\textbf{Mo}$tion-$\textbf{C}$orrupted $\textbf{3D}$ brain MRI data.
论文ICLR 2026 Poster2026 年trustworthy medical AI ProstaTD:将手术 triplet 从分类桥接到全监督检测
ICLR 2026 Poster accepted paper at ICLR 2026. Surgical triplet detection is a critical task in surgical video analysis, with significant implications for performance assessment and training novice surgeons. However, existing datasets like CholecT50 lack precise spatial bounding box annotations, rendering triplet classification at the image level insufficient for practical applications. The inclusion of bounding box annotations is essential to make this task meaningful, as they provide the spatial context necessary for accurate analysis and improved model generalizability. To address these shortcomings, we introduce ProstaTD, a large-scale, multi-institutional dataset for surgical triplet detection, developed from the technically demanding domain of robot-assisted prostatectomy.
论文ICLR 2026 Poster2026 年clinical prediction MedAraBench:大规模阿拉伯语医学问答数据集与基准
ICLR 2026 Poster accepted paper at ICLR 2026. Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region.
论文ICLR 2026 Poster2026 年医学影像 HistoPrism:通过基因表达预测从泛癌组织学解锁功能通路分析
ICLR 2026 Poster accepted paper at ICLR 2026. Predicting spatial gene expression from H\&E histology offers a scalable and clinically accessible alternative to sequencing, but realizing clinical impact requires models that generalize across cancer types and capture biologically coherent signals. Prior work is often limited to per-cancer settings and variance-based evaluation, leaving functional relevance underexplored. We introduce HistoPrism, an efficient transformer-based architecture for pan-cancer prediction of gene expression from histology. To evaluate biological meaning, we introduce a pathway-level benchmark, shifting assessment from isolated gene-level variance to coherent functional pathways.
论文ICLR 2026 Poster2026 年medical LLM agent 大语言模型能否匹配系统综述的结论?
ICLR 2026 Poster accepted paper at ICLR 2026. Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: **Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies?** To explore this question, we present MedEvidence, a benchmark pairing findings from 100 medical SRs with the studies they are based on.
论文ICLR 2026 Poster2026 年clinical prediction 医学 MLLM 如何失效?医学图像视觉定位研究
ICLR 2026 Poster accepted paper at ICLR 2026. Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks—particularly in zero-shot settings where generalization is critical—remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. **In this work**, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle *visual grounding* from *semantic grounding*, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. Code/project link: https://guimeng-leo-liu.github.io/Medical-MLLMs-Fail/
论文ICLR 2026 Poster2026 年medical LLM agent GALAX:面向精准医疗中可解释强化引导子图推理的图增强语言模型
ICLR 2026 Poster accepted paper at ICLR 2026. In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets, guiding the discovery of novel therapeutics and effective treatment strategies. Existing pipelines capture only one or two of these—numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse rich node semantics and the generalization power of LLMs—thereby limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by coarse step definitions, unreliable intermediate evaluation, and vulnerability to reward hacking with added computational cost. These gaps motivate jointly integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context.
论文ICLR 2026 Poster2026 年medical LLM agent Doctor-R1:通过体验式 Agent 强化学习掌握临床问诊
ICLR 2026 Poster accepted paper at ICLR 2026. The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making.
论文ICLR 2026 Poster2026 年医学影像 CardioComposer:利用可微几何实现解剖扩散模型的组合式控制
ICLR 2026 Poster accepted paper at ICLR 2026. Generative models of 3D cardiovascular anatomy can synthesize informative structures for clinical research and medical device evaluation, but face a trade-off between geometric controllability and realism. We propose CardioComposer: a programmable, inference time framework for generating multi-class anatomical label maps from interpretable ellipsoidal primitives. These primitives represent geometric attributes such as the size, shape, and position of discrete substructures. We specifically develop differentiable measurement functions based on voxel-wise geometric moments, enabling loss-based gradient guidance during diffusion model sampling. Code/project link: https://github.com/kkadry/CardioComposer
论文ICLR 2026 Poster2026 年trustworthy medical AI PathChat-SegR1:通过 SO-GRPO 实现病理推理分割
ICLR 2026 Poster accepted paper at ICLR 2026. Segmentation in pathology image requires handling out-of-domain tissue morphologies and new pathologies beyond training distributions, where traditional closed-set segmentation approaches fail to generalize. Reasoning segmentation enables zero-shot generalization via prompting with text queries. However, existing reasoning segmentation models face three barriers when applied to pathology: (1) the vision encoder lack pathology-specific knowledge and robustness to staining variations, (2) the large language model (LLM) backbone for reasoning fails to identify whether it has gathered sufficient semantic context to trigger the segmentation output, and (3) no reasoning segmentation benchmarks and datasets exist for pathology analysis. Consequently, we introduce PathChat-SegR1, a reasoning segmentation model built upon pathology-specific vision encoders trained with a novel stain-invariant self-distillation for robust pathology image representations.
论文ICLR 2026 Oral2026 年clinical prediction CounselBench:心理健康问答中大语言模型的大规模专家评测与对抗基准
ICLR 2026 Oral accepted paper at ICLR 2026. Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat.
论文ICLR 2026 Poster2026 年clinical prediction CRONOS:4D 医学纵向序列的连续时间重建
ICLR 2026 Poster accepted paper at ICLR 2026. Forecasting how 3D medical scans evolve along time is important for disease progression, treatment planning, and developmental assessment. Yet existing models either rely on a single prior scan, fixed grid times, or target global labels, which limits voxel-level forecasting under irregular sampling. We present CRONOS, a unified framework for many-to-one prediction from multiple past scans that supports both discrete (grid-based) and continuous (real-valued) timestamps in one model, to the best of our knowledge the first to achieve continuous sequence-to-image forecasting for 3D medical data. CRONOS learns a spatio-temporal velocity field that transports context volumes toward a target volume at an arbitrary time, while operating directly in 3D voxel space.
论文ICLR 2026 Poster2026 年medical LLM agent AnesSuite:面向 LLM 麻醉学推理的综合基准与数据集套件
ICLR 2026 Poster accepted paper at ICLR 2026. The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Code/project link: https://github.com/MiliLab/AnesSuite
论文ICLR 2026 Poster2026 年trustworthy medical AI Resp-Agent:面向多模态呼吸音生成与疾病诊断的 Agent 系统
ICLR 2026 Poster accepted paper at ICLR 2026. Deep learning-based respiratory auscultation is currently hindered by two fundamental challenges: (i) inherent information loss, as converting signals into spectrograms discards transient acoustic events and clinical context; (ii) limited data availability, exacerbated by severe class imbalance. To bridge these gaps, we present **_Resp-Agent_**, an autonomous multimodal system orchestrated by a novel Active Adversarial Curriculum Agent (Thinker-A²CA). Unlike static pipelines, Thinker-A²CA serves as a central controller that actively identifies diagnostic weaknesses and schedules targeted synthesis in a closed loop. To address the representation gap, we introduce a modality-weaving Diagnoser that weaves clinical text with audio tokens via strategic global attention and sparse audio anchors, capturing both long-range clinical context and millisecond-level transients. Code/project link: https://github.com/zpforlove/Resp-Agent
论文ICLR 2026 Poster2026 年trustworthy medical AI LiveClin:无泄漏的实时临床基准
ICLR 2026 Poster accepted paper at ICLR 2026. The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for the approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI–human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. Code/project link: https://github.com/AQ-MedAI/LiveClin
论文ICLR 2026 Poster2026 年trustworthy medical AI 基于互信息正则的频率均衡视网膜表征学习
ICLR 2026 Poster accepted paper at ICLR 2026. We propose a frequency-oriented perspective on retinal representation learning by analyzing masked autoencoders (MAE) through the lens of spatial frequency. Our analysis shows that MAE favors low-frequency content while under-encoding diagnostically critical high-frequency structures in retinal images. Because retinal pathology often manifests in high-frequency detail, this bias limits diagnostic performance and motivates frequency-balanced representations. Within a mutual-information (MI) formulation of MAE, we introduce the Frequency-Balanced Retinal Masked Autoencoder (RetMAE), which augments the reconstruction objective with a MI regularizer that suppresses low-frequency redundancy and accentuates clinically salient high-frequency information.
论文ICLR 2026 Poster2026 年trustworthy medical AI 用生成器-验证器 LMM 从医学文档合成高质量视觉问答
ICLR 2026 Poster accepted paper at ICLR 2026. Large Multimodal Models (LMMs) are increasingly capable of answering medical questions that require joint reasoning over images and text, yet training general medical VQA systems is impeded by the lack of large, openly usable, high-quality corpora. We present MedVLSynther, a rubric-guided generator-verifier framework that synthesizes high-quality multiple-choice VQA items directly from open biomedical literature by conditioning on figures, captions, and in-text references. The generator produces self-contained stems and parallel, mutually exclusive options under a machine-checkable JSON schema; a multi-stage verifier enforces essential gates (self-containment, single correct answer, clinical validity, image-text consistency), awards fine-grained positive points, and penalizes common failure modes before acceptance. Applying this pipeline to PubMed Central yields MedSynVQA: 13,087 audited questions over 14,803 images spanning 13 imaging modalities and 28 anatomical regions.
论文ICLR 2026 Poster2026 年clinical prediction M3CoTBench:医学图像理解中 MLLM 思维链基准
ICLR 2026 Poster accepted paper at ICLR 2026. Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis.
论文ICLR 2026 Poster2026 年medical LLM agent KnowGuard:面向多轮临床推理的知识驱动拒答
ICLR 2026 Poster accepted paper at ICLR 2026. In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences.
论文ICLR 2026 Poster2026 年trustworthy medical AI NurValues:临床情境中大语言模型的真实护理价值观评测
ICLR 2026 Poster accepted paper at ICLR 2026. While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse–patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: _Altruism_, _Human Dignity_, _Integrity_, _Justice_, and _Professionalism_. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse–patient conflicts.
论文ICLR 2026 Poster2026 年clinical NLP 多图像医学思维
ICLR 2026 Poster accepted paper at ICLR 2026. Large language models perform well on many medical QA benchmarks, but real clinical reasoning is harder because diagnosis often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, in which models must interpret each image, combine cross-view evidence, and solve diagnostic questions under intermediate supervision and step-level evaluation. The dataset contains 10,067 cases, including 720 test cases, with an average of 6.68 images per case, substantially denser than prior work (earlier maxima $\leq$ 1.43). On the test set, the best closed-source models, Claude-4.6-opus, Gemini-3-pro, and GPT-5.2-xhigh, achieve only 54.9%--57.2% accuracy, while smaller proprietary variants, GPT-5-mini/nano, drop to 39.7% and 30.8%.
论文ICLR 2026 Poster2026 年trustworthy medical AI Cancer-Myth:评估大语言模型回答含错误预设的患者问题
ICLR 2026 Poster accepted paper at ICLR 2026. Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions} in the questions, posing risks to safe medical decision-making.
论文ICLR 2026 Poster2026 年clinical prediction 能否用 LLM 为临床时间序列数据生成可迁移表征?
ICLR 2026 Poster accepted paper at ICLR 2026. Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. Code/project link: https://neuromedbench.github.io/
论文ICLR 2026 Poster2026 年trustworthy medical AI 超越分类准确率:Neural-MedBench 与深层推理基准的必要性
ICLR 2026 Poster accepted paper at ICLR 2026. Epilepsy affects over 50 million people worldwide, and one-third of patients suffer drug-resistant seizures where surgery offers the best chance of seizure freedom. Accurate localization of the epileptogenic zone (EZ) relies on intracranial EEG (iEEG). Clinical workflows, however, remain constrained by labor-intensive manual review. At the same time, existing data-driven approaches are typically developed on single-center datasets that are inconsistent in format and metadata, lack standardized benchmarks, and rarely release pathological event annotations, creating barriers to reproducibility, cross-center validation, and clinical relevance. Code/project link: https://omni-ieeg.github.io/omni-ieeg/; https://github.com/Omni-iEEG/Omni-iEEG
论文ICLR 2026 Poster2026 年trustworthy medical AI MedAgent-Pro:通过推理型 Agent 工作流迈向证据型多模态医学诊断
ICLR 2026 Poster accepted paper at ICLR 2026. Modern clinical diagnosis relies on the comprehensive analysis of multi-modal patient data, drawing on medical expertise to ensure systematic and rigorous reasoning. Recent advances in Vision–Language Models (VLMs) and agent-based methods are reshaping medical diagnosis by effectively integrating multi-modal information. However, they often output direct answers and empirical-driven conclusions without clinical evidence supported by quantitative analysis, which compromises their reliability and hinders clinical usability. Here we propose MedAgent-Pro, an agentic reasoning paradigm that mirrors modern diagnosis principles via a hierarchical diagnostic workflow, consisting of disease-level standardized plan generation and patient-level personalized step-by-step reasoning.
论文ICLR 2026 Poster2026 年trustworthy medical AI 超越医学考试:面向心理健康真实任务与模糊性的临床医生标注公平性数据集
ICLR 2026 Poster accepted paper at ICLR 2026. Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. In psychiatry especially, these challenges are worsened by fairness and bias issues, since models can be swayed by patient demographics even when those factors should not influence clinical decisions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This U.S. centric dataset — created without any LM assistance — is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets.
论文ICLR 2026 Poster2026 年clinical prediction 从病历到诊断对话:面向精神共病的临床扎根方法与数据集
ICLR 2026 Poster accepted paper at ICLR 2026. Psychiatric comorbidity is clinically significant yet challenging due to the complexity of multiple co-occurring disorders. To address this, we develop a novel approach integrating synthetic patient electronic medical record (EMR) construction and multi-agent diagnostic dialogue generation. We create 502 synthetic EMRs for common comorbid conditions using a pipeline that ensures clinical relevance and diversity. Our multi-agent framework transfers the clinical interview protocol into a hierarchical state machine and context tree, supporting over 130 diagnostic states while maintaining clinical standards.
论文ICLR 2026 Poster2026 年trustworthy medical AI AbdCTBench:从腹部表面几何学习临床生物标志物表征
ICLR 2026 Poster accepted paper at ICLR 2026. Body composition analysis through CT and MRI imaging provides critical insights for cardio-metabolic health assessment but remains limited by accessibility barriers including radiation exposure, high costs, and infrastructure requirements. We present AbdCTBench, a large-scale dataset containing 23,506 CT-derived abdominal surface meshes from 18,719 patients, paired with 87 comorbidity labels, 31 specific diagnosis codes, and 16 CT-derived biomarkers. Our key insight is that external surface geometry is predictive of internal tissue composition, enabling accessible health screening through consumer devices. We establish comprehensive benchmarks across seven computer vision architectures (ResNet-18/34/50, DenseNet-121, EfficientNet-B0, ViT-Small, Swin Transformer-Base), demonstrating that models can learn robust surface-to-biomarker representations directly from 2D mesh projections. Code/project link: https://abdctbenchrepo.github.io/AbdCTBench/
论文Nature Communications2024 年医学影像 医学图像中的 Segment Anything
MedSAM adapts the Segment Anything paradigm to medical image segmentation and reports broad evaluation across imaging modalities.
数据资源chest radiographs with pneumonia/lung opacity annotationschest X-ray pneumonia detection challenge datasetRSNA 2018 AI image challenge dataset开放访问 RSNA 肺炎检测挑战数据集
The RSNA Pneumonia Detection Challenge dataset is a chest radiograph benchmark for detecting pneumonia-related lung opacities. It supports object detection, chest X-ray classification, localization, and radiology AI evaluation under a competition framework.
数据资源cine cardiac MRI with segmentation labelscardiac MRI segmentation datasetACDC challenge dataset; see official database page申请访问 ACDC 自动心脏诊断挑战数据集
ACDC is a cardiac MRI dataset for automated cardiac diagnosis and segmentation. It supports left and right ventricular segmentation, myocardium segmentation, cardiac function quantification, and evaluation of robust cardiac image analysis methods.
数据资源raw MRI k-space and reconstructed MRI dataMRI reconstruction datasetLarge raw MRI reconstruction dataset; see official site申请访问 fastMRI 原始 MRI 重建数据集
fastMRI is a raw MRI dataset for accelerated magnetic resonance image reconstruction, originally released by NYU Langone Health and Meta AI. It is used for MRI reconstruction, compressed sensing replacement, generative reconstruction, and robustness evaluation.
数据资源2D and 3D biomedical imagesstandardized biomedical image benchmark12 2D datasets and 6 3D datasets in MedMNIST v2开放访问 MedMNIST v2 生物医学图像基准
MedMNIST v2 is a standardized collection of lightweight biomedical image classification datasets, including 2D and 3D tasks. It is useful for quick benchmarking, AutoML, foundation model sanity checks, and reproducible evaluation across multiple medical imaging domains.
数据资源abdominal CT with kidney and tumor annotationskidney tumor CT segmentation datasetTCIA C4KC-KiTS collection; see collection page开放访问 C4KC-KiTS 肾肿瘤分割集合
C4KC-KiTS is a TCIA imaging collection associated with kidney and kidney tumor segmentation benchmarks. It supports kidney segmentation, renal tumor segmentation, surgical planning research, and evaluation of abdominal CT segmentation models.
数据资源12-lead ECG waveforms with diagnostic labelsECG waveform benchmarkLarge public ECG dataset; version 1.0.3开放访问 PTB-XL:大型开放 12 导联 ECG 数据集
PTB-XL is a large public 12-lead electrocardiography dataset with diagnostic statements and waveform records. It is a standard benchmark for ECG classification, cardiac abnormality detection, clinical signal representation learning, and robust evaluation of biosignal models.
数据资源chest radiographs with radiology reportschest X-ray image-report datasetLarge-scale CXR image-report dataset; version 2.1.0申请访问 MIMIC-CXR v2.1.0 胸部 X 光数据集
MIMIC-CXR is a large deidentified chest radiograph dataset with associated free-text radiology reports. It is widely used for chest X-ray classification, report generation, image-text representation learning, radiology retrieval, and medical multimodal foundation model evaluation.
数据资源medical images with bilingual visual questions and answersmedical visual question answering datasetBilingual medical VQA dataset; see official project page开放访问 SLAKE:语义标注、知识增强医学 VQA 数据集
SLAKE is a semantically labeled medical visual question answering dataset with bilingual English-Chinese questions, medical images, and knowledge-enhanced annotations. It is useful for medical multimodal learning, image-grounded QA, and radiology VQA evaluation.
数据资源Chinese conversational medical QA textChinese medical conversational QA datasetLarge-scale Chinese medical CQA dataset; see official repository开放访问 CMCQA:中文医学会话问答数据集
CMCQA is a large Chinese medical conversational question-answering dataset released with knowledge-grounded medical dialogue research. It supports medical conversation QA, knowledge-grounded response generation, and evaluation of Chinese medical dialogue systems.
数据资源medical exam question-answer textmedical exam QA benchmarkUSMLE, Mainland China, and Taiwan exam-style QA splits; see repository开放访问 MedQA:含美国、中国大陆与台湾拆分的医学考试问答数据集
MedQA is a medical examination question answering benchmark with English and Chinese medical licensing-style question sets, including mainland China and Taiwan variants. It is widely used for medical QA and medical reasoning evaluation.
数据资源Chinese medical exam and QA textChinese medical LLM evaluation benchmarkMultiple Chinese medical exam and benchmark splits; see Hugging Face card开放访问 CMB:中文医学基准
CMB is a comprehensive Chinese medical benchmark for evaluating medical large language models on medical exams, reasoning, and clinical knowledge questions. It is suited for Chinese medical QA, LLM evaluation, and instruction-following assessment.
数据资源Chinese biomedical and clinical textChinese biomedical NLP benchmark8 biomedical NLU tasks; see official repository开放访问 CBLUE:中文生物医学语言理解评测基准
CBLUE is a Chinese biomedical language understanding benchmark covering real-world biomedical NLP tasks such as named entity recognition, relation extraction, term normalization, clinical trial classification, sentence similarity, and medical question answering. It is useful for evaluating Chinese clinical NLP models and medical language models.
数据资源胸部 X 光放射影像224,316 chest radiographs申请访问 CheXpert
Stanford chest radiograph dataset for automated chest X-ray interpretation and uncertainty-aware label evaluation.
数据资源TextLLM evaluation benchmarkHealth AI evaluation benchmark开放访问 HealthBench 健康 AI 评测基准
Benchmark for evaluating health AI model safety, helpfulness, and clinical-relevance judgments with physician-reviewed rubrics.
数据资源Multimodal clinical dataBenchmarkICML 2025 benchmark开放访问 CLIMB 临床基础模型基准
Multimodal clinical data foundation and benchmark introduced at ICML 2025 for clinical foundation model research.
技术竞赛Openbreast cancer histopathology semantic segmentationH&E whole-slide histopathology images开始 北京时间 2026-05-03 BEETLE 乳腺癌组织病理分割挑战
Grand Challenge official API lists this medical AI challenge with status OPEN. BEETLE is a multicenter, multiscanner benchmark for breast cancer histopathology segmentation. It focuses on multiclass semantic segmentation of hematoxylin and eosin (H&E)-stained whole-slide images (WSIs) into four tissue categories: invasive epithelium, non-invasive epithelium, necrosis, and other. The evaluation set comprises 170 densely annotated regions from 54 WSIs, covering all molecular subtypes and histological grades, thereby capturing much of the morphological heterogeneity seen in clinical practice. BEETLE provides a standardized resource for benchmarking breast cancer segmentation models, supporting the development of robust, generalizable algorithms for large-scale biomarke...
征稿与合作npj Digital Medicine截止 北京时间 2027-03-24期刊专刊 npj Digital Medicine 专辑:医学 AI 研究中的前瞻性与干预性临床证据
This Nature Portfolio / npj Digital Medicine collection is open for submissions until 2027-03-24. It seeks clinical AI studies centered on pre-specified actions, including pragmatic trials, cluster-randomized and stepped-wedge designs, generative AI in workflow, causal analyses, target trial emulation, post-market surveillance, drift monitoring, safety, economic evaluation, and lifecycle governance.
征稿与合作npj Cardiovascular Health截止 北京时间 2026-06-12期刊专刊 npj Cardiovascular Health 专辑:使用 AI 与 ML 的临床照护和以患者为中心的互动
This Nature Portfolio / npj Cardiovascular Health collection is open for submissions until 2026-06-12. It calls for AI and machine learning work in cardiovascular care, including diagnostic evaluation, CVD risk prediction, cardiac imaging and ECG deep learning, EHR tooling, multi-omics, wearables, remote monitoring, deployment, transparency, ethics, and regulation.
征稿与合作EMNLP 2026截止 北京时间 2026-05-25会议征稿 EMNLP 2026 征稿
CCF-Deadlines lists EMNLP 2026 with papers due 2026-05-25 UTC-12 and conference dates 2026-10-24 to 2026-10-29 in Budapest. EMNLP is relevant to clinical NLP, biomedical language models, medical text mining, EHR note understanding, and safe medical LLM evaluation.