AI4Meder 站内搜索

搜索医学 AI 论文与资源

按论文、数据资源、技术竞赛、投稿截止日期和课程资源检索社区内容，快速进入对应详情页。

27 条结果

输入关键词或点击标签，按论文、数据资源、竞赛截止日期、征稿与课程缩小范围。标签：LLMs

论文ICLR 2026 Poster2026 年clinical NLP

LaVCa：LLM 辅助的视觉皮层图像描述

ICLR 2026 Poster accepted paper at ICLR 2026. Understanding the properties of neural populations (or voxels) in the human brain can advance our comprehension of human perceptual and cognitive processing capabilities and contribute to developing brain-inspired computer models. Recent encoding models using deep neural networks (DNNs) have successfully predicted voxel-wise activity. However, interpreting the properties that explain voxel responses remains challenging because of the black-box nature of DNNs. As a solution, we propose LLM-assisted Visual Cortex Captioning (LaVCa), a data-driven approach that leverages large language models (LLMs) to generate natural-language captions for images to which voxels are selective.

医学影像计算医疗多模态临床语言智能论文 Neuroscience Computer vision 查看论文详情

论文ICLR 2026 Poster2026 年clinical NLP

LLM 推理中类人谬误模式的理论扎根评测

ICLR 2026 Poster accepted paper at ICLR 2026. We study logical reasoning in language models by asking whether their errors follow established human fallacy patterns. Using the Erotetic Theory of Reasoning (ETR) and its open‑source implementation, PyETR, we programmatically generate 383 formally specified reasoning problems and evaluate 38 models. For each response, we judge logical correctness and, when incorrect, whether it matches an ETR‑predicted fallacy. Two results stand out: (i) as a capability proxy (Chatbot Arena Elo) increases, a larger share of a model’s incorrect answers are ETR‑predicted fallacies ($\rho=0.360, p=0.0265$), while overall correctness on this dataset shows no correlation with capability; (ii) reversing premise order significantly reduces fallacy production for many models, mirroring human order effects.

临床语言智能论文 LLMs language models reasoning synthetic data 查看论文详情

论文ICLR 2026 Poster2026 年医学影像

脑图基础模型：跨多图谱与疾病的预训练和提示微调

ICLR 2026 Poster accepted paper at ICLR 2026. As large language models (LLMs) continue to revolutionize AI research, there is a growing interest in building large-scale brain foundation models to advance neuroscience. While most existing brain foundation models are pre-trained on time-series signals or connectome features, we propose a novel graph-based pre-training paradigm for constructing a brain graph foundation model. In this paper, we introduce the Brain Graph Foundation Model, termed BrainGFM, a unified framework that leverages graph contrastive learning and graph masked autoencoders for large-scale fMRI-based pre-training. BrainGFM is pre-trained on a diverse mixture of brain atlases with varying parcellations, significantly expanding the pre-training corpus and enhancing the model’s ability to generalize across heterogeneous fMRI-derived brain representations. Code/project link: https://github.com/weixinxu666/BrainGFM

医学影像计算论文 Brain Graph Foundation Model Functional Magnetic Resonance Imaging (fMRI)Neuroscience Graph Pre-Training 查看论文详情

论文ICLR 2026 Poster2026 年clinical prediction

MedAraBench：大规模阿拉伯语医学问答数据集与基准

ICLR 2026 Poster accepted paper at ICLR 2026. Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region.

医学影像计算临床语言智能 EHR 与临床预测论文 Dataset Benchmark Large Language Models 查看论文详情

论文ICLR 2026 Poster2026 年medical LLM agent

大语言模型能否匹配系统综述的结论？

ICLR 2026 Poster accepted paper at ICLR 2026. Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: **Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies?** To explore this question, we present MedEvidence, a benchmark pairing findings from 100 medical SRs with the studies they are based on.

医学影像计算临床语言智能论文 Benchmarks Multi-document Reasoning Medical AI 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

ATPO：面向多轮医学对话的自适应树策略优化

ICLR 2026 Poster accepted paper at ICLR 2026. Effective information seeking in multi-turn medical dialogues is critical for accurate diagnosis, especially when dealing with incomplete information. Aligning Large Language Models (LLMs) for these interactive scenarios is challenging due to the uncertainty inherent in user-agent interactions, which we formulate as a Hierarchical Markov Decision Process (H-MDP). While conventional Reinforcement Learning (RL) methods like Group Relative Policy Optimization (GRPO) struggle with long-horizon credit assignment and Proximal Policy Optimization (PPO) suffers from unstable value estimation in this context, we propose a novel uncertainty-aware Adaptive Tree Policy Optimization (ATPO) algorithm. Our method adaptively allocates the rollout budget to states with high uncertainty, quantified by a composite metric of Bellman error and action-value variance.

医学影像计算临床语言智能可信、安全、公平与隐私论文 Reinforcement Learning (RL)Large Language Models (LLMs)查看论文详情

论文ICLR 2026 Poster2026 年clinical prediction

医学 MLLM 如何失效？医学图像视觉定位研究

ICLR 2026 Poster accepted paper at ICLR 2026. Generalist multimodal large language models (MLLMs) have achieved impressive performance across a wide range of vision-language tasks. However, their performance on medical tasks—particularly in zero-shot settings where generalization is critical—remains suboptimal. A key research gap is the limited understanding of why medical MLLMs underperform in medical image interpretation. **In this work**, we present a pioneering systematic investigation into the visual grounding capabilities of state-of-the-art medical MLLMs. To disentangle *visual grounding* from *semantic grounding*, we design VGMED, a novel evaluation dataset developed with expert clinical guidance, explicitly assessing the visual grounding capability of medical MLLMs. Code/project link: https://guimeng-leo-liu.github.io/Medical-MLLMs-Fail/

医学影像计算医疗多模态临床语言智能论文 Medical MLLM Visual Grounding 查看论文详情

论文ICLR 2026 Poster2026 年medical LLM agent

GALAX：面向精准医疗中可解释强化引导子图推理的图增强语言模型

ICLR 2026 Poster accepted paper at ICLR 2026. In precision medicine, quantitative multi-omic features, topological context, and textual biological knowledge play vital roles in identifying disease-critical signaling pathways and targets, guiding the discovery of novel therapeutics and effective treatment strategies. Existing pipelines capture only one or two of these—numerical omics ignore topological context, text-centric LLMs lack quantitative grounded reasoning, and graph-only models underuse rich node semantics and the generalization power of LLMs—thereby limiting mechanistic interpretability. Although Process Reward Models (PRMs) aim to guide reasoning in LLMs, they remain limited by coarse step definitions, unreliable intermediate evaluation, and vulnerability to reward hacking with added computational cost. These gaps motivate jointly integrating quantitative multi-omic signals, topological structure with node annotations, and literature-scale text via LLMs, using subgraph reasoning as the principle bridge linking numeric evidence, topological knowledge and language context.

医学影像计算临床语言智能论文 Reinforcement Learning Large Language Model (LLM)Text-Numeric Graph (TNG)查看论文详情

论文ICLR 2026 Poster2026 年medical LLM agent

Doctor-R1：通过体验式 Agent 强化学习掌握临床问诊

ICLR 2026 Poster accepted paper at ICLR 2026. The professionalism of a human doctor in outpatient service depends on two core abilities: the ability to make accurate medical decisions and the medical consultation skill to conduct strategic, empathetic patient inquiry. Existing Large Language Models (LLMs) have achieved remarkable accuracy on medical decision-making benchmarks. However, they often lack the ability to conduct the strategic and empathetic consultation, which is essential for real-world clinical scenarios. To address this gap, we propose Doctor-R1, an AI doctor agent trained to master both of the capabilities by ask high-yield questions and conduct strategic multi-turn inquiry to guide decision-making.

医学影像计算临床语言智能论文 Doctor Agent Clinical Inquiry Agentic Reinforcement Learning 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

基于强化学习的假设驱动临床决策语言 Agent

ICLR 2026 Poster accepted paper at ICLR 2026. Clinical decision-making is a dynamic, interactive, and cyclic process where doctors have to repeatedly decide on which clinical action to perform and consider newly uncovered information for diagnosis and treatment. Large Language Models (LLMs) have the potential to support clinicians in this process, however, most applications of LLMs in clinical decision support suffer from one of two limitations: Either they assume the unrealistic scenario of immediate availability of all patient information and do not model the interactive and iterative investigation process, or they restrict themselves to the limited "out-of-the-box" capabilities of large pre-trained models without performing task-specific training. In contrast to this, we propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM, that converges towards a diagnosis via repeatedly requesting and interpreting relevant tests. Using a hybrid training paradigm combining supervised and reinforcement learning, we train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making. Code/project link: https://github.com/dharouni/LA-CDM

医学影像计算临床语言智能可信、安全、公平与隐私论文 Clinical Decision Making Large Language Models 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

从对话到查询执行：EHR 数据库 Agent 的用户与工具交互基准

ICLR 2026 Poster accepted paper at ICLR 2026. Despite the impressive performance of LLM-powered agents, their adoption for Electronic Health Record (EHR) data access remains limited by the absence of benchmarks that adequately capture real-world clinical data access flows. In practice, two core challenges hinder deployment: query ambiguity from vague user questions and value mismatch between user terminology and database entries. To address this, we introduce EHR-ChatQA, an interactive database question answering benchmark that evaluates the end-to-end workflow of database agents: clarifying user questions, using tools to resolve value mismatches, and generating correct SQL to deliver accurate answers. To cover diverse patterns of query ambiguity and value mismatch, EHR-ChatQA assesses agents in a simulated environment with an LLM-based user across two interaction flows: Incremental Query Refinement (IncreQA), where users add constraints to existing queries, and Adaptive Query Refinement (AdaptQA), where users adjust their search goals mid-conversation. Code/project link: https://github.com/glee4810/EHR-ChatQA

医学影像计算临床语言智能 EHR 与临床预测论文 Database Agents LLM Agents 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

Critic-Adviser-Reviser 循环精炼：迈向高质量 EMR 语料生成

ICLR 2026 Poster accepted paper at ICLR 2026. Electronic medical records (EMRs) are vital for healthcare research, but their use is limited by privacy concerns. Synthetic EMR generation offers a promising alternative, yet most existing methods merely imitate real records without adhering to rigorous clinical quality principles. To address this, we introduce LLM-CARe, a stage-wise cyclic refinement framework that progressively improves EMR quality through three stages, each targeting a specific granularity: corpus, section and document. At each stage, a Critic, an Adviser, and a Reviser collaborate iteratively to evaluate, provide feedback, and refine the drafts.

医学影像计算临床语言智能 EHR 与临床预测论文 Large Language Model Synthetic Data Generation 查看论文详情

论文ICLR 2026 Oral2026 年clinical prediction

CounselBench：心理健康问答中大语言模型的大规模专家评测与对抗基准

ICLR 2026 Oral accepted paper at ICLR 2026. Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios. The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and online human therapists on patient questions from the public forum CounselChat.

医学影像计算临床语言智能 EHR 与临床预测论文 large language models mental health 查看论文详情

论文ICLR 2026 Poster2026 年medical LLM agent

AnesSuite：面向 LLM 麻醉学推理的综合基准与数据集套件

ICLR 2026 Poster accepted paper at ICLR 2026. The application of large language models (LLMs) in the medical field has garnered significant attention, yet their reasoning capabilities in more specialized domains like anesthesiology remain underexplored. To bridge this gap, we introduce AnesSuite, the first comprehensive dataset suite specifically designed for anesthesiology reasoning in LLMs. The suite features AnesBench, an evaluation benchmark tailored to assess anesthesiology-related reasoning across three levels: factual retrieval (System 1), hybrid reasoning (System 1.x), and complex decision-making (System 2). Alongside this benchmark, the suite includes three training datasets that provide an infrastructure for continued pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with verifiable rewards (RLVR). Code/project link: https://github.com/MiliLab/AnesSuite

医学影像计算临床语言智能论文 Large language model Reasoning Anesthesiology 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

LiveClin：无泄漏的实时临床基准

ICLR 2026 Poster accepted paper at ICLR 2026. The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for the approximating real-world clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI–human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. Code/project link: https://github.com/AQ-MedAI/LiveClin

医学影像计算医疗多模态临床语言智能论文 MultiModal Medical Benchmark ICLR 2026 查看论文详情

论文ICLR 2026 Poster2026 年clinical prediction

知识型语言模型作为个性化医疗黑箱优化器

ICLR 2026 Poster accepted paper at ICLR 2026. The goal of personalized medicine is to discover a treatment regimen that optimizes a patient's clinical outcome based on their personal genetic and environmental factors. However, candidate treatments cannot be arbitrarily administered to the patient to assess their efficacy; we often instead have access to an *in silico* surrogate model that approximates the true fitness of a proposed treatment. Unfortunately, such surrogate models have been shown to fail to generalize to previously unseen patient-treatment combinations. We hypothesize that domain-specific prior knowledge—such as medical textbooks and biomedical knowledge graphs—can provide a meaningful alternative signal of the fitness of proposed treatments.

医学影像计算临床语言智能 EHR 与临床预测论文 Large language models Personalized medicine 查看论文详情

论文ICLR 2026 Poster2026 年clinical prediction

M3CoTBench：医学图像理解中 MLLM 思维链基准

ICLR 2026 Poster accepted paper at ICLR 2026. Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis.

医学影像计算医疗多模态临床语言智能论文 Chain-of-Thought Multimodal Large Language Models 查看论文详情

论文ICLR 2026 Poster2026 年medical LLM agent

KnowGuard：面向多轮临床推理的知识驱动拒答

ICLR 2026 Poster accepted paper at ICLR 2026. In clinical practice, physicians refrain from making decisions when patient information is insufficient. This behavior, known as abstention, is a critical safety mechanism preventing potentially harmful misdiagnoses. Recent investigations have reported the application of large language models (LLMs) in medical scenarios. However, existing LLMs struggle with the abstentions, frequently providing overconfident responses despite incomplete information. This limitation stems from conventional abstention methods relying solely on model self-assessments, which lack systematic strategies to identify knowledge boundaries with external medical evidences.

医学影像计算临床语言智能论文 multi-agent system 临床推理医学问答查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

SAE 能否揭示并缓解医疗 LLM 的种族偏差？

ICLR 2026 Poster accepted paper at ICLR 2026. LLMs are increasingly being used in healthcare. This promises to free physicians from drudgery, enabling better care to be delivered at scale. But the use of LLMs in this space also brings risks; for example, such models may worsen existing biases. How can we spot when LLMs are (spuriously) relying on patient race to inform predictions? In this work we assess the degree to which Sparse Autoencoders (SAEs) can reveal (and control) associations the model has made between race and stigmatizing concepts. We first identify SAE latents in gemma-2 models which appear to correlate with Black individuals.

医学影像计算临床语言智能可信、安全、公平与隐私论文 clinical natural language processing mechanistic interpretability 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

大语言模型的医学可解释性与知识图谱

ICLR 2026 Poster accepted paper at ICLR 2026. We present a systematic study of medical-domain interpretability in Large Language Models (LLMs). We study how the LLMs both represent and process medical knowledge through four different interpretability techniques: (1) UMAP projections of intermediate activations, (2) gradient-based saliency with respect to the model weights, (3) layer lesioning/removal and (4) activation patching. We present knowledge maps of five LLMs which show, at a coarse-resolution, where knowledge about patient's ages, medical symptoms, diseases and drugs is stored in the models. In particular for Llama3.3-70B, we find that most medical knowledge is processed in the first half of the model's layers.

医学影像计算临床语言智能 EHR 与临床预测论文 Large Language Models Interpretability 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

NurValues：临床情境中大语言模型的真实护理价值观评测

ICLR 2026 Poster accepted paper at ICLR 2026. While LLMs have demonstrated medical knowledge and conversational ability, their deployment in clinical practice raises new risks: patients may place greater trust in LLM-generated responses than in nurses' professional judgments, potentially intensifying nurse–patient conflicts. Such risks highlight the urgent need of evaluating whether LLMs align with the core nursing values upheld by human nurses. This work introduces the first benchmark for nursing value alignment, consisting of five core value dimensions distilled from international nursing codes: _Altruism_, _Human Dignity_, _Integrity_, _Justice_, and _Professionalism_. We define two-level tasks on the benchmark, considering the two characteristics of emerging nurse–patient conflicts.

医学影像计算临床语言智能 EHR 与临床预测论文 Large language models value alignment 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

Cancer-Myth：评估大语言模型回答含错误预设的患者问题

ICLR 2026 Poster accepted paper at ICLR 2026. Cancer patients are increasingly turning to large language models (LLMs) for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with patient details. In this paper, we first have three hematology-oncology physicians evaluate cancer-related questions drawn from real patients. While LLM responses are generally accurate, the models frequently fail to recognize or address false presuppositions} in the questions, posing risks to safe medical decision-making.

医学影像计算临床语言智能可信、安全、公平与隐私论文 Medical benchmark LLM evaluation 查看论文详情

论文ICLR 2026 Poster2026 年trustworthy medical AI

能否用 LLM 为临床时间序列数据生成可迁移表征？

ICLR 2026 Poster accepted paper at ICLR 2026. Deploying clinical ML is slow and brittle: models that work at one hospital often degrade under distribution shifts at the next. In this work, we study a simple question -- can large language models (LLMs) create portable patient embeddings i.e. representations of patients enable a downstream predictor built on one hospital to be used elsewhere with minimal-to-no retraining and fine-tuning. To do so, we map from irregular ICU time series onto concise natural language summaries using a frozen LLM, then embed each summary with a frozen text embedding model to obtain a fixed length vector capable of serving as input to a variety of downstream predictors.

医学影像计算临床语言智能 EHR 与临床预测论文 Machine Learning for Healthcare ICU Time-series 查看论文详情

论文ICLR 2026 Poster2026 年clinical prediction

学习自我批判机制用于区域引导胸部 X 光报告生成

ICLR 2026 Poster accepted paper at ICLR 2026. Automatic radiology reporting assists radiologists in diagnosing abnormalities in radiology images, where grounding the automatic diagnosis with abnormality locations is important for the report interpretability. However, existing supervised-learning methods could lead to learning the superficial statistical correlations between images and reports, lacking multi-faceted reasoning to critique the relevant regions on which radiologists would focus. Recently, self-critical reasoning has been investigated in test-time scaling approaches to alleviate hallucinations of LLMs with increased time complexity. In this work, we focus on chest X-ray report generation with particular focus on clinical accuracy, where self-critical reasoning is alternatively introduced into the model architecture and their training objective, preferred by the real-time automatic reporting system.

医学影像计算临床语言智能 EHR 与临床预测论文 radiology report generation x-ray report generation 查看论文详情

论文ICLR 2026 Poster2026 年医疗多模态

医学 MLLM 如何失效？医学图像视觉定位研究

系统研究医学 MLLM 在医学图像视觉定位中的失效模式，提出 VGMED 评估数据集与 VGRefine 推理时方法，面向医学视觉问答与医学图像解释场景。

医疗多模态医疗 AI 论文会议论文查看论文详情

论文Nature Medicine2025 年临床 LLM

面向专家级医学问答的大语言模型

Nature Medicine paper on Med-PaLM 2 and expert-level medical question answering with large language models.

LLM 医学问答 Med-PaLM 查看论文详情

征稿与合作npj Digital Medicine截止北京时间 2027-04-30期刊专刊

npj Digital Medicine 专辑：多模态数据与 AI 时代的计算药物重定位

This Nature Portfolio / npj Digital Medicine collection is open for submissions until 2027-04-30. It invites work at the intersection of computational drug repurposing, multimodal biomedical data, and AI, including omics, EHRs, real-world evidence, imaging, digital phenotyping, LLMs, graph neural networks, multimodal transformers, knowledge graphs, generative AI, causal inference, explainability, and clinical translation.

医疗多模态征稿 Nature Portfolio npj Digital Medicine drug repurposing multimodal data 查看征稿详情