论文ICLR 2026 Poster2026 年clinical NLP 多图像医学思维
ICLR 2026 Poster accepted paper at ICLR 2026. Large language models perform well on many medical QA benchmarks, but real clinical reasoning is harder because diagnosis often requires integrating evidence across multiple images rather than interpreting a single view. We introduce MedThinkVQA, an expert-annotated benchmark for thinking with multiple images, in which models must interpret each image, combine cross-view evidence, and solve diagnostic questions under intermediate supervision and step-level evaluation. The dataset contains 10,067 cases, including 720 test cases, with an average of 6.68 images per case, substantially denser than prior work (earlier maxima $\leq$ 1.43). On the test set, the best closed-source models, Claude-4.6-opus, Gemini-3-pro, and GPT-5.2-xhigh, achieve only 54.9%--57.2% accuracy, while smaller proprietary variants, GPT-5-mini/nano, drop to 39.7% and 30.8%.
论文ICLR 2025 Poster2025 年医疗大模型与 Agent KGARevion:面向知识密集型生物医学问答的 AI Agent
ICLR 2025 Poster。该论文提出 KGARevion,一个面向知识密集型生物医学问答的知识图谱增强 AI Agent,通过生成、验证并筛选与问题相关的知识三元组来支持医学推理与 Medical QA。
数据资源Chinese community medical questions and answersChinese medical QA datasetUpdated cMedQA dataset; see official repository开放访问 cMedQA2:中文社区医学问答数据集
cMedQA2 is an updated Chinese community medical question answering dataset for question-answer matching and medical QA research. It is useful for training and evaluating Chinese medical retrieval, ranking, and answer selection models.
数据资源Chinese conversational medical QA textChinese medical conversational QA datasetLarge-scale Chinese medical CQA dataset; see official repository开放访问 CMCQA:中文医学会话问答数据集
CMCQA is a large Chinese medical conversational question-answering dataset released with knowledge-grounded medical dialogue research. It supports medical conversation QA, knowledge-grounded response generation, and evaluation of Chinese medical dialogue systems.
数据资源Chinese medical question-answer textChinese medical QA corpusAbout 26 million medical QA pairs开放访问 Huatuo-26M:大规模中文医学问答数据集
Huatuo-26M is a large-scale Chinese medical question-answering dataset with about 26 million QA pairs collected for medical language modeling and medical dialogue research. It is suitable for Chinese medical LLM pretraining, fine-tuning, and QA system development.
数据资源medical exam question-answer textmedical exam QA benchmarkUSMLE, Mainland China, and Taiwan exam-style QA splits; see repository开放访问 MedQA:含美国、中国大陆与台湾拆分的医学考试问答数据集
MedQA is a medical examination question answering benchmark with English and Chinese medical licensing-style question sets, including mainland China and Taiwan variants. It is widely used for medical QA and medical reasoning evaluation.
数据资源Chinese medical exam and QA textChinese medical LLM evaluation benchmarkMultiple Chinese medical exam and benchmark splits; see Hugging Face card开放访问 CMB:中文医学基准
CMB is a comprehensive Chinese medical benchmark for evaluating medical large language models on medical exams, reasoning, and clinical knowledge questions. It is suited for Chinese medical QA, LLM evaluation, and instruction-following assessment.
数据资源TextLLM benchmarkBenchmark and leaderboard开放访问 MedHELM 医学 LLM 评测基准
Medical LLM benchmark and leaderboard intended to broaden coverage beyond single medical QA datasets.
征稿与合作NLPCC 2026截止 北京时间 2026-05-26会议征稿 NLPCC 2026 征稿
CCF-Deadlines lists NLPCC 2026 with papers due 2026-05-26 UTC+8 and conference dates 2026-11-03 to 2026-11-05 in Macau. NLPCC is relevant to Chinese clinical NLP, Chinese biomedical language resources, medical text mining, and healthcare question answering.