Evaluating large language models on medical evidence summarization.

Submitted by yip4002 on January 1, 2024 - 12:26pm

Title	Evaluating large language models on medical evidence summarization.
Publication Type	Journal Article
Year of Publication	2023
Authors	Tang L, Sun Z, Idnay B, Nestor JG, Soroush A, Elias PA, Xu Z, Ding Y, Durrett G, Rousseau JF, Weng C, Peng Y
Journal	NPJ Digit Med
Volume	6
Issue	1
Pagination	158
Date Published	2023 Aug 24
ISSN	2398-6352
Abstract	Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.
DOI	10.1038/s41746-023-00896-7
Alternate Journal	NPJ Digit Med
PubMed ID	37620423
PubMed Central ID	PMC10449915
Grant List	R01 LM014306 / LM / NLM NIH HHS / United States R00 LM013001 / LM / NLM NIH HHS / United States R01 LM009886 / LM / NLM NIH HHS / United States KL2 TR001874 / TR / NCATS NIH HHS / United States P30 CA013696 / CA / NCI NIH HHS / United States