A critical evaluation of generative query expansion on biomedical literature retrieval.

Submitted by yip4002 on June 5, 2026 - 4:46pm

Title	A critical evaluation of generative query expansion on biomedical literature retrieval.
Publication Type	Journal Article
Year of Publication	2026
Authors	Fang Y, Zhang G, Chen F, Peng Y, Weng C
Journal	J Am Med Inform Assoc
Volume	33
Issue	6
Pagination	1121-1133
Date Published	2026 Jun 01
ISSN	1527-974X
Keywords	Algorithms, Humans, Information Storage and Retrieval, Natural Language Processing
Abstract	OBJECTIVE: To evaluate the effectiveness of generative query expansion for biomedical literature retrieval. MATERIALS AND METHODS: We thoroughly examined eight generative query expansion methods using three large language models across five datasets for biomedical literature retrieval. We further performed a quantitative analysis, including performance comparisons, rank transition analysis, and article-type effect analysis. We also conducted a qualitative examination of representative cases, from which we derived an error taxonomy. RESULTS: On BioASQ-Y/N, GPT-4o-based query expansion shifts Recall@10 to 0.417-0.512 and nDCG@10 to 0.358-0.479, relative to a baseline of 0.491 and 0.456. For PubMedQA, Precision@1 ranges from 0.764 to 0.876 and nDCG@10 from 0.847 to 0.931, compared with baseline values of 0.893 and 0.935. For 2019-Trec-PM, query expansion yields Recall@100 of 0.217-0.256 and nDCG@100 of 0.272-0.312, versus a baseline of 0.227 and 0.274. Similarly, for 2018-TREC-PM, Recall@100 spans 0.169-0.227 and nDCG@100 spans 0.195-0.250, relative to baseline scores of 0.164 and 0.191. For 2017-TREC-PM, Recall@100 and nDCG@100 fall within 0.111-0.139 and 0.154-0.191 under query expansion, compared with baseline metrics of 0.102 and 0.147. Both general-purpose and domain-specific Llama-based models demonstrate similar performance to GPT-4o. DISCUSSION AND CONCLUSION: The impact of query expansion varies significantly by the expansion methods and type of evidence, but is relatively agnostic to backbone model choice. Notably, query expansion primarily affects article ranking but has a limited impact on the screening stage. Our findings underscore the unique challenges of biomedical literature retrieval and highlight the need to develop domain-specific information retrieval techniques.
DOI	10.1093/jamia/ocag037
Alternate Journal	J Am Med Inform Assoc
PubMed ID	41921511
PubMed Central ID	PMC13197180
Grant List	UL1TR001873 and UL1TR002384 / / National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) / R01LM014344, R01LM014573, and T15LM007079 / LM / NLM NIH HHS / United States / TR / NCATS NIH HHS / United States UL1TR001873 / NH / NIH HHS / United States UL1TR002384 / NH / NIH HHS / United States R01LM014344 / LM / NLM NIH HHS / United States R01LM014573 / LM / NLM NIH HHS / United States T15LM007079 / LM / NLM NIH HHS / United States