A critical evaluation of generative query expansion on biomedical literature retrieval.

TitleA critical evaluation of generative query expansion on biomedical literature retrieval.
Publication TypeJournal Article
Year of Publication2026
AuthorsFang Y, Zhang G, Chen F, Peng Y, Weng C
JournalJ Am Med Inform Assoc
Volume33
Issue6
Pagination1121-1133
Date Published2026 Jun 01
ISSN1527-974X
KeywordsAlgorithms, Humans, Information Storage and Retrieval, Natural Language Processing
Abstract

OBJECTIVE: To evaluate the effectiveness of generative query expansion for biomedical literature retrieval.

MATERIALS AND METHODS: We thoroughly examined eight generative query expansion methods using three large language models across five datasets for biomedical literature retrieval. We further performed a quantitative analysis, including performance comparisons, rank transition analysis, and article-type effect analysis. We also conducted a qualitative examination of representative cases, from which we derived an error taxonomy.

RESULTS: On BioASQ-Y/N, GPT-4o-based query expansion shifts Recall@10 to 0.417-0.512 and nDCG@10 to 0.358-0.479, relative to a baseline of 0.491 and 0.456. For PubMedQA, Precision@1 ranges from 0.764 to 0.876 and nDCG@10 from 0.847 to 0.931, compared with baseline values of 0.893 and 0.935. For 2019-Trec-PM, query expansion yields Recall@100 of 0.217-0.256 and nDCG@100 of 0.272-0.312, versus a baseline of 0.227 and 0.274. Similarly, for 2018-TREC-PM, Recall@100 spans 0.169-0.227 and nDCG@100 spans 0.195-0.250, relative to baseline scores of 0.164 and 0.191. For 2017-TREC-PM, Recall@100 and nDCG@100 fall within 0.111-0.139 and 0.154-0.191 under query expansion, compared with baseline metrics of 0.102 and 0.147. Both general-purpose and domain-specific Llama-based models demonstrate similar performance to GPT-4o.

DISCUSSION AND CONCLUSION: The impact of query expansion varies significantly by the expansion methods and type of evidence, but is relatively agnostic to backbone model choice. Notably, query expansion primarily affects article ranking but has a limited impact on the screening stage. Our findings underscore the unique challenges of biomedical literature retrieval and highlight the need to develop domain-specific information retrieval techniques.

DOI10.1093/jamia/ocag037
Alternate JournalJ Am Med Inform Assoc
PubMed ID41921511
PubMed Central IDPMC13197180
Grant ListUL1TR001873 and UL1TR002384 / / National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH) /
R01LM014344, R01LM014573, and T15LM007079 / LM / NLM NIH HHS / United States
/ TR / NCATS NIH HHS / United States
UL1TR001873 / NH / NIH HHS / United States
UL1TR002384 / NH / NIH HHS / United States
R01LM014344 / LM / NLM NIH HHS / United States
R01LM014573 / LM / NLM NIH HHS / United States
T15LM007079 / LM / NLM NIH HHS / United States