A multi-agent large language model framework to automatically assess performance of a clinical AI Triage tool.

Submitted by yip4002 on June 5, 2026 - 4:45pm

Title	A multi-agent large language model framework to automatically assess performance of a clinical AI Triage tool.
Publication Type	Journal Article
Year of Publication	2026
Authors	Flanders AE, Peng Y, Prevedello L, Ball R, Colak E, Menon P, Shih G, Lin H-M, Lakhani P
Journal	Npj Health Syst
Volume	3
Date Published	2026
ISSN	3005-1959
Abstract	Radiology reports can be used as a surrogate for performance of clinical AI tools. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. Performance of the open-source models, consensus of models and GPT-4o were compared to human report review. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. The capability of each LLM varied. The highest AUC performance was achieved with llama3.3:70b and GPT-4o. Using MCC the ideal combination of LLMs were: Full-9 Ensemble, Top-3 Ensemble and consensus. No statistically significant differences were observed between Top-3, Full-9, and consensus. An ensemble of open-source LLMs provides a more consistent and reliable method to derive a ground truth retrospective evaluation of a clinical AI triage tool over a single LLM alone.
DOI	10.1038/s44401-026-00100-4
Alternate Journal	Npj Health Syst
PubMed ID	42245913
PubMed Central ID	PMC13233039