Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.

Journal: Diagnostics (Basel, Switzerland)

Published: March 21, 2025

Abstract

Background/

Objectives: This study evaluated the diagnostic accuracy of seven publicly available large language models (LLMs)-GPT-3.5, GPT-4.o Mini, GPT-4.o, Gemini 1.5 Flash, Claude 3.5 Sonnet, Grok3, and DeepSeek R1-in diagnosing corneal diseases, comparing their performance to human specialists.

Methods: Twenty corneal disease cases from the University of Iowa's EyeRounds were presented to each LLM. Diagnostic accuracy was determined by comparing LLM-generated diagnoses to the confirmed case diagnoses. Four human cornea specialists evaluated the same cases to establish a benchmark and assess interobserver agreement.

Results: Diagnostic accuracy varied significantly among LLMs (p = 0.001). GPT-4.o achieved the highest accuracy (80.0%), followed by Claude 3.5 Sonnet and Grok3 (70.0%), DeepSeek R1 (65.0%), GPT-3.5 (60.0%), GPT-4.o Mini (55.0%), and Gemini 1.5 Flash (30.0%). Human experts averaged 92.5% accuracy, outperforming all LLMs (p < 0.001, Cohen's d = -1.314). GPT-4.o showed no significant difference from human consensus (p = 0.250, κ = 0.348), while Claude and Grok3 showed fair agreement (κ = 0.219). DeepSeek R1 also performed reasonably (κ = 0.178), although not significantly.

Conclusions: Among the evaluated LLMs, GPT-4.o, Claude 3.5 Sonnet, Grok3, and DeepSeek R1 demonstrated promising diagnostic accuracy, with GPT-4.o most closely matching human performance. However, performance remained inconsistent, especially in complex cases. LLMs may offer value as diagnostic support tools, but human expertise remains indispensable for clinical decision-making.

Authors

Cheng Jiao, Erik Rosas, Hassan Asadigandomani, Mohammad Delsoz, Yeganeh Madadi, Hina Raja, Wuqaas Munir, Brendan Tamm, Shiva Mehravaran, Ali Djalilian, Siamak Yousefi, Mohammad Soleimani

Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists.

Similar Publications