Using large language models as decision support tools in emergency ophthalmology.
Background: Large language models (LLMs) have shown promise in various medical applications, but their potential as decision support tools in emergency ophthalmology remains unevaluated using real-world cases.
Objective: We assessed the performance of state-of-the-art LLMs (GPT-4, GPT-4o, and Llama-3-70b) as decision support tools in emergency ophthalmology compared to human experts.
Methods: In this prospective comparative study, LLM-generated diagnoses and treatment plans were evaluated against those determined by certified ophthalmologists using 73 anonymized emergency cases from the University Hospital of Split. Two independent expert ophthalmologists graded both LLM and human-generated reports using a 4-point Likert scale.
Results: Human experts achieved a mean score of 3.72 (SD = 0.50), while GPT-4 scored 3.52 (SD = 0.64) and Llama-3-70b scored 3.48 (SD = 0.48). GPT-4o had lower performance with 3.20 (SD = 0.81). Significant differences were found between human and LLM reports (P < 0.001), specifically between human scores and GPT-4o. GPT-4 and Llama-3-70b showed performance comparable to ophthalmologists, with no statistically significant differences.
Conclusions: Large language models demonstrated accuracy as decision support tools in emergency ophthalmology, with performance comparable to human experts, suggesting potential for integration into clinical practice.