Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.

Journal: Journal Of Orthopaedic Surgery (Hong Kong)

Published: March 03, 2025

Abstract

Background: Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.

Methods: Three LLMs, OpenAI's GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.

Results: GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045).

Conclusions: The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.

Authors

Andrew Xu, Manjot Singh, Mariah Balmaceno Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher Mcdonald, Bassel Diebo, Alan Daniels

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination.

Similar Publications