Advancing medical AI: GPT-4 and GPT-4o surpass GPT-3.5 in Taiwanese medical licensing exams.

Journal: PloS One
Published:
Abstract

Background: Chat Generative Pre-Trained Transformer (ChatGPT), launched by OpenAI in November 2022, features advanced large language models optimized for dialog. However, the performance differences between GPT-3.5, GPT-4, and GPT-4o in medical contexts remain unclear.

Objective: This study evaluates the accuracy of GPT-3.5, GPT-4, and GPT-4o across various medical subjects. GPT-4o's performances in Chinese and English were also analyzed.

Methods: We retrospectively compared GPT-3.5, GPT-4, and GPT-4o in Stage 1 of the Taiwanese Senior Professional and Technical Examinations for Medical Doctors (SPTEMD) from July 2021 to February 2024, excluding image-based questions.

Results: The overall accuracy rates of GPT-3.5, GPT-4, and GPT-4o were 65.74% (781/1188), 95.71% (1137/1188), and 96.72% (1149/1188), respectively. GPT-4 and GPT-4o outperformed GPT-3.5 across all subjects. Statistical analysis revealed a significant difference between GPT-3.5 and the other models (p < 0.05) but no significant difference between GPT-4 and GPT-4o. Among subjects, physiology had a significantly higher error rate (p < 0.05) than the overall average across all three models. GPT-4o's accuracy rates in Chinese (98.14%) and English (98.48%) did not differ significantly.

Conclusions: GPT-4 and GPT-4o exceed the accuracy threshold for Taiwanese SPTEMD, demonstrating advancements in contextual comprehension and reasoning. Future research should focus on responsible integration into medical training and assessment.

Authors