Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction.

Journal: Academic Radiology

Published: September 24, 2024

Abstract

Objective: To assess the performance of Large Language Models (LLMs) in detecting and correcting MRI artifacts compared to radiologists using text-based and visual questions.

Methods: This cross-sectional observational study included three phases. Phase 1 involved six LLMs (ChatGPT o1-preview, ChatGPT-4o, ChatGPT-4V, Google Gemini 1.5 Pro, Claude 3.5 Sonnet, Claude 3 Opus) and five radiologists (two residents, two junior radiologists, one senior radiologist) answering 42 text-based questions on MRI artifacts. In Phase 2, the same radiologists and five multimodal LLMs evaluated 100 MRI images, each containing a single artifact. Phase 3 reassessed the identical tasks 1.5 months later to evaluate temporal consistency. Responses were graded using 4-point Likert scales for "Management Score" (text-based) and "Correction Score" (visual). McNemar's test compared response accuracy, and the Wilcoxon test assessed score differences.

Results: LLMs outperformed radiologists in text-based tasks, with ChatGPT o1-preview scoring the highest (3.71±0.60 in Round 1; 3.76±0.84 in Round 2) (p<0.05). In visual tasks, radiologists performed significantly better, with the Senior Radiologist achieving 92% and 94% accuracy in Rounds 1 and 2, respectively (p<0.05). The top-performing LLM (ChatGPT-4o) achieved only 20% and 18% accuracy. Correction Scores mirrored this difference, with radiologists consistently scoring higher than LLMs (p<0.05).

Conclusions: LLMs excel in text-based tasks but have notable limitations in visual artifact interpretation, making them unsuitable for independent diagnostics. They are promising as educational tools or adjuncts in "human-in-the-loop" systems, with multimodal AI improvements necessary to bridge these gaps.

Authors

Yasin Gunes, Turay Cesur, Eren Camur, Bilal Cifci, Turan Kaya, Mehmet Colakoglu, Ural Koc, Rıza Okten

Textual Proficiency and Visual Deficiency: A Comparative Study of Large Language Models and Radiologists in MRI Artifact Detection and Correction.

Similar Publications