Diagnostic performance of multimodal large language models in radiological quiz cases: the effects of prompt engineering and input conditions.

Journal: Ultrasonography (Seoul, Korea)
Published:
Abstract

Objective: This study aimed to evaluate the diagnostic accuracy of three multimodal large language models (LLMs) in radiological image interpretation and to assess the impact of prompt engineering strategies and input conditions.

Methods: This study analyzed 67 radiological quiz cases from the Korean Society of Ultrasound in Medicine. Three multimodal LLMs (Claude 3.5 Sonnet, GPT-4o, and Gemini-1.5-Pro-002) were evaluated using six types of prompts (basic [without system prompt], original [specific instructions], chain-of-thought, reflection, multiagent, and artificial intelligence [AI]-generated). Performance was assessed across various factors, including tumor versus non-tumor status, case rarity, difficulty, and knowledge cutoff dates. A subgroup analysis compared diagnostic accuracy between imaging-only inputs and combined imaging-descriptive text inputs.

Results: With imaging-only inputs, Claude 3.5 Sonnet achieved the highest overall accuracy (46.3%, 186/402), followed by GPT-4o (43.5%, 175/402) and Gemini-1.5-Pro-002 (39.8%, 160/402). AI-generated prompts yielded superior combined accuracy across all three models, with improvements over the basic (5.5%, P=0.035), chain-of-thought (4.0%, P=0.169), and multiagent prompts (3.5%, P=0.248). The integration of descriptive text significantly enhanced diagnostic accuracy for Claude 3.5 Sonnet (46.3% to 66.2%, P<0.001), GPT-4o (43.5% to 57.5%, P<0.001), and Gemini-1.5-Pro-002 (39.8% to 60.4%, P<0.001). Model performance was significantly influenced by case rarity for GPT-4o (rare: 6.7% vs. non-rare: 53.9%, P=0.001) and by knowledge cutoff dates for Claude 3.5 Sonnet (post-cutoff: 23.5% vs. pre-cutoff: 64.0%, P=0.005).

Conclusions: Claude 3.5 Sonnet achieved the highest diagnostic accuracy in radiological quiz cases, followed by GPT-4o and Gemini-1.5-Pro-002. The use of AI-generated prompts and the integration of descriptive text inputs enhanced model performance.

Authors