Assessing Large Language Models for Medical Question Answering in Portuguese: Open-Source Versus Closed-Source Approaches.

Journal: Cureus
Published:
Abstract

Large language models (LLMs) show promise in medical knowledge assessment. This study benchmarked a closed-source (GPT-4o, OpenAI, San Francisco, CA) and an open-source (LLaMA 3.1 405B, Meta AI, Menlo Park, CA) LLM on 148 multiple-choice questions from the 2023 Portuguese National Residency Access Examination across five clinical domains. Using five distinct prompting strategies, models provided single-best-answer predictions. GPT-4o consistently outperformed LLaMA 3.1 by 7-11% accuracy across all prompts. Chain-of-thought prompting yielded the highest numerical accuracy for GPT-4o, though this improvement was not statistically significant over simpler prompts in post-hoc analyses, while offering minimal benefit when applied to LLaMA 3.1. Both models performed best in pediatrics and less accurately in surgery and psychiatry questions. Bias assessment indicated GPT-4o aligned well with correct answer distributions, unlike LLaMA 3.1, which showed prompt-dependent skew. Closed-source models currently demonstrate higher accuracy on Portuguese medical questions, likely due to extensive training. However, open-source models remain valuable for data control, though domain-focused fine-tuning may be needed for optimal performance in high-stakes applications.