Artificial intelligence chatbots in transfusion medicine: A cross-sectional study.

Journal: Vox Sanguinis
Published:
Abstract

Objective: The recent rise of artificial intelligence (AI) chatbots has attracted many users worldwide. However, expert evaluation is essential before relying on them for transfusion medicine (TM)-related information. This study aims to evaluate the performance of AI chatbots for accuracy, correctness, completeness and safety.

Methods: Six AI chatbots (ChatGPT 4, ChatGPT 4-o, Gemini Advanced, Copilot, Anthropic Claude 3.5 Sonnet, Meta AI) were tested using TM-related prompts at two time points, 30 days apart. Their responses were assessed by four TM experts. Evaluators' scores underwent inter-rater reliability testing. Responses from Day 30 were compared with those from Day 1 to evaluate consistency and potential evolution over time.

Results: All six chatbots exhibited some level of inconsistency and varying degrees of evolution in their responses over 30 days. None provided entirely correct, complete or safe answers to all questions. Among the chatbots tested, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet demonstrated the highest accuracy and consistency, while Microsoft Copilot and Google Gemini Advanced showed the greatest evolution in their responses. As a limitation, the 30-day period may be too short for a precise assessment of chatbot evolution.

Conclusions: At the time of the conduct of this study, none of the AI chatbots provided fully reliable, complete or safe responses to all TM-related prompts. However, ChatGPT 4-o and Anthropic Claude 3.5 Sonnet show the highest promise for future integration into TM practices. Given their variability and ongoing development, AI chatbots should not yet be relied upon as authoritative sources in TM without expert validation.

Authors
Prateek Srivastava, Ashish Tewari, Arwa Al Riyami