Assessment of a zero-shot large language model in measuring documented goals-of-care discussions.

Journal: MedRxiv : The Preprint Server For Health Sciences
Published:
Abstract

Goals-of-care (GOC) discussions and their documentation are an important process measure in palliative care. However, existing natural language processing (NLP) models for identifying GOC documentation require costly training data that do not transfer to other constructs of interest. Newer large language models (LLMs) hold promise for measuring linguistically complex constructs with fewer or no task-specific training. To evaluate the performance of a publicly available LLM with no task-specific training data (zero-shot prompting) for identifying EHR-documented GOC discussions. This diagnostic study compared performance in identifying electronic health record (EHR)-documented GOC discussions of two NLP models: Llama 3.3 using zero-shot prompting, and a task-specific BERT (Bidirectional Encoder Representations from Transformers)-based model trained on a corpus of 4,642 manually annotated notes. Models were evaluated using text corpora drawn from clinical trials enrolling adult patients with chronic life-limiting illness hospitalized at a US health system over 2018-2023. The outcomes were NLP model performance, evaluated by the area under the Receiver Operating Characteristic curve (AUC), area under the precision-recall curve (AUPRC), and maximal F1 score. NLP performance was evaluated for both note-level and patient-level classification over a 30-day period. Across three text corpora, GOC documentation represented <1% of EHR text and was found in 7.3-9.9% of notes for 23-37% of patients. In a 617-patient held-out test set, Llama 3.3 (zero-shot) and BERT (task-specific, trained) exhibited comparable performance in identifying GOC documentation. Llama 3.3 identified GOC documentation with AUC 0.979, AUPRC 0.873, and F1 0.83; BERT identified the same with AUC 0.981, AUPRC 0.874, and F1 0.83. In examining the cumulative incidence of GOC documentation over the specified 30-day period, Llama 3.3 identified patients with GOC documentation with AUC 0.977, AUPRC 0.955, and F1 0.89; and BERT identified the same with AUC 0.981, AUPRC 0.952, and F1 0.89. A zero-shot large language model with no task-specific training performs similarly to a task-specific supervised-learning BERT model trained on thousands of manually labeled EHR notes in identifying documented goals-of-care discussions. These findings demonstrate promise for rigorous use of LLMs in measuring novel clinical trial outcomes.

Authors
Robert Lee, Kevin Li, James Sibley, Trevor Cohen, William Lober, Danae Dotolo, Erin Kross