Prediction of acute and chronic kidney diseases during the post-covid-19 pandemic with machine learning models: utilizing national electronic health records in the US.
Background: COVID-19 has been linked to acute kidney injury (AKI) and chronic kidney disease (CKD), but machine learning (ML) models predicting these risks post-pandemic have been absent. We aimed to use large electronic health records (EHR) and ML algorithms to predict the incidence of AKI and CKD during the post-pandemic period, assess the necessity of including COVID-19 infection history as a predictor, and develop a practical webpage application for clinical use.
Methods: National EHR data from TriNetX, emulating a prospective cohort of 104,565 patients from 07/01/2022 to 03/31/2024, were used. A total of 69 baseline variables were included, with demographics, comorbidities, lab test results, vital signs, medication histories, hospitalization visits, and COVID-19-related variables. Prediction windows of 1 month and 1 year were defined to assess AKI and CKD incidence. Eight machine learning models, primarily including extreme gradient boosting (XGBoost), neural network, and random forest (RF), were applied. Cross-validation and model tuning were conducted during the training process. Model performance was evaluated using six metrics, including the area under the receiver-operating-characteristic curve (AUROC). A combination of model-driven, data-driven, and clinical-driven methods was employed to identify the final models. An application with the final models was built using the R Shiny framework.
Results: The final models, incorporating 9 variables-primarily including eGFR, inpatient visit number, and number of COVID-19 infections-were selected. XGBoost demonstrated the best performance for predicting the incidence of AKI in 1 month (AUROC = 0.803), AKI in 1 year (AUROC = 0.799), and CKD in 1 year (AUROC = 0.894). Random Forest (RF) was selected for predicting the incidence of CKD in 1 month (AUROC = 0.896). A comparison of AUROC with and without COVID-19 infection confirmed its importance as a critical predictor in the model. The final models were translated into a convenient tool to facilitate their use in clinical settings.
Conclusions: Our study demonstrates the applicability of using large national EHR data in developing high-performance machine learning models to predict AKI and CKD risks in the post-COVID-19 period. Incorporating the number of COVID-19 infections in the past year showed improved prediction performance and should be considered in future models for kidney disease prediction. A user-friendly application was created to support clinicians in risk assessment and surveillance. Background: Artificial Intelligence and Biomedical Informatics Pilot Funding, Penn State College of Medicine.