Background
Preterm birth, occurring before 37 weeks of completed gestation, affects approximately 10% of pregnancies globally [
1‐
3] and is the leading cause of infant mortality worldwide [
4,
5]. The causes of preterm birth are multifactorial since different biological pathways and environmental exposures can trigger premature labor [
6]. Large epidemiological studies have identified many risk factors, including multiple gestations [
1], cervical anatomic abnormalities [
7], and maternal age [
8]. Notably, even though a history of preterm birth [
9] is one of the strongest risk factors, the recurrence rate remains low at < 30% [
10,
11]. Additionally, the maternal race is associated with risk for preterm birth; Black women have twice the prevalence compared to White women [
1,
12]. Preterm births have a heterogenous clinical presentation and cluster based on maternal, fetal, or placental conditions [
3]. These obstetric and systemic comorbidities (e.g., pre-existing diabetes, cardiovascular disease) can also increase the risk of preterm birth [
13,
14].
Despite our understanding of numerous risk factors, there are no accurate methods to predict preterm birth. Some biomarkers associate with preterm birth, but their best performance is limited to a subset of all cases [
15,
16]. Recently, analysis of maternal cell-free RNA and integrated -omic models have emerged as promising approaches [
17‐
19], but initial results were based on a small pregnancy cohort and require further validation. In silico classifiers based on demographic and clinical risk factors have the advantage of not requiring serology or invasive testing. However, even in large cohorts (> 1 million individuals), demographic- and risk factor-based models report limited discrimination (AUC = 0.63–0.74) [
20‐
24]. To date, we lack effective screening tools and preventative strategies for prematurity [
25].
EHRs are scalable, readily available, and cost-efficient for disease-risk modeling [
26]. EHRs capture longitudinal data across a broad set of phenotypes with detailed temporal resolution. EHR data can be combined with socio-demographic factors and family medical history to comprehensively model disease risk [
27‐
29]. EHRs are also increasingly being augmented by linking patient records to molecular data, such as DNA and laboratory test results [
30]. Since preterm birth has a substantial heritable risk [
31], combining rich phenotypes with genetic risk may lead to better prediction.
Machine learning models have shown promise for accurate risk stratification across a variety of clinical domains [
32‐
34]. However, despite the rapid adoption of machine learning in translational research, a review of 107 risk prediction studies reported that most models used only few variables, did not consider longitudinal data, and rarely evaluated the model performance across multiple sites [
35]. Studies using machine learning to predict preterm birth have relied on small cohorts and subsets of preterm birth and are rarely replicated in external datasets [
22,
36‐
38]. Pregnancy research is especially well poised to benefit from machine learning approaches [
27]. Per standard of care during pregnancy, women are carefully monitored with frequent prenatal visits, medical imaging, and clinical laboratory tests. Compared to other clinical contexts, pregnancy and the corresponding clinical surveillance occur in a defined time frame based on gestational length. Thus, EHRs are well-suited for modeling pregnancy complications, especially when combined with the well-documented outcomes at the end of pregnancy.
In this study, we combine multiple sources of data from EHRs to predict preterm birth using machine learning. From Vanderbilt’s EHR database (> 3.2 million records) and linked genetic biobank (> 100,000 individuals), we identified a large cohort of women (n = 35,282) with documented deliveries. We trained models (gradient-boosted decision trees) that combine demographic factors, clinical history, laboratory tests, and genetic risk with billing codes to predict preterm birth. We find models trained on only billing codes show potential for predicting preterm birth and outperform a similar model using only known clinical risk factors. By investigating the patterns learned by our models, we identify clusters with distinct preterm birth risk and comorbidity profiles. Finally, we demonstrate the generalizability of billing code-based models trained at Vanderbilt on an external, independent cohort from the University of California, San Francisco (UCSF, n = 5978). Our findings provide a proof of concept that machine learning on rich phenotypes in EHRs shows promise for portable, accurate, and non-invasive prediction of preterm birth. The strong predictive performance across clinical context and preterm birth subtypes argues that machine learning models have the potential to add value during the management of pregnancy; however, further work is needed before these models can be applied in clinical settings.
Discussion
Preterm birth is a major health challenge that affects 5–20% of pregnancies [
1,
2,
12] and leads to significant morbidity and mortality [
56,
57]. Predicting preterm birth risk could inform clinical management, but no accurate classification strategies are routinely implemented [
25]. Here, we take a step toward addressing this need by demonstrating the potential for machine learning on dense phenotyping from EHRs to predict preterm birth in challenging clinical contexts (e.g., spontaneous and recurrent preterm births). However, we emphasize that more work is needed before these approaches are ready for the clinic. Compared to other data types in the EHRs, models using billing codes alone had the highest prediction accuracy and outperformed those using clinical risk factors. Demonstrating the potential broad applicability of our approach, the model accuracy was similar in an external independent cohort. Combinations of many known risk factors and patterns of care drove prediction; this suggests that the algorithm builds on existing knowledge. Thus, we conclude that machine learning based on EHR data has the potential to predict preterm birth accurately across multiple healthcare systems.
Decision tree-based models are robust to correlated features can identify complex non-linear combinations and remain transparent for interpretation after training. In addition to these advantages, decision tree-based models have demonstrated strong performance in various clinical prediction tasks [
58‐
60]. Pregnancy is a clinical context with close monitoring and well-defined endpoints that may similarly benefit from machine learning approaches, yet few studies have applied decision tree-based machine learning models to large pregnancy cohorts with rich clinical data [
61].
Our approach has several distinct advantages compared to published preterm birth prediction models. First, our models have robust performance. Previous models using risk factors (diabetes, hypertension, sickle cell disease, history of preterm birth) to predict preterm birth, despite having cohorts up to two million women [
23], have reported ROC-AUCs between 0.69 and 0.74 [
20‐
22,
24]. Our models obtain a ROC-AUC of 0.75 and PR-AUC of 0.40 using data available at 28 weeks of gestation even after excluding multiple gestations. Furthermore, given the unbalanced classification problem (preterm births are less common than non-preterm), we report high PR-AUCs in addition to high ROC-AUCs. The improvement in our models is likely driven by richer longitudinal phenotypes accessible from EHRs and complex models capable of identifying non-linear patterns. These factors also likely contributed to the decision tree-based models outperforming the logistic regression model (Additional file
1: Fig. S11). A recent deep learning model trained using word embeddings from EHRs achieved a high performance (ROC-AUC = 0.83 [
61]). This model was evaluated over a stratified high-risk cohort consisting of birth before 28 weeks of gestation. We did not stratify preterm births by severity since more than 85% of preterm births occur after 32 weeks of gestation [
62]; however, this is an important topic for future work. Our models achieve comparable performance with the benefit of easier interpretability, which is an advantage over deep learning approaches, and we discuss this further below.
Second, our models use readily available data throughout pregnancy that do not require invasive sampling. While some studies have also obtained high ROC-AUCs (e.g., 0.81–0.88), they used serum biomarkers across small cohorts [
17] or acute obstetric changes within days of delivery [
16]. The potential to enable cost-effective and broad application is illustrated by our evaluation of the classifiers on EHR data from UCSF; however, substantial further work is needed to move from this proof-of-concept analysis to clinic-ready models. Furthermore, the rich characterization of the phenome provided by EHRs leveraged by our approach could also complement more invasive biochemical assays.
Third, the gradient-boosted decision trees we implement are easier to interpret than “black box” deep learning models that cannot easily identify features driving predictions. Transparency is an important, if not necessary, characteristic of machine/artificial learning models deployed in clinical practice [
63,
64], and it can facilitate the discovery of insights and hypotheses to motivate future work. We reveal the patterns learned by our model by clustering deliveries using feature importance profiles. The enrichment for known risk factors (e.g., gestational hypertension, fetal abnormalities, and pre-pregnancy BMI) in clusters with high preterm birth prevalence establishes confidence in our machine learning-based prediction models. In addition, we can quantify the strength of enrichment and combination of risk factors across clusters with distinct comorbid patterns. Since preterm birth is a heterogenous phenotype [
6], and stratifying pregnancies based on clinical features may be critical to uncovering the biological basis of labor [
3,
65,
66], the learned rules from our model offer a possible method for subphenotyping.
Finally, our approach generalizes across hospital systems. We demonstrate that billing code-based models trained at Vanderbilt achieve similar accuracy in an independent cohort from UCSF. The generalizability of machine learning models can be constrained by the sampling of the training data. Thus, the accurate prediction in an independent dataset from an external institution points to several inherent strengths of the approach. First, successful replication indicates the models’ ability to learn predictive signals despite regional variation in assigning billing codes to an EHR. Even with different demographic distribution between the two cohorts (e.g., a greater proportion of African American and Asian women in the Vanderbilt and UCSF cohorts, respectively, Table
1), the overall model performance is very similar. Second, the large cohorts used to train and evaluate models at Vanderbilt and USCF guard against the potential weakness of EHRs, such as miscoding or omission of key data points. These errors are unavoidable in EHRs [
67], but the large cohort used to train our models mitigates these errors and enables the high accuracy in the UCSF dataset, even with its different demographics. Additionally, idiosyncratic patterns of patient care at the institution used to develop the algorithm, which would be present in the Vanderbilt training and held-out sets, are unlikely to be present in the external UCSF cohort and inflate the out-of-sample accuracy. Third, the top features driving model performance are shared across institutions and reflect combinations of known risk factors and patterns of care. This aids interpretability of the underlying algorithm and likely reflects underlying pathophysiology that is innate to preterm birth.
We see several avenues for further improving our algorithm. First, some of the top features reflected routine obstetric care for high-risk pregnancies. Thus, factors that are already known to the physician or that arise from a different clinical pathway initiated by a clinician based on their assessment that a pregnancy is a high risk contribute to our prediction. This is not unique to preterm birth prediction and is a concern in any study based on longitudinal EHRs. To mitigate this effect, the learning problem could be engineered to force the algorithm to discover new unappreciated risk factors. However, we also note that prediction based on a combination of known and novel risk factors is still valuable. Second, we were surprised that the addition of features beyond billing codes, such as lab values, concepts extracted from clinical notes, and genetic information did not significantly improve performance. In some cases, any redundant information already captured by the billing codes would not improve the model’s accuracy; this is likely true for clinical notes. However, other sources, like currently available genetic data and polygenic risk scores, may not effectively capture underlying etiologies of preterm birth. Thus, these sources may not add more discriminatory power due to limitations in the current data. Indeed, the largest published genome-wide study for preterm birth only explains a very small fraction of the heritability [
31], and a polygenic risk score derived from it was not predictive in our cohort. The relatively small sample size of individuals with genetic data may also limit its predictive utility in a broadly defined delivery cohort. For example, genetic risk prediction may have greater utility in certain subtypes of preterm birth (e.g., individuals with a strong family history of preterm birth). We also note that many lifestyle factors, such as smoking, alcohol consumption, diet, and physical activity, have been implicated for increasing preterm birth risk [
1,
68]. Many of these data are recorded in unstructured fields in EHRs, and there are active efforts to develop accurate algorithms to extract these data from EHR [
69,
70]. As these approaches become robust, including lifestyle factors may further improve preterm birth prediction. Further subphenotyping of preterm birth will not only aid in the prediction, but also understanding its multifactorial etiology and developing personalized treatment strategies. Subphenotyping by gestational age to predict preterm birth earlier during gestation, especially before 22 weeks, would provide physicians more time for therapeutic interventions. Finally, while we evaluated the ability of our classifiers to discriminate preterm births, further studies evaluating the calibration of these models are necessary to better risk stratify pregnancies.
The strong predictive performance of our models suggests that they have the potential to be clinically useful. Compared to a machine learning model trained using only known risk factors, the billing code-based classifier incorporated a broad set of clinical features and predicted preterm birth with higher accuracy. Furthermore, the superior performance was not driven by the number of risk factors or the total burden of billing codes. These results indicate the algorithm is not simply identifying less healthy individuals or those with greater healthcare usage. The models also accurately predicted many preterm births in challenging and important clinical contexts such as spontaneous and recurrent preterm birth. Spontaneous preterm births are common [
1,
12,
71], and unlike iatrogenic deliveries, they are more difficult to predict because they are driven by unknown multifactorial etiologies [
12,
25]. Similarly, since a prior history of preterm birth is one of the strongest risk factors [
72], distinguishing pregnancies most at risk for recurrent preterm birth has the potential to provide clinical value.
However, we emphasize that additional work is needed before this approach is ready for clinical application. Though it has strong performance, a more comprehensive evaluation of the algorithm against the current clinical practice is needed to determine how early and how much improvement in the standard of care this approach could provide [
73]. Furthermore, while our model performed similarly on White and Black women, the two most represented groups in the training set, the lower performance on Hispanic and Asian women highlights that future approaches must be evaluated to ensure that they do not introduce or amplify biases against specific groups or types of preterm birth [
74]. In addition, as noted above, we anticipate further gains in the clinical value of this approach as more modalities of data become incorporated in the EHR [
75], and more data from diverse populations become available. Addressing these questions and taking other necessary steps toward clinical utility will require the close collaboration of diverse experts from basic, clinical, social, and implementation sciences.
Methods
Ascertaining delivery type and date for the Vanderbilt cohort
We identified women with at least one delivery (
n = 35,282, “delivery-cohort”) at Vanderbilt Hospital based on the presence of delivery-specific billing codes, which included the International Classification of Diseases ninth and tenth editions (ICD-9, ICD-10) and Current Procedural Terminology (CPT) or estimated gestational age (EGA) documented in the EHR. Combining delivery-specific ICD-9/10 (“delivery-ICDs”), CPT (“delivery-CPTs”), and EGA data, we developed an algorithm to label each delivery as preterm or not preterm. Women with multiple gestations (e.g., twins, triplets) were identified using ICD and CPT codes and excluded for singleton-based analyses. See Additional file
1: Supplementary Materials and Methods for the exact codes considered.
We demarcate multiple deliveries by grouping delivery-ICDs in intervals of 37 weeks starting with the most recent delivery-ICD. This step is repeated until all delivery-ICDs in a patient’s EHR are assigned to a pregnancy. We chose 37-week intervals to maximally discriminate between pregnancies.
For each delivery, we assign labels (preterm, term, or postterm) ascertained using the delivery-ICDs. EGA values, extracted from structured fields across clinical notes, were mapped to multiple pregnancies using the same procedure. For women with multiple EGA recorded in their EHR, the most recent EGA value determined the time interval to group preceding EGA values. Based on the most recent EGA value for each pregnancy, we assigned labels to each delivery (EGA < 37 weeks: preterm; ≥ 37 and < 42 weeks: term, ≥ 42 weeks: postterm). After pooling the delivery labels based on delivery-ICDs and EGA, we assigned a consensus delivery label by selecting the oldest gestational age-based classification (i.e., postterm > term > preterm). By incorporating both billing code- and EGA-based delivery labels and selecting the oldest gestational classification, we expect this to increase the accuracy of this algorithm, which we evaluate by chart review (described in detail below).
Since CPT codes do not encode delivery type, we combined the delivery-CPTs with timestamps of delivery-ICDs and EGAs to approximate the date of delivery. Delivery-CPTs were grouped into multiple pregnancies as described above. The most recent timestamp from delivery-CPTs, delivery-ICDs, and EGA values was used as the approximate delivery date for a given pregnancy.
Validating delivery type based on chart review
To validate the delivery type ascertained from billing codes and EGA, we used chart-reviewed labels as the gold standard. For 104 randomly selected EHRs from the delivery cohort, we extracted the date and gestational age at delivery from clinical notes. For the earliest delivery recorded in the EHR, we assigned a chart review-based label according to the gestational age at delivery (< 37 weeks: preterm; 37 and 42 weeks: term, ≥ 42 weeks: postterm). The precision/positive predictive value (PPV) for the ascertained delivery type as a binary variable (“preterm” or “not preterm”) was calculated using the chart reviewed label as the gold standard. To compare the ascertainment strategy to a simpler phenotyping algorithm, we compared the concordance of the label derived from delivery-ICDs to one based on the gestational age within 3 days of delivery. This simpler phenotyping approach resulted in a lower positive predictive value (85%) and recall (93%; Additional file
1: Fig. S1B) compared to the billing code-based ascertainment strategy.
Training and evaluating gradient-boosted decision trees to predict preterm birth
All models for predicting preterm birth used boosted decision trees as implemented in XGBoost v0.82 [
39]. Unless stated otherwise, we trained models to predict the earliest delivery in a woman’s EHR as preterm or not preterm. The delivery cohort was randomly split into training (80%) and held-out (20%) sets with an equal proportion of preterm cases. For prediction tasks, we used only ICD-9 and excluded ICD-10 codes to avoid potential confounding effects. The total count of billing codes within a specified time frame was used as features to train our models; if a woman never had a billing code in her EHR, we encoded these as “0.” For all models, we excluded ICD-9, CPT codes, and EGA used to ascertain delivery type and date. On the training set, we use the tree of Parzen estimators as implemented in hyperopt v0.1.1 [
76] to optimize hyperparameters by maximizing the mean average precision. The best set of hyperparameters was selected after 1000 trials using 3-fold cross-validation over the training set (80:20 split with an equal proportion of preterm cases). We evaluated the performance of all models on the held-out set using Scikit-learn v0.20.2 [
77]. All performance metrics reported are on the held-out set. For precision-recall curves, we define the baseline chance performance for each model as the prevalence of preterm cases. To ensure no data leaks were present in our training protocol, we trained and evaluated a model using a randomly generated dataset (
n = 1000 samples) with a 22% preterm prevalence. As expected, this model did not do better than chance (ROC-AUC = 0.50, PR-AUC = 0.22, data not shown). All trained models with their optimized hyperparameters are provided at
https://github.com/abraham-abin13/ptb_predict_ml.
Predicting preterm birth at different weeks of gestation
As the first step, we evaluated whether billing codes could discriminate between delivery types. Models were trained to predict preterm birth using the total counts of each ICD-9, CPT, or ICD-9 and CPT code across a woman’s EHR. We excluded any codes used to ascertain the delivery type or date. All three models were trained and evaluated on the same cohort of women who had at least one ICD-9 and CPT code (Additional file
1: Fig. S2).
Next, we evaluated the machine learning models at 0, 13, 28, and 35 weeks of gestation by training using only features present before each time point. For the subset of women in our delivery cohort with EGA, we calculated the date of conception by subtracting EGA (recorded within 3 days of delivery) from the date of delivery. Next, we trained the models using ICD-9 and CPT codes time-stamped before different gestational time points with only singleton (Fig.
2B) or including multiple gestations (Additional file
1: Fig. S3). The same cohort of women was used to train and evaluate across models. The sample size varied slightly (
n = 11,843 to 10,799) since women who already delivered were excluded at each time point.
In addition to evaluating the models based on the date of conception, we trained the models at different time points before the date of delivery (Additional file
1: Fig. S4) using the same cohort of women by requiring every individual in this cohort to have at least one ICD-9 or CPT code before each time point. Evaluating the models before the date of delivery increased the sample size (
n = 15,481) compared to a prospective conception-based design (
n = 12,410) and yielded similar results.
Evaluating the predictive potential of demographic, clinical, and genetic features from EHRs
In addition to billing codes, we extracted the structured and unstructured features from the EHRs (Fig.
3A). We evaluated the models using features present before 28 weeks of gestation (Fig.
3) and features present before or after delivery (Additional file
1: Fig. S6). Structured data included self or third-party reported race (Fig.
1E), age at delivery, past medical and family history (92 features, see Additional file
1: Supplementary Materials and Methods), and clinical labs. For training models, we only included clinical labs obtained during the first pregnancy and excluded values greater than four standard deviations from the mean. To capture the trajectory of each clinical lab’s values across pregnancy (307 clinical labs, see Additional file
1: Supplementary Materials and Methods), we trained the models using the mean, median, minimum, and maximum lab measurements. For unstructured clinical text in obstetric and nursing clinical notes, we applied CLAMP [
78] to extract Unified Medical Language System (UMLS) concept unique identifiers (CUIs) and included those with positive assertions with > 0.5% frequency across all EHRs. When training preterm birth prediction models, we one-hot encoded the categorical features. No transformations were applied to the continuous features.
A subset of women (
n = 905) was genotyped on the Illumina MEGA
EX platform. We applied standard genome-wide association study (GWAS) quality control steps [
79] using PLINK v1.90b4s [
80]. We calculated a polygenic risk score for each White woman with genotype data based on the largest available preterm birth GWAS [
31] using PRSice-2 [
81,
82]. We assumed an additive model and summed the number of risk alleles at single nucleotide polymorphisms (SNPs) weighted by their strength of association with preterm birth (effect size). PRSice determined the optimum number of SNPs by testing the polygenic risk score for association with preterm birth in our delivery-cohort at different GWAS
p-value thresholds. We included the date of birth and five genetic principal components to control for ancestry. Our final polygenic risk score used 356 preterm birth-associated SNPs (GWAS
p-value < 0.00025).
Using the structured and unstructured data derived from the EHR, we evaluated whether adding EHR features to billing codes could improve preterm birth prediction. Since the number of women varied across EHR feature, we created subsets of the delivery cohort for each EHR feature. Each subset included women with at least one recorded value for the EHR feature and billing codes. Then, we trained three models as described above for each subset: (1) using only the EHR feature being evaluated, (2) using ICD-9 and CPT codes, and (3) using the EHR feature with ICD-9 and CPT codes. Thus, all three models for a given EHR feature were trained and evaluated on the same cohort of deliveries (Fig.
3A).
Predicting preterm birth using billing codes and clinical risk factors at 28 weeks of gestation
We compared the performance of a model trained using billing codes (ICD-9 and CPT) present before 28 weeks of gestation with a model trained using clinical risk factors to predict preterm delivery (Fig.
4). Both models were trained and evaluated on the same cohort of women (
n = 21,099). We selected well-established obstetric risk factors that included maternal and fetal factors across organ systems, occurred before and during pregnancy, and had moderate to high risk for preterm birth [
3,
13,
23,
44]. For each individual, risk factors were encoded as high-risk or low-risk binary values. Risk factors such as non-gestational diabetes status [
48], gestational diabetes [
48], gestational hypertension, pre-eclampsia or eclampsia [
1,
50], fetal abnormalities [
13], cervical abnormalities [
51], and sickle cell disease [
49] status were defined based on at least one corresponding ICD-9 code occurring before the date of delivery (Additional file
1: Supplementary Materials and Methods). The remaining factors, such as race (Black, Asian, or Hispanic was encoded as higher risk) [
20], age at delivery (> 34 or < 18 years old) [
45‐
47], pre-pregnancy BMI ≥ 35, and pre-pregnancy hypertension (> 120/80) [
1,
50], were extracted from structured fields in EHR. Pre-pregnancy value was defined as the most recent measurement occurring before 9 months of the delivery date.
Density-based clustering on feature importance values
To better understand the decision-making process of our machine learning models, we calculated the feature importance value for the model predicting preterm birth at 28 weeks of gestation. We used SHapley Additive exPlanation values (SHAP) [
52,
53,
55] to determine the marginal additive contribution of each feature for each individual. First, we calculated a matrix of SHAP values of features by individuals from the held-out cohort. Since the shape of this matrix was too large to perform the density-based clustering, we created an embedding using 30 Uniform Manifold Approximation and Projection (UMAP) components with default parameters as implemented in UMAPv0.3.8 [
83]. Next, we performed a density-based hierarchical clustering using HDBSCANv0.8.26 [
84]. We used default parameters (metric=Euclidean) and tried a range of values for two hyperparameters: minimum number of individuals in each cluster (“min_clust_size”) and threshold for determining outlier individuals who do not belong to a cluster (“min_samples”). After tuning these two hyperparameters, we selected the clustering model with the highest density-based cluster validity score [
84], which measures the within- and between-cluster density connectedness. We find a min_clust_size = 110 and min_samples = 10 had the highest density-based cluster validity (DBCV) score with 6 distinct clusters with one cluster for outliers (Additional file
1: Fig. S13). A minority of women (
n = 16) were not assigned to a cluster (“outliers”). To visualize the cluster assignments, we performed UMAP on the feature importance matrix with default settings and two UMAP components and colored each individual by their cluster membership. Finally, we calculated the preterm birth prevalence and accuracy within each cluster.
Comorbidity enrichment within clusters
We tested for enrichment of clinical risk factors within each cluster by using Fisher’s exact test as implemented in Scipy [
85]. For each risk factor, we constructed a contingency table based on a given cluster membership and being high risk for the risk factor. We report enrichment as the odds ratio with the color bar showing the log
10 scale of the odds ratio. For sickle cell disease, one cluster did not have any cases of sickle cell disease.
We compared how models trained used billing codes (ICD-9 and CPT) performed in different clinical contexts. First, we evaluated the accuracy of predicting spontaneous preterm birth using models trained to predict all types of preterm births. From all preterm cases in the held-out set, we excluded women who met any of the following criteria to create a cohort of spontaneous preterm births: medically induced labor, delivery by cesarean section, or preterm premature rupture of membranes. The ICD-9 and CPT codes used to identify the exclusion criteria are provided in Additional file
1: Supplementary Materials and Methods. We calculated recall/sensitivity as the number of predicted spontaneous preterm births out of all spontaneous preterm births in the held-out set. We used the same approach to quantify the performance of models trained using clinical risk factors (Fig.
4E).
We trained the models to predict preterm birth among cesarean sections and vaginal deliveries separately using billing codes (ICD-9 and CPT) as features. Deliveries were labeled as cesarean sections or vaginal deliveries if they had at least one relevant billing code (ICD-9 or CPT) occurring within 10 days of the date of first delivery in the EHR. Billing codes used to determine the delivery type are provided in Additional file
1: Supplementary Materials and Methods. Deliveries with billing codes for both cesarean and vaginal deliveries were excluded. We trained separate models to predict cesarean and vaginal deliveries (Fig.
6A and Additional file
1: Fig. S8).
We evaluated how well models using billing codes could predict recurrent preterm birth. From our delivery cohort, we retained women whose first delivery in the EHR was preterm and a second delivery for which we ascertained the type (preterm vs. not preterm) as described above for the first delivery. We trained models using billing codes (ICD-9 and CPT) at time points before the date of delivery because the majority of this cohort did not have reliable EGA at the second delivery. As described earlier, separate models were trained using billing codes timestamped before the time point being evaluated (Fig.
6B, Additional file
1: Fig. S9).
Preterm birth prediction in independent UCSF cohort
We evaluated how well models trained at Vanderbilt using billing codes perform in an external cohort assembled at UCSF. Only the first delivery in the EHR was used for prediction. Women with twins or multiple gestations, identified using billing codes (Additional file
1: Supplementary Materials and Methods), were excluded. Delivery type (preterm vs. not preterm) was assigned based on the presence of ICD-10 codes. Term (or not preterm) deliveries were determined by the presence of an ICD-10 code beginning with the character “O80,” specifying an encounter for full-term delivery. Preterm deliveries were determined by both the absence of ICD-10 codes beginning with “O80” and the presence of codes beginning with “O60.1,” the family of codes for preterm labor with preterm delivery. We trained models using ICD-9 codes present before 28 weeks of gestation on the Vanderbilt cohort to predict preterm birth. We refer to this model as the “Vanderbilt-28wk model” throughout the manuscript. CPT codes were not used since they were not available from the UCSF EHR system. The Vanderbilt-28wk model was evaluated on the Vanderbilt held-out set and the independent UCSF cohort.
Feature interpretation from boosted decision tree models
To determine the feature importance, we used SHAP values [
52,
53,
55] to determine the marginal additive contribution of each feature for the Vanderbilt-28wk model. For the held-out Vanderbilt cohort and the UCSF cohort, a SHAP value was calculated for each feature per individual. Feature importance was summarized by taking the mean of the absolute value of SHAP scores across individuals, and the top fifteen features based on the mean absolute SHAP value in either the Vanderbilt or UCSF cohorts are reported. To compare how feature importance differed between Vanderbilt and UCSF, we computed the Pearson correlation of the mean absolute SHAP values.
Training and evaluating a logistic regression model using only top features
Using only the top 15 features obtained from our Vanderbilt-28wk model as predictors, we trained a logistic regression using Scikit-learn v0.20.2 [
67] with the following parameters: random_state=0, max_iter=10000,solver=‘liblinear’, class_weight=‘balanced’. The model was trained using the same training set from the Vanderbilt cohort (i.e., ICD-9 codes present before 28 weeks) used for comparing to the UCSF dataset. Performance was evaluated also on the same held-out set from the Vanderbilt cohort using ROC-AUC and PR-AUC.
For the Vanderbilt-28wk model, we evaluated model performance on the Vanderbilt held-out set stratified by race. We excluded individuals (n = 284) from this analysis if their race was annotated as “other” or had multiple categories because their subset counts were low (n < 143), therefore more likely to have sampling variability. Stratifying the held-out set by race resulted in four categories (White, Black, Hispanic, Asian). Next, we evaluated the model performance on each subset and report the ROC-AUC and PR-AUC with the preterm birth prevalence within each subset.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.