Analysis of simulated data
We considered a simulation design that enabled us to assess the performance of the proposed summary-level modified Poisson method in the presence of multiple data partners, multiple covariates (including but not limited to data source indicators), and differences in exposure prevalence and outcome incidence across data partners. Although modified Poisson regression is broadly applicable with rare and common outcomes, here we considered a scenario with common outcomes, where logistic regression would provide biased estimates of adjusted risk ratios.
Specifically, we simulated a distributed network with three (i.e., K = 3) data partners and n = 10000 individuals with 5000, 2000, and 3000 individuals contributing data from the first, second, and third data partners, respectively. We considered five covariates X1, X2, X3, X4 and X5. We generated X1 as a Bernoulli variable with a mean (i.e., P(X1 = 1)) of 0.6, X2 as a continuous variable following the standard uniform distribution, X3 as a continuous variable following the unit exponential distribution, X4 as an indicator that an individual contributed data from the first data partner, and X5 as an indicator that an individual contributed data from the second data partner.
The exposure E was generated from a Bernoulli distribution with the probability of being exposed (E = 1) defined as 1/{1 + exp(0.73 − X1 − X2 + X3 − 0.2X4 + 0.2X5)}, indicating a non-randomized study. This setting led to different exposure prevalences across data partners. The resulting exposure prevalence was approximately 40% overall, 43% for the first data partner, 34% for the second data partner, and 38% for the third data partner. The outcome Y was generated from a Bernoulli distribution with the probability of having the outcome (Y = 1) defined as exp(ZTβ) = exp(−0.1 − 0.5E − 0.4X1 − 0.6X2 − 0.5X3 − 0.1X4 + 0.1X5) such that the true adjusted risk ratio comparing the exposed to unexposed was exp(−0.5) = 0.61. The resulting outcome incidence (i.e., risk) varied across the three data partners. This incidence was about 30% for the entire pooled data, 27% for the first data partner, 35% for the second data partner, and 31% for the third data partner.
As the reference, we first fit a modified Poisson regression model using pooled individual-level data (Table
1). We then implemented our proposed distributed algorithm that did not require sharing of individual-level data to estimate
β. Based on the starting value
β(0) =
0, the analysis took seven iterations to converge. The individual-level and summary-level methods produced identical point estimates and sandwich variance-based standard errors (Table
1). The Additional file
1 provides the summary-level information shared between the data partners and the analysis center during each iteration.
Table 1Point Estimates and Standard Errors Using the Summary-Level Modified Poisson Method and Pooled Individual-Level Data Analysis: Analysis of Simulated Data
Intercept | −0.09702882 | 0.03751991 | −0.09702882 | 0.03751991 |
Exposure | −0.48776703 | 0.03462485 | −0.48776703 | 0.03462485 |
X1 | −0.38249121 | 0.02968448 | −0.38249121 | 0.02968448 |
X2 | −0.62968463 | 0.05161073 | −0.62968463 | 0.05161073 |
X3 | −0.50382664 | 0.02389781 | −0.50382664 | 0.02389781 |
X4 | −0.09079213 | 0.03389126 | −0.09079213 | 0.03389126 |
X5 | 0.13274741 | 0.03814272 | 0.13274741 | 0.03814272 |
Analysis of real-world data
To further illustrate our method, we analyzed a dataset created from the IBM® Health MarketScan® Research Databases, which contain de-identified individual-level healthcare claims information from employers, health plans, hospitals, and Medicare and Medicaid programs fully compliant with U.S. privacy laws and regulations (e.g., Health Insurance Portability and Accountability Act). The study dataset included 9736 patients aged 18–79 years who received sleeve gastrectomy or Roux-en-Y gastric bypass between 1/1/2010 and 9/30/2015. The outcome of interest was any hospitalization during the 2-year follow-up period after surgery. The exposure variable was set to 1 if the patient received sleeve gastrectomy and 0 if the patient received Roux-en-Y gastric bypass. We estimated the risk ratio of hospitalization comparing sleeve gastrectomy with Roux-en-Y gastric bypass using the pooled individual-level data analysis and the summary-level information approach, adjusting for the following covariates identified during the 365-day period prior to the surgery: age; sex; Charlson/Elixhauser combined comorbidity score; diagnosis of asthma, atrial fibrillation, atrial flutter, coronary artery disease, deep vein thrombosis, gastroesophageal reflux disease, hypertension, ischemic stroke, myocardial infarction, pulmonary embolism, and sleep apnea; use of anticoagulants, assistive walking device, and home oxygen; unique drug classes dispensed and unique generic medications dispensed.
Of the 9736 patients in the study dataset, 7877 (81%) patients underwent the sleeve gastrectomy procedure and 1859 (19%) patients had the Roux-en-Y gastric bypass procedure. The outcome event was not rare in the study, with 1485 (19%) sleeve gastrectomy patients and 608 (33%) Roux-en-Y gastric bypass patients having at least one hospitalization during the two-year follow-up period. We randomly partitioned the dataset into three smaller datasets with 2000, 3000 and 4736 patients to create a “simulated” distributed data network. As the reference, the pooled individual-level data analysis produced
\( {\hat{\beta}}_E=-0.4632219 \) with a standard error 0.0422368, and a 95% confidence interval: − 0.5460061, − 0.3804377. These results corresponded to an adjusted risk ratio of
\( \exp \left({\hat{\beta}}_E\right)=0.63 \) with a 95% confidence interval: 0.58, 0.68. Based on the starting value
β(0) =
0, the proposed summary-level modified Poisson method took seven iterations to converge and produced point estimates and sandwich variance-based standard errors identical to those observed in the corresponding pooled individual-level data analysis (Table
2). The adjusted odds ratio from logistic regression was 0.53. As expected, interpreting the estimated adjusted odds ratio as an estimate of the adjusted risk ratio amplified the protective effect of sleeve gastrectomy compared to Roux-en-Y gastric bypass, resulting in an effect estimate that was further from the null (suggesting a 10% greater relative protective effect) than the modified Poisson regression estimate.
Table 2Point Estimates and Standard Errors Using the Summary-Level Modified Poisson Method and Pooled Individual-Level Data Analysis: Analysis of Real-World Data
Intercept | −1.5653147 | 0.1040146 | −1.5653147 | 0.1040146 |
Exposurea | −0.4632219 | 0.0422368 | −0.4632219 | 0.0422368 |
Demographics |
Age (years) | 0.0029253 | 0.0018694 | 0.0029253 | 0.0018694 |
Female sex | 0.0860355 | 0.0478638 | 0.0860355 | 0.0478638 |
Combined comorbidity score | 0.0543412 | 0.0125169 | 0.0543412 | 0.0125169 |
Diagnosis of |
Asthma | −0.0212109 | 0.0538624 | −0.0212109 | 0.0538624 |
Atrial fibrillation | 0.0835820 | 0.1249359 | 0.0835820 | 0.1249359 |
Atrial flutter | 0.1638268 | 0.2598721 | 0.1638268 | 0.2598721 |
Coronary artery disease | 0.1597157 | 0.0746721 | 0.1597157 | 0.0746721 |
Deep vein thrombosis | 0.1738365 | 0.1371837 | 0.1738365 | 0.1371837 |
Gastroesophageal reflux disease | −0.0586971 | 0.0393428 | −0.0586971 | 0.0393428 |
Hypertension | 0.0046885 | 0.0448018 | 0.0046885 | 0.0448018 |
Ischemic stroke | −0.1592596 | 0.1986055 | −0.1592596 | 0.1986055 |
Myocardial infarction | 0.2069423 | 0.1509918 | 0.2069423 | 0.1509918 |
Pulmonary embolism | 0.2846648 | 0.1436404 | 0.2846648 | 0.1436404 |
Sleep apnea | −0.0261550 | 0.0396516 | −0.0261550 | 0.0396516 |
Use of |
Anticoagulants | 0.1064905 | 0.1205252 | 0.1064905 | 0.1205252 |
Assistive walking device | 0.0908909 | 0.1394227 | 0.0908909 | 0.1394227 |
Home oxygen | 0.0732620 | 0.1162295 | 0.0732620 | 0.1162295 |
Number of drug dispensing |
Unique drug classes | −0.0579458 | 0.0166577 | −0.0579458 | 0.0166577 |
Unique generic medications | 0.0673783 | 0.0136406 | 0.0673783 | 0.0136406 |
As expected, we had difficulty fitting a log-binomial regression model within this bariatric surgery dataset. Under the starting value
β(0) =
0, the first iterated estimate
β(1) could not be calculated because the formula for
β(1) under the log-binomial regression model includes 1 − exp(
ZiTβ(0)) in the denominator, which takes the value 0 when
β(0) =
0. We also considered three non-zero starting values. We first set the starting values for all parameters to 0.05, but the analysis stopped at the second iteration due to matrix singularity. We then let the starting values be the estimates obtained from a logistic regression model fit using the entire bariatric surgery dataset, and the analysis converged with
\( {\hat{\beta}}_E=-0.7369683 \). Finally, we specified the starting values as the estimates obtained from the modified Poisson regression fit using the entire bariatric surgery dataset, and the analysis converged with
\( {\hat{\beta}}_E=-0.4544135 \). These results illustrated the convergence problems of log-binomial regression and the sensitivity of this method to starting values. In comparison, the modified Poisson analysis had no convergence problems and its estimates remained the same as those presented in Table
2 when using these alternative starting values.