We consider a similar situation as Debray et al. [
6] in that we can use reported summary statistics from previous logistic regression models with different sets of covariates and at least one IPD from the publications or the authors themselves. Suppose that each published prediction model has a subset of covariates in the IPD, and is constructed for same prediction task. The number of published prediction models is
N (
\(i=1,\dots,N\)) and the
ith article reports the estimated coefficients,
\(\hat {\boldsymbol {\theta }}_{i}\), and the covariance matrix
\(\boldsymbol {\Sigma }_{i}=\text {Cov}(\hat {\boldsymbol {\theta }}_{i})\) (at least its diagonal elements). Each
\(\hat {\boldsymbol {\theta }}_{i}\) is a column vector of possibly different length,
\(\hat {\boldsymbol {\theta }}_{i}=(\hat {\theta }_{i1},\dots,\hat {\theta }_{i p_{i}})^{T}\), and
p
i
represents the differences of covariate sets among studies. To synthesize these regression coefficients, Debray et al. [
6] utilize the mean or zero imputation methods for omitted coefficients and apply the technique of multivariate meta-analysis. Another simple approach is to apply multivariate meta-analysis using studies with common covariate sets [
6]. However, the former approach leads to biased results and the latter is not biased but leads to loss of efficiency by ignoring indirect information from omitted studies. In order to improve this situation, we propose a new method for synthesis of logistic regression coefficients under different sets of covariates.
For simplicity, we assume the case where the true model has the full set of covariates in the IPD, which means the prior models have subsets of covariates and are considered as under-specified models. Note that since the omitted variables from the true models (full models) may be correlated with the included variables, the subset models are confounded and biased compared with true models. Our method can be generalized to more complex cases where the previous models for meta-analysis are a mixture of under- and over-specified models. Our method can be also applied to more general models such as generalized linear models, including linear regression and other non-linear regression models with some modification. This study was approved by the Institutional Review Board at the National Cancer Center in Tokyo and The University of Tokyo, Japan.
The omitted variable bias in the logistic regression model
Firstly, we introduce the omitted variable bias under one original logistic regression setting, which can afterward be extended to the meta-analysis setting with the assumption that the covariate sets differ among studies.
Let
\(\boldsymbol {X}=\left (1,X_{2}, \dots, X_{j}\right)^{T}\) and
\(\boldsymbol {Z}=(Z_{1}, \dots, Z_{k})^{T}\) be vectors of covariates and
Y∈{0,1} be a binary response variable. Suppose the data-generating process (DGP) can be formulated by the true model:
$$\begin{array}{@{}rcl@{}} \text{logit} P(Y=1|\boldsymbol{X},\boldsymbol{Z})=\boldsymbol{X}^{T}\boldsymbol{\alpha}+\boldsymbol{Z}^{T}\boldsymbol{\beta}, \end{array} $$
(1)
where
α,
β are the true parameters of interest and “logit” means the logistic function,
\(\text {logit}(p)=\text {log} \left (p/(1-p)\right)\). The misspecified model is assumed to be fitted, which omits relevant covariates
Z from the true model (
1). Specifically,
$$\begin{array}{@{}rcl@{}} \text{logit}\,\, P(Y=1|\boldsymbol{X})=\boldsymbol{X}^{T}\boldsymbol{\gamma}. \end{array} $$
(2)
We investigate the degree to which the regression coefficient γ, estimated under the misspecified model, differs from the true parameters α,β, and define the differences as the omitted variable bias.
To derive the omitted variable bias, the unbiasedness condition of the estimating function can be employed [
15]; this approach is a generalized result of the landmark paper of White [
16]. In the meta-analysis framework, the idea of the omitted variable bias can function as an analogy of different covariate sets and as a representation of the incorporation of indirect information from prior models.
In general, score functions from misspecified models cannot satisfy the unbiasedness condition of estimating functions. Therefore, the first step is to find the solution of the unbiasedness condition of estimating function (
3), i.e., find
γ∗=
f(
α,
β,
p
XZ
), which is the function of the true parameters
α,
β and the joint distribution of covariates,
p
XZ
;
$$\begin{array}{@{}rcl@{}} E\left[\left\{Y-\frac{1}{1+\exp\left(-\boldsymbol{X}^{T}\boldsymbol{\gamma}^{*}\right)}\right\} \boldsymbol{X}\right] = 0. \end{array} $$
(3)
Here, the expectation is taken by the true joint distribution of
Y,
X and
Z defined from (
1) and
p
XZ
. Under some regularity conditions, the maximum likelihood estimate of
γ from the misspecified model (
2) is a consistent estimate of
γ∗.
Secondly, for assessing biases caused by dropping the important predictors, we assume to have (at least) one IPD with the outcome and the full covariates
X,
Z. This assumption is considered reasonable for researchers who want to develop a new prediction model on their own IPD, incorporating prior summary statistics from regression results. Using the IPD, we can empirically solve (
3) and derive the omitted variable bias.
Note that in the general case, the function f cannot be written in closed form due to its nonlinearity, but in the following case where every omitted covariate is a continuous variable it can be explicitly written.
Special case : omitted covariates, Z, are continuous variables
In general, the maximum likelihood estimate of
γ in (
2) (consistently) estimates
γ∗ as the solution of (
3). In particular, for the cases of normal continuous variables, the following analytical evaluation can be adapted. Now we suppose
Z|
X follows the multivariate normal distribution,
N(
μ
Z|X
,
Ω
Z|X
). Based on the normality assumption of
Z|X, we have
$$\begin{array}{@{}rcl@{}} \boldsymbol{Z}=\boldsymbol{\Delta}\boldsymbol{X}+\boldsymbol{\tau} \end{array} $$
where \(\boldsymbol {\Delta }=(\boldsymbol {\delta }_{1}, \dots, \boldsymbol {\delta }_{k})^{T}\) is k×j matrix and τ∼N
τ
(0,Ω
Z|X
).
Applying the technique of Chao et al. [
17] to our covariate structure and using the probit approximation of logistic distribution, the expectation of
Y conditional on
X can be expressed as follow:
$${\fontsize{7.3}{6}\begin{aligned} E[\!Y|\boldsymbol{X}]=P(Y=1|\boldsymbol{X})&=\int \frac{1}{1+\exp\left(-\boldsymbol{X}^{T}\boldsymbol{\alpha}-(\boldsymbol{\Delta}\boldsymbol{X}+\boldsymbol{\tau}\right)^{T} \boldsymbol{\beta})}N_{\boldsymbol{\tau}}(0,\boldsymbol{\Omega}_{\boldsymbol{Z|X}}) d\boldsymbol{\tau}\\ &\approx \Phi \left[c \left\{ \frac{\boldsymbol{X}^{T}(\boldsymbol{\alpha}+\boldsymbol{\Delta}^{T}\boldsymbol{\beta})}{\sqrt{1+c^{2}\boldsymbol{\beta}^{T}\boldsymbol{\Omega}_{\boldsymbol{Z|X}}\boldsymbol{\beta}} }\right\}\right], \end{aligned}} $$
where
Φ is the cumulative distribution function of standard normal distribution and
c=16(3)
1/2/15
π is the adjustment factor for probit approximation of the logistic distribution proposed by Johnson et al. [
18].
In order to satisfy the unbiasedness condition of the estimating function, (
3), we have
$${ \fontsize{9.1}{6}\begin{aligned} E \left[\Phi \left[c \left\{ \frac{\boldsymbol{X}^{T}\left(\boldsymbol{\alpha}+\boldsymbol{\Delta}^{T}\boldsymbol{\beta}\right)}{\sqrt{1+c^{2}\boldsymbol{\beta}^{T}\boldsymbol{\Omega}_{\boldsymbol{Z|X}}\boldsymbol{\beta}} }\right\}\right] \boldsymbol{X}- \Phi \left\{c\left(\boldsymbol{X}^{T}\boldsymbol{\gamma}^{*}\right)\right\}\boldsymbol{X}\right]=0 \end{aligned}} $$
Therefore, the function
f should be denoted as
$$\begin{array}{@{}rcl@{}} \boldsymbol{\gamma^{*}}=\boldsymbol{f}(\boldsymbol{\alpha},\boldsymbol{\beta}, p_{\boldsymbol{XZ}}) \approx \frac{\boldsymbol{\alpha}+\boldsymbol{\Delta}^{T}\boldsymbol{\beta}}{\sqrt{1+c^{2}\boldsymbol{\beta}^{T}\boldsymbol{\Omega}_{\boldsymbol{Z|X}}\boldsymbol{\beta}}} \end{array} $$
(4)
which is the generalization of the results of Chao et al. [
17] and Cramer et al. [
19].
Suppose there exist
N reported models (
\(i=1,\dots,N\)) with their estimated coefficients of
α,
β and
γ and their covariance matrices, and when
\(i=1,\dots,N\), studies fit the true model (
1) with a full set of covariates,
X and
Z, and when
\(i=n+1,\dots, N\), studies mistakenly omit covariates
Z. We assume the homogeneity of studies (i.e., the distribution of covariates and outcomes are common across the studies in the meta-analysis). Here we show only the case where
Z is omitted, but the case where
X is omitted can be considered in the same manner, and further, it is easy to generalize to various other omittance patterns. To synthesize the estimated coefficients vectors from the logistic regression models, we apply a GNLS method to incorporate the unequal variances of studies into meta-analysis.
Based on this setting, the nonlinear model for meta-analysis can be formulated as follows;
$$\begin{array}{@{}rcl@{}} \hat{\boldsymbol{\theta}}_{i}=\boldsymbol{g}_{i}(\boldsymbol{\alpha}, \boldsymbol{\beta}, p_{\boldsymbol{XZ}})+\boldsymbol{\varepsilon}_{i} \ \ \ \ \ (i=1,\dots,n), \end{array} $$
(5)
where
$$\begin{array}{@{}rcl@{}} \boldsymbol{g}_{i}(\boldsymbol{\alpha}, \boldsymbol{\beta}, p_{\boldsymbol{XZ}}) =\left\{\begin{array}{ll} (\boldsymbol{\alpha}^{T},\boldsymbol{\beta}^{T})^{T} &(i=1,\dots,n) \\ \boldsymbol{f}(\boldsymbol{\alpha}, \boldsymbol{\beta}, p_{\boldsymbol{XZ}}) & (i=n+1,\dots,N), \end{array}\right., \\ \left(\begin{array}{c} \boldsymbol{\varepsilon}_{1} \\ \vdots \\ \boldsymbol{\varepsilon}_{N} \end{array} \right) \sim N(\boldsymbol{0},\boldsymbol{\Sigma}), \ \ \boldsymbol{\Sigma}=\left[\begin{array}{ccc} \text{Cov}(\hat{\boldsymbol{\theta}}_{1}) & \ldots & \boldsymbol{0}\\ \vdots & \ddots & \vdots\\ \boldsymbol{0} &\ldots & \text{Cov}(\hat{\boldsymbol{\theta}}_{N}) \end{array}\right], \end{array} $$
and \(\hat {\boldsymbol {\theta }}_{i}\) is the column vector of reported coefficients in the ith study. The function f() comes from the omitted variable bias formula introduced in the previous section, whose formulation is reasonable if an assumption of homogeneity of studies in meta-analysis is acceptable.
In a large sample, the estimated coefficients \(\hat {\boldsymbol {\theta }}_{i}\) are (approximately) normally distributed with mean θ
i
=g
i
(α,β,p
XZ
) and covariance \(\text {Cov}(\hat {\boldsymbol {\theta }}_{i})\). This asymptotic normality of estimated coefficients leads to the justification of the GNLS approach.
Under the model (
5), overall estimates of the regression coefficients
\(\hat {\boldsymbol {\alpha }}^{*}\) and
\(\hat {\boldsymbol {\beta }}^{*}\) can be obtained by GNLS as follows:
$$\begin{array}{@{}rcl@{}} \left(\hat{\boldsymbol{\alpha}}^{*T},\hat{\boldsymbol{\beta}}^{*T}\right)^{T} &=& \mathop{\arg\min}_{\boldsymbol{\alpha},\boldsymbol{\beta}} \sum_{i=1}^{N}\left\{\hat{\boldsymbol{\theta}}_{i} -\boldsymbol{g}_{i}(\boldsymbol{\alpha}, \boldsymbol{\beta}, \hat{p}_{\boldsymbol{XZ}})\right\}^{T}\\&&\times\boldsymbol{\Sigma}^{-1} \left\{\hat{\boldsymbol{\theta}}_{i} - \boldsymbol{g}_{i}(\boldsymbol{\alpha}, \boldsymbol{\beta}, \hat{p}_{\boldsymbol{XZ}})\right\}, \end{array} $$
where \(\hat {p}_{\boldsymbol {XZ}}\) is an estimate of p
XZ
from the IPD.
The diagonal of the covariance matrix
Σ is typically reported in the literature but the off-diagonals are unknown, thus off-diagonal elements can be imputed by using the IPD. We employ the same imputation method as Debray et al. [
6] based on the IPD as follows;
$$\begin{array}{@{}rcl@{}} \text{Cov} (\hat{\boldsymbol{\theta}}_{i,W})=\hat{V}_{i}^{\frac{1}{2}}R_{IPD}\hat{V}_{i}^{\frac{1}{2}}, \end{array} $$
where
\(\text {Cov} (\hat {\boldsymbol {\theta }}_{i,W})\) is a working covariance matrix of the
ith study which is applied to one of the block diagonal elements of
Σ,
\(\hat {V}_{i}=\text {diag}(\text {Cov} (\hat {\boldsymbol {\theta }}_{i}))\) is a diagonal matrix whose diagonal elements are the estimated standard errors (SE) reported from each study and
R
IPD
is a working correlation matrix of coefficients calculated from the IPD. The covariance matrix can be calculated with a sandwich estimator under the model misspecification assumption instead of the imputation based on the IPD [
15], but there computational complexity remains a problem and little improvement is gained in simulations studies. Furthermore, even if the covariance matrix is misspecified, the proposed estimator is still consistent and asymptotically normally distributed with a sandwich covariance matrix. This robustness follows the asymptotic theory of the generalized estimating equations. In this situation, let
\(\hat {\boldsymbol {\alpha }}_{W}\) and
\(\hat {\boldsymbol {\beta }}_{W}\) denote our estimators with the working covariance matrix. The covariance matrix of these estimators can be estimated by
$$\begin{array}{@{}rcl@{}} \left(\hat{\boldsymbol{D}}^{T}\boldsymbol{\Sigma}_{W}^{-1}\hat{\boldsymbol{D}}\right)^{-1} \hat{\boldsymbol{D}}^{T} \boldsymbol{\Sigma}_{W}^{-1} \text{Cov}\left(\hat{\boldsymbol{\theta}}_{I}\right) \boldsymbol{\Sigma}_{W}^{-1} \hat{\boldsymbol{D}} \left(\hat{\boldsymbol{D}}^{T} \boldsymbol{\Sigma}_{W}^{-1} \hat{\boldsymbol{D}}\right)^{-1}, \end{array} $$
where
\(\hat {\boldsymbol {D}}=\left (\hat {\boldsymbol {D}}_{1}^{T},\dots,\hat {\boldsymbol {D}}_{N}^{T}\right)^{T}, \hat {\boldsymbol {D}}_{i}=\partial \boldsymbol {g}_{i}\left (\boldsymbol {\alpha }, \boldsymbol {\beta }, \hat {p}_{\boldsymbol {XZ}}\right)/\partial \left (\boldsymbol {\alpha }^{T}, \boldsymbol {\beta }^{T}\right)|_{\left (\boldsymbol {\alpha }^{T}, \boldsymbol {\beta }^{T}\right)=\left (\hat {\boldsymbol {\alpha }}^{T}_{W}, \hat {\boldsymbol {\beta }}^{T}_{W}\right)}\),
Σ
W
is a working covariance matrix, and
\(\text {Cov}(\hat {\boldsymbol {\theta }}_{I})=\left (\left \{\hat {\boldsymbol {\theta }}_{i} - \boldsymbol {g}_{i}\left (\hat {\boldsymbol {\alpha }}_{W}, \hat {\boldsymbol {\beta }}_{W}, \hat {p}_{\boldsymbol {XZ}}\right)\right \}\left \{\hat {\boldsymbol {\theta }}_{i} - \boldsymbol {g}_{i}\left (\hat {\boldsymbol {\alpha }}_{W}, \hat {\boldsymbol {\beta }}_{W}, \hat {p}_{\boldsymbol {XZ}}\right)^{T}\right \}\right)\) [
20,
21]. This idea essentially comes from Liu et al. [
21] and can be regarded as analogy of the result proposed by Chen et al. [
22].
In addition to the above, if the working covariance matrix is a good approximation of the true covariance structure, the following relationship holds:
$$\begin{array}{@{}rcl@{}} \text{Avar} \left(\hat{\boldsymbol{\theta}}^{*}\right)\le \text{Avar} \left(\hat{\boldsymbol{\theta}}_{M}\right) \le \text{Avar} \left(\hat{\boldsymbol{\theta}}_{S}\right), \end{array} $$
where Avar denotes an asymptotic covariance matrix, and \(\hat {\boldsymbol {\theta }}^{*}\), \(\hat {\boldsymbol {\theta }}_{M}\) and \(\hat {\boldsymbol {\theta }}_{S}\) are the estimates of \(\boldsymbol {\theta }=\left (\boldsymbol {\alpha }^{T},\boldsymbol {\beta }^{T}\right)^{T}\) obtained from our proposed method, from multivariate meta-analysis using only the studies with full covariates and from a single study with full covariates, respectively, when the number of studies N goes to infinity.
Here, we assume a fixed effect model which presumes that there is no heterogeneity in the distribution of covariates and in the values of the parameters of interest. This assumption may sometimes be unrealistic. Therefore, we recommend considering whether this assumption is reasonable based on background knowledge or reported information. In addition, we can propose how to modify this to a random effects model to incorporate the heterogeneity by assuming that the parameters underlying studies and the parameters of distribution of covariates follow some distribution. For example, considering the case that all omitted variables are continuous (i.e., Section ‘
The omitted variable bias in the logistic regression model)’, we can incorporate random effects by assuming that
α,
β,
Δ and
Ω
Z|X
in (
4) follow distributions. Random effects in
α,
β accommodate the heterogeneity of parameters and random effects in
Δ and
Ω
Z|X
accommodates that of distribution of covariates.