Our goal is to incorporate information from multiple family informants’ family health history observations into a common, integrated FHH. That is, we wish to predict family members’ disease statuses from MIFHH observations and use those estimates to calculate disease risk for unaffected individuals in a family. In the simplest case, we can use arithmetical methods that ignore sources of variation and error in MIFHH data [
26]. Alternatively, a statistical model accounting for the process that gives rise to such variation in disease status reports may be used to estimate the integrated FHH. In this section, we introduce a Bayesian hierarchical logistic regression model for improving the precision of such estimation based on MIFHH data.
We begin by defining a common notation. Let each realization of a pedigree containing family health history from m informants on n family members be represented by Y, an m × n-dimensional matrix. The values of the cells in Yij reflect the ith informant’s report of the jth family member’s disease status. Ideally, our integration solution reduces the dimensionality of the FHH to a simple n-dimensional vector y (y1,…yn).
Statistical model
We can treat the case of MIFHH integration as a classification problem. Classification models allow the researcher to infer the state of a variable vis-a-vis model parameters and data. We infer one of two states from a set of possibly discrepant observations on a particular individual: does individual j truly have a particular disease state (y = 1 if yes, and y = 0 if no)? Because we do not observe the true disease state on typical FHH data per se, we treat it as a latent variable. Here, we assume informants’ accounts of disease statuses of their family members represent evidence of the underlying true disease state of the individual. While several candidate models for such classification tasks in clinical contexts exist (i.e., Item Response Theory, Naive Bayes, Random Forests), the hierarchically structured and dependent nature of MIFHH data make it particularly challenging to model. Moreover, as disease contexts within families are likely informed by population parameters, better models would incorporate informative priors reflecting this information. As such, we propose using a Bayesian hierarchical logistic regression model that accounts for variability in outcome arising from both informants and the family members they are reporting on, together with informative priors.
Following the Bayesian hierarchical logistic regression models of [
31,
32], we assume that individual reports of disease statuses are distributed Bernoulli with probability
θ. As we have multiple observations from informants on different family members (but not all
m informants report on all
n family members), the response vector for each
jth family member is of length
k × 1, where
k is the number of informants reporting on
j, and thus 1 ≤
k ≤
m ≤
n. When the all family members are informants, then
m =
n.
Specifically,
$$ {y}_{ij}\sim Bern\left({\theta}_{ij}\right),\forall i\in \left(1,2,\dots, k\right) $$
(1)
and model
θij as a latent variable vis-a-vis the logit-link function (
ϕ =
ln \( \frac{\mu }{1-\mu } \), where
μ is the predicted mean vector of the Bernoulli parameter
θ):
$$ \phi \left({\theta}_{ij}\right)={\beta}_0+{X}_{ij}\beta +{W}_i{b}_i+{\epsilon}_{ij}. $$
(2)
The first term on the right hand side of Eq.
2 reflects the level-1 intercept (
β0) and the next two terms reflect the matrix of level-1 covariates in
X and the matrix of level-2
W covariates, respectively. These matrices have dimensionality
k ×
p (
p being the number of level-1 covariates) and
k ×
q (
q being the number of level-2 covariates), respectively. The third term (
ϵ) captures the errors, which are optionally assumed to be over-dispersed following a normal distribution:
$$ {\epsilon}_j\sim N\left(0,{\sigma}^2{I}_{ij}\right), $$
(3)
where each Iij is indexed on identity matrix I. In practice, however, the over-dispersion of the errors can be fixed to be 1.
The level-2 effects are also assumed to be distributed normally with mean 0 and covariance
D:
$$ {b}_i\sim {N}_q\left(0,D\right). $$
(4)
The conjugate priors for this model as derived by [
32] assume that each level-1 effect
β is distributed normally,
$$ \beta \sim {N}_p\left(b,{B}^{-1}\right), $$
(5)
where
b is the mean vector and
B− 1 is the variance of
β, which can optionally be modeled as Inverse-Wishart if level-1 effects are assumed to be correlated but is here set to be non-informative. Next, the residual error variance follows an Inverse-Gamma distribution,
$$ {\sigma}^2\sim IG\left(v,1/\delta \right), $$
(6)
where
ν and
δ are the shape and scale hyperparameters of the Inverse-Gamma distribution. Finally, we assume that the level-2 effects have an Inverse-Wishart precision matrix prior:
$$ D\sim IW\left(\psi, \rho \right), $$
(7)
where the scale and shape hyperparameters of the Inverse-Wishart are defined such that
ψ is a
q ×
q positive definitive matrix and
ρ is a scalar such that
ρ ≥
q, respectively. A Kruschke-style diagram of this hierarchical model [
33] is depicted in Additional file
1.
Information can be incorporated into these priors by specifying appropriate hyperparameter values. For instance, one may incorporate prior information about the population prevalence of a disease by setting the hyperparameter for the intercept equal to the logit transformed parameter, which would have the result of mixing the observed average reported disease rates in the data with the prior and incorporating that information into the estimate of the model intercept. We sample parameters directly from the posterior of this model using Markov Chain Monte Carlo (MCMC) with the MCMCpack package for R as detailed in [
32,
34], which implements Algorithm 2 from [
35]. For each model we draw a sample of size 20,000 with a 5000 run burn-in, and sample every other draw with an adaptive mean acceptance rate of about 45%. Thus, our final sample represents 10,000 draws from the posterior of each set of model parameters.
The model described above draws from the posterior of the parameters associated with the informant-informee dyad reports of disease statuses (i.e., at the dyad level). To approximate the equivalent of the individual level reports, we simply average over each individual family member’s vector of posterior predictives (θ) as described below.
Empirical example
The data we use to illustrate our model include MIFHH information collected in 2011–2013 from 128 informants from 45 families residing in the greater Cincinnati area. The number of informants per family ranges from 2 to 5, with an average of 2.8. Each informant independently provided family history of type 2 diabetes for their first- and second-degree biological relatives and we also record self-reports of disease status from the informants. Additionally, each informant provided demographic and lifestyle information such as tobacco and alcohol use and weight status, about each biological relative and themselves. Details about design and data features of this study can be found elsewhere [
26]. The final analytic dataset consists of 2159 FHH records contributed by informants from all 45 families, almost two-thirds of which (
n = 1337) are multiple accounts from informants of the same family with respect to common relatives.
The analysis proceeds in two stages. First, for each family member enumerated we estimate diabetes status as a latent variable with multiple observations provided by different informants using the procedure detailed below. The number of informant based observations per individual family member ranges from 1 to 5. In this model we are able to systematically account for a) population-level prior prevalence of type 2 diabetes, b) family member characteristics (at level-1), and c) informant or family-level variability (at level-2). We assume that the hyperparameters for the mean and variance of β0 (the level-1 intercept) are ≂ -1.99 and 100, respectively. This specification models the population-level prior prevalence of type 2 diabetes by a normal distribution with a mean equivalent to just above observed background probability in the United States (which is roughly 12%, thus − 1.99 ≂ ln \( \frac{0.12}{1-.12} \)) and a wide, but finite, variance. We additionally assume vague level-2 covariance priors (with hyperparameters set to ρ = q and ψ = I × q, which assumes within-informant covariance in reports on family members for the informant level-2 models and within/between informant covariance in the family level-2 models. Finally, for the residual error variances (σ2), we assume hyperparameters that result in non-informative priors.
Second, we make use of the posterior predictions (
θij, above) of the final model. These represent the distribution of marginalized model-adjusted probabilities that diabetes status is indicated on the informant-family member dyad. Following [
26], we average these predictions over the number of dyads each family member was reported upon by an informant to obtain a weighted estimate of diabetes status.
Model selection
The primary measure used to compare and select competing parameterizations of our proposed model is the Deviance Information Criteria (DIC). This measure is appropriate as it incorporates a first approximation to the predictive accuracy of the model vis-a-vis the posterior deviance while simultaneously discounting for model complexity. We follow the DIC specification of [
31] (pp. 180–3), which defines the DIC as the sum of the average deviance of the posterior sample and one half its variance. The latter term is proportional to the effective number of parameters in the model and is a good estimate of Bayesian model complexity. Like other deviance and likelihood based model selection measures (AIC, BIC, AICC, etc), models with comparatively lower values of DIC are preferred.
We also evaluate classification accuracy for each of our candidate models using the area under the receiver-operator curve (AUC) for both dyadic and individual-level predictions. Larger values of AUC represent better classification, with clinically relevant values exceeding 70% [
36]. Additional model robustness checks are reported in Additional file
2.