Analysis was based on cross-sectional data from a total of 62,111 respondents aged 52–85, participating in the 2012 (wave 11) U.S. Health and Retirement Survey (HRS) (
n = 10,858) [
17]; 2012–2013 (wave 6) English Longitudinal Study on Ageing (ELSA) (
n = 7938) [
18]; 2012 (wave 2) The Irish Longitudinal Study in Ageing (TILDA) (
n = 6668) [
19] and 2010–2015 baseline of the Canadian Longitudinal Study on Aging (CLSA) (
n = 36,647) [
20]. The design of these studies has been comprehensively described elsewhere [
17‐
21] but for completeness is explained in Additional file
2 Section 1.
Self-reported diagnoses and risk factors
Nine self-reported medical conditions were identified as common across all four studies: hypertension, diabetes, stroke (including transient ischemic attack), angina, myocardial infarction (MI), arthritis, cancer (not including minor skin cancers), lung disease (at least one of: emphysema, chronic bronchitis or chronic obstructive pulmonary disease) and osteoporosis. A tenth condition included psychological disorders of anxiety/mood (Psych 1) (CLSA, HRS) and/or psychiatric problems (Psych 2) (TILDA, HRS, ELSA).
Statistical analysis
Cross-sectional survey weights were used to report population representative disease prevalence using STATA 15. Crude population prevalence of disease was calculated using the tab command in STATA 15. Odds ratios for disease presence and risk factors were calculated using a survey-weighted logistic regression for each disease. This was implemented with the svy:logit command in STATA15. When making comparisons directly to the U.S., we pooled data across countries, ensuring cluster and strata variables across countries were accounted for; country-level weights were scaled to have a common mean and standard deviation 1 to prevent countries with weights on a larger scale dominating the analysis. Fully adjusted income, education and BMI gradients for each disease were identified and calculated with the addition of an interaction term by country. The marginal effect of each respective variable was then extracted assuming all other confounding variables were equal. This was performed using the margins command in STATA15.
Disease patterns were identified using Latent Class Analyses (LCA) which were population weighted in all cases and took into account the stratification and clustering inherent in the cohort sampling designs. LCA is a model-based clustering method for multivariate categorical data and has previously been applied in the analysis of multimorbidity [
22,
23]. In the case of multimorbidity, clustering using LCA is more appropriate than standard distance-based methods, such as k-means or hierarchical clustering, since the appropriate probability distribution for the data is readily available. Furthermore, LCA allows extra flexibility for diseases to have partial membership across multiple clusters unlike other more limiting distance-based clustering methods.
Two sets of parameters underlie the model: the group probability τ and item probability θ. The group probability parameter represents the a priori probability that an observation belongs to a particular group, so that P(Group g) = τg. The item response probability represents the probability of a success for a given item, conditional on group membership, so that P(Item m = 1 | Group g) = θgm.
More formally, let X = X1, …, Xn denote M-dimensional vector-valued binary random variables, composed of G groups. The observed-data likelihood distribution for the data X can then be written: \( p\Big(X\left|\ \uptheta, \uptau \right)={\prod}_{\mathrm{i}=1}^{\mathrm{n}}{\sum}_{\mathrm{g}=1}^{\mathrm{G}}{\uptau}_{\mathrm{g}}{\prod}_{\mathrm{m}=1}^{\mathrm{M}}{\uptheta}_{\mathrm{g}\mathrm{m}}^{{\mathrm{x}}_{\mathrm{i}\mathrm{m}}}{\left(1-{\uptheta}_{\mathrm{g}\mathrm{m}}\right)}^{1-{\mathrm{X}}_{\mathrm{i}\mathrm{m}}} \).
The naïve Bayes assumption that observations are conditionally independent based on group membership has been made for this model. Direct inference using the observed-data likelihood is typically difficult and is facilitated by the introduction of latent variables Z = Z1, …, Zn. Each Zi = Zi1, …ZiG is a G-dimensional vector, representing the true cluster membership of Xi as a multinomial random variable. That is, suppose that the true group membership is known for each Xi and is denoted by Zig = 1 if observation i belongs to Group g, otherwise Zig = 0. The complete-data density for an observation (Xi, Zi) is then \( \mathrm{p}\left(\mathrm{X},\mathrm{Z}|\uptheta, \uptau \right)={\prod}_{\mathrm{i}=1}^{\mathrm{n}}{\prod}_{\mathrm{g}=1}^{\mathrm{G}}{\left\{{\uptau}_{\mathrm{g}}{\prod}_{\mathrm{m}=1}^{\mathrm{M}}{\uptheta}_{\mathrm{g}\mathrm{m}}^{{\mathrm{x}}_{\mathrm{i}\mathrm{m}}}{\left(1-{\uptheta}_{\mathrm{g}\mathrm{m}}\right)}^{1-{\mathrm{X}}_{\mathrm{i}\mathrm{m}}}\right\}}^{{\mathrm{Z}}_{\mathrm{i}\mathrm{g}}}. \) LCA thus allows the data to be summarised at a global and local level. The parameters θ and τ summarise the overall behaviour of the clusters in the data, while each variable Zi informs us of the cluster membership, and thus behaviour, of an individual observation i.
Inference for our LCA models was performed using an expectation-maximisation (EM) algorithm. This works in two steps: the E-step, where Z is estimated, based on the current values of θ and τ, and the M-step, where the complete data likelihood is maximised with respect to θ and τ based on the current value of Z. The algorithm proceeds iteratively until it has deemed to converge; that is, once parameter estimates remain more or less unchanged after successive iterations.
As the true number of groups
G is not known in advance, each LCA model was run over a range of 1–10 groups. The number of clusters was then chosen using the Bayesian information criterion (BIC), where
\( BIC=-2\log \mathrm{p}\left(\mathrm{X}\ \right|\uptheta, \tau \Big)+\left( GM+G-1\right)\log \left(\sum \limits_{\mathrm{i}=1}^{\mathrm{n}}{w}_i\right) \);
wi is the survey weight attached to observation
i and logp(X |θ, τ) is the survey weighted pseudo-loglikelihood. Here a lower value of BIC indicates a more suitable choice of model. In many practical examples as was performed in this current work a balance has to be found between model parsimony and model fit and so an “elbow” is usually identified whereby the addition of clusters has diminishing returns to model fit improvement. We applied LCA using the software package lcca in R [
24]. Code to implement this analysis and BIC values for all models assessed are provided in Additional file
4.