Introduction
R
packages, namely GJRM
(v. 0.2-3) – Generalised Joint Regression Modelling [8] and gamlss
(v. 5.1-7) – Generalised Additive Models for Location, Scale and Shape [9]. The GJRM
package allows us to deal simultaneously with two response variables while their specific marginal distributions are conveniently expressed in a joint manner by means of a copula function that binds them together. In this way, we will be able to define a joint distribution for both the process that governs the probability that a woman has not yet reached menopause and for the age at menopause itself. A bivariate copula regression model will be adopted [10]. To allow for sufficient flexibility in the model estimation, we will consider spline functions to model some of the covariates effects. The gamlss
package allows for virtually expressing any distributional parameter as a function of covariates in a generalized additive model (GAM, [11]) fashion and adopts a method for the imputations which is more flexible than other imputation methods provided by other packages in R [12]. This usage has naturally led to the emergence of a secondary objective of this work – to compare, within our context of age at menopause, the imputations obtained by these two different methods.Breast cancer screening data from Portugal
pregnancy
=0 if the woman has never been pregnant; 1 otherwise), breastfeeding (breastf
=0 if the woman has never breastfed; 1 otherwise) and the use of oral contraceptives (anov
=0 if the woman has never used oral contraceptives; 1 otherwise); (ii) quantitative information carried by the continuous variables age at menopause (menopause
) (Figs. 1 and 2), age at menarche (menarche
), year of birth (birth
) and age at the last attending screening (sage
); (iii) demographic information given by the municipality purchasing power index (ipccap
); and (iv) spatial information embodied in neighbourhood structure of the municipality of residence (muni
). The central region of Portugal is divided in 78 municipalities (Figs. 3 and 4) and roughly represents 25% of the Portuguese population. More details about screening program and the inclusion criteria are given in [4].
Variable | Summary | |
---|---|---|
Mean | Range | |
Age at menopause | 48.2 | 20–59 |
Age at menarche | 13.2 | 8–18 |
Age at last attending screening | 58.3 | 45–69 |
Year of birth | 1948.9 | 1920–1965 |
Municipality purchasing power index | 81 | 24–145 |
% No | % Yes | |
Any pregnancy | 7.4 | 92.6 |
Oral contraceptives | 52.6 | 47.4 |
Breastfeeding | 44.6 | 55.4 |
Bivariate conditional copula regression
Bivariate joint distributions through copulas
Mixed binary-continuous copulas
The likelihood
Imputation methodology
R
packages that allow for multiple imputation under chosen work-models. Both are very flexible and the user is offered a variety of options for building the imputation model. This contrasts to most packages available that are often limited to simple models like the homoscedastic normal linear regression model [12].Imputing with a copula approach
imputeSS()
, which takes a fitted gjrm
object and imputes the missing values. Although, the mixing of the “posterior imputed values” to which we allude above must be the carried on by the user. Additionally it does not provide an option to conduct imputations from a truncated distribution, which in our case would be extremely useful.Imputing with GAMLSS models
gamlss
package in R [9] may be used for MI. As with GJRM, this package is not Bayesian-based, so we cannot rely on posterior predictive distributions. However, the package considers the bootstrap predictive distribution as an approximation to the posterior predictive distribution [27, 28]. This is achieved by approximating the Bayesian posterior distribution \(f(\boldsymbol {\Phi } | \mathcal {Y}_{\text {obs}}, \boldsymbol {v}_{i})\) in (9) by \(f\left (\tilde {\boldsymbol {\Phi }} | \hat {\boldsymbol {\Phi }} (\mathcal {Y}_{\text {obs}}, \boldsymbol {v}_{i})\right)\), which is the sampling distribution of the imputation parameters evaluated at the estimated values. The values \(\tilde {\boldsymbol {\Phi }}\) are the possible values of the imputation model parameters, \(\hat {\boldsymbol {\Phi }}(\mathcal {Y}_{\text {obs}})\) is an estimator of such model parameters. If there are variables fitted as non-linear functions, a penalization of the likelihood is used. This sampling distribution, \(f\left (\tilde {\boldsymbol {\Phi }} | \hat {\boldsymbol {\Phi }}(\mathcal {Y}_{\text {obs}}, \boldsymbol {v}_{i})\right)\), is obtained by fitting the model to several bootstrap samples. The set of all parameters obtained constitutes the sampling distribution.gamlss.tr
package, which allows users to define truncated distributions in GAMLSS models. Unfortunately, within the package GJRM, we do not have such option.Modelling the age at menopause in central Portugal
Semiparametric predictors
R
packages GJRM and gamlss
facilitates the choice of the functional form specifications for the missing and observed response models. In the case of the GJRM package, we want to simultane- ously model the underlying missing indicator, \(Y_{1}^{*}\), and the response, Y2, as we are under an MNAR assumption. Both models will be linked with the introduction of a bivariate copula [8], conditional on some covariates. In the MAR scenario, we will use the gamlss
package to model only Y2 before and after the imputations.pregnancy
, anov
and breastf
for entering the model with linear effects. The effects of the continuous information such as birth
, ipccap
and menarche
may be non-linear. Spatial information enclosed in muni
, viewed as a Markov random field, will be taken into account in order to see how the age at menopause differs between regions.muni
i). More details are given in “Flexible effects” section below. The covariates used are considered to potentially influence the age at menopause according to some previous researches and expert opinion, as long as they were available in the data set.Copula model
Marginal models
Flexible effects
Spatial effects
muni
i)=ξkm,m=1,…,78, where every municipality is assigned a specific regression coefficient giving us the level of some random quantity within the mth region. In case of a spatial variable, like muni
, a simple Markov random field smoother [32] is sometimes appropriate. Indeed, the map displayed on Figs. 3 and 4 may be viewed as an irregular lattice.Selected marginal distributions
gamlss
package, are:
Marginal | AIC | BIC |
---|---|---|
Gamma | 1318295 | 1317297 |
Gumbel | 1263542 | 1264326 |
LogNormal | 1328666 | 1329256 |
Normal | 1298449 | 1299285 |
Student’s t | 1289426 | 1290270 |
Weibull | 1265325 | 1266310 |
Results
R
packages – GJRM and gamlss
– in order to analyse the robustness of our findings.Model selection
Copula | AIC | BIC |
---|---|---|
N | 1394988 | 1397140 |
PL | 1393386 | 1395308 |
C90 | 1391835 | 1393702 |
C270 | 1397586 | 1399744 |
J0 | 1401004 | 1403095 |
J90 | NA | NA |
J270 | 1391834 | 1393701 |
G90 | NA | NA |
G270 | 1393482 | 1395444 |
CCA; gamlss; Gumbel margin | gamlss after imputations truncated Weibull | GJRM; no imputations; logit; Gumbel; Copula=J270 | gamlss after imputations produced with gjrm: Gumbel margin Copula=J270 | |
---|---|---|---|---|
intercept | 50.09 (0.03) ∗∗∗ | 51.09 (0.03) ∗∗∗ | 50.97 (0.03) ∗∗∗ | 51.59 (0.03) ∗∗∗ |
pregnancy | 0.28 (0.04) ∗∗∗ | 0.19 (0.03) ∗∗∗ | 0.27 (0.04) ∗∗∗ | 0.27 (0.03) ∗∗∗ |
breastf | 0.13 (0.02) ∗∗∗ | 0.15 (0.02) ∗∗∗ | 0.24 (0.02) ∗∗∗ | 0.20 (0.02) ∗∗∗ |
anov | 0.25 (0.02) ∗∗∗ | 0.30 (0.02) ∗∗∗ | 0.40 (0.02) ∗∗∗ | 0.34 (0.02) ∗∗∗ |
\(\sigma _{Y_{2}}\) | 4.04 | 4.01 | 4.25 | 4.23 |
τ | – | – | -0.91 | – |
θ | – | – | -20.8 | – |
Estimated effects
gamlss.tr
; (ii) an MNAR scenario using the imputeSS
function within the GJRM package. Although the shapes of the obtained distributions are similar, the distribution corresponding to the imputations via the imputeSS
function is shifted towards larger values and has a larger lower tail. Based on the current knowledge of the biological menopause process, we can say that the imputations produced with the gamlss.tr
package, which allows the user to use a truncated distribution for the imputations, in this case a Weibull, seem to be more in agreement with the values that are considered reasonable for a woman to reach the menopause age. Nevertheless, none of the imputed processes produced values above 67 years. The occurrence of menopause at the age of 69 and 70 is considered to be unrealistic [35].
gamlss
package (before imputations); a Gumbel distribution for the age at menopause and the location parameter expressed according to (12) with an identity link function. Subsequently, we continue to consider the gamlss
package but only after obtaining the imputations via the same package using a truncated Weibull distribution. The third scenario considers the copula approach according to “Modelling the age at menopause in central Portugal” section with a logit and Gumbel marginal models and a Joe copula rotated by 270o. The location parameters for the logit and Gumbel marginal models are expressed as in (11) and (12). The last scenario exposes the application of a GAMLSS approach to the completed data set obtained after the application of the imputeSS
function in GJRM. From this table we can state that the different scenarios do not significantly differ in its estimates of the regression parameters for the binary variables. They are all significant and positive.gamlss
package before the imputations (corresponding to a CCA). The downward trend of the age at menopause when viewed as a function of the birth year is notorious, being in accordance with what had already been observed by [4]. Meaning that younger women are tendentiously having early menopauses. The variables ipccap
and menarche
have generally a positive relation with the menopause. Women living in municipalities with higher purchasing power tend to have late menopauses as well as women with late menarche. From the spatial clustering plot we might conclude that areas in the coast (Western) of Portugal tend to show early menopause.
ipccap
almost disappears and the menarche
impacts negatively the menopause age only for those women that had their menarche until the age of 12. The spatial clustering remains more or less unchanged.gamlss
package fitted to the data set after filling up the missing values with one imputation using the copula approach. Compared to Fig. 9, the variable that seems to be changing more its behaviour is ipccap
. Those municipalities with a purchasing power slightly above the national average tend to show an increase in their menopause ages. The municipalities with higher ipccap
are located in the coast of Portugal, and from the spatial plot (bottom right panel) those municipalities seem to have a negative spatial effect. Although these estimates may seem to point different conclusions, from our point of view we think that this is due to the spatial random effects showing that there is a need to incorporate new spatial information in the data because their confidence intervals do not contain zero.gamlss
package produces the best results, i.e., it produces complete data sets that are more in agreement with the reality than using the GJRM package that does not allow for truncation. Given that, and given the information provided by the Figs. 8 and 10, we can state that the age at menopause is increasing in the centre of Portugal. Younger women will, on average, experience the menopause a little later than women of previous generations.
Discussion
R
packages mice
[36] or mi
[37]. These procedures generally take as many variables as possible that might affect the probability of missingness to impute the missing values by specifying regression models without specifying a model for the probability of missingness. We tried both approaches but the results obtained were similar to the ones of a complete case analysis.gamlss
package. The other line of research that we pursued was to fit a joint model for the age at menopause and the probability of missingness. This was achieved using copulas which allowed us to model the situation with a non-ignorable missing mechanism.Conclusion
GJRM
and gamlss
, respectively) inside the popular R
software.GJRM
has the virtue of allowing the construction of a bivariate distribution in an easy and natural way by typifying a copula with a specific correlation parameter. After adjusting the model, the imputations are obtained via the imputeSS
function. On the other hand, the imputation tools available within the gamlss
are more useful because we are allowed to use truncated distributions while in the GJRM
that feature is not available. This detail turns out to be decisive in the results obtained in the validation analysis presented in the Supplementary Material I. The differences between the imputed menopause values in 2010 and the true observed ages in 2017 are always smaller for the gamlss
case.