Background
A statistical analysis plan (SAP) is a critical link between how a clinical trial is conducted and the clinical study report. To secure objective study results, regulatory bodies expect that the SAP will meet requirements in pre-specifying inferential analyses and other important statistical techniques. Writing a good SAP for a model-based sensitivity or ancillary analysis [
1,
2] involves non-trivial decisions on and justification of many aspects of the chosen model setting. In particular, trials with longitudinal count data as primary endpoint pose challenges for model choice and model validation. This paper explores tools for this decision process when sensitivity analyses are performed using generalized linear mixed models (GLMMs) for the analysis of longitudinal count data. These tools can be used to build transparent strategies for shaping the final models reported in the SAP.
The documentation of longitudinal profiles for the primary endpoint offers many advantages. They are more informative compared with a single timepoint analysis and give information about the ’speed of efficacy’ [
3]. Treatment effects evaluated by comparing change over time in quantitative outcome variables between the treatment groups are of great interest [
4,
5]. The analysis of longitudinal profiles offers an effective way to handle composite endpoints like: (1.) the long-term effect of experimental treatment (E) is better than that of standard treatment (S), and (2.) patients under E reach a pre-specified effect faster than those under S.
We are interested in parametric modeling approaches for quantifying absolute effects, adjusting for baseline covariates and handling stratification. There is a rich literature on nonparametric methods for longitudinal data, for example, Brunner et al. [
6]. These models do, in general, allow estimation of relative effects. Omar
et al.[
7] provide an overview of several alternative parametric approaches in trying to deal with individual longitudinal profiles: (i) the ’summary statistic method’ [
8] using a suitable summary measure (e.g. rates of change, post-treatment mean, last value of the outcome measure, or area under a curve) calculated for each subject, and subsequently analyzed with rather simple statistical techniques; (ii) repeated measures analysis of variance; (iii) marginal models based on generalized estimating equations (GEE) [
9]; (iv) mixed effects modeling approach involving fixed and random effects components [
10,
11].
Mixed effects (or random effects) models allow us to investigate the profile of individual patients, estimate patient effects and describe the heterogeneity of treatment effects over individual patients. They account for different sources of variation (patient effects, center effects, measurement errors) and provide direct estimates of the variance components which might be of interest in their own right. Furthermore, they allow us to address various covariance structures and are useful for accommodating overdispersion often observed among count response data [
10‐
12].
The EMA Guideline on Missing Data in Confirmatory Clinical Trials from 2010 [
13] explicitly considers random effects approaches (i.e. generalized linear mixed effects models (GLMMs) in the case of a non-Gaussian response) as an approach to handling trials with a series of primary endpoints measured repeatedly over time. Mixed models are also helpful for handling missing values. They are applicable under missing completely at random (MCAR) as well as missing at random (MAR) [
14], while simple repeated univariate analyses for each time point using test procedures such as the
t-test, ANOVA, or the Wilcoxon rank sum test rely on the more restrictive assumption of MCAR. Also, for non-ignorable missing data mechanisms, newer model-based strategies for longitudinal analyses are increasingly available and offer the opportunity to account for dropout patterns (e.g. pattern mixture models [
15]). To be fully compatible with the intention-to-treat (ITT) principle, one has to explicitly consider incomplete individual profiles to correctly incorporate the information available for all randomized patients.
These points in summary may explain why our interest focuses on GLMMs as a powerful tool for the sensitivity analysis of longitudinal count data. What we need is to pre-specify in detail a robust, valid, and parsimonious strategy for the data to come (see ICH E9, EMA or PSI Guidelines [
13,
16,
17]). Writing the SAP prospectively for a randomized clinical trial with longitudinal
countsas the primary endpoint asks for a series of decisions when specifying a GLMM for the analysis. Consideration should mainly be given to the following issues:
Distributional assumptions: Poisson, negative binomial, or more sophisticated extensions, e.g., accounting for zero-inflation.
Transformation of outcome variable: e.g. log-transformation for skewed continuous positive variables [
18], or variance-stabilizing transformations (e.g. inverse hyperbolic sine-transformation for non-negative count variables). An FDA guideline [
19] postulates that a rationale for the choice of data transformation along with the interpretation of the estimates of treatment effects based on the transformed scale should be provided. In some situations, transformation of endpoint data is indicated and preferred to untransformed analyses on the original scale. However, careful consideration should be given to using a transformation which should be pre-specified at the protocol design stage.
Variance-covariance structure: specifying whether random effects (e.g. patient-specific intercept, patient-specific slopes) are appropriate; specification of the within-error structure. Altogether, random effects selection can be challenging, particularly when the outcome distribution is not normal (see [
20‐
22] for more details).
Methods for handling dropouts: e.g. dealing with informative drop-outs, applying an analysis in which the last observation is carried forward, accounting for non-ignorable missing data mechanisms (pattern-mixture models). This approach must be fully compatible with the intention-to-treat principle.
Use of covariate- or baseline-adjusted analyses, handling multi-center data: specifying the mean structure by identifying the fixed effects terms.
The last issue is proposed by Pocock
et al.[
23] for avoiding misuse and data-driven selection of covariates within the clinical trial setting. The typical strategy for settling this complex issue is to decide on a simple model on which the primary analysis is based and to use sensitivity analyses to assess the robustness of the derived result under realistic model deviations.
In this paper, we propose using pilot or pre-study data to make an
informed choice about the sensitivity analysis stated in the SAP. Pilot or pre-study (commonly called a “feasibility” or “vanguard” study) data come from a trial in an earlier phase, from a registry, or from a proof-of-concept study. For phase III trials, data from phase II trials generally exist [
24]. In this respect we could also use data from the comparable treatment arms of related studies. Using these data helps to shape and justify
in advance the modeling strategy for analyzing the main trial data, and to check the validity and the appropriateness of several model assumptions. It is imperative to minimize misspecification of the assumed GLMM, and this analysis enables the trial statistician to define a broad and robust setting for the final choice of the model.
Having determined the main focus of this paper, we need to motivate our choice of
Bayesian tools for achieving our goal. Within the GLMM framework, analytical methods for model assessment and goodness-of-fit criteria are not straightforward, and
frequentist approaches remain limited. The inclusion of random effects makes theoretical derivations rather complex, imposing computational challenges. Some proposed model evaluation procedures focus on checking the deterministic components (i.e. mean and variance-covariance structure) of a GLMM based on the cumulative sums of residuals, or assess the overall adequacy by means of a goodness-of-fit statistic which can be used in a manner similar to the well-known
R
2 criterion [
25,
26]. Furthermore, for small sample sizes, likelihood-based inference via penalized quasi-likelihood in the case of a longitudinal discrete outcome can be unreliable with variance components being difficult to estimate. In contrast, many easy-to-implement tools are available within the
Bayesian framework. We will briefly review Bayesian tools developed recently and demonstrate their usefulness: For assessing goodness-of-fit and performing model validation, we apply the probability integral transform (PIT) [
27‐
29] as a graphical posterior model check. We implement formal statistical criteria such as the deviance information criterion (DIC) [
30], conditional predictive ordinate (CPO) [
31,
32], or proper scoring rules [
28,
29,
33‐
36] to compare the forecasting capability of different competing GLMMs. A further objective is exploring the behavior of these different Bayesian methods for model validation concerning the coherence of their preference for a certain candidate model.
The article is organized as follows: The Methods section reviews Bayesian strategies for GLMMs in the count response situation. The main idea of integrated nested Laplace approximation (INLA) proposed by Rue
et al.[
37] is described briefly. We also introduce tools for model ranking and for evaluating the performance of the proposed model alternatives based on a
prediction-oriented approach. Additionally, a case study is presented which will be used in the subsequent section to motivate the methodology. The Results section applies the proposed Bayesian methodology to clinical trial data on vertigo attacks and presents the findings of our simulation study. The Discussion section contains the limitations of the methods proposed. More technical material is provided in the Appendix. Selected
R‐INLA
code with further details concerning the INLA approach is included in the Web Supplementary Material of this paper [see Additional files
1 and
2.
For data analysis,
inla
-program [
38] based on the open-source software
R
version 2.12.1 [
39] was used to demonstrate the applicability of the Bayesian toolbox.
Discussion
We have discussed Bayesian strategies for model evaluation of GLMMs for longitudinal count data and used integrated nested Laplace approximations to do the calculations. We especially looked at tools such as the DIC, logarithmic score, and PIT. These techniques for model assessment are implemented in the package
R‐INLA
which can easily be used in
R
and aim to score the models with respect to their appropriateness explaining the observed data. Therefore, a very practical toolbox is at the hand for statisticians. It must be noted that other instruments such as pivotal quantities [
71] or different proper scoring rules [
28] can be used if the calculations are done with MCMC methods (e.g. using WinBUGS [
55,
72]).
We applied this toolbox to the typical task of a clinical trial statistician of making decisions for pre-specified sensitivity analyses or the efficacy analysis in a statistical analysis plan. Data from a former trial were used as pilot data for an ongoing phase III trial. Our interest was to give some insight and guidance in the most important aspect of deciding on a final SAP. The main task consisted of deciding which GLMM should be used for longitudinal count data. To this end, we performed a Bayesian analysis of the pilot data with different models and employed a prediction-based approach to derive statements on model fit.
We next discuss four important aspects of this process: prior distributions, normality assumption for random effects, Bayesian model evaluation, and modeling of clinical trial data.
Prior distributions
Bayesian analysis needs a specification of prior distributions. However, when fitting a GLMM in a Bayesian setting, specifying prior distributions is not straightforward; this is particularly true for variance components. Fong
et al.[
40] pointed out that the priors for variance components should be chosen carefully. To quantify the sensitivity of the posterior distributions with respect to changes in the priors for the random effects precision parameters, Roos & Held [
73] discuss a measure based on the so-called Hellinger distance for GLMMs with binary outcome but not for count data. Adapting their approach to count data is a topic for future research. In this study, we followed advice from the literature: in the case of negative binomial models, estimation of the posterior mean of the dispersion parameter can be affected when a vague prior specification is used to characterize the gamma hyper-parameter. To circumvent the problem of distorting posterior inferences, e.g. Lord
et al.[
74] recommend a non-vague prior distribution for the dispersion parameter to minimize the risk of a mis-estimated posterior mean and to obtain stable and valid results. This issue is particularly relevant for data characterized by a small sample size in combination with low sample mean values. The situation is quite complex and the only practical way to handle this issue is a careful simulation study to investigate whether changing priors would influence the decision on the relevant model. The material provided in the Web Supplement may help a statistician set up such simulation studies.
Gaussian random effects
Throughout our article the distribution of random effects was assumed to be Gaussian. One reason was that Bayesian inference was based on the INLA approach. Within the INLA methodology an extension to non-Gaussian random effects is not straightforward due to the central role of the latent Gaussian field. The main challenge in applying INLA to latent models is that the approach depends heavily on the latent Gaussian prior assumption to work properly. For further details on this issue see [
75]. Recently, Martins & Rue [
75] proposed an extension that allows INLA to be applied to models where some independent components of the latent field have a so-called “near-Gaussian” prior distribution. All in all, the assumption of Gaussian distributed random effects that is usually taken for granted may be subject to criticism, and there are a number of situations in which this might not be a realistic assumption. From a theoretical point of view, this normality assumption may be dropped in favour of other symmetric but heavier-tailed densities, such as the Student
t-distribution which allows to identify and accommodate for outliers both on the level of the within-group errors but also at the level of random effects [
76]. Further research is needed to investigate the impact of inappropriate distributional assumptions, i.e. to understand its influence not only on posterior inference, but also on several Bayesian instruments which are applied for model evaluation.
Bayesian model evaluation
The INLA approach for approximate fully Bayesian inference on the class of latent Gaussian models provides an attractive and convenient alternative to an inference scheme based on sampling-based methods such as MCMC, and avoids its computational burden. By taking advantage of the properties of latent Gaussian models, INLA outperforms MCMC schemes in terms of both accuracy and speed. Bayesian approaches naturally lead to posterior predictive distributions, from which any desired functional can readily be computed. We earlier discussed Bayesian methods for assessing probabilistic forecasts via proper scoring rules serving as a loss function. These scores can be used for an omnibus evaluation of both sharpness and calibration of predictive distributions and provide a usable instrument for assessing the validity of different competing, non-nested modeling strategies. It should also be noted that there is a variety of proper scoring rules with a unique and well-defined underlying decision problem which can be applied in a given situation as well. According to Gneiting [
62] there are many options and considerations in choosing a specific scoring function, and there is a need for theoretically principled guidance. In this article, we have focused on the logarithmic score which is easily calculated and available from INLA. The mean logarithmic score was competitive for the simulated negative binomial data, and most importantly, it was able to identify the correct model as the one best suited for prediction, namely the true model generating the data emerged with the smallest
. In contrast to proper scoring rules, PIT histograms allow evaluation of the predictive quality of a model with respect to calibration only, neglecting sharpness. Furthermore, the DIC was applied as a common model selection criteria that takes into account goodness of fit while penalizing models for overfitting. Despite its computational simplicity, DIC does have several drawbacks, particularly tending to under-penalize complex hierarchical models. Likewise, DIC is not suitable for comparing a model for transformed outcome with competing models for data on the original scale. Accordingly, predictive checks should be preferred to rank different non-nested GLMMs alternatives. Nevertheless, the question of what constitutes a noteworthy difference in DIC or mean scores to distinguish between competing model types has not yet received a satisfactory answer. For Bayes factors, calibration scales have been proposed, but no credible scale has been proposed for the difference in DIC or the difference in mean scores [
54]. In conclusion, we recommend using several instruments for model evaluation to gain further insight into different aspects of a statistical model, such as forecasting ability, combined assessment of calibration and sharpness, and comparison of features of the model-based posterior predictive distribution to equivalent features of the observed data.
Modeling of clinical trial data
In the clinical trial setting, Bayesian instruments based on the INLA approach can be applied as decision support to pre-specify a suitable final model for sensitivity analyses. Provided that adequate pilot data exist, an appropriate modeling strategy is developed using prior information obtained from a trial in an earlier phase. Sensitivity analyses are important to investigate the effects of deviations from a statistical model and its underlying assumptions. Furthermore, it is necessary to assess in what way the (posterior) inference depends on extreme observations and on unverifiable model assumptions. Altogether, situation-specific robustness of the proposed analyses must be checked carefully.
Conclusions
The statistical model must be specified in the SAP before acquiring the real trial data. Similar independent data (for example, from patients treated with the standard treatment) may serve as a basis for decision-making. The analyses proposed in the SAP have to be appropriate and should rely on a minimum of assumptions [
77,
78]. Sensitivity analyses help to assess whether, for example, results from simple testing procedures applied to the primary efficacy analysis agree with the results obtained by additional, more complex analyses. These analyses may consider more complex settings for individual-level parameters, or different distributional forms of individual inhomogeneity. By studying agreement between different strategies via sensitivity analyses in the statistical report, such as simple tests for the primary analysis together with modeling approaches for sensitivity analyses, it is possible to explore robustness and accurate estimation of the treatment effects.
We look at the situation when there are various possible analyses of a given hypothesis (in our case: no treatment×time interaction), all of which have different distributional assumptions (specified by different assumptions in terms of the random effects structure and corresponding distributional assumptions). In this case, robustness would come from different analyses, with different assumptions, showing substantial agreement. On the first look, there is no ordering that would allow us to declare one analysis better than another by virtue of relying on a specific distributional assumption. On a second look, the logarithmic score, the DIC, and the PIT provide scores to establish such an ordering.
We concentrated on justifying the distributional assumptions in the count response situation, that is, checking deviations from the assumptions regarding the stochastic part of the hierarchical model, since there was no evidence for specific prognostic factors or factors for baseline adjustment, which would improve the precision of the results when considered in the model. We also explored the effect of random effects structures that include subject-specific intercepts and/or slopes. The simulation could also demonstrate how much power would be lost if we chose a more general model (NB distribution, random intercept and random slope) compared with a simple model (Poisson distribution, random intercept, no random slope) when the simple model is true.
However, the Bayesian toolbox used is all-purpose and can be applied to detect other more complex forms of misspecification as well, such as non-Gaussian distributed random effects [
75], alternative functional relationships in the population-mean structure, random effects precisions depending on a binary covariate (e.g. treatment group), alternative prior distributions or different hyperprior parameter values. Finally, part of the sensitivity analyses of the trial data may also be checking whether the modeling assumptions for the primary efficacy analysis are reasonable or not.
Competing interests
The authors declare that they have no competing interests.