Background
Multilevel or hierarchical designs and clustered data are now commonplace in many fields including medical and health sciences research. These designs involve units within clusters, for example patients or subjects clustered at sites or centers, or multiple measurements of an outcome clustered on individual subjects. Models for multilevel data may contain fixed effects describing measured characteristics at any level of the design, as well as random effects describing unexplained cluster variation. Non-normal outcomes such as binary, skewed, count, or time to event outcomes are also common and present additional challenges due to non-linearity, different scales (e.g. probability, rate, or hazard), and different effect measures (e.g. risk ratio, odds ratio, rate ratio, and hazard ratio).
One very common situation in health services research involves patients clustered within sites. In these situations, fixed effects are constructed for measured patient risks and site characteristics, and random effects are used to model unexplained variation in outcomes or procedure use across sites. This unexplained variation, sometimes referred to as general contextual effects (GCE) [
1,
2], induces dependence among subjects within a cluster and so inference for fixed effects requires analytic methods that can accommodate this dependence such as generalized linear mixed models (GLMM), multilevel regression models (MLRM), or generalized estimating equations (GEE) [
3‐
7]. Methods for estimation and inference for fixed effects in these models are well established and described in many references e.g. [
3‐
6]. These analyses and results are standard for clinical studies and answer one set of important questions, estimation and inference for associations between individual fixed effects and outcomes.
Several other objectives in the analysis of multilevel designs are less often recognized, and methods to study them are less well developed. First, GCE is often important in its own right and can be the main focus of a study because it represents unexplained differences across treatment sites in clinical outcomes, cost, use of procedures or compliance with guidelines. Second, quantifying variation due to sets of covariates, such as all patient risks or all hospital characteristics, in comparison to GCE is also important to understand the relative contribution of each. Some work has been done in comparing sets of risks or characteristics to GCE to help place both in context, e.g. [
1,
8], but more attention is needed for this important problem. In this paper we propose methods for studying these two objectives.
While most previous studies have considered GCE when assessing clinical outcomes [
9‐
13], the same analytic methods can be used to study GCE in processes of care [
14], as for example in [
1] who have studied use of medications and specialty physicians. In these studies, GCE is particularly important for several reasons. First, GCE in health care processes represents variation that is not driven by patient characteristics or treatment guidelines, since results are virtually always adjusted for patient risk. Second, variation of GCE in processes of care is potentially modifiable through site-specific interventions, which can be tested using trial designs such as cluster-randomized [
15‐
18] or stepped-wedge [
19] designs. Finally, GCE, particularly in processes of care, can be large, even dominating measured patient and hospital effects, as we illustrate in our application below.
Several categories of methods have been used to study and quantify GCE, each addressing different questions. One set of methods, exemplified by intra-class correlation (ICC) [
20] and percent change in variance [
1], seeks to determine the percent of total variation in outcomes due to cluster variation. These methods are useful in answering questions about relative sources of variation, but are not directly comparable to effect measures such as odds ratios. A second set of methods has the goal of quantifying GCE in comparison to fixed effects. One simple approach, which we call Individual Outcome Measures (IOM), involves calculating intervals for the mean outcome (e.g. probability) as the cluster random effect ranges across its (usually normal) distribution e.g. [
3,
4]; however, interval widths differ for each covariate pattern making comparisons difficult and impractical with more than a few covariates. Additional approaches, such as Median and Interval Odds Ratios (MOR, IOR) [
21‐
23] and more recently Median Hazard Ratios (MHR) [
2], are based on odds or hazard ratios comparing subjects in two randomly selected clusters. A third set of methods studies individual clusters by ranking them or identifying outlying clusters, but is limited in that differential cluster size may lead to small clusters being more likely to have extreme averages. More sophisticated statistical methods, termed institutional profiling, are based on GLMM or MLRM and are used extensively in health services research [
24]. These methods only indirectly answer the question of cluster variation since variability in cluster-specific estimates is affected by sample size. A recent paper has reviewed and combined some of these methods discussed above into a stepwise procedure [
1].
We focus on methods that quantify and compare fixed and random effects on the effect (e.g. odds ratio) scale and so address questions similar to those of MOR/MHR. The methods we propose provide some advantages in terms of interpretation and visualization, as we discuss below. The approach we describe is based on comparing subjects in clusters at specified percentiles of the random effect distributions to subjects in a “reference”, e.g. median, cluster. We refer to these methods as Reference Effect Measures (REM) [
25]. Subjects are compared based on the relevant effect measure, for example odds ratios for logistic regression. By comparing two subjects with identical covariate patterns the resulting odds ratio between subjects in two clusters does not depend on the particular covariate pattern. This approach is based on percentiles and so generalizes easily to non-normal or empirical distributions, and is amenable to graphical presentation.
To illustrate these methods, we use data from the Department of Veterans Affairs (VA) for a population of patients with an index inpatient admission for atrial fibrillation (AF), the most common form of cardiac arrhythmia, between 2001 and 2012 at one of the 124 VA hospitals that treat such patients. Typical studies within this setting adjust for and/or assess associations of 10–30 patient and hospital characteristics with binary, continuous, or time to event patient outcomes. Variation in these outcomes across hospitals is an inescapable feature of the data and is often of primary clinical or health services research interest. The example described in detail below involves mixed logistic regression models for probability of mortality and of initiation of a cardiac rhythm control strategy within 90 days after discharge from the index admission for AF.
In this paper we consider two objectives in the analysis of cluster variation for non-normal outcomes in multilevel designs: 1) Quantify GCE and compare it to individual subject and cluster covariate effects, and 2) Quantify relative magnitudes of GCE and variation from sets of measured factors. In the Methods section we first summarize concepts and notation for multilevel logistic models. We then describe REM methods in more detail and discuss how these methods address our two objectives. We also describe use of REM with other common situations such as GLMM and Cox proportional hazards models. In the Results section we apply these methods to objectives 1 and 2 for hospital variation in selection of treatments for atrial fibrillation patients at 124 VA hospitals. We also illustrate and compare several alternative methods. In a second example we contrast these results with hospital variation in mortality, and illustrate methods for visualizing and presenting results in easily interpretable ways.
Discussion
In this paper we have proposed methods for quantifying and displaying variation in outcomes and treatments due to unexplained cluster sources as well as to sets of measured patient and hospital variables. These methods allow sources of variation to be studied on the same scale as effects of individual variables and, unlike other methods, can be easily incorporated into standard visual displays such as forest plots. Additionally, REM offers the flexibility of having standard values to report, such as REM (0.75) for moderate comparisons or REM (0.975) for extreme comparisons, while also allowing calculation of all other percentiles and a complete description of the distributions of interest. This is useful for directly relating the impact of a fixed effect to the exact percentile in the risk distribution of unmeasured site variation. Finally, the methods are widely applicable to multiple random effects or random slopes, empirical distributions, and random effects with non-normal distributions in multilevel studies. We used these methods to show that treatment for a common and serious cardiac condition (AF) is highly variable across VA hospitals, and this GCE is at least as great a source of variation in treatment use as all patient factors combined. These results suggest opportunities for study and improvement of patient care for AF patients, and illustrate the usefulness of the proposed methods.
As methods to model hierarchical data become more commonplace, it also becomes essential to develop meaningful and interpretable ways of presenting results. This is particularly true with random cluster variation (GCE), which has received much less attention than individual fixed effects, especially in the context of non-normal outcomes. With growing interest in studying processes of care and health care system-level questions, further fueled by growth of electronic health records (EHR), cluster variation will often be a factor of primary interest particularly for processes of care since these processes are often driven by unmeasured provider characteristics (e.g. preferences, training) or local culture that are difficult to capture in a model. Thus, understanding and explaining GCE in the context of other sources of variation will continue to become an area of focus moving forward.
Previously proposed methods for studying GCE are summarized in the Introduction and at the end of Example 1. Lack of a single summary measure for GCE is related to the fact that several questions can be asked about GCE. If the goal is to understand what proportion of total variation is attributed to cluster variation, ICC and related methods are available [
1,
20]. If the goal is to rank or identify outlying sites, profiling methods can be used [
24]. For quantifying GCE on the same scale as fixed effects, MOR methods have been used [
21‐
23]. The recently proposed stepwise approach to analysis of variability for multilevel data considers several of these in their step 2 [
1]. REM methods focus on quantifying GCE in the context of standard analyses through direct comparisons to individual fixed effects and sets of fixed effects on the same scale, similar to the questions addressed using MOR.
Interpretation of cluster level fixed effects has generated some controversy in the literature. Many authors interpret cluster level covariate effects in a similar way as patient level covariate effects, comparing patients with the same measured characteristics at hospitals with the same random and fixed effect values but differing by one unit in the cluster level covariate being considered. Others have argued that this interpretation is invalid since the design does not allow the same subjects to be observed at clusters with different cluster level covariate values. The latter interpretation motivated the development of the Interval Odds Ratio [
21,
22]. We acknowledge the merit of the latter argument, but consider the former conditional interpretation valid in the context of the models used and the assumptions made. REM for cluster level covariates is consistent with the former interpretation, and with describing model (1) through components of the linear predictor
L =
β0 +
x'
sβs +
x'
cβc +
u. We also note that this issue involves cluster level fixed effects but does not involve interpretation of subject level covariate effects or random cluster effects.
REM methods provide several advantages. The methods are general and apply to most types of outcomes and models commonly encountered in health research including binary, continuous, count, and time to event. The methods are based on percentiles, which allow more complete description of non-normal distributions that may arise from empirical distributions of sets of variables as in Eq. (
4), or from use of non-normal random effects [
34]. Percentiles also provide easy interpretations in terms of comparison with individual fixed effects and ranges or the ‘95% rule’ for normal distributions. These interpretations are more familiar than the interpretation of MOR, which summarizes a distribution of two patients at randomly selected clusters comparing the higher risk patient to the lower risk patient. A further advantage of REM is easy visualization, for example in forest plots like Fig.
1. When assessing factors driving variation in treatment use or mortality we presented ranges with shading showing several percentiles for different groupings of fixed effects and GCE. Widths of these ranges show visually which sets of measured factors and GCE are the largest drivers of variation in initiation of treatment. The distribution of differences between patients used to construct MOR is less interpretable graphically.
We have further explored the extent to which GCE drives variation in processes and outcomes through comparison of GCE to sets of fixed patient and hospital effects. Quantifying discriminatory ability of sets of fixed effects is also considered by [
1] in their step 1, using area under the receiver operator curve (AUC). Our proposed REM approach for sets of fixed effects is complementary to AUC, and will be useful when direct comparisons with individual fixed effects or with GCE on the same scales are of most interest.
One issue we have not included in our analyses but that could be easily handled with REM involves imbalance of subject factors across clusters. This can occur when different hospitals tend to treat different types of patients. Methods are available for partitioning patient factors into between-cluster components (e.g. hospital averages) and within-cluster components (subject values or deviations of subject values from hospital averages) [
41,
42]. Both between-cluster and within-cluster components can be included as covariates as usual, and REM methods would allow this component of variation to be quantified separately from patient factors and hospital system factors.