Background
The multiplicative model quantifies the joint effects of exposures on the relative risk of disease and is the mainstay of case–control analysis [
1]. The contribution of the multiplicative model to studies of disease etiology is undeniable. However, there are several epidemiological questions that are more easily addressed with an additive risk model, where exposure effects are modeled on the absolute risk (probability) scale. In particular, additive risk models can clarify the public health significance of exposure effects [
2,
3] and the interpretation of statistical interactions [
4‐
6]. Despite these advantages, the technical difficulties of properly constraining risk estimates to the 0–1 range and a lack of software for constrained additive risk regression have hindered the use of additive risk models in case–control studies [
7‐
9].
We recently encountered the challenge of additive risk modeling with case–control data in an investigation of gender differences in smoking-associated lung cancer in the Environment and Genetics in Lung cancer Etiology (EAGLE) Study—a population-based case–control study conducted in Northern Italy between 2002–2005 [
10]. In a logistic regression analysis of never and ever smokers of the EAGLE Study, De Matteis and colleagues found evidence of an interaction between gender and pack-years smoked that suggested a higher susceptibility to lung cancer in men [
11,
12]. The authors sought to quantify the public health implications of the gender differences they found by estimating absolute risk differences of lung cancer in men and women, adjusted for other confounders. The risk difference estimates could theoretically be obtained with an additive risk model yet, unlike methods for multiplicative modeling, reliable methods for additive risk regression with case–control data were not available.
To address the challenge of absolute risk estimation in case–control studies, we present a novel regression approach to quantify risk difference associations with population-based case–control data using linear-expit (
lexpit) regression.
Lexpit regression is an additive-multiplicative risk model for a dichotomous outcome that can incorporate additive and multiplicative effects of risk factors and properly constrains risk estimates to a feasible range. We previously showed that
lexpit regression addresses the main technical challenges to additive risk analysis of binary data in cohort studies [
13]. Building on this earlier work, we extend
lexpit regression to population-based case–control studies by incorporating sampling information into the estimation procedure. After describing the interpretation of
lexpit regression and its methodology, we return to the question that motivated the development of these new methods and use the
lexpit model to quantify confounder-adjusted risk difference effects of gender for smoking- and non-smoking associated lung cancer in the EAGLE Study.
Results
Lexpit regression was performed to assess the absolute risk differences associated with gender and smoking in the Northern Italian population represented by EAGLE participants. Our main interest was in a model that could estimate additive effects for gender, pack-years, and their interaction, considering multiplicative effects for all remaining covariates. A description of the included variables and their codings are described in Table
3. The
lexpit analysis was conducted in the R language, version 2.15 [
19], using our open-source package blm [
20] (for usage examples see Additional file
2 and Additional file
3).
Table 3
Representation of variables included in regression analyses of the EAGLE study
Pack-yearsa
| Continuous | |
Female | Categorical | Male = 0 |
| | Female = 1 |
Age | Continuous | Years |
Education | Trend | None = 0 |
| | Elementary = 1 |
| | Middle school = 2 |
| | High school or more = 3 |
Smoked cigars, pipes, cigarillos | Categorical | Never Smoked = 0 |
| | Smoked = 1 |
ETS in the workplace | Categorical | No ETS = 0 |
| | ETS = 1 |
High-risk occupationb
| Categorical | No occupation = 0 |
| | Occupation = 1 |
Average percent inhaled | Trend | Never smoker = 0 |
| | <25% = 1 |
| | 25-49% = 2 |
| | 50-74% = 3 |
| | 75-100% = 4 |
Years since quitting | Continuous | Years |
Estimates for the additive effects of gender showed a 4.6 per 100,000 persons higher 3-year lung cancer risk for women than men among never smokers, adjusting for other demographic variables (Table
4). The risk difference effect can also be expressed as a rate by dividing by the duration of risk, e.g. a 4.6 per 100,000 34-month risk corresponds to an average risk rate of 13 per 100,000 person-years. We estimate that every 10 additional pack-years smoked increases the 3-year lung cancer risk in male smokers by 52.9 per 100,000 persons but by only 13.6 per 100,000 persons among women, showing a strong female-pack-year interaction (RD=−39.3 per 100,000 persons per 10 pack-years smoked, 95% CI=−70.1 to −8.6). After accounting for the risk effects of gender and pack-years, the residual odds ratio effects of the
lexpit model found that greater age, occupational ETS exposure, and inhalation depth further increased lung cancer risk estimates, while higher education and greater years since quitting decreased risk estimates.
Table 4
Lexpit
regression analysis of the EAGLE Study
Female | 4.6 | (−1.8, 11.0) | | |
Pack-years (per 10 yrs) | 52.9 | (31.9, 73.8) | | |
Female x Pack-years | −39.3 | (−70.1, -8.6) | | |
Age – 60a
| | | 1.12 | (1.10, 1.13) |
Education – 1b
| | | 0.69 | (0.60, 0.80) |
High-risk occupationc
| | | 1.01 | (0.72, 1.41) |
Occupational ETS | | | 1.54 | (0.72, 1.41) |
Cigars, pipes, cigarillos | | | 1.15 | (0.86, 1.53) |
Average percent inhaled | | | 2.19 | (1.99, 2.41) |
Years since quitting | | | 0.94 | (0.93, 0.95) |
As one assessment of the improvement of the fit of the model with the use of multiplicative effects we compared the weighted Hosmer-Lemeshow goodness-of-fit statistic (Additional file
1: Section S3) among the
lexpit model, a strictly additive blm model, and a strictly multiplicative logistic model of the same variables. The chi-squared statistic in the blm model was 20.8, the weighted logistic model 18.2, and 15.9 with the
lexpit model, indicating an improvement in fit with the use of the additive-multiplicative form we used.
Discussion
We have presented lexpit regression methods to estimate adjusted absolute risk differences with population-based case–control data. By shifting the focus from estimates of relative risk to absolute risk, lexpit regression gives epidemiologists a direct and reliable way to assess the public health significance of an exposure’s effect. Moreover, lexpit regression provides a flexible framework for handling potential confounders, as variables with additive or multiplicative effects can be accommodated. When there is uncertainty about a variable’s mode of effect, we outlined approaches to assess the reasonableness of each effect type. Our open-source R package blm allows the new methods to be implemented with the ease of standard logistic regression.
Lexpit regression is the absolute risk analog to additive-multiplicative models for hazard rates, such as the Cox-Aalen model [
21], which have become increasingly popular in the survival literature [
22]. Each class of models share the strength of greater flexibility in the study and representation of the joint effects of risk factors on the hazard rate, in the case of the Cox-Aalen model, and the absolute risk of disease, in the case of the
lexpit model. The extension of additive-multiplicative models to absolute risk estimation from a variety of study designs is significant because of the importance of individualized risk assessment to public health. To our knowledge, the
lexpit model is the first additive-multiplicative regression model of risk that appropriately ensures risk estimates are within the probability scale. Although alternative additive-multiplicative models of risk could be developed by considering other functions for the multiplicative component (e.g. exp), we have focused on the
expit function because of its mathematical advantages. Because of the expit function, the
lexpit model will require fewer constraints than alternative additive-multiplicative models to produce feasible estimates in the 0–1 probability range.
None of more than 20 published observational studies that have examined male–female differences in lung cancer etiology have quantified the independent effect of gender on the absolute risk of smoking- and non-smoking-associated lung cancer [
23‐
26]. Using
lexpit regression, we were able to address this important public health question. Our findings add to the De Matteis
et al. logistic regression of the EAGLE case–control study [
11] in two important ways. First, we confirmed that gender differences in the confounder-adjusted effect of pack-years are found on the additive risk scale. Secondly, we found suggestive evidence that women’s risk of lung cancer risk is higher than men’s in never smokers but is lower than men’s in smokers. Conventional unconditional logistic regression, which does not provide estimates of absolute risk, would not identify these findings, especially given that gender was used as a matching variable in selecting controls. Thus, our novel methods provide further insight about male-and-female differences in lung cancer risk from previously analyzed data that has direct public health implications.
In their commentary on the De Matteis
et al. study, Alberg and colleagues pointed to a need to further delineate the clinical significance of gender differences in lung cancer etiology [
12]. Our re-analysis of the EAGLE Study clarifies the clinical relevance of gender effects for lung cancer risk in an Italian population by providing estimates of the excess lung cancer risk associated with gender. The small excess risk in women never smokers suggests that some gender-related etiological factor(s) for non-smoking-related lung cancer remains to be identified. A public health implication for the gender differences we found among smokers concerns selection criteria for computed tomographic lung cancer screening. Current guidelines recommend screening for individuals between ages 55 and 75 years with a minimum of 30 pack-years smoked [
27]. However, in an Italian population, we estimate that the excess lung cancer risk for a male 30 pack-year smoker is more than 1,100 per 100,000 greater than an otherwise similar female 30 pack-year smoker. Thus, in keeping with the “equal management for equal risk” principle [
28], gender-based risk criteria for lung cancer screening selection may be warranted in some populations.
The implications of the EAGLE
lexpit analysis for computed tomographic screening guidelines exemplifies the importance of the choice of measure of association used in an etiological analysis for understanding the public health significance of a risk factor’s effect. Risk differences measure a risk factor’s effect in terms of the number of excess attributable cases in a well-defined population, an explicit measure of the public health significance of an effect, which can be compared across exposures and across diseases. Our study provides an important example of this comparative use of risk differences with respect to gender effects in smoking- and non-smoking-associated lung cancer. Some research has suggested a higher risk of lung cancer among women never smokers [
29]. We further elucidated this difference through
lexpit analysis by showing that the excess risk in women never smokers was approximately equal to the excess risk with 1 additional pack-year smoked in men as compared to women. As the development of public health interventions and clinical recommendations become increasingly guided by individual risk assessment, there will be a growing need for methods like
lexpit regression that can facilitate the estimation of absolute risk differences from observational data.
Lexpit regression resolves several limitations of alternative strategies for estimating risk differences from case–control studies. Using non-additive models of risk, such as the logistic model, to estimate a marginal risk difference [
30,
31] gives average in the study population, not equivalent to a risk difference effect estimated here. The application of the
lexpit model to case–control data extends previously proposed methods for absolute risk methods requiring prospective cohorts or disease registries [
32]. Further,
lexpit regression advances current methods for assessing additive interactions in case–control studies. It is well known that multiplicative interactions sometimes disappear when modeled on the additive scale [
4‐
6,
33] and vice versa, highlighting the dependence of statistical interactions on the choice of a model’s scale. The removal of interactions leads to more parsimonious models whose risk associations have a clearer interpretation. The flexible additive-multiplicative form of the
lexpit can help epidemiologists reduce multiplicative and additive statistical interactions, making it easier to interpret risk effects. While departure from additivity can be detected on the relative risk scale using the relative excess risk due to interaction, this metric is limited because it can only detect the direction of departure from additivity but not the magnitude of the effect [
34,
35].
While
lexpit regression makes the important advance of allowing case–control studies to make inferences about absolute risk and risk differences of exposures, there are several challenges to its application to case–control data. First, the period of risk for the cumulative risk estimates of the
lexpit model is determined by the period of case ascertainment, which may generally prohibit long-term risk estimates. As with other common probability models of case–control data, the
lexpit model assumes the population risk of disease is fixed during the period cases and controls are sampled. The population validity of
lexpit regression also requires accurate sampling weights, which may be difficult to obtain for studies using a so-called “secondary base” [
36], as with hospital or registry controls, for the selection of controls. Further investigation of the availability and accuracy of sampling information in case–control studies is needed to clarify the practical limitations of using sampling data for absolute risk estimation.
Competing interests
The authors have no competing interests to declare.
Authors’ contributions
SAK designed and conducted the study analyses and wrote the first draft of the manuscript. SW, NEC, and MTL conceived of the design of the study sample and the coordination of data collection. SW, HAK, SDM, DC, AWB, and RV provided input on the statistical analyses and the interpretation of the results. All authors contributed to the writing of the manuscript and read and approved the final manuscript.