2.2 The general concept of bias: biased versus unbiased estimator
In order to define the terms used to describe the different types of bias, we will assume that the following linear regression model describes the underlying mechanism of the association of interest within the entire
population of interest/
source population. In economics, this equation is known as the
population equation, whereas in epidemiology there is no equivalent term. For expositional purposes we use a linear model that describes the underlying mechanism of the
population of interest/
source population,
$$\begin{aligned} y_i=\beta _0+\beta _1 x_{1i}+...+\beta _n x_{ni}+\varepsilon _i \end{aligned}$$
(1)
where
\(y_i\) represents the outcome for the individual
i,
\(x_1 .. x_n\) represent the explanatory variables, and
\(\varepsilon\) stands for the unobservable random error term.
\(\beta _1 .. \beta _k\) are the coefficients for the explanatory variables
\(x_1 .. x_n\).
When the
sample/
study population is representative of the
population of interest/
source population, estimation of the linear regression model described in Eq. (
1) in the
sample/
study population results in parameter estimates that on average coincide with the
population equation.
In epidemiology,
bias can result from the way in which individuals are selected (
selection bias), the way in which the variables are measured (
information bias/
measurement error), or a failure to control for the impact of explanatory variables (
confounding) (Bouter et al.
2017; Grimes and Schulz
2002; Rothman
2012). Similarly, economists refer to bias when the value of the parameter being estimated (a property of the population) and the expectation of the estimator differ (
\(E[{\hat{\beta }} ] \ne \beta\)) (Wooldridge
2009). For an estimator
\({\hat{\beta }}\) to be unbiased, it is required that
\(E[{\hat{\beta }} ]= \beta\), meaning that the expected value of estimator is equal to the parameter value
\(\beta\). In economic terms, the latter holds if the following conditional expectation of
\(\varepsilon _i\) given
\(x_{1i}...x_{ni}\):
$$\begin{aligned} E(\varepsilon _i \mid x_{1i}...x_{ni}) = E(\varepsilon _i \mid X_{i}) = 0. \end{aligned}$$
(2)
Economists also refer to this as the
zero conditional mean assumption (Wooldridge
2010). Simply put, this means that if the
sample/
study population is randomly selected from the
population of interest/
source population, the error term has a mean of zero and is uncorrelated with each of the explanatory variables in the model. If this assumption holds, then each explanatory variable is necessarily
exogenous (Wooldridge
2010). That is, the variable is not influenced by other variables in the association.
The
zero conditional mean assumption is violated when an included regressor is
endogenous, meaning that it is dependent on the error term (
\(\varepsilon _i\)). This phenomenon is referred to as
endogeneity by economists (see also Table
2).
Endogeneity often occurs as a result of self-selection of individuals (Wooldridge
2010). To complicate matters even further, in economics the term
endogeneity is often also used as an umbrella term for various different problems that cause a violation of the
zero conditional mean assumption:
omitted variable bias,
simultaneity or
measurement error (Wooldridge
2009).
2.3 Confounding versus selection bias
In epidemiology,
confounding (Table
1) implies that the effect of
\(x_i\) on
\(y_i\) is mixed with the effect of a third factor, also known as the
confounder (Ahlbom
2021). When this
confounder is not included in the regression model, this leads to bias (Grimes and Schulz
2002; Rothman
2012). Thus in the terminology used in economics—the
zero conditional mean assumption is violated due to an
omitted variable. It is important to keep in mind, that there can be more than one
confounder that can be either observed or unobserved. In order for a
confounder to distort the effect of
\(x_i\) on
\(y_i\), it should be associated with both
\(y_i\) (as a cause or proxy of the cause) and
\(x_{i}\). However, the
confounder should be correlated with
\(x_i\), but should not be an effect of
\(x_{i}\) (Rothman
2012). When
confounding is present, the validity
6 of a study is compromised, because the estimated association does not reflect the true relationship between
\(x_{i}\) and
\(y_i\) (Bouter et al.
2017; Grimes and Schulz
2002; Rothman
2012; Miettinen and Cook
1981). Within the concept of
confounding, a distinction is often made between
measured confounding or
unmeasured confounding (Ahlbom
2021).
Measured confounding is defined as confounding resulting from variables that are observed and measured. However, when
confounding is not successfully removed or corrected for, or results from a variable that is not observed, this leads to uncontrolled or
residual confounding. When
residual confounding is caused by a failure to observe the confounder, we will refer to the resulting bias as
unmeasured confounding. Economics does not have a term to describe the overall concept of
confounding, but it does have equivalent terms for
measured confounding and
unmeasured confounding, which will be explained in further detail in the next sections.
The epidemiological concept of
measured confounding is present when an unbiased estimator of the treatment effect cannot be obtained by directly comparing outcomes between treatment groups, due to the presence of an observed
confounder, and a correct specification of the outcome model can resolve this problem (Austin
2011a). Economists would then say that the model coincides with the
population equation, and hence the estimator is unbiased. However, researchers can never be sure that the specified model is equal to the
data generating process of the population, because the population equation is unknown. Let us assume that the data generating process of the population is described by the following
population equation:
$$\begin{aligned} y_i=\beta _0+\beta _1 x_{1i}+\beta _2 x_{2i}+\beta _1 x_{1i}*\beta _2 x_{2i}+\beta _2 x_{2i}^2+\beta _3 x_{3i}+\varepsilon _i. \end{aligned}$$
(3)
If one wrongly models the association with an ordinary least squares (OLS) according to Eq. (
1) instead of the
true population equation (
3), the
interaction term (
\(\beta _1 x_{1i}*\beta _2 x_{2i}\))
7 and the quadratic term (
\(\beta _2 x_{2i}^2\)) become part of the error term and thus induce
endogeneity. On the other hand, if the omitted variable is independent from a specific
x of interest, the estimator of the effect of that specific
x on
y remains unbiased and only the standard errors will be compromised. In other words, not taking the interaction and the quadratic terms into account leads to bias. This is true because omitting a relevant term from the regression model results in a correlation between the error term (
\(\varepsilon _i\)) and the explanatory variables
\(\beta _1 x_{1i}\) and
\(\beta _2 x_{2i}\). As a result, the assumption of conditional independence of the error term no longer holds and the estimator is
biased. This refers to the assumption that distinguishes
measured from
unmeasured confounding. The assumption of no
unmeasured confounding is sometimes also referred to as
selection on observables,
exogeneity,
conditional independence assumption or
ignorability in economics. When the assumption of
selection on observables holds, correct inferences for causal parameters can be obtained by using methods such as regression-adjustment, matching, re-weighting, and the doubly-robust estimator (Cerulli
2015).
As indicated above, the epidemiological term
unmeasured confounding refers to
confounding that is the result of a
confounder that was not or poorly measured, and therefore, not taken into account in the data analysis. In some situations, this type of
confounding can be considered to be equivalent to the economics term
selection bias, which arises due to incomplete observation of the population. Economists often use instrumental variable analysis (Angrist and Pischke
2009) and epidemiologists use its analogue, Mendelian randomization in genetic research, to adjust for this type of bias (Streeter et al.
2017).
In economics,
selection bias can take up various forms. For instance,
self-selection bias might occur because participation is not randomly determined, thus the selection occurs based on an explanatory variable (
x). The term
self-selection bias is generally used when an indicator of participation might be systematically related to unobserved factors (Wooldridge
2009). In the following example, we represent the situation where the treatment indicator is related to the wage variable (i.e., a person’s monthly salary). Let us assume that we observe only individuals with a wage below 2500 (
\(x_{1i}<2500\)). Wage
\(x_{1i}\) is an explanatory variable of the outcome
\(y_i\). This implies that the error term becomes the following:
$$\begin{aligned} \eta _i=\varepsilon _i+\beta _{1}(x_{1i} \ge 2500), \end{aligned}$$
(4)
which consists of a random error as well as the unobserved part of the population (i.e., individuals with a wage above 2500 do not occur in our sample). The conditional expectation of interest is the following:
$$\begin{aligned} \begin{aligned}&E(y_i \mid x_{1i}<2500,... ,x_{ki}, ... ,x_{ni}) \\&\quad =\beta _0+\beta _1 (x_{1i}<2500)+... \beta _n x_{ki} +...+\beta _k x_{ni}\\&\qquad + E(\eta _i \mid x_{1i}<2500 ,... , x_{ki}, ... ,x_{ni}). \end{aligned} \end{aligned}$$
(5)
If
$$\begin{aligned} E(\eta _i \mid x_{1i} < 2500,... ,x_{ki}, ... ,x_{ni}) \ne 0, \end{aligned}$$
(6)
we are facing
selection bias. If the non-observed characteristics are correlated with any other observed term, then
$$\begin{aligned} corr(x_{1i}\ge 2500,x_{1i}< 2500,... ,x_{ki}, ... ,x_{ni}) \ne 0. \end{aligned}$$
Ergo, if selection is not random, that is, either influenced by the individual or by the sampling researcher, conditional independence does not hold. If the entire variable is unobserved,
selection bias is said to be equal to
omitted variable bias. Hence, in economics the distinction between
self-selection bias and
omitted variable bias is based on the degree of observation of the confounder variable. However, in epidemiology this type of bias is referred to as
confounding (Ahlbom
2021). Instrumental variable analysis can be used to adjust for both
self-selection bias and
omitted variable bias, when a valid instrument is available (Angrist and Pischke
2009). Since
unmeasured confounding cannot be easily resolved with any available data, economists frequently use quasi-experimental designs, such as difference-in-difference (Wing et al.
2018) or regression discontinuity analyses (Robin et al.
2012), to deal with this.
To increase confusion even more, epidemiologists also use the term
selection bias, but their definition is not necessarily equivalent to the aforementioned economic definition of
selection bias. In epidemiology,
selection bias occurs due to the procedures or processes used to select the
sample/
study population in observational studies (Table
1) (Bouter et al.
2017; Rothman
2012; Hernan and Robins
2019).
Selection bias is present when the association between
\(x_{1}\) and
\(y_i\) differs between the
sample/
study population and
population of interest/
source population (Grimes and Schulz
2002; Rothman
2012). In these cases, the magnitude and direction of the bias are difficult to determine (Ertefaie et al.
2015). As a result, the study’s external validity is compromised because the identified association cannot be generalized to the
population of interest/
source population. This is in line with another form of
selection bias in economics, which can arise due to
endogenous sample selection, and is referred to by some researchers as
sample selection bias. The latter implies that the non-random sample selection from the population occurs based on the outcome variable
y. For instance, if we intend to estimate the relationship between frailty (
y) and several other factors in the population of adults:
$$\begin{aligned} y_i=\beta _0+\beta _1 x_{1i}+\beta _2 x_{2i}+\varepsilon _i. \end{aligned}$$
(7)
Sample selection bias occurs in this case if there is selective attrition from the
panels survey/
cohort study, meaning that those who remain in the
sample/
study population have on average better (or worse) outcomes. In our case, those who are frail may not continue to participate in the study, and therefore, the resulting
sample/
study population is not random anymore but rather a selective subset of the population. This will as well result in biased and inconsistent estimator, due to the fact that the
population equation is not in line with the expected value conditional on the outcome being less than a cutoff value (e.g., given frailty scale 0-5
\(E(y_i>3 \mid x_{1i},x_{2i}) \ne E(y_i \mid x_{1i},x_{2i})\)).
Thus, in economics the term selection bias incorporates different forms of selection bias (e.g., sample selection bias, treatment selection bias, and self-selection bias), but these specific terms are not used as often as their overarching term selection bias. This implies that the definition of selection bias in economics is broader than in epidemiology. The epidemiological concept selection bias is equivalent to the economic concept sample-selection bias, which occurs due to endogenous sample selection. Treatment selection bias and self-selection bias on the other hand occur due to endogenous treatment allocation. The economic terms treatment selection bias and self-selection bias, which refer to non-random treatment uptake and individual self-selection to treatment, respectively, encompass the epidemiological term confounding by indication. Confounding by indication does not have one equivalent term in economics. While selection bias defined by an epidemiologist will be understood by an economist, the reverse might not be true and can lead to confusion.
The epidemiological concept
confounding by indication (Table
1) is a special form of
confounding that occurs when
\(y_i\) is causally related to the indication for
\(x_{1}\) (Catalogue of bias collaboration, Aronson JK, Bankhead C, Mahtani KR, Nunan D.
2018; Miettinen and Cook
1981). In other words, individuals in the intervention group are different from those in the control group, based on an underlying factor(s) that influenced their choice for the intervention (Rothman
2012). Randomization is the best way to ensure the prevention of
confounding by indication (Bhide et al.
2018). However, when using non-experimental data, the decision to allocate or start an intervention (i.e.,
\(x_{1}\)), may be influenced by a wide variety of underlying factors (e.g., therapist or patient preferences, the severity of the disease, prognosis, availability) (Grobbee and Hoes
2014). If these underlying factors are positively or negatively associated with the outcome,
confounding by indication is present. As a consequence, the validity of the study is compromised. It is important to note that
confounding by indication is in fact a form of
selection bias that cannot be fully adjusted for, because factors that drive the choice for an intervention are often not completely known or difficult to measure. This implies that there will be a substantial amount of
residual confounding.
In epidemiology,
time-varying confounding is said to occur when confounders have values that change over time. Examples of time-varying confounders can be labor market status, body mass index, and depression severity.
Time-varying confounding can also occur with changes in a time-varying intervention (i.e., an intervention that is not fixed in time), like, for example, a treatment dose (Platt et al.
2009). This type of
confounding can be resolved by using marginal structural models, g-computation, targeted maximum likelihood estimation, or G-estimation of structural nested models (Clare et al.
2019). In economics, this phenomenon is another example of the violation of the zero conditional mean assumption in a longitudinal context and it is categorized as an
endogeneity problem. Most frequently, it is dealt with by applying instrumental variable techniques or fixed effects.
Finally, the term
collider bias can create confusion, especially when compared to the term
confounding. In
collider bias, similar to
confounding, the effect of
\(x_i\) on
\(y_i\) is distorted. The difference lies in the fact that in the case of
collider bias both
\(x_i\) and
\(y_i\) independently cause a third factor, also known as the
collider (i.e., a collider is a variable that is caused by two other variables: one that is (or is associated with) the treatment, and another one that is (or is associated with) the outcome) (Catalogue of bias collaboration, Lee H, Aronson JK, Nunan D.
2019; Griffith et al.
2020; Elwert and Winship
2014). When
confounding is present, the
confounder variable is associated with both
\(x_i\) and
\(y_i\), but
\(x_i\) and
\(y_i\) do not independently cause the confounder (like in the case of a
collider). When the
collider is not adjusted for (either in the study design or in the statistical analysis), it may influence the likelihood of being selected into a study, leading to bias (Catalogue of bias collaboration, Lee H, Aronson JK, Nunan D.
2019; Griffith et al.
2020). Thus, in some cases
selection bias, can be considered a type of
collider bias (Catalogue of bias collaboration, Lee H, Aronson JK, Nunan D.
2019). This is because just like selection bias,
collider bias stems from conditioning (e.g., controlling, stratifying, or selecting) on the collider variable. In economics,
endogenous selection bias is equivalent to
collider bias (Elwert and Winship
2014).
2.4 Terms in a nutshell
In order to have a clear panorama, we will summarize the equivalent concepts of bias between epidemiology and economics. In Table
3, we summarize the field-specific terms with their proposed equivalents. Furthermore, Fig.
2, maps the economic terms for bias on epidemiological terms.
The economic term
endogeneity implies a correlation between the error term and the explanatory variables, and essentially indicates the violation of the zero conditional mean assumption. The term has no exact equivalent in epidemiology. According to epidemiologists,
confounding occurs at the intervention uptake level, and can occur when intervention allocation is not random. Economists use the term
selection on observables when referring to the assumption of no
unmeasured confounding, which is the key assumption for inference in presence of
measured confounding in epidemiology.
Unmeasured confounding and
omitted variable bias represent the same phenomenon
8 which occurs when intervention uptake is associated with unobserved characteristics.
Omitted variable bias (if the variable is fully unobserved) and
self-selection bias (if the variable is partially unobserved) /
unmeasured confounding can be considered as a subcategory of
endogeneity.
Unmeasured confounding and
omitted variable bias both arise in the phase when the
sample/
study population is selected by the researcher or when the data collection is carried out (Fig.
2). Sampling from the population using a pre-defined set of inclusion and exclusion criteria will result in bias if the sample selection is related to characteristics that are associated with the outcome and the intervention allocation, i.e.,
sample selection bias or
endogenous sample selection according to economists or
selection bias according to epidemiologists (Fig.
1). Thus, in this case the terms are equivalent between fields. Economists, however, in general do not distinguish whether the bias occurs at sample selection or intervention uptake level, and use the overarching term of
selection bias for all scenarios where an explanatory variable is related to both intervention or outcome (Fig.
2). The economics definition of
selection bias can, thus, be equated with the term
confounding used in epidemiology, when it refers to
treatment-selection bias or
self-selection bias. However, the terms
treatment selection bias and
self-selection bias are less often used than the overarching term
selection bias.
Table 3
Field-specific terms with proposed equivalents. The equivalent terms that have exact equivalents are displayed in parallel lines. Terms that have no exact equivalents are not part of the table
Bias | Bias |
– | Endogeneity |
Unmeasured confounding | Self- or treatment- selection bias (if the variable partially unobserved)/ Omitted variable bias (if variable fully unobserved) |
Measured confounding | Selection on observables |
Information bias | Measurement error |
Selection bias | Sample selection bias or Endogenous sample selection |
Confounding by indication | Treatment selection bias |
Self-selection bias |