Undercoverage and nonresponse in web surveys
Currently, due to the absence of a sampling frame of the general population, random sampling for single-mode web surveys is impossible [
13] under almost all jurisdictions. Although other sampling procedures are possible, in practice, web surveys are often based on online-recruited access panels, or participants are recruited on websites visited. Therefore, elements are missing in the sample due to coverage errors, and inclusion probabilities are unknown for those responding. Accordingly, design-based estimates of web surveys are not legitimated by mathematical statistics [
14]. Some survey agencies recruit offline by drawing random samples, for example from a population register and inviting sampled persons to participate in a web survey to remedy this problem. Furthermore, if persons without internet access are provided with online access, recruiting offline may reduce undercoverage bias [
15‐
17].
Although the level of internet access at the European Union’s household level has increased steadily, differences still exist: from 85% in Greece to 99% in Norway (in 2022) [
18]. For Germany in 2022, official statistics reported internet access at the household level at 91% and individual internet usage at about 93% [
18,
19]. Based on the American Community Survey 2021, the internet penetration rate in the USA is estimated at 90% [
20]. Depending on the excluded proportion of people without internet access and the difference in the target variable between people with internet and without internet [
21], excluding subpopulations might bias survey estimates.
While undercoverage addresses the internet access requirement, nonresponse refers to the ability and motivation to participate in the survey. Both selection mechanisms, undercoverage and nonresponse, may cause bias in the survey estimates. Accordingly, a survey data set is the result of both selection mechanisms. Disentangling between coverage and nonresponse errors is impossible due to the absence of a sampling frame [
22]. The larger the proportion of non-respondents and the stronger the correlation between the target variable and the missing data mechanism, the larger the nonresponse bias.
An equation by [
21] allows the estimation of the difference between a population mean
\(\bar{Y}\) and a mean of a sample with nonreponse
\(\bar{Y}_{nr}\):
$$\begin{aligned} \bar{Y} - \bar{Y}_{nr} \approx \frac{R_{\rho Y} S_{\rho } S_{Y}}{\bar{\rho }}. \end{aligned}$$
(1)
The equation (
1) assumes that every person has a response propensity
\(\rho\), with an overall mean
\(\bar{\rho }\), and standard deviation
\(S_{\rho }\) in the population.
\(R_{\rho Y}\) is the correlation between
Y and
\(\rho\), and
\(S_{Y}\) is the standard deviation of
Y. The bias (
\(\bar{Y} - \bar{Y}_{nr}\)) depends on three quantities: the correlation between the response propensity and the variable to be estimated, the variance of the response propensity, and the variance of the variable of interest. Accordingly, the bias will be small if either the participation rate in the non-probability sample is high or at least one of the other factors (
\(R_{\rho Y}\),
\(S\rho\),
\(S_{Y}\)) is small.
Regardless of response mode or mandatory participation, surveys show a downward trend in response rates [
23‐
26]. Web surveys yield even lower response rates than other response modes [
27]. In general, with increasing proportions of nonresponse and increasing correlations between the response variable and the mechanisms causing nonresponse, the risk for biased estimates increases [
28]. However, although probability-based surveys suffer from decreasing response rates, empirically, their estimates seem to be still more accurate than estimates obtained by non-probability samples [
11]. Given equation (
1), this empirical result is mathematically plausible only if the correlation between response propensity and variables of interest is low or the differences in response propensities are small in probability-based surveys.
Internet use and health
The mechanisms causing differences in internet use depending on health conditions are rarely discussed. However, the selection process from the target population to the population covered by web surveys can be characterized by six requirements. First, the technical requirements of a working internet connection by line, WiFi, digital cellular networks or satellites must be fulfilled. Second, sufficient financial resources by the respondents are necessary if internet access is not provided for free. Third, using a smartphone or a computer requires the physical ability to see (or hear) and the ability to type or speak. Fourth, answering survey questions requires cognitive abilities such as understanding abstract concepts, word finding and judgment. Fifth, recruitment for a survey needs a mode to contact the respondent, which usually requires a sampling frame. Such a frame for web-based population surveys is rare. Therefore, offline sampling requires other frames, such as phone numbers or address lists, which can only be used indirectly for web surveys. For online sampling, river sampling or similar non-probability sampling techniques are necessary. Sixth, the designated respondent needs sufficient motivation to answer a survey request.
Physical or cognitive capabilities might be impaired depending on the symptoms of a medical condition. Due to hospital stays or caregivers, the probability of contact with the designated respondent may vary between contact modes and medical conditions. Finally, motivation for a response might decrease (or increase) depending on the type and severity of the medical condition.
The effects might not necessarily be linear or additive. For example, physical disabilities might impact survey participation only for severely disabled persons. Increasing physical or cognitive problems might reduce motivation as well. Therefore, neither a diagnosis (a reported code from the International Classification of Diseases, ICD) nor a specific symptom alone will be a sufficient or necessary condition for survey response. Hence, no simple pattern of health conditions and survey participation can be expected.
However, some studies are available if the potential bias could be reduced by weighting. [
29] reported that weighting by age and gender did not eliminate differences between a web and a CAPI survey in BMI, eating habits, physical activity, alcohol consumption, and smoking. [
2] reported that web responders currently smoked less, had fewer children, and less often had a chronic disease. Observed health differences between internet users and non-users (based on the Michigan Behavioral Risk Factor Surveillance System (BRFSS) survey 2003 and the Health and Retirement Study 2002) are described by [
30]. Based on reported internet usage in European (European Social Survey) and US data (BRFSS) obtained by conventional surveys (F2F and CATI), [
31] note that ‘(...) calibration on age, gender, ethnic background, urban residence, education and household income does not eliminate the observed health differences’. In a probability-based survey with the web as a response mode, [
32] showed significant differences between web and face-to-face respondents after controlling for gender, age, region, marital status, household size, educational attainment, and country of birth. Recently, [
33] showed in a comprehensive analysis of American data persisting differences between internet users and non-users including age, employment, cultural activities, and education. Using British and Swedish data, [
34] provided evidence for similar differences regarding, for example, age, low-level of education, and living alone. Both publications noticed health issues (such as disability) as predictors for internet use. [
33] summarized their findings: ‘Without some reasonable adjustment, a variable like health status has a high risk of being significantly biased in studies that do not cover the non-internet population’. Hence, there is growing evidence of correlations between health indicators and internet use, which cannot be corrected by weighting procedures.
Selection mechanisms and weighting
The data missing due to coverage and nonresponse errors can be described by three different data generating mechanisms: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) [
35,
36].
If the data generating mechanism is MAR, the generalized regression estimator (GREG) can be used to correct the effect of such missing data by calibration [
37]. The calibration estimator for respondents (
r) of a target variable
Y is defined as
$$\begin{aligned} \hat{Y}_w = \sum _r w_i y_i, \end{aligned}$$
(2)
with
\(w_i\) the calibrated correction weights and
\(y_i\) the response. The
\(w_i\) are a product of the initial weights
\(wi_i\) and the correction factor
\(v_i\), where
\(w_i = wi_i*v_i\) (for details on the calibration estimator, see [
38]). When the MAR assumption does not hold, GREG estimates are still biased. In such cases, the probability of being included in the survey and response depends on
\(y_i\), and (unobserved)
\(x_i\) cannot fully explain the selection probability. Therefore, the missing data generating mechanism is considered as MNAR.