Background
Healthcare-associated infections (HAIs) and antimicrobial resistance (AMR) have been widely recognized as a significant threat to public health, with an estimated prevalence in acute care hospitals of 5.9% and an annual incidence of 3,760,000 new cases per year [
1]. Therefore, a considerable amount of resources have been invested to monitor HAI prevalence and investigate the associated risk factors, with the objective of developing targeted intervention strategies. Several epidemiologic and surveillance studies have been conducted, ranging from hospital level [
2], to country level [
3,
4] to worldwide initiatives [
5].
The European Union is on the frontline in monitoring and controlling HAI risk in its Member States, under the coordination of the European Centre for Disease Prevention and Control (ECDC) healthcare-associated infections surveillance network (HAI-Net). This network has carried out two European-wide Point Prevalence Surveys (PPSs), applying a standardized protocol developed by the ECDC, in 2011–2012 and 2016–2017 [
6]. The aims of these studies were mainly to estimate HAI and antibiotic use prevalence in acute care hospitals, and to describe patients, infections, invasive procedures, antibacterial agent use and antimicrobial resistance, while also gathering information on hospitals’ characteristics and infection control practices.
In order to obtain robust estimates and to ensure comparability among participating countries, it is important to apply a consistent hospital sampling strategy. Therefore, the ECDC required every country participating in the PPS to provide a specific number of hospitals. This number is calculated to have a similar prevalence estimation error, taking into account the per-country hospital size distribution. Further, the ECDC required that the hospitals should be selected preferentially through systematic sampling among all hospitals in the country [
6] taking into account hospital size. Countries were then categorized according to their ability to strictly follow the protocol in terms of sampling strategy and number of participating hospitals. Nevertheless, in the 2011–2012 PPS, sixteen out of thirty-three countries were not able to select hospitals through systematic sampling or provided data from fewer hospitals than requested by the ECDC [
7]. When systematic sampling was not feasible, countries resorted to convenience sampling [
8].
For the 2016 Italian PPS [
9], convenience sampling was employed. The survey saw the participation of 135 hospitals altogether but, as indicated by the ECDC, a sample of 55 hospitals was required for the Italian sample. Generating a representative sub-sample from the 135 participating hospitals proved to be challenging for several reasons. First of all, regional participation in the survey was extremely heterogeneous, as the majority of hospitals were provided by two regions, Piedmont and Emilia-Romagna. Secondly, there was an excess of large hospitals in the sample compared to the actual size distribution of hospitals in Italy.
To address these issues, we developed three alternative, non-conventional sampling procedures for prevalence studies, that allow selecting a subset of units from a convenience sample. Furthermore, we developed a “Quality Score” which evaluates participating units according to data quality and completeness and can be used as an additional selection criterion.
The objective of these methodologies is to improve the representativeness and quality of epidemiologic studies when the selection of participating units is made using non-probabilistic strategies.
Data collected through the 2016 Italian PPS on HAIs was used to test and evaluate these methodologies, but their implementation is general and can be applied in a number of contexts.
Discussion
Using the 2016 Italian PPS on HAIs as a case study , we developed a set of subsampling methodologies to improve the representativeness of epidemiologic studies when the selection of participating units is made using non-probabilistic strategies, such as convenience sampling.
Convenience sampling [
8] is one of the most common non-probabilistic sampling methodologies [
18,
20]. It implies that the statistical units in the sample are selected based on their availability to the researcher, in terms of physical reachability and/or willingness to be included. It is a strategy often used when it is not feasible, economically or logistically, to perform random sampling.
The major drawback of this methodology is that the selected units may not be representative of the target population, introducing distortions in the distribution of sample characteristics, and in the worst case scenario, also in the outcomes of interest [
21,
22], therefore biasing estimates and decreasing the generalizability of the studies [
8].
For country-level, clustered prevalence studies, convenience sampling may be simpler and more cost-effective compared to more formal sampling strategies, such as systematic sampling (as suggested by the ECDC PPS protocol), since it does not assume the compliance to participation for all (or most) of a country’s hospitals. Employing systematic sampling, for example, may be difficult when a central selection of participating hospitals is not possible, due to the specific organization of the healthcare systems (decentralized healthcare systems) or to non-willingness to enforce compliance. In Italy, regional authorities are virtually exclusively responsible for healthcare organization and delivery, within a framework provided by the central Government.
It is customary in these cases to select participating units for example among hospitals which are part of some established surveillance network [
23], or to let them be chosen by regional health surveillance authorities [
24].
As previously mentioned, the Italian PPS sample was generated by convenience sampling driven by regional health authorities, and this generated issues in terms of geographical and risk factor distribution. Therefore, this sample represented a perfect case for testing our methodologies, which try to measure the impact of these issues and reduce them by algorithmic subsampling.
First, we developed a score to quantify the quality of the collected data (QS) in terms of missing data and errors. In statistical practice, there is a significant corpus of literature on the negative impact of missing data on inference [
25‐
29], especially when data is missing not-at-random [
30]. The extent of the bias caused by missing data can span from a simple increase in error and uncertainty estimates, to effect magnitude and/or direction biases in risk factors analyses [
30]. A common way to mitigate these issues is to employ imputation techniques [
25]: these methods work either by “guessing” the missing information based on available data on the same variable (distributional imputation, e.g., taking the mean, median or mode value), or at observation level in other variables (prediction-based methods, e.g.: model-based or non-parametric imputation models), or by a combination of both. These techniques, though, are often technically and computationally demanding and may artificially reduce uncertainty or introduce bias themselves [
31].
The issue of errors in the collected variables is equally relevant: they can introduce distortions in the resulting estimates and statistical associations, which are very hard to identify and account for at the analysis stage [
32‐
34].
The score we propose considers both the amount of missing information and possible errors in data collection, weighted by the statistical relationship of the considered variables with the primary outcome of the study, i.e. the risk of HAI. This score can be used to rank and select statistical units based on the reliability (information value) of their data, thereby reducing the reliance on analytic solutions like imputation and favoring the use of the original data. The consequence is a more transparent analytic pipeline.
While applying the score to the 2016 Italian PPS data, we observed a large gradient of values, the hospitals with worse data scoring 78.7 times higher (indicating worse quality) than the one with the best (lowest) score. The analysis of the components of the score can help identify which variables have more issues: in our case, the information at the patient level was more complete than ward and hospital-level characteristics. These findings could be useful for the conduction of future surveys, by highlighting problematic variables that should be addressed when training local operators regarding data acquisition, and by revealing issues in the definitions of variables that could make their collection problematic.
When we compared the QS with HAI prevalence, we observed a slight negative association, that is, worse data quality predicted lower HAI prevalence. The relationship, albeit weak, was conserved after stratifying hospitals by their number of beds, a known predictor of HAI risk in hospitals [
35]. These results could hint to an association between accuracy in data collection and quality of the HAI case finding process. It is hard to derive definitive conclusions in this regard, without having the real hospital-level prevalence and, given the small sample size, a spurious correlation cannot be ruled out. A solution could be to compute the QS of the whole “European ECDC PPS validation sample”, a group of patients in which the presence of HAI was verified by experts [
6]. If the QS proves an accurate predictor of the rate of identification error without influencing the estimates, it could be then used as a tool for selecting hospitals with less biased estimates.
The second problem we tried to solve was the possible bias in the representativeness of hospitals chosen by convenience sampling (or other non-probabilistic sampling strategies). A common approach to the problem of under/overrepresentation is based on the reweighting of observational units: for example weighting observations according to the inverse of the probability of being selected to reduce disparities between subgroups [
36] or based on some reference data to increase generalizability [
37]. Instead, to our knowledge not much research has been devoted to a subsampling approach to the representativity issues, with a notable example in the work of Pérez Salamero Gonzálezet al [
38]. We argue that a subsampling based approach may have practical advantages over a weighting based one, for example not having to manage the weights in the entirety of the analytical pipeline, or giving the possibility to clearly evaluate the impact of individual units in the final estimates. Furthermore, the common inverse-weighting based methods use only the in-sample data to estimate the weights, without relying to external reference data; this would promote more uniformity in the final sub-group representation but do not ensure generalizability of the results.
Our approach is based on two subsampling techniques (Probabilistic and Distance procedures), which, informed by country-level information, subsample hospitals generating a distribution more similar to what would have been achieved by random sampling, given a set of stratification variables. As reported in the methods, we selected geographical location and hospital size as characteristics of interest, both because these are known risk factors for HAI and increased AMU [
1,
12] and since their distribution in the PPS sample was highly distorted compared to the target population. Furthermore, information about hospital location and size is easily available at the country level, compared to other important predictors (such as hospital case-mix). Nevertheless, these methods are general and can be adapted to any combination of variables for which reference data is available.
The Probability procedure uses country data to build a probability distribution given one or more variables and chooses hospitals according to it. The Distance method defines strata using the same distribution and then fills them with the hospitals in order of similarity to the strata according to a hierarchy of hospital variables: if units with the right characteristics are not available, the model selects hospitals which are as similar as possible. Finally, it further refines the sample by randomly switching hospitals in and out, updating the sample only if the switch improves the distributional fit. It can be associated to a greedy gradient search followed by a random search phase to escape possible local maxima.
We further proposed a Uniformity sampling method providing a balanced sub-sample for the considered characteristics, which may be useful for risk factor analysis and prediction models [
19].
The QS is considered in all methodologies, selecting the best hospital among equivalent proposals in term of distributional fit.
Based on our dataset, both the Probability and the Distance methods sensibly decreased the distributional bias of the generated subsamples, compared to the convenience sample. This improvement highlights the possible drawbacks of non-probabilistic sampling methods and supports the necessity of adjustment before analysis. Nevertheless, it should be noted that the specific subsampling algorithm only had a relatively small effect on the final estimates of HAI prevalence, which ranged from 6.98% for the Distance method prioritizing hospital size (S), to 7.87% for the Probability method, compared to 7.44% for the convenience sample estimate. As reference, the HAI prevalence at the European level, as reported by the ECDC PPS study, was 5.9% [
1]. The Distance (S) method provided the estimate clostest to the European result, perhaps because it is driven by a highly predictive risk factor (hospital size), but it is impossible, just from these results, to affirm that the Distance procedure (S) is capable of providing the more realistic estimate. An indirect way to test our methods could be using ECDC PPS data from participating countries that recruited all or a large random sample of hospitals. From these hospitals, a simulated, biased, convenience sample could be drawn, and then the three procedures could be tested to prove which sub-sampling method would be better in retrieving the original prevalence.
The variability of the bootstrapped estimates was large: all methods showed a 90% BI of more than two percentage points. This difference is quite significant in terms of burden of disease: it is greater than the variation in HAI prevalence observed among most of the countries in the European PPS study or in the same country but in surveys conducted in different years [
1,
12,
35]. If we consider the estimates’ uncertainty between sampling methods, the variability is even larger, with the range of possible prevalence estimates going from 5.75% (lower bootstrap interval for the Distance (S) method) to 8.88% (upper bootstrap interval for the Probability method). These results may indicate that the differences in sampling strategies among countries and studies could explain a large portion of the observed variability in HAI prevalence. Therefore, the crude risk estimates extracted by prevalence studies should be built on larger samples and enriched by more sensitive analysis, like individual risk analysis based on patient characteristics [
39,
40] or multilevel models [
41‐
43] to factor out the hospitals’ specific contributions to risk.
Both the Probability and the Distance methods have limitations, and many improvements could be proposed.
The Probability method is the most beneficial in decreasing the global distributional bias, since it considers many variables at once, but may increase bias in specific characteristics. This happens when the amount of distortion in one characteristic is much higher than in another. The model accepts a trade-off (more distortion) on the less biased variable in order to optimize the global fit. In our case, the algorithm slightly increased the bias regarding hospital size to compensate for the highly distorted geographical distribution (i.e., two regions providing more than 50% of the total number of hospitals). This phenomenon may influence the outcome of the study itself if the bias is increased for strongly predictive variables. We indeed observed a higher HAI prevalence in the Probability sample, due to the greater quota of larger hospitals which were included by the algorithm to improve the regional representativeness (many regions provided a relative excess of large hospitals). A possible solution could be to reweight the contribution of the variables to hospital selection according to their statistical relevance with the outcome. The QS has a strong impact on the Probability method, but we showed that hospital selection is still largely driven by the country distribution of hospitals; alternatively, different weights could be attributed to the QS by exponentiating it after the rescaling (Supplementary Material S
2).
The Distance method, on the other hand, allows specifying a hierarchical order for variables to be used for adjustment, so that the researcher may give priority to those more related to the outcome of interest. The drawback of this method is that it aggressively optimizes the first variable of the hierarchy, switching to the second only in case of ties; the same is the case for the third variable and so on. Therefore, its flexibility is spent mostly on few variables (primarily the first), especially if these have many possible values. This bias is compensated by a random search for solutions that improves the global fit (e.g., samples improving the fit for hospital size at the cost of a greater bias in geographical distribution are discarded), but the candidate hospitals are still evaluated using the same hierarchy.
We provide the R code for both methods (Supplementary Material S
2) and we encourage researchers to experiment, improve, and adapt it for their purposes.
A general limitation of subsampling for bias correction is that it strongly impacts the sample size and, therefore, the power of studies. Only if the initial convenience sample is large, enough margin exists to compensate for the bias by subsampling. An alternative could be first to oversample the original data, increasing the size artificially, and then to produce a subsample of the original size but sub/over-sampled in order to reduce distortions.
Further validation of the sub-sampling methodologies may be pursued by taking advantage of countries with sample coverage near 100% in the European PPS sample: various distortive convenience sampling strategies may be simulated and to test how efficient are the subsampling methods in retrieving the real population prevalence.
The presented methods have been demonstrated in the context of Healthcare Associated Infection prevalence estimation in the presence of a convenience sample of acute care hospitals, but their applicability is indeed generalizable to a number of contexts and problem, wherever bias in the sample representativity and data quality issue are possibly present. For example, sample representativity isssues are common in designs that requires the opt-in of the participant units, like survey-based [
44], cohort [
45] and surveillance [
46] studies; the same is true for errors in data collection, especially with participant self-reported data [
47‐
49].