Background
Respondent-driven sampling (RDS) was developed by Heckathorn [
1] as an improvement on snowball-type sampling for measuring disease prevalence in ‘hidden’ populations, that is, those that are difficult to reach because they lack a sampling frame. Groups commonly studied with RDS include men who have sex with men, sex workers and drug users [
2‐
4]. The intricacies of RDS are described elsewhere [
1,
5‐
7] so we provide only a brief outline here. Researchers recruit an initial group from the target population, called ‘seeds’. Each seed is tasked with recruiting members from their personal network who are also members of the target population; these recruited participants then become recruiters themselves and sampling continues until a pre-specified condition is met, typically when the target sample size is reached. Usually, participants are incentivized to participant in the recruitment chains by receiving payment both for participating and for recruiting others into the study. Recruitment is tracked using coupons so that participants can be traced along the recruitment chains. Participants are also asked about the size of their personal networks with respect to the population of interest. For example, in a study of HIV prevalence among injection drug users in a city, participants may be asked: “How many other people who inject drugs in [city] do you spend time with?”. The resulting RDS data differs in two important aspects from data obtained through simple random samples. First, sampling is not random, some participants are more likely to be selected than others and this likelihood is a function of how well-connected they are. Second, the observations are not independent as the data may be clustered within recruiters or seeds.
Clustering occurs if there is homophily in the population; if people are more likely to be connected to others with a shared trait; although it can also refer to network communities as outlined by Rocha et al. [
8]. In this paper, we consider clustering within a single community and therefore driven by homophily. Heckathorn showed that, if the recruitment chains are long enough, under certain (reasonable) assumptions the RDS-derived data can be analysed in such a way as to produce asymptotically unbiased population estimates of disease prevalence [
7]. The utility of RDS-specific prevalence estimates has been studied using simulation by Spiller et al. [
9] and Baraff, McCormick and Raftery [
10] who examined the variability of RDS prevalence estimates and recommended RDS-specific techniques instead of naive sample prevalence estimates. However, McCreesh et al. [
11] cautioned that in estimates of prevalence, RDS-adjusted techniques often produced confidence intervals that excluded the population value. Until recently, the focus of most studies using RDS has been to quantify disease prevalence, but as RDS becomes more popular, regression analyses of these data are also becoming common.
Although regression analysis of RDS data is frequently undertaken, the best method for accommodating correlation between participants (clustering) and the non-random sampling of recruits remains unknown. Carballo-Diéguez et al. [
12] noted in 2011 that “the pace of development of statistical analysis methods for RDS-collected data has been slower than the explosion of implementation of RDS as a recruitment tool”. Several authors have recently observed that regression techniques in particular for RDS samples are not well established [
4,
13,
14]. Yet their use continues to increase; a search of PubMed for the terms ‘respondent driven sampling’ and ‘regression’ over the years 1997 to 2017 indicated that the first RDS paper to use regression techniques was published in 2004, by 2017 there were 59 papers. While many authors do not specifically address the difficulties in performing regression on RDS data some acknowledge the limitations and perform unadjusted analysis [
4,
13]. Several authors used weighted regression [
14‐
18], which assumes that network size is accurately reported and without further adjustment still assumes independence between participants; or included weights as covariates [
17,
18]. At least one study mitigated the influence of extreme responders to the network question with the ‘pull-in’ feature of the RDSAT software [
19] which re-assigns extreme values to ones more aligned with the sample [
20]. Fewer authors have attempted to control for clustering; Lima et al. attempted to control for homophily (related to clustering) by incorporating the outcome value of the recruiter as an independent variable [
21] and Schwartz et al. used robust Poisson regression ‘accounting for clustering’ of participants within the same seed [
13]. We found only one study which used both weighted regression and controlled for clustering; those authors used weighted regression and modelled dependence among observations with two methods and found similar results with both [
22]. Treatment of clustering is the thornier of the two statistical issues with RDS regression, because clusters, if they exist, may be difficult to identify. The main clustering unit may be at the level of the seed, which would produce a few, large clusters, or it may be approximated by an auto-regressive structure in which participants are dependent on their immediate recruiter, but largely independent of those further up the recruitment chain. The covariance structure proposed by Wilhelm [
23] in which correlation decreases with successive waves may provide a useful middle ground. Added to these conceptual questions are statistical concerns with clustered data. Hubbard at al [
24]. note that when generalised estimating equations (GEE) are used, estimates can be inaccurate if the number of clusters is small, so treating initial seeds as clustering units can be problematic. Another study with mixed cluster sizes found that failure to adjust for clustering would have led to incorrect conclusions [
25]. There are a multitude of methods available to account for both unequal sampling probabilities and clustering, but little work has been undertaken to determine the most appropriate regression methods for use with RDS data.
Motivating example
The Our Health Counts (OHC) Hamilton study was a community-based participatory research project with the aim of establishing a baseline health database for an urban Indigenous population living in Ontario. Respondent-driven sampling was appropriate for this population because of the inter-connectedness of the population and the lack of a suitable sampling frame. Based on census estimates, the population is comprised of approximately 10,000 individuals, 500 of whom were sampled in the OHC study. Commonly reported network sizes are 10, 20, 50 and 100, the median network size was 20, with mean 46.5. The top decile of participants reported network sizes in excess of 100 people. The distribution of reported network size for the OHC Hamilton study is illustrated in the Additional file
1: Figure S1.
The objective of this simulation study was to evaluate the validity and accuracy of several regression models for estimating the risk of a binary outcome from a continuous predictor from an RDS sample and specifically, to assess performance with varying levels of outcome prevalence and homophily.
Discussion
Using simulated data, with network degree modelled after RDS data collected from an urban Indigenous population, a dichotomous outcome variable analogous to disease state, and normally distributed continuous predictors, we explored the error rate, coverage rate, bias and accuracy of various regression estimates. Our results indicate that weighted regression using RDS-II weights can lead to inflated type-I error, poor parameter coverage and biased results. When the goal of research is to estimate risk associated with exposure, we prefer Poisson regression to standard logistic regression because it directly estimates relative risk and at higher levels of outcome prevalence the odds ratio is a poor estimate of relative risk. Furthermore, our results show that at low prevalence Poisson regression performs well in terms of observed error rate, coverage and accuracy.
Several studies have reported using weighted regression (WR) techniques, with RDS-II weights, to account for the non-random nature of RDS samples [
15,
36‐
40]. Results of this study indicated that weighted regression, to account for non-random sampling probability should not be undertaken for RDS data without careful consideration to the distribution of the weights used. The poor performance of weighted regression in this study can be attributed to the increased variability of the weighted regression estimates, as illustrated in Additional file
3: Figure S3 The weighted regression estimates are dependent on the reported network degree and a participant reporting very few connections in the community weighs heavily in the analysis and can act as a leverage point. The two most extreme simulated data sets from the population with prevalence of 10% and homophily of 1 are shown in Additional file
4: Figure S4. In this study, because population data were simulated and therefore completely known, reported network degree was equal to the actual network degree and participants were sampled based on their true degree of connectedness in the population. Despite perfect knowledge of network size, the presence of participants within the samples who reported very low degree (and hence had large weights) nevertheless unduly influenced the weighted regression estimates. That weighted regression performed poorly in these controlled circumstances should serve as a caution to future researchers. At the very least, unweighted estimates should always be reported. If weighted regression is performed care must be taken to investigate the influence of those assigned large weights and to perform sensitivity analysis on the degree information.
Our secondary analysis investigated populations where the outcome and network degree were correlated and largely replicated the findings of the primary investigation. When the outcome and degree are correlated, weighted regression results in inflated type-I error, except when those with the highest degree were in G1 (“diseased” group, outcome = 1). In this situation the error rate was virtually zero because those in G1 have the lowest RDS-II weights and so there are no leverage points that drive the high error rate in the other populations. This too though is undesirable because those in G2 (“healthy group”, outcome = 0) will tend to be leverage points and may nullify true relationships when they form a large majority of the population. Again, these findings suggest extreme caution using weighted regression with RDS samples.
We examined several techniques for dealing with clustering: GLM and GEE with data correlated within recruiter, seed or, both and with different covariance structures, as well as modelling the outcome value of the immediate recruiter as a model covariate. These results do not provide clear guidance on the best method of handling dependence in the data. None of the methods were consistently poor across models and populations. Including the outcome of a participant’s recruiter as a covariate may be a viable option; our results indicate that the extra parameter did not reduce the coverage rate and accuracy was actually minimally improved. We also note that in general, the impact of clustering on the variance of regression models is generally less than in the estimation of variance means or prevalence itself. For example, in the context of cluster randomized trials, Donner and Klar [
41] discuss the decrease in variance in a regression model relative to a single mean or proportion. Nonetheless more work is necessary to determine the utility of this approach in populations where the relative activity depends on outcome group.
The performance of the unweighted GEE models was related to the working covariance structure and standard error adjustment used. Models fit with a compound-symmetric working covariance structure and any of the Classical, FIRORES, FIROEEQ or MBN adjustments to the standard error have acceptable overall error and coverage rates (models 19–23). However, slightly inflated error rates were observed for the population with prevalence of 50% and homophily of 1.5 and the population with prevalence 10% and no homophily. Coverage rates were generally close to 95% for these models. When an auto regressive term was used within seeds (models 27, 28), overall coverage dropped below 94%, this was also the case with a compound symmetric structure and no adjustment to the standard error (models 29, 30). The independent correlation structure (with no covariance between observations) performed poorly, with inflated type-I errors.
The glimmix procedure in SAS was used to model GEE with compound symmetric working covariance structures and various sandwich estimates (models 19–23). There were no appreciable differences in error rates, coverage rates or relative bias among the various standard error adjustments for these models. As shown in Additional file
6: Table S2 the glimmix models have slightly lower coverage rates, and inflated error rates for some populations, so we recommend simpler generalized linear models.
The accuracy of the models in terms of case prediction is higher for logistic regression than Poisson regression, although as can be seen in Fig.
3 the disparity is proportional to outcome prevalence. At lower prevalence levels, the Poisson model variance approaches the variance of the Binomial distribution and so model mis-specification decreases and accuracy increases.
Another method of simulating RDS data is through the use of exponential random graph models (ERGM). Spiller et al. [
9] in their recent simulation study investigating the variability of RDS prevalence estimators, used ERGM to simulate multiple populations from distributions with specified homophily, prevalence, mean degree and relative activity. This approach creates networks that, when averaged over many simulations have the desired network parameters, though in practice individual populations will vary. In contrast, our approach randomly selected network degree from a specified distribution, and then randomly allocated group membership and ties in such a way as to achieve precise levels of prevalence and homophily. For each combination of desired network traits, a single population was created and multiple RDS samples were drawn, thereby allowing only a single source of variability, the RDS sampling process. Given that our research question of interest was how best to model data sampled using respondent-driven sampling from a networked population, we feel that fixing the population constant is the appropriate strategy, but examining the impact of the population simulation method is an area of future interest.
Prevalence
Our findings are in line with other studies [
9,
10,
42] that have found coverage rates substantially less than 95% in the estimation of prevalence from RDS samples. Our results also support using RDS-II over RDS-I. We found that the robust variance estimators of the
surveylogistic procedure in SAS, using the RDS-II weights performed well (Table
3). One interesting finding is that, similar to the regression results, the weighted prevalence estimates are also susceptible to leverage points, but only at low prevalence (10%). When we more closely examined samples with large disparities in the outcome prevalence estimates we found that the disparity among estimators is caused entirely by individuals with low degree. The smallest reported network size in these samples was 2, in line with degree reported in the OHC study and in this simulation study, a reported degree of two is an accurate reflection of connectedness. The weights assigned to each participant are related not only to the participant’s reported degree but the distribution of degrees across the sample. If a sample contains a few reports of very large degree (as occurred in the OHC sample) then the weights allocated to those with lower reported degree will have greater impact. We found that prevalence estimators that incorporate weights are generally superior at moderate to high prevalence, but should be used with caution in samples with low outcome prevalence.
The appropriate use of weights in regression analysis is an area of active discussion. Our findings suggest that the use of weights is appropriate for determining population outcome prevalence, but not in the application of regression models for RDS samples. These results are in line with Lohr and Liu’s paper examining weighting in the context of the National Crime Victimization Survey [
43]. In their survey of the literature they reported little debate surrounding the use of weights in the calculation of average population characteristics, but several competing views on the incorporation of weights into more complex analyses such as regression. More recent work by Miratrix et al. [
44] further suggests that initial, exploratory analyses, as we are typically performing in RDS data should be performed without weights to increase power and that generalization to the entire population should be a secondary focus of subsequent samples.
In a simulation study the limitations stem from our own design. As an initial investigation into regression techniques and RDS data we chose to use complete data sets, so the effects of missing data are unknown. We also used a correctly-reported network degree, whereas in the OHC study we observed a tendency for people to report degree in clusters (such as 5, 10, 20, 100). Future work may focus more on log-link models, which seem promising. It would also be interesting to investigate what happens if the outcome responses are correlated with degree size, and, if better-connected people are better (or worse) off, a concern flagged by Reed et al. [
45].
Conclusion
Our results indicate that weighted regression should be used cautiously with RDS data. Unweighted estimates should always be reported, because weighted estimates may be biased and may not be valid in samples with a broad range of reported degree, such as the case with our motivating example of connectedness in an urban Indigenous population. Researchers are likely to have prior knowledge regarding the prevalence of the outcome in their target population (HIV prevalence, for instance), but much less likely to have knowledge regarding the homophily of the population. The greater the outcome prevalence, the greater the discrepancy between the odds ratio estimated from logistic regression and the relative risk. In light of this we suggest that a simple, unweighted, Poisson regression model is the most reliable method for modelling the likelihood of group membership from an RDS sample.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.