Problem description
We consider a finite population D containing n individuals, where each individual i is described by a vector x
i
of auxiliary variables. The auxiliary variables x
i
are known for each individual before the recruitment period starts. Usually x
i
is available from municipal or national person-registries. Examples of these variables are gender, age, place of residence, and social economic status. In addition to x
i
, each individual i has an unobserved outcome of interest y
i
. The goal of this paper is to obtain a sample of size m (m<n) from D, in which we can observe y
i
.
A sample is described by the vector
s=(
s
1,…,
s
n
), where
s
i
takes the value 1 if individual
i is in the sample and 0 otherwise [
13,
14]. With this representation there are 2
n
possible samples. Before the recruitment period starts we need to determine
π
i
, which is the probability that individual
i is included in
s (i.e.
p(
s
i
=1)=
π
i
). We want to recruit a sample of
m individuals and therefore
, where
m is a positive integer.
Different choices can be made for the inclusion probabilities
π
i
. For instance, we can assign equal inclusion probabilities to all individuals, i.e.
π
i
=
m/
n. In this case, the sample
s is expected to be a ‘miniature’ version of the population
D, because we expect
s to have approximately the same composition of auxiliary characteristics as
D. In this case, the sample is referred to as a representative sample [
11]. However,
π
i
is frequently chosen to be proportional to
x
i
. For example, by oversampling a rare subgroup we could increase the precision of the result for that particular subgroup [
15].
List sequential sampling method
To obtain the sample we use the list sequential method based on sampling without replacement developed by Bondesson and Thorburn [
12]. To illustrate the list sequential method, we first consider the situation in which all invited individuals will participate in the study.
During the recruitment period, we sequentially decide for each individual
i from
D whether we include this individual in the sample (
s
i
=1) or not (
s
i
=0). After this decision, the probability of being included in the sample for the remaining non-invited individuals from
D is updated. Let
be the vector of initial inclusion probabilities which is determined
before the sampling procedure starts, i.e.
. We sequentially evaluate each individual
i from the population and update the inclusion probabilities of all non-evaluated individuals after each evaluation. For the first individual, we have
. Depending on whether individual 1 is included in the sample or not, the inclusion probabilities of all other, non-evaluated, individuals are updated. This gives us the vector
π
(1), from which we use
to determine
s
2; i.e. decide whether to include the second individual in the sample or not. The updating scheme can be represented as
Generally, when we evaluate individual
i, we will use the inclusion probability
to determine
s
i
. After the evaluation of individual
i, we update all probabilities
, for
j>
i with
(1)
where
are weights that may depend on
s
1,
s
2,…,
s
i−1. Note that
determines how
is affected by the sampling outcome from the individual
i, since
influences the second order inclusion probability
p(
s
i
=1,
s
j
=1). The sampling scheme gives a sample of size
m, when the weights are restricted to sum up to one, i.e.
. To guarantee that
, all weights should satisfy
(2)
Within these bounds, we can impose different restrictions on
, resulting in samples with certain characteristics. Generally, when
we have corr(
s
i
=1,
s
j
=1)<0 (i.e. a negative correlation between the sampling indicators of individuals
i and
j), whereas with
, we have corr(
s
i
=1,
s
j
=1)>0. For more detail about the list sequential method, we refer the reader to respectively theorem 1 and remark 1 from Bondesson and Thorburn [
12].
Well spread samples
We are interested in recruiting a well spread sample with the list sequential sampling method. Usually, a well spread sample leads to parameter estimates with low variances. Before we can introduce the definition of a well spread sample, we require the concept of coherent subsets. Let d(i,k) be the distance between individuals i and k. A subset D
′ from the population D is coherent if the following holds. First, let some individual i∈D
′. Individual k is included in D
′ if and only if d(i,k)≤r, where r≥0. Consequently, D
′ can be constructed by including all individuals within a ball of radius r around individual i.
Grafström and Schelin considered a sample to be well spread with respect to the inclusion probabilities
π when, for every coherent subset
D
′⊂
D,
(3)
A smaller distance to individual i increases the probability of being included in the coherent subset D
′. To satisfy (3), it is clear that the inclusion probability of individual i should be more influenced by the sampling indicators s of individuals with a smaller distance. We propose to measure distance between individuals with the auxiliary variables x, where d(x
i
,x
k
) is the distance between individual i and k. Based on the types of auxiliary variables, we can choose, for instance, the Mahalanobis or the Manhattan distance.
To obtain a well spread sample with the list sequential sampling method, we will use preliminary weights which are specified before the recruitment period starts. The preliminary weight reflects the effect of s
k
from individual k on the inclusion probability of individual i. The weights are referred to as preliminary because the upper bound from (2) has an effect on the conditional inclusion probabilities.
The preliminary weights are constructed in the following way. Let
be the rank of the distance of the
k
t
h
individual to individual
i, where
k≠
i. We rank the distances in ascending order, where we assign
c
(i)=1 to the closest individual,
c
(i)=2 to the second closest individual, and so on. To construct the preliminary weights, we could use the linear function
(4)
where the weights μ and λ≤0 are arbitrarily chosen weights. The sampling indicator s
k
of individual k has a larger effect on individuals at smaller distance, whereas it has less effect on individuals at further distance. To recruit a set of approximately m individuals, we restrict the weights to satisfy .
Heterogeneous participation probabilities
A problem of sampling from population D is that individuals that are invited to participate in the study can decline the invitation. Let b=(b
1,…,b
n
) be the vector that indicates whether an individual i is invited to participate (b
i
=1) or not (b
i
=0). When individual i refuses to participate in the study, we have s
i
=0 and we do not observe y
i
. Let ϕ=(ϕ
1,…,ϕ
n
) be the vector that contains the participation probability per person in the population, where ϕ
i
=p(s
i
=1|b
i
=1). Note that when every invitee participates (i.e. ϕ
i
=1, for i=1,…,n), we have s=b.
Let be the inclusion probability corrected for non-participation, i.e. the probability of being invited to participate in the study for individual i from D. When ϕ
i
is known before the recruitment period starts, non-participation can be dealt with by using as probability to invite individual i. Moreover, we can use the updating rule from (1) to update the inclusion probabilities of the non-evaluated individuals , j>i, after individual i responded to the invitation. This will give us a sample that approximately satisfies the inclusion probabilities π.
The following small sampling problem illustrates this modification. Consider that, for the first individual, we have and ϕ
1=0.5. The probability to invite this individual is therefore . Using this strategy there might be some individuals i with . This means that the participation probability of individual i is too low with respect to ; the desired probability to be included in s for individual i cannot be reached. For instance, this would happen in the example above for individual 1 when ϕ
1=0.1 and consequently . This means that we have to invite individual 1 two and a half times to satisfy . Because we can only invite an individual once, we restrict all values to be one or lower.
Adaptive list sequential sampling method
Usually, ϕ
i
is not known before the recruitment period starts. In this section we suggest how ϕ
i
can be estimated adaptively during the recruitment period. In addition, we consider delayed response to the invitation.
For each individual, we have some knowledge about the willingness to participate before the recruitment period starts. For example, we might have participation estimates from a small pilot study or from previously performed studies. In addition, information from the invited individuals becomes available during the recruitment period. Therefore, we propose to use a Bayesian method to estimate the participation probability of individual i during the recruitment period, in which we use both the available prior knowledge and the information that becomes available during the recruitment period.
Let
z
i
be the vector of all observed characteristics of individual
i, which are related to the participation probability. We assume a missing at random type of mechanism for the participation probabilities, where the participation probability of individual
i only depends on observed characteristics
z
i
, i.e.
p(
s
i
=1|
b
i
=1,
z
i
). The participation probability can be written as
(5)
where α is the intercept term, and f() is a function of the observed characteristics z
i
and the regression weights β. Because more information becomes available during the recruitment period, the participation probability estimates become more accurate. The vector of estimated participation probabilities of all n individuals after the evaluation of individual i is denoted as . We then adapt the inclusion probabilities as .
After an invitation has been send to an individual, it might take some time to get a response. Let be the indicator whether individual j has responded to the invitation before individual i is evaluated, where when we observe s
j
and when we do not observe the participation indicator s
j
during the evaluation of individual i. Note that when individual j has not been invited (i.e. b
j
=0), s
j
=0 since individual j is not included in the set of participants. A problem of delayed response is that we cannot use the update rule from (1) to determine , when the participation indicator of the previous individual is not observed. Consequently, we cannot update which means that our sampling method is less successful in recruiting a well spread sample. As a solution, we propose to use the data from all previously invited individuals, and replace the non-observed participation indicators with their estimated expected value. We use this approach in step 1 of the adaptive list sequential sampling method listed below.
Before we start the adaptive list sequential sampling method, we specify the vector
π
(0)=
π, which contains the initial probabilities of being included in
s for every individual
i in
D. The desired number of individuals in
s is
, where
m is a positive integer. The first individual from
D is invited with the probability
, where
is an initial guess of the participation probability of the first individual. All other individuals from
D are invited in a sequential way, where the steps of the adaptive list sequential sampling method for individual
i=2,…,
n are
1. Calculate
To deal with delayed response to the invitation, we propose to use a modified version of the column-wise updating rule proposed by Bondesson and Thorburn [
12]. We calculate
by iterating over
k=1,2,…,
i−1, where
(6)
and
is calculated as
The weight
determines the effect of
s
k
on
and therefore also
. The choice of preliminary weights
is discussed in the previous section. Because (6) still requires the
observed indicators
s
1,
s
2,…,
s
i−1, we modify (6) to deal with delayed response to the invitation. When
, we replace
s
k
with its estimated expectation
, where
is the participation probability estimate of individual
k from the previous evaluation
i−1. The delayed response adjusted column-wise updating rule from (6) is
2. Calculate
Decide whether individual
i should be invited to participate in the study, where
b
i
=1 if the individual is invited and
b
i
=0 if not. This decision is based on the probability of being invited,
(7)
where
is the participation probability estimated from the previous evaluation
i−1. We draw the decision to invite individual
i from a Bernoulli distribution with
.
3. Update the vector ϕ
(i)
Let
R
(i)={
r;
b=1,
u
(i)=1,
r∈
D} be the set of all
m
i
individuals that responded to the invitation to participate. Each individual from
R
(i) is described by
r=(
s,
z), where
s=1 when invitee
r participates and
s=0 otherwise, and
z is a vector of known characteristics. The participation probability of individual
k is defined as (5). Because we might have some a-priori knowledge about the intercept
α and the regression weights
β, we use Bayesian inference to estimate the posterior distribution
g(
α,
β|
R
(i)), i.e.
(8)
where
θ is a vector of parameters, and
f() is the prior distribution of (
α,
β). The likelihood of
R
(i) given (
α,
β) is
where
p(
s
ℓ
=1|
z
ℓ
,
α,
β) is given by (5). Following (8) we update the vector of estimated participation probabilities
, where for individual
k=1,…,
n
To estimate , we can use quadrature or MCMC methods. The values of θ depend on the amount of prior knowledge that is available before the recruitment period starts. For instance, we can assume that (α,β) is sampled from some flat distribution with large variance when no prior knowledge is available.
Simulations
We illustrated the performance of the adaptive list sequential sampling method with two simulations. In these two simulations, we created populations with unknown heterogeneous willingness to participate and delayed response to the invitation. The first simulation was focused on recruiting a well spread, representative set of participants. In the second simulation, we investigated stratified sampling from a population in which some subgroups were over-represented.
Simulation 1
Consider a population D of size n=4000 from which we drew a random sample without replacement of size m=400 with the adaptive list sequential sampling method. To recruit a representative sample from the population, we assigned equal inclusion probabilities to all individuals from the population; i.e. for i=1,…,n. When the sample is well spread, the distribution of the auxiliary characteristics x should be approximately similar in the population and the sample.
The data was generated as follows. The vector z
i
was drawn from a multivariate normal distribution with means zero, and covariances zero. The probability of positively responding to the invitation was p(s
i
=1|b
i
=1,z
i
)=invlogit[ α+z
i
β], where invlogit denotes the inverse logit transformation, α=1, and β=(0.3,−0.7,0.1,0.4). The response was drawn from a Bernoulli distribution with p(s
i
=1|b
i
=1,z
i
). In addition, for individual i, delayed response to the invitation was simulated by drawing time t
i
from a Poisson distribution with expectation 15. Individual i responded to the invitation after the evaluation of individual i+t
i
. Thus if t
i
=0, individual i responded immediately to the invitation.
For individual
i, the characteristics
x
i
were drawn from a multivariate normal distribution with means zero, variances one, and covariance matrix
To obtain a well spread and representative sample, we used the adaptive list sequential method. To satisfy (3), we used the Mahalanobis distance to quantify the distance between individuals. We ranked the distances in ascending order and used the order to determine the preliminary weights
, for
i=1,…,
n and
k≠
i. Using (4), we specified the following adaptive list sequential sampling methods with different characteristics
Simple random sampling: Assign zero to all weights
. Consequently,
and therefore
. With these weights, we used the initial inclusion probability
to determine whether we should invite individual
i.
Adjusted sampling 1: The inclusion probability of individual
was equally influenced by all
n−1=3999 other individuals by using the preliminary weights
.
Adjusted sampling 2: Only the 50 nearest neighbors of individual
i influenced the inclusion probability
by using the preliminary weights
We used an estimated participation probability to deal with non-participation. Two different approaches to estimate the participation probability were evaluated. The first approach was to use all available data to estimate the participation probability, i.e. . With the second approach, we assumed that z
i
had no impact on the participation probability, i.e. . The second approach was used to investigate whether the impact of miss-specifying had a large impact on how well the sample was spread.
We assumed that we had no prior knowledge about the participation probability before the recruitment period started. Therefore flat, non-informative priors were used for α and all regression weights β by assuming they followed normal distributions with means zero and variance 100. Because we assumed zero means, the initial estimated participation probabilities were 50%, i.e. for i=1,…,n.
We quantified how well a sample was spread with the following measure based on Voronoi polytopes, suggested by Grafström and Lundström[
10]. Let individual
i∈
s, i.e. individual
i is included in the set of participants
s. The Voronoi polytope
v
i
consists of all individuals
j from the population
D for which
d(
x
i
,
x
j
)≤
d(
x
k
,
x
j
), for all other individuals
k∈
s. Note that when
d(
x
i
,
x
j
)=
d(
x
k
,
x
j
), individual
j is included in both polytopes
v
i
and
v
k
, but weighted with 1/2.
Let
q
i
be the sum of initial inclusion probabilities of the individuals in
v
i
,
Grafström and Lundström showed that a sample can be considered to be well spread if
q
i
is one or close to one for all polytopes
v
i
. Therefore, a measure to quantify how well spread a sample is
where a low R corresponds to well spread sample. To investigate how well the adaptive list sequential sampling methods performed in recruiting a well spread sample, the simulation was performed 1000 times. We calculated the mean and variance of R, and the average sum of recruited participants. Note that the best adaptive list sequential sampling method should give us a set of approximately 400 participants with a low R in every simulation.
Simulation 2
In simulation 2, we considered a population D of size n=5000, in which each individual was described by a categorical auxiliary variable x
i
and a unobserved binary outcome of interest y
i
. The auxiliary variable x
i
had five possible values g. The main goal of this simulation was to estimate the sum of the outcome y in the population, denoted as , with a set of participants in which we can measure y. Moreover, we had resources to measure y in a set of participants of size m=500. The set of participants was obtained with an adaptive list sequential sampling method where we dealt with non-participating during the recruitment period.
Individuals in different subgroups had different participation probabilities and different frequencies of the outcome
y. The characteristics of the populations were
where p(s
i
=1|b
i
=1,x
i
=g) was the participation probability of individual i given x
i
=g, i.e. for individual i the probability of participating depended on x
i
. The response to an invitation was drawn from a Bernoulli distribution with probability p(s
i
=1|b
i
=1,x
i
=g). Moreover, .
The individuals in the set of participants
s were used to estimate
Y, denoted as
, where we used the Horvitz-Thompson estimator and its variance [
14‐
16] to determine
. The estimate
was calculated as
(9)
where
was the desired probability of being included in the set of participants
s, specified before the recruitment period started. The variance of
was approximated with
where
is the second order joint-inclusion probability of the
i
t
h
and
j
t
h
individuals in
s, i.e.
. To determine
, we used the sample based approximation technique proposed by Hájek [
17,
18].
The set of participants
s was obtained with the adaptive list sequential sampling method. Before the recruitment period started, we specified the vector
π
(0). We considered a vector
π
(0), in which the probability of being included in
s was proportional to the size of group
g in the population. Because not all groups were observed with the same frequency in
D, we oversampled the smaller subgroups in such a way that each group
g was observed with similar frequency in
s. For each invited individual with
x=1, we have to invite 2, 2, 4, and 4 individuals with respectively
x=2,3,4,5 to obtain an equal number of individuals from each group in
s. Therefore, depending on the value of
x
i
, we used the following probabilities for individual
i
Note that we could also use stratified sampling to get our desired set of participants because we only have five disjoint groups. However when we have a large number of groups, stratification becomes impracticable. A large number of groups is no problem for the (adaptive) list sequential sampling design, if it is possible to specify a distance measure between individuals (see (3)). With π
(0), we expected to have an equal number of individuals for each subgroup g in the set of participants.
We considered two adaptive list sequential methods to recruit the sample.
Simple random sampling: Assign zero to all weights
. Therefore
.
Adjusted sampling: To recruit a well spread sample, the inclusion probability of individual
i should
only be influenced by individuals located in the same group. Therefore, we used the following preliminary weights
where n
g
is the number of individuals in group g.
For both adaptive list sequential sampling methods, we used the following model to describe the participation probability
where β
g
is the regression weight for group g. Because we assumed we had no a-priori information about the participation probabilities, we used non-informative priors for β by sampling all five parameters β
g
from a normal distribution with mean zero and variance 100. For individual i, delayed response to the invitation was simulated by drawing time t
i
from a Poisson distribution with expectation 15. Individual i responded to the invitation after the evaluation of individual i+t
i
.
The simulations were performed 1000 times and we calculated the bias, MSE, and coverage of for both adaptive list sequential methods.