- Split View
-
Views
-
Cite
Cite
Kristof Bostoen, Zaid Chalabi, Optimization of household survey sampling without sample frames, International Journal of Epidemiology, Volume 35, Issue 3, June 2006, Pages 751–755, https://doi.org/10.1093/ije/dyl019
- Share Icon Share
Abstract
There are a few sampling methods available to survey households in situations where sample frames are either unavailable or are unreliable. The most popular of these methods is the expanded programme of immunization (EPI) sampling method, which has been used in low-income countries. The purpose of this paper is to explain how mathematical programming can be used to optimize EPI and other household survey sampling methods in these situations.
Representative samples in household surveys are often difficult to obtain in low-income countries. Traditional sampling methods based on simple random sampling (SRS) give each Basic sampling unit (BSU) an equal probability of inclusion in the sample. Although SRS is conceptually simple, applying it to household surveys can be expensive and unfeasible because it requires all the households to be identified prior to the sampling. The cluster sampling methods commonly used in household surveys reduce the need for detailed lists of households to the selected clusters. However, creating these lists (known as sample frames) still requires considerable effort, skill, and resources, which are not always available in low-income countries. The sample frames may not be reliable in situations where (i) maintaining the household lists proves difficult (often due to a lack of administrative structure for reporting changes), (ii) minorities, disadvantaged communities, or migrants tend to be excluded, and (iii) there is a high rate of migration, as in peri-urban areas or among populations displaced because of events such as natural disasters. Alternative household sampling methods, which do not use detailed sample frames, have been developed to cater for such situations.
EPI sampling method
To date one of the most popular spatial sampling methods adopted by WHO for use in low-income countries is the EPI method, named after the Expanded Programme of Immunization. This makes use of a modification of PPS (Probability Proportional to Size) sampling developed originally in the USA1 and modified for use in the smallpox eradication programmes in West Africa.2
The EPI method can be described simply as follows. A number of clusters (e.g. communities, villages) are chosen with a probability proportionate to their size, and then an equal number of selected households is surveyed in each of the selected clusters. In each chosen cluster the EPI method selects (i) a location near the centre of the community, (ii) a random direction (which is often defined in the field by spinning a bottle or pen), and (iii) a random household along the chosen direction pointing outwards from the centre of the community to its boundary. In subsequent steps, which are carried out iteratively, the closest household (door to door) to that determined in the previous step is chosen and checked for compliance with the inclusion criteria. The iterations are repeated until the required number of households is surveyed.
There is no doubt that EPI-sampling (as this method is generally known) has been instrumental in evaluating immunization coverage worldwide. However statisticians had some concerns on the bias and precision of the estimates obtained using the EPI method until computer simulations provided some indications of its validity.3,4 EPI-sampling has enabled WHO and UNICEF to measure the coverage of their childhood immunization programmes and has also been adapted to measure nutritional status.5
The simplicity and ease of applying EPI-sampling has made this method very popular. Unfortunately this method has often been used inappropriately owing to the lack of understanding of its statistical and analytical limitations as well as the lack of appropriate alternative sampling methods.6,7 The use of EPI-sampling has, therefore, on occasions resulted in non-representative data on which perhaps erroneous decisions and conclusions were made. Suggestions have been made to improve and adapt the EPI method.4,8–10 However, most of these improvements resulted in undermining the simplicity of the original method. Furthermore, it is difficult to assess whether these improvements resulted in more representative data.
The difficulties hindering further developments of the EPI and other sampling methods are mainly attributable to the fact that (i) the performance measure used to quantify improvements in the sampling method is ill-defined, (ii) there is a multitude of scenarios of household distribution on which the sampling method requires verification and validation, and (iii) it is not apparent how to analyse the properties of a sampling method that is not strictly random unless an exhaustive set of simulations are carried out. Furthermore a sampling method that is optimal in one scenario of household distribution could be sub-optimal or even inefficient in another scenario.
Note that the EPI design of 30 × 7 (30 clusters × 7 samples) was originally intended for measuring vaccination coverage in children aged between 12 and 23 months. This narrow age-specified inclusion criterion determines the parameters of the EPI design in terms of the density of the sampling and its geographical spread as on average it is expected that one in every seven households has a child aged between 1 and 2 years. Nutritional surveys, however, focus on children <5 years old. Because children in this wider age range are expected to be present in most households, this inclusion criterion reduces significantly the geographical spread of the sample. To compensate (albeit partially) for the clustering effect, the sample size in each cluster is increased for nutritional surveys from 7 to 30 (30 clusters × 30 samples), which in turn increases the design effect (deff) of the sample.11
Problems also arise with EPI sampling in mortality surveys. Owing to lack of an unbiased inclusion criterion, the sample becomes geographically highly clustered. Any outcome measure, which is highly clustered, can lead to high design effects (deff > 2).12 A design effect of 2 is often assumed when no estimates of deff are available and is the value assumed in the original EPI design.6
Larger design effects can be found in household surveys measuring the provision of health services and access to water and sanitation. Clusters that include a health centre or a water point will have substantially higher access figures than those that do not have health or water provision services. In these studies, there is the additional problem that methods such as EPI sampling, which rely on PPS for the selection of clusters, can introduce selection bias; smaller sized clusters have less chance of being selected when using PPS, however clusters can be small because of lack of provision of services and so the selection may be erroneously biased against them.
The EPI method limits the design decisions to the number of clusters and the number of households within a cluster, and to defining the sequential choice of households for surveying within a cluster. Indeed, neither the sample size nor the strategy for selecting households are optimized in any sense. To validate EPI, past work used either hypothetical scenarios in which clusters and households are generated artificially through computer simulations3,13 or real scenarios generated from data-rich surveys.4
Mathematical programming
Mathematical programming methods could be used to optimize household survey sampling methods in settings where sample frames are unavailable or impractical as is common in developing countries. Several sampling methods such as those of the EPI have been used in such situations. Simulations are often used to improve the statistical robustness of these methods, evaluating their sampling properties under different computer generated spatial distributions simulating realistic scenarios. We propose that mathematical programming would be more efficient in improving sampling methods by circumventing the need to use computing-intensive methods such as Monte Carlo simulations and by optimizing (rigorously and explicitly) the sampling methods through minimizing a sampling error term while constraining the survey cost or data collection time.
We propose the use of mathematical programming as a more meticulous approach to assessing various sampling methods. Mathematical programming is a branch of mathematics that deals with the formulation of optimization problems and the development and use of procedures (algorithms) to solve these problems. In its most basic form, an optimization problem is a mathematical description of a system (or a scheme) that is characterized by (i) a set of variables (known as control or optimization variables) whose values can be changed to achieve preset objectives, (ii) a performance measure (which depends on the control variables), and (iii) a set of constraints (which also depend on the control variables). By definition, the solution of an optimization problem is the set of values of the control variables that optimize (i.e. maximize or minimize) the performance measure while ensuring that the constraints are satisfied.
To illustrate the use of mathematical programming in this application, the household sampling problem is formulated as an optimization problem. The description below is not unique as several formulations are possible depending on the objectives of the sampling method. One of the advantages of the mathematical programming approach is that it allows alternative formulations to be compared in a straightforward manner.
The optimization problem is constructed in three main steps. The first step defines the control variables. These variables could be (i) the number of clusters, (ii) the number of households within a cluster, (iii) the spatial location of the starting point of the survey, and (iv) the ‘survey pathway’ or the spatial sampling strategy. The survey pathway is defined as the line constructed from joining together directed straight line segments (i.e. vectors) connecting consecutive households in the order they are surveyed.
The second step defines the performance measure to be optimized, for example the total error of the sample mean estimate. The square of the total error is the sum of two terms: the first term (bias) is the squared deviation of the sample mean estimate from that obtained by the ‘gold standard’ SRS sampling and the second term is the sample variance.14 Other sampling error terms could be used depending on the context of the problem.
The third step defines the set of constraints to be satisfied. These could be divided into two sets. The first set of constraints is associated with the cost of the survey or the time taken for data collection. These constraints are often represented as inequalities. For example, it could be required that the cost of the survey does not exceed a fixed budget. The second set of constraints is associated with the geographical setting. Although they are referred to as constraints, these relationships model the spatial distribution of the households and the presence of any geographical barriers. They are called constraints per se because the geographical distribution of the households confines the survey pathway to some routes (i.e. restrict the feasible set of control variables).
The optimization problem can be represented in a compact form. Table 1 defines the optimization problem in words and represents it mathematically.
The mathematical representation captures in a snapshot the whole optimization problem by specifying the set of control variables, the performance measure to be optimized and the set of constraints to be satisfied. In this representation,
Ψ is the total error of the sample mean estimate, the measure to be optimized.
- \(n,\ m,\ \left({\vec{{\varphi}}}_{1},\ {\ldots},\ {\vec{{\varphi}}}_{m}\right)\)are the control variables whose optimal values are to be determined.
n is the number of clusters.
m is the number of households within a cluster (assumed in this problem formulation to be the same for all clusters).
- \(\left({\vec{{\varphi}}}_{1},\ {\ldots},\ {\vec{{\varphi}}}_{m}\right)\)is the survey pathway where
The arrow sign means that each line segment is a vector.
- \({\vec{{\varphi}}}_{i}\)is the vector (in two-dimensional Euclidean space ℜ2) connecting household i − 1 to household i (note that\({\vec{{\varphi}}}_{i}\)is the vector connecting the starting point of the survey to the first household).
- \({\delta}\left({\vec{{\varphi}}}_{i}\right)\)is the magnitude of the vector\({\vec{{\varphi}}}_{i}\), it represents the distance covered between a pair of consecutive households.
c is the cost of survey.
α is the maximum budget allocated for the survey. The inequality c ≤ α ensures that that the cost does not exceed the budget.
Z is the feasible set of the survey pathway. Z models the spatial distribution of the households (the spatial pattern describing the scatter of the households in two dimensions) and possibly any geographical barriers.
The total error of the sample mean (Ψ) is a function of the sample size (the number of clusters n and the number of households within a cluster m) and the spatial sampling strategy
Figure 1 shows schematically the spatial sampling strategy. In this figure, x and y are the coordinates in space, the symbol ‘×’ denote the location of a household, 𝛉 the starting point of the survey,
The first constraint is an inequality constraint. It sets an upper limit α to the cost of the survey c. The cost of the survey is a function of n, m and the total distance covered by the surveyor (additional terms such as the number of vehicles used can also be included). The total distance covered is given by
The second constraint is expressed as a set constraint.
There are many mathematical methods available to solve complex optimization problems; the choice of the method depends on the formulation of the optimization problem. In the case of the above household survey sampling problem, the appropriate solution method will depend primarily on the models of the survey pathway and the spatial distribution of the households. The mathematical programming methods include function space methods,18,19 integer-based methods,20 and combinatorial methods.21,22 These methods have been widely and successfully used in applications such as operations research, economics, management science, control engineering, and network design amongst many others.
As in the case of any approach, the use of mathematical programming has its advantages and disadvantages. On the one hand it could be argued that this approach presents a conceptually simplistic (albeit mathematically complex) formulation of the household sampling problem and paves the way for a robust and explicit formulation of the household survey sampling problem.
On the other hand, one of the disadvantages of the mathematical programming approach is that it requires analytical expressions of the performance measure and the constraints as functions of the control variables. These requirements could be viewed as a disadvantage, compared with the Monte Carlo approach,3,4 because they entail understanding of the way in which the control variables influence performance measure and constraints. We would argue, however, that the advantages gained in using the mathematical programming approach far outweigh any disadvantages. The mathematical programming approach provides a rigorous means to optimize the sampling methods under different scenarios without the need of exhaustive Monte Carlo simulations to cater for all permutations of the setting.
Conclusion
Households sampling methods such as EPI have been widely and successfully used. These methods, however, suffer from a number of disadvantages. There is a need to develop alternative sampling methods in situations where traditional data collection methods prove challenging or unfeasible. We believe that although mathematical programming methods are not yet widely used in epidemiology, they have an important role to play in this area. One application is to optimize household survey sampling methods so that they become more reliable in circumstances where sampling frames are not available.
We have described the first step in a rigorous approach towards optimizing household survey sampling methods in settings where sample frames are not feasible. The sampling methods obtained through the optimization approach requires rigorous validation. Initial validation could be done using existing geo-referenced data but formal validation will require practical field testing.
Obtaining representative samples in household surveys is difficult to achieve in situations where detailed sampling frames are unavailable or are unreliable.
Mathematical programming could be used to optimize EPI sampling as well as radically different methods appropriate for household survey sampling without sample frames.
The authors are grateful to Sandy Cairncross, Ben Armstrong, Chris Grundy, and Lucy Smith for their comments and support.
References
Serfling RE, Sherman IL. Attribute Sampling Methods, Publication No. 1230. US Department of Health and Human Services, Public Health Service: Washington, D.C.,
Henderson RH, Davis H, Eddins DL, Foege WH. Assessment of vaccination coverage, vaccination scar rates, and smallpox scarring in five areas of West Africa.
Lemeshow S, Tserkovnyi AG, Tulloch JL, Dowd JE, Lwanga SK, Keja J. A computer simulation of the EPI survey strategy.
Bennett S, Radalowicz A, Vella A, Tomkins A. A computer simulation of household sampling schemes for health surveys in developing countries.
Bennett S. The EPI cluster sampling method: a critical appraisal. Bull Inst Statist Inst
Stoeckel J. Evaluation of multiple indicator cluster surveys. UNICEF Division of Evaluation, Policy and Planning,
Henderson H, Sundaresan T. Cluster sampling to assess immunization coverage: a review of experience with a simplified sampling method.
Turner AG, Magnani RJ, Shuaib M. A not quite as quick but much cleaner alternative to the expanded programme on immunization (EPI) cluster survey design.
Milligan P, Njie A, Bennett S. Comparison of two cluster sampling methods for health surveys in developing countries.
Depoortere E, Checchi F, Broillet F, Gerstl S, Minetti A, Gayraud O et al. Violence and mortality in West Darfur, Sudan (2003–04): epidemiological evidence from four surveys.
Mann G. Cluster Sampling simulator. MSc Thesis. London: Department of Geomatic Engineering, University College London,
Bennett S, Radalowicz A, Vella V, Tomkins A. A computer simulation of household sampling schemes for health surveys in developing countries.
Thomsen I, Tesfu D, Binder A. Estimation of design effects and interclass correlations when using outdated measures of size.
Levy PS, Lemeshow S. Sampling of Populations. Methods and Applications. 3rd edn. New York: John Wiley & Sons,
Pytalk R. Numerical Methods for Optimal Control with State Constraints. Berlin: Springer-Verlag,