We construct an immunization rate model at the census block group resolution. The steps involve integrating activity-based populations for Minnesota (MN) and Washington (WA) with school immunization data collected by the health departments in these states.
Activity-based model and demographics
Following the approach of [
9,
10], we use a population model that represents the entire population of MN and WA, with complete demographics and activities for each person, activity times, and locations. This representation integrates over one dozen public and commercial datasets. We briefly summarize the steps of the process here and refer to [
9,
10] for complete details. As a first step, a synthetic representation of each person in the population is constructed who, when aggregated to a block group level, are statistically equivalent to individuals in the U.S. census block group.
Next, daily activities are assigned to individuals within a household using activity and time-use surveys, and methods from transportation science. Finally, activity locations are determined using methods from transportation studies and detailed land use data. In particular, this model has school locations for all school aged children. The resulting population—referred to as an “activity-based population”— is statistically equivalent to the census; we refer to [
9,
11‐
14] for details on validation. The population models developed using this approach have been used in a number of studies on epidemic spread and public health policy planning [
9,
10,
12,
15‐
18].
Here, we use activity-based models constructed using the 2010 census data. These models have rich demographic characteristics; in our study, we focus on age and income. We divide the population by age into four categories: Pre-school (ages 0–5), school (ages 6–17), adults (ages 18–65), and senior (ages 65+). These groups represent 7, 19, 67, and 7% of the population, respectively. We also group the population by annual household income into low ($0 —$25, 000), medium ($25 000 —$75,000), and high ($75,000+), which represent 20, 45, 35% of the population, respectively. We use the same categories for the state of Washington.
School immunization data
We describe here the publicly available school immunization data that we use in our analysis for MN and WA. In this paper, we only focus on the immunization rates for the MMR vaccine for middle school children, namely, 7th grade in the case of Minnesota and 6th grade in the case of Washington, since any children who get delayed immunizations (i.e., those do not get the school-entry vaccine requirements by elementary school, but do so in later years), are likely to be covered in this data.
Minnesota
We use data collected by the Minnesota Department of Health (MDH) for the school year 2015–2016, as reported in their Annual Immunization Status Report (AISR).
1 The data contains immunization statistics for 7th-grade students in all public and private schools across the state, except for schools with fewer than 5 children—these are not available because of privacy—and schools that did not respond to MDH. Initially, there are 872 schools in the MDH report, and after removing schools with fewer than 5 children (114) and schools that did not report (12), we are left with 712 schools. The relevant fields for this study are 1) total enrollment and 2) percentage of students who have received two doses of the MMR vaccine. From this data, we compute the number of unvaccinated and under-vaccinated
2 children in a school as
$$ \left(\mathrm{total}\kern0.5em \mathrm{enrollment}\right)\kern0.5em \times \left(1-\frac{\%\mathrm{vaccinated}}{100}\right) $$
We use the vaccination rates from the AISR to assign an immunization status to each children in the activity-based model. In order to perform this assignment (Section 2.3), we need to find a mapping between schools in the MDH data and synthetic schools in our population model. However, the data reported by MDH only lists school names and their corresponding school districts. We need a full address for each school, in addition to the name, in order to find the corresponding school in the activity-based model. Addresses were obtained from the Minnesota Department of Education website or, when not available on the website, by manual search using Google Maps.
Washington
Immunization data for Washington was obtained from the Washington State Department of Health (WDH) website.
3 The report contains entries for 6th-grade students in 2615 public and private schools across the state, for the school year 2016–2017. However, a majority of the schools did not have enrollment data (137) or had a reported enrollment of zero students (1441). We discarded these entries as well as all the schools with less than 5 students (138) to obtain a total of 899 schools. The WDH data does include full addresses for each school, allowing us to skip one step in data collection and manual labeling. Then, the immunization data is used to model immunization rates, as described in the next subsection.
Statistical analysis using network scan statistics
We use the methods of scan statistics [
5,
20,
21] to identify statistically significant geographical clusters with a high proportion of under-immunized children—this approach formalizes anomaly detection as a hypothesis testing problem, and has been used for detecting anomalies or “hotspots” in spatial data [
22‐
24]. We use an extension of this approach to networks, which has been used for anomaly detection in network data [
25‐
27]. Specifically, we consider a network G = (B, E) defined on the set of block groups, i.e., each block group b is a node in the set B.
Two block groups b, b′ ∈ B are connected by an edge, (b, b′) ∈ E, if they share a boundary. Thus, G denotes the adjacency graph of the block groups. We say that a subset C ⊂ B is a connected subgraph if the graph H(C, E′) formed by considering only the edges (b, b′) ∈ E with b, b′ ∈ C is connected. We consider such subgraphs, since this allows us to consider clusters of arbitrary shapes, whereas most applications of scan statistics in spatial data restrict the clusters to have some fixed regular shape. We note that a cluster of block groups that is topologically shaped as a circle is also a connected subgraph, with respect to the above definition. Therefore, this notion strictly generalizes the clusters considered in SatScan.
For each block group
b ∈
B, we have two counts: (i) pop(b), which is the baseline count of 7th grade children in MN (6th grade in WA), and (ii) unimm(b), which is the estimated number of under-immunized 7th grade in MN (6th grade in WA) children (also referred to as the “event count”). Following our notation defined earlier, we have
unimm(
b) = Σ
v∈ C(b)(1−
p(
v)). We use the Poisson version of the Kulldorff scan statistic [
5], where the null hypothesis
H0 is that the event counts for all nodes b are generated proportionally to their baseline counts, i.e., (1−
μ) ∙
pop(
b), where
μ is the state-wide immunization rate.
Under the alternative hypothesis H
1(C) for a cluster C, the event counts for nodes outside C, i.e., in
B−
C are (again) generated with rate proportional to the baseline counts, but for the nodes within C, the counts are generated at a higher rate than expected. The scan statistic F (C) is defined as a generalized log-likelihood ratio, and the objective is to find clusters that maximize this statistic. In the classical spatial scan approach, which is implemented in the popular SatScan software [
28], the maximization is done over circular and elliptical regions.
Optimization over clusters of arbitrary shape is computationally much more challenging. We will use the approach of [
29], which efficiently searches over all connected clusters of a certain size and provably finds one with the maximum log-likelihood score. The restriction on the cluster size serves as a regularization constraint, while also making the problem computationally more tractable. We note that there are many other potential approaches for finding such clusters, e.g., greedily picking significant block groups, and connecting them subsequently. However, it is shown in [
29] that approaches such as this do not perform very well, in general. Monte Carlo sampling is used to determine the p-value for each cluster—accounting for multiple hypothesis testing— and we consider the top few significant clusters. We compare our results with the clusters discovered using SatScan.
Many extensions of scan statistics allowing arbitrary shapes have been proposed, e.g., [
30,
31]. We also note that there is some risk of finding spurious or very “patchy” clusters, when constraints on the allowed shapes are relaxed. In particular, an “octopus” effect has been reported [
30], where clusters with high even counts are connected by narrow paths on a network. We explore this possibility through simulations in the
Appendix. We find that if the target cluster is very different from the background population, the maximum log-likelihood cluster has a high overlap with the target. On the other hand, if the true cluster is not very significant, the maximum log-likelihood solutions might differ quite a bit. This is related to, but not as extreme as the octopus effect. We hypothesize that the constraint on the solution size and the optimality guarantee in the method of [
29] prevents reaching the degenerate cases reported in [
30]. For a more detailed discussion of the advantages and limitations of scan statistics, we refer to [
5,
27,
31].
Characterization of under-immunized block groups
To characterize the block groups that are a part of the anomalous clusters of under-immunized children, we perform separate logistic regression analysis for MN and WA using all the block groups in each state. The response variable in the regression is whether the block group is a part of the under-immunized clusters or not. It takes value 1 if it is a part of one of the under-immunized clusters and 0 if it is not. The independent variables we considered are average age in the block group, number of workers per capita, average household income in block group, average household size, number of children in age group 0-5 years and the total income of the block group. These variables were selected because data for them are available as part of the census data.
The raw feature list considered other demographics as well such as the number of school aged children 6-18 years, number of adults between 19-65 years, number of people older than 65 years, and the total population size of the block group. However these variables were correlated with the number of children 0-5 years, and hence were removed to avoid the problem of multi-collinearity.
The regression analysis identifies the list of statistically significant demographics that contribute to the probability of a block group being a part of the under-immunized cluster. If we can identify these features in a robust manner, public health officials can utilize this information to design actionable and targeted policies.