Data analysis
The spatial scan statistic was developed to test for geographical clusters and to identify their approximate location [
20]. The number of events,
i.
e., incident cases, may be assumed to have either a
Poisson or
Bernoulli distribution. Depending on the availability of data, the spatial scan statistic can be used for either aggregated data, such as census areas, or with precise geographical coordinates, where each 'census area' contains only one person at risk.
The spatial scan statistic imposes a circular window on the map, and lets the center of the circle move over the area, so that the window includes different sets of neighboring census areas at different positions. If the window contains the centroid of a census area, that entire area is included in the window. For each circle centroid, the radius of the circular window varies from 0 to a maximum radius, so the window never includes more than 50% of the total population at risk. In this way, the circular window is flexible in both location and size. The method creates a very large number of distinct circular windows, each containing a different set of neighboring areas, and each possibly containing a cluster of events [
21].
Using the total number of cases in the region, the scan statistic tests the null hypothesis, complete spatial randomness, against the alternative hypothesis, which is that the probability of a case being inside the zone is greater than it than being outside of the zone. Since the scan statistic is likelihood-based, the most likely zones can be selected and tested for statistical significance. Because we are evaluating a huge number of outbreak locations, sizes, and durations, we need to adjust for multiple testing. Since we do not have population-at-risk data, this cannot be achieved via standard methods of scan statistics. Instead, we must create a large number of random permutations on the spatial and temporal attributes of each case in the data set. That is, we shuffle the times and assign them to the original set of case locations, ensuring that both the spatial and temporal marginals remain unchanged. Then, the most likely cluster is calculated for each simulated data set in exactly the same way as for the real data. To estimate distributions for the test statistic, the spatial scan statistic uses the Monte Carlo method and assigns cases to individuals in the population randomly, and recalculates the test statistic. Only a small number of all possible zones (potential clusters) are tested, to minimize multiple testing and false positives. In each of the likely clusters, the output includes a listing of geographic subdivisions, numbers of observed and expected cases, population, relative risk (RR), and p-value. The scan test imposes a circular zone and calculates a rate within that zone. Areas of below-average risk within that zone will be declared a cluster constituent, provided their low rate is offset by high-rate areas within the circular zone. Thus, the scan statistic is prone to false-positives.
If a purely spatial analysis is performed for an extensive time period, it is more difficult to detect recently emerging clusters. To resolve this problem, a purely spatial analysis based on data from only the last few years can be performed. One problem, however, is determining the appropriate number of years to include. If too few years are analyzed, low-to-moderate excess risk that is present for a considerable length of time could be missed. If too many years are analyzed, the analysis might have insufficient power to detect a very recent high-excess-risk cluster. One solution might be the use of a space-time scan statistic.
Instead of a circular window in two dimensions, the space-time scan statistic uses a cylindrical window in three dimensions. The base of the cylinder represents space (as in the purely spatial scan statistic), and the height represents time. The cylinder is flexible in both its circular geographical base and its starting date, which are independent of one other. As in the purely spatial scan statistic, the likelihood ratio test statistic is constructed using a computational algorithm for calculating the likelihood for each window in three dimensions, rather than in two dimensions.
Spatial scan statistic (SaTScan) was used to identify clusters of census tracts in Fukuoka Prefecture. SaTScan identifies a cluster at any location of any size up to a maximum size, and minimizes the problem of multiple statistical tests. Scanning was also set to search only for areas with high proportions of TB. No geographic overlap was used as a default setting, so secondary clusters would not overlap the most significant cluster. In order to scan for small to large clusters, the maximum cluster size was set to 50% of the total population at risk. To ensure sufficient statistical power, the number of Monte Carlo replications was set to 999, and clusters with statistical significance of p < 0.05 were reported.
Since the spatial scan statistic (SaTScan) has been widely applied to various diseases, we applied the method, utilizing a circle as the base for the scanning cylinder, to the data for Fukuoka Prefecture. A less geographically compact cluster provides less power to detect the disease. All the clusters detected were approximately circular. In reality, however, infectious diseases might frequently assume other cluster types, and more complicated clusters, such as oval clusters, are more realistic. We were aware of this methodological limitation and sought to exercise caution when interpreting the results.