Background
Zika virus (ZIKV) is an emerging mosquito-borne arbovirus that causes serious public health consequences. Zika virus is primarily transmitted to people through the bite of an infected
Aedes species mosquito (
Ae. aegypti and
Ae. albopictus) [
1]. Although most infections carry mild symptoms or are asymptomatic, the association between ZIKV and microcephaly and Guillain-Barré syndrome placed ZIKV as a global medical emergency during the 2016 epidemic [
2‐
6]. Currently there is no medicine or vaccine to cure or prevent ZIKV infection. Therefore, infection containment, vector control and personal protection are the most important measures to prevent infections and contain viral spread [
7].
According to the U.S. Centers for Disease Control and Prevention (CDC), ZIKV has been reported in over 60 countries and territories worldwide, during the 2015–2016 ZIKV epidemic, with South America as the most severely affected continent [
8]. In the United States, locally acquired ZIKV cases have been reported in Florida and Texas as well as the U.S. territories in Puerto Rico, U.S. Virgin Islands, and American Samoa [
8,
9]. Travel-associated U.S. cases of ZIKV infections have been reported in all 50 States in the U.S. [
9]. ZIKV can also be sexually transmitted, which suggests concern for potential local outbreaks [
9]. By November, 2016, U.S. travel-associated ZIKV cases amounted to 4115. By this time, there were 139 locally acquired mosquito-borne and 35 sexually transmitted cases in the U.S. Cases in U.S. territories amounted to 39,951 [
9]. Although there were no reports of microcephaly cases, 13 cases of Guillain-Barré syndrome were reported in the continental U.S., and 50 in U.S. territories [
9].
Regarding ZIKV surveillance and vector control, challenges exist that have significantly limited the effectiveness of current methods. In the United States, disease surveillance is supported by the CDC Division of Health Informatics and Surveillance and is carried out through a variety of networks that involve the collaboration of thousands of agencies at the federal, state, territorial, and tribal levels across health departments, laboratories, and hospitals [
10‐
12]. Importantly, while ZIKV cases reported from official sources such as the CDC are of high quality, such reporting is not timely due to an internal protocol of these offices to collect and verify data prior to formal publication. In addition, the cases reported by any single source do not always reflect all the cases that truly exist. More importantly, in many countries or regions with poor infrastructure and healthcare systems, established systems for such case reporting do not exist. To collect as much ZIKV case information as possible with minimal delay, other services are available that can publish such information more timely.
Alternative data sources such as social media and other digital services provide an opportunity to overcome existing surveillance obstacles by providing relevant information that is temporally and geographically tagged. To date, the variety of digital data streams that have been utilized to help track diseases over time and space have included internet search engines [
13‐
18], electronic health records [
19], news reports [
20,
21], Twitter posts [
22‐
26], satellite imagery [
27], clinicians’ search engines [
28], and crowd-sourced participatory disease surveillance systems [
29‐
31].
In terms of social media data streams, Twitter is a free social networking service that enables millions of users to send and read one another’s brief messages, or “tweets,” each day. Tweets can be posted either publicly or internally within groups of “followers.” Currently, this service includes approximately 326 million registered users, with 67 million in the United States [
32]. In spite of a fair amount of noise due to general chatter and the sheer number of tweets, Twitter contains useful information that can be utilized for disease surveillance and forecasting.
Previously, Twitter has been utilized to measure public anxiety related to stock market prices, national sentiment, and the impacts of earthquakes [
33‐
35]. More recently, Twitter was used in epidemic tracking and forecasting for the H1N1 pandemic, general influenza, and the recent Ebola outbreak [
25,
36‐
39]. In terms of ZIKV, studies have made use of Twitter data and developed predictive models for a variety of applications. Mandal et al. (2018) developed Twitter-based models to track zika prevention techniques and help inform health care officials [
40]. Other studies have performed content analysis of Twitter data to explore and predict what types of zika-related discussions people were having during the recent ZIKV epidemic [
41‐
44].
Since ZIKV outbreaks are influenced by many environmental and social factors, such as local mosquito species and density distributions, season, climate, land use, land cover, human demographics, and mitigation efforts, successful surveillance and forecasting of the disease can be difficult [
45‐
51]. Use of live streaming ZIKV-related information via nationwide tweets could represent a practical, timely, and effective surveillance tool, in turn improving ZIKV case detection and outbreak forecasting [
14,
52]. To date, however, studies making use of Twitter data to monitor the spread of ZIKV in real time and space have been limited.
In one study, Teng et al. (2017) developed models to forecast cumulative ZIKV cases [
13]. However, these models were developed to predict ZIKV cases cumulatively, and on a global basis. Further, these studies did not make use of the Twitter data stream, but rather Google Trends. Similarly, Majumder et al. (2016) developed models to forecast ZIKV case counts in Columbia during the recent ZIKV epidemic [
14]. Again, however, the analysis utilized Google Trends data and considered cumulative, rather than weekly, case counts. In our research, we identified only a single study that attempted to forecast ZIKV using the Twitter data steam, and on a weekly basis. In this study by McGough et al. (2017), the authors demonstrated the utility of developing Twitter-based models to forecast ZIKV in countries of South America [
26]. However, given the lack of robust diagnostic capabilities in the region, the study was limited to using “suspected,” rather than “confirmed,” ZIKV cases. Additionally, the study did not examine spatial patterns of ZIKV cases and tweets, nor did it compare local- versus national-level modeling. To the best of our knowledge, there has been only a single study to date to harness digital data streams for near-real time weekly forecasting of ZIKV cases, and no such study to date that has utilized Twitter data for ZIKV forecasting in the United States and offered a comparison of national- and state-level models [
26].
In this study, we demonstrate the value of utilizing time- and geo-tagged information embedded in the Twitter data stream to 1) examine the relationship between weekly ZIKV cases and ZIKV-related tweets temporally and spatially, 2) assess whether Twitter data can be used to predict weekly ZIKV cases and, if so, 3) develop weekly ZIKV predicative models that can be used for early warning purposes on a state and national level. This study contributes to the body of literature.
Discussion
Weekly ZIKV case reports and zika tweets in the U.S. and in Florida exhibited very similar temporal patterns, peaking during summer and declining in fall. A multivariate auto-regression analysis using Florida and U.S. data demonstrated zika tweets to be an important predictor of weekly ZIKV case counts during the 2016 study period. Combined with information of previous ZIKV case counts, we calibrated two models that were able to estimate weekly ZIKV cases 1 week in advance with reasonable accuracy; one model for Florida and one model for the U.S. Both models performed best when both prior ZIKV case count data and Twitter data were included. Following calibration of the models, and subsequent internal cross-validation, a comparison of predicted versus observed weekly ZIKV case counts demonstrated reasonable model performance for the Florida model and reduced, but still moderate, performance for the national model. A time-series plot of predicted and observed case counts similarly showed the Florida model to predict reasonably well and the national model to predict moderately well. While a comparison of observed and model-predicted ZIKV case counts produces R2 values ≥0.70 for the Florida model, we must be careful not to overstate the model performance given that disease forecasting models can sometimes yield R2 > 0.9. Nonetheless, results for both models in this study suggest that Twitter data can be used to help track ZIKV prevalence during outbreak periods. Given that Twitter data is immediately available, compared to a delay of cases often reported by the CDC, Twitter represents a particularly useful tool for epidemiologist and public health officials involved in disease surveillance.
During model development, 1 week lagged zika tweets were best correlated with weekly ZIKV cases. This is visually apparent during the major outbreak period in Florida, where a sharp rise in zika tweets appeared to precede ZIKV cases by 1 week. A possible explanation is that an inherent temporal difference exists between Twitter chatter and ZIKV diagnosis. For instance, it is plausible that discussion of ZIKV (potentially due to the presence of symptomatic or hospitalized family members or friends) predates actual diagnosis. In this case, a rise in zika tweets would predict a rise in ZIKV cases. Whether or not this temporal difference in zika tweets is truly reflecting chatter related to the impending rise in ZIKV cases, however, cannot be confirmed here.
It is worth noting that reports of the first locally acquired ZIKV in Florida corresponded with the sharp rise in zika tweets occurring in August. Therefore, an alternative explanation is that the initial sharp rise in zika tweets occurring in summer could reflect chatter related to the first few cases of locally acquired ZIKV, rather than the impending increase in total ZIKV cases that occurred the following week. This explanation, however, fails to explain the overall higher correlation between ZIKV case counts and 1 week lagged zika tweets over the entire study period.
The primary strength of these models is the use of readily available, real-time Twitter data to estimate ZIKV cases. Additionally, the use of 1 week old ZIKV case reports to generate a good estimate means reduced dependence on the timely publication of case reports by government agencies in order to track ZIKV and predict outbreak trends. Where states report case counts on a daily basis during an outbreak (e.G. Florida), estimates of the following week’s ZIKV case counts can similarly be updated on a real-time (daily) basis. This enables better epidemic preparedness by local and state public health agencies in charge of disease response.
A primary limitation of these models is the need for historical ZIKV case count information. This requires the government to continue monitoring and publishing case reports. Though such surveillance takes place in the U.S. and other industrialized nations, it does not take place in many developing countries. Furthermore, government data may not always be released in time to enable ZIKV case predictions. In such regions where quantitative case count data is not accurately and/or consistently reported, or potentially delayed, univariate analyses using only prior zika tweets demonstrated that Twitter data may still be useful for disease surveillance. This assumes that Twitter is used among the local population and that sufficient knowledge of the disease and disease activity exists among the population.
Also noteworthy, since these are statistical models that depend on previous case reports, they cannot be used to predict a ZIKV outbreak where no prior case reports exist. Additionally, given their dependence on historical trends, these models are limited in their ability to predict historically anomalous events that could give rise to dramatic changes in disease prevalence. To this end, mechanistic models that take into account meteorology, vector distribution, population distribution and movement would provide more insight. Also, a diagnosis issue related to the cross-reactivity of diagnostic assays with other arboviruses presents a unique challenge for ZIKV surveillance. This challenge exists with traditional surveillance methods and is still an issue using our modeling approach.
Residuals plots for our models exhibited a departure from normality, with models tending to over-predict at low case count values and under-predict at high case counts. The models’ capability in predicting the full range of cases is compromised because of its over-prediction at extremely low values and under-prediction at extremely high values. Although this tendency toward extreme value prediction is quite common in statistical predictive models trained based on a limited number of measurement data, it nonetheless represents a limitation in this type of statistical modeling that needs to be acknowledged.
Importantly, this work presents predictive models designed with the goal of using covariates to forecast an outcome variable; namely, ZIKV cases. This is distinct from explanatory modeling, which seeks to understand the causal relationships between covariates and outcome variables [
63]. In this study, we do not pursue such causal inference. Therefore, while
zika tweets serve useful in predicting ZIKV cases, we do not make claims about the relationship between tweets and ZIKV cases.
Understanding why zika tweets correlate well with ZIKV case counts and therefore offer utility as a surveillance tool is an interesting question. It is possible that zika tweets are capturing tweets related to first-hand illness, or that such tweets are merely capturing ZIKV awareness, or a combination of both. While this is an area of active research, the lack of a complete understanding of this relationship does not prevent zika tweets from serving as a useful predictor variable in the development of ZIKV forecasting models.
In discussing this study, it is important to avoid ‘big data hubris’ [
64]. That is, while our models demonstrate the ability of Twitter data to serve as an indicator of disease activity, such data should not be viewed as a substitute for traditional data collection and analysis, but rather a supplement to such traditional approaches. In future work, combining Twitter data with traditionally collected data related to vector population density, vaccine injection, transmissibility, and basic reproductive number would be useful to incorporate into modeling efforts.
A prominent, temporary spike in tweets that did not coincide with ZIKV cases occurred in early February. This was visible in total U.S. data and Florida data. Such a spike was months ahead of actual major ZIKV activity in the U.S. and can be explained by several important media-related occurrences. This time period marked the occurrence of the first cases of ZIKV to be reported in the United States by the CDC (week of Jan. 30). Of additional relevance was the WHO having declared a ZIKV public health emergency of international concern (Feb. 1) and President Obama announcing a request for $1.8 billion in ZIKV-related emergency funds the following week (Feb. 8). This was a very high profile week for ZIKV in terms of media attention. The inflation of such tweets by these respective events was reflected in actual Twitter content. A qualitative content analysis of trending ZIKV-related topics during this time period supported the existence of particular concern among the population over the arrival of ZIKV to the U.S., showing an overwhelming prevalence of such tweets as “Zika Health Emergency,” “Zika Virus is in the US!,” and “Great, Zika cases here.” Additionally, tweets that included “#CDC” were 2–4 times higher during the period when these events took place than during any week over the following 3 month period. Since these instances of media-related tweet inflation were infrequent, they did not appear to impact our predictive models. In using Twitter data for disease surveillance in the future it is nonetheless important for researchers to be mindful of the influence such major media headlines can have on tweet count, so as not to infer disease.
Two other points of deviation between tweet counts and ZIKV cases occurred during the months of November and December. In these cases, ZIKV cases increased sharply without corresponding increases in
zika tweets. A possible explanation for this is the announced ending of the ZIKV public health emergency on November 18th by the WHO [
65]. This announcement potentially relieved public concern of ZIKV, which may have in turn depressed
zika tweets in the weeks following.
When comparing national versus Florida ZIKV cases and tweets, time-series analyses showed national tweets to increase more dramatically during the major outbreak period, responding less to weekly vicissitudes in case counts (Figs.
1 and
2). Although this could suggest the potential for over-prediction of ZIKV cases for a national-based model, application of a U.S. model showed this to not be an issue. However, false prediction and timing of high ZIKV activity periods were apparent issues in the national model. That U.S. tweet counts responded less sensitively to ZIKV case counts, and that the U.S. model did not perform as well as the Florida model, makes sense given the higher spatial coverage of the entire U.S. relative to ZIKV hotspot regions (e.g. FL, CA, NY, and TX).
The keyword mosquito was also examined for its potential to serve as an early signal of locally acquired ZIKV, given that a rise in mosquitoes (the primary ZIKV vector) would expectedly lead to a rise in ZIKV. This keyword, however, provided no added benefit over use of the keyword zika. Rather, zika and mosquito tweets were tightly correlated throughout the entire year.
In general, the increase of ZIKV in the summer and subsequent decrease in the fall season can be explained by higher temperatures and humidity during summer months, which provides conditions ideal for mosquito breeding, as well as increased person travel. Additionally, pesticide spraying campaigns during the height of the outbreak, particularly in late summer, may have helped to control mosquito populations and prevent the spread of ZIKV. For instance, aerial spraying of the organophosphate pesticide Naled was conducted in Miami-Dade County, Florida, multiple times in September in order to combat ZIKV [
66]. In addition to dropping temperatures in the fall, this was another likely contributor to the sharp decrease in ZIKV cases and related tweets during this season.
In terms of spatial distribution across the U.S., ZIKV case reports were highly correlated with population-adjusted zika tweets. States with the most ZIKV cases also had the highest zika tweet prevalence while states with the fewest cases had the lowest tweet prevalence. This suggests that in addition to temporal accuracy, Twitter data may be a useful tool for predicting disease prevalence spatially. Additionally, this reinforces the potential utility of using Twitter data for ZIKV disease surveillance at the national level.
More research is necessary to identify an appropriate national-level predictive model. Additionally, future modeling efforts should attempt to separate tweets indicating awareness from tweets indicating infection. This could be accomplished by conducting a detailed content analysis of zika tweets. For instance, researchers could assemble a list of keywords or phrases in order to filter out non-infection related zika tweets. Once validated, this approach would produce a new time-series dataset of zika tweet counts that could be used to calibrate a new predictive model of ZIKV case counts. This approach would enable us to understand the underlying relationships between tweets and case counts. Lastly, calibration of other state-wide models for comparison with our Florida model is a worthwhile area of future research in order to understand how the relationship between Twitter data and disease incidence might vary from state-to-state, and to better utilize such data for predictive purposes in other regions.