Background
The threat of pandemic influenza has led to extensive efforts to strengthen the global influenza surveillance [
1,
2], including the development of novel syndromic surveillance systems intended to identify potential outbreaks and track influenza in the population. Some focus on identifying influenza-like-illness (ILI) in clinical and other settings, while others search the Internet to identify disease outbreaks that might not have been recognized by the authorities [
3,
4]. Having found high correlations with traditional surveillance systems and noting the benefits of timeliness and low cost, Internet-based surveillance systems have been widely recognized as important supplementary data sources for influenza surveillance [
5].
Infectious disease surveillance is a process; the data available for analysis reflects not only disease status in the population (the signal) but also other non-random factors (the noise). Our research has shown that decisions made by patients, healthcare providers, and public health professionals about seeking and providing healthcare and about reporting cases to health authorities are all influenced by the information environment, which we define as the information the population is exposed to through media, the Internet, social networks, and so forth. And since the information environment changes constantly, surveillance data systems that depend on decisions by patients and health professionals are likely to be biased, possibly in different ways [
6,
7]. Epidemiologists and public health practitioners typically recognize these potential biases qualitatively, and present their analysis of the available data with appropriate caveats. Public health practitioners and clinicians are also aware of the surge in medical resource utilization caused by the “worried well” in response to media coverage of disease outbreaks [
8]. Awareness of these biases is often lost, however, at higher levels [
9].
Some researchers assume that data on the information environment — such as Google searches and Twitter feeds — can be used as proxies that directly estimate disease transmission in the population as long as the “signal” can be separated from the “noise” (trends in the data reflecting public awareness rather than disease transmission
per se) [
3,
10,
11]. Others simply view information environment data as a direct proxy for disease transmission. Surveillance systems that are fast, inexpensive, decentralized, automated, and utilize the power of information technology seem to satisfy the need for a magic bullet in the digital era. Whether such systems work as expected, or are another example of “big data hubris” [
12], remains an open question.
To address this issue in a rigorous way, the objectives of this study are to (1) develop a method to characterize the relationship between surveillance data and the information environment, (2) identify surveillance systems that more closely reflect actual disease trends rather than the information environment, therefore useful for tracking, and (3) understand the implications of the fact that some surveillance systems are more correlated with the information environment. In particular, we developed a Bayesian hierarchical statistical model that allows us to examine the relationship between surveillance data and information environment more formally than our previous analyses [
6,
7]. This methodological paper is for public health surveillance specialists to better understand and improve the performance of data systems.
Our analysis uses influenza surveillance data and information environment proxy data (e.g. Google search and HealthMap) from Hong Kong during the pre-pandemic (2007-2008) and pandemic (June – November 2009) periods. Rather than thinking of influenza-related web queries and news being direct indicators of disease transmission in the population, we view them as indicators of the information environment. We built a Bayesian hierarchical statistical model to estimate the correspondence between individual surveillance data and the information environment proxy data. Although not employed in this analysis, the model has the potential to incorporate epidemiological expertise through informed prior distributions. The findings have enabled us to understand how each surveillance system is related to the information environment and disease status, which should eventually help public health practitioners interpret the influenza surveillance data for situational awareness purposes, as well as prioritizing resources to different surveillance systems given the specific decision-making needs.
Discussion
In their efforts to develop new methods for influenza surveillance, researchers have considered many different data sources, most of which already exist in electronic form. Some derive from traditional surveillance approaches while using influenza-like-illness (ILI) and other data that do not require laboratory diagnosis. Others, such as the Global Public Health Intelligence Network (GPHIN) and HealthMap, search results from the Internet and other media sources via automated algorithms to identify disease outbreaks that might not have been recognized by the authorities [
3,
4]. Google Flu Trends uses influenza-related search queries to model flu activity [
31], while some other studies try to capture ILI through micro-blogging platforms such as Twitter [
32]. New terms such as “Internet-based surveillance”, “digital disease detection”, and “inforveillance” have been introduced to describe such public health surveillance practices [
20,
10].
Recent studies have found a high correlation between syndromic surveillance and traditional influenza surveillance data [
33‐
38]. Internet-based surveillance such as Healthmap and Google Flu Trends have claimed success in capturing pandemic flu outbreak [
3,
4] and tracking flu activity [
11] days to weeks ahead of standard Centers for Disease Control and Preventions (CDC) systems. With its advantages of timeliness and low cost, Internet-based surveillance systems have been widely recognized as important supplementary data sources and widely used as the baseline standard for evaluating new influenza surveillance systems [
5,
39‐
41].
Internet-based surveillance data, however, reflects both a “signal” reflecting actual disease trends and “noise” caused by changes in public awareness. How to accurately and effectively separate the “signal” from the “noise” becomes one of the biggest challenges in analysing internet-based surveillance data. Some researchers have developed natural language processing algorithms to classify this information automatically [
34,
42], some use crowd-sourcing platforms to engage Internet users in tagging data manually [
43], and some use both [
44]. The curated data, which is thought of as reflecting actual disease status, is then compared to the traditional surveillance data and tested for correlations [
31,
32,
34,
35]. This approach, however, does not prove the validity of the new surveillance method since both data streams may reflect the same information environment, therefore be biased in the same way. For instance, Google Flu Trends, which had been performing well in tracking CDC surveillance data, dramatically over-estimated the flu activity in the United States in 2012-13 flu season [
37]. The over-estimation might be due to the extensive media coverage of flu during the winter holiday season [
45], but raises the question of how well Internet-based surveillance systems reflect flu activity per se, rather than other factors such as public awareness.
Our analysis uses a Bayesian hierarchical statistical model to estimate the correspondence between individual surveillance data and the information environment proxy data. The model structure is developed based on an understanding of disease surveillance being a process rather than a direct reflection of disease status per se. The statistical model does not directly describe disease transmission dynamics, and the goal is not to estimate parameters or flu activity level. Rather, as a characterization tool, this analysis reveals how surveillance systems “behave” differently under changing information environments. Similarly, the purpose of model fitting is not to identify the perfect model with the best fit for the data. Rather, the goal is to find a model that captures the relationship between the surveillance systems and the information environment that is consistent with epidemiological expertise and practitioners’ understanding of the actual disease process, and thus one that is likely to be applicable in the future.
Among all the influenza surveillance data that we studied, we found some surveillance systems that more consistently corresponded to the information environment proxy data than others. The level of correspondence with the information environment is associated with certain characteristics of the surveillance data. General practitioner (%ILI-visit) and laboratory (%positive) seem to proportionally reflect the actual disease trends and are less representative of the information environment. Surveillance systems using influenza-specific code for reporting tend to reflect biases of both healthcare seekers and providers. This pattern is what we would expect to see if the information environment were influencing the observable data.
Characterization of surveillance systems using the pandemic model
When looking at “completeness” only, three types of surveillance systems show a certain level of stability in the changing information environment. Surveillance data in percentages such as percentage of specimen tested positive, percentage of ILI visits at general practitioners and percentage of fever at residential homes for the elderly tend to have insignificant CIs for “completeness.” Surveillance systems that use less specific diagnostic and reporting codes such as “pneumonia and influenza hospitalization” and “fever at residential homes for the elderly” are also less likely to be influenced by the search index and the media coverage. Surveillance systems monitoring the elderly tend to be less susceptible to the information environment as compared to those monitoring children, which can be observed by the comparison between P&I-HA(0-15 yr) and P&I-HA(65+ yr) (Figures
4 and
5). For surveillance systems that meet more than one criterion, the correlation with the information environment is weaker than those only meet one.
Surveillance data represented as percentages seem to be less correlated with the information environment, perhaps because the nominator and denominator change in the same direction in response to the information environment. Reflecting general practitioners’ role as the gatekeeper of the healthcare system, general practitioner visits are predominantly influenced by only one layer of decision-making ― patients seeking medical attention. Since patients usually do not have the ability to distinguish influenza from other viral respiratory infections themselves, flu and non-flu infections may be just as likely to be presented to general practitioners. This pattern might not hold during the early stage of a pandemic, when the spread of novel influenza virus may not keep up with the spread of awareness, possibly leading to a negative correlation between the percentage of ILI-related general practitioner visits and the information environment. However, due to insufficient data volume, the model failed to run when segmenting the pandemic period into the early (summer) and late (fall) stage.Percentage of specimen tested positive, on the other hand, is often used as the “gold standard” for influenza surveillance. As a surveillance system with a specific case definition based on confirmed virological testing, as well as being in a percentage format, the percentage of specimens tested positive is likely to provide the most reliable estimates of flu activity. However, an individual case has to go through at least two layers of decision-making ― patient’s decision on healthcare seeking and physician’s decision on sampling and diagnosis. Thus, it is possible that when the physicians are “sensitized” by the media and official guidelines, they may actively look for cases that fit the clinical definition of influenza and sample them for laboratory testing. This effect is more obvious in the count data for flu-HA, but may also influence the percentage of specimen tested positive as shown in Figures
2,
4 and
5.
Another pattern is the difference between influenza specific and non-specific surveillance systems. Flu hospitalization data seem be more correlated with the information environment as compared to pneumonia and influenza together. As discussed above, once sensitized, physicians are more likely to take samples from patients, and to use diagnostic codes that are specific for influenza, especially in the subpopulations that are considered to be more vulnerable to the pH1N1 virus during the flu pandemic. During the early stage of the pandemic, children and young adults were considered to be more susceptible to the novel influenza virus, which might contribute to the observation in our study that the pediatric P&I-HA tends to be more correlated with the information environment compared to the elderly.
We also observed a difference in the level of correspondence to the information environment between surveillance systems that monitor elderly versus other age groups. One possible explanation is disparities in information literacy and access to computers and the Internet among different age groups. Google searches are likely to be driven by subpopulations of specific demographic characteristics and socio-economic status, such as young and middle-aged people who have easy access to digital devices as an information portal, as compared to the elderly who live in residential homes. Although children may have limited information literacy and access, their parents are likely to take immediate action in response to the information related to children’s health.
As for the “excess” parameter,
φ
j,t
, ILI visits at general practitioners, percentage of specimen tested positive and percentage of fever at the residential homes for the elderly show the least significant correlation with the information environment proxy data, including the number of total HealthMap alerts, unique alerts, healthcare facilities related alerts, lab(%RSV), search index for authorities and pandemic flu terms (Figure
6). The lack of significant correlation might be due to the percentage format of these data streams, since data in counts usually have significant CIs. The coefficients for the search terms for pandemic influenza are all positive for the surveillance systems represented as counts, which suggests a positive correlation between the biases in those surveillance systems and public awareness of pH1N1.
In general, the fewer layers of decision-making, the less correlated the surveillance system is with the information environment. The traditional “gold standard” surveillance systems, such as hospitalization and virologic surveillance, are subject to the biases introduced by healthcare professionals. The more specific and ad hoc the diagnostic and reporting codes are, the more likely it is influenced by the information environment. Surveillance data in percentage format tends to capture actual disease trends in constant ratio, less influenced by the information environment than data in counts.
Characterization of surveillance systems using the non-pandemic model
For the non-pandemic period we developed three indices to describe the relations between surveillance data and information environment proxy data — an actual disease status indicator (illness index) plus public awareness of both influenza and other viral respiratory diseases (public awareness index and non-flu index). Since we are most interested in the correspondence between the surveillance data and the public awareness index, the posterior distribution of the public awareness index coefficient is compared among different surveillance data streams, segmented by flu/non-flu season and year.The public awareness index used in the NP model is a collection of search keywords and categories of HealthMap alerts that are most likely to be associated with public awareness of influenza outbreaks, such as search volume for influenza outbreaks and the number of alerts of school-based outbreaks. Surveillance systems are in general more correlated with public awareness during the flu season as compared to the non-flu season. When observing increasing flu activity in the community or from news media, one may get sensitized and tend to seek medical attention when feeling sick. The exceptions are two surveillance systems that monitor predominantly the elderly ― flu surveillance at residential homes for the elderly and P&I-HA (65+ yr) (Figure
7B). These two surveillance systems are relatively less correlated with the public awareness in most cases, and show more stability from year to year (Figure
8B). During the non-flu season, the majority of surveillance systems seem not to be influenced by public awareness except for the flu associated hospitalization, number of specimens tested positive, percentage of specimens tested positive, and P&I-HA(0-15 yr) (Figure
7B), which can also be observed in the year to year comparison graph (Figure
8B).Beyond comparing the correlation between surveillance and information environment proxy data individually, we also made an exploratory effort to assess the similarity among different surveillance systems by using the characterization tool and the identified evidence of potential biases in clinical practice. When comparing the four hospitalization data streams side by side, we observed that flu-HA looks more similar to P&I-HA(0-15 yr) (Figures
9A and C), while P&I-HA is more similar to P&I-HA(65+ yr) (Figures
9B and D). The patterns are consistent in flu and non-flu season and in different years (Figures
8 and
9), and correspondent to the pandemic model.
In the pandemic model, we observed that flu-HA is more correlated with the information environment than P&I-HA, while pediatric P&I-HA also seems to be more correlated with the information environment than the P&I-HA for the elderly. Also, when replaced the incidence rate of all-age with 5-14 yr, data fits better for flu-HA, general practitioner ILI-visits, notifiable infectious disease reporting and P&I-HA(0-15 yr) (Additional file
1: Table S9). During the pandemic flu outbreak, since the children and young adults were considered to be at higher risk than the elderly, physicians may tend to order more laboratory testing for pediatric patients [
7]. During the non-pandemic period, the same clinical practice might still prevail. Since pediatric mortality is a reportable condition, clinicians are more likely to order laboratory tests and use a specific diagnostic code if the test results are positive. The elderly patients, however, who usually have non-specific clinical manifestation for respiratory diseases and lower viral loads [
46], are less likely to be sampled, less likely to have a positive result if tested, and usually given a less specific diagnostic code such as “pneumonia and influenza”.
It is worth noting that the data volume for flu-HA and pediatric P&I-HA are both relatively low during the non-pandemic period, which might also contribute to the similarity between the two data streams. Also, there is a much lower Google search volume and fewer HealthMap alerts in 2007 compared to 2008, possibly due to the introduction of smartphone and rapid growth of the Internet itself from 2007 and onwards [
47].
Before the 2009 pandemic flu outbreak, the age-stratified flu-HA was not collected in Hong Kong, therefore it is difficult to test our hypothesis of the biases in clinical practice. The implications of this finding are (1) flu hospitalization might not be representative for all age groups, and (2) it is important to collect age-stratified flu hospitalization data, not only for monitoring the susceptibility of the subpopulation, but also for assessing potential biases in practice.
Re-evaluating the usage of information environment data
This study also has implications for the use of information environment data for disease surveillance. Advances in information technology have made a wide range of data available to public health researchers and practitioners, offering the promise of improving current surveillance systems, generating more sensitive and timely warnings of disease outbreak or providing more accurate estimates of disease transmission. For this potential to be realized, however, the characteristics of these new data sources must be understood before they are used in sophisticated statistical models. Olson and colleagues have suggested, for instance, that Google Flu Trends’ impressive retrospective correlation with ILI surveillance data may be a product of over-fitting by “fishing” through numerous search term combinations as part of data mining. Moreover, Google Flu Trends’ tendency to miss the beginning of an outbreak and its poor accuracy at the local level also limits its application in providing early warning and situational awareness [
37]. The importance of understanding the nature of the data and the environment in which the data is generated may be overshadowed by researchers’ and practitioners’ enthusiasm for data availability (i.e. “big data”) and purely statistical patterns (often ignoring confounding variables or underlying processes).
Limitations
The approach we used in this study is limited by availability of data, which influenced how we evaluate the model and interpret the results. Since the reliable incidence rate estimate is not available before the 2009 pH1N1 outbreak, we developed different model for non-pandemic period, the results of which are, to some extent, consistent with what have been found in the pandemic model. Google search volume for some keywords in Hong Kong is not large enough to generate a search index; or sometimes is not of the same time resolution as the weekly surveillance data. More than half of the search keywords, for instance, are only available on monthly basis. Also, we have not exhausted all the possible combinations for keywords, HealthMap alerts count, and different time lag.
Given the noisy data and the lack of disease transmission mechanism, our search for the best fitting model might have led to over-fitting. The selection and aggregation of predictors, therefore, is guided by both practical knowledge and model performance comparison, in order to achieve a balanced model version that is of relatively good fit and meaningful for practitioners to interpret. For instance, the predictors are grouped in a relatively arbitrary manner, but the selection process for the pandemic model was blinded from the results of posteriors for each parameter before the final model version was selected.
Conclusions
In this study, we estimated the correspondence of multiple influenza surveillance data streams with indicators of the information environment, and the results suggest that most influenza surveillance data, to some degree, reflect public awareness as well as actual disease status. For instance, individuals who are aware of the on-going transmission of influenza are likely to search for information for prevention and self-diagnosis purposes, and may tend to seek medical attention once feeling sick. Thus, although it has not been recognized and studied systemically, many influenza surveillance systems may reflect changes in the information environment as well as actual disease trends. And although the data we analysed are all from Hong Kong, the underlying mechanisms are not specific to that region, so the problem may be widespread. Indeed, Zhang and colleagues and Stoto found similar patterns using less formal methods in the United States [
6,
7].
Some surveillance systems seem to represent public awareness more than actual disease status. In particular, ad hoc surveillance systems set up during the early ascertainment of pH1N1 outbreak — such as the walk-in clinics for ILI, making pH1N1 as a new condition for notifiable infectious disease — are more correlated with the information environment than other surveillance systems that we identified in Hong Kong. Such results help us better understand and characterize influenza surveillance systems, which can be used in data interpretation, resources allocation, new surveillance systems design and implementation in order to capture a more accurate picture of disease transmission.
The study findings discussed above are consistent with our practical knowledge that traditional and syndromic surveillance systems can be influenced by the public awareness of the disease. Other than Google searches, correlation between social media (e.g. Twitter) and similar data with traditional flu surveillance data may only indicate that both are reflecting the same information environment, rather the social media data reflecting actual disease status. Often times they are not clearly distinguished; and, people readily jump to the conclusion that the information environment data can be used as a proxy for disease status. As shown in our study, such an approach has its limitations. Information environment data such as web queries and tweets in fact have dual usage, (1) as proxy for direct estimate for disease, and (2) as covariates to control the model for public awareness biases. When researchers promote the idea of using the Internet data for disease surveillance among practitioners, the differentiation was not made clearly, and sometimes is lost when being communicated to the general public. Our study provides a framework to understand how the information environment data is related to the traditional surveillance data, which will help to fine tune the usage of information environment data.
Acknowledgements
This article was developed in collaboration with a number of partnering organizations, and with funding support awarded to the Harvard School of Public Health under cooperative agreements with the U.S. Centers for Disease Control and Prevention (CDC) grant number(s) 5P01TP000307-01 (Preparedness and Emergency Response Research Center). The content of these publications as well as the views and discussions expressed in these papers are solely those of the authors and do not necessarily represent the views of any partner organizations, the CDC or the US Department of Health and Human Services nor does mention of trade names, commercial practices, or organizations imply endorsement by the U.S. Government.
The data was made available by the Centre for Health Protection, Department of Health of Hong Kong, Hong Kong Hospital Authority, Dr. Benjamin Cowling and colleagues from the University of Hong Kong, and HealthMap.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
YZ carried out the study, performed the statistical analysis, and drafted the manuscript. AA participated in the study design, coding and statistical analysis. MS and BC conceived of the study, and participated in its design and coordination. All authors participated in designing the study, analysing and interpreting the results and drafting the manuscript. All authors read and approved the final manuscript.