Background
The prevention of nosocomial infections [
1] must take into account the nosocomial risk of managing patients admitted to hospital with a community-acquired infection that poses an epidemic hazard. Identifying these patients upon admission would allow early implementation of precautionary measures in the admitting departments. Most frequently, patients admitted to hospital with a community-acquired infection first go to the emergency departments (ED). At this stage, they present with one or more symptoms expressed as a chief complaint. The diagnoses made at the end of these patients' clinical, biological and therapeutic management in EDs are often based solely on the physicians' best judgement, and are rarely confirmed by microbiological tests, which provide definitive results 24-48 hours after their receipt in the laboratory. This is why the early identification of patients admitted for a community-acquired infection that poses an epidemic risk should be based on syndromic surveillance.
Syndromic surveillance "focuses on the early symptom (prodrome) period before clinical or laboratory confirmation of a particular disease and uses both clinical and alternative data sources. Strictly defined, syndromic surveillance gathers information about patients' symptoms (e.g., cough, fever, or shortness of breath) during the early phases of illness" [
2].
Few studies have investigated the surveillance of patients admitted to hospital with a community-acquired infection that poses an intra-hospital epidemic risk [
3]. Most syndromic surveillance systems based on ED data are designed to identify anomalous phenomena (e.g., bioterrorism, emerging infectious disease) occurring within the community at a regional or even national level [
4‐
13], but these methods have not been applied in intra-hospital settings to identify patients who represent an epidemic risk. Most of the systems described in the literature are based on the chief complaint [
4‐
7,
14] and sometimes on the syndromic discharge diagnosis [
12,
13]. In France, EDs are gradually computerizing their clinical records to meet the legislative framework for cooperation with the French National Institute for Public Health Surveillance (Institut de Veille Sanitaire, InVS) and regional health agencies for the transmission of health information [
15]. The French Society of Emergency Medicine (Société Française de Médecine d'Urgence, SFMU) recommends encoding ED discharge diagnoses with the International Statistical Classification of Diseases and Related Health Problems, 10
th Revision (ICD-10), and chief complaints based on a thesaurus developed by the SFMU from a relevant selection of ICD-10 codes [
16,
17].
An automatic clinical decision support system for detecting patients carrying infections with an epidemic risk who are admitted to EDs is being developed at Hôpital de la Croix-Rousse in Lyon. This detection tool will rely on computerized ED medical records (Dossier Médical des Urgences, DMU). These records contain early clinical data before any diagnostic confirmation is entered in real time (chief complaint, clinical examination data, etc.). The data entered in the DMU are heterogeneous and appear partly as structured variables and partly as textual variables corresponding to sections of narrative reports in medical language. An important part of the information needed for the syndromic identification of patients is described in the narrative reports that are divided into different sections: doctors' clinical observations, specialists' notes, prescribed diagnostic and therapeutic procedures. Each narrative report section is defined as a textual variable in the DMU database. Processing these narrative reports is a prerequisite for using DMU data.
The purpose of this paper was to describe and evaluate a natural language processing system to extract and encode information found in the narrative reports of computerized ED medical records.
Discussion
In the early stage of patient management, syndromic surveillance is instrumental in preventing and controlling nosocomial epidemic phenomena related to the admission of patients who could be an epidemic risk. Identification is an important means of helping infection control practitioners implement preventive measures to limit the risk of transmission of infections that pose an epidemic risk, including additional precautions (contact, droplets, air), for interaction with the clinical teams. It is, therefore, important to implement tools to identify patients who represent a risk in EDs before they are even admitted. Knirsch et al. tested an automated clinical decision support system for identifying additional, potential tuberculosis patients who clinicians failed to place in respiratory isolation [
3]. This tool was based on the use of a natural language processing system to encode narrative chest radiograph reports, called MedLEE (Medical Language Extraction and Encoding System) and algorithms checking laboratory and pharmacy data for evaluating the immunocompromised status of patients. Based on a retrospective cohort study conducted for evaluation in 1992-1993, the combination of clinical and automated clinical decision support systems improved the isolation rate from 62.6% to 78.4%, disclosing the relevance of automated methodologies for detecting patients at risk.
A similar experiment is underway to develop an automated clinical decision support system at Hôpital de la Croix-Rousse. Natural language processing is a necessary prerequisite for this process. UrgIndex was designed to automatically process natural language data.
Evaluation of UrgIndex, which was part of its development, indicated that processing quality was satisfactory. Recall was 85.8%, ranging from 81.3% to 90.0%, depending on the type of syndrome at the end of the learning phase. Evaluation of recall on a new set of 100 medical records confirmed its good performances in terms of recall (85.8% overall) and precision (79.1% overall).
The small number of available concepts for "specialists' notes" and "discharge prescriptions" shows that these variables are seldom used by clinicians. For the variables "reasons," "observations" and "procedures", the lack of processing was linked mostly to the presence of either an abbreviation, acronym, synonym or spelling error unrecognized by the ECMT and not present in the UrgIndex correspondence table. This language variation table is an important UrgIndex asset for processing natural language data that are sometimes approximate (employing common words instead of medical words, regional words, abbreviations or unconventional acronyms or spelling or typing errors). Language variations are responsible of false negatives (not perfect recall) and to a lesser extent of false positive (not perfect precision). We should emphasize the particular difficulty of obtaining an exhaustive correspondence table, given the very telegraphic style of emergency physicians' notes and typing errors in the emergency context to trace the patient's clinical description. UrgIndex was designed to enrich this correspondence table as it was being utilized.
Another limitation is related to the ECMT. Some clinical concepts and their synonyms have no codes in the ECMT, as illustrated by "bronchial congestion," and "air bronchogram." Also, the same acronyms are sometimes applied to 2 different concepts, which are easily understandable in the context by a clinician, but may not be correctly interpreted by the ECMT. For example, the acronym "ARF" can mean both "acute respiratory failure" and "acute renal failure."
Finally, the application does not contextualize concepts found in textual variables based on their occurrence timeline and does not perform sustained semantic analysis. It is only based on a search of medical concept. This participates to the not perfect precision (79.1%) as false positives due to antecedents represented 35% of all false positives in the test set. For example, the application does not distinguish if a symptom is an antecedent, belongs to the current history of the disease or corresponds to a current clinical examination. Such a limitation can lead to background noise (codes of suspected infection concepts for patients who do not have any; for example, "the patient had pulmonary tuberculosis in 1982": processing in the application will return the "pulmonary tuberculosis" code).
Background noise may be the source of false positives, which will require the validation of cases, within patients detected by the automated clinical decision support system, by infection control practitioners before alerting health care providers. A study is also being planned to determine the sensitivity/specificity of case identification by the clinical decision support system prior to its implementation in hospital. This evaluation will be carried out once the tool is fully developed (i.e. once the detection algorithms are completely developed with the DMU's structured data and textual data and fully integrated into the clinical decision support system).
Many authors have already expressed interest in syndromic surveillance in hospital EDs. Such surveillance is possible if medical records are computerized and permit regular computerized transmission of necessary data to epidemiological services in charge of this surveillance [
4‐
7,
14]. Most syndromic surveillance systems described in the literature are based on the surveillance of chief complaints [
4‐
7,
14] or discharge diagnoses [
12,
13] in EDs to detect potential outbreaks of target diseases as soon as possible, to provide early warning to the community if necessary and to incite epidemiological field investigations to confirm the diseases as well as their origin, and take appropriate measures. For example, a syndromic surveillance system was implemented in Virginia in 7 EDs for 10 months [
7]. The chief complaints were faxed daily to the health department, classified manually according to 7 syndromes (fever, respiratory distress, vomiting, diarrhoea, rash, disorientation and sepsis), and analyzed by the cumulative sum algorithm. This system was able to prospectively reveal the onset of the flu epidemic earlier than the Sentinel Influenza Network, a routine surveillance system [
7].
Studies have already been undertaken on the use of natural language processing in the syndromic surveillance system. Among them, a trial called Real-time Outbreak and Disease Surveillance (RODS) was conducted in 200 emergency structures in Pennsylvania, Utah, Ohio and New Jersey [
6]. A free text extractor named CoCo (Complaint Coder) analyzed the chief complaints and automatically classified them according to naive Bayesian classification algorithms based on 1 of the following 8 syndromes: respiratory, botulism, gastrointestinal, neurological, cutaneous, constitutional, haemorrhagic and other. A detection algorithm then analyzed cluster research data. This system allowed the prospective detection of exposure to carbon monoxide [
22]. A retrospective study at the University of Pittsburgh Medical Center ED evaluated the performance of the CoCo free text extractor [
23]. The authors measured the extractor's ability to classify 527,228 patients admitted between 1990 and 2003 based on 1 of 7 syndromes: respiratory, botulism, gastrointestinal, neurological, cutaneous, constitutional and haemorrhagic. Each primary discharge diagnosis, already coded in ICD-9, was also retrieved and served as the "gold standard" to evaluate the extractor's performance. According to the results, the tool's sensitivity ranged from 30% for botulism syndrome to 75% for haemorrhagic syndrome. Its specificity was between 93% and 99%.
Another example of a syndromic surveillance system with textual processing of chief complaints is that of the New York City Department of Health and Mental Hygiene, which uses another type of classification tool for chief complaints: their classification algorithm is based on a search of keywords [
4]. The studied syndromes are common colds, infectious conditions or death upon arrival, respiratory syndromes, diarrhoea, fever, rash, asthma and vomiting. Abnormal events are detected by temporal and spatial clustering methods.
South et al. reported the value of employing multiple textual sources from computerized ED records, and not the sole chief complaint, to improve the ability to identify flu-like syndromes [
24]. Indeed, the sensitivity of a free textual extractor in identifying patients admitted to EDs with a flu-like syndrome was 27% when the free textual extractor was applied to data on the chief complaint, 51% when applied to ED observation data, and 4% when applied to the triage nurse's observation data. By combining these various natural language data, sensitivity was increased to 75%.
Authors have begun to focus on syndromic surveillance for nosocomial infection monitoring and alerts [
25,
26]. These trials exploit the computerized medical records of hospitalized patients to detect the beginning of intra-hospital outbreaks (e.g., gastroenteritis due to Norovirus). However, we have not found any articles on the use of syndromic surveillance data from EDs to implement an intra-hospital alert system for patients who could be an epidemic risk. The information provided by InVS surveillance systems, both nonspecific and specific to certain syndromes (the Sentinel Network for influenza and acute gastroenteritis, etc.) [
27], is intended for regional and national surveillance. The information circuit for these systems is, therefore, not designed for intra-hospital purposes. The objective of syndromic surveillance within a hospital, as in a community, is to implement an appropriate alert for preventive measures that should be taken in a very reactive way in the facility during patient admission.
UrgIndex will be integrated into a clinical decision support system aimed at identifying cases of community-acquired infections with the aid of varied filtering of symptoms and procedures, but by customizing the filters, this application could also serve other types of clinical decision support systems: to assist in triage by directly processing the chief complaint for consultation; to help in diagnosis and management decisions; to participate in surveillance based on EDs and mortality (Surveillance Sanitaire des urgences et des décès, SurSaUD) in the InVS surveillance system by sending coded data (e.g., during summer heat wave periods, the InVS assesses the impact of heat waves on the population by analyzing the chief complaints for hyperthermia, dehydration, hyponatraemia and discomfort) [
28]; to research case clusters during bioterrorism and to identify patients for rapid inclusion in study protocols.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SG built the correspondence tables for non-standard terms and filters, and undertook the analysis. The application itself was designed and constructed by OY and QG. SD performed ECMT algorithms. ALM and VS participated in data collection with the DMU's data warehouse. SG, VP and MHM evaluated and determined which pertinent infectious disease to detect. SG drafted the manuscript and MHM revised it. All authors have read and approved the final manuscript version.