Skip to main content

01.12.2012 | Research article | Ausgabe 1/2012 Open Access

BMC Medical Informatics and Decision Making 1/2012

The freetext matching algorithm: a computer program to extract diagnoses and causes of death from unstructured text in electronic health records

BMC Medical Informatics and Decision Making > Ausgabe 1/2012
Anoop D Shah, Carlos Martinez, Harry Hemingway
Wichtige Hinweise

Electronic supplementary material

The online version of this article (doi:10.​1186/​1472-6947-12-88) contains supplementary material, which is available to authorized users.

Competing interests

None of the authors have any competing interests to declare.

Authors’ contributions

ADS analysed the GPRD data for cause of death, designed the Freetext Matching Algorithm, and manually annotated the results of analysis. All authors discussed and reviewed the manuscript.



Electronic health records are invaluable for medical research, but much information is stored as free text rather than in a coded form. For example, in the UK General Practice Research Database (GPRD), causes of death and test results are sometimes recorded only in free text. Free text can be difficult to use for research if it requires time-consuming manual review. Our aim was to develop an automated method for extracting coded information from free text in electronic patient records.


We reviewed the electronic patient records in GPRD of a random sample of 3310 patients who died in 2001, to identify the cause of death. We developed a computer program called the Freetext Matching Algorithm (FMA) to map diagnoses in text to the Read Clinical Terminology. The program uses lookup tables of synonyms and phrase patterns to identify diagnoses, dates and selected test results. We tested it on two random samples of free text from GPRD (1000 texts associated with death in 2001, and 1000 general texts from cases and controls in a coronary artery disease study), comparing the output to the U.S. National Library of Medicine’s MetaMap program and the gold standard of manual review.


Among 3310 patients registered in the GPRD who died in 2001, the cause of death was recorded in coded form in 38.1% of patients, and in the free text alone in 19.4%. On the 1000 texts associated with death, FMA coded 683 of the 735 positive diagnoses, with precision (positive predictive value) 98.4% (95% confidence interval (CI) 97.2, 99.2) and recall (sensitivity) 92.9% (95% CI 90.8, 94.7). On the general sample, FMA detected 346 of the 447 positive diagnoses, with precision 91.5% (95% CI 88.3, 94.1) and recall 77.4% (95% CI 73.2, 81.2), which was similar to MetaMap.


We have developed an algorithm to extract coded information from free text in GP records with good precision. It may facilitate research using free text in electronic patient records, particularly for extracting the cause of death.
Additional file 1: Documentation for Freetext Matching Algorithm. PDF document containing general description, user guide and technical documentation. (PDF 922 KB)
Additional file 3: Freetext Matching Algorithm in an Access Database. ZIP archive containing Microsoft Access 2000 database and instructions for use. Program code is licensed under the GNU General Public License Version 3. (ZIP 11 MB)
Additional file 4: Comparison of Freetext Matching Algorithm with MetaMap and Negex. PDF document containing details of the MetaMap configuration, performance of different MetaMap options and comparisons of FMA and MetaMap output for selected texts. (PDF 143 KB)
Authors’ original file for figure 1
Authors’ original file for figure 2
Authors’ original file for figure 3
Über diesen Artikel

Weitere Artikel der Ausgabe 1/2012

BMC Medical Informatics and Decision Making 1/2012 Zur Ausgabe