Background
Principal methods used for automatic de-identification
Machine learning-based systems
Methods
Main characteristics of selected rule-based de-identification systems
-
HMS Scrubber[6]. HMS Scrubber is a freely available open source tool developed within the Shared Pathology Informatics Network (SPIN). It was designed to obscure HIPAA identifiers and tailored to pathology reports. As part of the SPIN project, an XML schema was defined to accommodate different information contained within a pathology report; HMS Scrubber can operate with such an XML schema as well as plain-text documents. The algorithmic structure of HMS Scrubber involves the following steps: (1) if the reports follow the SPIN XML schema, HMS Scrubber takes advantage of the predefined information and searches for any occurrences of these identifiers in the textual portion of the document; (2) pattern matching with regular expressions, 50 regular expressions are used to detect potential PHI such as dates, phone and social security numbers; and (3) word-based dictionary lookups, comparing each word in the document to a database of over 101,000 unique personal and geographic place names built from the U.S. 1990 Census [17].
-
The Medical De-identification System (MeDS)[7]. MeDS was developed and tested on different types of free-text clinical records such as discharge summaries, clinical notes, and laboratory and pathology reports. This system was especially tuned to process Health Level Seven (HL7) messages [18], but it can easily be modified to accept other document formats. MeDS de-identifies documents through several steps: (1) pre-processing of the well-labeled patient identifiers in report headers (when applicable), and using this information to find the same identifiers in the narrative parts of the report; (2) pattern matching using approximately 40 regular expressions to detect numerical identifiers, dates, addresses, state names and abbreviations, time, ages, e-mail addresses, etc.; (3) name matching with two lists: a list of proper names extracted from different sources (see Table1 for details) and a list of common usage words. Additionally, it searches for predictive markers such as ‘Mr.’ or ‘Dr.’ and uses part-of-speech information to assist with disambiguation of proper names. And, (4) MeDS uses name nearness matching based on a text string nearness algorithm to deal with typographical errors and variants of the patient’s known first, middle and last names.
-
The MIT deid software package[9]. This software, freely available at the PhysioNet website, was designed to be applicable to a variety of free-text medical records, removing both HIPAA identifiers and an extended PHI set that includes years of dates. The de-identification procedure performs lexical matching with lookup tables, regular expressions, and simple heuristics for context checks. Four types of dictionaries are used by the MIT deid system: (1) lists of known names of patients and hospital staff; (2) lists of generic female and male first names, last names, hospital names, locations and states, all classified as ambiguous or unambiguous items depending on whether they are also common words or not; (3) lists of keywords or phrases that often precede or follow PHI terms; and (4) lists of non-PHI terms such as common words and UMLS terms useful for determining ambiguous PHI terms. Then, PHI instances that involve numerical patterns are identified by regular expressions and context checks. For non-numeric PHI terms, the algorithm first performs dictionary lookups in order to locate known and potential PHI, and then processes several regular expressions that look for patterns with context keywords indicating PHI terms.
Main characteristics of selected machine learning-based de-identification systems
-
The MITRE Identification Scrubber Toolkit (MIST)[10]. MIST provides an environment to support rapid tailoring of automated de-identification to different types of documents. It allows end users to annotate training data and run subsequent experiments. To detect PHI identifiers, MIST uses a machine learning classifier (Conditional Random Fields). and tackles de-identification as a sequence-labeling problem assigning labels to individual words, indicating whether the word is part of a specific type of PHI, or whether it does not belong to the PHI phrase. A well-known encoding using this labeling is the BIO schema; Figure1 depicts an example of BIO annotations. In the figure, B T indicates that the word is the beginning of a PHI phrase of type T, I T indicates a word inside or at the end of a PHI phrase of type T and O indicates a word outside a PHI phrase. MIST uses the BIO schema, and is structured to allow the addition of new learning features for the CRF predictions. The default feature specification distributed with MIST only includes a few features; such as the target word, prefixes and suffixes from length 1 to 3, capitalization of the target word and the following word, digits inside the target word, context words denoting a company in the next four words (e.g., “Ltd.”, “Corp.”, “Co.”, “Inc.”), a context window of 3 words, and 2- and 3-grams of words surrounding the target word.×
-
The Health Information DE-identification (HIDE) system[12]. HIDE also provides an environment for tagging, classifying and retagging, which allows the construction of large training datasets without intensive human intervention. For de-identifying documents, HIDE also uses a CRF model for predicting PHI and deals with the problem by tagging each token using the aforementioned BIO schema. The main difference with MIST is that HIDE provides a larger set of features by default. Approximately, 34 features are derived from the morphology of the token (e.g., capitalization, special characters, affixes from length 1 to 3, single- double- triple- and quadruple-digit word, and digits inside, to name but a few); moreover, the context-window processed by HIDE comprises the four previous and four following tokens.
HMS Scrubber | MeDS | MIT deid | MIST | HIDE | ||
---|---|---|---|---|---|---|
Main technique | Rule-based | X | X | X | n/a | n/a |
ML-based | n/a | n/a | n/a | X | X | |
Programming language | Java | Java | Perl | Python | Python | |
ML algorithm | n/a | n/a | n/a | CRF (Carafe) | CRF (CRFsuite) | |
Input documents | XML/txt | HL7/txt | txt | txt/XML-inline/json | XML/txt/HL7 | |
HIPAA compliant | X | X | X |
1
|
1
| |
Regular Expressions (#) | ~50 | ~40 | ~90 |
2
|
2
| |
PHI markers (e.g., Mr.) | X | X | X |
3
| -- | |
Part-of-speech information | -- | X | -- |
--
| -- | |
String similarity techniques (e.g. edit distance, fuzzy matching) | -- | X | -- | -- | -- | |
Dictionaries* (size) | Person names | ~101K | ~280K | ~96K4
| -- | -- |
Geographic places | ~167K | ~4K | -- | -- | ||
US area code | -- | -- | ~380 | -- | -- | |
Medical phrases | -- | ~50 | ~28 | -- | -- | |
Medical terms | -- | ~80K | ~175K | -- | -- | |
Companies | -- | ~200 | ~500 | -- | -- | |
Ethnicities | -- | ~120 | ~195 | -- | -- | |
Common words | -- | ~220K | ~50K | -- | -- | |
Machine Learning features | Contextual window | n/a | n/a | n/a | 3-words | 4-words |
Morphological (#) | n/a | n/a | n/a | 22 | 34 | |
Syntactic | n/a | n/a | n/a | -- | -- | |
Semantic | n/a | n/a | n/a | -- | -- | |
From dictionaries | n/a | n/a | n/a |
5
|
5
|
Evaluation methodology
Reference standard corpora
-
Patients: includes the first and last name of patients, their health proxies, and family members, excluding titles (e.g., Mrs. Smith was admitted).
-
Doctors: refers to medical doctors and other practitioners mentioned in the records, excluding titles.
-
Hospitals: names of medical organizations and of nursing homes, including room numbers, buildings and floors (e.g., Patient was transferred to room 900).
-
Locations: includes geographic locations such as cities, states, street names, zip codes, building names, and numbers.
-
Dates: includes all elements of a date. Originally, years were not annotated in this corpus, however we modified these annotations in order to consider years and then be consistent with our VHA date annotations.
-
Phone numbers: includes telephone, pager, and fax numbers.
-
Ages: ages above 90 years old.
-
IDs: refers to any combination of numbers, letters, and special characters identifying medical records, patients, doctors, or hospitals (e.g., medical record number).Table 2PHI category distribution and mapping for the VHA, i2b2 and Swedish Stockholm EPR corporaVHA corpusInstancesi2b2 corpusInstancesStockholm EPR De-identified CorpusInstancesPatient Name206 (3.88%)Patients929 (4.76%)Person NameFirst Name923 (20.87%)Relative Name30 (0.55%)Other Person Name20 (0.37%)Last Name929 (21%)Healthcare Provider Name492 (9.08%)Doctors3751 (19.24%)Street City137 (2.53%)Locations263 (1.35%)Location148 (3.35%)State Country161 (2.97%)Zip code4 (0.07%)Deployment43 (0.79%)----Healthcare Unit Name1453 (26.83%)Hospitals2400 (12.31%)Health_Care_Unit1021 (23.08%)Other Organization86 (1.59%)----Date2547 (47.03%)Dates7098 (36.40%)Date_Part710 (16.05%)Full_Date500 (11.30%)Age > 894 (0.07%)Ages16 (0.08%)Age56 (1.27%)Phone Number90 (1.66%)Phone Numbers232 (1.19%)Phone Number136 (3.07%)Electronic Address4 (0.07%)----SSN16 (0.30%)IDs4809 (24.66%)--Other ID Number123 (2.27%)--
-
Names: all occurrences of person names, distributed in four sub-categories (i.e., patients, relatives, healthcare providers, and other persons) and including first names, last names, middle names and initials (not titles), e.g. “Patient met Dr. JAMISON JAMES”.
-
Street City: addresses including the city, street number and name, apartment number, etc. (e.g., “lived on 5 Main Street, Suite 200, Albany NY 0000”).
-
State Country: all mentions of states and countries. It also includes mentions of countries associated with military service, service awards, or place of residence at the time of deployment (e.g., “He was awarded the Korean service medal”).
-
ZIP code: zip code information.
-
Deployment: armed forces-specific identifiers that describe a deployment location, or mention of units, battalion, regiment, brigade, etc. (e.g., “had worked as a cook at Air Base 42 for 3 yrs”).
-
Healthcare Unit Name: any facility performing health care services, including smaller units (e.g., detox clinics, HIV clinics), and generic locations such as MICU, SICU, ICU, ER. This also includes all explicit mentions of healthcare facilities, clinical laboratories, assisted living, nursing homes, and generic mentions such as “the hospital”, “the clinic”, or “medical service” (e.g., “patient was referred to the blue clinic”, “transferred to 4 west”).
-
Other Organization Name: company or organization names not related with healthcare that are attributed to a patient or provider (e.g., “patient is an active member of the Elk’s club”).
-
Date: all elements of a date, including year and time, days of the week, and day abbreviations (e.g., “on December, 11, 2009@11:45am”, “administered every Monday, TU, and Thurs”).
-
Age > 89: all instances of age greater than 89 years old.
-
Phone Number: all numeric or alphanumeric combinations of phone, fax, or pager numbers, including phone number extensions (e.g., “call 000-LEAD”, “dial x8900”).
-
Electronic Address: references to electronic mail addresses, web pages and IP addresses.
-
SSN: combinations of numbers and characters representing a social security number, including first initial of last name and last four digits of the SSN (e.g., “L0000 was seen in clinic”).
-
Other ID Number: all other combinations of numbers and letters that could represent a medical record number, lab test number, or other patient or provider identifier such as driver’s license number (e.g., “prescription number: 0234569”, “Job 13579/JSS”).
Systems implementation and testing
Systems output analysis
Results
RULE-BASED SYSTEMS | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|
Overall results | Overall results | PARTIAL MATCHES | FULLY-CONTAINED MATCHES | |||||||
HMS Scrubber | MeDS | MIT deid | HMS Scrubber | MeDS | MIT deid | HMS Scrubber | MeDS | MIT deid | ||
One PHI | P (CI) | 0.01 (0.005-0.015) | 0.10 (0.085-0.115) | 0 | 0.32 (0.31-0.33) | 0.45 (0.435-0.465) |
0.81 (0.795-0.825) | 0.16 (0.15-0.17) | 0.14 (0.125-0.155) | 0.42 (0.40-0.44) |
R (CI) | 0.02 (0.015-0.025) | 0.21 (0.20-0.22) | 0 | 0.65 (0.64-0.66) |
0.78 (0.765-0.795) | 0.64 (0.625-0.655) | 0.34 (0.325-0.355) | 0.32 (0.305-0.335) | 0.36 (0.345-0.375) | |
F2 (CI) | 0.02 (0.012-0.025) | 0.17 (0.16-0.18) | 0 | 0.54 (0.53-0.55) | 0.68 (0.665-0.695) |
0.67 (0.655-0.685) | 0.28 (0.27-0.29) | 0.25 (0.24-0.26) | 0.37 (0.355-0.385) | |
All PHI types | P (CI) | 0.01 (0.005-0.015) | 0.05 (0.045-0.055) | 0 | 0.23 (0.22-0.24) | 0.34 (0.325-0.365) | 0.76 (0.745-0.775) | 0.12 (0.115-0.125) | 0.10 (0.09-0.11) | 0.40 (0.335-0.465) |
R (CI) | 0.02 (0.0195-0.0215) | 0.14 (0.13-0.15) | 0 | 0.47 (0.455-0.485) | 0.60 (0.585-0.615) | 0.60 (0.585-0.615) | 0.26 (0.225-0.295) | 0.22 (0.205-0.235) | 0.34 (0.325-0.355) | |
F2 (CI) | 0.02 (0.018-0.022) | 0.10 (0.09-0.11) | 0 | 0.39 (0.38-0.40) | 0.52 (0.505-0.535) | 0.63 (0.615-0.645) | 0.21 (0.195-0.225) | 0.18 (0.17-0.19) | 0.35 (0.315-0.385) | |
MACHINE LEARNING-BASED SYSTEMS
| ||||||||||
Overall results |
EXACT MATCHES
|
PARTIAL MATCHES
|
FULLY-CONTAINED MATCHES
| |||||||
MIST | HIDE | MIST | HIDE | MIST | HIDE | |||||
One PHI | P (CI) | 0.54 | 0.50 |
0.95
| 0.89 | 0.58 | 0.56 | |||
(0.52-0.56) | (0.48-0.52) | (0.935-0.965) | (0.875-0.905) | (0.56-0.60) | (0.54-0.58) | |||||
R (CI) | 0.25 | 0.27 | 0.46 |
0.49
| 0.28 | 0.30 | ||||
(0.24-0.26) | (0.26-0.28) | (0.445-0.475) | (0.475-0.505) | (0.27-29) | (0.29-31) | |||||
F2 (CI) | 0.28 | 0.30 | 0.51 |
0.54
| 0.31 | 0.33 | ||||
(0.265-0.295) | (0.285-0.315) | (0.495-0.525) | (0.525-0.555) | (0.295-0.325) | (0.315-0.345) | |||||
All PHI types | P (CI) | 0.52 | 0.48 | 0.90 | 0.84 | 0.55 | 0.52 | |||
(0.495-0.545) | (0.46-0.50) | (0.885-0.915) | (0.825-0.855) | (0.525-0.575) | (0.50-0.54) | |||||
R (CI) | 0.24 | 0.25 | 0.44 | 0.46 | 0.27 | 0.28 | ||||
(0.225-255) | (0.24-0.26) | (0.425-0.455) | (0.445-0.475) | (0.255-0.285) | (0.265-0.295) | |||||
F2 (CI) | 0.27 | 0.28 | 0.49 | 0.50 | 0.30 | 0.31 | ||||
(0.255-0.285) | (0.265-0.295) | (0.475-0.505) | (0.485-0.515) | (0.285-0.315) | (0.295-0.325) |
RULE-BASED SYSTEMS | |||||||
---|---|---|---|---|---|---|---|
PHI type | #Inst. | PARTIAL MATCHES | FULLY-CONTAINED MATCHES | ||||
HMS Scrubber | MeDS | MIT deid | HMS Scrubber | MeDS | MIT deid | ||
Patient Name | 206 | 0.83 |
0.99
| 0.98 | 0.57 | 0.69 | 0.69 |
Relative Name | 30 | 0.76 | 0.95 | 1 | 0.57 | 0.67 | 0.77 |
Healthcare Provider Name | 492 | 0.74 | 0.96 | 0.94 | 0.43 | 0.47 | 0.38 |
Other Person Name | 20 | 0.66 | 0.81 | 0.74 | 0.30 | 0.25 | 0.35 |
Street City | 137 | 0.90 | 0.96 | 0.81 | 0.70 | 0.78 | 0.78 |
State Country | 161 | 0.45 | 0.49 | 0.85 | 0.43 | 0.45 | 0.84 |
Deployment | 43 | 0.34 | 0.49 | 0.27 | 0.07 | 0.02 | 0.05 |
ZIP code | 4 | 1 | 1 | 1 | 1 | 1 | 1 |
Healthcare Unit Name | 1453 | 0.45 | 0.51 | 0.12 | 0.24 | 0.23 | 0.03 |
Other Org Name | 86 | 0.33 | 0.50 | 0.27 | 0.03 | 0.20 | 0.03 |
Date | 2547 | 0.74 | 0.87 | 0.80 | 0.34 | 0.27 | 0.46 |
Age > 89 | 4 | 0 | 0 | 1 | 0 | 0 | 1 |
Phone Number | 90 | 0.73 | 0.79 | 0.80 | 0.42 | 0.5 | 0.48 |
Electronic Address | 4 | 0 | 0.86 | 0.75 | 0 | 0 | 0.75 |
SSN | 16 | 1 | 1 | 1 | 1 | 1 | 1 |
Other ID Number | 123 | 0.66 | 0.82 | 0.41 | 0.43 | 0.61 | 0.27 |
MACHINE LEARNING-BASED SYSTEMS
| |||||||
PHI type |
#Inst.
|
PARTIAL MATCHES
|
FULLY-CONTAINED MATCHES
| ||||
MIST | HIDE | MIST | HIDE | ||||
Patient Name | 206 | 0.51 | 0.54 | 0.42 | 0.50 | ||
Relative Name | 30 | 0.13 | 0.13 | 0.13 | 0.13 | ||
Healthcare Provider Name | 492 | 0.53 | 0.59 | 0.44 | 0.53 | ||
Other Person Name | 20 | 0 | 0.20 | 0 | 0.15 | ||
Street City | 137 | 0.26 | 0.29 | 0.26 | 0.27 | ||
State Country | 161 | 0.14 | 0.22 | 0.14 | 0.21 | ||
Deployment | 43 | 0.07 | 0.05 | 0.07 | 0.02 | ||
ZIP code | 4 | 0 | 0.75 | 0 | 0.75 | ||
Healthcare Unit Name | 1453 | 0.09 | 0.09 | 0.06 | 0.05 | ||
Other Org Name | 86 | 0.09 | 0.07 | 0.06 | 0.06 | ||
Date | 2547 | 0.72 | 0.73 | 0.39 | 0.38 | ||
Age > 89 | 4 | 0 | 0 | 0 | 0 | ||
Phone Number | 90 | 0.34 | 0.61 | 0.24 | 0.51 | ||
Electronic Address | 4 | 0 | 0 | 0 | 0 | ||
SSN | 16 | 0.56 | 0.87 | 0.56 | 0.87 | ||
Other ID Number | 123 | 0.32 | 0.69 | 0.20 | 0.63 |
10-fold cross-validation experiment | |||||||
---|---|---|---|---|---|---|---|
Overall results | EXACT MATCHES | PARTIAL MATCHES | FULLY-CONTAINED MATCHES | ||||
MIST | HIDE | MIST | HIDE | MIST | HIDE | ||
One PHI | P (CI) | 0.89 | 0.88 |
0.96
| 0.95 | 0.91 | 0.91 |
(0.88-0.90) | (0.87-0.89) | (0.95-0.97) | (0.94-0.96) | (0.90-0.92) | (0.90-0.92) | ||
R (CI) | 0.64 | 0.70 | 0.70 |
0.76
| 0.67 | 0.73 | |
(0.625-0.655) | (0.685-0.715) | (0.685-0.715) | (0.75-0.77) | (0.655-0.685) | (0.72-0.74) | ||
F2 (CI) | 0.68 | 0.73 | 0.74 |
0.79
| 0.71 | 0.76 | |
(0.665-0.695) | (0.72-0.74) | (0.725-0.755) | (0.775-0.805) | (0.70-0.72) | (0.75-0.77) | ||
All PHI types | P (CI) | 0.87 | 0.87 | 0.95 | 0.92 | 0.90 | 0.89 |
(0.855-0.885) | (0.86-0.88) | (0.94-0.96) | (0.905-0.935) | (0.885-0.915) | (0.88-0.90) | ||
R (CI) | 0.63 | 0.69 | 0.69 | 0.74 | 0.66 | 0.71 | |
(0.615-0.655) | (0.675-0.705) | (0.675-0.705) | (0.725-0.755) | (0.645-0.675) | (0.695-0.725) | ||
F2 (CI) | 0.67 | 0.72 | 0.73 | 0.77 | 0.70 | 0.74 | |
(0.655-0.685) | (0.71-0.73) | (0.713-0.745) | (0.76-0.78) | (0.685-0.715) | (0.725-0.755) |
10-fold cross-validation experiment | |||||
---|---|---|---|---|---|
PHI type | #Inst. | PARTIAL MATCHES | FULLY-CONTAINED MATCHES | ||
MIST | HIDE | MIST | HIDE | ||
Patient Name | 206 | 0.51 | 0.51 | 0.49 | 0.51 |
Relative Name | 30 | 0 | 0.13 | 0 | 0.10 |
Healthcare Provider Name | 492 | 0.58 | 0.61 | 0.54 | 0.59 |
Other Person Name | 20 | 0 | 0.20 | 0 | 0.20 |
Street City | 137 | 0.28 | 0.48 | 0.28 | 0.43 |
State Country | 161 | 0.58 | 0.71 | 0.58 | 0.70 |
Deployment | 43 | 0.19 | 0.28 | 0.16 | 0.21 |
ZIP code | 4 | 0 | 0 | 0 | 0 |
Healthcare Unit Name | 1453 |
0.55
|
0.61
|
0.52
|
0.58
|
Other Org Name | 86 | 0.10 | 0.29 | 0.09 | 0.25 |
Date | 2547 |
0.93
|
0.94
|
0.89
|
0.92
|
Age > 89 | 4 | 0 | 0 | 0 | 0 |
Phone Number | 90 | 0.27 | 0.88 | 0.23 | 0.78 |
Electronic Address | 4 | 0.75 | 0.75 | 0 | 0.75 |
SSN | 16 | 0.37 | 0.62 | 0.37 | 0.56 |
Other ID Number | 123 | 0.37 | 0.72 | 0.34 | 0.65 |
Discussion
“Out-of-the-box” evaluation
Ten-fold cross-validation experiment
Partial matches analysis
Systems errors analysis
-
Unusual PHI formats: Some PHI types appear within our documents in unusual formats that are not always detected by the de-identification systems. Rule-based systems missed these annotations because none of their regular expressions capture such formats; while for machine learning-based systems, these annotations were missed because of differences between the training and testing feature vectors or because of a lack of useful training features capturing these atypical PHI formats. However, the de-identification systems sometimes partially captured these unusual formats, as reflected by an increase in partial matches. For example, the instance of Dates ‘Jan 09, 2000@07:54:32’ was partially detected by all three rule-based de-identification systems, and missed by the machine learning-based systems when trained with the i2b2 de-identification corpus. Other examples of missing annotations were phone numbers that included country codes, and day intervals such as ‘WFS’ or ‘M-F’.
-
Acronyms and abbreviations: The Healthcare Unit Names PHI type includes many acronyms and abbreviations referring to healthcare facilities that are frequently missed by all five de-identification systems. For instance, ‘MH’ (Mental Health), ‘ECF’ (Extended Care Facility), or ‘ENT’ (Ear, Nose, Throat) were always missed by all systems. A possible reason for this is that these systems were developed with a different definition of what a PHI is. Concluding that a term is PHI or not is at times a judgment call. Admittedly, in our evaluation we defined terms that are PHI in a relatively broad sense. It is conceivable that others may be of the opinion that these terms do not meet the strict definition of PHI.
-
Lack of examples: This mainly affects machine learning-based systems; and it causes PHI patterns with few instances in the corpus to be missed. For instance, less than 10 instances of Dates formats like ‘020309’ or ‘70’s’ were found in our corpus, and they were often missed. This of course highlights one of the key weaknesses of all machine learning based systems: lack of training data.
-
Main causes of spurious PHI annotations (i.e., false positives):
-
Measurements: A common error committed by all systems consisted in confusing medical measurements with Dates or other numerical identifiers. For instance, in the phrases “rating: 1/10” and “maintain CVD 8-10”, ‘1/10’ and ‘8-10’ were annotated as Dates.
-
Ambiguous words: Common words, which could also represent PHI, were often wrongly annotated by the systems, especially by rule-based de-identification systems. For instance, ‘BROWN, ‘GRAY’, ‘WALKER’ or ‘CUTTER’ are examples of ambiguous words that were sometimes wrongly recognized as PHI by the systems.
-
De-identification systems’ specific PHI types: A few PHI types annotated by de-identification systems were not included in our PHI specification. For example, HMS Scrubber annotates expressions like ‘30 days’ as ages; however, these annotations are not considered PHI in our reference standard.