Background
Methods
Definitions
Quasi-identifiers
Equivalence classes
Uniqueness
Threat model and risk measurement
Context
Notation
Measuring uniqueness
Estimating uniqueness
Empirical evaluation
Simulation
Data sets
Description | Quasi-identifiers | No. Records |
---|---|---|
Adult
| 32,561 | |
The adult dataset from the UC Irvine machine learning data repository. This is an extract from the US census and has common demographics and socio-economic status variables: ftp://ftp.ics.uci.edu/pub/machine-learning-databases/adult
| · Age | |
· Profession | ||
· Education | ||
· Marital status | ||
· Race | ||
· Sex | ||
· Country | ||
FARS
| · | 43,330 |
Department of Transportation Fatal crash information: http://www-fars.nhtsa.dot.gov/main.cfm
| · Age | |
· Race | ||
· Month of Death | ||
· Day of Death | ||
CUP
| 95,412 | |
Data from the Paralyzed Veterans Association on veterans with spinal cord injuries or disease: http://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html
| · ZIP code | |
· Age | ||
· Gender | ||
· Income | ||
Pharm
| 16,424 | |
Prescription records from the Children’s Hospital of Eastern Ontario pharmacy from July 2006 to March 2009. This is for inpatients only and excludes acute cases. A de-identified version of this data was disclosed to commercial data aggregators [67]. | · Age | |
· Postal code (FSA) | ||
· Admission date | ||
· Discharge date | ||
· Sex | ||
ED
| 108,344 | |
Emergency department records from Children’s Hospital of Eastern Ontario from 1st June 2007 to 1st June 2009. This data is disclosed for the purpose of disease outbreak surveillance. | · Admission date | |
· Postal Code | ||
· Date of Birth | ||
· Sex | ||
Niday
| 637,964 | |
A registry of all newborns in Ontario from 1st April 2004 to 31st March 2009. This data set is used frequently for research purposes: http://www.bornontario.ca
| · Maternal postal code | |
· Baby DoB | ||
· Mother DoB | ||
· Baby sex |
Measurement
Model combination
Ethics
Results
Discussion
Summary and implications
Applications in practice
-
Financial constraints: how much money will the adversary spend on a re-identification attack ? Costs will be incurred to acquire databases. For example, the construction of a single profession-specific database using semi-public registries that can be used for re-identification attacks in Canada costs between $150,000 to $188,000 [49]. In the US, the cost for the voter registration list from Alabama is more than $28,000, $5,000 for Louisiana, more than $8,000 for New Hampshire, $12,000 for Wisconsin and $17,000 for West Virginia [39].
-
Time constraints: how much time will the adversary spend to acquire registries useful for a re-identification attack? For example, let’s say that one of the registries that the adversary would use is the discharge abstract database from hospitals. Forty eight states collect data on inpatients [74], and 26 states make their state inpatient databases (SIDs) available through the Agency for Healthcare Research and Quality (AHRQ) [75]. The SIDs for the remaining states would also be available directly from each individual state but the process may be more complicated and time consuming in this example. Would an adversary satisfy themselves only with the AHRQ states or will they put the time to get the data from other states as well ?
-
Willingness to misrepresent themselves: to what extent will the adversary be willing to misrepresent themselves to get access to public or semi-public registries? For example, some states only make their voter registration lists available to political parties or candidates (e.g., California) [39]. Would an adversary be willing to misrepresent themselves to get these lists? Also, some registries are available at a lower cost for academic use versus commercial use. Would a non-academic adversary misrepresent themselves as an academic to reduce their registry acquisition costs?
-
Willingness to violate agreements: to what extent would the adversary be willing to violate data sharing agreements or other contracts that s/he needs to sign to get access to registries? For example, acquiring the SIDs through the AHRQ requires that the recipient sign a data sharing agreement which prohibits re-identification attempts. Would the adversary still attempt a re-identification even after signing such an agreement?
-
Willingness to commit illegal acts: to what extent would an adversary break the law to obtain access to registries that can be used for re-identification? For example, privacy legislation and the Elections Act in Canada restrict the use of voter lists to running and supporting election activities [49]. There is at least one known case where a charity allegedly supporting a terrorist group has been able to obtain Canadian voter lists through deception for fund raising purposes [76‐78].
-
Voter registration lists, court records, obituaries published in newspapers or on-line, telephone directories, private property security registries, land registries, and registries of donations to political parties (which often include at least full address).
-
Professional and sports associations often post information about their members and teams (e.g., lists of lawyers, doctors, engineers, and teachers with their basic demographics, and information about sports teams with their demographics, height, weight and other physical and performance characteristics).
-
Certain employers often post information about their staff on-line, for example, at educational and research establishments and at law firms.