Three-party protocols
Some protocols rely on exact matching of encrypted keys based on phonetically transformed identifiers by a third party. Such protocols are used for cancer registries [
19,
20] and information exchange between hospitals. In the proposal of [
15,
16] identifiers are transformed according to phonetic rules and subsequently encrypted with a one-way hash function. To prevent some cryptographic attacks on this protocol, the identifiers are combined with a common pad before hashing. The hash values are transferred to a third party who hashes them again using another pad. Then the third party performs exact matching on the resulting hash values. Despite exact matching, the linkage allows for some errors in identifiers, because hash values of phonetic encodings are matched. Providing database owners do not collude with the third party the protocol is secure. However, string comparison using phonetic encodings usually yields more false positive links than string similarity functions [
21‐
23].
[
11] suggested a protocol based on hashed values of sets of consecutive letters (
q-grams, see below). For each string, the database holders
A and
B create for each record the power set of the
q-grams of their identifiers. Each subset of the power set is hashed by an HMAC algorithm using a common secret key of the database owners.
A and
B form tuples containing the hash values, the number of
q-grams in the hashed subset and the total number of
q-grams and an encryption of the identifiers to a third party
C. The number of tuples is much larger than the number of records. To calculate the string similarity between two strings
a and
b,
C computes a similarity measure based on the information in the tuples. As [
11] shows,
C is able to determine a similarity measure of
a and
b by selecting the highest similarity coefficient of the tuples associated with
a and
b. To prevent frequency attacks, [
11] propose to use an additional trusted party, thereby extending the number of parties involved to four. Furthermore, they recommend hiding the tuples among tuples created from dummy strings using Rivest's "chaffing and winnowing" technique [
24]. Apart from an increase of computational and communication costs [
25,
26], the protocol is prone to frequency attacks on the hashes of the
q-gram subsets with just one
q-gram [
17,
18].
[
27] used the value of a second identifier for padding every single character of a string before encryption. Subsequently, a third party is able to compare strings on the character level and to compute a string similarity. This elegant protocol requires a total flawless second identifier. However, a second identifier with few different values is open to a frequency attack.
In the protocol of [
28] two data holders, holding lists of names, build an embedding space from random strings and embed their respective strings therein using the SparseMap method [
29,
30]. Then, each data holder sends the embedded strings to a third party which determines their similarity. To create the embedding space, data holder
A generates
n random strings and builds
z reference sets from them. Next,
A reduces the number of reference sets by the greedy resampling heuristic of SparseMap to the best
k <
z reference sets. These
k reference sets are used to embed the names in a
k-dimensional space. The coordinates for a given name are approximations of the distances between the name to the closest random string in each of the
k reference sets in terms of the edit distance. As a result, for each name
A receives a
k-dimensional vector. After receiving the
k reference sets from
A,
B embeds his names in the same way. Finally, both data holders send their vectors to a third party,
C, who compares them using the standard
Euclidean distance between them. Using SparseMap allows the mapping of strings into the vector space avoiding prohibitive computational costs. This is accomplished by the reduction of dimensions using the greedy resampling method and by the distance approximations. However, the experiments in [
28] indicate that the linkage quality is significantly affected by applying the greedy resampling heuristic.
Pang and Hansen [
31] suggested a protocol based on a set of reference strings common to
A and
B. For a given identifier, both database holders compute the distances,
d, between each identifier string and all reference strings in the set. If
d is less than a threshold
δ, the respective reference string is encrypted using a key previously agreed on by
A and
B. For each identifier string, the resulting set of encrypted reference strings along with their distances,
d, and an ID number form a tuple. Both database holders send their tuples to a third party
C. For every pair of ID numbers where the encrypted reference strings agree,
C sums the distances,
d, and finds the minimum of this sum. If this minimum lies below a second threshold
δ
sim
, the two original identifier strings are classified as a match. The performance of the protocol depends crucially on the set of reference strings. Unless this is a superset of the original strings the performance is rather discouraging.
A different approach to solve the privacy-preserving record linkage problem for numerical keys is taken by [
32]. They suggest using anonymized versions of the data sets for a first linkage step that is capable of classifying a large portion of record pairs correctly as matches or mismatches. Only those pairs which cannot be classified as matches or mismatches will be used in a costly secure multi-party protocol for computing similarities.
Two-party protocols
[
33] suggested a protocol that allows two parties to compute the distance between two strings without exchanging them. Due to the large amount of necessary communication to compare two strings, such a protocol is unsuited for tasks with large lists of strings as required by privacy-preserving record linkage [
28]. The protocol suggested by [
34] uses a secure set intersection protocol described in [
35]. However, this protocol requires extensive computations and is therefore also regarded as impractical for linking large databases [
4,
28].
The protocol of Yakout et al. [
36] assumes that the data holders have already transformed their names into vectors as described by Scannapieco et al. [
28] and is designed to compare them without resorting to a third party. In the first phase, the two data holders reduce the number of candidate string pairs by omitting pairs which are unlikely to be similar. In the second phase of the protocol, the standard
Euclidean distance between the remaining candidate vector pairs is computed using a secure scalar product protocol. Yakout et al. demonstrate that neither party must reveal their vectors in the computations. Although more parsimoneous, this protocol cannot outperform the protocol of Scannapieco et al. [
28].