Background
Record linkage is a process that allows to identify records appearing in different databases and referring to the same entity (e.g. an individual) [
1], but which do not share a common unique identifier. In record linkage, the status of a pair of records is either matching (same individual) or non-matching (distinct individuals). This process consists in three successive steps: data preprocessing (curation of the data), record pair comparison and linkage. Data preprocessing includes harmonizing data formats and dealing with missing values. The record pair comparison can be computationally expensive, as the number of all possible record pairs is the product of the numbers of records in each dataset. To reduce the number of comparisons to run, it is common to perform blocking. Blocking consists in splitting the datasets into smaller sets that agree on one or more variables, called blocking variables. Only records within the same blocks are then compared. When no unique person identifier is shared between the two datasets, linkage has to be performed by comparison of shared matching variables. The linkage performance is assessed by comparison with the gold standard (or ground truth) based on a confusion matrix [
2]. The record linkage matches may have two types of errors: False Positives (FP), i.e. true non-matches classified as matches, and False Negatives (FN), i.e. true matches classified as non-matches.
Linkage methods are usually classified as either deterministic or probabilistic [
1,
3,
4]. Deterministic record linkage methods assess matching status based on the exact agreement or disagreement of either all or a fraction of the matching variables. If data are of very good quality (i.e. no more than 5% of missing data or errors in any matching variable), the deterministic linkage can have a satisfying linkage quality. Otherwise, it will produce a large number of FNs [
5]. By contrast, Probabilistic Record Linkage (PRL) aims at determining the probability that two records refer to the same individual. Rather than requesting exact agreement of the matching variables, PRL can use similarity scores between the values taken by the matching variables. PRL takes into account the difference in discriminatory power of each matching variable. Indeed, the more frequent a value of matching variable is, the less discriminative for linkage this value is. In practice, PRL can give better results than deterministic linkage when the data are not of good quality [
6]. To allow for typing errors or spelling changes, the values, in two records, of a matching variable are compared using a similarity function, which returns a similarity score. These scores are used as input by linkage methods to classify record pairs into matches and non-matches. In PRL, the determination of the threshold on the likelihood scores that separate the matches from the non-matches is critical and has a direct impact on the relative numbers of FP and FN [
7]. Although the threshold could be estimated by controlling the theoretical FP and FN rates [
3], the most common practice is to examine the empirical distribution of scores, and chose the threshold according to a predefined FN or FN rate. From a machine learning point of view, record linkage can be considered as a classification task. Each record pair is represented by a comparison vector containing, for each matching variable, the similarity score between both records. The supervised machine learning (ML) algorithm learns a model that takes such a comparison vector as input and returns matching status as output, based on a training set in which the matching status of record pairs are known. Various ML algorithms have been applied to record linkage, such as Classification Tree (CT), Support Vector Machines (SVM), Neural Networks (NNET), or Random Forest (RF) [
8‐
13]. However, their application is usually limited by the need of a training set.
We therefore conduct a study, in which we learn an ML model from a training set where the ground truth was established by PRL followed by manual review. Because PRL and ML methods may make distinct errors [
14], we also propose to combine PRL with the ML model we have trained. We applied this hybrid linkage process to match individuals between the GEMO (Genetic Modifiers of
BRCA1 and
BRCA2) [
15] and the GENEPSO (prospective cohort of
BRCAx mutation carriers) [
16] studies, building our PRL + ML approach on a first version of those studies, and applying it to their updated versions.
GEMO and GENEPSO are two independent ongoing nation-wide studies involving BRCA1 and BRCA2 carriers, with unconnected databases and whose individuals were not registered by a shared identifier. BRCA1 and BRCA2 genes testing has become part of routine clinical practice in European countries and North America since the identification of the two genes in the 90’s, which greatly improved recommendations about breast and ovarian cancer risk management treatments. Nonetheless, both retrospective and prospective studies on large datasets of BRCA1 and BRCA2 (BRCA1/2) mutation carrier families are very much needed to refine individual cancer risk estimates by using different cancer risk factors such as genetic factors, lifestyle/environmental factors, family history and breast pathology.
GEMO and GENEPSO provide an overview of a well-characterized sample of counseled Hereditary Breast and Ovarian Cancer (HBOC) families in France. Through the GEMO study, blood DNA from BRCA1/2 mutation carriers is available to perform genetic epidemiological projects aiming at identifying and characterizing genetic factors modifying breast and ovarian cancer risk. In the prospective cohort GENEPSO, which aims at assessing environmental and lifestyle risk factors, BRCA1/2 mutation carriers are followed over time to observe characteristics of subjects who are developing either primary or secondary cancer.
GEMO and GENEPSO were set up at different time by two different coordinating centers and investigators involved in the Genetics and Cancer Group (GCG, UNICANCER) [
17], a French multicenter group composed of clinicians, molecular geneticists and scientists. Participants in both studies undergo genetic counseling and they are invited to participate in GEMO and/or GENEPSO through the family cancer clinics if tested positive for a mutation in
BRCA1 or
BRCA2. About 26% of index cases carrying such a mutation (i.e. the first individual tested in the family) are included in GEMO, and 21% in GENEPSO [
18]. Therefore, it is essential to identify the overlap between participants of the both studies by linking the two data sources, which will allow setting up studies evaluating simultaneously genetic and non-genetic factors modifying cancer risk of carriers of a
BRCA1 or
BRCA2 mutation. Studies conducted in subjects enrolled in both studies will also allow, for instance, assessment of whether it is possible to predict response to treatment according to
BRCA1/2 mutation status and other genetic variant profile.
Discussion
PRL has a lower computational cost but the linkage quality is impacted by the choice of the threshold on the likelihood score. Lower thresholds lead to more FP whereas higher thresholds lead to more FN. The ML approach reaches higher precision, requesting fewer manual reviews. However, the blocking step can lead to FN if the data contain errors in blocking variables. We found that the PRL + ML combined method, having the highest recall compared to either of the two methods alone, improves linkage by identifying more true matches, but at the cost of additional manual reviews.
In a context where manual review cost is to be capped and missing true matches is tolerated, the ML approach, which has a much higher precision to the expense of a lower recall, is an interesting option. Another possibility, which we expect from our results on dataset 1 (Fig.
3) to reach higher recall and higher precision, would be to use PRL + ML but with a higher threshold for PRL (such as 0.68 in our study). Here, our goal was to identify as many common participants as possible between the two studies, so as to facilitate research projects requiring both genetic and follow-up data. We therefore chose the PRL + ML approach with a relatively low threshold of 0.6 for PRL, so as to maximize recall.
We expect linkage performance to be related to the number of matching variables. Had more matching variables been shared between GEMO and GENEPSO, the most discriminating matching variables could have been identified using feature selection algorithms, resulting in a lower computational cost. Here, with 10 matching variables, such a strategy was not necessary. On the other hand, if too few matching variables had been available, one could expect ML models to have lower performance, giving the advantage to PRL.
Previously, Elfeky et al. [
30] described a hybrid technique for record linkage, combining both supervised and unsupervised machine learning methods. Record pairs were assigned a matching or non-matching status through unsupervised clustering, and the resulting labeled data was then used as a training dataset for a supervised model [
2]. However, this technique was not suitable here since two unsupervised machine learning methods (K-means and bagged k-means) showed independently poor performance (
Supplementary Data, Table S5), probably due to our imbalanced data.
In this study, the two databases are limited in size. However, larger databases may be challenging for record linkage. In this case, the traditional blocking technique that we employed here is a first step towards reducing computational complexity. In addition, partitioning the data into a larger number of smaller blocks and processing them in parallel using our hybrid record linkage process could be used to maintain a reasonable computational time. Besides, In order to decrease the burden of manual review, we could aim to achieve a high precision instead of having high recall by choosing a higher PRL score threshold. Thus, the manual review could serve for linkage method tuning.
Acknowledgements
We thank the patients and the participants in the contributing studies.
The Genetic Modifiers of Cancer Risk in BRCA1/2 Mutation Carriers (GEMO) study is a study from the National Cancer Genetics Network «UNICANCER Genetic Group», France. We wish to pay a tribute to Olga M. Sinilnikova, who with Dominique Stoppa-Lyonnet initiated and coordinated GEMO until she sadly passed away on the 30th June 2014. The team in Lyon (Olga Sinilnikova, Mélanie Léone, Laure Barjhoux, Carole Verny-Pierre, Sylvie Mazoyer, Francesca Damiola, Valérie Sornin) managed the GEMO samples until the biological resource center was transferred to Paris in December 2015 (Noura Mebirouk, Fabienne Lesueur, Dominique Stoppa-Lyonnet). We want to thank all the GEMO collaborating groups for their contribution to this study. Coordinating Center: Service de Génétique, Institut Curie, Paris: Muriel Belotti, Ophélie Bertrand, Anne-Marie Birot, Bruno Buecher, Sandrine M. Caputo, Chrystelle Colas, Anaïs Dupré, Emmanuelle Fourme, Marion Gauthier-Villars, Lisa Golmard, Claude Houdayer, Marine Le Mentec, Virginie Moncoutier, Antoine de Pauw, Claire Saule, Dominique Stoppa-Lyonnet, and Inserm U900, Institut Curie, Paris: Fabienne Lesueur, Noura Mebirouk. Contributing Centers: Unité Mixte de Génétique Constitutionnelle des Cancers Fréquents, Hospices Civils de Lyon - Centre Léon Bérard, Lyon: Nadia Boutry-Kryza, Alain Calender, Sophie Giraud, Mélanie Léone. Institut Gustave Roussy, Villejuif: Brigitte Bressac-de-Paillerets, Olivier Caron, Marine Guillaud-Bataille. Centre Jean Perrin, Clermont–Ferrand: Yves-Jean Bignon, Nancy Uhrhammer. Centre Léon Bérard, Lyon: Valérie Bonadona, Christine Lasset. Centre François Baclesse, Caen: Pascaline Berthet, Laurent Castera, Dominique Vaur. Institut Paoli Calmettes, Marseille: Violaine Bourdon, Catherine Noguès, Tetsuro Noguchi, Cornel Popovici, Audrey Remenieras, Hagay Sobol. CHU Arnaud-de-Villeneuve, Montpellier: Isabelle Coupier, Pierre-Olivier Harmand, Pascal Pujol, Paul Vilquin. Centre Oscar Lambret, Lille: Aurélie Dumont, Françoise Révillion. Centre Paul Strauss, Strasbourg: Danièle Muller. Institut Bergonié, Bordeaux: Emmanuelle Barouk-Simonet, Françoise Bonnet, Virginie Bubien, Michel Longy, Nicolas Sévenet. Institut Claudius Regaud, Toulouse: Laurence Gladieff, Rosine Guimbaud, Viviane Feillel, Christine Toulas. CHU Grenoble: Hélène Dreyfus, Dominique Leroux, Magalie Peysselon, Christine Rebischung. CHU Dijon: Amandine Baurand, Geoffrey Bertolone, Fanny Coron, Laurence Faivre, Vincent Goussot, Caroline Jacquot, Caroline Sawka. CHU St-Etienne: Caroline Kientz, Marine Lebrun, Fabienne Prieur. Hôtel Dieu Centre Hospitalier, Chambéry: Sandra Fert-Ferrer. Centre Antoine Lacassagne, Nice: Véronique Mari. CHU Limoges: Laurence Vénat-Bouvet. CHU Nantes: Stéphane Bézieau, Capucine Delnatte. CHU Bretonneau, Tours and Centre Hospitalier de Bourges: Isabelle Mortemousque. Groupe Hospitalier Pitié-Salpétrière, Paris: Florence Coulet, Florent Soubrier, Mathilde Warcoin. CHU Vandoeuvre-les-Nancy: Myriam Bronner, Sarab Lizard, Johanna Sokolowska. CHU Besançon: Marie-Agnès Collonge-Rame, Alexandre Damette. CHU Poitiers, Centre Hospitalier d’Angoulême and Centre Hospitalier de Niort: Paul Gesta. Centre Hospitalier de La Rochelle: Hakima Lallaoui. CHU Nîmes Carémeau: Jean Chiesa. CHI Poissy: Denise Molina-Gomes. CHU Angers: Olivier Ingster. CHRU de Lille: Sylvie Manouvrier-Hanu, Sophie Lejeune.
GENEPSO Centers: the Coordinating Center: Institut Paoli-Calmettes, Marseille, France: Catherine Noguès, Lilian Laborde, Pauline Pontois and the Collaborating Centers: Institut Curie, Paris: Dominique Stoppa-Lyonnet, Marion Gauthier-Villars; Bruno Buecher, Institut Gustave Roussy, Villejuif: Olivier Caron; Hôpital René Huguenin/Institut Curie, Saint Cloud: Catherine Noguès, Emmanuelle Mouret-Fourme; Centre Paul Strauss, Strasbourg: Jean-Pierre Fricker; Centre Léon Bérard, Lyon: Christine Lasset, Valérie Bonadona; Centre François Baclesse, Caen: Pascaline Berthet; Hôpital d’Enfants CHU Dijon – Centre Georges François Leclerc, Dijon: Laurence Faivre; Centre Alexis Vautrin, Vandoeuvre-les-Nancy: Elisabeth Luporsi; Centre Antoine Lacassagne, Nice: Marc Frénay; Institut Claudius Regaud, Toulouse: Laurence Gladieff; Réseau Oncogénétique Poitou Charente, Niort: Paul Gesta; Institut Paoli-Calmettes, Marseille: Catherine Noguès, Hagay Sobol, François Eisinger, Jessica Moretta; Institut Bergonié, Bordeaux: Michel Longy, Centre Eugène Marquis, Rennes: Catherine Dugast; GH Pitié Salpétrière, Paris: Chrystelle Colas, Florent Soubrier; CHU Arnaud de Villeneuve, Montpellier: Isabelle Coupier, Pascal Pujol; Centres Paul Papin, and Catherine de Sienne, Angers, Nantes: Alain Lortholary; Centre Oscar Lambret, Lille: Philippe Vennin, Claude Adenis; Institut Jean Godinot, Reims: Tan Dat Nguyen; Centre René Gauducheau, Nantes: Capucine Delnatte; Centre Henri Becquerel, Rouen: Annick Rossi, Julie Tinat, Isabelle Tennevet; Hôpital Civil, Strasbourg: Jean-Marc Limacher; Christine Maugard; Hôpital Centre Jean Perrin, Clermont-Ferrand: Yves-Jean Bignon; Polyclinique Courlancy, Reims: Liliane Demange; Clinique Sainte Catherine, Avignon: Hélène Dreyfus; Hôpital Saint-Louis, Paris: Odile Cohen-Haguenauer; CHRU Dupuytren, Limoges: Brigitte Gilbert; Couple-Enfant-CHU de Grenoble: Dominique Leroux; Hôpital de la Timone, Marseille: Hélène Zattara-Cannoni.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.