Skip to main content
main-content

01.12.2012 | Research article | Ausgabe 1/2012 Open Access

BMC Medical Informatics and Decision Making 1/2012

Efficient algorithms for fast integration on large data sets from multiple sources

Zeitschrift:
BMC Medical Informatics and Decision Making > Ausgabe 1/2012
Autoren:
Tian Mi, Sanguthevar Rajasekaran, Robert Aseltine
Wichtige Hinweise

Electronic supplementary material

The online version of this article (doi:10.​1186/​1472-6947-12-59) contains supplementary material, which is available to authorized users.

Competing interests

All authors declare that they have no competing interests.

Authors’ contributions

TM contributed to the implementation of the algorithms, testing and analysis on the synthetic and real data, manuscript preparation, algorithms development, and performance analysis. SR contributed to algorithms development, analysis of the results, performance analysis, and manuscript preparation. RA contributed to data preparation, results analysis, and performance analysis. All authors read and approved the final manuscript.

Abstract

Background

Recent large scale deployments of health information technology have created opportunities for the integration of patient medical records with disparate public health, human service, and educational databases to provide comprehensive information related to health and development. Data integration techniques, which identify records belonging to the same individual that reside in multiple data sets, are essential to these efforts. Several algorithms have been proposed in the literatures that are adept in integrating records from two different datasets. Our algorithms are aimed at integrating multiple (in particular more than two) datasets efficiently.

Methods

Hierarchical clustering based solutions are used to integrate multiple (in particular more than two) datasets. Edit distance is used as the basic distance calculation, while distance calculation of common input errors is also studied. Several techniques have been applied to improve the algorithms in terms of both time and space: 1) Partial Construction of the Dendrogram (PCD) that ignores the level above the threshold; 2) Ignoring the Dendrogram Structure (IDS); 3) Faster Computation of the Edit Distance (FCED) that predicts the distance with the threshold by upper bounds on edit distance; and 4) A pre-processing blocking phase that limits dynamic computation within each block.

Results

We have experimentally validated our algorithms on large simulated as well as real data. Accuracy and completeness are defined stringently to show the performance of our algorithms. In addition, we employ a four-category analysis. Comparison with FEBRL shows the robustness of our approach.

Conclusions

In the experiments we conducted, the accuracy we observed exceeded 90% for the simulated data in most cases. 97.7% and 98.1% accuracy were achieved for the constant and proportional threshold, respectively, in a real dataset of 1,083,878 records.
Zusatzmaterial
Additional file 2 : Introduction to the real data. (PDF 51 KB)
12911_2011_513_MOESM2_ESM.pdf
Authors’ original file for figure 1
12911_2011_513_MOESM3_ESM.pdf
Literatur
Über diesen Artikel

Weitere Artikel der Ausgabe 1/2012

BMC Medical Informatics and Decision Making 1/2012 Zur Ausgabe