Skip to main content

Quality and Complexity Measures for Data Linkage and Deduplication

  • Chapter
Quality Measures in Data Mining

Part of the book series: Studies in Computational Intelligence ((SCI,volume 43))

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates RA, Ribeiro-Neto B. Modern information retrieval. AddisonWesley Longman Publishing Co., Boston, 1999.

    Google Scholar 

  2. Bass J. Statistical linkage keys: How effective are they? In Symposium on Health Data Linkage, Sydney, 2002. Available online at: http://www.publichealth.gov.au/symposium.html.

  3. Baxter R, Christen P, Churches T. A comparison of fast blocking methods for record linkage. In Proceedings of ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 25-27, Washington DC, 2003.

    Google Scholar 

  4. Bertolazzi P, De Santis L, Scannapieco M. Automated record matching in cooperative information systems. In Proceedings of the international workshop on data quality in cooperative information systems, Siena, Italy, 2003.

    Google Scholar 

  5. Bertsekas DP. Auction algorithms for network flow problems: A tutorial introduction. Computational Optimization and Applications, 1:7-66, 1992.

    MATH  MathSciNet  Google Scholar 

  6. Bilenko M, Mooney RJ. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of ACM SIGKDD, pages 39-48, Washington DC, 2003.

    Google Scholar 

  7. Bilenko M, Mooney RJ. On evaluation and training-set construction for duplicate detection. In Proceedings of ACM SIGKDD workshop on Data Cleaning, Record Linkage and Object Consolidation, pages 7-12, Washington DC, 2003.

    Google Scholar 

  8. Blakely T, Salmond C. Probabilistic record linkage and a method to calculate the positive predictive value. International Journal of Epidemiology, 31:6:1246-1252,2002.

    Article  Google Scholar 

  9. Centre for Epidemiology and Research, NSW Department of Health. New South Wales mothers and babies 2001. NSW Public Health Bull, 13:S-4, 2001.

    Google Scholar 

  10. Chaudhuri S, Ganjam K, Ganti V, Motwani R. Robust and efficient fuzzy match for online data cleaning. In Proceedings of ACM SIGMOD, pages 313-324, San Diego, 2003.

    Google Scholar 

  11. Chaudhuri S, Ganti V, Motwani R. Robust identification of fuzzy duplicates. In Proceedings of the 21st international conference on data engineering (ICDE’05), pages 865-876, Tokyo, 2005.

    Google Scholar 

  12. Christen P, Churches T, Hegland M. Febrl - a parallel open source data linkage system. In Proceedings of the 8th PAKDD, Springer LNAI 3056, pages 638-647, Sydney, 2004.

    Google Scholar 

  13. Churches T, Christen P, Lim K, Zhu JX. Preparation of name and address data for record linkage using hidden markov models. BioMed Central Medical Informatics and Decision Making, 2(9), 2002. Available online at: http://www.biomedcentral.com/1472-6947/2/9/.

  14. Cohen WW. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD, pages 201-212, Seattle, 1998.

    Google Scholar 

  15. Cohen WW, Ravikumar P, Fienberg SE. A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 workshop on information integration on the Web (IIWeb-03), pages 73-78, Acapulco, 2003.

    Google Scholar 

  16. Cohen WW, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In Proceedings of ACM SIGKDD, pages 475-480, Edmonton, 2002.

    Google Scholar 

  17. Cooper WS, Maron ME. Foundations of probabilistic and utility-theoretic indexing. Journal of the ACM, 25(1):67-80, 1978.

    Article  MATH  MathSciNet  Google Scholar 

  18. Elfeky MG, Verykios VS, Elmagarmid AK. TAILOR: A record linkage toolbox. In Proceedings of ICDE, pages 17-28, San Jose, 2002.

    Google Scholar 

  19. Fawcett T. ROC Graphs: Notes and practical considerations for researchers. Technical Report HPL-2003-4, HP Laboratories, Palo Alto, 2004.

    Google Scholar 

  20. Fellegi I, Sunter A. A theory for record linkage. Journal of the American Statistical Society, 64(328):1183-1210, 1969.

    Google Scholar 

  21. Galhardas H, Florescu D, Shasha D, Simon E. An extensible framework for data cleaning. In Proceedings of ICDE, page 312, 2000.

    Google Scholar 

  22. Gill L. Methods for automatic record matching and linking and their use in national statistics. Technical Report National Statistics Methodology Series, no 25, National Statistics, London, 2001.

    Google Scholar 

  23. Gomatam S, Carter R, Ariet M, Mitchell G. An empirical comparison of record linkage procedures. Statistics in Medicine, 21(10):1485-1496, 2002.

    Article  Google Scholar 

  24. Gu L, Baxter R. Adaptive filtering for efficient record linkage. In SIAM international conference on data mining, Orlando, 2004.

    Google Scholar 

  25. Gu L, Baxter R. Decision models for record linkage. In Proceedings of the 3rd Australasian data mining conference, pages 241-254, Cairns, 2004.

    Google Scholar 

  26. Hernandez MA, Stolfo SJ. The merge/purge problem for large databases. In Proceedings of ACM SIGMOD, pages 127-138, San Jose, 1995.

    Google Scholar 

  27. Hernandez MA, Stolfo SJ. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.

    Article  Google Scholar 

  28. Kelman CW, Bass AJ, Holman CD. Research use of linked health data - a best practice protocol. Aust NZ Journal of Public Health, 26:251-255, 2002.

    Google Scholar 

  29. Lee ML, Ling TW, Low WL. IntelliClean: a knowledge-based intelligent data cleaner. In Proceedings of ACM SIGKDD, pages 290-294, Boston, 2000.

    Google Scholar 

  30. Maletic JI, Marcus A. Data cleansing: beyond integrity analysis. In Proceedings of the Conference on Information Quality (IQ2000), pages 200-209, Boston, 2000.

    Google Scholar 

  31. MatchWare Technologies. AutoStan and AutoMatch, User’s Manuals. Kenneb-unk, Maine, 1998.

    Google Scholar 

  32. McCallum A, Nigam K, Ungar LH. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of ACM SIGKDD, pages 169-178, Boston, 2000.

    Google Scholar 

  33. Monge A, Elkan C. The field-matching problem: Algorithm and applications. In Proceedings of ACM SIGKDD, pages 267-270, Portland, 1996.

    Google Scholar 

  34. Nahm UY, Bilenko M, Mooney RJ. Two approaches to handling noisy variation in text mining. In Proceedings of the ICML-2002 workshop on text learning (TextML’2002), pages 18-27, Sydney, 2002.

    Google Scholar 

  35. Newcombe HB, Kennedy JM. Record linkage: making maximum use of the discriminating power of identifying information. Communications of the ACM, 5(11):563-566, 1962.

    Article  Google Scholar 

  36. Newman DJ, Hettich S, Blake CL, Merz CJ. UCI repository of machine learning databases, 1998. URL: http://www.ics.uci.edu/∼mlearn/MLRepository.html.

  37. Porter E, Winkler WE. Approximate string comparison and its effect on an advanced record linkage system. Technical Report RR97/02, US Bureau of the Census, 1997.

    Google Scholar 

  38. Pyle D. Data preparation for data mining. Morgan Kaufmann Publishers, San Francisco, 1999.

    Google Scholar 

  39. Rahm E, Do HH. Data cleaning: problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3-13, 2000.

    Google Scholar 

  40. Ravikumar P, Cohen WW. A hierarchical graphical model for record linkage. In Proceedings of the 20th conference on uncertainty in artificial intelligence, pages 454-461, Banff, Canada, 2004.

    Google Scholar 

  41. Salzberg S. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3):317-328, 1997.

    Article  Google Scholar 

  42. Sarawagi S, Bhamidipaty A. Interactive deduplication using active learning. In Proceedings of ACM SIGKDD, pages 269-278, Edmonton, 2002.

    Google Scholar 

  43. Shearer C. The CRISP-DM model: The new blueprint for data mining. Journal of Data Warehousing, 5(4):13-22, 2000.

    Google Scholar 

  44. Smith ME, Newcombe HB. Accuracies of computer versus manual linkages of routine health records. Methods of Information in Medicine, 18(2):89-97, 1979.

    Google Scholar 

  45. Tejada S, Knoblock CA, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of ACM SIGKDD, pages 350-359, Edmonton, 2002.

    Google Scholar 

  46. Winkler WE. Using the EM algorithm for weight computation in the FellegiSunter model of record linkage. Technical Report RR00/05, US Bureau of the Census, 2000.

    Google Scholar 

  47. Winkler WE. Methods for record linkage and Bayesian networks. Technical Report RR2002/05, US Bureau of the Census, 2002.

    Google Scholar 

  48. Winkler WE. Overview of record linkage and current research directions. Technical Report RR2006/02, US Bureau of the Census, 2006.

    Google Scholar 

  49. Winkler WE, Thibaudeau Y. An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. decennial census. Technical Report RR91/09, US Bureau of the Census, 1991.

    Google Scholar 

  50. Yancey WE. BigMatch: a program for extracting probable matches from a large file for record linkage. Technical Report RRC2002/01, US Bureau of the Census, 2002.

    Google Scholar 

  51. Yancey WE. An adaptive string comparator for record linkage. Technical Report RR2004/02, US Bureau of the Census, 2004.

    Google Scholar 

  52. Zhu JJ, Ungar LH. String edit analysis for merging databases. In KDD workshop on text mining, held at ACM SIGKDD, Boston, 2000.

    Google Scholar 

  53. Zingmond DS, Ye Z, Ettner SL, Liu H. Linking hospital discharge and death records - accuracy and sources of bias. Journal of Clinical Epidemiology, 57:21-29,2004.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Christen, P., Goiser, K. (2007). Quality and Complexity Measures for Data Linkage and Deduplication. In: Guillet, F.J., Hamilton, H.J. (eds) Quality Measures in Data Mining. Studies in Computational Intelligence, vol 43. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-44918-8_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-44918-8_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44911-9

  • Online ISBN: 978-3-540-44918-8

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics