Probabilistic characterization of nearest neighbor classifier

Dhurandhar, Amit; Dobra, Alin

doi:10.1007/s13042-012-0091-y

Probabilistic characterization of nearest neighbor classifier

Original Article
Published: 27 April 2012

Volume 4, pages 259–272, (2013)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Amit Dhurandhar¹ &
Alin Dobra²

290 Accesses
19 Citations
Explore all metrics

Abstract

The k-nearest neighbor classification algorithm (kNN) is one of the most simple yet effective classification algorithms in use. It finds major applications in text categorization, outlier detection, handwritten character recognition, fraud detection and in other related areas. Though sound theoretical results exist regarding convergence of the generalization error (GE) of this algorithm to Bayes error, these results are asymptotic in nature. The understanding of the behavior of the kNN algorithm in real world scenarios is limited. In this paper, assuming categorical attributes, we provide a principled way of studying the non-asymptotic behavior of the kNN algorithm. In particular, we derive exact closed form expressions for the moments of the GE for this algorithm. The expressions are functions of the sample, and hence can be computed given any joint probability distribution defined over the input–output space. These expressions can be used as a tool that aids in unveiling the statistical behavior of the algorithm in settings of interest viz. an acceptable value of k for a given sample size and distribution. Moreover, Monte Carlo approximations of such closed form expressions have been shown in Dhurandhar and Dobra (J Mach Learn Res 9, 2008; ACM Trans Knowl Discov Data 3, 2009) to be a superior alternative in terms of speed and accuracy when compared with computing the moments directly using Monte Carlo. This work employs the semi-analytical methodology that was proposed recently to better understand the non-asymptotic behavior of learning algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Article Open access 15 November 2019

Learning from imbalanced data: open challenges and future directions

Article Open access 22 April 2016

References

Abello J, Pardalos P, Resende M (eds) (2002) Handbook of massive data sets. Kluwer, Norwell
Blum A, Kalai A, Langford J (1999) Beating the hold-out: bounds for k-fold and progressive cross-validation. In: Computational learing theory
Bremner D, Demaine E, Erickson J, Iacono J, Langerman S, Morin P, Toussaint G (2005) Output-sensitive algorithms for computing nearest-neighbor decision boundaries. Discrete Comput Geom 33:593–604
Article MathSciNet MATH Google Scholar
Connor-Linton J (2003) Chi square tutorial. http://www.georgetown.edu/faculty/ballc/webtools/web_chi_tut.html
Dhurandhar A, Dobra A (2008) Probabilistic characterization of random decision trees. J Mach Learn Res 9:2287–2314
MathSciNet Google Scholar
Dhurandhar A, Dobra A (2009) Semi-analytical method for analyzing models and model selection measures based on moment analysis. ACM Trans Knowl Discov Data 3:1–51
Article Google Scholar
Duda R, Hart P, Stork D (2001) Pattern classification, 2 edn. Wiley, New York
Hall M, Holmes G (2003) Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans KDE
Hu Q, Pan W, An S, Ma P, Wei J (2010) An efficient gene selection technique for cancer recognition based on neighborhood mutual information. Int J Mach Learn Cybern 1:63–74
Article Google Scholar
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the fourteenth IJCAI
Krause E (1987) Taxicab geometry: an adventure in non-Euclidean geometry. Dover, New York
Liu W, White A (1997) Metrics for nearest neighbour discrimination with categorical attributes. In: Research and development in expert systems XIV: Proceedings of the 17th annual technicial conference of the BCES Specialist Group, pp 51–59
Maggini M, Giles C, Horne B (1997) Financial time series forecasting using k-nearest neighbors classification. In: Nonlinear financial forecasting, pp 169–181
Moore A, Lee M (1994) Efficient algorithms for minimizing cross validation error. In: International conference on machine learning, pp 190–198
Nigsch F, Bender A, Buuren B, Tissen J, Nigsch E, Mitchell J (2006) Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization. J Chem Inf Model 46:2412–2422
Article Google Scholar
Park B, Samworth R (2008) Choice of neighbor order in nearest-neighbor classification. Ann Stat 36:2135–2152
Article MathSciNet MATH Google Scholar
Qin Y, Zheng D, Zhao T (2012) Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3:71–76
Article Google Scholar
Rajagopalan B, Lall U (1999) A k-nearest neighbor simulator for daily precipitation and other weather variables. Water Resour Res 35:3089–3101
Article Google Scholar
Shao J (2003) Mathematical statistics. Springer, Berlin
Sidorov G, Koeppen M, Cruz-Corts N (2011) Recent advances in machine learning techniques and applications. Int J Mach Learn Cybern 2:123–124
Google Scholar
Sim J, Kim S, Lee J (2005) Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinform Comput Appl Biosci 21:2844–2849
Google Scholar
Stanfill C, Waltz D (1986) Toward memory-based reasoning. Commun ACM 29(12):1213–1228
Article Google Scholar
Stone C (1977) Consistent nonparametric regression. Ann Stat 5(4):595–645
Article MATH Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, New York
Yang K, Shahabi C (2007) An efficient k nearest neighbor search for multivariate time series. Inf Comput 205:65–98
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant No. 0448264.

Author information

Authors and Affiliations

IBM TJ Watson, Yorktown Heights, USA
Amit Dhurandhar
University of Florida, Gainesville, USA
Alin Dobra

Authors

Amit Dhurandhar
View author publications
You can also search for this author in PubMed Google Scholar
Alin Dobra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Dhurandhar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dhurandhar, A., Dobra, A. Probabilistic characterization of nearest neighbor classifier. Int. J. Mach. Learn. & Cyber. 4, 259–272 (2013). https://doi.org/10.1007/s13042-012-0091-y

Download citation

Received: 18 April 2011
Accepted: 24 March 2012
Published: 27 April 2012
Issue Date: August 2013
DOI: https://doi.org/10.1007/s13042-012-0091-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Probabilistic characterization of nearest neighbor classifier

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Probabilistic characterization of nearest neighbor classifier

Abstract

Access this article

Similar content being viewed by others

A Systematic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science

A survey on semi-supervised learning

Learning from imbalanced data: open challenges and future directions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation