On Sample Size and Classification Accuracy: A Performance Comparison

Sordo, Margarita; Zeng, Qing

doi:10.1007/11573067_20

Margarita Sordo²³ &
Qing Zeng²³

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3745))

Included in the following conference series:

International Symposium on Biological and Medical Data Analysis

1720 Accesses
33 Citations

Abstract

We investigate the dependency between sample size and classification accuracy of three classification techniques: Naïve Bayes, Support Vector Machines and Decision Trees over a set of 8500 text excerpts extracted automatically from narrative reports from the Brigham & Women’s Hospital, Boston, USA. Each excerpt refers to the smoking status of a patient as: current, past, never a smoker or, denies smoking. Our empirical results, consistent with [1], confirm that size of the training set and the classification rate are indeed correlated. Even though these algorithms perform reasonably well with small datasets, as the number of cases increases, both SMV and Decision Trees show a substantial improvement in performance, suggesting a more consistent learning process. Unlike the majority of evaluations, ours were carried out specifically in a medical domain where the limited amount of data is a common occurrence [13][14]. This study is part of the I2B2 project, Core 2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
McKay, M., Fitzgerald, M.A., Beckman, R.J.: Sample Size Effects When Using R2 to Measure Model Input Importance. Technical Report LA-UR-99-1357. Los Alamos National Laboratory, Los Alamos, NM, USA
Google Scholar
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
Google Scholar
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Google Scholar
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the 4th Annual Symposium of Document Analysis and Information Retrieval (1995)
Google Scholar
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)
Google Scholar
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, Springer, Heidelberg (1998)
Google Scholar
Dumais, S., Platt, J., Heckerman, D.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proceedings of the 7th International. Conference on Information and Knowledge Management (1998)
Google Scholar
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, p. 1286 (1998)
Google Scholar
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
MATH Google Scholar
Ghani, R.: Using Error-Correcting Codes for Text Classification. In: Workshop on Text Mining at the First IEEE Conference on Data Mining (2001)
Google Scholar
Raudys, S.J., Jain, A.K.: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3) (March 1991)
Google Scholar
Wilcox, A., Hripcsak, G.: Classification algorithms applied to narrative reports. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 455–459 (1999)
Google Scholar
Webber Chapman, W., Haug, P.J.: Comparing Expert Systems for Identifying Chest X-ray Reports that Support Pneumonia. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 216–220 (1999)
Google Scholar
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Google Scholar
Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS 14 (2002)
Google Scholar
Gilles, G.: Semi-supervised learning. Tech. Rep. Dept. Informatique et Recherche Operationnelle, Universite de Montreal, Montreal, QC, Canada H3C 3J7
Google Scholar

Download references

Author information

Authors and Affiliations

Decision Systems Group, Harvard Medical School, Boston, MA, USA
Margarita Sordo & Qing Zeng

Authors

Margarita Sordo
View author publications
You can also search for this author in PubMed Google Scholar
Qing Zeng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Aveiro, DETI/IEETA, 3810-193, Aveiro, Portugal
José Luís Oliveira
Biomedical Informatics Group, Dep. Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Spain
Víctor Maojo
Medical Bioinformatics Department, Institute of Health ‘Carlos III’, Ctra. Majadahonda-Pozuelo, km 2. 28220 Majadahonda, Madrid
Fernando Martín-Sánchez
Department of Electronics and Telecommunications (DET/IEETA), University of Aveiro, 3810 193, Aveiro, Portugal
António Sousa Pereira

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sordo, M., Zeng, Q. (2005). On Sample Size and Classification Accuracy: A Performance Comparison. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_20

Download citation

DOI: https://doi.org/10.1007/11573067_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29674-4
Online ISBN: 978-3-540-31658-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics