Abstract
We investigate the dependency between sample size and classification accuracy of three classification techniques: Naïve Bayes, Support Vector Machines and Decision Trees over a set of 8500 text excerpts extracted automatically from narrative reports from the Brigham & Women’s Hospital, Boston, USA. Each excerpt refers to the smoking status of a patient as: current, past, never a smoker or, denies smoking. Our empirical results, consistent with [1], confirm that size of the training set and the classification rate are indeed correlated. Even though these algorithms perform reasonably well with small datasets, as the number of cases increases, both SMV and Decision Trees show a substantial improvement in performance, suggesting a more consistent learning process. Unlike the majority of evaluations, ours were carried out specifically in a medical domain where the limited amount of data is a common occurrence [13][14]. This study is part of the I2B2 project, Core 2.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
McKay, M., Fitzgerald, M.A., Beckman, R.J.: Sample Size Effects When Using R2 to Measure Model Input Importance. Technical Report LA-UR-99-1357. Los Alamos National Laboratory, Los Alamos, NM, USA
Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)
Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)
Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the 4th Annual Symposium of Document Analysis and Information Retrieval (1995)
Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, Springer, Heidelberg (1998)
Dumais, S., Platt, J., Heckerman, D.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proceedings of the 7th International. Conference on Information and Knowledge Management (1998)
McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, p. 1286 (1998)
Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
Ghani, R.: Using Error-Correcting Codes for Text Classification. In: Workshop on Text Mining at the First IEEE Conference on Data Mining (2001)
Raudys, S.J., Jain, A.K.: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3) (March 1991)
Wilcox, A., Hripcsak, G.: Classification algorithms applied to narrative reports. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 455–459 (1999)
Webber Chapman, W., Haug, P.J.: Comparing Expert Systems for Identifying Chest X-ray Reports that Support Pneumonia. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 216–220 (1999)
McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS 14 (2002)
Gilles, G.: Semi-supervised learning. Tech. Rep. Dept. Informatique et Recherche Operationnelle, Universite de Montreal, Montreal, QC, Canada H3C 3J7
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sordo, M., Zeng, Q. (2005). On Sample Size and Classification Accuracy: A Performance Comparison. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_20
Download citation
DOI: https://doi.org/10.1007/11573067_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29674-4
Online ISBN: 978-3-540-31658-9
eBook Packages: Computer ScienceComputer Science (R0)