Skip to main content

On Sample Size and Classification Accuracy: A Performance Comparison

  • Conference paper
Biological and Medical Data Analysis (ISBMDA 2005)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3745))

Included in the following conference series:

Abstract

We investigate the dependency between sample size and classification accuracy of three classification techniques: Naïve Bayes, Support Vector Machines and Decision Trees over a set of  8500 text excerpts extracted automatically from narrative reports from the Brigham & Women’s Hospital, Boston, USA. Each excerpt refers to the smoking status of a patient as: current, past, never a smoker or, denies smoking. Our empirical results, consistent with [1], confirm that size of the training set and the classification rate are indeed correlated. Even though these algorithms perform reasonably well with small datasets, as the number of cases increases, both SMV and Decision Trees show a substantial improvement in performance, suggesting a more consistent learning process. Unlike the majority of evaluations, ours were carried out specifically in a medical domain where the limited amount of data is a common occurrence [13][14]. This study is part of the I2B2 project, Core 2.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  2. McKay, M., Fitzgerald, M.A., Beckman, R.J.: Sample Size Effects When Using R2 to Measure Model Input Importance. Technical Report LA-UR-99-1357. Los Alamos National Laboratory, Los Alamos, NM, USA

    Google Scholar 

  3. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–22 (1994)

    Google Scholar 

  4. Lewis, D.D., Ringuette, M.: A comparison of two learning algorithms for text categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval, pp. 81–93 (1994)

    Google Scholar 

  5. Wiener, E., Pedersen, J.O., Weigend, A.S.: A neural network approach to topic spotting. In: Proceedings of the 4th Annual Symposium of Document Analysis and Information Retrieval (1995)

    Google Scholar 

  6. Cohen, W.W., Singer, Y.: Context-sensitive learning methods for text categorization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 307–315 (1996)

    Google Scholar 

  7. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Proceedings of the 10th European Conference on Machine Learning, Springer, Heidelberg (1998)

    Google Scholar 

  8. Dumais, S., Platt, J., Heckerman, D.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proceedings of the 7th International. Conference on Information and Knowledge Management (1998)

    Google Scholar 

  9. McCallum, A., Nigam, K.: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI 1998 Workshop on Learning for Text Categorization, p. 1286 (1998)

    Google Scholar 

  10. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)

    MATH  Google Scholar 

  11. Ghani, R.: Using Error-Correcting Codes for Text Classification. In: Workshop on Text Mining at the First IEEE Conference on Data Mining (2001)

    Google Scholar 

  12. Raudys, S.J., Jain, A.K.: Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(3) (March 1991)

    Google Scholar 

  13. Wilcox, A., Hripcsak, G.: Classification algorithms applied to narrative reports. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 455–459 (1999)

    Google Scholar 

  14. Webber Chapman, W., Haug, P.J.: Comparing Expert Systems for Identifying Chest X-ray Reports that Support Pneumonia. In: Proceedings of the American Medical Informatics Association (AMIA) Symposium, pp. 216–220 (1999)

    Google Scholar 

  15. McCallum, A., Nigam, K.: A comparison of event models for Naive Bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)

    Google Scholar 

  16. Ng, A., Jordan, M.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. NIPS 14 (2002)

    Google Scholar 

  17. Gilles, G.: Semi-supervised learning. Tech. Rep. Dept. Informatique et Recherche Operationnelle, Universite de Montreal, Montreal, QC, Canada H3C 3J7

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sordo, M., Zeng, Q. (2005). On Sample Size and Classification Accuracy: A Performance Comparison. In: Oliveira, J.L., Maojo, V., Martín-Sánchez, F., Pereira, A.S. (eds) Biological and Medical Data Analysis. ISBMDA 2005. Lecture Notes in Computer Science(), vol 3745. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11573067_20

Download citation

  • DOI: https://doi.org/10.1007/11573067_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29674-4

  • Online ISBN: 978-3-540-31658-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics