Skip to main content
main-content

01.12.2012 | Research article | Ausgabe 1/2012 Open Access

BMC Medical Informatics and Decision Making 1/2012

Predicting sample size required for classification performance

Zeitschrift:
BMC Medical Informatics and Decision Making > Ausgabe 1/2012
Autoren:
Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula, Long H Ngo
Wichtige Hinweise

Electronic supplementary material

The online version of this article (doi:10.​1186/​1472-6947-12-8) contains supplementary material, which is available to authorized users.
Rosa L Figueroa, Qing Zeng-Treitler, Sasikiran Kandula and Long H Ngo contributed equally to this work.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

QZ and RLF conceived the study. SK and RLF designed and implemented experiments. SK and RLF analyzed data and performed statistical analysis. QZ and LN participated in study design and supervised experiments and data analysis. RLF drafted the manuscript. Both SK and QZ had full access to all of the data and made critical revisions to the manuscript. All authors read and approved the final manuscript.

Abstract

Background

Supervised learning methods need annotated data in order to generate efficient models. Annotated data, however, is a relatively scarce resource and can be expensive to obtain. For both passive and active learning methods, there is a need to estimate the size of the annotated sample required to reach a performance target.

Methods

We designed and implemented a method that fits an inverse power law model to points of a given learning curve created using a small annotated training set. Fitting is carried out using nonlinear weighted least squares optimization. The fitted model is then used to predict the classifier's performance and confidence interval for larger sample sizes. For evaluation, the nonlinear weighted curve fitting method was applied to a set of learning curves generated using clinical text and waveform classification tasks with active and passive sampling methods, and predictions were validated using standard goodness of fit measures. As control we used an un-weighted fitting method.

Results

A total of 568 models were fitted and the model predictions were compared with the observed performances. Depending on the data set and sampling method, it took between 80 to 560 annotated samples to achieve mean average and root mean squared error below 0.01. Results also show that our weighted fitting method outperformed the baseline un-weighted method (p < 0.05).

Conclusions

This paper describes a simple and effective sample size prediction algorithm that conducts weighted fitting of learning curves. The algorithm outperformed an un-weighted algorithm described in previous literature. It can help researchers determine annotation sample size for supervised machine learning.
Zusatzmaterial
Additional file 1: Appendix1 is a PDF file with the main lines of R code that implements curve fitting using inverse power models. (PDF 18 KB)
12911_2011_460_MOESM1_ESM.PDF
Additional file 2: Appendix 2 is a PDF file that contains more details about the active learning methods used to generate the learning curves. (PDF 34 KB)
12911_2011_460_MOESM2_ESM.PDF
Authors’ original file for figure 1
12911_2011_460_MOESM3_ESM.png
Authors’ original file for figure 2
12911_2011_460_MOESM4_ESM.png
Authors’ original file for figure 3
12911_2011_460_MOESM5_ESM.png
Authors’ original file for figure 4
12911_2011_460_MOESM6_ESM.png
Authors’ original file for figure 5
12911_2011_460_MOESM7_ESM.png
Literatur
Über diesen Artikel

Weitere Artikel der Ausgabe 1/2012

BMC Medical Informatics and Decision Making 1/2012 Zur Ausgabe