Skip to main content
Erschienen in: European Journal of Nuclear Medicine and Molecular Imaging 12/2020

06.04.2020 | Original Article

Effect of machine learning re-sampling techniques for imbalanced datasets in 18F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients

verfasst von: Chenyi Xie, Richard Du, Joshua WK Ho, Herbert H Pang, Keith WH Chiu, Elaine YP Lee, Varut Vardhanabhuti

Erschienen in: European Journal of Nuclear Medicine and Molecular Imaging | Ausgabe 12/2020

Einloggen, um Zugang zu erhalten

Abstract

Purpose

Biomedical data frequently contain imbalance characteristics which make achieving good predictive performance with data-driven machine learning approaches a challenging task. In this study, we investigated the impact of re-sampling techniques for imbalanced datasets in PET radiomics-based prognostication model in head and neck (HNC) cancer patients.

Methods

Radiomics analysis was performed in two cohorts of patients, including 166 patients newly diagnosed with nasopharyngeal carcinoma (NPC) in our centre and 182 HNC patients from open database. Conventional PET parameters and robust radiomics features were extracted for correlation analysis of the overall survival (OS) and disease progression-free survival (DFS). We investigated a cross-combination of 10 re-sampling methods (oversampling, undersampling, and hybrid sampling) with 4 machine learning classifiers for survival prediction. Diagnostic performance was assessed in hold-out test sets. Statistical differences were analysed using Monte Carlo cross-validations by post hoc Nemenyi analysis.

Results

Oversampling techniques like ADASYN and SMOTE could improve prediction performance in terms of G-mean and F-measures in minority class, without significant loss of F-measures in majority class. We identified optimal PET radiomics-based prediction model of OS (AUC of 0.82, G-mean of 0.77) for our NPC cohort. Similar findings that oversampling techniques improved the prediction performance were seen when this was tested on an external dataset indicating generalisability.

Conclusion

Our study showed a significant positive impact on the prediction performance in imbalanced datasets by applying re-sampling techniques. We have created an open-source solution for automated calculations and comparisons of multiple re-sampling techniques and machine learning classifiers for easy replication in future studies.
Anhänge
Nur mit Berechtigung zugänglich
Literatur
7.
Zurück zum Zitat He H, Garcia EA. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009;21(9):1263–84.CrossRef He H, Garcia EA. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2009;21(9):1263–84.CrossRef
8.
Zurück zum Zitat Kabir MF, Ludwig S, editors. Classification of breast cancer risk factors using several resampling approaches. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); 2018: IEEE. Kabir MF, Ludwig S, editors. Classification of breast cancer risk factors using several resampling approaches. 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA); 2018: IEEE.
10.
Zurück zum Zitat Batuwita R, Palade V, editors. Efficient resampling methods for training support vector machines with imbalanced datasets. The 2010 International Joint Conference on Neural Networks (IJCNN); 2010: IEEE. Batuwita R, Palade V, editors. Efficient resampling methods for training support vector machines with imbalanced datasets. The 2010 International Joint Conference on Neural Networks (IJCNN); 2010: IEEE.
11.
Zurück zum Zitat Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing. 2016;175:935–47.CrossRef Loyola-González O, Martínez-Trinidad JF, Carrasco-Ochoa JA, García-Borroto M. Study of the impact of resampling methods for contrast pattern based classifiers in imbalanced databases. Neurocomputing. 2016;175:935–47.CrossRef
12.
Zurück zum Zitat Chawla NV. Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook: Springer; 2009. p. 875–86. Chawla NV. Data mining for imbalanced datasets: An overview. In: Data mining and knowledge discovery handbook: Springer; 2009. p. 875–86.
22.
25.
Zurück zum Zitat He H, Bai Y, Garcia EA, Li S, editors. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008: IEEE. He H, Bai Y, Garcia EA, Li S, editors. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008: IEEE.
26.
Zurück zum Zitat Han H, Wang W-Y, Mao B-H, editors. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing; 2005: Springer. Han H, Wang W-Y, Mao B-H, editors. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. International conference on intelligent computing; 2005: Springer.
27.
Zurück zum Zitat Mani I, Zhang I, editors. kNN approach to unbalanced data distributions: a case study involving information extraction. Proceedings of workshop on learning from imbalanced datasets; 2003. Mani I, Zhang I, editors. kNN approach to unbalanced data distributions: a case study involving information extraction. Proceedings of workshop on learning from imbalanced datasets; 2003.
28.
Zurück zum Zitat Tomek I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976;6:769–72. Tomek I. Two modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976;6:769–72.
29.
Zurück zum Zitat Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972;2(3):408–21.CrossRef Wilson DL. Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans. Syst. Man Cybern. 1972;2(3):408–21.CrossRef
30.
Zurück zum Zitat Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter. 2004;6(1):20–9.CrossRef Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter. 2004;6(1):20–9.CrossRef
31.
Zurück zum Zitat Batista GE, Bazzan AL, Monard MC, editors. Balancing Training data for automated annotation of keywords: a Case Study. WOB; 2003. Batista GE, Bazzan AL, Monard MC, editors. Balancing Training data for automated annotation of keywords: a Case Study. WOB; 2003.
32.
Zurück zum Zitat Alves GEDAP, Silva DF, Prati RC, editors. An experimental design to evaluate class imbalance treatment methods. 2012 11th International Conference on Machine Learning and Applications; 2012: IEEE. Alves GEDAP, Silva DF, Prati RC, editors. An experimental design to evaluate class imbalance treatment methods. 2012 11th International Conference on Machine Learning and Applications; 2012: IEEE.
33.
Zurück zum Zitat Chen T, Guestrin C, editors. Xgboost: a scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016: ACM. Chen T, Guestrin C, editors. Xgboost: a scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016: ACM.
35.
Zurück zum Zitat Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7(Jan):1–30. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7(Jan):1–30.
36.
Zurück zum Zitat Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017;18(1):559–63. Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017;18(1):559–63.
38.
42.
Zurück zum Zitat Upadhaya T, Vallières M, Chatterjee A, Lucia F, Bonaffini PA, Masson I, et al. Comparison of radiomics models built through machine learning in a multicentric context with independent testing: identical data, similar algorithms, different methodologies. IEEE Transactions on Radiation Plasma Medical Sciences. 2018;3(2):192–200.CrossRef Upadhaya T, Vallières M, Chatterjee A, Lucia F, Bonaffini PA, Masson I, et al. Comparison of radiomics models built through machine learning in a multicentric context with independent testing: identical data, similar algorithms, different methodologies. IEEE Transactions on Radiation Plasma Medical Sciences. 2018;3(2):192–200.CrossRef
43.
Zurück zum Zitat D'Amico NC, Merone M, Sicilia R, Cordelli E, D'Antoni F, Zanetti IB, et al. Tackling imbalance radiomics in acoustic neuroma. Int. J. Data Min. Bioinform. 2019;22(4):365–88.CrossRef D'Amico NC, Merone M, Sicilia R, Cordelli E, D'Antoni F, Zanetti IB, et al. Tackling imbalance radiomics in acoustic neuroma. Int. J. Data Min. Bioinform. 2019;22(4):365–88.CrossRef
Metadaten
Titel
Effect of machine learning re-sampling techniques for imbalanced datasets in 18F-FDG PET-based radiomics model on prognostication performance in cohorts of head and neck cancer patients
verfasst von
Chenyi Xie
Richard Du
Joshua WK Ho
Herbert H Pang
Keith WH Chiu
Elaine YP Lee
Varut Vardhanabhuti
Publikationsdatum
06.04.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
European Journal of Nuclear Medicine and Molecular Imaging / Ausgabe 12/2020
Print ISSN: 1619-7070
Elektronische ISSN: 1619-7089
DOI
https://doi.org/10.1007/s00259-020-04756-4

Weitere Artikel der Ausgabe 12/2020

European Journal of Nuclear Medicine and Molecular Imaging 12/2020 Zur Ausgabe