To the Editor,
I read with interest the article by Yao et al. [
1], entitled “Computer algorithm can match physicians’ decisions about blood transfusions”. In this study, the authors designed a multilayer perceptron neural network to predict the appropriateness of intra-operative blood transfusion cases. Of the 4946 patients who received intra-operative blood transfusions, 3604 cases were classified as appropriate and 1342 as inappropriate by expert anesthesiologists based on guidelines of the World Health Organization. The authors claimed that the neural network trained on these data achieved a promising 96.8% accuracy rate in matching human judgement: 99% of the computer’s decisions matched the experts on the appropriate cases and 90.9% matched on the inappropriate ones. While the concept of the study is interesting and while I believe that machine learning models could be useful tools in the screening for blood transfusion overconsumption on a larger scale, the reported accuracy results should be interpreted with caution. One of the key concepts in the creation of reliable machine learning models is to split the dataset into separate training, validation and test sets. While a training set can be seen as a subset of data employed for fitting the model’s parameters, a validation set can be defined as a set of examples to tune the parameters of the model such as the selection of an optimal number of hidden layers in a neural network. However, it has to be mentioned that
k-fold cross-validation can also be regarded as a valuable alternative for traditional validation sets. After training and validation, it is common practice to evaluate the model’s performance on unseen (i.e. independent) test data samples in order to obtain a reliable estimation of the generalization error, which is the error rate on the prediction of new data [
2]. In this study, no splitting procedure is described in the materials and methods section and it is stated that all 4946 data entries were inputted to the model. Moreover, as can be derived from the supplementary python script, it seems that the model was trained, validated and tested on exactly the same data entries. Therefore, the reported accuracies in this study are likely biased, have to be interpreted as the results of a potentially overfitted model and cannot be perceived as being per definition valid for independent data samples.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (
http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.