Background
For routine clinical microbiology diagnostic laboratories, the highest workload is generated by urine samples from patients with suspected urinary tract infection (UTI) [
1]. According to the UK Standards of Microbiological Investigations, UTIs are defined as the ‘presence and multiplication of microorganisms, in one or more structures of the urinary tract, with associated tissue invasion’. The most common causative pathogen is
E. coli followed by other members of the
Enterobacteriaceae family. The incidence of UTIs varies with age, gender, and comorbidities. Women experience a higher incidence than men, with 10–20% suffering from at least one symptomatic UTI throughout their lifetime. Most UTIs that occur in men are associated to physiological abnormalities of the urinary tract. In children, UTIs are common but often difficult to diagnose due to non-specific symptoms. Where a UTI is suspected, a urine sample is collected for processing by a centralised diagnostic laboratory. Upon arrival, the sample receives microscopic analysis, microbiological culture, and where necessary, antimicrobial sensitivity testing [
2]. However, many urine samples will yield a negative culture result, no significant bacterial isolate or mixed culture results suggesting sample contamination. Such ambiguous and diagnostically unhelpful outcomes typically occur in approximately 70–80% of urine samples cultured [
3‐
8]. This creates opportunities for significant cost savings. At the same time, diagnostic microbiology laboratories in the UK and elsewhere are undergoing transition to full laboratory automation [
9‐
11]. With a view to assist with the consolidation of services [
12] and changes in laboratory practice, appropriate pre-processing and classification of urine samples prior to culture might be required to reduce the number of unnecessary cultures performed.
In many hospitals, automated urine microscopy is performed prior to culture using automated urine sediment analysers. This is a common precursor to culture and informs on the cellular content of the urine sample, where evidence of pyuria results in direct antimicrobial sensitivity testing accompanying culture; in addition to culture on chromogenic agar, urine is applied directly to nutrient agar for sensitivity testing by Kirby–Bauer method. The use of microscopic analysis, biochemical dip-stick testing, and flow cytometry for predicting urinary tract infection are well documented in the literature. The current consensus is that WBC count and bacterial count correlate with culture outcome [
3,
4,
13] but not well enough to replace culture entirely. We here explored the potential for a machine learning solution to reduce the burden of culturing the large number of culture-negative samples without reducing detection of culture-positive samples, with concessions made for particularly vulnerable patient groups.
We speculated that the application of a statistical machine learning model that accounts not just for current diagnostic results but also for historical culture outcome, as well as clinical details and demographical data, could potentially reduce laboratory workload without compromising the detection of UTIs. We contrast the classification performance of heuristic microscopy thresholds with three machine learning algorithms: A Random Forest classifier, a Neural Network with a single hidden layer, and the Extreme Gradient Boosting algorithm XGBoost. Random Forest classifiers are one of many ensemble methods, where the predictions of multiple base estimators are used to improve classification. In a Random Forest multiple ‘trees’ are constructed, each from a bootstrap sample of the training data and a random subset of features. The resulting classification is a result of the average of all the ‘trees’, hence the name ‘Random Forest’ [
14]. Neural Networks are supervised learning algorithms made up of multiple layers of ‘perceptrons’ with assigned weights, which when summed and provided to a step function, produce a classification output. By optimising a loss function and adjusting the weights through a process called ‘backpropagation’, Neural Networks can learn non-linear relationships [
14]. Boosting algorithms, such as the XGBoost algorithm in this study, generate a decision tree using a sample of the training data. The performance of the trained classifier, when tested using all the training data, is used to generate sample weights that influence the next classifier. An iterative process then occurs, each time generating a new classifier that is informed by the misclassification of the prior classifier [
15].
Discussion
To our knowledge, there are no other observational studies of this magnitude for the study of urine analysis for the diagnosis of UTIs. Most previous studies with the objective of predicting urine culture based on variables generated from sediment analysis, flow cytometry, and/or dip-stick testing have been controlled studies of a few hundred patients, with little consistency in the inclusion criteria [
3,
4,
6,
13,
27‐
29]. Prior efforts to establish a heuristic model based on microscopy thresholds generated conflicting results. Falbo et al. [
4] and Inigo et al. [
3] reported a sensitivity and specificity in the range of 96–98% and 59–63% respectively, with microscopy thresholds on sample populations of less than 1000. Both studies reported an optimum WBC count (cells/μl) of 18 but differing bacterial counts (44/μl and 97/μl respectively). Variation in results between the two studies is likely to be due to small sample size. It should also be noted that neither study adjusted for pregnant patients or children under the age of 11, and the sensitivity of classification for vulnerable demographics was not shared. Additionally, greater than 50% of samples in the study by Inigo et al. originated from inpatients and both studies included specimens from catheterised patients [
3,
4]. In contrast to those findings, Sterry-Blunt et al. [
6] reported from a study of 1411 samples that the highest achievable negative predictive value when using white blood cell and bacterial count thresholds was 89·1% and concluded that the SediMAX should not be used as a screening method prior to culture.
The use of flow cytometry for urine analysis prior to culture has been gaining popularity as a replacement to automated urine microscopy and shows good performance in the literature. Multiple studies have now shown that the use of flow cytometry with optimised cell count thresholds provides greater specificity without compromising sensitivity when classifying urine samples [
3,
27,
30‐
32]. Future work should investigate the benefit of using machine learning algorithms that include cellular counts generated using flow cytometry methods as opposed to automated microscopy.
Taking advantage of recent developments in ‘big data’ technologies, our observational study analysed data representing an entire year of urine analysis at a large pathology service that covers sample processing for multiple hospitals as well as the community in the Bristol/Bath region in the Southwest of the UK. To our knowledge there have been no attempts to apply machine learning techniques for the purpose of predicting urine culture outcome in a laboratory setting. Taylor et al. [
5] applied supervised machine learning to predict UTIs in symptomatic emergency department patients. An observational study of 80,387 adult patients, using 211 variables of both clinical and laboratory data, was used to develop 6 machine learning algorithms that were then compared to documentation of UTI diagnosis and antibiotic administration. The study concluded that the XGBoost algorithm outperformed all other classifiers and when compared to the documented diagnosis, application of the algorithm would approximate to 1 in 4 patients being re-categorised from false positive to true negative, and 1 in 11 patients being re-categorised from false negative to true positive. The XGBoost algorithm presented has similar performance to the one trained on our dataset, with an AUC score of 0·904. The sensitivity was poor however, at 61·7%, and a corresponding specificity of 94·9%. It is suspected that the difference in sensitivity between our models is the result of the application of class weights. Taylor et al. [
5] did not disclose any parameter tuning of this sort and the sensitivity reported was likely a result of class imbalance (only 23% of their training consists of positive samples). Here, we applied class weights to direct a classification algorithm that favored a high sensitivity and met the criteria expected of a screening test.
Our study made considerations for the high risk groups of pregnant patients and children under the age of 11, with the objective to generate a predictive algorithm that would conform to the UK standards of microbiological investigations. We also classified patients into groups based on identification of key words in clinical details provided by the requesting clinician. Although methods were put into place to increase the accuracy of these classifications (employment of a Levenshtein distance algorithm and consolidation of clinical details from patients with multiple samples) the free-form nature of the notes means that key words would not always be included even when applicable. This has likely led to an underestimation of some groups, but it is possible that this may be addressed in future by more advanced text mining of clinical notes, such as the use of deep learning techniques that can classify patients into medical subdomains, as shown successfully by Weng et al. [
33].
In our dataset, when observing samples that have generated a positive bacterial culture, there is a clear difference in the distribution of white cell counts in pregnant patients compared to all other patients. The changes in the immune response during pregnancy are not fully understood but it is agreed that modulation of the immune system is significantly changed [
26]. This could explain the differences observed in our dataset, but we must also consider the contribution from the screening for asymptomatic bacteriuria in pregnant patients during the middle trimester. Although asymptomatic bacteriuria is cited as an associated with adverse outcomes [
26,
27], a randomised control study of 5132 pregnant patients in the Netherlands reported a low risk of pyelonephritis in untreated asymptomatic bacteriuria, question the use of such screening [
7].
Our study demonstrates the power of machine learning algorithms in defining critical variables for clinical diagnosis of suspected UTIs. Given increasing demand due to ageing populations in most developed and developing countries, radical change is needed to improve cost efficiency and optimise capacity in diagnostic laboratories. At a time when antimicrobial resistance is dramatically on the rise amongst Gram-negative bacteria, including the two most common urinary pathogens, E. coli and Klebsiella pneumoniae, any significant reduction in inappropriate sample processing will have a positive impact on the turn-around time for clinically relevant infection and improve time to appropriate therapy and antimicrobial stewardship.
Extrapolating our estimated workload reduction on a national scale, the savings made in reduction of purchases of culture agar alone (without considering the time cost and additional expenses involved in performing bacterial culture), the implementation of the three XGBoost algorithms as described in Fig.
6 would result in savings of £800,000–5 million per year across the UK (estimates are based on local purchasing data and online sources [
34]).
There are several limitations of this study. Firstly, the retrospective nature of the study makes it difficult to clarify some of the details such as potential mis-labelling of samples. However, the use of over 200,000 samples archived with a state-of-the-art LIMS system should ensure the data are relatively robust to random individual errors in labelling. Secondly, the clinical details provided by the requesting clinicians were relatively sparse. This is true for most diagnostic requests in a busy and publicly-funded hospital, where doctors must prioritise their limited time. Hence, the dataset represents the “real life” scenario. Thirdly, it should be remembered that the outcome we have studied is a culture predictability rather than clinical/therapeutic outcome.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.