Introduction

Fig. 1
figure 1

Three-step framework for predictive modeling

Cardiovascular diseases (CVD) are due to disorders of the heart and blood vessels [1]. It is one of the leading causes of death and disability. Early diagnosis and treatment of the disease can reduce the threat of having a further severity of the disease. It is necessary to gain clear understanding of risk and prevention factors as well as to improve the accuracy of diagnosis [2]. CAD is a cardiovascular disease in which presence of atherosclerotic plaques in arteries can restrict blood flow to the heart muscle by physically clogging the artery, leads to cardiac death or myocardial infraction [3]. CAD can be diagnosed using noninvasive and invasive methods. These tests help in evaluating the severity of disease and its effect on the function of the heart and possible form of treatment to be given to a patient. Noninvasive diagnostic methods are echocardiogram, exercise stress testing, magnetic resonance imaging, single photon emission computer tomography, but the result of these methods are inconclusive and not reliable as angiography [4,5,6,7,8]. Angiography is an invasive, costly and highly technical procedure. It cannot be utilized for screening of large population or close follow-up of treatments [9]. Moreover, these methods utilize enormous amount of resources such as time, require expensive laboratory setup, specialized tools and techniques. Limitations of diagnostic methods encourage researchers to seek other less expensive and noninvasive methods for diagnosis of CAD such as data mining that can lead to easy detection of CAD without going through angiography. Various epidemiological studies have been done in the past including Framingham Heart study [10, 11], Nippon–Honolulu–San Francisco study [12, 13], Monitoring Trends and Determinants in Cardiovascular Disease [14, 15], INTERHEART study [16, 17] for understanding the patterns, cause and risk factors for the disease. Data mining methods have been used to find patterns and models from clinical data [18, 19]. During the past few decades statistical and machine learning techniques have been increasingly applied to assist medical diagnosis. It includes both predictive and descriptive data mining techniques. Predictive data mining is widely used for generating models that can be used for prediction and classification. Descriptive data mining uses associations, clustering and subgrouping for finding interesting patterns in data [20]. If mined properly, the information hidden in these records is a huge resource bank for medical research. These data often contain hidden patterns and relationships which can lead to improved diagnosis and treatment, and provides a platform to better understand the mechanisms governing almost all aspects of the medical domain [21]. Various data mining techniques, namely, decision tree [22,23,24,25,26,27,28], support vector machine (SVM) [24, 25, 27], artificial neural networks (ANN) [24, 25, 27, 28], Naïve Bayes [28], Bayesian Networks [25], have been used for CVD diagnosis as black box and models generated were not clinically interpretable. On the other hand, the rules generated by decision trees are clinically interpretable, which is highly desirable in clinical applications [29]. Decision trees can be constructed relatively fast and their results are clinically interpretable. They do not require complex parameter adjustment from a user’s point of view [30]. In most of the studies, instances with missing values were eliminated before applying learning processes or use of machine learning technique for handling missing values. The presence of missing values in a data set can affect the performance of a model constructed. Instance deletion is practical only when the data include lesser cases of missing values and when analysis of the rest of the cases will not lead to any serious bias in clinical decisions. 1% of missing data is usually considered trivial, 1–5% as manageable. But, 5–15% require sophisticated methods to handle, and more than 15% may severely impact any kind of interpretation [31]. One may use missing value imputation to increase accuracy of predictive models. K-means is an unsupervised learning algorithm that can be used for missing value imputation [32]. In this paper, we propose an intelligent machine learning framework for CAD prediction (Fig. 1). The framework also handles missing values through data imputation.

Data set description

Clinical data of 335 consecutive patients were collected from Department of Cardiology, Indira Gandhi Medical College, Shimla, India. All the subjects had been suspected for CAD and enrolled for angiography. 27 features were recorded for each patient including demographic, historic and laboratory features namely age, sex, smoking history, hypertension, diabetes mellitus, dyslipidemia, chest pain type, random blood sugar, cholesterol, low density lipoprotein, high density lipoprotein, triglycerides, systolic blood pressure, diastolic blood pressure, height, weight, body mass index, waist circumference, central obesity, ankle–brachial index, exercise duration, METS achieved, rate pressure product, duration of recovery with persistent ST changes, duke treadmill test and result of angiography (significant CAD and severity of the disease) (Table 1).

Table 1 Data set description with descriptive statistical values

Machine learning framework

Data were preprocessed using data encoding for leveling of qualitative attributes (indicated in Table 1) before the cluster formation and further imputed the missing value with centroid value of the features of the clusters. To predict CAD cases, we prepared CAD data set (we call it CDS) using CAD class as predict and severity data set (we call it SDS) using severity class as predictant. Then, models were constructed using supervised learning algorithms: C4.5, NB Tree and MLP for diagnosis of CAD and its severity The models are trained and validated using k-fold cross-validation method, where all the samples are eventually used for both training and testing. In this method, data set is divided into k equal size subsets where \(k=10\) and \(k-1\) data subsets are used to train the model and remaining subset is used to test the model. This procedure is repeated k-times to allow every sample to act as training and testing samples. The average result across all k trials is computed to produce final estimation.

K-means clustering

Various missing value imputation techniques have been employed by researchers, such as case-wise deletion, mean value imputation, maximum likelihood, machine learning algorithms including decision tree and MLP. Statistical methods were also explored in the medical domain [33]. Many researchers have used K-means clustering algorithm (KMCA) to impute missing values in medical data [34] and financial data [35]. K-means clustering algorithm takes input parameter k (number of clusters) and partition data into k clusters with high inter-cluster similarity based on distance function. It allocates membership to each data point for different k clusters. The remaining objects are assigned to another cluster whose center is nearest to the object. Then, centroid of the cluster is computed as new cluster center. This process iterates until the criterion function is met.

Table 2 Performance of models on CDS data set

Model construction using learning schemes

In this study, we explored two classification techniques, namely, decision tree and neural network, for diagnosis of CAD and its severity.

Decision tree (C4.5, NB Tree)

A decision tree is a tree in which each non-leaf node denotes a test on an attribute of cases, each branch is a resultant of the test, and each leaf node denotes a class extrapolation. It selects the most discriminant set of attributes based on the outcome of statistical measures [36]. It is an iterative process helpful in splitting the data set into partitions.

C4.5 is the extension of ID3 algorithm developed by the Ross Quinlan [37]. It uses divide-and-conquer approach to build decision tree and uses information gain as splitting criteria. It works with top-down approach, looking at each stage an attribute of relevance to split the features that distinguish the classes in the best possible way and then recursively processing the sub problems that result from the split [11].

NB Tree The Naive Bayes Classifier is based on Bayesian concept, generates decision tree with Naive Bayes classifiers at the leaves which works with the assumption that the features in a data set are mutually independent. Being relatively robust, easy to implement, fast, and accurate, it is used widely. Some of the key areas include the diagnosis of diseases and decision support systems for different medical diagnosis [38], in taxonomic studies for the classification of RNA sequences [39] and spam filtering in e-mail clients [40]

Multilayer perceptron

An artificial neural network is a mathematical model consisting of a number of highly interconnected elements organized into layers inspired by nature. It is suitable for training large amounts of data with very few inputs. It requires less formal statistical training and have the ability to implicitly detect complex nonlinear relationships between dependent and independent variables. Multilayer perceptron is a popular ANN architecture with back propagation, a class of supervised neural network and can be used to model complex relationship between inputs and outputs [36, 41].

Table 3 Performance of models for SDS data set

Performance measures

The performance of a classification model is measured in terms of accuracy, sensitivity, specificity and error rate [11].

Accuracy—accuracy is a measure of the percent of correctly classified objects by the classification method:

$$\begin{aligned} \hbox {Accuracy} =\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}. \end{aligned}$$

Error rate—the percentage of incorrectly classified object by the classification method:

$$\begin{aligned} \hbox {Error Rate}=\frac{\mathrm{FP}+\mathrm{FN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}}. \end{aligned}$$

Sensitivity (true positive rate)—the percentage of positive examples predicted correctly:

$$\begin{aligned} \hbox {Sensitivity}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}. \end{aligned}$$

Specificity (true negative rate)—the percentage of negative examples predicted correctly:

$$\begin{aligned} \hbox {Specificity}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}}, \end{aligned}$$

where TP is true positive, TN is true negative, FP is false positive and FN is false negative

Results

The models were constructed using CDS data set with algorithms C4.5, NB Tree and MLP. The performance of the models were evaluated for accuracy, misclassification error rate, sensitivity and specificity. Other statistical measures such as kappa statistics, mean absolute error and root mean square error have been calculated. The results are presented in Table 2. It is found that C4.5 achieves the prediction accuracy of 97.6% for detection of CAD and lowest misclassification error rate of 2.38% and highest sensitivity and specificity of 97.5% and 97.6% It achieves the higher value of Cohen’s Kappa, i.e., 0.952 lowest value of RMSE 0.154.

For prediction of severity of the disease, SDS data set was used. Results (Table 3) show that C4.5 has the highest prediction accuracy among the three methods and lowest misclassification error rate and highest value of Kappa statistics (KS) and lowest value of mean absolute error (MAE) and root mean square error (RMSE).

We, therefore, consider C4.5 for rule extraction. Some of the rules extracted are shown in Fig. 2.

Fig. 2
figure 2

Rules extracted from decision tree

We also compared C4.5, NB Tree and MLP with missing data toleration techniques [33, 42] for presence and absence of stenosis in the arteries (CAD) and severity of disease. The results are shown in Tables 4 and 5 (Figs. 3, 4, 5, 6, 7, 8).

Table 4 Performance of classifiers with C4.5, NB Tree, MLP with missing data toleration techniques for CAD
Table 5 Performance of classifiers with C4.5, NB Tree, MLP with missing data toleration techniques for severity

The optimized model constructed using C4.5 for disease severity has the highest prediction accuracy, lowest misclassification error rate of 80.8 and 19.1%, respectively. For CAD diagnostic model cluster-based missing value imputation does not improve the performance of the C4.5 classifier because the features used to construct the CAD model do not contain missing values, but in case of NB Tree and MLP there is significant improvement in accuracy.

Discussion and conclusion

Literature review suggested that models with the best classification performance may differ from one problem to another; they rely on data preprocessing techniques, feature selection methods, selection of algorithms for model construction and validation. The study examines the two predictive data mining approaches: decision tree and MLP with cluster-based missing value imputation method in search for an optimal model capable of performing more accurate and sensitive disease diagnosis. The experiments are conducted using Waikato Environment for knowledge analysis toolkit. MLP ignores missing values, C4.5 uses distribution-based imputation.

Fig. 3
figure 3

Accuracy of models with missing value tolerance and cluster-based imputation for prediction of CAD

Fig. 4
figure 4

Error rate of models with missing value tolerance and cluster-based imputation for diagnosis of CAD

Fig. 5
figure 5

Sensitivity of models with missing value tolerance and cluster-based imputation for CAD

Fig. 6
figure 6

Specificity of models with missing value tolerance and cluster based imputation for CAD

Fig. 7
figure 7

Accuracy of model with missing value tolerance and cluster-based imputation for severity

Fig. 8
figure 8

Misclassification error rate of models with missing value tolerance and cluster-based imputation for severity

Model constructed with C4.5 for prediction of coronary artery disease with 25 features with predictor (presence and absence of coronary artery disease) reaches highest accuracy, of 97.6 as compared to other predictive models. Proposed method improves the prediction accuracy by 0.9%, sensitivity and specificity of 0.4 and 1.7%. Other statistical measures are also calculated such as Cohen’s Kappa, mean absolute error and root mean square error. Increase of 0.08 value of KS, reduction of 0.015 in MAE and 0.023 in RMSE for NB Tree. In case of MLP there is also improvement of 1.2% for accuracy 0.6% for sensitivity and 1.7% for specificity, increment of 0.024 in KS 0.024 and reduction of 0.018 and 0.047 for MAE and RMSE. Further, to predict the severity of disease, the models were constructed with 25 features with class severity using the decision tree and MLP algorithms. Proposed method improves the prediction accuracy by 3.2% in case of C4.5 and 4.77, 3.56% for NB Tree and MLP. Significant reduction of misclassification error rate, i.e., by 3.28% in case of C4.5 and 4.77, 3.65% for NB Tree and MLP. Significant improvement of statistical measures such as KS, MAS and RMSE. Significant rules (Fig. 2) extracted from optimized C4.5 show that chest pain type [43, 44] is the major predictor of CAD. Angina chest pain has the highest probability of CAD. High density lipoprotein >48 shows the healthy attribute of the subjects [45, 46] (rules 3–6). In rules 1 and 2 angina chest pain, Duke score, METS are same, but weight and waist circumference can affect the probability of single vessel and multi-vessel disease, higher value of weight and higher value of WC can lead to multi-vessel disease. The rules extracted are clinically interpretable and aid in the decision-making process. The study showed that decision tree, based on intelligent diagnostic model using noninvasive and clinical features, was capable of disease diagnosis and its severity with high accuracy, with low cost. However, the rules extracted from the decision tree are crisp and its performance could be improved by fuzzy rule-based approach. The results are reproducible. Parameters used to construct models were recorded as routine clinical examination (noninvasive) of symptomatic patients. The proposed model gives the high pretest probability of CAD and its severity without using an invasive diagnosis technique.