Skip to content
ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Automated verbal autopsy classification: using one-against-all ensemble method and Naïve Bayes classifier

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 28 Nov 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Verbal autopsy (VA) deals with post-mortem surveys about deaths, mostly in low and middle income countries, where the majority of deaths occur at home rather than a hospital, for retrospective assignment of causes of death (COD) and subsequently evidence-based health system strengthening. Automated algorithms for VA COD assignment have been developed and their performance has been assessed against physician and clinical diagnoses. Since the performance of automated classification methods remains low, we aimed to enhance the Naïve Bayes Classifier (NBC) algorithm to produce better ranked COD classifications on 26,766 deaths from four globally diverse VA datasets compared to some of the leading VA classification methods, namely Tariff, InterVA-4, InSilicoVA and NBC. We used a different strategy, by training multiple NBC algorithms using the one-against-all approach (OAA-NBC). To compare performance, we computed the cumulative cause-specific mortality fraction (CSMF) accuracies for population-level agreement from rank one to five COD classifications. To assess individual-level COD assignments, cumulative partially-chance corrected concordance (PCCC) and sensitivity was measured for up to five ranked classifications. Overall results show that OAA-NBC consistently assigns CODs that are the most alike physician and clinical COD assignments compared to some of the leading algorithms based on the cumulative CSMF accuracy, PCCC and sensitivity scores. The results demonstrate that our approach improves the performance of classification (sensitivity) from 6% to 8% when compared against current leading VA classifiers. Population-level agreements for OAA-NBC and NBC were found to be similar or higher than the other algorithms used in the experiments. Although OAA-NBC still requires improvement for individual-level COD assignment, the one-against-all approach improved its ability to assign CODs that more closely resemble physician or clinical COD classifications compared to some of the other leading VA classifiers.

Keywords

COD classification, VA algorithms, CSMF Accuracy, sensitivity, performance assessment

Introduction

Verbal autopsy (VA) is increasingly being used in developing countries where most deaths occur at home rather than in hospitals, and causes of death (COD) information remains unknown1. This gap in information prevents evidence-based healthcare programming and policy reform needed to reduce the global burden of diseases2. VA consists of a structured questionnaire to gather information on symptoms and risk factors leading up to death from family members of the deceased. Each completed survey is then typically reviewed independently by two physicians, and COD diagnosis is assigned using World Health Organization (WHO) International Classification of Disease (ICD) codes3. If there is disagreement in diagnosis, then the VA undergoes further review by a senior physician4,5.

In recent years, efforts have been made to automate VA COD diagnosis using various computational algorithms in an attempt to further standardize VA COD diagnosis and alleviate physician time and costs613. The current leading computational VA techniques include, InterVA-47, Tariff6, InSilicoVA8, King-Lu10, and Naïve Bayes Classifier (NBC)11. InterVA-4 employs medical-expert-defined static weights for symptoms and risk factors given a particular COD, and subsequently calculates the sum of these weights to determine the most likely COD7. Conversely, Tariff was pre-trained on the Population Health Metrics Research Consortium (PHMRC) VA data to compute tariffs, which express the strength of association between symptoms and CODs that are later summed and ranked to determine a COD; the same procedure is used on the test dataset, with the resultant summed and ranked tariffs scores compared against the pre-trained COD rankings14. InSilicoVA assigns CODs by employing a hierarchical Bayesian framework with a naïve Bayes calculation component; it also computes the uncertainty for individual CODs and population-level COD distributions8. The King-Lu method measures the distribution of the COD and symptoms in the VA training dataset and uses these to predict CODs in the VA test dataset10. Lastly, NBC predicts the COD after computing the conditional probabilities of observing a symptom for a given COD from the VA training dataset, and then applying the Bayes rule against these probabilities11. These existing automated classification algorithms, however, generate low predictive accuracy when compared against physician VA or hospital-based COD diagnoses8,11,15,16. Therefore, there is need to improve automated classification techniques to enable wider and more reliable employment in the field.

The aim of this research is to develop a classification method for predicting CODs using responses from structured questions in a VA survey. We used a different strategy by training multiple NBC algorithms17 using the one-against-all approach (OAA-NBC)18,19 to generate ranked assignments of CODs for 26,766 deaths from four globally diverse VA datasets (one VA dataset was divided into four datasets; a total of seven datasets were used for analysis). We also compare our technique against the current leading algorithms Tariff6, InterVA-47, NBC11 and InSilicoVA8 on the same deaths used for OAA-NBC.

Methods

Datasets

In order to test the performance of the algorithms, we use four main datasets, containing information on a total of 26,766 deaths: three physician COD diagnosed VA datasets, namely the Indian Million Death Study (MDS)20, South African Agincourt Demographic and Health Survey (DHS) dataset21, and Bangladeshi Matlab DHS dataset22, and one health facility diagnosed COD dataset, namely the PHMRC VA data collected from six sites in four countries (India, Mexico, the Philippines and Tanzania)23,24. We use four combinations of the PHMRC data by age group (adult and child) and by site (all versus India-only); this filtering was done to determine the effect on results when deaths are collected from the same geographical setting. A total of seven datasets were used and are summarized in Table 1. These datasets are publicly available, except for the MDS, and have been used in other studies11,15,24.

Table 1. Verbal autopsy (VA) datasets used in the study.

MDSAgincourtMatlabPHMRC-
Adult
(All Sites)
PHMRC-
Child
(All Sites)
PHMRC-
Adult
(India)
PHMRC-
Child
(India)
RegionIndiaSouth AfricaBangladeshMultiple1 MultipleAndhra
Pradesh
and Uttar
Pradesh
Andhra
Pradesh
and Uttar
Pradesh
# of deaths12,2255,8232,0004,6542,0641233948
Ages1–59 months15–64 years20–64 years12–69 years28 days–
11 years
12–69 years28 days–
11 years
# of grouped
CODs
151615139139
# of
Symptoms
9088214224133224133
Physician
Classification
Dual
physician
agreement
Dual
physician
agreement
Two level
physician
classification
Hospital
certified
cause of
death,
including
clinical and
diagnostic
tests
Hospital
certified
cause of
death,
including
clinical
and
diagnostic
tests
Hospital
certified
cause of
death,
including
clinical and
diagnostic
tests
Hospital
certified
cause of
death,
including
clinical and
diagnostic
tests

1Six sites in total: Andhra Pradesh and Uttar Pradesh (India), Distrito Federal (Mexico), Bohol (Philippines) and Dar es Salaam and Pemba (Tanzania); applicable to both adult and child age group specific datasets.

The MDS VA dataset used in this study contains information on 12,225 child deaths from ages one to 59 months. For each death, two trained physicians independently and anonymously assigned a WHO ICD version 10 code25. In the cases where the two physicians did not initially agree or reconcile on a COD, a third senior physician adjudicated20. Similarly, the Agincourt dataset21 underwent dual physician COD assignment on its 5,823 deaths for ages 15 to 64 years. COD assignment was slightly different for the Matlab dataset which has 2,000 deaths for ages 20 to 64 years; a single physician assigned a COD, followed by review and verification by a second physician or an experienced paramedic22. In contrast, the PHMRC dataset is comprised of 6,718 hospital deaths that were assigned a COD based on certain clinical diagnostic criteria, including laboratory, pathology, and medical imaging findings23,24. For each VA datasets, we grouped the physician assigned CODs into 17 broad categories, refer to Table 2. We also show the distribution of records for each COD for each of the seven datasets used in our study.

Table 2. Cause list with absolute death counts by VA dataset.

GroupsCausesAgincourtMatlabMDSPHMRC
All Sites
Adult
PHMRC
Indian
Adults
PHMRC
All Sites
Children
PHMRC
Indian
Children
1Acute respiratory11011339230481532141
2HIV/AIDS2012NA5NANANANA
3Diarrhoeal6629271110141`256112
4Pulmonary TB690437817721NA
5Other and
unspecified infections
432792514622174376187
6Neoplasms (cancer)24435296497192815
7Nutrition and
endocrine
7090372NANANANA
8Cardiovascular
Diseases
381714189282427625
9Chronic Respiratory27129218452NANA
10Liver cirrhosis8910011223459NANA
11Other non-
communicable
diseases
221244134569712518680
12Neonatal conditionsNANA410NANANANA
13Road and transport
injuries
2194995124329264
14Other injuries36668659471218324259
15Ill-defined71135397NANA19465
16Suicide12534NA7033NANA
17Maternal6023NA345136NANA

One-against-all Naïve Bayes (OAA-NBC) approach

An overview of our approach is shown in Figure 1. We transformed each VA dataset into binary format with VA survey questions being the attributes (columns), answers being the values of cells in rows (re-coded into binary format with ‘Yes’ coded as 1 and ‘No’ as 0) and CODs (group number identifier listed in Table 2) being the last (or the first) column. For all VA datasets, a death is represented as a row (record).

58fb9376-ce69-4a8f-be7d-15b9e9c21751_figure1.gif

Figure 1. Overview of one-against-all approach.

We divided each VA dataset into training and testing datasets. We trained multiple NBC models17 on the transformed training datasets using the one-against-all approach18,19. We choose NBC because it has shown better results on VA surveys in the past11. The one-against-all approach was used because it improves the algorithm’s classification accuracy on datasets with several categories of dependent variables as demonstrated by past literature18,19. This will be explained in detail in the next section. During testing, the trained NBC models assign CODs to each death in the testing dataset. The assigned causes are ordered by their probabilities with the assumption that top cause is most likely the real cause.

Training Naïve Bayes using one-against-all approach. NBC uses a training dataset to learn the probabilities of symptoms and their CODs11,17. NBC first measures the probability of each COD, P(COD), in the training dataset. Secondly, it determines the conditional probabilities of each symptom given a particular COD, P(Sym|COD). Thirdly, NBC determines the probability of every COD given a VA record in the test set, i.e., P(COD|VA).

P(COD|VA)=P(COD)SymVAP(Sym|COD)

Equation 1. Conditional probability of COD given a VA record.

P(COD|VA) is determined by taking the product of all P(Sym|COD) (i.e., all symptoms in the VA record) and P(COD). The highest P(COD|VA) value determines that COD as the correct COD. In particular, we chose the Naïve Bayes Multinomial classification algorithm that estimates probabilities by using a maximum likelihood estimate which is readily available in data mining software applications like Weka17,18.

CODNBC=argmaxCOD CODsP(COD|VA)

Equation 2. Select the class with the maximum probability.

In the one-against-all approach, we built an NBC model for each COD instead of one model for all CODs. In this approach, a dataset with M categories of CODs (dependent variables) is decomposed into M datasets with binary categories (CODs). Each binary dataset Di has a COD Ci (where i = 1 to M) labelled as positive and all other CODs labelled as negative with no two datasets having the same CODs labelled as positive. Finally, NBC is trained on each dataset Di resulting in M Naïve Bayes models, as shown in Figure 2. Each model is then used to classify the CODs for records in the test dataset producing a probability of classification. The cause Ci (where i=1 to M) with the highest probability is considered as the correct classification.

58fb9376-ce69-4a8f-be7d-15b9e9c21751_figure2.gif

Figure 2. One-against-all approach for ensemble learning.

Testing OAA-NBC on new surveys. During testing, each Naïve Bayes model predicts a COD for each VA record in the test dataset, resulting in a list of CODs for each VA record in the test dataset. The list of assigned CODs is sorted by the COD probabilities. We made a minor modification in the one-against-all approach; instead of selecting a COD with the highest probability, we ranked the CODs in descending order of their probabilities for each VA record. We kept the ranked probabilities to generate cumulative performance measures, which are described in detail in the next section.

Assessment methods

A VA algorithm’s performance is measured by quantifying the similarity between the algorithm’s COD assignments to physician review (or clinical diagnoses in PHMRC) assignments. Since the community VA datasets included in this study come from countries that have weak civil and death registration, physician review is the most practical and relatively accurate (and only) option to use for assessing algorithm performance. Moreover, given that these deaths are unattended, it follows that there is no ‘gold standard’ for such community VA datasets. Nevertheless, we are confident in the robustness of dual physician review as initial physician agreement (i.e. where two physicians agreed right at the onset of COD coding) was relatively high; e.g., 79% for MDS and 77% for Agincourt.

We measured and compared the individual and population-level performance of all of the algorithms using the following metrics: sensitivity, partially chance corrected concordance (PCCC) and cause-specific mortality fraction (CSMF) accuracy. These measures are commonly used in VA studies11,15,26. They are shown in Equation 3Equation 5. They are helpful in objectively assessing the performance of VA algorithms, as they provide a robust strategy to assess an algorithm’s classification ability for test datasets with widely varying COD distributions12,26.

Sensitivity=TruepositiveTruepositive+FalseNegative

Equation 3. Sensitivity of classification

PCCC(k)=Skn1kn

WhereS=TruepositiveTruepositive+FalseNegative

Equation 4. Partially chance corrected concordance (PCCC) of classification: S is the fraction of positively (correctly) assigned causes when the correct cause is in the top k assigned causes out of total n causes.

Sensitivity and PCCC are metrics that assess the performance of an algorithm for correctly classifying the CODs at the individual level. Sensitivity measures the proportion of death records that are correctly assigned for each COD12. Similarly, PCCC computes how well a VA classification algorithm classifies the CODs at the individual-level while also taking chance (likelihood that it was randomly assigned a COD) into consideration8,11,12,15.

CSMFAccuracy=1j=1n|CSMFjTrueCSMFjPred|2(1Minimum(CSMFjTrue))

WhereCSMFPred=(TP+FP)NandCSMFTrue=(TP+FN)N

Equation 5. Cause-specific mortality fraction (CSMF) Accuracy of classification: n is the total COD and N is the total records.

In contrast, CSMF accuracy (hereafter referred to as ‘agreement’) is a measure for assessing how closely the algorithms classified the overall COD distribution at the population level12. It can be observed from Equation 5 that CSMF Accuracy computes the absolute error between predicted COD distributions by an algorithm (pred) and the observed (true) COD distributions.

We measure the cumulative values of sensitivity, PCCC and agreement on each rank and for each algorithm; e.g., sensitivity at rank two represents the sensitivity of both rank one and rank two classifications, which facilitates measuring the overall performance of the algorithms for classifications at the top two or more ranks. If 60% of the individual classifications are correct at rank one and 15% more are correct at rank two then the overall accuracy is 75% at both ranks. Finally, we also perform a statistical test of significance on the results of all the datasets to ascertain that the difference in results is not by chance. This type of the statistical test depends on the data distribution and association between experiments. We use Wilcoxon signed rank test as we are unsure about normal data distribution of our results. Our null hypothesis is that there is no significant difference between OAA-NBC and another algorithm. This is further discussed in the results section.

Experimental setup

In order to compare the performance between OAA-NBC, InterVA-47, Tariff6, InSilicoVA8 and NBC11, we follow a seven step procedure. In Step one, we partitioned each VA dataset using the commonly used evaluation criteria in data mining: 10-fold cross validation18. In 10-fold cross validation, a dataset is divided into 10 parts. Each part is created by using stratified sampling method—i.e., each part contains the same proportion of standardized CODs as the original dataset. In Step two, we selected one part for testing and nine parts for training from each VA dataset. In Step three, we trained OAA-NBC, InterVA-4, Tariff, InSilicoVA and NBC on the designated training data subsets from each partitioned VA dataset. In Step four, we generated classifications with ranks for each algorithm on the test part per VA dataset. In Step five, we calculated the cumulative sensitivity, PCCC and agreement for each rank per each VA dataset. In Step six, we repeated the process from Step two to Step five up to 10 repetitions with a different part for testing in each turn and for each VA dataset. In Step seven, we computed the average sensitivity, PCCC and agreement for each rank per VA dataset and algorithm.

We implemented OAA-NBC in Java and with Weka API18. Weka provides APIs for one-against-all approach and Naïve Bayes Multinomial classifier18. We used the OpenVA package version 1.0.2 in R to implement InterVA-4, Tariff, InSilicoVA and NBC algorithms. The data format also was transformed into InterVA-4 input format (Y for 1 and empty for 0 values). It is important to note that the Tariff version provided in the OpenVA package is computationally different from the IHME’s SmartVA-Analyze application tool. We used custom training option for InterVA-4 and InSilicoVA as present in OpenVA package in R. In custom training, the names of symptoms do not need to be in the WHO standardised format, and the rankings of the conditional probability P(symptom|cause) are determined by marching the same quantile distributions in the default InterVA P(symptom|cause). The reason for choosing customized training instead of using pre-trained global models is that different datasets have different proportions of symptoms and causes of deaths, and custom training allows algorithms to generate models customized for the dataset. It also allows for fair evaluation across algorithms, especially for the ones that only work by using customized training on datasets and acquire more knowledge of the dataset during testing.

We performed data partitioning, as discussed in Step 1, using Java and Weka’s18 stratified sampling API. Each algorithm was executed on that partitioned data. We used a separate Java program to compute the cumulative measures of sensitivity, PCCC and agreement (CSMF accuracy) on the COD assignments of each algorithm for each VA dataset. This process ascertained that our evaluation measures are calculated in the exact same manner. Our source code for all the experiments is available on GitHub and is archived at Zenodo27.

Results

Ranked CSMF accuracy comparison

Figure 3 shows the averaged agreements (CSMF accuracy) by algorithms across all VA datasets using rank one (most likely) cause (COD) assignments and the fifth most likely cause assignments (rank five). OAA-NBC produces the highest agreements for most of the VA datasets, ranging from 86% to 90% for rank one; it comes second or identical to NBC for the PHMRC Child datasets (global and India). Furthermore, OAA-NBC agreements were relatively consistent across the VA datasets compared to some of the other algorithms that varied considerably, such as Tariff, InterVA-4 and InSilicoVA. As expected, the cumulative agreements increased the overall agreements for each algorithm when including the top five ranked classifications for every VA dataset.

58fb9376-ce69-4a8f-be7d-15b9e9c21751_figure3.gif

Figure 3. Ranks 1 and 5 cause-specific mortality fraction (CSMF) accuracies (agreement) across VA datasets and algorithms.

Ranked sensitivity comparison

Individual-level cumulative sensitivity results for classification ranks one and five are shown in Table 3; cumulative PCCC values were not shown as the values were very close to the cumulative sensitivity values. It can be observed from Table 3 that OAA-NBC has the highest sensitivity values for the first ranked (most likely) COD assignments compared to the other algorithms, ranging between 53-63%. When considering all top five ranked classifications, OAA-NBC has improved sensitivity values by 31–38%, with cumulative values ranging from 91–95%. In the case of Tariff, InterVA-4 and InSilicoVA, the sensitivity values are significantly lower (10–40%) in comparison to OAA-NBC; NBC does not differ substantially from OAA-NBC, as differences only range from 3–7%. These results show that OAA-NBC consistently yields closer agreement with the physician review or clinical diagnoses at the individual-level than the other algorithms on most of the VA datasets.

Table 3. Cumulative sensitivity of rank 1 and 5 COD classifications by VA dataset and algorithm.

VA dataset, rank, cumulative sensitivity (%)
MDSMatlabAgincourtPHMRC
Adult -Global
PHMRC
Adult - India
PHMRC
Child - Global
PHMRC
Child - India
AlgorithmRank
1
Rank
5
Rank
1
Rank
5
Rank
1
Rank
5
Rank
1
Rank
5
Rank
1
Rank
5
Rank
1
Rank
5
Rank
1
Rank
5
Tariff31.571.440.775.327.572.135.974.744.079.437.083.739.586.3
InterVA-448.882.734.879.346.378.836.382.241.184.645.191.851.293.0
InSilicoVA45.685.935.680.835.880.335.079.550.387.343.389.649.492.4
NBC56.090.150.787.248.287.447.788.154.886.151.593.158.692.4
OAA-NBC61.194.357.991.255.593.153.191.060.193.154.693.463.094.7

We also performed a Wilcoxon signed rank statistical test on the reported sensitivity in Table 3, generated from the five algorithms (we also included rank two to rank three values which are not shown in the table to minimize space). For 35 observations, the Wilcoxon signed ranked test yielded Z-score=5.194 and two tailed p-value=2.47 x 10-7 between OAA-NBC and NBC. It yielded the same Z-scores and two tailed p-values against InSilicoVA, InterVA-4, and Tariff. Thus, there is a statistically significant difference between the sensitivity values generated by OAA-NBC and the four other algorithms (p < 0.05). Similarly, we conducted the Wilcoxon signed rank test on 35 observations of agreements for the five different algorithms, finding a statistically significant difference between OAA-NBC and the other algorithms (Z=4.248, p < 0.05).

Thus, the use of one-against-all approach with NBC (OAA-NBC) improves the performance of COD classification for VA records, and yields better COD assignments at the population- and individual-level, which are statistically different and not attributed to chance compared to the four other algorithms. This also conforms to the machine learning literature that the one-against-all approach improves the performance of classification algorithms when there are more than two classes (CODs)19. However, this does not indicate that OAA-NBC does not require improvement, as the overall sensitivity for the top ranked CODs per VA record is still lower than 80%. We also made an additional assessment on the COD sensitivity; Table 4 shows the sensitivity per cause for first ranked predictions and VA dataset for each algorithm (PHMRC Indian datasets are excluded as their results are similar to PHMRC global datasets and this minimizes space too). The sensitivity values vary per VA dataset and cause for all of the algorithms; road and transport injuries and other injuries were the only causes that OAA-NBC predicted consistently well for four out of the five VA datasets. However, there are several causes where the sensitivity of the classifications by OAA-NBC were lower than 50%, and in some cases, 0% (four causes in MDS and two causes in PHMRC – Child global datasets). Sensitivity values are 0% for COD groups that have proportion of records near 1% per VA dataset (number of records for each COD in VA datasets are shown in Table 2). In general, the algorithms performances vary on different CODs for certain conditions in VA datasets. For example, classifications were equal to or under 10% across all algorithms for HIV/AIDs, cancers, cardiovascular disease, and chronic respiratory diseases in the MDS dataset. Algorithms like OAA-NBC and NBC mostly have better sensitivity for COD groups that have higher proportion of records in training dataset. However, this is not always the case, and better sensitivity values also depend on how distinguishable VA records of a COD group are from all other COD groups. In the next section, we discuss the problem and effects of imbalance within datasets on the algorithms’ classification accuracy.

Table 4. Top ranked (most likely) sensitivity scores per COD by VA dataset and algorithm with physician assigned COD distributions.

Cause, sensitivity (%)
VA DatasetAlgorithmAcute
respiratory
HIV/AIDSDiarrhoealTuberculosisOther &
unspecified
infections
CancersNutrition &
endocrine
Cardiovascular
diseases
Chronic RespiratoryLiver cirrhosisOther NCDsNeonatal
conditions
Road & transport
injuries
Other injuriesIll-definedSuicideMaternal
MDS Physician* 27.7 0.04 22.2 0.6 20.6 0.8 3.0 0.1 0.2 0.9 11.0 3.3 0.8 5.4 3.2 - -
Tariff36.110.047.542.519.716.731.75.00.023.93.011.284.357.325.7--
InterVA-478.00.055.351.043.69.50.98.33.329.21.015.470.771.510.4--
InSilicoVA61.50.055.750.032.40.642.30.00.021.06.813.982.169.663.3--
NBC74.90.070.431.646.54.141.30.01.718.022.615.273.080.149.0--
OAA-NBC85.20.078.517.951.50.025.30.00.04.523.011.079.880.625.7--
MatlabPhysician*0.5-1.42.13.917.64.535.76.45.012.2-2.43.41.71.71.1
Tariff15.0-53.355.015.041.061.138.179.850.09.9-57.051.213.370.816.7
InterVA-40.0-26.751.029.848.621.132.142.661.07.4-81.537.10.070.815.0
InSilicoVA20.0-50.034.511.417.134.447.971.353.08.2-91.519.013.386.78.3
NBC10.0-21.742.515.455.443.364.166.557.020.0-83.515.021.776.75.0
OAA-NBC20.0-51.730.57.567.638.975.375.853.023.8-96.039.52.575.85.0
Agincourt Physician* 1.9 34.5 1.1 11.8 7.4 4.2 1.2 6.5 0.5 1.5 3.8 - 3.8 6.3 12.2 2.1 1.0
Tariff44.321.439.853.37.224.669.324.730.850.019.6-80.841.03.014.060.3
InterVA-436.174.534.759.912.528.125.813.743.350.79.9-78.464.70.021.929.2
InSilicoVA53.129.331.260.911.426.232.814.835.841.418.5-81.552.729.779.852.1
NBC41.259.427.960.826.635.333.228.333.339.116.6-79.363.327.169.253.3
OAA-NBC39.177.924.348.052.328.742.944.13.335.819.0-82.682.026.76.448.3
PHMRC -
Adult Global
Physician* 6.5 - 2.2 3.8 13.4 10.7 - 19.9 1.8 5.0 15.0 - 2.7 10.1 - 1.5 7.4
Tariff26.0-28.647.426.848.7-30.319.364.05.8-64.037.8-22.989.9
InterVA-414.5-5.914.645.847.7-32.645.487.213.0-29.240.8-25.761.8
InSilicoVA16.1-36.722.627.639.4-25.232.146.913.1-76.159.0-35.780.3
NBC26.7-31.730.040.760.0-49.441.760.621.3-61.469.6-35.784.1
OAA-NBC22.7-22.820.352.164.2-64.627.462.426.3-59.774.3-18.690.1
PHMRC -
Child Global
Physician* 25.8 - 12.4 - 18.2 1.4 - 3.7 - - 9.0 - 4.5 15.7 9.4 - -
Tariff28.9-56.2-20.56.7-14.5--22.7-67.862.236.4--
InterVA-469.9-45.3-25.843.3-5.0--8.6-78.463.518.4--
InSilicoVA39.3-45.6-26.935.0-10.4--17.2-87.186.429.2--
NBC60.5-48.4-45.510.0-15.7--12.9-83.985.527.2--
OAA-NBC71.0-53.4-46.60.0-0.0--8.5-90.491.023.0--

* Proportion of deaths assigned for each COD by physician(s) or clinical diagnoses (PHMRC)

Discussion

Our approach (OAA-NBC) produces better population and individual-level agreement (sensitivity) from different VA surveys compared to other algorithms. However, the overall sensitivity values are still in the range of 55–61% and not greater than 80% for the top ranked COD assignments. There are several reasons for the low sensitivity values; firstly, each VA dataset is unique, with varying amounts of overlapping or different symptoms. In this respect, the symptom-cause information (SCI) was unique to each VA dataset, and so, some of the algorithms could have had more trouble generating adequate SCIs due to the logic employed by the algorithm itself and VA data. This could help explain the low sensitivity scores by cause and per algorithm for the MDS data, which was one of the VA datasets with the fewest amounts of symptoms, and which could have impacted the SCI used for COD assignment by the algorithms. Conversely, some algorithms like InterVA-4 (when you specify the format as following the WHO 2012 or 2014 VA Instrument) require a set of predefined symptoms, or else prefer independent symptoms (i.e. had a fever) over dependent symptoms (i.e. fever lasted for a week) or interdependent symptoms (i.e. did s/he have diarrhoea and dysentery); the absence of such symptoms would also impact the algorithms’ ability to classify VA records correctly. A solution to this problem is to have better differentiating symptoms for each COD.

One may argue that algorithms, such as InterVA-4 and InSilicoVA (non-training option), which use a different input, namely symptom list, based on WHO’s forms for assigning CODs and do not need training on data, would be unfairly evaluated by using customized training. We converted symptoms in our datasets to WHO standardised names and evaluated InterVA-4, and InSilicoVA on the datasets. We used the same method of 10-fold cross validation method as we used in our experiments earlier but we only provided a test set for each fold to the algorithms for assigning causes of deaths based on standardised symptom names. The output of these algorithms was one of the 63 standardised CODs. We mapped these 63 causes to our 17 CODs for a fair evaluation (see Table 6 for complete details on mapping to the 17 COD categories). We observed that sensitivity for rank one for InterVA-4 remains between approximately 25% and 42%, and sensitivity for InSilicoVA remained between 20% and 43% on all datasets. The use of pre-trained models on standardized VA data inputs did not yield any better results than customized training on datasets.

One may also argue for the use of more recent algorithm versions, such as InterVA-5, for assessments. Due to the fact that the VA data used were captured prior to the release of the WHO 2016 forms, the resultant binary files would have many missing symptoms. Furthermore, InterVA-5 was only recently released for public use, specifically in September of 2018. Although an enhanced algorithm may perform more effectively due to logic employed, the VA data is also very relevant for performance. Since the VA data used in this study conformed better with the 2014 forms, we ran experiments using algorithms that were designed from WHO 2014 VA forms or do not require a specific input for a fair comparison.

VA datasets also differ in COD composition counts; there are some CODs in the VA datasets which have large number of records, while other CODs have fewer records. The ratio of composition of these CODs is highly imbalanced which can make any algorithm more biased towards the CODs with higher ratio of records in the training set. This implies that the overall agreement would most likely remain low for the algorithms in such cases. COD balancing can be performed by duplicating the number of records for the minority CODs (CODs with the least amounts of records) or decreasing the number of records for the majority CODs (CODs with the greatest amounts of records)18. However, these types of artificial balancing approaches do not always yield improvements in results.

A point for discussion relates to the distribution of CODs in training and test datasets. In machine learning, the composition of records of classes (e.g., CODs) are kept in the same proportion in the training and test set as in the original dataset when performing experiments18. This allows for a fair evaluation of the algorithm, otherwise too many VA records in a test set of a COD and too few in the training set would only result in poor performance of the algorithm for that COD. In real life situations, when a machine learning application is in production, it is possible that we may not get all the variations in the training (historical) set and we may have more variations of a particular COD in the newly collected data. The common solution to this problem is to update the training data, and re-train the algorithm to reflect newer SCI variations as they are observed18. Nonetheless, to understand the effect of different variations of CODs in training and test set, we performed another experiment by using Dirichlet distribution, which allowed us to vary the composition of records in the test set28. We used Dirichlet distribution-based sampling that actually models variability in occurrences of classes (CODs) by applying resampling with replacement. We divided the dataset into 10 parts using 10-fold cross validation method18 as in our experiments above. On each fold, we resampled the test set with replacement using Dirichlet distribution28, resulting in different number of records for each type of COD. OAA-NBC, InterVA-4, Tariff, InSilicoVA and NBC were then evaluated on the resampled test set with different distribution of CODs. The results are shown in Table 5 for Matlab and MDS datasets. The overall performance of classification decreased as expected because the CODs with too few VA records in the actual training set have been duplicated many times by the Dirichlet distribution in the new test set only. For example, if a record related to COD is not classified correctly by an algorithm and it is repeated many times in the test set then sensitivity will decrease on that COD. The overall performance of the algorithms remain low as expected. OAA-NBC and NBC still yield better performance than all other algorithms. We show results for these two datasets only as the other VA datasets had similar results of a dip in performance. An ideal training dataset would be a large repository of community VA deaths with enough variations in symptom patterns for each COD that are clinically verified; however, no such repository exists. The whole purpose of training on VA datasets is to be able to help classify CODs in situations where deaths occur unattended.

Table 5. Comparison of cumulative sensitivity and cause-specific mortality fraction (CSMF) accuracy of rank 1 and 5 classifications using Dirichlet distribution on MDS and Matlab data.

AlgorithmVA dataset, rank, cumulative sensitivity and CSMF accuracy (%)
MDSMatlab
SensitivityCSMF accuracySensitivityCSMF accuracy
Rank 1Rank 5Rank 1Rank 5Rank 1Rank 5Rank 1Rank 5
Tariff29.064.753.774.645.279.054.680.8
InterVA-433.663.949.470.733.471.551.675.1
InSilicoVA38.175.957.280.537.781.459.485.8
NBC41.774.760.479.638.773.757.676.7
OAA-NBC41.075.059.879.245.686.260.488.3

Table 6. Complete mapping of ICD-10 and WHO cause labels to the cause list used for performance assessments.

No.Cause of DeathWHO list of CausesICD-10 Range
1Acute respiratoryAcute resp infect incl pneumonia, Neonatal
pneumonia
H65-H68, H70-H71, J00-J22, J32, J36,
J85-J86, P23
2HIV/AIDSNA
3DiarrhoealDiarrhoeal diseasesA00-A09
4Pulmonary TBPulmonary tuberculosisA15-A16, B90, J65
5Other and
unspecified
infections
Sepsis (non-obstetric), HIV/AIDS related death,
Malaria, Measles, Meningitis and encephalitis,
Tetanus, Pertussis, Haemorrhagic fever, Other and
unspecified infect dis, Neonatal sepsis
A17-A33, A35-A99, B00-B17, B19-B89,
B91-B99, C46, D64, D84, G00-G09,
H10, H60, I30, I32-I33, K02, K04-K05,
K61, K65, K67, K81, L00-L04, L08,
M00-M01, M60, M86, N10, N30, N34,
N41, N49, N61, N70-N74, P35-P39,
R50, R75, ZZ21
6Neoplasms
(cancer)
Oral neoplasms, Digestive neoplasms, Respiratory
neoplasms, Breast neoplasms, Reproductive
neoplasms MF, Other and unspecified neoplasms
C00-C26, C30-C45, C47-C58, C60-C97,
D00-D48, D91, N60, N62-N64, N87, R59
7Nutrition and
endocrine
Severe anaemia, Severe malnutritionD50-D53, E00-E02, E40-E46, E50-E64,
X53-X54
8Cardiovascular
Diseases (CVD)
Diabetes mellitus, Acute cardiac disease, Stroke,
Other and unspecified cardiac dis
E10-E14, G45-G46, G81-G83, I60-I69,
I00-I03, I05-I15, I26-28, I31, I34-I52,
I70-I99, R00-R01, R03, ZZ23
9Chronic
respiratory
Chronic obstructive pulmonary dis, AsthmaJ30-J31, J33-J35, J37-J64, J66-J84,
J90-J99, R04-R06, R84, R91
10Liver cirrhosisLiver cirrhosisB18, F10, K70-K77, R16-R18, X45, Y15,
Y90-91
11Other non-
communicable
diseases
Sickle cell with crisis, Acute abdomen, Renal
failure, Epilepsy, Congenital malformation, Other
and unspecified, Other and unspecified NCD
D55-D63, D65-D83, D86, D89, E03-E07,
E15-E35, E65-E68, E70-E90, F00-F09,
F11-F52, F54-F99, G10-G37, G40-G41,
G50-G80, G84-G99, H00-H06, H11-H59,
H61-H62, H69, H72-H95, K00-K01, K03,
K06-K14, K20-K31, K35-K38, K40-K60,
K62-K64, K66, K78-K80, K82-K93, L05,
L10-L99, M02-M54, M61-M85, M87-M99,
N00-N08, N11-N29, N31-N33, N35-N40,
N42-N48, N50-N59, N75-N86, N88-N99,
Q00-Q99, R10-R15, R19-R23, R26-R27,
R29-R49, R56, R63, R70-R74, R76-R77,
R80-R82, R85-R87, R90, ZZ25
12Neonatal
conditions
Cause of death unknown, Prematurity, Birth
asphyxia, Other and unspecified neonatal CoD
C76, D64, G40, O60, P00, P01, P02-P03,
P05, P07, P10-P15, P21, P22, P24-P29,
P50-P52, P61, P77, P80, P90-P92, R04,
R06, Q00-Q99, W79, Z37
13Road and
transport injuries
(RTI)
Road traffic accident, Other transport accidentV01-V99, Y85
14Other injuriesAccid fall, Accid drowning and submersion,
Accid expos to smoke fire & flame, Contact with
venomous plant/animal, Accid poisoning & noxious
subs, Assault, Exposure to force of nature, Other
and unspecified external CoD
S00-S99, T00-T99, W00-W99, X00-X44,
X46-X52, X55-X59, X85-X99, Y00-Y14,
Y16-Y84, Y86-Y89, Y92-Y98, ZZ27
15Ill-definedNAP96, R02, R07-R09, R25, R51-R54,
R57-R58, R60-R62, R64-R69, R78-R79,
R83, R89, R92-R94, R96, R98-R99
16SuicideIntentional self-harmX60-X84
17MaternalEctopic pregnancy, Abortion-related death,
Pregnancy-induced hypertension, Obstetric
haemorrhage, Obstructed labour, Pregnancy-
related sepsis, Anaemia of pregnancy, Ruptured
uterus, Other and unspecified maternal CoD, Not
pregnant or recently delivered, Pregnancy ended
within 6 weeks of death, Pregnant at death, Birth
asphyxia, Fresh stillbirth, Macerated stillbirth
A34, F53, O00-O08, O10-O16, O20-O99

Finally, the performance of machine learning algorithms depend on the logic employed by the algorithm and the VA data, in terms of generating an adequate SCI for COD classification to discriminate different classes (CODs). To mitigate the effects of using one set of training data on all VA data, we trained algorithms on data derived from its origin dataset by using 10-fold cross validation method. By doing so, only SCIs generated from each separate VA data was considered when algorithms were classifying deaths per VA dataset. For the most part, the algorithms performed consistently, with OAA NBC performing better the majority of the time. Our results are reproducible; all of the scripts used and sample datasets are publicly available (see Experimental Setup section).

Conclusion

In this study, we enhanced the NBC algorithm using the one-against-all approach to assign CODs to records in multiple VA datasets from different settings. The results show that our approach has 6-8% better sensitivity and PCCC for individual-level COD classification than some of the current best performing computer-coded VA algorithms (i.e., Tariff, InterVA-4, NBC and InSilicoVA). Population-level agreements for OAA-NBC and NBC were found to be similar or higher than the other algorithms used in the experiments. Overall results show that OAA-NBC classification results are most like dual physician and clinical diagnostic COD assignments when compared against some of the leading algorithms by using cumulative sensitivity, PCCC and CSMF accuracy scores. The performance results are not due to chance as indicated by the Wilcoxon Signed Rank.

Thus, we conclude that using the one-against-all approach with NBC helped improve accuracy of COD classification. The one-against-all approach (and other ensemble methods of machine learning) can also be used with other VA algorithms instead of just Naïve Bayes. Although OAA-NBC generates the highest cumulative CSMF accuracy values, OAA-NBC still requires improvements to produce the most accurate COD classifications, especially for individual-level classification which is still below 80%. In the future, we plan to extend this work to include narratives present in the VA surveys for automated classification. Another endeavour would be to apply the one-against-all approach to the other algorithms to determine whether they can be improved further to classify community VA deaths more similarly to dual physician review.

Data availability

Some of the data used in the analysis has already been made available, specifically the PHMRC data which can be found at: http://ghdx.healthdata.org/record/population-health-metrics-research-consortium-gold-standard-verbal-autopsy-data-2005-2011.

The other datasets are included with the source code: https://github.com/sshahriyar/va (archived at https://doi.org/10.5281/zenodo.148926727.

Software availability

Source code available from: https://github.com/sshahriyar/va

Archived source code at time of publication: https://doi.org/10.5281/zenodo.148926727.

License: MIT License.

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 28 Nov 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
Gates Open Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Murtaza SS, Kolpak P, Bener A and Jha P. Automated verbal autopsy classification: using one-against-all ensemble method and Naïve Bayes classifier [version 1; peer review: 1 approved, 1 approved with reservations] Gates Open Res 2018, 2:63 (https://doi.org/10.12688/gatesopenres.12891.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 28 Nov 2018
Views
16
Cite
Reviewer Report 03 Jan 2019
Ying Lu, Department of Applied Statistics, Social Sciences and Humanities, Steinhardt School of Education, Culture and Human Development, New York University, New York, NY, USA 
Approved
VIEWS 16
First I would like to congratulate the authors for developing an effective solution to the verbal autopsy classification problem. The results look very convincing, and the rationale of the methods seems to be reasonable. The source code is open-access. 

... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lu Y. Reviewer Report For: Automated verbal autopsy classification: using one-against-all ensemble method and Naïve Bayes classifier [version 1; peer review: 1 approved, 1 approved with reservations]. Gates Open Res 2018, 2:63 (https://doi.org/10.21956/gatesopenres.13987.r26787)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 23 Jan 2019
    Syed Shariyar Murtaza, Data Science Lab, Ryerson University, Toronto, M5B 2K3, Canada
    23 Jan 2019
    Author Response
    Thank you for reviewing this article. Please find below replies to your questions.

    Q1. How well does the one-against-all method perform when the number of disease categories increases? Will the uncertainty ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 23 Jan 2019
    Syed Shariyar Murtaza, Data Science Lab, Ryerson University, Toronto, M5B 2K3, Canada
    23 Jan 2019
    Author Response
    Thank you for reviewing this article. Please find below replies to your questions.

    Q1. How well does the one-against-all method perform when the number of disease categories increases? Will the uncertainty ... Continue reading
Views
25
Cite
Reviewer Report 20 Dec 2018
Aaron S. Karat, Department of Clinical Research, London School of Hygiene & Tropical Medicine, London, UK 
Clara Calvert, Department of Population Health, London School of Hygiene & Tropical Medicine, London, UK 
Approved with Reservations
VIEWS 25
Thank you for the opportunity to review this article: it describes the development of a new method in an important area of global health and for the most part is well written and organised. Overall, the authors make a coherent ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Karat AS and Calvert C. Reviewer Report For: Automated verbal autopsy classification: using one-against-all ensemble method and Naïve Bayes classifier [version 1; peer review: 1 approved, 1 approved with reservations]. Gates Open Res 2018, 2:63 (https://doi.org/10.21956/gatesopenres.13987.r26788)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 23 Jan 2019
    Syed Shariyar Murtaza, Data Science Lab, Ryerson University, Toronto, M5B 2K3, Canada
    23 Jan 2019
    Author Response
    Thank you for reviewing this article. Please find below replies to your questions. We have also submitted a modified version with your recommendations.

    Introduction

    Q1. This provides a good overview of the ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 23 Jan 2019
    Syed Shariyar Murtaza, Data Science Lab, Ryerson University, Toronto, M5B 2K3, Canada
    23 Jan 2019
    Author Response
    Thank you for reviewing this article. Please find below replies to your questions. We have also submitted a modified version with your recommendations.

    Introduction

    Q1. This provides a good overview of the ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 28 Nov 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

Are you a Gates-funded researcher?

If you are a previous or current Gates grant holder, sign up for information about developments, publishing and publications from Gates Open Research.

You must provide your first name
You must provide your last name
You must provide a valid email address
You must provide an institution.

Thank you!

We'll keep you updated on any major new updates to Gates Open Research

Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.