Background
Information that comes from accurate and complete records of deaths – who died of what – is a huge resource for evidence-based health planning and development [
1]. However, most deaths in low-income countries remain unregistered due to the absence of civil registration system [
1‐
3]. In countries, where deaths are not routinely recorded and classified by cause, verbal autopsy (VA) has become an alternative technique [
4‐
6]. Verbal autopsy covers the entire process of interviewing close caregivers about the circumstances proceeding to death [
4,
7,
8]. Verbal autopsy data are collected by trained lay interviewers, and then interpreted into a probable cause of death [
2,
4,
7,
8]. Either of the following methods can be applied to derive the cause of death from VA data: physician review, physician review using an algorithm, computer algorithms and InterVA model [
2,
9].
Physician review (PR) is the most commonly used method of interpreting VA data [
4,
9]. According to this method, two independent physicians review VA questionnaires to assign probable cause of death and corresponding International Classification of Diseases (ICD) code [
9,
10]. However, the diagnosis reached by physicians can vary depending on their training, experience and knowledge of local epidemiology; which in turn limit the internal consistency and comparability of findings [
9,
10]. Physician review is also costly and time-taking. A single verbal autopsy review may take up to half an hour of physicians’ time; competing the time required for patient care [
3,
9,
10]. Despite all its shortcomings, numerous validation studies have demonstrated PR to be capable of producing reasonably valid cause of death information, and is used in many surveillance systems [
2,
5,
11].
Recently, an automated method of interpreting verbal autopsy, called InterVA, has been developed and used [
1,
4,
7,
10]. It calculates the probability of a set of causes of death, given the presence of indicators (circumstances, signs, and symptoms) reported in VA interviews [
10,
12]. Although statistical modeling of this sort may not reflect the subjective subtleties of physicians’ review; the InterVA is faster, cheaper and internally consistent [
4,
9,
10]. Therefore, there is an increasing interest to shift from PR, which is widely used in several research centers and surveillance systems, in to the automated InterVA approach [
2,
4,
5,
9,
12]. However, the reliability of the diagnosis that can be reached by using these methods is not sufficiently studied. According to prior studies which compared PR and InterVA, the level of agreement vary from low (κ = 0.27) to moderate (κ = 0.42–0.48) [
10,
13‐
15].
The Kilite Awlaelo Health and Demographic Surveillance System (KA-HDSS), located in northern Ethiopia, has been using PR method to determine cause of death since September 2009. We used data from the KA-HDSS to measure the agreement in diagnosis between physician review and the computer-based InterVA-4 model.
Discussion
This study compared the level of case-by-case agreement in diagnosis between the InterVA model and PR methods. In general, the CSMF for the major causes of death were comparable in both methods and also consistent with previous findings [
2,
11,
13]. However, the overall case-by-case agreement in diagnosis lies within the fair range of agreement [
28]. The level of agreement has varied by causes of death, age and sex of the deceased, ranging from fair (κ = 0.23 for cardiovascular diseases) to substantial level agreement (κ = 0.75 for accidents/injuries).
The proportions of deaths attributed to communicable causes by both methods were similar and consistent to the existing knowledge of the burden of communicable diseases in Ethiopia [
2,
11]. Both methods were similar in attributing the proportion of TB, pneumonia/sepsis, acute infections, malaria and HIV/AIDS. In a similar study in Kenya, the two methods comparably attributed pneumonia/sepsis, TB, malaria and meningitis [
10]. However, according to other studies, the InterVA overestimated TB than the physician review [
10,
13,
15]. The comparability of both methods in the number of times they diagnosed HIV/AIDS was inconsistent. In some studies, the InterVA diagnosed HIV/AIDS more frequently than physician review [
10], while less frequently than PR [
13] in other study. This discrepancy may be related to misclassification of HIV/AIDS and TB, which is reported in several studies [
10,
22,
29,
30].
In our study, both methods attributed NCDs comparably and the magnitude of the estimate accords with previous findings [
31‐
33]. Similarly, cardiovascular causes of death were comparably estimated by both methods, which was also reported in another study [
15]. The consistency of both methods in estimating deaths attributed to accidents/injuries shown in our study concurs with that of other studies [
10,
15]. This may be related to the clarity of the indicators (signs and symptoms) reported for accidents and injuries than other causes.
In our study, the overall chance corrected agreement, at broad and specific causes of death categories, falls between 0.21 and 0.40, which is considered as a fair agreement [
28]. The case-by-case agreement at specific cause of death level was higher than a similar study in Kenya (κ = 0.27) and lower than another findings from Ethiopia (κ = 0.49) and Kenya (κ =0.42) [
10,
13,
15]. As reported in a similar study, the level of agreement was better in younger ages than the older age groups [
34]. This could be explained in terms of the difference in epidemiology of causes of diseases across age groups. Older age groups experience multiple illness conditions with overlapping symptomatic nature than younger groups [
34].
Findings from several [
10,
13‐
15], but not all [
26], studies show that the concordance level between the PR and InterVA is insufficient. A comparative study in Northwest Ethiopia, which included 408 adult deaths, measured a concordance level of 0.49 on broad CoD level [
13]. Even much lower levels of agreement were also reported from the African Population and Health Research Center (k = 0.27) [
10] and Kilifi Health Demographic Surveillance System (k = 0.32) [
15], both in Kenya, which did similar comparison. On the other hand, finding from a recent multi-center study, which used data from Health and Demographic Surveillance systems, and Health and Demographic Surveys, showed an almost perfect level of agreement, reporting overall concordance correlation coefficient of 0.83 [
26].
In the present study, inference about validity of either of the methods can not be made in the absence of a gold-standard diagnosis. However, validation studies which simultaneously evaluated PR and InterVA methods against hospital certified deaths, showed that the PR performs better than the InterVA model [
14,
15]. A validation study which compared both the InterVA and PR methods against hospital CoD revealed that the level of agreement between InterVA and hospital CoD (κ = 0.32) was lower than the agreement between physician review and hospital CoD (k = 0.52). In addition, in another study which evaluated the PR and InterVA using clinical diagnostic gold standards in a sample of 12,542 verbal autopsy cases, the PR has shown a better performance than the InterVA, across all age-groups [
14].
Discrepancy in the diagnosis between these two methods may not be unexpected, though further investigation is needed to explain the variation. Nevertheless, according to previous studies the discordance in diagnosis was related to a variation on how the two methods process and use the verbal autopsy data. The InterVA uses the data from the closed ended questions only, while the PR involve extensive use of the open ended narrative part of the VA data [
3,
9,
10,
14]. In addition, the InterVA use a probability matrix to process the indicators in the verbal autopsy data, while the PR is based on expert judgment [
7,
9].
In addition to the minimal effort it requires, the InterVA has a comparative advantage of being completely internally consistent that enables producing comparable outputs. In contrary, PR is labor intensive, and prone to inter-observer variation. However, it has also some benefits. As a part of their routine clinical practice, reviewer physicians treat patients who come from the same population where the VA cases come from. This gives reviewer physicians a chance to correlate the signs and symptoms used to describe illness in the specific community with the actual illness confirmed through clinical investigations. Although, such prior knowledge can affect the possibility of coding less prevalent causes [
9], it may help the PR process to be a robust on CoD which are common in the community.
The present study has the following limitations. The two methods were compared in the absence of a gold-standard diagnosis. As a result, it was possible to conclude about the validity of the methods. Although the study included more cases than the minimum sample size required, it was not sufficient when it comes to comparing sub-groups or rare causes of death.
Conclusion
In summary, this study reported an overall low chance corrected agreement in probable cause of death between PR and InterVA. The level of agreement varies across different categories of causes of death, and age of the deceased. The agreement ranged from moderate to substantial for important public health diseases like TB, perinatal causes, pneumonia/sepsis, and accidents and injuries; while the agreement for NCDs, especially for cardiovascular causes and neoplasms was low. Both methods showed a relatively better agreement in under-five children and adults aged 15-45, while they least agreed for cases aged 45 and above years. Therefore, if the InterVA were used in place of the PR process, the overall diagnosis would be fairly similar.
Acknowledgements
The authors are thankful to the funding organizations, field workers and data management staff. We are grateful to Prof. Peter Byass from WHO Collaborating Centre for Verbal Autopsy, Umeå Centre for Global Health Research, Umeå University, Sweden, for providing training on the application of InterVA to the KA-HDSS research team.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
BW and YAM involved in design of the surveillance, data collection and supervision. BW was involved in study conception, data processing and analysis. BW wrote the manuscript and interpreted the results. BW, YAM, GJD, and MS made substantial revision to the manuscript. MS and GJD mentored the process of paper writing from preparation to manuscript write up. All authors read and approved the final manuscript.