Introduction
According to the European Resuscitation Council (ERC) and the European Society for Intensive Care Medicine (ESICM), “diffuse and extensive anoxic injury” on neuroimaging is predictive of poor functional outcome after cardiac arrest [
1]. Head computed tomography (CT) is widely available and is frequently used for neuroprognostication [
2‐
4]. Recent meta-analyses conclude that the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) level of evidence for CT to predict outcome after cardiac arrest is very low [
5‐
7]. Signs of “diffuse and extensive anoxic injury” seen as a reduced differentiation between grey and white matter and/or effacement of the cerebral sulci on CT correlate well with elevated levels of neuronal injury markers [
8] and histopathological severity of hypoxic–ischaemic encephalopathy (HIE) [
9].
In clinical practice, CTs are usually assessed qualitatively using a non-standardised approach [
10]. Some specialised centres use non-standardised quantitative methods such as grey–white-matter ratio (GWR) placing regions of interest at the basal ganglia and/or in (sub)cortical regions to quantify oedema [
7]. Manual assessments carry the risk of interrater variability and standardisation and/or automatisation may be necessary to ensure a safe translation from research to clinical routine [
11‐
14]. In stroke imaging, automated quantification of non-contrast CTs is routinely used [
15]. This has not yet been achieved for neuroprognostication and a few single-centre retrospective studies with automated quantification of GWR after cardiac arrest have been published [
16,
17].
Based on our retrospective studies with adult out-of-hospital cardiac arrest patients, we established Standard Operating Procedures (SOPs) for qualitative (visual interpretation) and quantitative (GWR) CT assessments for clinical use [
13,
18‐
20]. In line with the previous studies [
18,
19,
21‐
24], our recent pilot study from the Target Temperature Management after Out-of-hospital Cardiac Arrest (TTM)-trial confirmed that GWR was most accurate at the basal ganglia level where a cutoff < 1.10 predicted poor functional outcome with 100% specificity [
18].
We here present the results of a prospective observational substudy of an international multicentre trial in which we applied our previously published criteria for standardised qualitative and quantitative CT assessment as well as an atlas-based automated GWR method for neuroprognostication after cardiac arrest [
14]. Our main hypotheses were that “Definite signs of severe HIE”, by standardised assessment and manually or automatically obtained GWR < 1.10, would predict poor functional outcome without false positives in CTs performed 48 h–7 days after cardiac arrest [
13].
Method
Study design
This prospective international multicentre observational study (Clinicaltrials.gov NCT03913065) was a substudy of the Targeted Hypothermia versus Targeted Normothermia after out-of-hospital cardiac arrest (TTM2) trial [
25] (Clinicaltrials.gov NCT02908308). The design and statistical analysis plan of this substudy has previously been published [
13].
Patient selection and ethics
Between November 2017 and January 2020, the TTM2-trial consecutively screened unconscious patients ≥ 18 years admitted to hospital after out-of-hospital cardiac arrest of a presumed cardiac or unknown cause [
25]. Approval was waived/obtained from the appropriate ethics committees. The trial was performed in accordance with the ethical standards laid down in the Declaration of Helsinki and its later amendments [
26]. Consent was obtained from legal representatives and/or patients according to local legislation.
Thirteen sites from Sweden, Germany, France, and United Kingdom that routinely use CT for neuroprognostication in patients unconscious > 48 h post-arrest participated (electronic supplementary material [ESM] Table E1). Unconsciousness was defined as not obeying verbal commands. Included patients were managed according to the TTM2-trial protocol regarding randomisation, clinical management, neurological prognostication, decisions on withdrawal of life-sustaining therapy, and follow-up [
25,
27‐
29].
Data collection and technical requirements
All types of scanners and software were permitted. Technical prerequisites were availability of axial slices of 4–5 mm thickness obtained with a tube voltage of 120 kV.
CT assessments
CTs with artefacts or structural lesions interfering with reliable evaluation were excluded. Five radiologists and two neurologists with 3–15 years of experience of CTs after cardiac arrest from four countries evaluated images individually using a virtual private network (VPN) secured platform (Human Observer Net) [
30] (ESM Table E2). Raters were blinded to all information except the patients age, since brain volume may decrease with age, and thus, this information was considered necessary for assessing the extent of cerebral oedema. The raters received approximately 30 min of training for the software used for evaluations, but unrelated to the actual rating of images. Raters were encouraged to have the SOP accessible during ratings
.
Standardised operating procedures for qualitative assessments
Axial images were evaluated at four levels; brain stem and cerebellum, basal ganglia, frontoparietal cortex at corona radiata level, and at high convexity cortex (ESM Fig. E1A) [
13]. The raters confirmed or declined; “Are there definite signs of severe HIE defined as complete or near complete loss of grey–white-matter differentiation at the basal ganglia level and in the frontoparietal cortex with additional evidence of brain swelling/sulcal effacement?”.
Standardised operating procedures for quantitative assessments
Circular 0.1 cm
2 regions of interest were manually placed at the basal ganglia level in the putamen, the caudate nucleus (caput), the posterior limb of the internal capsule, and the genu corpus callosum bilaterally (ESM Fig. E1B) [
13].
Automated measurements
The software pipeline for automated GWR determinations has been published [
17]. Images were co-registered to a freely available standard brain atlas and mean Hounsfield Units were quantified in each individual CT space using inversely transferred probabilistic tissue maps [
31,
32] (ESM Figs. E2–E3).
GWR calculations
GWR was calculated as the sum of the radiodensity of the grey matter regions of interest divided by the sum of the radiodensity of the white matter regions of interest (ESM Fig. E1B). The GWR-8 model included all eight regions of interest. The GWR-4 model and the automated GWR only included the measurements in the putamen and in the posterior limb of the internal capsule.
Outcome assessment
Functional outcome by the modified Rankin Scale (mRS) was assessed by a trained outcome assessor at a structured face-to-face or telephone follow-up, at six months after randomisation. Functional outcome was dichotomised into good (mRS 0–3) and poor (mRS 4–6) [
25,
27,
33].
Statistical analysis
The results are reported according to the Standards of Reporting Diagnostic Studies [
34] and the Standards for Studies of Neurological Prognostication in Comatose Survivors of Cardiac Arrest [
35]. Continuous variables are reported as median (interquartile range, IQR) or means (± standard deviation) and categorical variables in numbers (percentages). Sensitivities and specificities for prediction of poor functional outcome, and negative and positive predictive values are presented with 95% confidence intervals (CI) calculated with Wilsons´s method. Results from the manual standardised assessments are presented separately for each rater and as median (min–max) of all raters. For GWR, we decided to apply the cutoff < 1.10, since this yielded a 100% specificity for poor outcome prediction in our pilot study [
18]. In addition, we analysed the pre-specified GWR cutoff < 1.15. The overall prognostic performance of GWR for good versus poor functional outcome was assessed by the area under the receiver-operating characteristic curve (AUC) with 95% CI. AUC was classified as; < 0.60 = failure, 0.60–0.70 = poor, 0.70–0.80 = fair, 0.80–0.90 = good, and 0.90–1.00 = excellent [
36]. The mean AUC for manual GWR was compared to the automated GWR using DeLong.
The interrater agreement between the blinded raters was calculated with Fleiss’ kappa. Intra-rater agreements for 20% of the images re-evaluated by each rater (identical for all raters) were analysed with Cohen’s kappa. The strength of the agreement was classified as kappa (
κ); < 0.20 = poor, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = good, and 0.81–1.00 = very good for Fleiss’ and Cohen’s kappa [
37‐
39].
CT's performed 48 h–7 days post-arrest were included in our prospective cohort. To assess the accuracy of automated GWR, we included all patients with CTs performed ≤ 7d in a post hoc cohort. We assessed the impact of timing on prognostic accuracies for automated GWR < 1.10 in time-windows < 2 h, 2–6 h, > 6–48 h, > 48–96 h, and > 96–168 h after cardiac arrest.
Sensitivities and specificities with 95% CI are also presented separately for patients randomised to hypothermia and normothermia within the prospective cohort for each rater for the standardised qualitative assessment, GWR-8 and GWR-4 at cutoff < 1.10 and for automated GWR < 1.10.
We further examined whether "severe HIE" by qualitative rating and GWR-8 cutoff < 1.10 evaluated by four or more raters corresponded with pathological findings of routine prognostic methods.
Statistical analyses were performed with IBM SPSS Statistics (SPSS Statistics for Windows, Version 29.0.0.0 Armonk, NY: IBM Corp) and R, version 4.0.4 (The R Foundation for Statistical Computing).
Discussion
In this prospective multicentre study, evaluating three different methods of diagnosing severe hypoxic ischaemic injury on CT for prediction of poor functional outcome after cardiac arrest, we validate pre-published standardised criteria and evaluate GWR cutoff < 1.10 for manual and automated assessments [
13]. We conclude that CT is a highly specific prognostic tool for neuroprognostication, regardless of assessment method, with highest sensitivities for poor outcome prediction when performed 48–96 h post-arrest. GWR determination at the basal ganglia level < 1.10 performed either manually or automated offer a more objective measure of HIE with reduced interrater variability.
CT is a guideline-recommended predictor of outcome after cardiac arrest with very low quality of evidence [
1,
5,
7,
40]. The main concerns raised by ERC/ESICM include the lack of multicentre validation and standardised assessments of both qualitative and quantitative methods [
1]. Our study provides a framework that is easy to use in clinical practice and addresses several concerns raised in recent publications [
5‐
7,
10,
41].
Our standardised qualitative criteria define signs of severe HIE as a “complete or nearly complete loss of grey-white matter differentiation in the basal ganglia and in the frontoparietal cortex with additional evidence of brain swelling/sulcal effacement” [
13]. A visual interpretation according to a checklist with mandatory evaluation at several levels of the brain had to be completed before reaching a conclusion. This qualitative assessment predicted poor outcome with 0% false-positive rate in 980 blinded ratings overall. In line with the previous qualitative studies, sensitivities for individual raters ranged between 11 and 61% for imaging performed 48 h–7 days post-arrest [
18,
20,
41].
Both the ERC/ESICM and the American Neurocritical Care Society recommendations use similar, undefined terminology to describe signs of severe HIE on CT; "diffuse", "extensive" or "bilaterally across vascular territories", with a "loss of grey-white-matter differentiation" [
1,
5]
. While our standardised qualitative criteria may offer a more precise definition of severe HIE than those given in the current guidelines, it achieved only moderate interrater reliability. The CT evaluation in our study was mostly performed by experienced raters (3–15 years with CTs of cardiac arrest patients). Rater experience may impact both sensitivity and specificity of CT evaluation using our SOP. This should be kept in mind when implementing our CT analysis in clinical routine. Future improvements to improve interrater reliability are necessary and may include a better standardisation of windowing during visual analysis, standards regarding decision in case of residual grey–white differentiation and awareness of the effects from residual contrast agents. In contrast to clinical practice, our raters only had one CT available for analysis and did not have access to pre-cardiac arrest CTs. We plan a subsequent study using serial CTs to evaluate whether an analysis of changes in grey–white-matter differentiation and brain volume over time improves prognostic accuracy.
GWR is the only guideline-recommended method to quantify the extent of HIE on CT and can be applied with routine radiological software, but there is no consensus on the number, size, and exact location of regions of interest [
1,
10,
22,
42‐
44]. Based on previous investigations and our retrospective pilot study, we chose to validate manually placed 0.1 cm
2 regions of interest at the basal ganglia level [
18]. Importantly, we included the instruction to place the regions of interest in a subregion with a radiodensity representative of the entire anatomical target region and to avoid potential confounders (artefacts, calcifications, lacunar infarcts, etc.). We confirmed that both manual GWR models had a maximal specificity at GWR < 1.10, which is in accordance with the previous studies [
6,
7,
18]. As expected, sensitivities increased at cutoff < 1.15 at cost of a slightly decreased specificity. None of the false positives through quantitative measurements fulfilled criteria for "severe HIE" with the qualitative assessment, underlining the potential value of combining both approaches. As in our pilot study, GWR-8 was consistently superior to GWR-4 concerning prognostic accuracies, intra- and interrater agreements [
18]. A possible explanation for the higher accuracy of GWR-8 is the reduction of noise due to the larger number of regions of interest [
18,
45]. GWR-8 was superior to the qualitative assessment for some raters and the interrater reliability for GWR-8 was superior to that of qualitative assessments—highlighting a potential advantage of quantification. We presume that the interrater reliability of manual GWR could be further improved by applying stricter instructions for measurements within anatomical regions or using non-circular and/or larger regions of interest.
Automated atlas-based GWR measurements offer an alternative to manual measurements unaffected by interrater variability and could increase the availability of GWR for hospitals without on-site neuroradiologic expertise [
17]. A few previous studies evaluate automated GWR quantification and they are limited by single-centre, retrospective designs, and early assessment of functional outcome [
16,
17]. The prognostic accuracy of automated GWR < 1.10 in our prospective cohort was as good as the manually assessed GWR with 40% sensitivity at 100% specificity. This performance is also in the range of manual and automated GWR from CTs performed > 24 h post-arrest in the previous studies [
18,
20,
24,
46], routine EEG, and SSEP [
7]. Except for one study on early CTs, 1.10 was the lowest reported GWR cut-off with 100% specificity thus far [
47]. Overall, automated GWR < 1.10 performed within 7 days post-arrest had one false-positive prediction of poor outcome in a patient with a subcortical low attenuating area close to putamen, most likely an old lacunar infarction or perivascular space, but with intact overall grey–white-matter differentiation. The use of automated GWR relies on anatomic landmarks and its use must include a quality check for co-registration and exclusion of artefacts or acute brain pathologies potentially interfering with measurements [
10]. Future studies on larger cohorts should investigate whether machine learning can predict outcome from CTs after cardiac arrest with superior accuracy compared to our human rater-based approach [
48].
Data from our current and previous studies do not suggest that CT can predict good outcome/absence of severe HIE. Future studies, for example using analysis of serial CTs, should re-investigate this issue.
Our results on optimal timing support guideline recommendations [
1,
5] that CTs performed 48–96 h have higher sensitivity for predicting poor outcome than examinations performed within the first hours post-arrest [
5,
17,
18,
20,
23,
24,
49]. Examinations performed on hospital admission are often routinely used to exclude cerebral causes of unconsciousness and may be too early to detect HIE for most patients. The increase in sensitivity within the first days corresponds with developing HIE. The higher sensitivity of later examinations is in line with previous observations and supports the notion that an optimal time window of a few days exists for neuroprognostic CT’s [
17]. We found no clinically relevant effect of temperature allocation on prognostic accuracies for prediction of poor functional outcome. When performed at an optimal timepoint and analysed using standardised interpretation, combining CT with other prognostic methods with higher sensitivities such as EEG or NSE could increase the number of correctly identified poor outcome patients.
Strengths and limitations
Strengths of our study include the prospective, multicentre design with standardised criteria for neuroprognostication and withdrawal of life-sustaining therapy and a structured assessment of functional outcome at 6 months. CT’s were prospectively performed in unconscious patients at a timepoint clinically most relevant for neuroprognostication. Radiological assessments were blinded and performed by multiple raters from different countries according to a pre-published protocol using standardised radiological criteria and pre-defined cutoffs. A comparison with automated GWR within the same cohort further strengthens our results.
Our main limitation is imprecision due to sample size [
5]. A substantial proportion of patients were examined before the pre-specified time point, reasonably as part of clinical practice, and thus reported as part of a post hoc cohort examined ≤ 7 days. Additional patients did not receive CT > 48 h, because they underwent magnetic resonance imaging rather than CT, used other prognostic methods or because CT could not be performed for logistical reasons.
In contrast to clinical practice, to standardise the protocol within a clinical trial, our raters only had axial CT images available, separately performed qualitative and quantitative assessments, their rating were final, and they did not have the possibility to discuss their results with colleagues.
Patients included in the TTM2-trial had a presumed cardiac or unknown cause of cardiac arrest and results may differ from other causes of arrest [
25]. The conservative approach to prognostication within the TTM2-trial was designed to limit the risk of self-fulfilling prophecies, reflected in the longer times to withdrawal of life-sustaining therapy in our prospective cohort [
25,
28,
29]. Nonetheless, despite the blinded CT evaluations in this study the risk of self-fulfilling prophecies cannot be entirely excluded, since local radiologists CT reports were available to treating physicians. Our results should be validated in a cohort where withdrawal of treatment was not performed.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.