Introduction
Oropharyngeal squamous cell carcinoma (OPSCC) is one of the most frequent head and neck cancer, strictly related to human papillomavirus (HPV) infection in the majority of cases [
1]. Despite sharing the same anatomical location, HPV-positive and HPV-negative OPSCCs present crucial differences that must be taken into account by oncologists: 1) clinical presentation, as HPV-positive OPSCC symptoms are related to neck mass due to nodal spread of disease, whereas patients with HPV-negative lesions present symptoms related to local growth of the primary tumour, such as odynophagia and dysphagia; 2) HPV-negative OPSCCs have a lower survival and response rate to radio-chemotherapy than HPV-positive ones; 3) patients affected by HPV-positive OPSCC are often younger than HPV-negative OPSCC [
2]. Therefore, HPV status determines the appropriate therapy and follow-up plan. In patients affected by OPSCC, HPV status is routinely assessed on biopsied tissue by p16 immunostaining. However, surgical biopsy exposes patients to surgery-related complications, such as bleeding [
3], and the presence of co-existing inflammatory changes in the specimen might decrease the sensitivity of the immunostaining [
4].
Despite several studies described different imaging features useful to predict HPV status [
5‐
7], this approach is not sufficiently reliable due to the presence of overlapping radiological characteristics [
8]. To overcome the limitations of subjective medical image interpretation, several authors investigated the potential utility of texture analysis in HPV status assessment [
9,
10], since one of the aims of radiomics and machine learning (ML) is the conversion of medical images to quantitative, reader independent data for predictive modelling [
11,
12].
Radiomics refers to the analysis of large amounts of quantitative features extracted from medical images. These features include pixel grey level distribution parameters and texture analysis derived data, which evaluate grey level value patterns in images. ML is a subfield of artificial intelligence which may be adopted to build up classification or regression models from radiomics data through automated recognition of patterns in the data space, implementing predictive algorithms [
11,
13].
Given this potential, recently the number of radiomic studies has grown dramatically, especially in oncological imaging [
14,
15]. However, despite these efforts, the routine use of these tools in the clinical setting has not yet occurred, for example due to lack of technique standardization and external validation [
14]. The increasing attention given to ML and radiomics has also resulted in a growing availability of quality assessment checklists, such as the Radiomic Quality Score (RQS) [
12,
16,
17]. The RQS’s strength is represented by the evaluation of different aspects of radiomic studies, ranging from images acquisition protocol to data sharing, grouped in six domains (protocol quality and reproducibility, feature selection and validation, biologic/clinical validation and utility, model performance index, high level of evidence, and open science and data). Each item contributes to a final percentage score for the paper, allowing for a quantitative assessment of methodological quality. The value of the RQS is also confirmed by its use across various topics in the recent literature [
18‐
20]. An additional advantage of the RQS is the possibility to use its final score to perform statistical analyses with other variables. As also described in a previous report [
19], radiomic studies are published on peer-reviewed journals specialized not only in radiology but in a variety of fields, demonstrating a widespread interest among the research community.
With the present systematic review, we aimed to perform a literature revision with RQS quality assessment as well as an evaluation of the relationship between study quality and journal characteristics. In particular, our focus was on the current applications of radiomics in OPSCC imaging for the prediction of HPV status and association between study quality and indices commonly accepted as a proxy for research quality [
21].
Discussion
In the present systematic review, 19 radiomics and ML investigations published in the recent literature on the OPSCC were evaluated. In this setting, one of the most crucial issues in clinical practice is HPV status evaluation [
26], and conventional imaging is not currently able to reliably replace the current gold standard (expression of p16 protein via immunohistochemistry from specimen [
27]), despite various attempts [
28‐
30]. The quality of included studies was very low (median score expressed as number 12, corresponding to a median percentage score of 33) with highest RQS equal to 15 (42%). Significantly higher RQS score was found in clinical journals compared to radiological ones, while no correlations were observed between RQS score and other journal characteristics (JIF, quartile JIF/JCI or year of publication).
Radiomics and texture analysis have tried to fill in the gaps in oral oncology and improve the performance of medical imaging. In the last years, several Authors attempted HPV status prediction radiomic modelling based on different imaging techniques, CT, MRI, or PET/CT. It is interesting to note that only a minority of papers employed DL, given the increasing attention to this approach [
31]. Despite MRI having demonstrated its usefulness in HPV status assessment [
32,
33], in our review most of the studies were focused on CT images. This could be due to some of its advantages: 1) wider availability in most hospitals [
34]; 2) greater variability of MRI based on acquisition parameters as well as different scanners [
35]. The resulting relevant radiomic features extracted from CT images have shown potential for HPV status prediction either in internal [
8,
35‐
38] or external validation [
39]. Other authors employed T1-weighted post contrast [
25,
40,
41] and ADC [
25,
42] images on MRI, and a combination of PET-based and CT-based radiomics on PET/CT [
23]. In some cases, specific steps within the radiomic pipeline were also explored, such as comparison between 2 and 3D segmentation [
43] and variations due to different CT scanners [
44]. These are valuable as limitations in reproducibility of radiomics have been reported due to different CT reconstruction algorithms and image noise [
45].
The RQS is one of the most known quality assessment checklists in the field of radiomics and was used to evaluate each paper’s strengths and weaknesses. Proposed by Lambin et al. [
12], this score is composed by various items elaborated to reflect commonly employed steps in radiomic analysis pipelines, allowing quantitative and reproducible evaluation by peers. Although this score may be excessively strict when considering the practical issues of medical imaging research, it still represents a valuable and well-known tool [
19]. Like other systematic reviews in other oncological imaging fields [
20,
46,
47], the quality of included studies was very low and RQS ranged from -2 to 15, between 0 and 42% expressed as percentage. In line with the previous investigations [
20,
46], some RQS items were satisfied to a greater extent than others. More than half of articles performed feature reduction, decreasing the risk of overfitting, and included non-radiomic features in a multivariable analysis. To demonstrate the utility of radiomics, all studies included a comparison to the current gold standard method, not always the case in other RQS systematic reviews. Some common missing steps were also recognised: less than 15% of radiomics pipelines comprised a cut-off analysis based on previously published reports, no cost-effectiveness analyses, phantom studies, or multiple time-point imaging were available. Open science implementation was also limited, with use of publicly available datasets practiced in very few cases, despite the advantages it might provide for testing reproducibility of proposed radiomics-based predictive models. Furthermore, in over a quarter of the included articles final model validation was entirely missing. Actual clinical implementation of these results will require more robust validation and possibly studies dedicated to this task.
Additional proxy quality indicators were included in our study. As journal quality indices, JIF and JIF quartiles were selected. The JIF is an index of relevance of the journal in its field of research, calculated from the citation average by year obtained by research published during the previous two years [
48]. Other journal performance indicators, quartile ranking by JIF and JCI, were used. These provide additional insights, and JCI in particular reflects a 3-year citation window and is a field-normalized citation metric, unlike JIF [
49]. Since Lambin proposed the total score expressed in percentage as quality assessment, the association between this and journal characteristics was evaluated. Interestingly, a significantly higher RQS was found in clinical journals compared to radiological ones. Probably, some RQS items, such as multivariable analysis with non-radiomics features and comparison to a gold standard, may benefit from the inclusion of a clinical researcher among the authors.
On the other hand, no significant correlation was found between RQS and either JIF and or JIF quartile. We also did not find a significant increase of RQS in relation to the year of publication. Similarly, the highest score was not associated to the better journals, in terms of JIF or JIF ranking. These results can again be interpreted in different ways: i) lack of uniformity in quality of radiomics/ML evaluation by reviewers, supported by the absence of association between excellence of the investigation and performance of the journal; ii) as hypothesized by an another systematic review [
19], RQS items could not reflect journal or reviewer points of focus, such as patients selection criteria and the topic of the analysis; iii) the items proposed by Lambin [
12] in the RQS may be too technical for general peer-review. The results of our analysis confirmed findings from previous reviews: not only the median RQS score in OPSCC articles was in line with that reported by authors [
20,
50], but also the JIF was not related to quality of radiomic analysis [
19]. However, our findings suggested that a higher RQS was found in clinical journals, contrary to what reported in our previous work [
19] and Park et al. [
50], although the latter described a not- significant trend for higher RQS in clinical journals [
50].
The present review has some limitations. Firstly, the small sample size of included studies and their heterogeneity in terms of design and imaging modalities (MRI, CT, PET/CT). The inclusion of journal quality indicators, such as JIF, and JIF quartile and JCI, which are themselves influenced by potential biases [
51], despite JIF being universally recognised as a valuable indicator [
52]. As proposed by Lambin [
12], the RQS should be expressed as a percentage score, but scores less than zero are all converted to 0%, losing the differences between all scores ranging from -8 to 0.
In conclusion, radiomics and ML studies for the prediction of HPV status in OPSCC have demonstrated low overall quality according to the RQS. While study quality was not related to journal quality, articles with best RQS scores were found in clinical journals. Future investigations in this field should take into account the issues highlighted in this review in order to improve upon previous experiences and facilitate a translation of promising research results to real-world clinical practice.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.