Introduction
The human epidermal growth factor receptor 2 (HER2) gene is located on chromosome 17 and encodes a transmembrane tyrosine kinase receptor protein [
1]. HER2 gene amplification and receptor overexpression, which occur in 15% to 20% of human breast cancers, are important prognostic markers for poor prognosis, including a more aggressive disease and a shorter survival [
2]. Moreover, HER2-positive status is a predictive marker of response to trastuzumab therapy in both metastatic and adjuvant settings [
3,
4]. An accurate evaluation of HER2 status is therefore crucial for identification of patients who would most likely benefit from targeted anti-HER2 therapies. Currently, there are several Food and Drug Administration (FDA)-approved methods to evaluate HER2 status, such as immunohistochemical (IHC) assessment of HER2 protein expression or evaluation of HER2 gene amplification using
in situ hybridization (ISH), most commonly, fluorescent ISH (FISH) [
5,
6]. FISH assay is considered to be one of the reference methods for HER2 evaluation in breast cancer, as it accurately predicts response to trastuzumab therapy [
7]. Patients are eligible for trastuzumab therapy when their breast cancer specimens are positive at IHC (i.e. 3+) and/or amplified at FISH (ratio > 2.2). However, patients whose tumor specimen is equivocal at FISH (ratio between 1.8 and 2.2) but whose ratio is ≥ 2.0 represent also potential candidates for targeted treatment.
The classical evaluation method for gene amplification, the manual signal enumeration by visual estimation, is a rather time-consuming analysis. Therefore, several companies have developed automated signal quantification systems, which operate through a computer with scanning and image analysis software like Metafer 4 produced by MetaSystems [
8,
9]. The programming algorithm (the so called “classifier”) currently used by the latter determines the ratio between the average copy number for HER2 to average copy number for chromosome 17 (CEP17) on the basis of equi-sized square tiles [
10]. This programming algorithm is defined as “tile-sampling classifier”. However, as the size of tile does not always correspond to the size of a single tumor cell nucleus, some postulate that results obtained might therefore not completely reflect the biology of single cells. To encounter this statement, MetaSystems have recently developed a new programming algorithm, the “nuclei-sampling classifier”, which is able to automatically quantify fluorescent signals in nuclei within tissue sections. In this study, we have compared results obtained with the reference method, the manual scoring, with those obtained with the new nuclei-sampling classifier from MetaSystems in 64 clearly nonamplified (n = 32) and clearly amplified (n = 32) breast cancer specimens.
Discussion
Our results showed an excellent concordance between manual scoring, our reference method, and nuclei-sampling analysis for clearly nonamplified and clearly amplified cases. Indeed, the concordance of results for nonamplified cases was 100%, both for the automated and the human corrected nuclei-sampling analyses. For amplified cases, the concordance between the two methods was 96.9% for the automated nuclei-sampling analysis and rose to 100% following human correction. These concordance rates with manual scoring results fulfill the ASCO/CAP requirements of concordance greater than 95% for clearly amplified and nonamplified cases [
6].
Our results are consistent to those obtained by Theodosiou and collaborators [
11]. In their study, they examined the utility of an image analysis software (EIKONA3D, Alpha Tec Ltd) for the evaluation of HER2 amplification in nuclei in 100 breast cancer cases from two institutions. Similar to the analysis software presented here, the user had the possibility to manually correct the results obtained through the automated nuclei segmentation. They found a very good overall concordance (92.8%) between the results obtained by manual scoring by an expert and those obtained with the image analysis software. Similar to our results, the concordance for nonamplified cases was 100%, whereas the concordance for amplified cases was lower, 74.1% [
11].
In this work, we validated the new Metafer 4 classifier in 64 breast cancer specimens (32 nonamplified and 32 amplified cases), chosen randomly among eligible clearly nonamplified and amplified cases of our cohort, as required from the ASCO/CAP and the Canadian guidelines for HER2 testing in breast cancer for validation of a new test. Accordingly to these recommendations, a new test has to be compared with the reference test in at least 25 samples, ideally by using 50% cases unequivocally positive and 50% cases unequivocally negative [
5,
6]. The new classifier evaluated here was able to recognize cell nuclei on the image and therefore to calculate HER2 FISH ratio on nucleus basis. Moreover, this new classifier allowed the user to interact with the software during an optional interactive phase, in order to improve the selection of cells automatically selected by the software. In this study, we defined this method as “nuclei-sampling analysis” to differentiate it from “tile-sampling analysis”, which was performed with the Metafer 4 classifier currently used. This classifier, in fact, calculated HER2 FISH ratio on the basis of equi-sized tiles placed by the software on images. In order to validate this new classifier, we compared results obtained through manual scoring of slides, considered as the reference method, with those obtained through nuclei-sampling analysis. Moreover, we analyzed the accuracy of both the automated and the human corrected nuclei-sampling analyses.
As all randomly selected amplified cases analyzed in this study were amplified with HSR, we decided to evaluate the accuracy of this new classifier on the less common amplified cases without HSR. These cases represented indeed about 4% of all amplified cases in our cohort of breast cancer patients. For the 28 cases without HSR that we have analyzed, concordance between manual scoring and automated nuclei-based method was 75%. After human correction, concordance between the two methods rose to 86%. Considering that patients whose specimen is equivocal at FISH (ratio between 1.8 and 2.2) but whose ratio is ≥ 2.0 represent also potential candidates for trastuzumab treatment, 4 patients out of 7 discordant cases at the automated nuclei-sampling analysis, and 3 cases out of 4 discordant cases at the nuclei-sampling analysis after human correction would therefore be eligible to receive a targeted treatment. We noticed that some discordant cases were polysomic or monosomic (4 out of 7 discordant cases) and we postulate that this aneuploidy status could explain the discordance. It has been reported that biological variance reduces sampling efficiency [
12]. Indeed, higher biological variance associated with aneuploidy status could have had an impact on the spot counting by the software and this could explain the discrepancy with results obtained by manual scoring. Moreover, quality of the images of some discordant cases (2 out of 7 discordant cases) was poor (cell nuclei were blurred in the image), which could also be an additional explanation for this discrepancy.
With the aim to further analyse the accuracy of the new classifier, we also examined equivocal cases, which represent about 5% of our cohort population. Overall concordance between manual scoring and tile-sampling method was 31%, whereas concordance between manual scoring and nuclei-sampling method after human correction was 59%. If equivocal cases were splitted in those with ratio ≥ 2.0 and those with ratio < 2.0, we noticed that twice as many cases were correctly classified with a ratio ≥ 2.0 using nuclei-sampling method after human correction as compared to tile-sampling method. So even if concordance between manual scoring and nuclei-sampling method was not optimal, these results suggest that nuclei-sampling method is more reliable than tile-sampling method for the identification of patients who could potentially benefit from targeted anti-HER2 therapies. Similar to the amplified cases without HSR, we also noticed that some discordant cases were aneuploid (4 out of 12 discordant cases). Also, in 2 out of 12 discordant cases the quality of images was poor (cell nuclei were blurred in images).
Tile-sampling method has been developed by MetaSystems and other companies in order to overcome the difficulties that are frequently encountered when fluorescent signals are enumerated via automated image analysis software. Firstly, reliable separation of overlapping nuclei in tissue sections is very difficult, especially in dense packed tissues like breast cancer. Secondly, it is arduous for image analysis software to automatically distinguish distinct cell populations (normal and tumor cells) present in analyzed fields. To overcome these difficulties, the Metafer 4 software places non-overlapping tiles of equal size on images in order to cover the majority of nuclear material and therefore quantify fluorescent signals. Moreover, a ratio estimation algorithm was introduced with the aim to improve the accuracy of results of the automated analysis in samples in which distinct cell populations are present [
10,
13].
Although the tile-sampling analysis is in general well performing [
8,
9], the nuclei-based analysis offers some advantages compared to the tile-based analysis. Firstly, the way in which the new classifier selects nuclei for analysis coincides better to what the user does when the user is analyzing a sample. Whereas the nuclei-sampling analysis recognizes cell nuclei, the tile-sampling analysis places equi-sized tiles on the image. In addition, as the size of tile does not always correspond to the size of a single nucleus, nuclei are often truncated during tile-sampling. As a consequence, one single tile may contain signals from multiple nuclei or only part of a nucleus. This can be disadvantageous especially in cases of chromosome 17 monosomy or polysomy, where exact number of CEP17 signal per cell is relevant. Secondly, the nuclei-based analysis offers the advantage that the user can improve the selection of cell nuclei that have been automatically selected by the software through active interaction with the software. During the interactive phase, the user can add nuclei that were not considered during the automated selection, delete unsuitable nuclei, divide overlapping nuclei or connect separated pieces of the same nucleus. This optional, interactive phase requires additional time, in average 7 minutes for equivocal cases and amplified cases without HSR and 4 minutes for nonamplified cases and amplified cases with HSR, but it is very helpful and effective especially in difficult cases, for instance in samples with abundant stroma or intermixed normal cells. In fact, we observed a better concordance between results obtained by the reference method and those obtained with the nuclei-based analysis after the interactive phase, compared to results obtained with the automated analysis. Theodosiou and collaborators observed similar results using a similar method. In their hand, manual correction required up to 5 minutes for each case (nonamplified and amplified cases) and it was particularly useful in cases with low image quality [
11]. We noticed that among all functions that the user could choose during the interactive phase, the delete function was the most effective one. In fact, when discordant cases were evaluated blindly by a second independent observer who used exclusively the delete function, results obtained by the two observers were similar (data not shown). We may therefore conclude that the delete function is very effective in improving the results obtained with the automated nuclei-based analysis. Moreover, the time necessary for human correction can additionally be reduced if only the delete function is used during the interactive phase (6 minutes in average for equivocal cases and amplified cases without HSR). The automated nuclei-sampling analysis required between 3 and 5 minutes per case, depending on the cellularity of images. Our image analysis software is slower compared to others, for example Matlab, which required 3.5 seconds for analysis of a single image on local server [
14]. As we analyzed between 5 and 10 images for each case, Matlab would have taken between 17.5 and 35 seconds to evaluate a case.
In this study, the reference method was represented by the manual scoring of specimen. A closer examination of nuclei automatically selected by the software during nuclei-based analysis allowed us to observe how the user can also be biased when the user is analysing a case. In particular, human eyes have a tendency to pay more attention to those cell nuclei in which more fluorescent signals are present. One could therefore argue that human brain considers those nuclei more attractive and preferentially chooses them during the manual signal enumeration. Nuclei-based analysis, on the contrary, selects nuclei on the basis of the shape of cell nuclei and on quality of fluorescent signals and is therefore more “neutral” in the choice of the nuclei. Therefore, eligible nuclei that have less fluorescent signals (and may be judged as less attractive by human brain) are also taken into account for analysis from software. Opinions on this topic are divergent. Whereas some underline that software do not always select the most appropriate nuclei for analysis [
11], others claim that results obtained with automated analysis are more accurate especially in amplified and borderline cases, as manual analysis of HER2 signals can only be estimated when probe signals cluster closely together [
15]. Another advantage of image analysis system over manual scoring is that storing of captured images allows archiving of cases for future study.
Some limitations are associated with the new classifier. Accuracy of the new classifier to recognize nuclei is markedly reduced in images with dense packing of cells or in images in which DAPI counterstain is blurred. As mentioned above, results obtained with any quantitative image analysis software depend tremendously on the fields chosen by the observer for analysis. If the fields chosen are not representative of the sample, results obtained by quantitative image analysis can be rather different from those obtained through manual scoring. This issue is common to all diagnostic algorithms. Reliable sampling procedure is prerequisite for diagnostic accuracy in virtual microscopy [
12,
16].
Standardization of images capture is a central point in the development of a diagnostic algorithm in virtual microscopy [
17]. In our study, optimal specification for the capture of images from FISH HER2 slides hybridized with PathVysion™ HER2 DNA Probe kit (image size, size of tiles, identification criteria for HER2 and CEP17 spots, segmentation criteria for nuclei, filtering) has been previously established using over 400 slides (personal communication, Ulrich Klingbeil, MetaSystems). Quality of captured images is in general excellent, since quality and intensity of fluorescence signals are reproducible and background is very low. Dissimilar to other algorithms used in object-related diagnosis [
16], fluorescent spot identification is less problematic. In contrary to other structures within tissue that are difficult to be recognized, fluorescent spots are easily identified by the classifier, as they are mostly of the same size and intensity, except for HSR cases. However, spot identification is also reliable in amplified cases with HSR (where spot dimensions can be more variable), since the software adopted a different spot-counting analysis (evaluation of the signal area in the HER2 channel instead of individual spot counts) [
10].
Tissue-based diagnosis has been subjected to remarkable changes following the introduction of new technologies. For instance, technological advances in tissue-based diagnosis allow the implementation of digitized images into routine clinical pathology. Virtual pathology has several advantages compared to conventional microscopy. For example, virtual pathology allows archiving of virtual images, promotes continuing education as well as interactive remote consultation between pathologists [
18]. Moreover, it has been reported that analysis of digitized slides gives results as accurate as that obtained through conventional microscopy [
19,
20]. However, one critical point is whether the diagnostic information contained in the virtual slides reliably reflect the real whole slide. In this context, the adopted sampling procedure plays a central role [
12]. This is an important point to consider, when the efficacy of virtual diagnostic algorithms are compared [
12]. Both the tile-sampling classifier and the nuclei-sampling classifier are based on a stratified and passive sampling method as defined in Kayser et al. [
12]. However, whereas the tile-sampling classifier recognizes nuclear material through the DAPI filter (and put square tiles on the image, where the DAPI coloration is the strongest), the nuclei-sampling classifier recognizes single nuclei within tissue on the basis of nuclei characteristics, such as nuclei size and roundness. Spot recognition and spot counting is effectuated in same way for both methods.
In our clinical context, pathologists share virtual images and results via a LAN platform. This form of information sharing represents one of the first steps towards the so called “Grid technology”. A Grid is an open and dynamic communication system consisting of connected nodes (i.e. servers) that are linked together via Internet connections and share certain communication rules in using open standards [
21]. The Grid technology will also have an impact on the quality in tissue-based diagnosis as such implementation will require appropriate standardization of legal, medical and technological aspects associated with virtual pathology [
17].
Competing interests
The new Metafer 4 software version and the interactive touch screen were kindly provided by MetaSystems.
Authors’ contributions
DF participated in the conception of the study, data collection, analysis and interpretation of results and wrote the manuscript. SJ, CC, FS and CD participated in the conception of the study, data collection, analysis and interpretation of results and reviewed the manuscript. LP participated in the analysis and interpretation of the results and editing of the manuscript. All authors read and approved the final manuscript.