Introduction

The interest in minimal residual disease (MRD) assessment as a prognostic tool in acute myeloid leukemia (AML) is growing. Molecular markers, although allowing a greater sensitivity, enable MRD monitoring only in a minority of cases.1 Immunophenotypic MRD detection using flow cytometry turned out to offer an attractive alternative. Using immunophenotyping, several early studies have shown the prospective value of the frequency of MRD cells for survival of AML patients both in adults2, 3, 4, 5, 6 and in children.7, 8, 9 Meanwhile, many other studies have confirmed these data.1, 10 Immunophenotypic detection of MRD using flow cytometry can be performed by defining aberrant expression of markers or marker combinations on the leukemic cells at diagnosis. These so-called leukemia-associated (immuno)phenotypes (LAPs) are absent, or are present at very low frequencies, in normal blood or bone marrow (BM) cells. This approach can be successfully applied in 80–100% of AML patients.2, 3, 4, 5, 6, 7, 8, 9, 11

Apart from serving as a prognostic marker, MRD may ultimately turn out to offer a tool to guide patient-tailored therapy, an approach already adopted once for childhood AML.12 Even when categorized as low risk by cytogenetics/molecular markers and/or other conventional prognostic parameters, AML patients may be identified to be at high risk based on MRD cell frequency after induction therapy and stratified to treatment intensification such as allogeneic transplantation. Inversely, intensive treatment in AML patients defined as ‘poor risk’ by conventional factors, but as ‘low risk’ based on MRD assessment, might ultimately be avoided. Furthermore, at follow-up, increments in MRD cell frequency by sequential MRD assessments may identify forthcoming relapses, in turn urging therapy to be administered before overt relapse occurs.

For these reasons, it is important to establish the prognostic value of MRD cell frequency assessment in clinical studies in prospective, multicenter settings. A first requirement to enable adequate performance of such studies is the standardization of all procedures necessary for definition of adequate LAPs in newly diagnosed AML, as well as those necessary for identification and quantification of MRD in BM follow-up material. Standardizing LAP technology for MRD detection is complicated because aberrant immunophenotypes may not only differ from patient to patient but also within a patient, often resulting in many different LAPs (illustrated in Supplementary Figure 1). There are large differences in the suitability of LAPs. This suitability may be defined by at least four factors. First, the percentage of blasts that bear a particular aberrancy at diagnosis determines the sensitivity of subsequent detection and quantification of AML cells. Second, the extent to which equivalent cells in control normal BM bear the aberrancy determines LAP specificity: the lower the frequencies of such ‘background’ expression, the higher the LAP specificity.5, 13, 14 Third is the stability of LAPs at follow-up—that is, the possibility that expression of markers, which constitute the LAP, may be higher or lower at follow-up as compared with diagnosis,5, 14, 15, 16 thereby resulting in overestimation or false negativity, respectively, of MRD. Finally, the quality of monoclonal antibody conjugates, as well as the stability and resolution of the flow cytometer, is important. All these factors contribute to, often, large differences in the applicability of specific LAPs for different leukemia cases.

To standardize all the procedures that are needed for LAP definition as well as for quantification of MRD, five centers in Belgium and the Netherlands joined forces in 2004. These centers had ample experience in (at least) four-color flow cytometry. One of these centers had extensive experience with MRD detection in adult AML and served as the reference and coordinating center.

The primary goal of this study was to standardize the identification of LAP at diagnosis. First, we defined a standardized antibody panel and standard operating procedures based on both the MRD experience in the group, the published data and extensive knowledge of the performance of antibody conjugates. Second, we evaluated whether this standardized antibody panel could identify robust LAPs in the vast majority of AML patients. The quality of LAPs was determined on the basis of their presumed specificity, sensitivity and stability. The coordinating center served as reference for the definition of LAPs. Finally, the (improvements in) performance of individual laboratories in identifying LAPs in an initial learning phase and a test phase was evaluated.

Materials and methods

Patients and cells

Sixty-four patients with AML, consecutively presenting during a period of 18 months (July 2004–December 2005) in the participating institutes, were included. The median age was 59 years (range: 9–85; two children of 9 and 15 years were included). FAB (French-American-British) classification distribution was 6 M0, 11 M1, 16 M2, 2 M3, 6 M4, 5 M5, 3 M6, 1 M7, 6 RAEB (3 RAEB and 3 RAEB-t), 1 secondary AML (out of CML), 1 AML with non-Hodgkin’s lymphoma and 6 cases with unknown classification. BM of patients with AML, and normal BM from patients with cardiac disease, was obtained after informed consent and according to the institutional protocols. LAPs in normal BM controls were defined as percentages of white blood cells (WBCs).

Participating centers and working plan

The participating centers were required to have the following criteria: (i) longstanding experience in immunophenotyping of leukemia using at least four-color flow cytometry; (ii) access to clinical samples; (iii) experience either in MRD assessment and/or in quantification of low-frequency cell subpopulations; (iv) participation for at least 5 years in external quality control assessment programs for CD34 counting and leukemia/lymphoma immunophenotyping;17, 18 and (v) extensive experience with various antibody–fluorochrome combinations. Five centers from the Netherlands and Belgium participated (for details, see Supplementary Files). Amsterdam (indicated as no. 1) served as the coordinating and reference center, and the other centers were anonymously, in arbitrary sequence, referred to as participants no. 2–5.

Four centers were equipped with a FACScalibur (Becton Dickinson, San Jose, CA, USA) with an argon and red diode laser. CellQuest software was used for analyses of the list mode data (LMD). One participant was equipped with a Beckman Coulter five-color FC500 flow cytometer (Miami, FL, USA) with an argon laser. LMD were analyzed with CXP software (Miami, FL, USA).

The study period was subdivided in two phases: a learning phase with 28 patients and a test phase with 35 patients (for details, see Supplementary Files).

Immunophenotypic labeling

The first goal of the working group was to develop a standard panel of monoclonal antibody combinations to define putative LAPs at diagnosis. This panel was composed of monoclonal antibodies and fluorochromes in four-color combinations based on previous experience with LAP definitions and MRD assessments in AML,5, 19 as well as extensive knowledge of antibody–fluorochrome performance. MoAbs were conjugated with fluorescein isothiocyanate, phycoerythrin (PE), peridinyl chlorophyllin or allophycocyanin for FACSCalibur users, and with fluorescein isothiocyanate, PE, phycoerythrin-Texas Red conjugate (energy coupled dye, ECD) and PE-Cy5 for the Coulter FC500 user (Table 1). The backbone of a LAP consists of CD45, a primitive marker (CD34, CD133 or CD117), a myeloid marker (usually CD13 or CD33) and one of the aberrant markers listed in Table 1. The antibody–fluorochrome combinations were centrally purchased, titrated and diluted to the user concentration, and subsequently distributed to all centers.

Table 1 Standard immunophenotypic panel

LAPs were established in newly diagnosed AML in a two-step approach. In the first step, the panel was used to define the immunophenotype of the blasts in whole BM or peripheral blood at diagnosis. In this so-called first run, aberrant expression patterns of individual markers were defined, and, from these, relevant markers for the putative LAPs for MRD assessment (one LAP per tube) were defined. As these markers may originate from different tubes, the finally chosen combinations had to be tested in a so-called second run. Because of the large number of LAPs, central distribution of conjugate combinations for the second run was not feasible.

For all relevant monoclonal antibodies, the experience of all centers (advantages, disadvantages and pitfalls) was evaluated in regular meetings of the study group. The final list of antibody–fluorochrome combinations for use in the second run is shown on the website http://www.vumc.nl/afdelingen/hematologie/behandelaars/onderzoek/ under ‘clones of monoclonal antibodies’.

Data exchange and data interpretation

When a patient was included in the study in one of the five institutes, the procedure of data exchange was as follows: the LMD of the first run were uploaded by that institute to a website (termed ‘X-drive’), with accessibility restricted to the working group members (Figure 1). Using this site, each of the five institutes analyzed these first run LMD files of that particular patient and, independent of each other, defined putative LAPs. These were communicated to the coordinator. From all putative LAPs, the coordinator, according to its experience, defined so-called ‘proposed LAPs’, fulfilling as many essential criteria as possible. In the quarterly group meetings, these ‘proposed LAPs’, upon consensus, were termed ‘consensus LAPs’. On some occasions, the consensus LAPs deviated from the proposed LAPs. Also, all other putative LAPs were discussed, with performance evaluated for all individual centers.

Figure 1
figure 1

Organization of the study. When any of the five centers entered an AML patient (top of the figure), it ran the monoclonal antibody panel shown in Table 1 and sent the LMD to the other centers. All centers then formulated their putative LAP(s) and sent these to the coordinator, who subsequently designed ‘proposed LAP(s)’. Finally, at the quarterly group meetings, ‘consensus LAP(s)’ were agreed upon, and ‘final LAPs’ were formulated for subsequent MRD studies.

Statistics

Student’s t-test was used to evaluate the significance of improvements in performance between the learning and test phases of the study.

Results

Standard immunophenotypic panel (first run)

After the learning phase, tube 6 was modified: TdT was replaced by CD36, as TdT had not contributed to suitable independent LAPs, did not appropriately function as an exclusion marker for M4 and M5 AML types, and, as intracellular marker, hampered a rapid application of the panel.

In the next five sections, the performance of the panel (number and nature of LAPs detected and the sensitivity, specificity and stability of consensus LAPs) is described. This part covers the final results after the consensus meetings. Finally, performance of individual centers in both phases is described.

Number and nature of consensus LAPs

In the learning phase, 55 consensus LAPs were found in 28 patients (Tables 2a and b, third column). No LAPs were found in one patient (3.5%). In the test phase, 88 consensus LAPs were found in 35 patients (Tables 2a and b, fourth column). LAPs were found in all patients. For details on numbers of LAPs per patient per study phase, see legends of Table 2. Taken together, one or more LAPs could be identified in almost all AML patients (63/64: 98.5%).

Table 2 Consensus LAPs in terms of sensitivity and specificity

Table 3 shows a detailed overview of consensus LAPs, divided into the well-known four major groups. Important are LAPs that were chosen most frequently as the best (that is, most sensitive and most specific, and stable, as outlined in the next paragraphs). Table 3 shows the leading aberrancies in the first column, with, in the fourth column (printed in boldface), the percentage of patients for whom these aberrancies were chosen as best consensus LAPs. These included cross-lineage expression with CD7 as the leading marker (36% of the patients), asynchronous expression of CD34 and CD133 (20%), asynchronous expression of CD15 combined with a progenitor marker (28%) and absence of HLA-DR (20%) or CD33 (11%) in combination with a progenitor marker. All other aberrancies were considered suitable in <10% of the patients.

Table 3 Summary of consensus LAPs

In 60 of the 64 patients, at least one primitive marker (CD34, CD117 or CD133; more details about distribution in legends of Table 3) was present in the consensus LAPs. The remaining four cases had LAPs without a primitive marker (three patients), or no LAP at all (one patient).

Sensitivity of the consensus LAPs

LAPs were categorized into the following three groups based on the percentage of LAP-covered leukemic cells (Table 2): (1) >50% (sensitivity category ‘high’); (2) 20–50% (‘intermediate’); and (3) 10–20% (‘low’). LAPs present on <10% of blast cells were not considered suitable for MRD monitoring.

Table 2a shows for all consensus LAPs that the best sensitivity category (‘high’) contained 64% (51+13%) and 50% (42+8%) of the LAPs in the learning phase and test phase, respectively. Taking the high and intermediate categories together, these figures were 92% and 90%, respectively.

On performing the same analysis for the best consensus LAP (total number is 63, similar to the number of patients with LAP), it was seen that the high category contained in the learning phase 71% of the LAPs and in the test phase 60% of the LAPs. Taking the high and intermediate categories together, these figures were 93% and 97%, respectively (Table 2b).

Specificity of the consensus LAPs

The specificity of a LAP depends on the expression of the aberrant immunophenotype on normal myeloid progenitor and/or mature cells. All participants evaluated the background of the consensus LAPs in normal BM. The median expression on normal myeloid cells usually was <0.05% (% of WBC), with maximum values generally <0.1% (Figure 2). This background resulted mainly from aberrant expression on the immature population: LAPs with an immature marker, which represent the vast majority of all consensus LAPs (87%) and of all best consensus LAPs (95%), had up to 10% background aberrancies on the immature population (Figure 2b). CD36 on WBC and on primitive cells was discovered during the study to have higher percentages (Figures 2a and b) and thereafter excluded from further analysis. In LAPs without primitive markers (only 13% of all consensus LAPs and 5% of best consensus LAPs), median LAP expression on normal BM WBC was usually <0.1% (data not shown).

Figure 2
figure 2

Aberrant expression of marker/marker combinations on WBC and immature cell fractions in normal BM. Different LAP immunophenotypes tested in normal BM. All LAPs contained CD45 and, if not indicated in the figure, a myeloid marker (CD13 or CD33) or CD117. (a) LAP expression as % of WBC. Note the relatively high expression of CD34+CD36+. (b) Aberrant marker expression as a percentage of primitive marker compartment (CD34 or CD117). Note logarithmic scale in a and b.

The specificity of a consensus LAP was now categorized as (1) ‘high’ (<0.1% background) or (2) ‘low’ (0.1% background). The two specificity categories are presented in Table 2a and distributed over the three sensitivity categories. The high-specificity category contained in the learning phase 82% (51+24+7%) and in the test phase 82% (42+33+7%) of all consensus LAPs.

On performing the same analysis for the best consensus LAP (total number 63), it was seen that the highest specificity category represented 86% of the patients in the learning phase and 88% in the test phase (Table 2b).

Combined analysis of sensitivity and specificity

Ultimately, the quality of the consensus LAP has to be acquired on the combined evaluation of sensitivity and specificity. The best LAP possesses the highest sensitivity (>50% expression) and the best specificity (<0.1% background). The order of quality of other LAP categories is presented in the legends of Table 2. Assuming that the high- and intermediate-sensitivity categories, together with high specificity, will offer the best consensus LAPs (63 in total), Table 2b shows that in 82% (64+18%) of cases in the learning phase and in 85% (51+34%) in the test phase, LAPs met this demand (further details are in Supplementary Files). It can thus be concluded that, in the majority of the 63 evaluable patients, relatively high-quality (>20% expression and <0.1% background reactivity) LAPs can be designed.

Stability of LAPs

Scoring for a good LAP based on stability did not include a learning and test phase, but it was performed using a list of expression of markers in paired diagnosis–relapse samples. We have performed an extensive comparison between diagnosis and relapse in 33 AML patients (18 different antigens in 125 diagnosis–relapse comparisons; Supplementary Table 1). In general, LAPs are relatively stable in most cases with a notable exception for CD19. For details, see Supplementary Results and Supplementary Table 1.

Performance of individual centers

The data discussed above constitute the final outcome of all consensus meetings. Next, we describe the performance of individual centers.

Missed LAPs

In the learning phase, centers 2–5 missed 39–63% of the consensus LAPs, whereas the coordinator (no. 1) missed 11% (Table 4). In the test phase, these percentages dropped to 27–40% and 3%, respectively (Table 4). Thus, the improvement was 12–33% for centers 2–5 and 8% for the coordinator. Supplementary Figure 2 shows a detailed overview of the numbers of missed LAPs per institute per phase.

Table 4 Percentages of missed LAPs by the different centers in learning and test phase

Number of patients with no second LAP defined; comparison between the institutes

We then studied how many patients would not have been eligible for follow-up evaluation because of the missing of all LAPs present. For a patient to be eligible for MRD evaluation, at least one consensus LAP should have been defined. The answer, based on the data in Supplementary Figure 2, has been summarized in Figure 3a: in the learning phase, the percentages of patients apparently ineligible for MRD detection ranged from 21 to 33% for centers 2–5, whereas the coordinator identified at least one LAP in all. In the test phase, results clearly improved: the percentages of patients without any identified LAP ranged from 7 to 18%, whereas the coordinator had no misses.

Figure 3
figure 3

Missed consensus LAPs. For 63 of the 64 patients reported here, at least one consensus LAP could be agreed upon. Cases with no LAP defined for a particular institute and a particular patient were scored as ‘missed patient’. Institutes 1–5 analyzed 29, 24, 29, 28 and 12 samples in the learning phase (gray bars) and 35, 35, 35, 34 and 27 samples in the test phase (black bars), respectively. Panel a shows the proportions of patients in whom no LAPS were defined and panel b shows the proportions of patients in whom extra LAPs were missed. (a) For institutes 1–5, percentages of patients who would not have been eligible for MRD assessment are shown (no LAPs were defined, whereas at least one consensus LAP was present). Improvement was 0%, 10%, 17%, 14% and 26% for institutes 1–5, respectively. Differences between the two phases were significant: P=0.035 (all institutes: n=5) and P=0.016 (institutes 2–5). (b) For institutes 1–5, the percentages of patients in whom only one LAP was defined, whereas at least two consensus LAPs were present, are shown. Only institute 2 had made considerable improvement (20%). Note that this figure does not include learning effects for all defined LAPs: these have been shown earlier in Table 4. In that table, all institutes made progression, which apparently is merely due to increased performance in defining LAPs additional to the two defined for Figure 3b.

The best policy to avoid missing MRD due to immunophenotypic shifts may be to define, whenever possible, at least two different LAPs. The performance of all centers to define a second LAP in patients where such LAP had indeed been formulated at the consensus meetings is shown in Figure 3b. In the learning phase, such a second LAP was missed in 8–29% of patients by centers 2–5 and in 10% of patients by the coordinator. These percentages were slightly lower in the test phase: a second LAP was missed in 9–20% of patients by centers 2–5 and in 6% by the coordinator. Thus, in addition to patients in which all LAPs had been missed (Figure 3a), a second LAP may be missed in some instances, where a first LAP had been successfully identified (Figure 3b).

Types of missed LAPs

In a next analysis, we investigated which types of LAPs are missed. LAPs missed by more than one center in relatively high percentages were those with asynchronous antigen expression (31–48%), with the exception of LAPs with an absence of marker(s) (for example, CD133+CD34, CD15HLA-DR, CD33 and HLA-DR; see Table 3), LAPs with marker overexpression (25–88%) and LAPs without immature markers (14–50%) (Supplementary Table 2). The identification of LAPs with cross-lineage marker expression was generally most successful (7–32% misses). Performance of the individual LAPs is shown in Supplementary Table 3. Although, upon this further subdivision, numbers become too low to reliably evaluate, institutes were observed to perform well for the cross-lineage LAP with CD7, and also for LAPs with aberrant absence of marker expression (see above).

Irrespective of the type of LAP, it may be argued that the sensitivity and specificity category may also affect the success of identification of a LAP. Figure 4 shows the results for sensitivity only, because the specificity categories harbored too low numbers for reliable conclusions. The main conclusion (details in Supplementary Results) is that the centers performed much better for high-sensitivity category compared with intermediate and low-sensitivity category. However, substantial improvement was seen from the learning phase to the test phase, ending with almost similar performance in the high and intermediate category in the test phase.

Figure 4
figure 4

Performance of individual institutes in different sensitivity categories in the learning and test phase. The figure shows the performance of the four centers to define LAPs, now subdivided into the three sensitivity categories (i.e., high (>50%), black; intermediate (20–50%), dark gray; and low (10–20%), light gray). Missed LAPS in a sensitivity category are expressed as percentage of the maximal number of consensus LAPs defined in that specific sensitivity category. In the learning phase, the percentage of missed LAPs was significantly higher in the intermediate-sensitivity (marked *P=0.01) and in the low-sensitivity category (marked #P=0.001) than in the high-sensitivity category (all centers included). These differences were also significant when the coordinating center (1) was excluded from analysis (P-values ranging between 0.01 and 0.05). When comparing the learning and test phases, for centers 2–5 there was an improvement in the intermediate-sensitivity category (marked P=0.016). There was also improvement in the lowest sensitivity category (marked §P=0.024).

Incorrectly assigned LAPs

Although missed LAPs may prevent evaluation of a patient’s MRD, incorrect assignment of a LAP may result in unreliable assessments of MRD cell frequency. Incorrectly designed LAPs mainly resulted from underestimation of background LAP expression in normal BM. In the learning phase, centers performed experiments on background staining in normal BM and evaluated specificity. Specificity of LAPs was an important issue at the consensus meetings. Incorrectly designed LAPs were only scored as such during the test phase and are shown (per sample and per institute) in Supplementary Figure 2 (green symbols). The percentages of incorrectly designed LAPs were low in institutes 1 and 2, but considerably higher in institutes 3 and 4. The number of LAPs was too low for reliable conclusions for institute 5. Such incorrect designs included the omission of exclusion markers, incorrect interpretation of levels of expression and definition of incorrect LAPs (for detailed information, see Supplementary Results). No specific type of incorrectly designed LAPs, common to all participants, could be defined, except for the fact that, for LAPs defined by the absence of expression of markers such as HLA-DR, inclusion of an exclusion marker is recommended.

Discussion

The current study was undertaken to establish criteria necessary to enable assessment of MRD in clinical studies in a multicenter setting. Such studies should establish whether or not MRD has the prognostic value reported by several single-institute retrospective studies, and at what time points and for which specific patient groups MRD assessment can be a part of clinical decision making. Such approach has in the meanwhile been applied in childhood AML, although in a single-center approach.12

When defining criteria to be fulfilled for a multicenter approach, we defined two distinct studies: first, definition of LAP at diagnosis and, second, the use of such LAPs to detect and quantify MRD during follow-up. The present study deals with the diagnostic part. Altogether, the final success rate of defining at least one consensus LAP was high (only 1 out of 64 samples had no LAPs at all), whereas in two-third of the samples two or more independent consensus LAPs could be defined. However, this result was achieved after multiple analyses and extensive group discussions and was based for a large part on the experience of the coordinating center. The complexity to define LAPs by relatively unexperienced centers is evidenced by the high percentages of missed LAPs by centers 2–5, even in the second, testing phase. Some cross-lineage aberrancies, as well as those asynchronous aberrancies that are characterized by complete lack of a marker, were apparently the easiest to define. Other asynchronous aberrancies, antigen overexpression and aberrancies on mature cells were more difficult to establish. In addition, in the initial learning phase, the success rate decreased with decreasing sensitivity (that is, represented by blast coverage percentage of LAPs).

With time, a learning effect, evidenced by the success rate of defining LAPs, was seen. This result was partly because of the thorough discussions at the group meetings. Ultimately, in the test phase this resulted in a success rate (that is, at least one LAP defined) of 82–93% for the relatively inexperienced institutes 2–5 in the 35 evaluable samples. Improvement was especially evident in a category with intermediate sensitivity; in the test phase, LAPs were identified with equal success rate in the high-sensitivity and the intermediate-sensitivity category. However, the overall results implicate that, because of missed LAPs, 7–18% of the patients would not have been evaluable for MRD according to these centers.

Furthermore, in another 9–20% of cases, a center defined only a single LAP, whereas at least two LAPs were finally defined in the consensus meetings. These patients would be more likely to become false negative for MRD if immunophenotypic shifts would have occurred. It has been reported by others and by us5, 11, 16 that such shifts occur frequently, although it has not always led to the disappearance of markers. In case only one aberrant marker is present, it is advised to design a second staining in which the nonaberrant markers that are included in the LAP are replaced by other, nonaberrant, markers. In this way, the risk of loss of these normal markers, which could also result in false negativity, is minimized. For example, a ‘primitive’ marker, CD133, may occasionally become negative; replacement by, for example, CD34 would then be adequate. This condition was met in the current study because in 18% of cases at least two primitive markers among CD34, CD133 and CD117 were present. In most cases, it is possible to replace a myeloid marker by another. As the expression of CD33, CD13 and CD117 is quite stable, the design of the same LAPs with alternative myeloid markers is not urgent.

Our group had agreed that LAPs should cover at least 10% of the blast cells at diagnosis: first, because ‘background’ levels in normal BM may amount to this percentage and, second, because lower frequencies will significantly hamper the sensitivity necessary for MRD detection.

Another important factor for the quality of a LAP is specificity, which is determined by LAP expression on control normal or regenerating BM. Cell populations with the immunophenotype of LAPs are present in frequencies varying from 0.001 to 0.1% in control BM. In practice, the contribution of these ‘LAP-like’ events to specific AML-derived, LAP+ cells will have to be defined in the setting of MRD. This interaction will strongly depend on the level and intensity of expression of the aberrant markers on the normal versus AML cell populations. AML cells at diagnosis may occupy the so-called ‘empty spaces’—that is, positions of antigen expression absent in normal cells; however, because of even moderate immunophenotypic changes during or after therapy, AML-specific LAP+ events may interfere with normal LAP+ events. Even if stringent precautions have been taken to standardize antibody–fluorochrome combinations, as well as instrumental setup and calibration over time, there is no guarantee that the AML LAP+ population will not be ‘contaminated’ by normal events with ‘LAP-like’ immunophenotypes. The impact of ‘background LAP’ on the reliable detection of specific (AML) LAP+ events is considerable: up to 0.1% of WBC and up to 10% of primitive marker compartment is possible; as a result, specific detection of MRD <0.1% is compromised (Figure 2). On the basis of its very high background levels, at the end of the learning phase, CD36 was concluded not to represent a valuable marker for MRD.

The type of LAP and their distribution over patient groups differ within the literature. In the present study, the frequency of LAPs with asynchronous expression was twice as high as for cross-lineage aberrancies, which is in agreement with some other adult studies performed with these two types of major aberrancies.14 However, in two previous studies on adult AML5 and childhood AML11 in which we were involved, the frequencies of both LAP types were more or less equal.

Our study also indicated that MRD studies may be performed using different instruments: in our case, four Becton Dickinson and one Beckman Coulter (Miami, FL, USA) flow cytometer were used. This is of importance in view of international clinical studies in which MRD assessment may be dealt with by (core) centers that may use different brands of flow cytometers.

The large differences in specificity, as well as LAP coverage on blasts, represent a problem when cutoff levels have to be defined for clinical decision making: although in the case with high specificity and high sensitivity, MRD levels of 0.01% (that is, 1:10 000) can be accurately measured, in other cases levels of only 0.1% or even lower may be the maximally reachable sensitivity. Improvements can be expected from the use of more colors, as shown by Voskova et al.13, in the transition from four to five colors. The use of at least eight-color flow cytometry, now feasible in many institutes, will further improve flow-cytometric MRD analysis. The use of novel software tools (for example, Infinicyt (Cytognos, Salamanca, Spain)) for defining LAPs in an objective manner may further improve flow-cytometric MRD analysis.

In conclusion, our study has shown that immunophenotypic MRD assessment is a complex process that requires specific experience that is not covered by a large experience in only standard diagnostic immunophenotyping. The quality of MRD assessment can only be guaranteed after a substantial period of training under the guidance of highly experienced center(s). Therefore, we recommend that, currently, the number of centers that participate in large-scale multicenter MRD studies be restricted to a few core centers with MRD experience.