INTRODUCTION

In the USA, a generic drug is approved based on its pharmaceutical equivalence and bioequivalence (BE) to the reference listed drug (RLD). BE studies are generally conducted by comparing the in vivo rate and extent of drug absorption of a test and a RLD product in a 2-sequence, 2-treatment, 2-period crossover study in healthy subjects, where the test drug refers to the generic drug under investigation (or under development). A test product is considered to be bioequivalent to a reference product if the 90% confidence interval (CI) of the geometric mean ratio (GMR) of AUC (area under the concentration vs. time curve) and Cmax (maximum concentration) between the test and reference fall within the limits of 80.00–125.00%. This approach is based on the assumption that a 20% difference between the test and reference products is not clinically significant. A crossover BE study outcome can be affected by study sample size and the within-subject variability (WSV) (1,2). WSV refers to variability in a response (e.g., plasma drug concentration) within the same subject, when the subject is administered two doses of the same drug on two different occasions (3). This variability may be intrinsic to the drug substance and/or the formulation, but may also include analytical variability, drug product quality variability, physiological or pathological variability of the subject, and unexplained random variation. WSV is usually measured by within-subject variance (σ WR 2) or within-subject standard deviation (σ WR). Given the same sample size and mean test/reference ratio, drugs with smaller WSV would, in most cases, more easily pass the conventional average BE limits of 80.00–125.00%.

For narrow therapeutic index (NTI) drugs—where small differences in dose or blood concentration may lead to serious therapeutic failures and/or adverse drug reactions (4)—a 20% difference in blood concentration or drug exposure may be unacceptable. Although the US Food and Drug Administration (FDA) does not currently provide a list of NTI drugs, digoxin, lithium carbonate, phenytoin, tacrolimus, theophylline, and warfarin are usually considered NTI drugs by many experts (5,6). At the April 2010 Advisory Committee for Pharmaceutical Science and Clinical Pharmacology (ACPSCP) meeting on NTI drugs, the committee voted 11-2 that the average BE limits of 80.00–125.00% are not sufficient for critical dose or NTI drugs (7). They commented that “the requirements for confidence intervals should perhaps be narrower (90–111%) and should include 100% (or 1.0)” and “Replicate studies are important”. Currently, Health Canada has tightened the average BE limits of AUC for critical dose drugs to 90.0–112.0% (5), while the European Medicines Agency (EMA) also has expressed that in specific cases of products with NTIs, the acceptance interval for AUC should be tightened to average BE limits of 90.00–111.11% (8). Where Cmax is of particular importance for safety, efficacy, or drug level monitoring, the 90.00–111.11% acceptance interval should also be applied for this parameter (8).

This paper’s authors considered the ACPSCP’s recommendation (April 2010 and July 2011) and evaluated various approaches to demonstrate BE of NTI drugs. These approaches included (1) direct tightening of average BE limits (as Health Canada and EMA recommended) and (2) a scaled average BE approach where BE limits are tightened based on the RLD’s WSV. In addition, if an NTI test product’s WSV is greater than the reference product’s WSV, the larger variation in blood concentration may increase the likelihood of therapeutic failures and/or adverse reactions. As such, the authors also evaluated methods and study designs to determine if the generic product has equal or less WSV than its reference product to ensure interchangeability within a subject. This approach has been discussed at a high level in recent publications (911), as well as in revised BE guidance for warfarin sodium tablets (4) and tacrolimus capsules (12).

In this article, we will present modeling and simulation work conducted to support the above approach. This study’s specific objective was to evaluate different bioequivalence approaches for NTI drugs by conducting power estimation under various conditions with different regulatory constraint values and variability comparison criterion.

METHODS

Theory

In a fully replicated, 2-sequence, 2-treatment, 4-period, crossover BE study (sequence 1: TRTR, and sequence 2: RTRT) without missing observations, all subjects provide two observations on T and R, respectively. The number of subjects in each sequence is n1 and n2 for sequences 1 and 2, respectively. An observation, in this context, is a natural log-transformed pharmacokinetic (PK) parameter, ln(AUCt), ln(AUCinf), or ln(Cmax), where AUCt is the area under the concentration vs. time curve from time zero to time t, the last time point with a measurable concentration, and AUCinf is the area under the concentration vs. time curve from time zero to time infinity.

The following quantities are defined to be used in further equations:

T ijk  = kth observation (k = 1 or 2) on T for subject j within sequence i

R ijk  = kth observation (k = 1 or 2) on R for subject j within sequence i

$$ {I}_{ij}=\frac{{\displaystyle {T}_{ij1}}+{\displaystyle {T}_{ij2}}}{2}\kern0.5em -\kern0.5em \frac{{\displaystyle {R}_{ij1}}+{\displaystyle {R}_{ij2}}}{2} $$
(1)
$$ {D}_{ij}={R}_{ij1}-{R}_{ij2} $$
(2)

I ij is the difference between the mean of a subject’s (specifically subject j within sequence i) two observations on T and the mean of the subject’s two observations on R, while D ij is the difference between a subject’s two observations on R. The I ij ’s and the D ij ’s are statistically independent under an assumption of normality for the distribution of subject-specific means. The I ij ’s and D ij ’s are uncorrelated in any event.

Under the assumed model described in the guidance for industry, “Statistical Approaches to Establishing Bioequivalence” (13), the variance of the I ij ’s is

$$ {\sigma}_D^2+\frac{\sigma_{\mathrm{WT}}^2+{\sigma}_{\mathrm{WR}}^2}{2} $$
(3)

and the variance of the D ij ’s is \( 2{\sigma}_{\mathrm{WR}}^2 \), where \( {\sigma}_{\mathrm{WT}}^2 \) and \( {\sigma}_{\mathrm{WR}}^2 \) are within-subject variances of T and R, respectively, and \( {\sigma}_D^2 \) is the subject-by-formulation variance component.

Define

$$ E = \frac{{\displaystyle \sum_{i=1}^2{\displaystyle \sum_{j=1}^{{\displaystyle {n}_i}}\frac{{\displaystyle {I}_{ij}}}{n_i}}}}{2} $$
(4)
$$ {\displaystyle {s}_I^2} = \frac{{\displaystyle \sum_{i=1}^2{\displaystyle \sum_{j=1}^{{\displaystyle {n}_i}}{\displaystyle {\left({I}_{ij}-{\overline{I}}_{i.}\right)}^2}}}}{n-2} $$
(5)

and

$$ {\displaystyle {s}_{\mathrm{WR}}^2}=\frac{{\displaystyle \sum_{i=1}^2{\displaystyle \sum_{j=1}^{{\displaystyle {n}_i}}{\displaystyle {\left({D}_{ij}-{\overline{D}}_{i.}\right)}^2}}}}{2\left(n-2\right)} $$
(6)

where \( {\displaystyle {\overline{I}}_{i\cdot }}\kern0.5em =\kern0.5em \frac{{\displaystyle \sum_{j=1}^{{\displaystyle {n}_i}}{\displaystyle {I}_{ij}}}}{{\displaystyle {n}_i}} \), \( {\displaystyle {\overline{D}}_{i\cdot }}=\frac{{\displaystyle \sum_{j=1}^{{\displaystyle {n}_i}}{\displaystyle {D}_{ij}}}}{{\displaystyle {n}_i}} \), and \( n={\sum}_{i=1}^2{\displaystyle {n}_i}.\kern0.2em {\displaystyle {s}_{WR}^2} \) is the estimated within-subject variance of R. s 2WT , the estimated within-subject variance of T, can be calculated in the same way using the difference between a subject’s two observations on T.Under normality assumptions, we have the following distributional results:

$$ E\sim N\left({\displaystyle {\mu}_T}-{\displaystyle {\mu}_R},\left({\displaystyle {\sigma}_D^2}+\frac{\sigma_{\mathrm{WT}}^2+{\sigma}_{\mathrm{WR}}^2}{2}\right)\left[\frac{1}{4}\left(\frac{1}{{\displaystyle {n}_1}}+\frac{1}{{\displaystyle {n}_2}}\right)\right]\right) $$
(7)
$$ \frac{\left(n-2\right){s}_I^2}{{\displaystyle {\sigma}_D^2}+\frac{{\displaystyle {\sigma}_{\mathrm{WT}}^2}+{\displaystyle {\sigma}_{\mathrm{WR}}^2}}{2}}\sim {\chi}_{n-2}^2 $$
(8)
$$ \frac{\left(n-2\right)\kern0.28em {s}_{\mathrm{WR}}^2}{{\displaystyle {\sigma}_{\mathrm{WR}}^2}}\sim {\chi}_{n-2}^2 $$
(9)
$$ \frac{\left(n-2\right){\displaystyle {s}_{\mathrm{WT}}^2}}{{\displaystyle {\sigma}_{\mathrm{WT}}^2}}\sim {\displaystyle {\chi}_{n-2}^2} $$
(10)

All three quantities are statistically independent.

For the reference-scaled bioequivalence testing, the null and alternative hypotheses are described by Eqs. (11) and (12), respectively:

$$ {H}_0:\ \frac{{\left({\mu}_T-{\mu}_R\right)}^2}{\sigma_{WR}^2}>\theta $$
(11)
$$ {H}_1:\ \frac{{\left({\mu}_T-{\mu}_R\right)}^2}{\sigma_{\mathrm{WR}}^2}\le \theta $$
(12)

where μ T and μ R are the averages of the natural log-transformed PK measure (such as AUC and Cmax) for the test and reference products, respectively; σ WR is the within-subject standard deviation for the reference product; and θ is the scaled average BE limit (θ > 0).

The relationship between GMR and (μ T  − μ R ) can be expressed by

$$ \mathrm{lnGMR}={\mu}_T-{\mu}_R $$
(13)

The alternative hypothesis can be re-written as

$$ {\displaystyle {H}_1}:{\displaystyle {\left({\displaystyle {\mu}_T}-{\displaystyle {\mu}_R}\right)}^2}-\theta \times {\displaystyle {\sigma}_{\mathrm{WR}}^2}\le 0 $$
(14)

Furthermore,

$$ \theta =\frac{{\left[ \ln \left(\varDelta \right)\right]}^2}{\sigma_{\mathrm{W}0}^2} $$
(15)

where Δ is the upper BE limit for test/reference ratio of geometric means, and σ W0 is a regulatory constant.

The strategy for testing this hypothesis is to obtain a 1 − α (i.e., 95%) upper confidence bound for the quantity (μ T  − μ R )2 − θ × σ 2WR and to reject H0 in favor of H1 if this confidence bound is less than or equal to zero. The method of obtaining the upper confidence bound is Howe’s approximation I (14).

WSV comparison of the test and reference products is carried out by a one-side F test. The null hypothesis for this test is

$$ {H}_0:{\sigma}_{\mathrm{WT}}/{\sigma}_{\mathrm{WR}}>\delta $$
(16)

And the alternative hypothesis is

$$ {H}_1:{\sigma}_{\mathrm{WT}}/{\sigma}_{\mathrm{WR}}\le \delta, $$
(17)

where σ WT is the within-subject standard deviation for the test product and δ is the regulatory limit to declare the WSV of the test product not greater than that of the reference product. The (1 − α) 100% CI for σ WT/σ WR is given by

$$ \left(\frac{s_{\mathrm{WT}}/{s}_{\mathrm{WR}}}{\sqrt{F_{\frac{\alpha }{2}}\left({v}_1,{v}_2\right)}},\frac{s_{\mathrm{WT}}/{s}_{\mathrm{WR}}}{\sqrt{F_{1-\frac{\alpha }{2}}\left({v}_1,{v}_2\right)}}\right) $$
(18)

Here, α = 0.1, \( {F}_{\frac{\alpha }{2}}\left({v}_1,{v}_2\right) \) and \( {F}_{1-\frac{\alpha }{2}}\left({v}_1,{v}_2\right) \) are the values of the F distribution with v 1 (numerator) and v 2 (denominator) degrees of freedom that has a probability of α/2 and 1 − α/2 to its right, respectively.

Simulations

Fully replicated, 2-sequence, 2-treatment, 4-period, crossover BE studies were simulated using R (The R Project for Statistical Computing). Since NTI drugs generally have small to moderate WSV and replicate study designs are recommended to demonstrate BE, n = 24 is considered a reasonable and practical sample size to estimate the study power. Generally, the number of subjects required to demonstrate BE can be reduced by up to about 50% in a fully replicated study design compared to a conventional two-way crossover study design. The simulations discussed in this paper are based on n = 24 except when indicated otherwise.

To evaluate the power of each testing condition, 1 million studies were simulated and the percentage of passing studies was calculated for different criteria described in detail in the following sections. For each simulation, μ T  − μ R is sampled from normal distribution with mean of ln(GMR) and variance of \( \left({\sigma}_D^2+\frac{\sigma_{\mathrm{WT}}^2+{\sigma}_{\mathrm{WR}}^2}{2}\right)\left(\frac{1}{4}\left(\frac{1}{n_1}+\frac{1}{n_2}\right)\right) \) (Eq. 7). s 2WR and s 2WT are sampled from chi-squared distribution with (n − 2) degrees of freedom described in Eqs. 9 and 10, respectively. S I 2 was calculated from s 2WR , s 2WT , and s 2 D , s 2 I  = s 2 D  + 0.5 × (s 2WR  + s 2WT ). The values of GMR, σ WR and σ WT/σ WR are predefined fixed values. The 95% upper confidence boundary for (μ T  − μ R )2 − θ × σ 2WR is calculated by Howe’s approximation I (14). The 90% CI for σ WT/σ WR is calculated by Eq. 18. No subject-by-formulation interaction variation is assumed (i.e., σ 2 D  = 0). The tested values for each parameter are summarized in Table I.

Table I Values of Parameters that Are Tested in the Simulations

Investigation on the Variability of Within-subject Variability

In the furosemide AUC data set provided by Professor Leslie Benet, all 10 subjects received furosemide with and without orange juice. Since it was concluded that no effect of orange juice was observed on the PK of furosemide, each subject was considered to have received the same three treatments. Simulations were conducted to assess (1) variability of s 2WR , (2) passing rate using the upper limit of the 90% CI for σ WT/σ WR ≤ 2.5 criterion, and (3) comparison of the observed distribution of s 2WT /s 2WR and theoretical F distribution.

To assess the variability of WSV, the following steps were conducted: (1) randomly select two data points from the three data points from each subject as references 1 and 2 in a four-way crossover study, (2) fix the sequence as [1-2-1-2-1-2-1-2-1-2], and (3) calculate s WR. The distribution of s WR is plotted based on 5000 times of the above simulations.

To assess the passing rate using the upper limit of the 90% CI for σ WT/σ WR ≤ 2.5, the following steps were conducted: (1) randomly select two data points from the three data points from each subject as references 1 and 2 in a four-way crossover study, (2) randomly select two data points from the three data points from each subject as test 1 and test 2 in a four-way crossover study, (3) fix the sequence as [1-2-1-2-1-2-1-2-1-2], (4) calculate the s WR, s WT, and the upper bound of the 90% CI for s WT/s WR, and (5) repeat 5000 times and calculate the passing ratio based on the variability comparison criterion. Finally, the distribution of s 2WT /s 2WR are compared with theoretical F 8,8 distribution.

RESULTS

Effect of σ W0 and Δ on Implied BE Limits

The scaling model has two regulatory constants: σ W0 and Δ. They affect the 90% CI limits and the power of a BE study at a given WSV level. Two values of σ W0 (0.10 and 0.25) and two values of Δ (1.11 and 1.25) were examined (Fig. 1). At a given σ W0, Δ = 1.11 gives a narrower BE limit than Δ = 1.25. Specifically, when σ W0 = 0.10 and Δ = 1.11 (Note: 1.11 = 1/0.9), 90% CI limits become 80–125% when the coefficient of variation (CV; calculated from σ WR, the within-subject standard deviation of the reference product on the log scale, using the equation \( {\sigma}_{\mathrm{WR}}=\sqrt{ \ln \left(1+{\mathrm{CV}}^2\right)} \)) is around 21%. When σ W0 = 0.10 and Δ = 1.25, 90% CI limits become 80–125% when the CV is around 10%.

Fig. 1
figure 1

Effect of σ W0 and Δ on implied BE limits. Red Δ = 1.11, σ W0 = 0.10; blue Δ = 1.11, σ W0 = 0.25; magenta Δ = 1.25, σ W0 = 0.10; and black Δ = 1.25, σ W0 = 0.25. Note: 1.11 = 1/0.9

At a given Δ, the implied BE limits at σ W0 = 0.25 are narrower than those at σ W0 = 0.10. Specifically, when Δ = 1.11 and the CV is within 10%, the implied BE limits are within 90–111% at σ W0 = 0.10 and are within 95–105% at σ W0 = 0.25. As such, Δ = 1.11 and σ W0 = 0.10 were selected for further analysis because at σ W0 = 0.10 (i.e., a common value to define small WSV), the implied BE limits coincide with other major health regulatory standards for NTI drugs.

Comparison of Narrower Average BE Limits and Scaled BE Limits on Study Power and the Effect of Point Estimate Constraints (PECs)

Figure 2 compares passing rates under the average BE criteria and under the scaled BE criterion in combination with different PECs, using 24 subjects (n = 24). Capping criterion, which will be discussed later, has not been applied here. The PEC criterion is that the point estimate of GMR falls within a given range (e.g., 90.00–111.00, 95.00–105.263, or 80.00–125.00%).

Fig. 2
figure 2

Effect of point estimate on study power under different σ WR values when σ WT = σ WR, n = 24

When the average BE criteria are applied, narrower BE limits of 90.00–111.11% have a lower passing rate. When the RLD is compared to itself or an identical generic product (i.e., GMR = 1, σ WT = σ WR), the passing rate with BE limits of 90.00–111.11% is 91.18% when σ WR = 0.15 and 27.69% when σ WR = 0.25. When the GMR is 0.95 and σ WT = σ WR, the passing rate with BE limits of 90.00–111.11% is 52% when σ WR = 0.15 and 17% when σ WR = 0.25. In contrast, the scaled average BE criterion ensures a 100% passing rate when the RLD is compared to itself or an identical generic product.

The impact of PECs on study power was evaluated with scaled BE limits. When σ WT = σ WR = 0.05, PECs do not significantly affect study power based on scaled BE criteria. However, when σ WT = σ WR = 0.15 or 0.25, PECs may decrease the study power. When the RLD is compared to itself, at σ WR = 0.15, the passing rate dropped from 99 to 92% with a PEC of 95–105%, while at σ WR = 0.25, the passing rate dropped from 99 to 96% with a PEC of 90–111% and dropped to 70% with a PEC of 95–105%. When the GMR = 0.95, at σ WR = 0.15, the passing rate dropped from 89 to 50% with a PEC of 95–105%, while at σ WR = 0.25, the passing rate dropped from 97 to 85% with a PEC of 90–111% and dropped to 48% with a PEC of 95–105%.

The authors also investigated the effect of an additional constraint: that BE CI limits should contain 100%, on study power. With this added constraint, when the RLD is compared to an identical generic product or itself, the chance of passing never exceeds 90% because there is always a 10% chance that the 90% CI will not contain 100% (15). Simulation also showed that study power decreases with increasing sample sizes.

Approaches to Ensure BE Limits Are Never Wider than 80.00–125.00%

The reference-scaled BE limits can expand beyond 80.00–125.00% as the reference WSV increases. There arises the concern that an inappropriately conducted study resulting in high WSV could pass the scaled BE criteria when it would not be accepted using the current limits. The authors evaluated two approaches to ensure that the BE limits are never wider than 80.00–125.00%: (1) stop scaling at certain estimated within-subject standard deviation of the reference product (s WR) and (2) require that the study pass both scaled BE limits and average BE limits of 80.00–125.00%. When σ W0 = 0.10 and Δ = 1.11, the cutoff value of s WR is around 0.21. Both of these approaches preserve the actual level of significance at no more than 5% (15). The second approach has the advantage of not specifying the threshold limit to stop scaling; thus, it was selected for further evaluation.

The Power of the Average BE Criterion and the Scaled + Capping BE Criterion Under Different σ WT/σ WR Ratios

When the scaled BE limits are wider than the standard average BE limits, both the standard average and scaled BE criteria should be applied (the combination of standard average and scaled BE criteria will be termed scaled + capping BE criterion hereafter). Figure 3 describes the impact of the scaled + capping BE criterion on study power in comparison to the average BE criterion under different σ WT/σ WR ratios at σ WR = 0.1, 0.2, or 0.25.

Fig. 3
figure 3

The effect of within-subject variability difference on the study power when evaluated by the average bioequivalence (left panel) and the scaled average bioequivalence + capping criteria (right panel), n = 24

When σ WR = 0.1 (Fig. 3, upper panel), using the average BE limit, the study power was always close to 100% when the GMR varied between 0.95 and 1.05 and σ WT/σ WR ranged from 1 to 2. In contrast, when the σ WT/σ WR increased from 1 to 2, using the scaled + capping BE criterion, at GMR = 1, the study power dropped from 99 to 80%, while at GMR = 1.05, the study power dropped from 74 to 46%.

When σ WR = 0.2 (Fig. 3, middle panel), GMR = 1, and n = 24, when the σ WT/σ WR ratio increased from 1 to 2, the study power dropped from 100% to about 91% when evaluated by the average BE limits of 80.00–125.00%. When evaluated by the scaled + capping BE criterion, the study power decreased from 99 to 79% under the same conditions. When the GMR = 1.05 and the σ WT/σ WR ratio increased from 1 to 2, the study power decreased from 99 to 83% and 95 to 68% as evaluated by the average BE and the scaled + capping BE criterion, respectively.

When σ WR = 0.25 (Fig. 3, lower panel), similar study power was observed for each σ WT/σ WR ratio when evaluated by the average BE and the scaled + capping BE criterion. This is because when σ WR = 0.25, the capping criterion takes over the scaling criterion.

Effect of Various Variability Comparison Schemes and Standard Values on the Study Power for Different σ WT/σ WR Ratios

An F test evaluates whether the WSV of test and reference products are similar by calculating the 90% CI for the ratio of the within-subject standard deviation of test product to reference product σ WT/σ WR and comparing the CI to a standard (δ). Figure 4 demonstrates the study power in three equivalence evaluation schemes when n = 24, GMR = 1, σ WR = 0.1, and the standard value is set at 1.25, 1.5, 2, or 2.5. The three schemes were as follows:

Fig. 4
figure 4

Effect of variability comparison evaluation schemes (I, II, and III) and the regulatory standard (δ) on the study power (σ WR = 0.1), n = 24. Scheme I pass if the UPPER limit of the 90% CI for σ WT/σ WR ≤ δ, Scheme II pass if the LOWER limit of the 90% CI for σ WT/σ WR ≤ δ, and Scheme III pass if the ESTIMATE, sWT/sWR, is ≤δ

  1. I)

    Pass if the UPPER limit of the 90% CI for σ WT/σ WR ≤ δ;

  2. II)

    Pass if the LOWER limit of the 90% CI for σ WT/σ WR ≤ δ; and

  3. III)

    Pass if the ESTIMATE, sWT/sWR, is ≤δ.

When δ = 1.25 and σ WT/σ WR = 1, there was about 25, 98, and 85% study power with schemes I, II, and III, respectively. When the standard value was increased to 1.5, 2, and 2.5, the study power increased significantly in all three schemes. When σ WT/σ WR is greater than 1, the study power follows this order: scheme I < scheme III < scheme II. For example, when δ = 2.5, at σ WT/σ WR = 2, the study powers for schemes I, II, and III are 25, 99, and 85%, respectively.

Effect of Combined Scaled + Capping BE Criterion and Variability Comparison Criterion on the Study Power Under Different σ WT/σ WR

In Fig. 5, when σ WR = 0.1, GMR = 1, n = 24, and σ WT/σ WR = 1.2, the study power was about 98, 96, and 76%, respectively, when examined separately by the scaled limits, variability comparison criterion scheme I with δ = 2.5, and variability comparison criterion scheme I with δ = 2.0. When GMR = 1 and it was evaluated with both the scaled limits and variability comparison criterion, the overall study power was slightly lower (less than 2%) compared with being evaluated with variability comparison criterion scheme I alone. The study power decreased further when the GMR moved away from 1. When σ WT/σ WR increased to 2, the study power dropped below 30% (δ = 2.5) or 10% (δ = 2) when evaluated by the variability comparison criterion or the scaled plus variability comparison criterion. Figure 5 shows that the variability comparison criterion will reject the test product with much higher WSV than the reference product.

Fig. 5
figure 5

Effect of scaled average bioequivalence criteria, variability comparison criterion, and their combination on the study power at different σ WT/σ WR ratios when σ WR = 0.1, n = 24

Figure 6 shows the study power at different σ WT/σ WR values (n = 24) when evaluated by the combination of the scaled + capping BE criterion and the variability comparison criterion using scheme I with δ = 2 or 2.5. With δ = 2 (Fig. 6, left panel), when GMR = 1 and σ WT/σ WR = 1.2, less than 80% of studies can pass in all cases, suggesting this criterion is too strict. When δ = 2.5 (Fig. 6, right panel), at σ WR = 0.1 and GMR = 1.025, more than 80% study power can be obtained when the σ WT/σ WR is within 1.2. At σ WR = 0.2 and GMR = 1.05, more than 80% study power can be obtained when the σ WT/σ WR is within 1.2. Finally, when σ WT/σ WR > 2, there is less than a 20% chance of passing the study with δ = 2.5 as the limit regardless the value of σ WR.

Fig. 6
figure 6

Effect of within-subject variability difference on the study power when evaluated by combination of scaled + capping BE and variability comparison criterion, n = 24. Variability comparison criterion used was scheme I with δ = 2 (left panel) or δ = 2.5 (right panel)

Investigation on the Variability of Within-subject Variability

A concern is that if the variability of the WSV is too large, the chance is high that a reference to reference comparison may fail the variability comparison criterion. To investigate the effect of variability of within-subject variability on variability comparison criterion or to assess whether the selected regulatory constant for the upper limit of the 90% equal-tails confidence interval for σ WT/σ WR is appropriate, a subject needs to receive the same treatment at least three times to obtain the variability of within-subject variability. Such a study design is rare. A furosemide AUC data set provided by Professor Leslie Benet (University of California, San Francisco) was used to investigate the variability of within-subject variability (see “METHODS” section for more detail). Though not an NTI drug, three sets of in vivo data are available for the same furosemide formulation, thus are useful for calculating the variability of WSV. Based on the simulation model (see “METHODS” section), s WT of the furosemide data set varied from 0.06 to 0.33 (Fig. 7a) with a mean value of 0.21 and standard deviation of 0.047. If we sample two observations from each subject as the T 1 and T 2 and two observations for each subject as the R 1 and R 2 (i.e., at least one observation will be used twice) and calculate the s WR, s WT, and the upper bound of the 90% CI for sWT/sWR, the passing rate is 81% using the variability comparison scheme I with δ = 2.5. The simulated distribution of s WT 2/s WR 2 is very close to the theoretical F 8,8 distribution (Fig. 7b), suggesting that the underlying assumption for variability comparison criterion is reasonable. The authors further performed a simulation similar to Fig. 4 with n = 10 and s WR = 0.21. The passing rate was about 80% when σ WT/σ WR = 1 and evaluated with scheme I with δ = 2.5 (data not shown), suggesting that the variability criterion is reasonable.

Fig. 7
figure 7

a Distribution of s WR and b comparison of the observed distribution of s 2WT /s 2WT and theoretical F 8,8 distribution

DISCUSSION

NTI drugs have small differences in dose and/or blood concentration that may lead to serious therapeutic failures or adverse drug reactions. Generally, NTI drugs have small to medium WSV. In typical nonreplicate BE studies comparing generic and reference formulations of six-sample NTI drugs, the mean CV ranges from 5.7 to 21.7% (16). Because the residual variability in a typical BE study includes both the true WSV and variations due to differences in two formulations, the actual WSV in a replicate design study would be even smaller. Health Canada and EMA recommend tighter BE limits of 90.00–111.11% for NTI drugs. When the RLD is compared to exactly the same generic product or itself, the passing rate under the tightened average BE criterion decreased significantly when the σ WR increases (Fig. 2). When σ WR = 0.25, the study power is less than 30% when applying BE limits of 90.00–111.11%. In contrast, the scaled average BE criterion alone (when no 80.00–125.00% capping applied) ensures a close to 100% passing rate when the RLD is compared to an identical generic product or itself at σ WR = 0.25 (Fig. 2), and the BE limits will narrow with the decrease of σ WR (Fig. 1). Given the range of σ WR in NTI drugs (16), the fixed average BE limits of 90.00–111.00% can be too strict for truly equivalent generic drugs (i.e., GMR = 0.95–1.05) with small to medium WSV. Thus, FDA has recommended a scaled average BE criterion for BE demonstration of NTI drugs. For NTI drugs with moderate s WR, e.g., >0.21, they also need to pass conventional 80.00–125.00% BE limits, which will be discussed in the later section.

A scaled bioequivalence approach and point estimate constraint have previously been reported to demonstrate BE of highly variable drugs (1,2). In the simulations performed in this study, the additional PECs demonstrated a σ WR-dependent effect on the study power (Fig. 2). The smaller the σ WR, the smaller the influence of PECs on the study power since the reference-scaled limits are already tight. The higher the σ WR, the more power decreasing was observed with tighter PECs. In the case of moderate σ WR (e.g., between 0.2 and 0.3), additional PEC will enforce test and reference product BE limits to be closer with each other. Furthermore, simulation suggested that additional BE criterion to include 100% in the 90% CI causes a constant failure rate around 10% even when the RLD is compared to itself or an identical generic product. These simulation results suggested that inclusion of 100% in the BE limits can be too strict for equivalent generic NTI drugs; therefore, it is not recommended for demonstrating BE between generic and reference NTI drugs. In addition, simulation also showed that study power decreases with increasing sample sizes when including 100% in the 90% CI. Generic applicants would have a disincentive to study more subjects (15). Therefore, confidence intervals including 100% are not appropriate for BE demonstration of NTI drugs.

To prevent the scenario where the estimated σ WR from a particular study is high and the inappropriately conducted study may pass the scaled BE criteria, passing both scaled BE limits and average BE limits of 80.00–125.00% is recommended. When σ W0 = 0.10, Δ = 1.11, and s WR > 0.21, the average BE limits of 80.00–125.00% essentially control whether a study passes.

Because larger WSV of the test product than the RLD is of concern for NTI drugs, the WSV comparison becomes particularly important for generic NTI approval. While the scaled + capping BE criterion penalizes products with differences in WSV or GMR more than the average BE standards—especially at low σ WR (Fig. 3)—even with σ WT/σ WR = 2, there is still a 70 or 80% chance a study will pass the scaled + capping limits if the geometric mean ratio is close to 1 (Fig. 3). The simulation suggested the scaled + capping BE criterion alone is insufficient to fail BE studies with large differences (e.g., σ WT/σ WR = 2) in reference and test WSV when the GMR is close to 1. Thus, criterion to compare test and reference variability was developed and discussed below.

To estimate both test and reference WSV, a fully replicated, 2-sequence, 2-treatment, 4-period crossover study is needed. With this BE study design, each subject receives each formulation (reference, R, and test, T) twice. Because the pharmacokinetics and the analytical variability are the same for both the test and the reference products in a fully replicated study, a significant difference in the estimated WSV between test and reference products is an indicator of a product quality problem. Three equivalence evaluation schemes were evaluated as described in the “RESULTS” section. Scheme II was not considered because it can give applicants the incentive to increase the chance of passing by underpowering a study. Scheme III is considered too relaxed because it allows products with relatively large difference in WSV to pass with greater than 80% power. As such, scheme I became the focus of further investigation.

In further evaluation of the scaled + capping BE limits and variability comparison criterion, simulations indicated that the study power was mostly determined by the power of the variability comparison for BE studies with large difference in variability when the GMR is close to 1 (Fig. 5). Scheme I with δ = 2.5 was selected as the recommended criterion for variability comparison because it could produce more than 80% power for similar products (0.95 < GMR < 1.05 and σ WT/σ WR < 1.2) and less than 20% power for products with larger than twofold differences in within-subject standard deviation using 24 subjects (Fig. 6).

There were concerns that it may be difficult to pass variability comparison criterion since the variability of WSV is high in actual BE studies. To investigate this concern, the authors analyzed a furosemide AUC data set (provided by Professor Leslie Benet, University of California, San Francisco). Although the distribution of s WR covered a larger than twofold range (Fig. 7a), when the RLD is compared to itself, the study power is above 80% with the variability comparison criterion, which is reasonable for a study population size much smaller than normal. Overall, the furosemide data set support that the variability comparison criterion is reasonable.

The new BE standards for NTI drugs tightens the BE limits based on the RLD’s WSV, penalizes differences between the test product WSV and reference product WSV, and ensures a consistent study power at higher than 80% when the same product is compared to itself or an identical generic product. Adaptation of this scaled BE and variability comparison approach will enhance the ability to approve quality generic NTI drugs.

As of July 2014, FDA has published two product-specific BE recommendations where the scaled BE and variability comparison criterion are recommended to be applied to demonstrate bioequivalence. The products are a warfarin sodium tablet (17) and tacrolimus capsule (12). The bioequivalence limits of NTIs are scaled based on the estimated within-subject standard deviation of the reference product (s WR) in the study. The smaller the s WR, the narrower the BE limits are for the test product. The higher the s WR, the wider the BE limits are for the test product; however, wider product variation is prevented by the demonstration that average BE is within 80.00–125.00%. In addition, variability comparison is recommended for NTI drugs. The extension of this approach to other drugs depends on a consistent method for NTI drug classification. Improper application of this approach to non-NTI drugs will result in an unnecessarily low passing rate of generic products for which normal fluctuation in plasma concentration would be well tolerated. Further work in NTI drug classification and use of scaled BE and variability comparison to other NTI drugs is ongoing.