Sample Sizes for Designing Bioequivalence Studies for Highly Variable Drugs

. – Purpose. To provide tables of sample sizes which are required, by the European Medicines Agency (EMA) and the U.S. Food and Drug Administration (FDA), for the design of bioequivalence (BE) studies involving highly variable drugs. To elucidate the complicated features of the relationship between sample size and within-subject variation. Methods. 3- and 4-period studies were simulated with various sample sizes. They were evaluated, at various variations and various true ratios of the two geometric means (GMR), by the approaches of scaled average BE and by average BE with expanding limits. The sample sizes required for yielding 80% and 90% statistical powers were determined. Results. Because of the complicated regulatory expectations, the features of the required sample sizes are also complicated. When the true GMR = 1.0 then, without additional constraints, the sample size is independent of the intrasubject variation. When the true GMR is increased or decreased from 1.0 then the required sample sizes rise at above but close to 30% variation. An additional regulatory constraint on the point estimate of GMR and a cap on the use of expanding limits further increase the required sample size at high variations. Fewer subjects are required by the FDA than by the EMA procedures. The methods proposed by EMA and FDA lower the required sample sizes in comparison with unscaled average BE. However, each additional regulatory requirement (applying the mixed procedure, imposing a constraint on the point estimate of GMR, and using a cap on the application of expanding limits) raises the required number of subjects.


INTRODUCTION
The evaluation of bioequivalence (BE) for highlyvariable (HV) drugs and drug products frustrated the pharmaceutical industry for many years. It was difficult to demonstrate BE unless distressingly many subjects were included in the investigations (1).
Major regulatory authorities have sought remedies in recent years to the problem. The Committee for Medicinal Products for Human Use (CHMP) of the European Medicines Agency (EMA) issued in 2010 a substantially revised BE guideline (2) which included new approaches for the determination of BE for HV drugs. For the same purpose, members of a working group of the Food and Drug Administration (FDA) in the United States published a procedure (3).
The approaches of both EMA and FDA are based on the method of scaled average bioequivalence (SABE) (4) or its close variant. Nevertheless, there are meaningful deviations between the implementations of the two procedures which will be considered below.
A practical issue arising from the new approaches involves the design of BE studies. Their main goal is to lower required sample sizes in comparison with the expectations of the customarily applied unscaled average bioequivalence (ABE). Sample sizes required by the use of ABE have been published (5)(6)(7)(8)(9). However, sample sizes are not available which are needed for using the new procedures.
The main purpose of the present communication is to provide sample sizes which will be useful for designing BE studies for highly variable drugs. These will be presented in four tables in the Appendix.

BACKGROUND REGULATORY REQUIREMENTS
The usually applied criterion for the determination of bioequivalence implements the two one-sided tests (TOST) procedure (10). Accordingly, the average logarithmic kinetic responses of the test and reference formulations (μ T and μ R , respectively) are contrasted. The acceptance of bioequivalence is stated if the 90% confidence interval for the difference between the estimated logarithmic means is between preset regulatory limits. The limits (θ A ) are generally symmetrical on the logarithmic scale, and usually equal ln (1.25). Consequently, the criterion for the determination of average bioequivalence (ABE) is, schematically: In a bioequivalence study, the individual kinetic responses are evaluated from the measured concentrations. The means of the logarithmic responses of the two formulations (m T and m R ) are calculated; these sample averages estimate the true population means (μ T and μ R ). The true values of the means are not known and therefore their estimates have to be used: Also the within-subject variance is estimated (s 2 w ), in replicate-design studies with 3 or 4 periods, for each bioequivalence metric.
Expression [2] implies that, for the declaration of bioequivalence, the difference between the estimated logarithmic means, together with their 90% confidence interval, should be within the regulatory limits of ±θ A .
The succeeding expressions have the corresponding interpretation.
The method of EMA substitutes the fixed θ A regulatory constant with limits which depend on s 2 w : -ks W ≤ m T -m R ≤ ks W [3] This is still Average Bioequivalence but with Expanding Limits (ABEL) (11). Therefore, the TOST procedure of Schuirmann can be directly applied. Based on statistical recommendations (12), EMA proposed 0.76 as the value of the regulatory constant k (2).
An alternative, related definition of the regulatory constant, preferred by some, is: The expectation of EMA for the value of σ 0 is 0.294.
The European guideline prescribes additional conditions and requirements: 1-s w must be evaluated using the data only of the reference drug. Accordingly the withinsubject variation of the reference product (s WR ) should be used as s W in Eq. 3 to calculate the limits. The determination of s WR requires three-or four-period crossover designs.

2-
Apply the so-called mixed procedure for average BE (12)  1-Substitute s W in Eq. 4 by the estimated within-subject variation of the reference product(s WR ). Evaluate s WR either in 3period crossover studies in which the reference formulation is administered twice (RRT-RTR-TRR) or in 4-period investigations.

2-
Apply the mixed procedure for average BE.

3-
The confidence interval for SABE cannot be calculated analytically. The FDA recommends a numerical approach based on the approximate linearization of Eq. 5 (12,13).
FDA does not impose further restrictions such as a cap on the application of SABE. Overall, the statistical properties of the methods proposed by EMA and FDA are rather complex as a result of the additional conditions and requirements (mixed procedure, GMR constraint, and (for EMA) a cap on the limits). Furthermore, the tests required by both EMA and FDA are dependent on each other which makes the theoretical treatment very complicated. Therefore, the required sample sizes were obtained by simulations.

METHODS
Matlab (Release 2011a, MathWorks, Natick, MA) was used for simulations on a PC with Intel Core i5-2500K processor and 8Gb RAM. Under each condition, ten thousand simulations were performed. The simulated random variables followed a log-normal distribution, thereby following the multiplicative model of BE analysis (14). Three-period partial replicate (TRR-RTR-RRT) and four-period replicate (TRTR-RTRT) trials were simulated assuming zero period and sequence effects.
The same within-subject variances were assumed for the test and reference products but s WR was estimated only from the data of the reference product. Power was calculated as the success rate of passing the trial as the sample size increased from 12 to 200 subjects.
The Matlab program was validated by comparing the results with our previous Fortran program (4,12,16) and with that of the results of the FDA group (15). Several other properties of the results could be predicted by mathematical reasoning. We used also this kind of mathematical reasoning for validation. Finally, sample sizes simulated for 2-period studies at 80% power were compared with those calculated by Hauschke et al. Table 1).
The agreement between the simulated and calculated numbers is satisfactory.
The reported numbers represent the minimum number of volunteers when the power exceeds the desired level. That is, the results were rounded upward and, at the given sample size, the power was at least as high as the stated level. The precision of the estimation was evaluated by running the simulations twenty times at twenty-two different conditions (different CV's, different GMR's and different designs). The expected power was either 80% or 90%. The standard deviation of the simulated powers was calculated under each condition; the mean of these standard deviations was 0.460. Thus, the precision of the power estimation is about ±0.5%. In the worst case, this precision corresponds to a simulation error of ± 1 subject in the sample size tables. Such instability of the estimation could occur when the true power with the given sample size is very close to 80% or 90%. In this case if the simulated power is just below 80% or 90% then the number of subjects has to be increased by one in order to reach the required power.

Sample Sizes for Designing BE Studies
Sample sizes are presented in tables of the Appendix which are required for designing BE studies of highly-variable drugs and drug products. Tables A1 and A2 provide sample sizes with the  EMA regulatory criteria whereas Tables A3 and A4 show sample sizes with the FDA criteria. Tables  A1 and A3 consider the 3-period design of TRR-RTR-RRT whereas Tables A2 and A4 show results for the 4-period TRTR-RTRT design.
Sample sizes required for 80% and 90% statistical power are given within each table. The ratios of geometric means (GMR) are considered from 0.85 to 1.20. More than 200 subjects are always required when the GMR is outside this range. Within-subject variations of the reference product are shown between coefficients of variation (CV) of 30% and 80%.
All entries in these tables refer to total sample sizes in an investigation. Consequently, the number of subjects within a study sequence is either onethird or half of these figures.

Change of Sample Size with Within-Subject Variation: Effects of Regulatory Requirements
As noted earlier, the regulatory conditions and requirements of both EMA and FDA are complicated and contain various stipulations. As a result, the change of sample sizes with increasing variation is also complicated. For instance, when GMR is removed from 1.00, the required sample size, typically, initially decreases with rising variation and later increases. Therefore, additional simulations were performed in order to elucidate the features of this behaviour. The ranges of GMR and CV were limited in these simulations.
For these comparisons, 3-period studies were chosen with the EMA conditions of ABEL analysis and a regulatory constant of k = 0.76 or σ 0 = 0.294 but the additional constraints were varied. Results obtained in these simulations can be easily extrapolated also to the FDA method because the essential mathematical features of the EMA and FDA procedures are the same. Table 2 shows the required sample sizes without the additional regulatory conditions, that is, without the mixed procedure, GMR constraint and the cap on the regulatory limits. Table 2 shows that the required sample size does not change with increasing CV when GMR = 1.00 and decreases with rising CV when GMR deviates from unity.
The explanation lies in the features of the scaled difference of the means which is shown in Eq. 4. The scaled difference follows the non-central tdistribution (12).
When GMR = 1.00, the noncentrality parameter is zero. In this case, the scaled difference has a standard t-distribution. This means that the width of the confidence interval is independent of CV and depends only on the sample size. Consequently, with GMR = 1.00, the number of volunteers is also independent of CV. This is shown in Table 2 (apart of small, random fluctuations due to the simulation error).
When GMR deviates from 1.00 then the noncentrality parameter raises the upper limit of the confidence interval. The non-centrality parameter is proportional to log(GMR) and reciprocally proportional to s W . Thus, at a given sample size and log(GMR), the rise in the confidence interval gets smaller as s W (i.e., CV) increases. Correspondingly, the required sample size declines as s W (i.e., CV) increases.
This relationship is shown in Table 2. Similar considerations apply when the approach of ABEL is used since the power of average BE can be characterized by a bivariate noncentral t-distribution (6,7). Table 3 shows the effect of including the mixed strategy. With the mixed strategy, the method of evaluation depends on the estimated CV of the reference product. When the true CV is 30% then half of the simulated trials are evaluated by the unscaled and the other half by the scaled approach. When the true CV is higher than 30%, say 34%, then the estimated CV is still below 30% in many studies which are then evaluated by unscaled average BE. Thus, instead of using ABEL (or SABE) which would have narrower limits ("goalposts") in this region, unscaled average BE is applied with its wider, more relaxed 0.80-1.25 limits/goalposts. Consequently, on applying the mixed strategy, the required sample size becomes lower at and near CV = 30 % (compare Tables 2  and 3).  Table 3, the effect of including the GMR constraint can be observed. The constraint raises the required sample size when the CV is high. This is understandable since at large CV`s, the chances of observing very high (or very low) GMR is high. The effect of the GMR constraint is particularly conspicuous when the true GMR deviates from 1.00. Table 5 shows sample sizes with similar conditions as Table 4 except that the value of the regulatory constant is k = 0.893 or σ 0 = 0.25 and thereby corresponds to the requirements of FDA. The number of subjects is lower with the FDA than the EMA requirements.
The deviation is meaningful at moderately high variations but diminishes at still higher CV`s.  The results in Table 5 can be compared also with those given in Table A3. The sample sizes in Table  5 were obtained by using ABEL (in order to be able to compare them directly with those in Table 4) whereas those in Table A3 were computed with SABE as required by FDA. Therefore, small differences between entries in the two tables are due to the two algorithms.

Comparison of Sample Sizes Required by EMA and FDA
It was noted from the comparison of Tables A1 and  A2 with Tables A3 and A4 in the Appendix and of  Table 4 with Table 5 that regulatory requirements of FDA call for fewer subjects than those of EMA.
At first sight, the requirements of FDA are more favourable to sponsors than those of EMA. However, statistical reasoning supports the recommendations of EMA. Figure 1 illustrates the sample sizes required by the two agencies at and just above CV = 30%. The figure shows also the sample sizes needed by unscaled average BE just below and up to this variation. For illustrative purposes, the results were calculated with the true CV values, without the mixing effect and without the GMR constraint.
As expected, the required sample size increases with rising variation when the results are evaluated by unscaled average BE. Above CV=30% and with the application of ABEL, the sample size is independent of the variation when the true GMR = 1.0. However, when the true GMR deviates from 1.0 (e.g., when GMR =1.1), the required sample size initially decreases with rising CV. This behaviour was discussed in connection with the results given in Table 2.
Importantly, the required sample size changes continuously around CV = 30% when the requirements of EMA are followed (i.e., with the regulatory constant of k = 0.76 or σ 0 = 0.294). In other words, at CV =30% the same sample size is obtained regardless whether ABE or ABEL is applied.
In contrast, there is a discontinuity of sample sizes when the regulatory conditions of FDA are used (i.e., with the regulatory constant of k = 0.893 or σ 0 = 0.25). In other words, at CV = 30% larger sample size is required by applying ABE than ABEL. In fact, using ABEL within a range above CV = 30% requires a smaller sample than applying ABE within a range below CV = 30% (Figure 1). By requiring smaller samples above than below CV = 30%, the FDA regulatory condition could tempt some sponsors to prefer higher variations.
Other disadvantageous consequences of the proposed FDA regulation include the higher consumer risk for non-highly variable drugs. This aspect was already noted earlier (16).
Further features of sample sizes arising from the EMA and FDA requirements can be noted. The difference between the two sample sizes decreases as the variation gets higher, towards 40-50% (Tables in the Appendix and also from the  comparison of Tables 4 and 5). The reason is that the GMR constraint influences the outcome of the BE decision sooner with the FDA than with the EMA condition (15,16). At still higher variations, at CV > 50%, the required sample size increases more rapidly with rising CV with the EMA than with the FDA requirements. The reason is the cap on the use of ABEL at CV = 50% which EMA imposes. FDA does not apply a similar cap. Consequently, with the EMA requirements, unscaled average BE is used again at CV's exceeding 50% thereby leading to stricter study requirements.

Designing BE studies for highly variable drugs
Sample sizes for designing BE studies which involve non-highly variable drugs are typically estimated by assuming a within-subject (or a residual) variation and using a sample-size table such as that of Hauschke et al. (7). The sample size is usually selected at a 5% deviation between the means, i.e. at a true GMR = 1.05.
Larger absolute differences between the two logarithmic means can be noted in the various BE studies when the within-subject variation is higher. Therefore, it is recommended that a 10% deviation between the means, i.e. a true GMR = 1.10, be considered when the sample size tables in the Appendix are used.
With the approach of FDA, the minimum number of subjects with the 3-period partially replicating design is 28 and 40 subjects for 80% and 90% power, respectively.
With a 4-period replicated design, these numbers are 21 and 30 for 80% and 90% power, respectively. Haidar et al. (3) suggested that the inclusion of at least 24 subjects would be needed.
The suggestion may be considered as an absolute minimum.
With the procedure of EMA, the minimum number of subjects with the 3-period partially replicating design is 37 and 51 subjects for 80% and 90% power, respectively.
With a 4-period replicated design, these numbers are 27 and 36 for 80% and 90% power, respectively.
The estimated sample sizes depend on the within-subject variation of the test product. If it is lower than that of the reference formulation then fewer volunteers are needed to achieve the stated power. Conversely, if the variability of the test formulation is higher than that of the reference product then more subjects are needed than shown in the tables in the Appendix. In practice, however, the samples are too small to make judgments with adequate power about the relative variances of the two products and, consequently the assumption of identical variabilities is generally reasonable.
In view of the consequences of the mixed approach, it could be judicious to consider larger numbers of subjects at variations fairly close to 30%.
Both EMA and FDA developed the approaches for highly variable drugs in order to reduce the regulatory burden, i.e. to lower the required number of subjects in BE studies. The sample size tables in the Appendix demonstrate that both authorities achieve this goal.

CONCLUSIONS
Tables of sample sizes are provided for BE studies involving highly variable drugs.
These investigations are evaluated either by the approach of scaled average BE or by its close variant, average BE with expanding limits. Sample sizes are shown for the differing regulatory requirements of EMA and FDA.
When the two drug products have truly the same kinetic metrics (GMR = 1.00) and without the additional regulatory conditions, the required sample size is independent of the within-subject variation of the reference formulation. In other words, the producer risk is independent of the variation.
Each of the additional regulatory conditions and requirements yields complications in the relationship between sample size and variation when the true GMR deviates from 1.00. Use of the mixed strategy lowers the sample size near CV = 30%. A constraint on the point estimate of GMR increases the required sample size at higher variations. A cap on applying ABEL at a variation of 50% raises the sample size.