Quantitative Structure – Pharmacokinetics Relationships Analysis of Basic Drugs : Volume of Distribution

Purpose. The early prediction of pharmacokinetic behavior is of paramount importance for saving time and resources and for increasing the success of new drug candidates. The steady-state volume of distribution (VDss) is one of the key pharmacokinetic parameters required for the design of a suitable dosage regimen. The aim of the study is to propose a quantitative structure – pharmacokinetics relationships (QSPkR) for VDss of basic drugs. Methods: The data set consists of 216 basic drugs, divided to a modeling (n = 180) and external validation set (n = 36). 179 structural and physicochemical descriptors are calculated using validated commercial software. Genetic algorithm, stepwise regression and multiple linear regression are applied for variable selection and model development. The models are validated by internal and external test sets. Results: A number of significant QSPkRs are developed. The most frequently emerged descriptors are used to derive the final consensus model for VDss with good explanatory (r2 0.663) and predictive ability (qLOO-CV 0.606 and rpred 0.593). The model reveals clear structural features determining VDss of basic drugs which are summarized in a short list of criteria for rapid discrimination between drugs with a large and small VDss. Conclusions: Descriptors like lipophilicity, fraction ionized as a base at pH 7.4, number of cycles and fused aromatic rings, presence of Cl and F atoms contribute positively to VDss, while polarity and presence of strong electrophiles have a negative effect. This article is open to POST-PUBLICATION REVIEW. Registered readers (see “For Readers”) may comment by clicking on ABSTRACT on the issue’s contents page.


INTRODUCTION
The progress of computer-aided drug design techniques has led to an extensively increasing number of structures with drug-like properties and activities.Unfortunately, only few of them pass successfully though all stages of drug development and become drugs.In the past, one of the main reasons for drug failure was the unfavorable pharmacokinetic behavior (absorption, distribution, metabolism or excretion -ADME) (1).The understanding for the importance of pharmacokinetics inspired an intense research focused on the early prediction of the ADME properties of drug candidates before the expensive preclinical and clinical studies.As a result, the drug failure due to pharmacokinetics and bioavailability problems has fallen markedly from 40% in 1991 to 10% in 2000 (2).
One of the most reliable and widely used approaches for ADME prediction is the computational (in silico) modeling.It enables construction of quantitative structure -pharmacokinetics relationships (QSPkRs) based on molecular descriptors.The QSPkR models allow prediction of ADME properties even of virtual compounds, accelerate the identification of new drug candidates and reduce the cost of drug development process.
The volume of distribution VD is important pharmacokinetic parameter relating the amount of the drug in the body A to its plasma concentration, C: It has been defined as a hypothetical volume of body fluid that would be required to dissolve the total amount of drug at the same concentration as Correspondence Author: Dr. Zvetanka Zhivkova, Faculty of Pharmacy, Medical University of Sofia, 2 Dunav St., 1000 Sofia, Bulgaria.Tel.: +359 2 9236514, Email: zzhivkova@pharmfac.acad.bgthat found in plasma (3).Three types of volume of distribution are classically reported in the literature: VD of the central compartment (VD c ), VD during the terminal phase (VD  or VD area ) and VD at steady state (VD ss ).They differ in the times of sampling: just after iv administration, during the terminal phase of drug disposition, or after reaching of steady state, respectively (4).VD ss is considered as the most reliable indicator for drug distribution in the body (5).It determines the half-life of the drug and serves as a key parameter for setting up a suitable dosage regimen (4,6).
VD ss is the most frequently predicted ADME parameter and a good number of QSPkR models have been published in the last two decades.They differ in the size and in the content of the datasets, the descriptors used, and the methods for model derivation and validation.A few studies concern congeneric series of drugs (7 -10).In general, the QSPkRs proposed on congeneric series have a higher predictive power as a similar distribution behavior is expected.However, these models are local models, valid only within the studied series, while construction of a global model requires a large dataset encompassing diverse chemical spaces.The earlier models on diverse datasets are based on inconsistent data collected from literature, including different types of distribution volume (VD c , VD  or VD ss ), following different routes of administration (11 -17).In 2008 Obach et al. (5) published the largest and best curated database so far containing the major pharmacokinetic parameters of 670 drugs, including VD ss after iv administration.This database was used for the development of several successful models for VD ss prediction (18 -20).
Despite of the huge number of descriptors used in the QSPkR models for VD prediction, most of them contain mainly parameters, characterizing drugs lipophilicity (logP, logD and water solubility at different pH values, etc.) and the ionization state of the molecules (pK a of the base, fraction ionized or non-ionized as base or as acid, etc.).A good agreement exists on the fact that more lipophilic drugs have larger VD ss s (7,11,14,15,17,18).The fraction ionized as a base at pH 7.4 also increases VD ss , while the fraction ionized as an acid has a negative impact (14,15,17).All considered descriptors discriminate between acids and bases however there is no information about the structural features affecting VD ss .Аccording to the models, acids are expected to have small VD ss and baseslarge ones, which is consistent with the observations.VD ss reflects the drug ability to cross membranes and to bind in tissues.The bases have high affinity to membrane phospholipids due to interactions between the drug cationic centers and the phospholipid acidic groups.The basic drugs bind to plasma alpha-1-acid glycoprotein and albumin with moderate to strong affinity depending on lipophilicity and also are accumulated by iontrapping into lysosomes.Therefore, bases indeed have extensive VD ss s (21).Acids have high affinity for albumin and the high albumin concentration in plasma results in a high plasma protein binding.The ionization at physiological pH 7.4 prevents their distribution in tissues and in general, acids have small VD ss s.
Obviously, acids and bases follow different distribution patterns and it is reasonably to construct separate QSPkR models for acids and bases in order to identify the main structural features governing the value of VD ss .There are only few reports on separate QSPkR modeling of VD ss of bases and acids (14,15,19).In the studies of Ghafourian et al. (14,15) the separate models show lower predictive ability as compared to the model on the whole dataset -mainly due to the limited number of drugs involved in the study.Recently, we developed robust, predictive and easy interpretable models for VD ss of 132 acidic drugs from Obach's database (5), which revealed the main structural features affecting the distribution of acidic drugs in the body (19).The present study is focused on the relationship between the structure of basic drugs and their VD ss s.

Datasets
The whole dataset used in the present study comprised of 216 basic drugs belonging to different chemical and therapeutic classes.It was collected from Obach's database presenting data for the main pharmacokinetic parameters of 670 drugs following iv administration in human (5).In our previous study (19) we classified the molecules as acids, bases, neutral and zwitterions on the basis of their extent of ionization at the physiological pH 7.4.The fractions of a drug ionized as an acid (f A ) and as a base (f B ) were calculated according to the equations: The mol-files of the drugs were derived and verified from several public databases (DrugBank (22), Chemical Book (23), Japan Chemical Substance Dictionary (24) and ChEBI (25).The pK a values were calculated by ACD/LogD version 9.08 software (Advanced Chemistry Development Inc., Ontario, Canada).In case of multiple acidic/basic centers, the pK a of the strongest one was considered.A drug was defined as a base, if f B > 0.02 and f A = 0.
The whole dataset was divided randomly into modeling and external validation set.To this end the molecules were arranged in an ascending order according to their VD ss and one of every six drugs was allocated to a different subset.Thus, six subsets, each comprising 36 drugs, were generated.One of the subsets (randomly) was excluded as an external validation set and later was used for assessment of the predictive ability of the final model.The remaining five subsets composed the modeling set.In turn, each subset in the modeling set was used once as a test set for the model, developed on the training set, consisting of the remaining 4 subsets (leave-group-out validation).In summary, five training sets, five test sets and one external validation set were used in the study (Table 1).The experimental VD ss values were logarithmically transformed in order to get close to a normal distribution.A three-step variable selection procedure was applied to identify the most significant predictors.Initially, for every training set, descriptors with non-zero values for less than three molecules were eliminated.Next, descriptors were selected by genetic algorithm (GA) (28).Finally, the selected descriptors (usually less than 15) entered a forward stepwise linear regression with F-to-enter 4.00 and F-to-remove 3.99.Both genetic and stepwise regression algorithms were used as implemented in the MDL QSAR package.

Generation of QSPkR Models
The QSPkR models were developed by multiple linear regression (MLR) technique.Using different combinations of descriptors, a number of QSPkR models were constructed for each training set.Drugs which logVD ss values were predicted with residuals not obeying the normal distribution law were considered as outliers.They were removed from the training sets and the models were rebuilt.The models were primarily assessed by explained variance r 2 given by the equation:

Validation of the Models
The generated QSPkR models were validated by randomization test, leave-one-out cross-validation (LOO-CV) and leave-group-out validation.The model performance was assessed by cross-validated coefficient q 2 LOO-CV , prediction coefficient 2 pred r for the test set, mean fold error of prediction MFEP and accuracy: Accuracy of prediction was assessed as a percent of drugs with VD ss predicted with less than two-or three-fold error.

Analysis of the Datasets
The dataset used in the present study consisted of 216 basic drugs with diverse chemical structure and therapeutic usage.The structures covered a broad chemical space: the molecular weight was in the range 129 -1431 g/mol (mean 360, median 324), logP varied between -5.8 and 8.9 (mean 2.53, median 2.83), and logD 7.4 -between -8.7 and 6.9 (mean 1.0, median 1.28).The fraction ionized as a base at the physiological pH 7.4 f B ranged between 0.015 and 1.00 with 60% of the drugs almost completely ionized (with f B > 0.95).The VD ss s covered a wide interval from 0.073 to 140 L/kg (mean 6.08, median 2.5) and logVD ss s showed a normal distribution (mean 0.41, median 0.40).The unbound fractions f u were available for 182 drugs and suggested moderate to high plasma protein binding with f u  0.1 (plasma protein binding exceeding 90%) for 32% of the drugs, and f u  0.9 (negligible plasma protein binding) for 8% of the drugs.
In order to develop robust and predictive QSPkR models, the whole dataset was divided into two subsets -an external validation set (36 drugs) and a modeling set (180 drugs).In turn, the modeling set was divided into five training and five test sets as described in Methods.The logVD ss s had normal distribution for all subsets (Figure 1).

QSPkR Models for LogVD ss
Numerous significant models were generated on the five training sets using different initial combinations of descriptors.The models were validated as described in Methods.The best models for every training set in terms of explained variance r 2 , cross-validation coefficient on the training set q 2 LOO-CV , prediction coefficient for the test set r 2 pred , mean fold error of prediction MFEP, and accuracy are given in Table 2.
Although the training sets differed of each other by 20% of the included drugs, the generated QSPkR models were very similar in terms of selected variables, statistics and outliers.The explained variance of the best models r 2 varied between 0.635 and 0.700 (mean 0.666).The values of q 2 LOO-CV and r 2 pred ranging from 0.578 to 0.664 (mean 0.604) and from 0.444 to 0.602 (mean 0.538), respectively, were indicative for the good predictive ability of the models.The values of r

Evaluation of the Predictive Ability of the QSPkR Models
The predictive ability of the best proposed QSPkR models for logVD ss of basic drugs was assessed using the external validation set.The predictive statistics of the models is summarized in Table 4.The plot of predicted by the consensus model versus the experimental values of logVD ss for the external validation set is presented in Figure 2. The values of r 2 pred ranged from 0.529 to 0.593 (mean 0.555) and MFEP was between 2.24 and 2.38 (mean 2.31).The predicted values for VD ss were within the two fold error for 53% of the drugs (on average).As expected, the consensus model showed the best performance.

Criteria for VD ss Prediction of Basic Drugs
The descriptors involved in the consensus model were used to propose a number of criteria for prediction of VD ss of basic drugs.To this end the drugs were classified into three groups: with small (< 0.7L/kg), moderate (between 0.7 and 2L/kg), and large VD ss (> 2L/kg).A cutoff value for each descriptor could be defined in order to distinguish between drugs with small and large VD ss .The cutoffs for large VD ss are listed in Table 5.
For most descriptors, the number of molecules meeting these criteria in each group increased as VD ss increased (Figure 3).However, this was not true for f B and SdssC.Drugs with small and large VD ss had high values for f B .According to the consensus model, the descriptor SdssC had a negative contribution in VD ss , i.e. it was expected that drugs with negative values for SdssC would have large VD ss .However, the distribution of the molecules with SdssC < 0 followed the opposite trend.
The remaining seven criteria were applied to the studied dataset.The small VD ss group comprised 27 drugs with low lipophilicity (56% with logP < 0), small number of cycles, negative It is evident that the criteria for large VD ss defined in the present study are good enough to distinguish between drugs with small and large VD ss .Sixty three percent of the drugs with small VD ss meet neither criterion for large VD ss .At the other extreme, 52% of the drugs with large VD ss fulfill at least three criteria.Therefore, basic drugs meeting at least three of the following criteria: logP > 3, ncirc > 4, G min > 0, Dipole < 4, presence of Cl, F or/and aaaC atom are expected to have VD ss > 2 L/kg.Oppositely, molecules which meet neither criterion should have VD ss < 0.7 L/kg.

DISCUSSION
In general, the QSPkR models differ from the classical QSAR models in many aspects (29 -32).First, the QSPkR models operate with in vivo data, collected from different labs, often using a wide variety of assay conditions.Next, the sets of compounds used in ADME prediction consist of structurally diverse molecules, having diverse pharmacokinetic and pharmacodynamic behavior.Finally, there is a trade-off between simple, ease to interpret models with lower predictivity and more complex "black box" models with better predictive ability.Despite these challenges, there has been no respite in the development of newer and better in silico models for ADME prediction.The present study is focused on the development of QSPkR models for VD ss of basic drugs.A dataset consisted of 216 molecules covering a wide chemical and biological space.The chemical structures were described with 179 descriptors.The VD ss s were transformed to logVD ss in order to approach a normal distribution.A threestep variable selection was applied and a number of QSPkR models were proposed using MLR.In order to obtain robust models with high predictive ability, a rigorous validation procedure was applied, as recommended by Tropsha et al. (33).To this end the dataset was separated into six subsets of 36 drugs each.One of the subset was defined as an external validation set, the remaining five -as a modeling set.In turn, the modeling set was divided into training and test sets in a ratio 4:1 in five different combinations.The QSPkR models were developed on the training sets and validated by the test sets.The most frequently emerged descriptors entered a step-wise selection and a consensus QSPkR model was developed.All models were evaluated by the external validation set and showed very good predictive ability (r 2 pred in the range 0.529 -0.579; MFEP between 2.24 and 2.38).As expected, the consensus model performed best: r 2 0.663, q 2 LOO-CV 0.606, r 2 pred 0.593, MFEP 2.25.The values of both q 2 LOO-CV and r 2 pred exceeded the value of 0.5 accepted as a threshold for predictive models in QSAR (34).
The final consensus model contains descriptors with clear physical sense.It reveals the most important structural features determining the value of VD ss for basic drugs.The descriptor logP, encoding the lipophilicity of the molecule, appears to be the most significant determinant of VD ss .It is responsible for about 50% of the explained variance.There are 20 drugs with negative logP in the dataset.Seventeen of them have small VD ss (< 0.7L/kg).Oppositely, 10 of the 14 most lipophilic drugs (with logP > 5) have large VD ss (> 2L/kg).The positive effect of logP on VD ss has long been recognized (35).This is not surprising as a good lipophilicity is required for many processes involved in drug distribution: membrane permeability, binding to tissue components, accumulation in mast cells, etc.
The descriptor f B indicates the fraction ionized as a base at pH 7.4.Drugs with high f B values have large VD ss .The presence of a strong basic center enables the ion-pair interactions with the charged acidic head groups of membrane phospholipids, the binding to phosphatidylserine in the cell membranes in several tissues and the ion trapping in lysosomes (21).The descriptor Dipole represents the dipole moment of the molecule -a measure of polarity.According to the consensus model, it has a negative contribution in VD ss .This means that polar drugs should have small VD ss s, as it is observed.
The molecular descriptor ncirc is equal to the total number of cycles in the molecular graph.Оne cycle can be counted several times if it is fused with another cycles.For example, for biphenyl ncirc = 2, while for naphthalene ncirc = 3 (two cycles with 6 edges and one common cycle with 10 edges).At equal composition, molecules with higher value of ncirc have lower volume and surface.Obviously, ncirc reflects the compactness of the moleculehigher compactness is favorable for both membrane permeation and tissue binding.The positive contribution to VD ss , according to the consensus model, means that the more compact drugs should have larger VD ss .
G min coincides with the lowest E-state value in the molecule.The E-state value provides information about the electron accessibility to the atom.Terminal electronegative atoms are easily accessible and have higher E-state values, while atoms connected with electronegative ones (strong electrophiles) have lower values (26).The positive correlation between G min and VD ss means that drugs containing electrophile groups (CF 3 , SO 2 , CO, etc.) have small VD ss .
The descriptor SdssC encodes the presence and electronic state of a carbon atom type -C= (represented in the dataset as >C=O, >C=C< or >C=N-).Depending on the substituent, it can take positive as well as negative values.Molecules with many electronegative atoms and groups have negative values, while the prevalence of aromatic and aliphatic moieties results in positive ones.According to the consensus model, drugs with negative SdssC have large VD ss .
The presence of Cl and F atoms contributes to the lipophilicity of the molecule, and also they might serve as hydrogen-bond acceptors.The atom type aaaC represents an aromatic C-atom connected with three aromatic C-atoms, i.e. belonging to two fused aromatic rings.The presence of such atoms increases the VD ss .
The clear physical sense of the descriptors involved in the consensus model allowed us to define a list of criteria for discrimination between drugs with small and large VD ss s.Values for logP higher than 3, more than 4 circles in the molecule, positive G min values, dipole moments up to 4 D and the presence of Cl, F or/and fused aromatic rings are prerequisites for large VD ss .Drugs which meet neither criterion are expected to have small VD ss (< 0.7L/kg), while those meeting three or more criteria have VD ss > 2L/kg.Applying these criteria to drugs with small (27 drugs) and large VD ss (120 drugs) from the tested dataset, only six of them were mispredicted: two were overestimated and four were underestimated.
Eight drugs were identified as outliers of the consensus model from the modeling set and another four -from the external evaluation set (Table 6).Their incompatibility with the model could be due to several reasons: unique structural features, unusual distribution patterns, errors in molecule presentation or in the VD ss values.The search in the literature showed that netilmicin and pyrimethamine are not real outliers.The VD ss value of netilmicin in the Obach's database (5) is 0.073 L/kg.However, the value in the original reference is 0.2 -0.3 L/kg (36) which is close to our prediction of 0.42 L/kg.An even higher value of 0.68 L/kg was announced by Wenk et al. (37).Similarly, the VD ss value of pyrimethamine in the Obach's database is 0.43 L/kg, while others report for VD ss ranging between 2.12 and 3.06 L/kg (38).Our predicted value of 2.67 L/kg falls in this range.Cetrorelix is highly overpredicted by the model, due to the presence of three positive criteria (a large number of cycles, presence of Cl and aaaC atoms).This drug has an extremely high molecular weight of 1431 g/mol which embarrasses the membrane permeability and localizes the drug in plasma where it is 86% bound to plasma proteins (5).The experimental VD ss value of procyclidine is 0.74 L/kg (39) and seems unlikely small considering drug's high lipophilicity (logP 3.93).Ziprasidone meets all 6 criteria for large VD ss , but its observed value is only 1 L/kg.It is extensively metabolized in liver (40) and is almost completely bound to plasma proteins (f u 0.0012) (5), which locate it mainly in the central compartment.The presence of a sulfonyl group in the molecule of sildenafil results in a high negative value of G min , which in combination with the small ionized fraction (f B = 0.04) predicts 0.2 L/kg VD ss instead of the observed 1.4 L/kg.Mibefradil is correctly predicted as a drug with large VD ss , and the large difference between the observed and predicted values could be explained by the extremely high lipophilicity (logP 6.29).The last five outliers have extremely large VD ss (> 10 L/kg) implying considerable tissue accumulation and unique distribution patterns not captured by the models.Triamterene meets three criteria for large VD ss but its low lipophilicity (logP 0.18) results in a small predicted value.An extensive binding to tissues in the central compartment was observed in rat leading to very slow elimination (41).This is consistent with the extensive hepatic metabolism and biliary excretion of the drug (42).Azythromycin meets only two criteria for large VD ss and is underpredicted by the model.The presence of two basic centers in the molecule was considered as the main factor for its extremely large VD ss (21).Additionally, there is a plenty of hydrogen-bond donors and acceptors in the molecule involved in tissue binding.Extensive uptake and slow release from tissues was suggested as the main reasons for the long half-life of the drug (43).High concentrations were observed in prostate, tonsils and other tissues (44).Pentamidine meets only one criterion; however it is moderately lipophilic and also has two equivalent basic centers.A high accumulation of the in rat liver lysosomes was reported for this drug (45).As a diamine substance, pentamidine is a substrate of the organic cation transporters facilitating the high distribution in kidneys, liver and bile (46).Topixantrone meets three criteria but is also underpredicted -probably due to the unfavorable low logP.The presence of two basic centers presupposes specific interactions in the cell membranes or tissues.A prominent affinity of the drug for DNA has been also suggested (47).Chloroquine is the drug with the largest VD ss in the dataset and it is predicted as a large VD ss drug.The great difference between the observed and predicted values makes it outlier.Ion trapping was suggested as the main factor for chloroquine accumulation in tissues (48).Very high concentrations were observed in rat kidneys, liver, spleen and lung with a tissue to blood ratio close to 300 (49).A remarkable affinity for melanin in skin and eye (mediated through a charge transfer process) and slow release from the pigmented tissues was suggested (50).

CONCLUSIONS
The present study presents a set of statistically significant, predictive and interpretable QSPkR models for VD ss of basic drugs.The best of them, the consensus model, allows the prediction of 50% of the drugs in an external test set with less than 2fold error, and 88% -with less than 3-fold error.The descriptors involved in the model reveal clear structural features determining the distribution of basic drugs.The lipophilicity, the ionization at pH 7.4, the presence of fused rings, Cl and F atoms contribute positively to VD ss , while the polarity of the molecule and the presence of strong electrophiles have a negative effect.A list of criteria is proposed for discrimination between drugs with small and large VD ss .
observed and the calculated by the model values of VD ss for the i th drug and observed logVD ss value for the set.Only models with r 2 > 0.6 were subjected to validation.The most significant descriptors involved in the best models were further used for development of a consensus QSPkR model for logVD ss prediction.

Figure 2 .
Figure 2. Predicted by the consensus model vs. experimental logVD ss values for the external validation set.The four outliers are shown as blank circles.The lines and the dotted lines represent the twofold and three fold error limits.

Figure 4 .
Figure 4. Distribution of the drugs (in %) according to the number of met criteria for high VD ss

Table 1 .
Training, test and external validation sets used in the study.
the model values of VD ss for the i th drug in the training set or in the test set, and Histogram of logVD ss values: a. for the training sets; b. for the test and external validation sets

Table 3 .
The most frequently emerging descriptors in the QSPkR models for logVD ss

Table 4
Predictive statistics of the best QSPkR models for logVD ss of basic drugs evaluated by the external validation set (n = 36) Model

Table 5 .
Criteria for a large VD ss of basic drugs

Table 6
Outliers from the consensus model.