Predicting High-Impact Pharmacological Targets by Integrating Transcriptome and Text-Mining Features

- Purpose: Novel, “outside of the box” approaches are needed for evaluating candidate molecules, especially in oncology. Throughout the years of 2000-2010, the efficiency of drug development fell to barely acceptable levels, and in the second decade of this century, levels have improved only marginally. This dismal condition continues despite unprecedented progress in the development of a variety of high-throughput tools, computational methods, aggregated databases, drug repurposing programs and innovative chemistries. Here we tested a hypothesis that the economic impact of targeting a particular gene product is predictable a priori by employing a combination of transcriptome profiles and quantitative metrics reflecting existing literature. Methods: To extract classification features, the gene expression patterns of a posteriori high-impact and low-impact anti-cancer target sets were compared. To minimize the possible bias of text-mining, the number of manuscripts published prior to the first clinical trial or relevant review paper, as well as its first derivative in this interval, were collected and used as quantitative metrics of public interest. Results: By combining the gene expression and literature mining features, a 4-fold enrichment in high-impact targets was produced, resulting in a favourable ROC curve analysis for the top impact targets. The dataset was enriched by the highest impact anticancer targets, while demonstrating drastic differences in economic value between high and low-impact targets. Known anti-cancer products of EGFR, ERBB2, CYP19A1/ aromatase, MTOR, PTGS2, tubulin, VEGFA, BRAF, PGR, PDGFRA, SRC, REN, CSF1R, CTLA4 and HSP90AA1 genes received the highest scores for predicted impact, while microsomal steroid sulfatase, anticoagulant protein C, p53, CDKN2A, c-Jun, and TNSFS11 were highlighted as most promising research-stage targets. Conclusions: A significant cost reduction may be achieved by a priori impact assessment of targets and ligands before their development or repurposing. Expanding a suite of combinational treatments could also decrease the costs, while achieving a higher impact per developed ligand.


INTRODUCTION
Despite tremendous historical progress in anticancer research and the decrease in mortality rates associated with many forms of cancer (1), including tumors of the prostate, breast, testis, and colon, as well as many forms of leukemia, the outcomes for pancreatic, lung and brain tumors remain dismal. It is difficult to pinpoint individual factors contributing to the organ-specific treatment success rates. For some forms of cancer, including breast, ovary, prostate, uterus, leukemia, thyroid, and testis, there is a tremendous gap in outcomes for patients with treatment-sensitive and treatment-resistant tumors. For other cancers, the role of targeted/chemotherapy remains secondary, while long-term remission is being achieved through a combination of radiotherapy and early radical surgery. However, on average, a ten-year survival rate for all cancer forms increased from 22% in 1971 to 45% in 2007, with the most significant contribution to this increment being novel therapeutics.
Healthimprovements in breast cancer that could be _________________________________________ attributed to chemotherapy were estimated at between 14% and 27% in 1998 (2). It is likely that the more recently developed regimens show even greater impact. The broadly cited assessment of De-Vita et al. attributes 50% of the increase in survival rates to drug-based therapies (3). In this manuscript, we discuss a novel cost-effective strategy for developing anti-cancer therapeutics and their combinations. This strategy is based on a premise that the spectrum of efficient physiological anticancer mechanisms is relatively limited, and that improvement in survival rates is primarily due to the targeting of so-called "super-targets". Examples of well-known "super-targets" that were drugged since long time ago include DNA (4), folate reductase (5), and microtubules (6); these "supertargets" are still commonly tackled by either combinational therapy (7) or adjuvant therapy (8).
Discovery of a novel high-impact anti-cancer "super-target" is a significant event, further increasing cancer survival rates by approximately 1%, in our estimate. However, the existing statistics pertaining to the introduction of novel anti-cancer therapies point to stagnation (9)(10)(11)(12). This trend occurs in the backdrop of Food and Drug Administration (FDA)'s attempts to expedite the review process (12). Even more concerning is that the downturn in the number of New Drug Applications (NDA) takes place at the time when the enabling technologies appear to explode (13)(14)(15), including the newest investments in personalized genome projects, gene knock-out techniques, toolboxes for biological network modeling, RNAseq, genome-wide association studies along with centralization and dissemination of biomedical information by NCBI and other portals. Another alarming trend is the cost of drug design, reaching $5bn per an approved drug in 2013 (16). The combination of the NDA contraction and the exponentially rising costs of drug development points to a wall of resistance that has to be penetrated.
In this report we demonstrate a possibility to predict that anti-cancer molecule would tackle a super-target a priori. In other words, we describe the methodology of the "success mining" that attempts to identify the features of known "winner" molecules at the preclinical stage, then to prioritize current candidate molecules according to relative resemblance of a "winner" profile. This approach would aid in reallocating available funding to the most promising candidates and minimize costly attrition at later development stages.
Some attempts to evaluate the pharmacological promise of a given target or its ligand have been made before. Ma'ayan et al. introduced graphtheory methods to analyze the FDA-approved drugs and their known molecular targets (17). Zhu et al. in (18) explored multiple factors that collectively contribute to druggability of various targets, including its protein sequence, structural, physicochemical, and systems profiles. Importantly, the techniques to explore each of these profiles for target identification have been developed, but they have not been collectively used. Chen et al. in (19) proposed that a disease-independent property of proteins, "drug-target likeness", can be explored to facilitate the genomic scale target screening. Sakharkar et al. in (20) described quantitative characteristics of the currently explored (those that are not yet associated with any marketed drug) and successful (targeted by at least one marketed drug) biomolecules; these characteristics were translated into simple rules for selecting a target with larger possibility of success. These rules highlight target proteins with 5 or less homologs outside of their own family, proteins encoded by single-exon gene architecture and proteins interacting with more than 3 partners as more likely to be druggable. Bender et al. in (21) reported a success mining approach applied to ligands in the context of in vitro interaction profiles of their targets. According to Bender et al., Preclinical Safety Pharmacology (PSP) approach may anticipate adverse drug reactions (ADRs) during early phases of drug discovery by testing compounds in relatively simple in vitro binding assays.
All of these previously implemented methodologies attempt to discriminate the targets that have already acquired a ligand from the targets that are either in the process of ligand acquisition or would never acquire an approved ligand. We find that this approach needs supplementing due to a number of considerations. First, some targets may eventually acquire a somewhat beneficial and a relatively harmless ligand that would pass safety and efficiency criteria, if a sufficient investment is made. Secondly, the targets that have acquired the approved ligand may lose the association with the approval if the ligand is pulled from the market later. Finally, the practical impact of ligands is in proportion to the significance of the target for the pathophysiological mechanism that drives a given pathology. A still developing candidate with expected huge clinical (and market) niche, but without approved associated ligands, may be more valuable than a comparable target with an approved ligand of a more modest impact.
In this report we attempt to predict an impact of a pharmacological target candidate using its future market share as a proxy. The more clinical trials and especially advanced clinical trials are conducted around a certain ligand, the more likely the target would eventually be tackled by a high-impact therapeutics that would be inherently successful due to a combination of favorable biology and pharmacology. Aiming at the largest possible target impact significantly differs from using the criterion of being simply FDA approval; such studies have not been conducted yet. Incorporation of the forecasted future impacts in the decision criteria put forth by funding and regulatory agencies may aid in creating a policy instrumental in rolling back the escalating costs of the drug development.

Overview of methodology
The proposed impact forecasting technique relies on the target impact predictors available at the preclinical stage. The independent predictors were combined into an Index enabling to gauge the commercial potential of a target candidate before the bulk of investment is committed.
The GEO Datasets was searched for similarly normalized gene expression data, collected from normal and cancerous specimens of various tissue origins. Dataset GSE7307 includes 677 normal and diseased human tissues profiled for gene expression using the Affymetrix U133 plus 2.0 array. The target status information was extracted from TTD, where the ligand and clinical trial status for each target candidate is specified as either "successful" (approved ligand is associated with the gene's products) or "research" (no approved ligand is associated with the gene's products). The textmining was performed using PubMed with the gene names and their synonyms extracted from TTD.

Definition of target impacts
Target impacts were approximated by the number of clinical trials associated with the gene's name. The number of clinical trials associated with the gene was extracted by applying the PubMed filters. Weights 1, 2, 5 and 10 were assigned to the ligands in Phase I, II, III clinical trials and marketed ligands, respectively. The values of these weights were selected in proportion to the attrition rate of the ligands at each trial level. By resources committed, the ligand at the Phase I stage is cheaper by than a ligand that have reached the Phase III stage, and the approximate differences in the costs are reflected in the weights, see Supplemental Table 1 for the data. The number of the ligands in each category was multiplied by weights producing proxy target impacts that reflect the prospective revenue of the target reached in case of successful development. These values were defined as "real-life impacts", while the predicted impacts were derived from both microarray and text-mining data. The proxies for real-life impacts were designated Y for the purpose of deriving a prediction rule as a linear classifier, see below.
Differential expression consistency metric was derived based on these primary data points. The direct data D1 and D2 were transformed in bonuspenalty data using the following rule: The transformed indirect values D'1 and D'2 were multiplied for each environment and the products were summed to produce the Differential Expression Consistency Score (DEXCON).
In this system, the leading scores were assigned to the genes that show high expression levels in cancer, but low expression levels in the normal environment related to the tumor as well as low expression levels in all unrelated normal environments. It is apparent that the potential targets with such parameters would possess wider therapeutic windows under all other circumstances being equal.
The absolute levels of gene expression were measured across the panel of 90 tissue environments and this feature was termed INTENSITY, to reflect intensity of absolute mRNA transcript expression.
The parameters DEXCON and INTENSITY became the features embed in an integrated classifier and were designated as X1 and X2 for the future reference. See the specific values of the parameters and corresponding bonus-penalty points in the Supplemental Table 2.

Text-mining predictive features
The future target impacts were anticipated by extracting the levels of early scientific interest as measured by the number of non-review research publications available prior to the first review published and by first derivative of the research interest. This extraction was accomplished by querying the PubMed with gene name and all its synonyms followed by manual review of the result to ensure that all selected articles are relevant to the biology of the target and the resultant therapeutic avenues.
The derivative was measured as the ratio of the where NT -is the number of non-review publications addressing the role of the gene in the disease of interest prior to the date of the first clinical trial inception; NR -is the number of nonreview publications addressing the role of the gene in the disease of interest prior to the date of the first review published; Spacing -is the number of years between first clinical trial and first review. The function ABS is the absolute value operator and it accounts for the fact that the first review and the first clinical trial may follow in any order.
The average N = (NT + NR)/2 (6) The average measures the absolute number of peerreviewed research publications related to the gene. All numbers were normalized for the natural growth of PubMed population in time, by the formula: The features (5) and (6) were designated as X3 and X4 for combining them within the integrated classifier.

Classifier design
To design a linear classifier locating high-impact targets, the values of Y were transformed as The purpose of transform was to smooth the dataset by relatively diminishing the effects of a few very high Y values, dominating the numerical structure. The smoothing allows effective increase of diversity in the training set and is equivalent to the increased training set size, facilitating a more objective training process. The correction term + 1 in (8) accounts for zero Y values not amenable to log transform. The distortion introduced by + 1 correction is minimal and does not outweigh the benefit of the smoothing procedure. The features X1-X4 were ranked. The Y' was related to the ranked X1-X4 by a linear regression: where W1-W4 are the corresponding weight coefficients to be determined by an error minimization procedure: (10) where YP' are the predicted impacts for the entire population of the training set, Y' are the above defined "true impacts" for the entire population of the training set. The set of training weights [W1-W4, A] was determined by minimizing (10) using the Least Square Method. Further steps taken to improve the resolution of the method included the following. The ranked values of X1-X4 were each scored in the following manner: the top rank [0.9-1.0] received a score of 2, the next bin [0.75-0.89] received a score of 1 and all bins below 0.75 received score of 0.
The direct column vectors of features were replaced with the bonus-penalty values as defined here. The error minimization procedure was repeated and the proportions between the weight coefficients W remained practically unchanged.
The meaning of the bonus-penalty transform is to emphasize the role of informative outliers on the side of the model factors X (as opposed to the outputs Y) and to smooth the effects of the random noise, distributed equally among the members of the training set. To re-phrase, the bonus-penalty system selects only the most informative genes for contributions in the prediction rule and maximizes the signal-to-noise ratio at the given size of the training set.

Validation of selected approach
The list of targets was randomly divided in the training and testing sets of equal size. The testing set was set apart until the prediction rule was derived in the training set. Based on the derived prediction rule, the residual errors were computed in the training and test sets by comparing the predicted and real impacts. The populations of the residual errors (10) were compared for the training and testing sets to assess generalization by the prediction rule. We tested for the statistical equality of the error populations in the absence of over-fit. To select a proper T-test form (equal or unequal variance, 2-tail), a preliminary F-test was run to compare variances. The F-test reported equal variances between the error populations and based on these data equal variance T-test was applied for population comparison. The populations of errors were identical, with no over-fitting detected. Based on this conclusion, the training and testing populations were merged for the plotting of Receiver Operating Characteristic (ROC) curve. Ranking of real impacts were used for defining high and low real impact categories and respective labeling of the targets. At the next step, predicted target scores were ranked, and the distribution of real score labels was traced as a function of the predicted score. The bin with high predicted score on the ROC curve provided significant enrichment to the real high impact labels, thus, validating our approach.

ROC curve plotting and its use for computing relative enrichment
The true impacts Y were subdivided based on rank in the "high-impact" bin with the ranks [0.75-1.0] and "low-impact" bin with the ranks [0 -0.75]. The members of these groups acquired the positive and negative labels respectively. The predicted scores PY' were ranked as well, and the population of true impacts followed the rank of PY', producing a nonideal, but a generally correlating pattern. The true "high-impact" labels were predominantly concentrated in the higher regions of PY' rank. The predicted score ranks were explored from the top (1.0) to the bottom (0.0) values.
The fraction of high-impacts f1 was computed by summarizing the positive labels as defined above. The fraction of low-impacts f2 was computed in a similar fashion.
The fraction f1 of "high-impact" Y and the fraction f2 of "low-impact" Y were forming the Yaxis and the X-axis of the plot, respectively. Per each 0.1 (10%) increment of "low-impact" count, the fractional increment of "high-impact" targets was also computed. Every point on ROC curve can be represented in the coordinates [summary fraction of "low-impact" values; summary fraction of "highimpact" values], the summary fraction is the sum of all increments over the previous intervals. To exemplify, the "low-impact" summary fraction 0.1 + 0.1 + 0.1 = 0.3; the matching value of the "highimpact" summary fraction becomes 0.5 + 0.1 + 0.05 = 0.65.
The ratio of the summary fractions characterizes the relative enrichment in the true "high-impact" values as a function of the predicted impact rank. To exemplify, in the highest 0.1 fraction of the predicted impact rank corresponding to the left-most part of the ROC curve and [1.0-0.9] bin of the PY' rank, the summary fraction of the "high-impact" values is 0.5, therefore the relative enrichment is 0.5:0.1 = 5. Considering a predicted impact bin of rank [1.0-0.8], the summary fraction of the "high-impact" values is 0.5 + 0. 1 = 0.6, while the summary fraction of the "low-impact" values is 0.1 + 0.1 = 0.2, therefore the relative enrichment is 0.6:0.2 = 3. On comparative basis, using the top rank bin of the PY' rank produces 5fold higher chance to encounter a true "highimpact" label than using the population before the computational filter was applied. Figure 1 illustrates the attempt to predict target impacts based on the combination of DEXCON (X1) and INTENSITY (X2) gene expression features, extracted from the microarray data by the abovedescribed methodology. Figure 2 illustrates the attempt to predict target impacts based on textmining features. Figure 3 illustrates the attempt to predict target impacts based on the combination of DEXCON and INTENSITY gene expression features as well as the text-mining features. The ROC curves demonstrate non-zero area between the diagonal baseline, which reflects the ratio of false positive to false positive summary functions in each bin, and the thicker upper line which reflects the ratio of true positive and false positive summary functions in each bin. The ratio of true positives to the false positives was significantly higher than the baseline, as it is especially evident for the left corner of the ROC plot that describes the highest range of the predicted scores. The regions with higher predicted scores embed the majority of reallife high impact targets, while the regions with lower predicted scores are depleted in real-life high impact targets. These Figures point to the possibility of predicting high-impact target category a priori, already at the stage of preclinical development and before the onset of the most expensive clinical trial phase. It is very unlikely that the inherent biological mechanism determining the target's future impact at the preclinical development stages remain obscure. This mechanism leaves its signature in a variety of large-scale high-throughput studies as well as in collective research activity patterns. The more diverse sources of information are incorporated and the more the prediction point is shifted away from an onset of active clinical trial stage, the lesser the role of "me-too" bias factor in the emergence of the detected patterns.

A priori evaluation of future economic impact of a putative anti-cancer targets
The Tables 1 and 2, respectively, show the sets of FDA-approved and "still-in a-pipeline" anticancer targets with the predicted impacts identified on the above-described basis. The impact leaders on the side of the targets with the approved ligands are EGFR, ERBB2, CYP19A1, MTOR, PTGS2, tubulin, VEGFA, BRAF, PGR and PDGFRA. The functions and cancer-related status of the genes were explored using database Genes at NCBI. The predicted impact leaders demonstrate favorable biological anti-cancer features due to their central roles in more universal pathophysiological mechanisms. Thus, many forms of cancer depend on overexpression of EGFR and ERBB2 for their survival. Blockade of these kinases synergizes with cytotoxic therapeutics. In normal cells, such dependence is rare or absent; therefore, the combinational regimens based on EGFR and ERBB2 have an access to a broad therapeutic window. The aromatase CYP19A1 is a key enzyme in estrogen synthetic pathway and is selectively important for the cancer subtypes that rely on estrogen stimulation for growth and survival. Mammalian target of rapamycin (mTOR) regulates the functions of cell survival, motility, proliferation, protein synthesis and transcription, making this target extremely important for rapidly propagating cells characterized by destabilized metabolism. PTGS2 (cyclooxygenase-2) is among the most important mediators of inflammation. Consequently, this target is indispensable for the growth stimulation produced by stromal immune cells and for metastasis. Many tumor types directly depend on prostaglandin stimulation. Tubulins are selectively more important for rapidly dividing cells undergoing mitotic process. VEGFA pathway is necessary for neovascularization of tumor foci. VEGFA also produces autocrine stimulation of multiple pro-cancer survival pathways, thus, its inhibition selectively affects almost all cancers. BRAF (serine-threonine kinase B-Raf) is a protooncogene with a broad transforming potential and, therefore, the blockade of its product is selectively more important for the cells where BRAF is constitutively activated. Progesterone receptor (PGR) is selectively more important for ovarian, mammary and endometrium cell populations that depend on progesterone for their growth and survival. Finally, PDGRFA (platelet-derived growth factor receptor, alpha polypeptide) is the powerful mitogen indispensable in certain tissues. To summarize, all genes demonstrating high predicted impacts also demonstrate highly selective involvement in specific cancer types and lack of such involvement in a majority of normal tissues. These differential roles and the ability to bind relatively non-toxic ligands explain their observed ranks presented in Table 1.
The sets of targets presented in the Table 2 are not yet drugged by suitable ligands. The target STS (steroid sulfatase, EC 3.1.6.2) is a member of a steroid synthesis pathway that is capable to produce in selected cancer cell populations the same dependence as other steroid pathway mediators, explaining its relatively high predicted impact. PROC (protein C) is actively involved in blood anti-coagulation pathways, stimulates cell migration and influences the secretory behavior of tumor cells, while suppressing NK killers and T helper cells of TH2, TH17 and TH21 subtypes. Therapeutic activation of TP53 is intended to restore the powerful tumor-suppressor phenotype mediated by protein, explaining its high predicted impact. Figure 1. ROC curve of high-impact targets vs low-impact targets. The ROC curve was built using DEXCON and INTENSITY parameters, using the following procedure: 1) divide the combined list of targets (successful and research) by impact categories, the top 25% forming the "high-impact" class and the rest forming "low-impact" class. 2) apply bonuspenalty scoring approach to the DEXCON and INTENSITY values for the list of targets and combining the indirect bonuspenalty scores with the optimized weight in a 2-feature classifier. 3) rank the target list by the values of the 2-feature classifier. 4) determine the fractions of the "high-impact" and "low-impact" categories in each 0.2 bin of rank by the 2feature classifier. 5) summarize the differential fractions for each category accrued on the range from 0 (highest ranked 2feature scores) to the given point of rank for the 2-feature score. 6) the sum of differential fractions for "low-impact" categories forms a X-axis coordinate; the sum of differential fractions for "high-impact" categories forms a Y-axis coordinate. The diagonal thin baseline at 45 degrees across the plot reflects the ratio of false positives to false positives for each new point in the form of summary functions, while thicker line reflects the ratio of true positive summary function to the false positive summary function in ROC analysis. At the far right corner, the summary function for the true and false positives are both equal to 1, and the lines cross. The area between the lines is proportional to the resolution quality at multiple possible cut-offs.
Attempts to reactivate CDKN2A (cyclin-dependent kinase inhibitor 2A, multiple tumor suppressor 1) are performed within the same therapeutic paradigm as for TP53. Proto-oncogene c-JUN encodes transcription factor that mediates apoptosis resistance, with good potential of pharmacological inhibition for a significant impact. TNFSF11 (Tumor necrosis factor (ligand) superfamily, member 11) is involved in metastasis and boneremodeling. Remarkably, the high predicted impact was assigned to TNFSF11 based on Therapeutic Target Database definition of the candidate as not yet matching an FDA-approved ligand. However, we found that TNFSF11 ligand denosumab was approved in 2010 under the names of Xgeva and Prolia, thus, validating our approach. CD40 (TNF receptor superfamily member 5) is a co-stimulatory protein found on antigen presenting cells and as such is a key molecule in establishing an immune response. Alterations of CD40 function determine probability of cancer emergence and metastasis. Proto-oncogene c-MET is involved in de novo angiogenesis and metastasis; its activation in tumors is correlated with poor prognosis. Being a receptor component adds up to its potential for higher impact upon drugging. Hepatocyte growth factor/scatter factor HGF is activating a tyrosine kinase signaling cascade of c-Met, thus, contributing to the same metastasis-related pathway. In leukemia patients, JAK2 (Janus kinase 2) forms fusions with the TEL(ETV6) (TEL-JAK2) and PCM1 genes providing the targets that do not exist in normal cells. These targets are druggable in the same manner as by well-known Gleevec. To summarize, the genes pinpointed as promising tend to participate in the most central mechanism of tumor cells survival and propagation. Figure 2: ROC curve of high-impact targets vs. low-impact targets. The ROC curve was built using text-mining parameters, using the following procedure: 1) divide the combined list of targets (successful and research) by impact categories, the top 25% forming the "high-impact" class and the rest forming "low-impact" class. 2) apply bonus-penalty scoring approach to the N (Number of publications between first review and first clinical) and time-derivative of N for the list of targets and combining the indirect bonus-penalty scores with the optimized weight in a 2-feature classifier. 3) rank the target list by the values of the 2-feature classifier. 4) determine the fractions of the "high-impact" and "low-impact" categories in each 0.2 bin of rank by the 2-feature classifier. 5) summarize the differential fractions for each category accrued on the range from 0 (highest ranked 2-feature scores) to the given point of rank for the 2-feature score. 6) the sum of differential fractions for "low-impact" categories forms a X-axis coordinate; the sum of differential fractions for "high-impact" categories forms a Y-axis coordinate. The diagonal thin baseline at 45 degrees across the plot reflects the ratio of false positives to false positives for each new point in the form of summary functions, while thicker line reflects the ratio of true positive summary function to the false positive summary function in ROC analysis. At the far right corner, the summary functions for the true and false positives are both equal to 1, and the lines cross. The area between the lines is proportional to the resolution quality at multiple possible cut-offs. INTENSITY and text-mining parameters, using the following procedure: 1) divide the combined list of targets (successful and research) by impact categories, the top 25% forming the "high-impact" class and the rest forming "low-impact" class. 2) apply bonus-penalty scoring approach to the DEXCON and INTENSITY values for the list of targets. 3) apply bonuspenalty scoring approach to the ranked derivative of early research interest and to the volume N of early research interest. 4) combine the secondary bonus-penalty values for all features with the optimized training weights in a 4-feature classifier. 5) rank the target list by the values of the 4-feature classifier. 6) determine the fractions of the "high-impact" and "lowimpact" categories in each 0.2 bin of rank by the 4-feature classifier. 7) summarize the differential fractions for each category accrued on the range from 0 (highest ranked 4-feature scores) to the given point of rank for the 4-feature score. 8) the sum of differential fractions for "low-impact" categories forms a X-axis coordinate; the sum of differential fractions for "high-impact" categories forms a Y-axis coordinate. The diagonal thin baseline at 45 degrees across the plot reflects the ratio of false positives to false positives for each new point in the form of summary functions, while thicker line reflects the ratio of true positive summary function to the false positive summary function in ROC analysis. At the far right corner, the summary functions for the true and false positives are both equal to 1, and the lines cross. The area between the lines is proportional to the resolution quality at multiple possible cut-offs.
The genes with the lowest rank in the Table 1 were analyzed in a similar fashion. IMPDH1 (IMP (inosine 5'-monophosphate) dehydrogenase 1) is mostly involved in transplant rejection and retinitis pigmentosa, its link to the cancer is tenuous. IMPDH2 (inosine 5'-monophosphate dehydrogenase 2) is also involved in autograft rejection, and its connections to cancer are not apparent. ITGA2B (integrin, alpha 2b (platelet glycoprotein IIb of IIb/IIIa complex, antigen CD41) is mostly involved in fibrinogen activation and coagulopathies, however, its connections to metastasizing are proven. LHCGR (luteinizing hormone/choriogonadotropin receptor) is involved in a broad diversity of pathways and its levels correlate with survival in ovarian epithelial cancer patients. MME (MME membrane metalloendopeptidase) shows a strong link to cancer, to both prognosis and metastasis, however targeting of metallo-endopeptidases (MMPs) was historically not successful, despite an investment of a lot of efforts. OXTR (oxytocin receptor) is mostly involved in behavior and social adaptation and its involvement in tumorigenesis is a stretch. PARP1 (poly (ADP-ribose) polymerase is strongly related to malignancy, however its expression is ubiquitous, and only a few clinical trials have been published for the targeting of this gene. While PDE4A (phosphodiesterase 4A, cAMP-specific) is involved in cardiac muscle activity and fibroblast proliferation, the links to malignancy are indirect. PTH1R (parathyroid hormone 1 receptor) functions are pleiotropic, with known roles in transplant rejection, organ development and bone maturation, with some evidence of its contribution to certain breast cancers. RARA (retinoic acid receptor, alpha) is vital for differentiation of hematopoietic lineages and respective malignancies. However, the utility of RARA agonists is confined to leukemia field, and the number of clinical trials in this area is limited. TSPO (translocator protein) is involved in microglia and retinal inflammation, HIV-1 virus maturation, while overexpression of TSPO correlates with the progress of breast cancer. However, the number of clinical trials for drugs that target this molecule is small, and its expression is not highly specific to cancer samples. TXNRD1 (thioredoxin reductase 1) participates in redox processes, apoptosis, membrane raft formation, while its overexpression correlates with glioblastoma multiforme progression. Speaking generally, comparison of high and low-impact score bins indicates that the higher ranking score gene set consistently ignites substantially higher interests of members of research community. The entire list of the high score bin members is associated with prominent cancer-related finding and produces relevant DEXCON signals, exceeding the threshold of interest. The drugging of candidates with highest impact ranks would be the most influential on cancer outcomes and deserves prioritization.    Table 2 Continued…..   ACVRL1  ACVRL1, ACVRLK1, ALK-1, ALK1, HHT, HHT2, ORW2, SKR3, TSR-I,  ACVRL1 activin A receptor type II-like 1, TGF-B superfamily receptor type I,  activin A   By contrast, the lower scoring members not always show well-established association with cancer phenotypes as evident from the functional description entries that passed manual curation before linking to respective gene names in "Genes" subdivision of NCBI. For this group of genes, a majority of evidence is based on correlation of mRNA or protein expression levels to cancer phenotypes or outcomes, however, an independent evaluation of the consistency of overexpression findings does not confirm uniformity of this observation. However, retrospective post-prediction analysis revealed an important caveat which is likely to result in future improvements of the impact score concept. Some genes, like RARA or MME, demonstrate strong links to cancer, but produce low scores due to relatively narrow utility, i.e. limiting the applicability of developed ligands to limited spectrum of tumors. As these diseases are generally considered "orphan", it is important that the development of the ligands aimed at the treatment of these pathologies should continue unimpeded despite low scores for respective targets. Hence, the model that we propose should be further optimized by introducing an "orphan" disease coefficient that would preclude an attrition of the targets that are highly specific to certain malignancies that ail relatively low number of patients.

An impact of targeting "super-targets" as a component of a combinatorial treatment
Typically, anti-cancer therapies are combinatorial as they include at least 2 components. Assuming typical three-component therapy approach, n-fold increase in the number of available high-impact targets would result in n3 increase in the number of available drug combinations, prompting further progress in their evaluation and testing. Based on the data in Table 1 and different evaluations (2)(3), the effect of the current pool of therapies can be substantially magnified by this combinatorial expansion. Above we discussed that a conservative estimate of the contribution of therapeutics to the observed doubling of cancer survival is at 40%. Designating this increment as SIDT (Survival Increment Due To Therapies), being equal of 40%, one can draw a model: (12) where SIDT (1) is the survival increment due to therapies at the current level, SIDT(2) is the survival increment at the projected level, N is the number of available therapies at the current level, m is the increment in the number of high-impact targets applied in cancer field. The exponential coefficient 2-3 assumes a formation of two or threecomponent drug cocktails, however this number may be greater or lesser in the future. Based on the model (12), at n = 2 and m = 2, the SIDT(2) would increase 4-fold, which would be coming very close to curing at least some types of cancer (see Table 1), and improving ten year survival rates for lung, brain and pancreatic cancers by at least 30%. At n = 3 and m = 2, the SIDT (2) would increase 9-fold, making many now lethal types of cancer eradicated, producing >60% improvement of the survival rates after 10 years observation for many other malignant disorders.

DISCUSSION
A significant progress is achieved for treatment of the majority of cancer form, with the average survival rate practically doubling over the last 45 years (Table 3 compiled according to the UK data presented at (1). The rate of progress appears to be constant for most of individual sub-ranges of the plot, however a certain recent acceleration is observed. Likely, increase in the number of available treatments contributes to this improvement, although not as a single factor. Assuming that the level of investment, and, therefore, projected impacts correctly reflect the level of future revenues/sales both for the 'successful" and for the "research" sets of targets and also assuming that the level of sales correctly reflects the benefit to society, one can argue that the top 10% of the successful targets, or just seven of them, produce 75% of all anti-cancer effects. The top impact target alone, EGFR, mediates 23% of total anti-cancer effect, while the second best target, ERBB2, mediates 22% of total effect. While from purely clinical point of view these numbers seem disproportional, many novel therapies rely on a combinatorial synergy with EGRF and ERBB2 ligands (22)(23). Hence, introduction of novel highimpact targets could alter the survival dynamics even further for at least some of the cancer forms.
The technique presented in this paper allowed us to evaluate the potential impact of the targets which are currently at the research stage. Our analysis highlighted microsomal steroid sulfatase (estrone sulfatase) at the top of the list which, after some score gap, was followed by anticoagulant protein C, p53, CDKN2A, c-Jun, TNSFS11, CD40, c-MET and JAK2, all of which were highlighted as the most promising research-stage targets. Accoring to our calculations, the relative importance of microsomal steroid sulfatase is at approximately the same level as that of well-known anticancer targets BRAF kinase and progesterone receptor. A PubMed search using term "estrone sulfatase" or "estrone sulfatase" AND "cancer" returns 193 and 125 manuscripts, respectively. Recent years resulted in the development of a number of potent estrone sulfatase inhibitors aimed at the suppression of the formation of both E1 and breast carcinomapromoting steroid dehydroepiandrosterone (DHEA) from DHEA-sulfate (DHEAS) (24)(25)(26). As approximately 40% of breast tumors are estrogendependent, successful advancement of estrone sulfatase inhibitors into clinical practice could potentially lead to sizable global effects. On the other hand, many potential targets were predicted to have minimal impact; deprioritization of these targets may lead to substantial savings and subsequent shift of clinical development efforts toward the most promicing drug candidates.
Our impact scoring technique is, in a nutshell, a 4-feature bonus-penalty classifier which comprises two components, the microarray and the literature mining. The INTENSITY feature is the level of absolute transcript expression demonstrated by the target candidate. With all other factors being equal, the candidates with more intensive expression would influence biological signal transduction events more robustly as they produce higher amounts of mRNA, and, therefore, the protein.
While the detected correlation of the impacts and the transcript levels is relatively weak (r = 0.2), it is sufficient to boost performance of a classifier of the bonus-penalty type. The DEXCON feature reflects the stability of the differential expression signal in tumors of various tissue origins, and in the tumors of same origin. As it was shown earlier, the DEXCON score is superior to typical t-test based evaluations of the significance of observed differential expression patterns, as it takes into consideration a consistency of evidence (27). It is important to note that microarray-derived features are capable of serving as predictors even when completely novel target candidates comes into the scope of study; hence, their value is higher than that of text-mining features. The text-mining features rely on pre-existing information regarding a potential action mechanism and perceived value of a target candidate. When a majority of scientific data is collected prior to the major information disseminating event, i.e. publishing of influential review and/or result of a clinical trial, and the "metoo" bias is minimized by pre-dissemination choice of cut-off, the literature mining features become a less-than-obvious predictor, although still inferior to the experimental measurements such as microarrays. It is obvious that the bibliometric aspect of the study may be improved by taking into account the impact factors of the journals, number of patents, relative sizes of each study, total amount of grant support etc. In this initial report, the bibliometric aspects were limited to the number of publications and to the rate of accumulation prior to the critical bias-producing events. The rationale behind our approach is that both the total number of publications and their accelerating deposition into the PubMed are proxies to research interest, which, in turn, correlates with the objective value of the target. The research interest fundamental also drives publishing in higher impact journals and determines awarded grant support, which is instrumental to perform studies in larger groups of animals or patients cohorts. Hence, introduction of additional bibliometric parameters will also introduce co-correlating variables.
Relative weights for each of these parameters should be evaluated by experimenting in silico. On the other hand, target gene expression related features are a-priori independent of the bibliometrics and, therefore, more likely to add an input to the model. The linear classifier for YP' was selected due to its robustness which aids in prevention of model overfitting. Since the training set was as small as ~ 30 successful anticancer targets, this precaution appears to be warranted, especially if the number of features would be increasing due to inclusion of other information sources, for example, the networks of biomolecules. The least square method was selected for regression modeling, with the ranking-based cutoffs for high and low real impact. High rank was defined as the highest quartile and the low rank was defined as two lowest quartiles. These cut-offs are somewhat arbitrary and, therefore, the results of the study are qualitative rather than quantitative. However, the proposed technique for estimating the commercial promise of still potential targets, which are costly to develop, is inexpensive, and, therefore, of value. In this study, selected cut-offs clearly separated the high and low impact groups of targets while preserving sufficient number of targets in each group and, by that, allowing for ROC plotting.
To generate a high proxy impact Y, a target should demonstrate a consistent promise in multiple clinical trials. There is certainly an informational gap between early research interest and clinical trial results. An intriguing discovery of a novel pharmacological mechanism and its experimental confirmations at pre-clinical level may not even acknowledge possible inability of a target to acquire a non-toxic ligand, unfavorable patterns of expression or pharmacodynamics etc. From the point of information theory, a predictor is a function that contributes a quantity of information sufficient to measure the pattern of the future event or approach it. I 2 = I 1 + ΔI (11) Where I 2 -is the final state, the completeness of information allows reliably describe the target pattern of the future; I 1 -is the initial state, the fragmentary or zero initial information concerning the target pattern is insufficient ΔI -the predictor produces the increment of information rendering knowledge of the future pattern.
Based on the formula (11), the microarray setting corresponds to I 1 ~ 0 (little is known about any aspect of the target and its behavior prior to the experiment), while text-mining corresponds to I 1 > 0 (substantial knowledge about the target and its expected behavior prior to computation of metrics). From these considerations it is apparent that a perfect microarray classifier that allows complete prediction of a future pattern produces a greater informational increment than an equally perfect text-mining classifier. At the same time, the contribution of the text-mining classifier is non-zero, unless the information gap between the present state and the future state is negligible. Thus, we argue that the proposed text-mining approach is at least partially objective and, therefore, provides an added value when coupled with the microarray data. Speaking generally, the fusion of orthogonal sets of features produces a greater summary I to elucidate the final state more reliably than the component set of features in isolation. In that sense, the imperfect (biased) contribution of the textmining features is still useful, due to its informative component permitting to bridge the information gap in the equation sooner (11). The method of extracting text-mining features employed in this report is analogous to consensus forecasting used in economic modeling. In most of cases the expert consensus is correct, but historical record attests that it never should be applied in isolation (28)(29). Thus, the text-mining derived features and microarray features act synergistically, supporting each other, and provide an integrated predictor.
The current rates of attrition for the ligands and targets are discussed extensively (30,31). As an example, Hutchinson et al reports an average attrition rate for the anti-cancer therapies as 95-96% (30). As the leads to the loss of all the costs accrued by the rejected ligands, the attrition of more advanced candidates is more damaging event. Implementation of targets evaluation by their potential for eventual success will lead to earlier elimination of some ligands off the development pipelines. The preference towards anti-cancer targets with the widest possible therapeutic window would contract the overall volume of the clinical trials. If we would treat a clinical trial as a test with a certain signal-to-noise ratio, we could apply known statistical observation that the size of the test is smaller if the signal-to-noise ratio is inherently higher. In other words, if the targets are selected on their favorable gene expression pattern with preferential expression in tumors rather that in normal cells, lesser toxicities are expected, and an enrollment of lower numbers of patients into dose escalation trials would be necessary.

CONCLUSION
The main result of the report is demonstration that the promising behavior of a pharmacological target is predictable early based on their expression signatures and text-mining of the pattern of the early research interest. Considering a very divergent levels of promise displayed by the currently approved targets, we conclude that most of survival increment attributed to anticancer therapies in general is achieved via the high-impact targets. Improvement in predicting the targets with inherently wide therapeutic window may result in clinical trials stage savings and, eventually, in explosion of therapeutic opportunities that would benefit the entire society.

CONFLICT OF INTEREST
There is no conflict of interest at all stages of the manuscript creation Supplemental