View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Tumors are highly heterogeneous such that expression of specific oncogenes may only be seen in a subset of patients. Such bimodal genes can be defined as those having two modes of expression within the same population. Given the large-scale regulatory role of microRNA (miRNA), we hypothesize that bimodal miRNA may be important biomarkers of tumorigenesis. Gaussian mixture modeling was applied to miRNA-seq data from 9 types of cancer. The underlying distribution of each miRNA was compared to non-cancerous tissue samples and a bimodality index was calculated. In head and neck, liver, lung, and stomach cancers, we identified bimodal co-expression of miR-105-1, miR-105-2, and miR-767. Patients with high expression of these miRNA module had poor survival prognoses. The computationally predicted targets of the 3 miRNA were then analyzed for bimodality, and functional annotation clustering revealed enrichment for transcriptional regulation. miR-105 and miR-767 were overexpressed in vitro, and putative target genes, including ATF3, MINOS1, PER2, SERTAD2, and SUMO1, were downregulated. Next, we will test miR-105 and miR-767 inhibitors as a viable cancer therapy by measuring proliferation in the cell model. Collectively, we show that bimodally expressed miRNA can be utilized to predict cancer prognosis and design personalized treatments.
Short Abstract: The avalanche of variants residing in genomic non-coding “dark matter”, available via whole genome sequencing (WGS), provide a challenging disease interpretation opportunity. The popular GeneCards Suite KnowledgeBase, encompassing ~150K annotated coding and non-coding genes in GeneCards(PMID:27322403) and ~20,000 annotated diseases in MalaCards(PMID:27899610), provides qualitatively and quantitatively rich entities and relationships for the Suite’s NGS tools: VarElect(PMID:27357693), the phenotype interpreter, accepts lists of genes and phenotypes as input, and computes prioritized direct (keyword-based) and indirect (inferred from gene-to-gene associations) gene/disease connections; TGex, our end-to-end NGS solution, is a VCF-to-report clinical analyzer which incorporates VarElect’s algorithms. WGS contributes three classes of functional genomic elements to variant analyses: promoters, enhancers, and ncRNAs, all central to tissue-related gene expression, with many underlying diseases. Together they amount to >20% of the new DNA territories. We’ve augmented GeneHancer(PMID:28605766), a novel regulatory element database with ~250,000 enhancers and promoters. Information is amalgamated from: ENCODE, Ensembl, FANTOM5, VISTA, dbSUPER, and now, UCNEbase, and EPDnew. In parallel, ~100,000 unified ncRNAs are consolidated from 21 general and specialized databases(PMID:23172862 and work in progress). The Suite’s WGS disease interpretation platform provides a comprehensive route to clinical significance of coding and non-coding single nucleotide and structural genomic variations, often elucidating unsolved cases.
Short Abstract: To improve the quality and transition of healthcare, robust big data management platforms are necessary to analyze heterogeneous genomics and healthcare data of high volume, velocity, variety and veracity. Healthcare data includes information about patient life style, medical history, visits to the practice, wet lab and imaging test, diagnoses, medications, surgical procedures, consulted providers and genomics profile. Adequate and analytic access to the healthcare and genomics data has potential to revolutionize the field of medicine by developing better understanding of biological mechanisms and modelling complex biological interactions by integrating and analyzing knowledge in a holistic manner. To effectively meet the goals of implementing system for precision medicine, significant efforts are required from the experts in various disciplines, located within one or multiple organizational units. One of the major challenges is to establish an efficient and secure workflow that can connect all units to streamline transparent data flow, quality inspection, processing, analysis and sharing. Here we presents a new, user-friendly HIPAA compliant precision medicine platform i.e. PROMIS-Med towards complex and large scaled healthcare and genomics data management, analysis and visualization. PROMIS-Med is managing healthcare data of over 800,000 patients and helping integrative processing and analysis of genomics data of various kinds.
Short Abstract: Cell-free DNA (cfDNA) offers the potential for minimally invasive genome-wide profiling of tumor alterations without tumor biopsy and may be associated with patient prognosis. Triple-negative breast cancer (TNBC) is characterized by few mutations but extensive somatic copy number alterations (SCNAs), yet little is known regarding SCNAs in metastatic TNBC. We sought to evaluate SCNAs in metastatic TNBC exclusively via cfDNA and determine if cfDNA tumor fraction is associated with overall survival in metastatic TNBC. We identified 164 patients with biopsy-proven metastatic TNBC and performed low-coverage genome-wide sequencing of cfDNA from plasma. Without prior knowledge of tumor mutations, we determined tumor fraction of cfDNA for 96.3% of patients and SCNAs for 63.9% of patients. Copy number profiles and percent genome altered were remarkably similar between metastatic and primary TNBCs. Certain SCNAs were more frequent in metastatic TNBCs relative to paired primary tumors and primary TNBCs in publicly available data sets The Cancer Genome Atlas and METABRIC. Prespecified cfDNA tumor fraction threshold of ≥ 10% was associated with significantly worse metastatic survival (median, 6.4 v 15.9 months) and remained significant independent of clinicopathologic factors.
Short Abstract: The genome of an individual person contains up to 3.5 million variants, which can potentially contribute to disease. Very often, little attention is paid on SNVs within splicing regions, which might alter splicing patterns and thus produce aberrant proteins. Although a variety of publications showed that splice site SNVs are associated to human disease, yet no efficient tool has been developed that is able to identify splice site disruptors out of large variant datasets. In contrast, splice-site prediction tools are mainly designed to process small number of sites. Hence, we aim to implement an effective and accurate annotation pipeline, which identifies deleterious synonymous splice site variants out of very large datasets. We integrate the MaxEntScan tool in our pipeline, as well as dbscSNV_ADA and dbscSNV_RF scores to predict the deleteriousness of SNVs within splicing regions. The pipeline is trained using variant sets from HGMD, ClinVar and gnomAD databases. Furthermore, we applied machine learning using a combination of MaxEntScan and other genomic scores, such as CADD13, DANN and FATHMM, DPSI and Eigen. Finally, we applied our pipeline on a large dataset of Parkinson’s disease cohort studies to identify novel links between splice-site mutations and disease.
Short Abstract: In cancer, liquid biopsies are an extremely attractive resource for the discovery of novel biomarkers. Numerous studies have demonstrated the presence of RNA-species in liquid biopsies. The expression of RNA molecules can be highly cell-type specific, suggesting they may serve as circulating biomarkers. To identify cancer-type-specific RNAs, we reprocessed smallRNA and polyA+ RNA-sequencing data of TCGA and TARGET database, resulting in the annotation and quantification of miRNAs, isomiRs, sn(o)RNAs, tRNAs, mRNAs and lncRNAs in almost 12,000 patient samples covering 40 cancer types. A novel expression specificity score was calculated, resulting in more than 6000 genes with a cancer-type-specific expression pattern. Whereas cancer-type-specific lncRNAs and mRNAs were identified in almost all cancer types, cancer-type-specific miRNAs were only found in a subset. To evaluate the potential of these cancer-type-specific RNAs as non-invasive biomarkers, we collected plasma samples from metastatic patients representing 34 cancer types. We prepared RNA from plasma and extra-cellular vesicles and analysed those using smallRNA sequencing to quantify circulating miRNA expression levels. Quantification of mRNAs and lncRNAs in plasma derived RNA is currently ongoing. Our preliminary findings indicate that a subset of cancer-type specific RNAs are detectable in plasma and may be used as biomarkers for diagnosis or disease monitoring.
Short Abstract: Tumor neoantigens are drivers of cancer immunotherapy response; however, neoantigen prediction tools produce many candidates that require further prioritization for research/clinical applications. We investigated four peptide novelty metrics that help refine predicted neoantigenicity: paired MHC binding affinity difference, paired peptide sequence similarity, homologous peptide sequence similarity, and microbial peptide sequence similarity. We applied these metrics to neoepitopes predicted from somatic mutations in The Cancer Genome Atlas (TCGA), as well as to a group of peptides with neoepitope-specific immune response data. Only 19.9% of predicted neoepitopes across TCGA displayed novel MHC binding based on our criteria. Peptide sequence similarity was high between paired tumor-normal epitopes, but some neoepitopes were more similar to other human peptides, or to bacterial or viral peptides, than their paired normal counterparts. Applied to peptides with neoepitope-specific immune response data, a linear model incorporating a neoepitope’s binding affinity, paired MHC binding affinity difference, and sequence similarity to its closest viral peptide was able to predict immunogenicity with an AUROC of 0.66. These novelty criteria emphasize biologically meaningful neoepitopes, demonstrating that neoepitopes should be considered within the context of putatively co-occurring peptides, with potential implications for the development of personalized vaccines for cancer treatment.
Short Abstract: The opioid crisis in the United States is currently accounting for ~100 death per day due to overdose. The effectiveness of opioids to reduce pain, and the seeking behavior of opioid addicts, leads physicians in the United States to prescribe over 200 million opioid prescriptions every year. To better understand the profile of opioid seeking patients, advanced computational methods that mine Electronic health records (EHR) can be employed. EHR systems contain information on medical procedures, lab tests, vital signs, prescriptions, and other data for millions of patients. For this project we trained a machine learning model to classify patients for likelihood to have substance dependence using EHR from ~3,200 patients diagnosed with substance dependence (ICD-9 code 304.*), along with control samples of the same size composed of patients with no history of substance dependence (ICD-9 codes 304.* or 305.*), but with matched age, race, and gender. The model achieves prediction accuracy of ~86%, and the analysis of the model uncovers associations between basic clinical factors and substance dependence. The predictive model may hold utility for identifying opioid seeking patients that report other symptoms in the emergency room (ER), so those patients can more properly treated.
Short Abstract: Cancer cells contain thousands of mutated genes, differential copy numbers and differential expressions of genes. The progression of cancer differs from patient to patient. Identification of key proteins and pathways of individual patient’s molecular profile has become important for personalized medicine. At the first step of our proposed pipeline, gene mutations, gene expression profile, copy number variations and clinical data of lung cancer patients (LUAD) are downloaded from TCGA. Significant genomic variations are determined by using R MADGIC and GAIA packages. Using R Deseq2 package, most active differentially expressed genes are determined for the patients (number of patients=55) for whom the adjacent normal tissue RNA-seq expression levels are available. Most active pathways are determined by Cytoscape jactivemodules program based on expression levels. For significant genomic variations and gene expression levels, MDS plot and Kaplan-Meier survival analysis of the patients is performed. The goals of our project are to 1) computationally identify the top most significant genes whose mutation and expression profile correlate with the patient survival time 2) verify the significance of results against results of a recent study conducted on TCGA LUAD dataset(Deng Z. min et al, 2017) and 3) provide an open-source automated pipeline.
Short Abstract: We present a machine learning framework that aims to predict drug side effects arising from a variety of biological mechanisms, including the interactions of the drug molecule with targets expressed in unintended tissues. The approach integrates data from several public resources into a large knowledge graph summarising molecular and phenotypic information on 1,035 approved drugs. It incorporates information on tissue expression of drug targets as well as links between tissues and side effects based on normalised literature cooccurrence. The graph is used as input for node2vec - a neural network embedding method that automatically derives feature vectors for each node based on the topology of its neighbourhood. The learnt feature vectors are then used to train a classifier predicting the existence of a relationship for each drug-side effect pair. The integrated model can predict infrequent side effects with AUC of 0.95 and is validated on two external datasets including adverse events detected in clinical trials of yet unapproved drugs. Our approach demonstrates the utility of automated unsupervised network representation learning. In addition, considering tissue information enhances prediction of drug side effects and may lead to new strategies for identifying novel drug repurposing opportunities.
Short Abstract: Hepatocellular carcinoma (HCC) has poor prognosis and is the second leading cause of cancer mortality worldwide. The heterogeneous nature of HCC has resulted in the development of multiple murine models to investigate the underlying tumor biology and potential therapies. However, it is unclear to what extent different mouse models of HCC recapitulate human disease at the molecular level. We compared the genomic and transcriptomic profiles for 56 tumors from 4 mouse models to 987 HCC patients with diverse etiologies. Analyzing somatic single nucleotide variants from mouse tumors, we identified known mutational signatures and one novel mouse-specific signature. Among diverse non-synonymous mutations affecting established oncogenes and tumor suppressors, we observed orthologous amino acid changes in CTNNB1, a known HCC driver, in mice exposed to streptozotocin (STAM), and near universal (~90%) BRAF V600E in mice exposed to N-nitrosodiethylamine (DEN). Transcriptomic analysis revealed high correlations between STAM samples and human tumors characterized by high proliferation, high tumor grade, and poor prognosis. We also found evidence that the mouse immune system shapes the somatic mutational landscape of murine tumors. Overall, we identified two mouse models that demonstrated similar molecular characteristics to human tumors and may therefore provide more representative models for studying HCC oncogenesis.
Short Abstract: Here we assess coverage and characteristics of a natural-product (NP) target network to explore the potential of NP’s in cancer therapeutic space. NP's, or compounds from living sources, may help address challenges of cancer drug resistance, and may also be synergistic with some cancer drugs. In recent work, our group developed an evidence based framework for approved cancer drug-target interactions, which we termed the Cancer Targetome (Blucher et al., 2017). Using publicly available databases, we developed a similar framework for known biological targets for compounds isolated from plant sourced NPs. Critical cancer pathways were then identified from the Reactome knowledgebase using over-representation analysis with a set of pan-cancer driver genes. Both cancer drug and NP targets were mapped to protein targets within these pathways to assess independent and combined coverage. Considering target interactions with the strongest target binding values, the addition of NPs saw a 60% increase in coverage of targets in these pathways. In addition, mapping of these two classes of targets into a biological network revealed statistically similar characteristics. This work indicates that natural products may substantially increase the therapeutic target space when considered jointly with cancer drugs and assist in identifying novel therapeutic combination strategies.
Short Abstract: Broadly neutralizing antibodies (bNAbs) targeting the HIV-1 envelope glycoprotein (Env) have promising utility in prevention and treatment of HIV-1 infection with several undergoing clinical trials. Due to the high diversity and mutation rate of HIV-1, viruses may be resistant to a particular bNAb or administration of a bNAb may lead to viral escape. Until now, resistant viral strains have been identified by a three-step procedure: sequencing of Env, generating of pseudoviruses, and performing of in vitro neutralization assays, of which the latter two steps are expensive- and time-consuming. Here we developed sequence-based machine learning (ML) classifiers that predict resistance of HIV-1 to 32 bNAbs with an average accuracy of 87%. Using the tree-based ML method gradient boosting machine, we were able to interpret learnt biological features that distinguished between resistance and sensitivity for all 32 antibodies. Notably, we correctly identified 100% of the resistant and 88% of the sensitive strains from patients enrolled in the VRC601 phase 1 clinical trial. Further, we predicted resistance to bNAbs VRC01, 3BNC117, 10-1074, and PGT121 with a mean accuracy of 91% for 262 clinical viral strains. The availability of in silico antibody resistance predictors will facilitate informed decisions of antibody usage in clinical settings.
Short Abstract: Expression quantitative trait loci (eQTLs) have been mapped in most tumor types. These studies measured gene expression in tumors and identified associations between these gene expression levels and common inherited genetic variants (e.g. SNPs) profiled in the same patients. These results have been widely applied: For example, the majority of inherited cancer risk variants implicated by GWAS are in non-coding likely-regulatory regions of the genome. Thus, to identify genes regulated by these variants, eQTLs identified from tumors are typically interrogated—facilitating rational functional follow-up studies. However, bulk tumor gene expression data reflect cancer and tumor-infiltrating normal cells; thus, tumor eQTLs could arise from cancer cells, normal cells, or both. We have developed an approach, which by modeling tumor purity, can identify high-confidence cancer eQTLs from mixture tumor gene expression. We investigated the eQTL profiles of cancer risk variants identified by breast cancer GWAS. Only about one-third of breast cancer risk variants identified as eQTLs from an uncorrected analysis of bulk tumor expression could be confidently attributed to cancer cells, with the remaining variants showing evidence of an effect in cells of the tumor microenvironment. Our approach will be critical for understanding how inherited polymorphisms influence cancer risk, development, and treatment.
Short Abstract: Genomic data become more frequently part of clinical practice. Novel tools and methods are required to transform information from increasingly voluminous genomic databases into actionable data for health care. In the era of precision medicine, the development of high-throughput technologies and electronic health records resulted in a paradigm shift in healthcare. However, the treatment of temporal data still remains a challenge. Temporal models have been proposed for electronic health records, but not genomic data. Frequently temporal genomic data are based on stimulus response studies. A typical query on those data includes searching for temporal effects/ time patterns in genesets. One frequently employed model for temporal data in healthcare is temporal abstraction, a model based on conversion of expression values into an interval-based qualitative representation expressing the amount of change over time. The challenge is to find domain specific mappings to create those representations. We explore the feasibility of modeling change by statistical significance. We propose to use empirical Bayes for DNA microarray data to determine differences in consecutive time points and comparisons across platforms by comparing p-values. For count data we use voom transformations allowing for RNA-seq data analysis. We demonstrate this approach in the framework of our SPOT software.
Short Abstract: In recent years, proteomic profiling of cancer cell lines combined with quantification of their response to drugs has proven to be useful for the identification of protein biomarkers of drug sensitivity and resistance. However, given that phosphorylation-based signaling is known to play a major role in determining drug response, phosphoproteomic profiling can provide a different angle on predicting drug sensitivity of cancer cell lines by focusing on their activity landscapes. Here, we profiled the proteomes and phosphoproteomes of 125 cancer cell lines using label-free mass spectrometry to a depth of >10,000 proteins and >55,000 phosphorylation sites (p-sites). We applied a wide range of computational approaches, including elastic net, concordance analysis, etc, to integrate these data with publicly available drug sensitivity measurements, identify proteomic and phosphoproteomic markers of drug response and suggest novel kinase-substrate relationships. The results not only recapitulated known drug-gene/protein interactions, but also suggested novel biomarkers predicting drug responses, which were subsequently validated in vitro, in vivo and on the patient level. These results suggest that, in combination with advanced computation methods, the activity profiling of cell lines has important value in translational research.
Short Abstract: Background RNA-Sequencing opens new opportunities in personalized medicine and offers unprecedented information about the human transcriptome, but harnessing this information with bioinformatics tools is typically a bottleneck. Little is known about the genetic associations between mRNAs, miRNAs, other noncoding RNAs in deriving cancer pathogenesis, and their potential suitability as drug targets. Results Here we combined the transcriptomic assays of mRNAs responses from different time points in continuous or discontinuous drug pressures of three different drug classes on Leukemia specific cell lines in order to identify pathways underlying drug resistance and to understand the mechanism of actions of mono and combined therapeutic applications. An integrative approach was developed to analyze the mRNA sequencing data and link the differentially expressed genes to various biological ontologies and regulatory databases. Next, machine learning approaches, as well as network approaches, will be applied to the drug transcriptomic profiles to build a sequential drug prediction model. Conclusion Our analysis showed that the discontinuous long drug application in both cell lines has a similar effect as the continuous short drug application. On the other hand, the continuous long drug application led to a remarkable difference of the B-cell receptor signaling pathway rankings in the two BTK inhibitors.
Short Abstract: We tested for associations between polygenic scores derived from three sources (the Genetics of Obesity-related Liver Disease "GOLD" consortium, the UK Biobank "UKB", and the cytokine level GWAS of Aholi-Olli et al. "AO") and obesity- or liver-related outcomes and corresponding laboratory measures in the Michigan Genomics Initiative, by which patient genotypes have been linked to the Michigan EMR. Numerous significant associations are found, especially between polygenic scores and laboratory measures. There is evidence for overfitting, especially in AO, but even in the well-powered UKB study there is not strong evidence for signals below the genome-wide significant cutoff. However, the likelihood ratio tests are not conclusive on either. These association tests include adjustments for prescribed drugs, and we will describe technical challenges related to biased-missingness of drugs and treatments as longitudinal covariates. Best practices in these studies should interpret prescription records in a cite-specific context, which becomes a challenge for "one size fits all" approaches to such studies. Therefore, we propose to create and then curate instances of two classes of Knowledge Object payloads: one class of objects holding encodings of readily-updateable polygenic scores, and another class holding encodings of site-specific data filters for use in incorporating longitudinal prescribing data.
Short Abstract: Genome-wide association studies (GWAS) have made substantial progress in identifying susceptibility loci associated with complex traits and there is emerging evidence that genetics-based targets lead to 28% more launched drugs. However, translating results of GWAS for drug discovery remains challenging. We implemented a pathway-centric approach to analyzing GWAS loci, and demonstrated that global large scale analysis of 1,456 protein interaction pathways on nearly 1,600 GWAS (473 traits) leads to drug targets and translates genetic findings into therapeutic hypotheses for 182 diseases. We validated our genetic pathway-based targets by testing if current drug targets for 97 diseases are enriched in the pathway space for the same indication. Remarkably, 30% of these diseases have significantly more targets in these pathways than expected by chance; the comparable number for GWAS alone (without using pathway analysis) is zero. Overall, this study provides a large-scale pathway analysis of GWAS data and demonstrates how pathway analysis can aid in translation of GWAS data into therapeutic hypotheses for new drug discovery targets and repositioning opportunities for current drugs.
Short Abstract: Public multi-omics repositories allow researchers to extract data for integration and analysis in order to discover the molecular bases of diseases and development of effective treatments. However, the diversity of biological systems, the technological limits, the large number of biological variables and the relatively low number of biological samples make the integration and analysis of multi-omics datasets a challenging task. In this work we integrate the data from TARGET Acute Lymphoblastic Leukemia (ALL) Phase2 project. In particular, we combine clinical data, gene expression, DNA methylation and copy number variance to find new markers correlated with poor clinical outcome and early bone marrow relapse. From the cohort of 792 patients we identified 80 patients for which all required multi-omics experimental data is available in TARGET database. In this group, 36 patients suffered from ALL relapse. For the analysis we designed rule-based framework. The main advantage of rule-based approach is the fact that the outcome rule set is not only useful as a classifier but also provides a natural means of understanding which features (and their combinations) influence the outcome. Presented rule-based workflow allows selecting the most important features discriminating between relapse and relapse-free patients providing prognostic multi-markers for relapse in ALL.
Short Abstract: Cancer cell line panels are widely used for evaluating drug response across diverse tissue types. A growing set of molecular profiling data complements measurements of chemosensitivity, providing novel avenues for response determinant discovery and clinical translation. Accessing and inter-relating data from different sources is essential for evaluating such determinants, but remains challenging. To enable wider access to cell line pharmacogenomic data, we have developed CellMinerCDB (CellMiner Cross-Database, discover.nci.nih.gov/cellminercdb), a web application integrating data from several widely studied cancer cell line panels, including the NCI-60 (NIH), GDSC (Sanger/MGH), and CCLE/CTRP (Broad). Altogether, our database spans over 1300 distinct cell lines, 400 clinically relevant cancer drugs, 20,000 experimental compounds, and molecular profiling data, such as gene/protein expression, DNA copy, methylation, and mutational status. Cell line and tested drug overlaps allow cross-database validation of genomic and drug data, and CellMinerCDB simplifies this by transparently matching differently named entities between sources. Data exploration is enhanced via annotations to restrict analysis to particular tissue types, as well as pathway and process-based gene annotations allow biological interpretation of response predictive features. Descriptions of the data availability, retrieval, and web-application functionalities will be presented, including informative examples of data integration, and translational results.
Short Abstract: Hepatocellular carcinoma (HCC) is the fifth most common and the second deadliest cancer in the world. HCC is highly resistant to conventional chemotherapies, and targeted agents only extend the patient's life by a few months. In this context, it is important to identify drugs that have not yet been used in HCC treatment. Discovery of off-target effects of drugs targeting liver cancer-specific protein networks can be investigated by using systems biology research approaches. In this work, the pathways reported to be active in HCC were constructed from 801 nodes-3896 edges. DrugBank small molecule inhibitors those having at least one target protein integrated. Shortest cycles were extracted. By the means of in silico perturbation attack strategies, the target proteins of the drugs and their interactions were calculated as the drug effectiveness, changes in the efficiency of the signaling network, changes in the number of feedback cycles, and changes in the network functionality were identified and ranked. The same attack strategy was applied to identify and rank drug combinations. Brigatinib, Regorafenib, Sunitinib, Thalidomide, Pranlukast, Lenvatinib, Chloroquine, Pseudoephedrine, and Amrinone were identified to be validated in vivo at the transcription level of 770 cancer gene in chemosensitive Huh7 and chemoresistant Mahlavu cells.
Short Abstract: Adverse drug reactions (ADRs) are common, unwanted side effects during drug treatment that cause over 2 million injuries, hospitalizations, and deaths across the United States each year. The detection, assessment, understanding, and prevention of ADRs in patients is a priority in pharmacovigilance research. However, most approaches are limited in considering the identified adverse reactions as independent, which misses important but sparse side effects and provides minimal evaluation of related or predominant side effects. This leads to low ADR detection for less well represented populations, such as pediatric patients (<18 years old). , ultimately widening gaps in our knowledge of ADRs disproportionately affecting these populations. Here, we present a personalized pharmacovigilance approach for detecting ADRs disproportionately affecting pediatric patients. Using the Adverse Event Open Learning Universal Standardization (AEOLUS) containing more than 8 million ADR reports and the SNOMED-CT clinical concept hierarchy, we provide an approach to detect ADR relationships that is more interpretable, generalizable, and statistically powerful. We show that building the SNOMED-CT hierarchy, using the ‘is_a’ relationship, increases the statistical power for detecting ADRs for pediatric subpopulations. We present a case study in which we find adverse reactions from anti-epileptic drug use disproportionately affecting pediatric patients.
Short Abstract: Joint analysis of drug-induced adverse events (AEs) and disease-phenotypes can potentially identify novel drug-drug, disease-disease, and drug-disease relationships. The latter can be potentially used for identifying drug-disease contraindications or drug repositioning candidates. To test this hypothesis, we extracted AE of drugs from the ADReCS database and disease-phenotype associations from the Monarch Initiative. We mapped the AEs and phenotypes to the UMLS (Unified Medical Language System) concepts to enable direct comparison of diseases and drugs based on their phenotypic similarities. We ranked drug-disease associations using cosine similarity with term frequency–inverse document frequency. We evaluated the performance of our approach with two sets of data: (a) known drug mechanism of action (MoA) and (b) known drug–disease contraindications. Our phenotype-based computed drug-drug relationships suggest that drugs of same MoA or chemical class tend to have high phenotypic similarity. In case of phenotype-based drug-disease associations, surprisingly, we often noticed that several known indications showed high phenotypic similarity. This might suggest that there could be potential drug-induced aggravation of the primary disease condition for which the implicated drug is prescribed. Our methods are freely available as a web application (https://phenorx.research.cchmc.org/).
Short Abstract: RGD (https://rgd.mcw.edu) is a multi-species platform ideally suited for translational research. RGD was designed to allow researchers to easily access a large corpus of data for human and to move from data for human to that for disease models and back again. As such, RGD has established a rich core set of human data and integrated it with data for six other species used as models for human disease. The complete human gene set in RGD has been enhanced with associated annotations for disease, phenotype, pathways, Gene Ontology (GO) and gene-drug interactions. Additionally, RGD has imported the ClinVar variant set from NCBI and associated these variants and their associated phenotype/disease annotations with the corresponding genes. Both validated and predicted miRNA targets have been incorporated from miRGate. To facilitate use of these data, RGD has also developed a suite of innovative tools for data discovery and analysis. These include the OLGA Object List Generator and Analysis tool, the Gene Annotator tool for exploring functional annotations for a list of genes, Interactive pathway diagrams, InterViewer for visualizing protein-protein interactions, Variant Visualizer to investigate ClinVar variants, JBrowse genome browsers, and RGD's disease portals which present consolidated data for twelve disease categories.
Short Abstract: The data commons paradigm aims to accelerate scientific discoveries by facilitating cross-project analyses through harmonization of ingested data curated from a variety of sources. The Gen3 software stack is a suite of open source software for hosting data commons in a secure, scalable platform for applications. Gen3 includes five main services for authentication and authorization, GraphQL based searching, curating submissions against a metadata dictionary, mapping data GUIDs to locations, and an interactive website. The process of building a Gen3 Data Commons and using it requires harmonizing datasets by creating a standardized data dictionary of variable names and using this dictionary for data ingestion and co-analyses. Since all Gen3 Data Commons share a common infrastructure, open-source tools and apps can be developed for analyses that span different datasets, even across different commons. We describe the Gen3 software components in detail and discuss steps required for creating data dictionaries. We then demonstrate our cloud-based workspace for data analysis and visualization, which supports Python and R Jupyter notebooks, ShinyR applications, and Docker/CWL analysis pipelines. These tools represent real use cases as they interoperate with existing Gen3 Data Commons, including the Brain Commons, BloodPAC Commons, Environmental Data Commons, and the NIAID Data Hub.
Short Abstract: Widespread use of antibiotics in clinics and livestock increases the probability that drug resistance develops in bacteria. Bacterial resistance not only renders applied antimicrobial drugs and therapies inefficient, it can also alter the effect of other therapeutic compounds. The latter is referred to as evolutionary drug-drug interactions, cross-resistance or collateral sensitivity. Several data sets describing this phenomenon were recently and independently reported based on laboratory evolution experiments where Escherichia coli was placed under antibiotic stress. Comparative and integrative analysis of these data sets appear to be lacking so far. From cross-treatment studies on evolutionary drug-drug interactions, we compiled a comprehensive collection how 1014 pairs of antibiotics affect E.coli and discuss implications on the protocol design of rational drug treatment. Further, we numerically extended the integrated compendium by 38% to 1406 drug pairs by critically evaluating the performance of 6 different data imputation methods including baseline prediction, Hidden-Markov-like, latent factor model, and group factor analysis (GFA). With the GFA method and using background information on the structural similarity of drug pairs in the form of functional connectivity fingerprints of radius 4, we reached an RMSE below 0.5 and accuracy of over 85% in imputing the effect of unseen drug combinations.
Short Abstract: Drug toxicity is a leading cause of hospital adverse events and injuries to patients, affecting over two million hospital stays annually. Severe toxicity can cause 100,000 deaths per year. A majority of toxic events were caused by the interaction between target proteins and human tissues. While some events were well studied, there are no systematic methods to investigate the cellular mechanism and predict toxicity in general. To solve this issue, we introduced an integrative model that combines multi-omic features for predicting the toxicity of target proteins in 10 human body systems and 45 tissues. By incorporating novel features such as pharmacological pathways and interaction with cellular regulatory network, we were able to improve the overall performance by 23% and achieve a median AUROC of 0.70. We then applied our models to predict the toxicity of 4,968 proteins in human druggable genome, and will further validate the results using clinical trials data from AACT. By showing that our toxicity score can well differentiate clinical trials that failed for toxicity reasons from those that succeeded, we aim to demonstrate the promising application of our results in drug development and pharmacoepidemiology, as well as deciphering the mechanism of toxicity in chemical biology.
Short Abstract: Pediatric cancers are often driven by master transcription factors difficult to drug, such as MYCN, which is frequently amplified in neuroblastoma. Genome-wide CRISPR-Cas9 dependency screening in large compendia of tumor lineages offer a powerful therapeutic avenue. A significant computational difficulty in the identification of druggable lineage dependencies is that dependencies are usually shared among tumors from different lineages, and they are described by a combinatorial essentiality effect of genes involved in several functional mechanisms. Here we show that these challenges can be addressed through data deconvolution and integrative genomic strategies. We applied Independent Component Analysis to a genome-scale CRISPR-Cas9 screen of MYCN-amplified neuroblastoma cell lines and found a preferential dependency on polycomb repressive complex 2 genes EZH2, EED and SUZ12. Transcriptomic and epigenetic analysis of neuroblastoma cell line and tumor data revealed that MYCN upregulates EZH2, leading to inactivation of a tumor suppressor program, and that EZH2 represses neuronal differentiation. Further experiments focused on EZH2 as a druggable target because several small-molecule, EZH2-inhibitors are already in clinical trials. In summary, our study highlights data deconvolution and integrative genomic strategies as effective exploratory methods for the identification of novel druggable dependencies and supports testing EZH2 inhibition in patients with MYCN-amplified neuroblastoma.
Short Abstract: Targeted genomic profiling assays allow assessment of a single to hundreds of cancer-related genes simultaneously. Effective targeted assays focus on somatic mutations in genes that have established relevance in cancer. The patterns of somatic mutations in cancer genes are highly characteristic and non-random. When designing oncology research assays, cancer genomic databases provide evidence for characterizing somatic mutation patterns, and help define significantly mutated regions. However, a robust process to collect, catalogue, and analyze the data is required to maintain the value of derived knowledge with growing datasets. We present here a streamlined process for standardizing and analyzing publicly available genomic cancer data. The supervised process used the recent data from COSMIC and a comprehensive quality control process validated each build, ensuring data integrity. The most recent build sourced v83 which contained over 16000 whole-exome and 181000 targeted sequencing samples across 22 standardized cancer types. Our process defined ~5,000 recurrent hotspot mutations across and within specific cancer types and identified 709 candidate cancer genes significantly enriched (q< 0.001; frequency >10%) in either hotspot mutations or deleterious mutations. The implemented genomic data standardization process supports a sustainable approach to the development of oncology-focused genetic analysis solutions including cancer research gene panels.
Short Abstract: Combination therapies offer the potential of targeting multiple genetic aberrations at once to tackle tumor subclonal populations or overcome resistance mechanisms. We provide a computational framework for assessing single agents individually, as well as in combination, and their interactions with patient genetic alterations. Investigating the association between aberrational pathways and drug response in de novo acute myeloid leukemia patient samples from the Beat AML Consortium reveals many significant pathway-level associations with drug sensitivity or resistance. We note these are driven by mutations in a spectrum of genes within the pathway, and therefore are potentially missed when considering only single gene interactions with drug response. To further understand how these intrinsic mutational perturbations result in drug sensitivity or resistance, we used a probabilistic graphical modeling framework to model pathway impact. Complementary to this, we model extrinsic drug perturbations on these pathways using quantitative drug-target binding information from the Cancer Targetome to model impact downstream. We will discuss our developments on the development of a unified framework of intrinsic and extrinsic perturbation modeling for rigorous in silico hypothesis generation and testing to facilitate future drug combination screening recommendations.
Short Abstract: Motivation: Clustering analysis has been long used to find underlying structures in different omics data such as gene expression profiles. This data typically presents high number of dimensions and has been used successfully to find co-expressed genes in samples that share similar molecular and clinical characteristics. Nevertheless, the clustering results are highly dependent of the features used and the number of clusters considered, while the partition obtained does not guarantee clinically relevant findings. Methods: We propose a multi-objective optimization algorithm for disease subtype discovery based on a non-dominated sorting genetic algorithm. Our proposed framework combines the advantages of clustering algorithms for grouping heterogeneous omics data and the searching properties of genetic algorithms for feature selection and optimal number of clusters determination to find features that maximize the survival difference between subtypes while keeping cluster consistency high. Results: Two breast cancer datasets were divided into a training and testing set to test our model. In both cases our method identified clinically relevant sub-groups in the training sets (log-rank test = 0 & 0.0004). The features obtained were used to create nearest-centroid classifiers which were tested in the test sets with significant survival differences between groups (log-rank test = 1.22E-15 & 0.028).
Short Abstract: Pan-cancer modeling approaches lead to an integrated picture of commonalities among various tumor types, whereas tissue-specific studies lead to insights solely based on a single tumor type. In this study, we aim to systematically model these two extremes of comparison spectrum for drug sensitivity prediction in cancer cell-lines. We used publically available human cancer cell-lines from Cancer Cell Line Encyclopedia (CCLE) and Genomics of Drug Sensitivity in Cancer (GDSC) projects in pan-cancer and cancer-type specific settings to develop an understanding on which of the two extremes yields better predictive power. We compared several linear regression methods and ridge regression performed well overall. Since in pan-cancer setting, there is an additional effect of drug for various tissue types along with the confounding effect of sample size, pan-cancer predictions outperformed cancer-type specific results for CCLE. GDSC results, however, have shown improved performance of targeted drugs in cancer-type specific setting for certain tissue types. Predictive accuracy of EGFR, ERK, MEK and AKT inhibitors has been significantly boosted for GDSC breast cancer cell-lines in cancer-type specific setting. Results from this study show pan-cancer approach overcasting tissue-specific signals and thus we advocate for an integration of both modeling approaches, for improved drug response prediction accuracy.
Short Abstract: Next-generation studies can be used for the detection of various DNA modifications, such as single nucleotide variations on a large scale. Formalin-fixed paraffin-embedded (FFPE) tissues are one of the most abundant source of clinical specimen, and this method of preparation is known to degrade DNA. Here, we generated 42 whole-exome sequencing data sets from pairs of matched fresh-frozen (FF)/FFPE pairs. The samples contain human normal and tumor tissues from two different organs (liver and colon). Coverage analysis shows that FFPE samples have less good indicators than FF samples, but the coverage quality remains globally above the usual thresholds. We detect limited but significant variations in coverage between the three extraction kits. Variants analysis shows a high rate of concordance calls between matched FF / FFPE pairs. We detect a limited but significant variation in number of variants between FF and FFPE samples for the three different FFPE DNA extraction kits. Taken together, our results confirm the potential of FFPE samples for clinical genomic studies, but also indicate that the choice of a FFPE DNA extraction kit should be done with careful testing and analysis beforehand.
Short Abstract: Inflammatory dacryoadenitis (ID) is an important cause of morbidity and may be associated with inflammatory disorders [e.g. sarcoidosis, IgG4 related disease (IgG4RD)]. Advances in understanding of non-coding RNA [microRNA (miRNA), small nuclear RNA(snRNA) and small nucleolar RNAs (snoRNAs)], have revealed their importance in transcriptional regulation and pathogenesis of disease. We investigated the role of non-coding RNA levels in a pilot study using samples of ID [sarcoidosis, idiopathic chronic dacryoadenitis (ICD), IgG4RD and sclerosing dacryoadenitis (SD)].
Short Abstract: Pancreatic cancer is an aggressive cancer with a very poor prognosis, for which chemotherapy remains the mainstream treatment options. But only a small subset of pancreatic cancers respond well to chemotherapy. Better knowledge of the molecular mechanisms contributing to drug resistance is imperative to improve patient prognosis. To identify genes modulating the impact of platinum, we performed genome-wide CRISPR screen in pancreatic cell line MIA PaCa-2, as well as peritoneal carcinomatosis model in SCID mice. We ranked genes according to their sensitizing impact and identified genes whose deletion modified the impact of Oxaliplatin or cisplatin, as well as genes shared across both treatments with FDR<0.2 cutoff using EdgeR methodology. Genes involved in DNA damage response, cell cycle regulation, and also DNA repair are significantly enriched among hits identified. In addition, some are known platinum sensitizer, which is a validation of our systematic functional screening approach. We are currently in the process of validating these hits in additional pancreatic cell lines and also the in vivo system. Our study yield important mechanistic insights into platinum resistance in pancreatic cancer cells and allow us to nominate new treatment targets for pancreatic carcinomas.
Short Abstract: A computational approach for drug target identification in the fields of oncology and immunology by mining gene expression data was developed. The idea is to focus on certain cell types that play a pathogenic role in cancer as well as in autoimmune diseases. First, differential gene expression analysis is applied to transcriptomics datasets that are relevant for the diseases and cell types of interest. Then, the differentially expressed genes are used to derive a list of candidate targets for that cell type. These candidate genes are tested for association to disease-driving pathways in reference expression data from various cancer types and immune-related diseases. A strong association increases the confidence that a candidate gene plays a pathogenic role in the disease. Several other criteria are taken into account as well, including the expression of candidate genes across relevant normal tissues or gene ontology annotations. As proof of concept, we applied the described method to identify targets in tumor-associated fibroblasts and in immune-activated fibroblasts in rheumatoid arthritis (RA). We focus on in this cell type due to its common pathogenic role in both cancer and RA, but the concept is generally applicable to further tissues, cell types, and diseases.
Short Abstract: Understanding the relationship between individuals' social networks and health could help reduce incidence of unhealthy behaviors, and by complementing biological -omics data analyses, it could help improve patient-specific precision medicine. So, we analyze individuals' mobile sensor-based longitudinal social network (SMS) data, Fitbit-based longitudinal health-related behavioral (physical activity) data, and trait (personality, depression, and anxiety) data, all collected by the NetHealth study. We examine trait differences between individuals whose social network positions (centralities) or Fitbit physical activities change over time versus time-stable individuals, as well as individuals whose centralities and their physical activities co-evolve (correlate over time) versus those with no co-evolution relationship. We find that individuals whose centralities change with time do not show any trait difference compared to time-stable individuals. However, if out of the centrality-changing individuals we focus on those whose physical activities also change with time, then these are more introverted than the time-stable individuals. Moreover, individuals whose centralities and physical activities both change with time and whose centralities co-evolve with physical activities are more anxious compared to individuals who are time-stable and do not have a co-evolution relationship. Hence, our study reveals several links between individuals' social network structure, health-related behaviors, and the other traits.
Short Abstract: This study presents the finding from a population level computational investigation of disease-environment relationship. We examined the health insurance claims database of over 150 million unique US individuals to obtain the frequency of the clinical appearances of six mental conditions including bipolar disorder, schizophrenia, personality disorder, Parkinson’s disease, epilepsy, and major depressive disorder. The dataset included time-stamped patient treatment episodes during the period of 2003-2013 with individual patient diagnoses defined by the corresponding ICD-9 code. We used mixed-effects regression, modeling disease counts per age and sex groups with the environmental exposures measured at the U.S. county level. The environmental factors included U.S. county-level qualities of air, land, water, built, and weather along with the sociodemographic characteristics. The computational investigation suggests that (a) the spatial prevalence of six mental conditions are remarkably different in the U.S., (b) air and land pollution are important environmental predictors of the frequency of the clinical appearance of bipolar disorder and major depression, and (c) that the psychiatric disorders should be an important component of efforts to quantify and understand effects of chronic pollutant exposures. These links point to the potential systemic effects of air and land pollution that might affect the brain.
Short Abstract: The NCATS Pharmaceutical Collection (NPC) consists of approximately 3000 approved and investigational small molecule drugs for clinical use in humans or animals. Till date, we have concentration-response data available from high throughput screening of NPC libraries against a large panel of cell-based assays. Computational models were built using these in vitro activity data along with structural features of the compounds to predict potential target and/or therapeutic indications of drugs across a broad array of human diseases. Machine learning classification algorithms: Naïve Bayes, Random Forest, Support Vector Machines, and Extreme Gradient Boosting were used to generate predictive models using these datasets for three different targets namely, Cytochrome P450 3A4 (CYP3A4), Estrogen Receptor1 (ESR1), and Adrenoceptor Alpha 1A (ADRA1A). The established models were validated using internal test dataset. The results showed an improved predictive performance of the models using the combination datasets of in vitro activity and structural features. Thus, the proposed studies will help to discover novel therapeutic uses of approved drugs for repurposing.
Short Abstract: Background: After years of research, the cause of Crohn’s disease (CD) remains unknown. Its accurate diagnosis, however, can help in management and preventing the onset of disease. Whole exome sequencing (WES) provides a new way of evaluating CD-predisposition and can help identify new disease genes and pathways. Method and Results: We developed AVA,Dx (Analysis of Variation for Association with Disease), a machine learning-based method that uses WES data alone to highlight CD genes and predict individual CD status. AVA,Dx first predicts changes in function of genes due to individual-specific genetic variation. Then, it maps the resulting gene-function vectors to individual CD-status. In testing, AVA,Dx differentiated three quarters of the CD patients from healthy controls with 71% precision. Importantly, we were able to account for batch effects to enable accurately predicting individual-CD status for individuals from a separately sequenced cohort. Furthermore, some of the genes selected by our method as relevant to CD were not previously identified, but were significantly enriched in some known CD pathways. Conclusions: AVA,Dx highlights new CD genes and pathways and accurately predicts CD-status. Note that using AVA,Dx techniques may help improve our understanding of other complex disease in the future.