View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Interaction between peptides and major histocompatibility complex class I (MHC-I) molecules plays an important role in human adaptive immune responses. Identification of peptides binding to a specific MHC-I molecule is an essential step for cancer vaccine design and immunotherapy. There are several methods based on machine learning to predict binding affinity between peptides and MHC-I molecules. In this study, we propose to use Generative Adversarial Network (GAN) in conjunction with a convolutional neural network (CNN) to predict peptide binding affinity to MHC-I molecules. GAN is a class of neural network that is used to generate synthetic samples that follow the distribution of the original data. It has been used in image generation and small molecular drug discovery. In our framework, we trained a GAN that simulates peptides based on their binding affinities to specific MHC-I molecules. Both real and simulated peptides are then feed into a CNN that trains a classifier that distinguish positive and negative peptides. We applied our method into several categories of weekly automated benchmark datasets from Immune Epitope Database (IEDB) and demonstrate its advantage over several methods such as sNebula and HLA-CNN.
Short Abstract: With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences, but relatively slow and expensive to extract useful information because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly test our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins which function has been released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts protein function prediction problem to a language translation problem, and applies neural machine translation model for protein function prediction.
Short Abstract: Plant specialized metabolites are highly diverse, important for plant interactions with environmental factors, and have significant agronomical and pharmaceutical value. Despite their importance, genes responsible for their production remain largely unidentified. Because the ancestors of some specialized metabolism (SM) genes originally played role in general metabolism (GM), it is challenging to distinguish metabolic genes as SM or GM genes. In addition, many SM pathways are incomplete due to the high variation of specialized metabolites, and their restriction to specific taxa. Using tomato, Solanum lycopersicum, as a model, we use machine-learning to distinguish tomato SM/GM genes globally as well as from different pathways. Our approach uses benchmark SM and GM genes from Tomatocyc to assess what sequence, transcriptomic, or evolutionary features distinguish SM from GM genes, and SM/GM genes in the same pathways from genes in other pathways. Features are integrated using machine-learning methods to generate two prediction models. The first model is capable of classifying unannotated enzymatic genes as SM or GM genes. The second model predicts the specific pathway(s) a SM/GM gene belongs to. Our study provides a comprehensive evaluation of features that can distinguish SM/GM genes and an effective model for predicting SM genes for further experimental studies.
Short Abstract: Interpretation of electroencephalograms (EEGs) is a process that is dependent on the subjective analysis of the examiner and has low interrater agreement. As a first step for EEG analysis, neurologists categorize the signals as normal or abnormal. In this investigation, we explore the hypothesis that high performance automatic classification of an EEG signal as abnormal can approach human performance by examining the first few minutes of an EEG recording. This study establishes a baseline for automated classification of abnormal adult EEGs using machine learning and a big data resource - The TUH EEG Corpus. A demographically balanced subset of the corpus was used to evaluate performance of the systems. The data was partitioned into training (1,387 normal and 1,398 abnormal files) and evaluation (150 normal and 130 abnormal files) sets. We compared the performance of several well-established technologies: hidden Markov Models (HMMs) (26.1% error rate), an HMM with a Stacked Denoising Autoencoder (HMM-SdA) (24.6%) and a deep learning system that combined a Convolutional Neural Network with a Multilayer Perceptron (CNN-MLP) (21.2%). We have established an experimental paradigm that can be used to explore this application and have demonstrated a promising baseline using state of the art deep learning technology.
Short Abstract: ZIKA virus (ZIKV) was first isolated in the 1950s, but health authorities and science have neglected it for several decades. The recent global epidemic declared by the World Health Organization in 2016 and the link of the infection with cases of microcephaly and Guillain-Barré syndrome put the virus in evidence. To develop treatments and preventive measures, it is necessary to understand the viral molecular mechanisms. Experimental approaches present technical difficulties and high cost when applied in large scale, making in silico methods an important tool to aid traditional approaches. This project uses machine-learning techniques to identify host-virus protein interactions to increase the molecular understanding of ZIKV infection in hosts. Based on data from protein interactions between several viruses and hosts, sets of positive and negative interactions were created. These sets were used to train the machine-learning algorithms. The physical-chemical characteristics of the proteins were extracted and normalized to create a vector for pattern comparison. The trained algorithms are being used to predict protein interactions between ZIKV and hosts. Protein-protein interactions will be used to study the mechanisms of viral infection and pathology. The topology of the network of protein interactions will be then generated for each analyzed host.
Short Abstract: Automated seizure detection using clinical electroencephalograms (EEGs) is a challenging machine learning problem due to several factors such as low signal to noise ratios, signal artifacts and benign variants. Commercially available seizure detection systems suffer from unacceptably high false alarm rates. Deep learning algorithms, like Convolutional Neural Networks (CNNs), have not previously been effective due to the lack of big data resources. A significant big data resource, known as TUH EEG Corpus, has recently become available for EEG interpretation creating a unique opportunity to advance technology using CNNs. The depth of a CNN is of crucial importance. State of the art results can be achieved by exploiting very deep models, but very deep models are prone to degradation in performance with respect to generalization and suffer from convergence problems. In this study, a deep residual learning framework is introduced that mitigates these problems by reformulating the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. This architecture delivers 30% sensitivity at 16 false alarms per 24 hours. This architecture enables designing deeper architectures that are easier to optimize and can achieve better performance than prior state of the art.
Short Abstract: Integration of heterogeneous genomic data can provide crucial information regarding regulation of condition-specific gene expression. DNA affinity purification sequencing (DAP-Seq) is a novel approach to detect protein-DNA interaction at genome scale in species where ChIP-seq is infeasible. We collected an Arabidopsis DAP-seq dataset and 89 Arabidopsis expression datasets. More than 1.8 million protein-DNA interactions were identified from DAP-Seq data. A machine learning based approach was developed to infer condition-specific regulatory interactions from integration of DAP-Seq and expression data. We formulated inference as a problem of binary classification followed by feature selection. We tested multiple machine learning classifiers including logistic regression with different regularization methods, L1 penalized linear support vector machine and guided regularized random forest. We found that most of the classifiers showed similar high performance with Area Under a Curve (AUC) ranging from 0.75 ~ 0.85. Logistic regression with L1 regularization produced most sparse model and stable set of features. We found that choosing different negative training sets can significantly impact the performance of the classifiers by more than 0.2 difference of AUC value. Compared to existing tools, our approach does not require time-series expression data, which provides great convenience when expression data is limited.
Short Abstract: Polycystic ovary syndrome (PCOS) is a common endocrine disorder that affects up to 20% of women, however diagnosis is commonly unreliable and un-quantitative. Here we use supervised machine learning and measurements of 51 cytokines from a large cohort of patients to identify a low-dimensional set of potential biomarkers for diagnosis of PCOS. Both whole blood and individual follicular fluid (FF) aspirates were collected women during pre- intracytoplasmic sperm injection with in vitro fertilization (ICSI/IVF) oocyte retrieval and linked with patients’ PCOS status as diagnosed by the Rotterdam criteria (n = 69 PCOS, n = 222 non-PCOS). We trained a binary support vector machine (SVM) using a random subset of patient data to determine cytokine profile associated with PCOS. Our resultant model includes 3 variables and is 76% accurate. This provides insight into the immunological basis of PCOS and may define a potential non-invasive quantitative strategy for diagnosis.
Short Abstract: Transcription factors bind combinatorially to the regulatory elements that control gene expression and cell identity. However, only a fraction of these factors has been assayed genome-wide, which has also made high-throughput analysis of transcription factor co-binding extremely difficult. Furthermore, many analyses across cell lines have been limited to histone modification assays. Here we describe an analysis of 208 ChIP-seq data sets produced by the ENCODE Consortium in a single cell type, HepG2, representing 22% of all known transcription factors expressed in HepG2, using a self-organizing map (SOM) with the SOMatic package. This analysis generated 196 distinct transcription factor co-binding profiles through metaclustering the SOM units, which we then analyzed using decision trees to determine the unique features of each profile. We are now extending this analysis to incorporate all ChIP-seq data generated by ENCODE. These results highlight a framework for future analyses, building toward a more complete picture of gene regulation.
Short Abstract: Training of deep learning algorithms is highly dependent on the order of training samples. Various forms of curriculum learning have been proposed to reduce the sensitivity of the training process. The general concept behind curriculum learning is to use easy samples first and gradually introduce more complex samples. Identifying the difficulty level of samples is a major challenge. We propose a new data selection strategy based on using a less sensitive algorithm that excels at automatic segmentation to triage samples, rank the data based on posteriors generated in this first pass, and then proceed with training a more complex deep learning system using this derived ordering of the data. We use a hybrid hidden Markov model / Stacked denoising Autoencoder based system for the first pass, and a more powerful system based on a Convolutional Neural Network and a Long Short-Term Memory Network for the second pass. We demonstrate this strategy on a seizure detection task based on the TUH EEG Seizure Corpus. Our system produced a sensitivity 32.13% with 10 false alarm per 24 hours, which is very close to our overall best performance, yet is a robust process that can be easily applied to new tasks.
Short Abstract: O-glycosylation is a highly complex process involving numerous proteins and biological functions within humans. Changes in mRNA expression levels of proteins involved in O-glycosylation have been associated with various cancers, diabetes and developmental processes. There are at least 20 isoforms of polypeptide GalNAc-transferases (ppGalNAc-Ts) that catalyze the addition of GalNAc to proteins. There is a large amount of redundancy in the various ppGalNAc-Ts site preferences but they do have sites that are specific to a given ppGalNAc-T. ISOGlyP (Isoform Specific O-Glycosylation Prediction) is a user friendly, web-based O-glycosylation prediction resource (ISOGlyP.utep.edu) that takes into account the differences in amino acid patterns recognized by 11 different ppGalNAc-Ts by using small random peptides to determine specific amino acid preferences. This work explains how additional protein features, such as structural disorder and surface accessibility, not captured in the small peptide studies, increase the accuracy of various predictive models by incorporating data mining techniques into the core ISOGlyP program. The use of ISOGlyP can now allow for a comparative analysis of glycosylation between two different transferases to identify sites that would be unique to a given ppGalNAc-T. This can lead to deeper understanding of the effects of ppGalNAc-Ts in disease.
Short Abstract: Horizontal gene transfer (HGT) has been much less studied in Eukaryotes in contrast to Prokaryotes; while partly because HGT is relatively less frequent in Eukaryotes, but mainly due to the fact that means and tools for systematic and easy identification of horizontally transferred (HT) genes in Eukaryotes were not available until recently. In this study, we developed a machine learning based method to detect inter kingdom HT genes. We used quantitative measures derived from the outcome of basic local alignment search tool for protein (BLAST-P) performed against non-redundant (nr) protein database. Our model shows 91% true positive rate and 0 % false positive rate for the test data derived from phylogenetically validated HT genes. As a pilot project, we focus on the fungi kingdom, in which many species often shared natural habitat with other species and thus providing opportunities for HGT to occur. In total, we have identified 838 HT genes from six fungal genomes so far and the data are deposited in a database. Our program will be extended to other Eukaryotes, and the results will serve as a platform for cross species HGT studies and tracing evolutionary marks in multiple domains of life.
Short Abstract: The human genome mutation rate is estimated to be 1.1*10^-8 per site per generation , leading to a variant approximately every 300 base pairs . Learning about a particular variant’s origin and distribution is often a goal of genomics research, particularly for variants identified to be under natural selection or for variants that are representative of different populations [3-5]. Generally, research in this area relies heavily upon carefully annotated collections of genotypes and biological observations in multiple populations, along with time-intensive historical analysis. In our study, we employ Ancestry’s millions of consented-to-research genotypes and historical family records to broadly assess the populations in which a variant of interest may have arisen and propagated. Our method utilizes an identity-by-decent network, along with a statistical analysis of birth locations in historical family records . Here, we demonstrate this method and assess its performance upon multiple loci of varying frequency across diverse populations. This type of data-driven approach can provide insights into the origins, migration patterns, and historical and contemporary geographic locations of populations carrying any variant of interest. Citations:  PMID:20220176,  PMID:18197193,  PMID:26432245,  PMID:18252222,  PMID:24630847,  PMID:28169989
Short Abstract: Clustering nucleotide sequences is a fundamental step in analyzing biological data. Most widely-used software tools for sequence clustering employ approaches which may not guarantee optimal clusters. Further, these tools utilize one parameter, which determines the similarity among sequences in a cluster. Often, the similarity between sequences within a cluster is unknown, so clusters produced by these tools may not match the actual clusters. To overcome this limitation, the mean shift algorithm was adapted in order to cluster nucleotide sequences. The mean shift is an established unsupervised machine-learning algorithm, which has been applied thousands of times in image processing and computer vision. The mean shift algorithm — unlike previous approaches — is guaranteed to converge to the modes, i.e. cluster centers can be found. MeShClust is the first application of the mean shift algorithm to clustering DNA sequences, and it is one of a handful of applications of the mean shift in bioinformatics. Another novel contribution of MeShClust is its use of supervised machine learning for predicting sequence similarity using a combination of alignment-free and alignment-based methods. Overall, MeShClust produces high quality clusters even when the clustering similarity parameter is only an estimate of the actual similarity.
Short Abstract: Polypharmacy, the use of drug combinations, is common to treat patients with complex or co-existing conditions. However, a major consequence of polypharmacy is a high risk of adverse side effects, which emerge because of drug-drug interactions, in which activity of one drug changes if taken with another drug. We present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions, and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon develops a graph convolutional neural network for multirelational link prediction that is designed to handle such multimodal graphs with a large number of edge types. Decagon predicts the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. Decagon models particularly well side effects with a strong molecular basis, while on non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon creates opportunities to use large pharmacogenomic and patient data to flag polypharmacy side effects for follow-up pharmacological analysis. Project website: http://snap.stanford.edu/decagon.
Short Abstract: microRNAs have been frequently reported to play critical roles in cancers and to have great potential to be diagnostic biomarkers of cancer. During microRNA biogenesis, one strand of miRNA hairpin precursor is preferentially selected for entry into an RNA-induced silencing complex, while the other strand is typically thought to be degraded. When the strand preference is changed, this phenomenon is called arm switching. This study aimed to determine and characterize the extent of arm switching in the pan-cancer datasets available from the The Cancer Genome Atlas and discuss its potential application as a biomarker. We investigated arm switching from two different conditions: tumor tissues and normal adjacent tissues and tumor tissues from different organs. We detected miRNA arm switching events and our results showed a significant difference in ovarian cancer in arm switching of pre-miRNA compared with other cancers. Finally we use support vector machine method (SVM) and combine leave one out cross validation (LOOCV) method to select the best combination of arm switched miRNA tumor markers. Survival analysis identified some arm switching miRNAs associated with patient survival. Our results suggest that miRNA arm switching could serve as a potential biomarker for cancer discovery.
Short Abstract: The identification of the interactions between drug candidate compounds and target biomolecules is an important step in drug discovery. Within the last decade, computational approaches have been developed (i.e., "virtual screening") with the objective of aiding experimental work by predicting novel drug-target interactions (DTIs), via the construction and application of statistical models. Deep learning algorithms have drawn attention to virtual screening after a deep learning algorithm won the Merck’s drug discovery challenge. One of the current issues in DTI prediction is the process of feature engineering, where obtaining the most representative compound feature vectors is challenging even with intensive manual work. In this study, our aim was to create a novel virtual screening system using convolutional neural networks -CNNs- (widely used in image analysis), that extracts features of compounds from their simple images containing the skeletal formula (i.e., 2D drawings). The main advantage of this system is reducing the time spent on generating complex compound features and letting the CNN learn complex features inherently from the ready-to-use drawings. With extensive amount tests regarding different architectures and hyper-parameters, our initial optimal models reached an average accuracy of 0.82, displaying the potential of CNNs for DTI prediction without any feature engineering.
Short Abstract: Since the beginning of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Additionally, paired statistics including length difference or Earth Mover’s distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours.
Short Abstract: Direct-to-consumer (DTC) genomics tests provide a unique experience for individuals interested in their family history and genetic matches. AncestryDNA estimates customers' distant and recent ancestry. The recent ancestry evaluation is powered by the Genetic CommunitiesTM algorithm, which constructs a large Identity-By-Descent (IBD) network and clusters individuals if they share a significant amount of genome-wide IBD . With the remarkable popularity in DTC genomics and continuous expansion of our genotype database, the IBD connections among individuals grow quadratically, leading to a "hairball" effect on the IBD network. To address the computational challenges inherent with a genome-wide IBD network, we propose a novel approach that investigates the significance of the IBD segments within the communities. In this study, we evaluate the impact of this method on the feasibility and accuracy of community assignments.  https://www.ncbi.nlm.nih.gov/pubmed/28169989  https://www.ancestry.com/cs/dna-help/communities/whitepaper
Short Abstract: We consider the problem of predicting bacterial antibiotic resistance phenotypes from whole-genome sequencing (WGS) data, using supervised machine learning. We adopt a systematic approach where any possible k-mer is a candidate variable. From the statistical learning perspective, the challenge is to identify a small yet predictive set of variables. We rely for this purpose on lasso penalized logistic regression models, coupled with extensive resampling procedures (stability selection). Starting from a database of 1307 Mycobacterium tuberculosis genomes publically available on PATRIC, with resistance/susceptible phenotypes available for 6 antibiotics, we built prediction rules involving 1 to 8 k-mers per antibiotic. Annotating these k-mers revealed that the models recovered mutations within previously known genes. The prediction process solely amounts to detecting the presence of these 23 k-mers, hence is remarkably easy and fast to carry out (a few minutes starting from sequencing reads, seconds from assembled genomes). The resulting classifiers achieved comparable validation performance than Mykrobe software. This preliminary study therefore validates the feasibility of this statistical, data-driven, approach and paves the way to several extensions aiming to include prior information about population structure, associated-resistance phenomena or quantitative measures of resistance, that can naturally be leveraged within this statistical learning framework.
Short Abstract: Target selection is the first and pivotal step in drug discovery. We collected a set of 332 targets that succeeded or failed in phase III clinical trials, and explored whether Omic features from Harmonizome could predict clinical success. We used stringent validation schemes with bootstrapping and modified permutation tests to assess feature robustness and generalizability while accounting for target class selection bias. We also used classifiers to perform multivariate feature selection and found that classifiers with a single feature performed as well in cross-validation as classifiers with more features. Successful targets tended to have lower mean expression and higher expression variance than failed targets. Finding this modest predictive signal led us to ask whether a Deep Learning model could automatically derive transcriptomic features with greater predictive signal. We opted for a transfer learning approach, where we first trained a stacked denoising auto-encoder to learn a low-dimensional encoding of targets’ tissue expression patterns, and then used the learned features to annotate targets. Preliminary results suggest the learned features encode both tissue-specific and non-tissue-specific functional information about targets and are promising predictors of target outcomes in clinical trials, as well as other target properties.
Short Abstract: By integrating knowledge from a variety of high-content omics resources with low-content literature-based knowledge, it is possible to predict gene and protein function with machine learning. Harmonizome-ML is an interactive web-based system that empowers non-programmer users to perform gene and protein knowledge imputation. Users can select from a variety of processed omics datasets and predict almost any biological function, human disease association, pathway membership, or mouse phenotype. Harmonizome-ML first presents the user with a web-based interface, where the user can pick omics datasets to use as the attributes for learning, and a class of knowledge to predict. Users can then select from a variety of machine learning algorithms, their various parameter settings, and the predictor’s performance evaluation methods. Once those options are all entered, Harmonizome-ML generates a Jupyter Notebook, rendered on demand, to provide a complete analysis report of the predictions that were made through the pipeline. The Jupyter Notebook provides transparency, reproducibility, and opportunity to modify the code for customized analyses. Using Harmonizome-ML, investigators can quickly explore machine learning-backed predictions for novel gene function hypotheses to guide their research. Harmonizome-ML is openly available at http://harmonizome.cloud/Harmonizome-ML.
Short Abstract: Epigenomic markers, such as histone modifications, play important roles in development and cell differentiation. Characterization of the determinants and regulatory mechanisms of histone modifications during development and cell differentiation is critical to understand cell fate determination and identity maintenance. Here, we have developed deep convolutional neural networks (CNNs) for histone modifications in different cell types at different developmental stages, as well as for each histone modification across different cell types. These CNNs outperform previous methods in predicting the histone modifications in each cell type, using DNA sequence alone, suggesting that the models have learned the sequence determinants of the specific histone modifications in different cells types. Indeed, we found that motif models learned in first convolutional layer can well explain the unique histone modification pattern in different cells, and their high similarity to know transcription factor motifs pinpoints to possible transcriptional factors involved. By profiling the influence of the learned motif models on the CNNs, we found that varying sequence motif models learned in different cell types models can reveal their lineage relationships.
Short Abstract: Cell protrusion is morphodynamically heterogeneous at the subcellular level. However, the mechanism of cell protrusion has been understood based on the ensemble average of actin regulator dynamics. Here, we establish a computational framework to deconvolve the subcellular heterogeneity of lamellipodial protrusion from live cell imaging. HACKS (deconvolution of Heterogeneous Activity in Coordination of cytosKeleton at a Subcellular level) identifies distinct subcellular phenotypes based on machine-learning algorithms and reveals their underlying actin regulator dynamics at the leading edge. Using our method, we discover ‘accelerating protrusion’, which is driven by previously unknown temporal coordination of Arp2/3 and VASP activities. We validate our finding by drug treatment assays and further identified fine regulation of Arp2/3 and VASP recruitment associated with accelerating protrusion. Our study suggests HACKS, combined with pharmaloclogical perturbations, can be a powerful tool to discover the drug effect by revealing the susceptible morphodynamic phenotypes and the associated changes in molecular dynamics.
Short Abstract: Precisely coordinated tissue-specific gene expression changes are required during development and differentiation. Developing tissues are dynamic ensembles of cells of different types, and cellular lineage trees capture underlying cell-type relations via progenitor (more pluripotent) to descendant (more differentiated) relationships. Typically, correlations between cells of different types are disregarded in differential gene expression analysis. Our work uses statistical modeling to take these correlations into account in a quantitative way. To that end we developed a statistical framework, where gene expression is explicitly modeled by Ohrnstein-Uhlenbeck processes on a tree topology that connects progenitor cell types to their descendants. Maximum (profile) likelihood-based inference allows us to estimate correlations between different cell types. We study this approach using simulated data before applying it to RNA expression data across hematopoietic differentiation. Overall on simulated data, our approach outperforms methods that do not take into account correlations between cells of different types. We find that specific cell-type relationships substantially affect differential gene expression analysis, and our method could improve detection of differentially expressed genes. This approach can be generalized to other datasets or datatypes.
Short Abstract: Pharmacogenomics is a rapidly developing field that aims to deliver on the promise of precision medicine by guiding pharmacological intervention based on an individual’s unique genotype for drug metabolism. We have developed a sequencing-based pharmacogenomics screening panel that uses targeted capture to perform deep sequencing of 200+ critical genes critical to drug metabolism. Integral to analysis of the genetic data that are generated is the assessment of copy number variation (CNV). CNVs occur when the number of copies of a particular gene varies from one individual to the next. Identifying this variation can be exceedingly difficult but is key to interpreting genotypes and predicting variant effect in certain genes, such as CYP2D6. Historically, the identification of CNVs in our pipeline has been labor-intensive and difficult. In this study, we present a novel application of an artificial neural network machine learning algorithm to learn the complex patterns associated with CNV data. The result is a trained network that quickly and accurately classifies CNVs according to known categories in the CYP2D6 gene. We show that a simple, one hidden-layer network is sufficient to achieve the extremely high accuracy and low false-positive rate required in a high-throughput clinical setting.
Short Abstract: DNA methylation influences gene expression and cellular identity via inter-dependent epigenetic mechanisms which regulate chromatin accessibility. A common method of sample perseveration, termed formalin-fixed paraffin-embedding (FFPE) can make high-quality ChIP-seq library preparation very difficult. Recent advances in both whole genome bisulfite sequencing (WGBS) and Illumina Methylation BeadChips can accommodate ultra-low DNA input, including FFPE samples. Uncovering the relationships between epigenetic regulatory mechanisms using WGBS or Illumina MethylationEPIC BeadChips would be useful, in terms of determining transcriptomic activity and clinical outcome. Here we present Methyl2Acetyl, a machine learning framework designed to predict H3K27Ac enrichment using WGBS or Illumina EPIC Array data as input. Using random forest learners to model CpG Island/DNA methylation relationships, we first trained Methyl2Acetyl with WGBS and H3K27Ac ChIP-seq data from pediatric Neuroblastoma tumor samples(n=6), and evaluated its performance using a leave-one-out approach (r^2: 0.83 +/- 0.01). We then applied Methyl2Acetyl to an independent solid-tumor pediatric Rhabdomyosarcoma dataset(n=14). Again, Methyl2Acetyl accurately inferred H3K27Ac enrichment (r^2: 0.74 +/- 0.05). Our results indicate our model is capable of imputing H3K27Ac enrichment between multiple types of solid-tumor cancers, providing a useful tool for cases where sample type or the amount of sample available limits the ability to perform high-quality ChIP-seq analyses.
Short Abstract: Data driven approaches for extracting knowledge from biomedical data are numerous and diverse. This creates a daunting challenge for the biomedical community – which method is best to apply to specific experimental results. Even when benchmarks of method performances are readily found for simulation data and curated gold standard data sets, these datasets may not always be applicable for problems on the forefront of translational biomedicine. To limit the scope of this problem, we developed the Method for Optimal Classification by Aggregation, or MOCA, an unsupervised heterogeneous ensemble method for binary classification. MOCA performs a weighted sum over the rank predictions of an ensemble of base classifiers. In such a setting, we prove that the optimal weights are proportional to the ratio of the area under the receiver operating characteristic and the class conditioned variance of predictions. In addition, we show that under the mild criterion of conditionally independent base classifiers these weights can be inferred from unlabeled data. We show that MOCA performs robustly when applied to simulation data and to DREAM collaborative challenge data for network inference and drug synergy prediction. MOCA is a solution for solving a central challenge in computational systems biology.
Short Abstract: An increasing number of biological data sets are high dimensional. For example, SNPs over the entire human genome, and extensive text documents from biomedical literature. In other scenarios, computational tasks require methods to handle not only high-dimensional data, but also those having complex structured representations. Such datasets include annotations represented in concept hierarchies and structured profiles from medical records. Many machine learning methods depend on proper proximity measures between objects. However, it is well-known that popular functions (e.g., the Euclidean distance) are unsuccessful in high-dimensional situations whereas others lack metric properties (e.g., cosine distance) and thus do not exploit theoretical benefits reserved for distance metrics. We propose a new class of metrics that unify and generalize some existing metrics. We empirically show its mitigation of the hubness effect in high-dimensional datasets. We demonstrate the effectiveness of our metrics on high-dimensional data by an extensive comparison against other functions. We also show that our metrics have intuitive interpretation in measuring functional similarity of proteins using their ontological annotations. Finally, we introduce the concept of functional phylogeny and show that our distance measures are capable of reconstructing a species tree on a reduced set of organisms using only associated Gene Ontology annotations.
Short Abstract: Drug target identification is important for drug discovery and repurposing. However, accurate identification of drug targets remains challenging due to the complexity of interactions associated with drug perturbations. The Targuess application, currently under development, aims to combine 'omics resources in useful ways, to predict drug targets. As part of this work, we have assembled lists of drug-target associations from DrugBank, DrugCentral, TargetCentral, Drug Repurposing Hub, the Drug Gene Interaction Database, Drug Target Commons, and LINCS resources. Surprisingly, drug-target interactions are different across these resources. Benchmarking drug-target interactions with additional data can be used to rank and score the most credible associations and evaluate the quality of each drug-target resource. An interactive web application that will permit users to explore drug-target associations is currently under developed. Novel drug-target associations are also predicted using a machine learning framework. The drug-target associations are used as the target class, and resources such as differential gene expression data from the LINCS L1000, ARCHS4, and CREEDS resources are used as the features/attributes. Hence, Targuess provides a web-based platform for the evaluation and prediction of drug targets for orphan small molecules and existing drugs.
Short Abstract: Most regulatory network inference approaches in cancer use expression data to analyze transcription factor (TF) motifs in annotated promoter regions. However, distal enhancers are important for fine-tuning of gene expression. We developed an experimental and computational strategy to incorporate the effect of enhancers on gene regulatory programs across multiple cancers. Our framework, PSIONIC (Patient Specific Inference of Networks Incorporating Chromatin), enables us to selectively share information across tumors and explore similarities and differences in patient-specific inferred TF activities. TFs impacts on gene regulation have not been well characterized in gynecological and basal breast cancers. Hence, we applied our approach to 723 RNA-seq experiments from gynecological and basal breast cancer tumors as well as 96 cell lines. To integrate regulatory sequence information in our models, we also generated an ATAC-seq data set profiling chromatin accessibility in cell line models of these cancers. Our analysis identified tumor type-specific and common TF regulators of gene expression, as well as predicted dysregulated transcriptional regulators. In vitro assays confirmed that PSIOINC -inferred TF activities were predictive of sensitivity to targeted TF inhibitors. Moreover, many of the identified TF regulators were significantly associated with survival outcome within the tumor type.
Short Abstract: Motivation: Protein solubility plays a vital role in pharmaceutical research and production yield. For a given protein, the extent of its solubility can represent the quality of its function, and is ultimately defined by its sequence. Thus, it is imperative to develop novel, highly accurate in silico sequence-based protein solubility predictors. In this work we propose, DeepSol, a novel Deep Learning-based protein solubility predictor. The backbone of our framework is a convolutional neural network that exploits k-mer structure and additional sequence and structural features extracted from the protein sequence. Results: DeepSol outperformed all known sequence-based state-of-the-art solubility prediction methods and attained an accuracy of 0.77 and Matthew’s correlation coefficient of 0.55. The superior prediction accuracy of DeepSol allows to screen for sequences with enhanced production capacity and can more reliably predict solubility of novel proteins. Availability and implementation: DeepSol’s best performing models and results are publicly deposited at https://doi.org/10.5281/zenodo.1162886 (Khurana and Mall, 2018).
Short Abstract: Like many natural phenomena, genetic mutations that cause cancer have a very heavy tail - the frequency of a few driver mutations is very high as compared to the majority. As a result, infrequent driver mutations makes it difficult to distinguish mutated genetic pathways that cause cancer progression. Despite the success of computational methods to understand the underlying structure of somatic mutations based on the Dirichlet Process (DP), frequency counts, or factor analysis, they have limitations capturing datasets with power law behavior. Generalizations such as the Pittman-Yor process or placing sparse priors on factor analysis have been used to model power law behavior, but they are limited by scale and computational tractability. We use a non-parametric BFRY process that has found success in modeling power law behavior, while having a simple tractable density. To scale to large genetic datasets we use variational autoencoders (VAE) to perform posterior inference. The model’s utility for modeling power law behavior is investigated on randomly generated datasets and the somatic mutation profile of bladder cancer patients from TCGA.
Short Abstract: Plants produce diverse metabolites essential for plant growth, development, and/or adaptation. Some of these metabolites are also important for human nutrition and medicine. Although we have amassed substantial knowledge in plant metabolic pathways, a large number of genes responsible for biosynthesis of these plant metabolites in known pathways remain to be identified. Gene identification can be difficult due to laborious biochemical approaches or high error rates from common computational predictions based on co-expression. Here, we aim to develop a machine learning approach to maximize the utility of gene expression data for predicting metabolic pathway genes using tomato as a model. For machine learning input, we extract ~7500 expression features from combining 47 expression datasets of multiple tissues/environments with two different ways of representing expression values, 10 expression similarity measures, and 93 clustering algorithm/parameter combinations. For each pathway, we build a machine learning model integrating the expression features, and identify the best feature combination to predict pathway genes. Our study highlights the need to extensively explore expression features to maximize the performance of pathway predictions, and reveals the divergent expression features informative for predicting genes from different pathways.
Short Abstract: There is now a strong focus in science to address the newly appreciated issue of reproducibility of landmark studies and clinical trials. DREAM Challenges, partnering with Sage Bionetworks, has created a unique data-hosting infrastructure to facilitate reproducible models via unbiased benchmarking. This framework enables challenge participants to examine questions in biology and medicine, such as clinical prediction and prognosis benchmarking, in a collaborative manner. The traditional DREAM challenge employed a standard “data to model” framework where training and blinded test data are made fully available to participants and they would submit their prediction file. In the past year, DREAM has flipped this traditional “data to model” paradigm and began using a cloud-based “model to data” framework. Here, models are containerized via Docker and are submitted and run on sequestered validation data. This system was created mainly for the Digital Mammography challenge where the training and test data were so large that they could not be made publicly available. This framework also enabled all submitted models to be reproduced and run on future mammograms. DREAM has employed similar approaches in other challenges and are now standardizing this approach to enable continual benchmarking in future challenges.
Short Abstract: Tumor heterogeneity and admixtures with normal tissue makes somatic variant calling a challenging task. Combining one or more callers using rule-based filtering has previously been a standard protocol. However, although these filtering approaches may help to increase specificity, this may come at the expense of sensitivity. NeoMutate is a machine-learning (ML) framework that compensates for these and other drawbacks. We demonstrate here how NeoMutate leverages the ensemble of seven variant callers to improve performance overall. As in the ICGC-DREAM challenge, BAM files were randomly sampled into two non-overlapping subsets of equal size. More than 3000 true cancer variants were randomly selected from COSMIC and added to one of the BAMs. The variant allele frequencies of the resulting synthetic mutations ranged from 0.01 to 1, allowing for simulation of multiple subclones. All possible combinations of the seven variant callers were trained using eight machine learning algorithms, and evaluated using five-fold cross validation. NeoMutate significantly improved variant detection. In particular, decision-tree type models had higher performance compared to any of its composed individual variant callers. Given the unique characteristics of the many available callers, we demonstrate here that integrating them comprehensively in an ensemble machine-learning layer optimizes somatic variant detection rates.
Short Abstract: The ability of the B-cell receptor to recognize virtually any pathogenic epitope relies on Ig V(D)J rearrangement to generate a vast diverse repertoire. Usage of individual V genes in combinatorial V(D)J rearrangement is not random, but rather strong biases exist that favor specific V genes. To elucidate these rearrangement mechanisms, we assembled a comprehensive dataset of the murine pre-immune Igk repertoire -where for each of the 162 Vk genes we measured gDNA and RNA rearrangement frequencies in pro-B and pre-B cells, and ChIP-seq/RNA-seq signal from 34 datasets including epigenetic marks, TFs and transcription. To assess feature importance in predicting Vk gene activity, recombination frequencies and RNA/gDNA rearrangement ratios, we built validated classification and regression models and then used feature selection to identify minimal subsets of features that together would explain most variability in the observations. We show that important differences do occur in Vk gene usage between gDNA and RNA. These differences are likely tied to promoter strengths and appear consistent throughout B cell development. Importantly, the enhancer mark H3K4me1 is the most predictive of Vk utilization in pre-B cells, together with Pax5 and Ikaros - while PU.1 is the top predictor of early Vk gene rearrangement in pro-B cells.
Short Abstract: Machine-learning models for protein function enable the prediction and discovery of sequences with optimal properties. However, the amount of labeled data available for training is often small. Furthermore, machine-learning models operate on vectors, and the best way to vectorize proteins is not obvious. Amino acids are analogous to letters, subsequences of amino acids to words, and proteins to documents encoding structure and function. Given a large collection of unlabeled texts, embedding models convert words and documents to vectors that capture their meaning. We apply these techniques to learn the language of proteins. Our model learns to vectorize proteins such that similar vectors encode similar proteins. The predictive power of Gaussian process regression models trained using embeddings is comparable to those trained on existing representations despite the embeddings having orders of magnitude fewer dimensions. Moreover, embeddings do not require alignments, structural data, or amino-acid properties. While the number of known protein sequences is rapidly increasing, it remains time-consuming and difficult to measure many protein properties of interest. By first training an embedding model on unlabeled protein sequences, we are able to transfer information encoded in these sequences to a specific task.
Short Abstract: Genes have been one of the most effective windows into the biology of autism, and it has been estimated that perhaps a thousand or more genes may confer risk. However, only 75 genes are currently viewed as having robust enough evidence to be considered true "autism genes". While massive genetic studies are underway to produce data to implicate additional genes, alternative approaches have aimed to predict autism risk genes using other forms of genome-scale data, such as gene expression and network interactions. Here we present forecASD, which is a machine learning approach that integrates spatiotemporal gene expression, heterogeneous network data, and previous gene-level predictors of autism association to yield a single score that represents each gene's likelihood of being involved in the etiology of autism. We demonstrate that forecASD has substantially increased performance compared to previous gene-level predictors of autism association, including genetic-based measures such as TADA. On an independent test set, consisting of newly-released data from the SPARK, we show that forecASD best predicts which genes will have an excess of LGD mutations. Furthermore, we define a set of functional pathways that are currently underrepresented in the autism literature.
Short Abstract: Background: Clinical parameters don't accurately predict clinical outcomes in rare renal diseases like nephrotic syndrome due to underlying biologic heterogeneity. Machine learning techniques can identify predictors from many potential parameters across the genotype-phenotype continuum. Methods: Clinical data, pathology features and kidney tissue genome wide mRNA expression levels were collected for 197 patients in the Nephrotic Syndrome Study Network (NEPTUNE) cohort. Weighted gene co-expression network analysis was used to cluster genes into modules based on expression level across samples. Elastic net regularization was used to build Cox proportional hazards models for time to 40% renal function decline and proteinuria remission. Predictive expression modules selected by the algorithm were analyzed for their functional significance and biological processes using PANTHER and KEGG pathway enrichment analysis. Results: In the full elastic net model, tubular and glomerular modules were selected predictors for remission (tAUC 0.72) and renal function loss (tAUC 0.77). Targetable signaling pathways such as the integrin, interleukin, FGF, EGF receptor, JAK/STAT and VEGF associated with specific modules. Conclusions: Pathway enrichment analysis demonstrates that the top glomerular tissue mRNA expression modules predictive of remission include specific molecular signaling pathways which can be perturbed by known pharmacological agents and thus serve as potential therapeutic targets.
Short Abstract: Many biological concepts fall under an open world assumption, which, in the context of binary classification, limits the ground-truth determination of the true class label (for positive and/or negative class) of the training and test examples. This limitation is addressed in positive-unlabeled learning by training a classifier between positive and unlabeled (treating unlabeled as negative) examples. Surprisingly, this approach leads to an optimal (in a certain sense) classifier; however, as shown in this work, using the same approach (treating unlabeled as negatives) for testing leads to a biased estimate of classifier performance. We also show that typically used performance measures including the receiver operating characteristic curve, or the precision-recall curve, among others, can be corrected with the knowledge of the class priors; i.e., the proportions of the positive and negative examples in the unlabeled data. We extend the results to the noisy-positive setting where some of the examples labeled positive are in fact negative. Here, the correction also requires the knowledge of the proportion of noise in the noisy-positives. Estimating the positive class prior and the proportion of noise using state-of-the-art algorithms, we demonstrate the efficacy of the correction on real-life data.
Short Abstract: The computational tools used for genomic analyses are becoming increasingly sophisticated. While these applications provide more accurate results, a new problem is emerging in that these pieces of software have a large number of tunable parameters. The default parameter choices are designed to work well on average, but the most interesting experiments are often not "average". Choosing the wrong parameter values can lead to significant results being overlooked, or false results being reported. We take the first steps towards generating an automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly. Using the parameter advising framework, first developed for multiple sequence alignment, we can optimize parameter choices for each input. In doing so, we provide the first method for finding advisor sets for applications with large numbers of tunable parameters. On over 1500 RNA-Seq samples in the Sequence Read Archive, area under the curve (AUC) for the Scallop transcript assembler shows a median increase of 28.9% over using only the default parameter choices. This approach is general, and when applied to StringTie it increases AUC by 13.1% on experiments from ENCODE. A parameter advisor for Scallop is available on Github (https://github.com/Kingsford-Group/scallopadvising).
Short Abstract: GenePattern Notebooks (https://www.genepattern-notebook.org) allow researchers to perform hundreds of bioinformatics analyses and construct scientific narratives within a Jupyter notebook, without the need to write code. A notebook created in this environment is a series of “cells” which contain the executable analytical workflow and the associated descriptive formatted text and graphics. GenePattern Notebooks can be used for early data exploration and for presenting polished results. During the early stages of notebook development, a user can add, delete, and modify cells. When an analysis is ready for publication, the same document that was used in the design and analysis phases can serve as the complete, reproducible, in silico methods section for a publication. Programming users can freely mix GenePattern analysis cells with code cells, using Python variables directly as GenePattern analyses’ inputs and outputs. In addition to creating and editing their own notebooks, researchers can find notebooks contributed by others and adapt them for their own use and share notebooks with collaborators and the general public. Shared notebooks can be viewed by accessing their URL and if the author of a notebook wishes to make changes, they can be reflected immediately. Several notebooks can already be found in https://notebook.genepattern.org/
Short Abstract: Stratification of cancer patients by risk is one of important tasks to realizing the promise of personalized cancer therapy. Driven by the hypothesis that the aggressiveness of cancer (and disease outcome) is associated with distinct genomic and transcriptional features, we modified our recently developed molecular signature method (MSM) for optimal classification of a given set of tumors into poor and better survival classes, given tumor profiles of genomic alterations or gene expression levels. The algorithm determines expression biomarkers that are associated with survival, and then computes an outcome-based signature score representing a weighted sum of the expression biomarkers used to classify tumors; the weights in the signature function are computed analytically using the MSM. We applied this approach to RNAseq profiles of TCGA cancers and obtained very distinct separations of tumors into poor and better survival classes across all types. The P-values of the survival difference obtained for the combined signatures are substantially lower than any of the P-values obtained for individual genes. The survival classes identified are drastically different from randomly formed tumor classes of the same sizes, based on the total number of expression biomarkers as well as by a combined P value computed using all genes.
Short Abstract: The human epigenome has been experimentally characterized by dozens of assays across hundreds of cell types. Unfortunately, most potential experiments—combinations of cell types and assay types—have not been performed and likely will never be run due to their cost. A natural desire is to impute these missing data sets by extrapolating from currently available data. Previous imputation techniques include ChromImpute, an ensemble of regression trees, and PREDICTD, a tensor factorization approach. We adopt a deep tensor factorization model, called Avocado, that outperforms both prior approaches in terms of the mean-squared error on a pre-defined 1% of the human genome. In addition, we show that Avocado learns a latent representation of the genome that can be used to predict aspects of chromatin architecture, gene expression, promoter-enhancer interactions, and replication timing more accurately than similar predictions made from real or imputed data. We then use feature attribution methods to better understand how Avocado works. Finally, we use submodular selection to identify a representative subset of genomic regions, and we demonstrate that these regions can be used to train genome-wide estimators more efficiently than using all positions and to better effect than a randomly selected set of positions.
Short Abstract: Bioassay protocols are conspicuously absent from the informatics of drug discovery: current best practices have not progressed beyond using scientific English text, which is intractable to software. We will present our solution which draws from the rich semantic web vocabularies of the BioAssay Ontology, Drug Target Ontology, Gene Ontology, and others. On their own these ontologies are not friendly to experimental scientists, and so we have created the Common Assay Template, which turns the massive hierarchies of the underlying ontologies into useful guidelines. This has been supplemented by machine learning infrastructure to help translate existing text into suggestions, with the help of natural language analysis. Using our new web based interface, a small team of biologists were able to annotate 3500 MLPCN screening assays that were extracted from the PubChem database, which consumed approximately 3 weeks FTE. These semantically annotated protocols are fully machine readable, which imparts many new capabilities, which apply at all scales. Searching can be done using precise specific terms, which is far more effective than keyword searching. In conjunction with electronic lab notebooks, the annotations serve as a facile way to classify and organize experiments, and keep tabs on the activities of colleagues.
Short Abstract: Introduction Gene edits following DNA double-strand break (DSB) repair by non-homologous end joining (NHEJ) are highly unpredictable and error-prone, and thus undesirable for precision genome engineering and translational sciences. In contrast, edits resulting from microhomology-mediated end joining (MMEJ) can robustly generate mutant deletion alleles with reduced sequence heterogeneity. Here, we present MEDJED (Microhomology Evoked Deletion Judication Elucidation), a machine learning tool for predicting the extent of MMEJ-based deletions following a DSB generated by commonly used genome editing tools, including CRISPR/Cas systems. Methods A random forest (RF) regression model was trained and tested on 46 and 44 loci, respectively, using MiSeq data generated in HeLa cell gene editing experiments using CRISPR/Cas9 reagents. Features including deletion length, microhomology length, thermal stability, GC content, and distance between microhomology arms and DSB site were examined. Results The RF model comprised 5,000 trees (with maximum node depth of 6 and random feature pool of 3), and achieved a Pearson correlation coefficient of 0.68, with mean absolute error of 0.14, root mean square error of 0.16. Conclusion MEDJED succeeds as proof-of-concept for multifactorial approaches to developing efficient bi-allelic and MMEJ-activating reagents for precision genome engineering. Experiments to test its applicability to other model systems are underway.
Short Abstract: Targeted genome editing using CRISPR-Cas (clustered, regularly interspaced, short palindromic repeats and CRISPR-associated proteins) system has rapidly become a mainstream method in molecular biology. Cpf1 (from Prevotella and Francisella 1), a recently reported effector endonuclease protein of class 2 CRISPR-Cas system, has several different characteristics from the predominant Cas9 nuclease. Although Cpf1 has broadened our options to efficiently modify genes in various species and cell types, we still have limited knowledge on Cpf1, especially regarding its target sequence dependent activity profiles. Determination of CRISPR nuclease activities is one of the key initial steps for genome editing. Several computational approaches have been proposed for the in silico prediction of CRISPR nuclease activities. However, they rely on manual feature extraction, which inevitably limits the efficiency, robustness, and generalization performance. To address the limitations of existing approaches, this paper presents an end-to-end deep learning framework for CRISPR-Cpf1 guide RNA activity prediction, dubbed as DeepCpf1. Leveraged by (1) a convolutional neural network for feature learning from target sequence composition and (2) multi-modal architecture for seamless integration of an epigenetic factor (i.e., chromatin accessibility), the proposed method significantly outperforms the conventional approaches with an unprecedented level of high accuracy. (Published in Nature Biotechnology 2018)
Short Abstract: Sampling through sequencing continues to provide a large number of genomes for taxonomic classification and for subsequent characterization of metagenomic samples. Many taxonomy prediction algorithms use exact matching or alignment based methods in order to identify organism type. This typically involves building an indexed sequence data structure for discriminating between the many taxonomic divisions and representative sequences available. Given Earth’s large unsampled genetic reservoir, programmatic taxonomy identification faces the “open world” problem of having a varying and ever-increasing number of genomes to potentially classify. Exact match based algorithms, such as Kraken, are especially susceptible to this problem since they need to store variations of a sequence in order to reproduce its taxonomic assignment. Here we present Tree-CNN, a hybrid decision tree, CNN approach for taxonomy prediction of DNA strings, compare its performance to popular tools, explore the “learnability” of taxonomic clades at various levels, and characterize the difficulty of correct classification given the distribution of taxonomy labels relative to genome similarity. Tree-CNN models take advantage of the taxonomy structure making predictions at branch nodes. Tree-CNNs achieve a 93% Sensitivity at the species level and occupy 118MB in size - orders of magnitude lower than Kraken (14GB) and Opal (21GB).
Short Abstract: The application of systems biology and machine learning approaches to large amounts and variety of biomedical data often yields predictive models that can potentially transform data into knowledge. However, it is not always obvious what techniques and/or datasets are most appropriate for specific problems, calling for alternatives such as building heterogeneous ensembles capable of incorporating the inherent variety and complementarity of the many possible models. However, the problem of systematically constructing these ensembles from a large number and variety of base models/predictors is computationally and mathematically challenging. We developed novel algorithms for this problem that operate within a Reinforcement Learning (RL) framework to search the large space of all possible ensembles that can be generated from an initial set of base predictors. RL offers a more systematic alternative to the conventional ad-hoc methods of choosing the base predictors into the final ensemble, and has the potential of deriving optimal solutions to the problem. For the sample problem of splice site identification, our algorithms yielded effective ensembles that perform competitively with the ones consisting of all the base predictors. Furthermore, the ensembles utilized a substantially smaller subset of all the base predictors, potentially aiding the ensembles’ reverse engineering and eventual interpretation.
Short Abstract: We developed a new tool for predicting DNA-binding sites, JET2DNA, adapted from the approach implemented in JET2 for predicting protein-protein interfaces. JET2DNA significantly outperforms other established methods of DNA-binding site predictions when tested on both bound and unbound forms (average values for JET2DNA predictions on bound and unbound forms respectively: acc=0.86, F1=0.63 and acc=0.88, F1=0.60). We stress that, contrary to machine learning approaches, our method combines in a straightforward way three sequence- and structure-based descriptors: evolutionary conservation of the sequence, physico-chemical properties of residues and the geometry of the protein. JET2DNA is provided of four scoring schemes, obtained from different combinations of these three descriptors. These multiple scoring schemes permit to accurately predict DNA-binding sites for a wide range of protein classes and to detect multiple DNA interaction sites on the same protein surface. These may be used by different DNA partners or by the same one in a different moment of the accomplishment of the protein function. These multiple interaction sites present on the same protein surface may not be detected in the same crystal structure and they may be erroneously classified as "false positives" in the crystal of system studied.
Short Abstract: Small data systems biology allow for new insights, not usually revealed by systematic integration of the big data. Feature engineering involves filtering and selecting the most relevant features in machine meaning and can be applied in any discipline. Genomic data usually contains hundreds of gene measures per sample which makes applying machine learning challenging. We highlight our results showing that integrating systems approaches in the analysis of small scale datasets can help with feature selection in genomic data. Two systems approaches were used independently to identify important features that might predict Plasmodium falciparum drug sensitivity from transcriptome measures. In the first approach, drug perturbation signatures from the Library of Network-Based Cellular Signatures database were used to find genes deferentially expressed with Artemisinin in human cell lines. Phylogenetic profiling was used to infer a list of orthologous gene from human to malaria. In the second approach, a pharmacophore search was used to find possible proteins that might bind to Artemisinin. In total, 237 genes were identified and used as features for constructing a deep neural network to predict Artemisinin sensitivity from microarray gene expression data (0.605 cross-validation R-squared, 0.458 test R-squared).
Short Abstract: Background: MicroRNAs (miRNAs) are small, non-coding RNA that regulate gene expression through post-transcriptional silencing. Differential expression observed in miRNAs, combined with advancements in deep learning (DL), have the potential to improve cancer classification by modelling non-linear miRNA-phenotype associations. We propose a novel approach to miRNA-based cancer classification incorporating hierarchical tissue annotation, class-label balancing, and DL. Methods: miRNA expression profiles were analyzed for 1746 neoplastic and 3871 normal samples, across 26 types of cancer involving six organ sub-structures and 68 cell types. miRNAs were ranked and filtered using a specificty score representing their information content in relation to neoplasticity, incorpating 3 levels of hierarchical biological annotation. A DL architecture composed of autoencoders (AE) and a multi-layer perceptron (MLP) was trained to predict neoplasticity using 497 abundant and informative miRNAs. SMOTE was applied to correct sample class imbalance. Results: Nested four-fold cross-validation was used to assess the performance of the DL model. The model achieved an accuracy, sensitivity and specificity of 81%, 83.9% and 78.1%, respectively. Conclusion: Deep learning provides a powerful tool well-suited for modelling complex miRNA-phenotype associations, particularly in cancer. Incorporating biological context in the form of hierarchical tissue annotations with miRNA expression data results in more accurate cancer predictions.
Short Abstract: Alternative Splicing produces multiple mRNA isoforms of a gene which have important diverse roles such as regulation of gene expression, human heritable diseases, and response to environmental stresses. However, very little has been done to assign functions at the mRNA isoform level. Functional networks, where the interactions are quantified by their probability of being involved in the same biological process have also been usually generated at the gene level. We have developed 17 tissue-specific mRNA isoform functional networks in addition to an organism level reference functional network for mouse. Using the leave-one-out strategy with a diverse array of tissue-specific RNA-Seq datasets and sequence information, we trained a random forest model to predict the functional networks. Because there is no mRNA isoform-level gold standard, we have used the single isoform genes co-annotated to Gene Ontology (GO) biological process, KEGG/BioCyc pathways and protein-protein interactions as functionally related (positive pair). The non-functional pairs (negative pair) were generated by using the GO annotations tagged with “NOT” qualifier. We have validated our network by comparing its performance with previous methods, randomized positive and negative class labels, and by literature evidence. These networks will be made available to the mouse genetics community.
Short Abstract: A fundamental goal of precision medicine is to use multiple types of data to generate a high-quality understanding of a patient's disease. For datatypes which have well-understood effects on the outcome of interest, it is straightforward to incorporate this data directly in a probabilistic graphical model. However, complex datatypes can be difficult to incorporate. To overcome this challenge, we present a deep learning framework which incorporates complex covariate data (e.g., histology images) into an interpretable graphical model framework by using Contextual Explanation Networks to encode the covariate data into sample-specific “contexts" which determine interpretable parameters for a graphical model. We apply the framework to a dataset of Kidney Renal Clear Cell Carcinoma patients and find that the use of imaging contexts improves performance of case/control status by logistic regression from a baseline of 95% to over 99% predictive accuracy. Finally, we investigate the learned contexts to uncover molecular subtypes of the disease.
Short Abstract: The role of miRNAs can be deeply influenced by SNPs, in particular in their seed regions, since these variations may modify their affinity for particular transcripts, generate novel binding capabilities for specific miRNA binding sites, or destroy them. Several computational tools for miRNA-target site predictions have been developed, but the achieved results are often discordant, making the study of binding sites hard, and the analysis of SNP effects even harder. HappyMirna is a Java library (freely available at https://bitbucket.org/bereste/happymirna) that computes, annotates, and integrates miRNA-target predictions as reported by different tools. The core of HappyMirna is a fusion data structure, relying on SQLite, which integrates miRNA-target predictions achieved with different state-of-the-art tools in order to reduce false positive results. Many machine learning approaches are implemented for data integration, such as Random Forest, Support Vector Machine, Neural Networks, and Deep Learning. HappyMirna allows the users to test custom sequences as input, such as those generated in novel sequencing experiments. Moreover, predictions from modified microRNAs or modified mRNA can be compared to verify how SNPs impact on the miRNA-target pairing. Since this combinatorial approach can be very time-consuming, HappyMirna is implemented in a parallel fashion, allowing a linear scalability of the application.
Short Abstract: High-throughput phenotyping is a powerful approach to compare large amounts of genetic lines for virtually any phenotype. For complex traits however, manual visual inspection does not always provide a clear-cut interpretation and is a time-consuming affair. Therefore, it is of interest to develop and implement efficient and high-quality automated image analysis methods. Deep learning has seen a massive adoption in the past few years. This is due to recent advances in hardware to train deep neural networks, availability of software libraries to build neural networks, new and powerful architectural designs of neural networks, and availability of big datasets to train on. Here, we present a deep learning approach using (among others) the Faster R-CNN network architecture for high throughput analysis of complex biological images (wheat laser dissection microscopy, wheat stomata, and in-field wheat canopy) that could previously not be analyzed using classical image analysis. In future work, we want to learn the relationship between the genetic variation and the extracted phenotypic variation to detect underlying causal genetic variation. The deep learning approach has enabled us to perform high-throughput comparison of genetic lines on any complex trait that can be imaged, provided that we generate the accompanying annotated training data.
Short Abstract: Flavin mono-nucleotide (FMN) is a cofactor which is involved in electron transport chain for carrying and transferring electrons in cellular respiration. Without the interactions from FMN binding sites, the electron cannot be transferred and cells cannot harvest energy from consumed food. Therefore, creating a precise model to identify FMN binding sites in electron transport chain is a crucial problem for understanding the electron transport chain process and designing the drug targets. For decades, there are some computational developments on general FMN binding sites with high performance, however there has been less previous evidence for electron transport chain and no study to date has examined the deep learning algorithms on this dataset. We therefore provide a deep learning approach, which includes two-dimensional convolutional neural networks and position specific scoring matrices to resolve this problem. Our method exhibited an independent dataset accuracy of 96% and MCC of 0.52 for predicting FMN binding sites in electron transport chain. In comparison with other published works, this method has the significant improvement in all measurement metrics. Throughout the current study, we provide an accurate tool for predicting FMN binding sites in electron transport chain and useful information for biologists in their research academics.
Short Abstract: Disease risk gene identification is a fundamental challenge to studying complex diseases. Case-control whole exome sequencing studies have great potentials to uncover risk genes in complex diseases. Unfortunately, current variant association tests are underpowered to identify disease risk genes in sequencing studies. From systems biology point of view, properly integrating orthogonal biological information is a promising avenue to improve power of identifying disease risk genes. In this study, we propose Integrated Gene Signal Processing (IGSP), an approach to prioritize disease risk genes by integrating network information with genetic associations across multiple phenotypes in sequencing studies. By following a 'discovery-driven' integration strategy without relying on prior knowledge about the diseases, IGSP not only can maintain the unbiased character of genome-wide association studies but can be applied to studies of novel diseases. Simulations show that IGSP can effectively uncover risk genes with marginal association signals and thus outperform a widely-used genetic association test by 2 to 3 times. Indeed, in a real-world data set of a small disease cohort, we demonstrate that IGSP discloses important risk genes of congenital heart disease in 22q11.2 deletion syndrome, despite their weak association signals and thus were missed by the previous study.
Short Abstract: Gene regulatory sequences play critical roles in implementing tightly controlled RNA expression patterns that are essential in a variety of biological processes. Specifically, enhancer sequences drive expression of their target genes, and the availability of genome-wide maps of enhancer-promoter interactions (EPIs) across different cell lines has opened up the possibility to use machine learning approaches to extract interpretable features that define these interactions in different contexts. Inspired by machine translation models, we develop an attention-based neural network model, EPIANN, to predict EPIs exclusively based on DNA sequences. Our approach accurately predicts EPIs and generates pairwise attention scores at the sequence level, which specify how short regions in the enhancer and promoter pair up to drive the interaction prediction. The attention representations can be visualized and correlated with other genomic and functional genomic features in order to better characterize EPIs.
Short Abstract: Single cell RNA-sequencing technology (scRNA-seq) provides a new avenue to discover and characterize cell types, but the question remains: how well do novel transcriptomic cell subtypes replicate across studies? Here we describe our recent efforts to quantify the degree of cell type replicability across datasets, and provide approaches to rapidly identify and evaluate novel types with high similarity. 1) We provide an in-depth assessment of the replicability of neuronal identity, comparing results across eight technically diverse datasets to define best practices. 2) We then evaluate novel cortical interneuron subtypes, finding that approximately half have evidence of replication in at least one other study. Identifying these putative replicates allows us to re-analyze the data to find candidate marker genes that are robust to technical differences across studies. 3) We demonstrate the wide applicability of our findings across tissues, technologies and species with additional targeted use cases. By defining practices that enable high accuracy identification of replicable cell types at multiple levels of specificity, our work suggests a general route forward for large-scale evaluation of scRNA-seq data.
Short Abstract: It is known that utilizing the global graph connectivities in protein similarity networks can significantly improve the detection of protein remote evolutionary relation or structural similarities. However, the key limitation is the intensive computation needed to construct a massively large protein network. We present Label Propagation on Low-rank Kernel Approximation (LP-LOKA), a low-rank semi-supervised learning formulation on massive protein network with millions of nodes for this task. Our low-rank approximation by the Nyström method utilizes information from millions of sequences without computing all pairwise similarities. With a scalable parallel algorithm based on distributed memory computing using message-passing interface, LP-LOKA is able to search protein networks with two million protein domains from the ADDA database or three hundred thousand proteins from Swiss-Prot with high-performance computation. Another implementation using Apache-Hadoop/Spark was developed to provide fault-tolerance distributed computing. Compared with widely used profile-search methods such as HMMer/JackHMMer, HHSearch/HHBlits, PSI-BLAST and the sparse-network-search method, RankProp, we show that LP-LOKA effectively utilizes the information in the massive protein networks to provide both better remote homology detection and fold recognition results. We also observed that the larger the size of the protein similarity network, the better the performance, especially on some particular protein superfamiliers and folds.
Short Abstract: Growing resistance in bacterial pathogens to antimicrobial drugs is a serious and increasing public health concern. In silico identification of resistance phenotypes based on whole genome sequence (WGS) data would improve the ability to track and form treatment responses. WGS-based methods provide a faster approach with a wider potential utility than culture based methods; for example, they can be used to predict multiple relevant phenotypes from a single input. Our aim in this study is to assess different machine learning algorithms ability to predict the minimum inhibitory concentrations (MICs) from 2553 Salmonella enterica subspecies enterica whole genome sequences from 57 serovars. We have tested or are in the process of testing support vector machine, artificial neural net, and ensemble (RandomForest, XGBoost) models for their ability to predict the MIC value for 14 different antimicrobial drugs. Additionally, we are testing three different methods of encoding the whole genome sequence data: the identification of the presence or absence of either genes (using Roary) or pangenome segments (using Panseq), and defined-length subsequences (k-mers, using Jellyfish). Preliminary results suggest that machine learning approaches can accurately predict the MIC of multiple antimicrobial drugs using Salmonella whole genome sequences as inputs.
Short Abstract: Antibodies provide a key mode of defense employed by the immune system to fight disease, so eliciting potent antibodies is one of the main goals in vaccine development. Antibodies are one of the most effective therapeutic agents, and engineering potent antibodies is one of the main goals in biologic drug development. The power of antibodies lies in their affinity and specificity in recognizing their cognate antigens. Unfortunately, experimental techniques to determine antibody-antigen binding affinities are difficult to scale up to large sets of new antibodies and antigens. Though computational methods are suitable for large-scale prediction, current methods lack sufficient accuracy. Here, we address the problem of predicting the binding affinity of an antibody against an antigen variant based on the amino acid sequence of that antigen variant. We develop a mixture of experts approach that learns models for individual antibodies against some antigen variants, and then combines information across the antibodies in order to make accurate predictions for a wide range of new variants. In evaluation on a dataset consisting of 52 antibodies and 608 strains of HIV, the predictive accuracy of our approach is demonstrated to be significantly better than that of an existing approach.
Short Abstract: Bladder cancer is the 5th most common cancer in Canada. Most patients are initially diagnosed with non-muscle invasive bladder cancer (NMIBC), but 10-30% progress to muscle-invasive bladder cancer (MIBC) even with treatment. To understand the molecular progression of bladder cancer, we propose a machine learning method to identify specific copy number alterations (CNAs) associated with muscle invasion. A public dataset from MSKCC was used, in which 288 genes have been sequenced from 109 bladder cancer biopsies (Kim et al., 2015). We first performed feature grouping to classify each sample as “muscle-invasive” or “non-muscle invasive” according to their pathological stage. Resampling was used to avoid class imbalance. A wrapper-based feature selection method was applied to the new set of features to identify the most relevant CNAs, and ten-fold cross-validation was used to avoid over-fitting. Our results show that CNA values of TP53, DDR2 and MLL2 can predict whether an individual’s bladder cancer is currently muscle-invasive, with 91% accuracy and 95% precision. NMIBC is associated with loss of TP53, while MIBC is associated with gain of DDR2 or MLL2. Particularly, amplification of DDR2 is specifically associated with muscle invasion. These findings provide insight to the pathogenesis of muscle invasion in bladder cancer.
Short Abstract: In the recent decades, the use of computational tools has been prevalent in drug development. Molecular mechanics force fields (MMFFs) provide an atomistic representation of drug-target protein binding interactions and elucidates pertinent structural information necessary to evolve lead compounds into viable drug candidates through simulation. While most protein force fields are well optimized, developing force field parameters for potential drug candidates featuring new chemical scaffolds can be challenging. The research presented herein addresses this problem by employing machine learning models in conjunction with the CHARMM General Force Field (CGenFF) for the development of a pipeline that enables parameterization across MMFFs. Random forest (RF) classification algorithms were used for the prediction of CGenFF atom types, labels that correspond to atomic environments, and atomic partial charges. The inputs to each model were “atomic fingerprints” (AFps), a custom definition of feature vectors that describe an atom’s geometric and chemical environment. The current model was trained against 464 organic molecules (193,152 AFps) with a 10-fold cross validation classification accuracy of 99.35% (average F1-score= 0.993) and a RMSE of 0.015 (R2= 0.977) for the prediction of partial charges. Case studies for ongoing validation are offered to clarify the algorithm’s function and output data significance.
Short Abstract: The Atom Mapping of a chemical reaction is a bijection between the atoms of the two sides of the reaction. It encodes the changes that take place during a reaction and therefore constitutes essential information in studies that involve computational processing of chemical reactions. In Metabolic Engineering for example, Atom Mappings are critical for the computation of biologically feasible metabolic pathways for the production of therapeutic compounds. Computational methods for automatically deriving the Atom Mapping have been developed. The Atom Mapping problem has been approached as an optimisation problem assuming that the reaction proceeds with the minimum number of bond changes. In an effort to alleviate that assumption, previous approaches take into account the stability of the bonds. In these approaches, the bond stabilities have been manually chosen by experts but appear to be insufficient to be globally applied. In this project, I am developing a computational tool for the calculation of the Atom Mapping which is based on Machine Learning techniques for estimating the bond stabilities. Preliminary results show that the accuracy of the Machine Learning based method is comparable with the previous approaches but in contrast more robust and scalable across a large and diverse dataset of reactions.
Short Abstract: Complex diseases such as autism spectrum disorder or coronary artery disease are caused by several perturbed cellular mechanisms associated with hundreds of interacting genes. However, identifying specific disease-gene linkages is a significant challenge. For example, patients diagnosed with the same disease might have distinct genetic causes, and most mutations have small-to-moderate effects on the disease, making them hard to detect. These complications, among others, make the comprehensive identification of disease-associated genes impossible through traditional genetic screenings. One approach is to use machine learning (ML) to build upon known disease-associated genes to identify others based on hundreds of thousands of human genetic and molecular data that is publicly available. However, this approach is not straightforward because genes are linked to diseases with varying levels of confidence, ranging from strong experimental evidence to weak circumstantial associations. Though all of these gene-disease linkages can hold valuable information, traditional ML algorithms do not incorporate examples with varying levels of evidence. We have developed a suite of ML methods that can incorporate this "weighted" information. Furthermore, we have extended these disease-gene prediction methods for use with noisy, low-quality data sets wherein some data is known to be less certain, but not which data.
Short Abstract: Biological experiments such as single cell RNA-Seq, often produce large highly multidimensional datasets with complex structures potentially able to shed light on different biological processes. Characterizing such structures require approaches able to deal with datasets containing thousands of cells. To tackle this issue, we developed ElPiGraph, a fast, robust and scalable approach to reconstruct Elastic Principal Graph from single cell RNASeq and other biological data. ElPiGraph reconstructs a graph structure while simultaneously embedding it into a data space and is able to identify subsets of the data points possessing a prescribed topology, such as branching or circular paths. It is also possible to perform bootstrapped graph reconstruction, which can assess the robustness of the reconstructed graph and identify more complex structures, such as branching interconnect circles. ElPiGraph has been implemented in several programming languages, including Java, R, Matlab, Python and Scala, with a graphical interface available for the R implementation. By applying ElPiGraph on single cell RNA-Seq datasets, we will show how it can be used to explore the genetic events connected with differentiation, embryonic development, and the cell cycle. References:  Gorban A.N., Zinovyev A. 2010. Int J Neural Syst 20(3):219-32  https://github.com/sysbio-curie/ElPiGraph.R
Short Abstract: Predicting traits from genetic information is a grand challenge in biology and has major implications for accelerating plant and animal breeding and for our understanding of the genetic basis of complex traits. In Genomic Prediction (GP) a statistical model is built that uses genotype information to predict quantitative traits such as disease resistance, stress tolerance, or yield. While multiple approaches have been applied to GP and compared, machine learning (ML) approaches, especially deep learning, have been underrepresented in these studies. To assess the utility of ML approaches for GP, we used multiple ML algorithms, with an emphasis on deep learning approaches, to predict agronomic traits in six plant species and compared the results to traditional GP approaches. While we found that no one method performed best for all traits or all species, ML models tended to perform no better or worse than traditional methods when all available markers were used as predictors. However, ML approaches benefited significantly from feature selection, which in some cases allowed them to surpass the predictive performance of the traditional approaches. Our findings suggest that ML approaches could benefit both breeders using GP in the field and researchers using GP to better understand complex traits.
Short Abstract: Copy number variants (CNVs) have been implicated in several genetic disorders. Multiple callers exist today that use exome-sequencing data to call CNVs, but they collectively suffer from high false positive and low concordance rates. Venn diagrams measuring concordance (based on percent overlap) among the callers that are often used to alleviate these problems have long been assumed to be valid. We hypothesized that this approach to select high-quality CNVs is suboptimal. We used CNVs predicted by four CNV callers from 503 exomes to explore the validity of the Venn diagram approach and subsequently built a machine-learning method called CN-Learn. For each call, we captured the level and extent of concordance among the callers and supplemented them with CNV size, GC content, mappability and local read depth fluctuations. Using an exhaustive set of twelve predictors, CN-Learn identified high-confidence CNVs with high precision (~90%) in an objective fashion. CN-Learn identified true CNVs even when they lacked concordance, and improved the CNV yield without inflating false positive rates. CN-Learn addresses several limitations of the existing Venn diagram approach, and recovers additional clinically relevant CNVs that would have otherwise either been buried underneath the false positives or discarded due to lack of concordance.