View Posters By Category
Session A: (July 22 and July 23)
Session B: (July 24 and July 25)
Presentation Schedule for July 22, 6:00 pm – 8:00 pm
Presentation Schedule for July 23, 6:00 pm – 8:00 pm
Presentation Schedule for July 24, 6:00 pm – 8:00 pm
Session A Poster Set-up and Dismantle
Session B Poster Set-up and Dismantle
Short Abstract: Motivation: Prediction of the interaction affinity between proteins and compounds is a major challenge in the drug discovery process. WideDTA is a deep-learning based prediction model that employs chemical and biological textual sequence information to predict binding affinity. Results: WideDTA uses four text-based information sources, namely the protein sequence, ligand SMILES, protein domains and motifs, and maximum common substructure words to predict binding affinity. WideDTA outperformed one of the state of the art deep learning methods for drug-target binding affinity prediction, DeepDTA on the KIBA dataset with a statistical significance. This indicates that the word-based sequence representation adapted by WideDTA is a promising alternative to the character-based sequence representation approach in deep learning models for binding affinity prediction, such as the one used in DeepDTA. In addition, the results showed that, given the protein sequence and ligand SMILES, the inclusion of protein domain and motif information as well as ligand maximum common substructure words do not provide additional useful information for the deep learning model. Interestingly, however, using only domain and motif information to represent proteins achieved similar performance to using the full protein sequence, suggesting that important binding relevant information is contained within the protein motifs and domains.
Short Abstract: Motivation: Identification of high affinity drug-target interactions (DTI) is a major research question in drug discovery. In this study, we propose a novel methodology to predict drug-target binding affinity using only ligand binding information for the proteins without any protein sequence or structure information. The SMILES representations of the interacting ligands of a protein are converted to a set of chemical words and each protein is described by this set of chemical words. Results: Representing proteins using the word-embeddings of the SMILES representations of their strong binding ligands yielded a performance comparable to those methods where proteins are represented using sequence-based approaches. We utilize the eXtreme Gradient Boosting (XGBoost) algorithm to predict protein - drug binding affinities in the Davis and KIBA Kinase datasets. Using only SMILESVec, which is a strictly string based representation of the proteins based on their interacting ligands, we were able to predict drug-target binding affinity as well as or better than the KronRLS or SimBoost models that utilize protein sequence on the KIBA dataset.
Short Abstract: A central challenge of systems biology is to computationally infer global regulatory networks from genome-wide data. Methods for inferring and modeling regulatory networks must allow for complexity at the systems level, while also adjusting for limitations in available data. We describe adapting, the Inferelator, a widely used meta-method for global network inference, to single cell genomics experimental design. Here we describe efforts to learn regulatory networks from single cell RNA-seq expression data combined with complementary data sources that provide priors and a gold standard on the network structure to reveal the relative strength, importance, and dynamic properties of each regulatory edge in the D. melanogaster ovary. We learn and verify a set of gene regulatory networks underlying specification and function of each ovarian cell type, and thus understand how this organ is built. We also describe our refactoring of our network inference code to allow for integrated and reproducible workflows for a wide variety of single cell (mixed data-type) genomics experimental designs (data-sets). Code is freely available and also integrated with a Jupyter embedded system for workflow setup and network visualization.
Short Abstract: Mitochondria are characterized by their own internal compartmentalization. The outer membrane separates the interior of the organelle from the cytoplasm, while the inner membrane encloses the matrix. The two membranes are separated by the intermembrane space. Proteins residing in the different compartments are specialized to fulfill different functions. Hence, knowing the precise location of a protein inside mitochondria is crucial for its accurate functional characterization. We present DeepMito, a method based on a convolutional neural network architecture to extract patterns from primary sequence and discriminating four different sub-mitochondrial compartments: outer and inner membranes, intermembrane space and matrix. DeepMito was trained/tested on a dataset comprising 424 protein sequences endowed with experimental evidence for sub-mitochondrial localization. Our approach reported very good performances in cross-validation, reaching Matthews Correlation Coefficients (MCCs) of 0.46, 0.47, 0.53 and 0.65 in discriminating outer-, inner-, inter-membrane space and matrix proteins, respectively. Moreover, it is very robust to class imbalance. DeepMito significantly outperforms other similar approaches, with scores that are much more stable across the four different classes. Finally, we demonstrated the utility of DeepMito for proteome-scale analysis, including human data, with predictions matching available experimental annotations.
Short Abstract: Data-driven medical research relies on patient data provided by hospitals. Therefore, multicentric studies are difficult to conduct because hospitals are rightfully reluctant to release patient data to the outside. The FeatureCloud project aims to overcome this challenge by providing hospitals with the possibility to contribute their data to research without releasing it and thereby endangering the privacy of their patients. The federated design of the FeatureCloud platform provides patient privacy in an obvious manner: the data stays within the hospital servers. However, simply federating the process is not sufficient to ensure that the data remains private, because parameters that are being exchanged can leak information to adversaries. Therefore, we propose to implement the data processing mechanisms in a differentially private way. Methods such as quantile normalization or PCA are routinely used to make sure that data coming from different sources can be analyzed in a meaningful way. Here, we provide first examples for federated, privacy-aware data normalization methods in the context of the FeatureCloud platform. We show that, using differential privacy, data processing without leaking neither raw data nor parameters allowing (partial) reconstruction of the information is possible.
Short Abstract: Systematic cell type composition differences in RNA-sequencing can lead to bias in differential expression analyses because of cell type specific gene expression. Cell type deconvolution can help recognizing and potentially correcting these effects. Common methods for deconvolution approximate cell type compositions by finding the best linear combination of cells from a reference gene expression matrix (RGEP) using a list of signature genes. Selection of the RGEP matrix and the signature genes are the most important factors for performance for these methods, making performance vulnerable to biases in the RGEP matrix or signature genes. We developed a method that takes advantage of the growing amount of available scRNA-seq data and the representation learning capabilities of deep neural networks to overcome the necessity of RGEP matrix and signature gene curation. Our method, called Cell Deconvolutional Network (CDN), learns cell type deconvolution from artificial bulk RNA-seq samples generated from scRNA-seq datasets. CDN can take advantage of the growing amount of scRNA-seq datsets for varying tissues and organisms and outperforms the start-of-the art algorithms on both artificial and real bulk RNA-seq data, while not relying on curated RGEP matrices or signature genes for deconvolution.
Short Abstract: Biomedical samples often consist of a mixture of different cell types. Knowing their relative proportions is useful for research and clinical diagnostics. Thus, computational deconvolution methods have been developed to estimate these quantities based on the gene expression profiles from the mixed samples. These methods rely on gene expression data from pure cell types as the training set. Typically, they use linear regressions to decompose the mixed sample expression in terms of the mean expression of the pure cell types. In doing so, they fail to exploit a large part of the information contained in the training set. In this work, we went beyond mean-based regressions by using rich curated gene expression data to train SVM or LDA linear models to fit the proportions of each cell type. We prevented overfitting by using either ridge regularization or sparsity constraints. In the absence of gold standards for validation, we evaluated the resulting models by an in silico “leave-one-study-out” cross-validation strategy. It enabled a more reliable assessment of the generalization properties of the deconvolution models. Overall, we developed and tested a novel computational deconvolution method that properly leveraged a rich training set to provide more reliable predictions in concrete applications.
Short Abstract: Machine learning is an important technique of artificial intelligence that is widely applied in cancer diagnosis and detection. More recently, with the rise of personalised and precision medicine, there is a growing trend towards machine learning applications for prognosis prediction. However, to date, building up reliable prediction models of cancer outcomes in everyday clinical practice is still a hurdle. In this work, we integrate genomic, clinical and demographic data of lung adenocarcinoma and squamous-cell carcinoma patients from The Cancer Genome Atlas. For genomic information, we introduce both mutation and copy number variation of 15 selected genes to generate predictive models for recurrence and survivability. We compare the accuracy and benefits of three advanced machine learning algorithms: decision tree methods, neural network and vector support machines. Although the accuracy of predictive model using decision tree is lower than using neural network, the tree models reveal the most important predictors among genomic information, clinical status and demographics and their impact on the prediction of outcomes for both LUAD and LUSC. The machine learning models have the potential to help clinicians to make decisions on aspects such as follow-up timeline and to assist with personalised planning of future social and care needs.
Short Abstract: It has been widely shown that gene expression is highly correlated, meaning that the expression level of a subset of genes can be used to infer the expression level of other genes. As such, 978 landmark genes are used to predict the expression of 11,350 target genes in the L1000 assay developed by the Connectivity Map group (https://clue.io/cmap) for cost-effective gene expression profiling. Here, accurate inference of target-gene expression is prerequisite. To this end, the ordinary least squares and regularization methods for linear regression (LR) have mainly been adopted. Recently, deep neural networks (DNNs) were shown to achieve higher accuracy than LR, suggesting a non-linear relationship between the landmark genes and their targets. To improve the accuracy of DNNs for target-gene expression prediction, we extracted non-linear features of landmark gene expression using various types of autoencoders. When applied to a dataset containing 111,009 expression profiles of 943 landmark and 9520 target genes, feature extraction by denoising autoencoders improved the prediction accuracy of a DNN for ~94.4% of the target genes. On average, the prediction error for each target gene decreased by ~2.3%. Our results suggest that autoencoders can extract useful features of landmark gene expression for target-gene expression prediction.
Short Abstract: Cancer cells acquire substantial gene expression changes according to their progression due to not only intrinsic factors such as mutation in cancer cells, but also extrinsic factors including their microenvironments. Therefore, identifying the changes in their gene expression programs during cancer development is crucial to discovering cancer development mechanisms and to predicting patients’ prognosis. There are some studies to identify core genes for cancer progression in some cancer types. We obtained gene expression profiles and clinical information on pathological cancer stage, age, race and gender for 20 major cancer types (TCGA) from about 7,000 individual patients. Based on linear mixed-effect model, we discovered genes whose expression levels are significantly associated with pathological cancer stage by considering other clinical features as fixed effects and cancer types as random effects. Further, we found that they formed three gene expression modules. From the functional analysis, we found that they were associated with lipid oxidation, cell cycle and extracellular matrix organization. We further investigated how the three gene expression modules were associated with each other in terms of molecular network. Finally, we found that their gene expression levels provided prognostic information irrespective of cancer stage, which suggests that they play crucial roles.
Short Abstract: Identifying correctly drug–target interactions is a key step in drug repositioning. Experimental drug-target interaction identification can be challenging and expensive, therefore computational techniques for drug-target interaction prediction has gained a lot of attention. Computational approaches have been generally classified into ligand-based approaches (based on the structure-activity relationship of different ligands), target-based approaches (based on structural 3D information of the binding protein-ligand), and machine learning-based methods (based on similarities among both compounds and targets). The performance of ligand-based approaches, such as QSAR and pharmacophore modeling, is related to the number of active ligands available for the protein target. Target-based approaches, such as docking and binding-site similarity, are powerful tools for the identification of protein-ligand interactions based on the 3D structures of the target. Their limitations are related to the scarce availability of many drug targets and their 3D structures. Here we apply a machine learning approach to fuse ligand-based information about a set of marketed drugs (Morgan circular fingerprints) with 3D structural information about the interaction profile with their targets (PLIP) to improve the drug-target interaction prediction accuracy of the ligand-based approach and target-based approach analyzed singularly.
Short Abstract: A major cause of failed drug discovery programs is suboptimal target selection, resulting in the development of drug candidates that are potent inhibitors, but ineffective at treating the disease. In the genomics era, the availability of large biomedical datasets with genome-wide readouts has the potential to transform target selection and validation. Here, we investigate how computational intelligence methods can be applied to improve therapeutic target selection in oncology. We compared different machine learning classifiers applied to the task of drug target classification for specific cancer types, starting with lung and pancreatic cancers. For each cancer type, a set of “known” target genes was obtained and an equally-sized set of “non-targets” sampled from the human protein-coding genes. Models were trained on mutation, expression (TCGA), and gene essentiality (DepMap) data. In addition, we generated a numerical embedding of the interaction network of protein-coding genes using deep network representation learning. We assessed feature importance using the Random Forest classifier and performed feature selection. Our best model achieved good generalization performance (AUROC on the test set: 0.83 for lung cancer, 0.84 for pancreatic cancer). Our results indicate that this approach may be useful to inform early stages of the drug discovery pipeline.
Short Abstract: Plant stress responses are often investigated by pooling data from multiple individuals grown under tightly controlled conditions. However, it has been shown that 1) individual plants show considerable phenotype and expression variation even under tightly controlled conditions and 2) this variation can be used to extract biologically meaningful information about gene functions. Here we test the principle of using uncontrolled single-plant variation to extract biological information about gene functions. To do so, 61 B. napus plants of the same genotype were grown in a field setting without applying any treatment or perturbation. Instead, we assume that slight variations in the micro-environment of each plant will occur naturally. For each plant the leaf transcriptome was measured early in the growth season. In addition, an extensive amount phenotypes have been measured during the entire time span of the experiment. Using machine learning methods, we can accurately predict single-plant phenotypes from transcriptome data. This applies to phenotypes both close (in time and space) to the transcriptome data, such as leaf weight, and very distant, such as seed yield. As such we prove that studying natural perturbations to single plants is a viable alternative to more expensive traditional studies using tightly controlled growth conditions.
Short Abstract: Successful prediction and a molecular-level understanding of drug-induced organ toxicity are key to reduce both animal use and attrition rate in drug discovery. We leveraged gene-expression data in the Open TG-GATEs database and systematically compared predictive powers of a plethora of machine-learning models of increasing complexity, including logistic regression, support vector machines, and deep neural network (DNNs). DNNs consistently and substantially outperformed other models for almost all types of liver histopathology. The findings were corroborated by benchmark against published models and by a crowd-sourcing approach. We applied DNNs to independent datasets and confirmed their superior performance over other models across technological platforms of gene expression profiling and across rodent species. Finally, we observed that, by using transformed or additional features as input, it is possible to further boost either the interpretability or the performance of DNNs. The present study demonstrates the feasibility and advantage of applying deep-learning techniques to predict drug-induced liver histopathology based on gene expression. Joint application of physiology-emulating cellular systems, omics technologies and machine-learning models such as DNNs holds the promise to replace and reduce animal use in drug discovery and to ensure the safety profiles of drug candidates tested in clinical trials.
Short Abstract: Drug-target interaction prediction is vital for drug discovery. Despite modern technological advances in drug screening, experimental identification of drug-target interactions remains a very demanding task. It is very time consuming and extremely expensive. Predicting drug-target interactions in silico can thereby leverage drug development or repositioning. Multiple machine learning models have been developed over the years to predict such interactions. Specifically, multi-output learning models have drawn the attention of the scientific community due to their high predictive performance and computational efficiency. However, these models adopt the too optimistic assumption that all the labels are correlated with each other. Here, we present a new drug-target interaction (DTI) prediction strategy, addressing DTI as a label partitioning-driven multi-label classification task. We show that building multi-output learning models over groups (clusters) of labels often leads to superior results. For evaluation purposes, we employed five benchmark drug-target interaction networks and compared our strategy against other DTI prediction methods. The obtained results affirmed the effectiveness of the proposed framework.
Short Abstract: Immunization by direct venous inoculation of Plasmodium falciparum sporozoites (PfSPZ) under chloroquine treatment (PfSPZ-CVac) has recently been shown to provide high-level protection against controlled human malaria infection (CHMI). To identify Pf-specific antibody profiles of 40 immunized and non-immunized malaria-naïve individuals a whole Pf proteome microrarray with 7,455 protein fragments representing about 91% of the Pf proteome was used. Understanding PfSPZ-CVac induced antibody-response against Pf is essential for predicting and improving vaccine-induced protective immunity. For this purpose, we adapted supervised machine learning methods to identify predictive antibody profiles of immunized and non-immunized individuals at three different time points under and after vaccination. Due to the large number of features compared to the small number of samples we tested regularized logistic regression models, a random forest approach with feature selection and a multitask-SVM approach. The multitask-SVM approach improves the classification performance for immunized and non-immunized individuals based on the underlying antibody-profiles while combining time-dependent data in one prediction model. Furthermore, we developed new techniques to interpret the impact of Pf-specific antigens on the non-linear multitask-SVM model. Within our analysis we were able to identify informative Pf-specific antigens for protected and unprotected individuals that may provide a better understanding for vaccine development.
Short Abstract: To elucidate pathological mechanisms, and pinpoint target genes for developing therapeutics, it is important to assess abnormalities in the genome-wide interactions of proteins and genomic elements that cause gene mis-regulation. In this regard, ChIP-seq data is one of the core experimental resources to elucidate genome-wide epigenetic interactions and identify the functional elements which are associated with diseases. Accordingly, the analyses of ChIP-seq data are important and difficult computational challenges, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually determine all the peaks. We designed a novel supervised learning approach with convolution neural networks and integrated it into a software pipeline (called CNN-Peaks) for identifying ChIP-seq peaks using labeled data from human researchers who annotate the presence or absence of peaks in some genomic segments. The labeled data were used to train a model for identifying peak signals and the model was applied to unknown genomic segments. We validated our pipeline using ChIP-seq data available from the ENCODE data portal.
Short Abstract: ABSTRACT: Inspired by recent methodological advances in precision medicine involving the integrative analysis of multi-omics data, we sought to investigate the potential this paradigm might hold in the context of skin aging. For this we generated transcriptome and methylome profiling data from suction blister lesions of subjects between 21 and 76 years, which were integrated using a network fusion approach. Unsupervised cluster analysis on the combined network identified four subgroups exhibiting a significant age-association. As indicated by enrichment analyses of the Hallmarks of Aging, the clusters capture the biological aging state more clearly than a mere grouping by chronological age. To test the stability of the identified subtypes, 31 subjects from the original cohort were re-invited three years later for a longitudinal second measurement. The clusters demonstrated high stability, with only 6 of 31 subjects changing in cluster assignment, and changes only occurring along the natural age gradient. In order to characterize the biological pathways driving the clustering, a novel method based on pathway-level trained classification models was devised. The models identified intercellular signaling and regulation of stem cell proliferation as most important pathways and allow a data-driven ranking of the Hallmarks of Aging.
Short Abstract: Several databases provide neuron images for neuroscientists in the public domain. They acquire huge amounts of neuron images and help neuroscientists to understand connectome enormously. One of the biggest database for Drosophila brain is FlyCircuit  which contains 28,573 single neuron images. However, when the image is taken from raw data, there are 10 breakpoints in a neuron on average due to the following reasons. First, the experimental settings are not optimized for the confocal microscope. Second, the manual segmentation causes discontinuous in the neuron fibers. These will cause difficulty for analysis. Traditionally, breakpoints are fixed manually by some commercial software. However, the procedure is time-consuming, labor-intensive and usually lack objectivity. Here we propose a machine learning method to solve this problem. We train a Convolution Neuron Network (CNN) model to identify the position of breakpoints by their local environment. We use our tracing and segmentation algorithms  to prepare training set automatically. Due to our labeling method, training data could be prepared massively and quickly. Meanwhile, more accurate model could be trained. After breakpoints being located, we can fix them by missing retrieving method we proposed before. Furthermore, neuron reconstruction will become more easily by our proposed method.
Short Abstract: Cancer is a heterogeneous disease that may differ significantly in the population. Patients with similar histopathology may show different clinical outcomes posing a great challenge for both diagnosis and therapy decisions. Studying transcriptomic data is expected to bring novel insights about the molecular mechanisms that underlie cancer formation and progression, hence improving the accuracy with which patients are stratified and treated. Gene expression data from two distinct datasets were analysed: colorectal cancer data (TCGA) and colon normal tissue data (GTEx). Preliminary analyses were performed in both datasets, namely Principal Component Analysis (PCA), survival and differential gene expression analysis. Results show differences in gene expression and survival outcome between colon cancer patients in different stages of the disease. Also, when comparing normal with tumoral tissue, the most significant down-regulated and up-regulated genes were identified and interpreted from a biological and clinical point of view. Sparse Cox regression models further allow identifying genes associated with high/low survival risk. Transcriptomics analysis now plays a key role in understanding biological processes involved in cancer and can support biomarker identification towards a more personalized medicine.
Short Abstract: The Vienna Metabolomics Center has established open-source and cross-platform workflows for computational mass spectrometry, integrative multi-omics analysis and predictive modelling for clinical, biochemical, agricultural and ecological studies. All algorithms are implemented in the toolbox COVAIN with graphical user interface. High resolution mass spectral raw data are processed with an algorithm called mzFun and initial annotation biochemical pathways are assigned to unknown m/z features with an algorithm called mzGoupAnalyzer. Processed metabolomics, proteomics, transcriptomics, phenotypical and physiological data are imported into COVAIN, which provides rigorous statistical tools for data mining from data cleaning, imputation, uni- and multi-variate statistics including ANOVA, PCA, ICA, PLS, correlation, clustering, Granger causality, multiple regression up to advanced machine learning procedures. These algorithms include multivariate best subset selection by genetic algorithm, classifiers like SVM, DA, KNN and ensemble methods as well as ROC/AUC diagnostics. Further, statistical network inference, visualization, modularity analysis, KEGG pathway enrichment analysis are implemented. COVAIN is also featured with an experimentally validated inverse Jacobian calculation that infers biochemical regulation Jacobian matrix directly from genome-scale metabolomics covariance data. During this processes high quality editable figures are provided. We will use a recently published Gestational Diabetes Mellitus metabolomics data to illustrate the whole workflow.
Short Abstract: Recent technological advances allow unbiased investigation of cellular dynamic processes, by generating -omics profiles of thousands of single cells and computationally ordering these cells along a trajectory. The resulting trajectories help to gain insight into developmental processes such as the differentiation of immune cells in response to a pathogen. Since 2014, more than 70 so-called trajectory inference tools have been developed. Our recent review highlights the difficulty of the computational task at hand, since none of the existing methods are able to produce accurate models on all the biological datasets. Additionally, a vast majority of the methods scaled very poorly, an issue which is becoming more relevant as the dimensionality of the datasets keeps increasing. In this work, we present the results of a growing neural gas-based approach, modified to best suit trajectory inference for single cell omics applications. This algorithm iteratively grows an adaptive graph in order to best fit the data, and has the ability to fit any trajectory topology; from simple linear trajectories to complex disconnected graphs. Initial results show that this approach was able to produce accurate models on a wide variety of datasets, and can easily scale to processing millions of cells.
Short Abstract: Abstract Background: The occurrence of complex genetic disorders often results from multiple gene mutations. The effect of each mutation is not equal for the development of a disease. Inspired from the field of information retrieval, we propose using the term frequency (tf) and BM25 term weighting with the inverse document frequency (idf) and relevance frequency (rf) measures to weight genes based on their mutations. The underlying assumption is that the more mutations a gene has in patients with a certain disease and the less mutations it has in other patients, the more discriminative that gene is. Results: We applied various machine learning techniques on the task of cancer type classification. The proposed representations are also compared with sum of C-scores approach in Vural et.al. (2016). According to our test results, the highest f-score (76.95%) is achieved with the BM25-tf-rf based data representation and the proposed neural network model. Conclusions: The best performing classification system is further utilized for causal gene analysis. Examples from the most effective genes that are used for decision making are also found in the literature as target or causal genes. The proposed technique can also be applied to all genetic diseases.
Short Abstract: To fully appreciate the brain as a complex system, its component parts should be studied across multiple modalities of data. One such source of information from the brain is spatial gene expression. We generated a whole-brain, adult mouse spatial gene expression dataset using Spatial Transcriptomics (ST), an array-based transcriptome-wide mRNA assay that maintains the spatial origin of transcripts. After benchmarking a variety of analytical approaches, we found that most well-represented brain areas, as annotated by mapping the ST spots to the Allen Reference Atlas, are learnable using LASSO regression with respect to all other brain areas using only gene expression. We further extend these analyses through meta-analysis of the Allen Institute’s transcriptome-wide adult mouse in situ hybridization data (ABA ISH). Specific brain areas that were learnable in the ST data are similarly learnable in the ABA ISH data. Encouragingly, preliminarily LASSO models trained in the ST dataset have performed similarly in learning brain areas on held out ST data and on the ABA ISH data, and vice versa. Through this meta-analysis, I will determine the replicability of learning canonical anatomically-derived brain area assignments using gene expression from various types of spatial transcriptomic data and ultimately propose new transcriptionally-defined groupings.
Short Abstract: Alcoholic liver disease (ALD) is one of the most prevalent chronic liver diseases worldwide, causing pathological changes in liver due to excessive consumption of alcohol. It progresses from fatty liver through alcoholic liver fibrosis (ALF) to cirrhosis (ALC). Unfortunately, the clear majority of ALD patients are only diagnosed by the time ALD has reached the irreversible ALC stage. Here, we use data from Danish health registries to examine if it is possible to identify patients likely to develop ALF or ALC based on their past medical history. To this end, we use statistical and machine-learning techniques to analyze data from the Danish National Patient Registry. Consistent with the late diagnoses of ALD, we show that ALC is the most common form of ALD in the registry data and that ALC patients have a strong over-representation of diagnoses associated with liver dysfunction, such as ascites and hepatic failure. We also find a small number of ALF patients who appear to be much less sick than those with ALC. Our findings highlight the potential of this approach to uncover hidden knowledge in registry data related to ALD.
Short Abstract: Forecasting crop yields is a formidable challenge. Until today, there is no satisfactory model which has universal validity. Here we show that a machine learning model to predict yield impact on agriculture crops based on measured remote sensing and metabolites data, from field trials for crop plants under drought stress across years, is indeed possible. We further show that our machine learning model can be applied at vegetative and reproductive life cycle stages of crop development in the field to predict that there would be a yield penalty. We further identified that the metabolites measured using polar based gas chromatography (GCP) alone can predict the yield impact with an accuracy up to 85%. This concept was established using four different corn varieties in two countries (Spain and USA) with field trials spanning across 5 different years (2014-2018).
Short Abstract: Bone ingrowth and tissue differentiation is a very important issue especially after the Dental implant surgery. Several theoretical prediction models were purposed, such as Mechano-Regulatory tissue differentiation model proposed by Huiskes et al.  or Chou et al. . However, these simulations will consume huge resource both in computing power and processing time. In this study, we developed a Deep Convolutional Neural Networks (DCNN) model to predict tissue regeneration around implants. Theoretical simulation results was accumulated as the training dataset for our DCNN model. A Standard Operating Procedures for extracting features and labeling data from standard FEM simulation results was built. With these procedures, both short and long term evolution of bone are highly consistent with the numerical results. The proposed simulation data-driven model is efficient and accurate. It can provide reliable inference in the design of the bone implants.
Short Abstract: The risk of developing a radiation-associated breast cancer in the contralateral breast (RCBC) is a concern among breast cancer survivors who were treated with radiotherapy for a first primary breast cancer, especially among young survivors. Machine learning is a novel approach to identify agnostic genomic associations across molecular pathways. Among participants in the Women’s Environmental Cancer and Radiation Epidemiology (WECARE) Study who were ≤40 years of age at diagnosis and who received a scatter radiation dose >1Gy to the contralateral breast, we included 52 women with RCBC and 153 women with unilateral breast cancer who did not develop a second cancer. Using a novel preconditioned random forest approach, we built a predictive model with unselected 767,207 GWAS-identified single nucleotide polymorphisms (SNPs). Using 2/3 and 1/3 of the samples for modeling and testing respectively, an area-under-the-curve (AUC) of 0.62 (p=0.04) was obtained. Furthermore, this approach identified key biological correlates previously identified as relevant to breast cancer, radiation response, or both, including the cyclic AMP-mediated signaling and Ephrin-A. In summary, machine learning methods applied to genome-wide genotyping data have great potential to build a novel predictive model and, in this case revealed plausible biological correlates associated with the risk of RCBC.
Short Abstract: Due to more and more affordable high-throughput techniques for measuring molecular features of biomedical samples there has been a huge increase in different types of multi-omics datasets, containing for example genetic, or histone modification data. Due to the multi-view characteristic of the data, established approaches for exploratory analysis are not directly applicable. Here we present web-rMKL, a web server that provides an integrative dimensionality reduction with subsequent clustering of samples based on data from multiple inputs. The machine learning method was introduced for a multi-omic cancer subtype discovery setting, recently performing best for clinical enrichment in a comparison of state-of-the-art multi-omic and multi-view clustering algorithms. However, this method is not just limited to clustering patients for cancer subtype discovery as exemplified by the second use case for stem cell differentiation. web-rMKL offers an intuitive interface for uploading data and setting the parameters. rMKL-LPP runs on the back end and the user may receive notifications once the results are available. We also introduce a preprocessing tool for generating kernel matrices from tables containing numerical feature values. This program can be used to generate admissible input if no precomputed kernel matrices are available. The web server is freely available at https://web-rmkl.org.
Short Abstract: Prediction of the gene network is useful to comprehend gene regulation mechanism and identify disease-associated genes. On the progression of cancer, both the expression levels of the gene cluster with specific biological functions and those gene network are changed. In addition, progression of cancer induces the fluctuation of the expression level not only of the gene cluster involved in cell cycle but also that involved in cilium organization. In this study, we investigated the correlation for the expression patterns between gene clusters of cilium organization and other biological functions, by using the expression data of non-small cell lung cancer patients. At first, k-means++ clustering and enrichment analysis were applied to 4,600 differentially expressed genes. Next, the correlation coefficient between the centroids of each cluster was calculated for normal and cancer cells, respectively. As a result, the cluster with cilium organization related genes as major genes was observed. In the case of normal cells, correlation coefficient between centroids from clusters with cilium organization- and vasculature development-related genes as major genes was calculated to be -0.83. By contrast, the coefficient was 0.75 in the case of cancer cell, suggested cooperative expression pattern between both gene clusters.
Short Abstract: Transcription profiling, with the use of sophisticated machine learning algorithms, is providing new insights into the complex process of acquired drug resistance (ADR) in cancer. We have applied a novel feature selection methods based on personalized machine learning approaches to analyze diverse acquired drug resistance models, including first to third-generation epidermal growth factor receptor (EGFR), tyrosine kinase inhibitors (TKIs), taxane, platinum and anthracycline. Using meta-analysis-derived regularization regression through leave-one-study-out cross-validation (LOSOCV) and effective metaheuristics for hyperparameter optimization, we identified a highly accurate yet compact gene set classifier that is efficiently discriminating ADR cells from sensitive cells with an excellent prediction performance on external cohorts. We also found that the model showed high transferability across different anticancer drugs which were not used to build the model and intrinsic anticancer drugs resistance from two large-scale pharmacogenomic resources, the Cancer Genome Project (CGP) and the Cancer Cell Line Encyclopedia (CCLE). We also discovered common pathway dysregulation signatures for ADR to various anticancer drugs, which can provide new insights into the follow-up studies.
Short Abstract: A major problem for disease treatment is that while antibiotics are a powerful counter to bacteria, they are ineffective against viruses. With overlapping symptoms sets, it is inadequate to use them alone for diagnosis. Without proper identification equipment, medical practitioners can unnecessarily prescribe antibiotics for patients without bacterial infections, fostering anti-microbial resistance (AMR) and enabling implementation of ineffective treatment methods. For soldiers in field operations, where injuries facilitating infections are frequent, there exists a need for deployable diagnosis tools. DSTL (Defence Science Technology Laboratories), who work on behalf of the Ministry of Defence (MOD), identified this and set out to create a tool enabling differential diagnosis based on gene expression. We at the CBF were tasked with the biomarker discovery, to create a panel of genes enabling differential diagnosis. There exists evidence for many different methods of machine learning being utilised in the meta-analysis solution space . As such, we have been using Random forests for classification over a large-scale meta-analysis, a popular method of combining studies, harnessing the statistical power of many data points to provide more precise and robust estimates .
Short Abstract: Background: Unsupervised neural network models have shown their usefulness with noisy single cell mRNA-sequencing data (scRNA-seq), where the models generalize well, despite the zero-inflation of the data. Variants of autoencoders have been useful for denoising of single cell data, imputation of missing values and dimensionality reduction. Results: We present a feature with the potential to greatly increase the usability of autoencoders. By application of saliency maps on the representation layer, we can identify genes that are associated with each hidden unit. We apply a soft orthogonality constraint on the representation layer, to aid the deconvolution of the input signal. Our model can delineate biological meaningful modules that govern a dataset, as well as give information as to which modules are active in each single cell. Importantly, most of these modules can be explained by known biological functions, as provided by the Hallmark gene sets. Conclusions: We discover that tailored training of an autoencoder makes it possible to deconvolute biological modules inherent in the data, without any assumptions. In perspective, our model in combination with clustering methods is able to provide information about which subtype a given single cell belongs to, as well as which biological functions determine that membership.
Short Abstract: Due to the complexity of cancer, patient clustering has been used to disentangle the observed heterogeneity and identify cancer subtypes that can be treated specifically. Kernel based clustering approaches enable the integration of several data types, which is highly relevant when considering complex diseases. However, clusterings remain hard to evaluate and the influence of individual features on the final result is often unclear. Our proposed extension of multiple kernel based clustering enables the characterization of each identified patient cluster using the features with the highest impact on the result. To this end, we combine feature clustering with multiple kernel dimensionality reduction and introduce a score for the feature cluster impact on a patient cluster. We applied the approach to different cancer types described by four different data types aiming at identifying integrative patient subtypes and understanding which features were important for their identification. Our results show that not only does our method have state-of-the-art performance according to survival analysis, but, it also produces meaningful explanations for the molecular bases of the subtypes based on the high impact features. These support the validation of potential cancer subtypes and enable the formulation of new hypotheses concerning individual patient groups.
Short Abstract: The possibility to sequence DNA in cancer samples has triggered much effort recently to identify the forces at the genomic level that shape tumor apparition and evolution. Two main approaches have been followed for that purpose: (i) deciphering the clonal composition of each tumour by using the observed prevalences of somatic mutations, and (ii) elucidating the mutational processes involved in the generation of those same somatic mutations. Currently, both subclonal and mutational signatures deconvolutions are performed separately, while they are both manifestations of the same underlying process. We present Clonesig, the first method that jointly infers subclonal and mutational signature composition evolution of a tumor sample form bulk sequencing. CloneSig is based on a probabilistic graphical model that models somatic mutations as derived from a mixture of subclones where different mutational signatures are active. Parameters of the model are estimated using an EM algorithm. We have conducted extensive simulations of various tumor evolution scenarios that illustrate that Clonesig joint inference allows an accurate reconstruction of both processes. Application to real data shows results obtained with whole exome sequencing recapitulate characteristics of observations on whole genome sequencing, illustrating CloneSig's ability to recover relevant biological signal in the noisiest setting.
Short Abstract: Defining cell types in high-dimensional single cell data, such as mass cytometry (CyTOF), single cell RNA-sequencing or images, has been applied extensively. Here we introduce CellGen, a novel Autoencoder based clustering approach, to learn and interpret the multimodal distribution of single cell data. CellGen is built on an Autoencoder (AE) setup, in which the decoder consists of a Mixture-of-Expert architecture. This specific architecture allows various modes of the data in the specific experts to be automatically learned. Additionally, we make use of a Multivariate Normal (MVN) distributed latent space. As proof of concept, we tested CellGen on reconstructing synthetic data sampled from MVN distributions. We conducted preliminary tests with CellGen to specifically define rare subpopulations in CyTOF measurements of samples of peripheral blood mononuclear cells, where we achieved state-of-the-art results when comparing F-measures. In comparison to competitors, we see the consistent superiority of CellGen, especially in detecting both abundant and rare cell populations, where baseline approaches either performanced well for detecting abundant or rare cell population. Approaches like CellGen have the ability to perform unsupervised detection of abundant and rare subpopulations. This could enhance exploratory single-cell initiatives such as the Human Cell Atlas or be applied in personalized medicine.
Short Abstract: Protein aggregation is a major unsolved problem in biochemistry with implications for several human diseases as well as for biotechnology and bio-material sciences. The aggregation kinetics of proteins is very sensitive to even small changes in amino acid sequence or changes in experimental conditions such as pH, temperature, ionic/peptide concentration and so on. Therefore, we have collected the experimentally validated information related to the mechanistic and kinetic aspect of protein aggregation from literature and compiled it in the database called CPAD-2.0 (curated protein aggregation database). The role of point mutations on aggregation kinetics was further analyzed for various proteins using physicochemical, energetic, conformational and contact dependent properties. We developed a machine learning model to predict the aggregation rate enhancer or mitigator point mutations using the sequence information and quantitatively predicted the change in aggregation rate upon point mutation through regression analysis using structural information. The analyses suggest that kinetics of protein aggregation is impacted by local factors such as polypeptide length, location as well as mutation site conformations rather than global factors, such as overall three-dimensional protein fold, or mechanistic factors such as the presence of aggregation-prone regions.
Short Abstract: Chemical reactions are a foundation for understanding organic processes and discovering new bioactive compounds. To build in silico models for reactions we need precise description of their mechanism. To this end, all atoms in the reactants and products need to be mapped, i.e., aligned one-to-one. For many reactions the atom mapping requires chemical insight and cannot be just derived from the principle of the smallest edit distance. In this work we tested two widely used software packages for atom mapping, Indigo and Chemaxon, against mappings derived by human experts. We found significant differences in the mappings and identified the typical error patterns. The most common of them are bond reorganization in conjugated systems, incorrect stereochemistry, and choosing energetically unfavorable assignment. We observed additional issues when the molecules have high number of symmetries and in the dimerization reactions. Finally, we analyze the effect of these errors on building reaction prediction models. We conclude that the current state of atom mappers is not always sufficient for building reliable machine learning models for reactions and extra preprocessing steps need to be taken.
Short Abstract: Protein structural dynamics at equilibrium can be represented as a set of vibrational modes ordered by their amplitude using Normal Mode Analysis (NMA). However, translating the vast range of vibrational modes to biological function remains a challenging problem. Here we present a modelling pipeline that couples an enzyme’s physicochemical properties to NMA derived modes. Normal modes of single residue mutants for an enzyme are encoded in a vector. Dynamic profiles are generated for physicochemical parameters, P(x,t) by coarse grained methods. Each dimension of the encoded vector represents an integrated value of differential between the profile of the mutant with the wild type over time (t) and space (x). A multi-layer perceptron classification model is trained to predict the effect of a mutation on enzymatic activity. Experimental activity data for Hepatitis C Virus NS3 helicase mutants was used to test our framework. Our results identify steric volume at the ATP binding site as a discriminatory feature for its ATPase activity while solvent accessible surface area changes are indicative of RNA binding affinity. Predictive models thus derived can be used to link biological function to structural characteristics from NMA and can be used to aid protein engineering solutions.
Short Abstract: Understanding real-world datasets is often challenging due to size, complexity and/or poor knowledge about the problem to be tackled (i.e. electronic health records, OMICS data, ...). To achieve high accuracy in important tasks, equally complex machine learning models are usually used. In many situations (e.g. di- agnosis prediction) the decisions achieved by such automated systems can have significant, and potentially deleterious, consequences. It is therefore necessary for a model to not only provide a correct prediction, but to also provide an accompanying explanation regarding why a certain decision was achieved. How- ever, one of the current flaws in current interpretable machine learning research is the lack of an agreed-upon definition of interpretability, which hinders the fair comparison of existing interpretability methods. In our work we aim to properly define interpretability for problems and mod- els in computational biology. This will enable us to systematically benchmark existing interpretability methods, and potentially lead to the development of novel methods. Our work will focus on important tasks in computational biology, such as tran- scription factor binding predictions or 3D DNA structure reconstruction, lever- aging publicly available data (e.g. TCGA) or produced by close collaborators.
Short Abstract: Alternative splicing is one of the most important pre-mRNA processing for normal development and human diseases. Its regulation has long been thought to involve genetic elements, but recently was extended with multi-level epigenetic properties, such as histone modifications, to exhibit an increasingly complex regulation model. It is highly desirable to develop computational tools to decipher an extended splicing code incorporating both the genetic and epigenetic mechanisms across diverse human cell/tissue types and developmental stages. First, we are developing a deep learning approach, DeepCode, to integrate multiple and large-scale genomic and epigenomic data to decipher the splicing code. Our studies demonstrated its superior performance and highlighted the importance of epigenetic features in predicting the splicing patterns in multi-lineage differentiations from human ESCs. Second, we take further steps toward functional interpretations of the DeepCode. Taking two well-established cell lineages, we tried to model the chromatin state transitions across the lineage trees to identify the most dynamic splicing modules and crucial AS genes for cell fate decision. We also developed a probabilistic model with deep learning framework. It enables identifying the most dynamically switching splicing modules and AS genes, which are functionally linked to cell-cycle machinery and related pathways based on preliminary studies.
Short Abstract: Deep learning protein-based prediction models have gained great popularity in recent years. For these models, protein sequences are usually encoded into feature vectors. However, these encoding features are generally aggregative and not bijective, or require sequences to be alignable, thus decreasing the generalisation capability of the models. The use of raw amino acid sequences as models input is now gaining popularity. Padding is usually applied to get different length proteins to be within the same dimension, but little is known on how this addition could affect to the model performance. On the other hand, state-of-the-art Deep Learning models are not yet taking advantage of big data frameworks and distributed computation. Although there have been some approaches towards this integration, there are still no stable solutions. Overcoming this gap is crucial for getting the maximal potential out the growing public biological datasets. In this work, we build an scalable Deep Learning model by integrating big data and deep learning frameworks. We then analyse different protein bijective encodings in a protein function prediction problem and study the impact that the padding has on the performance of the model. Our results provide good practices on distributed computing protein-based deep learning models.
Short Abstract: Viruses, especially bacteriophages, may encode 50% to >90% unknown genes, depending on their environment. This diversity is due to the high natural selection pressures imposed by the parasite-host/predator-prey relationship between viruses and their obligatory hosts. Viruses are also agents of horizontal gene transfer, carry toxins and other virulence factors. We are using machine learning approaches to identify viral-associated ORFs and predict their function within metagenomes, viral genomes, and within host genomes (i.e., prophages or proviruses). This proof-of-principle approach will be expanded to all genetic ‘dark matter’. Artificial neural networks are useful as universal approximators for mapping protein sequence to gene annotation. Our current feed forward/back propagation networks are trained with ~410 features to classify viral ORFs into one of 11 classes (10 viral structural protein categories + “other”). ANNs were trained using data from 50,000 ORFs and have an average f1-score of 0.9 for the test set of >7K genes. We are increasing the predictive accuracy by using information theory to identify features that maximize the mutual information between feature and class, and include enough features to increase the total mutual information to more than one bit per class. This also brings some interpretability to how the ML algorithm classifies.
Short Abstract: Mastering the relationship between the sequence, the structure and the function of a protein is a fundamental challenge of bioinformatics. Indeed, on the one hand, two proteins may have as few as 20% sequence identity, yet have almost identical structure and function. On the other hand, a single mutation may completely impair the ability of a protein to fold and function properly, resulting in potentially deadly genetic disorder for the carrier. What features of a protein sequence are required for its function? Here we show that Restricted Boltzmann Machines (RBM), a Machine Learning algorithm that jointly learns a representation and a probability distribution over complex high-dimensional data can efficiently learn from sequences of protein families. The features inferred by the RBM are biologically interpretable: they are related to structure (such as residue-residue tertiary contacts, α-helix, β-sheets, intrinsically disordered regions), to function (such as activity and ligand specificity), or to phylogenetic identity. In addition, RBM can be used to design new protein sequences with putative properties by composing and turning up or down the different features at will. Overall, RBM are a versatile and practical tool to unveil and exploit the genotype-phenotype relationship for protein families.
Short Abstract: While emerging high throughput technologies enable researchers in the usage of more data for their machine learning analysis, methods lie behind in terms of efficiency and interpretability. For integrative cancer analysis, regularized multiple kernel learning with dimensionality reduction (rMKL-DR) stands out in comparison to other methods. In terms of unsupervised analysis of samples to detect interesting subgroups, it is not straightforward how to weight different measurements and their impact on the clustering result. RMKL-DR not only gives insights on this weighting, it also offers the possibility to use semi-supervised methods to further improve results. In this work we show an advanced rMKL-DR approach, by preprocessing kernels. This includes kernel k-means clustering of patients according to methylation, gene expression and miRNA data. Followed by slicing the kernels according to emerging patients clustering, rearranging values in kernel matrices and using new generated matrices as input for rMKL-DR. We benchmark our solution by comparing survival rates of patients from our model to rMKL-DR using p-values of the cox regression model. Our results show better performance on some cancer types, however there is no slice method outperforming all others.
Short Abstract: Over the past decades, it has become increasingly evident that genetic background is a key determinant of many types of human diseases. Identifying the genes and mutations that underlie human disease phenotypes is important for multiple purposes, including: (1) understanding disease mechanisms via the functions of the genes and the broader background of their biological pathways; (2) pre- and post-natal risk assessment; (3) precision medicine translational advances. To this end, we annotate the set of human protein-coding genes using a large array of biological annotations. We then apply state-of-the-art unsupervised deep learning techniques (graph embeddings) and couple these with a neural network classifier in order to model the gene-disease associations cataloged by the Human Gene Mutation Database (HGMD). We both model the known associations between a subset of human genes and rare Mendelian phenotypes as well as predict novel associations not yet in HGMD. We note that our classifier is a hierarchical one able to handle parent-child relationships that exist between disease phenotypes as encoded by the Human Phenotype Ontology (HPO). Both the graph embeddings and classifier network are trained together making this an end-to-end approach. We show good prediction performance with training accuracy up to 85% for certain phenotypes.
Short Abstract: Machine learning models trained on protein data tend to underperform due to the low amount of annotated data. Current research has shown that Language Models (LM) trained on unlabeled protein sequences can be used to improve performance on protein prediction tasks. However, protein LMs have not been fully studied, and their full capabilities are yet to be explored. A protein LM can be defined as a model that predicts the next amino acid given the context previous to that amino acid. In this research, we focus on assembling a high-quality protein dataset suitable for protein language modeling and training a Recurrent Neural Language Model on this dataset. We show that the protein LM learns to predict the next amino acid in a sequence and creates amino acid representations that are context dependent. In addition, our protein LM is able to predict the probability of a protein sequence, being able to discriminate between real and fake proteins. Finally, we show that our model also can generate new protein sequences with similar features to real proteins.
Short Abstract: The emergence of single-cell RNA-seq (scRNA-Seq) has revolutionized the ability to study complex biological systems. The analysis of scRNA-seq data typically involves quality control and filtering, dimensionality reduction of features, and identification of all cell clusters via unsupervised clustering methods. The next major step is the identification of marker genes that uniquely define each cell cluster. This task is often attempted by examining the top up-regulated genes in a one-versus-all differential expression analysis for each cell cluster. However, this approach may not work well when cell subpopulations are small and/or are mostly similar in expression to one another. Here we propose a multi-class decision tree approach utilizing multivariate Gaussian density and probability entropy-based information gain that can identify combinations of genes that distinguish various cell clusters. This method was applied to datasets from mouse lung and human peripheral blood monoclonal cells (PBMCs) that were clustered with Celda, a method developed by our lab for scRNA-Seq clustering. In each dataset, core groups of gene modules were identified that uniquely distinguished cell subpopulations. Overall, this approach will help facilitate the biological interpretation of of clustered single-cell data.
Short Abstract: In cancer, the primary tumour's organ of origin and histopathology are the strongest determinants of its clinical behaviour, but in 3% of new cancer diagnoses, a cancer patient presents with a metastatic tumour and no obvious primary. Challenges also arise when distinguishing a metastatic recurrence of a previously treated cancer from the emergence of a new one. Here we train a deep learning classifier to predict cancer type based on patterns of somatic passenger mutations detected in whole genome sequencing (WGS) of 2606 tumours representing 24 common cancer types. Our classifier achieves an accuracy of 91% on held-out tumor samples from this set.On primary and metastatic samples from an independent cohort, it achieves accuracies of 87% and 85%, respectively. This is double the accuracy of pathologists who were presented with a metastatic tumour without knowledge of the primary. Surprisingly, adding information about driver mutations reduced classifier accuracy. Our results have immediate clinical applicability, underscoring how patterns of somatic passenger mutations encode the state of the cell of origin, and can inform future strategies to detect the source of cell-free circulating tumour DNA.
Short Abstract: Cell-type identification is one of the most substantial tasks in single-cell RNA sequencing (scRNAseq). To date, there are quite a few methods proposed to classify single cell identities employing transcriptomic data. However, current methods employ genes as independent features to assign cell types disregarding their interactions. Although classical approaches perform accurately in some tasks, they fail to classify cells in many settings, because of the noisy nature of scRNAseq data, specifically, in high dimensional settings. Here, we take gene interactions into account by constructing an interaction network on a subset of genes. Afterwards, we employ a deep learning approach, called Graph Convolutional Network (GCN), applicable to graphs in order to carry out the classification. We validate our method on a few scRNAseq datasets with predefined cell types, including a mouse cell type atlas and a zebrafish embryo cell type atlas. Besides the high accuracy of cell type classification (with accuracy of > 95% on test data), our method removes unwanted variations such as batch effects accurately. In conclusion, interactions among genes in a cell are greatly informative about its developmental state and can be used by appropriate methods such as ours for precise cell-type identification.
Short Abstract: The success of deep learning models led to their fast adaptation for genomics tasks such as predicting DNA binding sites of proteins and RNA splicing outcomes. One major limitation of such models though, especially in application for biomedical tasks, is their black box nature, hindering interpretability. A recent promising method to address this limitation is Integrated Gradient (IG), which identifies features associated with prediction for a sample by a deep model. IG works by aggregating the gradients along the inputs that fall on the straight line between a baseline point and the sample of interest. In this work we address several limitations of IG. First, we define a procedure to identify features significantly associated with a specific prediction task such as differentially included exons in the brain. Then, we assess the effect of using different reference point definitions, and replacing the original single linear path used in IG with nonlinear variants. These variants include neighbors path in the original space (O-N-IG) and the hidden space (H-N-IG), and linear path in the hidden space (H-L-IG). Together, our proposed methods for selecting significant features, reference points, and paths for integrated gradients establish a framework to interpret deep learning models for genomic tasks.
Short Abstract: Assessing similarity is highly important for bioinformatics algorithms to determine correlations of biological information. Similarity can appear by chance, particularly for low expressed entities. This is especially relevant in single cell RNA-seq (scRNA-seq) data because read counts are much lower compared to bulk RNA-seq. Recently, a Bayesian correlation scheme, that assigns low correlation values to correlations coming from low expressed genes, has been proposed to assess similarity for bulk RNA-seq and miRNA. Our goal is to extend the properties of the Bayesian correlation in scRNA-seq data by considering 3 ways to compute similarity. First, we compute the similarity of pairs of genes over all cells. Second, we identify specific cell populations and compute the correlation in those populations. Third, we compute the similarity of pairs of genes over all clusters, by considering the total mRNA expression. We show that Bayesian correlations are more reproducible and have a smaller dependence on the number of cells than Pearson correlations in all 3 scenarios. We conclude that Bayesian correlation is a robust similarity measure in scRNA-seq data. The Bayesian method allows researchers to study similarity between pairs of genes without biasing the results by fake correlations, therefore, increasing experimental reproducibility.
Short Abstract: The high dimensional gene expression profiles of single cells, inferred by deep RNA sequencing (RNA-Seq), provide information to model cellular interaction networks which is important for better understanding e.g. pathogenesis of diseases. We employ generative models, specifically deep Boltzmann Machines (DBM), to investigate connections between cell identity and the underlying functional characteristics in single cell RNA-Seq data. In particular, we train DBMs to learn the joint distribution of neurotransmitter receptor gene expression and marker gene expression in different neuron types in the mouse brain. By sampling from the DBMs, conditional on the expression levels of marker genes, which indicate the cell identity, we study the connection of cell identity with a differentiation of neuronal function such as the type of neurotransmitter that is employed by the cell. Using the expression data of few marker genes allows us to infer the expression levels of e.g. receptor genes for GABA and glutamate signaling in the corresponding cells, indicating that the DBM learned the connection of marker genes with genes related to a differentiated cellular function. These finding demonstrate the potential of DBMs to extract biologically meaningful connections from single cell RNA-Seq data.
Short Abstract: De novo drug development is hampered by tedious in-vitro and animal-screening experiments exploring the enormous space of candidate drugs. In addition, the primary goal of personalized cancer medicine is to tailor a treatment given a patient's tumor molecular profile. To tackle these challenges, it is imperative to devise techniques that enable effective screening of anticancer compound efficacy. We analyze a family of models that makes use of three data modalities: compound's structure in SMILES format, gene expression profiles and prior knowledge from PPI networks. We propose a novel architecture for interpretable anticancer compound sensitivity prediction using a multimodal attention-based convolutional encoder. Our encoder outperforms a baseline model based on molecular fingerprints and a selection of SMILES encoders. In addition, we demonstrate that our model outperforms previously reported state-of-the-art results. Finally, we validate the attended genes and molecular sub-structures with domain knowledge and verify that the learned attention weights are meaningful. The generalization power and the interpretability of our model enable in-silico evaluation of anticancer compound efficacy on unseen cancer cells, positioning it as a valid solution for the development of personalized therapies as well as for the evaluation of candidate compounds in de novo drug design.
Short Abstract: We expect novel pathogens to arise due to their fast-paced evolution, and new species to be discovered thanks to advances in DNA sequencing and metagenomics. Moreover, recent developments in synthetic biology raise concerns that some strains of bacteria could be modified for malicious purposes. Traditional approaches to open-view pathogen detection depend on databases of known organisms, which limits their performance on unknown, unrecognized, and unmapped sequences. In contrast, machine learning methods can infer pathogenic phenotypes from single NGS reads, even though the biological context is unavailable. We present DeePaC, a Deep Learning Approach to Pathogenicity Classification. It includes a flexible framework allowing easy evaluation of neural architectures with reverse-complement parameter sharing. We show that CNNs and LSTMs outperform the state-of-the-art based on both sequence homology and machine learning, cutting the error rate almost in half when predictions for both mates in a read pair are integrated. To provide a degree of interpretability of a learned model, we identify and visualize sequence motifs contributing to the final classification decision. Investigating the relation between particular subsequences and class membership leads to identification of prospective markers of pathogenicity in a human host.
Short Abstract: Single-cell RNA sequencing (scRNA-seq) has become an ubiquitous method for studying gene expression dynamics. However the high dimensionality of the datasets and the inherent amounts of technical noise make the analysis of scRNA-seq data challenging. An important computational strategy for analysing scRNA-seq is dimensionality reduction allowing one to learn a latent representation of the cells, which can be used for data interpretation, visualisation, and feature extraction. Existing approaches typically ignore the rich structure of single-cell experiments, which may include multiple groups of cells or multiple omics profiled. We propose a statistical framework for learning the latent sources of cell-to-cell variability in structured data sets. The model builds upon group factor analysis, a Bayesian framework that includes hierarchical sparsity priors on factor loadings to efficiently integrate multi-view data. Here we re-define the sparsity priors to include a group-specific regularisation in order to disentangle the activity of factors across multiple groups of cells. Effectively, this allows the quantification, for every latent factor, of how much variability is shared between the different groups of cells, e.g. different cell types, tissues, or donor cohorts. Importantly, we employ stochastic variational inference and GPU-accelerated computations in order to accommodate large volumes of single-cell sequencing data.
Short Abstract: Fixation in formalin and embedment into paraffin (FFPE) is the routine clinical practice for preserving biopsy tissue samples. However, due to this fixation process, DNA suffers from damage that manifests as C:G>T:A mutations in sequencing reads, thereby hindering the mutation detection task. We developed a supervised machine-learning based variant refinement method to classify variants into fixation-induced changes or deaminations and non-deaminations. We used variant calls from exome sequencing datasets of 27 FFPE breast cancer specimens and annotated C:G>T:A changes into deaminations or non-deaminations, resulting in 2134171 and 69052 variants, respectively. We defined a set of descriptors with potential to predict deamination artifacts. These were related with allele-frequency, sequencing quality, fragment length and read-orientation bias, among others. After testing different algorithms, tree-based approaches (xgboost, random forest) outperformed the rest in leave-one-sample-out cross-validation, with average AUC values ranging 0.90-0.95. Feature importance analysis ranked read-orientation bias and strand bias as the features that contributed the most to the models. This study provides a promising way of removing DNA sequence noise from FFPE samples and facilitates the use of these samples in molecular testing.
Short Abstract: Cis-regulatory elements (cis-REs) regulate gene expression in a cell-specific manner and can harbor disease associated genetic variants leading to dysregulated gene expression. Although chromatin immunoprecipitation (ChIP) assays can detect cis-REs and their functional annotations (e.g., enhancer, insulators), they require millions of cells, making comparisons between individuals and conditions infeasible with small sample sizes. Alternatively, Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) identifies open chromatin regions (OCRs) corresponding to cis-REs with as few as 500-50,000 cells, however, determining the functional annotations of these loci remains a challenge. We trained neural network models to functionally annotate OCRs in clinical samples solely using ATAC-seq data in CD14+ monocytes, CD4+T, GM12878, HSMM, Islet, K562, and peripheral blood mononuclear cells. Shallow neural network models (based on 24 features) recapitulated enhancers in EndoC-βH1, naïve CD8+ T, and MCF7 with high accuracy and inferred individual specific enhancer activity in 19 human islet samples. We developed novel encoders for ATAC-seq data for training deep learning models, which outperformed shallow models in predicting insulators and enhancer subtypes across cells. Applications of these models on clinical ATAC-seq samples will uncover functional loss or gain of cis-REs across individuals in health and disease and will provide insights into underlying genetic variants.
Short Abstract: Gene-expression profiles of patient tissues have potential to predict clinical outcomes, such as prognoses and treatment responses. Classification algorithms promise to increase the accuracy of such predictions. However, due to the vast array of algorithms and hyperparameters, it is difficult for researchers to know which algorithm(s) and parameter combination(s) will lead to the highest accuracy for a given dataset. In practice, researchers often choose algorithms and parameters arbitrarily, which may result in suboptimal performance and/or bias. A better alternative would be to make such decisions in a data-driven manner. To address this need, we evaluated the performance of 52 “shallow” and “deep” classification algorithms on 50 gene-expression datasets. Using default parameters, we found that kernel-based, ensemble-based, and neural network-based algorithms outperformed other types of algorithm, but performance varied considerably across the datasets. We repeated the analysis but optimized the predictions in a nested manner, based on 1008 total parameter combinations; this improved algorithm performance, sometimes dramatically. However, even the algorithms that performed best on average did quite poorly in some cases, confirming the need to choose algorithms and parameters empirically. Because we held potential confounding factors constant, these analyses provide insights that may be obfuscated in other benchmark studies.
Short Abstract: The usage of antibiotics has a high risk of creating antibiotic resistance, which is why researchers are turning to antimicrobial peptides (AMPs), which are short defense proteins produced by all organisms that have been shown to induce less resistance and fight disease. The objective of this work is to build a model that can accurately classify AMPs. First, an LSTM autoencoder (AE) was trained on a set of AMPs. The loss function used for this AE is weighted mean squared error, with the weights pertaining to normalized BLOSUM62 matrix entries for the input and reconstructed residues for each peptide during training. This adds more penalization for wrong reconstructions, making the AE more accurate. Then, the encoder is used to build embeddings for a set of AMP and non AMPs. These latent features are length invariant, and along with their respective labels, they are input to a neural network classifier with 3 hidden layers. On our test set, this model outperforms current models in the literature for this binary classification task, reaching accuracy of about 90 \% in our curated dataset. This system can be used to guide further research and experimentation with AMPs for AMP-based drug design.
Short Abstract: Background Often times in droplet-based protocols, cell doublet is an issue where two cells get captured in the same droplet and result in spurious signals. To address these issues, we present for the first time a doublet identification tool based on deep neural networks. Methods We processed 60 samples that are cell line mixtures of Raji and K562 titrated at 50% each via Tapestri platform. We used 4 known loci that are genotypically distinct between the two cell lines to assign each cell to K562, RAJI or a doublet. This data is used as ground truth for our classifier. We set up a densely connected neural network classifier using tensorflow. The number of hidden layers in the classifier is equal to the number of amplicons in the targeted panel. We split the data from the 60 samples into training and test datasets. The hyperparameters were further optimized for low and high performing amplicons since the distribution of reads were noisier for low performing amplicons. Results Preliminary results from this method look promising. We were able to detect doublets confirmed by ground truth and remove them from downstream processing thereby reducing false positive clones.
Short Abstract: Machine learning techniques are frequently used in the field of drug discovery and repurposing for the prediction of interactions between drug candidate compounds and target proteins since the experimental approaches are not time- and cost-efficient to be applied to the massive compound-target interaction space. Recently, chemogenomic modelling approach became popular, where both compound and target protein features are used as inputs of the predictive models. Hence, they are able to incorporate targets with low number of (or no) training data and yield accurate predictions even for targets/compounds not involved in the training set at all. Chemogenomic approach is significant as it can be used to identify the druggability potential of human proteins that were never targeted before. In this study, we developed two chemogenomics-based computational methods, using deep (pairwise input deep neural networks) and shallow (random forests) supervised learning techniques, to predict the binding affinities of a large set of kinases against several drug candidate compounds. We participated the IDG-DREAM Drug-Kinase Binding Prediction Challenge with our methods and performed very well in round1. The findings of this study are expected to aid researchers in constructing high performance compound-target interaction predictors, especially for proteins with limited (or no) training data.
Short Abstract: Antibodies are the major line of defense of the adaptive immune system and one of the most important biologics in contemporary drug discovery. They bind specifically to a certain target, ability mediated by the combination of heavy and light chains pairing to give rise to a functional antibody. Together, the two chains form a specific antigen-binding interface, which is predominantly comprised of specific and highly variable loop regions, the so-called complementarity-determining regions (CDRs). While the mechanisms governing the binding affinity is well studied, the molecular determinants playing a role in chain-pairing are unknown. Progresses in sequencing technology and methods to detect paired sequences allow for investigation of this mechanism. In the literature has been suggested that the formation of functional antibodies is essentially combinatorial and therefore pairing is random. In this work we present an analysis based on the CDR3 sequences of both the heavy and light chains, which structurally account for a significant part of the inter-chain interface, suggesting evidence of a bias in chain-pairing between individuals and cell types. In contrast to earlier work, we do not use pre-calculated features or gene annotations, but the determined sequences directly, taking into account both the amino acids’ identity and positioning.
Short Abstract: Background Single-Cell DNA sequencing platforms like Tapestri is susceptible to errors from polymerase incorporations, PCR mediated recombination in Tapestri workflow or DNA-damage. All together these errors make variant calling and minimal residual disease detection challenging. To address these challenges, we developed a novel consensus sequence-based method for correcting the errors and reduce false-positive rates. Methods The error correction method involves 2 steps. To train the transition probabilities we parse the reads from cell bam files. A pileup is generated per amplicon and we calculate the probability of each base occurring along the amplicon. After we generate the multinomial transition probabilities, we apply error correction on each read per cell. We use the transition probability matrix and correct to reference based on a threshold of significance. Also, to ensure that we filter out the noisy reads before passing the data to our variant caller we suppress the quality scores of reads having very low coverage. Results We performed titrations experiments of 4 cell line mixtures with 98.4%, 1%, 0.5% and 0.1% dilutions. We ran Tapestri analytical workflow and with the error correction pipeline, we were able to reduce our false positive rates by ~60-80% while maintaining our sensitivity.
Short Abstract: Epigenetic reprogramming accompanies tumor progression. Promoters and enhancers are usually identified by ChIP-seq for H3K27Ac and other histone modifications but fresh tissue is often limiting for patient tumors. As a stable epigenetic mark, DNA methylation(DNAm) can be robustly profiled in both fresh and FFPE samples. However, our ability to understand biological impact of DNAm is hampered by its interpretability. To tackle this barrier, we developed Methyl2Activity, a promoter-based deep-learning framework that accurately infers epigentic activity and gene expression from DNA methylome. We designed a hybrid CNN/RNN architecture on static genomic features and dynamic DNA methylation features surrounding TSSs. Methyl2Activity was evaluated in a sample leave-one-out approach, using a small Neuroblastoma cohort(n=6). The model accurately predicted measured gene expression and histone modifications in the promoter region (Expression-r2=0.71, H3K27Ac-r2=0.78, H3K4me3-r2=0.88). To test the generalizability of our model, we trained Methyl2Acetyl all Neuroblastoma samples, and applied it to an independent pediatric Rhabdomyosarcoma dataset(n=8). Methyl2Activity again accurately inferred expression and histone modifications (Expression-r2=0.70, H3K27Ac-r2=0.72, H3K4me3-r2=0.84). Our results show the Methyl2Activity machine learning framework achieves accurate inference and is generalizable to different tumor types. Methyl2Activity will enable functional interpretation of DNAm patterns and substantially advance our understanding of epigenetic mechanisms in cancers.
Short Abstract: Phosphorylation is a key regulator of protein function in signal transduction pathways. Protein kinases are the enzymes that catalyze the phosphorylation of other proteins in a target-specific manner; dysregulation of phosphorylation is associated with many diseases. High-throughput phosphoproteomics methods now enable identification of phosphorylation events and their sites at a global scale; yet, determining which kinase is responsible for catalyzing those phosphosites remains as a challenge. Existing computational methods require several examples of known targets of a kinase to make accurate kinase-specific predictions, yet for a large body of kinases, no or only few target sites are reported. We present DeepKinZero, the first zero-shot learning approach to predict the kinase acting on a phosphosite for kinases with no known phosphosite information. DeepKinZero transfer knowledge from kinases with many known target phosphosites to those kinases with no known sites through a zero-shot learning model. The kinase-specific positional preferences are learned using a bidirectional recurrent neural network. Our computational experiments show that, as compared to baseline models, DeepKinZero achieves significant improvement in accuracy for kinases with no known phosphosites. We confirm these results on an independent dataset. By expanding our knowledge on understudied kinases, DeepKinZero can help to chart the phosphoproteome atlas.
Short Abstract: High throughput single-cell DNA sequencing allows for detection of rare mutations in cells and identification of sub-clones defined by co-occurrence of mutations. The big challenge with multiplex sequencing at single cell level is the non-uniform amplification which results in inadequate coverage of mutations of interest. To address this challenge, we developed a machine learning engine to optimize amplicon design for uniform amplification by making reliable performance prediction. 10 different panels were designed with amplicons spanning a wide range of design properties. The tested amplicons are classified into low, OK and high performer based on their normalized reads-per-cell value. The design properties of the amplicon are the features. Highly correlated features were identified and pruned. We used random forest classifier to calculate feature importance. Top features were identified using two different feature selection methods. We then analyzed the range of the top features for each class and their significance of variance between classes. These ranges were then used as parameters in the assay design pipeline. We designed three different panels using the new pipeline. We were able to achieve high panel performance of 97%, 92% and 88% across the three panels. The new parameters resulted in ~10-20% improvement in panel uniformity.
Short Abstract: Next Generation Sequencing (NGS) is a powerful technology highly relevant in biomedical research and pharmaceutical industry. Applied in combination with molecular assays it provides detailed insights in cell processes such as gene expression, DNA accessibility and protein interactions. However, the biological relevance of results from these assays is extremely sensitive to the quality of the sequencing data. So far quality tools require extensive computational resources or manual inspection. This is critical considering the increasing amount of sequencing data due to decreasing costs. In this study, we investigated the possibility to predict the quality of sequencing data from standard fastq files providing the sequencing raw reads. Quality metrics were derived for a large set of fastq files from the ENCODE data portal, already labeled to be of either low or higher quality. Based on these metrics, classification models were trained by state-of-the-art machine learning algorithms to accurately discriminate between low and high quality files. Furthermore, these models reveal differences between assays or species with respect to data quality. In addition, they achieve high predictive performance finding application within a novel machine learning based tool to assess the quality of sequencing data in an automatic and fast fashion.
Short Abstract: Antibiotic resistance is becoming a more and more serious problem to human health. This prompts scientists to develop new drugs as effective substitutes for conventional antibiotics. As a result, antimicrobial peptides (AMPs), one kind of host defense peptides produced by all multicellular organisms, have received increasing attention. However, discovery of novel antimicrobial peptides directly in a wet lab is time-consuming and labor-intensive. Hence, in silico prediction of antimicrobial peptides could be a good way to filter those potential sequences in and unlikely sequences out to reduce the wet lab costs. Here we propose a deep learning model with recurrent layers and attention mechanism to address this problem. The model can further be used for optimizing perturbed sequences for several putative AMPs from our previous work to obtain novel AMPs. From wet lab experiments, many of those identified sequences have been proven to have antimicrobial activities, with perturbed sequences with better score demonstrating improved activity.
Short Abstract: Motivation One of the major contributing factors to the notably low clinical success rate for most cancer therapies is a lack of understanding of how cancer cells respond to a particular drug. Recent advances in artificial intelligence (AI) could aid the development of drugs with better efficacy by predicting therapeutic responses of cells. However, black-box AI is limited as the reasoning behind its predictions are not interpretable. Results Here we developed a “visible” neural network-based AI, DrugCell, that predicts anti-cancer drug response utilizing the structural features of drugs and genomic features of cells while providing insights into the pathways that sensitize the cell to therapy. DrugCell outperformed baseline models trained using ElasticNet and Random Forest. The structural restriction imposed for interpretability of the model did not affect performance as DrugCell demonstrated comparable performance to fully-connected neural networks. The compound with the most accurate prediction of DrugCell was vincristine which has well-defined mechanisms of action. The cell model in DrugCell highlighted well-known pathways altered through microtubule inhibition by vincristine, such as cell adhesion and cell division. Armed with interpretability and generalizability, DrugCell will serve as the first step towards the next generation of intelligent systems in drug discovery and precision medicine.
Short Abstract: To ensure the correct expression of genes several mechanisms are key. Among these, the action of transcription factors binding to specific patterns in the DNA is specially relevant. However, in eukaryotic cells the DNA is compacted forming the chromatin and needs to be reorganized to allow or to impede the binding of the transcriptional machinery. Chromatin is not compacted uniformly, its structure depends on the action of several modulatory proteins that, for example, introduce chemical modifications in histone tails or in the actual DNA. Additionally, the 3D structure of chromatin is also dynamically regulated by loop-forming proteins, so regulatory elements can get closer to other DNA regions. How the combinations of all these factors are related to the compactacion of chromatin and the binding of transcription factors remains poorly understood. To better understand the relationship between chromatin structure, epigenetic marks and the binding of transcription factors, we have collected experimental information from the ENCODE project. We used this information to train several machine learning algorithms in order to determine the relationships between all different factors. Our results demonstrate how specific combinations of epigenetic modifications are better predictors of the local structure of chromatin and better predictors of transcription factor activity.
Short Abstract: Diabetic Retinopathy (DR) is among the major global causes for vision loss. With a rise in the prevalence of diabetes in industrialized countries, an associated increase of DR is expected. The severity of DR may vary, ranging from various stages of mild non-proliferative DR to advanced proliferative DR. In addition to disease progression, the occurrence of Diabetic Macular Edema at any stage of DR further threatens visual proficiency of DR patients. The availability of both markers as well as treatments, especially for early stages of DR, is limited. In this study we leverage machine learning methods to analyze a dataset of human retina samples associated with various stages of disease severity. The presented dataset includes RNASeq measurement of mRNA as well as microRNA. We apply a number of regression models to capture disease progression of DR, thereby identifying disease relevant markers, in particular factors involved in early onset of the disease. The analysis of human bulk RNASeq is further combined with single cell RNASeq measurement of retina samples originating from various model organisms. As a result we characterize disease progression on a transcriptome as well as a cell-specific level, fostering the development of effective treatments against DR.
Short Abstract: Machine learning models to predict phenomena such as gene expression, enhancer activity, transcription factor binding, or chromatin conformation are most useful when they can generalize to make accurate predictions across cell types. In this situation, a natural strategy is to train the model on experimental data from some cell types and evaluate performance on one or more held-out cell types. In this work, we show that, when the training set contains examples derived from the same genomic loci across multiple cell types, then the resulting model can be susceptible to a particular form of bias related to memorizing the average activity associated with each genomic locus. Consequently, the trained model may appear to perform well when evaluated on the genomic loci that it was trained on but tends to perform poorly on loci that it was not trained on. We demonstrate this phenomenon by using epigenomic measurements and nucleotide sequence to predict gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data and computing resources become available, future projects will increasingly risk suffering from this issue.
Short Abstract: Toxicology studies designed to establish the carcinogenicity of chemical compounds are costly, time-consuming, and necessitate extensive use of animal models. These difficulties motivate using machine learning models to predict unknown carcinogenicity from data collected about known carcinogens. Two publicly-available toxicology studies useful for this purpose are the U.S. National Institute of Environmental Health Sciences’ DrugMatrix study and Japan’s Open TG-GATEs (Toxicogenomics Project-Genomics Assisted Toxicity Evaluation System) study. However, the carcinogenicity of every compound in these studies is not known which suggests using semi-supervised machine learning to create carcinogenicity prediction classifiers from gene expression data. Semi-supervised machine learning has previously demonstrated success in image processing and optical character recognition, but only a limited number of applications beyond this space, particularly few in biomedicine, have been explored. We trained semi-supervised machine learning models on DrugMatrix and TG-GATEs gene expression measurements to predict carcinogenicity in silico. Random forest, Gaussian random field, and support vector machine algorithms were tested with varying proportions of unlabeled data. However, the addition of unlabeled data did not significantly improve prediction accuracy. These results suggest further exploration of semi-supervised learning in biomedicine, with optimization of feature selection and hyperparameter tuning, is needed.
Short Abstract: Glycans are complex sugar chains that are crucial components of many biological processes. Many proteins bind to glycans, including lectins and antibodies, with binding specificity to glycans often mediated by small motifs within the larger glycan. Identification of glycan binding motifs has previously been approached as a frequent subtree mining problem. In this work, we extend a standard frequent subtree mining approach by altering the glycan representation to include information on terminal connections. This allows specific identification of terminal residues as potential motifs. We achieve this by including additional nodes in a graph representation of the glycan structure to indicate the presence or absence of a connection at particular backbone carbon positions. Combining this frequent subtree mining approach with a state-of-the-art feature selection algorithm termed minimum-redundancy-maximum-relevance (mRMR), we have generated a classification pipeline that is trained on data from a glycan microarray. This classification pipeline enables prediction of binding to unknown/novel glycans, as well as identification of likely binding motifs based on existing array data. This new subtree mining approach will assist in the interpretation of the large number of publicly available glycan microarray datasets and will aid in the discovery of novel binding motifs for further experimental characterisation.
Short Abstract: In the paste decade 3D genome organization has emerged as a key tool for understanding the complexity of eukaryotic gene regulation. Topologically associated domains (TADs) have been proposed to be the basic unit of chromosome folding and many computational tools have been developed to facilitate their discovery. Though some tools allow for hierarchical TAD identification and TAD overlap they all start with the assumption of a TAD as a continuous region with clearly delineated boundaries. We present a new constrained matrix decomposition framework which instead of looking for TADs explicitly provides a concise description of the HiC data structure in terms of sparse and spatially coherent eigenvectors. While using these minimal assumptions our method recovers the TAD structure as non-zero entries of individual eigenvectors, while seamlessly allowing for TADs of different sizes, hierarchical relationships, and overlap. We show that our method outperforms dedicated TAD identification methods when validated against CTCF sites.
Short Abstract: Targeted gene editing, or optimizing existing DNA sequences, could enable synthetic biology applications from designing promoters for higher gene expression to modifying DNA to treat genetic diseases. Current approaches for targeted gene editing rely on the intuition of biologists or costly massively parallel experiments; no status-quo neural architectures exist that leverage state of the art machine learning methods to do targeted DNA sequence editing to explicitly optimize sequences for useful properties, like transcription factor binding or gene expression. Here, we propose a custom neural architecture for targeted sequence editing - the EDA architecture - consisting of an encoder, decoder, and analyzer. We evaluate the architecture to edit genomic sequences to bind to transcription factor SPI1. We compare the novel architecture to baseline approaches, including a textual variational autoencoder and rule-based editing model, which are current state of the art approaches for editing DNA sequences. The model outputs are evaluated on plausibility, similarity to original sequences, and predicted binding to transcription factor SPI1. Compared to state of the art approaches, the EDA architecture significantly improves binding scores of genomic sequences to 84.4\% predicted positive versus 17\%, while maintaining a high gapped-kmer similarity between original and generated sequences (52.4\% vs 5.2\%).
Short Abstract: We present PIET (Probabilistic Inference and Explanation of Transcriptomic data), a generative classifier that can train on bulk RNA-seq data and classify cell types in single cell RNA sequencing (scRNA-seq) data. ScRNA-seq enables single cell resolution insights in complex tissues. However, it also renders cellular identity and subpopulation identification challenging. PIET implements a generative probabilistic modeling of scRNA-seq data can effectively address this challenge. We demonstrate the performance of our approach by classifying tumor infiltrating lymphocytes from the melanoma microenvironment.
Short Abstract: Most existing Gene Regulatory Network (GRN) inference algorithms are designed for bulk sequencing data, where each sample represents the average of all sequenced cells. Hence, the samples are generally treated as a collection of independent random vectors. However, this is generally not the case for single-cell RNA-seq data, since each sample represents a single sequenced cell, and cells from the same cell type will exhibit similarities, and there may exist an underlying inter cell-type heterogeneity. In this work, we propose a novel single cell GRN inference algorithm that accounts for both the similarities across cells of the same type (by performing clustering) and across different cell-types (by imposing a similarity constraints across the GRNs of different cell types). Our proposed algorithm first applies clustering on previously imputed single cell expression data to gather information of inter-cluster (cell types) variance. Then, it performs LASSO with a similarity constraint among all clusters to capture the intra-sample homogeneity. Experiment results show significant improvements over several datasets of varying size when both imputation, clustering, and similarity constraints are included. Moreover, our final inferred GRN provides more robust results than previously proposed methods in finding driver-target gene pairs, providing a stronger biological insight.
Short Abstract: Missing data, a prevalent problem in biological data, can compromise statistical inference, thus hindering downstream network analysis. Most existing methods to handle missing data assume that missing data are drawn from an unbiased distribution. However, in biological data, such as single cell RNA sequencing data, often the missing values are drawn from a biased distribution which itself is conditioned on missing values. This paradigm results in Missing Not at Random (MNAR) data. We introduce all terrain graph-learner-CPC (ATG-CPC), a novel method to learn casual directed acyclic graphs from MNAR data. ATG-CPC uses all of the observed data and only the observed data to directly impute causal edges, instead of the missing values of the corresponding variables. Specifically, ATG-CPC learns and applies a conditional independence p-value transformation to calculate sample size adjusted conditional independence p-values for every potential edge. Using simulated data we show ATG-CPC to be an effective tool for recovering causal graphs masked by MNAR values. ATG-CPC generates more accurate causal graphs than casual learning methods even after pre-processing the data with traditional missing data methods including list-wise deletion, test-wise deletion, imputation methods.
Short Abstract: Natural products (NP) are bioactive compounds produced by plants, fungi and bacteria, and represent a vital source for many pharmaceutical drugs. Genes involved in NP biosynthesis are usually clustered in a genome, known as Biosynthetic Gene Clusters (BGC). Previous work on identifying BGCs showed supervised learning to be effective for bacteria. Fungi BGC genomic diversity makes their identification challenging. We propose an alignment-free approach based on supervised learning to predict BGCs in fungi, capable of generalizing predictions for various BGC types, focusing on precision while requiring less data curation. Our approach has 3 main steps: collect a curated dataset; design a classification framework; and evaluate predictions against annotated BGCs. Positive instances were curated from MIBiG, and synthetic negative instances from OrthoDB. We used protein k-mers and protein annotated domains as features. Classification models are based on Random Forest, C-Support Vector (SVC), LinearSVC, Nu-SVC, Logit classifiers and a LSTM network. We tested our models using annotated Aspergillus niger BGCs. Despite the task complexity, preliminary results for the positive class for all classifiers show trends with F-measure around 0.6, further improved if handling class imbalance. State-of-the-art methods seem to have limited performance possibly due to the genomic diversity aspect of these clusters.
Short Abstract: Pan-cancer modeling approaches reveal an integrated picture of similarities among various tumor tissue types, whereas tissue-specific studies lead to insights about the specific tissue type. Pan-cancer studies have strong statistical impact due to large number of cell-line or patient tumor samples. On the other hand, tissue-specific studies often suffer from “small n, large p” problem due to small sample size. In this study, we aim to systematically model these two extremes of comparison spectrum for drug-sensitivity prediction in cancer cell lines. We present a joint modeling approach, Bayesian multi-source regression (BMSR), to implicitly account for tissue types in pan-cancer analysis. We used publicly available human cancer cell-lines from GDSC projects CCLE in pan-cancer and tissue-specific settings to develop an understanding on which of the two extremes yields better predictive power. Our results from BMSR exhibit improved accuracy as compared to the state-of-the-art linear regression methods. Our analysis demonstrates that pan-cancer predictions outperform cancer-type specific modeling, since in pan-cancer setting, there is an additional effect of drug for various tissue types along with the confounding effect of sample size. Furthermore, our model suggests meaningful and novel features for few of the well-known targeted drugs by jointly modeling tissue-specific and pan-cancer signals.
Short Abstract: Motivation: Sequence-structure-function relationship is a fundamental question about proteins. With the quickly accumulating data on protein sequence and structure, a central question this study is addressing is: to what extent could current data alone reveal deep insights about sequence-structure relationships such that new sequences can be designed accordingly for novel structure folds?. Results: We have developed novel deep generative models, constructed low-dimensional and generalizable fold space representation, exploited sequence data with and without paired structures, and used fast fold prediction as an oracle and feedback. The resulting semi-supervised guided conditional WGAN was assessed over 100 novel folds not seen in the training set using sequence and structure analysis. The gcWGAN model achieved over 2.5% yield ratio for easy folds, 1% yield ratio for $\alpha$/$\beta$ folds, and 12% yield ratio for some folds. These results reveal the value of current protein data toward unraveling sequence-structure relationship and utilizing resulting knowledge for inverse protein design.
Short Abstract: Host-pathogen protein-protein interactions (HPIs) play vital roles in several biological processes and are directly involved with infectious diseases. It is crucial to understand their mechanism and unravel potential targets to develop therapeutics. Beyond single-species Protein-Protein Interaction (PPI) prediction, no comprehensive analysis has been attempted to model HPIs on a genome scale. In this study, a comparison between different machine learning methods such as support vector machines (SVM), artificial neural networks (ANN) and Deep Learning-based Convolutional Neural Networks (CNN) was performed to predict HPIs with high accuracy. Several sequence-based features were tested, including Autocorrelation, Dipep composition, Conjoint Triad, Quasi-order and One-hot. The training datasets were obtained from HPIDB and Negatome, to create positive and negative datasets as well as independent test datasets. Most of the models were able to perform well; the independent test sensitivity values ranging from 84.6% to 99.5%, specificity 56.1% to 98.8%, and MCC 0.61 to 0.91. We found that Negatome is not an ideal dataset in inter-species predictions. A novel approach to generate a more suitable non-interaction dataset is proposed. Among the methods tested, Deep Learning looks promising and further architectures are being explored. The best prediction models will be implemented on a web server called DeepHPI.
Short Abstract: Tumor necrosis factor alpha (TNFA) is a well-known cancer hallmark. However, limited genes in TNFA pathway engaged in carcinogenesis are revealed and therefore raised an imbalanced classification issue. As considering this issue, we applied bagging strategy to discover genes in TNFA-activated pathway using cancer patient transcriptomic profiles as features. After comparing several machine learning algorithms, e.g., SVM, random forest, neural network, and one-class logistic regression, linear SVM was chosen to construct the final ensemble learning model because of its outperformance, high portability, and flexible operation. Our approach performs outstandingly (averaged area under curve > 0.9) in 16 cancer types, showing that its potential capability to predict the promising genes responded to TNFA. Notably, the predicted genes also can accurately classify tumor and normal samples. Moreover, the predicted genes are significantly enriched in differentially expressed genes, oncogenic signatures, and the biological processes responded to TNFA. Additionally, the top two frequently predicted genes in pan-cancer, i.e. TCIM and JUND, have been reportedly involved in response to TNFA and carcinogenesis. Briefly, these results suggest that the genes predicted by our model might play critical roles in cell survival and carcinogenesis through TNFA pathway and be the potential targets for cancer therapy.
Short Abstract: Random Forests (RF) have been successfully applied to biological classification and regression problems due to its high accuracy and its potential for interpretability. Existing approaches to interpretability are typically limited to evaluating the importance of each feature independently, which is inadequate for investigating the synergic effect of multiple biological features. Also, these approaches can often only provide an overall interpretation of the model without explaining the prediction of individual cases. Hence, we propose the Network Explainer of Random Forests (NERF) for identifying network-based mechanistic insights underlying RF predictions. It allows the extraction and synthesis of the extremely enriched information encoded in an RF model and provides an individualised interpretation for each of the predictions. Here we present results of NERF in a drug sensitivity classification setting. Using data from the Cancer Cell Line Encyclopedia, the RF models yield a mean AUROC of 0.8 across multiple drugs. NERF interpretations of such predictions show that our approach is able to reveal the corresponding drug target genes, as well as genes involved in relevant drug resistance-related pathways. Overall, NERF generates interpretations of RF models that are more informative for biologists than the alternative, most widely used feature importance ranking method.
Short Abstract: Biological classification problems often have a lot of missing data in the training dataset used for classification. The missingness may arise due to faulty methods of data collection or storage. This problem can be solved by imputing missing values based on k-nearest neighbors (kNN) of the missing observation. However, most of the former imputation methods do not take into account the class label in the classification problem with missing data. Also, the existing kNN imputation methods use Minkowski distance or its variants as a measure of distance, which does not work well with heterogeneous data. We propose a novel iterative kNN imputation technique based on class weighted gray distance between the missing datum and all the training data. Gray distance works well with heterogeneous data with missing instances. The distance is weighted by Mutual Information (MI) which is a measure of feature relevance between the discrete or continuous features and the class label. This ensures that the imputed dataset is better directed towards improving the classification performance. This class weighted gray distance based kNN imputation algorithm (CGKNN) is tested on various biological classification datasets and shown to outperform the traditional kNN imputation techniques as well as missForest and MICE.
Short Abstract: The soluble carrier hormone binding protein (HBP) plays an important role in the growth of human and other animals. HBP can also selectively and non-covalently interact with hormone. Therefore, accurate identification of HBP is an important prerequisite for understanding its biological functions and molecular mechanisms. Since experimental methods are still labor intensive and cost ineffective to identify HBP, it’s necessary to develop computational methods to accurately and efficiently identify HBP. In this paper, a machine learning-based method was proposed to identify HBP, in which the samples were encoded by using the optimal tripeptide composition obtained based on the binomial distribution method. In the 5-fold cross-validation test, the proposed method yielded an overall accuracy of 97.15%. For the convenience of scientific community, a user-friendly webserver called HBPred2.0 was built, which could be freely accessed at http://lin-group.cn/server/HBPred2.0/.
Short Abstract: The rise in metagenomics has led to an exponential growth of viral discovery but the majority of these new sequences have no assigned host. Current machine learning approaches to predicting virus host infection focus on low dimension nucleotide features. There is scope for developing a wider range of features, encapsulating deeper levels of biological information. This study is based on the premise that a host specific signature is embedded in viral genomes by the process of virus host coevolution. Our goal was to investigate the predictive potential of features generated from different levels of viral genome representation. We compiled over a hundred binary datasets of infecting/non-infecting viruses at all taxonomic ranks of host. We generated twenty feature sets from these viral genomes by extracting k-mer compositions at the different levels of sequence representation. We trained and tested SVM classifiers to compare the predictive capacity of each of these feature sets for each dataset. We found that all levels of features were predictive of host taxonomy and that prediction improves with increasing k-mer length. Using a holdout method, we were also able to hypothesis which signals were virus- and host-specific.
Short Abstract: Protein kinases form the largest group of molecular targets currently investigated for anticancer drug treatment, but the intrinsic polypharmacological nature of most small-molecule kinase inhibitors limits the lead screening efforts. Systematic exploration of chemogenomic landscape underlying the druggable kinome is critical to enable more efficient kinome profiling strategies. We compiled and curated a comprehensive compound-kinase interaction dataset, that spans across 48% of the human kinome, and developed VirtualKinomeProfiler, an efficient computational model that captures the chemical similarity space of the druggable kinome and enables its systematic applications to drug discovery endeavors. The computational model integrates distinct representations of chemogenomic associations among kinases using an ensemble machine learning algorithm to facilitate high-throughput virtual profiling of compound-kinase interactions. By employing the computational platform, we profiled approximately 37 million compound-kinase pairs and made predictions for 151,708 compounds in terms of their repositioning and lead molecule potential against 248 kinases simultaneously. Experimental testing with biochemical assays validated 51 of the predicted interactions, identifying 19 small-molecule inhibitors of EGFR, HCK, FLT1, and MSK1 protein kinases. The prediction model led to a 1.5-fold increase in precision when compared to traditional biochemical screening, which demonstrates its potential to drastically expedite the kinome-specific drug discovery process.
Short Abstract: Transcription factors control gene expression by binding to specific DNA sequences that have been measured using methods such as HT-SELEX. But how exactly these specific binding sequences and their different combinations in different human enhancers and promoters orchestrate the cell type specific expression patterns is not well understood. Here we study the constituents of human enhancers using massively parallel STARR-seq reporter gene assay that allows measuring the enhancer activity of any DNA sequence. In a nutshell STARR-seq is a selection assay where the ability of only some enhancer sequences to drive expression of the reporter gene enriches only part of the input library DNA. These expression driving sequences are detected as RNA and sequenced. This data is well suited to analysis using machine learning based classifier models that find the sequence features separating the expression driving sequences from the ones that do not enhance expression of the reporter gene. We apply classifiers of different complexity from logistic regression to convolutional neural networks in order to study what features and interactions are required to explain the sequences that enrich in the STARR-seq experiments.
Short Abstract: Determination of cell-of-origin (COO) of diffuse large B-cell lymphomas (DLBCL) is used as a prognostic marker and may potentially be used as a stratification marker. Two major lymphoma subtypes, germinal center B cell-like (GCB) and activated B cell-like (ABC) lymphomas have been defined. Patients with ABC subtype have inferior outcomes to standard therapy. Unfortunately, current methods to assess COO remain limited by the need for tissue samples. We developed a machine learning-based COO classifier that predicts the subtype of DLBCL from targeted next generation sequencing of cell-free DNA isolated from plasma. This model uses single nucleotide variants and fusions called from sequencing data after filtering out non-tumor specific variants found in publicly available data (COSMIC, dbSNP, ExAC). We evaluated our model by comparing the COO predictions for 230 DLBCL subjects against the predictions determined by NanoString’s gene expression-based COO classifier assay (Lymph2Cx). Thirty-three subjects had no relevant variants detected or were unclassified by Lymph2Cx; COO results for 151 of the remaining subjects agreed with Lymph2Cx calls, leading to 77% concordance of calls between the two methods. These data suggest that using variant calls from plasma could be effective as a tissue-free method for COO calling in DLBCL subjects.
Short Abstract: Single cell RNA sequencing (scRNA-Seq) has enabled the parallel cell-specific transcriptional profiling revealing new insights about cell types within tissues. Clustering of scRNA-Seq data enables identification of cell types in heterogeneous cell populations, which can be used in downstream analyses of scRNA-Seq data. Dimension reduction and clustering of scRNA-Seq data is followed by cell type annotation and involves examination of known cell markers, which is a labour intensive task. Recently, manually curated database of cell markers for human and mouse was published (Zhang et al., 2018) representing a valuable resource for cell type classification using single cell transcriptomic data. In the current study, we used cell-marker database and scRNA-Seq data from published studies on rat (pineal gland) (Mays et al., 2018), bovine (embryos) (Lavagi et al., 2018), chimpanzee and bonobo (neural cells) (Marchetto et al., 2019). Unsupervised clustering and semi-supervised learning approaches were applied to classify cell types based on conserved orthologous gene expression across species. We classified cell types in different species and developed an approach for automated annotation of cell types in non model organisms. The authors would like to acknowledge the contribution of the COST Action CA15112 - Functional Annotation of Animal Genomes European network (FAANG-Europe).
Short Abstract: Genome-wide-association studies (GWAS) use Random-Forest for its capability to consider genetic interactions. Random-Forest is an ensemble technique that adds randomness to Decision Trees, which are individual predictive machine learning models. This randomness allows the evaluation of a larger solution space for associated loci in GWAS-style analyses. A key parameter in the Random-Forest is the number of features to be evaluated at each node of each tree (mtry). mtry controls randomness and substantially affects performance by limiting the search space. This is crucial for multigenic diseases, where sets of features may incrementally obtain strong disease-association (deep trees) thereby competing with individual features of moderate association (shallow trees). Previous studies explore the optimal value of mtry, however this highly depends on the dataset and its characteristics. We propose a method where the value of mtry varies during the training process. This allows the creation of trees with diverse depths. The ensemble hence captures the strongest individual but importantly also sets of features associated with the disease. Furthermore, we evaluate altering mtry at the node level, which allows a more comprehensive search of the solution space. We assess our approach on Bone Mineral Density (BMD) case/control datasets.
Short Abstract: Design and optimization of CRISPR-Cas9 gRNAs can be complex. To assist in this process, numerous machine-learning based models capable of predicting on-target activity and potential off-targets have been developed. However, these models ignore the natural differences between individual genomes. Differences the chromatin environment can dramatically affect the on-target activity of CRISPR- Cas9. Similarly, natural genetic variations can alter the off-target landscape. To address this limitation, we developed two tools, GT-Scan2 and VARSCOT, which incorporate chromatin and variant information to better predict on-target and off- target activity. GT-Scan2 combines both sequence features and histone modifications to accurately predict target efficiency down to the cell-type. VARSCOT, our off-target predictor, utilizes a new bi-directional mapper which allows it to rapidly identify potential off-target sites based on sequence similarity. Importantly, our pipeline is the first of its kind to incorporate variant information, enabling it to identify off-target sites unique to a given individual or population. We have implemented our models as a free web-app utilizing new developments in serverless computing from AWS, making it a highly scalable and efficient tool for CRISPR-Cas9 target design
Short Abstract: In the last century, single-cell RNA sequencing (scRNA-seq) techinque allows the researchers to analyze in deepest way the transcriptome of indivual cells, providing a higher resolution of cellular differences. The computational analysis of scRNA-seq data usually include a clustering module that allows to define the cell type and merge the cells in clusters on the basis of the transcriptome similarity. In this work, we exploit machine learning technique to improve the set of genes that represent the final signature of each cluster. A dataset of 5000 immunological cells analyzed with scRNA-seq technique was used. The computational pipeline Seurat was exploit to analyze the dataset, obtaining 9 clusters. Then, the results were analyzed using a machine learnings approach that take advantage of a decision tree algorithm implemented in Weka tool and entropy measure was used to explore genes contained in the decision trees. This meaure allow us to set a threshold for cutting the trees of each cluster to the level defined by the entropy, capturing the whole set of genes contained. The analysis revealed a more consistent set of genes as signature of each clusters with respect to the only set identified by the computational pipeline.
Short Abstract: Epilepsy is a common neurological disorder characterized by recurrent seizures. Its pathogenesis is associated with diverse molecular dysregulation resulting in the remodelling of brain networks. Thus, a better understanding of the molecular mechanism of epileptogenesis, and in particular the latent period, is crucial to impede the development of epilepsy. Previous studies showed that alterations in microRNA (miRNA) expression significantly affects the stability and translation of miRNA target genes. Identification of miRNA-mRNA regulatory networks, therefore, can provide valuable insight into the aetiology of the disease. In this study, a Bayesian model was developed to integrate the expression profiles of miRNAs and mRNAs from an animal model of epileptogenesis. The temporal behaviours of influential miRNAs and their target mRNAs were inferred in an interaction network and further functional analysis highlighted the most strongly affected pathways and their interconnections. The resulting network, which contains both direct and indirect miRNA-mRNA interactions, was compared to sequence-based target prediction in order to identify the direct interactions with higher confidence. The findings in this research provide a list of key miRNAs and their targets with therapeutic potential for further experimental investigation.
Short Abstract: Major Histocompatability Complex II (MHC II) binds to peptides and presents them on the surface of antigen presenting cells. Interaction of T helper cells with the pMHC II complex is a necessary event in Th cell activation. Binding of peptides to MHC happens with high specificity, but the polymorphism of MHC molecules leads to diverse peptide binding repertoires. Pan-specific predictors can model peptide interaction to any MHC molecule, an example of which are the NetMHCIIpan suite of neural network models. Here we present results of improved MHC II epitope prediction by data integration and adaptation of NNalign, the algorithm underlying NetMHCIIpan. Mass spectrometry eluted ligand data has proved to be a valuable addition to training sets of MHC peptide prediction models. This data integration was achieved through new model architectures and training algorithms. NNalign_MA represents a further development that extends the scope of eluted ligand datasets to complex peptide mixtures. In this work we show improved CD4+ epitope prediction of NNalign_MA, compared to the state of the art prediction tool NetMHCIIpan-3.2. Furthermore, we investigate the influence of including ligand proteolytic signals on prediction. Improved CD4+ epitope predictions has applications in de-immunization of protein drugs.
Short Abstract: Sepsis represents a suppressed immune response to infection. Despite advances in modern medicine, severe sepsis remains a major cause of mortality globally. Currently, physicians rely on their interpretations of heterogeneous clinical symptoms and medical scoring systems, often resulting in misdiagnoses. Gene expression biomarker studies tend to be inconclusive, resulting in no effective diagnostic panels . In this global study of 300 emergency room patients, we assessed the predictive power of machine learning classifiers built using clinical and gene expression (RNA-Seq) data. We applied elastic net, support vector machine, and random forest, on each feature type separately and features combined. The highest area of the receiver operator characteristic curve was 0.88 and was achieved using elastic net on the combined model. The combined approach involved the direct concatenation of features followed by a z score transformation. We used “block-scaling” to address the issue of the model being dominated by gene features. This scales each feature by the inverse of the total number of features in a data-type specific block . Genetic and clinical feature coefficients were also explored for biological relevance. 1. Biron et al. (2015) Biomarker Insights,10:7-17. 2. Jeppe et al. (2008) Tox Sci, 102:2:444-54