Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Knowledge Guided Machine Learning in Biology

Schedule subject to change.
All times in Central Daylight Time (CDT)
Tuesday, May 11th
Keynote: Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks
  • Lorin Crawford

Presentation Overview: Show

A consistent theme of the work done in the Crawford Lab is to take modern
computational approaches and develop theory that enable their interpretations to be
related back to classical genomic principles. The central aim of this talk is to address
variable selection and interpretability questions in nonlinear regression models (e.g.,
neural networks). Motivated by statistical genetics, where interactions are of particular
interest, we introduce novel, interpretable, and computationally efficient ways to
summarize the relative importance of genetic variants contributing to broad-sense
heritability and phenotypic variation.

Knowledge-Guided Machine Learning for predicting the promiscuity of enzymes
  • Soha Hassoun, Tufts University, United States

Presentation Overview: Show

Experimental characterization of enzymatic activities on molecules has already advanced our understanding of cellular metabolism and provided a rich view of health and disease states. Despite progress, comprehensive characterization of enzyme function on substrate molecules remains elusive. Traditionally assumed specific, transforming a single substrate, many enzymes interact with a wide range of substrates. Due to the large number of enzyme sequences and substrates, comprehensive characterizing of these promiscuous interactions using experimentation and curation is prohibitive. Importantly, this limited characterization fundamentally hampers our understanding of cellular metabolism and delays biological discovery.

This talk will present several machine learning models for predicting the promiscuity of enzymes. The techniques leverage knowledge that is specific to enzymes, molecules, and the underlying biological network to achieve enhanced performance. First, the talk will discuss a neural network hierarchy-informed multi-label classifier that predicts enzyme classes, as defined via their enzyme commission numbers, for a given query molecule. The hierarchy is derived from the enzyme commission classification system. Next, the talk will present a neural network recommender system suited for recommending enzymes to molecules and vice versa. The recommender system performance is improved when utilizing molecular relationships culled from catalogued biochemical interactions. The talk will then present a graph-embedding link prediction model that predicts biochemical transformations between two molecules. We show that molecular attributes can significantly enhance the predictions. The talk will conclude by highlighting some knowledge integration and curation challenges when predicting the promiscuity of enzymes.

Penguin: Predicting RNA Pseudouridine Sites in Nanopore Sequencing Data
  • Sarath Janga, IUPUI, United Kingdom
  • Doaa Salem, IUPUI, United States
  • Daniel Acevedo, , University of Texas Rio Grande Valley, United States
  • Swapna Vidhur Daulatabad, IUPUI, United States
  • Quoseena Mir, IUPUI, United States

Presentation Overview: Show

Pseudouridine is one of the most abundant RNA modifications, occurring when uridines are catalyzed by Pseudouridine synthase proteins. It plays an important role in many biological processes such as stabilizing RNA through enhancing the function of transfer RNA and ribosomal RNA, and also has an importance in drug development. Recently, the single-molecule sequencing techniques such as the direct RNA sequencing platform offered by Oxford Nanopore technologies enable direct detection of RNA modifications on the molecule that is being sequenced, but to our knowledge this technology has not been used to identify RNA Pseudouridine sites. To this end, in this paper, we address this limitation by introducing a tool called Penguin that integrates several developed machine learning (ML) models (i.e., predictors) to identify RNA Pseudouridine sites in Nanopore direct RNA sequencing reads. Penguin extracts a set of features from the raw signal measured by the Oxford Nanopore and the corresponding basecalled k-mer. Those features are used to train the predictors included in Penguin, which in turn, is able to predict whether the signal is modified by the presence of Pseudouridine sites. We have included various predictors in Penguin including Support vector machine (SVM), Random Forest (RF), and Neural network (NN). The results on the two benchmark data sets show that Penguin is able to identify Pseudouridine sites with a high accuracy of 93.38% and 92.61% using SVM in random split testing and independent validation testing respectively. Thus, Penguin outperforms the existing Pseudouridine predictors in the literature that achieved an accuracy of 76.0 at most with an independent validation testing. A GitHub of the tool is accessible at https://github.com/Janga-Lab/Penguin.

Knowledge-based Meta-Learning for Cancer Prediction and Survival Analysis
  • Aidong Zhang, Departments of Computer Science, University of Virginia, United States

Presentation Overview: Show

Most AI approaches such as deep learning are very effective at perception and classification in presence of large number of labelled data. However, in biomedical domain, the number of annotated data can be extremely limited and labeling data is typically expensive. The ability to generalize based on few examples will be key for prediction systems. It is still an open problem on how machines can recognize and generalize new patterns and their variations after observing a few examples like humans do. Meta-learning is a new paradigm which utilizes prior knowledge learned from related tasks and generalizes to new tasks of limited supervised experience, and it has been applied in many fields to tackle scarce annotated data problem, such as cancer, drug discovery, etc. In this talk, I will discuss the potential of applying the meta-learning algorithms to analyze The Cancer Genome Atlas (TCGA) cancer patient data and demonstrates the effectiveness and superiority of the meta-learning methods for cancer prediction. I will also discuss how the meta-learning can be used for cancer survival analysis.

Learning to align with differentiable dynamic programming
  • Michiel Stock, Ghent University, Belgium
  • Dimitri Boeckaerts, https://michielstock.github.io/, Belgium
  • Steff Taelman, Ghent University, Belgium
  • Wim Van Criekinge, Ghent University, Belgium

Presentation Overview: Show

The alignment of two or more biological sequences is one of the main workhorses in bioinformatics because it can quantify similarity and reveal conserved patterns. Dynamic programming allows for rapidly computing the optimal alignment between two sequences by recursively splitting the problem into smaller tractable choices, i.e., deciding whether it is best to extend a current alignment or introduce a gap in one of the sequences. This process leads to the optimal alignment score and backtracking yields the optimal alignment. By departing from a collection of pairwise alignments, one can heuristically compute a multiple sequence alignment of many sequences. If one is interested in the effect of a small change in the alignment parameter or the sequences, one has to compute the alignment score gradient with respect to these inputs. Regrettably, computing this gradient is not possible because the individual maximisation (minimisation) steps in the dynamic programming are non-differentiable.

However, Mensch and Blondel recently showed that by smoothing the maximum operator, for example, by regularising with an entropic term, one can design fully differentiable dynamic programming algorithms. The individual smoothed maximum operators have various desirable properties, such as being efficient to compute, sparsity, or probabilistic interpretation. Departing from this work, we created a differentiable version of the Needleman–Wunsch algorithm.

The resulting gradient has an immediate diagnostic and statistical interpretation, such as computing the Fisher information to create uncertainty estimates. Furthermore, it enables us to use sequence alignment in differentiable computing, allowing one to learn an optimal substitution matrix and gap cost from a set of homologous sequences. The flexibility allows these parameters to vary at different regions in the sequences, for example, depending on the secondary structure. One can also change this around and fix the alignment parameters and optimise the sequences for alignment. This scheme allows for finding consensus sequences, which can be useful in creating a multiple sequence alignment. More broadly, our algorithm can be incorporated in arbitrary artificial neural network architectures, making it an attractive alternative to the popular convolution neural networks, LSTMs or transformer networks currently used to learn from biological sequences.

We provide a performant implementation of our method, compatible with deep learning, optimisation and probabilistic programming languages packages. To this end, we use the powerful Julia programming language, where we have provided custom gradients that are compatible with the major automatic differentiation packages, allowing for seamless integration with other packages.

EDI Panel Introduction
New machine learning approaches to estimate the functional consequence of mutations in diverse human populations
  • Yuval Itan, Icahn School of Medicine at Mount Sinai, United States
  • Cigdem Sevim Bayrak, Icahn School of Medicine at Mount Sinai, United States
  • Avner Schlessinger, Icahn School of Medicine at Mount Sinai, United States
  • Yiming Wu, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

The genome of a patient with a genetic disease contains about 20,000 non-synonymous variations, of which only one (or a few) is disease-causing. Current computational methods cannot predict the functional consequence of a mutation: whether it results in gain-of-function (GOF) or loss-of-function (LOF). Moreover, computational predictions of mutation pathogenicity are still lacking specificity when analyzing diverse human genetic data. Here we present two novel approaches to address these shortcomings: (1) a machine learning study to computationally differentiate GOF from LOF mutations, using natural language processing (NLP) and feature selection to generate the first large-scale human inherited GOF and LOF mutation database; and (2) a deep learning neural network approach to classify mutations by the human phenotype ontology (HPO) disease group. We demonstrate the utility of our combining our state-of-the-art with gold standard methods in case-control studies of inflammatory bowel disease (IBD) and congenital heart disease (CHD), where we discovered new genetic etiologies and genomic architectures for these diseases.

Supervised machine learning approach for diagnostic screening of cardiovascular disease using gut microbiome data
  • Sachin Aryal, University of Toledo College of Medicine and Life Sciences, United States
  • Ahmad Alimadadi, University of Toledo College of Medicine and Life Sciences, United States
  • Ishan Manandhar, University of Toledo College of Medicine and Life Sciences, United States
  • Bina Joe, University of Toledo College of Medicine and Life Sciences, United States
  • Xi Cheng, University of Toledo College of Medicine and Life Sciences, United States

Presentation Overview: Show

Cardiovascular disease (CVD), as the leading cause of death worldwide, has many different types of morbid conditions, such as hypertension, heart failure, and atherosclerosis, which could develop simultaneously or lead to each other. An array of different clinical assays and imaging approaches is required for a comprehensive evaluation of cardiovascular health. Therefore, a systematic screening of any existing cardiovascular dysfunction could save diagnostic time and initiate early therapeutic interventions. Gut microbiota dysbiosis has been reported in patients with certain types of CVD, such as hypertension. Therefore, we hypothesized that gut microbiome data could be trained with supervised machine learning (ML) models for systematic diagnostic screening of CVD. To test our hypothesis, we analyzed 16S rRNA sequencing data from stool samples collected through the American Gut Project. The stool 16S metagenomics data of 478 CVD and 473 non-CVD subjects were analyzed using five supervised ML algorithms: random forest (RF), support vector machine, decision tree, elastic net, and neural networks (NN). Interestingly, we identified 39 differential bacterial taxa (LEfSe: LDA > 2) between the CVD and non-CVD groups, but initial ML classifications, using these taxonomic features, could only achieve an AUC (0.0: perfect antidiscrimination; 0.5: random guessing; 1.0: perfect discrimination) of ~0.58 (RF and NN). Alternatively, the top 500 high-variance features of operational taxonomic units (OTUs) were used for training ML models and an improved AUC of ~0.65 (RF) was achieved. The top 25 highly contributing OTU features (HCOFs) were further selected from those high-variance OTU features, and the RF model, trained with only HCOFs, achieved an improved AUC of ~0.70. Overall, our study identified dysregulated gut microbiota in the CVD patients and further developed a gut microbiome-based ML approach for the first time for a promising systematic diagnostic screening of CVD.

Sources of Funding:
The work was supported by the Dean’s Postdoctoral to Faculty Fellowship from the University of Toledo College of Medicine and Life Sciences to Xi Cheng. Xi Cheng acknowledges grant support from the P30 Core Center Pilot Grant from NIDA Center of Excellence in Omics, Systems Genetics, and the Addictome. Bina Joe acknowledges grant support from the National Heart, Lung, and Blood Institute (HL143082).

Aryal S, Alimadadi A, Manandhar I, Joe B, Cheng X. Machine Learning Strategy for Gut Microbiome-Based Diagnostic Screening of Cardiovascular Disease. Hypertension. 2020;76: 1555–1562. https://www.ahajournals.org/doi/abs/10.1161/HYPERTENSIONAHA.120.15885

Supervised machine learning for gut microbiome-based detection of inflammatory bowel diseases
  • Sachin Aryal, University of Toledo College of Medicine and Life Sciences, United States
  • Ahmad Alimadadi, University of Toledo College of Medicine and Life Sciences, United States
  • Ishan Manandhar, University of Toledo College of Medicine and Life Sciences, United States
  • Bina Joe, University of Toledo College of Medicine and Life Sciences, United States
  • Xi Cheng, University of Toledo College of Medicine and Life Sciences, United States
  • Patricia B Munroe, Queen Mary University of London, United Kingdom

Presentation Overview: Show

Inflammatory bowel diseases (IBD) are characterized by chronic inflammation of the gastrointestinal (GI) tract. Crohn’s disease (CD) and ulcerative colitis (UC) are two major subtypes of IBD. Despite various clinical approaches being available for diagnosing IBD, such as endoscopy and colonoscopy, misdiagnosis of IBD occurs frequently, thus there is a clinical need to further improve diagnosis of this condition. Since dysbiosis in GI tract is reported in IBD patients, we hypothesized that gut microbiome data can be used to develop an artificial intelligence-based strategy for diagnostic screening of IBD. To test our hypothesis, fecal 16S metagenomics data of 729 IBD patients and 700 non-IBD controls collected from the American Gut Project were analyzed using five supervised machine learning (ML) models: random forest (RF), decision tree, elastic net, support vector machine and neural networks. Fifty bacterial taxa were identified to be significantly differential between the IBD and non-IBD groups. Supervised ML classifications, trained with these 50 taxonomic features, achieved a testing AUC (area under the receiver operating characteristics curve) of ~0.80 using the RF model. Next, we tested if operational taxonomic units (OTUs), instead of bacterial taxa, could be used as ML features for diagnostic classification of IBD. Top 500 high-variance OTUs were trained with the five ML models described above, and an improved AUC of ~0.82 was achieved by RF. Further, we tested the capability of the RF model to distinguish between Crohn’s disease (CD) and ulcerative colitis (UC) using 331 CD and 141 UC samples. A total of 117 bacterial taxa were identified to be significantly differential between CD and UC, and the RF model trained with these bacterial features achieved a testing AUC of ~0.91. Furthermore, the RF model trained with the top 500 high-variance OTUs achieved a slight improvement of AUC to ~0.92. In summary, we demonstrated robust supervised ML modeling for diagnostic screening of IBD and its subtypes.

Sources of Funding
The work was supported by the Dean’s Postdoctoral to Faculty Fellowship from University of Toledo College of Medicine and Life Sciences to Xi Cheng. Xi Cheng also acknowledges grant support from the P30 Core Center Pilot Grant from NIDA Center of Excellence in Omics, Systems Genetics, and the Addictome. Bina Joe acknowledges grant support from the National Heart, Lung, and Blood Institute (HL143082). Patricia B. Munroe acknowledges support from the National Institute of Health Research Cardiovascular Biomedical Research Centre at Barts and Queen Mary University of London.

1. Manandhar I, Alimadadi A, Aryal S, Munroe PB, Joe B, Cheng X. Gut microbiome-based supervised machine learning for clinical diagnosis of inflammatory bowel diseases. American Journal of Physiology-Gastrointestinal and Liver Physiology. (doi.org/10.1152/ajpgi.00360.2020)

Gene Signatures of COVID-19 Infection Severity Identified Using Graph Convolutional Neural Networks On Single Cell RNA-Seq Data
  • Mario Flores, UTSA, United States
  • Yufang Jin, UTSA, United States
  • Huang Wenjian, UTSA, United States
  • Ricardo Ramirez, UTSA, United States
  • Karla Paniagua, UTSA, United States

Presentation Overview: Show

One of the mysteries of Coronavirus Disease 19 (COVID-19) is why some people suffer severe symptoms, even life-threatening complications, while others suffer no symptoms or just mild ones. Several studies have related the severity of COVID-19 infection to immune system features resulting in more vulnerable groups to this viral infection. The goal of this study is to elucidate the response signatures of COVID-19 infection by identifying gene markers and activation patterns of cells related to patients with different degrees of severity. In particular, single cell RNA-Seq (scRNA-Seq) datasets of severe and mild cases were compared to uninfected cases using a Deep Learning approach. Our GCNN models predicted cells from patients with mild and severe symptoms with an accuracy of 90%. A novel GCNN model has been developed to classify the severity of Covid-19 infection. The learned GCNN features in the hidden layer were extracted to identify the leading genes and functional modules for immune response features for different severities of COVID-19 infection.

Machine learning on knowledge graphs and ontologies
  • Justin Reese, Lawrence Berkeley National Laboratory, United States
  • Deepak Unni, Lawrence Berkeley National Laboratory, United States
  • Nico Matentzoglu, Semanticly Ltd, United Kingdom
  • Nomi Harris, Lawrence Berkeley National Laboratory, United States
  • William Duncan, Lawrence Berkeley National Laboratory, United States
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. Knowledge graphs (KGs) are well-suited for integrating the heterogeneous data related to COVID-19. We constructed KG-COVID-19 [1], a KG that integrates a wide variety of data related to COVID-19, and a performant software package for machine learning on KGs [2]. We applied machine learning algorithms to produce actionable knowledge from the KG. Our strategy included ranking of drug repurposing candidates based on cosine similarity, and training/application of link prediction classifiers (multi-layer perceptron, random forest, decision tree, and logistic regression). Using this strategy, we produced ranked lists of drug repurposing candidates for COVID-19 treatment. We then used clinical data from N3C to validate these drug repurposing candidates using a retrospective case-cohort strategy.

To generalize and extend our tooling for graph machine learning that we developed to facilitate COVID-19 research, we have developed a framework called NEAT (Network Embedding All the Things) [3] for configuring reproducible pipelines for machine learning enabling machine learning on knowledge graphs and ontologies. NEAT machine learning tasks are entirely driven by human-readable configuration files, which both removes the requirement for users to write code and also serves as a detailed explanation of how each machine learning task was conducted. NEAT allows reproducible machine learning on knowledge graphs and ontologies using both node2vec-like algorithms to embed the graph structure and NLP algorithms to embed textual elements of node and edges in the graphs (e.g. class descriptions, node labels). The combination of embeddings from graph and textual elements improves performance of graph ML tasks.

1. Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S, et al. KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response. Patterns (N Y). 2020; 100155.
2. https://github.com/monarch-initiative/embiggen
3. https://github.com/Knowledge-Graph-Hub/NEAT

Supervised prediction of aging-related genes from a weighted dynamic protein-protein interaction network
  • Tijana Milenković, University of Notre Dame, United States
  • Khalique Newaz, University of Notre Dame, United States
  • Qi Li, University of Notre Dame, United States

Presentation Overview: Show

Human aging is linked to many prevalent diseases, such as diabetes, cancer, cardiovascular, and Alzheimer's disease. Even recent and widespread COVID-19 seems to be related to aging. The aging process is highly influenced by genetic factors. However, analyzing human aging via wet lab experiments is difficult due to the long human life span and ethical constraints. Analyzing human aging computationally can fill this gap. This includes prediction of aging-related genes via supervised learning from human -omics data, which is the task that we focus on. Gene expression-based methods for this task predict a gene as aging-related if its expression level varies with age. While such approaches do capture aging-specific information, they ignore interactions between genes, i.e., their protein products. Protein-protein interaction (PPI) network-based methods for this task predict a gene as aging-related if its position (i.e., node representation/embedding/feature) in the PPI network is ``similar enough'' to the network positions of known aging-related genes. While these approaches do consider PPIs that carry out cellular functioning, the PPIs are context-unspecific, i.e., the PPIs span different conditions, such as cell types, tissues, diseases, environments.

Unlike the above approaches, we consider a dynamic aging-specific PPI subnetwork that was inferred by integrating aging-specific gene expression data and the entire context-unspecific PPI network data, which should yield more accurate aging-related gene predictions because aging is a dynamic process. However, the considered dynamic subnetwork did not improve prediction performance compared to a static aging-specific subnetwork, despite the aging process being dynamic. This could be because the dynamic subnetwork was inferred using induced subgraph approach, which is quite naive as it considers all PPIs from the context-unspecific network that exist between only the active genes at a given age. However, first, not all PPIs between the active genes might be equally ``important'', and the induced approach has no way of identifying the most important of all such PPIs. Second, the induced approach fails to consider any inactive genes that might critically connect the active genes in the network. Instead, we recently inferred a dynamic aging-specific subnetwork using a methodologically more advanced notion of network propagation (NP), which improved upon the induced dynamic subnetwork in unsupervised analyses of the aging process. Intuitively, NP maps expression levels (i.e., activities) onto the genes in the entire context-unspecific network via random walk or diffusion, to assign condition-specific weights to the nodes (genes) or edges (PPIs) in the entire PPI network. Finally, NP assumes that the highest-weighted network regions are the most relevant for the condition of interest, i.e., such regions form the context-specific subnetwork. Hence, as opposed to the induced approach, first, NP assigns weights to PPIs that can help identify the most ``important'' PPIs. Second, NP can consider a non-active gene if, for example, the gene is connected to sufficiently many active genes.

Here, we evaluate whether using our existing NP-based dynamic subnetwork will improve upon using the dynamic and static subnetworks constructed by the induced approach in the supervised prediction of aging-related genes. However, the existing NP-based subnetwork is unweighted, i.e., it gives equal importance to each of the aging-specific PPIs. Because accounting for aging-specific edge weights might be important, we additionally propose a weighted NP-based dynamic aging-specific subnetwork. We demonstrate that a predictive machine learning model trained and tested on our weighted NP-based dynamic aging-specific subnetwork yields higher accuracy when predicting aging-related genes than predictive models run on any of the existing unweighted dynamic or static subnetworks.

Our proposed weighted dynamic aging-specific subnetwork could guide with higher confidence than the existing dynamic and static subnetworks the discovery of novel aging-related gene candidates for future wet lab validation.

International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube