Home

Knowledge Guided Machine Learning in Biology

Schedule subject to change.
All times in Central Daylight Time (CDT)

Tuesday, May 11^th

9:00-10:00

Keynote: Multi-scale Inference of Genetic Trait Architecture using Biologically Annotated Neural Networks

Lorin Crawford

Presentation Overview: Show

10:00-10:30

Knowledge-Guided Machine Learning for predicting the promiscuity of enzymes

Soha Hassoun, Tufts University, United States

Presentation Overview: Show

10:30-10:45

Penguin: Predicting RNA Pseudouridine Sites in Nanopore Sequencing Data

Sarath Janga, IUPUI, United Kingdom
Doaa Salem, IUPUI, United States
Daniel Acevedo, , University of Texas Rio Grande Valley, United States
Swapna Vidhur Daulatabad, IUPUI, United States
Quoseena Mir, IUPUI, United States

Presentation Overview: Show

11:00-11:30

Knowledge-based Meta-Learning for Cancer Prediction and Survival Analysis

Aidong Zhang, Departments of Computer Science, University of Virginia, United States

Presentation Overview: Show

11:30-11:45

Learning to align with differentiable dynamic programming

Michiel Stock, Ghent University, Belgium
Dimitri Boeckaerts, https://michielstock.github.io/, Belgium
Steff Taelman, Ghent University, Belgium
Wim Van Criekinge, Ghent University, Belgium

Presentation Overview: Show

The alignment of two or more biological sequences is one of the main workhorses in bioinformatics because it can quantify similarity and reveal conserved patterns. Dynamic programming allows for rapidly computing the optimal alignment between two sequences by recursively splitting the problem into smaller tractable choices, i.e., deciding whether it is best to extend a current alignment or introduce a gap in one of the sequences. This process leads to the optimal alignment score and backtracking yields the optimal alignment. By departing from a collection of pairwise alignments, one can heuristically compute a multiple sequence alignment of many sequences. If one is interested in the effect of a small change in the alignment parameter or the sequences, one has to compute the alignment score gradient with respect to these inputs. Regrettably, computing this gradient is not possible because the individual maximisation (minimisation) steps in the dynamic programming are non-differentiable.

However, Mensch and Blondel recently showed that by smoothing the maximum operator, for example, by regularising with an entropic term, one can design fully differentiable dynamic programming algorithms. The individual smoothed maximum operators have various desirable properties, such as being efficient to compute, sparsity, or probabilistic interpretation. Departing from this work, we created a differentiable version of the Needleman–Wunsch algorithm.

The resulting gradient has an immediate diagnostic and statistical interpretation, such as computing the Fisher information to create uncertainty estimates. Furthermore, it enables us to use sequence alignment in differentiable computing, allowing one to learn an optimal substitution matrix and gap cost from a set of homologous sequences. The flexibility allows these parameters to vary at different regions in the sequences, for example, depending on the secondary structure. One can also change this around and fix the alignment parameters and optimise the sequences for alignment. This scheme allows for finding consensus sequences, which can be useful in creating a multiple sequence alignment. More broadly, our algorithm can be incorporated in arbitrary artificial neural network architectures, making it an attractive alternative to the popular convolution neural networks, LSTMs or transformer networks currently used to learn from biological sequences.

We provide a performant implementation of our method, compatible with deep learning, optimisation and probabilistic programming languages packages. To this end, we use the powerful Julia programming language, where we have provided custom gradients that are compatible with the major automatic differentiation packages, allowing for seamless integration with other packages.

11:45-12:00

EDI Panel Introduction

13:00-13:15

New machine learning approaches to estimate the functional consequence of mutations in diverse human populations

Yuval Itan, Icahn School of Medicine at Mount Sinai, United States
Cigdem Sevim Bayrak, Icahn School of Medicine at Mount Sinai, United States
Avner Schlessinger, Icahn School of Medicine at Mount Sinai, United States
Yiming Wu, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

13:15-13:30

Supervised machine learning approach for diagnostic screening of cardiovascular disease using gut microbiome data

Sachin Aryal, University of Toledo College of Medicine and Life Sciences, United States
Ahmad Alimadadi, University of Toledo College of Medicine and Life Sciences, United States
Ishan Manandhar, University of Toledo College of Medicine and Life Sciences, United States
Bina Joe, University of Toledo College of Medicine and Life Sciences, United States
Xi Cheng, University of Toledo College of Medicine and Life Sciences, United States

Presentation Overview: Show

Cardiovascular disease (CVD), as the leading cause of death worldwide, has many different types of morbid conditions, such as hypertension, heart failure, and atherosclerosis, which could develop simultaneously or lead to each other. An array of different clinical assays and imaging approaches is required for a comprehensive evaluation of cardiovascular health. Therefore, a systematic screening of any existing cardiovascular dysfunction could save diagnostic time and initiate early therapeutic interventions. Gut microbiota dysbiosis has been reported in patients with certain types of CVD, such as hypertension. Therefore, we hypothesized that gut microbiome data could be trained with supervised machine learning (ML) models for systematic diagnostic screening of CVD. To test our hypothesis, we analyzed 16S rRNA sequencing data from stool samples collected through the American Gut Project. The stool 16S metagenomics data of 478 CVD and 473 non-CVD subjects were analyzed using five supervised ML algorithms: random forest (RF), support vector machine, decision tree, elastic net, and neural networks (NN). Interestingly, we identified 39 differential bacterial taxa (LEfSe: LDA > 2) between the CVD and non-CVD groups, but initial ML classifications, using these taxonomic features, could only achieve an AUC (0.0: perfect antidiscrimination; 0.5: random guessing; 1.0: perfect discrimination) of ~0.58 (RF and NN). Alternatively, the top 500 high-variance features of operational taxonomic units (OTUs) were used for training ML models and an improved AUC of ~0.65 (RF) was achieved. The top 25 highly contributing OTU features (HCOFs) were further selected from those high-variance OTU features, and the RF model, trained with only HCOFs, achieved an improved AUC of ~0.70. Overall, our study identified dysregulated gut microbiota in the CVD patients and further developed a gut microbiome-based ML approach for the first time for a promising systematic diagnostic screening of CVD.

Sources of Funding:
The work was supported by the Dean’s Postdoctoral to Faculty Fellowship from the University of Toledo College of Medicine and Life Sciences to Xi Cheng. Xi Cheng acknowledges grant support from the P30 Core Center Pilot Grant from NIDA Center of Excellence in Omics, Systems Genetics, and the Addictome. Bina Joe acknowledges grant support from the National Heart, Lung, and Blood Institute (HL143082).

Reference:
Aryal S, Alimadadi A, Manandhar I, Joe B, Cheng X. Machine Learning Strategy for Gut Microbiome-Based Diagnostic Screening of Cardiovascular Disease. Hypertension. 2020;76: 1555–1562. https://www.ahajournals.org/doi/abs/10.1161/HYPERTENSIONAHA.120.15885

13:30-13:45

Supervised machine learning for gut microbiome-based detection of inflammatory bowel diseases

Sachin Aryal, University of Toledo College of Medicine and Life Sciences, United States
Ahmad Alimadadi, University of Toledo College of Medicine and Life Sciences, United States
Ishan Manandhar, University of Toledo College of Medicine and Life Sciences, United States
Bina Joe, University of Toledo College of Medicine and Life Sciences, United States
Xi Cheng, University of Toledo College of Medicine and Life Sciences, United States
Patricia B Munroe, Queen Mary University of London, United Kingdom

Presentation Overview: Show

Inflammatory bowel diseases (IBD) are characterized by chronic inflammation of the gastrointestinal (GI) tract. Crohn’s disease (CD) and ulcerative colitis (UC) are two major subtypes of IBD. Despite various clinical approaches being available for diagnosing IBD, such as endoscopy and colonoscopy, misdiagnosis of IBD occurs frequently, thus there is a clinical need to further improve diagnosis of this condition. Since dysbiosis in GI tract is reported in IBD patients, we hypothesized that gut microbiome data can be used to develop an artificial intelligence-based strategy for diagnostic screening of IBD. To test our hypothesis, fecal 16S metagenomics data of 729 IBD patients and 700 non-IBD controls collected from the American Gut Project were analyzed using five supervised machine learning (ML) models: random forest (RF), decision tree, elastic net, support vector machine and neural networks. Fifty bacterial taxa were identified to be significantly differential between the IBD and non-IBD groups. Supervised ML classifications, trained with these 50 taxonomic features, achieved a testing AUC (area under the receiver operating characteristics curve) of ~0.80 using the RF model. Next, we tested if operational taxonomic units (OTUs), instead of bacterial taxa, could be used as ML features for diagnostic classification of IBD. Top 500 high-variance OTUs were trained with the five ML models described above, and an improved AUC of ~0.82 was achieved by RF. Further, we tested the capability of the RF model to distinguish between Crohn’s disease (CD) and ulcerative colitis (UC) using 331 CD and 141 UC samples. A total of 117 bacterial taxa were identified to be significantly differential between CD and UC, and the RF model trained with these bacterial features achieved a testing AUC of ~0.91. Furthermore, the RF model trained with the top 500 high-variance OTUs achieved a slight improvement of AUC to ~0.92. In summary, we demonstrated robust supervised ML modeling for diagnostic screening of IBD and its subtypes.

Sources of Funding
The work was supported by the Dean’s Postdoctoral to Faculty Fellowship from University of Toledo College of Medicine and Life Sciences to Xi Cheng. Xi Cheng also acknowledges grant support from the P30 Core Center Pilot Grant from NIDA Center of Excellence in Omics, Systems Genetics, and the Addictome. Bina Joe acknowledges grant support from the National Heart, Lung, and Blood Institute (HL143082). Patricia B. Munroe acknowledges support from the National Institute of Health Research Cardiovascular Biomedical Research Centre at Barts and Queen Mary University of London.

Reference:
1. Manandhar I, Alimadadi A, Aryal S, Munroe PB, Joe B, Cheng X. Gut microbiome-based supervised machine learning for clinical diagnosis of inflammatory bowel diseases. American Journal of Physiology-Gastrointestinal and Liver Physiology. (doi.org/10.1152/ajpgi.00360.2020)

14:00-14:15

Gene Signatures of COVID-19 Infection Severity Identified Using Graph Convolutional Neural Networks On Single Cell RNA-Seq Data

Mario Flores, UTSA, United States
Yufang Jin, UTSA, United States
Huang Wenjian, UTSA, United States
Ricardo Ramirez, UTSA, United States
Karla Paniagua, UTSA, United States

Presentation Overview: Show

14:15-14:30

Machine learning on knowledge graphs and ontologies

Justin Reese, Lawrence Berkeley National Laboratory, United States
Deepak Unni, Lawrence Berkeley National Laboratory, United States
Nico Matentzoglu, Semanticly Ltd, United Kingdom
Nomi Harris, Lawrence Berkeley National Laboratory, United States
William Duncan, Lawrence Berkeley National Laboratory, United States
Chris Mungall, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Integrated, up-to-date data about SARS-CoV-2 and COVID-19 is crucial for the ongoing response to the COVID-19 pandemic by the biomedical research community. Knowledge graphs (KGs) are well-suited for integrating the heterogeneous data related to COVID-19. We constructed KG-COVID-19 [1], a KG that integrates a wide variety of data related to COVID-19, and a performant software package for machine learning on KGs [2]. We applied machine learning algorithms to produce actionable knowledge from the KG. Our strategy included ranking of drug repurposing candidates based on cosine similarity, and training/application of link prediction classifiers (multi-layer perceptron, random forest, decision tree, and logistic regression). Using this strategy, we produced ranked lists of drug repurposing candidates for COVID-19 treatment. We then used clinical data from N3C to validate these drug repurposing candidates using a retrospective case-cohort strategy.

To generalize and extend our tooling for graph machine learning that we developed to facilitate COVID-19 research, we have developed a framework called NEAT (Network Embedding All the Things) [3] for configuring reproducible pipelines for machine learning enabling machine learning on knowledge graphs and ontologies. NEAT machine learning tasks are entirely driven by human-readable configuration files, which both removes the requirement for users to write code and also serves as a detailed explanation of how each machine learning task was conducted. NEAT allows reproducible machine learning on knowledge graphs and ontologies using both node2vec-like algorithms to embed the graph structure and NLP algorithms to embed textual elements of node and edges in the graphs (e.g. class descriptions, node labels). The combination of embeddings from graph and textual elements improves performance of graph ML tasks.

References
1. Reese JT, Unni D, Callahan TJ, Cappelletti L, Ravanmehr V, Carbon S, et al. KG-COVID-19: a framework to produce customized knowledge graphs for COVID-19 response. Patterns (N Y). 2020; 100155.
2. https://github.com/monarch-initiative/embiggen
3. https://github.com/Knowledge-Graph-Hub/NEAT

14:30-14:45

Supervised prediction of aging-related genes from a weighted dynamic protein-protein interaction network

Tijana Milenković, University of Notre Dame, United States
Khalique Newaz, University of Notre Dame, United States
Qi Li, University of Notre Dame, United States

Presentation Overview: Show

Human aging is linked to many prevalent diseases, such as diabetes, cancer, cardiovascular, and Alzheimer's disease. Even recent and widespread COVID-19 seems to be related to aging. The aging process is highly influenced by genetic factors. However, analyzing human aging via wet lab experiments is difficult due to the long human life span and ethical constraints. Analyzing human aging computationally can fill this gap. This includes prediction of aging-related genes via supervised learning from human -omics data, which is the task that we focus on. Gene expression-based methods for this task predict a gene as aging-related if its expression level varies with age. While such approaches do capture aging-specific information, they ignore interactions between genes, i.e., their protein products. Protein-protein interaction (PPI) network-based methods for this task predict a gene as aging-related if its position (i.e., node representation/embedding/feature) in the PPI network is ``similar enough'' to the network positions of known aging-related genes. While these approaches do consider PPIs that carry out cellular functioning, the PPIs are context-unspecific, i.e., the PPIs span different conditions, such as cell types, tissues, diseases, environments.

Unlike the above approaches, we consider a dynamic aging-specific PPI subnetwork that was inferred by integrating aging-specific gene expression data and the entire context-unspecific PPI network data, which should yield more accurate aging-related gene predictions because aging is a dynamic process. However, the considered dynamic subnetwork did not improve prediction performance compared to a static aging-specific subnetwork, despite the aging process being dynamic. This could be because the dynamic subnetwork was inferred using induced subgraph approach, which is quite naive as it considers all PPIs from the context-unspecific network that exist between only the active genes at a given age. However, first, not all PPIs between the active genes might be equally ``important'', and the induced approach has no way of identifying the most important of all such PPIs. Second, the induced approach fails to consider any inactive genes that might critically connect the active genes in the network. Instead, we recently inferred a dynamic aging-specific subnetwork using a methodologically more advanced notion of network propagation (NP), which improved upon the induced dynamic subnetwork in unsupervised analyses of the aging process. Intuitively, NP maps expression levels (i.e., activities) onto the genes in the entire context-unspecific network via random walk or diffusion, to assign condition-specific weights to the nodes (genes) or edges (PPIs) in the entire PPI network. Finally, NP assumes that the highest-weighted network regions are the most relevant for the condition of interest, i.e., such regions form the context-specific subnetwork. Hence, as opposed to the induced approach, first, NP assigns weights to PPIs that can help identify the most ``important'' PPIs. Second, NP can consider a non-active gene if, for example, the gene is connected to sufficiently many active genes.

Here, we evaluate whether using our existing NP-based dynamic subnetwork will improve upon using the dynamic and static subnetworks constructed by the induced approach in the supervised prediction of aging-related genes. However, the existing NP-based subnetwork is unweighted, i.e., it gives equal importance to each of the aging-specific PPIs. Because accounting for aging-specific edge weights might be important, we additionally propose a weighted NP-based dynamic aging-specific subnetwork. We demonstrate that a predictive machine learning model trained and tested on our weighted NP-based dynamic aging-specific subnetwork yields higher accuracy when predicting aging-related genes than predictive models run on any of the existing unweighted dynamic or static subnetworks.

Our proposed weighted dynamic aging-specific subnetwork could guide with higher confidence than the existing dynamic and static subnetworks the discovery of novel aging-related gene candidates for future wet lab validation.

Knowledge Guided Machine Learning in Biology

ISCB On the Web