Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: A Novel Method to Predict Intercellular Signaling in Single-cell RNA-seq Data via Graph Convolutional Network
COSI: MLCSB
  • Akram Vasighizaker, University of Windsor, Canada
  • Sheena Hora, University of Windsor, Canada
  • Luis Rueda, University of Windsor, Canada


Presentation Overview: Show

Computational approaches including link prediction methods have been recently used to study intercellular signaling networks or cell-cell communications based on graph-structured data. However, they obtained high performance for only some specific networks by considering the likelihood of node interactions. Analysis of local subgraphs instead of whole graph is a solution of subgraph-based methods.
In this work, we present a novel method which uses an attributed graph convolutional neural network to predict cell-cell communication from single-cell RNA-seq data. Our method extracts the latent as well as explicit attributes of an attributed graph constructed from gene expression profile of individual cells. Converting the high-dimensional and sparse single-cell RNA-seq data to a graph format is one of the challenges. We successfully overcome this limitation by applying SoptSC, a similarity-based optimization method in which the cell-cell network is built using a cell-cell similarity matrix which is learned from gene expression data. We performed experiments on six datasets extracted from the human and mouse pancreas tissue. Our comparative analysis shows that our method outperforms latent feature-based approaches, as well as the state-of-the-art methods for link prediction, WLNM, with 0.99 ROC and 99% prediction accuracy. The datasets are publicly available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE84133 and code at https://github.com/sheenahora/SEGCECO.

Virtual: A Symbolic Regression Approach to Hepatocellular Carcinoma Diagnosis Using Hypermethylated CpG Islands in Circulating Cell-Free DNA
COSI: MLCSB
  • Rushank Goyal, Betsos, India


Presentation Overview: Show

Hepatocellular carcinoma is the most common primary liver cancer and a major cause of death worldwide. Despite this, current diagnostic methods are limited by their low sensitivity. DNA methylation changes offer an alternative method of diagnosis through measuring such changes in circulating cell-free DNA present in blood plasma. A genetic programming-based symbolic regression approach was applied to gain the benefits of machine learning while avoiding the opacity drawbacks of "black box" models. The data included plasma samples from 36 patients with hepatocellular carcinoma as well as a control group of 55 that contained patients with and without cirrhosis. The symbolic regression methodology developed an equation utilizing the methylation levels of three biomarkers, with an accuracy of 91.3%, a sensitivity of 100%, and a specificity of 87.5% on the test data. All three biomarkers are differentially methylated in cancerous and non-cancerous samples. The performance matches prior research while providing the added benefits of transparency. Circulating cell-free DNA presents opportunities for minimally invasive early diagnosis of hepatocellular carcinoma. Future validation of the model obtained here on a larger and more diverse dataset can reveal the potential for such approaches in cancer diagnosis and open the way for further research.

Virtual: An Interpretable Machine Learning model for Selectivity of Small Molecules against Homologous Protein Family
COSI: MLCSB
  • Sarveswara Rao Vangala, Tata Consultany Services Ltd., India
  • Navneet Bung, Tata Consultany Services Ltd., India
  • Sowmya Krishnan, Tata Consultancy Services Ltd., India
  • Arijit Roy, TCS Innovation Labs-Hyderabad, Tata Consultancy Services Limited, India


Presentation Overview: Show

The primary goal of drug design is to develop potent small molecules that can inhibit the target protein with high selectivity. Various experimental and computational methods are used to measure the target-specificity of small molecules against the target protein. The selectivity of the small molecule remains a challenge, especially when the target protein belongs to a homologous family. We have developed a multi-task deep learning model for predicting the selectivity of small molecules on the closely related homologs of the target protein. The multi-task model, which can learn from training data of the related tasks has been tested on the Janus kinase and dopamine receptor family of proteins. The performance of the multi-task model was evaluated using various representation of small molecules such as fingerprints (ECFP4) and molecular graphs. It was observed that the feature-based representation (ECFP4) with the XGBoost performed marginally better when compared to deep neural network models in most of the evaluation metrics. To decipher the model decision on selectivity, fragments with positive and negative contributions towards activity of each homolog protein were identified using SHapley Additive exPlanations (SHAP) method. The proposed method can be used to screen molecules for selectivity during initial stage of drug discovery.

Virtual: Antibiotic Resistance Prediction and Biomarker Discovery in Neisseria gonorrhoeae
COSI: MLCSB
  • Rushank Goyal, All India Institute of Medical Sciences, India
  • Rashmi Chowdhary, All India Institute of Medical Sciences, India


Presentation Overview: Show

Antibiotic resistance is a global problem projected to kill 10 million each year by 2050, with Neisseria gonorrhoeae among the most urgent threats. Using machine learning techniques, models for resistance prediction were developed and compared, and were also used to test for the existence of genetic signatures i.e. biomarkers that had a statistically significant correlation with azithromycin, ciprofloxacin, or cefixime resistance, three drugs used against N. gonorrhoeae. Eight models were trained on three datasets of 3000+ samples and their corresponding resistance values. Each sample consisted of a unique pattern of certain consensus regions of the genome. XGBoost was the highest-performing model on the testing dataset, with an accuracy of 97% on azithromycin resistance prediction. Support Vector Machine and Random Forest were not far behind, with accuracies of 96% and 95% respectively on azithromycin data. The models were used to determine potential consensus regions correlated with resistance. Out of 584,362 regions, 135, 2612 and 6 regions were correlated with azithromycin, ciprofloxacin and cefixime respectively. This study led to the creation of accurate machine learning models and identified resistance biomarkers for three drugs. The model can be used for genotype-based resistance diagnosis and the biomarkers can be further developed for point-of-care testing.

Virtual: Biological Interpretation of Cell Painting and Gene Expression Features for Mitochondrial Toxicity Prediction
COSI: MLCSB
  • Srijit Seal, Department of Chemistry, University of Cambridge, United Kingdom
  • Jordi Carreras-Puigvert, Uppsala University, Sweden
  • Maria-Anna Trapotsi, Department of Chemistry, University of Cambridge, United Kingdom
  • Hongbin Yang, Department of Chemistry, University of Cambridge, United Kingdom
  • Ola Spjuth, Uppsala University, Sweden
  • Andreas Bender, Department of Chemistry, University of Cambridge, United Kingdom


Presentation Overview: Show

Cell Painting features are versatile biological descriptors, but they are computational in nature and challenging to interpret. In this study, we investigated the effects of Tox21 on mitochondrial membrane depolarization using Cell Painting, Gene Expression features and Morgan fingerprints for 382 compounds. We found mitochondrial toxicants significantly differ from non-toxic compounds and that some compounds with similar mechanisms of action cluster together. The correlation between Cell Painting and Gene Expression features associated with mitochondrial toxicity mechanisms allowed us to interpret their biological significance. Granularity features of Cell Painting were highly predictive of mitochondrial toxicity. Fusion models that combined Cell Painting, Gene Expression features, and Morgan fingerprints improved the detection of mitochondrial toxicants by 60% when tested with an external set of 236 compounds compared to models that only use structural features. When using structural features to predict mitochondrial toxicity, they limited the model applicability domain to the chemical space of the training dataset. Our fusion models extrapolated to new chemical spaces and correctly predicted mitochondrial toxicity when Tox21 assay outcomes were inconclusive because of cytotoxicity. The combination of Cell Painting and Gene Expression features with structural fingerprints improved the applicability domain while facilitating mechanistic analysis by interpreting the features.

Virtual: Biomarkers for Responsiveness to Folfirinox in treatment of PDAC
COSI: MLCSB
  • Tien Van Le, MCPHS University, United States
  • Hao Huynh, MCPHS University, United States
  • George Acquaah-Mensah, MCPHS University, United States
  • Kawther Abdilleh, Pancreatic Cancer Action Network, United States


Presentation Overview: Show

Prognosis and survival rate of pancreatic ductal adenocarcinoma (PDAC) remain poor due to unsuccessful early detections and complications with treatment. One commonly used standard treatment is a combination of 5-fluorouracil, leucovorin, irinotecan, and oxaliplatin (FOLFIRINOX). A difference in gene expression profiles between treatment-responsive and non-responsive patients was observed in this study. Data for the study was retrieved from the Pancreatic Cancer Action Network (PanCAN)’s SPARK platform. A cohort of patients from PanCAN’s Know Your Tumor program with malignant tumors were examined for Differentially Expressed Genes (DEGs) in responsive and non-responsive groups via DESeq2. Over-representation and Gene Set Enrichment analyses were conducted and modules of a protein-protein interaction network bearing the DEGs were analyzed.
Among non-responders, keratinization pathway genes were most significant. Machine Learning studies indicated their expression could be used to predict non-responsiveness. Cross-linked envelope proteins of keratinocytes are closely related to other proline-rich proteins, a major nutrient source for tumor metabolism. Keratin family proteins include KRT and KRTAP. Furthermore, up-regulation of KRT17 is associated with cancer proliferation and chemo-resistance in PDAC. These are of paramount importance and represent probable biomarkers for precision-treatment of PDAC, and further investigation in development of novel therapeutics.

Virtual: CanProSite: Predicting lung cancer prone sites using deep neural network
COSI: MLCSB
  • Medha Pandey, IIT Madras, India


Presentation Overview: Show

Lung adenocarcinoma (LUAD) and lung squamous carcinoma (LUSC) are prominent types of lung cancers, which leads to high mortality rate worldwide and occur mainly due to somatic driver mutations in proteins. Screening of such mutations is often cost and time intensive. Hence, in this study, we systematically analyzed the preferred residues, residues pairs and motifs of 4172 disease prone sites in 195 proteins and compared with 4137 neutral sites. We observed that the motifs LG, QF and TST are preferred in disease prone sites whereas GK, KA and ISL are predominant in neutral sites. Residues Gly, Asp, Glu, Gln and Trp are preferred in disease prone site. Further, utilizing deep neural networks, we developed CanProSite for predicting disease prone sites with amino acid sequence based features (physicochemical properties, conservation scores, secondary structure and di and tri-peptide motifs). Our model predicts the disease prone sites at an accuracy of 81% with sensitivity, specificity of 82%, 78%, respectively, on 10-fold cross-validation. We obtained an accuracy and AUC of 80 % and 0.89, respectively with a test set of 417 disease-causing and 413 neutral sites, we obtained . This method can serve in identification of disease causing and neutral sites in lung cancer.

Virtual: Contrastive learning on protein embeddings enlightens midnight zone
COSI: MLCSB
  • Michael Heinzinger, Technical University Munich, Germany
  • Maria Littmann, Technical University Munich, Germany
  • Ian Sillitoe, University College London, United Kingdom
  • Nicola Bordin, University College London, United Kingdom
  • Christine Orengo, University College London, United Kingdom
  • Burkhard Rost, Technical University Munich, Germany


Presentation Overview: Show

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the “midnight zone” of protein similarity, i.e., the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.

Virtual: DCellNet: An End-to-End Detection Network for Dense Cells
COSI: MLCSB
  • Qinyu Wang, Department of Computer Science and Technology,Wuhan University of Science and Technology, China
  • Niu Feng, Department of Food Science and Technoloy, Huazhong Agricultural University, China
  • Xiangxiang Zeng, Department of Computer Science and Engineering, Hunan University, China
  • Chunhua Deng, Department of Computer Science and Technology,Wuhan University of Science and Technology, China


Presentation Overview: Show

Motivation: Accurately detecting cells or other biomolecules under microscopic imaging can provide the basis for numerous downstream biological applications, such as the accurate measurement of various types of blood cells to reflect the health status of the body. In recent years, some deep learning methods have successfully advanced research in this area. However, diverse, dense and even adherent cells pose a serious challenge to these methods. To advance the study of dense cell detection, we contribute a microscopic image dataset containing a large number of adherent cells, PSM Dataset. Furthermore, we propose a kernel representation of cells, which focuses on the most core regions of cells while ignoring boundary information, as a way to separate cells. In our proposed framework called DCellNet, both the border information of cells and the aforementioned kernel representation are predicted, and the combination of the two ensures accurate detection of cells and separation between cells.
Results: We compared with professional cell detection software lmageJ and state-of-the-art detection algorithms based on deep learning, and the results show that our method is highly competitive in processing dense cells detection.
Availability and implementation: PSM dataset and the source code of DCellNet is available via:https://github.com/Wang-Qinyu/DCellNet

Virtual: Deep Bayesian Neural Networks-based Prediction of Protein-Protein Interactions
COSI: MLCSB
  • Jovana Aleksic, Universidad Politécnica de Madrid, Spain
  • Miguel García Remersal, Universidad Politécnica de Madrid, Spain
  • Joel Malek, Weill Cornell Medicine Qatar, Qatar
  • Stephanie Ramadan, Weill Cornell Medicine Qatar, Qatar
  • Nayra Al-Thani, Weill Cornell Medicine Qatar, Qatar


Presentation Overview: Show

Knowledge about protein-protein interactions (PPIs) provides us valuable insights into many biological processes. Traditional ways of identifying PPIs are expensive and labor-intensive. Different deep learning approaches for prediction of PPIs have been developed recently. However, these do not provide predictions beyond binary interaction labels. We resort to deep Bayesian neural networks (BNNs) to predict PPIs based only on the primary amino acid sequences. We treat this as a regression problem, and we aim to predict strength of interaction and interacting region. We propose modeling the strength of interaction as Gaussian distributions instead of point estimates. This is still ongoing research. Currently our model combines probabilistic Convolutional and Long-Short Term Memory layers, and outputs probability distributions quantifying the uncertainty of the prediction. The dataset used in this study was generated at Weill Cornell Medicine Qatar using recently developed all-vs-all sequencing (AVA-seq) approach to determine PPIs. The prediction of the strength of interaction and the interacting region provides information that allows for better understanding of protein functions. Furthermore, use of BNNs facilitates estimation of confidence intervals of the results. The proposed method can be generalized to regression problems where response probability distributions are available —or can be derived— instead of point estimates.

Virtual: DeepCircos –A novel deep learning framework for integration and analysis of multi-omics data using circular maps
COSI: MLCSB
  • Xiaojia Tang, Mayo Clinic, United States
  • Naresh Prodduturi, Mayo Clinic, United States
  • Kevin Thompson, Mayo Clinic, United States
  • Vera Suman, Mayo Clinic, United States
  • Krishna Kalari, Mayo Clinic, United States


Presentation Overview: Show

Deep-learning algorithms like convolutional neural networks (CNNs) have been successfully applied in speech recognition and image classification. However, their applications to multi-omics molecular data analysis remain challenging. Previous studies that converted tabular molecular data to images for deep-learning applications were mostly limited to gene expression. We proposed a novel framework, DeepCircos, to integrate and visualize multi-omics data based on chromosomal locations into images to be trained by various supervised/unsupervised image-based deep-learning algorithms. As a proof-of-concept, we first build training and test datasets from lung adenocarcinoma (n=512) and squamous cell carcinoma (n=498) patients in The Cancer Genome Atlas to assess DeepCircos ability to differentiate these two subtypes. Applying DeepCircos with the bilinear-CNN classifier to gene expression and copy number data yielded a 0.93 balanced accuracy in the test set. Grad-Cam analysis showed that the DeepCircos-trained model identified copy number events in chr3q as key differentiating events, consistent with previous findings. A second assessment of DeepCircos involved distinguishing cell types from single-cell RNA-Seq data of mouse brain tissue. Using pre-trained VGG16 model for feature extraction and K-means classifier, DeepCircos yielded a >95% balanced accuracy. DeepCircos showed great potential to accurately classify and identify key features using high-throughput multi-omics data.

Virtual: Domain-Agnostic Self-Supervised Contrastive Learning for Computational Histopathology
COSI: MLCSB
  • Stella Su, Henry M. Gunn High School, United States
  • Rikiya Yamashita, Stanford University, United States


Presentation Overview: Show

The performance of machine learning models drops when they are deployed across
different domains (e.g. scanners, staining protocols, medical sites) in histopathology applications. This domain shift issue is a key challenge that hampers the clinical applicability of such models. Another challenge is the scarcity of labeled data from experienced pathologists.

We developed a novel deep learning algorithm that incorporates histopathology-specific augmentations into self-supervised contrastive learning (SSL). We created an unlabeled dataset from 19 public histopathology datasets, diversified through 14 organs to learn domain agnostic representation of histopathology images.

The representations learned by our SSL model transfer well to downstream tasks with domain shift. In lymph node tumor classification, our unsupervised pre-training (using 10% of training labels) outperformed its fully supervised counterpart (using 100% of training labels). Our experiments showed unsupervised pre-training with histopathology-specific augmentations performed significantly better than unsupervised pre-training with augmentations for natural-scene images (AUROC 0.985 vs 0.806). When transferring representations to the downstream mitotic figure detection task, our unsupervised pre-training far surpassed its ImageNet supervised counterpart (AP 45.5 vs 38.4).

Our algorithm (SSL + histopathology-specific augmentations) effectively improves model performance on histopathology data with domain shift and compensates for the scarcity of labeled datasets.

Virtual: Drug-Target Interaction Prediction Using Transfer Learning
COSI: MLCSB
  • Alperen Dalkıran, Middle East Technical University, Turkey
  • M. Yağız Gündüz, Middle East Technical University, Turkey
  • Ahmet Rifaioglu, University of Heidelberg, Germany
  • Maria Jesus Martin, EMBL-European Bioinformatics Institute, United Kingdom
  • Rengul Atalay, University of Chicago, United States
  • Aybar Can Acar, Middle East Technical University, Turkey
  • Tunca Dogan, Hacettepe University, Turkey
  • Volkan Atalay, Middle East Technical University, Turkey


Presentation Overview: Show

Virtual screening of compounds against a target cell or protein is used widely during the initial steps of the drug discovery process. Lately, deep learning-based models for drug-target interaction prediction have yielded highly promising results. However, the impact has been limited since the majority of proposed deep learning models developed so far require large-scale training data. Such large-scale data is not available for many of the target proteins or families, and therefore, no prediction models are available for these. For many proteins, there is very little bioactivity data recorded in the databases or none at all. On the other hand, approaches developed recently, such as transfer learning, few- and zero-shot learning, can learn from such small amounts of data. Transfer learning is a machine learning approach where a model is trained for a source task, and this trained source model is then reused as an initial configuration to train a model (target model) for a different but related target task. Transfer learning has not been extensively exploited so far in the area of drug-target interaction prediction. In this study, we use transfer learning methods for the prediction of interactions between drugs/compounds and understudied target proteins that have scarce training data.

Virtual: DrugnomeAI: An ensemble semi-supervised learning framework for predicting druggability of candidate drug targets across the whole exome
COSI: MLCSB
  • Arwa Raies, AstraZeneca, United Kingdom
  • Ewa Tulodziecka, AstraZeneca, United Kingdom
  • James Stainer, AstraZeneca, United Kingdom
  • Lawrence Middleton, AstraZeneca, United Kingdom
  • Ryan Dhindsa, AstraZeneca, United Kingdom
  • Pamela Hill, AstraZeneca, United Kingdom
  • Ola Engkvist, AstraZeneca, United Kingdom
  • Andrew Harper, AstraZeneca, United Kingdom
  • Slave Petrovski, AstraZeneca, United Kingdom
  • Dimitrios Vitsios, AstraZeneca, United Kingdom


Presentation Overview: Show

The druggability of targets is a crucial consideration in drug target selection. Predicting druggability using standard machine learning (ML) approaches is challenging due to the small number of known targets, lack of robust negative data, and extreme data imbalance. Here, we overcome these challenges by re-purposing our stochastic semi-supervised ML framework (Mantis-ML; Vitsios et al., 2020) to develop DrugnomeAI, which estimates the druggability likelihood for every protein-coding gene in the human exome. DrugnomeAI integrates gene-level properties from 15 sources, including protein-protein interaction (PPI) networks, protein binding affinity, and toxicogenomics data, resulting in 326 features. The tool generates exome-wide predictions based on labelled sets of known drug targets (AUCs: 93.9%-99.0%), highlighting PPI networks as top predictors. DrugnomeAI provides specialised models stratified by disease type (e.g., oncology and non-oncology) and therapeutic modality (small molecules, monoclonal antibodies, PROTACs). The top-ranking DrugnomeAI genes were significantly enriched for genes previously selected for clinical development programs (p<1.0x10-308) and for genes achieving genome-wide significance in phenome-wide association studies from 450K UK Biobank exomes for binary (p-value=1.7x10-5) and quantitative traits (p-value=1.6x10-7). We provide a freely accessible web application to visualize druggability predictions and key features around gene druggability, across disease types and modalities.

Virtual: EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction
COSI: MLCSB
  • Hannes Stärk, Massachusetts Institute of Technology, Germany
  • Octavian-Eugen Ganea, Massachusetts Institute of Technology, United States
  • Lagnajit Pattanaik, Massachusetts Institute of Technology, United States
  • Regina Barzilay, Massachusetts Institute of Technology, United States
  • Tommi Jaakkola, Massachusetts Institute of Technology, United States


Presentation Overview: Show

Predicting how a drug-like molecule binds to a specific protein target is a core problem in drug discovery. An extremely fast computational binding method would enable key applications such as fast virtual screening or drug engineering. Existing methods are computationally expensive as they rely on heavy candidate sampling coupled with scoring, ranking, and fine-tuning steps. We challenge this paradigm with EquiBind, an SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand's bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. Further, we show extra improvements when coupling it with existing fine-tuning techniques at the cost of increased running time. Finally, we propose a novel and fast fine-tuning model that adjusts torsion angles of a ligand's rotatable bonds based on closed-form global minima of the von Mises angular distance to a given input atomic point cloud, avoiding previous expensive differential evolution strategies for energy minimization.

Virtual: Evaluating deep learning for predicting epigenomic profiles
COSI: MLCSB
  • Ziqi Tang, Cold Spring Harbor Laboratory, United States
  • Shushan Toneyan, Cold Spring Harbor Laboratory, United States
  • Peter Koo, Cold Spring Harbor Laboratory, United States


Presentation Overview: Show

Deep learning has been successful at predicting epigenomic profiles from DNA sequences. Most approaches frame this task as a binary classification relying on peak callers to define functional activity. Recently, quantitative models have emerged to directly predict the experimental coverage values as a regression. As new models continue to emerge with different architectures and training configurations, a major bottleneck is forming due to the lack of ability to fairly assess the novelty of proposed models and their utility for downstream biological discovery. Here we introduce a unified evaluation framework and use it to compare various binary and quantitative models trained to predict chromatin accessibility data. We highlight various modeling choices that affect generalization performance, including a downstream application of predicting variant effects. In addition, we introduce a robustness metric that can be used to enhance model selection and improve variant effect predictions. Our empirical study largely supports that quantitative modeling of epigenomic profiles leads to better generalizability and interpretability.

Virtual: Explainable Multi-omics Variational Autoencoder (M-VAE) Prediction Model for Alzheimer’s disease Dementia
COSI: MLCSB
  • Sithara Vivek, University of Minnesota, United States
  • Bharat Thyagarajan, Univeristy of Minnesota, United States
  • Weihua Guan, University of Minnesota, India


Presentation Overview: Show

Alzheimer’s disease (AD) dementia is a complex multifactorial process where the epigenetic and biochemical changes occur several years before clinical diagnosis. During the last decade, large amounts of high-throughput molecular data from the blood have improved our understanding of complex pathology in AD dementia. However, early detection of AD dementia remains challenging due to heterogeneity in disease onset and progression. We propose to develop a multi-omics variational autoencoder (M-VAE) model using genetic variants, epigenetic (DNAm), and transcriptomic (RNA-Seq) data from 143 participants with dementia and 2897 participants without dementia in Health and Retirement Study (HRS). The proposed method will incorporate existing biological knowledge (such as gene-gene interaction) as constraints to improve model generalizability. We optimized methods to account for class imbalance in training the VAE model and evaluated different strategies for feature selection. The biological interpretability of latent space representation from the DNAm VAE model using differentially methylated probes (p=1064) showed genes (e.g. RNF39 and PRDM16) that have been previously reported as epigenetic biomarkers of AD dementia. These preliminary data suggest that M-VAE can help provide novel insights into the complex biology in AD dementia and develop a non-invasive, blood-based multi-component biomarker signature for early detection of AD.

Virtual: FusionAI: Predicting fusion breakpoint from DNA sequence with deep learning
COSI: MLCSB
  • Pora Kim, The University of Texas Health Science Center at Houston, United States
  • Hua Tan, The University of Texas Health Science Center at Houston, United States
  • Jiajia Liu, The University of Texas Health Science Center at Houston, United States
  • Mengyuan Yang, The University of Texas Health Science Center at Houston, United States
  • Xiaobo Zhou, The University of Texas Health Science Center at Houston, United States


Presentation Overview: Show

Identifying the molecular mechanisms related to genomic breakage is an important goal of cancer mechanism studies. Among diverse locations of structural variants, fusion genes, which have the breakpoints in the gene bodies and are typically identified from the split reads of RNA-seq data, can provide a highlighted structural variant resource for studying the genomic breakages with expression and potential pathogenic impacts. In this study, we developed FusionAI, which utilizes deep learning to predict gene fusion breakpoints based on DNA sequence and let us identify fusion breakage code and genomic context. FusionAI leverages the known fusion breakpoints to provide a prediction model of the fusion genes from the primary genomic sequences via deep learning, thereby helping researchers a more accurate selection of fusion genes and better understand genomic breakage.

Virtual: GEARS: Predicting transcriptional outcomes of novel multi-gene perturbations
COSI: MLCSB
  • Yusuf Roohani, Stanford University, United States
  • Kexin Huang, Stanford University, United States
  • Jure Leskovec, Stanford University, United States


Presentation Overview: Show

Cellular response to genetic perturbation is central to numerous biomedical applications ranging from identifying genetic interactions involved in cancer to new methods for regenerative medicine. However, the combinatorial explosion in the number of possible multi-gene perturbations severely limits experimental interrogation. Here, we present GEARS, a method that can predict transcriptional response to both single and multi-gene perturbations using single-cell RNA-sequencing data from perturbational screens. GEARS uses graph neural networks to learn multi-dimensional representations for each gene and its perturbation while incorporating a knowledge graph of gene-gene relationships. GEARS is unique in its ability to accurately predict outcomes of perturbing novel genes not seen perturbed during training and combinations thereof. GEARS significantly improves performance over current approaches for predicting post-perturbation gene expression. It has higher precision in predicting five distinct genetic interaction sub-types and can identify the strongest interactions more than twice as well. GEARS can predict novel phenotypic outcomes to multi-gene perturbation and can thus be leveraged for the design of perturbational experiments.

Virtual: Generating drug-like molecules from gene expression signatures using transformer model
COSI: MLCSB
  • Prashant Govindarajan, Indian Institute of Technology Madras, India
  • Sundar Raman P, Indian Institute of Technology Madras, India


Presentation Overview: Show

The advancements in deep learning algorithms are expected to play an instrumental role in reducing the costs and time associated with drug discovery by accelerating de novo drug design. We propose an attention-based transformer model that can generate new drug-like molecules that can induce a desired transcriptomic profile. Drug-induced gene expression signatures were obtained from the LINCS L1000 database. Our transformer architecture can generate novel and active chemical molecules from a target gene expression signature given as input. Almost half of the generated molecules were valid, unique and synthesizable. Upon evaluating our model on unseen gene expression signatures, we show that the molecules generated by the transformer are not only similar to the actual compounds, but the model also learns to preserve certain structural and chemical features. The model was also evaluated on disease-associated drug-induced gene expression signatures.

Virtual: Generative adversarial networks to study the action of a cardiac inotrope on virtual population of computational models
COSI: MLCSB
  • Viatcheslav Gurev, IBM Research, United States
  • Tim Rumbell, IBM Research, United States
  • James Kozloski, IBM Research, United States


Presentation Overview: Show

Discovering new patterns and mechanisms in biological data with computational modeling is often challenging due to difficulties in model parameter inference for datasets with large feature variability. To address difficulties of model parameter inference, we developed a novel method to construct virtual populations of ventricular myocytes by integrating mechanistic modeling and machine learning and employed the methods for analysis of in vitro cardiac mechanics data, solving the inverse problem of parameter inference. The method is based on generative adversarial network (GAN) with the loss that constrains biophysical model parameters from cell signals and prior information on the model parameters. GAN was employed to construct virtual populations of cardiac ventricular myocytes in a study of the action of Omecamtiv Mecarbil (OM), a positive cardiac inotrope, using signals of unloaded myocyte shortening in experiments on rat myocytes, in the presence and absence of OM. The analysis suggests a novel action of OM due to changes of interactions between myosin and tropomyosin proteins. In the validation stage, model parameters inferred with the GAN were used to replicate other in vitro experimental protocols. Our approach is critical for parameter inference of biophysical models and exploration of complex biological systems.

Virtual: Hyperbolic geometry-based deep learning methods to produce population trees from genotype data
COSI: MLCSB
  • Aman Patel, Department of Computer Science, Stanford University, United States
  • Daniel Mas Montserrat, Department of Biomedical Data Science, Stanford University, United States
  • Carlos Bustamante, Department of Biomedical Data Science, Stanford University, United States
  • Alexander Ioannidis, Institute for Computational and Mathematical Engineering, Stanford University, United States


Presentation Overview: Show

The production of population-level trees from individual-level genomic data is a fundamental task in population genetics. Typically, these trees are produced using methods like hierarchical clustering, neighbor joining, or maximum likelihood. However, such methods are generally non-parametric: the addition of new data points necessitates regeneration of the entire tree, a potentially expensive process. They also do not easily integrate with larger workflows. We aim to address these problems by introducing parametric deep learning methods for tree formation. Our models specifically create continuous representations of population trees in hyperbolic space, which has previously proven effective in embedding hierarchically structured data. We present two different architectures - a multi-layer perceptron (MLP) and a variational autoencoder (VAE) - and we analyze performance using a variety of metrics. Both models tested produce embedding spaces that reflect human evolutionary history. We also demonstrate the generalizability of these models by verifying that addition of new samples to an existing tree occurs in a semantically meaningful manner. Finally, we compare the quality of trees generated by our models to those produced by established methods. Even though the benchmark methods are directly fit on the evaluation data, our models outperform some of these and achieve highly comparable performance overall.

Virtual: Identifying selections operating on HIV RT via Uniform Manifold Approximation and Projection
COSI: MLCSB
  • Shefali Qamar, University of California Santa Cruz, United States
  • Manel Camps, University of California Santa Cruz, United States
  • Jay Kim, University of California Santa Cruz, United States


Presentation Overview: Show

We analyze 14,651 HIV1 reverse transcriptase sequences from the Stanford HIV database to study the evolution of HIV RT under multiple-drug selection in the clinic. Our goal is to map out distinct sectors of HIV RT’s sequence space that are accessible to evolution. We utilize Uniform Manifold Approximation and Projection (UMAP), a graph-based dimensionality reduction technique uniquely suited for the detection of non-linear dependencies, and visualize the results using an unsupervised clustering algorithm based on density analysis, producing 21 distinct clusters of sequences. Supporting the biological significance of these clusters, they represent phylogenetically related sequences with strong correspondence to distinct treatment regimens. Further, the two largest clusters segregate sequences associated with nucleoside reverse transcriptase inhibitor and with non-nucleoside reverse transcriptase inhibitor treatments. Note that treatment information was not used as input in this analysis. Thus, the visualization of areas of HIV RT sequence space that are being explored by evolution allows the identification of individual selections. Further, subtracting known diagnostic mutations from the outcome of our dimensionality reduction analysis provides information about the higher-order epistatic context facilitating the evolution of distinct HIV RT drug resistance mutational pathways, information that is generally not accessible by other types of epistatic analyses.

Virtual: Improved prediction of bacterial CRISPRi guide efficiency through data integration and automated machine learning
COSI: MLCSB
  • Yanying Yu, Helmholtz Institute for RNA-based Infection Research (HIRI) / Helmholtz Centre for Infection Research (HZI), Germany
  • Sandra Gawlitt, Helmholtz Institute for RNA-based Infection Research (HIRI) / Helmholtz Centre for Infection Research (HZI), Germany
  • Lisa Barros de Andrade E Sousa, Helmholtz AI, Helmholtz Zentrum München, Germany
  • Erinc Merdivan, Helmholtz AI, Helmholtz Zentrum München, Germany
  • Marie Piraud, Helmholtz AI, Helmholtz Zentrum München, Germany
  • Chase Beisel, Helmholtz Institute for RNA-based Infection Research (HIRI), Faculty of Medicine, University of Würzburg, Germany
  • Lars Barquist, Helmholtz Institute for RNA-based Infection Research (HIRI), Faculty of Medicine, University of Würzburg, Germany


Presentation Overview: Show

CRISPR interference (CRISPRi), the targeting of a catalytically dead Cas protein to block transcription, is the leading technique to silence gene expression in bacteria. However, deriving design rules for optimal guide RNAs (gRNAs) remains challenging. Using automated machine learning on integrated data derived from three genome-wide CRISPRi screens, we show that depletion of gRNAs in the screens can be predicted by a combination of rich biological features, with gene expression having an outsized influence on gene depletion, and that tree ensemble-based regressors outperform linear regression and deep learning models on available training data. With further implementation of mixed-effect models to segregate the gene-specific effects, we develop a random forest regression model that isolates effects manipulable in gRNA design and apply methods from explainable AI to derive interpretable design rules. We validate our method with a high-throughput saturating screen of gRNAs targeting purine biosynthesis genes in Escherichia coli, which shows targeting can be effective deep within the coding sequence contrasting the previous finding. Our approach not only offers a powerful tool for CRISPRi (freely available at http://ciao.helmholtz-hiri.de) but also provides a blueprint for the development and interpretation of predictive models for other CRISPR-based technologies in bacteria.

Virtual: Inference of cell state transitions and cell fate plasticity from single-cell with MARGARET
COSI: MLCSB
  • Kushagra Pandey, Indian Institute of Technology, Kanpur, India
  • Hamim Zafar, Indian Institute of Technology, Kanpur, India


Presentation Overview: Show

Despite recent advances in inferring cellular dynamics using single-cell RNA-seq data, existing trajectory inference (TI) methods face difficulty in accurately reconstructing cell-state manifold and inferring trajectory and cell fate plasticity for complex topologies. We present MARGARET (https://github.com/Zafar-Lab/Margaret), a novel and scalable computational method for inferring single-cell trajectory and fate mapping for diverse dynamic cellular processes. MARGARET reconstructs complex trajectory topologies using a deep unsupervised metric learning and a graph-partitioning approach based on a novel connectivity measure, automatically detects terminal cell states, and generalizes the quantification of fate plasticity for complex topologies. On a diverse benchmark consisting of real and simulated datasets with ground-truth trajectory annotations, MARGARET outperformed state-of-the-art methods in recovering global topology and cell pseudotime ordering. For human hematopoiesis, MARGARET accurately identified all major lineages and associated gene expression trends and helped identify transitional progenitors associated with key branching events. For embryoid body differentiation, MARGARET identified novel transitional populations that were validated by bulk sequencing and functionally characterized different precursor populations in the mesoderm lineage. For colon differentiation, MARGARET characterized the lineage for BEST4/OTOP2 cells and the heterogeneity in goblet cell lineage in the colon under normal and inflamed ulcerative colitis conditions.

Virtual: Machine Learning Prediction of Molecular Initiating Events using Chemical Target Annotations and Gene Expression
COSI: MLCSB
  • Joseph L. Bundy, U.S. EPA, United States
  • Richard S. Judson, US EPA, United States
  • Imran Shah, US Environmental Protection Agency, United States
  • Antony J. Williams, US Environmental Protection Agency, United States
  • Chris Grulke, U.S. Environmental Protection Agency, United States
  • Logan J. Everett, U.S. Environmental Protection Agency, United States


Presentation Overview: Show

The advent of high-throughput transcriptomic screening technologies has resulted in a wealth of gene expression signatures associated with chemical treatment. Here, we integrate data from a large public compendium of transcriptomic responses to chemical exposure (LINCS) with a comprehensive database of chemical-protein associations (RefChemDB) in order to train binary classifiers that predict molecular initiating events (MIEs) from the transcriptomic response to each chemical exposure. We trained classifiers using expression data associated with chemical treatments linked to 51 distinct MIEs, and compared results for multiple classification algorithms and input feature sets. We also performed multiple analyses to validate the utility of these models, which ultimately resulted in high performance classifiers for 9 MIEs using MCF7 cells. MIEs modeled with dissimilar accuracies between MCF7 and PC3 cells, such as estrogen receptor activation, were found to correspond to targets that have different baseline expression in these cell lines. Overall, this methodology can offer insight when prioritizing chemicals of interest for targeted validation assays, which is relevant to U.S. EPA’s goal of increasing efficiency in chemical safety screening. This abstract does not necessarily reflect US EPA policy.

Virtual: Maize Feature Store (MFS): a graphical user interface with multi-omics data integration and machine-learning applications for Maize B73v5 gene classifications
COSI: MLCSB
  • Shatabdi Sen, Iowa State University, United States
  • Carson Andorf, Iowa State University, United States
  • Margaret Woodhouse, USDA, United States
  • John Portwood, USDA, United States


Presentation Overview: Show

The big-data analysis of multi-omics data associated with maize genomes is increasingly utilized to accelerate genetic research and improve agronomic traits. As a result, efforts have increased to integrate diverse datasets and extract meaning from these measurements. In the past, multiple multi-layer data structures have been proposed for the integration of multi-omics biological information associated with maize, but none have incorporated an unstructured omics data warehouse along with providing end-to-end solutions for evaluating and linking these data to target gene annotations. In our work, we present the Maize Feature Store (MFS), a versatile NoSQL-based Flask application that combines features built on multi-omics data to facilitate exploration and modeling of a broader spectrum of heterogeneous information via a variety of univariate, bivariate, and multivariate analysis modules. In addition, MFS integrates various machine-learning algorithms, both supervised and unsupervised that can significantly simplify the analysis and prediction of complex genome annotations. The MFS was capable of creating an accurate pan-genome classification model with an AUC-ROC score of 0.853. The functionality of MFS can be freely accessed via an online webserver (https://mfs.maizegdb.org/).

Virtual: Mixed benefit from neural networks for transcriptomic machine learning tasks
COSI: MLCSB
  • Benjamin J. Heil, Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, United States
  • Jake Crawford, Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, United States
  • Casey S. Greene, Center for Health AI, University of Colorado Anschutz School of Medicine, United States


Presentation Overview: Show

Deep learning methods can discover complex, nonlinear decision surfaces that have proven useful in many fields. However, classifiers employed in biology and medicine that use transcriptomic data generally rely on linear models. It is unclear when transcriptomic data possesses the necessary and sufficient conditions for deep learning to outperform linear models. To better understand these conditions, we compared a linear model (ridge logistic regression) and two neural networks, evaluating their performance on tissue type prediction in simulated, GTEx, and ReCount3 data across a spectrum of sample counts and tissues. We found that deep learning doesn’t necessarily surpass linear models, even in cases with thousands of samples and demonstrated nonlinear relationships between the data and the prediction target. This work shows that while deep learning is a promising tool, it is crucial to put results in the context of a representative baseline model to justify the increase in complexity inherent in highly parameterized models.

Virtual: Multi-omics Topic Modeling for Breast Cancer Classification
COSI: MLCSB
  • Filippo Valle, University of Turin and INFN, Italy
  • Matteo Osella, University of Turin and INFN, Italy
  • Michele Caselle, University of Turin and INFN, Italy


Presentation Overview: Show

Topic modeling is a widely used approach to extracting relevant information from large datasets. Recently the problem of finding a latent structure in a dataset to the community detection problem in network theory and a new class of topic modeling strategies has been introduced. We tested this approach on lung and breast cancer samples from the TCGA, using data of messenger RNA and microRNAs. The established cancer subtype organization is well reconstructed in the inferred latent topic structure. Moreover, the “topics” that the algorithm extracts correspond to genes involved in cancer development, and they are enriched in genes known to play a role in the corresponding disease; they are strongly related to the survival probability of patients too.
In biology, integrating transcriptional data with other layers of information, such as the post-transcriptional regulation mediated by microRNAs, can be crucial in identifying the driver genes and the subtypes of complex and heterogeneous diseases such as cancer. More specifically, we show how an algorithm based on a hierarchical version of stochastic block modeling can be adapted to integrate any combination of data. We will also show that the inclusion of the microRNAs layer significantly improves the accuracy of subtype classification.

Virtual: Navigating Connectivity Mapping Workflows for Predicting Molecular Targets with Gecco
COSI: MLCSB
  • Imran Shah, US Environmental Protection Agency, United States
  • Joseph Bundy, U.S. EPA, United States
  • Bryant Chambers, US Environmental Protection Agency, United States
  • Logan Everett, U.S. Environmental Protection Agency, United States
  • Derik Haggard, US EPA, United States
  • Joshuan Harrill, US Environmental Protection Agency, United States
  • Richard Judson, US EPA, United States


Presentation Overview: Show

There are thousands of chemicals in commerce, but we know the health effects of only a fraction due to the cost of animal testing. High-throughput transcriptomics could efficiently prioritize bioactive chemicals and reduce dependence on animal testing with computational approaches. We developed the generalized connectivity toolkit (Gecco) to harmonize disparate connectivity mapping workflows to match transcriptomic profiles of environmental chemicals with gene signatures associated with targets. Statistical aggregation (gene set enrichment analysis (GSEA), parametric and non-parametric t-statistics), and vector similarity (cosine, correlation, and Jaccard index) measures are implemented in Gecco using an extensible object-oriented approach. To perform these calculations at scale, Gecco includes a MongoDB framework for managing gene signatures (e.g., MSigDB, Dorothea, CREEDS) and transcriptomic profile collections (e.g., GEO, ArrayExpress, Connectivity Map, and LINCS). We are using Gecco to manage transcriptomic data for thousands of environmental perturbagens and evaluate the performance of different connectivity mapping strategies to predict their putative molecular targets. This talk will present an overview of Gecco and illustrative examples of everyday chemicals linked with nuclear receptor and stress response pathway activation.
This abstract does not necessarily reflect US EPA policy.

Virtual: Predicting Adverse Drug Reactions via Heterogeneous Graph Neural Network Learning using Multi-Modal Input Data
COSI: MLCSB
  • Sophia Krix, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Lauren DeLong, University of Edinburgh, United Kingdom
  • Sumit Madan, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Daniel Domingo-Fernández, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Holger Froehlich, Fraunhofer Institute for Scientific Computing and Algorithmcs, Germany


Presentation Overview: Show

Clinical trials oftentimes have to be terminated at the late stage because of unexpected adverse events, after many years of research and investment of high expenses. In order to avoid late-stage failures of drug candidates, it is crucial to have information on their potential side effects prior to their first trial. With the help of a novel graph neural networks, we are able to predict the likelihood of unwanted side effects of drugs of interest before they enter a clinical trial.
We therefore developed a biomedical knowledge graph to model relationships between drugs, proteins and side effects. We incorporated multi-modal information about gene expression experiments, chemical compound structure, gene ontology, protein structure and indication areas in the generation of the knowledge graph. We extended different graph neural networks (Relational Graph Attention Networks, Relational Graph Convolutional Neural Networks) to handle multi-modal input data for individual features. We compared our approach against the PREDICT method and Random Forests, demonstrating superior prediction performance of our method. Altogether, our work demonstrates the potential of modern graph neural networks to integrate multiple data types as well as a knowledge graph for highly accurate prediction of adverse drug reactions.

Virtual: Prediction of Virus-Host Protein-Protein Interactions Using Protein Sequence Embeddings in a Deep Siamese Twin Network
COSI: MLCSB
  • Sumit Madan, Fraunhofer Institute for Scientific Computing and Algorithms, Germany
  • Victoria Demina, NEUWAY Pharma GmbH, Germany
  • Marcus Stapf, NEUWAY Pharma GmbH, Germany
  • Oliver Ernst, NEUWAY Pharma GmbH, Germany
  • Holger Fröhlich, Fraunhofer Institute for Scientific Computing and Algorithms, Germany


Presentation Overview: Show

Viruses causing infectious diseases have always been a major public health concern and will be in the future. Therefore, prediction and understanding of virus-host interactions have huge relevance for the development of novel therapeutic intervention strategies. Here, we propose a novel deep Siamese twin neural network architecture using the pre-trained ProtBERT model to predict virus-host protein-protein interactions (PPIs) based on just protein sequences. We evaluate our approach on several PPI datasets and we apply it to two use cases, SARS-CoV-2 and John Cunningham polyomavirus (JCV), to predict virus protein to human host interactions. Our method predicted for the SARS-CoV-2 spike protein an interaction with the sigma 2 receptor that has been suggested as a drug target in the literature. Additionally, we applied our method to predict interactions of the JCV capsid protein VP1 showing an enrichment of PPIs with neurotransmitters, which are known to function as an entry point of the virus into glial brain cells. We also identify the parts of a pair of sequences that contribute to the PPI through Explainable AI (XAI) techniques. Altogether our work highlights the potential of pre-trained protein sequence embeddings as well as XAI methods for the analysis of PPIs of disease-causing viruses.

Virtual: Prioritizing Biomarkers and Driver Genes in Lung Adenocarcinoma using Graph Convolutional Networks
COSI: MLCSB
  • Shreya Raghavendra, Boltzmann Labs, India
  • Shravya Gupta, Boltzmann Labs, India
  • Sarath Kolli, Boltzmann Labs, India


Presentation Overview: Show

In the past decade, single-omic analyses have greatly helped understand human disease. However, heterogeneous diseases like cancer, having many etiologies for a single phenotype, are better characterized by multi-omics studies. In this study, we use Graph Convolutional Networks (GCNs) to prioritize target genes in Lung Adenocarcinoma. We integrate Gene Expression, Single Nucleotide Variation, Copy Number changes, DNA Methylation, miRNA Expression, and Protein Interaction Data into a single predictive model. GCNs outperform network-only and feature-only models in this regard as they can combine data originating from varied assay types, while also modeling the flow and interaction of biological information at multiple levels. Training the GCN on single-omics data, we identify important biomarkers – hyper and hypo-methylated CpG sites and genes, deregulated miRNAs, differentially expressed genes, and hypermutated genes. Our multi-omics model recovers 75% of known genes, a ~10% improvement over current methods. Additionally, while epigenetic markers usually contribute to 15-20% of the recovery, using miRNA expressions increased this value to 30%, which is a significant inference as epigenetic modifications are reversible. To our knowledge, this is the first study to integrate miRNA data with the other stated omics into a single oncological system and prioritize driver genes with such high accuracy.

Virtual: Proteins as language: NLP, Machine Learning & Protein sequences
COSI: MLCSB
  • Dan Ofer, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Natural language processing (NLP) is a field of computer science concerned with automated text and language analysis. In recent years, following a series of breakthroughs in deep and machine learning, NLP methods have shown overwhelming progress. Here, we review the success, promise and pitfalls of applying NLP algorithms to the study of proteins. Proteins, which can be represented as strings of amino-acid letters, are a natural fit to many NLP methods. We explore conceptual similarities and differences between proteins and language, and review a range of protein-related tasks amenable to machine learning. We present methods for encoding the information of proteins as text and analyzing it with NLP methods, reviewing classic concepts such as bag-of-words, k-mers/n-grams and text search, as well as modern techniques such as word embedding, contextualized embedding, deep learning and neural language models. In particular, we focus on recent innovations such as masked language modeling, self-supervised learning and attention-based models. Finally, we discuss trends and challenges in the intersection of NLP and protein research

Virtual: R and machine learning analysis of a dataset with liver profile and multiclass categorical classification
COSI: MLCSB
  • Silvia Vasquez, Universidad Peruana Cayetano Heredia, Peru


Presentation Overview: Show

Objective
To load, summarize, visualize a dataset and evaluate the fitting of proper models.
Methods
The “HCV” dataset was donated by its authors Dua, D and Graff, C. to the UCI Machine learning repository. The dataset presented 615 instances and 14 attributes. The target attribute presented 5 classes. A statistical summary of the data and class distribution was performed. The data was separated for training (80%) and validation (20%) and optimal models were selected.
Results
The dataset presented numerical attributes of laboratory blood tests to detect liver damage (cirrhosis, fibrosis) or hepatitis C. According to our selection of the model, the most important variables were ALP, AST, GGT, and ALT. The minor ones were sex and age. When 5 models were built, the accuracy showed RF, SVM, and LDA as the best models. The RF model was run directly on the validation set and a summary of the results was shown in a confusion matrix, the largest errors were for fibrosis and hepatitis classes. The smallest error occurred in blood donors.
Conclusion
Exploring a multiclass dataset by processing the data for statistical analysis and visualization are adequate to evaluate optimal models with ML and R.

Virtual: RubricOE: what Machine Learning can say about Alzheimer’s Disease
COSI: MLCSB
  • Aldo Guzman-Saenz, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Daniel Platt, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Filippo Utro, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Aritra Bose, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Subrata Saha, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Laxmi Parida, IBM, United States


Presentation Overview: Show

Alzheimer’s disease (AD) is notable for having substantial heritability unaccounted for by single nucleotide polymorphism alleles. A small handful of SNPs strongly linked with the APOE E4 allele shows very strong odds ratios associated with Alzheimer’s. Among AD patients, early onset (EOAD) vs. late onset (LOAD) show different impacts in memory and language vs. motor and atypical AD symptoms, as well as familial heritability in EOAD. Generally, genetic association studies do not account for the heritability.

This study considers the question of what machine learning (ML) approaches may reveal about pathogenic processes through identified alleles. We have constructed an ML pipeline, which we call RubricOE, comprised of linear kernel support vector machines, using feature ranking based on heritability estimated by variance predicted by linear ridge regression, and with multiple layers of cross validation to identify “stable sets” of the most strongly predictive features that remain consistent across all the training/test and validation splits. We evaluate these features using logistic regression characterizing expected sampling-induced variability, to relate cross-validation stable sets with GWAS confidence levels and to identify novel features that ML identified GWAS-like methods miss.

Virtual: scDREAMER: atlas-level integration of single-cell datasets using deep generative model paired with adversarial classifier
COSI: MLCSB
  • Ajita Shree, Indian Institute of Technology Kanpur, India
  • Krushna Pavan Musale, Indian Institute of Technology Kanpur, India
  • Hamim Zafar, Indian Institute of Technology Kanpur, India


Presentation Overview: Show

Recent advances in single-cell sequencing techniques have resulted in the generation of complex datasets harboring batch effects associated with different sequencing protocols, tissue location, donor, time and conditions necessitating the development of data integration algorithms. Here we present a novel deep learning-based data integration framework, called scDREAMER that employs a novel adversarial variational autoencoder consisting of an autoencoder and a discriminator, and a batch classifier (a multi-layer neural network) for learning the lower-dimensional cellular embeddings from the high-dimensional scRNA-seq data by overcoming batch effects. We evaluated the performance of scDREAMER on 5 real datasets consisting of up to 1 million cells and 30 batches. For these data integration tasks, scDREAMER was able to overcome a variety of challenges including the presence of skewed cell types among batches, nested batch effects, large number of batches, and preservation of developmental trajectory across batches. We demonstrated scDREAMER's superiority in batch-correction, conservation of biological variation, and identification of rare cells over state-of-the-art methods using nine different metrics as well as composite score metrics that assess the holistic performance of a method. scDREAMER is scalable and can perform atlas level integration across species (e.g., human and mouse) as shown using a 1 million cells dataset.

Virtual: Sequence predictability score to boost classification of protein sequences
COSI: MLCSB
  • Ved Piyush, University of Nebraska - Lincoln, United States
  • Ruibo Zhang, Texas Tech University, United States
  • Ranadip Pal, Texas Tech University, United States
  • Souparno Ghosh, University of Nebraska - Lincoln, United States


Presentation Overview: Show

Amino acid sequences corresponding to spike proteins of SARS-CoV-2 have been successfully used to empirically classify variants with very high precision. It appears that the spike protein signature of Variants of Concern (VoC) and Variants of Interest (VoI) are quite distinctive. However, there exist a large number of uncategorized variants that exhibit a high level of heterogeneity in their amino acid sequences. Even state-of-art classifiers trained to predict SARS-CoV-2 variants, from their spike protein sequences, exhibit substantial loss in accuracy and precision when the foregoing uncategorized variants are thrown into the mix. Hence, the key problem is how to handle high heterogeneity in the amino acid sequences associated with this imprecisely categorized class when developing a simultaneous multiclass classifier? We approach this classification problem from sequence-to-sequence prediction perspective. The central conceit of our approach is that for sequences that have precise labels (in this case Delta) there exists a pattern of prediction error when sequences of length, say l_1 , are used to predict immediately succeeding sequence of length, say, l_2. However, for more diffused classes, the ability of sequences of length l_1 to predict its immediately succeeding sequence of length l_2 would reveal a different pattern of prediction error.

Virtual: Similarity metric learning in high dimensional perturbational datasets
COSI: MLCSB
  • Ian Smith, University of Toronto, Canada
  • Petr Smirnov, University of Toronto, Canada
  • Benjamin Haibe-Kains, University Health Network, Canada


Presentation Overview: Show

High-throughput perturbational datasets, like the Next Generation Connectivity Map (L1000), use similarity metrics to identify perturbations or disease states that induce similar changes in the biological feature space. Similarities among perturbations are then used to identify drug mechanisms of action, to nominate therapeutics for a particular disease, and to construct biological networks among perturbations and genes. Standard similarity metrics include correlations, cosine distance, and a variety of gene set methods like GSEA, but these methods do not optimize or incorporate information about the measurement space. We introduce a similarity metric learning method to learn a class of similarity functions from the data that maximizes discrimination of replicate signatures by transforming the biological measurements into a natural basis. The learned similarity functions show substantial improvement for recovering known biological relationships from the data. In addition to capturing a more powerful notion of similarity, the transformed basis data can be used for other machine learning tasks on the data, like classification and clustering. Similarity metric learning on biological data is a powerful tool for the analysis of large biological datasets.

Virtual: Single Cell Pathway Embedding Neural Network (scPENN) for Single Cell RNA-Seq Analysis
COSI: MLCSB
  • Thatchayut Unjitwattana, University of Michigan, United States
  • Shashank Yadav, University of Michigan, United States
  • Bing He, University of Michigan, United States
  • Lana Garmire, University of Michigan, United States


Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) has been used to discover rare cell populations, track the trajectories of distinct cell lineages in development, and uncover regulatory relationships between genes. Compared to traditional machine learning techniques, classification using the deep neural network (NN) approach could potentially capture complex features of high-dimensional scRNA-seq data. However, the major disadvantage of the NN approach is the lack of interpretability, as the decision-making process behind this approach is a black box and is not always intuitive. Integrating prior biological knowledge into the NN model could help provide higher-level biological interpretations. The deep learning model's relationships between input nodes (genes) and hidden nodes (pathways) are determined by gene-pathway mapping. However, pathway embedding as a form of enhancing the interpretability of NN is currently lacking in the scRNA-Seq analysis domain. Here we introduce a single cell pathway embedding neural network (scPENN), which integrates the gene-pathway relationships as the connections between nodes in the model. Our model yielded better performance than traditional machine learning and fully-connected neural network approaches while providing higher-level pathway interpretation. We demonstrate that cell-specific pathway scores can be used as effective features alternative to gene-level features and pathway biomarkers for a specific cell type.

Virtual: SVM based classifier to identify plant-derived antimicrobial peptides
COSI: MLCSB
  • Mohini Jaiswal, National Institute of Plant Genome Research, India
  • Shailesh Kumar, National Institute of Plant Genome Research, India


Presentation Overview: Show

The presence of multiple genes in plants encoding antimicrobial peptides and peptide ability to interact with microbes for multitudinous low-affinity targets laid the foundation of this study. The current study aimed to develop prediction models for characterizing plant-derived peptides for four different activities named antimicrobial, antibacterial, antifungal, and antiviral. For constructing models, there were specific requirements as highlighted by the previous research findings. Firstly, a compendium of plant-derived bioactive peptides was needed for which we selected PlantPepDB. Secondly, there was a need to represent peptides in feature vectors that can truly reflect their intrinsic properties. Lastly, the selection of a machine-learning algorithm has a useful role in a classification study. For this supervised classification study, support vector machine (SVM) was used with radial basis function as a kernel option. The best hyperplane was selected after tuning the learning and kernel options of SVM such as C (regularization factor), and gamma. The developed models are integrated into a web server named PTPAMP for predicting plant-derived antimicrobial peptides. Overall, the goal of this study is to develop a platform for identifying plant-derived antimicrobial peptides in a high-throughput and cost-effective manner trailed by characterization according to their physicochemical properties.

Virtual: Systematic annotation of mutations through de novo prediction of protein binding sites underlies importance of extracellular interactions in cancer
COSI: MLCSB
  • Xingyan Kuang, University of Chicago, United States
  • Andi Dhroso, Worcester Polytechnic Institute, United States
  • Hongzhu Cui, Worcester Polytechnic Institute, United States
  • Nathan T. Johnson, Worcester Polytechnic Institute, United States
  • Elina Tjioe, University of California, San Francisco, United States
  • Ursula Pieper, University of California, San Francisco, United States
  • Andrej Sali, University of California, San Francisco, United States
  • Dmitry Korkin, Worcester Polytechnic Institute, United States


Presentation Overview: Show

The standard molecular-phenotypic definition of cancer relates genomic instability with increased proliferation, capable of evading growth suppressors, avoiding immune destruction, and metastasis. Recurrent single nucleotide variations (SNVs) as a result of genomic instability may target essential protein functions such as phosphorylation and acetylation. This work’s focus is to identify whether cancer SNVs target a protein’s binding site. During the past decade, numerous protein binding site prediction methods have been published using sequences that can be applied to any known protein sequence; however, they are less accurate than structure-based methods, which are limited by the narrow number of protein structural data. Therefore, we developed a new de novo protein binding site prediction method (Comparative Binding Region Annotator (COBRA)), which expands the number of proteins that can be assessed without losing accuracy. Our new de novo protein prediction method was assessed across eight different methods. Using COBRA, we performed a large-scale annotation of SNVs across eight different cancer types to evaluate for their protein binding sites (bSNVs). bSNVs were assessed for their enrichment of known cancer drivers, functional enrichment, patient survivability, clinical significance, and protein-protein interaction network. Finally, we analyzed whether the observed patterns were similar to SNVs that affect phosphorylation (pSNVs).

Virtual: Uncovering signals of cell state transition using topological data analysis in single cell data
COSI: MLCSB
  • Aldo Guzmán-Sáenz, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Kahn Rhrissorrakrai, IBM, United States
  • Filippo Utro, IBM THOMAS J. WATSON RESEARCH CENTER, United States
  • Laxmi Parida, IBM THOMAS J. WATSON RESEARCH CENTER, United States


Presentation Overview: Show

Single cell sequencing offers tremendous insight in understanding the composition of individual cells and identifying the features or combination of features that lead from one state to another. However, the increased dimensionality of studying 1000s of cells over 20,000+ genes necessitates new approaches to identify the potentially small fraction of cells whose state are truly in transition. This is particularly relevant in cancer where the development of drug resistant states is endemic. Transition Topological Data Analysis (T2DA) enables the identification of single cell sub-populations, i.e. states, along with the features that characterize this transition between states. T2DA performs a transformation of the data that preserves the original feature space while finding a backbone connecting cell states. Transition features extracted from the backbone may then be used to understand the mechanisms of cell state change. In a melanoma single cell RNA sequencing study of drug resistance, T2DA was able to identify biologically relevant transition genes describing cells that change from a drug responsive to resistive state. Such putative resistance genes can guide patient treatment planning or drug development. T2DA offers a novel approach for extracting greater value from single cell sequencing data and capturing the mechanisms that underlie cell state transitions.

Virtual: Using Deep Learning in Lyme Disease Diagnosis
COSI: MLCSB
  • Tejaswi Koduru, Thomas Jefferson High School for Science and Technology, United States


Presentation Overview: Show

Untreated lyme disease can lead to neurological, cardiac, and dermatological complications. Rapid diagnosis of the erythema migrans (EM) rash, a characteristic symptom of Lyme disease, is therefore crucial to early diagnosis and treatment. In this study, we aim to utilize deep learning frameworks including Tensorflow and Keras to create deep convolutional neural networks (DCNN) to detect images of acute Lyme Disease from images of erythema migrans. This study uses a custom database of erythema migrans images of varying quality to train a DCNN capable of classifying images of EM rashes vs non-EM rashes. Images from publicly available sources were mined to create an initial database. Machine based removal of duplicate images was then performed, followed by a thorough examination of all images by a clinician. The resulting database was combined with images of confounding rashes and regular skin, resulting in a total of 683 images. This database was then used to create a DCNN with an accuracy of 93% when classifying images of rashes as EM vs non EM. Finally, this model was converted into a web and mobile application to allow for rapid diagnosis of EM rashes by both patients and clinicians.

Virtual: ViralCounter: An automated algorithm that allows to count the number of viruses in cells
COSI: MLCSB
  • Dagmara Błaszczyk, Małopolska Centre of Biotechnology, UJ, Kraków, Poland, Poland
  • Katarzyna Owczarek, Małopolska Centre of Biotechnology, UJ, Kraków, Poland, Poland
  • Artur Szczepański, Małopolska Centre of Biotechnology, UJ, Kraków, Poland, Poland
  • Krzysztof Pyrć, Małopolska Centre of Biotechnology, UJ, Kraków, Poland, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology, UJ, Kraków, Poland, Poland


Presentation Overview: Show

Due to the pandemic a scientific community has joined forces to take advantage of complementary expertise. Also in the field of automated images processing and interpretation there are techniques which could help in coronavirus related studies, especially in studying and characterization of cells that contain SARS-CoV-2. Our research is focused on quantitative examination of the coronavirus presence in each cell.
Dataset contains currently 50 multi-cell 3D images taken from a confocal microscope, on which in different colors three different parts of images are marked: nuclei, leucines in cytoskeletons and viruses.
Current work is focused on determination of area of each cell in those images, in which points that contain information about nucleus and leucine in cytoskeleton are crucial. Extracting the cells from the image relies on finding the cell nucleus and determining cell boundaries by searching for accumulated leucines points. Algorithm chooses for further analysis cells that are clearly visible
Later steps include usage of extracted cells as a dataset in machine learning algorithms for identification of the viruses. Project includes training of an artificial neural network and tuning its parameters to segmentate the areas of viruses from each chosen cell, then counting the objects that will be segmented.

Virtual: XClone: Statistical modelling of copy number variations in single cells
COSI: MLCSB
  • Rongting Huang, The University of Hong Kong, Hong Kong
  • Xianjie Huang, The University of Hong Kong, Hong Kong
  • Yuanhua Huang, The University of Hong Kong, Hong Kong


Presentation Overview: Show

Somatic copy number variation (CNVs) are major mutations in various cancers for their development and clonal evolution. Analysing CNV in single-cell RNA-seq data is of critical importance for both detecting the CNV states in tumour cells and revealing its impact on transcriptional phenotypes. However, the intrinsic low coverage and high noise properties in scRNA-seq make it difficult to call the CNVs accurately. A few computational methods (inferCNV, CopyKAT, HoneyBADGER, CaSpER) have been recently proposed to analyse CNV from scRNA-seq data, but their accuracy and computational efficiency have not been well benchmarked. Here we present a statistical method, XClone, that integrates expression levels and allelic balance to enhance the detection of haplotype-aware CNVs from scRNA-seq data and the reconstruction of tumour clonal phylogeny. Compared to commonly used methods, XClone is found to be a promising tool for accurate CNV analysis across multiple data sets, including a well-characterized and verified gastric cancer sample that covers copy loss, gain, and loss of heterozygosity.

W-001: Towards an in silico cell
COSI: MLCSB
  • Yue Qin, University of California San Diego, United States
  • Emma Lundberg, KTH-Royal Institute of Technology, Sweden
  • Trey Ideker, University of California San Diego, United States


Presentation Overview: Show

The cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets, but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescence images in the Human Protein Atlas (HPA) with affinity purifications in BioPlex to create a unified hierarchical map of human cell architecture. Integration is achieved by configuring each approach as a general measure of protein distance, then calibrating the two measures using machine learning. The map, called the Multi-Scale Integrated Cell (MuSIC 1.0), resolves 69 subcellular systems of which approximately half are undocumented. Accordingly we perform 134 additional affinity purifications, validating subunit associations for the majority of systems. The map reveals a pre-ribosomal RNA processing assembly and accessory factors, which we show govern rRNA maturation, and functional roles for SRRM1 and FAM120C in chromatin and RPS3A in splicing. By integration across scales, MuSIC increases the resolution of imaging while giving protein interactions a spatial dimension, paving the way to incorporate diverse types of data in proteome-wide cell maps.

W-002: A Context-aware Deconfounding Autoencoder for Robust Prediction of Personalized in vivo Drug Response From Cell Line Compound Screening
COSI: MLCSB
  • Di He, The City University of New York, United States
  • Qiao Liu, The City University of New York, United States
  • You Wu, The City University of New York, United States
  • Lei Xie, The City University of New York, United States


Presentation Overview: Show

Accurate and robust prediction of patient-specific responses to a new compound is critical to personalized drug discovery and development. However, patient data are often too scarce to train a generalized machine learning model. Although many methods have been developed to utilize cell line screens for predicting clinical responses, their performances are unreliable due to data heterogeneity and distribution shift. We have developed a novel Context-aware Deconfounding Autoencoder (CODE-AE) that can extract intrinsic biological signals masked by context-specific patterns and confounding factors. Extensive comparative studies demonstrated that CODE-AE effectively alleviated the out-of-distribution problem for the model generalization, significantly improved accuracy and robustness over state-of-the-art methods in predicting patient-specific \in vivo drug responses purely from in vitro compound screens. Using CODE-AE, we screened 59 drugs for 9,808 cancer patients. Our results are consistent with existing clinical observations, suggesting the potential of CODE-AE in developing personalized anti-cancer therapies and drug-response biomarkers.

W-003: Adversarial Mutation Predicts the Spillover Risk of Viruses from their Genomic Sequences
COSI: MLCSB
  • Nathan Bollig, University of Wisconsin - Madison, United States
  • Tony Goldberg, University of Wisconsin - Madison, United States
  • Mark Craven, University of Wisconsin - Madison, United States


Presentation Overview: Show

While metagenomics has significantly expanded the set of known viral genomes, many newly discovered viruses cannot be easily linked to the range of hosts they have the potential to infect. Our work uses machine learning methods to induce models that characterize the host range of viruses based on their genomic sequences. Using these models, we define a computational search task called adversarial mutation, which leverages adversarial machine learning to identify high-risk variants of given viruses. We evaluate the performance of adversarial mutation operating on machine learning models trained to predict whether a virus can infect a human host, using a dataset of spike protein sequences from human and non-human-infecting coronaviruses. We demonstrate that adversarial mutation is effective at identifying viruses that present a high risk of spilling over into human populations.

W-004: Increased prioritization of known cancer drug targets from genetic screens using Deep Link Prediction
COSI: MLCSB
  • Pieter-Paul Strybol, UGent-IMEC / Galapagos NV, Belgium
  • Maarten Larmuseau, UGent, Belgium
  • Louise de Schaetzen van Brienen, UGent, Belgium
  • Tim Van den Bulcke, Galapagos NV, Belgium
  • Kathleen Marchal, UGent, Belgium


Presentation Overview: Show

We present Deep Link Prediction (DLP), a method for the interpretation of genetic screens. Our approach uses representation-based link prediction to reprioritize phenotypic readouts by integrating screening experiments with gene-gene interaction networks. We validate on two different loss-of-function technologies, RNAi and CRISPR, using datasets obtained from the DepMap consortium. Extensive benchmarking shows that our novel model DLP-DeepWalk outperforms other methods in recovering cell-specific cancer dependencies, achieving an average precision well above 90% across seven different cancer types with varying number of samples avaible, on both RNAi and CRISPR data. We show that genes ranked highest by DLP-DeepWalk are appreciably more enriched in drug targets compared to the ranking based on original screening scores without explicitly showing this information to the model, in other words the drug target prioritization happens in an unsupervised manner. Interestingly, this enrichment is more pronounced on RNAi data as compared to CRISPR data, consistent with the greater inherent noise of RNAi screens. Finally, we demonstrate how DLP-DeepWalk can infer the molecular mechanism through which putative targets trigger cell line mortality and could potentially uncover drug synergies or additions.

W-005: Brain and Organoid Manifold Alignment (BOMA), a machine learning framework for comparative gene expression data analysis across brains and organoids
COSI: MLCSB
  • Chenfeng He, Waisman Center, University of Wisconsin at Madison; Department of Biostatistics and Medical Informatics, UW-Madison, United States
  • André Sousa, Waisman Center, University of Wisconsin at Madison; Department of Neuroscience, UW-Madison, United States
  • Xinyu Zhao, Waisman Center, University of Wisconsin at Madison; Department of Neuroscience, UW-Madison, United States
  • Daifeng Wang, Waisman Center, University of Wisconsin at Madison; Department of Biostatistics and Medical Informatics, UW-Madison, United States


Presentation Overview: Show

Brain organoids have become useful models for understanding gene expression dynamics and gene functions that govern human brain development. However, whether such developmental gene expression programs are preserved between organoids and human brains, especially in specific cell types, remains unclear. Importantly, there is a lack of dedicated computational approaches for comparative data analyses between in vitro and in vivo developmental trajectories. To address this, we have developed a machine learning pipeline to align samples of brains and organoids by jointly embedding their RNA-seq data onto manifolds. Primarily, this pipeline first uses time warping for a global alignment and then manifold learning to locally refine the alignment. Such aligned samples reveal conserved developmental trajectories between brains and organoids (or specific trajectories from unaligned samples). We applied this pipeline to several recently published data sets, including both bulk tissue and single-cell transcriptomic data, demonstrating its generality and scalability.

W-006: Discovering interpretable features of the intrinsically disordered dark proteome by using evolution for contrastive learning
COSI: MLCSB
  • Alex Lu, Microsoft Research, United States
  • Amy Lu, Berkeley, United States
  • Iva Pritišanac, Medizinische Universität Graz, Austria
  • Taraneh Zarin, Center for Genomic Regulation, Spain
  • Julie Forman-Kay, University of Toronto, Canada
  • Alan Moses, University of Toronto, Canada


Presentation Overview: Show

A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call “reverse homology”, exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homologue from another set of IDRs otherwise sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.

W-007: Assessment and Optimization of the Interpretability of Machine Learning Models Applied to Transcriptomic Data
COSI: MLCSB
  • Yongbing Zhao, Mayo Clinic, United States
  • Jinfeng Shao, National Institutes of Health, United States
  • Yan Asmann, Mayo Clinic, United States


Presentation Overview: Show

Explainable artificial intelligence aims to interpret how the machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, the understandings of the applicability of these model explainers to biological data are still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pretrained models of predicting tissue types from transcriptomic data, and by identifying top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of Multilayer Perceptron (MLP) and Convolutional Neural Network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.

W-008: A Bayesian Experimental Design Framework to Optimize Microbial Communities
COSI: MLCSB
  • Jaron Thompson, University of Wisconsin, United States
  • Ophelia Venturelli, University of Wisconsin, United States
  • Victor Zavala, University of Wisconsin, United States


Presentation Overview: Show

Microbial communities have enormous functional potential, including the ability to valorize biofuel production, enhance agriculture yields, clean-up environmental wastes and benefit human health. While communities of microbial species have the potential to outperform individual species for a wide range of functions, discovering highly functional communities and optimizing environmental parameters to enhance target functions is difficult due to poorly understood mechanisms and the inability to experimentally observe all possible conditions. Data-driven approaches to rationally select microbial communities have emerged as a promising avenue for microbiome engineering. Despite a handful of initial successes in data driven approaches to optimize microbial communities, previous studies have relied on fitting standard ecological models and did not leverage model uncertainty to maximize the information content of experimental designs. Improving the ability to engineer and optimize the functions of microbial communities will require advancements in model development, parameter estimation, and optimal experimental design. Toward this end, we present a Bayesian design-of-experiments framework to model and optimize microbial community functions. Our framework includes a recurrent neural network architecture tailored to model microbial community dynamics, a Bayesian inference method for parameter estimation, and a model-guided optimization approach to select microbial community experiments that maximize information content and community function.

W-009: JAMIE: Joint Autoencoders for Multi-Modal Imputation and Embedding
COSI: MLCSB
  • Noah Cohen Kalafut, University of Wisconsin-Madison: CS; Waisman Center, United States
  • Daifeng Wang, University of Wisconsin-Madison: BMI, CS; Waisman Center, United States


Presentation Overview: Show

Single-cell multi-modal datasets allow us to acquire a deeper understanding of underlying molecular mechanisms and functions at cellular resolution. However, multi-modal data for single cells is typically noisy and heterogeneous across modalities. Generating data in several modalities for many cells is costly, time-consuming, and often impractical. Integrating multi-modal data and interpreting cross-modal links remains challenging as well. To address these issues, we developed a novel machine learning model, Joint Autoencoders for Multi-modal Imputation and Embedding (JAMIE). When running, JAMIE first infers correspondence from multi-modal data. With the addition of prior matching information, JAMIE trains the coupled autoencoders for each modality such that matched samples have the same embeddings (i.e., latent features after encoders). These joint embeddings can be used for sample clustering and label transference. Moreover, mixing the trained encoders and decoders enables reconstructing various modalities from a single input modality (multi-modal imputation). We applied JAMIE to recent single-cell multi-modal datasets: (1) gene expression and DNA methylation in lung adenocarcinomas; (2) gene expression and electrophysiology of mouse visual neurons. We found that JAMIE outperforms existing state-of-the-art algorithms for cell type transfer applications and modality imputation. JAMIE is open-source and available on GitHub.

W-010: Leveraging machine learning essentiality predictions and chemogenomic interactions to identify antifungal targets
COSI: MLCSB
  • Sean Liston, University of Toronto, Canada
  • Leah Cowen, University of Toronto, Canada
  • Chad Myers, University of Minnesota, United States
  • Nicole Robbins, University of Toronto, Canada
  • Suzanne Noble, UCSF School of Medicine, United States
  • Matthew O'Meara, University of Michigan, United States
  • Teresa O'Meara, University of Michigan Medical School, United States
  • Charles Boone, University of Toronto, Canada
  • Anne-Claude Gingras, Lunenfeld-Tanenbaum Research Institute, Canada
  • Yoko Yashiroda, RIKEN Center for Sustainable Resource Science, Japan
  • Jing Hou, University of Toronto, Canada
  • Benjamin Vandersluis, University of Minnesota, United States
  • Xiang Zhang, University of Minnesota, United States
  • Elizabeth Polvi, University of Toronto, Canada
  • Zhen-Yuan Lin, Lunenfeld-Tanenbaum Research Institute, Canada
  • Cassandra Wong, Lunenfeld-Tanenbaum Research Institute, Canada
  • Nicole Revie, University of Toronto, Canada
  • Huijuan Yan, UCSF School of Medicine, United States
  • Alice Xue, University of Toronto, Canada
  • Emma Lash, University of Toronto, Canada
  • Kali Iyer, University of Toronto, Canada
  • Amanda Veri, University of Toronto, Canada
  • Ci Fu, University of Toronto, Canada


Presentation Overview: Show

Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans. Understanding which genes are essential for growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, ​Saccharomyces cerevisiae​, construction of mutants is considerably more laborious in ​C. albicans​. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 ​C. albicans GRACE (gene replacement and conditional expression) strains that has been previously constructed as a basis for training. We identified several relevant features contributing unique information to the predictions. Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we prioritized the construction of an additional set of >800 strains and discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for efficient discovery of essential genes, and a similar approach may also be useful in other species.

W-011: OncoRx: An Integrative Approach to Identification of Pan-Cancer Molecular Biomarkers and Prediction of Targeted Multi-Drug Cancer Therapeutics
COSI: MLCSB
  • Darsh Mandera, Independent, United States


Presentation Overview: Show

Cancer is a highly heterogeneous disease with complex underlying biology. The current approach to treating cancer is expensive, time consuming, and ineffective 75% of the time. Conventional monotherapeutic techniques non-selectively target proliferating cells, and this leads to the destruction of both healthy and cancerous cells. With microRNA being an established cancer biomarker, treatment based on microRNA should provide the highest specificity and sensitivity due to its cancer-specific expression and stability. However, identifying particular microRNAs that play a key role in tumorigenesis remains a challenge, as expression of some microRNAs is significantly different between normal tissues and tumor tissues. In this research, a machine learning model was trained and tested with 23 cancer types using microRNA and pharmacological data from The Cancer Genome Atlas to identify top cancer biomarkers and predict drug combinations. Feature Selection using ExtraTreesClassifier identified 84 miRNAs as cancer drivers out of 705 microRNAs. These microRNA biomarkers were validated through KEGG pathway analysis, Gene Ontology enrichment analysis and overall survival analysis. The model was tested with multiple machine learning classifiers including K-NearestNeighbors, AdaBoostClassifier, and OneVsRestClassifier. OneVsRestClassifier, when combined with cross validation, outperformed other approaches, and is able to predict drug combinations for cancer patients with high accuracy.

W-012: DLEB: a web-based application for building deep learning models to solve biological problems
COSI: MLCSB
  • Suyeon Wy, Konkuk University, South Korea
  • Daehong Kwon, Konkuk University, South Korea
  • Kisang Kwon, Konkuk University, South Korea
  • Jaebum Kim, Konkuk University, South Korea


Presentation Overview: Show

Deep learning has been applied for solving many biological problems, and it has shown outstanding performance. Applying deep learning in research requires knowledge of deep learning theories and programming skills, but researchers have developed diverse deep learning platforms to allow users to build deep learning models without programming. Despite these efforts, it is still difficult for biologists to use deep learning because of limitations of the existing platforms. Therefore, a new platform is necessary that can solve these challenges for biologists. To alleviate this situation, we developed a user-friendly and easy-to-use web application called DLEB (Deep Learning Editor for Biologists) that allows for building deep learning models specialized for biologists. DLEB helps researchers (i) design deep learning models easily and (ii) generate corresponding Python code to run directly in their machines. DLEB provides other useful features for biologists, such as recommending deep learning models for specific learning tasks and data, pre-processing of input biological data, and availability of various template models and example biological datasets for model training. DLEB can serve as a highly valuable platform for easily applying deep learning to solve many important biological problems. DLEB is freely available at http://dleb.konkuk.ac.kr/.

W-014: A Novel Machine Learning-Based Approach for Predicting Molecular Biological Binding Affinity
COSI: MLCSB
  • Bomin Wei, Princeton International School of Mathematics and Science, United States
  • Yue Zhang, School of Medicine, the University of Utah, United States
  • Xiang Gong, Princeton International School of Mathematics and Science, United States


Presentation Overview: Show

Combating COVID-19 requires prompt development of an effective treatment, but conventional procedures are high in expenditure and failure rate. Identifying interactions between drug molecules and target proteins based on computational methods is essential for speeding up drug developments and thus can lower the cost. In this work, we propose a novel deep learning-based model, LPIDeep, consisting of ResNet-based CNN and bi-directional LSTM and using sequence inputs of drug molecules and proteins for predicting their interactions. LPIDeep utilizes ready-trained methods to embed text inputs into dense vector representations for higher accuracy and generalizability. We use the BindingDB dataset for model training and evaluation. The result shows that the LPIDeep model achieves good performance for the binding affinity prediction with regression in terms of R and MSE and with 0.89 and 0.46 on the training set; and scores 0.84 and 0.64 on the independent testing set, better than recently published models. This result suggests the high accuracy and capability in generalization of the LPIDeep model, demonstrating the potential to pinpoint new drug-target interactions to find better destinations for proven drugs.

W-015: Construction of in silico protein-protein interaction networks across different topologies using machine learning
COSI: MLCSB
  • Loïc Lannelongue, University of Cambridge, United Kingdom
  • Michael Inouye, University of Cambridge, United Kingdom


Presentation Overview: Show

Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools have been successful at predicting PPIs in silico, but the scarcity of reliable frameworks has led to network models that are difficult to compare and, overall, a low level of trust in the predicted PPIs. To better understand the underlying mechanisms that underpin these models, we designed B4PPI, an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use B4PPI to shed light on the impact of network topology and how algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models, two popular approaches, on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specialises in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results on yeast data and demonstrate that models using functional genomics can be better suited to cross-species prediction. With rapidly increasing amounts of sequence and genomics data, our study provides a systematic foundation for future construction, comparison and application of PPI networks.

W-016: Fast, flexible motif discovery using convolutional dictionary learning
COSI: MLCSB
  • Shane Chu, Washington University in St. Louis, United States
  • Gary Stormo, Washington University in St. Louis, United States


Presentation Overview: Show

Discovering the binding site motifs for transcription factors (TFs) in an important task in computational biology. For simple motifs, that can be represented by a single position weight matrix (PWM) there are several good, commonly used approaches. But motifs for more complex TF-DNA interactions, such as cases where they can be multiple modes of binding or the motif may have multiple parts separated by variable spacing, the task is much more challenging and good methods are lacking. We have used convolutional dictionary learning (CDL) to rapidly identify sets of statistically significant filters that can then be combined, modified and extended to create PWMs for complex binding motifs. We show that for simple motifs our method is comparable to existing methods but can also find complex motifs where other approaches often fail.

W-017: Fast and interpretable genomic data analysis using multiple approximate kernel learning
COSI: MLCSB
  • Ayyüce Begüm Bektaş, Koc University, Turkey
  • Çiğdem Ak, Oregon Health and Science University, United States
  • Mehmet Gönen, Koç University, Turkey


Presentation Overview: Show

Dataset sizes in biology have been increased drastically with the help of improved data collection tools and increasing size of patient cohorts. Previous kernel-based machine learning algorithms proposed for interpretability started to fail with large datasets, owing to their lack of scalability. Thus, we proposed a fast and efficient multiple kernel learning algorithm that integrates kernel approximation and group Lasso formulations into a conjoint model. Our method extracts significant and meaningful information from the genomic data while conjointly learning a model for out-of-sample prediction. It is scalable with increasing sample size by approximating instead of calculating distinct kernel matrices.

To test our computational framework, namely, Multiple Approximate Kernel Learning (MAKL), we demonstrated our experiments on three cancer datasets and showed that MAKL is capable to outperform the baseline algorithm while using only a small fraction of the input features. We also reported selection frequencies of approximated kernel matrices associated with feature subsets (i.e. gene sets/pathways), which extracts their relevance for the given classification task. MAKL producing sparse solutions is promising for computational biology applications considering its scalability and highly correlated structure of genomic datasets, and it can be used to discover new biomarkers and new therapeutic guidelines.

W-018: PathwayMultiomics: pathway-regularized GPU-accelerated matrix factorization
COSI: MLCSB
  • David Merrell, University of Wisconsin - Madison, United States
  • Anthony Gitter, University of Wisconsin - Madison, United States


Presentation Overview: Show

Biological pathways provide a useful vocabulary for describing the state of a biological system. For example, many types of cancer are known to have pathways with abnormally high or low activity levels. Pathway-level descriptions have clinical value because they yield candidate therapeutic targets beyond individual altered proteins.

Meanwhile, modern biological research generates a flood of -omic data, such as genomic, epigenomic, transcriptomic, and proteomic measurements. Different assays provide complementary views of a biological system. However, extracting the full potential from multi-omic data remains an open line of research.

We present PathwayMultiomics, a matrix factorization model that infers pathway activity levels from multi-omic data. It accommodates data with heterogeneous distributional assumptions, e.g., normal, Poisson, logistic, and ordinal. The model also accounts for batch effects and sample conditions. The matrix factorization is regularized by pathways’ known network structures. It outputs sample-specific pathway activation scores, along with interpretable latent factor representations of pathways.

We implement PathwayMultiomics as a Julia package. GPU acceleration allows our model to scale efficiently to large datasets. The model reliably recovers true signal in simulated data. We also demonstrate PathwayMultiomics on The Cancer Genome Atlas data, showing that its inferred pathway activities are useful for downstream tasks.

W-019: Multi-omics data integration in the cloud: Using machine learning methods to predict statistically significant associations between clinical and molecular features between disparate breast cancer cohorts
COSI: MLCSB
  • George Acquaah-Mensah, Massachusetts College of Pharmacy and Health Sciences, United States
  • Kawther Abdilleh, Pancreatic Cancer Action Network, United States
  • Boris Aguilar, Institute for Systems Biology, United States


Presentation Overview: Show

Among women, breast invasive carcinoma (BrCA) remains a leading cause of mortality. There are, however, disparities in biomolecular and clinical presentations, racial distribution, and incidences of aggressive types of breast cancer. We examined The Cancer Genome Atlas (TCGA) BrCA samples from stage II patients aged 50 or younger that are black (B/AA50) or white (W50). We combined a variety of multi-omic datasets to further characterize the disparities for insights. Methylation data was processed through the Methylmix algorithm to identify differentially methylated genes between the cohorts. Machine learning algorithms (including Random Forest and Deep Learning (Dl4j)) were trained on differential methylation values of driver genes. Simultaneously, we coupled gene expression (RNAseq) data with protein-protein interaction data to identify two bi-clusters. The trained algorithms were largely successful in predicting the bi-cluster assignment of each sample upon ten-fold cross-validation (bi-cluster 1: Precision and Recall were 0.964 and 0.853 for Dl4j and bi-cluster 2: Precision and Recall were 0.725 and 0.925 for Dl4j). There was a positive association between the cluster membership and cohorts.

W-020: Artificial intelligence & machine learning approaches using gene expression and variant data for predictive and personalized medicine
COSI: MLCSB
  • Sreya Vadapalli, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Habiba Abdelhalim, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States
  • Saman Zeeshan, Cancer Institute of New Jersey. Rutgers, The State University of New Jersey., United States
  • Zeeshan Ahmed, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States


Presentation Overview: Show

The convergence of genomics and transcriptomics data, along with staggering developments in artificial intelligence (AI) and machine-learning (ML), have the potential to elevate diagnostic and predictive analyses of major causes of mortality, modifiable risk factors, and other clinically actionable information. The grand challenge today is the successful assimilation of genetics into precision medicine that translates across different ancestries, diverse diseases, and other distinct populations, which will require clever use of AI/ML methods. Our goal here was to implement and evaluate different AI/ML approaches that can be used in genomics and precision medicine. Our scope was narrowed to the application of AI/ML algorithms for statistical and predictive analysis using WGS & WES for gene variants, and RNA-seq and microarrays for gene expression. We did not limit our study to specific diseases or data sources. Based on the scope of our study, we identified and investigated 32 different AI/ML approaches and algorithms for predictive diagnostics across several diseases. Our conclusions include SVM and RF as the most successful AI/ML algorithms, and ANN, KNN, NB, LR, and AB are among the best options available for bioinformatics, statistics, and predictive analysis of a wide variety of diseases using genomics data.

W-021: Application of Transfer Learning Drug Repurposing for Glioblastoma
COSI: MLCSB
  • Jennifer Fisher, University of Alabama at Birmingham, United States
  • Vishal Oza, University of Alabama at Birmingham, United States
  • Brittany Lasseigne, University of Alabama at Birmingham, United States


Presentation Overview: Show

Rare diseases collectively affect 25-30 million people in the US. Even though 80% of rare diseases have a genetic component, 95% of rare diseases do not have an identified molecular target for therapy. As machine learning approaches require statistically powered datasets to develop accurate models, rare disease datasets, which are small, have posed a significant challenge in handling technical and biological variation during model building and interpretation. In this study, we used transfer learning and signature reversion drug repurposing principles to identify disease-associated genes (DAGs) from disease-associated signatures (DASs aka latent variables) via transfer learning and differential latent variable analysis to identify drug repurposing candidates for Glioblastoma Multiforme (GBM), a rare brain tumor with a poor prognosis of 10-15 months. The results of this study suggest that transfer learning’s latent variable set has more stability during downsampling and the drug candidates predicted perturb molecular processes known to be dysfunctional in GBM. In conclusion, these results suggest that transfer learning is an effective method for rare disease drug repurposing.

W-022: The multifocal transcriptomic landscape of locally advanced prostate cancer
COSI: MLCSB
  • Maarten Larmuseau, Ghent University, Belgium
  • Kim Van der Eecken, Ghent University, Belgium
  • Louise de Schaetzen van Brienen, Ghent University, Belgium
  • Piet Ost, Ghent University, Belgium
  • Kathleen Marchal, Ghent University, Belgium


Presentation Overview: Show

Understanding the molecular alterations that allow transitioning from local to invasive disease is essential to improve cancer treatments. Investigating tumor progression has, however, been hampered by substantial intra- and intertumor heterogeneity. Here, we introduce a multifocal cohort of locally advanced prostate cancer, where several primary lesions and metastatic lymph nodes per patient have been transcriptomically profiled. Modeling pathway activity using a centroid-based approach allows tracing cancer progression from primary to metastatic tissue, highlighting how mainly invasion-related processes are altered. Moreover, using an external dataset we demonstrate that in lymph node positive primary tumors the activity of certain signatures is altered to resemble metastatic lymph nodes. We use this observation to identify the most likely seeding primary lesion in each patient of our cohort. The predicted seeding lesions agree well with seeding lesions estimated from RNA-seq derived somatic variants and are enriched in the invasive PAM50 Luminal B subtype. Finally, we leverage the unique design of our cohort and develop a new testing procedure to identify molecular processes that characterize the seeding lesions. Importantly, the poor correspondence between a lesion’s Gleason score and seeding status suggests that multifocal designs will be pivotal for the study of invasive disease.

W-023: Multi-omics data integration reveals correlated regulatory features of triple negative breast cancer
COSI: MLCSB
  • Kevin Chappell, University of Arkansas at Little Rock, United States
  • Stephanie Byrum, University of Arkansas for Medical Sciences, United States
  • Kanishka Manna, University of Arkansas at Little Rock, United States
  • Sayem Miah, University of Arkansas for Medical Sciences, United States


Presentation Overview: Show

Triple negative breast cancer (TNBC) is an aggressive type of breast cancer with few treatment options. TNBC is heterogeneous with large alterations in multiple omic landscapes leading to various subtypes with differing responses to therapeutic treatments. We applied a multi-omics data integration method to evaluate the correlation of important regulatory features in TNBC BRCA1 wild-type MDA-MB-231 and TNBC BRCA1 5382insC mutated HCC1937 cells compared with non-tumorigenic epithelial breast MCF10A cells. The data includes DNA methylation, RNAseq, protein, phosphoproteomics, and histone post-translational modification. Data integration methods identified regulatory features from each omics method that had greater than 80% positive correlation within each TNBC subtype. Key regulatory features at each omics level were identified distinguishing the three cell lines and were involved in important cancer related pathways such as TGFβ signaling, PI3K/AKT/mTOR, and Wnt/beta-catenin signaling. We observed overexpression of PTEN, which antagonizes the PI3K/AKT/mTOR pathway, and MYC, which downregulates the same pathway in the HCC1937 cells relative to the MDA-MB-231 cells. The PI3K/AKT/mTOR and Wnt/beta-catenin pathways are both downregulated in HCC1937 cells relative to MDA-MB-231 cells, which likely explains the divergent sensitivities of these cell lines to inhibitors of downstream signaling pathways. The data is available via GEO GSE171958 and ProteomeXchange PXD025238.

W-024: A molecular generative model with genetic algorithm and tree search for cancer samples
COSI: MLCSB
  • Sejin Park, Gwangju Institute of Science and Technology, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Personalized medicine is expected to maximize the intended drug effects and minimize side effects by treating patients based on their genetic profiles. Thus, it is important to generate drugs based on the genetic profiles of diseases, especially in anticancer drug discovery. However, this is challenging because the vast chemical space and variations in cancer properties require a huge time resource to search for proper molecules. Therefore, an efficient and fast search method considering genetic profiles is required for de novo molecular design of anticancer drugs. Here, we propose a faster molecular generative model with genetic algorithm and tree search for cancer samples (FasterGTS). FasterGTS is constructed with a genetic algorithm and a Monte Carlo tree search with three deep neural networks: supervised learning, self-trained, and value networks, and it generates anticancer molecules based on the genetic profiles of a cancer sample. When compared to other methods, FasterGTS generated cancer sample-specific molecules with general chemical properties required for cancer drugs within the limited numbers of samplings. We expect that FasterGTS contributes to the anticancer drug generation.

W-025: MOMA: a multi-task attention learning algorithm for multi-omics data interpretation and classification
COSI: MLCSB
  • Sehwan Moon, Gwangju Institute of Science and Technology, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Accurate diagnostic classification and biological interpretation are important in biology and medicine, which are data-rich sciences. Thus, integration of different data types is necessary for the high predictive accuracy of clinical phenotypes, and more comprehensive analyses for predicting the prognosis of complex diseases are required.

Here, we propose a novel multi-task attention learning algorithm for multi-omics data, termed MOMA, which captures important biological processes for high diagnostic performance and interpretability.
MOMA vectorizes features and modules using a geometric approach and focuses on important modules in multi-omics data via an attention mechanism. Experiments using public data on Alzheimer’s disease and cancer with various classification tasks demonstrated the superior performance of this approach.
The utility of MOMA was also verified using a comparison experiment with an attention mechanism that was turned on or off and biological analysis.

This paper was published in Bioinformatics, 38(8): 2287–2296 (2022).

W-026: SDGCCA: SUPERVISED DEEP GENERALIZED CANONICAL CORRELATION ANALYSIS FOR MULTI-OMICS INTEGRATION
COSI: MLCSB
  • Sehwan Moon, Gwangju Institute of Science and Technology, South Korea
  • Jeongyoung Hwang, Gwangju Institute of Science and Technology, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Integration of multi-omics data provides opportunities for revealing biological mechanisms related to certain phenotypes. We propose a novel method of multi-omics integration called supervised deep generalized canonical correlation analysis (SDGCCA) for modeling correlation structures between nonlinear multi-omics manifolds, aiming for improving classification of phenotypes and revealing biomarkers related to phenotypes. SDGCCA addresses the limitations of other canonical correlation analysis (CCA)-based models (e.g., deep CCA, deep generalized CCA) by considering complex/nonlinear cross-data correlations and discriminating phenotype groups. Although there are a few methods for nonlinear CCA projections for discriminant purposes of phenotypes, they only consider two views. On the other hand, SDGCCA is the nonlinear multiview CCA projection method for discrimination. When we applied SDGCCA to prediction of patients of Alzheimer’s disease (AD) and discrimination of early- and late-stage cancers, it outperformed other CCA-based methods and other supervised methods. In addition, we demonstrate that SDGCCA can be used for feature selection to identify important multi-omics biomarkers. In the application on AD data, SDGCCA identified clusters of genes in multi-omics data, which are well known to be associated with AD.

W-027: Essential Regression - a generalizable framework for inferring causal latent factors from multi-omic datasets
COSI: MLCSB
  • Xin Bing, Cornell University, United States
  • Tyler Lovelace, University of Pittsburgh, United States
  • Florentina Bunea, Cornell University, United States
  • Marten Wegkamp, Cornell University, United States
  • Sudhir Pai Kasturi, Emory University, United States
  • Harinder Singh, University of Pittsburgh, United States
  • Panayiotis V. Benos, University of Pittsburgh, United States
  • Jishnu Das, University of Pittsburgh, United States


Presentation Overview: Show

High-dimensional cellular and molecular profiling of biological samples highlights the need for analytical approaches that can integrate multi-omic datasets to generate prioritized causal inferences. Current methods are limited by high dimensionality of the combined datasets, the differences in their data distributions and their integration to infer causal relationships. Here we present Essential Regression (ER), a novel latent-factor-regression-based interpretable machine learning approach that addresses these problems by identifying latent factors and their likely cause-effect relationships with system-wide outcomes/properties of interest. ER can integrate many multi-omic datasets without structural or distributional assumptions regarding the data. It outperforms a range of state-of-the-art methods in terms of prediction. ER can be coupled with probabilistic graphical modeling thereby strengthening the causal inferences. The utility of ER is demonstrated using multi-omic systems immunology datasets to generate and validate novel cellular and molecular inferences, in a wide range of contexts including, immunosenescence and immune dysregulation.

W-028: CMOT: Cross Modality Optimal Transport for multi-modal inference and label prediction
COSI: MLCSB
  • Sayali Anil Alatkar, University of Wisconsin-Madison, United States
  • Daifeng Wang, University of Wisconsin - Madison, United States


Presentation Overview: Show

Next-generation sequencing technologies have generated multi-modal data, enabling a deeper understanding of complex biological mechanisms. However, simultaneous profiling of multi-modalities continues to be challenging, e.g., missing modalities. Furthermore, mutli-modal data integration is difficult since modalities may not always have paired samples or matched features thereby leaving partial to no correspondence information.

Here, we developed Cross Modality Optimal Transport (CMOT), an optimal-transport-based framework to infer additional modalities and labels for the samples with single modality. First, CMOT aligns the samples with multimodal data onto a common low-dimensional space. Second, CMOT applies the optimal transport to map the distribution of aligned samples (source) to the ones with single modality (target). It minimizes the Wasserstein distance between the source and target distributions to find an optimal correspondence mapping. Once transported, CMOT uses the nearest neighbors to infer additional modalities for the target samples. Besides, CMOT can predict missing sample labels, e.g., phenotypes. We applied CMOT to emerging multi-modal datasets and found that CMOT outperforms state-of-art methods for cross-modal inference: (1) genotype and gene expression data of Alzheimer's disease patients, (2) electrophysiology and gene expression of single neurons, and (3) gene expression and surface protein expression of peripheral blood mononuclear cells.

W-029: A machine learning-based framework for high-throughput drug combination screens
COSI: MLCSB
  • William Wright, St. Jude Children's Research Hospital, United States
  • Min Pan, St. Jude Children's Research Hospital, United States
  • Hyeong-Min Lee, St. Jude Children's Research Hospital, United States
  • Gregory Phelps, St. Jude Children's Research Hospital, United States
  • Jonathan Low, St. Jude Children's Research Hospital, United States
  • Duane Currier, St. Jude Children's Research Hospital, United States
  • Richard Lee, St. Jude Children's Research Hospital, United States
  • Taosheng Chen, St. Jude Children's Research Hospital, United States
  • Paul Geeleher, St. Jude Children's Research Hospital, United States


Presentation Overview: Show

Drug combinations are the basis of treatment for modern diseases but arriving at successful combination therapies is fraught with challenges. The current pool of single-agent drugs to potentially combine is far too large to brute-force screen, and purely computational predictions have performed poorly. Suitable screening methods are needed, but the design of experimental approaches has proven to be highly complex; researchers need to carefully balance many variables such as the number of doses from each drug, inclusion of replicates, and throughput. Critically, the number of tested doses must be sufficient to accurately capture synergy. However, high-density combination matrices are resource-intensive and thus limit throughput. This barrier often results in combination screens that use small or sparse matrices, which risk overlooking synergistic hits.
Here, we develop a machine learning approach that predicts response values of a 10×10 matrix design, where the only input concentrations are the single-agents and the matrix diagonal values. We screened various compounds and cell lines in fully measured matrices which we then used to train a regression model on. This strategy allows synergy to be calculated on high-density matrices but with fewer total concentrations tested, ultimately allowing for higher throughput.

W-030: Prediction of specific TCR-peptide binding from large dictionaries of TCR-peptide pairs
COSI: MLCSB
  • Yoram Louzoun, Bar Ilan University, Israel
  • Ido Springer, Bar Ilan University, Israel


Presentation Overview: Show

Current sequencing methods allow for detailed samples of T cell receptors (TCR) repertoires. To determine from a repertoire whether its host had been exposed to a target, computational tools that predict TCR-epitope binding are required. Currents tools are based on conserved motifs and are applied to peptides with many known binding TCRs.

We employ new Natural Language Processing (NLP) based methods to predict whether any TCR and peptide bind. We combined large-scale TCR-peptide dictionaries with deep learning methods to produce ERGO (pEptide tcR matchinG predictiOn), a highly specific and generic TCR-peptide binding predictor.

A set of standard tests are defined for the performance of peptide-TCR binding, including the detection of TCRs binding to a given peptide/antigen, choosing among a set of candidate peptides for a given TCR and determining whether any pair of TCR-peptide bind. ERGO reaches similar results to state of the art methods in these tests even when not trained specifically for each test.
The software implementation and data sets are available at https://github.com/louzounlab/ERGO. ERGO is also available through a webserver at: http://tcr.cs.biu.ac.il/

W-031: LOGICS: a framework for Learning Optimal Generative distribution Iteratively for the focused Chemical Structures
COSI: MLCSB
  • Bongsung Bae, Gwangju Institute of Science and Technology, South Korea
  • Hojung Nam, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

In de novo drug design, various computational methods have been proposed to search for molecules with desired properties and biological activities, e.g. binding affinity to target proteins. Here, we focus on a class of deep generative chemical modeling that include the transfer learning and reinforcement learning approaches, where the SMILES generators are fine-tuned with the information given by the independent prediction module. The previous approaches have shown that they are capable of generating focused structures of the desired target properties in a high frequency. However, the diversity of generations and the covered chemical space are questioned if they really meet the distribution of the true targeted molecules. In this study, we present LOGICS, a framework for Learning Optimal Generative distribution Iteratively for the focused Chemical Structures. We raise the problem of exploration-exploitation dilemma, and tackled the problem with the experience memory recording positive generation history, and layered tournament selection process for sophisticated fine-tuning set formation. The proposed method was demonstrated on the bioactivity optimizations towards two protein targets, κ-opioid receptors and PIK3CA, and evaluated by distributional distance metrics between generations and known actives. In PIK3CA task, LOGICS achieved a 38% FC distance score compared to the prior generator.

W-032: Machine learning for pooled microscopy-based spatial proteomics
COSI: MLCSB
  • Jiri Reinis, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Andreas Reicher, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Maria Ciobanu, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Stefan Kubicek, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria


Presentation Overview: Show

The current spatial proteomics toolbox is missing an important item: the ability to monitor subcellular localization of large numbers of proteins upon perturbation – within the same sample. Mass spectrometry and immunofluorescence microscopy can generally only capture a single timepoint per sample. This is not the case for live-cell imaging of fluorescently labeled reporter cell lines, but their generation is a very laborious task.
We have used a highly scalable intron-tagging protocol to fluorescently tag over 1,000 proteins. Our ambition is to use live-cell imaging to study changes in localization and abundance of these proteins in response to thousands of perturbations. To achieve high throughput, pooled format is required, meaning that in each microplate well, a mix of clones carrying different tagged proteins is present, and discrimination of these clones at single cell level is necessary.
To address this task, we devised a double-tagging and visual barcoding strategy that allows for the discrimination of clones using computer vision and convolutional neural networks. Preliminary results suggest it is possible to identify hundreds of clones with more than 90% accuracy. Our current focus is on screening a targeted set of 100 cancer-associated proteins against drug libraries to search for novel therapeutic strategies.

W-033: Robust comparison of multi-omics clustering methods for subtype discovery in chronic diseases
COSI: MLCSB
  • Teemu J. Rintala, Institute of Biomedicine, University of Eastern Finland, Finland
  • Vittorio Fortino, Institute of Biomedicine, University of Eastern Finland, Finland


Presentation Overview: Show

Multi-omics data is becoming increasingly available and there has been significant interest in using it to identify driving mechanisms and subtypes of chronic diseases. Unsupervised machine learning methods such as multi-view clustering analysis have been applied to partition patients based on their omics profiles. Often these methods are evaluated based on how accurately known partitions are recovered and how significant the survival differences between cancer patient subgroups are without accounting for relevant covariates, which is not ideal for discovering clinically actionable subtypes. In addition, the robustness to relatively small differences in the input data has not been considered. Here we have evaluated the stability of clustering results and metrics by resampling patients in publicly available omics datasets. In addition, we used linear models to account for covariates in survival analysis to better characterize the clinical relevance of clustering results. We compared popular multi-omics clustering approaches and considered recent kernel-based approaches that aim to integrate biological information in the form of functional pathways. Our analyses suggested that clustering results obtained from many methods were not reproducible on subsets of the data. Among the tested approaches, integrative non-negative matrix factorization (IntNMF) was particularly stable and yielded clinically relevant results.

W-034: Multi-Classification of Breast Cancer from Histopathology Images using Deep learning Approach
COSI: MLCSB
  • V. P. Subramanyam Rallabandi, 3BIGS CO.,LTD., South Korea
  • Sameer Mohammed, 3BIGS CO.,LTD., South Korea
  • Sridhar Srinivasan, 3BIGS CO.,LTD., South Korea
  • Md Ataul Islam, 3BIGS CO.,LTD., South Korea
  • Sathishkumar Natarajan, 3BIGS CO.,LTD, South Korea
  • Dawood Dudekula, 3BIGS CO.,LTD., South Korea
  • Junhyung Park, 3BIGS CO.,LTD, South Korea


Presentation Overview: Show

Tumor histology is an important tool in cancer diagnosis and treatment. Recent advances in deep learning for medical image analysis allude to the utility of radiologic data in describing disease characteristics and risk stratification. Here, we propose hybrid convolutional neural network (CNN), long short-term memory recurrent neural network (LSTM RNN) CNN-LSTM RNN deep learning approach in classifying and predicting four subtypes of benign and malignant breast cancer. We utilized BreakHis dataset comprising of 2480 benign and 5429 malignant images acquired at different resolutions 40x, 100x, 200x and 400x. We compared existed CNN models, VGG-16, ResNet50, Inception with our proposed InceptionResNetV2 and hybrid CNN-LSTM RNN models. All models were built using three different optimizers such as Adam, root mean square (RMSProp) and stochastic gradient descent (SGD) optimizers by varying number of epochs. We noticed Adam optimizer was best accuracy both for the training and validation sets with less model loss. We noticed highest accuracy of 94.4% using InceptionResNetV2 for 40x, 91.1% using CNN-LSTM RNN for 100x, 86.4% using InceptionResNetV2 for 200x and 91% using ResNet50 for 400x respectively. We conclude that the deep learning approaches were performed good than the traditional machine learning models in classifying benign and malignant cancer subtypes.

W-035: Predict Clinical Trial Success using 3BIGS Machine Learning Model
COSI: MLCSB
  • V. P. Subramanyam Rallabandi, 3BIGS CO.,LTD., South Korea
  • Sameer Mohammed, 3BIGS CO.,LTD., South Korea
  • Sridhar Srinivasan, 3BIGS CO.,LTD., South Korea
  • Md Ataul Islam, 3BIGS CO.,LTD., South Korea
  • Sathishkumar Natarajan, 3BIGS CO.,LTD, South Korea
  • Dawood Dudekula, 3BIGS CO.,LTD., South Korea
  • Junhyung Park, 3BIGS CO.,LTD, South Korea


Presentation Overview: Show

Prediction of success is one of the most critical components of clinical trial design. Several factors are responsible for successful trial like drug, treatment regime, heterogenous population, primary and secondary outcome assessments. We propose a machine learning (ML)-based prediction of the clinical trial success based on physiochemical properties of drug, inclusion and exclusion eligibility criteria for enrollment, primary and secondary outcome measures of large-scale clinical trial data of a disease. As a case study, all diabetic drugs, age group, gender, enrolled patients, study phase, interventions, primary and secondary outcome measures were collected from online resource (http://clinicaltrials.gov). Furthermore, features were extracted for the drug physiochemical properties, and transformed into structured data from the unstructured text for primary and secondary measures like standard clinical scale and diagnostic assessments. We assigned the clinical trials that are completed phase 4 with results met primary and secondary outcome measures as positives and the clinical trials suspended, withdrawn, and terminated as negatives to train using various ML models. The proposed model identified the features responsible for clinical trial success. We evaluated the best model on untrained data to predict. We conclude that ML-based approach will improve prediction accuracy of the clinical trial as successful or failure.

W-036: Multicellular and Multiscale Modeling of Proliferative Immune Response Under Joint Synergistic and Antagonistic Cytokine Signaling
COSI: MLCSB
  • Komlan Atitey, National Institute of Health - National Institute of Environmental Health Sciences, United States
  • Benedict Anchang, National Institute of Health - National Institute of Environmental Health Sciences-National Cancer Institute, United States


Presentation Overview: Show

Although machine learning and mechanistic modelling have independently been used to study the dynamics of infectious diseases such as coronavirus disease 2019 (COVID-19), which has resulted in almost 1 million deaths in the US, combining both approaches in a hybrid simulation model has not been exploited especially to address biological processes associated with COVID-19 infection such as cytokine storm and release syndrome which result from the interacting effects of multiple cytokines simultaneously during an immune response. To address this challenge, we recently published a model denoted as Multiscale Multicellular Quantitative Evaluator (MMQE) which adopts a hybrid computational approach comprising of continuous, discrete and stochastic non-linear model formulations to predict a system-level immune response as a function of multiple dependent signals and interacting agents including cytokines and targeted immune cells. MMQE quantifies the dynamics of lymphocytes proliferation mediated by a joint downregulation of IL-4 and upregulation of IL-2 during pathogen invasion. We validated our in-silico results with in-vivo and in-vitro experimental studies. We propose an extension of the model to account for early innate immune response from COVID-19 infection involving cells from myeloid lineages such as monocytes by integrating data from single-cell analysis using machine learning, stochastic and mechanistic modeling.

W-037: A Coronavirus Infection and Host-Shift Predictor
COSI: MLCSB
  • Betül Asiye Karpuzcu, Mugla Sitki Kocman University, Turkey
  • Erdem Türk, Mugla Sitki Kocman University, Turkey
  • Onur Can Karabulut, Mugla Sitki Kocman University, Turkey
  • Ahmed Hassan Ibrahim, Mugla Sitki Kocman University, Turkey
  • Barış Süzek, Mugla Sitki Kocman University, Turkey


Presentation Overview: Show

The recent devastating COVID-19 pandemic stressed out the need for robust viral infection prediction techniques and tools to identify potential host shifts. To this end, we re-purposed our previously developed machine-learning based adenoviral infection prediction approach, namely ML-AdVInfect, to predict infection capacity of coronaviridae for candidate hosts.
Here, we collected the sequence data for the orthologs of the literature-curated human coronavirus receptors (ACE2, TMPRSS2, DPP4, and ANPEP) as well as coronavirus spike proteins along with their respective host species. Then, we created a positive dataset (i.e. infection occurs) of 89 and a negative dataset (i.e. no known infection) of 48,751 coronavirus–host pairs covering all vertebrates. Next, we used publicly available virus–host protein–protein interaction tools (DeNovo, HOPITOR, InterSPPI-HVPPI) to predict interactions between host receptors and spike proteins to create a feature vector for our infection prediction models. Finally, we experimented several ML-based classifiers for coronavirus infection prediction. Overall, Random Forest models perform better; achieving the highest sensitivity (SN) or specificity (SP) with alternative hyper-parameters (SN:0.97±0.006, SP:0.66±0.001, AUC: 0.81±0.003; SN:0.77±0.022, SP:0.88±0.001, AUC: 0.82±0.011).
Our work supports the use of computational approaches in infection prediction and identification of potential host-shifts helping early detection of viral infection threats in public health.

W-039: Estimation of protein topological properties from single sequences using a highly parallelized minified language model
COSI: MLCSB
  • Guilherme Bottino, Institute of Chemistry, University of Campinas - UNICAMP, Brazil
  • Leandro Martinez, Institute of Chemistry, University of Campinas - UNICAMP, Brazil


Presentation Overview: Show

Important advances were made in Bioinformatics through the implementation of Artificial Intelligence algorithms, to the point where extensive language models can be used to model protein structures when little or no homology information is available. However, as those models become larger and more sophisticated, their accessibility to researchers and users, as well as training and execution costs, become a concern. Small-scale models that maintain a competitive level of accuracy with reduced size, execution time, and code length, are thus important. In this work, we present TintiNet.jl, a fresh approach to 1D protein-property estimator, which predicts residue-wise secondary structures, phi and psi angles, and solvent accessibility. Using contemporary neural network architectures, we designed a minified language model combining Inception and Transformer modules, which can achieve top performance on the prediction task, when compared to the cutting-edge algorithms, with only a fraction of the parameter count. It also achieves the smallest time to generate a prediction per sequence among the benchmarked models. Being fast and light, our model can run with very limited resources. This work should inspire other minification approaches and extend the accessibility of AI research for Structural Proteomics. All data and source code are available at https://github.com/hugemiler/TintiNet.jl.

W-040: Evaluation of Mutational Effects on Protein Structures with Alphafold
COSI: MLCSB
  • Michael Fujarski, University of Münster, Germany
  • Linda Ebbert, University of Münster, Germany
  • Sarah Sandmann, University of Münster, Germany
  • Julian Varghese, University of Münster, Germany


Presentation Overview: Show

The structures of proteins are fundamental for their behavior and need to be determined to gain insight into biological processes. Nevertheless, experiments determining protein structures are time-consuming and expensive. Thus, structure analysis of mutated proteins needs in-silico approaches.

The prediction of unknown protein structures is an ongoing challenge. Google’s DeepMind developed Alphafold, a robust deep model that proved to produce valid results with close to experimental accuracy. We present an evaluation of Alphafold on mutant proteins focusing on the hemoglobin beta subunit (Hbb).
Alphafold performs well on the wildtype with a C α root mean square deviation (RMSD) of 0.541Å. However, the sickle cell mutant is predicted to cause an insignificant distortion of the structure with RMSD
of only 0.039Å while the reference deviation based on UniProt should reach ∼1Å. The introduction of consecutive substitutions of peptides with glutamic acid produces similar results to the sickle cell mutant. The substitution of 15 consecutive peptides distorted the protein structure with an RMSD of 0.85Å.

Although Alphafold performs well on preserved protein structures, the model is currently not suitable for mutational effect predictions. Therefore, we propose an adaption of Alphafold’s training architecture to increase accuracy of mutational effects on the protein structure.

W-041: An accelerated data science workflow coupled with machine learning to predict the bioactivity of SARS-CoV-2 inhibitors
COSI: MLCSB
  • Smitha Sunil Kumaran Nair, Middle East College, Oman
  • Saqar Said Nasser Al Maskari, Middle East College, Oman
  • Kiran Gopakumar Rajalekshmi, Middle East College, Oman
  • Beema Shafreenb Rajamohamed, Dr. Umayal Ramanathan College for Women, India
  • Nallusamy Sivakumar, Sultan Qaboos University, Oman
  • Adhraa Al-Mawaali, Ministry of Health, Oman


Presentation Overview: Show

Data science encompassing machine learning algorithms facilitates multiple drug discovery processes. In this study, a feature-based virtual screening of active compounds against the therapeutic targets is adopted. A dataset of drug molecules with IC50 values known to interact with SARS-CoV-2 inhibitors was compiled from ChEMBL database. The potency of the drug determined by the compound's standard value was labeled as active and inactive compounds. The SMILES notations of the compounds were encoded using Lipinski descriptors from RDKit and the PubChem fingerprints from PaDEL descriptors. Exploratory Data Analysis carried out confirms that the pharmacokinetic profiles are statistically significant based on the p-value to discriminate between active and inactive compounds. The high dimension of the latter descriptors was optimized through Principal Component Analysis. Random Forest (RF) classifier was trained with feature vectors to obtain the model and was 10-fold cross-validated to obtain the prediction accuracy. The entire workflow repeated using cuDF and cuML enabled through accelerated computations in RAPIDS ecosystem resulted in an accuracy higher (around 3%) than the value obtained with Scikit-learn RF and with reduced time complexity. The results confirm that leveraging accelerated data science workflow coupled with machine learning libraries of RAPIDS is promising in predicting the bioactivity of drug compounds including ethnomedicine-based compounds.

W-042: Joint modeling of rare variant genetic effects using deep learning and data-driven burden scores
COSI: MLCSB
  • Brian Clarke, German Cancer Research Center (DKFZ), Germany
  • Eva Holtkamp, German Cancer Research Center (DKFZ), Germany
  • Hakime Öztürk, German Cancer Research Center (DKFZ), Germany
  • Felix Brechtmann, Department of Informatics, Technical University of Munich, Germany
  • Florian Hölzlwimmer, Department of Informatics, Technical University of Munich, Germany
  • Julien Gagneur, Department of Informatics, Technical University of Munich, Germany
  • Oliver Stegle, German Cancer Research Center (DKFZ), Germany


Presentation Overview: Show

Population-scale genomic sequencing provides novel opportunities to survey the effect of rare variants on phenotypes. Existing methods for rare-variant-association studies (RVASs), such as burden or variance-component tests, make strong assumptions about which variants exhibit phenotypic effects, limiting their efficacy.

Here, we propose DeepRVAT (Deep Rare Variant Association Testing), a data-driven framework that uses deep neural networks to learn a flexible rare variant aggregation function. Specifically, we build on DeepSet networks to efficiently model variant effects and interactions. Compared to existing methods, DeepRVAT (1) learns variant effects without strong filtering or specifying a kernel, (2) models nonlinear and epistatic effects, (3) efficiently incorporates dozens of multi-modal variant annotations, (4) provides trait-specific burden scores, and (5) utilizes GPUs for biobank-scale analyses.

We apply DeepRVAT to multiple phenotypes on 167,000 whole-exome-sequenced samples from UK Biobank. Compared with previous state-of-the-art methods, we obtain significantly increased power (e.g., 29 vs. 15 genes associated to human height; FDR < 0.05), while maintaining statistical calibration. Furthermore, we validate our results by multiple methods (e.g., enrichment analysis, comparison to larger studies) to ensure biological plausibility of our associations. Collectively, our results demonstrate increased power and robustness for studying gene-trait associations using rare variants.

W-042: Joint modeling of rare variant genetic effects using deep learning and data-driven burden scores
COSI: MLCSB
  • Brian Clarke, German Cancer Research Center (DKFZ), Germany
  • Eva Holtkamp, German Cancer Research Center (DKFZ), Germany
  • Hakime Öztürk, German Cancer Research Center (DKFZ), Germany
  • Felix Brechtmann, Department of Informatics, Technical University of Munich, Germany
  • Florian Hölzlwimmer, Department of Informatics, Technical University of Munich, Germany
  • Julien Gagneur, Department of Informatics, Technical University of Munich, Germany
  • Oliver Stegle, German Cancer Research Center (DKFZ), Germany


Presentation Overview: Show

Population-scale genomic sequencing provides novel opportunities to survey the effect of rare variants on phenotypes. Existing methods for rare-variant-association studies (RVASs), such as burden or variance-component tests, make strong assumptions about which variants exhibit phenotypic effects, limiting their efficacy.

Here, we propose DeepRVAT (Deep Rare Variant Association Testing), a data-driven framework that uses deep neural networks to learn a flexible rare variant aggregation function. Specifically, we build on DeepSet networks to efficiently model variant effects and interactions. Compared to existing methods, DeepRVAT (1) learns variant effects without strong filtering or specifying a kernel, (2) models nonlinear and epistatic effects, (3) efficiently incorporates dozens of multi-modal variant annotations, (4) provides trait-specific burden scores, and (5) utilizes GPUs for biobank-scale analyses.

We apply DeepRVAT to multiple phenotypes on 167,000 whole-exome-sequenced samples from UK Biobank. Compared with previous state-of-the-art methods, we obtain significantly increased power (e.g., 29 vs. 15 genes associated to human height; FDR < 0.05), while maintaining statistical calibration. Furthermore, we validate our results by multiple methods (e.g., enrichment analysis, comparison to larger studies) to ensure biological plausibility of our associations. Collectively, our results demonstrate increased power and robustness for studying gene-trait associations using rare variants.

W-043: HIDTI: integration of heterogeneous information to predict drug-target interactions
COSI: MLCSB
  • Jihee Soh, Gwangju Institute of Science and Technology, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology, South Korea
  • Sejin Park, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Identification of drug-target interactions (DTIs) plays a crucial role in drug development. Traditional laboratory-based DTI discovery is generally costly and time-consuming. Therefore, computational approaches have been developed to predict interactions between drug candidates and disease-causing proteins. We designed a novel method, termed heterogeneous information integration for DTI prediction (HIDTI), based on the concept of predicting vectors for all of unknown/unavailable heterogeneous drug- and protein-related information. We applied a residual network in HIDTI to extract features of such heterogeneous information for predicting DTIs, and tested the model using drug-based ten-fold cross-validation to examine the prediction performance for unseen drugs. As a result, HIDTI outperformed existing models using heterogeneous information, and was demonstrating that our method predicted heterogeneous information on unseen data better than other models. In conclusion, our study suggests that HIDTI has the potential to advance the field of drug development by accurately predicting the targets of new drugs. This paper was published in Sci Rep. 2022 Mar 8;12(1):3793.

W-044: Exploring Autoencoder Neural Network Architectures for Use in Gene Regulon Inference Through Bacterial Transcriptomic Compendia
COSI: MLCSB
  • Willow Kion-Crosby, Helmholtz Center RNA-based Infection Research, Germany
  • Lars Barquist, Helmholtz Center RNA-based Infection Research, Germany


Presentation Overview: Show

Gene regulation governs how bacteria respond to their environment. Our understanding of regulation mostly comes from model bacteria, which are often not representative of the many strains relevant to human health and industrial applications. In principle, transcriptomics can provide a complete picture of gene expression in diverse conditions, and serve as a basis from which to rapidly infer the regulatory networks governing bacterial behavior without the need for prior knowledge. A competitive approach to this problem is the use of autoencoders (AEs): an unsupervised neural network-based analysis tool. However, AEs are extremely flexible, and it unclear how network inference is affected by model parameters, such as the choice of activation function or network depth. Here we investigate which AE network architectures are best suited for the recovery of gene regulatory interactions from bacterial expression data by evaluating the performance of a variety of architecture choices on an RNA-seq compendium for Escherichia coli K-12 MG1655. We have subsequently used these networks to recover expression modules uniquely associated with human infections from uropathogenic E. coli (UPEC) transcriptomic data as a proof-of-principle. This work illustrates the potential for deep learning as a powerful means of inferring regulation from large transcriptomic data sets.

W-045: BuDDI: BUlk Deconvolution with Domain Invariance
COSI: MLCSB
  • Natalie Davidson, University of Colorado Anschutz Medical Campus, United States
  • Casey Greene, University of Colorado Anschutz Medical Campus, United States


Presentation Overview: Show

Single-cell experiments provide greater resolution within a single sample; however, they currently lack coverage across biological conditions. Furthermore, some single-cell experiments are inherently more difficult to perform than bulk experiments due to difficulties in dissociating cells or limited tissue amounts. To bridge this gap, we can leverage the large corpora of bulk RNA-Seq to infer missing single-cell observations using domain adaptation techniques.

We propose BuDDI (BUlk Deconvolution with Domain Invariance) to estimate cell-type proportions in bulk samples using single-cell references. BuDDI reconciles the variability between single-cell and bulk experiments by disentangling experimental and biological noise from a target signal (cell-type proportion). BuDDI learns three independent latent spaces within a single variational autoencoder: 1) target signal, 2) structured noise, and 3) remaining variation. A key aspect of BuDDI’s structure is that it can adapt missing observations across technologies.

We compare BuDDI against BayesPrism and CIBERSORTx on pseudo-bulks with simulated and real biological and technical noise. BuDDI performed better than CIBERSORTx and comparable BayesPrism. In addition, we validated that the cell-type latent representation was disentangled from the structured and unstructured noise. In future work, we plan to infer cell-type specific perturbation response using only reference single-cell profiles and perturbed bulk observations.

W-046: Bayesian Inference as a Robust Alternative to Non-Linear Regression for Dose-Response Parameters Assessment
COSI: MLCSB
  • Caroline Labelle, IRIC / Université de Montréal, Canada
  • Anne Marinier, IRIC / Université de Montréal, Canada
  • Sebastien Lemieux, IRIC / Université de Montréal, Canada


Presentation Overview: Show

Large-scale dose-response screens are used to test efficacy of therapeutic agents for various conditions. We previously proposed a Bayesian inference model (BiDRA) for the assessment of dose-response parameters.

Using large-scale pharmacological datasets, we demonstrate the robustness and the gain of using BiDRA, compared to the standard Levenberg-Marquardt algorithm. Notably, we demonstrate that discrepancies in replicated experiments are in part due to the analytical approach of obtaining dose-response parameters.

We identify the main limitation of Levenberg-Marquardt as being incomplete and seemingly unresponsive responses curves. For such experiments, Levenberg-Marquardt either do not converge (for a given number of iterations) or forces a fit and return unreliable parameters. We identify those experiments as being an important driver of discrepancies. For instance, for replicates of incomplete responses curves, Levenberg-Marquardt's HDR (high-dose response) estimates are arbitrary and often discordant, even though the experiments are concordant in the sense that their HDRs are unobservable. Alternatively, BiDRA’s posterior distributions are representative of the uncertainty of the parameters and align with one another, suggesting that the experiments are indeed concordant. Overall, we observe that BiDRA’s posterior distributions have higher correlation coefficients when comparing the parameters of replicated experiments.

W-047: Deep auxiliary learning for multi-modal integration and imputation to improve genotype-phenotype prediction
COSI: MLCSB
  • Pramod Bharadwaj Chandrashekar, University of Wisconsin Madison, United States
  • Chenfeng He, University of Wisconsin Madison, United States
  • Ting Jin, University of Wisconsin Madison, United States
  • Sayali Alatkar, University of Wisconsin Madison, United States
  • Saniya Khullar, University of Wisconsin Madison, United States
  • Daifeng Wang, University of Wisconsin Madison, United States


Presentation Overview: Show

Genotype-phenotype association has been found in many biological systems such as brain and brain diseases. However, predicting phenotypes from genotypes remains challenging, primarily due to complex underlying molecular and cellular mechanisms. Emerging multi-modal data enables studying such mechanisms at different scales. However, it is still challenging to integrate and interpret these multi-modalities for phenotype prediction, especially when some modality is missing. To address this, we developed an interpretable deep learning model to improve genotype-phenotype prediction from multi-modal data. The model can use prior biological knowledge to define the neural network architecture. Particularly, it embeds an auxiliary-learning layer for cross-modal imputation while training the model. Using this pre-trained layer, we can impute latent features of additional modalities and enable predicting phenotypes from a single modality only. Finally, the model uses integrated gradient to prioritize multi-modal features and links for phenotypes. We applied it to population-level genotypic and gene expression data for predicting clinical phenotypes in Schizophrenia and Alzheimer's disease, and gene expression and electrophysiology data of single-cell neurons for predicting cell cortical layers. We found that Deepdice not only outperforms existing state-of-the-art methods but also provides a deeper understanding of gene regulatory mechanisms at cellular resolution from genotype to phenotype.

W-048: wenda_gpu: fast domain adaptation for genomic data
COSI: MLCSB
  • Ariel Hippen, University of Pennsylvania, United States
  • Jake Crawford, University of Pennsylvania, United States
  • Jacob Gardner, University of Pennsylvania, United States
  • Casey Greene, University of Colorado School of Medicine, United States


Presentation Overview: Show

Supervised prediction models have been used for many purposes in bioinformatics. A fundamental assumption in supervised machine learning is that the data being classified is derived from the same distribution as the data used to train the classifier. However, challenges in data acquisition often mean few or no labeled examples are available for a distribution of interest. For these situations, the field of domain adaptation has established principled ways to develop predictors for the data of interest (target data) using data from a similar but distinct distribution (source data). A recent method, weighted elastic net domain adaptation or wenda, learns the complex interactions between features of genomic data and leverages them to maximize transferability, but the method is too computationally demanding to apply to many genome-sized datasets.
We have developed wenda_gpu, which uses GPyTorch to train models approximately 10-fold faster and can train on genome-sized data within hours on a single GPU-enabled machine. We demonstrate that wenda_gpu returns comparable results to the original wenda implementation, and that it can be used for improved prediction of cancer mutation status on small sample sizes compared to a regular elastic net.

W-049: Predicting Complex Disorders by Combining Comorbidity Data and Polygenic Risk Scores
COSI: MLCSB
  • Myson Burch, Department of Computer Science, Purdue University, United States
  • Pritesh Jain, Department of Biological Sciences, Purdue University, United States
  • Zhiyu Yang, Finnish Institute for Molecular Medicine, United States
  • Apostolia Topaloudi, Department of Biological Sciences, Purdue University, United States
  • Peristera Paschou, Department of Biological Sciences, Purdue University, United States
  • Aritra Bose, IBM T.J. Watson Research Center, United States
  • Petros Drineas, Department of Computer Science, Purdue University, United States


Presentation Overview: Show

Using modeling and forecasting techniques, we can add considerable value to healthcare and precision medicine strategies. To that end, comorbidity data and Polygenic Risk Scores (PRS) can be leveraged to predict levels of disease risk for individuals. In our work, we use ensemble classifiers to improve predictive performance in classification problems by combining the predictions of multiple baseline algorithms. We take advantage of the capabilities of well-performing models on a classification or regression task and make predictions that have better performance than any single algorithm in the ensemble. We build models that exclusively use the comorbidity data, as well as models that exclusively use the PRS and combine their predictions to generate more powerful models. We apply such analyses on comorbidity data extracted from the UK Biobank for a variety of conditions and their respective PRS computed using PRScs and SBLUP. We observe that ensemble classifiers applied on such data improve, sometimes considerably, baseline classifiers. Here, we show an analysis framework using predictive analytics that can aid in clinical diagnosis for complex diseases and can also be used to impute phenotypes, thus improving the power of Genome Wide Association Studies.

W-050: Deep Learning model of T-cell recognition of antigens and its applications in cancer
COSI: MLCSB
  • Olga Lyudovyk, Memorial Sloan Kettering Cancer Center, United States
  • Artem Streltsov, Cornell University, United States
  • Yuval Elhanati, Memorial Sloan Kettering Cancer Center, United States
  • Quaid Morris, Memorial Sloan Kettering Cancer Center, United States
  • Benjamin Greenbaum, Memorial Sloan Kettering Cancer Center, United States


Presentation Overview: Show

Understanding T-cell specificity is an unsolved problem relevant to biomarker, cell therapy, and vaccine development for infectious diseases, cancers, and auto-immune diseases. Large experimental datasets of T-cell receptors (TCR) and the epitopes (parts of antigens) they recognize have recently been published and present an opportunity to explore principles and patterns of T-cell-antigen recognition. We trained an attention-based transformer model on a dataset containing 635,581 epitope and TCR pairs to predict whether a TCR recognizes a specific epitope. Our BERT-(Bidirectional Encoder Representations from Transformers)-based classifier achieves a test set Average Precision and Area under the Receiver Operating Characteristics (AUROC) curve of 0.925 and 0.911 respectively, significantly outperforming the current state of the art (SOTA) model based on LSTM architecture (AUC of 0.844 and ROC of 0.876). Our model generalizes well to previously unseen TCR and epitope sequences and, unlike the SOTA model, can learn discontinuous sequence patterns.

The trained model elucidates SARS-CoV2 antigen-specific recognition motifs of TCR sequences of COVID-19 patients and investigates the clinical relevance of the abundance of COVID-19-specific TCRs. We then applied our model to analyze TCR repertoires of tumor infiltrating T-cells and circulating T-cells in blood samples of cancer patients to identify putative tumor-specific T-cells.

W-051: Pre-training on molecular simulations improves protein sequence-function modeling
COSI: MLCSB
  • Sam Gelman, University of Wisconsin-Madison, United States
  • Jerry Duan, University of Wisconsin-Madison, United States
  • Sameer D'Costa, University of Wisconsin-Madison, United States
  • Kaustubh Amritkar, University of Wisconsin-Madison, United States
  • Bryce Johnson, University of Wisconsin-Madison, United States
  • Philip Romero, University of Wisconsin-Madison, United States
  • Anthony Gitter, University of Wisconsin-Madison, United States


Presentation Overview: Show

Neural networks can infer the mapping between protein sequence and function based on labeled sequence-function examples. However, many experimental sequence-function datasets are limited in size or have a biased distribution of mutations. This affects the networks’ ability to generalize beyond the training data and decreases the utility for protein engineering. To overcome these challenges, we propose METL (mutational effect transfer learning), a method for predicting protein function based on transfer learning from specially pre-trained networks. METL is unlike existing protein representation learning methods. Instead of pre-training on large collections of naturally existing proteins, we employ weakly supervised pre-training on large synthetic datasets generated using molecular simulations. We train source models to predict simulated energy terms, which capture various aspects of protein stability. Then, we transfer the learned representation to predict the score from experimental sequence-function data. We implement a transformer encoder that uses relative positional embeddings based on three-dimensional residue distances. We evaluate our approach on five deep mutational scanning datasets using small training set sizes and extrapolation-based data splits to mimic training with limited experimental data. Preliminary results show METL outperforms multiple non-transfer learning baselines and a state-of-the-art protein language model.

W-052: An experimentally-based functional gene embedding
COSI: MLCSB
  • Felix Brechtmann, Department of Informatics, Technical University of Munich, Germany
  • Thibault Bechtler, Department of Informatics, Technical University of Munich, Germany
  • Brian Clarke, German Cancer Research Center (DKFZ), Germany
  • Oliver Stegle, German Cancer Research Center (DKFZ), Germany
  • Julien Gagneur, Department of Informatics, Technical University of Munich, Germany


Presentation Overview: Show

Gene embeddings, i.e. numerical representations of gene function, are of high relevance for modeling in genomics. Here we propose a generic gene embedding with distinctive features: Our embedding is based on multiple experimental data modalities excluding text-mining sources, to prevent ascertainment bias towards well-studied genes. Specifically, we integrate gene expression across human tissues (GTEx), protein-protein interactions extracted from STRING, and two recently published datasets: a genome-wide deletion screens (DepMap), and an embedding derived from protein sequences (Elnaggar et al.).

Using our embedding, we first predict curated trait-gene associations and reach a mean precision of at least 0.2 for 6 out of 10 traits while generally improving the performance over embeddings based on any single modality. Second, we predict gene-aggregated GWAS signals using MAGMA and obtain a median R2 of 6.6% across 25 blood biomarkers. Third, we improve the mean precision by 62% in the cancer driver gene prediction task over OncoVar. For all these prediction tasks we used a single embedding emphasizing its generality.

Overall, our embedding captures different aspects of gene functions and can be easily integrated into diverse prediction tasks that can benefit from a general-purpose gene embedding.

W-053: maxATAC: Predicting Transcription Factor Binding at Disease Risk Loci from ATAC-seq and DNA Sequence with Convolutional Neural Networks
COSI: MLCSB
  • Tareian Cazares, University of Cincinnati, United States
  • Faiz Rizvi, University of Cincinnati, United States
  • Balaji Iyer, University of Cincinnati, United States
  • Xiaoting Chen, Cincinnati Children's Hospital Medical Center, United States
  • Michael Kotliar, Cincinnati Children's Hospital Medical Center, United States
  • Joseph Wayman, Cincinnati Children's Hospital Medical Center, United States
  • Anthony Bejjani, University of Cincinnati, United States
  • Omer Donmez, Cincinnati Children's Hospital Medical Center, United States
  • Benjamin Wronowski, Cincinnati Children's Hospital Medical Center, United States
  • Sreeja Parameswaran, Cincinnati Children's Hospital Medical Center, United States
  • Leah Kottyan, Cincinnati Children's Hospital Medical Center, United States
  • Artem Barski, Cincinnati Children's Hospital Medical Center, United States
  • Matthew Weirauch, Cincinnati Children's Hospital Medical Center, United States
  • Vb Surya Prasath, Cincinnati Children's Hospital Medical Center, United States
  • Emily Miraldi, Cincinnati Children's Hospital Medical Center, United States


Presentation Overview: Show

Most disease-associated genetic variants fall outside of protein-coding DNA and are often enriched in regulatory elements associated with DNA binding proteins known as transcription factors (TFs). Computational methods are largely used to predict TF binding sites (TFBS) as the experimental characterization of most human TFs is intractable due to technical limitations. Instead, the most popular approaches use TF motifs and chromatin accessibility data to predict TF binding. Here, we present “maxATAC” a suite of deep neural network models for genome-wide TFBS prediction from the assay for transposase accessible chromatin (ATAC-seq) in any cell type, with models available for 127 human TFs. We demonstrate maxATAC’s capabilities by identifying TFBS associated with allele-dependent chromatin accessibility at atopic dermatitis genetic risk loci. We analyzed activated T cells isolated from patients with atopic dermatitis and their age-matched controls. Patient-specific ATAC-seq signal and DNA sequence were used as input for maxATAC to predict the binding of 103 nominally expressed TFs. We predicted increased binding of several TFs relevant to T cells, including MYB and FOXP1, in patients with atopic dermatitis. These results illustrate the utility of maxATAC models for exploring potential regulators of cellular biology and the effects of genetic variants on genomic regulation.

W-054: Prediction of drug-perturbed cancer cell line gene expression using graph neural networks
COSI: MLCSB
  • Nathaniel Evans, Oregon Health & Science University, United States
  • Shannon McWeeney, Oregon Health & Science University, United States
  • Guanming Wu, Oregon Health & Science University, United States


Presentation Overview: Show

Ineffective or limited precision oncology treatments are a cause of patient mortality. We seek to address this challenge by improving pre-clinical drug repurposing and drug combination discovery. We highlight the methodological challenge of training drug response models using single-drug data that will generalize well to multi-drug perturbations. We operate on the premise that protein-protein interactions mediate cellular drug response and hypothesize that incorporating this prior knowledge in a deep learning framework is liable to overcome limitations in drug response modeling. To do this we propose a machine learning model to predict drug-perturbed mRNA expression from intrinsic cancer features using graph neural networks (GNN) that operate on literature curated protein functional-interactions and drug-target interactions. We have shown promise of our approach using synthetic data and are in-progress of applying it to experimental datasets (LINCS L1000). The successful outcome of our method will enable novel GNN-based approaches to drug prioritization.

W-055: Multiclass Classifier for Predicting Congenital Heart Disease Subgroups using data from Copy Number Variants
COSI: MLCSB
  • Jacqueline Penaloza, The Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Blythe Moreland, The Institute for Genomic Medicine, Nationwide Childrens Hospital, United States
  • Kim McBride, The Center for Cardiovascular and Pulmonary Research, Nationwide Children’s Hospital, United States
  • Peter White, The Institute for Genomic Medicine, Nationwide Children’s Hospital, United States


Presentation Overview: Show

Congenital Heart Disease (CHD) is a global health burden that is a major cause of infant mortality. The heart is the first organ to develop, and disruption of this process leads to defects. Copy Number Variants (CNVs) are chromosomal gains or losses. In this study, we leverage our access to data from the Cytogenomics of Cardiovascular Malformations (CCVM) Consortium. This dataset includes information on demographics, diagnosis, and clinical chromosomal microarray analysis of patients with CHD. The CHD subgroups we focused on are (Ventricular/Atrial) Septal Defect (VSD/ASD), Right Ventricular Obstruction (RVOTO), Left Ventricular Outflow Tract Obstruction (LVOTO), Heterotaxy (HTX), Conotruncal Defect (CTD), Atrioventricular Septal Defect (AVSD), and Anomalous Pulmonary Venous Return (APVR). We developed a machine learning classifier to identify CNV patterns, both distinct and shared, between CHD subgroups, among features such as genomic region, biology processes, and other factors. Several tree-based algorithms were evaluated, with random forest chosen as the final model. Shapley Additive Explanations (SHAP) were used to explain classification of the CHD subgroups. We found CNV type, brain related pathways, cell migration, and cell projection are important factors in determining the classification. We demonstrated that machine learning is an effective tool for functional annotation of CNVs.

W-056: Genome based influenza risk assessment using tree-guided sparse learning
COSI: MLCSB
  • Cheng Gao, Univeristy of Missouri - Columbia, United States
  • Jane Tao, Rice University, United States
  • Xiu-Feng Wan, Univeristy of Missouri - Columbia, United States


Presentation Overview: Show

Genetic reassortment happens when two different influenza viruses infect the same cell, and reassortment between human and avian influenza A viruses (IAVs) facilitated emergence of all three last pandemic IAVs. Here we formulate influenza risk assessment as a sparse learning problem to access genomic compatibility among avian and human IAVs and to select synergistic features affecting virus replication efficiency by assuming the higher replication efficacy, the higher the genomic compatibility. A tree-guided sparse learning model (TGSL) was developed by assigning all proteins as the tree leaves with five internal nodes, representing HA, NA, RNP (PB2, PB1, PA, and NP), NS (NS1 and NS2), and MP (M1 and M2). The replication efficiency data for reassortants among human IAVs (H1N1 and H3N2) and avian IAVs (H5N1, H7N9, and H9N2), and canine IAVs (H3N2), were used as the training data. The 10-fold cross-validation showed that TGSL outperformed L1-norm, L2-norm, L1-∞ norm, and Sparse Group Lasso we compared. Residues associated with genomic compatibility were identified across all 10 proteins, but most in the contact regions among proteins in ribonucleoprotein complexes. The computational model and the derived features can help our understanding factors associated with genomic compatibility and facilitate pandemic influenza preparedness.

W-057: Generating challenging negatives for contrastive learning in hyperbolic space
COSI: MLCSB
  • Daniel McNeela, University of Wisconsin, Madison, United States
  • Anthony Gitter, University of Wisconsin, Madison, United States
  • Fred Sala, University of Wisconsin, Madison, United States


Presentation Overview: Show

Graph representation learning for chemical and protein graphs is a key component of advances in predictive modeling for drug discovery, protein engineering, and interactions among drugs and proteins. Recent unsupervised approaches use contrastive learning for generating graph embeddings in Euclidean space. The idea is to learn representations which exhibit high mutual information between local substructures and the global graph structure for positive samples while pushing away the representations for negative samples. While much focus has gone into how to generate positive samples, comparatively little attention has been paid to approaches for generating negative samples. We propose a method for generating negative samples that uses differential geometry to embed graphs on a hyperbolic manifold. Because hyperbolic geometry captures the hierarchical relationships in graphs and enables low-distortion embeddings, we show that we are able to achieve high-quality learned representations by embedding and generating negative samples in hyperbolic space. We are applying our graph representation learning framework to a variety of downstream tasks on chemical and protein graphs.

W-058: Cross-species transcriptome-based regression to discover equivalents of human samples and genes in biomedical research organisms
COSI: MLCSB
  • Hao Yuan, Michigan State University, United States
  • Christopher Mancuso, Michigan State University, United States
  • Kayla Johnson, Michigan State University, United States
  • Ingo Braasch, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States


Presentation Overview: Show

Biomedical research organisms are irreplaceable systems for studying the mechanism of human biology and disease in vivo., but it is still challenging to find the right organism and experimental setting ideal for investigating a given human trait or disease. We address this challenge by leveraging millions of publicly available transcriptomic profiles of human and research organisms. We have developed a sparse regression-based approach to identify a minimum set of transcriptomes in a research organism that can recapitulate a query human transcriptome. The custom regression model (trained based on one-to-one orthologs) is highly interpretable and helps find genotypes, tissues, experimental conditions, and phenotypes that could bring about a global transcriptional response in the research organism that is equivalent to the human profile. Further, the regression model can be used to predict genes in the organism’s genome that have equivalent expression in the biological context captured by the human sample. Our approach is general and can be applied to map gene expression data between human and any research organism to discover equivalent samples and genes. We have demonstrated this application by prioritizing samples in zebrafish equivalent to human samples that are biologically meaningful.

W-059: SIMBA: SIngle-cell eMBedding Along with features
COSI: MLCSB
  • Huidong Chen, Harvard Medical School/MGH/Broad, United States
  • Jayoung Ryu, Harvard Medical School/MGH/Broad, United States
  • Michael Vinyard, Harvard Medical School/MGH/Broad, United States
  • Adam Lerer, Facebook, United States
  • Luca Pinello, Harvard Medical School/MGH/Broad, United States


Presentation Overview: Show

Recent advances in single-cell omics technologies enable the individual and joint profiling of cellular measurements including gene expression, epigenetic features, chromatin structure and DNA sequences. Currently, most single-cell analysis pipelines are cluster-centric, i.e., they first cluster cells into non-overlapping cellular states and then extract their defining genomic features. These approaches assume that discrete clusters correspond to biologically relevant subpopulations and do not explicitly model the interactions between different feature types. In addition, single-cell methods are generally designed for a particular task as distinct single-cell problems are formulated differently. To address these current shortcomings, we present SIMBA, a graph embedding method that jointly embeds single cells and their defining features, such as genes, chromatin accessible regions, and transcription factor binding sequences into a common latent space. By leveraging the co-embedding of cells and features, SIMBA allows for the study of cellular heterogeneity, clustering-free marker discovery, gene regulation inference, batch effect removal, and omics data integration. SIMBA has been extensively applied to scRNA-seq, scATAC-seq, and dual-omics data. We show that SIMBA provides a single framework that allows diverse single-cell analysis problems to be formulated in a unified way and thus simplifies the development of new analyses and integration of other single-cell modalities.

W-060: Protein sequence encoder for secondary structural representation using a pretrained model
COSI: MLCSB
  • Incheol Shin, Pusan National University, South Korea
  • Giltae Song, Pusan National University, South Korea


Presentation Overview: Show

Machine learning is popularly used for drug discovery. A lot of drug discovery software tools are designed based on a representation of protein structures. Although proteins are in tertiary structure, they are often represented in primary structure as an amino acid sequence. This protein representation degrades the performance of drug discovery tools since we lose some real protein structure information to build machine learning models. The more structural information of proteins can be expressed, the better drug discovery tools can perform.
In this study, we propose a pretrained protein encoder using protein secondary structure. Our model enables to feed more protein structural information to machine learning models than existing encoders based on primary structural representation. Primary structural representation parts in existing machine learning tools for drug discovery can be replaced by our encoding outcome. For pretraining, we use tasks such as masked language modeling and secondary structure prediction. We evaluate our encoder by applying it to secondary structure prediction, remote homology detection, fluorescence landscape prediction, and stability landscape prediction. We compare our encoder with TAPE (Tasks Assessing Protein Embeddings) which is one of most popular protein encoders.

W-061: MAINE – a statistical and machine learning-based framework for feature selection and explanatory analysis of multi-omics datasets
COSI: MLCSB
  • Aleksandra Gruca, Silesian University of Technology, Poland
  • Joanna Henzel, Silesian University of Technology, Poland
  • Iwona Kostorz, Silesian University of Technology; Łukasiewicz Research Network – Institute of Innovative Technologies EMAG, Poland
  • Tomasz Stęclik, Silesian University of Technology; Łukasiewicz Research Network – Institute of Innovative Technologies EMAG, Poland
  • Łukasz Wróbel, Silesian University of Technology, Poland
  • Marek Sikora, Silesian University of Technology; Łukasiewicz Research Network – Institute of Innovative Technologies EMAG, Poland


Presentation Overview: Show

Multi-omics data are high dimensional, while at the same time, only a small fraction of its features is informative. In medical sciences, in addition to a robust feature selection procedure, the ability to discover human-readable patterns in the analysed data is also desirable. To address this need we developed MAINE – a Multi-omic Analysis and Exploration framework for feature selection and explanatory analysis of multi-omics data.

MAINE provides an intuitive interface that allows applying different statistical and machine learning methods for feature selection and explanatory analysis of multi-omics datasets. The unique functionality of MAINE is the ability to explain multidimensional dependencies between selected multi-omics features and event outcome prediction or patient survival time. Our approach is based on the observation that the most interesting features are related to event outcome prediction (classification) or patient survival probability (survival analysis), therefore learned patterns are visualized in the form of interpretable decision/survival trees and rules.

We evaluate MAINE by analysing selected features (genes) for two exemplary TCGA datasets and we show that they are involved in processes related to tumor initiation, progression and metastasis. In many cases we also identify literature sources that link selected features to leukemia (TCGA-AML) or lung cancer (TCGA-LUAD).

W-062: Exploring protein sequence space through extrapolation of deep neural nets
COSI: MLCSB
  • Chase Freschlin, University of Wisconsin - Madison, United States
  • Sarah Fahlberg, University of Wisconsin - Madison, United States
  • Pete Heinzelman, University of Wisconsin - Madison, United States
  • Philip Romero, University of Wisconsin - Madison, United States


Presentation Overview: Show

Proteins are complex biomolecules whose function is encoded in their amino acid sequence. They are a critical route for innovation across diverse applications, but typically require extensive engineering to optimize for some target property. Protein sequence space is too large to fully explore with experimental methods. Instead, machine learning (ML) models can infer the sequence-function mapping from functional datasets, which is used to identify optimal proteins with diverse sequences through search heuristics. Although the general feasibility of ML-guided protein design is well-established, key questions remain concerning which types of ML models are best for protein design and how model uncertainty impacts design success as we extrapolate beyond the training set. To investigate these questions, we trained a series of ML models with different architectures on a comprehensive functional dataset for binding protein GB1. We challenge our models to design proteins for a range of mutational regimes, each representing increased distance from wildtype GB1 with up to 90% sequence variation from wildtype, and functionally characterize a subset of designs for each model and distance. Distinct patterns in mutation frequency and function arise, revealing clear performance outcomes and limitations of different model architectures for protein engineering experiments.

W-063: Predicting tissue and cell type specific DNA methylation using structured learning
COSI: MLCSB
  • Mirae Kim, Rice University, United States
  • Yufei Cui, Massachusetts Institute of Technology, United States
  • Vicky Yao, Rice University, United States


Presentation Overview: Show

DNA methylation (DNAm) is a gene regulatory mechanism that involves the addition of a methyl group at CpG sites. DNAm does not directly alter the DNA sequence, and can dynamically regulate expression while also being stably inactivating large portions for a lifetime. Due to this, individual cell types show distinct patterns of DNAm influenced by the environment and innate cell characteristics. Existing methods have used methylation patterns for disease subtype and local tissue classification, but none have performed a structurally-informed multi-tissue classification with feature interpretability. Here, we aim to find the key CpG sites associated with tissue and cell types with the end goal of sample classification and the disentanglement of stable vs dynamic cell-specific CpG sites. We predict tissue- and cell-specific DNAm using structured support vector machine (SSVM) and tissue ontology. By using the anatomical relationships along with a large compendium of DNAm data, our model can predict a set of related labels for DNAm samples of unknown origin. We demonstrate promising performance on a set of 6,746 samples covering 36 tissue/cell types spanning 149 relationships and show that our results are interpretable for further analysis using feature weights and structure-aware evaluation.

W-064: Data-driven prediction of membranolytic anticancer peptides targeting lung cancer cells
COSI: MLCSB
  • Fatemeh Alimirzaei, Auburn University, United States
  • Chris Kieslich, Auburn University, United States


Presentation Overview: Show

Lung cancer remains the leading worldwide cause of cancer deaths with more than 1.8 million deaths annually. Membranolytic anticancer peptides (ACPs) have been shown to be effective at targeting and killing cancer cells, and the development of efficient computational methods can aid in the identification of potential ACP candidates. We have developed support vector machine (SVM) models to predict membranolytic anticancer activity given a peptide sequence. Oscillations in physiochemical properties in protein sequences have been shown to be predictive of protein structure and function, and in this work, we are taking advantage of these known periodicities to predict if a peptide has ACP activity given the amino acid sequence. Fourier transforms were applied to property factor vectors to measure the amplitude of physiochemical oscillations, which served as the features for our SVM models. Peptides targeting lung cancer cells were collected from the CancerPPD database and converted into physiochemical vectors using 10 property factors for the 20 natural amino acids. Cross-validation has been applied to train and tune the model based on multiple training and testing sets with the accuracy around %80 for the model predicting ACPs. Also, the minimum number of features required to maintain accuracies is approximately 150 features.

W-065: When overfitting is a good thing: using supervised machine learning for multi-omics integration
COSI: MLCSB
  • Dhoha Abid, Washington University in Saint Louis, United States
  • Michael Brent, Washington University in Saint Louis, United States


Presentation Overview: Show

Supervised machine learning models are typically used for interpolation or extrapolation. They are trained on instances with known labels and then used to predict the labels of other instances whose true label is unknown. Here, we present a completely different way of using such models in order to integrate distinct types of input features into a single numeric score. The model is applied to the same set of instances on which it was trained, then each “prediction” is used as a score that integrates information from the training features and the labels (Fig. 1). Overfitting the model to the training data increases the influence of the labels on the final score – in the limit of an exact fit, the predictions would be identical to the labels. In practice, the aim is to have both the features and the labels influence predictions.
We illustrate this idea using a model for integrating gene expression data and transcription factor (TF) binding location data to predict a TF network map. An accurate TF network map consists of edges that connect TFs to their direct, functional targets – the genes the TFs regulate by binding in their regulatory DNA.

W-066: An efficient not-only-linear correlation coefficient based on machine learning
COSI: MLCSB
  • Milton Pividori, Department of Genetics, University of Pennsylvania, United States
  • Marylyn Ritchie, Department of Genetics, University of Pennsylvania, United States
  • Diego Milone, Research Institute for Signals, Systems and Computational Intelligence, sinc(i), FICH-UNL, CONICET, Argentina
  • Casey Greene, Center for Health AI, University of Colorado School of Medicine, United States


Presentation Overview: Show

Correlation coefficients are widely used to identify relevant patterns in data. In transcriptomics, genes with correlated expression often share functions or are part of disease-relevant biological processes. Here we introduce the Clustermatch Correlation Coefficient (CCC), an efficient, easy-to-use and not-only-linear coefficient based on machine learning models. We show that CCC can capture biologically meaningful linear and non-linear patterns missed by standard, linear-only correlation coefficients. CCC efficiently captures general patterns in data by applying clustering algorithms and automatically adjusting the model complexity. When applied to human gene expression data, CCC identifies robust linear relationships while detecting non-linear patterns associated with sex differences that are not captured by the Pearson or Spearman correlation coefficients. Gene pairs highly ranked by CCC but not detected by linear-only coefficients showed high probabilities of interaction in tissue-specific networks built from diverse data sources including protein-protein interaction, transcription factor regulation, and chemical and genetic perturbations. CCC is much faster than state-of-the-art not-only-linear coefficients such as the Maximal Information Coefficient (MIC). CCC is a highly-efficient, next-generation not-only-linear correlation coefficient that can readily be applied to genome-scale data and other domains across different data types.

W-067: Determining the Optimal Embedding Technique for Mapping Gene Expression Samples into a Distributed Ontology Space
COSI: MLCSB
  • Filip Jevtic, Michigan State University, United States
  • Christopher Mancuso, Michigan State University, United States
  • Arjun Krishnan, Michigan State University, United States


Presentation Overview: Show

There are more than 1 million human gene expression samples that are publicly available. These samples together constitute an extremely valuable resource that any researcher can use to gain new biological insights about genes and cellular mechanisms. However, most of these samples lack systematic annotation of the exact tissue and cell type they originate from. Recently, Wang. et al. have developed a method, named OnClass, that uses a neural network to map gene expression samples in an ontology space to improve sample classification, including predicting on tissue/cell-types not seen in the training data. In this work, we investigate a variety of embedding techniques to optimize and improve the mapping between expression space and ontology space. We performed a systematic evaluation of seven techniques with ten embedding dimension sizes and ten hidden layer sizes settings using 8,416 gene expression samples with known tissue labels from 52 different tissues. Our overall findings show that random-walk based methods designed specifically for network-structured data outperform more conventional dimensionality reduction techniques like PCA and SVD for classifying the tissue-type of the expression sample.

W-068: Interpretation of machine learning methods for the prediction of breast cancer risk
COSI: MLCSB
  • Adam Klie, University of California, San Diego, United States
  • James Talwar, University of California, San Diego, United States
  • Meghana Pagadala, University of California, San Diego, United States
  • Hannah Carter, University of California, San Diego, United States


Presentation Overview: Show

Recent studies that applied machine learning methods to complex trait risk prediction from single nucleotide polymorphism (SNP) array data have shown promise in improving risk stratification. However, current performance gains for nonlinear machine learning methods when compared to traditional polygenic risk scoring (PRS) and linear methods have been modest. Moreover, there remains substantial debate as to the effect of capturing epistatic interactions in risk modeling. Central to the debate has been the difficulty in interpreting the complex, nonlinear mapping learned by these models. To decipher the importance of capturing nonlinear interactions in modeling cancer risk, we first applied several PRS approaches to the prediction of breast cancer status. Consistent with previous studies, we noted greater predictive capability upon inclusion of more loci in our model but little to no performance gains when nonlinearity was captured. We applied several interpretation methods to derive a score for each SNP per method type, including several which, to our knowledge, have not been applied in this context. We noted varying degrees of concordance between the scores assigned by each method. Our work represents a comprehensive study of methods for inferring SNP level contributions to cancer risk.

W-069: Dirichlet allocation of mutations captures the action of DNA damage and misrepair processes
COSI: MLCSB
  • Cait Harrigan, University of Toronto, Canada
  • Kieran Campbell, Lunenfeld-Tanenbaum Research Institute, Canada
  • Quaid Morris, Sloan Ketting Institute, United States
  • Tyler Funnell, Sloan Kettering Institute, United States


Presentation Overview: Show

Cancer develops as a function of the accumulation of somatic mutations. Both DNA damaging agents, and deficiencies in DNA repair mechanisms leave behind historical traces in the distributions of somatic mutations, known as mutational signatures. We introduce a Bayesian probabilistic model, DAMUTA for de novo mutational signature extraction, based on a modified latent dirichlet allocation model. We define signatures of damage and misrepair, and reframe COSMIC mutational signatures in terms of their damage and misrepair components. We fit DAMUTA to primary tumours from the Pan-cancer Analysis of Whole Genomes consortium, and post-treatment metastatic samples from the Hartwig Medical Foundation. Our findings indicate that the COSMIC reference set displays some redundancy with respect to misrepair signatures: for example, several mismatch repair signatures contain the same pattern. We also investigate the utility of a hierarchical tissue-specific prior on signature exposures, and assess this in context of recent findings on tissue-specific variation in mutational signatures (Degasperi et al. 2020).

W-070: Deep multi-omic network fusion for network-level biomarker discovery of Alzheimer’s Disease
COSI: MLCSB
  • Linhui Xie, Department of Electrical and Computer Engineering, Indiana University Purdue University Indianapolis, United States
  • Yash Raj, Department of BioHealth Informatics, Indiana University Purdue University Indianapolis, United States
  • Pradeep Varathan, Department of BioHealth Informatics, Indiana University Purdue University Indianapolis, United States
  • Bing He, Department of BioHealth Informatics, Indiana University Purdue University Indianapolis, United States
  • Kwangsik Nho, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, United States
  • Paul Salama, Department of Electrical and Computer Engineering, Indiana University Purdue University Indianapolis, United States
  • Andrew J. Saykin, Department of Radiology and Imaging Sciences, Indiana University School of Medicine, United States
  • Jingwen Yan, Department of BioHealth Informatics, Indiana University Purdue University Indianapolis, United States


Presentation Overview: Show

Multi-omic data spanning from genotype, gene expression to protein expression have been increasingly explored, with attempt to better interpret genetic findings from genome wide association studies and to gain more insight of the disease mechanism. However, gene expression and protein expression are part of dynamic process changing in various ways as a cell ages. Expression data captured by existing technology is often noisy and only capture a screenshot of the dynamic process. Performance of models built on top of these expression data is undoubtedly compromised. To address this problem, we propose a new interpretable deep multi-omic network fusion model (MoFNet) for predictive modeling of Alzheimer’s disease. In particular, the information flow from DNA to protein is leveraged as a prior multi-omic network to enhance the signal in gene and protein expression data so as to achieve better prediction power. The proposed model MoFNet significantly outperformed all other state-of-art classifiers when evaluated using genotype, gene expression and protein expression data from the ROS/MAP cohort. Instead of individual markers, MoFNet yielded 3 major multi-omic subnetworks related to innate immune system, clearance of unwanted cells or misfolded proteins, and neurotransmitter release respectively.

W-071: SUPREME: A cancer subtype prediction methodology integrating multiomics data using Graph Convolutional Neural Network
COSI: MLCSB
  • Ziynet Nesibe Kesimoglu, University of North Texas, United States
  • Serdar Bozdag, University of North Texas, United States


Presentation Overview: Show

To pave the road towards precision medicine, patients with similar biology should be grouped into cancer subtypes. Utilizing high-dimensional multiomics data generated from cancer tissues, integrative computational approaches have been developed to uncover cancer subtypes. Recently, Graph Neural Networks were discovered utilizing node features and associations simultaneously on graph-structured data. Addressing limitations that existing tools have in leveraging these architectures, we developed SUPREME, an integrative approach by comprehensively analyzing multiomics data and patient associations with multiplex graph convolutions. Unlike existing tools, SUPREME generates patient similarity networks and obtains embeddings from each using Graph Convolutional Network models, on which it utilizes all multiomics. Also, SUPREME integrates embeddings with raw features to capture local and global features simultaneously. SUPREME integrates all embedding combinations as separate tasks, thus being interpretable regarding utilized networks and features. On TCGA, METABRIC, and combined datasets, SUPREME significantly outperformed seven supervised methods, with consistent results. SUPREME-inferred subtypes consistently had significant survival differences, mostly more significant than survival differences between ground truth(PAM50) subtypes, and outperformed all nine methods. Our findings suggest that properly utilizing multiple datatypes and associations, SUPREME could demystify subtype characteristics that cause significant survival differences and could improve ground truth, which depends mainly on one datatype.

W-072: Unraveling the complex cell-cell interactions driving phenotype with Tensor-cell2cell
COSI: MLCSB
  • Erick Armingol, University of California, San Diego, United States
  • Hratch Baghdassarian, University of California, San Diego, United States
  • Cameron Martino, University of California, San Diego, United States
  • Araceli Perez-Lopez, Universidad Nacional Autónoma de México, Mexico
  • Caitlin Aamodt, University of California, San Diego, United States
  • Rob Knight, University of California, San Diego, United States
  • Nathan Lewis, University of California, San Diego, United States


Presentation Overview: Show

Cell interactions determine phenotypes, and intercellular communication is shaped by the contexts of the cells, including disease state, organismal life stage, and tissue microenvironment. Single-cell technologies measure the molecules mediating cell-cell communication, and emerging computational tools can exploit these data to decipher intercellular communication. However, current methods either disregard cellular context or rely on simple pairwise comparisons between samples, thus limiting the ability to decipher complex cell-cell communication across multiple time points, levels of disease severity, or spatial contexts. We have developed Tensor-cell2cell, an unsupervised method using tensor decomposition, which is the first strategy to decipher intercellular communication by simultaneously accounting for multiple stages, states, or locations of the cells. Using Tensor-cell2cell we were able to identify which cells are interacting and which ligand-receptor pairs were likely involved in varying severities of COVID-19 and Autism Spectrum Disorder. Thus, we present a powerful new approach to integrate large numbers of single cell RNA-seq data sets to understand the complex intercellular interactions underlying phenotypes.

W-073: Predicting Alzheimer’s disease progression using a Bidirectional Gated Recurrent Unit-based architecture
COSI: MLCSB
  • Mohammad Al Olaimat, University of North Texas, United States
  • Serdar Bozdag, UNIVERSITY OF NORTH TEXAS, United States


Presentation Overview: Show

Alzheimer's disease (AD) is a neurodegenerative disease that affects millions of people worldwide. Mild cognitive impairment (MCI) is an intermediary stage between cognitively normal (CN) state and AD. Not all people who have MCI convert to AD. The diagnosis of AD is made after significant symptoms of dementia such as short-term memory loss are already present. Since AD is currently an irreversible disease, diagnosis at the onset of disease brings a huge burden on patients, their caretakers, and the health sector. Thus, there is a crucial need to develop methods for the early prediction AD. Recurrent Neural Networks (RNNs) have been successfully used to handle Electronic Health Records (EHRs) for predicting conversion from MCI to AD. In this study, we propose a predictive model based on Bidirectional Gated Recurrent Unit (Bi-GRU) for early predicting conversion from MCI to AD using longitudinal and demographic data. To minimize the effect of the irregular time intervals between visits, we propose using age in each visit as an indicator of time change between successive visits.

W-074: Cell Type Specific DNA Signatures of Transcription Factor Binding
COSI: MLCSB
  • Aseel Awdeh, University of Ottawa, Canada
  • Marcel Turcotte, University of Ottawa, Canada
  • Theodore Perkins, Ottawa Hospital Research Institute/ University of Ottawa, Canada


Presentation Overview: Show

Transcription factors(TFs) can bind to different parts of the genome in different types of cells. These differences may be due to alterations in the DNA-binding preferences of a TF itself, or to cofactors, or mechanisms such as chromatin accessibility, that result in a DNA ``signature" of differential binding. We propose a method, based on deep learning, to detect and quantify cell type specificity in a TF's DNA-binding signature. We conduct a wide scale investigation of 194 distinct TFs across various cell types. We demonstrate the existence of cell type specificity in ~30 of the TFs, and the lack that of in other TFs. Finally, to further explain the biology behind a transcription factor's cell type specificity, or lack that of, we conduct a wide scale motif enrichment analysis of all transcription factors in question. We show that the presence of alternate motifs correlates with a higher degree of cell type specificity in TFs, such as ATF7, while consistent motifs throughout is usually associated with the absence of cell type specificity in a TF, such as CTCF. Our comprehensive investigation provides a basis for further study of the mechanisms behind differences in TF-DNA binding in different cell types.

W-075: AbLang: An antibody language model for completing antibody sequences
COSI: MLCSB
  • Tobias H. Olsen, Department of Statistics, University of Oxford, United Kingdom
  • Iain H. Moal, GSK Medicines Research Centre, GSK, United Kingdom
  • Charlotte M. Deane, Department of Statistics, University of Oxford, United Kingdom


Presentation Overview: Show

Motivation: General protein language models have been shown to summarise the semantics of protein sequences into representations that are useful for state-of-the-art predictive methods. However, for antibody specific problems, such as restoring residues lost due to sequencing errors, a model trained solely on antibodies may be more powerful. Antibodies are one of the few protein types where the volume of sequence data needed for such language models is available, for example in the Observed Antibody Space (OAS) database.

Results: Here, we introduce AbLang, a language model trained on the antibody sequences in the OAS database. We demonstrate the power of AbLang by using it to restore missing residues in antibody sequence data, a key issue with B-cell receptor repertoire sequencing, for example over 40% of OAS sequences are missing the first 15 amino acids. AbLang restores the missing residues of antibody sequences better than using IMGT germlines or the general protein language model ESM-1b. Further, AbLang does not require knowledge of the germline of the antibody and is seven times faster than ESM-1b.

W-076: Experimental exploration of a ribozyme neutral network using evolutionary algorithm and deep learning.
COSI: MLCSB
  • Rachapun Rotrattandumrong, Okinawa Institute of Science and Technology, Japan
  • Yohei Yokobayashi, Okinawa Institute of Science and Technology, Japan


Presentation Overview: Show

Fitness landscape of a biomolecule is a map of genotype (sequence) and phenotype (e.g., catalytic activity or structure) that determines how evolution would proceed. Neutral networks connect all genotypes with equivalent phenotypes in a fitness landscape and enable evolution across large mutational distances without detrimental effects to fitness. Earlier theoretical works on RNA fitness landscape revealed that many neighbouring RNA sequences are predicted to fold into the same secondary structure, forming an extensive neutral network. However, not a single experimental work has shown that such a network exists in real sequence-function maps. In this work, we used a hybrid approach that combines information from experimental evaluation of over 120,000 ribozyme sequences with a deep learning-guided evolutionary algorithm to explore the fitness landscape of an RNA ligase ribozyme. Our algorithm led to the discovery of the first empirical evidence of a large-scale neutral network. Furthermore, by experimentally mapping the entire combinatorial space connecting two active ribozymes separated by 16 mutations, we revealed that neutral paths may be predictable using only information from lower-order mutational interactions. Our work provides important empirical and computational evidence that neutral networks can increase the accessibility and predictability of fitness landscape.

W-077: Machine Learning-Based Prediction with Metabolic Models of Bacterial Growth Requirements on Various Substrates
COSI: MLCSB
  • Zahmeeth Sakkaff, Argonne National Laboratory, United States
  • Christopher Henry, Argonne National Laboratory, United States
  • James Davis, Argonne National Laboratory, United States
  • Pamela Weisenhorn, Argonne National Laboratory, United States


Presentation Overview: Show

The use of growth phenotype data from diverse biological systems to discover and validate new protein functions continues to be a significant challenge. Current efforts to understand and validate new gene functions suffer from significant barriers. This could be broken by adopting new computational methods such as machine learning (ML) combined with mechanistic insights from metabolic modeling. Our goal is to establish steppingstones in gene function discovery and validation by predicting the functions of genes that mechanistically explain observed and predicted growth phenotypes for microbial genomes. To build set of data to use in our efforts to understand, predict, and model, we have curated data from the literature and performed experimental studies to link 178 diverse microbial genome sequences with growth data for 64 carbon sources measured using the Biolog system. To compare ML model performance against performance from mechanistic approaches, we similarly constructed metabolic models based on RAST annotations in KBase. Overall, ML models outperformed metabolic models substantially but lacked the power of metabolic models to mechanistically explain phenotypes and translating phenotype predictions into new protein annotations. However, by combining both approaches, we can use ML to improve models while using models to improve protein annotations across microbial genomes.

W-078: Using Pathway-based Genetic Interactions to Predict Disease Phenotypes with Machine Learning
COSI: MLCSB
  • Mathew Fischbach, University of Minnesota, Department of Computer Science and Engineering, Bioinformatics and Computational Biology, United States
  • Wen Wang, University of Minnesota, Department of Computer Science and Engineering, United States
  • Chad Myers, University of Minnesota, Department of Computer Science and Engineering, Bioinformatics and Computational Biology, United States


Presentation Overview: Show

Genome-wide association studies (GWAS) aim to find associations between genotypes and phenotypes to identify specific genetic loci that affect disease risk. GWAS has been successful in discovering many new risk loci for diseases, yet we are unable to fully explain the heritability of many diseases with the risk loci discovered from GWAS. This missing heritability could be explained by genetic interactions. Our lab has shown that genetic interactions underlying human diseases form structured gene networks within and between biological pathways, and we have developed a method called BridGE for the systematic discovery of genetic interactions. This general framework has successfully mapped complex genetic interactions from human genotype data for multiple diseases. However, it has not been applied to improve predictive models for quantifying individuals’ disease risk.

In this work, we created a pathway-based machine learning pipeline for predicting case-control phenotypes that leverages combinatorial (pairwise) genotype information. We tested our pipeline with two independent Parkinson's disease cohorts, and our preliminary results show that models that include variant-pairs as features can improve predictions over models based on only collections of single variants. These results demonstrate the utility of incorporating combinations of loci into disease risk prediction models such as polygenic risk scores.

W-079: Application of Autoencoders to High-Throughput Transcriptomics in Chemical Screening
COSI: MLCSB
  • Jacob Fredenburg, U.S. Environmental Protection Agency, United States
  • Joshua Harrill, U.S. Environmental Protection Agency, United States
  • Logan Everett, U.S. Environmental Protection Agency, United States


Presentation Overview: Show

Recent developments in transcriptomic technologies have made it possible to assay cellular responses to chemical exposures in a high throughput manner. Due to the high dimensionality of the resulting data, there could be biologically relevant signals that are not readily apparent using traditional methods. Linear methods, such as principal component analysis (PCA), may be insufficient to capture the complex structure of gene expression patterns. Autoencoders are a natural choice for non-linear dimension reduction while also learning biologically relevant structure from the data. Autoencoders, using neural networks, compress (encode) data into a lower dimensional space, then reconstruct (decode) the original input. Additionally, variational autoencoders encode the data into a probability distribution and typically lead to disentangled latent variables. We have applied both traditional and variational autoencoder frameworks to transcriptomic data from high-throughput screening of chemical bioactivity in two distinct cell lines. We compare and explore the resulting latent spaces of these two machine learning frameworks in an interpretable way to better inform chemical safety assessments. This abstract does not necessarily reflect US EPA policy.

W-080: Probabilistic mapping of single cells to spatial transcriptomics datasets for gene expression prediction, identification of spatial regions, and subspot resolution
COSI: MLCSB
  • Marcel Reinders, Delft University of Technology, Netherlands
  • Ahmed Mahfouz, Leiden University Medical Center, Netherlands
  • Mohammed Charrout, Delft University of Technology, Netherlands


Presentation Overview: Show

Single-cell technologies allow comprehensive gene expression measurement of individual cells. Spatial transcriptomics technologies lose the cellular resolution but provide spatial context to gene expression profiles, which is essential in understanding tissue biology. An estimate of the location of the single cells can be predicted by using the similarity between spot and cell profiles. However, the position of a cell can be ambiguous when its profile correlates with spots over multiple disjoint regions. It is therefore essential to get a probabilistic estimate of the position to quantify the uncertainty in the mapping.

To this end, we propose a probabilistic mapping of single cells to their spatial coordinates guided by a reference spatial transcriptomics dataset. We use a graph representation of the spatial data as the domain of a Gaussian process latent variable model to get a per-cell probability distribution over the nodes representing the measured spots. Expression of the genes are modeled as linear combinations of a smaller number of latent factors that can be used to identify global spatial expression patterns and distinct spatial regions. Furthermore, by using a more granular graph than the spot resolution, the model provides a probabilistic way of predicting expression at a subspot resolution.

W-081: A general deep learning-based architecture for nucleus instance segmentation on Histopathological images
COSI: MLCSB
  • Aik Choon Tan, Moffitt Cancer Center, United States
  • Ching-Nung Lin, Moffitt Cancer Center, United States


Presentation Overview: Show

Nucleus segmentation is the one of the initial steps in digital pathology research. Recently, deep learning methods for nucleus instance segmentation on histopathological images outperforms intensity-based methods. We propose an innovation deep learning architecture to achieve state-of-art performance on several public datasets on nucleus segmentation. Our proposed architecture is a modified Mask-RCNN based on the ResNet-152. We added several steps to avoid any image scaling process by removing the pool layer and standardizing sample ration of the pooler of box and mask to 1. We used PanNuke based on TCGA (7741 256x256 images) as training set; and Triple Negative Breast Cancers (TNBC) of UHCW (50 512x512 images) and CoNSeP (41 1000x1000 images) of Curie Institute as two independent test sets. The test data were padded and patched using 256x256 tiles. The model is trained for 584 epochs using Nvidia V100 GPU. The batch size is 4 and the data is randomly augmented by horizontal/vertical flipping. The method achieved DICE coefficient of 0.782 and 0.714 for the TNBC and CoNSeP test sets, respectively. The results were superior to the state-of-art HoVer-Net. We demonstrated that our proposed architecture is achieving good performance and generalizable to varies datasets.

W-083: GETI: A New feature selection algorithm for survival prediction in genetic datasets
COSI: MLCSB
  • Rikhiya Ghosh, Icahn School of Medicine at Mount Sinai, United States
  • Kuan-Lin Huang, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Survival prediction for cancer patients is a hard problem that approximates the interaction of clinical parameters with several multi-omics factors. The several constraints that have negative effect on predictive capacity of survival models include scarcity and incompleteness of data, incompatibility of datasets etc leading to the algorithms being less stable across datasets. This paper introduces a joint predictive modeling algorithm GETI that explores the combined effect of genetic mutations and clinical factors on survival. The novelty of our approach lies in a new feature selection algorithm that uses explainability of machine learning algorithm augmented with ontological knowledge of pathways to infuse existing knowledge and overcome the constraints of incomplete dataset. We have benchmarked our algorithm using Breast cancer and Renal cell carcinoma datasets against standard feature selection algorithms. In our experiments, we find that GETI performs significantly better (2-9% increase in AUC/accuracy) than the standard algorithms in accuracy of survival prediction as well as has better stability of features across different Monte-Carlo simulations of the training dataset. In addition, we have also observed that the gap between training and validation dataset accuracy is the lowest for GETI, which indicates the ability to bridge the gap between incompatible datasets.

W-084: Connecting cell and tissue phenotype to protein expression using canonical correlation analysis
COSI: MLCSB
  • Vaishnavi Subramanian, University of Illinois at Urbana-Champaign, United States
  • Benjamin Chidester, Carnegie Mellon University, United States
  • Minh Do, University of Illinois at Urbana-Champaign, United States


Presentation Overview: Show

New imaging-based spatially-resolved ‘omics technologies are paving the way to a deeper understanding of tissue composition and disease. These technologies allow for the study of connections between high-dimensional gene or protein expression and various features of tissue and cell phenotype, including spatial context, subcellular expression patterns, and morphological features. To explore these connections on a large scale, principled computational tools are needed. Here, we propose the use of canonical correlation analysis (CCA), a robust, linear method well-suited for analyzing correlations of high-dimensional, multimodal data. As a proof of principle, we applied CCA to highly-multiplexed images measuring protein expressions that were acquired by multiplexed ion beam imaging (MIBI) of breast cancer tissue. A CCA analysis of all types of cells yielded statistically significant spatial relations between features describing the neighborhood of cells and their protein levels. The composition of cluster identities of the corresponding CCA embeddings in tissue could readily distinguish normal tissue from cancer tissue, with emphasis placed on myoepithelium regions. When applied to epithelial cells alone, CCA demonstrated the potential to discover sub-types by leveraging the expression and phenotype information jointly. Our findings highlight the opportunity to dive deeper into structure-function relations with fundamental methods on these new datasets.

W-085: BITES: Balanced Individual Treatment Effect for Survival data
COSI: MLCSB
  • Andreas Schäfer, Universität Regensburg, Germany
  • Stefan Solbrig, Universität Regensburg, Germany
  • Robert Lohmayer, Leibniz Institute for Immunotherapy Regensburg, Germany
  • Wolfram Gronwald, Universität Regensburg, Germany
  • Peter J. Oefner, Universität Regensburg, Germany
  • Tim Beißbarth, University Medical Center Göttingen, Germany
  • Rainer Spang, Universität Regensburg, Germany
  • Helena Zacharias, Universität Kiel, Germany
  • Michael Altenbuchinger, University Medical Center Göttingen, Germany
  • Stefan Schrod, University Medical Center Göttingen, Germany
  • Michael Altenbuchinger, University Medical Center Göttingen, Germany
  • Helena Zacharias, Universität Kiel, Germany
  • Rainer Spang, Universität Regensburg, Germany
  • Tim Beißbarth, University Medical Center Göttingen, Germany
  • Peter J. Oefner, Universität Regensburg, Germany
  • Wolfram Gronwald, Universität Regensburg, Germany
  • Robert Lohmayer, Leibniz Institute for Immunotherapy Regensburg, Germany
  • Stefan Solbrig, Universität Regensburg, Germany
  • Andreas Schäfer, Universität Regensburg, Germany
  • Stefan Schrod, University Medical Center Göttingen, Germany


Presentation Overview: Show

Estimating the effects of interventions on patient outcome is one of the key aspects of personalized medicine. Their inference is often challenged by the fact that the training data comprises only the outcome for the administered treatment, and not for alternative treatments (the so-called counterfactual outcomes). Several methods were suggested for this scenario based on observational data, i.e.~data where the intervention was not applied randomly, for both continuous and binary outcome variables. However, patient outcome is often recorded in terms of time-to-event data, comprising right-censored event times if an event does not occur within the observation period. Albeit their enormous importance, time-to-event data is rarely used for treatment optimization.
We suggest an approach named BITES (Balanced Individual Treatment Effect for Survival data), which combines a treatment-specific semi-parametric Cox loss with a treatment-balanced deep neural network; i.e.~we regularize differences between treated and non-treated patients using Integral Probability Metrics (IPM). We show in simulation studies that this approach outperforms the state of the art. Further, we demonstrate in an application to a cohort of breast cancer patients that hormone treatment can be optimized based on six routine parameters. We successfully validated this finding in an independent cohort. We provide BITES as an easy-to-use python implementation including scheduled hyper-parameter optimization.