Posters - Schedules

Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19 and no later than July 23. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2021. There are Q&A opportunities through a chat function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.

Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

Posters Home

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC	Session B: Monday, July 26 between 15:20 - 16:20 UTC
3DSIG Bio-Ontologies BioVis HitSeq Special Session 01 TransMed	3DSIG Bio-Ontologies BioVis CompMS Education EvolCompGen HitSeq NetBio RegSys TransMed

Session C: Tuesday, July 27 between 15:20 - 16:20 UTC	Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
3DSIG BioVis CompMS Education EvolCompGen HitSeq NetBio RegSys TransMed	CAMDA Education EvolCompGen Function iRNA MICROBIOME MLCSB RegSys Text Mining

Session E: Thursday, July 29 between 15:20 - 16:20 UTC
BIOINFO-CORE BOSC CAMDA COVID-19 EvolCompGen Function iRNA MICROBIOME MLCSB RegSys Special Session 05 SysMod Text Mining VarI General Comp Bio

A Bayesian Nonparametric Model for Inferring Subclonal Populations from Structured DNA Sequencing Data

COSI: MLCSB

Shai He, Department of Mathematics and Statistics, University of Massachusetts, Amherst, United States
Aaron Schein, Data Science Institute, Columbia University, United States
Vishal Sarsani, Department of Mathematics and Statistics, University of Massachusetts, Amherst, United States
Patrick Flaherty, Department of Mathematics and Statistics, University of Massachusetts, Amherst, United States

Short Abstract: There is often extensive genetic heterogeneity in a tumor as evidenced by single-cell and bulk DNA sequencing data. Understanding the genetic composition of a tumor is important in the personalized design of targeted therapeutic combinations and monitoring for possible recurrence after treatment. Thus, we propose a Bayesian nonparametric hierarchical Dirichlet process mixture model to jointly infer the underlying genotypes of tumor subpopulations and the distribution of those subpopulations in individual tumors by integrating single-cell and bulk sequencing data.

Inference with our model provides estimates of the subpopulation genotypes and the distribution over subpopulations in each sample. We represent the model as a Gamma-Poisson hierarchical model and derive a fast Gibbs sampling algorithm with analytical sampling steps based on this representation using the augment-and-marginalize method. This representation and inference algorithm can also be used with other hierarchical Dirichlet process prior models to derive a fast Gibbs sampler.

Experiments with simulation data show that our model outperforms standard numerical and statistical methods for decomposing admixed count data. Analyses of real acute lymphoblastic leukemia cancer sequencing dataset shows that our model improves upon state-of-the-art bioinformatic methods. An interpretation of the results of our model on this real dataset reveals co-mutated loci across samples.

A flexible, interpretable, and accurate approach for imputing the expression of unmeasured genes

COSI: MLCSB

Arjun Krishnan, Michigan State University, United States
Chris Mancuso, Michigan State University, United States
Jacob Canfield, Michigan State University, United States
Deepak Singla, NTU Singapore, Singapore

Short Abstract: While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96–570 and LINCS), and multiple imputation tasks (within and across microarray/RNA-seq datasets) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

A machine learning approach to identify novel antifungal targets in Candida albicans

COSI: MLCSB

Xiang Zhang, University of Minnesota, United States
Ci Fu, University of Toronto, Canada
Amanda Veri, University of Toronto, China
Kali Iyer, University of Toronto, Canada
Emma Lash, University of Toronto, Canada
Alice Xue, University of Toronto, Canada
Nicole Revie, University of Toronto, Canada
Elizabeth Polvi, University of Toronto, Canada
Sean Liston, University of Toronto, Canada
Benjamin Vandersluis, University of Minnesota, United States
Charles Boone, University of Toronto, Canada
Teresa O'Meara, University of Michigan, United States
Matthew O'Meara, University of Michigan, United States
Nicole Robbins, University of Toronto, Canada
Leah Cowen, University of Toronto, Canada
Chad Myers, University of Minnesota, United States

Short Abstract: Candida albicans is an opportunistic fungal pathogen that can lead to deadly infections in humans. Understanding which genes are essential for growth of this organism would provide opportunities for developing more effective therapeutics. Unlike the model yeast, Saccharomyces cerevisiae, construction of mutants is considerably more laborious in C. albicans. To prioritize efforts for mutant construction and identification of essential genes, we built a random forest-based machine learning model, leveraging a set of 2,327 C. albicans GRACE (gene replacement and conditional expression) strains that has been previously constructed as a basis for training. We identified several relevant features contributing unique information to the predictions. Through cross-validation analysis on our random forest model, we estimated an AUC of 0.92 and an average precision of 0.77. Given these strong results, we prioritized the construction of an additional set of >800 strains and discovered essential genes at a rate of ~64% amongst these new predictions relative to an expected background rate of essentiality of ~20%. Our machine learning approach is an effective strategy for efficient discovery of essential genes, and a similar approach may also be useful in other species.

A novel information-theoretic approach to improving training efficiency of clinical diagnostic classifiers

COSI: MLCSB

Amitesh Pratap, Inflammatix, United Kingdom
Michael Mayhew, Inflammatix, United Kingdom

Short Abstract: Successful identification of an acute infection using patients’ gene-expression data with machine learning classifiers is made complicated by training on incomplete and noisy multi-cohort gene expression datasets. The classifier performance is adversely affected by the introduction of noisy data points during training; therefore, a quantitative assessment of information content of the candidate datasets can lead to more efficient learning. We adapt the Bayesian Optimal Experimental Design (BOED) framework to improve classifier performance and its generalisation by sequentially training on datasets that have the highest information content. An objective function is used to rank the datasets according to their predicted Expected Information Gain (EIG) value for a chosen Gaussian Process (GP) classifier. We find that the classifier training performed on datasets with high EIG values gives the largest increases in the classifier accuracy. An online learning algorithm is proposed where the gene-expression datasets are used sequentially for GP classifier training in order of their EIG values. Furthermore, the predicted EIG values for unsampled regions of gene-expression dataspace can be used to guide future clinical data acquisition for improving classifier accuracy.

Adverse Event Prediction and Drug Repositioning through a Relational Graph Convolutional Neural Network Approach

COSI: MLCSB

Holger Fröhlich, Fraunhofer SCAI, Germany
Daniel Domingo-Fernández, Fraunhofer SCAI, Germany
Sophia Krix, Fraunhofer SCAI, Germany
Lauren Nicole DeLong, Fraunhofer SCAI, Germany
Marc Jacobs, Fraunhofer SCAI, Germany
Sheraz Gul, Fraunhofer ITMP, Germany
Andrea Zaliani, Fraunhofer ITMP, Germany

Short Abstract: Novel medications often fail during expensive and time-consuming late-stage clinical trials due to unforeseen adverse drug reactions (ADRs). Previous computational solutions to forgo these failures often have one of two aims: proactively predict ADRs, or repurpose established medications which have already passed clinical trials.
We developed an approach using the framework of relational graph convolutional neural networks (R-GCNs). Our approach learns relationships between drugs, proteins and conditions in a heterogeneous multi-relational knowledge graph and allows us to perform two tasks: i) predicting ADRs and ii) drug repositioning. A specific aspect of our method is the integration of multi-modal data (chemical structure, protein sequence, gene expression, microscopy images, representations of electronic health records) to encapture relevant biomedical metadata. We compared our method to several state-of-the-art representation learning approaches for both, ADR prediction and drug repositioning.
Overall, our method can predict novel indications for drug repurposing candidates as well as potential ADRs.

An Active Learning Workflow for 3D Morphological Analysis of Bioimages

COSI: MLCSB

Ziji Zhang, Stony Brook University, United States
Qinyun Zhao, Stony Brook University, United States
Hong Wang, Stony Brook University, United States
Rebecca Adikes, Stony Brook University, United States
Peng Zhang, Stony Brook University, United States
Benjamin Martin, Stony Brook University, United States
David Matus, Stony Brook University, United States
Yuefan Deng, Stony Brook University, United States

Short Abstract: Manual segmentation and quantification of complex 3D fluorescence images is a limiting factor in data analysis for many biologists. Here, we develop a morphological analysis workflow to automatically reconstruct 3D cell models with accurate geometrical quantifications and measurements by combining a semi-unsupervised segmentation system and Delaunay triangulation mesh generator. We applied our workflow to C. elegans sex myoblasts, an in vivo model of a cell that migrates during larval development, and differentiates into the adult uterine and vulval muscles. Thus, the conventional raw confocal images, which are sparse and noisy, are made useful for accurately understanding the migration and differentiation of mesodermal precursor cells. We also ameliorated the existing supervised machine learning approaches that heavily rely on massive hard-to-get manual annotations. Our active-learning-based segmentation system for single z-stack images combines multiple agent networks and fuses the pseudo-label masks which will be further stacked and extracted to generate 3D cell surfaces. The workflow, applied to high-resolution time-lapse confocal images, allowed for the quantification of numerous filopodia or cell protrusions. Finally, to test our workflow, we analyzed additional live imaging data from the C. elegans anchor cell (AC) and from a different model system, the developing zebrafish tailbud.

An Explainable Multi-Modal Neural Network Architecture for Predicting Epilepsy Comorbidities Based on Administrative Claims Data

COSI: MLCSB

Thomas Linden, University of Bonn, Fraunhofer Institute for Algorithms and Scientific Computing, Germany
Holger Fröhlich, Fraunhofer Institute for Algorithms and Scientific Computing (FHG), Germany
Johann de Jong, UCB Pharma S. A., Germany
Chao Lu, UCB Pharma S. A., United States
Kathrin Haeffs, UCB Pharma S. A., Germany
Victor Kiri, UCB Pharma S. A., United States

Short Abstract: Epilepsy is a complex brain disorder characterized by repetitive seizure events. Epilepsy patients often suffer from various and severe physical and psychological co-morbidities (e.g. anxiety, migraine, stroke, etc.). While general comorbidity prevalences and incidences can be estimated from epidemiological data, such an approach does not take into account that actual patient specific risks can depend on various individual factors, including medication. This motivates to develop a machine learning approach for predicting risks of future comorbidities for the individual epilepsy patient.

In this work we use inpatient and outpatient administrative health claims data of around 19,500 US epilepsy patients. We suggest a dedicated multi-modal neural network architecture to predict the time dependent risk of six common comorbidities of epilepsy patients. We demonstrate superior performance of DeepLORI in a comparison with several existing methods Moreover, we show that DeepLORI based predictions can be interpreted on the level of individual patients. Using a game theoretic approach, we identify relevant features in DeepLORI models and demonstrate that model predictions are explainable in the light of existing knowledge about the disease. Finally, we validate the model on independent data from around 97,000 patients, showing good generalization and stable prediction performance over time

Anti-cancer drug response prediction on locally connected gene expression manifolds leveraging kernelized feature selection

COSI: MLCSB

Tamas Madl, Amazon Web Services, Netherlands
Matthew Howard, Amazon Web Services, United Kingdom
Alessandro Riccombeni, Amazon Web Services, United Kingdom

Short Abstract: Selecting the most effective drug for individual patients is an important task in precision medicine and one that machine learning-based approaches may be able to expedite. However, drug response prediction using machine learning presents substantial challenges, such as the ‘curse’ of dimensionality and lack of well-defined distance metrics defining similar patients or cell lines.
We have constructed a recommender system to generate drug response predictions using weighted averages of response values of similar cell lines. We tackle high dimensionality by leveraging HSIC Lasso (Climente-González et al.,2019) to identify maximally relevant and minimally redundant genes. To represent similarity, we leverage proximity on an approximated Riemannian manifold, and compare this approach with simpler similarity metrics such as correlation. We evaluate the proposed architecture on the Genomics of Drug Sensitivity in Cancer database.
The resulting model outperforms recommender system based algorithms based on traditional similarity metrics (Suphavilai et al.,2020) by a significant margin (accuracy/drug: 81% vs. 77%; correlation with predicted drug response: 0.76 vs 0.65), substantiating the effectiveness of similarity metrics on locally connected manifold approximations derived from dimensionality-reduced gene expression data; taking a step towards making the accuracy of recommender-based drug response prediction models more relevant for clinical practice and patient benefit.

Application of Deep Learning in Target Identification Through Determining Mechanism of Action given Cellular Signature Data

COSI: MLCSB

Omar Abul-Hassan, Ocean Lakes High School, United States

Short Abstract: Current techniques in target identification in drug development are costly, lengthy, and function with high degrees of uncertainty that a drug can succeed in modulating a target. To address this, the ability of computational approaches to leverage cellular signature data for in-silico target identification was explored. The NIH LINCS program compiled and publicized data consisting of cell viability measurements and high-throughput gene-expression drug and target screens (measured by L1000 assay) of ~5000 small molecules and their respective mechanisms of action (MoA). This study involves the utilization of an ensemble of transfer learning and a novel neural network to convert this cellular signature data to the mechanism of action of a compound. Validating on an isolated subset of our training data, our approach achieved 91.51% AUC (Receiver Operating Characteristic curve). These promising results were also shown to significantly perform more efficiently than other machine learning approaches (p<0.05) and demonstrate advantages over previously published in-silico techniques. Overall, the algorithm provides a reliable framework for expedited drug target identification.

Assessment of a Novel Feature Selection Algorithm for Virus-Host Protein-Protein Interaction.

COSI: MLCSB

Ahmad Hassan Ibrahim, Mugla Sitki Kocman University, Turkey
Erdem Turk, Mugla Sitki Kocman University, Turkey
Betul Asiye Karpuzcu, Mugla Sitki Kocman University, Turkey
Selahattin Aksoy, Mugla Sitki Kocman University, Turkey
Onur Can Karabulut, Mugla Sitki Kocman University, Turkey
Baris Suzek, Mugla Sitki Kocman University, Turkey

Short Abstract: The development of machine learning-based virus-host protein-protein interaction(VHPPI) predictors is an active research area.

Here, we assessed the impact of a feature selection algorithm on the performance of machine learning-based VHPPI prediction. Typically, these VHPPI predictors use tripeptide frequencies of host and virus proteins computed on reduced amino acid alphabets. In this study, we used a 7-letter amino acid alphabet and created a total of 686(7^3+7^3; host+virus) features using tripeptide frequencies normalized by unit normalization. Assuming VHPPI are mediated by tripeptides from virus and host proteins sharing similar distribution, we computed Pearson Correlation Coefficient(PCC) between each host and virus feature (343*343) and filtered down pairs using various thresholds. The selected features are used to train Random Forest Classifiers(RFC).

We conducted experiments using three training(each 18120 positive/181200 negative) and three independent tests (each 4533 positive/45330 negative) sets provided by HVPPI. For the PCC thresholds 0.08-0.13, the number of features ranged from 203/186(host/virus) to 24/20. Although a ~%93.5 drop achieved through feature selection, the RFC’s sensitivity and specificity on the independent test set remained the same; 0.84+-0.02, 0.83+-0.02 respectively. Our models outperformed alternative tools HVPPI, Hopitor, and DeNovo on all the independent test sets regardless of the choice of PCC threshold.

BABEL enables cross-modality translation between multiomic profiles at single-cell resolution

COSI: MLCSB

Kevin Wu, Stanford University, United States
Kathryn Yost, Stanford University, United States
Howard Chang, Stanford University, United States
James Zou, Stanford University, United States

Short Abstract: Simultaneous profiling of multi-omic modalities within single cells is a grand challenge for biology. While there have been impressive technical innovations demonstrating feasibility of co-measurement technologies, widespread application of joint profiling methods is challenging due to experimental complexity, noise, and cost. Here, we introduce BABEL, a deep learning method that enables a multi-omic view of single cells given only a single measured modality. Leveraging a series of interoperable neural networks, BABEL can predict single-cell expression from single-cell chromatin accessibility, and vice versa, after training on relevant data. BABEL is robust across varying noise profiles, and across diverse biological contexts not seen during training. For example, BABEL can generate single-cell expression profiles from patient-derived basal cell carcinoma (BCC) chromatin accessibility data, enabling insights into transcriptomic states despite never being trained on BCC data. BABEL’s predictions are even comparable to analyses of empirical BCC scRNA-seq data. We further show that BABEL can be easily extended to incorporate additional single-cell data modalities such as protein expression, thus enabling unified translation across chromatin, RNA, and protein states. BABEL offers a powerful approach for multi-omic data exploration and hypothesis generation.

Benchmarking atlas-level data integration in single-cell genomics

COSI: MLCSB

Fabian Theis, Helmholtz Center Munich, Germany
Malte Luecken, Helmholtz Center Munich, Germany
Maren Büttner, Helmholtz Center Munich, Germany
Kridsadakorn Chaichoompu, Helmholtz Center Munich, Germany
Anna Danese, Helmholtz Center Munich, Germany
Marta Interlandi, University of Münster, Germany
Michaela Mueller, Helmholtz Center Munich, Germany
Daniel Strobl, Helmholtz Center Munich, Germany
Luke Zappia, Helmholtz Center Munich, Germany
Martin Dugas, University of Münster, Germany
Maria Colomé-Tatché, Helmholtz Center Munich, Germany

Short Abstract: Cell atlases often include samples that span locations, labs, and conditions leading to complex, nested batch effects in data. Thus, joint analysis of atlas datasets requires reliable data integration. Choosing a data integration method is a challenge due to the difficulty of defining integration success. Here, we benchmark 68 integration approaches on 85 batches of gene expression, chromatin accessibility, and simulation data from 23 publications, representing >1.2 million cells distributed in 13 atlas-level integration tasks. Our integration tasks span several common sources of variation such as individuals, species, and experimental labs. We evaluate methods according to scalability, usability, and their ability to remove batch effects while retaining biological variation.
Using 14 evaluation metrics, we find that highly variable gene selection improves data integration performance, whereas scaling pushes methods to prioritize batch removal over conservation of biological variation. Overall, Scanorama, scANVI, and scGen perform well, particularly on complex integration tasks; Seurat v3 and Harmony perform well on simpler tasks with distinct biological signals; and scATAC-seq integration performance is strongly affected by choice of feature space. Our reproducible Python module and benchmarking pipeline can be used to identify optimal data integration methods for new data, benchmark new methods, and guide method development.

Celloscope: a probabilistic model for cell type deconvolution in spatial transcriptomics data

COSI: MLCSB

Agnieszka Geras, Warsaw University of Technology, University of Warsaw, Poland
Shadi Darvish Shafighi, University of Warsaw, Poland
Kacper Domżał, University of Warsaw, Poland
Igor Filipiuk, University of Warsaw, Poland
Łukasz Rączkowski, University of Warsaw, Poland
Hosein Toosi, Royal Institute of Technology, Sweden
Leszek Kaczmarek, Nencki Institute of Experimental Biology of the Polish Academy of Sciences, Poland
Łukasz Koperski, Medical University of Warsaw, Poland
Jens Lagergren, Royal Institute of Technology, Sweden
Dominika Nowis, Medical University of Warsaw, Poland
Ewa Szczurek, University of Warsaw, Poland

Short Abstract: The spatial transcriptomics procedure enables to gain an insight not only into the level of gene activity, but also enables to map this activity spatially. It is possible due to the fact that, unlike single cell RNA-seq experiments, spatial transcriptomics (ST) retains information on cells’ position within the tissue. However, ST spots contain multiple cells, therefore the observed signal inevitably includes information about mixtures of cells of different types. In order to deconvolute the aforementioned mixtures and infer the spatial cell types composition, various methods combining the two complementary technologies: ST and single cell RNA-seq have been proposed. Unfavourably, these methods require both types of data and may be prone to bias due to platform-specific effects, such as sequencing depth. To address those issues, we present an innovative approach that does not require single cell data, but instead needs additional prior knowledge on marker genes. Our novel probabilistic model for cell type deconvolution in ST data called Celloscope, was applied on mouse brain data and was able to successfully indicate brain structures and spatially distinguish between two main neuron types: inhibitory and excitatory.

Computational characterization of thermostability in fungi and predicting thermostability using machine learning approaches

COSI: MLCSB

Sankar Mahesh R, SASTRA Deemed to be University, India
Ragothaman Yennamalli, SASTRA Deemed to be University, India

Short Abstract: Fungi are generally mesophilic in nature. However, there is the existence of thermophilic/thermostable fungi. Various molecular factors have been proposed to cause thermostability, such as salt bridges and side-chain─side-chain hydrogen bonds. These factors cannot be generalized for all fungi. Factors imparting thermostability can guide how fungal thermophilic proteins gain thermostability. We curated a dataset for 14 thermophilic fungi and their evolutionarily closer mesophiles. Initially, the proteome data of Rhizopus microsporus and its evolutionarily related mesophile Mucor circinelloides was analyzed. Using eggNOG, we classified the proteome into COGs. Excluding the COGs R and S, we extracted sequence features using Protr. Currently, we find that in COGs A, B, K, T, and U the amino acids Lys, Arg, Asp, Glu, Phe, Tyr, and Trp are highly represented in thermophile compared to the mesophile. Analyzing the features using an ensemble feature selection tool we selected feature-sets that were above a threshold of 0.6 (scale of 0.0 to 1.0). These include amino acid composition, pseudo amino acid composition, quasi-sequence-order descriptors, conjoint triad, dipeptide composition, Geary, Moran, and normalized Moreau-Broto autocorrelation. Both supervised and unsupervised machine learning approaches are being trained to derive a model that would differentiate a fungal sequence as thermophilic or mesophilic.

Concatenated Convolutional Neural Network and Transformer-Encoder to Diagnose Cancerous Thyroid Nodules in Ultrasound Cine Images

COSI: MLCSB

Tara Kapoor, Palo Alto High School, United States
Rikiya Yamashita, Stanford University, United States

Short Abstract: Incidental thyroid nodule detection has increased notably in recent years. Biopsy is invasive and expensive, and risk stratification of nodule ultrasounds is done manually by radiologists through the Thyroid Imaging, Reporting & Data System (TI-RADS), motivating the need for effective computerized diagnosis solutions.

I hypothesized that deep-learning could achieve superior classification performance in sensitivity, specificity, and AUROC compared to TI-RADS using ultrasound cine-clip images.

I developed a novel deep learning-based algorithm, trained on ultrasound cine-clip images with prospectively-collected biopsy data as ground truths. The algorithm consists of a MobileNet-v2 convolutional-neural-network (CNN) for feature extraction with a bi-layer Transformer-Encoder network (each with self-attention and feedforward sub-layers).

I addressed two main technical challenges: extreme class-imbalance (benign/malignant patient ratio=175/17) and processing patient-wide cine-clip images’ sequentiality. For the former, I applied weighted oversampling and back-propagated focal loss (alpha=0.9, gamma=2.4). For the latter, I stacked CNN inputs (adjacent or equally-spaced frames), then channeled extracted feature vectors into the Transformer-Encoder for patient-wide attention-mechanism processing.

My deep-learning techniques—adjacent-frame (AUROC=0.867) and equal-spaced-frame (AUROC=0.858) models—performed superior to TI-RADS scoring (AUROC=0.798). CNN+Transformer model can classify cine-clips as a pre-screening tool, or jointly with radiologists for improved prediction.

Correcting gradient-based attribution methods for neural networks in genomics

COSI: MLCSB

Peter Koo, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States
Antonio Majdandzic, Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, United States

Short Abstract: Deep neural networks (DNNs) have been applied successfully to many regulatory genomics tasks, such as predicting the binding strength between proteins and DNA. To explain DNN predictions, post hoc attribution methods are employed to provide importance scores for each nucleotide in a given sequence, often revealing motif-like representations that are important for model predictions. Among attribution methods, we find that those based on gradients are often employed in a naïve way, not respecting the categorical simplex constraints of the 1-hot encoded sequence, which can lead to a significant source of error for the importance of nucleotides. This can affect the efficacy of downstream analyses that use attribution scores, such as scoring the effects of disease-associated variants. Here we derive a simple correction for gradient-based attribution scores and demonstrate its effectiveness using synthetic and in vivo sequences across different networks and attribution methods. We find that our correction consistently leads to a small, but significant improvement in attribution scores for motif positions and reduces spurious attributions at other positions. While not intended to improve the classification performance, this correction provides a consistent improvement in interpretability, leading toward better transparency of a network’s decision-making process, providing clearer insights into the underlying biology.

Deep Generative Model for Flow Cytometry Data Integration

COSI: MLCSB

Mike Phuycharoen, The University of Manchester, United Kingdom
Verena Kaestele, The University of Manchester, United Kingdom
Thomas Williams, The University of Manchester, United Kingdom
Lijing Lin, The University of Manchester, United Kingdom
Tracy Hussell, The University of Manchester, United Kingdom
John Grainger, The University of Manchester, United Kingdom
Magnus Rattray, The University of Manchester, United Kingdom

Short Abstract: COVID-19 pandemic accelerated the need for large-scale immune profiling and analysis of collected patient data. Flow cytometry experiments can be used to analyze composition of blood samples, but often result in heterogeneous datasets due to changing availability of markers and other technical batch effects. In order to enable scalable flow data integration, we created a variational neural network model for simultaneous generative embedding, bias correction, and semi-supervised cell-type prediction. The model explicitly merges subspaces defined by markers shared across heterogeneous panels. Latent batch correction is performed before any downstream tasks in order to avoid integrating panel-specific batch effects. To provide consistent control sets for normalization, we resample and balance the data distribution using partially trained classifiers as the training progresses. Finally, we perform latent space clustering and visualize the combined space in 2D. Our model allows for sequential and parallel training, and scales well to millions of cells. Model selection is automated using a Bayesian optimizer, which determines global, as well as task-specific hyper-parameters such as subnetwork size and individual training frequencies. We apply the model to identify cell populations enriched in COVID-19 cases from several hospitals in our region to estimate correlations with disease trajectory.

Deep learning based detection and analysis of chest X rays to prognosticate the type of respiratory tract infection.

COSI: MLCSB

Saksham Saxena, NSIT DELHI UNIV NEW DELHI, India
Namrata Swain, NSIT DELHI UNIV NEW DELHI, India
Akanksha Kulshreshtha, Netaji Subhas University of Technology, New Delhi, India

Short Abstract: Machine Learning (ML) based methods have shown unparalleled success and is a powerful approach for accurate analysis of medical images the classification of images with highly similar features is a typical application of machine learning-based image analysis. This paper depicts a similar motion depicting a learning strategy based on convolutional neural networks to identify various X-ray images stipulating various respiratory diseases such as Pneumonia, TB, Pneumothorax, and COVID-19. To optimise the augmentation of the data, various parameters such as rescaling, horizontal and vertical flip, and zoom augmentation are varied and transformed. The system of pooling layers, dense layers, learning rate, and epochs are carefully adjusted when building the CNN model in order to achieve the model's optimum efficiency.
17272 chest x-ray scan samples were obtained from the Kaggle datasets, National institute of Health dataset, Stanford ML group (CheXpert database) to assess the model's results, with 15257 being used for training and 2015 for validation.
The proposed CNN model is able to execute efficiently and evince the following performance parameter-
(specificity_at_sensitivity:0.9996, sensitivity_at_specificity:0.9983, accuracy: 0.9269, val_specificity_at_sensitivity: 0.9937, val_sensitivity_at_specificity: 0.9727 etc.)
These findings show that data from deep learning neural networks can be used to detect COVID-19, Pneumonia, and other respiratory tract infection.

Deep Learning Framework for Predicting Phenotype from TCR-β Immune Repertoire

COSI: MLCSB

Ahmed Metwally, Illumina AI Lab, Illumina Inc., United States
Kyle Farh, Illumina AI Lab, Illumina Inc., United States

Short Abstract: The architecture of the T-cell receptor (TCR) repertoire largely contributes to the performance of the adaptive immune response against viral or bacterial infection. Each human has about 10^7–10^8 unique TCRs. The diversity of TCR repertoire is primarily affected by age, HLA genetic variabilities, and prior exposure to viral or bacterial infections. With the advent of immune sequencing, whether bulk or single-cell RNAseq, TCR repertoire can be characterized and used in predicting disease prognosis. Immune repertoire classification can be seen as a multiple instance learning problems with extremely low witness rates (~0.01%), and the overlap of immune repertoires of different individuals is low. These properties hinder the development of end-to-end deep learning frameworks for classifying individuals' phenotypes based on their TCR repertoire. This work presents our new framework that combines statistical and deep learning components to predict individuals' phenotype based on their TCR-β immune repertoire. We applied the proposed framework to a dataset of 641 TCR-β immune repertoires 289 with CMV+ and 352 with CMV-). Our performance evaluation shows robust performance in classifying CMV+ subjects (auROC = 0.95). The developed framework can be applied to other diseases such as infectious disease, cancers, and autoimmune disease.

Deep Learning in Automated Breast Cancer Diagnosis by Learning the Breast Histology from Microscopy Images

COSI: MLCSB

Qiangqiang Gu, Mayo Clinic, United States
Naresh Prodduturi, Mayo Clinic, United States
Steven N Hart, Mayo Clinic, United States

Short Abstract: Breast cancer is one of the most common cancers in women. However, the concordance rate of breast cancer diagnosis from histology slides is unacceptably low. Classifying normal versus tumor breast tissues from breast histology microscopy images is an ideal case to use for deep learning and could help to more reproducibly diagnose breast cancer.

We tested the accuracy of tumor versus normal classification using the BreAst Cancer Histology dataset. We first tested the patch-level classification accuracy for 16 combinations of non-specialized models, data preprocessing techniques, and hyperparameter configurations, and chose the model with the highest patch-level accuracy. Then we computed the slide-level accuracy of the selected models and compared them with 26 hyperparameter sets of a pathology-specific attention based multiple-instance learning model.

Two generic models (One-Shot Learning and the DenseNet201 with highly tuned parameters) achieved 94% slide-level validation accuracy compared to only 88% for the pathology-specific model.

The combination of image data preprocessing and hyperparameter configurations have a direct impact on the performance of image classifiers. To identify a well-performing model to classify tumor versus normal breast histology, researchers should not only focus on developing novel models, since hyperparameter tuning for existing methods could also achieve a high prediction accuracy.

Deep multitask learning of gene risk for comorbid neurodevelopmental disorders

COSI: MLCSB

A. Ercument Cicek, Bilkent University, Turkey
Ilayda Beyreli, Bilkent University, Turkey
Oguzhan Karakahya, Bilkent University, Turkey

Short Abstract: Autism Spectrum Disorder (ASD) and Intellectual Disability (ID) are comorbid neurodevelopmental disorders with complex genetic architectures. Despite large-scale sequencing studies only a fraction of the risk genes were identified for both. Here, we present a novel network-based gene risk prioritization algorithm named DeepND that performs cross-disorder analysis to improve prediction power by exploiting the comorbidity of ASD and ID via multitask learning. Our model leverages information from gene co-expression networks that model human brain development using graph convolutional neural networks and learns which spatio-temporal neurovelopmental windows are important for disorder etiologies. We show that our approach substantially improves the state-of-the-art prediction power in both single-disorder and cross-disorder settings. DeepND identifies prefrontal cortex brain region and early-mid fetal period as the highest neurodevelopmental risk window for both ASD and ID. Finally, we investigate frequent ASD and ID associated copy number variation regions and confident false findings to suggest several novel susceptibility gene candidates. DeepND can be generalized to analyze any combinations of comorbid disorders and is released atgithub.com/ciceklab/deepnd.

Deep-learning based tumor microenvironment segmentation is predictive of tumor mutations and patient survival in lung cancer

COSI: MLCSB

Łukasz Rączkowski, Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Poland
Iwona Paśnik, Department of Clinical Pathomorphology, Medical University of Lublin, Poland
Michał Kukiełka, Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Poland
Marcin Nicoś, Department of Pneumology, Oncology and Allergology, Medical University of Lublin, Poland
Magdalena Budzińska, Ardigen, Poland
Tomasz Kucharczyk, Department of Pneumology, Oncology and Allergology, Medical University of Lublin, Poland
Justyna Szumiło, Department of Clinical Pathomorphology, Medical University of Lublin, Poland
Paweł Krawczyk, Department of Pneumology, Oncology and Allergology, Medical University of Lublin, Poland
Nicola Crosetto, Department of Medical Biochemistry and Biophysics, Karolinska Institutet, Sweden
Ewa Szczurek, Faculty of Mathematics, Informatics, and Mechanics, University of Warsaw, Poland

Short Abstract: The importance of tumor microenvironment has been studied extensively in recent years. While specifically the immune microenvironment has been the focus of research, a more general approach has not been attempted as of yet. In this work we show that there is a relationship between general microenvironment composition and both gene mutations and patient survival in non-small cell lung cancer. We introduce a new training dataset for tissue classification in lung cancer, LubLung. With this, we train an accurate and reliable deep learning model ARA-CNN and use it to segment tissue slides from TCGA. These segmented slides are used to compute novel spatial metrics and are utilised in the tasks of gene mutation classification and survival prediction. We show that there are gene mutations in lung cancer that can be predicted based on tissue prevalence and tumor microenvironment quantified from images.. It’s likely that similar relationships can be found in other cancer types. In addition, tumor microenvironment is also highly important in survival analysis. We observe a clear link between the tumor neighbourhood structure and patient survival.

DEPICTION: an interpretability toolbox for computational biologists

COSI: MLCSB

Maria Rodriguez Martinez, IBM Research Europe, Switzerland
An-Phi Nguyen, IBM Research Europe, ETH Zurich, Switzerland

Short Abstract: Thanks to their outstanding performance, deep learning models have often been the method of choice in life sciences research. Despite their results, however, their decision process often remain obscure. The black-box nature of these models greatly limits their usage in high-stake scenarios such as healthcare or slows down further scientific discovery. Arguably, these are some reasons for the recent surge in interest towards the field of Interpretable Machine Learning. As a consequence, computational biologists are now presented with an overwhelmingly high number of interpretable methods to start unraveling the decision-making rules of an opaque model. In this work, we introduce DEPICTION, a toolbox designed to help computational biologists make their machine learning models more interpretable. DEPICTION provides an unified interface for the most well-established interpretability methods, which can be included in existing pipelines with little change to the code. Further, DEPICTION facilitates the comparison of different interpretability techniques. As a proof of concept, we apply DEPICTION to the analysis of a single-cell proteomic dataset where different immune cells populations where profiled. In all cases, DEPICTION identified the most informative markers to distinguish each cellular population, shedding light into the molecular phenotypes that characterise the immune system.

Design of peptides with desirable activity assisted by machine learning and data mining strategies

COSI: MLCSB

Francisca Rodríguez-Cabello, Centre for Biotechnology and Bioengineering, Chile
David Medina-Ortiz, Centre for Biotechnology and Bioengineering, Chile
Alvaro Olivera-Nappa, Centre for Biotechnology and Bioengineering, Chile

Short Abstract: Peptides are molecules composed mainly of amino acids linked together by peptide bonds. They have different properties and characteristics that make them unique structures and provide various advantages in biotechnology and bioengineering. One of the main problems of working with this type of structure lies in the high economic cost involved in its synthesis. It is necessary to create and implement computational methods that facilitate the design of peptides with desirable activity. Strategies such as directed evolution or rational design are recently being combined with computational methods based on artificial intelligence to improve their performance. However, implementing predictive models is an arduous and complex task, which as a generality is sought, effectiveness is lost since specific methods are implemented for a particular task. Based on an analytical approach and using statistical methods, bioinformatics approaches, different coding strategies, and data mining techniques, we have developed a system to propose variated rules of the definition of activities for particular peptide sequences using the information in our peptide database recently built. Employing this definition's rules and combining them with variated optimization algorithms, we propose designing and implementing an automaton system to generate peptide sequences with desirable chemical properties.

Development of an objective proteomic indicator of trauma severity

COSI: MLCSB

Sara Masarone, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, and The Alan Turing Institute, United Kingdom
Gerard Hernandez, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom
Jennifer Ross, Centre for Trauma Sciences, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom
Jason Pott, Centre for Trauma Sciences, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom
Karim Brohi, Centre for Trauma Sciences, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom
Michael R. Barnes, Centre for Translational Bioinformatics, The William Harvey Research Institute, Queen Mary University of London, United Kingdom
Daniel J. Pennington, Blizard Institute, Barts and the London School of Medicine and Dentistry, Queen Mary University of London, United Kingdom

Short Abstract: Trauma is one of the leading causes of death worldwide, causing more fatalities than HIV and Tuberculosis combined. Quickly and correctly assessing trauma severity is essential, both to improve survival and to minimise adverse long-term outcomes. Although numerous scoring systems have been devised to categorise trauma severity and predict mortality, most are subjective and based solely on observable patients’ characteristics. To address this, we have developed a machine learning pipeline to reliably categorise patient severity in the hyperacute window (<2hr post-trauma) using an ensemble-based classifier trained on proteomic data. This allowed us to identify eight proteins (CPLX1/2, GSN, IFI16, CHGB, CCL22, ADSSL1, PLG) that together achieved an AUC of 0.93 (5-folds cv) for correctly identifying critical patients, that outperformed current severity indicators. These proteins are involved in platelet degranulation, dissolution of fibrin clots, and a variety of signalling pathways. Patients who were assigned a higher probability of being critical were also more likely to require ventilation, longer hospitalisation and more transfusions. One of the identified biomarkers, Gelsolin (GSN), is a protein that has been recently associated with poor prognosis in a series of other health conditions, supporting its use as a general indicator of adverse clinical outcomes in human disease.

DIVERSE: Bayesian Data IntegratiVE learning for precise drug ResponSE prediction

COSI: MLCSB

Betul Guvenc Paltun, Aalto University, Finland
Samuel Kaski, Aalto University, Finland
Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan

Short Abstract: Detecting predictive biomarkers from multi-omics data is important for precision medicine, to improve diagnostics of complex diseases, and for better treatments. This needs substantial experimental efforts that are made difficult by the heterogeneity of cell lines and huge cost. An effective solution is to build a computational model over the diverse omics data, including genomic, molecular, and environmental information. However, choosing informative and reliable data sources from among the different types of data is a challenging problem. We propose DIVERSE, a framework of Bayesian importance-weighted tri- and bi-matrix factorization (DIVERSE3 or DIVERSE2) to predict drug responses from data of cell lines, drugs, and gene interactions. DIVERSE integrates the data sources systematically, in a step-wise manner, examining the importance of each added data set in turn. More specifically, we sequentially integrate five different data sets, which have not all been combined in earlier bioinformatic methods for predicting drug responses. Empirical experiments show that DIVERSE clearly outperformed five other methods including three state-of-the-art approaches, under cross-validation, particularly in out-of-matrix prediction, which is closer to the setting of real use cases and more challenging than simpler in-matrix prediction. Additionally, case studies for discovering new drugs further confirmed the performance advantage of DIVERSE.

Early detection of breast and prostate cancer risk based on routine check-up data using machine learning and survival analysis

COSI: MLCSB

Ron Shamir, Tel-Aviv University, Israel
Dan Coster, Tel-Aviv University, Israel
Eyal Fisher, University of Cambridge, United Kingdom
Shani Shenhar-Tsarfaty, Tel-Aviv University, Israel
Tehillah Menes, Tel-Aviv University, Israel
Shlomo Berliner, Tel-Aviv University, Israel
Ori Rogowski, Tel-Aviv University, Israel
David Zeltser, Tel-Aviv University, Israel
Itzhak Shapira, Tel-Aviv University, Israel
Eran Halperin, University of California, Los Angeles, California, United States
Saharon Rosset, Tel-Aviv University, Israel
Malka Gorfine, Tel-Aviv University, Israel

Short Abstract: Cancer is a leading cause of death, and early cancer detection can affect prognosis and increase treatment effectiveness. Towards this challenge, we asked the following research question: Is medical data on healthy, undiagnosed individuals predictive of their risk to develop cancer later?
We analyzed electronic medical records of 20,000 healthy individuals who underwent routine checkups at the Tel-Aviv Medical Center Inflammation Survey between 2001 and 2017. Those records encompass more than 600 parameters per visit, including laboratory tests, vital signs, medical history, medication profile, etc. We identified those who developed cancer later using the Israeli National Cancer Registry.
We developed a novel ensemble method for risk prediction of multivariate time series data using a random forest (RF) model of survival trees for left-truncated and right-censored (LTRC) data. Our method uses an adapted version of the log-rank score as a splitting criterion.
Our method predicted future prostate gland cancer and breast cancer six months before diagnosis with an area under the ROC curve of 0.62±0.05 and 0.6±0.03 respectively. Performance was better than prior predictors. Our model was able to detect individuals who were not detected by typical screening tests such as mammography and clinical breast examination for breast cancer.

Elastic net outperforms other machine learning models in gene expression prediction across global ancestries

COSI: MLCSB

Paul Okoro, Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States
Ryan Schubert, Department of Mathematics and Statistics, Loyola University Chicago, Chicago, IL, United States
Xiuqing Guo, The Lundquist Institute and Department of Pediatrics at Harbor-UCLA Medical Center, Torrance, CA, United States
Craig Johnson, Department of Biostatistics, University of Washington, Seattle, WA, United States
Jerome Rotter, The Lundquist Institute and Department of Pediatrics at Harbor-UCLA Medical Center, Torrance, CA, United States
Ina Hoeschele, Fralin Life Sciences Institute, Virginia Tech, Blacksburg, VA, United States
Yongmei Liu, Department of Medicine, Duke University School of Medicine, Durham, NC, United States
Hae Kyung Im, Section of Genetic Medicine, Department of Medicine, University of Chicago, Chicago, IL, United States
Amy Luke, Parkinson School of Health Sciences and Public Health, Loyola University Chicago, Maywood, IL, United States
Lara Dugas, Parkinson School of Health Sciences and Public Health, Loyola University Chicago, Maywood, IL, United States
Heather Wheeler, Program in Bioinformatics, Loyola University Chicago, Chicago, IL, United States

Short Abstract: Transcriptome prediction methods have become popular in complex trait mapping and most models have been trained in European populations using parametric linear models like elastic net (EN). We built transcriptome prediction models using both linear and non-linear machine learning (ML) algorithms and compared their performance to EN. We trained models using genotype and blood monocyte transcriptome data from the Multi-Ethnic Study of Atherosclerosis comprising individuals of African, Hispanic, and European ancestries and tested them using genotype and whole-blood transcriptome data from the Modeling the Epidemiology Transition Study comprising individuals of African ancestries. We show that prediction performance is highest when the training and the testing population share similar ancestries regardless of the prediction algorithm used. While EN generally outperformed random forest (RF), support vector regression (SVR), and K nearest neighbor (KNN), we found that RF outperformed EN for some genes, particularly between disparate ancestries, suggesting potential robustness and reduced variability of RF imputation performance across populations. We show including RF prediction models in PrediXcan revealed potential gene associations missed by EN models. Therefore, by integrating other ML modeling into PrediXcan and diversifying our training populations to include more global ancestries, we may uncover new genes associated with complex traits.

ENNGene: an Easy Neural Network model building tool for Genomics

COSI: MLCSB

Panagiotis Alexiou, Central European Institute of Technology (CEITEC), Czechia
Eliska Chalupova, Masaryk University, Czechia
Ondrej Vaculik, Masaryk University, Czechia
Filip Josefov, Masaryk University, Czechia
Jakub Polacek, Masaryk University, Czechia
Tomas Majtner, Central European Institute of Technology (CEITEC), Czechia

Short Abstract: Here we present ENNGene (an Easy Neural Network model building tool for Genomics), an application that simplifies the local training of custom Convolutional Neural Network models on Genomic data via an easy to use Graphical User Interface. ENNGene allows multiple streams of input information, including sequence, evolutionary conservation, and secondary structure, and performs the needed preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is to be customized by the user, ranging from number and types of the layers (convolutional, GRU, LSTM, or dense layers), to precise set-up of each layer, e.g. a dropout rate. ENNGene then deals with all steps of training and evaluation of the model, exporting useful metrics such as multi-class AUC-ROC or precision-recall curve plots, as well as TensorBoard log files. To facilitate interpretation of the predicted results, we deploy the Integrated Gradients method, providing the user with a graphical representation of an attribution level of each nucleotide. To showcase the usage of ENNGene, we train multiple models on RBP24 dataset, quickly reaching the state-of-the-art performance, while improving the performance on several of the proteins by including the evolutionary conservation score and tuning the network architecture per each protein.

Ensemble of deep and shallow graph convolutional networks for identifying disease-gene association

COSI: MLCSB

Giltae Song, Pusan National University, South Korea
Donghyun Son, Pusan National University, South Korea

Short Abstract: Identifying disease-gene association is substantial for discovering disease mechanisms and developing therapeutic drugs. Although experimental studies have been conducted for decades, it is too slow and costly to fill in all the association relationships between diseases and genes.
In this study, we develop a deep learning approach based on graph convolutional networks for determining the associations of diseases and genes. We use OMIM (Online Mendelian Inheritance in Man) data which includes disease-gene association information, experimentally validated. Since too few association relationships are available in this data to train a machine learning model, we add other auxiliary data sources to increase the volume of training data. We apply a multi-layered Graph Convolutional Networks model to capture unknown nonlinear associations between diseases and genes. To avoid information loss in deep layers, we combine a model with shallow layers with another model with deep layers. We evaluate our ensemble approach using OMIM data. Our results improve the performance of other existing methods. We believe that our graph convolutional network model can discover novel biomarker genes for diseases such as cancer.

Evaluation Of Convolutional Neural Networks Containing Interactions Between Genomic Motifs

COSI: MLCSB

Bernhard Renard, Hasso-Plattner-Institute, Germany, Germany
Marta Lemanczyk, Hasso-Plattner-Institute, Germany, Germany

Short Abstract: To obtain new insights into biological mechanisms, one common approach is to interpret Convolutional Neural Networks (CNN) by identifying patterns in genomic sequences. Attribution methods assign importance scores to each input feature and can uncover the relevance of single positions in the input sequence regarding the model’s prediction. These scores are created independently from other positions but many features in biological sequences are in a relationship with others forming motifs with a specific function. Moreover, groups of such locally dependent features can be part of a higher interaction representing a regulatory logic between motifs hidden in deeper layers. Interactions could lead to noisy or missing importance scores and therefore not capture the underlying ground truth in an understandable manner.
We define interaction logic concepts in biological sequences and generate data sets that capture these concepts. Then, we investigate whether current post-hoc attribution methods are capable to capture motifs if the input sequences contain interactions like the previously defined interaction concepts.

Evaluation of Dimension Reduction Methods for Transfer Learning

COSI: MLCSB

Vishal H. Oza, The University of Alabama at Birmingham, United States
Brittany N. Lasseigne, The University of Alabama at Birmingham, United States
Jennifer L. Fisher, The University of Alabama at Birmingham, United States
Elizabeth Ramsey, The University of Alabama at Birmingham, United States

Short Abstract: Even though eighty percent of rare diseases have a genetic component, ninety-five percent of rare diseases do not have a molecular target for therapy. One reason for this problem is that rare diseases have scarce datasets (datasets with a low number of samples) which makes it difficult to develop accurate statistical and machine learning models for identifying drug targets and drug repurposing candidates. However, with transfer learning, we can identify patterns from larger, statistically powered datasets and apply them to a rare disease scarce dataset, for example, to identify pathomechanisms or drug targets. The objective of this study is to evaluate different dimension reduction methods, a required step in transfer learning. We applied several dimension reduction methods including principal component analysis (PCA), independent component analysis (ICA), and non-negative matrix factorization (NMF) to identify gene expression patterns from Recount2 (a large and publicly available gene expression data set) and transferred these patterns to Glioblastoma Multiforme (GBM) gene expression profiles and gene expression profiles from cell lines perturbed by temozolomide, the standard chemotherapy for GBM. Further, we evaluated the performance of these dimension reduction methods to identify gene expression patterns with potential clinical impact.

Evolutionary-based variational generative models for biological sequences

COSI: MLCSB

Amine Remita, Université du Québec à Montréal, Canada
Abdoulaye Baniré Diallo, Université du Québec à Montréal, Canada

Short Abstract: Generative frameworks designed for genomics data are emerging as powerful approaches to study complex phenomena in biology including protein functions and structures, single-cell RNA-seq analyses and phylogenetic-based studies. However, to study molecular evolutionary derived processes, most of deep generative models do not consider explicitly the underlying evolutionary dynamics of biological sequences as it is performed within the Bayesian phylogenetic inference framework. Here we propose a method for a variational Bayesian generative model that jointly approximate the true posterior of local biological evolutionary parameters and generate sequence alignments. Moreover, it is instantiated and tuned for continuous-time Markov chain substitution models such as the generalized time reversible model. The architecture of our method consists of a set of deep variational encoders that infer the parameters of evolutionary-latent-variable distributions and allows sampling; and a generative model that computes probability transition matrices from sampled latent variables and generates a distribution of sequence alignments from reconstructed ancestral states. We train the model via a low-variance variational objective function and a site-wise-stochastic gradient ascent algorithm. Experimentally, we show the effectiveness and efficiency of the method on synthetic sequence alignments simulated with several evolutionary schemas and on real virus aligned DNA sequences.

Exploring the kinome using graph representation learning

COSI: MLCSB

Sachin Gavali, University of Delaware, United States
Karen Ross, Georgetown University Medical Center, United States
Chuming Chen, University of Delaware, United States
Julie Cowart, University of Delaware, United States
Cathy Wu, University of Delaware, United States

Short Abstract: The human kinome contains a vast network of interacting kinases and phosphorylation substrates. Some of these kinases are very well studied and have proven to be useful as therapeutic targets, but many are poorly understood and their biological roles unknown. In this work, we use the latest advancements in graph-based machine learning methods to explore the biological roles of these understudied kinases. We use the post-translational modification data from iPTMnet to build a kinase-substrate interaction network, and enrich this network using the Gene Ontology functional annotation to provide a biological context to these interactions. We then use the node2vec algorithm to learn vector representation of the kinases and substrates in this network, and use these representations to predict novel interactions for understudied kinases using a Random Forest model. We also perform a bioinformatics analysis of the predicted interactions to understand the biological roles of understudied kinases. For two of the understudied kinases - Q9UEE5 (STK17A) and Q9H422 (HIPK3) we were able to ascertain through functional enrichment analysis that they play an important role in cancer activity and mediating an inflammatory immune response based on their predicted interaction partners.

Feature selection with VAE for scRNA-seq analysis

COSI: MLCSB

Toshiya Tanaka, Osaka University, Japan
Shigeto Seno, Osaka University, Japan
Hideo Matsuda, Osaka University, Japan

Short Abstract: In recent years, single-cell RNA sequencing (scRNA-seq) technologies have made rapid progress. It has provided many valuable insights into biological systems, such as characterizing cell populations and unraveling complex cellular processes. Analyzing scRNA-seq data requires a number of procedures such as data normalization, feature selection and dimension reduction for visualization and clustering. In addition, their hyperparameters need to be set properly.

To address this problem, we propose a method based on a variational autoencoder (VAE). By using raw count data as input, VAE captures higher-order dependencies of gene expressions in end-to-end training. VAE usually uses a Gaussian distribution as the prior probability distribution in the latent space. However, it cannot represent multiple modes of different cell types in scRNA-seq data. Therefore, we use the Gaussian-mixture model instead of the Gaussian distribution. It allows us to represent the latent space more flexibly. If the VAE captures the relevant features of the scRNA-seq data, each cell type will form a group as a subpopulation within the overall Gaussian-mixture model population. We show the result of clustering cells with latent representations and how VAE selects the relevant features.

Flimma: A federated and privacy-preserving tool for differential gene expression analysis

COSI: MLCSB

Jan Baumbach, Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany, Germany
Markus List, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
David Blumenthal, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
Olga Zolotareva, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
Reza Nasirigerdeh, AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany, Germany
Julian Matschinske, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
Reihaneh Torkzadehmahani, AI in Medicine and Healthcare, Technical University of Munich, Munich, Germany, Germany
Mohammad Bakhtiari, Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany, Germany
Julian Späth, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
Tobias Frisch, Department of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark, Denmark
Amir Abbasinejad, Chair of Experimental Bioinformatics, TUM School of Life Sciences, Technical University of Munich, Germany
Paolo Tieri, CNR National Research Council, IAC Institute for Applied Computing, Rome, Italy, Germany
Nina Wenke, Chair of Computational Systems Biology, University of Hamburg, Hamburg, Germany, Germany

Short Abstract: Aggregating clinical transcriptomics data across hospitals can increase sensitivity and robustness of differential gene expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequently employed to pool local results. However, if class labels or confounders are inhomogeneously distributed between cohorts, results may differ significantly from those of aggregated analyses.
The dilemma between the accuracy loss and privacy may be resolved by employing privacy-aware techniques, such as federated learning (FL), or secure multi-party computations (SMPC). Here we present Flimma, a novel privacy-preserving implementation of a popular differential gene expression analysis workflow, limma voom. Flimma is based on HyFed (github.com/tum-aimed/hyfed), a hybrid federated framework, which enables the participants to hide the real values of their local models from the server while preserving the utility of the global model. Unlike limma voom, Flimma preserves the privacy of the data as only highly noisy model parameters are shared with the server. Flimma is user friendly and publicly available at exbio.wzw.tum.de/flimma/, including tutorials and documentation. A full version of this article is available at arxiv.org/abs/2010.16403.

G2PDeep: a web-based deep-learning framework for quantitative phenotype prediction and discovery of genomic markers

COSI: MLCSB

Shuai Zeng, University of Missouri-Columbia, United States
Ziting Mao, University of Missouri-Columbia, United States
Yijie Ren, University of Missouri-Columbia, United States
Duolin Wang, University of Missouri-Columbia, United States
Dong Xu, University of Missouri-Columbia, United States
Trupti Joshi, University of Missouri-Columbia, United States

Short Abstract: G2PDeep is an open-access web server, which provides a deep-learning framework for quantitative phenotype prediction and discovery of genomics markers. It uses zygosity or single nucleotide polymorphism (SNP) information from plants and animals as the input to predict quantitative phenotype of interest and genomic markers associated with phenotype. It provides a one-stop-shop platform for researchers to create deep-learning models through an interactive web interface and train these models with uploaded data, using high-performance computing resources plugged at the backend. G2PDeep also provides a series of informative interfaces to monitor the training process and compare the performance among the trained models. The trained models can then be deployed automatically. The quantitative phenotype and genomic markers are predicted using a user-selected trained model and the results are visualized. Our state-of-the-art model has been benchmarked and demonstrated competitive performance in quantitative phenotype predictions by other researchers. In addition, the server integrates the soybean nested association mapping (SoyNAM) dataset with five phenotypes, including grain yield, height, moisture, oil, and protein. A publicly available dataset for seed protein and oil content has also been integrated into the server. The G2PDeep server is publicly available at g2pdeep.org. The Python-based deep-learning model is available at github.com/shuaizengMU/G2PDeep_model.

Gene Network Connectivity Conveys Robustness in Gene Expression across Individuals, Cell types and Species

COSI: MLCSB

Amirreza Shaeiri, EMBL-Heidelberg, Iran
Olga Sigalova, EMBL-Heidelberg, Germany
Judith Zaugg, EMBL-Heidelberg, Germany

Short Abstract: One of the remarkable properties of living systems is their propensity to be robust against various sources of variation. A growing body of work has been studying this topic from the viewpoint of gene expression program, under dissimilar names. These efforts, while shedding great light on the study of this subject, have different limitations. They lack a comprehensive and comparative approach to study the variation across tissues and cell lines. The connectivity of genes through the co-expression or gene regulatory network is often not taken into account. A comparative analysis of gene expression variation across different species is missing. In this work, we aimed to address the above challenges by analyzing a variety of datasets. We illustrate extensive evidence suggesting the important role of networks in controlling the variation. To the best of our knowledge, we build the most inclusive gene-specific regulatory features. Moreover, we predict our desired statistics based on (1) only the genome sequence, (2) only features, and (3) features and network by utilizing a variety of methods. We further observe that expression variation is conserved for closer species in matching tissues.

High-dimensional multi-trait GWAS by reverse prediction of genotypes

COSI: MLCSB

Muhammad Ammar Malik, University of Bergen, Norway
Adriaan-Alexander Ludl, University of Bergen, Norway
Tom Michoel, University of Bergen, Norway

Short Abstract: Reverse linear regression, where genotypes of genetic variants are regressed on multiple traits simultaneously, have emerged to extend multi-trait genome-wide association studies (GWAS) to high-dimensional settings where the number of traits exceeds the number of samples.
Here we demonstrate that all multi-trait GWAS methods can be written as reverse genotype prediction methods. We analyzed linear and non-linear machine learning methods for multi-trait GWAS. Using genotypes, gene expression data and ground-truth transcriptional regulatory networks from the DREAM5 SysGen Challenge and from a cross between two yeast strains, we found that genotype prediction accuracy varies across variants, but does not correlate with the overall effect on gene expression of a variant. Moreover, feature coefficients correlated with the association strength between variants and individual traits, were predictive of true trans-eQTL target genes. Feature selection allowed to distinguish between high and low transcriptional activity genomic regions in random forest models, but not in ridge regression or SVM models. In summary, feature coefficients of genotype predictive models from high-dimensional traits identify biologically relevant variant-trait associations, but comparing relative importance of variants through these models in a GWAS-like manner using a single test-statistic remains an open challenge.

HydrAMP: a deep generative model for antimicrobial peptide discovery

COSI: MLCSB

Paulina Szymczak, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Marcin Możejko, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Tomasz Grzegorzek, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Marta Bauer, Medical University of Gdańsk, Poland
Wojciech Kamysz, Medical University of Gdańsk, Poland
Michał Michalski, The Centre of New Technologies, University of Warsaw, Poland
Damian Neubauer, Medical University of Gdańsk, Poland
Piotr Setny, The Centre of New Technologies, University of Warsaw, Poland
Jacek Sroka, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland
Ewa Szczurek, Faculty of Mathematics, Informatics and Mechanics of the University of Warsaw, Poland

Short Abstract: The development of resistance to conventional antibiotics in pathogenic bacteria poses a global health hazard. Antimicrobial peptides (AMPs) are an emerging group of compounds with the potential to become the new generation of antibiotics. Deep learning methods are widely used by wet-laboratory researchers to screen for the most promising candidates. We propose HydrAMP - a generative model based on a semi-supervised variational autoencoder, that can generate new AMPs, improve existing ones, and perform analogue discovery. Novel features of our approach include non-iterative training, parameter-regulated model creativity, generation of more diverse peptides, and the disentanglement of the latent space from the conditions. Our model enables fast and efficient discovery of peptides with desired biological activity. The peptides generated by HydrAMP are similar to the known AMPs in terms of physicochemical properties. The wet lab validation confirmed that HydrAMP is able to find potent analogues, which was demonstrated on Pexiganan, for which we obtained a new, more active analogue. The proposed model may contribute to the fight against the antibiotic resistance crisis.

Identifying cellular cancer mechanisms through pathway-driven data integration

COSI: MLCSB

Noel Malod-Dognin, Barcelona Supercomputing Center, Spain
Natasa Przulj, ICREA; Barcelona Supercomputing Center; University College London, Spain
Sam Windels, Barcelona Supercomputing Center; University College London, Spain

Short Abstract: Motivation:
Cancer is a genetic disease where mutations of cancer driver genes induce a functional reorganisation of the cell by reprogramming existing cellular pathways. Therefore, many approaches to predict cancer affected pathways have been suggested, typically based on how strongly they have been perturbed by differentially expressed genes. However, we observed that cancer driver genes perform hub-roles in the communication between pathways. Therefore, it is likely that not primarily the pathways themselves get perturbed, but more so their pathway-pathway interactions and with it their mutual functional relationships. So, we aim to identify cancer pathways and cancer genes based on their functional relationships and how they change in cancer.
Results:
To learn an embedding space that captures the functional organisation of pathways in the cell, we present our pathway-driven non-negative matrix tri-factorisation model (P-NMTF), which simultaneously decomposes a list of sub-adjacency matrices that encode how each pathway interacts within the cell. To predict cancer pathways and cancer genes we define our NMTF centrality and moving distance, which, respectively, allow us to compute the functional importance of a pathway or gene in the cell and how strongly its functional relationships are disrupted in cancer.

Identifying Cross-Cancer Similar Patients via a Semi-Supervised Deep Clustering Approach

COSI: MLCSB

Oznur Tastan, Sabanci University, Turkey
Duygu Ay, Sabancı University, Turkey

Short Abstract: With the characterization of cancer tumors at the molecular level, there have been reports of patients being similar despite being diagnosed with different cancer types. Motivated from these observations, we aim at discovering cross-cancer patients, which we define as patients whose tumors are more similar to patient tumors diagnosed with another cancer type. We develop DeepCrossCancer to identify cross-cancer patients that always co-cluster with the other patient from another cancer type. The input to DeepCrossCancer is the transcriptomic profiles of the patient tumors, the age, and sex of the patient. To solve the clustering problem, we use a semi-supervised deep learning-based clustering method in which the clustering task is supervised by cancer type labels and the survival times of the patients. Applying the method to patient data from nine different cancers, we discover 20 cross-cancer patients that consistently co-cluster. By analyzing the predictive genes of the cross-cancer patients and other genomic information available for the patient such as somatic mutations and copy number variations, we identify striking genomic similarities across these patients. The detection of cross-cancer patients opens up possibilities for transferring clinical decisions across patients at a single patient level.

Inferring developmental trajectories and optimized dimension reduction from temporal single-cell RNA-sequencing data

COSI: MLCSB

Maren Hackenberg, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany
Harald Binder, Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center - University of Freiburg, Germany

Short Abstract: Single-cell RNA-sequencing data from multiple time points promises insights into mechanisms controlling differentiation and cell fate decisions at the level of individual cells. Yet, at each time point, a different, heterogeneous sample of cells from diverse types and developmental stages is obtained, complicating the identification of specific developmental trajectories across multiple time points.
To address this challenge, we propose a modeling approach that integrates neural network-based dimension reduction with inference of the temporal dynamics. More specifically, a low-dimensional, latent representation of gene expression is obtained with an autoencoder. In the latent space, we describe trajectories by alternating between assigning cells into groups based on the current dynamic model prediction, and optimizing the model parameters by matching the predicted and true distributions in each group using a quantile-based loss function.
Based on simulated data, we show that this approach allows for inferring distinct developmental trajectories despite the lack of one-to-one correspondence between cells at different time points. Jointly optimizing the neural network for dimension reduction and the dynamic model allows for learning an improved low-dimensional representation specifically adapted to the underlying dynamics. We additionally present an application to single-cell RNA-sequencing data from several time points during mouse cortical differentiation.

Interpretable deep recommender system model for prediction of kinase inhibitor efficacy across cancer cell lines

COSI: MLCSB

Krzysztof Koras, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland
Dilafruz Juraeva, Merck KGaA, Translational Medicine, Department of Bioinformatics, Germany
Ewa Kizling, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland
Eike Staub, Merck KGaA, Translational Medicine, Department of Bioinformatics, Germany
Ewa Szczurek, Faculty of Mathematics, Informatics and Mechanics, University of Warsaw, Poland

Short Abstract: Computational models for drug sensitivity prediction have the potential to significantly improve personalized cancer medicine. Drug sensitivity assays, combined with profiling of cancer cell lines and drugs become increasingly available for training such models. Multiple methods were proposed for predicting drug sensitivity from cancer cell line features, some in a multi-task fashion. So far, no such model leveraged drug inhibition profiles. Importantly, multi-task models require a tailored approach to model interpretability. In this work, we develop DEERS, a neural network recommender system for kinase inhibitor sensitivity prediction. The model utilizes molecular features of the cell lines and kinase inhibition profiles of the drugs. DEERS incorporates two autoencoders to project cell line and drug features into 10-dimensional hidden representations and a feed-forward neural network to combine them into response prediction. We propose a novel interpretability approach, which in addition to the set of modeled features considers also the genes and processes outside of this set. Our approach outperforms simpler matrix factorization models, achieving R=0.82 correlation between true and predicted response for the unseen cell lines. The interpretability analysis identifies 67 biological processes that drive the cell line sensitivity to particular compounds. Detailed case studies are shown for PHA-793887, XMD14-99 and Dabrafenib.

Interpretable T-cell receptor binding prediction using Feature-wise Additive Networks

COSI: MLCSB

Maria Rodriguez Martinez, IBM Research Europe, Switzerland
An-Phi Nguyen, IBM Research Europe, ETH Zurich, Switzerland

Short Abstract: Linear models are often regarded as the example of an interpretable method. Arguably, one of the main reasons is the possibility of thinking of each feature separately from one another. However, linear models are limited since they are not able to approximate more complex functions. In this work, we introduce feature-wise additive neural networks: each feature is processed separately from the others, and a latent representation is produced for each of them. Subsequently, all the feature-wise representations are simply summed, and finally passed through a final network for classification. We show the feature-wise and additive aspects of our model greatly increase its interpretability compared to other deep learning models, while still retaining the capability of approximating complex functions. This claim is supported by our experiments in a T-cell receptor binding prediction task. While motivated by interpretability, our model can be useful also for multimodal problems (e.g. prediction using diverse clinical information) and problems with missing data (with no need of imputation with arbitrary values).

Learning Interpretable Pharmacophoric Representations to Improve Deep Generative Models for Fragment Elaboration

COSI: MLCSB

Charlotte M. Deane, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom
Thomas E. Hadfield, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom
Fergus Imrie, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom
Torsten Schindler, F. Hoffman-La Roche AG, Switzerland
William Pitt, UCB Pharma, United Kingdom
Andy Merritt, LifeArc, United Kingdom
Kristian Birchall, LifeArc, United Kingdom
Garrett M. Morris, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom

Short Abstract: Despite recent interest in deep generative models for scaffold elaboration, their applicability to fragment-to-lead campaigns has been limited. This is in part because they are currently unable to account for local protein structure when proposing molecules and are not designed to pursue specific design hypotheses. We propose a novel method for fragment elaboration, STRIFE, which uses Fragment Hotspot Maps to extract meaningful and interpretable information from the protein and uses these to generate elaborations with complementary pharmacophores. We demonstrate substantial improvements over an existing, structure unaware, fragment elaboration model on a large-scale test set and show that our approach allows a significant degree of control over the elaborations produced. On a challenging case study derived from the literature, despite only providing our model with the fragment and associated protein structure, a number of the proposed molecules ranked highly by STRIFE closely matched the design hypothesis specified by the authors.

LEVERAGING HIGH-THROUGHPUT SCREENING DATA, DEEP NEURAL NETWORKS, AND CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS TO ADVANCE PREDICTIVE TOXICOLOGY

COSI: MLCSB

Adrian J. Green, NC State University, United States
Jhuma Das, The University of North Carolina at Chapel Hill, United States
Meenal Chaudhari, North Carolina A&T State University, United States
Lisa Truong, Oregon State University, United States
Robyn L. Tanguay, Oregon State University, United States
Martin Mohlenkamp, Ohio University, United States
David Reif, NC State University, United States

Short Abstract: There are currently 85,000 chemicals registered with the US Environmental Protection Agency but only a small fraction have measured toxicological data. To address this gap, high-throughput screening (HTS) and computational methods are vital. As part of one such HTS effort, embryonic zebrafish were used to examine a suite of morphological and mortality endpoints from over 1,000 chemicals found in the ToxCast library. We hypothesized that by using a conditional generative adversarial network (GAN-ZT) or deep neural network (Go-ZT) we could efficiently predict toxic outcomes of untested chemicals. We converted the 3D chemical structural information into a weighted set of points, and using in vivo toxicity data to train two generators. Our results showed that Go-ZT significantly outperformed GAN-ZT, support vector machine (SVM), random forest (RF) and multilayer perceptron (MLP) models in cross validation, and when tested against an external test dataset. By combining both Go-ZT and GAN-ZT, our consensus model improved the Kappa, and area under the receiver operating characteristic to 0.673 and 0.837, respectively. Considering their potential use as prescreening tools, these models could provide in vivo toxicity predictions and insight into the hundreds of thousands of untested chemicals to prioritize compounds for HT testing.

Light Attention Predicts Protein Location from the Language of Life

COSI: MLCSB

Hannes Stärk, Department of Informatics, Technical University of Munich, Germany
Christian Dallago, Department of Informatics, Technical University of Munich, Germany
Michael Heinzinger, Department of Informatics, Technical University of Munich, Germany
Burkhard Rost, Department of Informatics, Technical University of Munich, Germany

Short Abstract: Although knowing where a protein functions in a cell is important to characterize biological processes, this information remains unavailable for most known proteins. Machine learning narrows the gap through predictions from expertly chosen input features leveraging evolutionary information that is resource expensive to generate. We showcase using embeddings from protein language models for competitive localization predictions not relying on evolutionary information. Our lightweight deep neural network architecture uses a softmax weighted aggregation mechanism with linear complexity in sequence length referred to as light attention (LA). The method significantly outperformed the state-of-the-art for ten localization classes by about eight percentage points (Q10). The novel models are available as a web-service and as a stand-alone application at embed.protein.properties.

Manifold-based gene density estimates reveal immune signaling in meningioma tumors

COSI: MLCSB

Aarthi Venkat, Yale University, United States
Danielle Miyagishima, Yale University, United States
Alexander Tong, Yale University, United States
Murat Günel, Yale University, United States
Smita Krishnaswamy, Yale University, United States

Short Abstract: Single-cell sequencing analysis often includes clustering cells and identifying differentially expressed genes (DEGs) between clusters. However, the number of clusters, clustering algorithm, and choice of hyperparameters can have a large effect on downstream analyses and biological interpretation.

To address these difficulties, we present our method, which leverages manifold learning to identify DEGs in a cluster-independent way. We begin by modeling the cellular state space as a manifold. We then calculate a kernel density estimate (KDE) of each gene over the graph using a generalized form of KDE to smooth manifolds. Finally, we compute the L1 distance between gene density distributions and a uniform distribution. By ranking the genes based on L1 distance, we identify genes with highly localized expression along the manifold.

We demonstrate the utility of this approach on spatial and RNA sequencing of meningioma tumors from nine patients. Our method identifies enrichment of critical immune signaling DEGs in NF2-mut tumors versus NF2-wt tumors, corroborating a link between NF2 loss and immune infiltration. Within NF2-mut tumors, our approach discovers cell-cell communication networks through identifying downstream targets colocalized with enriched ligands in the spatial profile. Together, we show our method enables cluster-independent exploration of the immune infiltrate during brain tumorigenesis.

Mapping cell structure across scales by fusing protein images and interactions

COSI: MLCSB

Yue Qin, University of California San Diego, United States
Casper Winsnes, KTH-Royal Institute of Technology, Sweden
Edward Huttlin, Harvard Medical School, United States
Fan Zheng, University of California San Diego, United States
Wei Ouyang, KTH-Royal Institute of Technology, Sweden
Jisoo Park, University of California San Diego, United States
Adriana Pitea, University of California San Diego, United States
Jason Kreisberg, University of California San Diego, United States
Steven Gygi, Harvard Medical School, United States
Wade Harper, Harvard Medical School, United States
Jianzhu Ma, Purdue University, United States
Emma Lundberg, KTH-Royal Institute of Technology, Sweden
Trey Ideker, University of California San Diego, United States

Short Abstract: The eukaryotic cell is a multi-scale structure with modular organization across at least four orders of magnitude. Two central approaches for mapping this structure – protein fluorescent imaging and protein biophysical association – each generate extensive datasets but of distinct qualities and resolutions that are typically treated separately. Here, we integrate immunofluorescent images in the Human Protein Atlas with ongoing affinity purification experiments from the BioPlex resource to create a unified hierarchical map of eukaryotic cell architecture. Integration involves configuring each approach to produce a general measure of protein distance, then calibrating the two measures using machine learning. The evolving map, called the Multi-Scale Integrated Cell (MuSIC 1.0), currently resolves 69 subcellular systems of which approximately half are undocumented. Based on these findings we perform 134 additional affinity purifications, validating close subunit associations for the majority of systems. The map elucidates roles for poorly characterized proteins; identifies new protein assemblies in ribosomal biogenesis and RNA splicing; and reveals crosstalk between cytoplasmic and mitochondrial ribosomal proteins. By integration across scales, MuSIC substantially increases the mapping resolution obtained from imaging while giving protein interactions a spatial dimension, paving the way to incorporate many molecular data types in proteome-wide maps of cells.

MatchMaker: A Deep Learning Framework for Drug Synergy Prediction

COSI: MLCSB

A. Ercument Cicek, Bilkent University, Turkey
Halil İbrahim Kuru, Bilkent Universtiy, Turkey
Oznur Tastan, Sabanci University, Turkey

Short Abstract: Drug combination therapies are commonly used for the treatment of complex diseases such as cancer due to increased efficacy and reduced side effects. However, experimentally validating all possible combinations for synergistic interaction even with high-throughout screens is intractable due to vast combinatorial search space. Computational techniques are often used to reduce the number of combinations to be evaluated experimentally by prioritizing promising candidates. We present MatchMaker that can predict drug synergy scores for a pair of drugs using the drugs’ chemical structures and gene expression profiles of untreated cell lines as input. MatchMaker is a deep neural network-based drug synergy prediction algorithm. The model contains three neural subnetworks; two subnetworks learn a representation of the two drugs separately conditioned on cell line gene expression of the given cell line, the output of these two subnetworks are then input to a third subnetwork that predicts the Loewe synergy score of the pair. We train Matchmaker using DrugComb dataset, that contained 286,421 examples. MatchMaker yields performance improvements up to ~15% correlation and ~33% mean squared error (MSE) improvements over the next best method DeepSynergy.

ME-VAE: Multi-Encoder Variational AutoEncoder for Controlling Multiple Transformational Features in Single Cell Image Analysis

COSI: MLCSB

Luke Ternes, Oregon Health & Science University, United States
Mark Dane, Oregon Health & Science University, United States
Marilyne Labrie, Oregon Health & Science University, United States
Gordon Mills, Oregon Health & Science University, United States
Joe Gray, Oregon Health & Science University, United States
Laura Heiser, Oregon Health & Science University, United States
Young Hwan Chang, Oregon Health & Science University, United States

Short Abstract: See attached long abstract

Measuring hidden phenotype: Quantifying the shape of barley seeds using the Euler Characteristic Transform

COSI: MLCSB

Erik Amezquita, Michigan State University, United States
Michelle Quigley, Michigan State University, United States
Tim Ophelders, TU Eindhoven, Netherlands
Jacob Landis, Cornell University, United States
Daniel Koenig, University of California Riverside, United States
Daniel H. Chitwood, Michigan State University, United States
Elizabeth Munch, Michigan State University, United States

Short Abstract: Shape plays a fundamental role in biology. Traditional phenotypic analysis methods measure some features but fail to measure the information embedded in shape comprehensively. To extract, compare, and analyze this information embedded in a robust and concise way, we turn to Topological Data Analysis (TDA), specifically the Euler Characteristic Transform (ECT). TDA measures shape comprehensively using mathematical terms based on algebraic topology features. To study its use, we compute both traditional and topological shape descriptors to quantify the morphology of 3121 barley seeds scanned with X-ray Computed Tomography (CT) technology. The ECT in this case produces vectors that can be thought of as shape signatures for each barley seed. Using these vectors, we successfully train a support vector machine (SVM) to classify 28 different accessions of barley based solely on the 3D shape of their grains. We observe that combining both traditional and topological descriptors classifies barley seeds to their correct accession better than using traditional descriptors alone. This improvement suggests that TDA is thus a powerful complement to traditional morphometrics to describe shape nuances which are otherwise ignored. TDA can quantify aspects of phenotype that have remained "hidden", enabling the reconstructing of objects based on their topological signatures.

Mining Antimicrobial Resistance Genes from a Salmonella enterica pan-genome using A Cross-Validated Feature Selection (CVFS) Approach

COSI: MLCSB

Ming-Ren Yang, Taipei Medical University, Taiwan
Yu-Wei Wu, Taipei Medical University, Taiwan

Short Abstract: Understanding genes and their underlying mechanisms is critical in deciphering how antimicrobial-resistant (AMR) bacteria withstand detrimental effects of antibiotic drugs. Current AMR databases, however, might not be comprehensive since new mechanisms are continuously discovered. It is thus critical to expand the potential AMR gene repertoire for more accurate inferences of AMR strains.

We developed a Cross-Validated Feature Selection (CVFS) approach for robustly mining genes related to AMR activities in an unbiased manner. The core idea behind the CVFS approach is interrogating features among randomly-split non-overlapping sub-datasets to ensure the representativeness of the features. Furthermore, the interrogation process will be repeated several times to minimize random effects, and only features selected by most of the repeated runs are chosen as the final feature set. By testing this idea on Salmonella enterica pan-genome dataset, we show that this approach is able to extract the most representative features (genes selected by the CVFS approach; CVFS-genes) that predict AMR activities very well, indicating the high association between the CVFS-genes and the AMR activities. The functional analysis demonstrates that the majority of the gene functions are hypothetical proteins (i.e. unknown functional roles), highlighting the potential of CVFS-genes to significantly expand the antimicrobial resistance databases.

Multiple Instance Learning to support a tumor area detection in histopathological scans.

COSI: MLCSB

Sylwia Szymanska, Silesian University of Technology, Poland
Joanna Polanska, Silesian University of Technology, Poland

Short Abstract: Breast cancer is one of the most common causes of death among women. The acquisition of an annotated dataset is still a time-consuming process, therefore the use of computer-aided diagnosis (CAD) for an automatic classification of histopathological images can improve the analysis process.
Our goal was to investigate selected machine learning algorithms on datasets that consisted of images of invasive ductal carcinoma breast cancer (IDC), of which 277,524 slices were extracted (198,738 negative and 78,786 positive). Determined features and/or interactions were later used to analyse the precision of tumour area detection.
Analysis was based on three Multiple Instance Learning (MIL) methods: the APR MIL, the Citation-kNN MIL and the MILBoost. The data was grouped into labelled bags which allows usage of less descriptive annotations of instances in datasets. We did a comparison of the results collected from three different datasets: first contains features, second contains interactions and third dataset contains interactions with features. The indicators calculated based on the confusion matrix showed that the best detection of tumour areas was achieved using a set containing features and interactions. The highest accuracy was obtained by MILBoost (96-95%), then Citation-kNN MIL (81-70%) and APR MIL (66-60%).

Multitask group Lasso for Genome-Wide Association Studies in admixed populations

COSI: MLCSB

Asma Nouira, MINES PARISTECH - Institut Curie - INSERM, France
Chloé-Agathe Azencott, MINES PARISTECH - Instiut Curie - INSERM, France

Short Abstract: Population stratification refers to the presence of differences in allele frequencies between subpopulations within samples, due to different ancestry. It is one of the major challenges in Genome Wide Association Studies (GWAS) as it increases type I error. An additional issue in GWAS is the presence of correlation between SNPs, or Linkage Disequilibrium (LD). To account for LD, we consider associations at the level of LD-groups (groups of correlated SNPs) rather than at the individual SNP level. In this contribution, we introduce multitask group Lasso for feature selection where each task corresponds to a subpopulation and each feature corresponds to an LD-group. Our algorithm provides the selection of either shared LD-groups across all tasks, or of population-specific LD-groups. We incorporate stability selection to improve the stability of sparsity-enforcing penalties. We used safe screening rules to provide a significant speed-up to scale the algorithm for GWAS data. To our knowledge, this is the first framework applied to GWAS associating feature selection, stability selection and safe screening rules for admixed populations at the LD-groups level. We show that our approach outperforms all standard methods on a simulated dataset and on two real cancer datasets.

Non-negative matrix factorization of multi-species single-cell RNA-seq data

COSI: MLCSB

William Stafford Noble, University of Washington, United States
Xinxian Deng, University of Washington, United States
Christine Disteche, University of Washington, United States
Mu Yang, University of Washington, United States
Giancarlo Bonora, University of Washington, United States
Jacob Schreiber, Stanford University, United States

Short Abstract: Analyzing transcriptomic data collected from multiple species can be useful for analyzing evolutionary changes between genes in different species. When considering a single species, a powerful tool for collecting this data is high-throughput single-cell sequencing technology, which allows for profiling hundreds of thousands of cells to enable the exploration of expression differences among the cells. However, cross-species data require establishing a common set of genes, which can be challenging given the existence of orthologous and paralogous genes. Single-cell datasets also raise issues such as data sparsity and batch effects. We propose to use an extension of non-negative matrix factorization (NMF), based on a deep neural network, to compare expression of orthologous X-linked genes to that of their corresponding autosomal genes across species. Our data, represented as a 4D tensor, is a single-cell RNA-seq dataset that consists of multiple species and cell types. We extract information from the tensor by using deep NMF to induce latent factors corresponding to genes, cells, species, and cell types. We demonstrate that, using this approach, cells from different species and cell types can be jointly embedded in a latent space, which should facilitate cross-species and cross-types expression comparison of orthologs.

On the estimation of epigenetic energy landscapes from nanopore sequencing data

COSI: MLCSB

Jordi Abante, Johns Hopkins University, United States
Sandeep Kambhampati, Johns Hopkins University, United States
Andrew P. Feinberg, Johns Hopkins University, United States
John Goutsias, Johns Hopkins University, United States

Short Abstract: High-throughput third-generation sequencing devices, such as nanopore sequencing, can generate long reads spanning thousands of bases. This new technology offers the possibility of considering a wide range of epigenetic modiﬁcations and provides the capability to interrogate previously inaccessible regions of the genome, such as highly repetitive regions, and perform comprehensive allele-speciﬁc methylation analysis, among other applications. It is well-known, however, that detection of DNA methylation from nanopore data results in a substantially reduced per-read accuracy when comparing to bisulfite sequencing due to noise introduced by the sequencer and its underlying pore chemistry. Therefore, new methods must be developed for reliable modeling and analysis of DNA methylation landscapes using nanopore sequencing data. Here we introduce such a method and, by using simulations, we provide evidence of its superiority to the state-of-the-art. The proposed approach establishes a solid foundation for developing a comprehensive framework for the statistical analysis of DNA methylation and possibly of other epigenetic marks using nanopore sequencing data and potential energy landscapes.

OncoML: A Multi-omics-based Machine Learning Approach for Targeted Cancer Drug Prediction

COSI: MLCSB

Darsh Mandera, Independent, United States

Short Abstract: The current approach to cancer treatment is a one-size-fits-all approach, failing to comprehend tumor heterogeneity, resulting in 75% ineffectiveness of cancer treatment. Recent research has focused on modeling of drug prediction by applying machine learning on genetic mutations or using microRNA (miRNA), a key biomarker of cancer. Although these approaches demonstrate improved potential of targeted drug prediction, they present some limitations. Gene mutations have shown to account for only a small subset of candidate biomarkers, and while miRNA-based gene expression is regarded as offering more predictive modalities, both can be complemented by the integration and analysis of the multi-omic view of cancer. In this research, machine learning model was trained and tested with over 80% of cancer types using gene mutation, miRNA, and drug response data from The Cancer Genome Atlas. Feature Selection using ExtraTreesClassifier identified 945 gene mutations and miRNAs as key features out of over 18,000 features. The model was tested with multiple machine learning classifiers including DecisionTree, K-NearestNeighbors, and Ensemble Learning-based approaches - AdaBoostClassifier and OneVsRestClassifier. OneVsRestClassifier, when combined with cross validation, outperformed other approaches and is able to predict drugs for cancer patients based on their gene mutations and miRNA data with an accuracy of 83%.

PandoraGAN: Generating antiviral peptides using Generative Adversarial Network

COSI: MLCSB

Shraddha Surana, ThoughtWorks technologies, India
Pooja Arora, ThoughtWorks technologies, India
Divye Singh, ThoughtWorks technologies, India
Deepti Sahasrabuddhe, ThoughtWorks technologies, India
Jayaraman Valadi, ThoughtWorks technologies, India

Short Abstract: The continuous increase in pathogenic viruses and the intensive laboratory research for development of novel antiviral therapies often poses challenges in terms of cost and time efficient drug design. This accelerates research for alternate drug candidates and contributes to the recent rise in research of antiviral peptides against many of the viruses. With limited information regarding these peptides and their activity, modifying the existing peptide backbone or developing a novel peptide is very time consuming and a tedious process. Advanced deep learning approaches such as generative adversarial networks (GAN) can be helpful for wet lab scientists to screen potential antiviral candidates of interest and expedite the initial stage of peptide drug development. To our knowledge this is the first ever use of GAN models for antiviral peptides across the viral spectrum.

Results: In this study, we develop PandoraGAN that utilizes GAN to design bio active antiviral peptides.Available antiviral peptide data was manually curated for preparing training data set to include peptides with lower IC50 values. We further validated the generated sequences comparing the physico-chemical properties of generated antiviral peptides with training data.

Pathogenic potential prediction of novel fungal DNA using ResNets based on a newly curated fungi-host database

COSI: MLCSB

Jakub M. Bartoszewicz, Data Analytics & Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
Ferdous Nasri, Data Analytics & Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany
Bernhard Y. Renard, Data Analytics & Computational Statistics, Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Germany

Short Abstract: Novel fungal disease outbreaks are happening more often now than ever, and they are hard to predict and diagnose. Currently, a multi-drug resistant fungal species, Candida auris, is emerging worldwide with a mortality rate of 30-60%. Pathogenic fungi are under-researched, while they are constantly evolving to survive in new environments and hosts as their natural habitats are destroyed by climate change. At the same time, genetic sequencing technologies have advanced rapidly, opening the door for computational approaches to forecast pathogenic potential.

We construct the most comprehensive database, to our knowledge, of pathogenic fungal species and their hosts by mining literature, books, and existing databases. This database consists of 982 pathogenic fungal species, including 471 species with publicly available genomic data. Furthermore, we present a classifier based on Residual Neural Networks (ResNets) to predict the pathogenic potential of novel fungi directly from unannotated DNA sequences. We compare our model with Basic Local Alignment Search Tool (BLAST), and achieve higher prediction accuracy. We believe our work provides valuable information to open the door for future research in pathology of fungi.

PENGUINN: Precise Exploration of Nuclear G-Quadruplexes Using Interpretable Neural Network

COSI: MLCSB

Eva Klimentová, Faculty of Informatics, Masaryk University, Czechia
Jakub Poláček, Faculty of Informatics, Masaryk University, Czechia
Petr Šimeček, Central European Institute of Technology (CEITEC), Masaryk University, Czechia
Panagiotis Alexiou, Central European Institute of Technology (CEITEC), Masaryk University, Czechia

Short Abstract: G-quadruplexes (G4s) are a class of stable structural nucleic acid secondary structures that are known to play a role in a wide spectrum of genomic functions, such as DNA replication and transcription. The classical understanding of G4 structure points to four variable length guanine strands joined by variable length nucleotide stretches. Experiments using G4 immunoprecipitation and sequencing experiments have produced a high number of highly probable G4 forming genomic sequences. The expense and technical difficulty of experimental techniques highlights the need for computational approaches of G4 identification. Here, we present PENGUINN, a machine learning method based on Convolutional neural networks, that learns the characteristics of G4 sequences and accurately predicts G4s outperforming state-of-the-art methods. We provide both a standalone implementation of the trained model, and a web application that can be used to evaluate sequences for their G4 potential.

Predicting Condition-Specific Gene Expression in Fungi

COSI: MLCSB

Ananthan Nambiar, Department of Bioengineering, University of Illinois at Urbana-Champaign, United States
Veronika Dubinkina, Department of Bioengineering, University of Illinois at Urbana-Champaign, United States
Simon Liu, Department of Computer Science, University of Illinois at Urbana-Champaign, United States
Sergei Maslov, Department of Bioengineering, University of Illinois at Urbana-Champaign, United States

Short Abstract: Bioengineered fungi can be used as ‘microbial factories’ to manufacture biofuels and other chemicals through processes that are economically and ecologically sustainable. To do so, it is important to have an understanding of the organism’s condition-specific transcriptional regulation of genes. Here we propose a neural network, FUngal pRomoter sequence to conditiOn specific expRession (FUROR), that takes as inputs the promoter sequence of a gene and the expression of transcription factors (TFs) to predict the condition-specific expression of the gene as High, Mean or Low. We tested FUROR on N. crassa for which it achieved an accuracy of 0.58 on a test set of randomly withheld elements from the gene-condition matrix and 0.48 on withheld conditions. Both of these results are significantly better than random assignment of classes. The neural network was also interpreted for the de novo discovery of TF-DNA binding motifs. This indicates that FUROR could be used for discovery of gene regulatory networks of non-model fungal species. Currently, we are extending the model to other species including I. orientalis.

Predicting Protein Relative Solvent Accessible Surface Using Graph Neural Networks

COSI: MLCSB

Gonzalo Rubio, Genomica Bioinformatics Ltd., Canada
German Novakovsky, Genomica Bioinformatics Ltd., Canada
Ophir Greif, Genomica Bioinformatics Ltd., Canada
Eldad Haber, Genomica Bioinformatics Ltd., Canada

Short Abstract: Proteins are complex molecules that have vital roles in virtually every biological function. Understanding their properties, including local structure, surface accessibility, and interactions is key to developing therapeutics and understanding their overall function.

In this work, we develop techniques to assess the relative solvent accessible surface area (rASA). Clearly, if the 3D structure of a protein is known then computation of the rASA can be easily done. However, in most cases, where the structure of the protein is unknown, it is difficult to assess the rASA.

In this work, we use graph neural networks accompanied by a data prepossessing pipeline that uses homologous proteins to generate graph properties. We use the proteinNet database that was generated for the task of training protein folding algorithms. This allowed us to train our network on roughly 40K proteins with known structures, accompanied by their Multiple Sequence Alignments (MSA) and the Position Specific Scoring Matrices (PSSM). The use of MSA provides an estimate of the contact maps which are used in the graph network to explore long range relationships between residues. As we show in our numerical experiments, we significantly improve over the state of the art on some of the latest data sets.

Prediction of Antiviral Activity in Peptides from Venomous Animals Using an Ensemble Deep Learning Model

COSI: MLCSB

Caio Fontes de Castro, Institute of Mathematics and Statistics, University of São Paulo, Brazil
Fernanda Midori Abukawa, Laboratory of Applied Toxinology, Butantan Institute, Brazil
Milton Yutaka Nishiyama Junior, Laboratory of Applied Toxinology, Butantan Institute, Brazil

Short Abstract: Antiviral Peptides (AVPs), a subset of Antimicrobial Peptides (AMPs), have been studied as alternatives to traditional antiviral drugs. However, the identification of promising bioactive peptides, especially from venomous animals, it’s a challenge given that the development and pre-clinical testing of a single peptide can be expensive and present low effectiveness. This promotes the development of many Machine Learning and Deep Learning models that attempt to predict the AVPs based on amino acid sequence and derived features. We have developed an ensemble model based on both Random Forest (RF) model and LSTM Network that combines both physicochemical and sequence information in order to predict the AVPs. To train the model we gathered experimentally validated AMPs from seven databases, resulting in 15,650 AMPs and 2,818 validated AVPs. We tested our model with several feature sets as input for the RF; The best performing models utilized PCP descriptors for the RF, and achieved 10-fold cross validation mean values of 96% Accuracy and 0.98 AUC in the AVP prediction task, and 96% Accuracy and 0.99 AUC in the general AMP prediction task. Following, toxin proteins from Arachnidea species will be used in an attempt to identify novel AVPs and AMPs.

Prediction of Drug-Kinase Binding Affinities with Focus on Conserved Protein Kinase Domain

COSI: MLCSB

Davor Oršolić, Ruđer Bošković Institute, Croatia
Bono Lučić, Ruđer Bošković Institute, Croatia
Višnja Stepanić, Ruđer Bošković Institute, Croatia
Tomislav Šmuc, Ruđer Bošković Institute, Croatia

Short Abstract: Protein kinases, as important signaling proteins which deregulation or over-expression is a contributing factor of many diseases, are one of the most common pharmacological targets for novel drug discovery.
Previous approaches implemented on benchmark datasets show poor performance on more rigorous test settings containing previously unseen small compounds or protein kinase targets - thus limiting their real world application. For this purpose we extended the compound space of putative protein kinase inhibitors by combining several publicly available databases with popular benchmark datasets. In addition, we limited the size of input information by considering only compounds with small molecular weight (< 900 Da) and learning only conserved protein kinase domain sequence representations instead of using whole sequences, since most of the kinase inhibitors bind to the ATP binding site or other catalytic subunits in the proximity of ATP binding site.
Predictive methodology used relied on an ensemble approach, XGBoost trained on fingerprint based and sequence based similarity features, as a baseline - and graph convolutional networks (GCN) as more advanced representation learning predictive methodology. In order to assess the uncertainty of model predictions, we defined a structure-based applicability domain with focus on density of compound space in the training set.

Prediction of Human Pancreas Cell Types via ConvNet on Two-dimensional Mapping of Single-cell RNA-seq Data

COSI: MLCSB

Akram Vasighizaker, University of Windsor, Canada
Li Zhou, University of Windsor, Canada
Luis Rueda, University of Windsor, Canada

Short Abstract: Recent studies on using single-cell RNA sequencing (scRNA-seq) technology have been widely applied in biological studies such as drug discovery. Prior to in-depth investigations of the functionality of single cells for pathological goals, identification of cell types is an essential step that can be sped up using computational methods. Recently, supervised learning methods have been developed to automatically identify cell types. Due to the lack of sufficient annotated datasets, these methods have not been commonly used in scRNA-seq studies. Classification methods can simply take advantage of feature selection techniques to improve cell type prediction while identifying the most informative genes among a high number of genes in high-dimensional scRNA-seq datasets. In this regard, we introduce a combination of two powerful techniques for representation learning and unsupervised feature selection to automatically achieve cell type identification in two steps. Average prediction accuracy of 98\% obtained on six different cell types in a Human Pancreas scRNA-seq dataset. In addition, we found that 11 out of 13 selected genes are biologically related to two cell types in the Human Pancreas, which confirms the effectiveness of the proposed approach.

Probeset selection for targeted spatial transcriptomics

COSI: MLCSB

Fabian Theis, Helmholtz Center Munich, Germany
Malte Lücken, Helmholtz Center Munich, Germany
Louis Kümmerle, Helmholtz Center Munich, Germany
Lukas Heumos, Helmholtz Center Munich, Germany

Short Abstract: Spatial transcriptomics is an emerging technology that enables cellular variation to be placed into a spatial context. This has produced novel insights into tissue heterogeneity under healthy and diseased conditions. High resolution and cost-efficient methods require a pre-selection of genes that will be measured. Here we present a method to select probe sets for spatial transcriptomics and benchmark this approach and others in a custom pipeline. Our method optionally incorporates prior knowledge and accounts for experimental constraints.

ProteinBERT: A universal protein language & function model

COSI: MLCSB

Nadav Brandes, The Hebrew University of Jerusalem, Israel
Michal Linial, The Hebrew University of Jerusalem, Israel
Dan Ofer, The Hebrew University of Jerusalem, Israel

Short Abstract: Self-supervised deep learning is a powerful approach for sequence modeling in natural language and potentially biological sequences. However, existing models (e.g. BERT) and pre-training methods are designed for natural languages, not protein sequences. Protein-specific models and pre-training tasks are necessary to better capture the information within protein sequences. Here, we introduce a novel deep-learning model based on BERT, called ProteinBERT. We present a pre-training scheme that consists of masked language modeling combined with a protein-specific pre-training task of predicting Gene Ontology (GO) functions. We introduce novel architectural elements that make the model more efficient and flexible to very large sequence lengths and multiple inputs. We obtain state-of-the-art results, despite using a far smaller model than other deep learning models, and show major benefits from pretraining. Code and models are available

RandomSCM: interpretable ensembles of sparse classifiers tailored for omics data

COSI: MLCSB

Thibaud Godon, Laval University, Canada
Pier-Luc Plante, Laval University, Canada
Baptiste Bauvin, Laval University, Canada
Élina Francovic-Fontaine, Laval University, Canada
Alexandre Drouin, Element AI, a ServiceNow company; Laval University, Canada
François Laviolette, Laval University, Canada
Jacques Corbeil, Laval University, Canada

Short Abstract: Recent metabolomics measurement devices, such as mass spectrometers, produce extremely high-dimensional data. Together with small sample sizes, this setting is known as the fat data (or p >> n) problem. Biomarker discovery in this configuration is a challenge. Classical statistical methods fail and common Machine Learning (ML) algorithms produce models too complex to be interpretables. ML algorithms that rely on sparsity to predict phenotypes using very few covariates have been shown to thrive in this setting. While sparsity helps to avoid overfitting, it also leads to concise models that are easier to interpret for biomarker discovery.

The Set Covering Machine (SCM) algorithm produces sparse models based on simple decision rules. Recent work has applied SCMs to the genotype-to-phenotype prediction of antibiotic resistance and achieved state-of-the-art accuracy. To adapt this approach to metabolomics (fat) data, we developed a bootstrap aggregation of SCM models : RandomSCM.

We explored applications of RandomSCM beyond genotype-to-phenotype prediction by applying it to five metabolomics dataset. Predictions performances are at state-of-the art level. Furthermore, the study of the decision rules in RandomSCM revealed valid biomarkers of the phenotypes. These results demonstrate the high potential of the RandomSCM algorithm for biomarker discovery in omics sciences.

Real-time pathogenicity prediction during sequencing of novel viruses and bacteria

COSI: MLCSB

Jakub Bartoszewicz, Hasso Plattner Institute, Germany
Ulrich Genske, Hasso Plattner Institute, Germany
Bernhard Renard, Hasso Plattner Institute, Germany

Short Abstract: Novel pathogens evolve quickly and may emerge rapidly, causing dangerous outbreaks or even global pandemics. Next-generation sequencing is the state-of-the art in open-view pathogen detection, and one of the few methods available at the earliest stages of an epidemic, even when the biological threat is unknown. Analyzing the samples as the sequencer is running can greatly reduce the turnaround time, but existing tools rely on close matches to lists of known pathogens and perform poorly on novel species. Machine learning approaches can predict if single reads originate from more distant, unknown pathogens, but require relatively long input sequences and processed data from a finished run. We train ResNets to classify raw, incomplete Illumina and Nanopore reads and integrate our models with HiLive2, a real-time Illumina mapper. This approach outperforms alternatives based on machine learning and sequence alignment on simulated and real data, including SARS-CoV-2 sequencing runs. After just 50 Illumina cycles, we observe an 80-fold sensitivity increase compared to real-time mapping. The first 250bp of Nanopore reads, corresponding to 0.5s of sequencing time, are enough to yield predictions more accurate than mapping the finished long reads. The approach could also be used for screening synthetic sequences against biosecurity threats.

Representation learning of genomic sequence motifs via information maximization

COSI: MLCSB

Peter Koo, Cold Spring Harbor Laboratory, United States
Nicholas Lee, Cold Spring Harbor Laboratory, United States

Short Abstract: Convolutional neural networks have been applied in supervised learning to a variety of computational genomics problems, taking DNA sequences as inputs and predicting regulatory functions as outputs. Despite their successful performance metrics, these supervised learning approaches have limitations: supervised models commonly only learn sequence features that can immediately help accurate prediction of the regulatory function outputs, ignoring other features present within the input sequences. Additionally, the genomic features learned by supervised models are often very basic sequence features (e.g., GC content).

Here we present Genomic Representations with Information Maximization (GRIM), an unsupervised learning method based on the Infomax principle that enables more comprehensive identification of whole sequence motifs. We demonstrate that GRIM is able to discover motifs known to be present in genomic sequences but which are not detectable using supervised methods. We also demonstrate the efficacy of the representations of genomic sequences learned by GRIM by showing that relatively simple models trained on these representations can approach the performance of more complex, fully supervised models trained on raw genomic sequences. We further demonstrate the utility of GRIM in analyzing several in vivo genomic datasets, illuminating use cases for our method.

Scaled Bernoulli Mixture Model for Clustering of Single-cell ATAC-seq Data

COSI: MLCSB

Mudassar Iqbal, University of Manchester, United Kingdom
Syed Murtuza Baker, University of Manchester, United Kingdom
Adam Farooq, Aston University, United Kingdom
Magnus Rattray, University of Manchester, United Kingdom

Short Abstract: Single-cell and single-nucleus ATAC-seq methods are increasingly employed for studying chromatin accessibility. Due to technical issues in the sequencing protocols, there may be large differences in sequencing depth across cells. This can strongly impact the downstream clustering analysis and commonly employed approaches can produce clusters defined by these sequencing artefacts rather than by the underlying biology. This warrants the development of a clustering approach for binary ATAC-seq that is capable of dealing with differences in sequencing depth.
We develop a binary mixture model where the underlying Bernoulli distribution is modified with an additional cell-specific parameter modelling sequencing depth. We develop a bespoke Expectation- Maximisation based inference method combined with a model-based feature-selection approach and a cluster splitting/merging heuristic to improve performance. Our method robustly identifies clusters and informative open chromatin features, and can automatically detect the number of clusters. We validate our method on synthetic data and apply it to publicly available real single-cell ATAC-seq datasets. We compare against standard Bernoulli mixture model and a state-of-the-art clustering method, and show that our method is tolerant to variation in sequencing depth and provides biologically meaningful clustering.

Schema: metric learning enables interpretable synthesis of heterogeneous single-cell modalities

COSI: MLCSB

Rohit Singh, Massachusetts Institute of Technology, United States
Brian Hie, Massachusetts Institute of Technology, United States
Ashwin Narayan, Massachusetts Institute of Technology, United States
Bonnie Berger, Massachusetts Institute of Technology, United States

Short Abstract: A complete understanding of biological processes requires synthesizing information across heterogeneous modalities, such as age, disease status, or gene expression. Technological advances in single-cell profiling have enabled researchers to assay multiple modalities simultaneously. We present Schema, which uses a principled metric learning strategy that identifies informative features in a modality to synthesize disparate modalities into a single coherent interpretation. We use Schema to infer cell types by integrating gene expression and chromatin accessibility data; demonstrate informative data visualizations that synthesize multiple modalities; perform differential gene expression analysis in the context of spatial variability; and estimate evolutionary pressure on peptide sequences.

scJoint: transfer learning for data integration of atlas-scale single-cell RNA-seq and ATAC-seq

COSI: MLCSB

Yingxin Lin, The University of Sydney, Australia
Tung-Yu Wu, Stanford University, United States
Sheng Wan, National Chiao Tung University, Taiwan
Jean Yang, The University of Sydney, Australia
Wing Wong, Stanford University, United States
Rachel Wang, The University of Sydney, Australia

Short Abstract: Single-cell multi-omics data continues to grow at an unprecedented pace, and effectively integrating different modalities holds the promise for better characterization of cell identities. Although a number of methods have demonstrated promising results in integrating multiple modalities from the same tissue, the complexity and scale of data compositions typically present in cell atlases still pose a significant challenge for existing methods. Here we present scJoint, a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. Using multiple atlas data and a biologically varying multi-modal data, we demonstrate scJoint is computationally efficient and consistently achieves significantly higher cell type label accuracy than existing methods while providing meaningful joint visualizations. This suggests scJoint is effective in overcoming the heterogeneity in different modalities towards a more comprehensive understanding of cellular phenotypes.

Towards More Realistic Simulated Datasets for Benchmarking Deep Learning Models in Regulatory Genomics

COSI: MLCSB

Anshul Kundaje, Stanford University, United States
Eva Prakash, Stanford University, United States
Avanti Shrikumar, Stanford University, United States

Short Abstract: Interpretation of neural networks trained on regulatory sequence data is an important problem in computational genomics. Several methods such as in-silico mutagenesis, Grad-CAM, DeepLIFT, and Integrated Gradients have been developed to explain such networks. However, the limitations that arise when applying these methods to genomic data are not completely understood. While simulated datasets with known ground-truth DNA motifs can be used to test whether a given interpretability method can accurately recover the motifs, such simulations do not reflect the complexity of real biological data.

In this work, we propose a systematic pipeline for designing simulated datasets to mirror the complexity of a given experimental dataset. We apply the pipeline to build simulated datasets based on publicly-available chromatin accessibility experiments. We use these simulated datasets to quantify the performance and identify pitfalls of different interpretation methods based on how well they can recover the ground-truth motifs. We further explore the impact of user-defined settings on the interpretation methods, and find that some commonly-used settings from the computer vision literature are not always a good choice for genomics. Based on our analysis, we suggest some best practices for practitioners interested in applying these model interpretation methods to their own genomic datasets.

Towards Multimodal Transformers Trained on Biomedical Text and Knowledge Graphs

COSI: MLCSB

Helena Balabin, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Germany
Charles Tapley Hoyt, Laboratory of Systems Pharmacology, Harvard Medical School, United States
Benjamin M Gyori, Laboratory of Systems Pharmacology, Harvard Medical School, United States
John Bachman, Laboratory of Systems Pharmacology, Harvard Medical School, United States
Colin Birkenbihl, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Germany
Alpha Tom Kodamullil, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Germany
Martin Hofmann-Apitius, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Germany
Daniel Domingo-Fernández, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Germany

Short Abstract: Although Transformer-based Language Models (LMs) can learn contextualized representations of language, they have difficulties in representing factual knowledge. Many proposed solutions are based on integrating relational facts in the form of Knowledge Graph (KG) triples into LMs, which has proven to significantly improve such contextualized representations, specifically in the biomedical domain. However, a major limitation of these approaches is their dependence on entity linking (i.e., the process of aligning text tokens and KG entities). To overcome this impediment, we propose a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (STonKGs), which can operate on unaligned pairs of text sequences and KG triples. STonKGs is a large-scale Transformer-based model trained on several million text-triple pairs from PubMed, assembled by the Integrated Network and Dynamical Reasoning Assembler (INDRA). First, we adapt the output of node2vec to represent KG triples as sequential input data. This, in conjunction with text tokens, is used as the input to our model, which then uses multi-modal attention to learn rich interdependencies between text tokens and KG entities. By evaluating STonKGs on various fine-tuning tasks and comparing it to an NLP- and a KG-baseline, we empirically validate the added value of our knowledge integration method.

Transcriptomic deconvolution of neuroendocrine neoplasms predicts clinically relevant characteristics via training on data of healthy origin

COSI: MLCSB

Raik Otto, Humboldt-Universität zu Berlin, Germany
Katharina Detjen, Charité Berlin, Germany
Pamela Riemer, Laboratory of Molecular Tumorpathology, Institute of Pathology, University Medicine Charite, Germany
Carsten Groetzinger, Virchow-Klinikum, University Medicine Charite, Germany
Guido Rindi, Università Cattolica del Sacro Cuore, Italy
Bertram Wiedenmann, Charité Berlin, Germany
Christine Sers, Laboratory of Molecular Tumorpathology, Institute of Pathology, University Medicine Charite, Germany
Ulf Leser, Institut für Informatik, Humboldt-Universität zu Berlin, Germany

Short Abstract: Comprehensive training of Machine-Learning models is frequently not possible for rare and diverse cancer types such as Pancreatic neuroendocrine neoplasms (panNENs). We report on a novel data-augmentation technique which substitutes neoplastic training data with data of healthy origin based on a transcriptomic deconvolution algorithm. The output of the deconvolution is subsequently utilized as training data for Machine-Learning models, which in turn predict clinical characteristics of panNENs. Deconvolution-trained models efficiently predict the neoplastic grading, disease-related patient survival, and can differentiate between neuroendocrine tumor and carcinoma subtype while achieving the same prediction accuracy as a baseline model trained on neoplastic expression data and the Ki-67 gold-standard biomarker classified panNENs. The clinical characterization of panNENs and that of rare cancer types in general is complemented by the method.

Using topic modeling to detect cellular crosstalk in scRNA-seq

COSI: MLCSB

Alexandrina Pancheva, University of Glasgow, United Kingdom
Helen Wheadon, University of Glasgow, United Kingdom
Simon Rogers, University of Glasgow, United Kingdom
Thomas Otto, University of Glasgow, United Kingdom

Short Abstract: Cell-cell interactions are vital for numerous biological processes including development, differentiation, and response to inflammation. Currently most methods for studying interactions on scRNA-seq level are based on curated databases of ligands and receptors. Whilst useful, such methods are limited by current biological knowledge. Recent advances in single cell protocols have allowed for physically interacting cells to be captured, enabling complimentary methods for studying interactions that does not rely on prior information. We introduce a new method for detecting genes whose expression change as a result of interaction in such datasets based on Latent Dirichlet Allocation (LDA). We validate our method on synthetic data before applying our approach to two datasets of physically interacting cells, allowing us to identify genes that change as a result of interaction. For each dataset we produce a ranking of genes that are changing in subpopulations of the interacting cells. Lastly, we apply our method to a dataset generated by a standard droplet based protocol, not designed to capture interacting cells and discuss its suitability for analysing interaction. We are able to rank genes that change as a result of interaction without relying on prior clustering and generation of synthetic reference profiles as current methods do.

Sponsors

Posters - Schedules

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC

Session B: Monday, July 26 between 15:20 - 16:20 UTC

Session C: Tuesday, July 27 between 15:20 - 16:20 UTC

Session D: Wednesday, July 28 between 15:20 - 16:20 UTC

Session E: Thursday, July 29 between 15:20 - 16:20 UTC

ISCB On the Web