Posters

Poster presentations at ISMB 2020 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2020. There are Q&A opportunities through a chat function to allow interaction between presenters and participants.

Preliminary information on preparing your poster and poster talk are available at: https://www.iscb.org/ismb2020-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time	Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time
Bioinfo-core CAMDA COSI COVID-19 Education COSI EvoCompGen COSI Function / CAFA 4 MLCSB COSI SCANGEN (Special Session) SysMod COSI Systems Immunology (Special Session) Text Mining July 14 between 10:40 am - 2:00 pm EDT iRNA COSI	3DSIG COSI Bio-Ontologies COSI BioVis COSI CompMS COSI COVID-19 EvoCompGen COSI HitSeq COSI MICROBIOME COSI NetBio COSI RegSys COSI TransMed COSI VarI COSI General Comp Bio

Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time

Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time

July 14 between 10:40 am - 2:00 pm EDT

iRNA COSI

A Gene-Expression-Based Ensemble Learning Approach for Targeted Cancer Drug Prediction

COSI: MLCSB COSI

Darsh Mandera, Independent, United States

Short Abstract: MicroRNA has been identified as a key biomarker of cancer. MicroRNAs are noncoding RNAs that regulate gene expression in the post-transcriptional phase. When microRNA count is depleted, gene expression can become dysregulated, consequently leading to progression or drug resistance in cancer.

Even though cancer is a complex and extremely heterogeneous condition, the current practice of treating cancer is a one-size-fits-all approach that is expensive and time-consuming; this causes patients to suffer. Worse, prescribed treatment is ineffective 75% of the time.

Recently, machine learning has been used in cancer diagnosis and detection. In this research, machine learning models were built, trained, and tested on human patient microRNA and drug response data of 80% of cancer types from The Cancer Genome Atlas. Four different models were implemented using DecisionTreeClassifier, k-NearestNeighborsClassifier, AdaBoostClassifier, and OneVsRestClassifier, of which the latter two are ensemble learning methods. Cross-validation was conducted using K-Fold and GridSearchCV. The final ensemble learning model based on OneVsRestClassifier used a median scoring method during cross-validation to select the best hyperparameters and outperformed the other aforementioned classifiers. This approach predicts cancer drugs based on patients’ microRNA data with an accuracy of 74.1%, which is a significant improvement over the current one-size-fits-all approach.

A Machine Learning Method for Functional Characterization of Newly Discovered Genes

COSI: MLCSB COSI

Priyanka Bhandary, Iowa State University, United States
Sagnik Banerjee, Iowa State University, United States
Karin Dorman, Iowa State University, United States
Eve Syrkin Wurtele, Iowa State University, United States

Short Abstract: Homologous proteins often exhibit similar functions. Sequence similarity has therefore been used to assign functions to newly discovered proteins that bear homology to well-characterized ones. However, every organism possesses a set of genes that do not bear any sequence homology to any other proteins – making function annotations via sequence comparison impossible. We developed a predictive machine-learning framework that exploits expression counts to associate functions with proteins. Tested on multiple regulons (genes co-expressed across a broad range of biological conditions) comprised of genes co-involved in the same bio-chemical pathway, our approach out-performs other techniques. A compendia of machine learning classifiers were trained using gene counts to predict regulon information for unidentified genes in aggregated RNA-Seq data from Arabidopsis thaliana. Multilayer Perceptron classifier achieved a precision of 0.80, recall of 0.82, and F1-score of 0.81 for the large Photosynthesis regulon. Three other methods were compared to the machine learning approach, two of which employ correlations. Machine learning techniques can pave the way to better hypotheses generation on the biological process of uncharacterized genes by identifying patterns in the gene expression profile. Thus, they can infer potential roles for uncharacterized genes, and can improve the quality of existing gene functional annotations.

A method for tracking cell migration in vivo based on deep learning with target detection

COSI: MLCSB COSI

Tsubasa Mizugaki, Department of Bioinformatic Engineering, Osaka University, Japan
Utkrisht Rajkumar, Department of Computer Science and Engineering, University of California, San Diego, United States
Kenji Fujimoto, Department of Bioinformatic Engineering, Osaka University, Japan
Hironori Shigeta, Department of Bioinformatic Engineering, Osaka University, Japan
Shigeto Seno, Department of Bioinformatic Engineering, Osaka University, Japan
Yutaka Uchida, Department of Immunology and Cell Biology, Osaka University, Japan
Masaru Ishii, Department of Immunology and Cell Biology, Osaka University, Japan
Vineet Bafna, Department of Computer Science and Engineering, University of California, San Diego, United States
Hideo Matsuda, Department of Bioinformatic Engineering, Osaka University, Japan

Short Abstract: Cell migration is one of the important criteria for determining effects on cells by inflammatory and/or chemical stimulation. For detecting the movement of cells, it is not easy to utilize traditional methods, such as optical flow-based methods, because those cells have the similar fluorescence intensity and cell shapes to each other. We adopt a tracking approach that is based on a convolutional neural network (CNN) method using time-series images observed with two-photon excitation microscopy. Our tracking tool is based on MDNet (multi-domain network) that Nam and Han (2016) developed. This tool tracks objects by combining the convolutional layers for object recognition with a domain specific classification layer. The locations of migrated cells are indicated based on predicted bounding boxes. The shapes of migrated cells are identified based on obtained bounding boxes. We extend MDNet to track multiple cells in microscopic images. First, we select the initial bounding box of a cell by a graph-cut optimization algorithm, and then we start cell-tracking from the bounding box by MDNet. We have applied our method to tracking leukocytes migration in living animals with inflammatory stimulation. We will present the performance results of cell tracking compared with other traditional tracking methods.

A new strategy for whole exome sequencing based copy number variation detection by long short-term memory method

COSI: MLCSB COSI

Xiaoying Lv, DCH Technologies Inc, United States
Mengchun Gong, DCH Technologies Inc, United States
Wenzhao Shi, DCH Technologies Inc, United States
Shengyin Qin, Bio-X Institutes, Shanghai JiaoTong University, China
Gang Feng, DCH Technologies Inc, United States

Short Abstract: Background：
The whole exome sequencing (WES) protocols are widely used in clinical studies. However, precise CNVs calling from WES data is still challenging at present due to sequencing limitations and kits capture efficiency. Due to the weak power of current methods, we put forward a new strategy using Long short-term memory (LSTM) model to predict CNV whole-exome before segmentation.
Methods & Results:
WES data and gold standard genotyping microarray CNVs data from 136 schizophrenia samples and 100 normal samples were adopted to develop this new strategy. We extracted the normalized depth, GC content, bin_length, and calculated the average depth in 100 normal samples, depth fluctuation in 100 normal samples as input features for each bin and test several parameters (iteration, batch size, num unit and so on) for LSTM model. The new strategy was significantly more suitable for CNV detection to improve the F1 score from 0.031 to 0.11 in bin level when comparing with the cnvkit, the best caller in current CNV methods.
Conclusion:
We explored a new strategy to predict whole-exome CNV using LSTM method, which can use recurrent neural network to remove ambiguous factor effect. The strategy sheds a new insight on exome CNV detection.

A Non-Parametric Bayesian Framework for Detecting Coregulated Splicing Signals in Heterogeneous RNA Datasets with Applications to Acute Myeloid Leukemia

COSI: MLCSB COSI

David Wang, University of Pennsylvania, United States
Mathieu Quesnel-Vallieres, University of Pennsylvania, United States
Yoseph Barash, University of Pennsylvania, United States

Short Abstract: Analysis of RNASeq data from large patient cohorts can reveal transcriptomic perturbations that are associated with disease. This is typically framed as an unsupervised learning task to discover latent structure in a data matrix. However, the heterogeneity of these datasets makes such analysis challenging. For example, in acute myeloid leukemia, mutations in splice factor genes occurring in a subset of the patients may only result in alteration of a subset of splicing events. Thus, there is a need to identify “tiles”, defined by a subset of samples and splicing events with abnormal signals. Although algorithms exist for this task, they fail to model splicing data.
To address these challenges, we propose CHESSBOARD, a non-parametric Bayesian model for unsupervised discovery of tiles. Our algorithm does not require a priori knowledge of the number of the tiles and uses a unique missing value model for cancer data. First, we apply our model to
synthetic datasets and show it outperforms several baseline approaches. Next, we show that it recovers tiles characterized by splicing aberrations which are reproducible in multiple AML patient cohorts. Finally, we show that tiles we discover are correlated with drug response to therapeutics, pointing to translational potential of our findings.

A pretrained high-sensitive CNN-based multi-task learning model for taxonomic assignment of human viruses

COSI: MLCSB COSI

Haoran Ma, National University of Singapore, China
Tin Wee Tan, National University of Singapore, Singapore
Hon Kim Kenneth Ban, National University of Singapore, Singapore

Short Abstract: Background
Taxonomic assignment is important in identification of infectious disease. Although some taxonomic assignment tools have been developed, these tools are limited by factors such as length of reads, error rate and chosen of K value. Besides, as an important function that can reflect the overall distribution of aligned reads, coverage calculation is not included in taxonomic report for these tools.

Methods
A CNN based multi-task learning model was developed, which can do both taxonomic assignment and coverage calculation for human viruses. In the architecture, multiple MaxPooling and BatchNormalization layers were added for tolerating the mutations in virus nucleotide sequences. To calculate coverage, locations of each input reads were predicted as second task of the architecture. The model was trained on the augmented ICTV datasets by using an early-stopping strategy.

Results
For taxonomic assignment, the CNN model outperformed Kraken2, Centrifuge and Bowtie2 on simulated reads, and identified about two times of sars-cov-2 viruses than the other three tools on four real RNA-seq datasets. For coverage calculation, the accuracy on unseen testing dataset was higher than 95%. The pretrained model and datasets were available via GitHub.

Significance
Our CNN model is highly sensitive on human viruses and can provide coverage calculation.

A regularized functional regression model enabling transcriptome-wide dosage-dependent association study of cancer drug response

COSI: MLCSB COSI

Evanthia Koukouli, Lancaster University, United Kingdom
Frank Dondelinger, Lancaster university, United Kingdom
Juhyun Park, Lancaster University, United Kingdom
Dennis Wang, University of Sheffield, United Kingdom

Short Abstract: Tumor genetic makeup plays an important role in cancer drug sensitivity. Gene expression markers are a promising tool to build decision aids for treatment selection or dosage tuning. Using in vitro cancer cell line dose-response and gene expression data from the Genomics of Drug Sensitivity in Cancer project (Iorio et al., 2016), we build a dose-varying regression model. Unlike existing approaches, this allows us to estimate dosage-dependent associations with gene expression, enabling fine-grained dosage tuning and drug selection.

We include the transcriptomic profiles of unexposed cell lines as dose-invariant covariates into the regression model and assume that their effect varies smoothly over the dosage levels. A two-stage variable selection algorithm is used to identify genetic factors that are associated with drug response over the varying dosages.

We evaluated the effectiveness of our method using simulation studies and cross-validation for predictive accuracy assessment. We further tested the model on data from five MAPK/ERK targeted compounds applied to different cancer cell lines under different dosages. We reveal the dosage-dependent dynamics of the associations between the selected genes and drug response, and we performed pathway enrichment analysis to highlight the role of the selected genes in tumorgenesis and DNA damage response.

Accurately and Interpretably Imputing the Expression of Unmeasured Genes

COSI: MLCSB COSI

Jacob Canfield, Michigan State University, United States
Christopher Mancuso, Michigan State University, United States
Deepak Singla, Nanyang Technological University, Singapore
Arjun Krishnan, Michigan State University, United States

Short Abstract: While there are >2 million publicly-available human microarray gene-expression profiles, these profiles were measured using a variety of platforms that each cover a pre-defined, limited set of genes. Therefore, key to reanalyzing and integrating this massive data collection are methods that can computationally reconstitute the complete transcriptome in partially-measured microarray samples by imputing the expression of unmeasured genes. Current state-of-the-art imputation methods are tailored to samples from a specific platform and rely on gene-gene relationships regardless of the biological context of the target sample. We show that sparse regression models that capture sample-sample relationships (termed SampleLASSO), built on-the-fly for each new target sample to be imputed, outperform models based on fixed gene relationships. Extensive evaluation involving three machine learning algorithms (LASSO, k-nearest-neighbors, and deep-neural-networks), two gene subsets (GPL96-570 and LINCS), and three imputation tasks (within and across microarray/RNA-seq) establishes that SampleLASSO is the most accurate model. Additionally, we demonstrate the biological interpretability of this method by showing that, for imputing a target sample from a certain tissue, SampleLASSO automatically leverages training samples from the same tissue. Thus, SampleLASSO is a simple, yet powerful and flexible approach for harmonizing large-scale gene-expression data.

Adversarial Deconfounding Autoencoder for Learning Robust Gene Expression Embeddings

COSI: MLCSB COSI

Ayse Dincer, Univeristy of Washington, United States
Joseph Janizek, University of Washington, United States
Su-In Lee, University of Washington, United States

Short Abstract: The increasing number of gene expression profiles has enabled the use of complex models, such as deep unsupervised neural networks, to extract a latent space from these profiles. However, expression profiles, especially when collected in large numbers, inherently contain variations introduced by technical artifacts (e.g., batch effects) and uninteresting biological variables (e.g., age) in addition to the true signals of interest. These sources of variation, called confounders, produce embeddings that fail to capture biological variables of interest and transfer to different domains. To remedy this problem, we attempt to disentangle confounders from true signals to generate biologically informative embeddings by introducing the AD-AE (Adversarial Deconfounding AutoEncoder) approach. The AD-AE model consists of an autoencoder and an adversary network that we jointly train to generate embeddings that can encode as much information as possible without encoding any confounding signal. By applying AD-AE to two distinct gene expression datasets, we show that our model can (1) generate embeddings that do not encode confounder information, (2) conserve the biological signals present in the original space, and (3) generalize successfully across different confounder domains. We believe that this adversarial deconfounding approach can be the key to discovering robust expression patterns.

An improvement of ComiR algorithm for microRNA target prediction by exploiting coding region sequences of mRNAs

COSI: MLCSB COSI

Giorgio Bertolazzi, University of Palermo, Department of Economics, Business and Statistics, Italy
Panayiotis Benos, University of Pittsburgh, Department of Computational and Systems Biology, United States
Claudia Coronnello, Fondazione Ri.MED, Italy
Michele Tumminello, University of Palermo, Department of Economics, Business and Statistics, Italy

Short Abstract: MicroRNA regulation activity depends on the recognition of binding sites located on mRNA molecules. Traditionally, only microRNA targets on the 3’UTR of mRNAs are considered. ComiR is an algorithm that predicts the targets of a set of microRNAs, considering their abundance as part of its probabilistic model. ComiR was trained with the information regarding binding sites in the 3’UTR region, by using a reliable dataset containing the targets of endogenously expressed microRNAs in D. melanogaster S2 cells. This dataset was obtained by comparing the results from two experimental approaches, i.e., inhibition, and immunoprecipitation of the AGO1 protein.
In this work, we tested whether including coding region binding sites in ComiR algorithm improves the performance of the tool. We focused the analysis on the D. melanogaster species and updated the ComiR database with the currently available releases of mRNA and microRNA sequences. We found that the information related to the coding regions increases the efficiency of ComiR, compared to the one trained with 3’UTR. Therefore we propose to upgrade the existing ComiR web-tool by including the coding region based trained model, available together with the 3’utr based one.

Associating Protein Function with Human Traits through PWAS

COSI: MLCSB COSI

Nathan Linial, The Hebrew University of Jerusalem, Israel
Nadav Brandes, The Hebrew University of Jerusalem, Israel
Michal Linial, The Hebrew University of Jerusalem, Israel

Short Abstract: Over the last decades, GWAS has become a canonical tool for exploratory genetic research, generating countless gene-phenotype associations. Despite its accomplishments, several limitations still hinder its success, including low statistical power and obscurity about the causality of implicated variants. We introduce PWAS (Proteome-Wide Association Study), a new method for detecting protein-coding genes associated with phenotypes through protein function alterations. PWAS aggregates the signal of all variants jointly affecting a protein-coding gene and assesses their overall impact on the protein’s function using machine-learning and probabilistic models. Subsequently, it tests whether the gene exhibits functional variability between individuals that correlates with the phenotype of interest. By collecting the genetic signal across many variants in light of their rich proteomic context, PWAS can detect subtle patterns that standard GWAS and other methods overlook. To demonstrate its applicability for a wide range of human traits, we applied PWAS on a cohort derived from the UK Biobank (~330K individuals) and evaluated it on 49 prominent phenotypes. 23% of the significant PWAS associations on that cohort were missed by standard GWAS. A comparison between PWAS to existing methods proves its capacity to recover causal protein-coding genes and highlighting new associations with a plausible biological mechanism.

aTEMPO: Pathway-Specific Temporal Anomalies for Precision Therapeutics

COSI: MLCSB COSI

Christopher Pietras, Tufts University, United States
Liam Power, Tufts University, United States
Donna Slonim, Tufts University, United States

Short Abstract: Predictable dynamic processes characterize many biological states, from development to the cell cycle to healthy aging. Dynamic processes are also inherently important in disease, and identifying disease-related disruptions of normal dynamic processes can provide information about individual patients. In previous work, our group has characterized individuals' disease states via pathway-based anomalies in expression data with CSAX. We have also identified disease-correlated disruption of predictable dynamic patterns by modeling a virtual time series in static data with TEMPO. Here we combine the insights from the two approaches, using an anomaly detection model and virtual time series to identify anomalous temporal processes in specific disease states. Our approach produces an anomaly score for an individual sample by combining the level of pathway-specific temporal dysregulation detected in that sample across a set of functional pathways. This allows us to identify which temporal patterns are most disrupted in each sample. We show that aTEMPO performs significantly better than general-purpose comparator anomaly detection methods when used to detect diseases and disorders with a developmental or age-related component, including autism spectrum disorders, Alzheimer’s disease, or Huntington’s disease. We further demonstrate that this approach can informatively characterize individual patients, suggesting personalized therapeutic approaches.

Automated inference of CRISPRi guide efficiency in bacteria from genome-wide essentiality screens

COSI: MLCSB COSI

Yanying Yu, Helmholtz Institute for RNA-based Infection Research, Germany
Sandra Gawlitt, Helmholtz Institute for RNA-based Infection Research, Germany
Chase Beisel, Helmholtz Institute for RNA-based Infection Research, Germany
Lars Barquist, Helmholtz Institute for RNA-based Infection Research, Germany

Short Abstract: New CRISPR-based technologies are increasingly being applied to manipulate the genome and transcriptomes of microbes. Despite this, deriving design rules for optimal guide RNAs remains challenging. Genome-wide screens offer one potential source of information, as depletion in a screen reflects a mixture of fitness effects and guide efficiency. By applying automated machine learning to data from publicly available CRISPRi screens, we show that it is possible to accurately predict the depletion of guides targeting essential genes using a combination of genomic, sequence, thermodynamic, and transcriptomic features. We then developed an iterative procedure to separate the contribution of features that can be manipulated in guide design from gene-intrinsic features to produce a predictive model for guide efficiency. We demonstrate that this model predicts guide efficiency in flow-cytometry based assays and outperforms existing tools for guide design intended for genome editing. The features of good CRISPRi guides in E. coli are distinct from those previously derived for genome editing application. Our model not only addresses the need for an effective tool for CRISPRi guide design, but also this robust approach provides a blueprint for the development and interpretation of predictive models for other CRISPR-based technologies.

BinOpt: An algorithm to optimally assign feature importance to classes

COSI: MLCSB COSI

Malvika Sudhakar, Indian Institute of Technology Madras, India
Karthik Raman, Indian Institute of Technology Madras, India
Raghunathan Rengaswamy, Indian Institute of Technology Madras, India

Short Abstract: Many real-world problems are tackled by formulating them as classification problems. In biology, specifically, classification is used to stratify patients, differentiate between cell-types, and so on. The differences in classes are studied using features such as genes, or proteins and their contribution to various classes is of interest. Often, studies seek to identify the best set of features that can differentiate between classes. However, there are multiple equivalent feature sets that can aid in classification equally well. In this study, we aim to leverage these equivalent solutions to identify the true contribution of each feature to a class. We have developed an algorithm called BinOpt, which solves a mixed-integer linear programming problem using multiple subsets of features as input for prediction. We studied the effect of problem size, sparsity and number of subsets given as input on the performance of the algorithm. We show that the algorithm performs better on sparse problems and that the time complexity increases with problem size. We illustrate the utility of BinOpt on a cancer somatic mutation dataset to identify contribution of genes to different cancer types. We gained multiple insights about driver genes involved in multiple cancer-types and identified genes unique to a cancer-type.

Can We Trust Convolutional Neural Networks for Genomics?

COSI: MLCSB COSI

Peter Koo, Cold Spring Harbor Laboratory, United States
Matthew Ploenzke, Harvard University, United States

Short Abstract: Convolutional neural networks (CNNs) are powerful methods to predict transcription factor binding sites from DNA sequence. Although CNNs are largely considered a "black box", attribution-based interpretability methods can be employed to identify single nucleotide variants that are important for model predictions. However, there is no guarantee that attribution methods will recover meaningful features even for state-of-the-art CNNs. Here we train CNNs with different architectures and training procedures on synthetic sequences embedded with known motifs and then quantitatively measure how well attribution methods recover ground truth. We find that deep CNNs tend to recover less interpretable motifs, despite yielding superior performance on held out test data. This suggests that good model performance does not necessarily imply good model interpretability. Strikingly, we find that adversarial training, a method to promote robustness to small perturbations to the input data, can significantly improve the efficacy of attribution methods. We also find that CNNs specially designed with an inductive bias to learn strong motif representations consistently improves interpretability. We then show that these results generalize to in vivo ChIP-seq data. This work highlights the importance of moving beyond performance on benchmark datasets when considering whether to trust a CNN’s prediction in genomics.

Cancer Prediction using DNA Methylation Profiles

COSI: MLCSB COSI

Annika Viswesh, Palo Alto High School, United States

Short Abstract: Tools to help healthcare professionals identify a cancer type based on DNA Methylation profiles are currently unavailable. DNA methylation pattern changes occur early, so it can be used in the early detection of cancer. Machine learning algorithms were developed in this study to predict a particular kind of cancer. DNA methylation profiles of Acute Myeloid Leukemia, Sarcoma, Skin Cutaneous Melanoma, Stomach Adenocarcinoma, Lung Squamous Cell Carcinoma, and Brain Lower Grade Glioma were downloaded from “The Cancer Genome Atlas”. 1200 samples with 200 records per cancer type and 485,577 features per record were extracted, cleaned, balanced, and feature engineered. Five different computational models were built using Logistic Regression, Decision Trees, Gaussian Naïve Bayes, Support Vector Machines, and Random Forest. Precision, Recall, and f1_score were used to evaluate and compare their performance. Logistic Regression and Linear Support Vector Machines gave the best results with precision, recall, and f1_score of 99.5%.  This project demonstrates an accurate big data-driven and machine learning to predictive analytics approach in early cancer detection, which can improve cancer patient outcomes. Although the study utilizes only data for six cancer types, this concept of machine learning-assisted decision-making can be extended to incorporate more cancers in the future.

Chemical representation learning for toxicity prediction

COSI: MLCSB COSI

Greta Markert, IBM Research Zurich, Switzerland
Matteo Manica, IBM, Switzerland
Gisbert Schneider, ETH Zurich, Switzerland
Jannis Born, ETH Zurich, Switzerland
Maria Rodriguez Martinez, IBM Research Zurich, Switzerland

Short Abstract: Toxicity is the main cause of both the high attrition rate in drug discovery and the majority of withdrawals of approved drugs. Striving to reduce toxicity necessitates developing reliable and interpretable QSAR prediction models. At the core of these models is the goal of finding data representations with the highest predictive power for a given task. In a case study of several toxicity databases (environmental toxicity and cytotoxicity), we present an overarching study comparing various representations of molecules, including testing molecular fingerprints, graph convolutional neural networks, and SMILES-based representation. We are first to benchmark both established and state-of-the-art graph kernel techniques on the Tox21 dataset. Since SMILES models yield the highest accuracy, we examined the different variants of this notation. The result is a unifying SMILES preprocessing pipeline (with an open-source Python implementation), which is used to systematically investigate different SMILES styles, e.g. by removing the stereoinformation or adding bonds explicitly. For leveraging model interpretability, neural attention mechanisms are employed on the SMILES sequences. The resulting attention maps are validated on a panel of known toxicophores. We show an application of the toxicity models by incorporating them into a conditional generative model that strives to generate non-toxic pharmaceutical candidate drugs.

Clustering of Position Specific Weight Matrix in Various Protein Families

COSI: MLCSB COSI

Omer Ali, Radium Hospital, Norway
Junbai Wang, Radium Hospital, Norway

Short Abstract: Background: With advent of information technology and its use for biological applications, new high-throughput methods have produced transcription factor (TF) binding motifs with an ever-increasing number of collections which are maintained in different TF databases generated from various sources. There are few efficient and automatic ways to combine biologically relevant/similar TFs into a group using prior knowledge (e.g., protein family) of the TFs. Thus, there is need of a tool that can efficiently group TF (position weight matrix) to representative motif.
Results: Here, we present a prototype of package for clustering the PWM of TFs and estimate consensus PWM of each cluster. First, the program assigns a DNA-Binding Domain (DBD) family label to each input PWM and classifying them to respective DBDs. Then, an affinity propagation clustering is applied within each DBD family for grouping similar PWMS. A simple statistics assessment of cluster quality and other relevant information of clustering results were exported in formatted text files.
Conclusion: Similar PWMs were assigned to same cluster and represented by a representative PWM in each cluster. There is also an automatic cluster quality analysis in pipeline, which makes it unique from existing tools. A software pipeline will be available online once completed.

Deep learning approach to determining the type of long reads

COSI: MLCSB COSI

Lovro Vrcek, Genome Institute of Singapore, A*STAR, Singapore, Singapore
Megan Hong Hui Huang, Genome Institute of Singapore, A*STAR, Singapore, Singapore
Robert Vaser, University of Zagreb, Faculty of Electrical Engineering and Computing, Zagreb, Croatia, Croatia
Mile Šikić, Genome institute of Singapore, Singapore

Short Abstract: Single and metagenome de novo assembly of long reads is still one of the most difficult problems in bioinformatics. Often used paradigm, called Overlap-Layout-Consensus, aims at finding a Hamiltonian path through an assembly graph obtained from overlapping reads in a sample. However, these graphs can be extremely complex due to repetitive regions in genomes and sequencing artifacts such as chimeric reads, which lead to higher fragmentation of the assembly genomes. A popular approach for tackling this problem is based on dividing reads into three categories and processing them appropriately. These three categories of reads are regular, repetitive, and chimeric.

A drawback of read classification with heuristic algorithms in existing assemblers is a manual selection of parameters based on just several genomes. In this work, we propose a deep learning approach for classification of reads based on their pile-o-grams, plots of coverage versus base index. The model was trained on a hand-labeled dataset consisting of pile-o-gram images from multiple bacteria, and tested on a different bacteria species not included in the training set. With such a setup, and with classes being balanced, an accuracy of 93% was achieved which opens the possibility of creating more accurate and less contiguous assemblies.

Deep learning enables automated and extensible peak group identification for multi-transition chromatogram-based data-independent acquisition data analysis

COSI: MLCSB COSI

Leon L. Xu, University of Toronto, Canada
Hannes L. Rӧst, University of Toronto, Canada

Short Abstract: Data-independent acquisition (DIA) is a novel mass spectrometric method that achieves high reproducibility and quantitative accuracy through a deterministic acquisition strategy; however, existing heuristic and knowledge based software cannot keep up with the increasing complexity of the resulting data. Here, we present a novel method based on deep learning (DL), which is able to automatically extract and classify peak group features directly from chromatographic DIA data. Our approach is an end-to-end neural network that performs feature extraction and scoring in one single step, unlike previous approaches that relied on manual feature engineering. First, we extract one dimensional chromatographic data from the raw two dimensional mass-to-charge and retention time data map. Then, we train a neural network based on the Transformer architecture to operate directly on that chromatographic data to annotate all data points that belong to a peak group feature. This neural network can be trained using only labelled data, but is also capable of integrating unlabelled data in the training process as well. Overall, the end-to-end nature of the method allows us to annotate subjectively better peak boundaries, and our method shows higher performance than existing methods on comparable data in terms of precision and recall.

Development of Coexpression-guided Generative Model for Plant Gene Expression Inference

COSI: MLCSB COSI

Yuichi Aoki, Tohoku University, Japan
Shinichi Yamazaki, Tohoku University, Japan
Takeshi Obayashi, Tohoku University, Japan

Short Abstract: Gene expression is one of the most prominent molecular phenotypes associated with plant responses to the environments. Therefore, an accurate understanding of the gene expression state space is fundamentally essential for plant physiology. Generative models, such as variational autoencoders (VAEs) and generative adversarial networks (GANs), have the potential to imitate the phenomena of interest and provide useful information to understand the data generation mechanisms. The recent accumulation of gene expression data brings the opportunity to apply this generative model-based inference approach in transcription events. In this study, we attempted to build VAEs and GANs for gene expression state space of higher plants by using a comprehensive gene expression dataset provided by ATTED-II (atted.jp). Learning of models with regularization term for gene coexpression states successfully decreased the generalization error, and based on the constructed models, we successfully estimated some tissue-specific gene expression profiles by utilizing the neural style transfer approach. In this conference, we would like to discuss the possibility of latent space interpretation of gene expression generative models, and the applicability of these techniques to the inter-species comparative gene coexpression analyses.

DNA methylation predictive capacity is dispersed across the genome and not limited to promoter regions

COSI: MLCSB COSI

Xiavan Roopnarinesingh, University of Oklahoma Health Sciences Center, United States
Cory Giles, Oklahoma Medical Research Foundation, United States
Hunter Porter, University of Oklahoma Health Sciences Center, United States
Chase Brown, University of Oklahoma Health Sciences Center, United States
Constantin Georgescu, Oklahoma Medical Research Foundation, United States
Jonathan Wren, Oklahoma Medical Research Foundation, United States

Short Abstract: It is understood that promoter methylation is associated with regulation, but methylation is not limited to promoters. We sought to examine methylation’s predictive capacity across the methylome, predicing sex, tissue, age and expression. This analysis is performed on over 40,000 publicly available Illumina 450k microarrays from the Gene Expression Omnibus, across 30 tissues. For each feature, a k-nearest neighbors, logistic regression, and random forest classifier are trained. For age, elastic net models are trained. For gene expression, predictions scores are mean correlation of a given gene with a GO term’s associated genes. We observe that promoter and gene body methylation are similarly predictive of tissue and sex (f-scores>0.9), with age predicted with a MAD of 4.8. Iteratively removing probes from each model showed a performance drop at 150 randomly selected probes. Gene expression AUROC scores of 0.68 were obtained for gene expression, no methylation terms were able to achieve AUROC greater than the negative control scores of 0.49. This separation describes methylation with an informative capacity that predicts sample-associated features, while minimal information about gene function can be predicted with mean methylation.

Efficient nonlinear genotype-to-phenotype prediction for heel bone mineral density

COSI: MLCSB COSI

Aleksandr Medvedev, Skolkovo Institute of Science and Technology, Russia
Elena Nabieva, Skolkovo Institute of Science and Technology, Russia
Satyarth Mishra Sharma, Skolkovo Institute of Science and Technology, Russia
Dmitry Yarotsky, Skolkovo Institute of Science and Technology, Russia

Short Abstract: For a long time, linear models have been a standard for genotype-to-phenotype predictions. In recent years, the amount of data available to researchers has increased substantially, opening the way to constructing and reliably testing complex nonlinear models. The important open problems are the size of the nonlinear part of heritability and the optimal architecture of predictive models capable of revealing this nonlinear heritability. In this paper, we explore the applicability of nonlinear methods to the prediction of heel bone mineral density -- an important risk factor for fractures and the defining feature of osteoporosis.
We compare a panel of linear and nonlinear methods, and find that some nonlinear methods can produce significantly more accurate predictions than linear methods such as LASSO. In particular, gradient boosted decision trees implemented in the library XGBoost explain 32% of bone density variance on the UK Biobank dataset, while LASSO explains only 27%. Moreover, the nonlinearity revealed by XGBoost appears to be rather complex in the sense that the achieved gain in accuracy cannot be alternatively obtained by some common "weakly nonlinear" models, such as linear models with a nonlinear transformation of the output or shallow neural networks with a small number of hidden neurons.

eNetXplorer: an R package for the quantitative exploration of elastic net families for generalized linear models

COSI: MLCSB COSI

Julian Candia, National Institutes of Health, United States
John Tsang, National Institutes of Health, United States

Short Abstract: Background: Rigorous, exploratory analysis for the identification of correlates in a multivariable setting is needed in a variety of contexts, especially in systems biology where data involving a large number of features are highly prevalent. In many scenarios, e.g. typical single- and multi-omics data analysis, regression requires regularization models such as ridge and lasso. To explore different levels of regularization and penalty in regression, the elastic net spans the full range from ridge to lasso. However, there remain several open issues, ranging from model selection and assessment of model-level statistical significance to feature selection and significance.

Results: By generating null-model ensembles via random permutations of the response variable, eNetXplorer provides a cross-validated framework to address these outstanding issues. Feature importance is evaluated by flexible criteria based on out-of-bag predictions, which are assessed via user-defined quality functions. eNetXplorer fits linear, binomial, multinomial, and Cox regression models and provides a set of standard plots, summary statistics and output tables for downstream analysis. This R package is available under GPLv3 license at CRAN.

Significance: eNetXplorer enables quantitative, exploratory analysis to generate hypotheses on which features may be associated with biological phenotypes of interest, such as for the identification of biomarkers for therapeutic responsiveness.

Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies

COSI: MLCSB COSI

Rashika Ramola, Northeastern University, United States
Shantanu Jain, Northeastern University, United States
Predrag Radivojac, Northeastern University, United States

Short Abstract: Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied, there remain understudied problems when the training data violate the statistical assumptions relied upon for accurate learning and model characterization. This particularly holds true in the open world setting where observations of a phenomenon generally guarantee its presence but the absence of such evidence cannot be interpreted as the evidence of its absence. Learning from such data is often referred to as positive-unlabeled learning, a form of semi-supervised learning where all labeled data belong to one (say, positive) class. We here study the quality of estimated performance in positive-unlabeled learning. We provide evidence that such estimates can be wildly inaccurate. We then present correction methods for several such measures and demonstrate that the knowledge or accurate estimates of class priors in the unlabeled data and noise in the labeled data are sufficient for the recovery of true classification performance. We provide theoretical support as well as empirical evidence for the efficacy of the new performance estimation methods.

Fourier-transform-based attribution priors improve the stability and interpretability of deep learning models for regulatory genomics

COSI: MLCSB COSI

Alex Tseng, Stanford University, United States
Avanti Shrikumar, Stanford University, United States
Anshul Kundaje, Stanford University, United States

Short Abstract: Deep learning models of regulatory DNA can accurately predict transcription factor (TF) binding and chromatin accessibility profiles. Base-resolution importance (i.e. "attribution") scores learned by the models can highlight predictive motifs and syntax. Unfortunately, these models are prone to overfitting and are sensitive to random initializations, often resulting in noisy and irreproducible attributions that obfuscate underlying motifs. To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of input-level attribution scores are computed at training-time, and high-frequency components of the Fourier spectrum are penalized. We evaluate different model architectures with and without attribution priors to predict binary or continuous profiles of TF binding or chromatin accessibility. The prior is agnostic to the model architecture or predicted experimental assay, yet provides similar gains across all experiments. We show that our attribution prior dramatically improves the models’ stability, interpretability, and performance on held-out data, including when training data is severely limited. Our attribution prior also allows models to identify motifs more sensitively and precisely within individual regulatory elements. This work represents an important advancement in improving the reliability of deep learning models for deciphering the cis-regulatory code from regulatory profiling experiments.

Fragment-Based Deep Generative Model for Molecules

COSI: MLCSB COSI

Emmanuel Noutahi, InVivo AI, Canada
Prudencio Tossou, Laval University, Canada

Short Abstract: Recent advances in machine learning have allowed application of deep generative modelling methods to unstructured data such as molecules. Consequently, generative models are now being explored for de novo drug design. These approaches offer considerable opportunities for early-stage drug discovery but still face several challenges including chemical validity, drug-likeness and synthetic accessibility of generated compounds. In particular, a large body of previous works focuses on generating SMILES (Simplified Molecular-Input Line-Entry System), a string representation of molecules, by leveraging recent advances in recurrent neural networks. However, as the SMILES syntax is sensitive to any slight alterations, these approaches often produce chemically invalid structures.
In this work, we seek to address some of the shortcomings of existing deep generative approaches for molecule design by proposing a fragment-based framework which leverages a new type of representation we call “retrosynthetic fragmentation tree”. We show how this representation allows generation of molecules in a coarse-to-fine manner, based on a structure-by-structure design step as opposed to the currently prevailing atom-by-atom generation. We benchmark our method against current state-of-the-art approaches and show that not only does it improve chemical validity, but it also induces favorable bias regarding synthetic accessibility and drug-likeness of generated compounds.

Generalizing RNA velocity to transient cell states through dynamical modeling

COSI: MLCSB COSI

Volker Bergen, Helmholtz Center Munich, Germany
Marius Lange, Helmholtz Center Munich, Germany
Stefan Peidli, Helmholtz Center Munich, Germany
Alex Wolf, Helmholtz Center Munich, Germany
Fabian Theis, Helmholtz Center Munich, Germany

Short Abstract: The introduction of RNA velocity in single cells has opened up new ways of studying cellular differentiation. The originally proposed framework obtains velocities as the deviation of the observed ratio of spliced and unspliced mRNA from an inferred steady state. Errors in velocity estimates arise if the central assumptions of a common splicing rate and steady-state mRNA levels are violated. With scVelo ( scvelo.org ), we address these restrictions by solving the full transcriptional dynamics of splicing kinetics using a likelihood-based dynamical model. This generalizes RNA velocity to a wide variety of systems comprising transient cell states, which are common in development and in response to perturbations. We infer gene-specific rates of transcription, splicing and degradation, and recover the latent time of the underlying cellular processes. This latent time represents the cell’s internal clock and is based only on its transcriptional dynamics. Moreover, scVelo allows us to identify regimes of regulatory changes such as stages of cell fate commitment and, therein, systematically detects putative driver genes. We demonstrate that scVelo enables disentangling heterogeneous subpopulation kinetics with unprecedented resolution in hippocampal neurogenesis and pancreatic endocrinogenesis. We anticipate that scVelo will greatly facilitate the study of lineage decisions, gene regulation, and pathway activity identification.

Genomics-guided discovery of fungicidal bacteria

COSI: MLCSB COSI

Matthew Biggs, AgBiome, United States
Mathias Twizeyimana, AgBiome, United States
Esther Gachango, AgBiome, United States
Kelly Craig, AgBiome, United States
David Ingham, AgBiome, United States

Short Abstract: AgBiome leverages a vast collection of microbes to discover crop-protective products. We are continually refining the use of genomics to make our screening and discovery strategies more effective. We recently completed a discovery program for bacterial isolates with fungicidal activity against Colletotrichum, the causative agent of a devastating plant disease called Sorghum Anthracnose. Starting from more than 70,000 bacterial isolates, we implemented genomics-based sampling of our bacterial search space and systematic exploration of local optima. We eventually screened 1,131 strategically-selected bacterial isolates, 106 of which control Sorghum Anthracnose at greater than 70% in vitro. The protective isolates are highly diverse, representing 4 phyla and 18 genera. Furthermore, using a machine learning approach, we identified Biosynthetic Gene Clusters that are predictive of fungicidal activity and validated the predictive value of these genomic features through further in vitro screening.

GMStool: GWAS-based Marker Selection tool for Genomic Prediction from Genomic Data

COSI: MLCSB COSI

Seongmun Jeong, Korea Research Institute of Bioscience and Biotechnology, South Korea
Jae-Yoon Kim, Korea Research Institute of Bioscience and Biotechnology, South Korea
Namshin Kim, Korea Research Institute of Bioscience and Biotechnology, South Korea

Short Abstract: Genomic selection is a scheme to predict the genetic merits of individuals using genome-wide markers. However, Genomic prediction accuracy is affected by many factors, including missing heritability, number of genetic markers, models, trait features, and so on. Among them, genetic information affects traits directly or indirectly through SNP-SNP interactions with other genetic markers. In this study, we presented GMStool, a genome-wide association-based marker selection method using a heuristic search method, for improving the prediction accuracy of traits with various heritability. To validate our method, four phenotypes with various heritability (0.271 to 0.774) in rice and soybean population were used, the markers were selected through cross-validation, created the prediction model, and applied to test set. Consequently, as to the selected number of markers was as few as 68 and as many as 418, and the prediction accuracy (Pearson correlation) that was obtained by applying to training and test set was found to be 0.81 to 0.92 and 0.72 to 0.86, respectively. GMStool is written in R language and freely available at github.com/lovemun/GMStool.

Gromov-Wasserstein based optimal transport to align single-cell multi-omics data

COSI: MLCSB COSI

Pinar Demetci, Brown University, United States
Rebecca Santorella, Brown University, United States
Bjorn Sandstede, Brown University, United States
Ritambhara Singh, Brown University, United States
William S. Noble, University of Washington, United States

Short Abstract: Data integration of single-cell measurements is critical for our understanding of cell development and disease, but the lack of correspondence between different types of single-cell measurements makes such efforts challenging. Several unsupervised algorithms are capable of aligning heterogeneous types of single-cell measurements in a shared space, enabling the creation of mappings between single cells in different data modalities.
We present Single-Cell alignment using Optimal Transport (SCOT), an unsupervised learning algorithm that uses Gromov Wasserstein-based optimal transport to align single-cell multi-omics datasets. SCOT calculates a probabilistic coupling matrix that matches cells across two datasets. The optimization uses k-nearest neighbor graphs, thus preserving the local geometry of the data. We use the resulting coupling matrix to project one single-cell dataset onto another via barycentric projection. We compare the alignment performance of SCOT with state-of-the-art algorithms on three simulated and two real datasets. Our results demonstrate that SCOT yields results that are comparable in quality to those of competing methods, but SCOT is significantly faster and requires tuning fewer hyperparameters.

Inferring Signaling Pathways with Probabilistic Programming

COSI: MLCSB COSI

David Merrell, University of Wisconsin, United States
Anthony Gitter, University of Wisconsin, United States

Short Abstract: Cells regulate themselves via complex biochemical processes called signaling pathways. These are usually depicted as networks, where nodes represent proteins and edges indicate their influence relationships. To understand diseases and therapies at the cellular level, it is crucial to understand the signaling pathways at work. Because signaling pathways can be rewired by disease, inferring signaling pathways from context-specific data is highly valuable.

We formulate signaling pathway inference as a Dynamic Bayesian Network (DBN) structure learning problem on phosphoproteomic time course data. We take a Bayesian approach, using MCMC to sample DBN structures. We use a novel proposal distribution that efficiently samples large, sparse graphs. We also relax some modeling assumptions made in past works. We call the resulting method Sparse Signaling Pathway Sampling (SSPS). We implement SSPS in Julia, using the Gen probabilistic programming language.

We evaluate SSPS on simulated and real data. SSPS attains superior scalability for large problems, in comparison to other DBN techniques. We also find that it competes well against established methods on the HPN-DREAM breast cancer network reconstruction challenge. SSPS significantly improves Bayesian techniques for network inference, and gives a proof of concept for probabilistic programming in this setting.

Integrating diffusion components of multi-omics datasets with application to cancer molecular subtyping

COSI: MLCSB COSI

Xin Duan, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Du Cai, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Yufeng Chen, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Qiqi Zhu, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Zeping Huang, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Chenghang Li, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Xiaojian Wu, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China
Feng Gao, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, China

Short Abstract: Cancer is a heterogeneous disease and consists of multiple molecular subtypes underlying the diverse clinical outcomes. Most strategies for cancer molecular subtyping are mainly based on unsupervised classification of single transcriptome data, especially gene expression profiles. However, molecular heterogeneity also exists on other genetic or epigenetic levels. For a more comprehensive analysis of cancer heterogeneity, multi-omics data integration provides a more effective solution. Here, we propose DMCI, which integrates the first diffusion component of multi-omics datasets into a joint variable, combining K-means to dissect the cancer heterogeneity. Diffusion map is a spectral non-linear dimension method where the first diffusion component accounting for the largest importance of dimension. The joint variable learning from our DMCI not only captures the complementary information from different data sources but also is more computational efficiency. To demonstrate the effectiveness, we applied DMCI for colorectal cancer and ovarian cancer subtyping, comparing with other data integration methods, our approach showed much better performance and identified molecular subtypes that are much more clinically relevant.

Interpretable Deep Learning Algorithms for Prokaryotic Genome Annotation

COSI: MLCSB COSI

Mohammad Ruhul Amin, Fordham University, United States
Hirak Sarkar, University of Maryland, United States
Rob Patro, University of Maryland, United States

Short Abstract: With the recent rapid growth of the deep learning applications, the usage of off-the-shelf models gained tremendous popularity. Despite these successes, understanding and thereby uncovering the underlying learning mechanism of such a model needs detailed exploration. Recent investigations in this area led the researchers to create algorithms for interpretable deep learning models. Motivated by these advancements, we would like to extend our project of the prokaryotic genome annotation model to be interpretable. Our previously published tool, called DeepAnnotator, achieved an F-score of ~94%, and established a generalized computational approach for genome annotation using deep learning. We used Recurrent Neural Network (RNN) with Long Short-Term Memory (LSTM) to demonstrate the potential of deep learning networks to annotate genome sequences. In this work, we have extended the DeepAnnotator model with an interpretable layer to discover the sequence patterns that play a significant role in the identification of protein-coding sequence boundaries. An algebraic transformation of the vector representation (from the interpretable layer) of the DNA sequence around the start codon shows that anti-Shine-Dalgarno sequence helps the model to detect the start of the gene. Therefore, such a network has the potential to discover new features that help ribosomes in protein synthesis.

Investigating the importance of matched training data for deep learning in genomics

COSI: MLCSB COSI

Sebastian Röner, Berlin Institute of Health (BIH), Germany
Max Schubach, Berlin Institute of Health (BIH), Germany
Louisa-Marie Krützfeldt, Berlin Institute of Health (BIH), Germany
Martin Kircher, Berlin Institute of Health (BIH), Germany

Short Abstract: Modern biological research is increasingly relying on computational methods, especially machine learning, to identify patterns in a wide variety of large datasets. To utilize machine learning to its full potential, a suitable amount of data has to be preprocessed, cleaned and prepared. One important aspect is the avoidance of technical biases and imbalances in the selection of training data. Here, we investigated the effects of sampling biases in the selection of negative training data for deep learning models trained on DNA sequence from open chromatin regions (e.g. DNAse-seq or ATAC-seq). As a result, we developed our own sampling method, which selects windows over the whole genome and matches the positive and negative labeled regions according to their GC content and number per chromosome. We compared our results to other commonly used methods like genNullSeqs of the gkmSVM package (Ghandi, et al, Bioinformatics 2016) and simple random sampling, seeing an improvement in model performance and a decrease in introduced biases from model interpretation. To improve the applicability of our method further, we currently work on the generalizability of our tool, including more features for negative selection, for example including repeat annotations and distance to gene annotation features.

KCML: a machine-learning framework for inference of multi-scale gene functions from genetic perturbation screens

COSI: MLCSB COSI

Jens Rittscher, University of Oxford, United Kingdom
Lucas Pelkmans, University of Zurich, Switzerland
Heba Sailem, University of Oxford, United Kingdom

Short Abstract: Characterising context‐dependent gene functions is crucial for understanding the genetic bases of health and disease. To date, inference of gene functions from large‐scale genetic perturbation screens is based on ad hoc analysis pipelines involving unsupervised clustering and functional enrichment. We present Knowledge‐ and Context‐driven Machine Learning (KCML), a framework that systematically predicts multiple context‐specific functions for a given gene based on the similarity of its perturbation phenotype to those with known function. As a proof of concept, we test KCML on three datasets describing phenotypes at the molecular, cellular and population levels and show that it outperforms traditional analysis pipelines. In particular, KCML identified an abnormal multicellular organisation phenotype associated with the depletion of olfactory receptors, and TGFβ and WNT signalling genes in colorectal cancer cells. We validate these predictions in colorectal cancer patients and show that olfactory receptors expression is predictive of worse patient outcomes. These results highlight KCML as a systematic framework for discovering novel scale‐crossing and context‐dependent gene functions which can provide a means to augment existing gene ontologies. KCML is highly generalisable and applicable to various large‐scale genetic perturbation screens.

Knowledge-guided graph-partitioning for biclustering

COSI: MLCSB COSI

Qian Yang, Tsinghua University, China
Haiyan Huang, University of California, Berkeley, United States
Xuegong Zhang, Tsinghua University, China

Short Abstract: Biclustering is an analysis of underlying relations between groups of rows and groups of columns in a data matrix. It has many applications in biology. Inderjit S. Dhillon converted the task to a partitioning task of a bipartite graph, in which finding a bicluster is then equivalent to pooling nodes from the two disjoint sets.

We proposed a generative model to solve partition problem of a bipartite graph. We modeled the generation of edges among nodes as sampling from different probabilistic distributions. Maximum likelihood inference of the model parameters gives the partition of the graph. However, this inference problem is not tractable due to the huge parameter space. We proposed to incorporate known relations between nodes in one set. We implemented an EM algorithm to obtain the maximum likelihood estimates. Imperfect knowledge can also be updated in an iterative manner after the initial solution obtained.

We conducted a series of simulation experiments and our results showed the convergence of the algorithm as well as its better performance than the previous graph-based biclustering method. We also experimented the method on an eQTL dataset to show the potential of the method in real applications.

Latent periodic process inference from single-cell RNA-seq data

COSI: MLCSB COSI

Shaoheng Liang, The University of Texas MD Anderson Cancer Center, United States
Fang Wang, The University of Texas MD Anderson Cancer Center, United States
Jincheng Han, The University of Texas MD Anderson Cancer Center, United States
Ken Chen, The University of Texas MD Anderson Cancer Center, United States

Short Abstract: The development of a phenotype in a multicellular organism often involves multiple, simultaneously occurring biological processes. Advances in single-cell RNA-sequencing make it possible to infer latent developmental processes from the transcriptomic profiles of cells at various developmental stages. Accurate characterization is challenging, however, particularly for periodic processes such as cell cycle. To address this, we develop Cyclum, an autoencoder approach to identify circular trajectories in the gene expression space. Experiments using the scRNA-seq data from a set of proliferating cell-lines and mouse embryonic stem cells show that Cyclum reconstructed experimentally labeled cell-cycle stages and rediscovered known cell-cycle genes more accurately than Cyclone, ccRemover, Seurat, and reCAT. Applying Cyclum to removing cell-cycle effects substantially improves delineations of cell subpopulations, which is useful for establishing various cell atlases and studying tumor heterogeneity. Comparing circular patterns in each gene between nicotine treated human embryonic cells and a control sample proposes proven and new target genes of nicotine. Thus, Cyclum can be applied as a generic tool for characterizing periodic processes underlying cellular development/differentiation and cellular architecture in the scRNA-seq data. These features make it useful for constructing the Human Cell Atlas, the Human Tumor Atlas, and other cell ontologies.

Learning Context-aware Structural Representations to Predict Antigen and Antibody Binding Interfaces

COSI: MLCSB COSI

Srivamshi Pittala, Dartmouth College, United States
Chris Bailey-Kellogg, Dartmouth College, United States

Short Abstract: Understanding how antibodies specifically interact with their antigens can enable better drug and vaccine design, as well as provide insights into natural immunity. Experimental structural characterization can detail the “ground truth” of antibody-antigen interactions, but computational methods are required to efficiently scale to large-scale studies. In order to increase prediction accuracy as well as to provide a means to gain new biological insights into these interactions, we have developed PECAN, a unified deep learning-based framework to predict binding interfaces on both antibodies and antigens. PECAN leverages three key aspects of antibody-antigen interactions to learn predictive structural representations: (1) since interfaces are formed from multiple residues in spatial proximity, we employ graph convolutions to aggregate properties across local regions in a protein; (2) since interactions are specific between antibody-antigen pairs, we employ an attention layer to explicitly encode the context of the partner; (3) since more data is available for general protein-protein interactions, we employ transfer learning to leverage this data as a prior for the specific case of antibody-antigen interactions. We show that PECAN achieves state-of-the-art performance at predicting binding interfaces on both antibodies and antigens, and that each of its three aspects drives additional improvement in the performance.

Modeling gene expression dynamics in neuronal differentiation using ODE models trained on single-cell RNA seq data

COSI: MLCSB COSI

Yi Xing Hu, mcgill university, Canada
Claudia Kleinman, McGill University, Canada
Selin Jessa, McGill University, Canada
Paul Francois, McGill University, Canada

Short Abstract: Single cell transcriptomics allows us to characterize intracellular heterogeneity and investigate detailed gene expression patterns within samples. Many algorithms exist for inferring cell trajectories across pseudo time to capture the dynamics of cellular processes. Predictive models, however, remains scarce. With the goal of predicting the effects of perturbations on the system and investigating optimal control strategies, we have implemented a method based on ordinary differential equations (ODE) that is able to predict the system’s behaviour over time during neuronal differentiation. We used a pseudo time trajectory constructed from single cell data from the mouse brain during different stages of development as a training dataset. The model was able to predict the gene expression dynamics across pseudo-time for missing data along the trajectory. Here, we present the analysis of our results, including a detailed overview of our experimental setup, methods and future directions.

Modelling cellular fate decisions using RNA velocity

COSI: MLCSB COSI

Marius Lange, Helmholtz Center Munich, Germany
Volker Bergen, Helmholtz Center Munich, Germany
Michal Klein, Institute of Computational Biology, Helmholtz Center Munich, Germany, Germany
Manu Setty, Program for Computational and Systems Biology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, United States
Bernhard Reuter, Department of Computer Science, University of Tübingen, Germany, Germany
Dana Pe'Er, Program for Computational and Systems Biology, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, United States
Fabian Theis, Helmholtz Center Munich, Germany

Short Abstract: Single-cell RNA-sequencing has become one of the main tools for studying gene mechanisms and how they relate to cellular fate decisions. A central challenge is that single-cell RNA-seq only reveals static snapshots of gene expression. To overcome this limitation, and to link transient gene expression programmes to their eventual outcomes, a multitude of computational methods has been developed. With few exceptions, these methods are based on transcriptomic similarity between cells. This can lead to false conclusions as transcriptomic similarity does not necessarily imply developmental relatedness. Recently, RNA velocity has been introduced, recovering directed dynamic information from splicing kinetics which relaxes the aforementioned assumption. Here, we propose CellRank, a probabilistic model based on Markov processes which makes use of both transcriptomic similarity as well as RNA velocity to model cellular fate decisions. CellRank infers developmental start- and endpoints as well as probabilistic fate choices. We demonstrate CellRank’s capabilities in pancreatic endocrinogenesis, hippocampal dentate gyrus neurogenesis as well as in lung regeneration. In these applications, CellRank yields insights into the timing and mechanisms of fate choices. CellRank scales to large numbers of cells and is fully compatible with the analysis toolkit scanpy.

MultiML: a method to predict optimal sample size in Multi-Omics data analyses using Machine Learning.

COSI: MLCSB COSI

Leandro Balzano-Nogueira, University of Florida, Gainesville, Florida, United States
Sonia Tarazona, Polytechnic University of Valencia, Valencia, Spain, Spain
Ana Conesa, University of Florida, United States

Short Abstract: Multi-omics methods are applied in clinical fields to find biomarkers or modelling biology. Conducting many in-parallel-assays is expensive and when developing a classifier, it’s necessary to establish optimal number of observations for predictive power. Nowadays, no methodology exists to estimate neither sample-size (SS) for omics-classification-problems nor for multi-omics-scenario. We introduce MultiML, a machine-learning-algorithm to estimate SS in multi-omics-experiments to achieve a given classification-error-rate (CER). MultiML takes a multi-omics pilot and runs a simulation using the user-desired machine-learning-algorithm (default-implementation contains PLS-DA and Random Forest) to estimate the minimum SS required to keep CER below a desired threshold. MultiML draws subsamples of different SS, obtains the classifier-for-case and measures ER to adjust a first-order-smoothed-penalized P-spline-regression to predict ER for larger SS. This is done for multi-omics-platforms, to determine the best combinations. MultiML also provides the applied-method and the SS required to reach a user-given-ER. We applied MultiML to TCGA-Glioblastoma-dataset containing gene-expression, DNA-methylation, proteomics, RNAseq and miRNAseq from 515 individuals with different cancer-subtypes. Tests were performed with a CER=0.01, considering combinations from 2 to 5 omics. MultiML found that more omics reduce ER, but some (methylation) penalize the predictor performance. MultiML demonstrated to be a powerful estimator of SS required to keep acceptable-limits-ER.

Multiview learning for understanding functional multiomics

COSI: MLCSB COSI

Nam Nguyen, Stony Brook University, United States
Daifeng Wang, University of Wisconsin - Madison, United States

Short Abstract: The molecular mechanisms and functions in complex biological systems remain elusive. Recent high-throughput techniques have generated a wide variety of multiomics datasets that enable the identification of biological functions and mechanisms via multiple facets. However, integrating these large-scale multiomics data and discovering functional insights are, nevertheless, challenging tasks. To address this, machine learning has been applied to analyze multiomics. This review introduces multiview learning—an emerging machine learning field—and envisions its potentially powerful applications to multiomics. Particularly, multiview learning is more effective than previous integrative methods for learning data’s heterogeneity and revealing cross-talk patterns. Although it has been applied to various contexts including computer vision and speech recognition, multiview learning has not yet been widely applied to biological data—specifically, multiomics data. Therefore, we firstly review recent multiview learning methods and unifies them in a framework called multiview empirical risk minimization (MV-ERM). We further discuss the potential applications of each method to multiomics, including genomics, transcriptomics, and epigenomics, aiming to discover the mechanistic interpretations across omics. Secondly, we explore possible applications to different biological systems, including human diseases, plants, and single-cell analysis, and discuss both the benefits and caveats of using multiview learning to discover the molecular mechanisms of these systems.

NECminer: a novel alignment-free tool for the identification and classification of nitrification-related enzymes using deep learning

COSI: MLCSB COSI

Jeanette Norton, Utah State University, United States
Naveen Duhan, Utah State University, United States
Rakesh Kaundal, Bioinformatics Facility, Center for Integrated BioSystems, Utah State University, United States

Short Abstract: Nitrification is an important microbial two-step transformation in the nitrogen cycle, as it is the only natural process that produces nitrate within a system. The time and resources needed for determining the function of enzymes experimentally are restrictively costly. Therefore, an accurate computational prediction of the nitrification-related enzymes has become much more important.
We propose NECminer, a novel end-to-end feature selection and classification model training approach for nitrification-related enzyme prediction. The algorithm has been developed using Deep Learning, a class of machine learning algorithms. Two large datasets of protein sequences, enzyme and non-enzymes respectively were used to train the models with protein sequence features like amino acid composition, dipeptide composition, conformation transition and distribution (CTD), conjoint, quasi order, etc. The K-fold cross-validation and independent testing were performed to validate our model training. Among all, the CTD-based method gave the best prediction performance (accuracy of 84.18%) followed by the Dipep feature (82.93%). The MCC values ranged from 0.65 to 0.68. Currently, we are implementing more enhanced features to significantly improve the prediction performance. NECminer uses a two-tier approach for prediction, first, it will predict a query sequence as enzyme or non-enzyme followed by predicting nitrification-related class. Access freely at bioinfo.usu.edu/NECminer/.

Neuroblastoma logistic regression for survival patients prediction

COSI: MLCSB COSI

André Luiz Molan, UNESP - São Paulo State University, Brazil
José Luiz Rybarczyk-Filho, UNESP - São Paulo State University, Brazil

Short Abstract: Neuroblastoma is an extracranial solid tumor with highly predictive clinical behavior and it mainly affects individuals under 15 years old. Using NGS technologies, performing gene expression profiling of tumors has become more common. Thus, the application of meta-analysis and machine learning techniques became indispensable for a more effective study. Here, based on the RNA-seq data of 498 patients, we performed a meta-analysis and a logistic regression in order to predict the survival status patients based on gene expression profile. The patients were grouped according to their survival status (dead or alive). A meta-analysis was performed in the programming environment R with WGCNA package, which uses the Stouffer method and generates adjusted p-values (q-values) for each gene. Only q-values lower than 0.000001 were considered to be significant, resulting in 77 genes. Based on the expression values of such genes, we performed a logistic regression in R, using the caret package. The dataset was splitted into two partitions, 70% for training and 30% for testing. A cross-validation was made and then we started the training stage. The model was tested and we achieved 79.05% of accuracy. The sensitivity was 85.47%, with a specificity of 54.84%.

One-Class recommender systems for modeling enzyme-substrate interactions

COSI: MLCSB COSI

Xinmeng Li, Department of Computer Science, Tufts University, United States
Soha Hassoun, Department of Computer Science, Tufts University, United States

Short Abstract: Although traditionally assumed specific (transforming a single substrate), many, if not all, enzymes are promiscuous and act on multiple substrates. Characterizing such promiscuous activity on molecules is essential in advancing biological engineering applications such as selecting enzymatic steps when creating synthesis pathways and predicting unexpected interactions within cellular hosts. Despite tremendous experimental efforts and documentation in large databases (e.g. BRENDA), the extent of enzyme promiscuity continues to be largely unexplored and under-documented.
Viewing enzymes as items and substrates as users (or vice versa), we evaluate SVD-based and neural network models to recommend substrates to enzymes, or vice versa. Our key contribution is formulating this problem as a one-class recommender system because only positive enzyme-substrate interactions are observed in databases such as KEGG or BRENDA. We use sampling of the unknown, yet presumably negative, interactions and bagging, and compare the results against a recommender system obtained assuming all unknown reactions are negative, where negative and positive interactions are equally weighted. We train and test the recommender systems using interaction data from the KEGG database and report performance using Mean Average Precision (MAP), R-Precision, and Precision@K. Integrating this recommender system with pathway synthesis tools can advance biological engineering practices.

PaccMann^RL: Designing anticancer drugs from transcriptomic data via reinforcement learning

COSI: MLCSB COSI

Ali Oskooei, IBM, Switzerland
Joris Cadow, IBM, Switzerland
Karsten Borgwardt, ETH Zurich, Switzerland
Matteo Manica, IBM, Switzerland
Jannis Born, ETH Zurich, Switzerland
Maria Rodriguez Martinez, IBM Research Zurich, Switzerland

Short Abstract: While state-of-the-art deep learning approaches have shown potential in generating compounds with desired chemical properties, they disregard the cellular biomolecular properties of the target disease. We introduce a novel framework for de-novo molecular design that systematically leverages systems biology information into the drug discovery process. Embodied through two separate VAEs, the drug generation is driven by a disease context (transcriptomic proﬁles of cancer cells) deemed to represent the target environment of the drug. Showcased at the task of anticancer drug discovery, our conditional generative model is demonstrated to tailor anticancer compounds to target desired biomolecular proﬁles. Specifically, we reveal how the molecule generation can be biased towards compounds with high predicted inhibitory eﬀect against individual cell-lines or cell-lines from speciﬁc cancer sites. We verify our approach by investigating candidate drugs generated against speciﬁc cancer types and ﬁnd highest structural similarity to existing compounds with known eﬃcacy against these types. Despite no direct optimization of other pharmacological properties, we report good agreement with cancer drugs in metrics like drug-likeness, synthesizability and solubility. We envision our approach to be a step towards increasing success rates in lead compound discovery and ﬁnding more targeted medicines by leveraging the cellular environment of the disease.

Partially Connected AutoEncoders (PCAE) for single cell RNAseq data mining

COSI: MLCSB COSI

Luca Alessandri, Dept. of Molecular Biotechnology and Health Sciences, University of Torino, Italy
Francesca Cordero, Dept. of Computer Science, University of Torino, Italy
Marco Beccuti, Dept. of Computer Science, University of Torino, Italy
Maddalena Arigoni, Dept. of Molecular Biotechnology and Health Sciences, University of Torino, Italy
Martina Olivero, Dept. of Oncology, University of Torino, Italy
Raffaele Calogero, Dept. of Molecular Biotechnology and Health Sciences, University of Torino, Italy

Short Abstract: rCASC is a framework providing an integrated analysis environment for single-cell RNAseq. Sub-population discovery can be achieved using different clustering techniques, based on different distance metrics. Quality of clusters is estimated through Cell Stability Score, which describes the stability of a cell in a cluster as consequence of perturbations induced by removing a random set of cells from the overall cells’ population.
In this work we introduce a new rCASC functionality for the functional annotation of cell clusters exploiting autoencoders. Indeed, autoencoder-based dimensionality reduction was used to discover functional/molecular features underlying cell subpopulations organization (cell clusters), using an architecture (Partially Connected AutoEncoders, PCAE) informed by prior biological data (TF targets, miRNA targets, etc.). Our preliminary data are based on a simple latent space structure, with only one hidden layer, in which each hidden node represents a transcription factor (TF) or miRNA. The vertices connecting input/output nodes (genes) with each hidden node are based on experimental evidences of a functional relation between the gene and the TF/miRNA hidden node. Our results indicate that latent space clustering allows the identification of molecular signatures involved in cell subpopulations organization. Furthermore, markers extracted from TFs-based latent space perfectly fit with subpopulation biological characteristics.

Predicting Critical Events after Hematopoietic Stem Cell Transplantation

COSI: MLCSB COSI

Lisa Eisenberg, Department of Computer Science, University of Tübingen, Germany
Amin T. Turki, Department of Bone Marrow Transplantation, University Hospital Essen, Germany
Aleksandra Pillibeit, Department of Bone Marrow Transplantation, University Hospital Essen, Germany
Christian Brossette, Department of Pediatric Oncology and Hematology, Saarland University, Germany
Stefan Theobald, Department of Pediatric Oncology and Hematology, Saarland University, Germany
Yvonne Braun, Department of Pediatric Oncology and Hematology, Saarland University, Germany
Andrea Grandjean, Averbis GmbH, Germany
Saskia Leserer, Department of Bone Marrow Transplantation, University Hospital Essen, Germany
Jürgen Rissland, Institute of Virology, Saarland University, Germany
Dominic Kaddu-Mulindwa, Department of Internal Medicine, Saarland University, Germany
Kerstin Rohm, Fraunhofer Institute for Biomedical Engineering, Germany
Jochen Rauch, Fraunhofer Institute for Biomedical Engineering, Germany
Gabriele Weiler, Fraunhofer Institute for Biomedical Engineering, Germany
Stephan Kiefer, Fraunhofer Institute for Biomedical Engineering, Germany
Nico Pfeifer, Department of Computer Science, University of Tübingen, Germany

Short Abstract: Allogeneic hematopoietic stem cell transplantation (HSCT) effectively treats leukemia and lymphoma through its alloreactive immune effects. However, this immunotherapy may have life-threatening complications, such as graft-versus-host disease (GVHD) or Cytomegalovirus (CMV) viremia. Preventing and responding to complications requires different actions from clinicians, e.g. increasing immunosuppression to address GVHD, which may be fatal with CMV. Clinical decision support systems based on machine learning models which predict these events accurately and early have the potential to alert clinicians to impending complications, to support difficult treatment decisions, and to ultimately improve care.

In this research project we analyzed a large dataset of more than 1,700 patients with HSCT. Combining temporal virology and laboratory data with information on pre-HSCT constellations (e.g. donor matching, treatment before HSCT), we trained machine learning models to predict impending critical events after HSCT. We focused on the prediction of survival and CMV reactivation in the following x days, and varied x to assess in which time frame predictions are possible. Though we are still far from application in clinical practice, our models show promising first results. For x = 21, for instance, we predict survival with AUROC 0.90 and AUPRC 0.48, and CMV with AUROC 0.87 and AUPRC 0.46.

Predicting gene essentiality in Caenorhabditis elegans by feature engineering and machine-learning

COSI: MLCSB COSI

Tulio L. Campos, The University of Melbourne, Brazil
Pasi K. Korhonen, The University of Melbourne, Australia
Paul W. Sternberg, California Institute of Technology, United States
Robin B. Gasser, The University of Melbourne, Australia
Neil D. Young, The University of Melbourne, Australia

Short Abstract: Defining genes that are essential for life has major implications for understanding critical biological processes and mechanisms. Although essential genes have been identified and characterised experimentally using functional genomic tools, it is challenging to predict with confidence such genes from molecular and phenomic data sets using computational methods. Using extensive data sets available for the model organism Caenorhabditis elegans, we constructed here a machine-learning (ML)-based workflow for the prediction of essential genes on a genome-wide scale. We identified strong predictors for such genes and showed that trained ML models consistently achieve highly-accurate classifications. Complementary analyses revealed an association between essential genes and chromosomal location. We believe that the present ML-based approach will be applicable to other metazoans for which comprehensive data (i.e. genomic, transcriptomic, proteomic, variomic, epigenetic and phenomic) are available.

Prediction of tissue of origin and molecular subtypes for cancers of unknown primary (CUP) using machine learning

COSI: MLCSB COSI

Yue Zhao, The Jackson Laboratory, United States
Sandeep Namburi, The Jackson Laboratory, United States
Ziwei Pan, The Jackson Laboratory and UConn Health Center, United States
Carolyn Paisie, The Jackson Laboratory, United States
Honey Reddi, The Jackson Laboratory (At the time of work), United States
Richard Tothill, University of Melbourne, Australia
Jens Rueter, The Jackson Laboratory, United States
Kanwal Raghav, The University of Texas MD Anderson Cancer Center, United States
William Flynn, The Jackson Laboratory, United States
Sheng Li, The Jackson Laboratory, United States
R. Krishna Murthy Karuturi, The Jackson Laboratory, United States
Joshy George, Jackson Laboratory for Genomic Medicine, United States

Short Abstract: The knowledge of a tumor’s tissue of origin and molecular subtype play critical role in the choice of treatment regimen and prognosis. However, approximately 5% of metastatic tumors have unknown tissue of origin and are therefore classified as cancers of unknown primary (CUP). CUP patients are denied tissue-specific therapy and have poor prognosis. Furthermore, there are no tools available for molecular subtyping. We developed a 1D Inception convolutional neural network (1D-iCNN) model to identify the primary site using the transcriptional profiles of annotated primary tumors across 32 cancer types from The Cancer Genome Atlas project (TCGA). The 1D-iCNN model utilizes chromosomally ordered gene expression data and multiple convolutional kernels with different configurations simultaneously to improve generalization. Our 1D-iCNN identifies the tissue of origin with top-1-accuracy of >97% and >92% in cross-validation and independent external validation of metastatic tumors with known primaries respectively. For 11 tissues of origin, using the same gene expression data, we developed random forest models that classify the molecular subtype of the tumor. Our random forest models attain accuracy of >80%. Together, our methods to identify the tissue of origin and molecular subtype will provide better therapeutic opportunities via precise characterization of tumors for CUP patients.

ProbeRating: A recommender system to infer binding profiles for nucleic acid-binding proteins

COSI: MLCSB COSI

Shu Yang, Department of Computer Science, University of British Columbia, Canada
Xiaoxi Liu, RIKEN Center for Integrative Medical Sciences, Japan
Raymond T. Ng, Department of Computer Science, University of British Columbia, Canada

Short Abstract: Determining the binding preferences of nucleic acid-binding proteins (NBPs), namely RNA-binding proteins (RBPs) and transcription factors (TFs), is the key to decipher the protein-nucleic acids interaction code. Today, available NBP binding data from experiments are still limited, which leaves a large portion of NBPs uncovered. Unfortunately, most computational methods that model the NBP binding preferences require experimental data of the NBPs in interest, and thus only focus on experimentally characterized NBPs. The binding preferences of experimentally unexplored NBPs remain largely unknown.
Here, we introduce ProbeRating, a nucleic acid recommender system that utilizes techniques from deep learning and word embeddings. ProbeRating is developed to predict binding profiles for unexplored or poorly studied NBPs by exploiting their homologs NBPs which currently have available binding data. ProbeRating adapts FastText from Facebook AI Research to extract biological features, and it then builds a neural network based recommender system. We evaluate the performance of ProbeRating on two different tasks: one for RBP and one for TF. As a result, ProbeRating outperforms previous methods on both tasks. The results show that ProbeRating can be a useful tool to study the binding mechanism for the many NBPs that lack direct experimental evidence.

Quantifying gene selection in cancer through protein functional alteration bias

COSI: MLCSB COSI

Nadav Brandes, The Hebrew University of Jerusalem, Israel
Nathan Linial, The Hebrew University of Jerusalem, Israel
Michal Linial, The Hebrew University of Jerusalem, Israel

Short Abstract: Compiling the catalogue of genes actively involved in cancer is an ongoing endeavor, with profound implications to the understanding and treatment of the disease. An abundance of computational methods have been developed to screening the genome for candidate driver genes based on genomic data of somatic mutations in tumors. Existing methods make many implicit and explicit assumptions about the distribution of random mutations. We present FABRIC, a new framework for quantifying the selection of genes in cancer by assessing the effects of de-novo somatic mutations on protein-coding genes. Using a machine-learning model, we quantified the functional effects of ∼3M somatic mutations extracted from over 10 000 human cancerous samples, and compared them against the effects of all possible single-nucleotide mutations in the coding human genome. We detected 593 protein-coding genes showing statistically significant bias towards harmful mutations. These genes, discovered without any prior knowledge, show an overwhelming overlap with known cancer genes, but also include many overlooked genes. FABRIC is designed to avoid false discoveries by comparing each gene to its own background model using rigorous statistics, making minimal assumptions about the distribution of random somatic mutations. The framework is an open-source project with a simple command-line interface.

Quantitative evaluation of structural alerts extracted from deep learning QSAR models

COSI: MLCSB COSI

Sangrak Lim, Kist europe, Germany
Yong Oh Lee, Kist europe, Germany
Young Jun Kim, Kist europe, Germany

Short Abstract: Background: In toxicity evaluation, structural alerts have been used as an efficient method for screening the chemical-protein interaction. Recently, deep learning QSAR models have improved the toxicity prediction performance with the capability of the automated feature extraction. Several studies analyzed their model with structural alerts extracted from the hidden weights of the model. However, the usual qualitative analysis of a model may lead to irreproducible results.
Methods: We made a simple deep learning model based on a convolutional neural network, and we extracted structural alerts using the weights of the model. To evaluate the validity of the extracted structural alerts, we used bioalerts library for the derivation of structural alerts using the statistical method. The structural alerts extracted from the deep learning model were compared to the structural alerts computed by bioalerts.
Results: The substring similarity results with TOX21 dataset showed that extracted structure alerts from deep learning matched 34% to those from the statistical methods, even when the deep learning model scored high performance. We will conduct two tasks as future works: (1) Providing substructure alert features to deep learning models. (2) Validating the extracted substructure alerts which could function as toxicity biomarker.

Robust Identification of Temporal Biomarkers in Longitudinal Omics Studies

COSI: MLCSB COSI

Ahmed Metwally, Stanford University, United States
Michael Snyder, Stanford University, United States

Short Abstract: Longitudinal omics studies increasingly collect rich data sampled frequently in time across large cohorts to capture dynamic health fluctuations and disease transitions. There is a need for statistical frameworks to identify not only which omics features are differentially regulated between groups but also over what time intervals. Longitudinal omics data additionally may have inconsistencies including nonuniform sampling intervals, missing datapoints, subject drop out, and different number of samples for different subjects.

We developed a statistical method that provides robust identification of time intervals of temporal omics biomarkers. The proposed method is based on a semi-parametric approach, where using smoothing splines to model longitudinal data and infer significant time intervals of omics features based on an empirical distribution constructed through a permutation procedure. Simulation of 5 datasets with diverse temporal patterns analyzed with our method indicated specificity >0.99 and sensitivity > 0.72. Application of the proposed method to a longitudinal multiomics study of 105 participants revealed temporal sexual dimorphism in amino acids, lipids, and hormone metabolites following a respiratory viral infection. We provide an open-source Bioconductor R package, OmicsLonDA, to make our method accessible.

scClassify: sample size estimation and multiscale classification of cells using single and multiple reference

COSI: MLCSB COSI

Yingxin Lin, The University of Sydney, Australia
Yue Cao, The University of Sydney, Australia
Hani Jieun Kim, The University of Sydney, Australia
Agus Salim, La Trobe University, Australia
Terence Speed, Walter and Eliza Hall Institute of Medical Research, Australia
David Lin, Cornell University, United States
Pengyi Yang, The University of Sydney, Australia
Jean Yang, The University of Sydney, Australia

Short Abstract: Automated cell type identification is a key computational challenge in single-cell RNA-sequencing (scRNA-seq) data. To capitalise on the large collection of well-annotated scRNA-seq datasets, we developed scClassify, a multiscale classification framework based on ensemble learning and cell type hierarchies constructed from single or multiple annotated datasets as references. scClassify enables the estimation of sample size required for accurate classification of cell types in a cell type hierarchy and allows joint classification of cells when multiple references are available. We show that scClassify consistently performs better than other supervised cell type classification methods across 114 pairs of reference and testing data, representing a diverse combination of sizes, technologies, and levels of complexity, and further demonstrate the unique components of scClassify through simulations and compendia of experimental datasets. Finally, we demonstrate the scalability of scClassify on large single-cell atlases and highlight a novel application of identifying subpopulations of cells from the Tabula Muris data that were unidentified in the original publication. Together, scClassify represents state-of-the-art methodology in automated cell type identification from scRNA-seq data.

SCIM: Universal Single-Cell Matching with Unpaired Feature Sets

COSI: MLCSB COSI

Stefan Stark, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
Joanna Ficek, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
Francesco Locatello, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
Ximena Bonilla, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
Stéphane Chevrier, Department of Quantitative Biomedicine, University of Zürich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland
Franziska Singer, Nexus Personalized Health Technologies, ETH Zürich, Rämistrasse 101, 8092 Zürich, Switzerland, Switzerland
Kjong-Van Lehmann, Biomedical Informatics, Dept. of Computer Science, ETH Zürich, Universitätsstrasse 6, 8092 Zürich, Switzerland, Switzerland
Gunnar Rätsch, ETH Zürich, Switzerland

Short Abstract: Multi-modal molecular profiling of samples on a single-cell level can yield deeper insights into tissue microenvironment and disease dynamics. Profiling technologies like scRNA-seq often consume the analyzed cells and cellular correspondences between data modalities are lost. To exploit single cell ’omics technologies jointly, we propose Single-Cell data Integration via Matching (SCIM), a scalable and accurate approach to recover such correspondences in two or more technologies, even in the absence of overlapping feature sets. SCIM assumes that cells share a common underlying structure and reconstructs such technology-invariant latent space using an auto-encoder framework with an adversarial objective. Cell pairs across technologies are then identified using a customized bipartite matching scheme operating on the latent representations. We evaluate SCIM on a simulated branching process designed for scRNA-seq data (total of 192,000 cells) and show that the cell-to-cell matches reflect the same pseudotime (Pearson’s coefficient: 0.86). Moreover, we apply our method to a real-world melanoma tumor sample, and achieve 93% cell-matching accuracy with respect to cell-type label when aligning scRNA-seq and CyTOF datasets. SCIM is a scalable and flexible algorithm that bridges the gap between generation and integrative interpretation of diverse multi-modal data.

SEABED: An unsupervised clustering approach to stratify drug responsive subpopulations

COSI: MLCSB COSI

Tzen Szen Toh, University of Sheffield, United Kingdom
Nirmal Keshava, Constellation Analytics, LLC, United States
Haobin Yuan, University of Sheffield, United Kingdom
Bingxun Yang, University of Sheffield, United Kingdom
Michael Menden, Helmholtz Zentrum München—German Research Center for Environmental Health, Germany
Dennis Wang, University of Sheffield, United Kingdom

Short Abstract: Personalised cancer treatments are challenging to develop due to disease heterogeneity. Predicting response from preclinical studies and reliably measuring drug sensitivity have hindered the detection of pharmacological biomarkers and clinical trial success. Arbitrarily defined thresholds of drug response used in clinical trials results in higher likelihoods to draw false conclusions from pharmacology response data.

Published in npj Systems Biology and Applications, we introduce SEABED (SEgmentation And Biomarker Enrichment of Differential treatment response), an unsupervised machine learning approach to compare drug response between 327 anti-cancer therapies. SEABED integrated multiple measures of drug response, identifying subpopulations responding differently to drugs with similar or different targets. This enabled further identification of a combination of drugs that are effective for subpopulations with a particular mutation. Systematic analysis of preclinical data for a failed phase III clinical trial for KRAS mutant non-small cell lung cancer patients suggests potential indications in KRAS mutant pancreatic and colorectal cancers. Our approach differs from existing methods by maximizing the differences in drug response profiles between groups of individuals and then applying statistical models to identify genetic biomarkers of subpopulations, further defining them. This study exemplifies a method to identify novel cancer subpopulations, their genetic biomarkers, and effective drug combinations.

Selecting representative subsets of genomic loci for high-throughput functional validation

COSI: MLCSB COSI

Jacob Schreiber, Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA, United States
Jeffrey A. Bilmes, University of Washington, United States
William S. Noble, University of Washington, United States
Galip Yardimci, University of Washington, United States

Short Abstract: Functional characterization assays test large collections of genomic loci for functional activity; however, the strategy for selecting target loci to be tested varies across assays. Some methods perform a high throughput scanning of megabase-scale gene loci in a tiling fashion, whereas other methods test candidate regulatory sites as defined by DNase hypersensitivity. Here, we describe a submodular optimization framework for identifying representative subsets of the genome in an unbiased manner that have minimal redundancy for the purpose of functional characterization studies. Our approach uses two types of input: features that describe the epigenetic state, such as histone ChIP-seq assays, and DNA sequence derived features. Using these features, we performed representative subset selection along the entire length of the human genome at megabase resolution and along five genome loci at 250 basepair resolution. We assessed our method’s ability to identify the optimal representative subset by comparing the subset against Segway annotations and the 3D subcompartment structure of the genome. We show that the representative subsets we obtain cover diverse types of genomic elements in a non-redundant manner and better represent rarer element types. We anticipate that our method will allow unbiased and principled selection of target loci for functional characterization assays.

Simultaneously prediction of multiple outcomes using revised stacking algorithms.

COSI: MLCSB COSI

Li Xing, University of Saskatchewan, Canada
Xuekui Zhang, UNIVERSITY OF VICTORIA, Canada
Mary Lesperance, University of Victoria, Canada

Short Abstract: Motivation
HIV is difficult to treat because its virus mutates at a high rate and mutated viruses easily develop resistance to existing drugs. If the relationships between mutations and drug resistance can be determined from historical data, patients can be provided personalized treatment according to their own mutation information. The HIV Drug Resistance Database was built to investigate the relationships. Our goal is to build a model using data in this database, which simultaneously predicts the resistance of multiple drugs using mutation information from sequences of viruses for any new patient.

Results
We propose two variations of a stacking algorithm that borrow information among multiple prediction tasks to improve multivariate prediction performance. The most attractive feature of our proposed methods is the flexibility with which complex multivariate prediction models can be constructed using any univariate prediction models. Using cross-validation studies, we show that our proposed methods outperform other popular multivariate prediction methods.

snoGloBe: a gradient boosting machine for the prediction of box C/D snoRNA interactions

COSI: MLCSB COSI

Gabrielle Deschamps-Francoeur, Université de Sherbrooke, Canada
Michelle S Scott, Université de Sherbrooke, Canada

Short Abstract: Box C/D small nucleolar RNAs (snoRNAs) are a subfamily of small non-coding RNAs known to guide methylation of ribosomal and small nuclear RNAs. This function is well characterized and requires an interaction with specific regions of the snoRNAs. In humans, however, some snoRNAs do not have known canonical targets. Also, some snoRNAs exhibit non-canonical functions, such as regulation of alternative splicing and of mRNA stability, the deregulation of which has been implicated in diseases such as Prader-Willi syndrome and cancer. New methodologies were developed allowing the high-throughput detection of RNA-RNA interactions. The study of these datasets revealed that the canonical interactions only account for 21% of all snoRNA interactions.

The aim of this project is to develop a tool to predict snoRNA interactions, both canonical and non-canonical, using a gradient boosting machine. The model was trained using manually curated and high-throughput data. The interaction sequences and positional information were fed to the algorithm. The resulting model, called snoGloBe, obtains an accuracy of 0.98 and a Matthews correlation coefficient of 0.75 on the independent test set.

With this tool, we will be able to predict novel potential snoRNA interactions, shedding light on their non-canonical functions and their implication in different diseases.

SPACE：A clustering method for identifying cell population in spatial transcriptomics data

COSI: MLCSB COSI

Minsheng Hao, Tsinghua University, China
Kui Hua, Tsinghua University, China
Xuegong Zhang, Tsinghua University, China

Short Abstract: SPACE, a clustering method for identifying SPAtial CEll population. The relative position of cells in tissues affects their morphology and function. Recent developments of various spatial sequencing technologies provide powerful tools that enable researchers to understand cells in the context of tissue microenvironments. However, currently there are only a limited number of computational methods that can simultaneously process spatial and transcriptome information for cell identification. We propose SPACE, a clustering method for identifying spatial cell populations based not only on their gene expression profiles but also on their spatial locations. The adjustable parameter that balances these two kinds of information in the objective function allows us to explore how transcriptome and spatial location contribute to cell identities. The eigenvalue decomposition solution of SPACE avoids the needs of iterative calculations and can be GPU-accelerated, making it scalable to large-scale datasets. We tested our method on multiple mouse brain datasets, and SPACE successfully found clusters that are highly consistent with the existing knowledge about the spatial distribution of basic functional regions and anatomical morphological structures of tissues.

STRUCTURAL ALPHABET PREDICTION USING DEEP LEARNING APPROACH

COSI: MLCSB COSI

Gabriel Cretin, Université de Paris - INSERM UMR-S 1134 - INTS - DSIMB Team, France
Tatiana Galochkina, Université de Paris - INSERM UMR-S 1134 - INTS - DSIMB Team, France
Charlotte Périn, Université de Paris - INSERM UMR-S 1134 - INTS - DSIMB Team, France
Alexandre G. De Brevern, Université de Paris - INSERM UMR-S 1134 - INTS - DSIMB Team, France
Jean-Christophe Gelly, Université de Paris - INSERM UMR-S 1134 - INTS - DSIMB Team, France

Short Abstract: Structural alphabets describe protein local conformation more precisely than secondary structures. Thanks to structural alphabets, complex protein structures (3D) can be encoded as a sequence vector of consecutive states (1D). Numerous applications such as protein structure analysis and protein structure prediction use structural alphabets. Here we present a new method based on Deep Learning to predict the local structure of a given protein sequence in terms of structural alphabet Protein Blocks. We show that our method outperforms structural alphabet prediction compared to the state-of-the-art method. Our method improve the performance of methods based on structural alphabets such as Fold recognition and de novo structure prediction.

Systematic tissue annotations of genomic samples by modeling unstructured metadata

COSI: MLCSB COSI

Nathaniel T. Hawkins, Michigan State University, United States
Marc Maldaver, Michigan State University, United States
Arjun Krishnan, Michigan State University, United States

Short Abstract: There are currently well over two million genomics samples from human and model organisms that are publicly available. Ideally, researchers can reuse these data by curating datasets/samples relevant to their question of interest rather than investing thousands of dollars per experiment to collect additional data. However, "publicly available" does not mean "accessible.” Retrieving relevant data is a challenge because the associate metadata for these data are described using natural language text in unstructured fields. Additionally, a vast majority of genomics samples are missing basic information about the sex and age of the organism, tissue of origin, disease state, etc. Using character-level embeddings – high-dimensional numerical representations – derived from experimental metadata, we have trained natural-language-processing-based machine learning models that can be used to annotate genomics samples for tissues and cell-types based on their text descriptions alone. By utilizing sample text, our approaches are agnostic to the underlying genomics data type and will be able to annotate data from vastly different experiments without needing to retrain, a major limitation of current work in this field. We aim to provide systematic and complete labels for publicly available genomics samples in order to facilitate future research efforts.

t-SNE and unsupervised clustering based SNP filters improve the performance in whole genome sequencing studies.

COSI: MLCSB COSI

Jianhu Wu, DCH Technologies Inc, Cambridge, MA, 02142, USA, China
Wenhao Zhou, Center for Molecular Medicine, Children’s Hospital of Fudan University, Shanghai 201102, China, China
Yulan Lu, Center for Molecular Medicine, Children’s Hospital of Fudan University, Shanghai 201102, China, China
Mengchun Gong, DCH Technologies Inc, United States
Wenzhao Shi, DCH Technologies Inc, United States
Gang Feng, DCH Technologies Inc, United States

Short Abstract: Next-generation sequencing (NGS), especially whole genome sequencing (WGS), is a powerful tool in rare diseases clinical research and diagnosis. However, the published methods for SNP calling have high false positive rate and low heterozygous consistency, especially in WGS cohort studies. Many existing supervised machine learning based SNP filters are susceptible for training datasets so that they are hard for different population. In this study, we propose a new method based on t-distributed stochastic neighbor embedding (t-SNE) and unsupervised clustering for SNP filtering. Furthermore, to improve the analysis efficiency, we also adopted one parallel network system to accelerate the analysis task. We used hg001-hg005 in the National Institute of Standards and Technology (NIST v3.3.2) to evaluate the performance of our proposed strategy. Our current results show that this method is: 1. more accurate comparing with the traditional SNP filtering methods; 2. better heterozygous consistency; 3. significantly efficient; 4. more robust with different datasets.

Training and interpreting hierarchical cell type classification models using mass, heterogeneous RNA-seq data from human primary cells

COSI: MLCSB COSI

Matthew Bernstein, Morgridge Institute for Research, United States
Zhongjie Ma, University of Wisconsin - Madison, United States
Michael Gleicher, University of Wisconsin - Madison, United States
Colin Dewey, University of Wisconsin-Madison, United States

Short Abstract: Gene expression-based classification of a biological sample’s cell type is an important step in many transcriptomic analyses, including that of annotating cell types in single-cell RNA-seq datasets. In this work, we utilize the breadth of publicly available, healthy, primary cell data for training cell type classifiers. To this end, we curated a novel training dataset from the Sequence Read Archive consisting of 4,167 purified bulk RNA-seq samples from 263 studies and labeled with 294 cell type terms in the Cell Ontology. Furthermore, we explore the novel application of hierarchical classification algorithms that take into account the graph structure of the ontology to this task. These algorithms improve on state-of-the-art methods and produce accurate cell type predictions on both bulk and single-cell data across diverse and fine-grained cell types. In addition, the algorithms make extensive use of linear models, which are particularly amenable to interpretation, and we have developed a visualization tool for interpreting the classifiers in the context of the Cell Ontology. Our cell type classifier is publicly available as a Python package called CellO (github.com/deweylab/CellO) and its associated visualization tool, the CellO Viewer, is available via a web interface (uwgraphics.github.io/CellOViewer/).

Treating biomolecular interaction as an image classification problem - a case study on T-cell receptor-epitope recognition prediction

COSI: MLCSB COSI

Pieter Moris, University of Antwerp, Belgium
Joey De Pauw, University of Antwerp, Belgium
Anna Postovskaya, University of Antwerp, Belgium
Benson Ogunjimi, University of Antwerp, Antwerp University Hospital, Belgium
Pieter Meysman, University of Antwerp, Belgium
Kris Laukens, University Of Antwerp, Belgium

Short Abstract: The prediction of epitope recognition by T-cell receptors (TCR) has seen much progress in recent years, with several methods now available that can predict TCR-epitope recognition for a given set of epitopes. However, the generic case of evaluating all possible TCR-epitope pairs remains challenging, mainly due to the high diversity of sequences and the limited amount of currently available training data.

In this work, we present a novel feature engineering approach for sequence-based predictive molecular interaction models, and demonstrate its potential in generic TCR-epitope recognition. The approach is based on the pairwise combination of the physicochemical properties of the individual amino acids in both sequences, which can provide a convolutional neural network with a combined representation of both sequences.

We found indications that this simplifies the prediction task and that it can improve the generalization capabilities of the model to a certain degree. We postulate that similar feature engineering methods could pave the way towards general epitope-agnostic models, although further improvements and additional data are still necessary. In addition, we highlight that appropriate validation strategies are required to accurately assess the generalization performance of TCR-epitope recognition models when applied to both known and novel epitopes.

Using Deep Learning for Ultra-fast Identification of V(D)J Segments

COSI: MLCSB COSI

Mohammad Ruhul Amin, Fordham University, United States
Thomas MacCarthy, Stony Brook University, United States

Short Abstract: Antibody repertoire (Rep-Seq) profiling of B-cells typically requires computational analysis of antibody sequences that have undergone V(D)J recombination and somatic hypermutation. An essential step in Rep-Seq bioinformatics is to determine the original V, D, and J segments and their boundaries within the sequence. Widely used techniques such as IMGT and IgBLAST are computationally intense, and newly proposed methods based on probability models even more so, usually taking many hours to complete. Deep learning methods offer the possibility of separating the model training (computationally intense) from the classification step, which is very fast. Furthermore, the training process can be performed beforehand thus enabling the V(D)J classification to be performed rapidly by the end-user as part of a Rep-Seq bioinformatics pipeline. Deep learning requires very large amounts of high-quality data, but these are now becoming available as more Rep-Seq datasets are published. We have developed a deep learning approach to enable ultra-fast processing of Rep-Seq data.

Workflow Platform for ML without Growing Pains

COSI: MLCSB COSI

David Yuan, European Bioinformatics Institute, United Kingdom
Tony Wildish, European Bioinformatics Institute, United Kingdom

Short Abstract: === PLEASE SEE ABSTRACT IN PDF ATTACHED ===

ISMB 2020

Posters

View Posters By Category

Poster Session A: July 13 & July 14 7:45 am - 9:15 am Eastern Daylight Time

Session B: July 15 and July 16 between 7:45 am - 9:15 am Eastern Daylight Time

ISCB On the Web