Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Proceedings Track Presentations


3DSIG: Structural Bioinformatics and Computational Biophysics

3DSIG: Structural Bioinformatics and Computational Biophysics


Predicting MHC-peptide binding affinity by differential boundary tree

  • Jianzhu Ma, Department of Computer Science, Purdue University, United States
  • Peiyuan Feng, Institute for Interdisciplinary Information Sciences, Tsinghua University, China
  • Jianyang Zeng, Institute for Interdisciplinary Information Sciences, Tsinghua University, China

Presentation Overview: Show

The prediction of the binding between peptides and major histocompatibility complex (MHC) molecules plays an important role in neoantigen identification. Although a large number of computational methods have been developed to address this problem, they produce high false positive rates in practical applications since in most cases, a single residue mutation may largely alter the binding affinity of peptide binding to MHC which can not be identified by conventional deep learning methods.
We developed a differential boundary tree model, named DBTpred to address this problem. We demonstrated that DBTpred can accurately predict MHC class I binding affinity compared to the state-of-art deep learning methods. We developed a parallel training algorithm to accelerate the training and inference process which enables DBTpred to be applied on large datasets. By investigating the statistical properties of differential boundary trees and the prediction paths to test samples, we revealed that DBTpred can provide an intuitive interpretation and possible hints in detecting important residue mutations that can largely influence binding affinity.


Bio-Ontologies

Bio-Ontologies


KG4SL: Knowledge Graph Neural Network for Synthetic Lethality Prediction in Human Cancers

  • Min Wu, I2R, A*STAR, Singapore
  • Yong Liu, Nanyang Technological University, Singapore
  • Shike Wang, ShanghaiTech University, China
  • Fan Xu, ShanghaiTech University, China
  • Yunyang Li, ShanghaiTech University, China
  • Jie Wang, ShanghaiTech University, China
  • Ke Zhang, ShanghaiTech University, China
  • Jie Zheng, ShanghaiTech University, China

Presentation Overview: Show

Motivation: Synthetic lethality (SL) is a promising gold mine for the discovery of anti-cancer drug targets. Wet-lab screening of SL pairs is afflicted with high cost, batch-effect, and off-target problems. Current computational methods for SL prediction include gene knock-out simulation, knowledge-based data mining, and machine learning methods. Existing methods tend to assume that SL pairs are independent of each other, without taking into account their intrinsic correlation. Although several methods have incorporated genomic and proteomic data to aid SL prediction, these methods involve manual feature engineering that heavily relies on domain knowledge.
Results: Here we propose a novel graph neural network (GNN)-based model, named KG4SL, by incorporating knowledge graph message-passing into SL prediction. The knowledge graph was constructed using 11 kinds of entities including genes, compounds, diseases, biological processes, and 24 kinds of relationships that could be pertinent to SL. The integration of knowledge graph can help harness the independence issue and circumvent manual feature engineering by conducting message-passing on the knowledge graph. Our model outperformed all the state-of-the-art baselines in AUC, AUPR and F1. Extensive experiments, including the comparison of our model with an unsupervised TransE model, a vanilla GCN (graph convolutional network) model, and their combination, demonstrated the significant impact of incorporating knowledge graph into GNN for SL prediction.


BioVis: Biological Data Visualizations

BioVis: Biological Data Visualizations


Metaball skinning of synthetic astroglial morphologies into realistic mesh models for in silico simulations and visual analytics

  • Marwan Abdellah, Blue Brain Project (BBP) / EPFL, Switzerland
  • Alessandro Foni, Blue Brain Project (BBP) / EPFL, Switzerland
  • Eleftherios Zisis, Blue Brain Project (BBP) / EPFL, Switzerland
  • Nadir Roman Guerrero, Blue Brain Project (BBP) / EPFL, Switzerland
  • Samuel Lapere, Blue Brain Project (BBP) / EPFL, Switzerland
  • Jay S. Coggan, Blue Brain Project (BBP) / EPFL, Switzerland
  • Daniel Keller, Blue Brain Project (BBP) / EPFL, Switzerland
  • Henry Markram, Blue Brain Project (BBP) / EPFL, Switzerland
  • Felix Schürmann, Blue Brain Project (BBP) / EPFL, Switzerland

Presentation Overview: Show

Motivation: Astrocytes, the most abundant glial cells in the mammalian brain, have an instrumental role in developing neuronal circuits. They contribute to the physical structuring of the brain, modulating synaptic activity, maintaining the blood-brain barrier in addition to other significant aspects that impact brain function. Biophysically detailed astrocytic models are key to unraveling their functional mechanisms via molecular simulations at microscopic scales. Detailed, and complete, biological reconstructions of astrocytic cells are sparse. Nonetheless, data-driven digital reconstruction of astroglial morphologies that are statistically identical to biological counterparts are becoming available. We use those synthetic morphologies to generate astrocytic meshes with realistic geometries, making it possible to perform these simulations.

Results: We present an unconditionally robust method capable of reconstructing high fidelity polygonal meshes of astroglial cells from algorithmically-synthesized morphologies. Our method uses implicit surfaces, or metaballs, to skin the different structural components of astrocytes and then blend them in a seamless fashion. We also provide an end-to-end pipeline to produce optimized two- and three- dimensional meshes for visual analytics and simulations respectively. The performance of our pipeline has been assessed with a group of 5,000 astroglial morphologies and the geometric metrics of the resulting meshes are evaluated. The usability of the meshes is then demonstrated with different use cases.

Implementation and availability: Our metaball skinning algorithm is implemented in Blender 2.82 relying on its Python API (Application Programming Interface). To make it accessible to computational biologists and neuroscientists, the implementation has been integrated into NeuroMorphoVis.

OncoThreads: Visualization of Large Scale Longitudinal Cancer Molecular Data

  • Theresa Anisja Harbig, Harvard Medical School, United States
  • Sabrina Nusrat, Harvard Medical School, United States
  • Tali Mazor, Dana-Faber Cancer Institute, United States
  • Qianwen Wang, Harvard Medical School, United States
  • Alexander Thomson, Novartis Institutes for BioMedical Research, United States
  • Hans Bitter, Novartis Institutes for BioMedical Research, United States
  • Ethan Cerami, Dana-Faber Cancer Institute, United States
  • Nils Gehlenborg, Harvard Medical School, United States

Presentation Overview: Show

Motivation: Molecular profiling of patient tumors and liquid biopsies over time with next-generation sequencing technologies and new immuno-profile assays are becoming part of standard research and clinical practice. With the wealth of new longitudinal data, there is a critical need for visualizations for cancer researchers to explore and interpret temporal patterns not just in a single patient but across cohorts.
Results: To address this need we developed OncoThreads, a tool for the visualization of longitudinal clinical and cancer genomics and other molecular data in patient cohorts. The tool visualizes patient cohorts as temporal heatmaps and Sankey diagrams that support the interactive exploration and ranking of a wide range of clinical and molecular features. This allows analysts to discover temporal patterns in longitudinal data, such as the impact of mutations on response to a treatment, e.g. emergence of resistant clones. We demonstrate the functionality of OncoThreads using a cohort of 23 glioma patients sampled at 2-4 timepoints.
Availability and Implementation: Freely available at http://oncothreads.gehlenborglab.org. Implemented in Javascript using the cBioPortal web API as a backend.
Contact: nils@hms.harvard.edu
Supplementary Material: Supplementary figures and video.


CAMDA: Critical Assessment of Massive Data Analysis

CAMDA: Critical Assessment of Massive Data Analysis


Investigation of REFINED CNN ensemble learning for anti-cancer drug sensitivity prediction

  • Omid Bazgir, Texas Tech University, United States
  • Souparno Ghosh, University of Nebraska-Lincoln, United States
  • Ranadip Pal, Texas Tech University, United States

Presentation Overview: Show

Motivation: Anti-cancer drug sensitivity prediction using deep learning models for individual cell line is a significant challenge in personalized medicine. Recently developed REFINED (REpresentation of Features as Images with NEighborhood Dependencies) CNN (Convolutional Neural Network) based models have shown promising results in improving drug sensitivity prediction. The primary idea behind REFINEDCNN is representing high dimensional vectors as compact images with spatial correlations that can benefit from CNN architectures. However, the mapping from a high dimensional vector to a compact 2D image depends on the a-priori choice of the distance metric and projection scheme with limited empirical procedures guiding these choices.
Results: In this article, we consider an ensemble of REFINED-CNN built under different choices of distance metrics and/or projection schemes that can improve upon a single projection based REFINED-CNN model. Results, illustrated using NCI60 and NCI-ALMANAC databases, demonstrate that the ensemble approaches can provide significant improvement in prediction performance as compared to individual
models. We also develop the theoretical framework for combining different distance metrics to arrive at a single 2D mapping. Results demonstrated that distance-averaged REFINED-CNN produced comparable performance as obtained from stacking REFINED-CNN ensemble but with significantly lower computational cost.

Asynchronous Parallel Bayesian Optimization forAI-driven Cloud Laboratories

  • Trevor Frisby, Carnegie Mellon University, United States
  • Zhiyun Gong, Carnegie Mellon University, United States
  • Christopher Langmead, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation:The recent emergence of cloud laboratories — collections of automated wet-lab instruments that are accessed remotely, presents new opportunities to apply Artificial Intelligence and MachineLearning in scientific research. Among these is the challenge of automating the process of optimizing experimental protocols to maximize data quality.Results:We introduce a new deterministic algorithm, called PROTOCOL (PaRallel OptimizaTiOn for ClOud Laboratories), that improves experimental protocols via asynchronous, parallel Bayesian optimization. The algorithm achieves exponential convergence with respect to simple regret. We demonstrate PROTOCOL in both simulated and real-world cloud labs. In the simulated lab, it outperforms alternative approaches to Bayesian optimization in terms of its ability to find optimal configurations, and the number of experiments required to find the optimum. In the real-world lab, the algorithm makes progress towards the optimal setting, but failed to reach it, due to budgetary constraints.Availability:PROTOCOL is available as both a stand-alone Python library, and as part of a R Shiny application at https://github.com/clangmead/PROTOCOL

PathCNN: Interpretable convolutional neural networks for survival prediction and pathway analysis applied to glioblastoma

  • Jung Hun Oh, Memorial Sloan Kettering Cancer Center, United States
  • Wookjin Choi, Virginia State University, United States
  • Euiseong Ko, University of Nevada, Las Vegas, United States
  • Mingon Kang, University of Nevada, Las Vegas, United States
  • Allen Tannenbaum, Stony Brook University, United States
  • Joseph Deasy, Memorial Sloan Kettering Cancer Center, United States

Presentation Overview: Show

Motivation: Convolutional neural networks (CNNs) have achieved great success in the areas of image processing and computer vision, handling grid-structured inputs and efficiently capturing local dependencies through multiple levels of abstraction. However, a lack of interpretability remains a key barrier to the adoption of deep neural networks, particularly in predictive modeling of disease outcomes. Moreover, because biological array data are generally represented in a non-grid structured format, CNNs cannot be applied directly.
Results: To address these issues, we propose a novel method, called PathCNN, that constructs an interpretable CNN model on integrated multi-omics data using a newly-defined pathway image. PathCNN showed promising predictive performance in differentiating between long-term survival (LTS) and non-LTS when applied to glioblastoma multiforme (GBM). The adoption of a visualization tool coupled with statistical analysis enabled the identification of plausible pathways associated with survival in GBM. In summary, PathCNN demonstrates that CNNs can be effectively applied to multi-omics data in an interpretable manner, resulting in promising predictive power while identifying key biological correlates of disease.


CompMS: Computational Mass Spectrometry

CompMS: Computational Mass Spectrometry


On the Feasibility of Deep Learning Applications Using Raw Mass Spectrometry Data

  • Maria Rodriguez Martinez, IBM Research Europe, Switzerland
  • Joris Cadow, IBM Research Europe, Switzerland
  • Matteo Manica, IBM Research Europe, Switzerland
  • Roland Mathis, IBM Research Europe, Switzerland
  • Tiannan Guo, Westlake University, China
  • Ruedi Aebersold, ETH Zuerich, Switzerland

Presentation Overview: Show

In recent years, SWATH-MS has become the proteomic method of choice for data-independent–acquisition, as it enables high proteome coverage, accuracy and reproducibility. However, data analysis is convoluted and requires prior information and expert curation. Furthermore, as quantification is limited to a small set of peptides, potentially important biological information may be discarded.
Here we demonstrate that deep learning can be used to learn discriminative features directly from raw MS data, eliminating hence the need of elaborate data processing pipelines. Using transfer learning to overcome sample sparsity, we exploit a collection of publicly available deep learning models already trained for the task of natural image classification. These models are used to produce feature vectors from each mass spectrometry (MS) raw image, which are later used as input for a classifier trained to distinguish tumor from normal prostate biopsies. Although the deep learning models were originally trained for a completely different classification task and no additional fine-tuning is performed on them, we achieve a highly remarkable classification performance of 0.876 AUC.
We investigate different types of image preprocessing and encoding. We also investigate whether the inclusion of the secondary MS2 spectra improves the classification performance. Throughout all tested models, we use standard protein expression vectors as gold standards. Even with our naïve implementation, our results suggest that the application of deep learning and transfer learning techniques might pave the way to the broader usage of raw mass spectrometry data in real-time diagnosis.

DIAmeter: Matching peptides to data-independent acquisition mass spectrometry data

  • Yang Young Lu, University of Washington, United States
  • Jeff Bilmes, University of Washington, United States
  • Ricard Mias, University of Washington, United States
  • Judit Villen, University of Washington, United States
  • William Stafford Noble, University of Washington, United States

Presentation Overview: Show

Tandem mass spectrometry data acquired using data independent acquisition (DIA) is challenging to interpret because the data exhibits complex structure along both the mass-to-charge (m/z) and time axes. The most common approach to analyzing this type of data makes use of a library of previously observed DIA data patterns (a "spectral library"), but this approach is expensive because the libraries do not typically generalize well across laboratories. Here we propose DIAmeter, a search engine that detects peptides in DIA data using only a peptide sequence database. Unlike other library-free DIA analysis methods, DIAmeter supports data generated using both wide and narrow isolation windows, can readily detect peptides containing post-translational modifications, can analyze data from a variety of instrument platforms, and is capable of detecting peptides even in the absence of detectable signal in the survey (MS1) scan.

MS2Planner: Optimizing Coverage in Tandem Mass Spectrometry based Metabolomics by Iterative Data Acquisition

  • Zeyuan Zuo, Computational Biology Department, Carnegie Mellon University, United States
  • Liu Cao, Computational Biology Department, Carnegie Mellon University, United States
  • Louis-Félix Nothias, Skaggs School of Pharmacy, University of California San Diego, United States
  • Hosein Mohimani, Computational Biology Department, Carnegie Mellon University, United States

Presentation Overview: Show

Untargeted tandem mass spectrometry experiments enable the profiling of metabolites in complex biological samples. The collected fragmentation spectra are the fingerprints of metabolites that are used for molecule identification and discovery. Two main mass spectrometry strategies exist for the collection of fragmentation spectra: data-dependent acquisition (DDA) and data-independent acquisition (DIA). In DIA strategy, all the ions in predefined mass-to-charge ratio ranges are co-isolated and co-fragmented, resulting in highly multiplexed fragmentation spectra. While DIA comprehensively collect the fragmentation ions of all the precursors, it results in a highly multiplexed fragmentation that limits subsequent annotation. In contrast, in DDA strategy fragmentation spectra are collected specifically for the most abundant ions dynamically observed. While DDA results in less multiplexed fragmentation spectra, the coverage is limited. We introduce MS2Planner workflow, an Iterative-Data Acquisition (ItDA) strategy that optimizes the number of high quality fragmentation spectra over multiple experimental acquisitions using topological sorting. Our results show that MS2Planner is 62.5% more sensitive and 9.4% more specific compared to the existing acquisition techniques.


Education: Computational Biology Education

Education: Computational Biology Education


Comparison of online learning designs during the COVID-19 pandemic within Bioinformatics courses in higher education

  • Marcela Davila, Gothenburg University, Sweden
  • Sanna Abrahamsson, Gothenburg University, Sweden

Presentation Overview: Show

Motivation: Due to the worldwide COVID-19 pandemic, new strategies had to be adopted to move from classroom-based education to online education, in a very short time. The lack of time to set up these strategies, hindered a proper design of online instructions and delivery of knowledge. Onsite practical education, including bioinformatics-related training, tend to rely on extensive practice, where students and instructors have a face-to-face interaction to improve the learning outcome. For these courses to maintain their high quality when adapted as online courses, different designs need to be tested and the students’ perceptions need to be heard.

Results: This study focuses on short bioinformatics-related courses for graduate students at the University of Gothenburg, Sweden, which were originally developed for onsite training. Once adapted as online courses, several modifications in their design were tested to obtain the best fitting learning strategy for the students. To improve the online learning experience, we propose a combination of: 1) short synchronized sessions, 2) extended time for own and group practical work, 3) recorded live lectures and 4) increased opportunities for feedback in several formats.


Evolution and Comparative Genomics

Evolution and Comparative Genomics


Improved Inference of Tandem Domain Duplications

  • Chaitanya Aluru, Princeton University, United States
  • Mona Singh, Princeton University, United States

Presentation Overview: Show

Motivation: Protein domain duplications are a major contributor to the functional diversification of protein families. These duplications can occur one at a time through single domain duplications, or as tandem duplications where several consecutive domains are duplicated together as part of a single evolutionary event. Existing methods for inferring domain level evolutionary events are based on reconciling domain trees with gene trees. While some formulations consider multiple domain duplications, they do not explicitly model tandem duplications; this leads to inaccurate inference of which domains duplicated together over the course of evolution.
Results: Here, we introduce a reconciliation-based approach that considers the relative positions of domains within extant sequences. We use this information to uncover tandem domain duplications within the evolutionary history of these genes. We devise an integer linear programming (ILP) formulation that solves this problem exactly, and a heuristic approach that works well in practice. We perform extensive simulation studies to demonstrate that our approaches can accurately uncover single and tandem domain duplications, and additionally test our approach on a well-studied orthogroup where lineage-specific domain expansions exhibit varying and complex domain duplication patterns.

Gene Tree and Species Tree Reconciliation with Endosymbiotic Gene Transfer

  • Yoann Anselmetti, Université de Sherbrooke, Canada
  • Nadia El-Mabrouk, Université de Montréal, Canada
  • Manuel Lafond, Université de Sherbrooke, Canada
  • Aïda Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

It is largely established that all extant mitochondria originated from a unique endosymbiotic event integrating an alpha-proteobacterial genome into an eukaryotic cell. Subsequently, eukaryote evolution has been marked by episodes of gene transfer, mainly from the mitochondria to the nucleus, resulting in a significant reduction of the mitochondrial genome, eventually completely disappearing in some lineages. However, in other lineages such as in land plants, a high variability in gene repertoire distribution, including genes encoded in both the nuclear and mitochondrial genome, is an indication of an ongoing process of Endosymbiotic Gene Transfer (EGT). Understanding how both nuclear and mitochondrial genomes have been shaped by gene loss, duplication and transfer is expected to shed light on a number of open questions regarding the evolution of eukaryotes, including rooting of the eukaryotic tree.
We address the problem of inferring the evolution of a gene family through duplication, loss and EGT events, the latter considered as a special case of horizontal gene transfer occurring between the mitochondrial and nuclear genomes of the same species. We present a linear-time algorithm for computing the DEL (Duplication, EGT and Loss) distance, as well as an optimal reconciled tree, for the unitary cost, and a dynamic programming algorithm allowing to output all optimal reconciliations for an arbitrary cost of operations. We illustrate the application of our EndoRex software and analyse different costs settings parameters on a plant dataset and discuss the resulting reconciled trees.

Build a Better Bootstrap and the RAWR Shall Beat a Random Path to Your Door: Phylogenetic Support Estimation Revisited

  • Wei Wang, Michigan State University, United States
  • Ahmad Hejasebazzi, Michigan State University, United States
  • Julia Zheng, Michigan State University, United States
  • Kevin Liu, Michigan State University, United States

Presentation Overview: Show

The standard bootstrap method is used throughout science and engineering to perform general-purpose non-parametric resampling and re-estimation. Among the most widely cited and widely used such applications is the phylogenetic bootstrap method, which Felsenstein proposed in 1985 as a means to place statistical confidence intervals on an estimated phylogeny. A key simplifying assumption of the bootstrap method is that input data are independent and identically distributed (i.i.d.). However, the i.i.d. assumption is an over-simplification for biomolecular sequence analysis, as Felsenstein noted. Special-purpose fully parametric or semi-parametric methods for phylogenetic support estimation have since been introduced, some of which are intended to address this concern.
In this study, we introduce a new sequence-aware non-parametric resampling technique, which we refer to as RAWR (“RAndom Walk Resampling”). RAWR consists of random walks that synthesize and extend the standard bootstrap method and the “mirrored inputs” idea of Landan and Graur. We apply RAWR to the task of phylogenetic support estimation. RAWR’s performance is compared to the state of the art using synthetic and empirical data that span a range of dataset sizes and evolutionary divergence. We show that RAWR support estimates offer comparable or typically superior type I and type II error compared to phylogenetic bootstrap support. We also conduct a re-analysis of large- scale genomic sequence data from a recent study of Darwin’s finches. Our findings clarify phylogenetic uncertainty in a charismatic clade that serves as an important model for complex adaptive evolution. We conclude with thoughts on future research directions.

Advancing admixture graph estimation via maximum likelihood network orientation

  • Erin Molloy, University at California, Los Angeles, United States
  • Arun Durvasula, University at California, Los Angeles, United States
  • Sriram Sankararaman, University at California, Los Angeles, United States

Presentation Overview: Show

Motivation: Admixture, the interbreeding between previously distinct populations, is a pervasive force in evolution.
The evolutionary history of populations in the presence of admixture can be modeled by augmenting phylogenetic trees with additional nodes that represent admixture events. While enabling a more faithful representation of evolutionary history, admixture graphs present formidable inferential challenges, and there is an increasing need for methods that are accurate, fully automated, and computationally efficient. One key challenge arises from the size of the space of admixture graphs. Given that exhaustively evaluating all admixture graphs can be prohibitively expensive, heuristics have been developed to enable efficient search over this space. One heuristic, implemented in the popular method TreeMix, consists of adding edges to a starting tree while optimizing a suitable objective function.

Results: Here, we present a demographic model (with one admixed population incident to a leaf) where TreeMix and any other starting-tree-based maximum likelihood heuristic using its likelihood function is guaranteed to get stuck in a local optimum and return an incorrect network topology. To address this issue, we propose a new search strategy that we term maximum likelihood network orientation (MLNO).
We augment TreeMix with an exhaustive search for a MLNO, referring to this approach as OrientAGraph. In evaluations including published admixture graphs, OrientAGraph outperformed TreeMix on 4/8 models (there were no differences in the other cases). Overall, OrientAGraph found graphs with higher likelihood scores and topological accuracy while remaining computationally efficient. Lastly, our study reveals several directions for improving ML admixture graph estimation.

Data-driven speciation tree prior for better species divergence times in calibration-poor molecular phylogenies

  • Qiqing Tao, Temple University, United States
  • Sudhir Kumar, Temple University, United States
  • Jose Barba-Montoya, Temple University, United States

Presentation Overview: Show

Motivation: Precise time calibrations needed to estimate ages of species divergence are not always available due to fossil records' incompleteness. Consequently, clock calibrations available for Bayesian analyses can be few and diffused, i.e., phylogenies are calibration-poor, impeding reliable inference of the timetree of life. We examined the role of speciation tree prior on Bayesian node age estimates in calibration-poor phylogenies and tested the usefulness of an informative, data-driven (dd) tree prior to enhancing the accuracy and precision of estimated times.
Results: We present a simple method to estimate parameters of the birth-death (BD) tree prior from the molecular phylogeny. The use of ddBD tree priors can improve Bayesian node age estimates for calibration-poor phylogenies. We show that the ddBD tree prior, along with only a few well-constrained calibrations, can produce excellent node ages and credibility intervals, whereas the use of a flat tree prior may require more calibrations. Relaxed clock dating with ddBD tree prior also produced better results than a flat tree prior when using diffused node calibrations. We also suggest using ddBD tree priors to improve the detection of outliers and influential calibrations in cross-validation analyses.
Conclusion: Bayesian dating analyses with ddBD tree priors will enable better node age estimates. Our results have practical use because the ddBD tree prior reduces the number of well-constrained calibrations necessary to obtain reliable node age estimates. This would help address key impediments in building the grand timetree of life, revealing the process of speciation, and elucidating the dynamics of biological diversification.


Function SIG: Gene and Protein Function Annotation

Function SIG: Gene and Protein Function Annotation


DeepGraphGO: graph neural net for large-scale, multispecies protein function prediction

  • Ronghui You, Fudan University, China
  • Shuwei Yao, Fudan University, China
  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan
  • Shanfeng Zhu, Fudan University, China

Presentation Overview: Show

Motivation: Automated function prediction (AFP) of proteins is a large-scale multi-label classification problem. Two limitations of most network-based methods for AFP are 1) a single model must be trained for each species and 2) protein sequence information is totally ignored. These limitations cause weaker performance than sequence-based methods. Thus, the challenge is how to develop a powerful network based method for AFP to overcome these limitations.
Results: We propose DeepGraphGO, an end-to-end, multispecies graph neural network-based method for AFP, which makes the most of both protein sequence and high-order protein network information. Our multispecies strategy allows one single model to be trained for all species, indicating a larger number of training samples than existing methods. Extensive experiments with a large-scale dataset show that DeepGraphGO outperformed a number of competing state-of-the-art methods significantly, including DeepGOPlus and three representative network-based methods: GeneMANIA, deepNF and clusDCA. We further confirme the effectiveness of our multispecies strategy and the advantage of DeepGraphGO over so-called difficult proteins. Finally, we integrate DeepGraphGO into the state-of-the-art ensemble method, NetGO, as a component and achieve a further performance improvement.


HitSeq: High-throughput Sequencing

HitSeq: High-throughput Sequencing


Haplotype-based membership inference from summary genomic data

  • Haixu Tang, Indiana University Bloomington, United States
  • Diyue Bu, Indiana University Bloomington, United States
  • Xiaofeng Wang, Indiana University Bloomington, United States

Presentation Overview: Show

Motivation: The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g., the allele frequencies, or even the presence/absence of genetic variants (e.g., shared by the Beacon project) in the group. These methods relied on the availability of the second sample, i.e., the DNA profile of a target human subject, and thus are often referred to as the membership inference method.
Results: In this paper, we demonstrate the haplotypes, i.e., the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a second sample. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs.

SAILER: Scalable and Accurate Invariant Representation Learning for Single-cell ATACseq Processing and Integration

  • Jing Zhang, UC Irvine, United States
  • Laiyi Fu, Xi'an Jiaotong University, China
  • Yingxin Cao, University of California, Irvine, United States
  • Jie Wu, University of California, Irvine, United States
  • Qin Ke Peng, Xi'an Jiaotong University, China
  • Qing Nie, U. of California, Irvine, United States
  • Xiaohui Xie, University of California, Irvine, United States

Presentation Overview: Show

Motivation:
Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies, and high sensitivity to confounding factors from various sources.

Results: Here we propose a new deep generative model framework, named SAILER, for analysing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: Clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to millions of cells. We implemented SAILER into a software package, freely available to the scientific community for large-scale scATAC-seq data analysis.

Topology-based Sparsification of Graph Annotations

  • Daniel Danciu, ETH Zurich, Switzerland
  • Mikhail Karasikov, ETH Zurich, Switzerland
  • Harun Mustafa, ETH Zurich, Switzerland
  • Andre Kahles, ETH Zurich, Switzerland
  • Gunnar Ratsch, ETH Zurich, Switzerland

Presentation Overview: Show

Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are needed to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.

Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections

  • Rob Patro, University of Maryland, United States
  • Jamshed Khan, University of Maryland, United States

Presentation Overview: Show

Motivation: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.
Results: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 hours, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 hours, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 hours using ~126 GB of memory, and over 16 hours using ∼289 GB of memory, respectively.

CentromereArchitect: inference and analysis of the architecture of centromeres

  • Tatiana Dvorkina, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Olga Kunyavskaya, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Andrey V. Bzikadze, Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, United States
  • Ivan Alexandrov, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Pavel A. Pevzner, Department of Computer Science and Engineering, University of California, San Diego, United States

Presentation Overview: Show

Motivation: Recent advances in long-read sequencing technologies led to a rapid progress in centromere assembly in the last year and, for the first time, opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. However, since these advances have not been yet accompanied by the development of the centromere-specific bioinformatics algorithms, even the fundamental questions (e.g., centromere annotation by deriving the complete set of human monomers and high-order repeats), let alone more complex questions (e.g., explaining how monomers and high-order repeats evolved) about human centromeres remain open. Moreover, even though there was a four-decade long series of studies aimed at cataloging all human monomers and high-order repeats, the rigorous algorithmic definitions of these concepts are still lacking. Thus, development of a centromere annotation tool is a prerequisite for follow-up personalized biomedical studies of centromeres across human population and evolutionary studies of centromeres across various species.
Results: We describe the CentromereArchitect, the first tool for the centromere annotation in a newly sequenced genome, apply it to the recently generated complete assembly of a human genome by the Telomere-to-Telomere consortium, generate the complete set of human monomers and high-order repeats for so-called live centromeres, and reveal a vast set of hybrid monomers that may represent the focal points of centromere evolution.

Long Reads Capture Simultaneous Enhancer-Promoter Methylation Status for Cell-type Deconvolution

  • Roded Sharan, Tel Aviv University, Israel
  • Sapir Margalit, Tel Aviv University, Israel
  • Yotam Abramson, Tel Aviv University, Israel
  • Hila Sharim, Tel Aviv University, Israel
  • Zohar Manber, Tel Aviv University, Israel
  • Surajit Bhattacharya, Children’s National Hospital, Washington DC, United States
  • Yi-Wen Chen, Children’s National Hospital, Washington DC; George Washington University, United States
  • Eric Vilain, Children’s National Hospital, Washington DC; George Washington University, United States
  • Hayk Barseghyan, Children’s National Hospital, Washington DC; George Washington University, United States
  • Ran Elkon, Tel Aviv University, Israel
  • Yuval Ebenstein, Tel Aviv University, Israel

Presentation Overview: Show

Motivation: While promoter methylation is associated with reinforcing fundamental tissue identities, the methylation status of distant enhancers was shown by genome-wide association studies to be a powerful determinant of cell-state and cancer. With recent availability of long-reads that report on the methylation status of enhancer-promoter pairs on the same molecule, we hypothesized that probing these pairs on the single-molecule level may serve the basis for detection of rare cancerous transformations in a given cell population. We explore various analysis approaches for deconvolving cell-type mixtures based on their genome-wide enhancer-promoter methylation profiles.
Results: To evaluate our hypothesis we examine long-read optical methylome data for the GM12787 cell line and myoblast cell lines from two donors. We identified over 100,000 enhancer-promoter pairs that co-exist on at least 30 individual DNA molecules per pair. We developed a detailed methodology for mixture deconvolution and applied it to estimate the proportional cell compositions in synthetic mixtures based on analyzing their enhancer-promoter pairwise methylation. We found our methodology to lead to very accurate estimates, outperforming our promoter-based deconvolutions. Moreover, we show that it can be generalized from deconvolving different cell types to subtle scenarios where one wishes to deconvolve different cell populations of the same cell-type.

Availability: The code used in this work to analyze single-molecule Bionano Genomics optical maps is available via the GitHub repository https://github.com/ebensteinLab/ Single_molecule_methylation_in_EP.
Contact: uv@post.tau.ac.il (Y.E), roded@tauex.tau.ac.il (R.S)

Constructing small genome graphs via string compression

  • Carl Kingsford, Carnegie Mellon University, United States
  • Yutong Qiu, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: The size of a genome graph --- the space required to store the nodes, their labels and edges --- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs.

Results: We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings.

Availability: The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph

Practical selection of representative sets of RNA-seq samples using a hierarchical approach

  • Carl Kingsford, Carnegie Mellon University, United States
  • Laura Tung, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources.
Results: We developed a novel computational method called “hierarchical representative set selection” to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4X runtime reduction and 5.35X memory reduction for 10000 and 12000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA.

doubletD: Detecting doublets in single-cell DNA sequencing data

  • Leah Weber, University of Illinois at Urbana-Champaign, United States
  • Palash Sashittal, University of Illinois at Urbana-Champaign, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

While single-cell DNA sequencing (scDNA-seq) has enabled the study of intra-tumor heterogeneity at an unprecedented resolution, current technologies are error-prone and often result in doublets where two or more cells are mistaken for a single cell. Not only do doublets confound downstream analyses, but the increase in doublet rate is also a major bottleneck preventing higher throughput with current single-cell technologies. Although doublet detection and removal is standard practice in scRNA-seq data analysis, there are no standalone doublet detection methods for scDNA-seq data.

We present doubletD, the first standalone method for detecting doublets in scDNA-seq data. Underlying our method is a simple maximum likelihood approach with a closed-form solution. We demonstrate the performance of doubletD on simulated data as well as real datasets, outperforming current methods for downstream analysis of scDNA-seq data that jointly infer doublets as well as standalone approaches for doublet detection in scRNA-seq data. Incorporating doubletD in scDNA-seq analysis pipelines will reduce complexity and lead to more accurate results.

Real-time mapping of nanopore raw signals

  • Chirag Jain, Indian Institute of Science, United States
  • Srinivas Aluru, Georgia Institute of Technology, United States
  • Haowen Zhang, Georgia Institute of Technology, United States
  • Haoran Li, Ohio State University, United States
  • Haoyu Cheng, Harvard Medical School, United States
  • Kin Fai Au, Ohio State University, United States
  • Heng Li, Harvard Medical School and Dana-Farber Cancer Institute, United States

Presentation Overview: Show

Motivation: Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base calling procedure before using aligners to map partial reads, or work well only on small genomes.

Results: In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored towards the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data, and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields better mapping accuracy on mapping yeast real raw signals and is 4.4 times faster. Moreover, our method performed well on mapping raw signals to genomes of size >100Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 vs. 0.8660).

Sequence-specific minimizers via polar sets

  • Hongyu Zheng, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States
  • Guillaume Marcais, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Minimizers are efficient methods to sample k-mers from genomic sequences, that unconditionally preserve long enough matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, that is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences.
Results: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers.


iRNA: Integrative RNA Biology

iRNA: Integrative RNA Biology


Weakly supervised learning of RNA modifications from low-resolution epitranscriptome data

  • Daiyun Huang, Xi'an Jiaotong-Liverpool University, China
  • Bowen Song, Xi'an Jiaotong-Liverpool University, China
  • Jingjue Wei, Xi'an Jiaotong-Liverpool University, China
  • Jionglong Su, Xi'an Jiaotong-Liverpool University, China
  • Frans Coenen, University of Liverpool, United Kingdom
  • Jia Meng, Xi'an Jiaotong-Liverpool University, China

Presentation Overview: Show

Motivation: Increasing evidence suggests that post-transcriptional RNA modifications regulate essential biomolecular functions and are related to the pathogenesis of various diseases. Precise identification of RNA modification sites is essential for understanding the regulatory mechanisms of RNAs. To date, many computational approaches have been developed for the prediction of RNA modifications, most of which were based on strong supervision enabled by base-resolution epitranscriptome data. However, high-resolution data may not be available.
Results: We propose WeakRM, the first weakly supervised learning framework for predicting RNA modifications from low-resolution epitranscriptome datasets, such as those generated from acRIP-seq and hMeRIP-seq. Evaluations on three independent datasets (corresponding to three different RNA modification types and their respective sequencing technologies) demonstrated the effectiveness of our approach in predicting RNA modifications from low-resolution data. WeakRM outperformed state-of-the-art multi-instance learning methods for genomic sequences, such as WSCNN, which was originally designed for transcription factor binding site prediction. Additionally, our approach captured motifs that are consistent with existing knowledge, and visualization of the predicted modification-containing regions unveiled the potentials of detecting RNA modifications with improved resolution.

JEDI: Circular RNA Prediction based on Junction Encoders and Deep Interaction among Splice Sites

  • Jyun-Yu Jiang, University of California, Los Angeles, United States
  • Chelsea J.-T. Ju, University of California, Los Angeles, United States
  • Junheng Hao, University of California, Los Angeles, United States
  • Muhao Chen, Information Sciences Institute, USC, United States
  • Wei Wang, University of California, Los Angeles, United States

Presentation Overview: Show

Circular RNA is a novel class of long non-coding RNAs that have been broadly discovered in the eukaryotic transcriptome. The circular structure arises from a non-canonical splicing process, where the donor site backspliced to an upstream acceptor site. These circular RNA sequences are conserved across species.
More importantly, rising evidence suggests their vital roles in gene regulation and association with diseases. As the fundamental effort toward elucidating their functions and mechanisms, several computational methods have been proposed to predict the circular structure from the primary sequence. Recently, advanced computational methods leverage deep learning to capture the relevant patterns from RNA sequences and model their interactions to facilitate the prediction. However, these methods fail to fully explore positional information of splice junctions and their deep interaction.

Results: We present a robust end-to-end framework, JEDI, for circular RNA prediction using only nucleotide sequences. JEDI first leverages the attention mechanism to encode each junction site based on deep bidirectional recurrent neural networks and then presents the novel cross-attention layer to model deep interaction among these sites for backsplicing. Finally, JEDI can not only predict circular RNAs but also interpret relationships among splice sites to discover backsplicing hotspots within a gene region. Experiments demonstrate JEDI significantly outperforms state-of-the-art approaches in circular RNA prediction on both isoform-level and gene-level. Moreover, JEDI also shows promising results on zero-shot backsplicing discovery, where none of the existing approaches can achieve.

Availability: The implementation of our framework is available at https://github.com/hallogameboy/JEDI

Thermodynamic modeling reveals widespread multivalent binding by RNA-binding proteins

  • Salma Sohrabi-Jahromi, Max Planck Institute for Biophysical Chemistry, Germany
  • Johannes Söding, Max Planck Institute for Biophysical Chemistry, Germany

Presentation Overview: Show

Motivation: Understanding how proteins recognize their RNA targets is essential to elucidate regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow them to bind to RNA in a multivalent, cooperative manner. They can thereby achieve higher specificity and affinity than proteins with a single RNA-binding domain. However, current approaches to de-novo discovery of RNA binding motifs do not take multivalent binding into account.

Results: We present Bipartite Motif Finder (BMF), which is based on a thermodynamic model of RBPs with two cooperatively binding RNA-binding domains. We show that bivalent binding is a common strategy among RBPs, yielding higher affinity and sequence specificity. We furthermore illustrate that the spatial geometry between the binding sites can be learned from bound RNA sequences. These discovered bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions.

Availability: BMF source code is available at https://github.com/soedinglab/bipartite_motif_finder under a GPL license. The BMF web server is accessible at https://bmf.soedinglab.org.


MICROBIOME

MICROBIOME


Umibato: estimation of time-varying microbial interaction using continuous-time regression hidden Markov model

  • Shion Hosoda, Waseda University, Japan
  • Tsukasa Fukunaga, The University of Tokyo, Japan
  • Michiaki Hamada, Waseda University, Japan

Presentation Overview: Show

Motivation:
Accumulating evidence has highlighted the importance of microbial interaction networks. Methods have been developed for estimating microbial interaction networks,
of which the generalized Lotka-Volterra equation (gLVE)-based method can estimate a directed interaction network. The previous gLVE-based method for estimating microbial interaction networks did not consider time-varying interactions.
Results:
In this study, we developed unsupervised learning based microbial interaction inference method using Bayesian estimation (Umibato), a method for estimating time-varying microbial interactions. The Umibato algorithm comprises Gaussian process regression (GPR) and a new Bayesian probabilistic model, the continuous-time regression hidden Markov model (CTRHMM). Growth rates are estimated by GPR, and interaction networks are estimated by CTRHMM. CTRHMM can estimate time-varying interaction networks using interaction states, which are defined as hidden variables. Umibato outperformed the existing methods on synthetic datasets. In addition, it yielded reasonable estimations in experiments on a mouse gut microbiota dataset, thus providing novel insights into the relationship between consumed diets and the gut microbiota.

Bacteriophage classification for assembled contigs using Graph Convolutional Network

  • Jiayu Shang, City Univeristy of Hong Kong, Hong Kong
  • Jingzhe Jiang, South China Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, China
  • Yanni Sun, City University of Hong Kong, Hong Kong

Presentation Overview: Show

Motivation: Bacteriophages (aka phages), which mainly infect bacteria, play key roles in the biology of microbes. As the most abundant biological entities on the planet, the number of discovered phages is only the tip of the iceberg. Recently, many new phages have been revealed using high throughput sequencing, particularly metagenomic sequencing. Compared to the fast accumulation of phage-like sequences, there is a serious lag in the taxonomic classification of phages. High diversity, abundance, and limited known phages pose great challenges for taxonomic analysis. In particular, alignment-based tools have difficulty in classifying fast accumulating contigs assembled from metagenomic data.

Result: In this work, we present a novel semi-supervised learning model, named PhaGCN, to conduct taxonomic classification for phage contigs. In this learning model, we construct a knowledge graph by combining the DNA sequence features learned by convolutional neural network (CNN) and protein sequence similarity gained from gene-sharing network. Then we apply graph convolutional network (GCN) to utilize both the labeled and unlabeled samples in training to enhance the learning ability. We tested PhaGCN on both simulated and real sequencing data. The results clearly show that our method competes favorably against available phage classification tools.

Statistical approaches for differential expression analysis in metatranscriptomics

  • Yancong Zhang, Harvard T. H. Chan School of Public Health, United States
  • Kelsey Thompson, Harvard T. H. Chan School of Public Health, United States
  • Curtis Huttenhower, Harvard T. H. Chan School of Public Health, United States
  • Eric Franzosa, Harvard T. H. Chan School of Public Health, United States

Presentation Overview: Show

Motivation: Metatranscriptomics (MTX) has become an increasingly practical way to profile the functional activity of microbial communities in situ. However, MTX remains underutilized due to experimental and computational limitations. The latter are complicated by non-independent changes in both RNA transcript levels and their underlying genomic DNA copies (as microbes simultaneously change their overall abundance in the population and regulate individual transcripts), genetic plasticity (as whole loci are frequently gained and lost in microbial lineages), and measurement compositionality and zero-inflation. Here, we present a systematic evaluation of and recommendations for differential expression (DE) analysis in MTX.

Results: We designed and assessed six statistical models for DE discovery in MTX that incorporate different combinations of DNA and RNA normalization and assumptions about the underlying changes of gene copies or species abundance within communities. We evaluated these models on multiple simulated and real multi-omic datasets. Models adjusting transcripts relative to their encoding gene copies as a covariate were significantly more accurate in identifying DE from MTX in both simulated and real datasets. Moreover, we show that when paired DNA measurements (metagenomic data, MGX) are not available, models normalizing MTX measurements within-species while also adjusting for total-species RNA balance sensitivity, specificity, and interpretability of DE detection, as does filtering likely technical zeros. The efficiency and accuracy of these models pave the way for more effective MTX-based DE discovery in microbial communities.

Availability: The analysis code and synthetic datasets used in this evaluation are available online at http://huttenhower.sph.harvard.edu/mtx2021.


MLCSB: Machine Learning in Computational and Systems Biology

MLCSB: Machine Learning in Computational and Systems Biology


stPlus: a reference-based method for the accurate enhancement of spatial transcriptomics

  • Boheng Zhang, Tsinghua University, China
  • Xuegong Zhang, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China
  • Shengquan Chen, Tsinghua University, China
  • Xiaoyang Chen, Tsinghua University, China

Presentation Overview: Show

Motivation: Single-cell RNA sequencing (scRNA-seq) techniques have revolutionized the investigation of transcriptomic landscape in individual cells. Recent advancements in spatial transcriptomic technologies further enable gene expression profiling and spatial organization mapping of cells simultaneously. Among the technologies, imaging-based methods can offer higher spatial resolutions, while they are limited by either the small number of genes imaged or the low gene detection sensitivity. Although several methods have been proposed for enhancing spatially resolved transcriptomics, inadequate accuracy of gene expression prediction and insufficient ability of cell-population identification still impede the applications of these methods.

Results: We propose stPlus, a reference-based method that leverages information in scRNA-seq data to enhance spatial transcriptomics. Based on an auto-encoder with a carefully tailored loss function, stPlus performs joint embedding and predicts spatial gene expression via a weighted k-NN. stPlus out-performs baseline methods with higher gene-wise and cell-wise Spearman correlation coefficients. We also introduce a clustering-based approach to assess the enhancement performance systematically. Using the data enhanced by stPlus, cell populations can be better identified than using the measured data. The predicted expression of genes unique to scRNA-seq data can also well characterize spatial cell heterogeneity. Besides, stPlus is robust and scalable to datasets of diverse gene detection sensitivity levels, sample sizes, and number of spatially measured genes. We anticipate stPlus will facilitate the analysis of spatial transcriptomics.

Availability: stPlus with detailed documents is freely accessible at http://health.tsinghua.edu.cn/software/stPlus/ and the source code is openly available on https://github.com/xy-chen16/stPlus.

Predicting mechanism of action of novel compounds using compound structure and transcriptomic signature co-embedding

  • Gwanghoon Jang, Korea University, South Korea
  • Sungjoon Park, Korea University, South Korea
  • Sanghoon Lee, korea university, South Korea
  • Sunkyu Kim, Korea University, South Korea
  • Sejeong Park, Korea University, South Korea
  • Jaewoo Kang, Korea University, South Korea

Presentation Overview: Show

Motivation:Identifying mechanism of actions (MoA) of novel compounds is crucial in drug discovery. Careful understanding of MoA can avoid potential side effects of drug candidates. Efforts have been made to identify MoA using the transcriptomic signatures induced by compounds. However, those approaches fail to reveal MoAs in the absence of actual compound signatures.

Results: We present MoAble, which predicts MoAs without requiring compound signatures. We train a deep learning-based co-embedding model to map compound signatures and compound structure into the same embedding space. The model generates low-dimensional compound signature representation from the compound structure. To predict MoAs, pathway enrichment analysis is performed based on the connectivity between embedding vectors of compounds and those of genetic perturbation. Results show that MoAble is comparable to the methods that use actual compound signatures. We demonstrate that MoAble can be used to reveal MoAs of novel compounds without measuring compound signatures with the same prediction accuracy as measuring it.

Availability:MoAble is available at https://github.com/dmis-lab/moable

Contact:kangj@korea.ac.kr, sungjoonpark@korea.ac.kr

Supplementary information:Supplementary data are available atBioinformaticsonline.

CROTON: An Automated and Variant-Aware Deep Learning Framework for Predicting CRISPR/Cas9 Editing Outcomes

  • Victoria Li, Hunter College High School, United States
  • Zijun Zhang, Flatiron Institute, Simons Foundation, United States
  • Olga Troyanskaya, Princeton University, United States

Presentation Overview: Show

CRISPR/Cas9 is a revolutionary gene-editing technology that has been widely utilized in biology, biotechnology, and medicine. CRISPR/Cas9 editing outcomes depend on local DNA sequences at the target site and are thus predictable. However, existing prediction methods are dependent on both feature and model engineering, which restricts their performance to existing knowledge about CRISPR/Cas9 editing. Herein, deep multi-task convolutional neural networks (CNNs) and neural architecture search (NAS) were used to automate both feature and model engineering and create an end-to-end deep-learning framework, CROTON (CRISPR Outcomes Through cONvolutional neural networks). The CROTON model architecture was tuned automatically with NAS on a synthetic large-scale construct-based dataset and then tested on an independent primary T cell genomic editing dataset. CROTON outperformed existing expert-designed models and non-NAS CNNs in predicting 1 base pair insertion and deletion probability as well as deletion and frameshift frequency. Interpretation of CROTON revealed local sequence determinants for diverse editing outcomes. Finally, CROTON was utilized to assess how single nucleotide variants (SNVs) affect the genome editing outcomes of four clinically relevant target genes: the viral receptors ACE2 and CCR5 and the immune checkpoint inhibitors PDCD1 and CTLA4. Large SNV-induced differences in CROTON predictions in these target genes suggest that SNVs should be taken into consideration when designing widely-applicable gRNAs.

Model learning to identify systemic regulators ofthe peripheral circadian clock

  • Julien Martinelli, Inria/Inserm/Institut Curie, France
  • Annabelle Ballesta, Inserm/Institut Curie, France
  • Xiao-Mei Li, Inserm/Université Paris Saclay, France
  • Sandrine Dulong, Inserm/Université Paris Saclay, France
  • Francis Lévi, Inserm/Université Paris Saclay, France
  • Michèle Teboul, Université Côté D'azur/CNRS/Inserm/IbV, France
  • Sylvain Soliman, Inria, France
  • François Fages, Inria, France

Presentation Overview: Show

Motivation: Personalized medicine aims at providing patient-tailored therapeutics based on multi-type data towards improved treatment outcomes. Chronotherapy that consists in adapting drug administration to the patient's circadian rhythms may be improved by such approach. Recent clinical studies demonstrated large variability in patients' circadian coordination and optimal drug timing. Consequently, new eHealth platforms allow the monitoring of circadian biomarkers in individual patients through wearable technologies (rest-activity, body temperature), blood or salivary samples (melatonin, cortisol), and daily questionnaires (food intake, symptoms). A current clinical challenge involves designing a methodology predicting from circadian biomarkers the patient peripheral circadian clocks and associated optimal drug timing. The mammalian circadian timing system being largely conserved between mouse and humans yet with phase opposition, the study was developed using available mouse datasets.
Results: We investigated at the molecular scale the influence of systemic regulators (e.g. temperature, hormones) on peripheral clocks, through a model learning approach involving systems biology models based on ordinary differential equations. Using as prior knowledge our existing circadian clock model, we derived an approximation for the action of systemic regulators on the expression of three core-clock genes: Bmal1, Per2 and Rev-Erb-alpha.
These time profiles were then fitted with a population of models, based on linear regression. Selected models involved a modulation of either Bmal1 or Per2 transcription most likely by temperature or nutrient exposure cycles. This agreed with biological knowledge on temperature-dependent control of Per2 transcription. The strengths of systemic regulations were found to be significantly different according to mouse sex and genetic background.

TUGDA: Task uncertainty guided domain adaptation for robust generalization of cancer drug response prediction from in vitro to in vivo settings

  • Rafael Peres da Silva, School of Computing, National University of Singapore, Singapore
  • Chayaporn Suphavilai, Genome Institute of Singapore, Singapore
  • Niranjan Nagarajan, Genome Institute of Singapore, Singapore

Presentation Overview: Show

Motivation: Large-scale cancer omics studies have highlighted the diversity of patient molecular profiles
and the importance of leveraging this information to deliver the right drug to the right patient at the right time.
Key challenges in learning predictive models for this include the high-dimensionality of omics data and
heterogeneity in biological and clinical factors affecting patient response. The use of multi-task learning
(MTL) techniques has been widely explored to address dataset limitations for in vitro drug response
models, while domain adaptation (DA) has been employed to extend them to predict in vivo response. In
both of these transfer learning settings, noisy data for some tasks (or domains) can substantially reduce
the performance for others compared to single-task (domain) learners, i.e. lead to negative transfer (NT).
Results: We describe a novel multi-task unsupervised DA method (TUGDA) that addresses these
limitations in a unified framework by quantifying uncertainty in predictors and weighting their influence on
shared feature representations. TUGDA’s ability to rely more on predictors with low-uncertainty allowed it
to notably reduce cases of NT for in vitro models (94% overall) compared to state-of-the-art methods. For
DA to in vivo settings, TUGDA improved over previous methods for patient-derived xenografts (9 out of 14
drugs) as well as patient datasets (significant associations in 9 out of 22 drugs). TUGDA’s ability to avoid
NT thus provides a key capability as we try to integrate diverse drug-response datasets to build consistent
predictive models with in vivo utility.
Availability: https://github.com/CSB5/TUGDA
Contact: nagarajann@gis.a-star.edu.sg
Supplementary information: Attached

Modeling drug combination effects via latent tensor reconstruction

  • Tianduanyi Wang, Aalto University; University of Helsinki, Finland
  • Sandor Szedmak, Aalto University, Finland
  • Haishan Wang, Aalto University, Finland
  • Tero Aittokallio, Aalto University; University of Helsinki; University of Turku; Oslo University Hospital; University of Oslo, Finland
  • Tapio Pahikkala, University of Turku, Finland
  • Anna Cichonska, Aalto University; University of Helsinki, Finland
  • Juho Rousu, Aalto University, Finland

Presentation Overview: Show

Motivation: Combination therapies have emerged as a powerful treatment modality to overcome drug resistance and improve treatment efficacy. However, the number of possible drug combinations increases very rapidly with the number of individual drugs in consideration which makes the comprehensive experimental screening infeasible in practice. Machine learning models offer time- and cost-efficient means to aid this process by prioritising the most effective drug combinations for further pre-clinical and clinical validation. However, the complexity of the underlying interaction patterns across multiple drug doses and in different cellular contexts poses challenges to the predictive modelling of drug combination effects.
Results: We introduce comboLTR, highly time-efficient method for learning complex, nonlinear target functions for describing the responses of therapeutic agent combinations in various doses and cancer cell-contexts. The method is based on a polynomial regression via powerful latent tensor reconstruction. It uses a combination of recommender system-style features indexing the data tensor of response values in different contexts, and chemical and multi-omics features as inputs. We demonstrate that comboLTR outperforms state-of-the-art methods in terms of predictive performance and running time, and produces highly accurate results even in the challenging and practical inference scenario where full dose-response matrices are predicted for completely new drug combinations with no available combination and monotherapy response measurements in any training cell line.

TITAN: T Cell Receptor Specificity Prediction with Bimodal Attention Networks

  • Anna Weber, IBM, Zurich Research Laboratory and ETH Zurich, Switzerland
  • Jannis Born, IBM, Zurich Research Laboratory and ETH Zurich, Switzerland
  • Maria Rodriguez Martinez, IBM, Zurich Research Laboratory, Switzerland

Presentation Overview: Show

Motivation: The activity of the adaptive immune system is governed by T-cells and their specific T-cell receptors (TCR), which selectively recognize foreign antigens. Recent advances in experimental techniques have enabled sequencing of TCRs and their antigenic targets (epitopes), allowing to research the missing link between TCR sequence and epitope binding specificity. Scarcity of data and a large sequence space make this task challenging, and to date only models limited to a small set of epitopes have achieved good performance. Here, we establish a K-NN classifier as a strong baseline and then propose TITAN (Tcr epITope bimodal Attention Networks), a bimodal neural network that explicitly encodes both, TCR sequences and epitopes to enable the independent study of generalization capabilities to unseen TCRs and/or epitopes.
Results: By encoding epitopes on the atomic level with SMILES sequences, we leverage transfer learning techniques to enrich the input data and boost performance. TITAN achieves high performance on general unseen TCR prediction (ROC-AUC 0.87 in 10-fold CV) and surpasses the results of the current state of the art (ImRex) by a large margin. While unseen epitope generalization remains challenging, we report two major breakthroughs. First, by dissecting the attention heatmaps, we demonstrate that the sparsity of available epitope data favors an implicit treatment of epitopes as classes. This may be a general problem that limits unseen epitope performance for sufficiently complex models. Second, we show that TITAN nevertheless exhibits significantly improved performance on unseen epitopes and is capable of focusing attention on chemically meaningful molecular structures.

Bayesian information sharing enhances detection of regulatory associations in rare cell types

  • Alexander P. Wu, Massachusetts Institute of Technology, United States
  • Jian Peng, University of Illinois at Urbana-Champaign, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States
  • Hyunghoon Cho, Broad Institute of MIT and Harvard, United States

Presentation Overview: Show

Recent advances in single-cell RNA-sequencing (scRNA-seq) technologies promise to enable the study of gene regulatory associations at unprecedented resolution in diverse cellular contexts. However, identifying unique regulatory associations observed only in specific cell types or conditions remains a key challenge; this is particularly so for rare transcriptional states whose sample sizes are too small for existing gene regulatory network inference methods to be effective. We present ShareNet, a Bayesian framework for boosting the accuracy of cell type-specific gene regulatory networks by propagating information across related cell types via an information sharing structure that is adaptively optimized for a given single-cell dataset. The techniques we introduce can be used with a range of general network inference algorithms to enhance the output for each cell type. We demonstrate the enhanced accuracy of our approach on three benchmark scRNA-seq datasets. We find that our inferred cell type-specific networks also uncover key changes in gene associations that underpin the complex rewiring of regulatory networks across cell types, tissues, and dynamic biological processes. Our work presents a path towards extracting deeper insights about cell type-specific gene regulation in the rapidly growing compendium of scRNA-seq datasets.

CALLR: a semi-supervised cell type annotation method for single-cell RNA sequencing data

  • Ziyang Wei, Department of Statistics, University of Chicago, United States
  • Shuqin Zhang, School of Mathematical Sciences, Fudan University, China

Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) technology has been widely applied to capture the heterogeneity of different cell types within complex tissues. An essential step in scRNA-seq data analysis is the annotation of cell types. Traditional cell type annotation is mainly clustering the cells first, and then using the aggregated cluster-level expression profiles and the marker genes to label each cluster. Such methods are greatly dependent on the clustering results, which are still insufficient for accurate annotation. In this work, we propose a semi-supervised learning method for cell type annotation called CALLR. It combines unsupervised learning represented by the Laplacian matrix constructed from all the cells, and a supervised term using logistic regression. By alternately updating the cell clusters and annotation labels, high annotation accuracy can be achieved. The model is formulated as an optimization problem, and a computationally efficient algorithm is developed to solve it. Experiments on seven real data sets show that CALLR outperforms the compared (semi-) supervised learning methods, and the popular clustering methods.


NetBio: Network Biology

NetBio: Network Biology


Graph Transformation for Enzymatic Mechanisms

  • Rolf Fagerberg, IMADA, University of Southern Denmark, Denmark
  • Christoph Flamm, Department of Theoretical Chemistry, University of Vienna, Austria
  • Walter Fontana, Department of Systems Biology, Harvard Medical School, United States
  • Juraj Kolčák, IMADA, University of Southern Denmark, Denmark
  • Christophe V.F.P. Laurent, IMADA, University of Southern Denmark, Denmark
  • Daniel Merkle, IMADA, University of Southern Denmark, Denmark
  • Nikolai Nøjgaard, IMADA, University of Southern Denmark, Denmark

Presentation Overview: Show

Motivation: The design of enzymes is as challenging as it is consequential for making chemical synthesis in medical and industrial applications more efficient, cost-effective and environmentally friendly. While several aspects of this complex problem are computationally assisted, the drafting of catalytic mechanisms, i.e. the specification of the chemical steps—and hence intermediate states—that the enzyme is meant to implement, is largely left to human expertise. The ability to capture specific chemistries of multi-step catalysis in a fashion that enables its computational construction and design is therefore highly desirable and would equally impact the elucidation of existing enzymatic reactions whose mechanisms are unknown. Results: We use the mathematical framework of graph transformation to express the distinction between rules and reactions in chemistry. We derive about 1000 rules for amino acid side chain chemistry from the M-CSA database, a curated repository of enzymatic mechanisms. Using graph transformation we are able to propose hundreds of hypothetical catalytic mechanisms for a large number of unrelated reactions in the Rhea database. We analyze these mechanisms to find that they combine in chemically sound fashion individual steps from a variety of known multi-step mechanisms, showing that plausible novel mechanisms for catalysis can be constructed computationally. Availability and Implementation: The source code of the initial prototype of our approach is available at https://github.com/Nojgaard/mechsearch Contact: daniel@imada.sdu.dk Supplementary information: Supplementary data are available at https://cheminf.imada.sdu.dk/preprints/ECCB-2021

Disease Gene Prediction with Privileged Information and Heteroscedastic Dropout

  • Jianzhu Ma, Department of Computer Science and Department of Biochemistry, Purdue University, United States
  • Sheng Wang, Paul G. Allen School of Computer Science, University of Washington, United States
  • Juan Shu, Department of Statistics, Purdue University, United States
  • Yu Li, Department of Computer Science and Engineering, The Chinese University of Hong Kong, China
  • Bowei Xi, Department of Statistics, Purdue University, United States

Presentation Overview: Show

Recently, machine learning models have achieved tremendous success in prioritizing candidate genes for genetic diseases. These models are able to accurately quantify the similarity among disease and genes based on the intuition that similar genes are more likely to be associated with similar diseases. However, the genetic features these methods rely on are often hard to collect due to high experimental cost and various other technical limitations. Existing solutions of this problem significantly increase the risk of overfitting and decrease the generalizability of the models.

In this work, we propose a graph neural network (GNN) version of the Learning Under Privileged Information (LUPI) paradigm to predict new disease gene associations. Unlike previous gene prioritization approaches, our model does not require the genetic features to be the same at training and test stages. If a genetic feature is hard to measure and therefore missing at the test stage, our model could still efficiently incorporate its information during the training process. We develop a Heteroscedastic Gaussian Dropout algorithm, where the dropout probability of the GNN model is determined by another GNN model with a mirrored GNN architecture. We compared our method with four state-of-the-art methods on the Online Mendelian Inheritance in Man (OMIM) dataset to prioritize candidate disease genes. Extensive evaluations show that our model could improve the prediction accuracy when all the features are available compared to other methods. More importantly, our model could make very accurate predictions when >90% of the features are missing at the test stage.

A novel constrained genetic algorithm-based Boolean network inference method from steady-state gene expression data

  • Hung-Cuong Trinh, Ton Duc Thang University, Viet Nam
  • Yung-Keun Kwon, University of Ulsan, South Korea

Presentation Overview: Show

Motivation: It is a challenging problem in systems biology to infer both the network structure and dynamics of a gene regulatory network from steady-state gene expression data. Some methods based on Boolean or differential equation models have been proposed but they were not efficient in inference of large-scale networks. Therefore, it is necessary to develop a method to infer the net-work structure and dynamics accurately on large-scale networks using steady-state expression.
Results: In this study, we propose a novel constrained genetic algorithm-based Boolean network inference (CGA-BNI) method where a Boolean canalyzing update rule scheme was employed to capture coarse-grained dynamics. Given steady-state gene expression data as an input, CGA-BNI identifies a set of path consistency-based constraints by comparing the gene expression level be-tween the wild-type and the mutant experiments. It then searches Boolean networks which satisfy the constraints and induce attractors most similar to steady-state expressions. We devised a heuristic mutation operation for faster convergence and implemented a parallel evaluation routine for execution time reduction. Through extensive simulations on the artificial and the real gene expression datasets, CGA-BNI showed better performance than four other existing methods in terms of both structural and dynamics prediction accuracies.
Conclusion: Taken together, CGA-BNI is a promising and scalable tool to predict both the structure and the dynamics of a gene regulatory network when a highest accuracy is needed at the cost of sacrificing the execution time.


RegSys: Regulatory and Systems Genomics

RegSys: Regulatory and Systems Genomics


Resolving diverse protein-DNA footprints from exonuclease-based ChIP experiments

  • Anushua Biswas, CSIR-National Chemical Laboratory, India
  • Leelavati Narlikar, CSIR-National Chemical Laboratory, India

Presentation Overview: Show

High-throughput chromatin immunoprecipitation (ChIP) sequencing-based assays capture genomic regions associated with the profiled transcription factor (TF). ChIP-exo is a modified protocol, which uses lambda exonuclease to digest DNA close to the TF-DNA complex, in order to improve on the positional resolution of the TF-DNA contact. Because digestion occurs in the 5′–3′ orientation, it produces directional footprints near the complex, on both sides of the DNA. Like all ChIP-based methods, ChIP-exo reports a mixture of different regions associated with the TF: those bound directly as well as via intermediaries. However, the distribution of footprints are likely to be indicative of the complex forming at the DNA. We present ExoDiversity, which uses a model-based framework to learn a joint distribution over footprints and motifs, thus resolving the mixture of ChIP-exo footprints into diverse binding modes. It uses no prior motif or TF information and automatically learns the number of different modes from the data. We show its application on a wide range of TFs and organisms/cell-types. Because its goal is to explain the complete set of reported regions, it is able to identify co-factor TF motifs that appear in a small fraction of the dataset. Further, ExoDiversity discovers small nucleotide variations within and outside canonical motifs, which co-occur with variations in footprints, suggesting that the TF-DNA structural configuration at those regions is likely to be different. Finally, we show that detected modes have specific DNA shape features and conservation signals, giving insights into the structure and function of the putative TF-DNA complexes.

DECODE: A Deep-learning Framework for Condensing Enhancers and Refining Boundaries with Large-scale Functional Assays

  • Mark Gerstein, Yale University, United States
  • Min Xu, Carnegie Mellon University, United States
  • Zhanlin Chen, Yale University, United States
  • Jing Zhang, UC Irvine, United States
  • Jason Liu, Yale University, United States
  • Yi Dai, UC Irvine, United States
  • Donghoon Lee, Icahn School of Medicine at Mount Sinai, United States
  • Martin Min, 5NEC Laboratories America, United States

Presentation Overview: Show

Summary: Mapping distal regulatory elements, such as enhancers, is the cornerstone for investigating genome evolution, understanding critical biological functions, and ultimately elucidating how genetic variations may influence diseases. Previous enhancer prediction methods have used either unsupervised approaches or supervised methods with limited training data. Moreover, past approaches have operationalized enhancer discovery as a binary classification problem without accurate enhancer boundary detection, producing low-resolution annotations with redundant regions and reducing the statistical power for downstream analyses (e.g., causal variant mapping and functional validations). Here, we addressed these challenges via a two-step model called DECODE. First, we employed direct enhancer activity readouts from novel functional characterization assays, such as STARR-seq, to train a deep neural network classifier for accurate cell-type-specific enhancer prediction. Second, to improve the annotation resolution (~500 bp), we implemented a weakly-supervised object detection framework for enhancer localization with precise boundary detection (at 10 bp resolution) using gradient-weighted class activation mapping.
Results: Our DECODE binary classifier outperformed the state-of-the-art enhancer prediction methods by 24% in transgenic mouse validation. Further, DECODE object detection can condense enhancer an-notations to only 12.6% of the original size, while still reporting higher conservation scores and genome-wide association study variant enrichments. Overall, DECODE improves the efficiency of regulatory element mapping with graphic processing units for deep-learning applications and is a powerful tool for enhancer prediction and boundary localization.

EnHiC: Learning fine-resolution Hi-C contact maps using a generative adversarial framework

  • Yangyang Hu, University of California Riverside, United States
  • Wenxiu Ma, University of California Riverside, United States

Presentation Overview: Show

Motivation:
The Hi-C technique has enabled the genome-wide mapping of chromatin interactions. However, high-resolution Hi-C data requires costly, deep sequencing, therefore has only been achieved in a limited number of cell types. Machine learning models based on neural networks have been developed as a remedy to this problem.

Results:
In this work, we propose a novel method, EnHiC, for predicting high-resolution Hi-C matrices from low-resolution input data based on a generative adversarial network (GAN) framework. Inspired by non-negative matrix factorization, our model fully exploits the unique properties of Hi-C matrices and extracts rank-1 features from multi-scale low-resolution matrices to enhance the resolution. Using three human Hi-C datasets, we demonstrate that EnHiC accurately and reliably enhances the resolution of Hi-C matrices and outperforms other GAN-based models. Moreover, EnHiC-predicted high-resolution matrices facilitate accurate detections of TADs and fine-scale chromatin interactions.

Availability:
EnHiC is publicly available at https://github.com/wmalab/EnHiC.

scPNMF: sparse gene encoding of single cells to facilitate gene selection for targeted gene profiling

  • Dongyuan Song, University of California, Los Angeles, United States
  • Kexin Li, University of California, Los Angeles, United States
  • Zachary Hemminger, University of California, Los Angeles, United States
  • Roy Wollman, University of California, Los Angeles, United States
  • Jingyi Jessica Li, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity, and extra (e.g., spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data.
Results: Here we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and cell-type annotation on targeted gene profiling data.


Text Mining

Text Mining


Utilizing Image and Caption Information for Biomedical Document Classification

  • Pengyuan Li, University of Delaware, United States
  • Gongbo Zhang, University of Delaware, United States
  • Xiangying Jiang, University of Delaware, United States
  • Juan Trelles Trabucco, University of Illinois at Chicago, United States
  • Daniela Raciti, California Institute of Technology, United States
  • Cynthia Smith, The Jackson Laboratory, United States
  • Martin Ringwald, The Jackson Laboratory, United States
  • G. Elisabeta Marai, University of Illinois at Chicago, United States
  • Cecilia Arighi, University of Delaware, United States
  • Hagit Shatkay, University of Delaware, United States

Presentation Overview: Show

Biomedical research findings are typically disseminated through publications. To simplify access to domain specific knowledge while supporting the research community, several biomedical databases devote significant effort to manual curation of the literature - a labor intensive process. The first step toward biocuration requires identifying articles relevant to the specific area on which the database focuses. Thus, automatically identifying publications relevant to a specific topic within a large volume of publications is an important task toward expediting the biocuration process and, in turn, biomedical research. Current methods focus on textual contents, typically extracted from the title-and-abstract. Notably, images and captions are often used in publications to convey pivotal evidence about processes, experiments and results.
We present a new document classification scheme, using both image and caption information, in addition to titles-and-abstracts. To use the image information, we introduce a new image representation, namely Figure-word, based on class labels of subfigures. We use word embeddings for representing captions and titles-and-abstracts. To utilize all three types of information, we introduce two information integration methods. The first combines Figure-words and textual features obtained from captions and titles-and-abstracts into a single larger vector for document representation; the second employs a meta-classification scheme. Our experiments and results demonstrate the usefulness of the newly proposed Figure-words for representing images. Moreover, the results showcase the value of Figure-words, captions, and titles-and-abstracts in providing complementary information for document classification; these three sources of information when combined, lead to an overall improved classification performance.


TransMed: Translational Medical Informatics

TransMed: Translational Medical Informatics


“Single-subject studies”-derived analyses unveil altered biomechanisms between very small cohorts: implications for rare diseases

  • Dillon Aberasturi, The University of Arizona, United States
  • Nima Pouladi, The University of Utah, United States
  • Samir Rachid Zaim, The University of Arizona, United States
  • Colleen Kenost, The University of Utah, United States
  • Joanne Berghout, Pfizer, United States
  • Walter W. Piegorsch, The University of Arizona, United States
  • Yves A. Lussier, The University of Utah, United States

Presentation Overview: Show

Motivation: Identifying altered transcripts between very small human cohorts is particularly challenging and is compounded by the low accrual rate of human subjects in rare diseases or sub-stratified common disorders. Yet, single-subject studies (S3) can compare paired transcriptome samples drawn from the same patient under two conditions (e.g., treated vs pre-treatment) and suggest patient-specific responsive biomechanisms based on the overrepresentation of functionally defined gene sets. These improve statistical power by: (i) reducing the total features tested and (ii) relaxing the requirement of within-cohort uniformity at the transcript level. We propose Inter-N-of-1, a novel method, to identify meaningful differences between very small cohorts by using the effect size of “single-subject-study”-derived responsive biological mechanisms.
Results: In each subject, Inter-N-of-1 requires applying previously published S3-type N-of-1-pathways MixEnrich to two paired samples (e.g., diseased vs unaffected tissues) for determining patient-specific enriched genes sets: Odds Ratios (S3-OR) and S3-variance using Gene Ontology Biological Processes. To evaluate small cohorts, we calculated the precision and recall of Inter-N-of-1 and that of a control method (GLM+EGS) when comparing two cohorts of decreasing sizes (from 20 vs 20 to 2 vs 2) in a comprehensive six-parameter simulation and in a proof-of-concept clinical dataset. In simulations, the Inter-N-of-1 median precision and recall are > 90% and >75% in cohorts of 3 vs 3 distinct subjects (regardless of the parameter values), whereas conventional methods outperform Inter-N-of-1 at sample sizes 9 vs 9 and larger. Similar results were obtained in the clinical proof-of-concept dataset.
Availability: R software is available at Lussierlab.net/BSSD.

Optimising Blood-Brain Barrier Permeation through Deep Reinforcement Learning for De Novo Drug Design

  • Tiago Pereira, University of Coimbra, Portugal
  • Maryam Abbasi, Univeristy of Coimbra, Portugal
  • José Oliveira, University of Aveiro, Portugal
  • Bernardete Ribeiro, University of Coimbra, Portugal
  • Joel Arrais, University of Coimbra, Portugal

Presentation Overview: Show

The process of placing new drugs into the market is time-consuming, expensive and complex. The application of computational methods for designing molecules with bespoke properties can contribute to saving resources throughout this process. However, the fundamental properties to be optimised are often not considered or conflicting with each other. In this work, we propose a novel approach to consider both the biological property and the bioavailability of compounds through a deep reinforcement learning framework for the targeted generation of compounds. We aim to obtain a promising set of selective compounds for the adenosine A2A receptor and, simultaneously, that have the necessary properties in terms of solubility and permeability across the blood-brain barrier to reach the site of action. The cornerstone of the framework is based on a Recurrent Neural Network architecture, the Generator. It seeks to learn the building rules of valid molecules to sample new compounds further. Also, two Predictors are trained to estimate the properties of interest of the new molecules. Finally, the fine-tuning of the Generator was performed with reinforcement learning, integrated with multi-objective optimisation and exploratory techniques to ensure that the Generator is adequately biased.
The biased Generator can generate an interesting set of molecules, with approximately 85% having the two fundamental properties biased as desired. Thus, this approach has transformed a general molecule generator into a model focused on optimising specific objectives. Furthermore, the molecules' synthesisability and drug-likeness demonstrate the potential applicability of the de novo drug design in medicinal chemistry.

Expected 10-anonymity of HyperLogLog sketches for federated queries of clinical data repositories

  • Ziye Tao, University of Toronto, Canada
  • Griffin M. Weber, Harvard Medical School, United States
  • Yun William Yu, University of Toronto, Canada

Presentation Overview: Show

Motivation: The rapid growth in of electronic medical records provide immense potential to researchers, but are often silo-ed at separate hospitals. As a result, federated networks have arisen, which allow simultaneously querying medical databases at a group of connected institutions. The most basic such query is the aggregate count—e.g. How many patients have diabetes? However, depending on the protocol used to estimate that total, there is always a trade-off in the accuracy of the estimate against the risk of leaking confidential data. Prior work has shown that it is possible to empirically
control that trade-off by using the HyperLogLog (HLL) probabilistic sketch.
Results: In this article, we prove complementary theoretical bounds on the k-anonymity privacy risk of using HLL sketches, as well as exhibit code to efficiently compute those bounds.


VarI: Variant Interpretation

VarI: Variant Interpretation


A variant selection framework for genome graphs

  • Chirag Jain, Indian Institute of Science, India
  • Neda Tavakoli, Georgia Institute of Technology, United States
  • Srinivas Aluru, Georgia Institute of Technology, United States

Presentation Overview: Show

Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping.

In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants (SNPs, indels), and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% indel structural variants can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube