Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Proceedings Track Presentations

3DSIG: Structural Bioinformatics and Computational Biophysics3DSIG: Structural Bioinformatics and Computational Biophysics

A joint method for marker-free alignment of tilt series in electron tomography
COSI: 3DSIG: Structural Bioinformatics and Computational Biophysics
Date: July 22 and July 23

  • Renmin Han, KAUST, Saudi Arabia
  • Zhipeng Bao, Tsinghua University, China
  • Xiangrui Zeng, Carnegie Mellon University, United States
  • Tongxin Niu, Institute of Biophysics, CAS, China
  • Fa Zhang, Institute of Computing Technology, CAS, China
  • Min Xu, Carnegie Mellon University, United States
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Electron tomography (ET) is a widely used technology for 3D macro-molecular structure reconstruction. To obtain a satisfiable tomogram reconstruction, several key processes are involved, one of which is the calibration of projection parameters of the tilt series. Although fiducial marker-based alignment for tilt series has been well studied, marker-free alignment remains a challenge, which requires identifying and tracking the identical objects (landmarks) through different projections. However, the tracking of these landmarks is usually affected by the pixel density (intensity) change caused by the geometry difference in different views. The tracked landmarks will be used to determine the projection parameters. Meanwhile, different projection parameters will also affect the localization of landmarks. Currently, there is no alignment method that takes interrelationship between the projection parameters and the landmarks.
Results: Here, we propose a novel, joint method for marker-free alignment of tilt series in ET, by utilizing the information underlying the interrelationship between the projection model and the landmarks. The proposed method is the first joint solution that combines the extrinsic (track-based) alignment and the intrinsic (intensity-based) alignment, in which the localization of landmarks and projection parameters keep refining each other until convergence. This iterative approach makes our solution robust to different initial parameters and extreme geometric changes, which ensures a better reconstruction for marker-free electron tomography. Comprehensive experimental results on three real datasets show that our new method achieved a significant improvement in alignment accuracy and reconstruction quality, compared to the state-of-the-art methods.
Availability: The main program is available at https://github.com/icthrm/joint-marker-free-alignment.

Precise Modelling and Interpretation of Bioactivities of Ligands Targeting G Protein-coupled Receptors
COSI: 3DSIG: Structural Bioinformatics and Computational Biophysics
Date: July 22 and July 23

  • Jiansheng Wu, Nanjing University of Posts and Telecommunications, China
  • Yang Zhang, University of Michigan, United States

Presentation Overview: Show

Motivation: Accurate prediction and interpretation of ligand bioactivities are essential for virtual screening and drug discovery. Unfortunately, many important drug targets lack experimental data about the ligand bioactivities; this is particularly true for G protein-coupled receptors (GPCRs), which account for the targets of about a third of drugs currently on the market. Computational approaches with the potential of precise assessment of ligand bioactivities and determination of key substructural features which determine ligand bioactivities are needed to address this issue.
Results: A new method, SED, was proposed to predict ligand bioactivities and to recognize key substructures associated with GPCRs through the coupling of screening for Lasso of long extended-connectivity fingerprints (ECFPs) with deep neural network training. The SED pipeline contains three successive steps: 1) representation of long ECFPs for ligand molecules, 2) feature selection by screening for Lasso of ECFPs, and 3) bioactivity prediction through a deep neural network regression model. The method was examined on a set of sixteen representative GPCRs that cover most subfamilies of human GPCRs, where each has 300–5000 ligand associations. The results show that SED achieves excellent performance in modelling ligand bioactivities, especially for those in the GPCR datasets without sufficient ligand associations, where SED improved the baseline predictors by 12% in correlation coefficient (r2) and 19% in root mean square error. Detail data analyses suggest that the major advantage of SED lies on its ability to detect substructures from long ECFPs which significantly improves the predictive performance.
Availability: The source code and datasets of SED are freely available at https://zhanglab.ccmb.med.umich.edu/SED/.


DIFFUSE: Predicting isoform functions from sequences and expression profiles via deep learning
COSI: Bio-Ontologies
Date: July 23 and July 24

  • Hao Chen, University of California, Riverside, United States
  • Dipan Shaw, University of California, Riverside, United States
  • Jianyang Zeng, Tsinghua University, China
  • Dongbo Bu, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China
  • Tao Jiang, University of California, Riverside; Tsinghua University, United States

Presentation Overview: Show

Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: i) Unlike genes, isoform-level functional annotations are scarce. ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. In this study, we present a novel approach, DIFFUSE, to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average AUC of 0.840 and AUPRC of 0.581 over 4,184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity, and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences.

CAMDA: Critical Assessment of Massive Data AnalysisCAMDA: Critical Assessment of Massive Data Analysis

Comprehensive evaluation of transcriptome-based cell-type quantification methods for immuno-oncology
COSI: CAMDA: Critical Assessment of Massive Data Analysis
Date: July 24 and July 25

  • Gregor Sturm, Chair of Experimental Bioinformatics, TUM School of Life Sciences Weihenstephan, Technical University of Munich, Germany
  • Francesca Finotello, Medical University of Innsbruck, Austria
  • Florent Petitprez, Ligue Nationale Contre le Cancer, France
  • Jitao David Zhang, Roche Innovation Center Basel, F. Hoffmann-La-Roche AG,, Switzerland
  • Jan Baumbach, Technical University of Munich, Germany
  • Wolf H. Fridman, Cordeliers Research Centre, UMRS_1138, INSERM, University Paris-Descartes, Sorbonne University, Paris, France
  • Markus List, Technical University of Munich, Germany
  • Tatsiana Aneichyk, Pieris Pharmaceuticals GmbH, Lise-Meitner-Straße 30, 85354 Freising, Germany, Germany

Presentation Overview: Show

Motivation: The composition and density of immune cells in the tumor microenvironment profoundly influence tumor progression and success of anti-cancer therapies. Flow cytometry, immunohistochemistry staining, or single-cell sequencing is often unavailable such that we rely on computational methods to estimate the immune-cell composition from bulk RNA-sequencing (RNA-seq) data. Various methods have been proposed recently, yet their capabilities and limitations have not been evaluated systematically. A general guideline leading the research community through cell type deconvolution is missing.

Results: We developed a systematic approach for benchmarking such computational methods and assessed the accuracy of tools at estimating nine different immune- and stromal cells from bulk RNA-seq samples. We used a single-cell RNA-seq dataset of ~11,000 cells from the tumor microenvironment to simulate bulk samples of known cell type proportions, and validated the results using independent, publicly available gold-standard estimates. This allowed us to analyze and condense the results of more than a hundred thousand predictions to provide an exhaustive evaluation across seven computational methods over nine cell types and ~1,800 samples from five simulated and real-world datasets. We demonstrate that computational deconvolution performs at high accuracy for well-defined cell-type signatures and propose how fuzzy cell-type signatures can be improved. We suggest that future efforts should be dedicated to refining cell population definitions and finding reliable signatures.

Availability: A snakemake pipeline to reproduce the benchmark is available at https://github.com/grst/immune_deconvolution_benchmark. An R package allows the community to perform integrated deconvolution using different methods (https://grst.github.io/immunedeconv).

PRECISE: A domain adaptation approach to transfer predictors of drug response from pre-clinical models to tumors
COSI: CAMDA: Critical Assessment of Massive Data Analysis
Date: July 24 and July 25

  • Soufiane Mourragui, Delft University of Technology and the Netherlands Cancer Institute, Netherlands
  • Marco Loog, TU Delft and University of Copenhagen, Netherlands
  • Mark van de Wiel, VUmc Amsterdam, Netherlands
  • Marcel Reinders, TU Delft and Leiden University Medical Center, Netherlands
  • Lodewyk Wessels, The Netherlands Cancer Institute, Netherlands

Presentation Overview: Show

Motivation: Cell lines and patient-derived xenografts (PDX) have been used extensively to understand the molecular underpinnings of cancer. While core biological processes are typically conserved, these models also show important differences compared to human tumors, hampering the translation of findings from pre-clinical models to the human setting. In particular, employing drug response predictors generated on data derived from pre-clinical models to predict patient response, remains a challenging task. As very large drug response datasets have been collected for pre-clinical models, and patient drug response data is often lacking, there is an urgent need for methods that efficiently transfer drug response predictors from pre-clinical models to the human setting.

Results: We show that cell lines and PDXs share common characteristics and processes with human tumors. We quantify this similarity and show that a regression model cannot simply be trained on cell lines or PDXs and then applied on tumors. We developed PRECISE, a novel methodology based on domain adaptation that captures the common information shared amongst pre-clinical models and human tumors in a consensus representation. Employing this representation, we train predictors of drug response on pre-clinical data and apply these predictors to stratify human tumors. We show that the resulting domain-invariant predictors show a small reduction in predictive performance in the pre-clinical domain but, importantly, reliably recover known associations between independent biomarkers and their companion drugs on human tumors.

CompMS: Computational Mass SpectrometryCompMS: Computational Mass Spectrometry

ADAPTIVE: leArning DAta-dePendenT, concIse molecular VEctors for fast, accurate metabolite identification from tandem mass spectra
COSI: CompMS: Computational Mass Spectrometry
Date: July 23

  • Dai Hai Nguyen, Kyoto University, Japan
  • Canh Hao Nguyen, Bioinformatics Center, ICR, Kyoto University, Japan
  • Hiroshi Mamitsuka, Kyoto University / Aalto University, Japan

Presentation Overview: Show

Motivation: Metabolite identification is an important task in metabolomics to enhance the knowledge of biological systems. There have been a number of machine learning based methods proposed for this task, which predict a chemical structure of a given spectrum through an intermediate (chemical structure) representation called molecular fingerprints. They usually have two steps: 1) predicting fingerprints from
spectra; 2) searching chemical compounds (in database) corresponding to the predicted fingerprints. Fingerprints are feature vectors, which are usually very large to cover all possible substructures and chemical properties, and therefore heavily redundant, in the sense of having many molecular (sub)structures irrelevant
to the task, causing limited predictive performance and slow prediction.
Results: We propose ADAPTIVE, which has two parts: learning two mappings 1) from structures to molecular vectors and 2) from spectra to molecular vectors. The first part learns molecular vectors for metabolites from given data, to be consistent with both spectra and chemical structures of metabolites. In more detail, molecular vectors are generated by a model, being parameterized by a message passing
neural network (MPNN), and parameters are estimated by maximizing the correlation between molecular vectors and the corresponding spectra in terms of Hilbert-Schmidt Independence Criterion (HSIC). Molecular vectors generated by this model are compact and importantly adaptive (specific) to both given data and task of metabolite identification. The second part uses input output kernel regression (IOKR),
the current cutting-edge method of metabolite identification. We empirically confirmed the effectiveness of ADAPTIVE by using a benchmark data, where ADAPTIVE outperformed the original IOKR in both predictive performance and computational efficiency.

NPS: scoring and evaluating the statistical significance of peptidic natural product–spectrum matches
COSI: CompMS: Computational Mass Spectrometry
Date: July 23

  • Azat Tagirdzhanov, Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Alexander Shlemov, Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Alexey Gurevich, Center for Algorithmic Biotechnology, St. Petersburg State University (Saint Petersburg, Russia), Russia

Presentation Overview: Show

Motivation: Peptidic Natural Products (PNPs) are considered a promising compound class that has many applications in medicine. Recently developed mass spectrometry-based pipelines are transforming PNP discovery into a high-throughput technology. However, the current computational methods for PNP identification via database search of mass spectra are still in their infancy and could be substantially improved.
Results: Here we present NPS, a statistical learning-based approach for scoring PNP–spectrum matches. We incorporated NPS into two leading PNP discovery tools and benchmarked them on millions of natural product mass spectra. The results demonstrate more than 45% increase in the number of identified spectra and 20% more found PNPs at a false discovery rate of 1%.
Availability: NPS is available as a command line tool and as a web application at http://cab.spbu.ru/software/NPS

pNovo 3: precise de novo peptide sequencing using a learning-to-rank framework
COSI: CompMS: Computational Mass Spectrometry
Date: July 23

  • Hao Yang, Institute of Computing Technology, CAS, China
  • Hao Chi, Institute of Computing Technology, CAS, China
  • Wen-Feng Zeng, Institute of Computing Technology, CAS, China
  • Wen-Jing Zhou, Institute of Computing Technology, CAS, China
  • Si-Min He, Institute of Computing Technology, CAS, China

Presentation Overview: Show

Motivation: De novo peptide sequencing based on tandem mass spectrometry data is the key tech-nology of shotgun proteomics for identifying peptides without any database and assembling unknown proteins. However, owing to the low ion coverage in tandem mass spectra, the order of certain con-secutive amino acids cannot be determined if all of their supporting fragment ions are missing, which results in the low precision of de novo sequencing.
Results: In order to solve this problem, we developed pNovo 3, which used a learning-to-rank framework to distinguish similar peptide candidates for each spectrum. Three metrics for measuring the similarity between each experimental spectrum and its corresponding theoretical spectrum were used as important features, in which the theoretical spectra can be precisely predicted by the pDeep algorithm using deep learning. On seven benchmark data sets from six diverse species, pNovo 3 recalled 29–102% more correct spectra, and the precision was 11–89% higher than three other state-of-the-art de novo sequencing algorithms. Furthermore, compared with the newly developed DeepNovo, which also used the deep learning approach, pNovo 3 still identified 21–50% more spectra on the nine data sets used in the study of DeepNovo. In summary, the deep learning and learning-to-rank techniques implemented in pNovo 3 significantly improve the precision of de novo sequencing, and such machine learning framework is worth extending to other related research fields to distinguish the similar sequences.

Unsupervised segmentation of mass spectrometric ion images characterizes morphology of tissues
COSI: CompMS: Computational Mass Spectrometry
Date: July 23

  • Dan Guo, Northeastern University, United States
  • Kylie Bemis, Northeastern University, United States
  • Catherine Rawlins, Northeastern University, United States
  • Jeffrey Agar, Northeastern University, United States
  • Olga Vitek, Northeastern University, United States

Presentation Overview: Show

Mass spectrometry imaging (MSI) characterizes the spatial distribution of ions in complex biological samples such as tissues. Since many tissues have complex morphology, treatments and conditions often affect the spatial distribution of the ions in morphology-specific ways. Evaluating the selectivity and the specificity of ion localization and regulation across morphology types is biologically important. However, MSI lacks algorithms for segmenting images at both single-ion and spatial resolution. This manuscript contributes Spatial-DGMM, an algorithm and a workflow for the analyses of MSI experiments, that detects components of single-ion images with homogeneous spatial composition. The approach extends Dirichlet Gaussian mixture models (DGMM) to account for the spatial structure of MSI. Evaluations on simulated and experimental datasets with diverse MSI workflows demonstrated that Spatial-DGMM accurately segments ion images, and can distinguish ions with homogeneous and heterogeneous spatial distribution. We also demonstrated that the extracted spatial information is useful for downstream analyses, such as detecting morphology-specific ions, finding groups of ions with similar spatial patterns, and detecting changes in chemical composition of tissues between conditions.

Education: Computational Biology EducationEducation: Computational Biology Education

scOrange – A Tool for Hands-On Training of Concepts from Single Cell Data Analytics
COSI: Education: Computational Biology Education
Date: July 24 and July 25

  • Martin Stražar, University of Ljubljana, Slovenia
  • Lan Žagar, University of Ljubljana, Slovenia
  • Jaka Kokošar, University of Ljubljana, Slovenia
  • Vesna Tanko, University of Ljubljana, Slovenia
  • Aleš Erjavec, University of Ljubljana, Slovenia
  • Pavlin Poličar, University of Ljubljana, Slovenia
  • Anže Starič, University of Ljubljana, Slovenia
  • Janez Demšar, University of Ljubljana, Slovenia
  • Gad Shaulsky, Baylor College of Medicine, United States
  • Menon Vilas, Howard Hughes Medical Institute, United States
  • Andrew Lamire, Howard Hughes Medical Institute, United States
  • Anup Parikh, Naringi Inc., United States
  • Blaž Zupan, University of Ljubljana, Slovenia

Presentation Overview: Show

MOTIVATION: Single-cell RNA sequencing allows us to simultaneously profile the transcriptomes of thousands of cells and to indulge in exploring cell diversity, development and discovery of new molecular mechanisms. Analysis of scRNA data involves a combination of non-trivial steps from statistics, data visualization, bioinformatics, and machine learning. Training molecular biologists in single-cell data analysis and empowering them to review and analyze their data can be challenging, both because of the complexity of the methods and the steep learning curve.
RESULTS: We propose a workshop-style training in single cell data analytics that relies on an explorative data analysis toolbox and a hands-on teaching style. The training relies on scOrange, a newly developed extension of a data mining framework that features workflow design through visual programming and interactive visualizations. Workshops with scOrange can proceed much faster than similar training methods that rely on computer programming and analysis through scripting in R or Python, allowing the trainer to cover more ground in the same time-frame. We here review the design principles of the scOrange toolbox that support such workshops and propose a syllabus for the course. We also provide examples of data analysis workflows that instructors can use during the training.

Evolution and Comparative GenomicsEvolution and Comparative Genomics

A Divide-and-Conquer Method for Scalable Phylogenetic Network Inference from Multi-locus Data
COSI: Evolution and Comparative Genomics
Date: July 23

  • Jiafan Zhu, Rice University, United States
  • Xinhao Liu, Rice University, United States
  • Huw Ogilvie, Rice University, United States
  • Luay Nakhleh, Rice University, United States

Presentation Overview: Show

Reticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting (ILS). However, these methods can only handle a small number of loci from a handful of genomes.

In this paper, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To
reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it.
We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological data sets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference.

We implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet).

Efficient Merging of Genome Profile Alignments
COSI: Evolution and Comparative Genomics
Date: July 23

  • André Hennig, Center for Bioinformatics Tübingen, University of Tübingen, Germany
  • Kay Nieselt, Center for Bioinformatics Tübingen, University of Tübingen, Germany

Presentation Overview: Show

Motivation: Whole-genome alignment methods show insufficient scalability towards the generation of large-scale whole-genome alignments (WGAs). Profile alignment-based approaches revolutionized the fields of multiple sequence alignment construction methods by significantly reducing computational complexity and runtime. However, WGAs need to consider genomic rearrangements between genomes, which makes the profile-based extension of several whole-genomes challenging. Currently, none of the available methods offer the possibility to align or extend WGA profiles.
Results: Here, we present GPA, an approach that aligns the profiles of WGAs and is capable of producing large-scale WGAs many times faster than conventional methods. Our concept relies on already available whole-genome aligners, which are used to compute several smaller sets of aligned genomes that are combined to a full WGA with a divide and conquer approach. To align or extend WGA profiles, we make use of the SuperGenome data structure, which features a bidirectional mapping between individual sequence and alignment coordinates. This data structure is used to efficiently transfer different coordinate systems into a common one based on the principles of profiles alignments. The approach allows the computation of a WGA where alignments are subsequently merged along a guide tree. The current implementation uses progressiveMauve (Darling et al., 2010) and offers the possibility for parallel computation of independent genome alignments. Our results based on various bacterial data sets up to several hundred genomes show that we can reduce the runtime from months to hours with a quality that is negligibly worse than the WGA computed with the conventional progressiveMauve tool.

Estimating the predictability of cancer evolution
COSI: Evolution and Comparative Genomics
Date: July 23

  • Sayed-Rzgar Hosseini, Cancer Research UK, Cambridge Institute, United Kingdom
  • Ramon Diaz-Uriarte, Dept. Biochemistry, Universidad Autonoma de Madrid, Instituto de Investigaciones Biomedicas “Alberto Sols” (UAM-CSIC), Spain
  • Florian Markowetz, University of Cambridge, United Kingdom
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: How predictable is the evolution of cancer? This fundamental question is of immense relevance for the diagnosis, prognosis, and treatment of cancer. Evolutionary biologists have approached the question of predictability based on the underlying fitness landscape. However, empirical fitness landscapes of tumor cells are impossible to determine in vivo. Thus, in order to quantify the predictability of cancer evolution, alternative approaches are required that circumvent the need for fitness landscapes.
Results: We developed a computational method based on Conjunctive Bayesian Networks (CBNs) to quantify the predictability of cancer evolution directly from mutational data, without the need for measuring or estimating fitness. Using simulated data derived from more than 200 different fitness landscapes, we show that our CBN-based notion of evolutionary predictability strongly correlates with the classical notion of predictability based on fitness landscapes under the Strong Selection Weak Mutation assumption. The statistical framework enables robust and scalable quantification of evolutionary predictability. We applied our approach to driver mutation data from the TCGA and the MSK-IMPACT clinical cohorts to systematically compare the predictability of 15 different cancer types. We found that cancer evolution is remarkably predictable as only a small fraction of evolutionary trajectories are feasible during cancer progression.

Inference of clonal selection in cancer populations using single-cell sequencing data
COSI: Evolution and Comparative Genomics
Date: July 23

  • Pavel Skums, Georgia State University, United States
  • Viachaslau Tsyvina, Georgia State University, United States
  • Alex Zelikovsky, GSU, United States

Presentation Overview: Show

Intra-tumor heterogeneity is one of the major factors influencing cancer progression and treatment outcome. However, evolutionary dynamics of cancer clone populations remain poorly understood. Quantification of clonal selection and inference of fitness landscapes of tumors is a key step to understanding evolutionary mechanisms driving cancer. These problems could be addressed using single cell sequencing, which provides an unprecedented insight into intra-tumor heterogeneity allowing to study and quantify selective advantages of individual clones. Here we present SCIFIL, a computational tool for inference of fitness landscapes of heterogeneous cancer clone populations from single cell sequencing data. SCIFIL allows to estimate maximum likelihood fitnesses of clone variants, measure their selective advantages and order of appearance by fitting an evolutionary model into the tumor phylogeny. We demonstrate the accuracy our approach, and show how it could be applied to experimental tumor data to study clonal selection and infer evolutionary history. SCIFIL can be used to provide new insight into the evolutionary dynamics of cancer. Its source code is available at https://github.com/compbel/SCIFIL

Large-Scale Mammalian Genome Rearrangements Coincide with Chromatin Interactions
COSI: Evolution and Comparative Genomics
Date: July 23

  • Krister Swenson, CNRS, Université de Montpellier, France
  • Mathieu Blanchette, McGill University, Canada

Presentation Overview: Show

Motivation: Genome rearrangements drastically change gene order along great stretches of a
chromosome. There has been initial evidence that these apparently non-local events in the 1D sense
may have breakpoints that are close in the 3D sense. We harness the power of the Double Cut and
Join model of genome rearrangement, along with Hi-C chromosome capture data to test this hypothesis
between human and mouse.
Results: We devise novel statistical tests which show that indeed, rearrangement scenarios that transform
the human into the mouse gene order are enriched for pairs of breakpoints that have frequent chromosome
interactions. This is observed for both intra-chromosomal breakpoint pairs, as well as for inter-chromosomal
pairs. For intra-chromosomal rearrangements, the enrichment exists for close (<20Mbs) and far (100Mbs)
pairs. Further, the pattern exists across multiple cell lines, from multiple laboratories, in different states
of the cell cycle. We show that similarities in the contact frequencies between these many experiments
contribute to the enrichment. We conclude that either 1) rearrangements usually involve breakpoints that
are spatially close, or 2) there is selection against rearrangements that act on spatially distant breakpoints.

Statistical Compression of Protein Sequences and Inference of Marginal Probability Landscapes over Competing Alignments using Finite State Models and Dirichlet Priors
COSI: Evolution and Comparative Genomics
Date: July 23

  • Dinithi Sumanaweera, Monash University, Australia
  • Lloyd Allison, Monash University, Australia
  • Arun Konagurthu, Monash University, Australia

Presentation Overview: Show

The information criterion of Minimum Message Length (MML) provides a powerful statistical framework for inductive reasoning from observed data. We apply MML to the problem of protein sequence comparison using finite state models with Dirichlet distributions. The resulting framework allows us to supersede the ad hoc cost functions commonly used in the field, by systematically addressing the problem of arbitrariness in alignment parameters, and the disconnect between substitution scores and gap costs. Furthermore, our framework enables the generation of marginal probability landscapes over all possible alignment hypotheses, with potential to facilitate the users to simultaneously rationalise and assess competing alignment relationships between protein sequences, beyond simply reporting a single (best) alignment. We demonstrate the performance of our program on benchmarks containing distantly related protein sequences.

Summarizing the Solution Space in Tumor Phylogeny Inference by Multiple Consensus Trees
COSI: Evolution and Comparative Genomics
Date: July 23

  • Nuraini Aguse, University of Illinois at Urbana-Champaign, United States
  • Yuanyuan Qi, University of Illinois at Urbana–Champaign, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.
We introduce the MULTIPLE CONSENSUS TREE (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T.

TreeMerge: A new method for improving the scalability of species tree estimation methods
COSI: Evolution and Comparative Genomics
Date: July 23

  • Erin Molloy, University of Illinois at Urbana-Champaign, United States
  • Tandy Warnow, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation: At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n^5) running time for datasets with n species.

Results: Here we present a new method called "TreeMerge" that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework---only O(n^2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 hours maximum running time. Thus, TreeMerge is a step towards a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All.

Availability: TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge).

Function SIG: Gene and Protein Function AnnotationFunction SIG: Gene and Protein Function Annotation

Multifaceted Protein-Protein Interaction PredictionBased on Siamese Residual RCNN
COSI: Function SIG: Gene and Protein Function Annotation
Date: July 22

  • Muhao Chen, University of California, Los Angeles, United States
  • Chelsea J.-T. Ju, University of California, Los Angeles, United States
  • Guangyu Zhou, University of California, Los Angeles, United States
  • Shirley Chen, University of California, Los Angeles, United States
  • Tianran Zhang, University of California, Los Angeles, United States
  • Kai-Wei Chang, University of California, Los Angeles, United States
  • Carlo Zaniolo, University of California, Los Angeles, United States
  • Wei Wang, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information.
Results: We present an end-to-end framework, PIPR, for PPI predictions using only the primary sequences. PIPR incorporates a deep residual recurrent convolutional neural network in the Siamese architecture, which leverages both robust local features and contextualized information that are significant for capturing the mutual influence of proteins sequences. Our framework relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short.

Reconstructing Signaling Pathways Using Regular-Language Constrained Paths
COSI: Function SIG: Gene and Protein Function Annotation
Date: July 22

  • Mitchell Wagner, Virginia Tech, United States
  • Aditya Pratapa, Virginia Tech, United States
  • T. M. Murali, Virginia Tech, United States

Presentation Overview: Show

Motivation: High-quality curation of the proteins and interactions in signaling pathways is slow and painstaking. As a result, many experimentally-detected interactions are not annotated to any pathways. A natural question that arises is whether or not it is possible to automatically leverage existing pathway annotations to identify new interactions for inclusion in a given pathway.

Results: We present RegLinker, an algorithm that achieves this purpose by computing multiple short paths from pathway receptors to transcription factors (TFs) within a background interaction network. The key idea underlying RegLinker is the use of regular-language constraints to control the number of non-pathway interactions that are present in the computed paths. We systematically evaluate RegLinker and five alternative approaches against a comprehensive set of 15 signaling pathways and demonstrate that RegLinker recovers withheld pathway proteins and interactions with the best precision and recall. We used RegLinker to propose new extensions to the pathways. We discuss the literature that supports the inclusion of these proteins in the pathways. These results show the broad potential of automated analysis to attenuate difficulties of traditional manual inquiry.

Availability: https://github.com/Murali-group/RegLinker

SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences
COSI: Function SIG: Gene and Protein Function Annotation
Date: July 22

  • Jian Zhang, Xinyang Normal University, China
  • Lukasz Kurgan, Virginia Commonwealth University, United States

Presentation Overview: Show

Motivation: Accurate predictions of protein-binding residues (PBRs) enhances understanding of molecular-level rules governing protein-protein interactions, helps protein-protein docking, and facilitates annotation of protein functions. Recent studies show that current sequence-based predictors of PBRs severely cross-predict residues that interact with other types of protein partners (e.g., RNA and DNA) as PBRs. Moreover, these methods are relatively slow, prohibiting genome-scale use.
Results: We propose a novel, accurate and fast sequence-based predictor of PBRs that minimizes the cross-predictions. Our SCRIBER (SeleCtive pRoteIn-Binding rEsidue pRedictor) method takes advantage of three innovations: comprehensive dataset that covers multiple types of binding residues, novel types of inputs that are relevant to the prediction of PBRs, and an architecture that is tailored to reduce the cross-predictions. The dataset includes complete protein chains and offers improved coverage of binding annotations that are transferred from multiple protein-protein complexes. We utilize innovative two-layer architecture where the first layer generates a prediction of protein-binding, RNA-binding, DNA-binding, and small ligand-binding residues. The second layer re-predicts PBRs by reducing overlap between PBRs and the other types of binding residues produced in the first layer. Empirical tests on an independent test dataset reveal that SCRIBER significantly outperforms current predictors and that all three innovations contribute to its high predictive performance. SCRIBER reduces cross-predictions by between 41% and 69% and our conservative estimates show that it is at least 3 times faster. We provide putative PBRs produced by SCRIBER for the entire human proteome and use these results to hypothesize that about 14% of currently known human protein domains bind proteins.
Availability: SCRIBER webserver is available at http://biomine.cs.vcu.edu/servers/SCRIBER/.

General Computational Biology

Bayesian Metabolic Flux Analysis reveals intracellular flux couplings
COSI: General Computational Biology
Date: July 24

  • Markus Heinonen, Aalto University, Finland
  • Maria Osmala, Aalto University, Finland
  • Henrik Mannerström, Aalto University, Finland
  • Janne Wallenius, Institute for Molecular Medicine Finland, Finland
  • Samuel Kaski, Aalto University, Finland
  • Juho Rousu, Aalto University, Finland
  • Harri Lähdesmäki, Aalto University, Finland

Presentation Overview: Show

Motivation: Metabolic flux balance analysis is a standard tool in analysing metabolic reaction rates compatible with measurements, steady-state and the metabolic reaction network stoichiometry. Flux analysis methods commonly place unrealistic assumptions on fluxes due to the convenience of formulating the problem as a linear programming model, and most methods ignore the notable uncertainty in flux estimates.
Results: We introduce a novel paradigm of Bayesian metabolic flux analysis that models the reactions of the whole genome-scale cellular system in probabilistic terms, and can infer the full flux vector distribution of genome-scale metabolic systems based on exchange and intracellular (e.g. 13C) flux measurements, steady-state assumptions, and objective function assumptions. The Bayesian model couples all fluxes jointly together in a simple truncated multivariate posterior distribution, which reveals informative flux couplings. Our model is a plug-in replacement to conventional metabolic balance methods, such as flux balance analysis (FBA). Our experiments indicate that we can characterise the genome-scale flux covariances, reveal flux couplings, and determine more intracellular unobserved fluxes in C. acetobutylicum from 13C data than flux variability analysis.
Availability: The COBRA compatible software is available at github.com/markusheinonen/bamfa
Contact: markus.o.heinonen@aalto.fi

Controlling Large Boolean Networks with Single-Step Perturbations
COSI: General Computational Biology
Date: July 24

  • Alexis Baudin, École Normale Supérieure Paris-Saclay, France
  • Soumya Paul, University of Luxembourg, Luxembourg
  • Cui Su, University of Luxembourg, Luxembourg
  • Jun Pang, University of Luxembourg, Luxembourg

Presentation Overview: Show

The control of Boolean networks has traditionally focussed on strategies where the perturbations are applied to the nodes of the network for an extended period of time. In this work, we study if and how a Boolean network can be controlled by perturbing a minimal set of nodes for a single-step and letting the system evolve afterwards according to its original dynamics. More precisely, given a Boolean network BN, we compute a minimal subset Cmin of the nodes such that BN can be driven from any initial state in an attractor to another `desired' attractor by perturbing some or all of the nodes of Cmin for a single-step. Such kind of control is attractive for biological systems because they are less time consuming than the traditional strategies for control while also being financially more viable. However, due to the phenomenon of state-space explosion, computing such a minimal subset is computationally inefficient and an approach that deals with the entire network in one go, does not scale well for large networks.
We develop a `divide-and-conquer' approach by decomposing the network into smaller partitions, computing the minimal control on the projection of the attractors to these partitions and then composing the results to obtain Cmin for the whole network. We implement our method and test it on various real-life biological networks to demonstrate its applicability and efficiency.

MCS^2 : Minimal coordinated supports for fast enumeration of minimal cut sets in metabolic networks
COSI: General Computational Biology
Date: July 24

  • Seyed Reza Miraskarshahi, Simon Fraser University, Canada
  • Hooman Zabeti, Simon Fraser University, Canada
  • Tamon Stephen, Simon Fraser University, Canada
  • Leonid Chindelevitch, Simon Fraser University, Canada

Presentation Overview: Show

Motivation: Constraint-based modeling of metabolic networks helps researchers gain insight into the metabolic processes of many organisms, both prokaryotic and eukaryotic. Minimal Cut Sets (MCSs) are minimal sets of reactions whose inhibition blocks a target reaction in a metabolic network. Most approaches for finding the MCSs in constrained-based models require, either as an intermediate step or as a byproduct of the calculation, the computation of the set of elementary flux modes (EFMs), a convex basis for the valid flux vectors in the network. Recently, Ballerstein et al. proposed a method for computing the MCSs of a network without first computing its EFMs, by creating a dual network whose EFMs are a superset of the MCSs of the original network. However, their dual network is always larger than the original network and depends on the target reaction. Here we propose the construction of a different dual network, which is typically smaller than the original network and is independent of the target reaction, for the same purpose. We prove the correctness of our approach, MCS2, and describe how it can be modified to compute the few smallest MCSs for a given target reaction.
Results: We compare MCS2 to the method of Ballerstein et al. and two other existing methods. We show that MCS2 succeeds in calculating the full set of MCSs in many models where other approaches cannot finish within a reasonable amount of time. Thus, in addition to its theoretical novelty, our approach provides a practical advantage over existing methods.

HitSeq: High-throughput SequencingHitSeq: High-throughput Sequencing

Alignment-free Filtering for cfNA Fusion Fragments
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Xiao Yang, Grail, Inc, United States
  • Yasushi Saito, Grail, Inc, United States
  • Arjun Rao, Grail, Inc, United States
  • Hyunsung John Kim, Grail, Inc, United States
  • Pranav Singh, Grail, Inc, United States
  • Eric Scott, Grail, Inc, United States
  • Matthew Larson, Grail, Inc, United States
  • Wenying Pan, Grail, Inc, United States
  • Mohini Desai, Grail, Inc, United States
  • Earl Hubbell, Grail Bio, United States

Presentation Overview: Show

Motivation: Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion
detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment
lengths, and specialized barcodes such as unique molecular identifiers.
Results: AF4 was developed to address these challenges. It uses a novel alignment-free kmer based
method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than
existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces
spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data
sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering
policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion
detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as
clinical and cell-line cfNA data.
Availability: AF4 is open sourced, licensed under Apache License 2.0, and is available at:
Contact: {xyang,ysaito}@grail.com

Building Large Updatable Colored de Bruijn Graphs via Merging
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Martin Muggli, Colorado State University, United States
  • Bahar Alipanahi, University of Florida, United States
  • Christina Boucher, University of Florida, United States

Presentation Overview: Show

Motivation: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph have been developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.
Results: We create a method for constructing and updating the colored de Bruijn graph on a very-large dataset through partitioning the data into smaller subsets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this paper. We refer to the resulting method as VariMerge. We validate our approach, and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8,000 strains. Lastly, we compare VariMerge to other competing methods — including Vari , Rainbowfish , Mantis , Bloom Filter Trie , the method by Almodaresi and Multi-BRWT — and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16,000 strains in a manner that allows additional samples to be added. Competing methods either did not scale to this large of a dataset or cannot allow for additions without reconstruction.
Availability: VariMerge is at https://github.com/cosmo-team/cosmo/tree/VARI-merge under under GPLv3 license.

cloudSPAdes: Assembly of Synthetic Long Reads Using de Bruijn graphs
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Ivan Tolstoganov, Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St.Petersburg State University, Russia, Russia
  • Anton Bankevich, Dept. of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA, United States
  • Zhoutao Chen, Universal Sequencing Technology Corporation, Carlsbad, CA, USA, United States
  • Pavel Pevzner, Dept. of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA, United States

Presentation Overview: Show

The recently developed barcoding-based Synthetic Long Read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. The project was supported by the Russian Science Foundation (grant 19-14-00172).

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Ali Ghaffaari, Max-Planck Institut für Informatik, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show

Motivation: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus---a property that is not exploited by extant methods.
Results: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combinining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project data set. On this graph, PSI outperforms GCSA2 in terms of index size, query time, and sensitivity.

hicGAN infers super resolution Hi-C data with generative adversarial networks
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Qiao Liu, Tsinghua University, China
  • Hairong Lv, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China

Presentation Overview: Show

Motivation: Hi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data.
Results: We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts.

Integrating read-based and population-based phasing for dense and accurate haplotyping of individual genomes
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Vikas Bansal, University of California San Diego, United States

Presentation Overview: Show

Motivation: Reconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping.

Results: In this paper, we describe a likelihood based method to integrate short-range haplotype information from a population reference panel of haplotypes with the long-range haplotype information present in sequence reads from methods such as Hi-C to assemble dense and highly accurate haplotypes for individual genomes. Our method leverages a statistical phasing method and a maximum spanning tree algorithm to determine the optimal second-order approximation of the population-based haplotype likelihood for an individual genome. The population-based likelihood is encoded using pseudo-reads which are then used as input along with sequence reads for haplotype assembly using an existing tool, HapCUT2. Using whole-genome Hi-C data for two human genomes (NA19240 and NA12878), we demonstrate that this integrated phasing method enables the phasing of 97-98% of variants, reduces the switch error rates by 3-6 fold, and outperforms an existing method for combining phase information from sequence reads with population-based phasing. On Strand-seq data for NA12878, our method improves the haplotype completeness from 71.4% to 94.6% and reduces the switch error rate 2-fold, demonstrating its utility for phasing using multiple sequencing technologies.

Availability and Implementation: Code and datasets are available at github.com/vibansal/IntegratedPhasing

Contact: vibansal@ucsd.edu

Locality sensitive hashing for the edit distance
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Guillaume Marçais, Carnegie Mellon University, United States
  • Dan DeBlasio, Carnegie Mellon University, United States
  • Prashant Pandey, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality Sensitive Hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have an alignment from those that may have an alignment. Therefore, an LSH reduces in the overall computational requirement while not introducing many false negatives (i.e., omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. And due to the lack of a practical LSH method for
edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

Results: We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is not only sensitive to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.

Minnow: A principled framework for rapid simulation of dscRNA-seq data at the read level
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Hirak Sarkar, Stony Brook University, United States
  • Avi Srivastava, Stony Brook university, United States
  • Robert Patro, Stony Brook University, United States

Presentation Overview: Show

With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or UMI deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-seq (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as PCR amplification, CB (cellular barcodes) and UMI (Unique Molecule Identifiers) selection, and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification, and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.

TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain
COSI: HitSeq: High-throughput Sequencing
Date: July 22 and July 23

  • Yan Gao, Harbin Institute of Technology, China
  • Bo Liu, Harbin Institute of Technology, China
  • Yadong Wang, Harbin Institute of Technology, China
  • Yi Xing, Children’s Hospital of Philadelphia, United States

Presentation Overview: Show

Motivation: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity.
Results: We present a novel tandem repeat detection and consensus calling tool, TideHunter, to efficiently discover tandem repeat patterns and generate high-quality consensus sequences from amplified tandemly repeated long-read sequencing data. TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size. We benchmarked TideHunter using simulated and real datasets with varying error rates and repeat pattern sizes. TideHunter is tens of times faster than state-of-the-art methods and has a higher sensitivity and accuracy.
Availability and Implementation: TideHunter is written in C, it is open source and is available at https://github.com/yangao07/TideHunter
Contact: bo.liu@hit.edu.cn; ydwang@hit.edu.cn; XINGYI@email.chop.edu


Large Scale Microbiome Profiling in the Cloud
Date: July 23

  • Camilo Valdes, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
  • Vitalii Stebliankin, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States
  • Giri Narasimhan, Bioinformatics Research Group (BioRG), School of Computing and Information Sciences, Florida International University, United States

Presentation Overview: Show

Bacterial metagenomics profiling for whole metagenome sequencing (WGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient, and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.

We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43,552 bacterial genomes from Ensembl. Flint runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40K genomes on 64 machines in 67 seconds — an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.

Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. Supplementary materials and data are available at http://biorg.cs.fiu.edu.

Learning a Mixture of Microbial Networks Using Minorization-Maximization
Date: July 23

  • Sahar Tavakoli, University of Central Florida, United States
  • Shibu Yooseph, University of Central Florida, United States

Presentation Overview: Show

The interactions among the constituent members of a microbial community play a major role in determining the overall behavior of the community and the abundance levels of its members. These interactions can be modeled using a network whose nodes represent microbial taxa and edges represent pairwise interactions. A microbial network is typically constructed from a sample-taxa count matrix that is obtained by sequencing multiple biological samples and identifying taxa counts. From large-scale microbiome studies, it is evident that microbial community compositions and interactions are impacted by environmental and/or host factors. Thus, it is not unreasonable to expect that a sample-taxa matrix generated as part of a large study involving multiple environmental or clinical parameters can be associated with more than one microbial network. However, to our knowledge, microbial network inference methods proposed thus far assume that the sample-taxa matrix is associated with a single network.
We present a mixture model framework to address the scenario when the sample-taxa matrix is associated with K microbial networks. This count matrix is modeled using a mixture of K Multivariate Poisson Log-Normal distributions and parameters are estimated using a maximum likelihood framework. Our parameter estimation algorithm is based on the Minorization-Maximization principle combined with gradient ascent and block updates. Synthetic datasets were generated to assess the performance of our approach on absolute count data, compositional data, and normalized data. We also addressed the recovery of sparse networks based on an l1-penalty model.

TADA: Phylogenetic augmentation of microbiome samples enhances phenotype classification
Date: July 23

  • Erfan Sayyari, University of California San Diego, United States
  • Siavash Mirarab, University of California San Diego, United States
  • Ban Kawas, IBM, United States

Presentation Overview: Show

Motivation: Learning associations of traits with the microbial composition of a set of samples is a fundamental goal in microbiome studies. Recently, machine learning methods have been explored for this goal, with some promise. However, in comparison to other fields, microbiome data is high-dimensional and not abundant; leading to a high-dimensional low-sample-size under-determined system. Moreover, microbiome data is often unbalanced and biased. Given such training data, machine learning methods often fail to perform a classification task with sufficient accuracy. Lack of signal is especially problematic when classes are represented in an unbalanced way in the training data; with some classes under-represented. The presence of inter-correlations among subsets of observations further compounds these issues. As a result, machine learning methods have had only limited success in predicting many traits from microbiome. Data augmentation consists of building synthetic samples and adding them to the training data and is a technique that has proved helpful for many machine learning tasks.
Results: In this paper, we propose a new data augmentation technique for classifying phenotypes based on the microbiome. Our algorithm, called TADA, uses available data and a statistical generative model to create new samples augmenting existing ones, addressing issues of low-sample-size. In generating new samples, TADA takes into account phylogenetic relationships between microbial species. On two real datasets, we show that adding these synthetic samples to the training set improves the accuracy of downstream classification, especially when the training data have an unbalanced representation of classes.

MLCSB: Machine Learning in Computational and Systems BiologyMLCSB: Machine Learning in Computational and Systems Biology

Adversarial domain adaptation for cross data source macromolecule in situ structural classification in cellular electron cryo-tomograms
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Ruogu Lin, Carnegie Mellon University, United States
  • Xiangrui Zeng, Carnegie Mellon University, United States
  • Kris M. Kitani, Carnegie Mellon University, United States
  • Min Xu, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Since 2017, an increasing amount of attention has been paid to the supervised deep learning based macromolecule in situ structural classification (i.e. subtomogram classification) in Cellular Electron Cryo-Tomography (CECT) due to the substantially higher scalability of deep learning. However, the success of such supervised approach relies heavily on the availability of large amounts of labeled training data. For CECT, creating valid training data from the same data source as prediction data is usually laborious and computationally intensive. It would be beneficial to have training data from a separate data source where the annotation is readily available or can be performed in a high-throughput fashion. However, the cross data source prediction is often biased due to the different image intensity distributions (a.k.a. domain shift).

Results: We adapt a deep learning based adversarial domain adaptation method (3D-ADA) to timely address the domain shift problem in CECT data analysis. 3D-ADA first uses a source domain feature extractor to extract discriminative features from the training data as the input to a classifier. Then it adversarially trains a target domain feature extractor to reduce the distribution differences of the extracted features between training and prediction data. As a result, the same classifier can be directly applied to the prediction data. We tested 3D-ADA on both experimental and realistically simulated subtomogram datasets under different imaging conditions. 3D-ADA stably improved the cross data source prediction, as well as outperformed two popular domain adaptation methods. Furthermore, we demonstrate that 3D-ADA can improve cross data source recovery of novel macromolecular structures.

Block HSIC Lasso: model-free biomarker detection for ultra-high dimensional data
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Héctor Climente-González, Institut Curie, France
  • Chloé-Agathe Azencott, MINES ParisTech, France
  • Samuel Kaski, Aalto University, Finland
  • Makoto Yamada, Kyoto University, Japan

Presentation Overview: Show

Motivation: Finding nonlinear relationships between biomolecules and a biological outcome is computationally expensive and statistically challenging. Existing methods have important drawbacks, including among others lack of parsimony, non-convexity, and computational overhead. Here we propose block HSIC Lasso, a nonlinear feature selector that does not present the previous drawbacks.
Results: We compare block HSIC Lasso to other state-of-the-art feature selection techniques in both synthetic and real data, including experiments over three common types of genomic data: gene-expression microarrays, single-cell RNA sequencing, and genome-wide association studies. In all cases, we observe that features selected by block HSIC Lasso retain more information about the underlying biology than those selected by other techniques. As a proof of concept, we applied block HSIC Lasso to a single-cell RNA sequencing experiment on mouse hippocampus. We discovered that many genes linked in the past to brain development and function are involved in the biological differences between the types of neurons.
Availability: Block HSIC Lasso is implemented in the Python 2/3 package pyHSICLasso, available on PyPI. Source code is available on GitHub (https://github.com/riken-aip/pyHSICLasso).

Collaborative Intra-Tumor Heterogeneity Detection
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Sahand Khakabimamaghani, Simon Fraser University, Canada
  • Salem Malikic, Simon Fraser University, Canada
  • Jeffrey Tang, Simon Fraser University, Canada
  • Dujian Ding, Simon Fraser University, Canada
  • Ryan Morin, Simon Fraser University, Canada
  • Leonid Chindelevitch, Simon Fraser University, Canada
  • Martin Ester, Simon Fraser University, Canada

Presentation Overview: Show

Motivation: Despite the remarkable advances in sequencing and computational techniques, noise in the data and complexity of the underlying biological mechanisms render deconvolution of the phylogenetic relationships between cancer mutations difficult. To overcome these limitations, new methods are required for integrating and harnessing the full potential of the existing data. \\

Results: We introduce a method called Hintra for intra-tumor heterogeneity detection. Hintra integrates sequencing data for a cohort of tumors and infers tumor phylogeny for each individual based on the evolutionary information shared between different tumors. Through an iterative process, Hintra learns the repeating evolutionary patterns and uses this information for resolving the phylogenetic ambiguities of individual tumors. The results of synthetic experiments show an improved performance compared to two state-of-the-art methods. The experimental results with a recent Breast Cancer dataset are consistent with the existing knowledge and provide potentially interesting findings.

DeepLigand: accurate prediction of MHC class I ligands using peptide embedding
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Haoyang Zeng, Massachusetts Institute of Technology, United States
  • David Gifford, Massachusetts Institute of Technology, United States
  • Brandon Carter, Massachusetts Institute of Technology,

Presentation Overview: Show

The computational modeling of peptide display by class I major histocompatibility complexes (MHCs) is essential for peptide-based therapeutics design. Existing computational methods for peptide-display focus on modeling the peptide-MHC binding affinity. However, such models are not able to characterize the sequence features for the other cellular processes in the peptide display pathway that determines MHC ligand selection. We introduce a semi-supervised model, DeepLigand, that outperforms the state-of-the-art models in MHC class I ligand prediction. DeepLigand combines a peptide language model and peptide binding affinity prediction to score MHC class I peptide presentation. The peptide language model characterizes sequence features that correspond to secondary factors in MHC ligand selection other than binding affinity. The peptide embedding is learned by pre-training on natural ligands, and discriminates between ligands and non-ligands in the absence of binding affinity prediction. While conventional affinity-based models fail to classify peptides with moderate affinities, DeepLigand discriminates ligands from non-ligands with consistently high accuracy

Identifying progressive imaging genetic patterns via multi-task sparse canonical correlation analysis: a longitudinal study of the ADNI cohort
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Lei Du, Northwestern Polytechnical University, China
  • Kefei Liu, University of Pennsylvania, United States
  • Lei Zhu, Xi'an University of Technology, China
  • Xiaohui Yao, University of Pennsylvania, United States
  • Shannon Leigh Risacher, Indiana University School of Medicine, United States
  • Andrew Saykin, Indiana University School of Medicine, United States
  • Lei Guo, Northwestern Polytechnical University, China
  • Li Shen, University of Pennsylvania, United States

Presentation Overview: Show

Identifying the genetic basis of the brain structure, function and disorder by using the imaging quantitative traits (QTs) as endophenotypes is an important task in brain science. Brain QTs often change over time while the disorder progresses and thus understanding how the genetic factors play roles on the progressive brain QT changes is of great importance and meaning. Most existing imaging genetics methods only analyze the baseline neuroimaging data, and thus those longitudinal imaging data across multiple time points containing important disease progression information are omitted. We propose a novel temporal imaging genetic model which performs the multi-task sparse canonical correlation analysis (T-MTSCCA). Our model uses longitudinal neuroimaging data to uncover that how SNPs play roles on affecting brain QTs over the time. Incorporating the relationship of the longitudinal imaging data and that within SNPs, T-MTSCCA could identify a trajectory of progressive imaging genetic patterns over the time. We propose an efficient algorithm to solve the problem and show its convergence. We evaluate T-MTSCCA on 408 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database with longitudinal magnetic resonance imaging (MRI) data and genetic data available. The experimental results show that T-MTSCCA performs either better than or equally to the state-of-the-art methods. In particular, T-MTSCCA could identify higher canonical correlation coefficients and captures clearer canonical weight patterns. This suggests that, T-MTSCCA identifies time-consistent and time-dependent SNPs and imaging QTs, which further help understand the genetic basis of the brain QT changes over the time during the disease progression.

Inheritance and variability of kinetic gene expression parameters in microbial cells: Modelling and inference from lineage tree data
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Aline Marguet, Univ. Grenoble Alpes, Inria, 38000 Grenoble, France, France
  • Marc Lavielle, Inria Saclay & Ecole Polytechnique, Palaiseau, France, France
  • Eugenio Cinquemani, Univ. Grenoble Alpes, Inria, 38000 Grenoble, France, France

Presentation Overview: Show

Motivation: Modern experimental technologies enable monitoring of gene expression dynamics in individual cells and quantification of its variability in isogenic microbial populations. Among the sources of this variability is the randomness that affects inheritance of gene expression factors at cell division. Known parental relationships among individually observed cells provide invaluable information for the characterization of this extrinsic source of gene expression noise. Despite this fact, most existing methods to infer stochastic gene expression models from single-cell data dedicate little attention to the reconstruction of mother-daughter inheritance dynamics.
Results: Starting from a transcription and translation model of gene expression, we propose a stochastic model for the evolution of gene expression dynamics in a population of dividing cells. Based on this model, we develop a method for the direct quantification of inheritance and variability of kinetic gene expression
parameters from single-cell gene expression and lineage data. We demonstrate that our approach provides unbiased estimates of mother-daughter inheritance parameters, whereas indirect approaches using lineage information only in the post-processing of individual cell parameters underestimate inheritance. Finally, we show on yeast osmotic shock response data that daughter cell parameters are largely determined by the mother, thus confirming the relevance of our method for the correct assessment
of the onset of gene expression variability and the study of the transmission of regulatory factors.

Model-Based Optimization of Subgroup Weights for Survival Analysis
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Jakob Richter, TU Dortmund, Germany
  • Katrin Madjar, TU Dortmund University, Germany
  • Jörg Rahnenführer, Technische Universität Dortmund, Germany

Presentation Overview: Show

Motivation: To obtain a reliable prediction model for a specific cancer subgroup or cohort is often difficult due to limited sample size and, in survival analysis, due to potentially high censoring rates.
Sometimes similar data from other patient subgroups are available, e.g., from other clinical centers.
Simple pooling of all subgroups can decrease the variance of the predicted parameters of the prediction models, but also increase the bias due to heterogeneity between the cohorts.
A promising compromise is to identify those subgroups with a similar relationship between covariates and target variable and then include only these for model building.
Results: We propose a subgroup-based weighted likelihood approach for survival prediction with high-dimensional genetic covariates.
When predicting survival for a specific subgroup, for every other subgroup an individual weight determines the strength with which its observations enter into model building.
MBO (model-based optimization) can be used to quickly find a good prediction model in the presence of a large number of hyperparameters.
We use MBO to identify the best model for survival prediction of a specific subgroup by optimizing the weights for additional subgroups for a Cox model.
The approach is evaluated on a set of lung cancer cohorts with gene expression measurements.
The resulting models have competitive prediction quality, and they reflect the similarity of the corresponding cancer subgroups, with both weights close to 0, weights close to 1, and medium weights.

Modeling Clinical and Molecular Covariates of Mutational Process Activity in Cancer
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Welles Robinson, University of Maryland, United States
  • Roded Sharan, School of computer science, Tel Aviv university, Israel
  • Mark Leiserson, University of Maryland, United States

Presentation Overview: Show

Motivation: Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures.
Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis.
Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions.
However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret.
Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g., smoking history) or molecular features (e.g., inactivations to DNA damage repair genes).
Results: To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures.
To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor's observed covariates.
We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations.
On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures.
We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors.
We also discover four signatures in a combined melanoma and lung cancer cohort -- using cancer type as a covariate -- and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas.

Availability: TCSM is implemented in Python 3 and available at https://github.com/lrgr/tcsm, along with a data workflow for reproducing the experiments in the paper.

MOLI: Multi-Omics Late Integration with deep neural networks for drug response prediction
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Hossein Sharifi-Noghabi, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Canada
  • Olga Zolotareva, Faculty of Technology and Center for Biotechnology, Bielefeld University, Germany, Germany
  • Colin Collins, Vancouver Prostate Centre, Vancouver, BC, Canada, Canada
  • Martin Ester, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada, Canada

Presentation Overview: Show

Motivation: Historically, gene expression has been shown to be the most informative data for drug response prediction. Recent evidence suggests that integrating additional omics can improve the prediction accuracy which raises the question of how to integrate the additional omics. Regardless of the integration strategy, clinical utility and translatability are crucial. Thus, we reasoned a multi-omics approach combined with clinical datasets would improve drug response prediction and clinical relevance.
Results: We propose MOLI, a Multi-Omics Late Integration method based on deep neural networks. MOLI takes somatic mutation, copy number aberration, and gene expression data as input, and integrates them for drug response prediction. MOLI uses type-specific encoding subnetworks to learn features for each omics type, concatenates them into one representation and optimizes this representation via a combined cost function consisting of a triplet loss and a binary cross-entropy loss. The former makes the representations of responder samples more similar to each other and different from the non-responders, and the latter makes this representation predictive of the response values. We validate MOLI on in vitro and in vivo datasets for five chemotherapy agents and two targeted therapeutics. Compared to state-of-the-art single-omics and early integration multi-omics methods, MOLI achieves higher prediction accuracy in external validations. Moreover, a significant improvement in MOLI’s performance is observed for targeted drugs when training on a pan-drug input, i.e. using all the drugs with the same target compared to training only on drug-specific inputs. MOLI's high predictive power suggests it may have utility in precision oncology.

PRISM: Methylation Pattern-based, Reference-free Inference of Subclonal Makeup
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Dohoon Lee, Seoul National University, South Korea
  • Sangseon Lee, Seoul National University, South Korea
  • Sun Kim, Seoul National University, South Korea

Presentation Overview: Show

Motivation: Characterizing cancer subclones is crucial for the ultimate conquest of cancer. Thus, a number of bioinformatic tools have been developed to infer heterogeneous tumor populations based on genomic signatures such as mutations and copy number variations. Despite accumulating evidence for the significance of global DNA methylation reprogramming in certain cancer types including myeloid malignancies, none of the bioinformatic tools are designed to exploit subclonally reprogrammed methylation patterns to reveal constituent populations of a tumor. In accordance with the notion of global methylation reprogramming, our preliminary observations on acute myeloid leukemia (AML) samples implied the existence of subclonally-occurring focal methylation aberrance throughout the genome.
Results: We present PRISM, a tool for inferring the composition of epigenetically distinct subclones of a tumor solely from methylation patterns obtained by reduced representation bisulfite sequencing (RRBS). PRISM adopts DNA methyltransferase 1 (DNMT1)-like hidden Markov model-based in silico proofreading for the correction of erroneous methylation patterns. With error-corrected methylation patterns, PRISM focuses on a short individual genomic region harboring dichotomous patterns that can be split into fully methylated and unmethylated patterns. Frequencies of such two patterns form a sufficient statistic for subclonal abundance. A set of statistics collected from each genomic region is modeled with a beta-binomial mixture. Fitting the mixture with expectation-maximization algorithm finally provides inferred composition of subclones. Applying PRISM for two acute myeloid leukemia samples, we demonstrate that PRISM could infer the evolutionary history of malignant samples from an epigenetic point of view.
Availability: PRISM is freely available on GitHub (https://github.com/dohlee/prism).

Rotation equivariant and invariant neural networks for microscopy image analysis
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Benjamin Chidester, Carnegie Mellon University, United States
  • Tianming Zhou, Carnegie Mellon University, United States
  • Minh N. Do, University of Illinois at Urbana-Champaign, United States
  • Jian Ma, Carnegie Mellon University, United States

Presentation Overview: Show

Neural networks have been widely used to analyze high-throughput microscopy images. However, the performance of neural networks can be significantly improved by encoding known invariance for particular tasks. Highly relevant to the goal of automated cell phenotyping from microscopy image data is rotation invariance. Here we consider the application of two schemes for encoding rotation equivariance and invariance in a convolutional neural network, namely, the group-equivariant CNN (G-CNN), and a new architecture with simple, efficient conic convolution, for classifying microscopy images. We additionally integrate the 2D-discrete-Fourier transform (2D-DFT) as an effective means for encoding global rotational invariance. We call our new method the Conic Convolution and DFT Network (CFNet). We evaluated the efficacy of CFNet and G-CNN as compared to a standard CNN for several different image classification tasks, including simulated and real microscopy images of subcellular protein localization, and demonstrated improved performance. We believe CFNet has the potential to improve many high-throughput microscopy image analysis applications. Source code of CFNet is available at: https://github.com/bchidest/CFNet.

Weighted Elastic Net for Unsupervised Domain Adaptation with Application to Age Prediction from DNA Methylation Data
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: July 24 and July 25

  • Lisa Handl, University of Tübingen, Germany
  • Adrin Jalali, Max-Planck-Institut für Informatik, Germany
  • Michael Scherer, Max-Planck Institute for Informatics, Germany
  • Ralf Eggeling, University of Tübingen, Germany
  • Nico Pfeifer, Department of Computer Science, University of Tübingen, Germany

Presentation Overview: Show

Motivation: Predictive models are a powerful tool for solving complex problems in computational biology. They are typically designed to predict or classify data coming from the same unknown distribution as the training data. In many real-world settings, however, uncontrolled biological or technical factors can lead to a distribution mismatch between datasets acquired at different times, causing model performance to deteriorate on new data. A common additional obstacle in computational biology is scarce data with many more features than samples. To address these problems, we propose a method for unsupervised domain adaptation that is based on a weighted elastic net. The key idea of our approach is to compare dependencies between inputs in training and test data and to increase the cost of differently behaving features in the elastic-net regularization term. In doing so, we encourage the model to assign a higher importance to features that are robust and behave similarly across domains.

Results: We evaluate our method both on simulated data with varying degrees of distribution mismatch and on real data, considering the problem of age prediction based on DNA methylation data across multiple tissues. Compared to a non-adaptive standard model, our approach substantially reduces errors on samples with a mismatched distribution. On real data, we achieve far lower errors on cerebellum samples, a tissue which is not part of the training data and poorly predicted by standard models. Our results demonstrate that unsupervised domain adaptation is possible for applications in computational biology, even with many more features than samples.

NetBio: Network BiologyNetBio: Network Biology

Inferring signalling dynamics by integrating interventional with observational data
COSI: NetBio: Network Biology
Date: July 23

  • Mathias Cardner, ETH Zurich, Switzerland
  • Nathalie Meyer-Schaller, University Hospital of Basel, Switzerland
  • Gerhard Christofori, University of Basel, Switzerland
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: In order to infer a cell signalling network, we generally need interventional data from perturbation experiments. If the perturbation experiments are time-resolved, then signal progression through the network can be inferred. However, such designs are infeasible for large signalling networks, where it is more common to have steady-state perturbation data on the one hand, and a non-interventional time series on the other. Such was the design in a recent experiment investigating the coordination of epithelial–mesenchymal transition (EMT) in murine mammary gland cells. We aimed to infer the underlying signalling network of transcription factors and microRNAs coordinating EMT, as well as the signal progression during EMT.

Results: In the context of nested effects models, we developed a method for integrating perturbation data with a non-interventional time series. We applied the model to RNA sequencing data obtained from an EMT experiment. Part of the network inferred from RNA interference was validated experimentally using luciferase reporter assays. Our model extension is formulated as an integer linear programme, which can be solved efficiently using heuristic algorithms. This extension allowed us to infer the signal progression through the network during an EMT time course, and thereby assess when each regulator is necessary for EMT to advance.

Robust network inference using response logic
COSI: NetBio: Network Biology
Date: July 23

  • Torsten Gross, IRI Life Sciences, Humboldt University, Berlin, Germany, Germany
  • Matthew Wongchenko, Genentech Inc., Oncology Biomarker Development, USA., United States
  • Yibing Yan, Genentech Inc., Oncology Biomarker Development, USA., United States
  • Nils Blüthgen, Charité - Universitätsmedizin Berlin, Institut für Pathologie, Berlin, Germany,, Germany

Presentation Overview: Show

Motivation: A major challenge in molecular and cellular biology is to map out the regulatory networks of cells. As regulatory interactions can typically not be directly observed experimentally, various computational methods have been proposed to disentangling direct and indirect effects. Most of these rely on assumptions
that are rarely met or cannot be adapted to a given context.
Results: We present a network inference method that is based on a simple response logic with minimal presumptions. It requires that we can experimentally observe whether or not some of the system’s components respond to perturbations of some other components, and then identifies the directed networks that most accurately account for the observed propagation of the signal. To cope with the intractable number of possible networks, we developed a logic programming approach that can infer networks of hundreds of nodes, while being robust to noisy, heterogeneous or missing data. This allows to directly integrate prior network knowledge and additional constraints such as sparsity. We systematically benchmark our method on KEGG pathways, and show that it outperforms existing approaches in DREAM3 and DREAM4-challenges. Applied to a novel perturbation data set on PI3K and MAPK pathways in isogenic models of a colon cancer cell line, it generates plausible network hypotheses that explain distinct sensitivities towards EGFR inhibitors by different PI3K mutants.
Availability and Implementation: A Python/Answer Set Programming implementation can be accessed at github.com/GrossTor/response-logic. Data and analysis scripts are available at github.com/GrossTor/response-logic-projects.
Contact: nils.bluethgen@charite.de

RegSys: Regulatory and Systems GenomicsRegSys: Regulatory and Systems Genomics

A statistical simulator scDesign for rational scRNA-seq experimental design
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Wei Vivian Li, University of California, Los Angeles, United States
  • Jingyi Jessica Li, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: Single-cell RNA-sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths, and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information.
Results: Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and six different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experiment design based on specific research goals and compares various scRNA-seq computational methods.
Availability: We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign.
Contact: jli@stat.ucla.edu

Comprehensive Evaluation of Deep Learning Architectures for Prediction of DNA/RNA Sequence Binding Specificities
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Ameni Trabelsi, Colorado State University, United States
  • Mohamed Chaabane, Colorado State University, United States
  • Asa Ben-Hur, Colorado State University, United States

Presentation Overview: Show

Deep learning architectures have recently demonstrated their power in predicting DNA- and RNA-binding specificity.
Existing methods fall into three classes: Some are based on Convolutional Neural Networks (CNNs), others use Recurrent Neural Networks (RNNs), and others rely on hybrid architectures combining CNNs and RNNs.
However, based on existing studies the relative merit of the various architectures is still unclear.

Results: In this study, We present a systematic exploration of deep learning architectures for predicting DNA- and RNA-binding specificity. For this purpose, we present \deepRAM, an end-to-end deep learning tool that provides an implementation of a wide selection of architectures; its fully automatic model selection procedure allows us to perform a fair and unbiased comparison of deep learning architectures.
We find that deeper more complex architectures provide a clear advantage with sufficient training data, and that hybrid CNN/RNN architectures outperform other methods in terms of accuracy.
Our work provides guidelines that can assist the practitioner in choosing an appropriate network architecture, and provides insight on the difference between the models learned by convolutional and recurrent networks.
In particular, we find that although recurrent networks improve model accuracy, this comes at the expense of a loss in the interpretability of the features learned by the model.

GkmExplain: Fast and Accurate Interpretation of Nonlinear Gapped k-mer SVMs
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Avanti Shrikumar, Stanford University, United States
  • Eva Prakash, BASIS Independent Silicon Valley, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Support Vector Machines with gapped k-mer kernels (gkm-SVMs) have been used to learn predictive models of regulatory DNA sequence. However, interpreting predictive sequence patterns learned by gkm-SVMs can be challenging. Existing interpretation methods such as deltaSVM, in-silico mutagenesis (ISM), or SHAP either do not scale well or make limiting assumptions about the model that can produce misleading results when the gkm kernel is combined with nonlinear kernels. Here, we propose GkmExplain: a computationally efficient feature attribution method for interpreting predictive sequence patterns from gkm-SVM models that has theoretical connections to the method of Integrated Gradients. Using simulated regulatory DNA sequences, we show that GkmExplain identifies predictive patterns with high accuracy while avoiding pitfalls of deltaSVM and ISM and being orders of magnitude more computationally efficient than SHAP. By applying GkmExplain and a recently developed motif discovery method called TF-MoDISco to gkm-SVM models trained on in vivo TF binding data, we obtain superior recovery of consolidated, non-redundant transcription factor (TF) motifs compared to other motif discovery methods. Mutation impact scores derived using GkmExplain consistently outperform deltaSVM and ISM at identifying regulatory genetic variants from gkm-SVM models of chromatin accessibility in lymphoblastoid cell-lines.

Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Surag Nair, Stanford University, United States
  • Daniel Kim, Stanford University, United States
  • Jacob Perricone, Stanford University, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Motivation: Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks (CNNs) have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types.
Results: We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis and trans regulation of chromatin dynamics across 123 diverse cellular contexts.
Availability: The code is available at https://github.com/kundajelab/ChromDragoNN
Contact: akundaje@stanford.edu

Large-scale inference of competing endogeneous RNA networks with sparse partial correlation
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Markus List, Technical University of Munich, Germany
  • Azim Dehghani Amirabad, International Max Planck Research School for Computer Science, Germany
  • Dennis Kostka, University of Pittsburegh, United States
  • Marcel Schulz, Goethe University, Germany

Presentation Overview: Show

Motivation: MicroRNAs (miRNAs) are important non-coding post-transcriptional regulators that are involved in many biological processes and human diseases. Individual miRNAs may regulate hundreds of genes, giving rise to a complex gene regulatory network in which transcripts carrying miRNA binding sites act as competing endogenous RNAs (ceRNAs). Several methods for the analysis of ceRNA interactions exist, but these do often not adjust for statistical confounders or address the problem that more than one miRNA interacts with a target transcript.

Results: We present SPONGE, a method for the fast construction of ceRNA networks. SPONGE uses 'multiple sensitivity correlation', a newly defined measure for which we can estimate a distribution under a null hypothesis. SPONGE can accurately quantify the contribution of multiple miRNAs to a ceRNA interaction with a probabilistic model that addresses previously neglected confounding factors and allows fast $p$-value calculation, thus outperforming existing approaches. We applied SPONGE to paired miRNA and gene expression data from The Cancer Genome Atlas for studying global effects of miRNA-mediated cross-talk. Our results highlight already established and novel protein-coding and non-coding ceRNA which could serve as biomarkers in cancer.

Availability: SPONGE is available as an R/Bioconductor package (https://doi.org/doi:10.18129/B9.bioc.SPONGE)

Contact: markus.list@wzw.tum.de and marcel.schulz@em.uni-frankfurt.de

Learning signaling networks from combinatorial perturbations by exploiting siRNA off-target effects
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Jerzy Tiuryn, Warsaw University, Poland
  • Ewa Szczurek, University of Warsaw, Poland

Presentation Overview: Show

Perturbation experiments constitute the central means to study cellular networks. Several confounding factors complicate computational modeling of signaling networks from this data. First, the technique of RNA interference (RNAi), designed and commonly used to knockdown specific genes, suffers from off-target effects. As a result, each experiment is a combinatorial perturbation of multiple genes. Second, the perturbations propagate along unknown connections in the signaling network. Once the signal is blocked by perturbation, proteins downstream of the targeted proteins also become inactivated. Finally, all perturbed network members, either directly targeted by the experiment, or by propagation in the network, contribute to the observed effect, either in a positive or negative manner. One of the key questions of computational inference of signaling networks from such data is, how many and what combinations of perturbations are required to uniquely and accurately infer the model?
Results: Here, we introduce an enhanced version of linear effects models (LEMs), which extends the original by accounting for both negative and positive contributions of the perturbed network proteins to the observed phenotype. We prove that the enhanced LEMs are identified from data measured under perturbations of all single, pairs and triplets of network proteins. For small networks of up to five nodes, only perturbations of single and pairs of proteins are required for identifiability. Extensive simulations demonstrate that enhanced LEMs achieve excellent accuracy of parameter estimation and network structure learning, outperforming the previous version on realistic data. LEMs applied to Bartonella henselae infection RNAi screening data identified known interactions between eight nodes of the infection network, confirming high specificity of our model, and suggested one new interaction.
Availability: https://github.com/EwaSzczurek/LEM
Contact: szczurek@mimuw.edu.pl

Selfish: Discovery of Differential Chromatin Interactions via a Self-Similarity Measure
COSI: RegSys: Regulatory and Systems Genomics
Date: July 22 and July 23

  • Abbas Roayaei Ardakany, University of California Riverside, United States
  • Ferhat Ay, La Jolla Institute for Allergy and Immunology, United States
  • Stefano Lonardi, University of California Riverside, United States

Presentation Overview: Show

Motivation:High-throughput conformation capture experiments such as Hi-C provide genome-wide maps of chromatin interactions, enabling life scientists to investigate the role of the three-dimensional structure of genomes in gene regulation and other essential cellular functions. A fundamental problem in the analysis of Hi-C data is how to compare two contact maps derived from Hi-C experiments. Detecting similarities and differences between contact maps is critical in evaluating the reproducibility of replicate experiments and for identifying differential genomic regions with biological significance. Due to the complexity of chromatin conformations and the presence of technology-driven and sequence-specific biases, the comparative analysis of Hi-C data is analytically and computationally challenging.
Results:We present a novel method called Selfish for the comparative analysis of Hi-C data that takes advantage of the structural self-similarity in contact maps. We define a novel self-similarity measure to design algorithms for (i) measuring reproducibility for Hi-C replicate experiments and (ii) finding differential chromatin interactions between two contact maps. Extensive experimental results on simulated and real data show that Selfish is more accurate and robust than state-of-the-art methods.

RNA: Computational RNA BiologyRNA: Computational RNA Biology

FunDMDeep-m6A: Identification and prioritization of functional differential m6A methylation genes.
COSI: RNA: Computational RNA Biology
Date: July 24 and July 25

  • Song-Yao Zhang, Northwestern Polytechnical University, China
  • Shao-Wu Zhang, Northwestern Polytechnical University, China
  • Xiao-Nan Fan, Northwestern Polytechnical University, China
  • Teng Zhang, Northwestern Polytechnical University, China
  • Jia Meng, Department of Biological Sciences, HRINU, SUERI, Xi’an Jiaotong-Liverpool University, China
  • Yuifei Huang, Department of Electrical and Computer Engineering, the University of Texas at San Antonio, United States

Presentation Overview: Show

Motivation: As the most abundant mammalian mRNA methylation, N6-methyladenosine (m6A) exists in >25% of human mRNAs and is involved in regulating many different aspects of mRNA metabolism, stem cell differentiation and diseases like cancer. However, our current knowledge about dynamic changes of m6A levels and how the change of m6A levels for a specific gene can play a role in certain biological processes like stem cell differentiation and diseases like cancer is largely elusive.
Results: To address this, we propose in this paper FunDMDeep-m6A a novel pipeline for identify-ing context-specific (e.g., disease vs. normal, differentiated cells vs. stem cells or gene knockdown cells vs. wild type cells) m6A-mediated functional genes. FunDMDeep-m6A includes, at the first step, DMDeep-m6A a novel method based on a deep learning model and a statistical test for identi-fying differential m6A methylation (DmM) sites from MeRIP-Seq data at a single-base resolution. FunDMDeep-m6A then identifies and prioritizes functional DmM genes (FDmMGenes) by combing the DmM genes (DmMGenes) with differential expression analysis using a network-based method. This proposed network method includes a novel m6A-signaling bridge (MSB) score to quantify the functional significance of DmMGenes by assessing functional interaction of DmMGenes with their signaling pathways using a heat diffusion process in protein-protein interaction (PPI) networks. The test results on 4 context-specific MeRIP-Seq datasets showed that FunDMDeep-m6A can identify more context-specific and functionally significant FDmMGenes than m6A-Driver. The functional enrichment analysis of these genes revealed that m6A targets key genes of many important con-text-related biological processes including embryonic development, stem cell differentiation, tran-scription, translation, cell death, cell proliferation and cancer-related pathways. These results demonstrate the power of FunDMDeep-m6A for elucidating m6A regulatory functions and its roles in biological processes and diseases.
Availability: The R-package for DMDeep-m6A is freely available from https://github.com/NWPU-903PR/DMDeepm6A1.0.

LinearFold: Linear-Time Approximate RNA Folding by 5’-to-3’ Dynamic Programming and Beam Search
COSI: RNA: Computational RNA Biology
Date: July 24 and July 25

  • Liang Huang, Oregon State University and Baidu Research USA, United States
  • He Zhang, Baidu Research USA, United States
  • Dezhong Deng, Oregon State University, United States
  • Kai Zhao, Google, United States
  • Kaibo Liu, Baidu Research USA, United States
  • David Hendrix, Oregon State University, United States
  • David Mathews, University of Rochester, United States

Presentation Overview: Show

Motivation: Predicting the secondary structure of an RNA sequence is useful in many applications. Existing algorithms (based on dynamic programming) suffer from a major limitation: their runtimes scale cubically with the RNA length, and this slowness limits their use in genome-wide applications.

Results: We present a novel alternative O(n3)-time dynamic programming algorithm for RNA folding that is amenable to heuristics that make it run in O(n) time and O(n) space, while producing a high- quality approximation to the optimal solution. Inspired by incremental parsing for context-free grammars in computational linguistics, our alternative dynamic programming algorithm scans the sequence in a left-to- right (5’-to-3’) direction rather than in a bottom-up fashion, which allows us to employ the effective beam pruning heuristic. Our work, though inexact, is the first RNA folding algorithm to achieve linear runtime (and linear space) without imposing constraints on the output structure. Surprisingly, our approximate search results in even higher overall accuracy on a diverse database of sequences with known structures. More interestingly, it leads to significantly more accurate predictions on the longest sequence families in that database (16S and 23S Ribosomal RNAs), as well as improved accuracies for long-range base pairs (500+ nucleotides apart), both of which are well known to be challenging for the current models.

Availability: Our source code is available at https://github.com/LinearFold/LinearFold, and our webserver is at http://linearfold.org (max. sequence length on server: 100,000nt).

Prediction of mRNA subcellular localization using deep recurrent neural networks
COSI: RNA: Computational RNA Biology
Date: July 24 and July 25

  • Zichao Yan, McGill University, Canada
  • Eric Lécuyer, Institut de Recherche Clinique de Montréal, Canada
  • Mathieu Blanchette, McGill University, Canada

Presentation Overview: Show

Motivation: Messenger RNA subcellular localization mechanisms play a crucial role in post-transcriptional gene regulation. This trafficking is mediated by trans-acting RNA-binding proteins interacting with cis-regulatory elements called zipcodes. While new sequencing-based technologies allow the high-throughput identification of RNAs localized to specific subcellular compartments, the precise mechanisms at play, and their dependency on specific sequence elements, remain poorly understood.
Results: We introduce RNATracker, a novel deep neural network built to predict, from their sequence alone, the distributions of mRNA transcripts over a predefined set of subcellular compartments. RNATracker integrates several state-of-the-art deep learning techniques (e.g. CNN, LSTM and attention layers) and can make use of both sequence and secondary structure information. We report on a variety of evaluations showing RNATracker's strong predictive power, which is significantly superior to a variety of baseline predictors. Despite its complexity, several aspects of the model can be isolated to yield valuable, testable mechanistic hypotheses, and to locate candidate zipcode sequences within transcripts.
Availability: Code and data can be accessed at https://www.github.com/HarveyYan/RNATracker

ShaKer: RNA SHAPE prediction using graph kernel
COSI: RNA: Computational RNA Biology
Date: July 24 and July 25

  • Stefan Mautner, Albert-Ludwigs-University Freiburg, Germany
  • Soheila Montaseri, Albert-Ludwigs-University Freiburg, Iran
  • Milad Miladi, University of Freiburg, Germany
  • Martin Mann, University of Freiburg, Germany
  • Fabrizio Costa, University of Exeter, United Kingdom
  • Rolf Backofen, Albert-Ludwigs-University Freiburg, Germany

Presentation Overview: Show

SHAPE experiments are used to probe the structure of RNA molecules.
We present ShaKer to predict SHAPE data for RNA using a graph-kernel-based machine learning approach that is trained on experimental SHAPE information.
While other available methods require a manually curated reference structure, ShaKer predicts reactivity data based on sequence input only and by sampling the ensemble of possible structures.
Thus, ShaKer is well placed to enable experiment-driven, transcriptome-wide SHAPE data prediction to enable the study of RNA structuredness and to improve RNA structure and RNA-RNA interaction prediction.

For performance evaluation we use accuracy and accessibility comparing to experimental SHAPE data and competing methods. We can show that Shaker outperforms its competitors and is able to predict high quality SHAPE annotations even when no reference structure is provided.

TransMed: Translational Medical InformaticsTransMed: Translational Medical Informatics

Deep Learning with Multimodal Representation for Pancancer Prognosis Prediction
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Anika Cheerla, Stanford University, United States
  • Olivier Gevaert, Stanford University, United States

Presentation Overview: Show

Motivation: Estimating the future course of patients with cancer lesions is invaluable to physicians; however, current clinical methods fail to effectively use the vast amount of multimodal data that is available for cancer patients. To tackle this problem, we constructed a multi-modal neural network based model to predict the survival of patients for 20 different cancer types using clinical data, mRNA expression data, microRNA expression data and histopathology whole slide images (WSIs). We developed an unsupervised encoder to compress these four data modalities into a single feature vector for each patient, handling missing data through a resilient, multimodal dropout method. Encoding methods were tailored to each data type - using deep highway networks to extract features from clinical and genomic data, and convolutional neural networks to extract features from WSIs.
Results: We used pancancer data to train these feature encodings and predict single cancer and pancancer overall survival, achieving a C-index of 0.78 overall. This work shows that it is possible to build a pancancer model for prognosis that also predicts prognosis in single cancer sites. Furthermore, our model handles multiple data modalities, efficiently analyzes WSIs, and represents patient multi-modal data flexibly into an unsupervised, informative representation. We thus present a powerful automated tool to accurately determine prognosis, a key step towards personalized treatment for cancer patients.

Drug repositioning based on bounded nuclear norm regularization
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Mengyun Yang, Central South University, China
  • Huimin Luo, Central South University, China
  • Yaohang Li, Old Dominion University, United States
  • Jianxin Wang, Central South University, China

Presentation Overview: Show

Motivation: Computational drug repositioning is a cost-effective strategy to identify novel indications for existing drugs. Drug repositioning is often modeled as a recommendation system problem. Taking advantage of the known drug-disease associations, the objective of the recommendation system is to identify new treatments by filling out the unknown entries in the drug-disease association matrix, which is known as matrix completion. Underpinned by the fact that common molecular pathways contribute to many different diseases, the recommendation system assumes that the underlying latent factors determining drug-disease associations are highly correlated. In other words, the drug-disease matrix to be completed is low-rank. Accordingly, matrix completion algorithms efficiently constructing low-rank drug-disease matrix approximations consistent with known associations can be of immense help in discovering the novel drug-disease associations.
Results: In this article, we propose to use a Bounded Nuclear Norm Regularization (BNNR) method to complete the drug-disease matrix under the low-rank assumption. Instead of strictly fitting the known elements, BNNR is designed to tolerate the noisy drug-drug and disease-disease similarities by incorporating a regularization term to balance the approximation error and the rank properties. Moreover, additional constraints are incorporated into BNNR to ensure that all predicted matrix entry values are within the specific interval. BNNR is carried out on an adjacency matrix of a heterogeneous drug-disease network, which integrates the drug-drug, drug-disease, and disease-disease networks. It not only makes full use of available drugs, diseases, and their association information, but also is capable of dealing with cold start naturally. Our computational results show that BNNR yields higher drug-disease association prediction accuracy than the current state-of-the-art methods. The most significant gain is in prediction precision measured as the fraction of the positive predictions that are truly positive, which is particularly useful in drug design practice. Cases studies also confirms the accuracy and reliability of BNNR.
Availability: The code of BNNR is freely available at https://github.com/BioinformaticsCSU/BNNR
Contact: jxwang@mail.csu.edu.cn

Enhancing the Drug Discovery Process: Bayesian Inference for the Analysis and Comparison of Dose-Response Experiments
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Caroline Labelle, University of Montreal, Canada
  • Anne Marinier, University of Montreal, Canada
  • Sébastien Lemieux, University of Montreal, Canada

Presentation Overview: Show

Motivation: The efficacy of a chemical compound is often tested through dose-response experiments from which efficacy metrics, such as the IC50 , can be derived. The Marquardt-Levenberg algorithm (non-linear regression) is commonly used to compute estimations for these metrics. The analysis are however limited and can lead to biased conclusions. The approach does not evaluate the certainty (or uncertainty) of the estimates nor does it allow for the statistical comparison of two datasets. To compensate for these shortcomings, intuition plays an important role in the interpretation of results and the formulations of conclusions. We here propose a Bayesian inference methodology for the analysis and comparison of dose-response experiments.

Results: Our results well demonstrate the informativeness gain of our Bayesian approach in comparison to the commonly used Marquardt-Levenberg algorithm. It is capable to characterize the noise of dataset while inferring probable values distributions for the efficacy metrics. It can also evaluate the difference between the metrics of two datasets and compute the probability that one value is greater than the other. The conclusions that can be drawn from such analyzes are more precise.

Availability: We implemented a simple web interface that allows the users to analyze a single dose-response dataset, as well as to statistically compare the metrics of two datasets.

Identifying and ranking potential driver genes of Alzheimer's Disease using multi-view evidence aggregation
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Sumit Mukherjee, Sage Bionetworks, United States
  • Thanneer Malai Perumal, Sage Bionetworks, United States
  • Kenneth Daily, Sage Bionetworks, United States
  • Solveig Sieberts, Sage Bionetworks, United States
  • Larsson Omberg, Sage Bionetworks, United States
  • Christoph Preuss, The Jackson Labortory, United States
  • Gregory Carter, The Jackson Laboratory, United States
  • Lara Mangravite, Sage Bionetworks, United States
  • Benjamin Logsdon, Sage Bionetworks, United States

Presentation Overview: Show

Motivation: Late onset Alzheimer’s disease (LOAD) is currently a disease with no known effective treatment options. To address this, there have been a recent surge in the generation of multi-modality data (Hodes and Buckholtz, 2016; Muelleret al., 2005) to understand the biology of the disease and potential drivers that causally regulate it. However, most analytic studies using these data-sets focus on uni-modal analysis of the data. Here we propose a data-driven approach to integrate multiple data types and analytic outcomes to aggregate evidences to support the hypothesis that a gene is a genetic driver of the disease. The main algorithmic contributions of our paper are: i) A general machine learning framework to learn the key characteristics of a few known driver genes from multiple feature-sets and identifying other potential driver genes which have similar feature representations, and ii) A flexible ranking scheme with the ability to integrate external validation in the form of Genome Wide Association Study (GWAS) summary statistics.
While we currently focus on demonstrating the effectiveness of the approach using different analytic outcomes from RNA-Seq studies, this method is easily generalizable to other data modalities and analysis types.

Results: We demonstrate the utility of our machine learning algorithm on two benchmark multi-view datasets by significantly outperforming the baseline approaches in predicting missing labels. We then use the algorithm to predict and rank potential drivers of Alzheimers. We show that our ranked genes show a significant enrichment for SNPs associated with Alzheimers, and are enriched in pathways that have been previously associated with the disease.

Predicting drug-induced transcriptome responses of a wide range of human cell lines by a novel tensor-train decomposition algorithm
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Michio Iwata, Kyushu Institute of Technology, Japan
  • Longhao Yuan, RIKEN Center for Advanced Intelligence Project, Saitama Institute of Technology, Japan
  • Qibin Zhao, RIKEN Center for Advanced Intelligence Project, Japan
  • Yasuo Tabei, RIKEN Center for Advanced Intelligence Project, Japan
  • Francois Berenger, Kyushu Institute of Technology, Japan
  • Ryusuke Sawada, Kyushu Institute of Technology, Japan
  • Sayaka Akiyoshi, Kyushu University, Japan
  • Momoko Hamano, Kyushu Institute of Technology, Japan
  • Yoshihiro Yamanishi, Kyushu Institute of Technology, Japan

Presentation Overview: Show

Genome-wide identification of the transcriptomic responses of human cell lines to drug treatments is a challenging issue in medical and pharmaceutical research. However, drug-induced gene expression profiles are largely unknown and unobserved for all combinations of drugs and human cell lines, which is a serious obstacle in practical applications. Here, we developed a novel computational method to predict unknown parts of drug-induced gene expression profiles for various human cell lines and predict new drug therapeutic indications for a wide range of diseases. We proposed a tensor-train weighted optimization (TT-WOPT) algorithm to predict the potential values for unknown parts in tensor-structured gene expression data. Our results revealed that the proposed TT-WOPT algorithm can accurately reconstruct drug-induced gene expression data for a range of human cell lines in the Library of Integrated Network-based Cellular Signatures. The results also revealed that in comparison with the use of original gene expression profiles, the use of imputed gene expression profiles improved the accuracy of drug repositioning. We also performed a comprehensive prediction of drug indications for diseases with gene expression profiles, which suggested many potential drug indications that were not predicted by previous approaches.

Representation Transfer for Differentially Private Drug Sensitivity Prediction
COSI: TransMed: Translational Medical Informatics
Date: July 22

  • Teppo Niinimäki, Aalto University, Finland
  • Mikko Heikkilä, University of Helsinki, Finland
  • Antti Honkela, University of Helsinki, Finland
  • Samuel Kaski, Aalto University, Finland

Presentation Overview: Show

Motivation: Human genomic datasets often contain sensitive
information that limits use and sharing of the data. In particular,
simple anonymisation strategies fail to provide sufficient level of
protection for genomic data, because the data are inherently
identifiable. Differentially private machine learning
can help by guaranteeing that the published results
do not leak too much information about any individual data point.
Recent research has reached promising results on differentially
private drug sensitivity prediction using gene expression data.
Differentially private learning with genomic data is challenging
because it is more difficult to guarantee privacy in high
dimensions. Dimensionality reduction can help, but if the dimension
reduction mapping is learned from the data, then it needs to be
differentially private too, which can carry a significant privacy
cost. Furthermore, the selection of any hyperparameters (such as the
target dimensionality) needs to also avoid leaking private

Results: We study an approach that uses a large public
dataset of similar type to learn a compact representation for
differentially private learning. We compare three representation
learning methods: variational autoencoders, PCA and random
projection. We solve two machine learning tasks on gene expression
of cancer cell lines: cancer type classification, and drug sensitivity
prediction. The experiments demonstrate significant benefit from all
representation learning methods with variational autoencoders
providing the most accurate predictions most often. Our results
significantly improve over previous state-of-the-art in accuracy of
differentially private drug sensitivity prediction.

Availability: Code used in the experiments is available at https://github.com/DPBayes/dp-representation-transfer

VarI: Variant InterpretationVarI: Variant Interpretation

Efficient haplotype matching between a query and a panel for genealogical search
COSI: VarI: Variant Interpretation
Date: July 24

  • Ardalan Naseri, University of Central Florida, United States
  • Erwin Holzhauser, University of Central Florida, United States
  • Degui Zhi, University of Texas Health Science Center at Houston, United States
  • Shaojie Zhang, University of Central Florida, United States

Presentation Overview: Show

Motivation: With the wide availability of whole genome genotype data, there is an increasing need for conducting genetic genealogical searches efficiently. Computationally, this task amounts to identifying shared DNA segments between a query individual and a very large panel containing millions of haplotypes. The celebrated Positional Burrows-Wheeler Transform (PBWT) data structure is a precomputed index of the panel that enables constant time matching at each position between one haplotype and an arbitrarily large panel. However, the existing algorithm (Durbin’s Algorithm 5) can only identify set-maximal matches, the longest matches ending at any location in a panel, while in real genealogical search scenarios, multiple “good enough” matches are desired.

Results: In this work, we developed two algorithmic extensions of Durbin’s Algorithm 5, that can find all L-long matches, matches longer than or equal to a given length L, between a query and a panel. In the first algorithm, PBWT-Query, we introduce “virtual insertion” of the query into the PBWT matrix of the panel, and then scanning up and down for the PBWT match block with length greater than L. In our second algorithm, L-PBWT-Query, we further speed up PBWT-Query by introducing additional data structures that allow us to avoid iterating through blocks of incomplete matches. The efficiency of PBWT-Query and L-PBWT-Query is demonstrated using the simulated data and the UK Biobank data. Our results show that our proposed algorithms can detect related individuals for a given query efficiently in very large cohorts which enables a fast on-line query search.

Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype
COSI: VarI: Variant Interpretation
Date: July 24

  • Bojian Yin, Centrum Wiskunde en Informatica, Netherlands
  • Marleen Balvert, Centrum Wiskunde en Informatica/Utrecht University, Netherlands
  • Rick A. A. van der Spek, UMC Utrecht, Netherlands
  • Bas E. Dutilh, Utrecht University, Netherlands
  • Sander Bohté, Centrum Wiskunde en Informatica, Netherlands
  • Jan Veldink, UMC Utrecht, Netherlands
  • Alexander Schoenhuth, Centrum Wiskunde en Informatica/Utrecht University, Netherlands

Presentation Overview: Show

Motivation: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease caused by aberrations in the genome. While several disease-causing variants have been identified, a major part of heritability remains unexplained. ALS is believed to have a complex genetic basis where non-additive combinations of variants constitute disease, which cannot be picked up using the linear models employed in classical genotype-phenotype association studies. Deep learning on the other hand is highly promising for identifying such complex relations. We therefore developed a deep-learning based approach for the classification of ALS patients versus healthy individuals from the Dutch cohort of the Project MinE dataset. Based on recent insight that regulatory regions harbour the majority of disease-associated variants, we employ a two-step approach: first promoter regions that are likely associated to ALS are identified, and second individuals are classified based on their genotype in the selected genomic regions. Both steps employ a deep convolutional neural network. The network architecture accounts for the structure of genome data by applying convolution only to parts of the data where this makes sense from a genomics perspective.

Results: Our approach identifies potentially ALS-associated promoter regions, and generally outperforms other classification methods. Test results support the hypothesis that non-additive combinations of variants contribute to ALS. Architectures and protocols developed are tailored towards processing population-scale, whole-genome data. We consider this a relevant first step towards deep learning assisted genotype-phenotype association in whole genome-sized data.