Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Proceedings Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Presenters please check the individual COSI schedules available under Program for exact presentation times.

The following Proceedings talks will be scheduled in the Community of Special Interest (COSIs) track as noted below. Final times will be confirmed in May (Each presentation will be scheduled for 20 minutes). ISMB 2018 Proceedings will be published in OUP Bioinformatics in a special issue in late June. (Check back here to access the issue.)

Cole Trapnell
COSI:
Date: Saturday, July 7

  • Cole Trapnell, University of Washington, United States

Presentation Overview: Show

M. Madan Babu
COSI:
Date: Monday, July 9

  • M. Madan Babu, MRC Laboratory of Molecular Biology, Cambridge, United States

Presentation Overview: Show

Martha L. Bulyk
COSI:
Date: Sunday, July 8

  • Martha L. Bulyk, Brigham & Women's Hospital and Harvard Medical School, Boston, United States

Presentation Overview: Show

Ruth Nussinov - A woman’s computational biology journey
COSI:
Date: Tuesday, July 10

  • Ruth Nussinov, National Cancer Institute, National Institutes of Health, United States, Tel Aviv University, Israel

Presentation Overview: Show

From the dynamic programming algorithm to fold RNA, to unraveling the hallmarks of oncogenic signaling, it has been a long and fascinating journey which aspired to tackle significant and pressing questions where computational biology can make a difference. My adventures began when revolutionary sequencing methods produced the first long DNA sequences with the development of an efficient algorithm to fold RNA, followed by pioneering bioinformatic DNA sequence analyses. They continued with the principles of protein-protein interactions and harnessing the unraveled interface architectures for prediction, and proposing fundamental biophysical principles based on a dynamic view of protein conformational ensembles. This view led us to suggest that all (dynamic) proteins are allosteric, and the universal “conformational selection and population shift” mechanism in molecular recognition, replacing the text-book “induced-fit” paradigm. Finally, my inspirational journey confronts oncogenic Ras signaling, a problem at the center of the NCI initiative, which will be the focus of my talk.

Steven Salzberg - 25 years of human gene finding: are we there yet?
COSI:
Date: Friday, July 6

  • Steven Salzberg, Bloomberg Distinguished Professor, Center for Computational Biology Johns Hopkins University, Baltimore, United States

Presentation Overview: Show

How many genes do we have? The Human Genome Project was launched with the promise of revealing all of our genes, the “code” that would help explain our biology. The publication of the human genome in 2001 provided only a very rough answer to this question. For more than a decade following, the number of protein-coding genes steadily shrank, but the introduction of RNA sequencing revealed a vast new world of splice variants and RNA genes. In this talk, I will review where we’ve been and where we are today, and I will describe a new effort to use an unprecedentedly large RNA sequencing resource to create a comprehensive new human gene catalog.

This talk describes joint work with Mihaela Pertea, Alaina Shumate, Ales Varabyou, and Geo Pertea.


CompMS: Computational Mass SpectrometryCompMS: Computational Mass Spectrometry

Bayes networks for mass spectrometric metabolite identification via molecular fingerprints
COSI: CompMS: Computational Mass Spectrometry
Date: Saturday, July 7

  • Marcus Ludwig, Friedrich-Schiller-University Jena, Germany
  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
  • Sebastian Böcker, Friedrich-Schiller-University Jena, Germany

Presentation Overview: Show

Metabolites, small molecules that are involved in cellular reactions, provide
a direct functional signature of cellular state. Untargeted metabolomics
experiments usually rely on tandem mass spectrometry to identify the
thousands of compounds in a biological sample. Recently, we presented
CSI:FingerID for searching in molecular structure databases using tandem mass
spectrometry data. CSI:FingerID predicts a molecular fingerprint that
encodes the structure of the query compound, then uses this to search a
molecular structure database such as PubChem. Scoring of the predicted query
fingerprint and deterministic target fingerprints is carried out assuming
independence between the molecular properties constituting the fingerprint.

We present a scoring that takes into account dependencies between molecular
properties. As before, we predict posterior probabilities of molecular
properties using machine learning. Dependencies between molecular properties
are modeled as a Bayesian tree network; the tree structure is estimated on
the fly from the instance data. For each edge, we also estimate the expected
covariance between the two random variables. For fixed marginal
probabilities, we then estimate conditional probabilities using the known
covariance. Now, the corrected posterior probability of each candidate can
be computed, and candidates are ranked by this score. Modeling dependencies
improves identification rates of CSI:FingerID by 2.85 percentage points.

SIMPLE: Sparse Interaction Model over Peaks of moLEcules for fast, interpretable metabolite identification from tandem mass spectra
COSI: CompMS: Computational Mass Spectrometry
Date: Saturday, July 7

  • Dai Hai Nguyen, Kyoto University, Japan
  • Canh Hao Nguyen, Kyoto University, Japan
  • Hiroshi Mamitsuka, Kyoto University, Japan

Presentation Overview: Show

Motivation: Recent success in metabolite identification from tandem mass spectra has been led by machine learning, which has two stages: mapping mass spectra to molecular fingerprint vectors and then retrieving candidate molecules from the database.In the first stage, i.e. fingerprint prediction, spectrum peaks are features and considering their interactions would be reasonable for more accurate identification of unknown metabolites. Existing appoaches of fingerprint prediction are based on only individual peaks in the spectra, without explicitly considering the peak interactions. Also the current cutting-edge method is based on kernels, being computationally heavy and making hard to interpret the obtained results.

Results:We propose two learning models that allow to incorporate peak
interactions for fingerprint prediction. First, we extend the state-of-the-art kernel learning method by developing kernels for peak interactions to combine with kernels for peaks through multiple kernel learning (MKL). Second, we formulate a sparse interaction model for metabolite peaks, which we call SIMPLE, being computationally efficient and interpretable for fingerprint prediction. The formulation of SIMPLE is convex and guarantees global optimization, for which we develop an alternating direction method of multipliers (ADMM) algorithm. Experiments using the MassBank dataset show that both models achieved comparative prediction accuracy with the current top-performance kernel method. Furthermore SIMPLE clearly revealed individual peaks and their
interactions which contribute to enhancing the performance of fingerprint prediction.


Function SIG: Gene and Protein Function AnnotationFunction SIG: Gene and Protein Function Annotation

DeepFam: Deep learning based alignment-free method for protein family modeling and prediction
COSI: Function SIG: Gene and Protein Function Annotation
Date: Saturday, July 7

  • Seokjun Seo, Seoul National University, South Korea
  • Minsik Oh, Seoul National University, South Korea
  • Youngjune Park, Seoul National University, South Korea
  • Sun Kim, Seoul National University, South Korea

Presentation Overview: Show

A large number of newly sequenced proteins are generated by the next-generation sequencing technologies and the biochemical function assignment of the proteins is an important task. However, biological experiments are too expensive to characterize such a large number of protein sequences, thus protein function prediction is primarily done by computational modeling methods, such as profile Hidden Markov Model (pHMM) and k -mer based methods. Nevertheless, existing methods have some limitations; k -mer based methods are not accurate enough to assign protein functions and pHMM is not fast enough to handle large number of protein sequences from numerous genome projects. Therefore, a more accurate and faster protein function prediction method is needed.
In this paper, we introduce DeepFam, an alignment-free method that can extract functional information directly from sequences without the need of multiple sequence alignments. In extensive experiments using the Clusters of Orthologous Groups (COGs) and G protein-coupled receptor (GPCR) dataset, DeepFam achieved better performance in terms of accuracy and runtime for predicting functions of proteins compared to the state-of-the-art methods, both alignment-free and alignment-based methods. Additionally, we showed that DeepFam has a power of capturing conserved regions to model protein families. In fact, DeepFam was able to detect conserved regions documented in the Prosite database while predicting functions of proteins. Our deep learning method will be useful in characterizing functions of the ever increasing protein sequences.
Codes are available at https://bhi-kimlab.github.io/DeepFam.

HFSP: High speed homology-driven function annotation of proteins
COSI: Function SIG: Gene and Protein Function Annotation
Date: Saturday, July 7

  • Yannick Mahlich, Technical University of Munich, Germany
  • Martin Steinegger, Max-Planck-Institute, Korea, Democratic People's Republic of
  • Burkhard Rost, Technical University of Munich, Germany
  • Yana Bromberg, Rutgers University, United States

Presentation Overview: Show

Motivation: The rapid drop in sequencing costs has produced many more (predicted) protein sequences than can feasibly be functionally annotated with wet-lab experiments. Thus, many computational methods have been developed for this purpose. Most of these methods employ homology-based inference, approximated via sequence alignments, to transfer functional annota-tions between proteins. The increase in the number of available sequences, however, has drasti-cally increased the search space, thus significantly slowing down alignment methods.
Results: Here we describe HFSP, a novel computational method that uses results of a high-speed alignment algorithm, MMseqs2, to infer functional similarity of proteins on the basis of their alignment length and sequence identity. We show that our method is accurate (83% accu-racy) and fast (more than 40-fold speed increase over state-of-the-art). HFSP can help correct at least a 20% error in legacy curations, even for a resource of as high quality as Swiss-Prot. These findings suggest HFSP as an ideal resource for large-scale functional annotation efforts.


General Computational Biology

Bayesian parameter estimation for biochemical reaction networks using region-based adaptive parallel tempering
COSI: General Computational Biology
Date: Saturday, July 7

  • Benjamin Ballnus, Helmholtz-Zentrum München, Germany
  • Steffen Schaper, Bayer AG -- Engineering & Technologies, Germany
  • Fabian J. Theis, Helmholtz-Zentrum München, Germany
  • Jan Hasenauer, Helmholtz-Zentrum München, Germany

Presentation Overview: Show

Motivation:
Mathematical models have become standard tools for the investigation of cellular processes and the unraveling of signal processing mechanisms. The parameters of these models are usually derived from the available data using optimization and sampling methods. However, the efficiency of these methods is limited by the properties of the mathematical model, e.g., non-identifiabilities, and the resulting posterior distribution. In particular, multi-modal distributions with long valleys or pronounced tails are difficult to optimize and sample. Thus, the developement or improvement of optimization and sampling methods is subject to ongoing research.

Results:
We suggest a region-based adaptive parallel tempering algorithm which adapts to the problem-specific posterior distributions, i.e. modes and valleys. The algorithm combines several established algorithms to overcome their individual shortcomings and to improve sampling efficiency. We assessed its properties for established benchmark problems and two ordinary differential equation models of biochemical reaction networks. The proposed algorithm outperformed state-of-the-art methods in terms of calculation efficiency and mixing. Since the algorithm does not rely on a specific problem structure, but adapts to the posterior distribution, it is suitable for a variety of model classes.

Availability:
The code is available both as supplementary material and in a Git repository written in MATLAB.


CAMDA: Critical Assessment of Massive Data AnalysisCAMDA: Critical Assessment of Massive Data Analysis

Personalized Regression Enables Sample-Specific Pan-Cancer Analysis
COSI: CAMDA: Critical Assessment of Massive Data Analysis
Date: Saturday, July 7 and Sunday, July 8

  • Ben Lengerich, Carnegie Mellon University, United States
  • Bryon Aragam, Carnegie Mellon University, United States
  • Eric Xing, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models.
Results: To uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning two latent distance metrics – one between personalized parameters and one between clinical covariates – and attempting to match the induced distances as closely as possible. Crucially, we do not assume these distance metrics are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer.
Availability: Software for personalized linear and personalized logistic regression, along with code to reproduce experimental results, is freely available at github.com/blengerich/personalized_regression.


RNA: Computational RNA BiologyRNA: Computational RNA Biology

aliFreeFold: an alignment-free approach to predict secondary structure from homologous RNA sequences
COSI: RNA: Computational RNA Biology
Date: Saturday, July 7 and Sunday, July 8

  • Jean-Pierre Glouzon, University of Sherbrooke, Canada
  • Aida Ouangraoua, Université de Sherbrooke, Canada

Presentation Overview: Show

Motivation: Predicting the conserved secondary structure of homologous ribonucleic acid (RNA) sequences is crucial for understanding RNA functions. However, fast and accurate RNA structure prediction is challenging, especially when the number and the divergence of homologous RNA increases. To address this challenge, we propose aliFreeFold, based on a novel alignment-free approach which computes a representative structure from a set of homologous RNA sequences using suboptimal secondary structures generated for each sequence. It is based on a vector representation of suboptimal structures capturing structure conservation signals by weighting structural motifs according to their conservation across the suboptimal structures.

Results: We demonstrate that aliFreeFold provides a good balance between speed and accuracy regarding predictions of representative structures for sets of homologous RNA compared to traditional methods based on sequence and structure alignment. We show that aliFreeFold is capable of uncovering conserved structural features fastly and effectively thanks to its weighting scheme that gives more (resp. less) importance to common (resp. uncommon) structural motifs. The weighting scheme is also shown to be capable of capturing conservation signal as the number of homologous RNA increases. These results demonstrate the ability of aliFreefold to efficiently and accurately provide interesting structural representatives of RNA families.

Availability: aliFreeFold was implemented in C++. Source code and Linux binary are freely available at https://github.com/UdeS-CoBIUS/aliFreeFold.

Convolutional neural networks for classification of alignments of non-coding RNA sequences
COSI: RNA: Computational RNA Biology
Date: Saturday, July 7 and Sunday, July 8

  • Genta Aoki, Keio University, Japan
  • Yasubumi Sakakibara, Keio University, Japan

Presentation Overview: Show

Motivation: The convolutional neural network (CNN) has been applied to the classification problem of DNA sequences, with the additional purpose of motif discovery. The training of CNNs with distributed representations of four nucleotides has successfully derived position weight matrices on the learned kernels that corresponded to sequence motifs such as protein-binding sites.
Results: We propose a novel application of CNNs to classification of pairwise alignments of sequences for accurate clustering of sequences and show the benefits of the CNN method of inputting pairwise alignments for clustering of non-coding RNA (ncRNA) sequences and for motif discovery. Classification of a pairwise alignment of two sequences into positive and negative classes corresponds to the clustering of the input sequences. After we combined the distributed representation of RNA nucleotides with the secondary-structure information specific to ncRNAs and furthermore with mapping profiles of next-generation sequence reads, the training of CNNs for classification of alignments of RNA sequences yielded accurate clustering in terms of ncRNA families and outperformed the existing clustering methods for ncRNA sequences. Several interesting sequence motifs and secondary-structure motifs known for the snoRNA family and specific to microRNA and tRNA families were identified.

Dissecting newly transcribed and old RNA using GRAND-SLAM
COSI: RNA: Computational RNA Biology
Date: Saturday, July 7 and Sunday, July 8

  • Christopher Jürges, Julius-Maximilians-Universität Würzburg, Germany
  • Florian Erhard, Julius-Maximilians-Universität Würzburg, Germany

Presentation Overview: Show

Global quantification of total RNA is used to investigate steady state levels of gene expression. However, being able to differentiate pre-existing RNA (that has been synthesized prior to a defined point in time) and newly transcribed RNA can provide invaluable information e.g. to estimate RNA half-lives or identify fast and complex regulatory processes. Recently, new techniques based on metabolic labeling and RNA-seq have emerged that allow to quantify new and old RNA: Nucleoside analogs are incorporated into newly transcribed RNA and are made detectable as point mutations in mapped reads. However, relatively infrequent incorporation events and significant sequencing error rates make the differentiation between old and new RNA a highly challenging task.
We developed a statistical approach termed GRAND-SLAM that, for the first time, allows to estimate the proportion of old and new RNA in such an experiment. Uncertainty in the estimates is quantified in a Bayesian framework. Simulation experiments show our approach to be unbiased and highly accurate. Furthermore, we analyze how uncertainty in the proportion translates into uncertainty in estimating RNA half-lives and give guidelines for planning experiments. Finally, we demonstrate that our estimates of RNA half-lives compare favorably to other experimental approaches and that biological processes affecting RNA half-lives can be investigated with greater power than offered by other methods.
Availability: GRAND-SLAM is available under the Apache 2.0 license at http://software.erhard-lab.de; R scripts to generate all figures are available at zenodo (doi:10.5281/zenodo.1162340)


Education: Computational Biology EducationEducation: Computational Biology Education

Training for translation between disciplines: a philosophy for life and data sciences curricula
COSI: Education: Computational Biology Education
Date: Sunday, July 8

  • K. Anton Feenstra, Vrije Universiteit Amsterdam, Netherlands
  • Sanne Abeln, Vrije Universiteit Amsterdam, Netherlands
  • Johan Westerhuis, University of Amsterdam, Netherlands
  • Filipe Brancos Dos Santos, Universiteit van Amsterdam, Netherlands
  • Douwe Molenaar, Vrije Universiteit Amsterdam, Netherlands
  • Bas Teusink, Vrije Universiteit Amsterdam, Netherlands
  • Huub Hoefsloot, Universiteit van Amsterdam, Netherlands
  • Jaap Heringa, Vrije Universiteit Amsterdam, Netherlands

Presentation Overview: Show

Motivation: Our society has become data-rich to the extent that research in many areas has become impossible without computational approaches. Educational programmes seem to be lagging behind this development. At the same time, there is a growing need not only for strong data-science skills, but foremost for the ability to both translate between tools and methods on the one hand, and application and problems on the other.
Results: Here we present our experiences with shaping and running a masters programme in Bioinformatics and Systems Biology in Amsterdam. From this, we have developed a comprehensive philosophy on how translation in training may be achieved in a dynamic and multidisciplinary research area, which is described here. We furthermore describe two requirements that enable translation, which we have found to be crucial: sufficient depth and focus on multidisciplinary topic areas, coupled with a balanced breadth from adjacent disciplines. Finally, we present concrete suggestions on how this may be implemented in practice, which may be relevant for the effectiveness of life science and data science curricula in general, and of particular interest to those who are in the process of setting up such curricula.


NetBio: Network BiologyNetBio: Network Biology

An optimization framework for network annotation
COSI: NetBio: Network Biology
Date: Sunday, July 8

  • Sushant Patkar, University of Maryland, United States
  • Roded Sharan, Tel Aviv University, Israel

Presentation Overview: Show

A chief goal of systems biology is the reconstruction of large-scale executable models of cellular processes of interest. While accurate continuous models are still beyond reach, a powerful alternative is to learn a logical model of the processes under study, which predicts the logical state of any node of the model as a Boolean function of its incoming nodes. Key to learning such models is the functional annotation of the underlying physical interactions with activation/repression (sign) effects. Such annotations are pretty common for a few well studied biological pathways. Here we present a novel optimization framework for large-scale sign annotation that employs
different models of signaling and combines them in a rigorous manner. We apply our framework to two large scale knockout datasets in yeast and evaluate its different components as well as the combined model on different subsets of physical interactions. Overall, we obtain an accurate predictor that outperforms previous work by a considerable margin.

Classifying tumors by supervised network propagation
COSI: NetBio: Network Biology
Date: Sunday, July 8

  • Wei Zhang, University of California San Diego, United States
  • Jianzhu Ma, University of California San Diego, United States
  • Trey Ideker, University of California San Diego, United States

Presentation Overview: Show

Motivation: Network propagation has been widely used to aggregate and amplify the effects of tumor mutations using knowledge of molecular interaction networks. However, propagating mutations through interactions irrelevant to cancer leads to erosion of pathway signals and complicates the identification of cancer subtypes.
Results: To address this problem we introduce a propagation algorithm, Network-Based Supervised Stratification (NBS2), which learns the mutated subnetworks underlying tumor subtypes using a super- vised approach. Given an annotated molecular network and reference tumor mutation profiles for which subtypes have been predefined, NBS2 is trained by adjusting the weights on interaction features such that network propagation best recovers the provided subtypes. After training, weights are fixed such that mutation profiles of new tumors can be accurately classified. We evaluate NBS2 on breast and glioblastoma tumors, demonstrating that it outperforms the best network-based approaches in classi- fying tumors to known subtypes for these diseases. By interpreting the interaction weights, we highlight characteristic molecular pathways driving selected subtypes.
Availability: The NBS2 package is freely available at: https://github.com/wzhang1984/NBSS

PrimAlign: PageRank-Inspired Markovian Alignment for Large Biological Networks
COSI: NetBio: Network Biology
Date: Sunday, July 8

  • Karel Kalecky, Baylor University, United States
  • Young-Rae Cho, Baylor University, United States

Presentation Overview: Show

Motivation: Cross-species analysis of large-scale protein-protein interaction (PPI) networks has played a significant role in understanding the principles deriving evolution of cellular organizations and functions. Recently, network alignment algorithms have been proposed to predict conserved in-teractions and functions of proteins. These approaches are based on the notion that orthologous pro-teins across species are sequentially similar and that topology of PPIs between orthologs is often conserved. However, high accuracy and scalability of network alignment are still a challenge.
Results: We propose a novel pairwise global network alignment algorithm, called PrimAlign, which is modeled as a Markov chain and iteratively transited until convergence. The proposed algorithm also incorporates the principles of PageRank. This approach is evaluated on tasks with human, yeast, and fruit fly PPI networks. The experimental results demonstrate that PrimAlign outperforms several prev-alent methods with statistically significant differences in multiple evaluation measures. PrimAlign, which is multiplatform, achieves superior performance in runtime with its linear asymptotic time com-plexity. Further evaluation is done with synthetic networks and results suggest that popular topologi-cal measures do not reflect real precision of alignments.
Availability: The source code is available at http://web.ecs.baylor.edu/faculty/cho/PrimAlign


TransMed: Translational Medical InformaticsTransMed: Translational Medical Informatics

AnoniMME: Bringing Anonymity to the Matchmaker Exchange Platform for Rare Disease Gene Discovery
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Bristena Oprisanu, University College London, United Kingdom
  • Emiliano De Cristofaro, University College London, United Kingdom

Presentation Overview: Show

Motivation: Advances in genome sequencing and genomics research are bringing us closer to a new era of personalized medicine, where healthcare can be tailored to the individual’s genetic makeup, and to more effective diagnosis and treatment of rare genetic diseases. Much of this progress depends on collaborations and access to genomes, and thus a number of initiatives have been introduced to support seamless data sharing. Among these, the Global Alliance for Genomics and Health runs a popular platform, called Matchmaker Exchange, which allows researchers to perform queries for rare genetic disease discovery over multiple federated databases. Queries include gene variations which are linked to rare diseases, and the ability to find other researchers that have seen or have interest in those variations is extremely valuable. Nonetheless, in some cases, researchers may be reluctant to use the platform since the queries they make (thus, what they are working on) are revealed to other researchers, and this creates concerns with privacy and competitive advantage.
Contributions: We present AnoniMME, a novel framework geared to enable anonymous queries within the Matchmaker Exchange platform. The framework, building on a cryptographic primitive called Reverse Private Information Retrieval (PIR), let researchers anonymously query the federated platform, in a multi-server setting. Specifically, they write their query, along with a public encryption key, anonymously in a public database. Responses are also supported, so that other researchers can respond to queries by providing their encrypted contact details.
Availability and Implementation: https://github.com/bristena-op/AnoniMME.

Association Mapping in Biomedical Time Series via Statistically Significant Shapelet Mining
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Christian Bock, ETH Zurich, Switzerland
  • Thomas Gumbsch, ETH Zurich, Switzerland
  • Michael Moor, ETH Zurich, Switzerland
  • Bastian Rieck, ETH Zurich, Germany
  • Damian Roqueiro, ETH Zurich, Switzerland
  • Karsten Borgwardt, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: Most modern intensive care units continuously record the physiological and vital signs of patients. When processed, these data can be used to extract temporal signatures (biomarkers) that help physicians understand the biological complexity of many syndromes. However, most biological biomarkers suffer from either poor predictive performance or weak explanatory power. Recent developments in time series classification focus on discovering shapelets, i.e., subsequences that are most predictive in terms of class membership. Shapelets have the advantage of combining an interpretable component—their shape—with high predictive performance. Currently, most shapelet discovery methods do not rely on statistical tests to verify the significance of individual shapelets. Therefore, it is of the utmost importance to identify statistically significant associations between the shapelets of physiological biomarkers and patients that exhibit certain phenotypes of interest. This would enable the discovery and subsequent ranking of novel physiological signatures that are interpretable, statistically validated, and accurate predictors of clinical endpoints.
Results: We present a novel and scalable method for scanning time series and identifying discriminative patterns that are statistically significant. The significance of a shapelet is evaluated while considering the problem of multiple hypothesis testing and mitigating it by efficiently pruning untestable shapelet candidates with Tarone’s method. We demonstrate the utility of our method by discovering patterns in a patient’s vital signs (heart rate, respiratory rate, and systolic blood pressure) that are early indicators of the severity of a future sepsis event, i.e., an inflammatory response to an infective agent that can lead to organ failure and death, if not treated in time.
Availability: We make our method and the scripts that are required to reproduce the experiments publicly available at https://github.com/BorgwardtLab/S3M.

Driver gene mutations based clustering of tumors: methods and applications
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Wensheng Zhang, Xavier University of Louisiana, United States
  • Erik Flemington, Tulane University, United States
  • Kun Zhang, Xavier University of Louisiana, United States

Presentation Overview: Show

Motivation: Somatic mutations in proto-oncogenes and tumor suppressor genes constitute a major category of causal genetic abnormalities in tumor cells. The mutation spectra of thousands of tumors have been generated by The Cancer Genome Atlas (TCGA) and other whole genome (exome) sequencing projects. A promising approach to utilizing these resources for precision medicine is to identify genetic similarity-based subtypes within a cancer type and relate the pinpointed subtypes to the clinical outcomes and pathologic characteristics of patients.
Results: We propose two novel methods, ccpwModel and xGeneModel, for mutation-based clustering of tumors. In the former, binary variables indicating the status of cancer driver genes in tumors and the genes’ involvement in the core cancer pathways are treated as the features in the clustering process. In the latter, the functional similarities of putative cancer driver genes and their confidence scores as the “true” driver genes are integrated with the mutation spectra to calculate the genetic distances between tumors. We apply both methods to the TCGA data of 16 cancer types. Promising results are obtained when these methods are compared to state-of-the-art approaches as to the associations between the determined tumor clusters and patient race (or survival time). We further extend the analysis to detect mutation-characterized transcriptomic prognostic signatures, which are directly relevant to the etiology of carcinogenesis.

GSEA-InContext: Identifying novel and common patterns in expression experiments
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Rani Powers, University of Colorado, United States
  • Andrew Goodspeed, University of Colorado, United States
  • Harrison Pielke-Lombardo, University of Colorado, United States
  • Aik-Choon Tan, University of Colorado, United States
  • James Costello, University of Colorado, United States

Presentation Overview: Show

Motivation: Gene Set Enrichment Analysis (GSEA) is routinely used to analyze and interpret coordinate
pathway-level changes in transcriptomics experiments. For an experiment where less than seven samples
per condition are compared, GSEA employs a competitive null hypothesis to test significance. A gene
set enrichment score is tested against a null distribution of enrichment scores generated from permuted
gene sets, where genes are randomly selected from the input experiment. Looking across a variety of
biological conditions, however, genes are not randomly distributed with many showing consistent patterns
of up- or down-regulation. As a result, common patterns of positively and negatively enriched gene
sets are observed across experiments. Placing a single experiment into the context of a relevant set
of background experiments allows us to identify both the common and experiment-specific patterns of
gene set enrichment.

Results: We compiled a compendium of 442 small molecule transcriptomic experiments and used GSEA
to characterize common patterns of positively and negatively enriched gene sets. To identify experimentspecific
gene set enrichment, we developed the GSEA-InContext method that accounts for gene expression
patterns within a background set of experiments to identify statistically significantly enriched gene sets.
We evaluated GSEA-InContext on experiments using small molecules with known targets to show that it
successfully prioritizes gene sets that are specific to each experiment, thus providing valuable insights
that complement standard GSEA analysis.

Availability and Implementation: GSEA-InContext implemented in Python, supplemental results, and
the background expression compendium are available at: https://github.com/CostelloLab/GSEA-InContext

Learning with multiple pairwise kernels for drug bioactivity prediction
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Anna Cichonska, University of Helsinki, Finland
  • Tapio Pahikkala, University of Turku, Finland
  • Sandor Szedmak, Aalto University, Finland
  • Heli Julkunen, Aalto University, Finland
  • Antti Airola, University of Turku, Finland
  • Markus Heinonen, Aalto University, Finland
  • Tero Aittokallio, University of Helsinki, Finland
  • Juho Rousu, Aalto University, Finland

Presentation Overview: Show

Motivation: Many inference problems in bioinformatics, including drug bioactivity prediction, can be formulated as pairwise learning problems, in which one is interested in making predictions for pairs of objects, e.g., drugs and their targets. Kernel-based approaches have emerged as powerful tools for solving problems of that kind, and especially multiple kernel learning (MKL) offers promising benefits as it enables integrating various types of complex biomedical information sources in the form of kernels, along with learning their importance for the prediction task. However, the immense size of pairwise kernel spaces remains a major bottleneck, making the existing MKL algorithms computationally infeasible even for small number of input pairs. Results: We introduce pairwiseMKL, the first method for time- and memory-efficient learning with multiple pairwise kernels. pairwiseMKL first determines the mixture weights of the input pairwise kernels, and then learns the pairwise prediction function. Both steps are performed efficiently without explicit computation of the massive pairwise matrices, therefore making the method applicable to solving large pairwise learning problems. We demonstrate the performance of pairwiseMKL in two related tasks of quantitative drug bioactivity prediction using up to 167 995 bioactivity measurements and 3 120 pairwise kernels: (i) prediction of anticancer efficacy of drug compounds across a large panel of cancer cell lines; and (ii) prediction of target profiles of anticancer compounds across their kinome-wide target spaces. We show that pairwiseMKL provides accurate predictions using sparse solutions in terms of selected kernels, and therefore it automatically identifies also data sources relevant for the prediction problem.

LONGO: An R Package for Interactive Gene Length Dependent Analysis for Neuronal Identity
COSI: TransMed: Translational Medical Informatics
Date: Sunday, July 8

  • Matthew J. McCoy, Washington University, United States
  • Alexander J. Paul, Saint Louis University, United States
  • Matheus B. Victor, Washington University, United States
  • Michelle Richner, Washington University, United States
  • Harrison W. Gabel, Washington University, United States
  • Haijun Gong, Saint Louis University, United States
  • Andrew S. Yoo, Washington University, United States
  • Tae-Hyuk Ahn, Saint Louis University, United States

Presentation Overview: Show

Motivation: Reprogramming somatic cells into neurons holds great promise to model neuronal devel-opment and disease. The efficiency and success rate of neuronal reprogramming, however, may vary between different conversion platforms and cell types, thereby necessitating an unbiased, systematic ap-proach to estimate neuronal identity of converted cells. Recent studies have demonstrated that long genes (>100 kb from transcription start to end) are highly enriched in neurons, which provides an oppor-tunity to identify neurons based on the expression of these long genes.
Results: We have developed a versatile R package, LONGO, to analyze gene expression based on gene length. We propose a systematic analysis of long gene expression (LGE) with a metric termed the long gene quotient (LQ) that quantifies LGE in RNA-seq or microarray data to validate neuronal identity at the single-cell and population levels. This unique feature of neurons provides an opportunity to utilize measurements of LGE in transcriptome data to quickly and easily distinguish neurons from non-neuronal cells. By combining this conceptual advancement and statistical tool in a user-friendly and interactive software package, we intend to encourage and simplify further investigation into LGE, particularly as it applies to validating and improving neuronal differentiation and reprogramming methodologies.


VarI: Variant InterpretationVarI: Variant Interpretation

A pan-genome-based machine learning approach for predicting antimicrobial resistance activities of the Escherichia coli strains
COSI: VarI: Variant Interpretation
Date: Sunday, July 8

  • Hsuan-Lin Her, Taipei Medical University, Taiwan
  • Yu-Wei Wu, Taipei Medical University, Taiwan

Presentation Overview: Show

Motivation: Antimicrobial resistance (AMR) is becoming a huge problem in both developed and de-veloping countries, and identifying strains resistant or susceptible to certain antibiotics is essential in fighting against antibiotic-resistant pathogens. Whole-genome sequences have been collected for different microbial strains in order to identify crucial characteristics that allow certain strains to become resistant to antibiotics; however a global inspection of the gene content responsible for AMR activities remains to be done.
Results: We propose a pan-genome-based approach to characterize antibiotic-resistant microbial strains and test this approach on the bacterial model organism Escherichia coli. By identifying core and accessory gene clusters and predicting AMR genes for the E. coli pan-genome, we not only showed that certain classes of genes are unevenly distributed between the core and accessory parts of the pan-genome but also demonstrated that only a portion of the identified AMR genes belong to the accessory genome. Application of machine learning algorithms to predict whether specific strains were resistant to antibiotic drugs yielded the best prediction accuracy for the set of AMR genes within the accessory part of the pan-genome, suggesting that these gene clusters were most crucial to AMR activities. Selecting subsets of AMR genes for different antibiotic drugs based on a genetic algorithm achieved better prediction performances than the gene sets established in the literature, hinting that the gene sets selected by the genetic algorithm may warrant further analysis in investigating more details about how E. coli fight against antibiotic drugs.

DisruPPI: Structure-based computational redesign algorithm for protein binding disruption
COSI: VarI: Variant Interpretation
Date: Sunday, July 8

  • Yoonjoo Choi, Korea Advanced Institute of Science and Technology, South Korea
  • Jacob Furlon, Dartmouth College, United States
  • Ryan Amos, Princeton University, United States
  • Karl Griswold, Dartmouth College, United States
  • Chris Bailey-Kellogg, Dartmouth College, United States

Presentation Overview: Show

Motivation: Disruption of protein-protein interactions can mitigate antibody recognition of therapeutic proteins, yield monomeric forms of oligomeric proteins, and elucidate signaling mechanisms, among other applications. While designing affinity-enhancing mutations remains generally quite challenging, both statistically- and physically-based computational methods can precisely identify affinity-reducing mutations. In order to leverage this ability to design variants of a target protein with disrupted interactions, we developed the DisruPPI protein design method (DISRUpting Protein-Protein Interactions) to optimize combinations of mutations simultaneously for both disruption and stability, so that incorporated disruptive mutations do not inadvertently affect the target protein adversely.
Results: Two existing methods for predicting mutational effects on binding, FoldX and INT5, were demonstrated to be quite precise in selecting disruptive mutations from the SKEMPI and AB-Bind databases of experimentally determined changes in binding free energy. DisruPPI was implemented to use an INT5-based disruption score integrated with an AMBER-based stability assessment, and was applied to disrupt protein interactions in a set of different targets representing diverse applications. In retrospective evaluation with three different case studies, comparison of DisruPPI-designed variants to published experimental data showed that DisruPPI was able to identify more diverse interaction-disrupting and stability-preserving variants more efficiently and effectively than previous approaches. In prospective application to an interaction between enhanced green fluorescent protein (EGFP) and a nanobody, DisruPPI was used to design five EGFP variants, all of which were shown to have significantly reduced nanobody binding while maintaining function and thermostability. This demonstrates that DisruPPI may be readily utilized for effective removal of known epitopes of therapeutically-relevant proteins.
Availability: DisruPPI is implemented in the EpiSweep package, freely available under an academic use license.

Finding associated variants in genome-wide association studies on multiple traits
COSI: VarI: Variant Interpretation
Date: Sunday, July 8

  • Lisa Gai, University of California, Los Angeles, United States
  • Eleazar Eskin, University of California, Los Angeles, United States

Presentation Overview: Show

Many variants identified by genome-wide association studies (GWAS) have been found to affect multiple traits, either directly or through shared pathways. There is currently a wealth of GWAS data collected in numerous phenotypes, and analyzing multiple traits at once can increase power to detect shared variant effects. However, traditional meta-analysis methods are not suitable for combining studies on different traits. When applied to dissimilar studies, these meta-analysis methods can be underpowered compared to univariate analysis. The degree to which traits share variant effects is often not known, and the vast majority of GWAS meta-analysis only consider one trait at a time. Here we present a flexible method for finding associated variants from GWAS summary statistics for multiple traits. Our method estimates the degree of shared effects between traits from the data. Using simulations, we show that our method properly controls the false positive rate and increases power when an effect is present in a subset of traits. We then apply our method to the North Finland Birth Cohort and UK Biobank data sets using a variety of metabolic traits and discover novel loci.


HitSeq: High-throughput SequencingHitSeq: High-throughput Sequencing

A graph-based approach to diploid genome assembly
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Shilpa Garg, MPI-INF, Germany, Germany
  • Mikko Rautiainen, Max Planck Institute for Informatics, Germany
  • Adam Novak, UC Santa Cruz, United States
  • Erik Garrison, Wellcome Trust Sanger Institute, United Kingdom
  • Richard Durbin, Wellcome Trust Sanger Institute, United Kingdom
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show

Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.
Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50x coverage Illumina data and 10x PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants.

A space and time-efficient index for the compacted colored de Bruijn graph
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Fatemeh Almodaresi, Stony Brook University, United States
  • Hirak Sarkar, Stony Brook University, United States
  • Avi Srivastava, Stony Brook university, United States
  • Robert Patro, Stony Brook University, United States

Presentation Overview: Show

Motivation: Indexing large ensembles of genomic sequences is an important building block for various sequence analysis pipelines. De Bruijn graphs are extensively used for representing large genomic indices, although the direct sequence query in a de Bruijn graph is, in general, time consuming and computationally intensive. This substantially slows down downstream methods, such as those used for mapping and alignment. Therefore, a fast, succinct and exact graph based index can be instrumental for performing efficient and accurate genomic analyses at a large scale.
Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.
Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.
Availability:pufferfish is written in C++11, is open source, and is available at https://github.com/ COMBINE- lab/pufferfish.

A Spectral Clustering-Based Method for Identifying Clones from High-throughput B cell Repertoire Sequencing Data
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Nima Nouri, Yale School of Medicine, Yale University, United States
  • Steven H. Kleinstein, Yale University School of Medicine, United States

Presentation Overview: Show

B cells derive their antigen-specificity through the expression of Immunoglobulin (Ig) receptors on their surface. These receptors are initially generated stochastically by somatic rearrangement of the DNA and further diversified following antigen-activation by a process of somatic hypermutation, which introduces mainly point substitutions into the receptor DNA at a high rate. Recent advances in next-generation sequencing (NGS) have enabled large-scale profiling of the B cell Ig repertoire from blood and tissue samples. A key computational challenge in the analysis of these data is partitioning the sequences to identify descendants of a common B cell (i.e. a clone). Current methods group sequences using a fixed distance threshold, or a likelihood calculation that is computationally-intensive. Here, we propose a new method based on spectral clustering with an adaptive threshold to determine the local sequence neighborhood. Validation using simulated and experimental data sets demonstrates that this method has high sensitivity and specificity compared to a fixed threshold that is optimized for these measures. In addition, this method works on data sets where choosing an optimal fixed threshold is difficult and is more computationally efficient in all cases. The ability to quickly and accurately identify members of a clone from repertoire sequencing data will greatly improve downstream analyses. Clonally-related sequences cannot be treated independently in statistical models, and clonal partitions are used as the basis for the calculation of diversity metrics, lineage reconstruction and selection analysis. Thus, the spectral clustering-based method here represents an important contribution to repertoire analysis.

AmpUMI: Design and analysis of unique molecular identifiers for deep amplicon sequencing
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Kendell Clement, Harvard University, United States
  • Rick Farouni, Harvard University, United States
  • Daniel Bauer, Boston Children Hospital, United States
  • Luca Pinello, Harvard University, United States

Presentation Overview: Show

Motivation: Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.
Results: Based on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis. AmpUMI is open-source and freely available at http://github.com/pinellolab/AmpUMI

Asymptotically optimal minimizers schemes
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Guillaume Marçais, Carnegie Mellon University, United States
  • Dan DeBlasio, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: The minimizers technique is a method to sample k-mers that
is used in many bioinformatics software to reduce computation, memory
usage and run time. The number of applications using minimizers keeps
on growing steadily. Despite its many uses, the theoretical
understanding of minimizers is still very limited. In many
applications, selecting as few k-mers as possible (i.e. having a low
density) is beneficial. The density is highly dependent on the choice
of the order on the k-mers. Different applications use different
orders, but none of these orders are optimal. A better understanding
of minimizers schemes, and the related local and forward schemes, will
allow designing schemes with lower density, and thereby making
existing and future bioinformatics tools even more efficient.

Results: From the analysis of the asymptotic behavior of minimizers,
forward and local schemes, we show that the previously believed lower
bound on minimizers schemes does not hold, and that schemes with
density lower than thought possible actually exist. The proof is
constructive and leads to an efficient algorithm to compare k-mers.
These orders are the first known orders that are asymptotically
optimal. Additionally, we give improved bounds on the density
achievable by the 3 type of schemes.

Haplotype Phasing in Single-Cell DNA Sequencing Data
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Gryte Satas, Brown University, United States
  • Ben Raphael, Princeton University, United States

Presentation Overview: Show

Motivation: Single-cell DNA sequencing is a promising technology that allows researchers to examine the genomic content of individual cells. Because the amount of DNA in a single cell is to o little to sequence directly, single-cell sequencing requires a method of whole-genome amplification (WGA). WGA introduces biases in the data, including high rates of allelic drop out and non-uniform coverage of the genome. These biases confound many downstream analyses, including the detection of genomic variants. Here, we show that amplification biases have a potential upside: long range correlations in a rates of allele drop out give a signal for phasing haplotypes at the lengths of PCR amplicons from WGA, rather than individual sequence reads.

Results: We describe a statistical test to evaluate concurrent allelic drop out between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. Using this model we derive an algorithm to perform haplotype assembly on single cells. We demonstrate that the algorithm predicts SNP-pair phasing with high accuracy using whole-genome sequencing data from only seven single cells, and results in haplotype blocks (median length 10.2kb) that are orders of magnitude longer than with sequence reads alone (median length 312bp), with low switch error rates (< 2%). We demonstrate similar advantages on whole-exome data, where we obtain haplotype blocks with lengths on the order of typical gene lengths (median length 9.2kb) compared to median lengths of 41bp with sequence reads alone, with low switch error rates (< 4%)

Novo&Stitch: Accurate Reconciliation of Genome Assemblies via Optical Maps
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Weihua Pan, University of California, Riverside, United States
  • Steve Wanamaker, UC Riverside, United States
  • Audrey Ah-Fong, UC Riverside, United States
  • Howard Judelson, UC Riverside, United States
  • Stefano Lonardi, UC Riverside, United States

Presentation Overview: Show

Motivation: De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e., sequencing errors, uneven sequencing coverage, and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g., mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.

Results: The concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in (Alhakami et al., 2017) that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g., 50 Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness.

Availability: Novo&Stitch can be obtained from https://github.com/ucrbioinfo/Novo_Stitch

Strand-seq Enables Reliable Separation of Long Reads by Chromosome via Expectation Maximization
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Maryam Ghareghani, Max Planck Institute for Informatics, Germany
  • David Porubsky, Max Planck Institute for Informatics, Germany
  • Ashley Sanders, European Molecular Biology Laboratory, Germany
  • Sascha Meiers, European Molecular Biology Laboratory, Germany
  • Evan Eichler, University of Washington, United States
  • Jan Korbel, EMBL, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show

Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.
Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization (EM) algorithm, termed SaaRclust, and demonstrate its ability to reliably cluster long reads by chromosome.
For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of PacBio reads with 30.1x coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.

Versatile genome assembly evaluation with QUAST-LG
COSI: HitSeq: High-throughput Sequencing
Date: Sunday, July 8 and Monday, July 9

  • Alla Mikheenko, St. Petersburg State University, Russia, Russia
  • Andrey Prjibelski, St. Petersburg State University, Russia, Russia
  • Vladislav Saveliev, St. Petersburg State University, Russia, Russia
  • Dmitry Antipov, St. Petersburg State University, Russia, Russia
  • Alexey Gurevich, St. Petersburg State University, Russia, Russia

Presentation Overview: Show

Motivation: The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes.

Results: In this manuscript we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG --- a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly cor- rectness and completeness. Using QUAST-LG we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference.

Availability and implementation: http://cab.spbu.ru/software/quast-lg
Contact: aleksey.gurevich@spbu.ru
Supplementary information: Supplementary data are available.


BioVis: Biological Data VisualizationBioVis: Biological Data Visualization

NeuroMorphoVis: a collaborative framework for visualization and analysis of neuronal morphology skeletons reconstructed from microscopy stacks
COSI: BioVis: Biological Data Visualization
Date: Monday, July 9

  • Marwan Abdellah, Blue Brain Project / EPFL, Switzerland
  • Juan Hernando, Blue Brain Project / EPFL, Switzerland
  • Stefan Eilemann, Blue Brain Project / EPFL, Switzerland
  • Samuel Lapere, Blue Brain Project / EPFL, Switzerland
  • Nicolas Antille, Blue Brain Project / EPFL, Switzerland
  • Henry Makram, Blue Brain Project / EPFL, Switzerland
  • Felix Schürmann, Blue Brain Project / EPFL, Switzerland

Presentation Overview: Show

Motivation: From image stacks to computational processing digital representations of neuronal morphologies is essential to neuroscientific research. Workflows involve various techniques and tools, leading in certain cases to convoluted and fragmented pipelines.
The existence of an integrated, extensible and free framework for processing, analysis and visualization of those morphologies is a challenge that is still largely unfulfilled.

Results: We present NeuroMorphoVis, an interactive, extensible and cross-platform framework for building, visualizing and analyzing digital reconstructions of neuronal morphology skeletons extracted from microscopy stacks.
Our framework is capable of detecting and repairing tracing artifacts, allowing the generation of high fidelity surface meshes and high resolution volumetric models for simulation and \emph{in silico} imaging studies.
The applicability of NeuroMorphoVis is demonstrated with two case studies.
The first simulates the construction of three-dimensional profiles of neuronal somata and the other highlights how the framework is leveraged to create volumetric models of neuronal circuits for simulating different types of \emph{in vitro} imaging experiments.

Availability and implementation: The source code and documentation are freely available on https://github.com/BlueBrain/NeuroMorphoVis under the GNU public license.
The morphological analysis, visualization and surface meshing are implemented as an extensible Python API (Application Programming Interface) based on Blender, and the volume reconstruction and analysis code is written in C++ and parallelized using OpenMP. The framework features are accessible from a user-friendly GUI (Graphical User Interface) and a rich CLI (Command Line Interface).

The Kappa platform for rule-based modeling
COSI: BioVis: Biological Data Visualization
Date: Monday, July 9

  • Pierre Boutillier, Harvard University, United States
  • Mutaamba Maasha, Harvard University, United States
  • Xing Li, Edgewise Networks, United States
  • Hector Medina-Abarca, Harvard University, United States
  • Jean Krivine, Universite Paris 7, France
  • Jerome Feret, Ecole Normale Superieure Paris, France
  • Ioana Cristescu, Harvard University, United States
  • Angus Forbes, University of California at Santa Cruz, United States
  • Walter Fontana, Harvard University, United States

Presentation Overview: Show

Motivation: We present an overview of the Kappa platform, an integrated suite of analysis and visualization techniques for building and interactively exploring rule-based models. The main components of the platform are the Kappa Simulator, the Kappa Static Analyzer, and the Kappa Story Extractor. In addition to these components, we describe the Kappa User Interface, which includes a range of interactive visualization tools for rule-based models needed to make sense of the complexity of biological systems. We argue that, in this approach, modeling is akin to programming and can likewise benefit from an integrated development environment. Our platform is a step in this direction.

Results: We discuss details about the computation and rendering of static, dynamic, and causal views of a model, which include the contact map, snaphots at different resolutions, the dynamic influence network, and causal compression. We provide use cases illustrating how these concepts generate insight. Specifically, we show how the contact map and snapshots provide information about systems capable of polymerization, such as Wnt signaling. A well-understood model of the KaiABC oscillator, translated into Kappa from the literature, is deployed to demonstrate the dynamic influence network and its use in understanding systems dynamics. Finally, we discuss how pathways might be discovered or recovered from a rule-based model by means of causal compression, as exemplified for early events in EGF signaling.

Availability: The Kappa platform is available via the project website at kappalanguage.org. All components of the platform are open source and freely available through the authors' code repositories.


Evolution and Comparative GenomicsEvolution and Comparative Genomics

Accurate prediction of orthologs in the presence of divergence after duplication
COSI: Evolution and Comparative Genomics
Date: Monday, July 9

  • Manuel Lafond, University of Ottawa, Canada
  • Mona Meghdari Miardan, University of Ottawa, Canada
  • David Sankoff, University of Ottawa, Canada

Presentation Overview: Show

Motivation: When gene duplication occurs, one of the copies may become free of selective pressure and evolve at an accelerated pace. This has important consequences on the prediction of orthology relationships, since two orthologous genes separated by divergence after duplication may differ in both sequence and function. In this work, we make the distinction between the primary orthologs, which have not been affected by accelerated mutation rates on their evolutionary path, and the secondary orthologs, which have.
Similarity-based prediction methods will tend to miss secondary orthologs, whereas phylogeny-based methods cannot separate primary and secondary orthologs. However, both types of orthology have applications in important areas such as gene function prediction and phylogenetic reconstruction, motivating the need for methods that can distinguish the two types.

Results: We formalize the notion of divergence after duplication, and provide a theoretical basis for the inference of primary and secondary orthologs.
We then put these ideas to practice with the HyPPO (Hybrid Prediction of Paralogs and Orthologs) framework, which combines ideas from both similarity and phylogeny approaches. We apply our method to simulated and empirical datasets, and show that we achieve superior accuracy in predicting primary orthologs, secondary orthologs and paralogs.
Availability: HyPPO is a modular framework with a core developed in Python, and is provided with a variety of C++ modules. The source code is available at https://github.com/manuellafond/HyPPO.

An evolutionary model motivated by physico-chemical properties of amino acids reveals variation among proteins
COSI: Evolution and Comparative Genomics
Date: Monday, July 9

  • Edward Braun, University of Florida, United States

Presentation Overview: Show

Motivation: The relative rates of amino acid interchanges over evolutionary time are likely to vary among proteins. Variation in those rates has the potential to reveal information about con-straints on proteins. However, the most straightforward model that could be used to estimate relative rates of amino acid substitution is parameter-rich and it is therefore impractical to use for this purpose.
Results: A six-parameter model of amino acid substitution that incorporates information about the physicochemical properties of amino acids was developed. It showed that amino acid side chain volume, polarity, and aromaticity have major impacts on protein evolution. It also revealed variation among proteins in the relative importance of those properties. The same general ap-proach can be used to improve the fit of empirical models, like the commonly used PAM and LG models.
Availability: Perl code and test data are available from https://github.com/ebraun68/sixparam

Deconvolution and phylogeny inference of structural variations in tumor genomic samples
COSI: Evolution and Comparative Genomics
Date: Monday, July 9

  • Jesse Eaton, Carnegie Mellon University, United States
  • Jingyi Wang, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Phylogenetic reconstruction of tumor evolution has emerged as a crucial tool for making sense of the complexity of emerging cancer genomic data sets. Despite the growing use of phylogenetics in cancer studies, though, the field has only slowly adapted to many ways that tumor evolution differs from classic species evolution. One crucial question in that regard is how to handle inference of structural variations (SVs), which are a major mechanism of evolution in cancers but have been largely neglected in tumor phylogenetics to date, in part due to the challenges of reliably detecting and typing SVs and interpreting them phylogenetically.

Results: We present a novel method for reconstructing evolutionary trajectories of SVs from bulk whole-genome sequence data via joint deconvolution and phylogenetics, to infer clonal subpopulations and reconstruct their ancestry. We establish a novel likelihood model for joint deconvolution and phylogenetic inference on bulk SV data and formulate an associated optimization algorithm. We demonstrate the approach to be efficient and accurate for realistic scenarios of SV mutation on simulated data. Application to breast cancer genomic data from The Cancer Genome Atlas (TCGA) shows it to be practical and effective at reconstructing features of SV-driven evolution in single tumors.

Availability and Implementation: Python source code and associated documentation are available at https://github.com/jaebird123/tusv

Contact: Russell Schwartz (russells@andrew.cmu.edu)

Inference of Species Phylogenies from Bi-allelic Markers Using Pseudo-likelihood
COSI: Evolution and Comparative Genomics
Date: Monday, July 9

  • Jiafan Zhu, Rice University, United States
  • Luay Nakhleh, Rice University, United States

Presentation Overview: Show

Phylogenetic networks represent reticulate evolutionary histories. Statistical methods for their inference under the multispecies
coalescent have recently been developed. A particularly powerful approach uses data that consist of bi-allelic markers (e.g., single nucleotide polymorphism data) and allows for exact likelihood
computations of phylogenetic networks while numerically integrating over all possible gene trees per marker. While the approach has good accuracy in terms of
estimating the network and its parameters, likelihood computations remain a major computational bottleneck and limit the method's applicability.
In this paper, we first demonstrate why likelihood computations of networks take orders of magnitude more time when compared to trees. We then propose
an approach for inference of phylogenetic networks based on pseudo-likelihood using bi-allelic markers. We demonstrate the scalability and accuracy of phylogenetic network inference via pseudo-likelihood computations on simulated data. Furthermore, we demonstrate aspects of robustness of the method to violations in the underlying
assumptions of the employed statistical model. Finally, we demonstrate the application of the method to biological data.
The proposed method allows for analyzing larger data sets in terms of the numbers of taxa and reticulation events. While pseudo-likelihood had been proposed before for
data consisting of gene trees, the work here uses sequence data directly, offering several advantages as we discuss.
The methods have been implemented in PhyloNet (\textit{http://bioinfocs.rice.edu/phylonet}).


3DSIG: Structural Bioinformatics and Computational Biophysics3DSIG: Structural Bioinformatics and Computational Biophysics

A novel methodology on distributed representations of proteins using their interacting ligands
COSI: 3DSIG: Structural Bioinformatics and Computational Biophysics
Date: Monday, July 9 and Tuesday, July 10

  • Hakime Öztürk, Boğaziçi University, Turkey
  • Elif Ozkirimli, Bogazici University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey

Presentation Overview: Show

Motivation: The effective representation of proteins is a crucial task that directly affects the performance of many bioinformatics problems. Related proteins usually bind to similar ligands. Chemical characteristics of ligands are known to capture the functional and mechanistic properties of proteins suggesting that a ligand based approach can be utilized in protein representation.

Methods: In this study, we propose SMILESVec, a SMILES-based method to represent ligands and a novel method to compute similarity of proteins by describing them based on their ligands. The proteins are defined utilizing the word-embeddings of the SMILES strings of their ligands. The performance of the proposed protein description method is evaluated in protein clustering task using TransClust and MCL algorithms. Two other protein representation methods that utilize protein sequence, BLAST and ProtVec, and two compound fingerprint based protein representation methods are compared.
Results: We showed that ligand-based protein representation, which uses only SMILES strings of the ligands that proteins bind to, performs as well as protein-sequence based representation methods in protein clustering. The results suggest that ligand-based protein description can be an alternative to the traditional sequence or structure based representation of proteins and this novel approach can be applied to different bioinformatics problems such as prediction of new protein-ligand interactions and protein function annotation

An integration of fast alignment and maximum-likelihood methods for electron subtomogram averaging and classification
COSI: 3DSIG: Structural Bioinformatics and Computational Biophysics
Date: Monday, July 9 and Tuesday, July 10

  • Yixiu Zhao, Carnegie Mellon University, United States
  • Xiangrui Zeng, Carnegie Mellon University, United States
  • Qiang Guo, Max Planck Institute for Biochemistry, Germany
  • Min Xu, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Cellular Electron CryoTomography (CECT) is an emerging 3D imaging technique that visualizes subcellular organization of single cells at sub-molecular resolution and in near-native state. CECT captures large numbers of macromolecular complexes of highly diverse structures and abundances. However, the structural complexity and imaging limits complicate the systematic de novo structural recovery and recognition of these macromolecular complexes. Efficient and accurate reference-free subtomogram averaging and classification represent the most critical tasks for such analysis. Existing subtomogram alignment based methods are prone to the missing wedge effects and low signal-to-noise ratio (SNR). Moreover, existing maximum-likelihood based methods rely on integration operations, which are in principle computationally infeasible for accurate calculation.

Results: Built on existing works, we propose an integrated method, Fast Alignment Maximum Likelihood method (FAML), which uses fast subtomogram alignment to sample sub-optimal rigid transformations. The transformations are then used to approximate integrals for maximum-likelihood update of subtomogram averages through expectation-maximization algorithm. Our tests on simulated and experimental subtomograms showed that, compared to our previously developed fast alignment method (FA), FAML is significantly more robust to noise and missing wedge effects with moderate increases of computation cost.Besides, FAML performs well with significantly fewer input subtomograms when the FA method fails. Therefore, FAML can serve as a key component for improved construction of initial structural models from macromolecules captured by CECT.

Protein threading using residue co-variation and deep learning
COSI: 3DSIG: Structural Bioinformatics and Computational Biophysics
Date: Monday, July 9 and Tuesday, July 10

  • Jianwei Zhu, Chinese Academy of Sciences, China
  • Sheng Wang, Toyota Technological Institute at Chicago, United States
  • Dongbo Bu, Chinese Academy of Sciences, China
  • Jinbo Xu, Toyota Technological Institute at Chicago, United States

Presentation Overview: Show

Template-based modeling (TBM), including homology modeling and protein threading, is a popular method for protein 3D structure prediction. However, alignment generation and template selection for protein sequences without close templates remain very challenging. We present a new method called DeepThreader to improve protein threading, including both alignment generation and template selection, by making use of deep learning and residue co-variation information. Our method first employs deep learning to predict inter-residue distance distribution from residue co-variation and sequential information (e.g., sequence profile and predicted secondary structure), and then builds sequence-template alignment by integrating the predicted distance information with sequential information through an ADMM algorithm. Experimental results suggest that predicted inter-residue distance is helpful to both protein alignment and template selection especially for protein sequences without very close templates, and that our method outperforms currently popular homology modeling method HHpred and threading method CNFpred by a large margin and greatly outperforms the latest contact-assisted protein threading method EigenTHREADER.


Bio-OntologiesBio-Ontologies

A Gene-Phenotype Relationship Extraction Pipeline from the Biomedical Literature Using a Representation Learning Approach
COSI: Bio-Ontologies
Date: Monday, July 9 and Tuesday, July 10

  • Wenhui Xing, Wuhan University of Technology, China
  • Junsheng Qi, China Agricultural University, China
  • Xiaohui Yuan, Wuhan University of Technology, China
  • Lin Li, Wuhan University of Technology, China
  • Xiaoyu Zhang, Huazhong University of Science and Technology, China
  • Yuhua Fu, Wuhan University of Technology, China
  • Shengwu Xiong, Wuhan University of Technology, China
  • Lun Hu, Wuhan University of Technology, China
  • Jing Peng, Wuhan University of Technology, China

Presentation Overview: Show

Motivation: The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.
Methods: We propose a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improve the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rulebased method is applied for gene recognition. Finally, we integrate one of famous information extraction system OLLIE to identify gene-phenotype relations.
Results: To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.

Deep neural networks and distant supervision for geographic location mention extraction
COSI: Bio-Ontologies
Date: Monday, July 9 and Tuesday, July 10

  • Arjun Magge, ASU, United States
  • Davy Weissenbacher, University of Pennsylvania, United States
  • Abeed Sarker, University of Pennsylvania, United States
  • Matthew Scotch, Arizona State University, United States
  • Graciela Gonzalez, University of Pennsylvania, United States

Presentation Overview: Show

Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision
techniques to generate additional samples to train our NER.

Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

Enumerating consistent subgraphs of directed acyclic graphs: an insight into biomedical ontologies
COSI: Bio-Ontologies
Date: Monday, July 9 and Tuesday, July 10

  • Yisu Peng, Indiana University Bloomington, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Predrag Radivojac, Indiana University Bloomington, United States

Presentation Overview: Show

Motivation: Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontologies can be seen as directed acyclic graphs, where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent subgraph; that is, a subgraph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent subgraphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.

Results: We propose an algorithm for enumerating consistent subgraphs of directed acyclic graphs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs, and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.

Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations
COSI: Bio-Ontologies
Date: Monday, July 9 and Tuesday, July 10

  • Fatima Zohra Smaili, King Abdullah University of Science and Technology, Saudi Arabia
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Biological knowledge is widely
represented in the form of ontology-based annotations: ontologies
describe the phenomena assumed to exist within a domain, and the
annotations associate a (kind of) biological entity with a set of
phenomena within the domain. The structure and information contained
in ontologies and their annotations makes them valuable for
developing machine learning, data analysis and knowledge extraction
algorithms; notably, semantic similarity is widely used to identify
relations between biological entities, and ontology-based
annotations are frequently used as features in machine learning
applications.

Results: We propose the Onto2Vec method, an approach to
learn feature vectors for biological entities based on their
annotations to biomedical ontologies. Our method can be applied to a
wide range of bioinformatics research problems such as
similarity-based prediction of interactions between proteins,
classification of interaction types using supervised learning, or
clustering. To evaluate Onto2Vec, we use the Gene Ontology (GO) and
jointly produce dense vector representations of proteins, the GO
classes to which they are annotated, and the axioms in GO that
constrain these classes. First, we demonstrate that
Onto2Vec-generated feature vectors can significantly improve
prediction of protein-protein interactions in human and yeast. We
then illustrate how Onto2Vec representations provide the means for
constructing data-driven, trainable semantic similarity measures
that can be used to identify particular relations between
proteins. Finally, we use an unsupervised clustering approach to
identify protein families based on their Enzyme Commission
numbers. Our results demonstrate that Onto2Vec can generate high
quality feature vectors from biological entities and
ontologies. Onto2Vec has the potential to significantly outperform
the state-of-the-art in several predictive applications in which
ontologies are involved.

Availability: https://github.com/bio-ontology-research-group/onto2vec

Contact: robert.hoehndorf@kaust.edu.sa and xin.gao@kaust.edu.sa


MICROBIOMEMICROBIOME

MicroPheno: Predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples
COSI: MICROBIOME
Date: Monday, July 9 and Tuesday, July 10

  • Ehsaneddin Asgari, University of California, Berkeley, United States
  • Kiavash Garakani, University of California, Berkeley, United States
  • Alice Carolyn McHardy, Helmholtz Centre for Infection Research, Germany
  • Mohammad R.K. Mofrad, University of California, Berkeley, United States

Presentation Overview: Show

Motivation: Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost-effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient, and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture, and forensic science. We describe a reference- and alignment-free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k-mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub-samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes.
Results: a k-mer distribution of shallow sub-samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body-site identification and Crohn's disease prediction. Aside from being more accurate, using k-mer features in shallow sub-samples allows (i) skipping computationally costly sequence alignments required in OTU-picking, and (ii) provided a proof of concept for the sufficiency of shallow and short-length 16S rRNA sequencing for phenotype prediction. In addition, k-mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro-F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and SVM.

Viral quasispecies reconstruction via tensor factorization with successive read removal
COSI: MICROBIOME
Date: Monday, July 9 and Tuesday, July 10

  • Soyeon Ahn, The University of Texas at Austin, United States
  • Ziqi Ke, The University of Texas at Austin, United States
  • Haris Vikalo, The University of Texas at Austin, United States

Presentation Overview: Show

As RNA viruses mutate and adapt to environmental changes, often developing resistance to antiviral vaccines and drugs, they form an ensemble of viral strains – a viral quasispecies. While high-throughput
sequencing has enabled in-depth studies of viral quasispecies, sequencing errors and limited read lengths render the problem of reconstructing the strains and estimating their spectrum challenging. Inference of viral quasispecies is difficult due to generally non-uniform frequencies of the strains, and is further exacerbated when the genetic distances between the strains are small.
This paper presents TenSQR, an algorithm that utilizes tensor factorization framework to analyze high-throughput sequencing data and reconstruct viral quasispecies characterized by highly uneven frequencies of its components. Fundamentally, TenSQR performs clustering with successive data removal to infer strains in a quasispecies in order from the most to the least abundant one; every time a strain is inferred, sequencing reads generated from that strain are removed from the dataset. The proposed successive strain reconstruction and data removal enables discovery of rare strains in a population and facilitates detection of deletions in such strains. Results on simulated datasets demonstrate that TenSQR can reconstruct full-length strains having widely different abundances, generally outperforming state-of-the-art methods at diversities 1-10% and detecting long deletions even in rare strains. A study on a real HIV-1 dataset demonstrates that TenSQR outperforms competing methods in experimental settings as well. Finally, we apply TenSQR to analyze a Zika virus sample and reconstruct the full-length strains it contains.


MLCSB: Machine Learning in Computational and Systems BiologyMLCSB: Machine Learning in Computational and Systems Biology

A new method for constructing tumor specific gene co-expression networks based on samples with tumor purity heterogeneity
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Francesca Petralia, ICAHN SCHOOL OF MEDICINE AT MT SINAI, United States
  • Li Wang, ICAHN SCHOOL OF MEDICINE AT MT SINAI, United States
  • Jie Peng, University of California, Davis, United States
  • Arthur Yan, ICAHN SCHOOL OF MEDICINE AT MT SINAI, United States
  • Jun Zhu, ICAHN SCHOOL OF MEDICINE AT MT SINAI, United States
  • Pei Wang, ICAHN SCHOOL OF MEDICINE AT MT SINAI, United States

Presentation Overview: Show

Tumor tissue samples often contain an unknown fraction of stromal cells. This problem is widely known as tumor purity heterogeneity (TPH) was recently recognized as a severe issue in omics studies. Specifically, if TPH is ignored when inferring co-expression networks, edges are likely to be estimated among genes with mean shift between non-tumor and tumor cells rather than among gene pairs interacting with each other in tumor cells. To address this issue, we propose TSNet, a new method which constructs tumor-cell specific gene/protein co-expression networks based on gene/protein expression profiles of tumor tissues. TSNet treats the observed expression profile as a mixture of expressions from different cell types and explicitly models tumor purity percentage in each tumor sample.

Using extensive synthetic data experiments, we demonstrate that TSNet outperforms a standard graphical model which does not account for tumor purity heterogeneity. We then apply TSNet to estimate tumor specific gene co-expression networks based on TCGA ovarian cancer RNAseq data. We identify novel co-expression modules and hub structure specific to tumor cells.

A scalable estimator of SNP heritability for Biobank-scale data
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Yue Wu, University of California, Los Angeles, United States
  • Sriram Sankararaman, University of California, Los Angeles, United States

Presentation Overview: Show

Heritability, the proportion of variation in a trait that can be explained by genetic variation, is an important parameter in efforts to understand the genetic architecture of complex phenotypes as well as in the design and interpretation of genome-wide association studies. Attempts to understand the heritability of complex phenotypes attributable to genome-wide SNP variation data have motivated the analysis of large datasets as well as the development of sophisticated tools to estimate heritability in these datasets.

Linear Mixed Models (LMMs) have emerged as a key tool for heritability estimation where the parameters of the LMMs,i.e., the variance components, are related to the heritability attributable to the SNPs analyzed. Likelihood-based inference in LMMs however poses serious computational burdens.

We propose a scalable randomized algorithm for estimating variance components in LMMs. Our method is based on a MoM estimator that has a runtime complexity O(NMB) for N individuals and M SNPs (where B is a parameter that controls the number of random matrix-vector mutiplications). Further, by leveraging the structure of the genotype matrix, we can reduce the time complexity to O(NMB/max( log_3N , log_3M) ) .

We demonstrate the scalability and accuracy of our method on simulated as well as on empirical data. On a standard hardware, our method computes heritability on a dataset of 500,000 individuals and 100,000 SNPs in 38 minutes.

A unifying framework for joint trait analysis under a non-infinitesimal model
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Ruth Johnson, University of California, Los Angeles, United States
  • Huwenbo Shi, University of California, Los Angeles, United States
  • Bogdan Pasaniuc, University of California, Los Angeles, United States
  • Sriram Sankararaman, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: A large proportion of risk regions identified by genome-wide association studies (GWAS) are
shared across multiple diseases and traits. Understanding whether this clustering is due to sharing of
causal variants or chance colocalization can provide insights into shared etiology of complex traits and
diseases.
Results: In this work, we propose a flexible, unifying framework to quantify the overlap between two
traits called UNITY (Unifying Non-Infinitesimal Trait analYsis). We formulate a full generative model that
makes minimal assumptions under a non-infinitesimal model and performs inference starting from GWAS
summary association data. To address the very large parameter space, we propose a Metropolis-Hastings
within collapsed Gibbs sampler to perform inference. Through comprehensive simulations and an analysis
of height and BMI, we show that our method produces estimations consistent with the known genetic
makeup of both height and BMI.

COSSMO: Predicting Competitive Alternative Splice Site Selection using Deep Learning
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Hannes Bretschneider, University of Toronto, Canada
  • Shreshth Gandhi, Deep Genomics, Inc., Canada
  • Amit G Deshwar, Deep Genomics, Inc., Canada
  • Khalid Zuberi, Deep Genomics Inc., Toronto ON, Canada
  • Brendan Frey, Deep Genomics, Inc., Canada

Presentation Overview: Show

Motivation: Alternative splice site selection is inherently competitive and the probability of a given splice site to be used also depends strongly on the strength of neighboring sites. Here we present a new model named Competitive Splice Site Model (COSSMO), which explicitly models these competitive effects and predict the PSI distribution over any number of putative splice sites. We model an alternative splicing event as the choice of a 3’ acceptor site conditional on a fixed upstream 5’ donor site, or the choice of a 5’ donor site conditional on a fixed 3’ acceptor site. We build four different architectures that use convolutional layers, communication layers, LSTMs, and residual networks, respectively, to learn relevant motifs from sequence alone. We also construct a new dataset from genome annotations and RNA-Seq read data that we use to train our model.

Results: COSSMO is able to predict the most frequently used splice site with an accuracy of 70% on unseen test data, and achieve an R2 of 60% in modeling the PSI distribution. We visualize the motifs that COSSMO learns from sequence and show that COSSMO recognizes the consensus splice site sequences as well as many known splicing factors with high specificity.

Availability: Our dataset is available from http://cossmo.deepgenomics.com.

Discriminating early- and late-stage cancers using multiple kernel learning on gene sets
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Arezou Rahimi, Koç University, Turkey
  • Mehmet Gönen, Koç University, Turkey

Presentation Overview: Show

Motivation: Identifying molecular mechanisms that drive cancers from early to late stages is highly important to develop new preventive and therapeutic strategies. Standard machine learning algorithms could be used to discriminate early- and late-stage cancers from each other using their genomic characterisations. Even though these algorithms would get satisfactory predictive performance, their knowledge extraction capability would be quite restricted due to highly correlated nature of genomic data. That is why we need algorithms that can also extract relevant information about these biological mechanisms using our prior knowledge about pathways/gene sets.

Results: In this study, we addressed the problem of separating early- and late-stage cancers from each other using their gene expression profiles. We proposed to use a multiple kernel learning formulation that makes use of pathways/gene sets (i) to obtain satisfactory/improved predictive performance and (ii) to identify biological mechanisms that might have an effect in cancer progression. We extensively compared our proposed multiple kernel learning on gene sets algorithm against two standard machine learning algorithms, namely, random forests and support vector machines, on 20 diseases from TCGA cohorts for two different sets of experiments. Our method obtained statistically significantly better or comparable predictive performance on most of the datasets using significantly fewer gene expression features. We also showed that our algorithm was able to extract meaningful and disease-specific information that gives clues about the progression mechanism.

Availability: Our implementations of support vector machine and multiple kernel learning algorithms in R are available at https://github.com/mehmetgonen/gsbc together with the scripts that replicate the reported experiments.

DLBI: Deep learning guided Bayesian inference for structure reconstruction of super-resolution fluorescence microscopy
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Yu Li, KAUST, Saudi Arabia
  • Fan Xu, Chinese Academy of Sciences, China
  • Fa Zhang, Chinese Academy of Sciences, China
  • Pingyong Xu, Chinese Academy of Sciences, China
  • Mingshu Zhang, Chinese Academy of Sciences, China
  • Ming Fan, Hangzhou Dianzi University, China
  • Lihua Li, Hangzhou Dianzi University, China
  • Xin Gao, KAUST, Saudi Arabia
  • Renmin Han, KAUST, Saudi Arabia

Presentation Overview: Show

Super-resolution fluorescence microscopy, with a resolution beyond the diffraction limit of light, has become an indispensable tool to directly visualize biological structures in living cells at a nanometer-scale resolution. Despite advances in high-density super-resolution fluorescent techniques, existing methods still have bottlenecks, including extremely long execution time, artificial thinning and thickening of structures, and lack of ability to capture latent structures.

Here we propose a novel deep learning guided Bayesian inference approach, DLBI, for the time-series analysis of high-density fluorescent images. Our method combines the strength of deep learning and statistical inference, where deep learning captures the underlying distribution of the fluorophores that are consistent with the observed time-series fluorescent images by exploring local features and correlation along time-axis, and statistical inference further refines the ultrastructure extracted by deep learning and endues physical meaning to the final image. In particular, our method contains three main components. The first one is a simulator that takes a high-resolution image as the input, and simulates time-series low-resolution fluorescent images based on experimentally calibrated parameters, which provides supervised training data to the deep learning model. The second one is a multi-scale deep learning module to capture both spatial information in each input low-resolution image as well as temporal information among the time-series images. And the third one is a Bayesian inference module that takes the image from the deep learning module as the initial localization of fluorophores and removes artifacts by statistical inference. Comprehensive experimental results on both real and simulated datasets demonstrate that our method provides more accurate and realistic local patch and large-field reconstruction than the state-of-the-art method, the 3B analysis, while our method is more than two orders of magnitude faster.

Gene Prioritization Using Bayesian Matrix Factorization with Genomic and Phenotypic Side Information
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Pooya Zakeri, Katholieke Universiteit Leuven, Belgium
  • Jaak Simm, Katholieke Universiteit Leuven, Belgium
  • Adam Arany, Katholieke Universiteit Leuven, Belgium
  • Sarah Elshal, Katholieke Universiteit Leuven, Belgium
  • Yves Moreau, Katholieke Universiteit Leuven, Belgium

Presentation Overview: Show

Motivation: Most gene prioritization methods model each disease or phenotype individually, but this fails to capture patterns common to several diseases or phenotypes. To overcome this limitation, we formulate the gene prioritization task as the factorization of a sparsely filled gene-phenotype matrix, where the objective is to predict the unknown matrix entries. To deliver more accurate gene-phenotype matrix completion, we extend classical Bayesian matrix factorization to work with multiple side information sources. The availability of side information allows us to make nontrivial predictions for genes for which no previous disease association is known.

Results: Our gene prioritization method can innovatively not only integrate data sources describing genes, but also data sources describing Human Phenotype Ontology terms. Experimental results on our benchmarks show that our proposed model can effectively improve accuracy over the well-established gene prioritization method, Endeavour. In particular, our proposed method offers promising results on diseases of the nervous system; diseases of the eye and adnexa; endocrine, nutritional and metabolic diseases; and congenital malformations, deformations and chromosomal abnormalities, when compared to Endeavour.

Improved pathway reconstruction from RNA interference screens by exploiting off-target effects
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Sumana Srivatsa, ETH Zurich, Switzerland
  • Jack Kuipers, ETH Zurich, Switzerland
  • Fabian Schmich, ETH Zurich, Switzerland
  • Simone Eicher, Biozentrum, University of Basel, Switzerland
  • Mario Emmenlauer, Biozentrum, University of Basel, Switzerland
  • Christoph Dehio, Biozentrum, University of Basel, Switzerland
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: Pathway reconstruction has proven to be an indispensable tool for analyzing the molecular mechanisms of signal transduction underlying cell function. Nested effects models (NEMs) are a class of probabilistic graphical models designed to reconstruct signalling pathways from high-dimensional observations resulting from perturbation experiments, such as RNA interference (RNAi). NEMs assume that the short interfering RNAs (siRNAs) designed to knockdown specific genes are always on-target. However, it has been shown that most siRNAs exhibit strong off-target effects, which further confound the data, resulting in unreliable reconstruction of networks by NEMs.

Results: Here, we present an extension of NEMs called probabilistic combinatorial nested effects models (pc-NEMs), which capitalize on the ancillary siRNA off-target effects for network reconstruction from combinatorial gene knockdown data. Our model employs an adaptive simulated annealing search algorithm for simultaneous inference of network structure and error rates inherent to the data. Evaluation of pc-NEMs on simulated data with varying number of phenotypic effects and noise levels as well as real data demonstrates improved reconstruction compared to classical NEMs. Application to Bartonella henselae infection RNAi screening data yielded an eight node network largely in agreement with previous works, and revealed novel binary interactions of direct impact between established components.

Improving genomics-based predictions for precision medicine through active elicitation of expert knowledge
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Iiris Sundin, Aalto University, Helsinki, Finland
  • Tomi Peltola, Aalto University, Helsinki, Finland
  • Luana Micallef, Aalto University, Helsinki, Finland
  • Homayun Afrabandpey, Aalto University, Helsinki, Finland
  • Marta Soare, Aalto University, Helsinki, Finland
  • Muntasir Mamun Majumder, University of Helsinki, Finland
  • Pedram Daee, Aalto University, Helsinki, Finland
  • Chen He, University of Helsinki, Finland
  • Baris Serim, University of Helsinki, Finland
  • Aki Havulinna, National Institute for Health and Welfare THL, Helsinki, Finland
  • Caroline Heckman, University of Helsinki, Finland
  • Giulio Jacucci, University of Helsinki, Finland
  • Pekka Marttinen, Aalto University, Helsinki, Finland
  • Samuel Kaski, Aalto University, Helsinki, Finland

Presentation Overview: Show

Motivation: Precision medicine requires the ability to predict the efficacies of different treatments for a given individual using high-dimensional genomic measurements. However, identifying predictive features remains a challenge when the sample size is small. Incorporating expert knowledge offers a promising approach to improve predictions, but collecting such knowledge is laborious if the number of candidate features is very large.
Results: We introduce a probabilistic framework to incorporate expert feedback about the impact of genomic measurements on the outcome of interest, and present a novel approach to collect the feedback efficiently, based on Bayesian experimental design. The new approach outperformed other recent alternatives in two medical applications: prediction of metabolic traits and prediction of sensitivity of cancer cells to different drugs, both using genomic features as predictors. Furthermore, the intelligent approach to collect feedback reduced the workload of the expert to approximately 11%, compared to a baseline approach.
Availability: Source code implementing the introduced computational methods is freely available at https://github.com/AaltoPML/knowledge-elicitation-for-precision-medicine.
Contact: first.last@aalto.fi
Supplementary information: Supplementary data are available at Bioinformatics
online.

mGPfusion: Predicting protein stability changes with Gaussian process kernel learning and data fusion
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Emmi Jokinen, Aalto University, Finland
  • Markus Heinonen, Aalto University, Finland
  • Harri Lähdesmäki, Aalto University, Finland

Presentation Overview: Show

Motivation: Proteins are commonly used by biochemical industry for numerous processes. Refining these proteins' properties via mutations causes stability effects as well. Accurate computational method to predict how mutations affect protein stability are necessary to facilitate efficient protein design. However, accuracy of predictive models is ultimately constrained by the limited availability of experimental data.

Results: We have developed mGPfusion, a novel Gaussian process (GP) method for predicting protein's stability changes upon single and multiple mutations. This method complements the limited experimental data with large amounts of molecular simulation data. We introduce a Bayesian data fusion model that re-calibrates the experimental and in silico data sources and then learns a predictive GP model from the combined data. Our protein-specific model requires experimental data only regarding the protein of interest, and performs well even with few experimental measurements. The mGPfusion models proteins by contact maps and infers the stability effects caused by mutations with a mixture of graph kernels. Our results show that mGPfusion outperforms state-of-the-art methods in predicting protein stability on a dataset of 15 different proteins and that incorporating molecular simulation data improves the model learning and prediction accuracy.

Availability: Software implementation and datasets are available at github.com/emmijokinen/mgpfusion}
Contact: emmi.jokinen@aalto.fi

Modeling Polypharmacy Side Effects with Graph Convolutional Networks
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Marinka Zitnik, Stanford University, United States
  • Monica Agrawal, Stanford University, United States
  • Jure Leskovec, Stanford University, United States

Presentation Overview: Show

Motivation: The use of drug combinations, termed polypharmacy, is common to treat patients with complex diseases or co-existing conditions. However, a major consequence of polypharmacy is a much higher risk of adverse side effects for the patient. Polypharmacy side effects emerge because of drug-drug interactions, in which activity of one drug may change, favorably or unfavorably, if taken with another drug. The knowledge of drug interactions is often limited because these complex relationships are rare, and are usually not observed in relatively small clinical testing. Discovering polypharmacy side effects thus remains an important challenge with significant implications for patient mortality and morbidity.

Results: Here, we present Decagon, an approach for modeling polypharmacy side effects. The approach constructs a multimodal graph of protein-protein interactions, drug-protein target interactions, and the polypharmacy side effects, which are represented as drug-drug interactions, where each side effect is an edge of a different type. Decagon is developed specifically to handle such multimodal graphs with a large number of edge types. Our approach develops a new graph convolutional neural network for multirelational link prediction in multimodal networks. Unlike approaches limited to predicting simple drug-drug interaction values, Decagon can predict the exact side effect, if any, through which a given drug combination manifests clinically. Decagon accurately predicts polypharmacy side effects, outperforming baselines by up to 69%. We find that it automatically learns representations of side effects indicative of co-occurrence of polypharmacy in patients. Furthermore, Decagon models particularly well polypharmacy side effects that have a strong molecular basis, while on predominantly non-molecular side effects, it achieves good performance because of effective sharing of model parameters across edge types. Decagon opens up opportunities to use large pharmacogenomic and patient population data to flag and prioritize polypharmacy side effects for follow-up analysis via formal pharmacological studies.

Optimization and profile calculation of ODE models using second order adjoint sensitivity analysis
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Paul Stapor, Helmholtz Center for Environmental Health, Germany
  • Fabian Fröhlich, Helmholtz Center for Environmental Health, Germany
  • Jan Hasenauer, Helmholtz Center for Environmental Health, Germany

Presentation Overview: Show

Motivation: Parameter estimation methods for ordinary differential equation (ODE) models of biological processes can exploit gradients and Hessians of objective functions to achieve convergence and computational efficiency. However, the computational complexity of established methods to evaluate the Hessian scales linearly with the number of state variables and quadratically with the number of parameters. This limits their application to low-dimensional problems.
Results: We introduce second order adjoint sensitivity analysis for the computation of Hessians and a hybrid optimization-integration based approach for profile likelihood computation. Second order adjoint sensitivity analysis scales linearly with the number of parameters and state variables. The Hessians are effectively exploited by the proposed profile likelihood computation approach. We evaluate our approaches on published biological models with real measurement data. Our study reveals an improved computational efficiency and robustness of optimization compared to established approaches, when using Hessians computed with adjoint sensitivity analysis. The hybrid computation method was more than two-fold faster than the best competitor. Thus, the proposed methods and implemented algorithms allow for the improvement of parameter estimation for medium and large scale ODE models.
Availability: The algorithms for second order adjoint sensitivity analysis are implemented in the Advanced MATLAB Interface to CVODES and IDAS (AMICI, https://github.com/ICB-DCM/AMICI/). The algorithm for hybrid profile likelihood computation is implemented in the parameter estimation toolbox (PESTO, https://github.com/ICB-DCM/PESTO/). Both toolboxes are freely available under the BSD license. Contact: jan.hasenauer@helmholtz-muenchen.de
Supplementary information: Supplementary data are available at Bioinformatics online.

Random forest based similarity learning for single cell RNA sequencing data
COSI: MLCSB: Machine Learning in Computational and Systems Biology
Date: Monday, July 9 and Tuesday, July 10

  • Maziyar Baran Pouyan, University of Pittsburgh, United States
  • Dennis Kostka, University of Pittsburegh, United States

Presentation Overview: Show

Motivation: Genome-wide transcriptome sequencing applied to single cells (scRNA-seq) is rapidly becoming an assay of choice across many fields of biological and biomedical research. Scientific objectives often revolve around discovery or characterization of types or sub-types of cells, and therefore obtaining accurate cell–cell similarities from scRNA-seq data is critical step in many studies. While rapid advances are being made in the development of tools for scRNA-seq data analysis, few approaches exist that explicitly address this task. Furthermore, abundance and type of noise present in scRNA-seq datasets suggest that application of generic methods, or of methods developed for bulk RNA-seq data, is likely suboptimal.
Results: Here we present RAFSIL, a random forest based approach to learn cell-cell similarities from scRNA-seq data. RAFSIL implements a two-step procedure, where feature construction geared towards scRNA-seq data is followed by similarity learning. It is designed to be adaptable and expandable, and RAFSIL similarities can be used for typical exploratory data analysis tasks like dimension reduction, visualization, and clustering. We show that our approach compares favorably with current methods across a diverse collection of datasets, and that it can be used to detect and highlight unwanted technical variation in scRNA-seq datasets in situations where other methods fail. Overall, RAFSIL implements a flexible approach yielding a useful tool that improves the analysis of scRNA-seq data.
Availability and Implementation: The RAFSIL R package is available at www.kostkalab.net/software.html


RegSys: Regulatory and Systems GenomicsRegSys: Regulatory and Systems Genomics

Covariate-Dependent Negative Binomial Factor Analysis of RNA Sequencing Data
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Siamak Zamani Dadaneh, Texas A&M University, United States
  • Mingyuan Zhou, The University of Texas at Austin, United States
  • Xiaoning Qian, Texas A&M University, United States

Presentation Overview: Show

Motivation: High-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have
become the basic practice for genomic studies in biomedical research. In addition to studying genes
individually, for example, through differential expression analysis, investigating coordinated expression
variations of genes may help reveal the underlying cellular mechanisms to derive better understanding
and more effective prognosis and intervention strategies. Although there exists a variety of co-expression
network based methods to analyze microarray data for this purpose, instead of blindly extending these
methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well
adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns.
Results: We have developed a fully Bayesian covariate-dependent negative binomial factor analysis
method—dNBFA—for RNA-seq count data, to capture coordinated gene expression changes, while
considering effects from covariates reflecting different influencing factors. Unlike existing co-expression
network based methods, our proposed model does not require multiple ad-hoc choices on data processing,
transformation, as well as co-expression measures, and can be directly applied to RNA-seq
data. Furthermore, being capable of incorporating covariate information, the proposed method can
tackle setups with complex confounding factors in different experiment designs. Finally, the natural model
parameterization removes the need for a normalization preprocessing step, as commonly adopted to
compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters
is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental
results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool
for discovering the gene modules with significant differential expression and meaningful biological insight.
Availability: dNBFA is implemented in R language and is available at https://github.com/siamakz/dNBFA.

Predicting CTCF-mediated chromatin loops using CTCF-MP
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Ruochi Zhang, Carnegie Mellon University, United States
  • Yuchuan Wang, Carnegie Mellon University, United States
  • Yang Yang, Carnegie Mellon University, United States
  • Yang Zhang, Carnegie Mellon University, United States
  • Jian Ma, Carnegie Mellon University, United States

Presentation Overview: Show

The three dimensional organization of chromosomes within the cell nucleus is highly regulated. It is known that CTCF is an important architectural protein to mediate long-range chromatin loops. Recent studies have shown that the majority of CTCF binding motif pairs at chromatin loop anchor regions are in convergent orientation. However, it remains unknown whether the genomic context at the sequence level can determine if a convergent CTCF motif pair is able to form chromatin loop. In this paper, we directly ask whether and what sequence-based features (other than the motif itself) may be important to establish CTCF-mediated chromatin loops. We found that motif conservation measured by "branch-of-origin" that accounts for motif turn-over in evolution is an important feature. We developed a new machine learning algorithm called CTCF-MP based on word2vec to demonstrate that sequence-based features alone have the capability to predict if a pair of convergent CTCF motifs would form a loop. Together with functional genomic signals from CTCF ChIP-seq and DNase-seq, CTCF-MP is able to make highly accurate predictions on whether a convergent CTCF motif pair would form a loop in a single cell type and also across different cell types. Our work represents an important step further to understand the sequence determinants that may guide the formation of complex chromatin architectures.

Quantifying the similarity of topological domains across normal and cancer human cell types
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Natalie Sauerwald, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Three-dimensional chromosome structure has been increasingly shown to influence various levels of cellular and genomic functions. Through Hi-C data, which maps contact frequency on chromosomes, it has been found that structural elements termed topologically associating domains (TADs) are involved in many regulatory mechanisms. However, we have little understanding of the level of similarity or variability of chromosome structure across cell types and disease states. In this work we present a method to quantify resemblance and identify structurally similar regions between any two sets of TADs.
Results: We present an analysis of 23 human Hi-C samples representing various tissue types in normal and cancer cell lines. We quantify global and chromosome-level structural similarity, and compare the relative similarity between cancer and non-cancer cells. We find that cancer cells show higher structural variability around commonly mutated pan-cancer genes than normal cells at these same locations.

Scalable preprocessing for sparse scRNA-seq data exploiting prior knowledge
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Sumit Mukherjee, University of Washington, United States
  • Yue Zhang, University of Washington, United States
  • Joshua Fan, University of Washington, United States
  • Georg Seelig, University of Washington, United States
  • Sreeram Kannan, University of Washington, United States

Presentation Overview: Show

Motivation: Single cell RNA-seq (scRNA-seq) data contains a
wealth of information which has to be inferred computationally from
the observed sequencing reads. As the ability to sequence more
cells improves rapidly, existing computational tools suffer from three
problems. (1) The decreased reads-per-cell implies a highly sparse
sample of the true cellular transcriptome. (2) Many tools simply
cannot handle the size of the resulting datasets. (3) Prior biological
knowledge such as bulk RNA-seq information of certain cell types
or qualitative marker information is not taken into account. Here we
present UNCURL, a preprocessing framework based on non-negative
matrix factorization for scRNA-seq data, that is able to handle varying
sampling distributions, scales to very large cell numbers and can
incorporate prior knowledge.
Results: We find that preprocessing using UNCURL consistently
improves performance of commonly used scRNA-seq tools for
clustering, visualization, and lineage estimation, both in the absence
and presence of prior knowledge. Finally we demonstrate that
UNCURL is extremely scalable and parallelizable, and runs faster
than other methods on a scRNA-seq dataset containing 1.3 million
cells.
Availability: Source code is available at https://github.com/
yjzhang/uncurl_python

Unsupervised embedding of single-cell Hi-C data (2)
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Jie Liu, University of Washington, United States
  • Dejun Lin, University of Washington, United States
  • Gurkan Yardimci, University of Washington, United States
  • William Noble, University of Washington, United States

Presentation Overview: Show

Single-cell Hi-C (scHi-C) data promises to enable scientists
to interrogate the 3D architecture of DNA in the nucleus of the
cell, studying how this structure varies stochastically or along
developmental or cell cycle axes. However, Hi-C data analysis
requires methods that take into account the unique characteristics
of this type of data. In this work, we explore whether methods that
have been developed previously for the analysis of bulk Hi-C data
can be applied to scHi-C data. We apply methods
designed for analysis of bulk Hi-C data to scHi-C data in
conjunction with unsupervised embedding. We find that one of these
methods, HiCRep, when used in conjunction with multidimensional
scaling (MDS), strongly outperforms three other methods, including a
technique that has been used previously for scHi-C analysis. We
also provide evidence that the HiCRep/MDS method is robust to
extremely low per-cell sequencing depth, that this robustness is
improved even further when high-coverage and low-coverage cells are
projected together, and that the method can be used to jointly embed
cells from multiple published datasets.

SigMat: A Classification Scheme for Gene Sig-nature Matching
COSI: RegSys: Regulatory and Systems Genomics
Date: Monday, July 9 and Tuesday, July 10

  • Jinfeng Xiao, University of Illinois at Urbana-Champaign, United States
  • Charles Blatti, University of Illinois at Urbana-Champaign, United States
  • Saurabh Sinha, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Motivation: Several large-scale efforts have been made to collect gene expression signatures from a variety of biological conditions, such as response of cell lines to treatment with drugs, or tumor samples with different characteristics. These gene signature collections are utilized through bioinformatics tools for “signature matching”, whereby a researcher studying an expression profile can identify previously cataloged biological conditions most related to their profile. Signature matching tools typically retrieve from the collection the signature that has highest similarity to the user-provided profile. Alternatively, classification models may be applied where each biological condition in the signature collection is a class label; however, such models are trained on the collection of available signatures and may not generalize to the novel cellular context or cell line of the researcher’s expression profile.
Results: We present an advanced multi-way classification algorithm for signature matching, called SigMat, that is trained on a large signature collection from a well-studied cellular context, but can also classify signatures from other cell types by relying on an additional, small collection of signatures representing the target cell type. It uses these “tuning data” to learn two additional parameters that help adapt its predictions for other cellular contexts. SigMat outperforms other similarity scores and classification methods in identifying the correct label of a query expression profile from as many as 244 candidate classes (drug treatments) cataloged by the LINCS L1000 project. SigMat retains its high accuracy in cross-cell line applications even when the amount of tuning data is severely limited.
Availability: SigMat is available on GitHub at https://github.com/JinfengXiao/SigMat.