Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Proceedings Track Presentations

CATS (Coordinates of Atoms by Taylor Series): Protein design with backbone flexibility in all locally feasible directions
Date: TBA
Room: TBA
  • Mark Hallen, Toyota Technological Institute at Chicago, United States
  • Bruce Donald, Duke University, United States

Presentation Overview: Show

Motivation: When proteins mutate or bind to ligands, their backbones often move significantly, especially in loop regions. Computational protein design algorithms must model these motions in order to accurately optimize protein stability and binding affinity. However, methods for backbone conformational search in design have been much more limited than for sidechain conformational search. This is especially true for combinatorial protein design algorithms, which aim to search a large sequence space efficiently and thus cannot rely on temporal simulation of each candidate sequence.
Results: We alleviate this difficulty with a new parameterization of backbone conformational space, which represents all degrees of freedom of a specified segment of protein chain that maintain valid bonding geometry (by maintaining the original bond lengths and angles and ω dihedrals). In order to search this space, we present an efficient algorithm, CATS, for computing atomic coordinates as a function of our new continuous backbone internal coordinates. CATS generalizes the iMinDEE and EPIC protein design algorithms, which model continuous flexibility in sidechain dihedrals, to model continuous, appropriately localized flexibility in the backbone dihedrals φ and ψ as well. We show using 81 test cases based on 29 different protein structures that CATS finds sequences and conformations that are significantly lower in energy than methods with less or no backbone flexibility do. In particular, we show that CATS can model the viability of an antibody mutation known experimentally to increase affinity, but that appears sterically infeasible when modeled with less or no backbone flexibility.
Availability: Our code is available as free software at https://github.com/donaldlab/OSPREY_refactor Contact: mhallen@ttic.edu, brd+ismb17@cs.duke.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

Deep learning based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms
Date: TBA
Room: TBA
  • Min Xu, Carnegie Mellon University, United States
  • Xiaoqi Chai, Carnegie Mellon University, United States
  • Hariank Muthakana, Carnegie Mellon University, United States
  • Xiaodan Liang, Carnegie Mellon University, United States
  • Ge Yang, Carnegie Mellon University, United States
  • Tzviya Zeev-Ben-Mordehai, University of Oxford, United Kingdom
  • Eric Xing, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations make the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data.

Results: To complement existing approaches, in this paper we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared to existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability. More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data.

Availability: Source code freely available at http://www.cs.cmu.edu/~mxu1/software

Contact: mxu1@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

Large-scale structure prediction by improved contact predictions and model quality assessment
Date: Saturday, July 22 2:00 PM - 2:20 PM
Room: TBA
  • Arne Elofsson, Stockholm University, Sweden
  • David Menendez Hurtado, Stockholm University, Sweden
  • Mirco Michel, Stockholm University, Sweden

Presentation Overview: Show

Motivation: Accurate contact predictions can be
used for predicting the structure of proteins. Until recently these
methods were limited to very big protein families, decreasing their
utility. However, recent progress by combining direct
coupling analysis with machine learning methods has made it possible
to predict accurate contact maps for smaller
families. To what extent these predictions can be used to
produce accurate models of the families is not known.

Results: We present the PconsFold2 pipeline that
uses contact predictions from PconsC3, the CONFOLD folding
algorithm and model quality estimations to predict
the structure of a protein. We show that the model quality
estimation significantly increases the number of models that
reliably can be identified. Finally, we apply PconsFold2 to 6379
Pfam families of unknown structure and find that PconsFold2 can, with an estimated
90% specificity, predict the structure of up to 450 Pfam
families of unknown structure. Out of these 343 have not been
reported before.
Availability: Datasets as well as models of all the 450
Pfam families are available at
http://c3.pcons.net/. All programs used
here are freely available.
Contact: arne@bioinfo.se
Supplementary information: No supplementary data

SnapDock - Template Based Docking by Geometric Hashing
Date: TBA
Room: TBA
  • Michael Estrin, Tel Aviv University, Israel
  • Haim J. Wolfson, Tel Aviv University, Israel

Presentation Overview: Show

A highly efficient template based protein-protein docking algorithm, nicknamed SnapDock, is presented. It employs a Geometric Hashing based structural alignment scheme to align the target proteins to the interfaces of non-redundant protein-protein interface libraries. Docking of a pair of proteins utilizing the 22,600 interface PIFACE library is performed in less than 2 minutes on the average. A flexible version of the algorithm allowing hinge motion in one of the proteins is presented as well. To evaluate the performance of the algorithm a blind re-modelling of 3,547 PDB complexes, which have been uploaded after the PIFACE publication has been performed with success ratio of about 35%. Interestingly, a similar experiment with the template free PatchDock docking algorithm yielded a success rate of about 23% with roughly 1/2 of the solutions different from those of SnapDock. Thus the combination of the two methods gave a 42% success ratio.

BIOSSES: A Semantic Sentence Similarity Estimation System for the Biomedical Domain
COSI: Bio-Ontologies
Date: TBA
Room: TBA
  • Gizem Soğancıoğlu, Boğaziçi University, Turkey
  • Hakime Öztürk, Boğaziçi University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey

Presentation Overview: Show

Motivation: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarisation. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.

Methods: We propose several approaches for sentence-level semantic similarity computation in the
biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.

Results: The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.

Availability: A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/

Deep Learning with Word Embeddings improves Biomedical Named Entity Recognition
COSI: Bio-Ontologies
Date: TBA
Room: TBA
  • Maryam Habibi, Humboldt-Universität zu Berlin, Germany
  • Leon Weber, Humboldt-Universität zu Berlin, Germany
  • Mariana Neves, Hasso-Plattner-Institute, Germany
  • David Luis Wiegandt, Humboldt-Universität zu Berlin, Germany
  • Ulf Leser, Humboldt-Universität zu Berlin, Germany

Presentation Overview: Show

Motivation: Text mining has become an important tool for biomedical research. The most fundamental text mining task is the recognition of biomedical named entities (NER), such as genes, chemicals, and diseases. Current NER methods rely on predefined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult.
Results: We show that a completely generic method based on deep learning and statistical word embeddings (called LSTM-CRF) outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
Availability: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora.

Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes
Date: TBA
Room: TBA
  • Dat Duong, UCLA, United States
  • Lisa Gai, UCLA, United States
  • Sagi Snir, Institute of evolution, Israel
  • Eun Yong Kang, University of California, Los Angeles, United States
  • Jae Hoon Sul, Brigham and Women's Hospital, Boston, USA, United States
  • Buhm Han, Asan Institute for Life Sciences, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea, Korea
  • Eleazar Eskin, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: There is recent interest in using gene expression data to contextualize findings from traditional genome wide association studies (GWAS). Conditioned on a tissue, expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression, and eGenes are genes whose expression levels are associated with genetic variants. eQTLs and eGenes provide great supporting evidence for GWAS hits and important insights into the regulatory pathways involved in many diseases. When a significant variant or a candidate gene identified by GWAS is also an eQTL or eGene, there is strong evidence to further study this variant or gene. Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. Unfortunately, these datasets often have small sample sizes in some tissues. For this reason, there have been many meta-analysis methods designed to combine gene expression data across many tissues to increase power for finding eQTLs and eGenes. However, these existing techniques are not scalable to datasets containing many tissues, like
the GTEx data. Furthermore, these methods ignore a biological insight that the same variant may be associated with the same gene across similar tissues.
Result: We introduce a meta-analysis model that addresses these problems in these existing methods. We focus on the problem of finding eGenes in gene expression data from many tissues, and show that
our model is better than other types of meta-analyses.
Availability: Source code and supplementary data are at https://github.com/datduong/RECOV.

Rectified Factor Networks for Biclustering of Omics Data
Date: TBA
Room: TBA
  • Sepp Hochreiter, Institute of Bioinformatics, Johannes Kepler University Linz, Austria
  • Gundula Povysil, Johannes Kepler University Linz, Austria
  • Thomas Unterthiner, Institute of Bioinformatics, Johannes Kepler University Linz, Austria, Austria
  • Djork-Arné Clevert, Bayer AG, Germany

Presentation Overview: Show

Motivation: Biclustering has become a major tool for analyzing large
data sets given as matrix of samples times features and has been
successfully applied in life sciences and e-commerce for drug design
and recommender systems, respectively. FABIA, one of the most successful
biclustering methods, is a generative model that represents each bicluster
by two sparse membership vectors: one for the samples and one for the features.
However, FABIA is restricted to about 20 code units because of the high
computational complexity of computing the posterior. Furthermore, code
nits are sometimes insufficiently decorrelated and sample membership is
difficult to determine. We propose to use the recently introduced unsupervised
Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks
of existing biclustering methods. RFNs efficiently construct very sparse, non-linear,
high-dimensional representations of the input via their posterior means.
RFN learning is a generalized alternating minimization algorithm based on
the posterior regularization method which enforces non- negative and normalized
posterior means. Each code unit represents a bicluster, where samples for which
the code unit is active belong to the bicluster and features that have activating
weights to the code unit belong to the bicluster.
Results: On 400 benchmark data sets and on three gene expression data sets with
known clusters, RFN outperformed 13 other biclustering methods including FABIA.
On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate,
that interbreeding with other hominins starting already before ancestors of
modern humans left Africa.
Availability and Implementation: https://github.com/bioinf-jku/librfn

DextMP: Deep dive into Text for predicting Moonlighting Proteins
COSI: Function
Date: TBA
Room: TBA
  • Ishita Khan, Purdue University, United States
  • Mansurul Bhuiyan, IUPUI, United States
  • Daisuke Kihara, Purdue University, United States

Presentation Overview: Show

independent cellular function. MPs are gaining more attention in recent years as they are found to play important roles in various systems including disease developments. MPs also have a significant impact in computational function prediction and annotation in databases. Currently MPs are not labeled as such in biological databases even in cases where multiple distinct functions are known for the proteins. In this work, we propose a novel method named DextMP, which predicts whether a protein is a MP or not based on its textual features extracted from scientific literature and the UniProt database.
Results: DextMP extracts three categories of textual information for a protein: titles, abstracts from literature, and function description in UniProt. Three language models were applied and compared: a state-of-the-art deep unsupervised learning algorithm along with two other language models of different types, Term Frequency-Inverse Document Frequency in the bag-of-words and Latent Dirichlet Alloca-tion in the topic modeling category. Cross-validation results on a dataset of known MPs and non-MPs showed that DextMP successfully predicted MPs with over 91% accuracy with significant improvement over existing MP prediction methods. Lastly, we have run DextMP with the best performing language models and text-based feature combinations on three genomes, human, yeast, and Xenopus laevis, and found that about 2.5 to 35% of the proteomes are potential MPs.
Availability: Code available at http://kiharalab.org/DextMP

Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference
COSI: Function
Date: TBA
Room: TBA
  • Christophe Dessimoz, University of Lausanne, Switzerland
  • Adrian Altenhoff, ETH Zurich, Switzerland
  • Gaston Gonnet, ETH Zurich, Switzerland
  • Natasha Glover, University of Lausanne, Switzerland
  • Clément Marie Train, University of Lausanne, Switzerland

Presentation Overview: Show

Motivation: Accurate orthology inference is a fundamental step in many phylogenetics and comparative analysis. Many methods have been proposed, including OMA (Orthologous MAtrix). Yet substantial challenges remain, in particular in coping with fragmented genes or genes evolving at different rates after duplication, and in scaling to large datasets. With more and more genomes available, it is necessary to improve the scalability and robustness of orthology inference methods.
Results: We present improvements in the Orthologous MAtrix (OMA) algorithm: 1) refining the pairwise orthology inference step to account for same-species paralogs evolving at different rates, and 2) minimizing errors in the pairwise orthology verification step by testing the consistency of pairwise distance estimates, which can be problematic in the presence of fragmentary sequences. In addition we introduce a more scalable procedure for hierarchical orthologous group (HOG) clustering, which is several orders of magnitude faster on large datasets. Using the Quest for Orthologs consortium orthology benchmark service, we show that these changes translate into substantial improvement on multiple empirical datasets.
Availability: This new OMA 2.0 algorithm is used in the OMA database (http://omabrowser.org) from the March 2017 release onwards, and can be run on custom genomes using OMA standalone version 2.0 and above (http://omabrowser.org/standalone).

Abundance estimation and differential testing on strain level in metagenomics data
Date: TBA
Room: TBA
  • Martina Fischer, Robert Koch Institute, Germany
  • Benjamin Strauch, Robert Koch Institute, Germany
  • Bernhard Y. Renard, Robert Koch Institute, Germany

Presentation Overview: Show

Motivation: Current metagenomics approaches allow analyzing the composition of microbial communities at high resolution. Important changes to the composition are known to even occur on strain level and to go hand in hand with changes in disease or ecological state. However, specific challenges arise for strain level analysis due to highly similar genome sequences present. Only a limited number of tools approach taxa abundance estimation beyond species level and there is a strong need for dedicated tools for strain resolution and differential abundance testing.
Methods: We present DiTASiC (Differential Taxa Abundance in-cluding Similarity Correction) as a novel approach for quantification and differential assessment of individual taxa in metagenomics samples. We introduce a generalized linear model for the resolution of shared read counts which cause a significant bias on strain level. Further, we capture abundance estimation uncertainties, which play a crucial role in differential abundance analysis. A novel statistical framework is built, which integrates the abundance variance and infers abundance distributions for differential testing sensitive to strain level.
Results: As a result, we obtain highly accurate abundance estimates down to sub-strain level and enable fine-grained resolution of strain clusters. We demonstrate the relevance of read ambiguity resolution and integration of abundance uncertainties for differential analysis. Accurate detections of even small changes are achieved and false-positives are significantly reduced. Superior performance is shown on latest benchmark sets of various complexities and in comparison to existing methods. DiTASiC code is freely available from https://rki_bioinformatics.gitlab.io/ditasic.

Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-mer Embedding
Date: TBA
Room: TBA
  • Xu Min, Tsinghua University, China
  • Wanwen Zeng, Tsinghua University, China
  • Ning Chen, Tsinghua University, China
  • Ting Chen, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China

Presentation Overview: Show

Motivation: Experimental techniques for measuring chromatin accessibility are expensive and time consuming, appealing for the development of computational methods to precisely predict open chromatin regions from DNA sequences. Along this direction, existing computational methods fall into two classes: one based on handcrafted k-mer features and the other based on convolutional neural networks. Although both categories have shown good performance in specific applications thus far, there still lacks a comprehensive framework to integrate useful k-mer co-occurrence information with recent advances in deep learning.
Method and results: We fill this gap by addressing the problem of chromatin accessibility prediction with a convolutional Long Short-Term Memory (LSTM) network with k-mer embedding. We first split DNA sequences into k-mers and pre-train k-mer embedding vectors based on the co-occurrence matrix of k-mers by using an unsupervised representation learning approach. We then construct a supervised deep learning architecture comprised of an embedding layer, three convolutional layers and a Bidirectional LSTM (BLSTM) layer for feature learning and classification. We demonstrate that our method gains high-quality fixed-length features from variable-length sequences and consistently outperforms baseline methods. We show that k-mer embedding can effectively enhance model performance by exploring different embedding strategies. We also prove the efficacy of both the convolution and the BLSTM layers by comparing two variations of the network architecture. We confirm the robustness of our model to hyper-parameters by performing sensitivity analysis. We hope our method can eventually reinforce our understanding of employing deep learning in genomic studies and shed light on research regarding mechanisms of chromatin accessibility.
Availability and implementation: The source code can be downloaded from https://github.com/ minxueric/ismb2017_lstm.

deBGR: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph
Date: TBA
Room: TBA
  • Prashant Pandey, Stony Brook University, United States
  • Michael A. Bender, Stony Brook University, United States
  • Rob Johnson, Stony Brook University, United States
  • Rob Patro, Stony Brook University, United States

Presentation Overview: Show

Motivation: Almost all de novo short-read genome and transcriptome assemblers start by building a representation of the de Bruijn Graph of the reads they are given as input (Compeau et al., 2011; Pevzner et al., 2001; Simpson et al., 2009; Schulz et al., 2012; Zerbino and Birney, 2008; Grabherr et al., 2011; Chang et al., 2015; Liu et al., 2016; Kannan et al., 2016). Even when other approaches are used for subsequent assembly (e.g., when one is using “long read” technologies like those offered by PacBio or Oxford Nanopore), efficient k-mer processing is still crucial for accurate assembly (Carvalho et al., 2016; Koren et al., 2017), and state-of-the-art long-read error-correction methods use de Bruijn Graphs Salmela et al. (2016). Because of the centrality of de Bruijn Graphs, researchers have proposed numerous methods for representing de Bruijn Graphs compactly (Pell et al., 2012; Pellow et al., 2016; Chikhi and Rizk, 2013; Salikhov et al., 2013). Some of these proposals sacrifice accuracy to save space. Further, none of these methods store abundance information, i.e., the number of times that each k-mer occurs, which is key in transcriptome assemblers.

Results: We present a method for compactly representing the weighted de Bruijn Graph (i.e., with abundance information) with essentially no errors. Our representation yields zero errors while increasing the space requirements by less than 18%–28%. Our technique is based on a simple invariant that all weighted de Bruijn Graphs must satisfy, and hence is likely to be of general interest and applicable in most weighted de Bruijn Graph-based systems.
Availability: https://github.com/splatlab/debgr

Discovery and genotyping of novel sequence insertions in many sequenced individuals
Date: TBA
Room: TBA
  • Pınar Kavak, Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey, Turkey
  • Yen-Yi Lin, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Ibrahim Numanagić, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Hossein Asghari, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Tunga Güngör, Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey, Turkey
  • Can Alkan, Department of Computer Engineering, Bilkent University, Ankara, Turkey, Turkey
  • Faraz Hach, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada

Presentation Overview: Show

Despite recent advances in algorithms design to characterize structural variation using high-throughput short read sequencing (HTS) data, characterization of novel sequence insertions longer than the average read length remains a challenging task. This is mainly due to both computational difficulties and the complexities imposed by genomic repeats in generating reliable assemblies to accurately detect both the sequence content and the exact location of such insertions. Additionally, de novo genome assembly algorithms typically require a very high depth of coverage, which may be a limiting factor for most genome studies. Therefore, characterization of novel sequence insertions is not a routine part of most sequencing projects. There are only a handful of algorithms that are specifically developed for novel sequence insertion discovery that can bypass the need for the whole genome de novo assembly. Still, most such algorithms rely on high depth of coverage, and to our knowledge there is only one method (PopIns) that can use multi-sampledata to "collectively" obtain very high coverage dataset to accurately find insertions common in a given population. Here we present Pamir, a new algorithm to efficiently and accurately discover and genotype novel sequence insertions using either single or multiple genome sequencing datasets. Pamir is able to detect breakpoint locations of the insertions and calculate their zygosity (i.e. heterozygous vs. homozygous) by analyzing multiple sequence signatures, matching one-end-anchored sequences to small-scale de novo assemblies of unmapped reads, and conducting strand-aware local assembly. We test the efficacy of Pamir on both simulated and real data, and demonstrate its potential use in accurate and routine identification of novel sequence insertions in genome projects.
Availability. Pamir is available at https://github.com/vpc-ccg/pamir
*Contact. fhach@sfu.ca, calkan@cs.bilkent.edu.tr

HopLand: Single-cell pseudotime recovery using continuous Hopfield network based modeling of Waddington’s epigenetic landscape
Date: TBA
Room: TBA
  • Jing Guo, Nanyang Technological University, Singapore
  • Jie Zheng, Nanyang Technological University, Singapore

Presentation Overview: Show

Motivation: The interpretation of transcriptome dynamics in single-cell data, especially pseudotime estimation, could help understand the transition of gene expression profiles. The recovery of pseudotime increases the temporal resolution of single-cell transcriptional data, but is challenging due to the high variability in gene expression between individual cells. Here, we introduce HopLand, a pseudotime recovery method using continuous Hopfield network to map cells in a Waddington’s epigenetic landscape. It reveals from the single-cell data the combinatorial regulatory interactions of genes that control the dynamic progression through successive cellular states.
Results: We applied HopLand to different types of single-cell transcriptome data. It achieved high accuracies of pseudotime prediction compared to existing methods. Moreover, a kinetic model can be extracted from each dataset. Through the analysis of such a model, we identified key genes and regulatory interactions driving the transition of cell states. Therefore, our method has the potential to generate fundamental insights into cell fate regulation.
Availability and implementation: The Matlab implementation of HopLand is available at https://github.com/NetLand-NTU/HopLand.

Improved Data-Driven Likelihood Factorizations for Transcript Abundance Estimation
Date: TBA
Room: TBA
  • Rob Patro, Stony Brook University, United States
  • Fatemehalsadat Almodaresi T S, Stony Brook University, United States
  • Avi Srivastava, Stony Brook University, United States
  • Mohsen Zakeri, Stony Brook University, United States

Presentation Overview: Show

Motivation: Many methods for transcript-level abundance estimation reduce the computational burden
associated with the iterative algorithms they use by adopting an approximate factorization of the likelihood
function they optimize. This leads to considerably faster convergence of the optimization procedure, since
each round of e.g., the EM algorithm, can execute much more quickly. However, these approximate
factorizations of the likelihood function simplify calculations at the expense of discarding certain information
that can be useful for accurate transcript abundance estimation.
Results: We demonstrate that model simplifications (i.e., factorizations of the likelihood function) adopted
by certain abundance estimation methods can lead to a diminished ability to accurately estimate the
abundances of highly-related transcripts. In particular, considering factorizations based on transcript-
fragment compatibility alone can result in a loss of accuracy compared to the per-fragment, unsimplified
model. However, we show that such shortcomings are not an inherent limitation of approximately factorizing
the underlying likelihood function. By considering the appropriate conditional fragment probabilities, and
adopting improved, data-driven factorizations of this likelihood, we demonstrate that such approaches can
achieve performance nearly indistinguishable from methods that consider the complete (i.e., per-fragment)
Availability: Our data-driven factorizations are incorporated into a branch of the Salmon transcript
quantification tool: https://github.com/COMBINE-lab/salmon/tree/factorizations

Improving the performance of minimizers and winnowing schemes
Date: TBA
Room: TBA
  • Carl Kingsford, Carnegie Mellon University, United States
  • Ron Shamir, School of Computer Science, Tel Aviv University, Israel
  • Yaron Orenstein, MIT, United States
  • Daniel Bork, University of Pittsburgh, United States
  • David Pellow, Tel Aviv University, Israel
  • Guillaume Marçais, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: The minimizers scheme is a method for selecting k-
mers from sequences. It is used in many bioinformatics software
tools to bin comparable sequences or to sample a sequence in
a deterministic fashion at approximately regular intervals, in order
to reduce memory consumption and processing time. Although
very useful, the minimizers selection procedure has undesirable
behaviors (e.g., too many k-mers are selected when processing
certain sequences). Some of these problems were already known to
the authors of the minimizers technique, and the natural lexicographic
ordering of k-mers used by minimizers was recognized as their origin.
Many software tools using minimizers employ ad hoc variations of the
lexicographic order to alleviate those issues.

Results: We provide an in-depth analysis of the effect of k-mer
ordering on the performance of the minimizers technique. By using
small universal hitting sets (a recently defined concept), we show
how to significantly improve the performance of minimizers and avoid
some of its worse behaviors. Based on these results, we encourage
bioinformatics software developers to use an ordering based on a
universal hitting set or, if not possible, a randomized ordering, rather
than the lexicographic order. This analysis also settles negatively a
conjecture (by Schleimer et al.) on the expected density of minimizers
in a random sequence.

Availability: The software used for this analysis is available on
GitHub: https://github.com/gmarcais/minimizers.git.
Contact: gmarcais@cs.cmu.edu

Modelling haplotypes with respect to reference cohort variation graphs
Date: TBA
Room: TBA
  • Benedict Paten, UCSC, United States
  • Jordan Eizenga, University of California Santa Cruz, United States
  • Yohei Rosen, University of California, Santa Cruz, United States

Presentation Overview: Show

Motivation: Current statistical models of haplotypes are limited to panels of haplotypes whose genetic variation can be represented by arrays of values at linearly ordered bi- or multiallelic loci. These methods cannot model structural variants or variants that nest or overlap.
Results: A variation graph is a mathematical structure that can encode arbitrarily complex genetic variation. We present the first haplotype model that operates on a variation graph-embedded population reference cohort. We describe an algorithm to calculate the likelihood that a haplotype arose from this cohort through recombinations and demonstrate time complexity linear in haplotype length and sublinear in population size. We furthermore demonstrate a method of rapidly calculating likelihoods for related haplotypes. We describe mathematical extensions to allow modelling of mutations. This work is an important incremental step for clinical genomics and genetic epidemiology since it is the first haplotype model which can represent all sorts of variation in the population.

Tumor Phylogeny Inference Using Tree-Constrained Importance Sampling
Date: TBA
Room: TBA
  • Benjamin Raphael, Princeton University, United States
  • Gryte Satas, Brown University, United States

Presentation Overview: Show

A tumor arises from an evolutionary process that can be modeled as a phylogenetic tree. However, reconstructing this tree is challenging as most cancer sequencing uses bulk tumor tissue containing heterogeneous mixtures of cells.
We introduce PASTRI (Probabilistic Algorithm for Somatic Tree Inference), a new algorithm for bulk-tumor sequencing data that clusters somatic mutations into clones and infers a phylogenetic tree that describes the evolutionary history of the tumor. PASTRI uses an importance sampling algorithm that combines a probabilistic model of DNA sequencing data with a enumeration algorithm based on the combinatorial constraints defined by the underlying phylogenetic tree. As a result, tree inference is fast, accurate and robust to noise. We demonstrate on simulated data that PASTRI outperforms other cancer phylogeny algorithms in terms of runtime and accuracy. On real data from a chronic lymphocytic leukemia (CLL) patient, we show that a simple linear phylogeny better explains the data the complex branching phylogeny that was previously reported. PASTRI provides a robust approach for phylogenetic tree inference from mixed samples.

A New Method to Study the Change of miRNA-mRNA Interactions Due to Environmental Exposures
COSI: NetBio
Date: TBA
Room: TBA
  • Francesca Petralia, Icahn School of Medicine at Mount Sinai, United States
  • Vasily Aushev, Icahn School of Medicine at Mount Sinai, United States
  • Kalpana Gopalakrishnan, Icahn School of Medicine at Mount Sinai, United States
  • Maya Kappil, Icahn School of Medicine at Mount Sinai, United States
  • Nyan Win Khin, Icahn School of Medicine at Mount Sinai, United States
  • Jia Chen, Icahn School of Medicine at Mount Sinai, United States
  • Susan Teitelbaum, Icahn School of Medicine at Mount Sinai, United States
  • Pei Wang, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Motivation: Integrative approaches characterizing the interactions among different types of biological molecules have been demonstrated to be useful for revealing informative biological mechanisms. One such example is the interaction between microRNA (miRNA) and messenger RNA (mRNA), whose deregulation may be sensitive to environmental insult leading to altered phenotypes. The goal of this work is to develop an effective data integration method to characterize deregulation between miRNA and mRNA due to environmental toxicant exposures. We will use data from an animal experiment designed to investigate the effect of low-dose environmental chemical exposure on normal mammary gland development in rats to motivate and evaluate the proposed method.

Results: We propose a new network approach — integrative Joint Random Forest (iJRF), which characterizes the regulatory system between miRNAs and mRNAs using a network model. iJRF is designed to work under the high-dimension low-sample-size regime, and can borrow information across different treatment conditions to achieve more accurate network inference. It also effectively takes into account prior information of miRNA-mRNA regulatory relationships from existing databases. When iJRF is applied to the data from the environmental chemical exposure study, we detected a few important miRNAs that regulated a large number of mRNAs in the control group but not in the exposed groups, suggesting the disruption of miRNA activity due to chemical exposure. Effects of chemical exposure on two affected miRNAs were further validated using breast cancer human cell lines.

Alignment of dynamic networks
COSI: NetBio
Date: TBA
Room: TBA
  • Vipin Vijayan, University of Notre Dame, United States
  • Dominic Critchlow, Austin Peay State University, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Presentation Overview: Show

Network alignment (NA) aims to find a node mapping that conserves similar regions between compared networks. NA is applicable to many fields, including computational biology, where NA can guide the transfer of biological knowledge from well- to poorly-studied species across aligned network regions. Existing NA methods can only align static networks. However, most complex real-world systems evolve over time and should thus be modeled as dynamic networks. We hypothesize that aligning dynamic network representations of evolving systems will produce superior alignments compared to aligning the systems' static network representations, as is currently done. For this purpose, we introduce the first ever dynamic NA method, DynaMAGNA++. This proof-of-concept dynamic NA method is an extension of a state-of-the-art static NA method, MAGNA++. Even though both MAGNA++ and DynaMAGNA++ optimize edge as well as node conservation across the aligned networks, MAGNA++ conserves static edges and similarity between static node neighborhoods, while DynaMAGNA++ conserves dynamic edges (events) and similarity between evolving node neighborhoods. For this purpose, we introduce the first ever measure of dynamic edge conservation and rely on our recent measure of dynamic node conservation. Importantly, the two dynamic conservation measures can be optimized with any state-of-the-art NA method and not just MAGNA++. We confirm our hypothesis that dynamic NA is superior to static NA, on synthetic and real-world networks, in computational biology and social domains. DynaMAGNA++ is parallelized and has a user-friendly graphical interface.

Image-based Spatiotemporal Causality Inference for Protein Signaling Networks
COSI: NetBio
Date: TBA
Room: TBA
  • Robert F. Murphy, Carnegie Mellon University, United States
  • Christoph Wülfing, University of Bristol, United Kingdom
  • Xiongtao Ruan, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Efforts to model how signaling and regulatory networks work in cells have largely either not considered spatial organization or have used compartmental models with minimal spatial resolution. Fluorescence microscopy provides the ability to monitor the spatiotemporal distribution of many molecules during signaling events, but as of yet no methods have been described for large scale image analysis to learn a complex protein regulatory network. Here we present and evaluate methods for identifying how changes in concentration in one cell region influence concentration of other proteins in other regions.
Results: Using 3D confocal microscope movies of GFP-tagged T cells undergoing costimulation, we learned models containing putative causal relationships among 12 proteins involved in T cell signaling. The models included both relationships consistent with current knowledge and novel predictions deserving further exploration. Further, when these models were applied to the initial frames of movies of T cells that had been only partially stimulated, they predicted the localization of proteins at later times with statistically significant accuracy. The methods, consisting of spatiotemporal alignment, automated region identification, and causal inference, are anticipated to be applicable to a number of biological systems.

Incorporating Interaction Networks into the Determination of Functionally Related Hit Genes in Genomic Experiments with Markov Random Fields
COSI: NetBio
Date: TBA
Room: TBA
  • Laurent Guyon, CEA, France
  • J. Pablo Radicella, CEA, France
  • Guillaume Pinna, CEA, France
  • Anna Campalans, CEA, France
  • Jaakko Nevalainen, University of Turku, Finland
  • Sean Robinson, University of Turku, Finland

Presentation Overview: Show

Motivation: Incorporating gene interaction data into the identification of ‘hit’ genes in genomic experiments is a well-established approach leveraging the ‘guilt by association’ assumption to obtain a network based hit list of functionally related genes. We aim to develop a method to allow for multivariate gene scores and multiple hit labels in order to extend the analysis of genomic screening data within such an approach.

Results: We propose a Markov random field based method to achieve our aim and show that the particular advantages of our method compared to those currently used lead to new insights in previously analysed data as well as for our own motivating data. Our method additionally achieves the best performance in an independent simulation experiment. The real data applications we consider comprise of a survival analysis and differential expression experiment and a cell-based RNA interference functional screen.

Availability: We provide all of the data and code related to the results in the paper.

Multiple network-constrained regressions expand insights into influenza vaccination responses
COSI: NetBio
Date: TBA
Room: TBA
  • Steven H. Kleinstein, Yale University School of Medicine, United States
  • Albert C. Shaw, Yale School of Medicine, United States
  • Sui Tsang, Yale School of Medicine, United States
  • Barbara Siconolfi, Yale School of Medicine, United States
  • Samit R. Joshi, Yale School of Medicine, United States
  • Heidi Zapata, Yale School of Medicine, United States
  • Jean Wilson, Yale School of Medicine, United States
  • Subhasis Mohanty, Yale School of Medicine, United States
  • Stefan Avey, Yale University, United States

Presentation Overview: Show

Motivation: Systems immunology leverages recent technological advancements that enable broad profiling of the immune system to better understand the response to infection and vaccination, as well as the dysregulation that occurs in disease. An increasingly common approach to gain insights from these large-scale profiling experiments involves the application of statistical learning methods to predict disease states or the immune response to perturbations. However, the goal of many systems studies is not to maximize accuracy, but rather to gain biological insights. The predictors identified using current approaches can be uninterpretable or present only one of many equally predictive models, leading to a narrow understanding of the underlying biology.
Results: Here we show that incorporating prior biological knowledge within a logistic modeling framework by using network-level constraints on transcriptional profiling data significantly improves interpretability. Moreover, incorporating different types of biological knowledge produces models that highlight distinct aspects of the underlying biology, while maintaining predictive accuracy. We propose a new framework, Logistic Multiple Network-constrained Regression (LogMiNeR), and apply it to understand the mechanisms underlying differential responses to influenza vaccination. While standard logistic regression approaches were predictive, they were minimally interpretable. Incorporating prior knowledge using LogMiNeR led to models that were equally predictive yet highly interpretable. In this context, B cell-specific genes and mTOR signaling were associated with an effective vaccination response in young adults. Overall, our results demonstrate a new paradigm for analyzing high-dimensional immune profiling data in which multiple networks encoding prior knowledge are incorporated to improve model interpretability.
Availability: The R source code described in this article is publicly available at
Contact: steven.kleinstein@yale.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

Predicting multicellular function through multi-layer tissue networks
COSI: NetBio
Date: TBA
Room: TBA
  • Jure Leskovec, Stanford University, United States
  • Marinka Zitnik, Stanford University, United States

Presentation Overview: Show

Motivation: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine.

Results: Here we present OhmNet, a hierarchy-aware unsupervised node feature learning approach for multi-layer networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale tissue hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems

Denoising Genome-wide Histone ChIP-seq with Convolutional Neural Networks
COSI: RegGen
Date: TBA
Room: TBA
  • Pang Wei Koh, Stanford University, United States
  • Emma Pierson, Stanford, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

Motivation: Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are commonly used to obtain genome-wide profiles of histone modifications associated with different types of functional genomic elements. However, the quality of histone ChIP-seq data is affected by many experimental parameters such as the amount of input DNA, antibody specificity, ChIP enrichment, and sequencing depth. Making accurate inferences from chromatin profiling experiments that involve diverse experimental parameters is challenging.

Results: We introduce a convolutional denoising algorithm, Coda, that uses convolutional neural networks to learn a mapping from suboptimal to high-quality histone ChIP-seq data. This overcomes various sources of noise and variability, substantially enhancing and recovering signal when applied to low-quality chromatin profiling datasets across individuals, cell types, and species. Our method has the potential to improve data quality at reduced costs. More broadly, this approach -- using a high-dimensional discriminative model to encode a generative noise process -- is generally applicable to other biological domains where it is easy to generate noisy data but difficult to analytically characterize the noise or underlying data distribution.

Direct AUC Optimization of Regulatory Motifs
COSI: RegGen
Date: TBA
Room: TBA
  • Lin Zhu, Tongji University, China
  • Hong-Bo Zhang, Tongji University, China
  • De-Shuang Huang, Tongji University, China

Presentation Overview: Show

Motivation: The discovery of transcription factor binding site (TFBS) motifs is essential for untangling the complex mechanism of genetic variation under different developmental and environmental conditions. Among the huge amount of computational approaches for de novo identification of TFBS motifs, discriminative motif learning (DML) methods have been proven to be promising for harnessing the discovery power of accumulated huge amount of high-throughput binding data. However, they have to sacrifice accuracy for speed and could fail to fully utilize the information of the input sequences.
Results: We propose a novel algorithm called CDAUC for optimizing DML-learned motifs based on the area under the receiver-operating characteristic curve (AUC) criterion, which has been widely used in the literature to evaluate the significance of extracted motifs. We show that when the considered AUC loss function is optimized in a coordinate-wise manner, the cost function of each resultant sub-problem is a piece-wise constant function, whose optimal value can be found exactly and efficiently. Further, a key step of each iteration of CDAUC can be efficiently solved as a computational geometry problem. Experimental results on real world high-throughput datasets illustrate that CDAUC outperforms competing methods for refining DML motifs, while being one order of magnitude faster. Meanwhile, preliminary results also show that CDAUC may also be useful for improving the interpretability of convolutional kernels generated by the emerging deep learning approaches for predicting TF sequences specificities.

Exploiting sequence-based features for predicting enhancer-promoter interactions
COSI: RegGen
Date: TBA
Room: TBA
  • Yang Yang, Carnegie Mellon University, United States
  • Ruochi Zhang, Tsinghua University, China
  • Shashank Singh, Carnegie Mellon University, United States
  • Jian Ma, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: A large number of distal enhancers and proximal promoters form enhancer-promoter interactions to regulate target genes in the human genome. Although recent high-throughput genome-wide mapping approaches have allowed us to more comprehensively recognize potential enhancer-promoter interactions, it is still largely unknown whether sequence-based features alone are sufficient to predict such interactions.

Results: Here we develop a new computational method (named PEP) to predict enhancer-promoter interactions based on sequence-based features only, when the locations of putative enhancers and promoters in a particular cell type are given. The two modules in PEP (PEP-Motif and PEP-Word) use different but complementary feature extraction strategies to exploit sequence-based information. The results across six different cell types demonstrate that our method is effective in predicting enhancer-promoter interactions as compared to the state-of-the-art methods that use functional genomic signals. Our work demonstrates that sequence-based features alone can reliably predict enhancer- promoter interactions genome-wide, which could potentially facilitate the discovery of important sequence determinants for long-range gene regulation.

Availability: The source code of PEP is available at: https://github.com/ma-compbio/PEP.

Contact: jianma@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

miniMDS: 3D structural inference from high- resolution Hi-C data
COSI: RegGen
Date: TBA
Room: TBA
  • Shaun Mahony, Pennsylvania State University, United States
  • Lila Rieber, Pennsylvania State University, United States

Presentation Overview: Show


Motivation: Recent experiments have provided Hi-C data at resolution as high as 1 Kbp. However, 3D structural inference from high-resolution Hi-C datasets is often computationally unfeasible using existing methods.

Results: We have developed miniMDS, an approximation of multidimensional scaling (MDS) that partitions a Hi-C dataset, performs high-resolution MDS separately on each partition, and then reassembles the partitions using low-resolution MDS. miniMDS is faster, more accurate, and uses less memory than existing methods for inferring the human genome at high resolution (10 Kbp).

Availability: A Python implementation of miniMDS is available on GitHub: https://github.com/seqcode/miniMDS

Contact: mahony@psu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

TITER: predicting translation initiation sites by deep learning
COSI: RegGen
Date: TBA
Room: TBA
  • Jianyang Zeng, Institute for Interdisciplinary Information Sciences, Tsinghua University, China
  • Lei Zhang, School of Medicine, Tsinghua University, China
  • Tao Jiang, Department of Computer Science and Engineering, University of California, Riverside, China
  • Hailin Hu, School of Medicine, Tsinghua University, China
  • Sai Zhang, Institute for Interdisciplinary Information Sciences, Tsinghua University, China

Presentation Overview: Show

Motivation: Translation initiation is a key step in the regulation of gene expression. In addition to the annotated translation initiation sites (TISs), the translation process may also start at multiple alternative TISs (including both AUG and non-AUG codons), which makes it challenging to predict TISs and study the underlying regulatory mechanisms. Meanwhile, the advent of several high-throughput sequencing techniques for profiling initiating ribosomes at single-nucleotide resolution, e.g., GTI-seq and QTI-seq, provides abundant data for systematically studying the general principles of translation initiation and the development of computational method for TIS identification.

Methods: We have developed a deep learning based framework, named TITER, for accurately predicting TISs on a genome-wide scale based on QTI-seq data. TITER extracts the sequence features of translation initiation from the surrounding sequence contexts of TISs using a hybrid neural network and further integrates the prior preference of TIS codon composition into a unified prediction framework.

Results: Extensive tests demonstrated that TITER can greatly outperform the state-of-the-art prediction methods in identifying TISs. In addition, TITER was able to identify important sequence signatures for individual types of TIS codons, including a Kozak-sequence-like motif for AUG start codon. Furthermore, the TITER prediction score can be related to the strength of translation initiation in various biological scenarios, including the repressive effect of the upstream open reading frames (uORFs) on gene expression and the mutational effects influencing translation initiation efficiency.

DeepBound: Accurate Identification of Transcript Boundaries via Deep Convolutional Neural Fields
Date: TBA
Room: TBA
  • Mingfu Shao, Computational Biology Department, Carnegie Mellon University, United States
  • Jianzhu Ma, University of California, San Diego, United States
  • Sheng Wang, Department of Human Genetics, University of Chicago, United States

Presentation Overview: Show

Reconstructing the full-length expressed transcripts a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak.
We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC~(area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.

Efficient approximations of RNA kinetics landscape using non-redundant sampling
Date: TBA
Room: TBA
  • Juraj Michalik, Inria Saclay, France
  • Hélène Touzet, CNRS, University of Lille and INRIA, France
  • Yann Ponty, CNRS/LIX, Polytechnique, France

Presentation Overview: Show

Motivation: Kinetics is key to understand many phenomena involving RNAs, such as co-transcriptional folding and riboswitches. Exact out-of-equilibrium studies induce extreme computational demands, leading state-of-the-art methods to rely on approximated kinetics landscapes, obtained using sampling strategies that strive to generate the key landmarks of the landscape topology. However, such methods are impeded by a large level of redundancy within sampled sets. Such a redundancy is uninformative, and obfuscates important intermediate states, leading to an incomplete vision of RNA dynamics.
Results: We introduce RNANR, a new set of algorithms for the exploration of RNA kinetics landscapes at the secondary structure level. RNANR considers locally optimal structures, a reduced set of RNA conformations, in order to focus its sampling on basins in the kinetic landscape. Along with an exhaustive enumeration, RNANR implements a novel non-redundant stochastic sampling, and offers a rich array of structural parameters. Our tests on both real and random RNAs reveal that RNANR allows to generate more unique structures in a given time than its competitors, and allows a deeper exploration of kinetics landscapes.
Availability: RNANR is freely available at https://project.inria.fr/rnalands/rnanr

Integrative Deep Models for Alternative Splicing
Date: TBA
Room: TBA
  • Yoseph Barash, University of Pennsylvania, United States
  • Matthew Gazzara, University of Pennsylvania Perelman School of Medicine, United States
  • Anupama Jha, University of Pennsylvania, United States

Presentation Overview: Show

Motivation: Advancements in sequencing technologies have highlighted the role of alternative splicing (AS) in increasing transcriptome complexity. This role of AS, combined with the relation of aberrant splicing to malignant states, motivated two streams of research, experimental and computational. The first involves a myriad of techniques such as RNA-Seq and CLIP-Seq to identify splicing regulators and their putative targets. The second involves probabilistic models, also known as splicing codes, which infer regulatory mechanisms and predict splicing outcome directly from genomic sequence. To date, these models have
utilized only expression data. In this work we address two related challenges: Can we improve on previous models for AS outcome prediction and can we integrate additional sources of data to improve predictions
for AS regulatory factors.

Results: We perform a detailed comparison of two previous modeling approaches, Bayesian and Deep Neural networks, dissecting the confounding effects of datasets and target functions. We then develop a new target function for AS prediction in exon skipping events and show it significantly improves model accuracy. Next, we develop a modeling framework that leverages transfer learning to incorporate CLIP-Seq, knockdown and over expression experiments, which are inherently noisy and suffer from missing values.Using several datasets involving key splice factors in mouse brain, muscle and heart we demonstrate both the prediction improvements and biological insights offered by our new models. Overall, the framework we propose offers a scalable integrative solution to improve splicing code modeling as vast amounts of relevant genomic data become available.
Availability: code and data available at:

A scalable moment-closure approximation for large-scale biochemical reaction networks
COSI: SysMod
Date: TBA
Room: TBA
  • Atefeh Kazeroonian, Technische Universität München, Germany
  • Fabian J. Theis, Helmholtz Zentrum München, Germany
  • Jan Hasenauer, Helmholtz Zentrum München, Germany

Presentation Overview: Show

Motivation: Stochastic molecular processes are a leading cause of cell-to-cell variability. Their dynamics are often described by continuous-time discrete-state Markov chains and simulated using stochastic simulation algorithms. As these stochastic simulations are computationally demanding, ordinary differential equation models for the dynamics of the statistical moments have been developed. The number of state variables of these approximating models, however, grows at least quadratically with the number of biochemical species. This limits their application to small- and medium-sized processes.

Results: In this manuscript, we present a scalable moment-closure approximation (sMA) for the simulation of statistical moments of large-scale stochastic processes. The sMA exploits the structure of the biochemical reaction network to reduce the covariance matrix. We prove that sMA yields approximating models whose number of state variables depends predominantly on local properties, i.e. the average node degree of the reaction network, instead of the overall network size. The resulting complexity reduction is assessed by studying a range of medium- and large-scale biochemical reaction networks. To evaluate the approximation accuracy and the improvement in computational efficiency, we study models for JAK2/STAT5 signalling and NFkB signalling. Our method is applicable to generic biochemical reaction networks and we provide an implementation, including an SBML interface, which renders the sMA easily accessible.

Availability: The sMA is implemented in the open-source MATLAB toolbox CERENA and is available from https://github.com/CERENADevelopers/CERENA.

Efficient Simulation of Intrinsic, Extrinsic and External Noise in Biochemical Systems
COSI: SysMod
Date: TBA
Room: TBA
  • Dennis Pischel, Otto-von-Guericke University Magdeburg, Germany
  • Kai Sundmacher, Max Planck Institute for Dynamics of Complex Technical Systems, Germany
  • Robert J Flassig, Max Planck Institute for Dynamics of Complex Technical Systems, Germany

Presentation Overview: Show

Motivation: Biological cells operate in a noisy regime influenced by intrinsic, extrinsic, and external noise, which leads to large differences of individual cell states. Stochastic effects must be taken into account to accurately characterize biochemical kinetics. Since the exact solution of the chemical master equation, which governs the underlying stochastic process, cannot be derived for most biochemical systems, approximate methods are used to obtain a solution.
Results: In this study a method to efficiently simulate the various sources of noise simultaneously is proposed and benchmarked on several examples. The method relies on the combination of the sigma point approach to describe extrinsic and external variability and the τ-leaping algorithm to account for the stochasticity due to probabilistic reactions. The comparison of our method to extensive Monte Carlo calculations demonstrates an immense computational advantage while losing an acceptable amount of accuracy. Additionally the application to parameter optimization problems in stochastic biochemical reaction networks is shown, which is rarely applied due to its huge computational burden. To give further insight a MATLAB® script is provided including the proposed method applied to a simple toy example of gene expression.

Estimation of time-varying growth, uptake and excretion rates from dynamic metabolomics data
COSI: SysMod
Date: TBA
Room: TBA
  • Eugenio Cinquemani, INRIA Grenoble - Rhone-Alpes, France
  • Valerie Laroute, Universite de Toulouse, France
  • Muriel Cocaign-Bousquet, INRA, France
  • Hidde de Jong, INRIA Grenoble - Rhone-Alpes, France
  • Delphine Ropers, INRIA Grenoble - Rhone-Alpes, France

Presentation Overview: Show

Motivation: Technological advances in metabolomics have made it possible to monitor the concentration of extracellular metabolites over time. From these data it is possible to compute the rates of uptake and excretion of the metabolites by a growing cell population, providing precious information on the functioning of intracellular metabolism. The computation of the rate of these exchange reactions, however, is difficult to achieve in practice for a number of reasons, notably noisy measurements, correlations between the concentration profiles of the different extracellular metabolites, and discontinuities in the profiles due to
sudden changes in metabolic regime.
Results: We present a method for precisely estimating time-varying uptake and excretion rates from time-series measurements of extracellular metabolite concentrations, specifically addressing all of the above issues. The estimation problem is formulated in a regularized Bayesian framework and solved by a
combination of extended Kalman filtering and smoothing. The method is shown to improve upon methods based on spline smoothing of the data. Moreover, when applied to two actual datasets, the method recovers known features of overflow metabolism in E. coli and L. lactis, and provides evidence for acetate uptake by L. lactis after glucose exhaustion. The results raise interesting perspectives for further work on
rate estimation from measurements of intracellular metabolites.

popFBA: tackling intratumour heterogeneity with Flux Balance Analysis
COSI: SysMod
Date: TBA
Room: TBA
  • Giancarlo Mauri, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy., Italy
  • Riccardo Colombo, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy., Italy
  • Davide Maspero, Dept of Biotechnology and Biosciences, University Milano-Bicocca, 20126, Milano, Italy., Italy
  • Dario Pescini, Dept of Statistics and Quantitative Methods, University Milano-Bicocca, 20126, Milano, Italy., Italy
  • Marzia Di Filippo, Dept of Biotechnology and Biosciences, University Milano-Bicocca, 20126, Milano, Italy., Italy
  • Chiara Damiani, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy., Italy

Presentation Overview: Show

Motivation: Intratumour heterogeneity poses many challenges to the treatment of cancer. Unfortunately, the transcriptional and metabolic information retrieved by currently available computational and experimental techniques portrays the average behaviour of intermixed and heterogeneous cell subpopulations within a given tumour. Emerging single-cell genomic analyses are nonetheless unable to characterise the interactions among cancer subpopulations. In this work, we propose popFBA, an extension to classic Flux Balance Analysis (FBA), to explore how metabolic heterogeneity and cooperation phenomena affect the overall growth of cancer cell populations.
Results: We show how clones of a metabolic network of human central carbon metabolism, sharing the same stoichiometry and capacity constraints, may follow several different metabolic paths and cooperate to maximise the growth of the total population. We also introduce a method to explore the space of possible interactions, given some constraints on plasma supply of nutrients. We illustrate how alternative nutrients in plasma supply and/or a dishomogeneous distribution of oxygen provision may affect the landscape of heterogeneous phenotypes. We finally provide a technique to identify the most proliferative cells within the heterogeneous population.
Availability: the popFBA MATLAB function and the SBML model are available at https://github.com/BIMIB- DISCo/popFBA
Contact: chiara.damiani@unimib.it

Association testing of bisulfite sequencing methylation data via a Laplace approximation
COSI: TransMed
Date: Tuesday, July 25 10:35 - 10:50
Room: TBA
  • Omer Weissbrod, Technion and Tel Aviv University, Israel
  • Elior Rahmani, Tel Aviv Universiy, Israel
  • Regev Schweiger, Tel Aviv University, Israel
  • Saharon Rosset, Tel Aviv University, Israel
  • Eran Halperin, UCLA, United States

Presentation Overview: Show

Epigenome-wide association studies (EWAS) can provide novel insights into the regulation of genes involved in traits and diseases. The rapid emergence of bisulfite sequencing technologies enables performing such genome-wide studies at the resolution of single nucleotides. However, analysis of data produced by bisulfite sequencing poses statistical challenges owing to low and uneven sequencing depth, as well as the presence of confounding factors. The recently introduced Mixed model Association for Count data via data AUgmentation (MACAU) can address these challenges via a generalized linear mixed model (GLMM) when confounding can be encoded via a single variance component. However, MACAU cannot be used in the presence of multiple variance components. Additionally, MACAU uses a computationally expensive Markov Chain Monte Carlo (MCMC) procedure, which cannot directly approximate the model likelihood.

We present a new method, Mixed model Association via a Laplace ApproXimation (MALAX), that is more computationally efficient than MACAU and allows to model multiple variance components. MALAX uses a Laplace approximation rather than MCMC based approximations, which enables to directly approximate the model likelihood. Through an extensive analysis of simulated and real data, we demonstrate that MALAX successfully addresses statistical challenges introduced by bisulfite sequencing while controlling for complex sources of confounding, and can be over 50% faster than the state of the art.

Availability and Implementation:
The full source code of MALAX is available at

Contact: omerw@cs.technion.ac.il or ehalperin@cs.ucla.edu}{ehalperin@cs.ucla.edu

Identification of Associations between Genotypes and Longitudinal Phenotypes via Temporally-constrained Group Sparse Canonical Correlation Analysis
COSI: TransMed
Date: Tuesday, July 25 10:50am - 11:05am
Room: TBA
  • Xiaoke Hao, Nanjing University of Aeronautics and Astronautics, China
  • Chanxiu Li, Nanjing University of Aeronautics and Astronautics, China
  • Jingwen Yan, Indiana University, United States
  • Xiaohui Yao, Indiana University, United States
  • Shannon Risacher, Indiana University, United States
  • Andrew Saykin, Indiana University, United States
  • Li Shen, Indiana University, United States
  • Daoqiang Zhang, Nanjing University of Aeronautics and Astronautics, China

Presentation Overview: Show

Motivation: Neuroimaging genetics identifies the relationships between genetic variants (i.e., the single nucleotide polymorphisms (SNPs)) and brain imaging data to reveal the associations from genotypes to phenotypes. So far, most existing machine learning approaches are widely used to detect the effective associations between genetic variants and brain imaging data at one time-point. However, those associations are based on static phenotypes and ignore the temporal dynamics of the phenotypical changes. The phenotypes across multiple time-points may exhibit temporal patterns that can be used to facilitate the understanding of the degenerative process. In this paper, we propose a novel temporally-constrained group sparse canonical correlation analysis (TGSCCA) framework to identify genetic associations with longitudinal phenotypic markers.
Results: The proposed TGSCCA method is able to capture the temporal changes in brain from longitudinal phenotypes by incorporating the fused penalty, which requires that the differences between two consecutive canonical weight vectors from adjacent time-points should be small. A new efficient optimization algorithm is designed to solve the objective function. Furthermore, we demonstrate the effectiveness of our algorithm on both synthetic and real data (i.e., the Alzheimer’s Disease Neuroimaging Initiative (ADNI) cohort, including progressive mild cognitive impairment (pMCI), stable MCI (sMCI) and Normal Control (NC) participants). In comparison with conventional SCCA, our proposed method can achieve strong associations and discover phenotypic biomarkers across multiple time-points to guide disease-progressive interpretation.

Molecular signatures that can be transferred across different omics platforms
COSI: TransMed
Date: Tuesday, July 25 9:15am - 9:30am
Room: TBA
  • Michael Altenbuchinger, University of Regensburg, Germany
  • Philipp Schwarzfischer, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Thorsten Rehberg, Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Jörg Reinders, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Christian W. Kohler, Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Wolfram Gronwald, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Julia Richter, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Monika Szczepanowski, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Neus Masqué-Soler, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Wolfram Klapper, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Peter J. Oefner, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Rainer Spang, University of Regensburg, Germany

Presentation Overview: Show

Motivation: Molecular signatures for treatment recommendations are well researched. Still it is challenging to apply them to data generated by different protocols or technical platforms.

Results: We analyzed paired data for the same tumors (Burkitt lymphoma, diffuse large B-cell lymphoma) and features that had been generated by different experimental protocols and analytical platforms including the nanoString nCounter and Affymetrix Gene Chip transcriptomics as well as the SWATH and SRM proteomics platforms. A statistical model that assumes independent sample and feature effects accounted for 69% to 94% of technical variability. We analyzed how variability is propagated through linear signatures possibly affecting predictions and treatment recommendations. Linear signatures with feature weights adding to zero were substantially more robust than unbalanced signatures. They yielded consistent predictions across data from different platforms, both for transcriptomics and proteomics data. Similarly stable were their predictions across data from fresh frozen and matching formalin-fixed paraffin-embedded human tumor tissue.

Availability: The R-package “zeroSum” can be downloaded at https://github.com/rehbergT/zeroSum. Complete data and R codes necessary to reproduce all our results can be received from the authors upon request.

Predicting phenotypes from microarrays using amplified, initially marginal, eigenvector regression
COSI: TransMed
Date: Tuesday, July 25 11:05am - 11:20am
Room: TBA
  • Daniel Mcdonald, Indiana University, United States
  • Lei Ding, Indiana University Bloomington, United States

Presentation Overview: Show

Motivation: The discovery of relationships between gene expression measurements and phenotypic responses is hampered by both computational and statistical impediments. Conventional statistical methods are less than ideal because they either fail to select relevant genes, predict poorly, ignore the unknown interaction structure between genes, or are computationally intractable. Thus, the creation of new methods which can handle many expression measurements on relatively small numbers of patients while also uncovering gene-gene relationships and predicting well is desirable.

Results: We develop a new technique for using the marginal relationship between gene expression measurements and patient survival outcomes to identify a small subset of genes which appear highly relevant for predicting survival, produce a low-dimensional embedding based on this small subset, and amplify this embedding with information from the remaining genes. We motivate our methodology by using gene expression measurements to predict survival time for patients with diffuse large B-cell lymphoma, illustrate the behavior of our methodology on carefully constructed synthetic examples, and test it on a number of other gene expression datasets. Our technique is computationally tractable, generally outperforms other methods, is extensible to other phenotypes, and also identifies different genes (relative to existing methods) for possible future study.

Key words: regression; principal components; matrix sketching; preconditioning; high-dimensional;

Availability: All of the code and data are available at http://mypage.iu.edu/~dajmcdon/research/

Contact: dajmcdon@indiana.edu

Supplementary information: Supplementary material is available at Bioinformatics} online.

Systematic identification of feature combinations for predicting drug response with Bayesian multi-view multi-task linear regression
COSI: TransMed
Date: Tuesday, July 25 11:20 am - 11:35 am
Room: TBA
  • Tero Aittokallio, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Krister Wennerberg, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Suleiman Ali Khan, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Muhammad Ammad-Ud-Din, Institute of Molecular Medicine Finland, University of Helsinki, Finland

Presentation Overview: Show

Motivation: A prime challenge in precision cancer medicine is to identify genomic and molecular features that are predictive of drug treatment responses in cancer cells. Although there are several computational models for accurate drug response prediction, these often lack the ability to infer which feature combinations are the most predictive, particularly for high-dimensional molecular data sets. As increasing amounts of diverse genome-wide data sources are becoming available, there is a need to build new computational models that can effectively combine these data sources and identify maximally predictive feature combinations.
Results: We present a novel approach that leverages on systematic integration of data sources to identify response predictive features of multiple drugs. To solve the modeling task we implement a Bayesian linear regression method. To further improve the usefulness of the proposed model, we exploit the known human cancer kinome for identifying biologically relevant feature combinations. In case studies with a synthetic data set and two publicly available cancer cell line data sets, we demonstrate the improved accuracy of our method compared to the widely used approaches in drug response analysis. As key examples, our model identifies meaningful combinations of features for the well known EGFR, ALK, PLK and PDGFR inhibitors.
Availability: The source code of the method is available at https://github.com/suleimank/mvlr
Contact: muhammad.ammad-ud-din@helsinki.fi or suleiman.khan@helsinki.fi
Supplementary information: Supplementary data are available at Bioinformatics online.

Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes
Date: TBA
Room: TBA
  • Yoichiro Nakatani, Trinity College Dublin, University of Dublin, Ireland
  • Aoife McLysaght, Trinity College Dublin, University of Dublin, Ireland

Presentation Overview: Show

It has been argued that whole-genome duplication (WGD) exerted a profound influence on the course of evolution. For the purpose of fully understanding the impact of WGD, several formal algorithms have been developed for reconstructing pre-WGD gene order in yeast and plant. However, to the best of our knowledge, those algorithms have never been successfully applied to WGD events in teleost and vertebrate, impeded by extensive gene shuffling and gene losses.
Here we present a probabilistic model of macrosynteny (i.e., conserved linkage or chromosome-scale distribution of orthologs), develop a variational Bayes algorithm for inferring the structure of pre-WGD genomes, and study estimation accuracy by simulation. Then, by applying the method to the teleost WGD, we demonstrate effectiveness of the algorithm in a situation where gene-order reconstruction algorithms perform relatively poorly due to a high rate of rearrangement and extensive gene losses. Our high-resolution reconstruction reveals previously overlooked small-scale rearrangements, necessitating a revision to previous views on genome structure evolution in teleost and vertebrate.
We have reconstructed the structure of a pre-WGD genome by employing a variational Bayes approach that was originally developed for inferring topics from millions of text documents. Interestingly, comparison of the macrosynteny and topic model algorithms suggests that macrosynteny can be regarded as documents on ancestral genome structure. From this perspective, the present study would seem to provide a textbook example of the prevalent metaphor that genomes are documents of evolutionary history.

Increasing the power of meta-analysis of genome-wide association studies to detect heterogeneous effects
Date: TBA
Room: TBA
  • Cue Hyunkyu Lee, Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, Korea
  • Eleazar Eskin, Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, United States
  • Buhm Han, Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, Korea

Presentation Overview: Show

Meta-analysis is essential to combine the results of genome-wide association studies (GWASs). Recent large-scale meta-analyses have combined studies of different ethnicities, environments, and even studies of different related phenotypes. These differences between studies can manifest as effect size heterogeneity. We previously developed a modified random effects model (RE2) that can achieve higher power to detect heterogeneous effects than the commonly used fixed effects model (FE). However, RE2 cannot perform meta-analysis of correlated statistics, which are found in recent research designs, and the identified variants often overlap with those found by FE. Here, we propose RE2C, which increases the power of RE2 in two ways. First, we generalized the likelihood model to account for correlations of statistics to achieve optimal power, using an optimization technique based on spectral decomposition for efficient parameter estimation. Second, we modified the statistic to focus on the heterogeneous effects that FE cannot detect, thereby increasing the power to identify new associations. We developed an efficient and accurate p-value approximation procedure using analytical decomposition of the statistic. In simulations, RE2C achieved a 71% increase in power compared with 21% for the decoupling approach when the statistics were correlated. Even when the statistics are uncorrelated, RE2C achieves a modest increase in power. Applications to real genetic data supported the utility of RE2C. RE2C is highly efficient and can meta-analyze one hundred GWASs in one day.

When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants
Date: TBA
Room: TBA
  • Kymberleigh Pagel, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Vikas Pejaver, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Guan Ning Lin, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Hyunjun Nam, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Matthew Mort, Institute of Medical Genetics, Cardiff University, United Kingdom
  • David N Cooper, Institute of Medical Genetics, Cardiff University, United Kingdom
  • Jonathan Sebat, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Lilia M Iakoucheva, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Sean D Mooney, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, United States
  • Predrag Radivojac, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States

Presentation Overview: Show

Motivation:Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease.
Results: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar, and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1,142 de novo variants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants