Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Proceedings Track Presentations

Attention Conference Presenters - please review the Speaker Information Page available here
Special Issue of Bioinformatics Bioinformatics
A New Method to Study the Change of miRNA-mRNA Interactions Due to Environmental Exposures
COSI: proceedings
  • Pei Wang, Icahn School of Medicine at Mount Sinai, United States
  • Susan L. Teitelbaum, Icahn School of Medicine at Mount Sinai, United States
  • Jia Chen, Icahn School of Medicine at Mount Sinai, United States
  • Nyan Win Khin, Icahn School of Medicine at Mount Sinai, United States
  • Maya Kappil, Icahn School of Medicine at Mount Sinai, United States
  • Kalpana Gopalakrishnan, Icahn School of Medicine at Mount Sinai, United States
  • Vasily N. Aushev, Icahn School of Medicine at Mount Sinai, United States
  • Francesca Petralia, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Motivation: Integrative approaches characterizing the interactions among different types of biological molecules have been demonstrated to be useful for revealing informative biological mechanisms. One such example is the interaction between microRNA (miRNA) and messenger RNA (mRNA), whose deregulation may be sensitive to environmental insult leading to altered phenotypes. The goal of this work is to develop an effective data integration method to characterize deregulation between miRNA and mRNA due to environmental toxicant exposures. We will use data from an animal experiment designed to investigate the effect of low-dose environmental chemical exposure on normal mammary gland development in rats to motivate and evaluate the proposed method.

Results: We propose a new network approach - integrative Joint Random Forest (iJRF), which characterizes the regulatory system between miRNAs and mRNAs using a network model. iJRF is designed to work under the high-dimension low-sample-size regime, and can borrow information across different treatment conditions to achieve more accurate network inference. It also effectively takes into account prior information of miRNA-mRNA regulatory relationships from existing databases. When iJRF is applied to the data from the environmental chemical exposure study, we detected a few important miRNAs who regulated a large number of mRNA in the control group but not in the chemical exposure groups, suggesting disruptions of miRNA activities due to chemical exposure. Effects of chemical exposure on two miRNAs were further validated using breast cancer human cell lines.

A scalable moment-closure approximation for large-scale biochemical reaction networks
COSI: proceedings
  • Atefeh Kazeroonian, Technische Universität München, Germany
  • Fabian Theis, Helmholtz-Zentrum München, Germany
  • Jan Hasenauer, Helmholtz-Zentrum München, Germany

Presentation Overview: Show

Motivation: Stochastic molecular processes are a leading cause of cell-to-cell variability. Their dynamics are often described by continuous-time discrete-state Markov chains and simulated using stochastic simulation algorithms. As these stochastic simulations are computationally demanding, ordinary differential equation models for the dynamics of the statistical moments have been developed. The number of state variables of these approximating models, however, grows at least quadratically with the number of biochemical species. This limits their application to small- and medium-sized processes.

Results: In this manuscript, we present a scalable moment-closure approximation (sMA) for the simulation of statistical moments of large-scale stochastic processes. The sMA exploits the structure of the biochemical reaction network to reduce the covariance matrix. We prove that sMA yields approximating models whose number of state variables depends predominantly on local properties, i.e. the average node degree of the reaction network, instead of the overall network size. The resulting complexity reduction is assessed by studying a range of medium- and large-scale biochemical reaction networks. To evaluate the approximation accuracy and the improvement in computational efficiency, we study models for JAK2/STAT5 signalling and NFkB signalling. Our method is applicable to generic biochemical reaction networks and we provide an implementation, including an SBML interface, which renders the sMA easily accessible.

Availability: The sMA is implemented in the open-source MATLAB toolbox CERENA and is available from https://github.com/CERENADevelopers/CERENA.

Alignment of dynamic networks
COSI: proceedings
  • Tijana Milenkovic, University of Notre Dame, United States
  • Dominic Critchlow, University of Notre Dame, United States
  • Vipin Vijayan, University of Notre Dame, United States

Presentation Overview: Show

Networks can model real-world systems in a variety of domains. Network alignment (NA) aims to find a node mapping that conserves similar regions between compared networks. NA is applicable to many fields, including computational biology, where NA can guide the transfer of biological knowledge from well- to poorly-studied species across aligned network regions. Existing NA methods can only align static networks. However, most complex real-world systems evolve over time and should thus be modeled as dynamic networks. We hypothesize that aligning dynamic network representations of evolving systems will produce superior alignments compared to aligning the systems' static network representations, as is currently done. For this purpose, we introduce the first ever dynamic NA method, DynaMAGNA++. This proof-of-concept dynamic NA method is an extension of a state-of-the-art static NA method, MAGNA++. Even though both MAGNA++ and DynaMAGNA++ optimize edge as well as node conservation across the aligned networks, MAGNA++ conserves static edges and similarity between static node neighborhoods, while DynaMAGNA++ conserves dynamic edges (events) and similarity between evolving node neighborhoods. For this purpose, we introduce the first ever measure of dynamic edge conservation and rely on our recent measure of dynamic node conservation. Importantly, the two dynamic conservation measures can be optimized using any state-of-the-art NA method and not just MAGNA++. We confirm our hypothesis that dynamic NA is superior to static NA, under fair comparison conditions, on synthetic and real-world networks, in computational biology and social network domains. DynaMAGNA++ is parallelized and it includes a user-friendly graphical interface.

Applying meta-analysis to Genotype-Tissue Expression data from multiple tissues to identify eQTLs and increase the number of eGenes
COSI: proceedings
  • Dat Duong, UCLA, U.S.A

Presentation Overview: Show

Motivation: There is recent interest in using gene expression data to contextualize findings from traditional genome wide association studies (GWAS). Conditioned on a tissue, expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression, and eGenes are genes whose expression levels are associated with genetic variants. eQTLs and eGenes provide great supporting evidence for GWAS hits and important insights into the regulatory pathways involved in many diseases. When a significant variant or a candidate gene identified by GWAS is also an eQTL or eGene, there is strong evidence to further study this variant or gene. Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. Unfortunately, these datasets often have small sample sizes in some tissues. For this reason, there have been many meta-analysis methods designed to combine gene expression data across many tissues to increase power for finding eQTLs and eGenes. However, these existing techniques are not scalable to datasets containing many tissues, like
the GTEx data. Furthermore, these methods ignore a biological insight that the same variant may be associated with the same gene across similar tissues.
Result: We introduce a meta-analysis model that addresses these problems in these existing methods. We focus on the problem of finding eGenes in gene expression data from many tissues, and show that
our model is better than other types of meta-analyses.
Availability: Source code and supplementary data are at https://github.com/datduong/RECOV.

BIOSSES: A Semantic Sentence Similarity Estimation System for the Biomedical Domain
COSI: proceedings
  • Arzucan Ozgur, Bogazici University, Turkey
  • Gizem Soğancıoğlu, Boğaziçi University, Turkey
  • Hakime Öztürk, Boğaziçi University, Turkey

Presentation Overview: Show

Motivation: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarisation. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods: We propose several approaches for sentence-level semantic similarity computation in the
biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.

Results: The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.

Availability: A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/

CATS (Coordinates of Atoms by Taylor Series): Protein design with backbone flexibility in all locally feasible directions
COSI: proceedings
  • Mark Hallen, Toyota Technological Institute at Chicago, United States
  • Bruce Donald, Duke University, United States

Presentation Overview: Show

Motivation: When proteins mutate or bind to ligands, their backbones often move significantly, especially in loop regions. Computational protein design algorithms must model these motions in order to accurately optimize protein stability and binding affinity. However, methods for backbone conformational search in design have been much more limited than for sidechain conformational search. This is especially true for combinatorial protein design algorithms, which aim to search a large sequence space efficiently and thus cannot rely on temporal simulation of each candidate sequence.
Results: We alleviate this difficulty with a new parameterization of backbone conformational space, which represents all degrees of freedom of a specified segment of protein chain that maintain valid bonding geometry (by maintaining the original bond lengths and angles and ω dihedrals). In order to search this space, we present an efficient algorithm, CATS, for computing atomic coordinates as a function of our new continuous backbone internal coordinates. CATS generalizes the iMinDEE and EPIC protein design algorithms, which model continuous flexibility in sidechain dihedrals, to model continuous, appropriately localized flexibility in the backbone dihedrals φ and ψ as well. We show using 81 test cases based on 29 different protein structures that CATS finds sequences and conformations that are significantly lower in energy than methods with less or no backbone flexibility do. In particular, we show that CATS can model the viability of an antibody mutation known experimentally to increase affinity, but that appears sterically infeasible when modeled with less or no backbone flexibility.
Availability: Our code is available as free software at https://github.com/donaldlab/OSPREY_refactor Contact: mhallen@ttic.edu, brd+ismb17@cs.duke.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

Deep Learning with Word Embeddings improves Biomedical Named Entity Recognition
COSI: proceedings
  • Ulf Leser, Humboldt-Universität zu Berlin, Germany
  • David Luis Wiegandt, Humboldt-Universität zu Berlin, Germany
  • Mariana Neves, Hasso-Plattner-Institute, Germany
  • Leon Weber, Humboldt-Universität zu Berlin, Germany
  • Maryam Habibi, Humboldt-Universität zu Berlin, Germany

Presentation Overview: Show

Motivation: Text mining has become an important tool for biomedical research. The most fundamental text mining task is the recognition of biomedical named entities (NER), such as genes, chemicals, and diseases. Current NER methods rely on predefined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which makes extrapolation of quality measures difficult.
Results: We show that a completely generic method based on deep learning and statistical word embeddings (called LSTM-CRF) outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
Availability: The source code for LSTM-CRF is available at https://github.com/glample/tagger and the links to the corpora are available at https://corposaurus.github.io/corpora.

Efficient Simulation of Intrinsic, Extrinsic and External Noise in Biochemical Systems
COSI: proceedings
  • Dennis Pischel, Otto-von-Guericke University Magdeburg, Germany
  • Kai Sundmacher, Max Planck Institute for Dynamics of Complex Technical Systems, Germany
  • Robert J. Flassig, Max Planck Institute for Dynamics of Complex Technical Systems, Germany

Presentation Overview: Show

Estimation of time-varying growth, uptake and excretion rates from dynamic metabolomics data
COSI: proceedings
  • Eugenio Cinquemani, INRIA Grenoble - Rhone-Alpes, France
  • Valérie Laroute, Universite de Toulouse, France
  • Muriel Cocaign-Bousquet, INRIA, France
  • Hidde de Jong, INRIA Grenoble - Rhone-Alpes, France
  • Delphine Ropers, INRIA Grenoble - Rhone-Alpes, France

Presentation Overview: Show

Genomes as documents of evolutionary history: a probabilistic macrosynteny model for the reconstruction of ancestral genomes.
COSI: proceedings
  • Yoichiro Nakatani, Trinity College Dublin, University of Dublin, Ireland
  • Aoife McLysaght, Trinity College Dublin, University of Dublin, Ireland

Presentation Overview: Show

Motivation:
It has been argued that whole-genome duplication (WGD) exerted a profound influence on the course of evolution. For the purpose of fully understanding the impact of WGD, several formal algorithms have been developed for reconstructing pre-WGD gene order in yeast and plant. However, to the best of our knowledge, those algorithms have never been successfully applied to WGD events in teleost and vertebrate, impeded by extensive gene shuffling and gene losses.
Results:
Here we present a probabilistic model of macrosynteny (i.e., conserved linkage or chromosome-scale distribution of orthologs), develop a variational Bayes algorithm for inferring the structure of pre-WGD genomes, and study estimation accuracy by simulation. Then, by applying the method to the teleost WGD, we demonstrate effectiveness of the algorithm in a situation where gene-order reconstruction algorithms perform relatively poorly due to a high rate of rearrangement and extensive gene losses. Our high-resolution reconstruction reveals previously overlooked small-scale rearrangements, necessitating a revision to previous views on genome structure evolution in teleost and vertebrate.
Conclusions:
We have reconstructed the structure of a pre-WGD genome by employing a variational Bayes approach that was originally developed for inferring topics from millions of text documents. Interestingly, comparison of the macrosynteny and topic model algorithms suggests that macrosynteny can be regarded as documents on ancestral genome structure. From this perspective, the present study would seem to provide a textbook example of the prevalent metaphor that genomes are documents of evolutionary history.

Image-based Spatiotemporal Causality Inference for Protein Signaling Networks
COSI: proceedings
  • Christoph Wülfing, University of Bristol, United Kingdom
  • Xiongtao Ruan, Carnegie Mellon University, United States
  • Robert F. Murphy, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Efforts to model how signaling and regulatory networks work in cells have largely either not considered spatial organization or have used compartmental models with minimal spatial resolution. Fluorescence microscopy provides the ability to monitor the spatiotemporal distribution of many molecules during signaling events, but as of yet no methods have been described for large scale image analysis to learn a complex protein regulatory network. Here we present and evaluate methods for identifying how changes in concentration in one cell region influence concentration of other proteins in other regions.

Results: Using 3D confocal microscope movies of GFP-tagged T cells undergoing costimulation, we learned models containing putative causal relationships among 12 proteins involved in T cell signaling. The models included both relationships consistent with current knowledge and novel predictions deserving further exploration. Further, when these models were applied to the initial frames of movies of T cells that had been only partially stimulated, they predicted the localization of a number of proteins at later times with statistically significant accuracy. The methods, consisting of spatiotemporal alignment, automated region identification, and causal inference, are anticipated to be applicable to a number of biological systems.

Increasing the power of meta-analysis of genome-wide association studies to detect heterogeneous effects.
COSI: proceedings
  • Cue Hyunkyu Lee , Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, South Korea
  • Eleazar Eskin, Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, United States
  • Buhm Han, Department of Convergence Medicine, University of Ulsan College of Medicine & Asan Institute for Life Sciences, Asan Medical Center, South Korea

Presentation Overview: Show

Meta-analysis is essential to combine the results of genome-wide association studies (GWASs). Recent large-scale meta-analyses have combined studies of different ethnicities, environments, and even studies of different related phenotypes. These differences between studies can manifest as effect size heterogeneity. We previously developed a modified random effects model (RE2) that can achieve higher power to detect heterogeneous effects than the commonly used fixed effects model (FE). However, RE2 cannot perform meta-analysis of correlated statistics, which are found in recent research designs, and the identified variants often overlap with those found by FE. Here, we propose RE2C, which increases the power of RE2 in two ways. First, we generalized the likelihood model to account for correlations of statistics to achieve optimal power, using an optimization technique based on spectral decomposition for efficient parameter estimation. Second, we modified the statistic to focus on the heterogeneous effects that FE cannot detect, thereby increasing the power to identify new associations. We developed an efficient and accurate p-value approximation procedure using analytical decomposition of the statistic. In simulations, RE2C achieved a 71% increase in power compared with 21% for the decoupling approach when the statistics were correlated. Even when the statistics are uncorrelated, RE2C achieves a modest increase in power. Applications to real genetic data supported the utility of RE2C. RE2C is highly efficient and can meta-analyze one hundred GWASs in one day.

popFBA: tackling intratumour heterogeneity with Flux Balance Analysis
COSI: proceedings
  • Chiara Damiani, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy
  • Marzia Di Filippo, Dept of Biotechnology and Biosciences, University Milano-Bicocca, 20126, Milano, Italy
  • Dario Pescini, Dept of Statistics and Quantitative Methods, University Milano-Bicocca, 20126, Milano, Italy
  • Davide Maspero, Dept of Biotechnology and Biosciences, University Milano-Bicocca, 20126, Milano, Italy
  • Riccardo Colombo, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy
  • Giancarlo Mauri, Dept of Informatics, Systems and Communication, University Milano-Bicocca, 20126, Milano, Italy

Presentation Overview: Show

Proceedings: Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference
COSI: proceedings
  • Clement Train, University of Lausanne, Switzerland
  • Natasha Glover, University of Lausanne, Switzerland
  • Gaston Gonnet, ETH Zurich, Switzerland
  • Adrian Altenhoff, University of Lausanne, Switzerland
  • Christophe Dessimoz, University of Lausanne, Switzerland

Presentation Overview: Show

Rectified Factor Networks for Biclustering of Omics Data
COSI: proceedings
  • Djork-Arné Clevert, Bayer AG, Germany

Presentation Overview: Show

Motivation: Biclustering has become a major tool for analyzing large
data sets given as matrix of samples times features and has been
successfully applied in life sciences and e-commerce for drug design
and recommender systems, respectively. FABIA, one of the most successful
biclustering methods, is a generative model that represents each bicluster
by two sparse membership vectors: one for the samples and one for the features.
However, FABIA is restricted to about 20 code units because of the high
computational complexity of computing the posterior. Furthermore, code
nits are sometimes insufficiently decorrelated and sample membership is
difficult to determine. We propose to use the recently introduced unsupervised
Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks
of existing biclustering methods. RFNs efficiently construct very sparse, non-linear,
high-dimensional representations of the input via their posterior means.
RFN learning is a generalized alternating minimization algorithm based on
the posterior regularization method which enforces non- negative and normalized
posterior means. Each code unit represents a bicluster, where samples for which
the code unit is active belong to the bicluster and features that have activating
weights to the code unit belong to the bicluster.
Results: On 400 benchmark data sets and on three gene expression data sets with
known clusters, RFN outperformed 13 other biclustering methods including FABIA.
On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate,
that interbreeding with other hominins starting already before ancestors of
modern humans left Africa.
Availability and Implementation: https://github.com/bioinf-jku/librfn

SnapDock - Template Based Docking by Geometric Hashing
COSI: proceedings
  • Michael Estrin, Tel Aviv University, Israel
  • Haim J. Wolfson, Tel Aviv University, Israel

Presentation Overview: Show

A highly efficient template based protein-protein docking algorithm, nicknamed SnapDock, is presented. It employs a Geometric Hashing based structural alignment scheme to align the target proteins to the interfaces of non-redundant protein-protein interface libraries. Docking of a pair of proteins utilizing the 22,600 interface PIFACE library is performed in less than 2 minutes on the average. A flexible version of the algorithm allowing hinge motion in one of the proteins is presented as well. To evaluate the performance of the algorithm a blind re-modelling of 3,547 PDB complexes, which have been uploaded after the PIFACE publication has been performed with success ratio of about 35%. Interestingly, a similar experiment with the template free PatchDock docking algorithm yielded a success rate of about 23% with roughly 1/2 of the solutions different from those of SnapDock. Thus the combination of the two methods gave a 42% success ratio.

When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants
COSI: proceedings
  • Kymberleigh Pagel, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Vikas Pejaver, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Guan Ning Lin, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Hyunjun Nam, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Matthew Mort, Institute of Medical Genetics, Cardiff University, United Kingdom
  • David N Cooper, Institute of Medical Genetics, Cardiff University, United Kingdom
  • Jonathan Sebat, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Lilia M Iakoucheva, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Sean D Mooney, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, United States
  • Predrag Radivojac, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States

Presentation Overview: Show

Motivation:Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease.
Results: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar, and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1,142 de novo variants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants

01:Deep learning based subdivision approach for large scale macromolecules structure recovery from electron cryo tomograms
COSI: proceedings
  • Min Xu, Carnegie Mellon University, United States
  • Xiaoqi Chai, Carnegie Mellon University, United States
  • Hariank Muthakana, Carnegie Mellon University, United States
  • Xiaodan Liang, Carnegie Mellon University, United States
  • Ge Yang, Carnegie Mellon University, United States
  • Tzviya Zeev-Ben-Mordehai, University of Oxford, United Kingdom
  • Eric Xing, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Cellular Electron CryoTomography (CECT) enables 3D visualization of cellular organization at near-native state and in sub-molecular resolution, making it a powerful tool for analyzing structures of macromolecular complexes and their spatial organizations inside single cells. However, high degree of structural complexity together with practical imaging limitations make the systematic de novo discovery of structures within cells challenging. It would likely require averaging and classifying millions of subtomograms potentially containing hundreds of highly heterogeneous structural classes. Although it is no longer difficult to acquire CECT data containing such amount of subtomograms due to advances in data acquisition automation, existing computational approaches have very limited scalability or discrimination ability, making them incapable of processing such amount of data.

Results: To complement existing approaches, in this paper we propose a new approach for subdividing subtomograms into smaller but relatively homogeneous subsets. The structures in these subsets can then be separately recovered using existing computation intensive methods. Our approach is based on supervised structural feature extraction using deep learning, in combination with unsupervised clustering and reference-free classification. Our experiments show that, compared to existing unsupervised rotation invariant feature and pose-normalization based approaches, our new approach achieves significant improvements in both discrimination ability and scalability. More importantly, our new approach is able to discover new structural classes and recover structures that do not exist in training data.

Availability: Source code freely available at http://www.cs.cmu.edu/~mxu1/software

Contact: mxu1@cs.cmu.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

02:Molecular signatures that can be transferred across different omics platforms
COSI: proceedings
  • Rainer Spang, University of Regensburg, Germany
  • Peter J. Oefner, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Wolfram Klapper, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Neus Masqué-Soler, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Monika Szczepanowski, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Julia Richter, Department of Pathology, Hematopathology Section and Lymph Node Registry, University Hospital Schleswig-Holstein, Campus Kiel/Christian-Albrecht University, Kiel, Germany
  • Wolfram Gronwald, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Christian W. Kohler, Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Jörg Reinders, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Thorsten Rehberg, Statistical Bioinformatics, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Philipp Schwarzfischer, Institute of Functional Genomics, University of Regensburg, Regensburg, Germany, Germany
  • Michael Altenbuchinger, University of Regensburg, Germany

Presentation Overview: Show

02:Tumor Phylogeny Inference Using Tree-Constrained Importance Sampling
COSI: proceedings
  • Benjamin Raphael, Princeton University, United States
  • Gryte Satas, Brown University, United States

Presentation Overview: Show

03:Direct AUC Optimization of Regulatory Motifs
COSI: proceedings
  • Lin Zhu, Tongji University, China
  • Hong-Bo Zhang, Tongji University, China
  • De-Shuang Huang, Tongji University, China

Presentation Overview: Show

03:Proceedings: DextMP: Deep dive into Text for predicting Moonlighting Proteins
COSI: proceedings
  • Ishita Khan, Purdue University, United States
  • Mansurul Bhuiyan, IUPUI, United States
  • Daisuke Kihara, Purdue University, United States

Presentation Overview: Show

04:Association testing of bisulfite sequencing methylation data via a Laplace approximation
COSI: proceedings
  • Omer Weissbrod, Weizmann Institute of Science, Israel

Presentation Overview: Show

04:Predicting multicellular function through multi-layer tissue networks
COSI: proceedings
  • Jure Leskovec, Stanford University, United States
  • Marinka Zitnik, Stanford University, United States

Presentation Overview: Show

Motivation: Understanding functions of proteins in specific human tissues is essential for insights into disease diagnostics and therapeutics, yet prediction of tissue-specific cellular function remains a critical challenge for biomedicine.

Results: Here we present OhmNet, a hierarchy-aware unsupervised node feature learning approach for multi-layer tissue networks. We build a multi-layer network, where each layer represents molecular interactions in a different human tissue. OhmNet then automatically learns a mapping of proteins, represented as nodes, to a neural embedding based low-dimensional space of features. OhmNet encourages sharing of similar features among proteins with similar network neighborhoods and among proteins activated in similar tissues. The algorithm generalizes prior work, which generally ignores relationships between tissues, by modeling tissue organization with a rich multiscale hierarchy. We use OhmNet to study multicellular function in a multi-layer protein interaction network of 107 human tissues. In 48 tissues with known tissue-specific cellular functions, OhmNet provides more accurate predictions of cellular function than alternative approaches, and also generates more accurate hypotheses about tissue-specific protein actions. We show that taking into account the tissue hierarchy leads to improved predictive power. Remarkably, we also demonstrate that it is possible to leverage the tissue hierarchy in order to effectively transfer cellular functions to a functionally uncharacterized tissue. Overall, OhmNet moves from flat networks to multiscale models able to predict a range of phenotypes spanning cellular subsystems.

05:Identification of Associations between Genotypes and Longitudinal Phenotypes via Temporally-constrained Group Sparse Canonical Correlation Analysis
COSI: proceedings
  • Daoqiang Zhang, Nanjing University of Aeronautics and Astronautics, China
  • Andrew Saykin, Indiana University, United States
  • Li Shen, Indiana University, United States
  • Shannon Risacher, Indiana University, United States
  • Jingwen Yan, Indiana University, United States
  • Xiaohui Yao, Indiana University, United States
  • Chanxiu Li, Nanjing University of Aeronautics and Astronautics, China
  • Xiaoke Hao, Nanjing University of Aeronautics and Astronautics, China

Presentation Overview: Show

05:Incorporating Interaction Networks into the Determination of Functionally Related Hit Genes in Genomic Experiments with Markov Random Fields
COSI: proceedings
  • Laurent Guyon, CEA, France
  • J. Pablo Radicella, CEA, France
  • Anna Campalans, CEA, France
  • Guillaume Pinna, CEA, France
  • Jaakko Nevalainen, University of Turku, Finland
  • Sean Robinson, University of Turku, Finland

Presentation Overview: Show

Motivation: Incorporating gene interaction data into the identification of ‘hit’ genes in genomic experiments is a well-established approach leveraging the ‘guilt by association’ assumption to obtain a network based hit list of functionally related genes. We aim to develop a method to allow for multivariate gene scores and multiple hit labels in order to extend the analysis of genomic screening data within such an approach.

Results: We propose a Markov random field based method to achieve our aim and show that the particular advantages of our method compared to those currently used lead to new insights in previously analysed data as well as for our own motivating data. Our method additionally achieves the best performance in an independent simulation experiment. The real data applications we consider comprise of a survival analysis and differential expression experiment and a cell-based RNA interference functional screen.

Availability: We provide all of the data and code related to the results in the paper.

06:Modelling haplotypes with respect to reference cohort variation graphs
COSI: proceedings
  • Benedict Paten, UCSC, United States
  • Jordan Eizenga, University of California Santa Cruz, United States
  • Yohei Rosen, University of California, Santa Cruz, United States

Presentation Overview: Show

06:Predicting phenotypes from microarrays using amplified, initially marginal, eigenvector regression
COSI: proceedings
  • Daniel Mcdonald, Indiana University, United States
  • Lei Ding, Indiana University Bloomington, United States

Presentation Overview: Show

07:Large-scale structure prediction by improved contact predictions and model quality assessment
COSI: proceedings
  • Arne Elofsson, Stockholm University, Sweden
  • David Menendez Hurtado, Stockholm University, Sweden
  • Mirco Michel, Stockholm University, Sweden

Presentation Overview: Show

Motivation: Accurate contact predictions can be
used for predicting the structure of proteins. Until recently these
methods were limited to very big protein families, decreasing their
utility. However, recent progress by combining direct
coupling analysis with machine learning methods has made it possible
to predict accurate contact maps for smaller
families. To what extent these predictions can be used to
produce accurate models of the families is not known.

Results: We present the PconsFold2 pipeline that
uses contact predictions from PconsC3, the CONFOLD folding
algorithm and model quality estimations to predict
the structure of a protein. We show that the model quality
estimation significantly increases the number of models that
reliably can be identified. Finally, we apply PconsFold2 to 6379
Pfam families of unknown structure and find that PconsFold2 can, with an estimated
90% specificity, predict the structure of up to 450 Pfam
families of unknown structure. Out of these 343 have not been
reported before.
Availability: Datasets as well as models of all the 450
Pfam families are available at
http://c3.pcons.net/. All programs used
here are freely available.
Contact: arne@bioinfo.se
Supplementary information: No supplementary data

07:Systematic identification of feature combinations for predicting drug response with Bayesian multi-view multi-task linear regression
COSI: proceedings
  • Tero Aittokallio, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Krister Wennerberg, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Suleiman Ali Khan, Institute of Molecular Medicine Finland, University of Helsinki, Finland
  • Muhammad Ammad-Ud-Din, Institute of Molecular Medicine Finland, University of Helsinki, Finland

Presentation Overview: Show

08:Denoising Genome-wide Histone ChIP-seq with Convolutional Neural Networks
COSI: proceedings
  • Pang Wei Koh, Stanford University, United States
  • Emma Pierson, Stanford, United States
  • Anshul Kundaje, Stanford University, United States

Presentation Overview: Show

09:Discovery and genotyping of novel sequence insertions in many sequenced individuals.
COSI: proceedings
  • Pınar Kavak, Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey, Turkey
  • Yen-Yi Lin, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Ibrahim Numanagić, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Hossein Asghari, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
  • Tunga Güngör, Department of Computer Engineering, Boğaziçi University, İstanbul, Turkey, Turkey
  • Can Alkan, Department of Computer Engineering, Bilkent University, Ankara, Turkey, Turkey
  • Faraz Hach, School of Computing Science, Simon Fraser University, Burnaby, BC, Canada

Presentation Overview: Show

09:Exploiting sequence-based features for predicting enhancer-promoter interactions
COSI: proceedings
  • Yang Yang, Carnegie Mellon University, United States
  • Ruochi Zhang, Tsinghua University, China
  • Shashank Singh, Carnegie Mellon University, United States
  • Jian Ma, Carnegie Mellon University, United States

Presentation Overview: Show

10:Chromatin Accessibility Prediction via Convolutional Long Short-Term Memory Networks with k-mer Embedding
COSI: proceedings
  • Ning Chen, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China
  • Ting Chen, Tsinghua University, China
  • Wanwen Zeng, Tsinghua University, China
  • Xu Min, Tsinghua University, China

Presentation Overview: Show

11:HopLand: Single-cell pseudotime recovery using continuous Hopfield network based modeling of Waddington’s epigenetic landscape
COSI: proceedings
  • Jie Zheng, Nanyang Technological University, Singapore
  • Jing Guo, Nanyang Technological University, Singapore

Presentation Overview: Show

12:Multiple network-constrained regressions expand insights into influenza vaccination responses
COSI: proceedings
  • Albert C. Shaw, Yale School of Medicine, United States
  • Steven H. Kleinstein, Yale School of Medicine, United States
  • Sui Tsang, Yale School of Medicine, United States
  • Barbara Siconolfi, Yale School of Medicine, United States
  • Samit R. Joshi, Yale School of Medicine, United States
  • Heidi Zapata, Yale School of Medicine, United States
  • Jean Wilson, Yale School of Medicine, United States
  • Stefan Avey, Yale School of Medicine, United States
  • Subhasis Mohanty, Yale School of Medicine, United States

Presentation Overview: Show

Systems immunology leverages recent technological advancements that enable broad profiling of the immune system to better understand the response to infection and vaccination, as well as the dysregulation that occurs in disease. An increasingly common approach to gain insights from these large-scale profiling experiments involves the application of statistical learning methods to predict disease states or the immune response to perturbations. However, the goal of many systems studies is not to maximize accuracy, but rather to gain biological insights. The predictors identified using current approaches can be uninterpretable or present only one of many equally predictive models, leading to a narrow understanding of the underlying biology.

Here we show that incorporating prior biological knowledge within a logistic modeling framework by using network-level constraints on transcriptional profiling data significantly improves interpretability. Moreover, incorporating different types of biological knowledge produces models that highlight distinct aspects of the underlying biology, while maintaining predictive accuracy. We propose a new framework, Logistic Multiple Network-constrained Regression (LogMiNeR), and apply it to understand the mechanisms underlying differential responses to influenza vaccination. While standard logistic regression approaches were predictive, they were minimally interpretable. Incorporating prior knowledge using LogMiNeR led to models that were equally predictive yet highly interpretable. In this context, B cell-specific genes and mTOR signaling were associated with an effective vaccination response in young adults. Overall, our results demonstrate a new paradigm for analyzing high-dimensional immune profiling data in which multiple networks encoding prior knowledge are incorporated to improve model interpretability.

12:TITER: predicting translation initiation sites by deep learning
COSI: proceedings
  • Sai Zhang, Institute for Interdisciplinary Information Sciences, Tsinghua University, China
  • Hailin Hu, School of Medicine, Tsinghua University, China
  • Tao Jiang, Department of Computer Science and Engineering, University of California, Riverside, China
  • Lei Zhang, School of Medicine, Tsinghua University, China
  • Jianyang Zeng, Institute for Interdisciplinary Information Sciences, Tsinghua University, China

Presentation Overview: Show

15:Efficient approximations of RNA kinetics landscape using non-redundant sampling
COSI: proceedings
  • Juraj Michalik, Inria Saclay, France
  • Hélène Touzet, CNRS, University of Lille and INRIA, France
  • Yann Ponty, CNRS/LIX, Polytechnique, France

Presentation Overview: Show

19:DeepBound: Accurate identification of transcript boundaries via deep convolutional neural fields
COSI: proceedings
  • Mingfu Shao, Computational Biology Department, Carnegie Mellon University, United States
  • Jianzhu Ma, University of California, San Diego, United States
  • Sheng Wang, Department of Human Genetics, University of Chicago, United States

Presentation Overview: Show

Motivation: Reconstructing the full-length expressed transcripts (a.k.a. the transcript assembly problem) from the short sequencing reads produced by RNA-seq protocol plays a central role in identifying novel genes and transcripts as well as in studying gene expressions and gene functions. A crucial step in transcript assembly is to accurately determine the splicing junctions and boundaries of the expressed transcripts from the reads alignment. In contrast to the splicing junctions that can be efficiently detected from spliced reads, the problem of identifying boundaries remains open and challenging, due to the fact that the signal related to boundaries is noisy and weak.

Results: We present DeepBound, an effective approach to identify boundaries of expressed transcripts from RNA-seq reads alignment. In its core DeepBound employs deep convolutional neural fields to learn the hidden distributions and patterns of boundaries. To accurately model the transition probabilities and to solve the label-imbalance problem, we novelly incorporate the AUC (area under the curve) score into the optimizing objective function. To address the issue that deep probabilistic graphical models requires large number of labeled training samples, we propose to use simulated RNA-seq datasets to train our model. Through extensive experimental studies on both simulation datasets of two species and biological datasets, we show that DeepBound consistently and significantly outperforms the two existing methods.

Availability: DeepBound is freely available at https://github.com/realbigws/DeepBound

19:Improved Data-Driven Likelihood Factorizations for Transcript Abundance Estimation
COSI: proceedings
  • Avi Srivastava, Stony Brook University, United States
  • Fatemehalsadat Almodaresi T S, Stony Brook University, United States
  • Mohsen Zakeri, Stony Brook University, United States
  • Rob Patro, Stony Brook University, United States

Presentation Overview: Show

22:deBGR: An Efficient and Near-Exact Representation of the Weighted de Bruijn Graph
COSI: proceedings
  • Michael A. Bender, Stony Brook University, United States
  • Prashant Pandey, Stony Brook University, United States
  • Rob Johnson, Stony Brook University, United States
  • Rob Patro, Stony Brook University, United States

Presentation Overview: Show

22:miniMDS: 3D structural inference from high- resolution Hi-C data
COSI: proceedings
  • Lila Rieber, Pennsylvania State University, United States
  • Shaun Mahony, Pennsylvania State University, United States

Presentation Overview: Show

23:Abundance estimation and differential testing on strain level in metagenomics data
COSI: proceedings
  • Benjamin Strauch, Robert Koch Institute, Germany
  • Bernhard Y. Renard, Robert Koch Institute, Germany
  • Martina Fischer, Robert Koch Institute, Germany

Presentation Overview: Show

24:Integrative Deep Models for Alternative Splicing
COSI: proceedings
  • Anupama Jha, University of Pennsylvania, United States
  • Matthew Gazzara, University of Pennsylvania Perelman School of Medicine, United States
  • Yoseph Barash, University of Pennsylvania, United States

Presentation Overview: Show

Motivation: Advancements in sequencing technologies have highlighted the role of alternative splicing (AS) in increasing transcriptome complexity. This role of AS, combined with the relation of aberrant splicing to malignant states, motivated two streams of research, experimental and computational. The first involves a myriad of techniques such as RNA-Seq and CLIP-Seq to identify splicing regulators and their putative targets. The second involves probabilistic models, also known as splicing codes, which infer regulatory mechanisms and predict splicing outcome directly from genomic sequence. To date, these models have utilized only expression data. In this work we address two related challenges: Can we improve on previous models for AS outcome prediction and can we integrate additional sources of data to improve predictionsfor AS regulatory factors.

Results: We perform a detailed comparison of two previous modeling approaches, Bayesian and Deep Neural networks, dissecting the confounding effects of datasets and target functions. We then develop a new target function for AS prediction in exon skipping events and show it significantly improves model accuracy. Next, we develop a modeling framework that leverages transfer learning to incorporate CLIP-Seq, knockdown and over expression experiments, which are inherently noisy and suffer from missing values. Using several datasets involving key splice factors in mouse brain, muscle and heart we demonstrate both the prediction improvements and biological insights offered by our new models. Overall, the framework we propose offers a scalable integrative solution to improve splicing code modeling as vast amounts of relevant genomic data become available. A

vailability: code and data available at majiq.biociphers.org/jha_et_al_2017/

28:Improving the performance of minimizers and winnowing schemes
COSI: proceedings
  • Guillaume Marçais, Carnegie Mellon University, United States
  • David Pellow, Tel Aviv University, Israel
  • Daniel Bork, University of Pittsburgh, United States
  • Yaron Orenstein, MIT, United States
  • Ron Shamir, School of Computer Science, Tel Aviv University, Israel
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show