Award Winners

Art & Science Award
Wikipedia Award
Wikidata Award
Fight Against Ebola Award
Ian Lawson Van Toch Memorial Award for Outstanding Student Paper
Late Breaking Research Award sponsored by PLOS
University of California Berkeley Center for Computational Biology Outstanding Oral Poster Prize
F1000Research Poster Award
RCSB PDB Poster Prize

Art & Science Award

The Dark Proteome
Sean O'Donoghue CSIRO & Garvan Institute, Australia
Christopher Hammang, Garvan Institute, Australia
Julian Heinrich, CSIRO, Australia

Wikipedia Competition

1^st Place - Alexander Hausser & Leonie Jahn - Docking (molecular)
2^nd Place - Leandro Poli - BioJava
3^rd Place - Vivek Rai - Vienna RNA Package

Wikidata Competition

1^st Place - Vivek Rai - Vienna RNA Package

Fight Against Ebola Award

Winner

Mark N. Wass, University of Kent, United Kingdom
Title: Using Computational Biology to Investigate Ebola Virus Pathogenicity
Authors: Morena Pappalardo¹, Miguel Julia¹, Diego Cantoni¹, Francesca Collu², James Macpherson², Mark J Howard¹, Franca Fraternali², Jeremy S. Rossman¹, Martin Michaelis¹ 1. School of Biosciences, University of Kent, UK.
2. Randall Division, King's College London, UK.

Honourable Mentions

Ahmed Arslan, KU Leuven, Centre of Microbial and Plant Genetics (CMPG), Leuven, Belgium
Title: From conserved protein residues to therapeutic targets for Ebola Virus Disease
Co-author: Vera van Noort, KU Leuven, Centre of Microbial and Plant Genetics (CMPG), Leuven, Belgium

Tamir Tuller, Tel-Aviv University, Israel
Title: Computational large scale exploration of functional regions in Ebola for therapy and vaccination
Authors: Eli Goz¹, Kiril Lomakin^2,3, Leslie Lobel¹
¹Department of Biomedical Engineering, Tel-Aviv University, Israel.
²Department of Microbiology, Immunology and Genetics, Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel.
³Department of Emerging and Re-emerging Diseases and Special Pathogens Uganda Virus Research Institute (UVRI) Entebbe, Uganda

Ian Lawson Van Toch Memorial Award for Outstanding Student Paper

TP041: RCK: accurate and efficient inference of sequence and structure-based protein-RNA binding models from RNAcompete data
Presenting Author: Yaron Orenstein MIT, United States

Motivation: Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240,000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but was outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers.
Results: We develop RCK, an efficient, scalable algorithm to infer sequence and structure preferences based on a new k-mer model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale.
Availability: Software and models are freely available at http://groups.csail.mit.edu/cb/rck/.
Contact: This email address is being protected from spambots. You need JavaScript enabled to view it.
Supplementary information: Supplementary data are available at Bioinformatics online.

PLOS - Late Breaking Research Award

TP87: Data-Driven Analysis of Lymphocyte Infiltration in Breast Cancer Development and Progression
Presenting Author: Ruth Dannenfelser, Princeton University, United States
Additional Authors: Josie Ursini-Siegel, Lady Davis Institute for Medical Research, Canada
Vessela Kristensen, Radiumhospitalet, Norway
Olga Troyanskaya, Princeton University, United States

The tumor microenvironment is now widely recognized for its role in tumor progression, treatment response, and clinical outcome. The intratumoral immunological landscape, in particular, has been shown to exert both pro-tumorigenic and anti-tumorigenic effects. Thus far, direct detailed studies of the cell composition of tumor infiltration have been limited; with some studies giving approximate quantifications using immunohistochemistry and other small studies obtaining detailed measurements by laboriously isolating cells from newly excised tumors and sorting them using flow cytometry. Herein we utilize a machine learning based approach to identify lymphocyte markers with which we can quantify the presence of B cells, cytotoxic T-lymphocytes, T-helper 1, and T-helper 2 cells in any gene expression data set and apply it on the studies of breast tissue. By leveraging many samples from existing large scale studies, we are able to find an inherent cell heterogeneity in clinically characterized immune infiltrates, a strong link between estrogen receptor status and infiltration in normal and tumor tissues, changes with genomic complexity, and identify characteristic differences in lymphocyte expression among molecular groupings. Furthermore, we explore the effects detailed infiltration patterns have on patient survival and changes with anti-estrogen therapy.

University of California Berkeley Center for Computational Biology Outstanding Oral Poster Prize

OP34: Hidden RNA Codes Revealed from in vivo RNA Structurome
Presenting Author: Yin Tang, Pennsylvania State University
Additional Authors:
Anton Nekrutenko, Pennsylvania State University, University Park, United States of America
Philip Bevilacqua, Pennsylvania State University, University Park, United States of America
Sarah Assmann, Pennsylvania State University, University Park, United States of America

RNA can fold into secondary and tertiary structures, which are important for regulation of gene expression. We recently developed a method to perform genome-wide RNA structure profiling in vivo employing high-throughput sequencing techniques, and applied this methodology to Arabidopsis. This method makes it possible to probe thousands of RNA structures at one time in living cells. Hidden RNA codes have been revealed by bioinformatic analyses of our RNA structuromes including RNA structures related to alternative polyadenylation and splicing [1].
Recently, further analysis of this dataset revealed a correlation between mRNA structure and the encoded protein structure, wherein the regions of individual mRNAs that code for protein domains generally have significantly higher structural reactivity than regions that encode protein domain junctions. This relationship is prominent for proteins annotated for catalytic activity but is reversed in proteins annotated for binding and transcription regulatory activity. We also found that mRNA segments that code for ordered regions have significantly higher structural reactivity than those that encode disordered regions [2].
We also developed a new computational platform, StructureFold, to facilitate the analysis of high throughput RNA structure profiling data. As a component of the Galaxy platform (https://usegalaxy.org), StructureFold integrates four computational modules in a user-friendly web-based interface or via local installation [3].

[1] Ding Y, Tang Y, Kwok CK, Zhang Y, Bevilacqua PC, Assmann SM. Nature. 2014;505:696-700.
[2] Tang Y, Assmann SM, Bevilacqua PC. J Mol Biol. 2016;428:758-766.
[3] Tang Y, Bouvier E, Kwok CK, Ding Y, Nekrutenko A, Bevilacqua PC, Assmann SM. Bioinformatics. 2015;31:2668-75.

University of California Berkeley Center for Computational Biology Outstanding Oral Poster Prize

OP14: Widespread misannotation of samples in genomics studies
Presenting Author: Lilah Toker, University of British Columbia, Canada
Additional Authors:
Min Feng, University of British Columbia, Canada
Paul Pavlidis, University of British Columbia, Canada

Concern about the reproducibility and reliability of biomedical research has been rising. A bedrock principle of research conduct is that the samples analyzed are correctly identified and not mixed up during processing, but this has rarely been assessed formally.
Here we studied the prevalence of sample misannotation in a large corpus of genomics studies by comparing meta-data annotations of sex to predictions from expression of sex-specific genes. We identified apparent misannotated samples in 46% of the datasets sampled. Extrapolating beyond our corpus, we estimate that at least 33% of all studies have at least one such mix-up (99% confidence interval). Because this method can only identify a subclass of potential misannotations, this provides a conservative estimate for the breadth of the problem. In an additional set of studies that used samples from the same subjects, 2/4 had misannotatated samples. These misannotations are likely to result from laboratory mix-ups rather than subject meta-data collection errors.
Our findings emphasize the need for genomics researchers to implement more stringent sample tracking and data quality control steps, and suggests that re-use of published data should be done in conjunction with careful re-examination of meta-data.

F1000 Poster Awards

OP23/B11: The evolutionary origin of orphan genes
Presenting Author: Zebulun Arendsee, Iowa State University, United States of America
Additional Author:
Eve Syrkin Wurtele, Iowa State University, United States of America

Many of the most powerful tools in biology rely on inference of homologs via sequence-based algorithms. However, many loci are invisible to such methods. Those that are short or rapidly evolving, such as orphan genes and small non-coding RNAs, may yield no significant hits. Whereas low-complexity or high-copy number loci may hide in a crowd of false positives. Searching by context bypasses this problem. We present an algorithm for tracing loci between genomes using a synteny map, and test its efficacy by mapping all Arabidopsis thaliana-specific genes to the genomes of eight related species. By reducing the search space and winnowing false positives, we were able to assess the origin of the individual orphan genes with unprecedented resolution. We traced many to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species also provides a starting point for future studies. Our pipeline can easily be adapted to contextualize elusive elements such as small RNAs and lineage-specific genes in any species for which reliable synteny maps can be built.

OP27/O42: TRACE: Reconstructing trajectories of cell cycle evolution using single-cell mass cytometry data
Presenting Author: Maria Anna Rapsomaniki, IBM Research Zurich
Additional Authors:
Xiaokang Lun, University of Zurich, Switzerland
Bernd Bodenmiller, University of Zurich, Switzerland
Maria Rodriguez Martinez, IBM Research Zurich, Switzerland

As single-cell experimental approaches become increasingly popular, cell-to-cell heterogeneity has emerged as a key determinant factor contributing to variability in gene expression and signaling responses. Mass cytometry (CyTOF) is a new proteomic technology that enables the simultaneous quantification of dozens of proteins in thousands of individual cells. In the context of cancer research, recent applications of CyTOF include the characterization of inter- and intra-tumor heterogeneity and the identification of novel cell subpopulations. However, as already demonstrated for single-cell RNA-seq, the resulting measurements are largely influenced by confounding factors, such as the cell cycle and cell volume. We present here TRACE, a novel computational approach to quantify this source of variability. TRACE first exploits a hybrid machine learning approach to classify single cells into discrete cell cycle phases according to measurements of established markers. Next, a metric embedding optimization technique creates a one-dimensional continuous marker that tracks biological pseudotime and individual cells are subsequently ordered according to this pseudotime marker. The resulting cell cycle trajectories across perturbation time points allow us to separate cell cycle effects from experimentally induced responses, enabling the direct comparison of signaling responses through cell cycle progression. Additionally we show that volume biases can be corrected using housekeeping gene measurements. Our approach, implemented in a simple and intuitive Graphical User Interface, was used to analyze data from various cell lines subject to different stimulations. In each case, TRACE was able to separate confounding effects from signaling responses, enabling the unbiased analysis of biological processes.

OP06/N16: The Landscape of Circular RNA in Cancer
Presenting Author: Nguyen Vo, University of Michigan-Ann Arbor, United States of America
Additional Authors:
Marcin Cieslik, University of Michigan-Ann Arbor, United States of America
Yajia Zhang, University of Michigan-Ann Arbor, United States of America
Xuhong Cao, University of Michigan-Ann Arbor, United States of America
Alexey Nesvizhskii, University of Michigan-Ann Arbor, United States of America
Arul Chinnaiyan, University of Michigan-Ann Arbor, United States of America

Circular RNAs (circRNA) are a new class of abundant, non-adenylated, and stable RNAs that form a covalently closed loop. Recent studies have suggested that circRNAs play important regulatory roles through interactions with miRNAs and ribonucleoproteins. High-throughput RNA-sequencing to detect circRNAs requires non-poly(A) selected protocols. In this study, we established the use of Exome Capture RNA-Seq protocol to profile circRNAs across more than 1000 human cancers samples. We validated our protocol against two other gold-standard methods, depletion of rRNA (Ribo-Zero) and digestion of linear transcripts (RNase-R). Capture RNA-seq was shown to greatly facilitate the high-throughput profiling of circRNAs, providing the most comprehensive catalogue of circRNA species to-date. Specifically, our method achieved significantly better enrichment for circRNAs than rRNA depletion, and, unlike RNase-R treatment, preserved accurate circular-to-linear ratios. Although the correlation between circular and linear isoform abundance was modest in general , we found strong evidence that the lineage specificity of circular RNAs is due to the lineage specificity of their parent genes. To shed light on the mechanism of circRNAs biogenesis, we are investigating the associations between mutations in canonical splicing sites and splicing factors with aberrant formation of circRNAs. Finally, ratio of circular to linear transcript abundance was explored to give insight in the dynamics between transcriptome stability/turnover and cell proliferation. Overall, our compendium provides a comprehensive resource that could aid the exploration of circRNAs as a new type of biomarkers, or as intriguing splicing and regulatory phenomena.

OP17/O20: Deconvolution of Cell and Environment Specific Signals and Their Interactions from Complex Mixtures in Biological Samples
Presenting Author: Urszula Czerwinska, Institut Curie, France
Additional Authors:
Emmanuel Barillot, Institut CURIE, France
Vassili Soumelis, Institut Curie, France
Andrei Zinovyev, Institut Curie, France

Background :
In many fields of science observations on a studied system represent complex mixtures of signals of various origin. Tumors are engulfed in a complex microenvironment (TME) that critically impacts progression and response to therapy. It includes tumor cells, fibroblasts, and a diversity of immune cells. It is known that under some assumptions, it is possible to separate complex signal mixtures, using classical and advanced methods of source separation and dimension reduction.

Description :
In this work, we apply independent components analysis (ICA) to decipher sources of signals shaping transcriptomes (global quantitative profiling of mRNA molecules) of tumor samples, with a particular focus on immune system-related signals. We use ICA iteratively decomposing signals into sub-signals that can be interpreted using pre-existing immune signatures through correlation or enrichment analysis.

Results :
Our analysis revealed a possibility to identify signals related to groups of immune cell types with unsupervised learning approach in a Breast Cancer dataset. Through Fisher exact test we identified significative groups corresponding to three out of five sub-signals: (1) T-cells, (2) DC/Macrophages, (3) Monocytes/ Macrophages/ Eosynophiles/Neutrophiles. T-cells metagene correlates well with the tumor grade (Kruskall-Wallis test p-value=0.003).

Discussion :
Ongoing analysis aims to evaluate the robustness of the represented groups and eventual differences between several types of cancer. We are to characterize the immune infiltration degree in the cancer transcriptome dataset and further correlate with patients’ survival and tumor characteristics. In the case of success, the results will be used in the diagnosis and cancer therapy, especially immunotherapies.

OP20/B02: AODP: An improved method for signature oligonucleotide design
Presenting Author: Christine Lowe Agriculture and Agri-Food Canada
Additional Authors:
Manuel Zahariev, Skwez Technology Corp
Min-Ru Lin, Agriculture and Agri-Food Canada
Jonathan Litovitz, Agriculture and Agri-Food Canada
Hai D.T. Nguyen, Agriculture and Agri-Food Canada
C. André Lévesque, Agriculture and Agri-Food Canada
Wen Chen, Agriculture and Agri-Food Canada

High-throughput Next Generation Sequencing (NGS) technologies and reference databases have enhanced our ability to explore diversity at genetic and taxonomic levels. Most off-the-shelf tools for examining genetic diversity implement algorithms that rely on sequence similarity and composition, which can lead to resolution loss in genetic comparisons, particularly at the species/sub-species taxonomic ranks. We present a new version of the Automated Oligonucleotide Design Pipeline (AODP). AODP designs signature oligonucleotides (SO) with specificity and fidelity based on genome or DNA barcode sequence identity, reducing the resolution loss observed with existing approaches. SO designed with AODP highlight regions with taxon or clade-specific polymorphisms that are useful for comparative genomics and provide suitable candidates for the design of primers/probes in diagnostic assays. AODP has several unique features: 1) The AODP algorithm uses a novel packed-Trie data structure, with support for multi-threaded insertion, optimized for DNA nucleotide strings, which scales well to multi-processor architectures; 2) SO can be designed for a large dataset with relatively small memory footprint; 3) Regions of DNA with a single nucleotide polymorphism (SNP) can be optionally ignored to minimize noise caused by sequencing errors during NGS; 4) The specificity of SO can be further validated against large reference databases; 5) SO thermodynamic properties can be calculated for wet-lab experimental conditions; and 6) SO can be directly used for in silico identification of taxa from environmental NGS data.

RCSB and PDB Poster Prize

OP24/L28: Tertiary Structural Propensities Reveal Basic Sequence-Structure Relationships in Proteins
Presenting Author:
Fan Zheng, Dartmouth College, United States of America
Additional Authors:
Jian Zhang, Dartmouth College, United States of America
Gevorg Grigoryan, Dartmouth College, United States of America

The Protein Data Bank (PDB) is a key resource of general principles that has shaped our understanding of protein structure. Most of the existing statistical generalizations of protein structures are made for secondary structures, which are often too generic to satisfy many specific design goals, or for protein domains, for which the PDB distribution is highly biased by evolution or human sampling, and thus not being physically meaningful. To fill this gap, we proposed the local tertiary motifs (TERMs) as a new fundamental level of structural unit. TERMs are combinations of non-continuous small secondary fragments connected by inter-residue contacts. We hypothesized that the PDB contains valuable quantitative information on the level of TERMs. We studied the propensities of TERMs within their corresponding ensembles, i.e. geometrically similar structural fragments from completely unrelated proteins. The TERM propensities are physically meaningful in many contexts. By breaking a protein structure into its constituent TERMs, we can evaluate the accuracy of structure-prediction models with poorly predicted regions identifiable, via a metric we named “structure score” capturing the sequence-structure relationships in TERMs. Also, querying TERMs affected by point mutations enables straightforward prediction of mutational free energies. Our performance exceeds or is comparable to state-of-art methods. Our results suggest that the data in the PDB are now sufficient to enable the quantification of complex structural features, such as those associated with entire TERMs. This should present opportunities for advances in computational structural biology techniques, including structure prediction and design.