Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


General Computational Biology

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Session Chairs:

Laxmi Parida (10:15 am - 12:40 pm)
Chaya Levovitz (2:00 pm - 6:00 pm)

Schedule subject to change
Saturday, July 7th
10:15 AM-10:20 AM
General: Introduction
Room: Columbus KL
10:20 AM-10:40 AM
Measuring and Closing the Genotype Leakage from Genomic Signal Profiles: Is it Ethical to Share RNA-seq wiggle files?
Room: Columbus KL
  • Arif Harmanci, University of Texas Health - Houston, United States
  • Mark Gerstein, Yale University, United States

Presentation Overview: Show

Functional genomics data is emerging as a valuable resource for personalized medicine. Here, we focus on the privacy aspects of genome-wide RNA-Seq and ChIP-Seq signal profiles, which represent measurement of activity at each genomic position. The signal profiles do not contain any explicit nucleotide information, i.e. no actual read information is revealed, and are thought to be safe to share publicly. Several consortia, for example IHEC, GTEx, and TCGA, publicly share functional genomics signal profiles while the genotype data is under restricted access. Here, we show that the signal profiles can be used to correctly genotype genomic deletions. Moreover, we demonstrate that these deletions can be used in linking attacks to identify individuals in genotype datasets. We develop measures of correct genotype prediction and information leakage from the RNA-seq signal profiles. We then present practical methods for genotyping deletions, and accurate linking of individuals to a large sample. To close the genotype leakage, we present an effective anonymization procedure against genotype prediction based linking attacks. Considering the extent to which RNA-seq signal profiles are shared publicly on the web, our results point to a critical source of sensitive information leakage, which can be potentially protected by our anonymization technique.

10:40 AM-11:00 AM
A Comprehensive Comparison of Gene Set Projection (GSP) Methods
Room: Columbus KL
  • Ali Amin-Mansour, Boston University, United States
  • Rui Hong, Boston University, United States
  • Gary Benson, Boston University, United States
  • Stefano Monti, Boston University, United States

Presentation Overview: Show

With the advent of microarray- and RNAseq-based gene expression profiling, finding gene sets with phenotype-associated functional changes has been the focus of many studies. While gene set enrichment analysis tools have been extensively used to test for the significant association between gene sets and a phenotype of interest, gene set projection (GSP) tools have been used to couple the enrichment scores of individual samples with a gene set of interest. Multiple GSP tools have been developed to perform this task, including ssGSEA, ASSIGN, Pathifier, and GSVA, among others. In this analysis, we seek to compare their performance on real and simulated data. In particular, we use simulated data derived from real gene expression profiles to compare sensitivity and specificity of the different methods, by maintaining the correlation structure between the genes in our cohort and by adding a constant effect size to measure the efficacy of each tool to measure the true enrichment in the underlying data. We have measured the influence of various parameters such as gene set size, correlation between genes within the gene set and the effect size on the accuracy of the Gene set projection tools.

11:00 AM-11:20 AM
Single Cell-2-Cell Communicator (SC2CC) to Elicit Cell-to-Cell Communication Network using Single Cell RNA-seq (scRNA-seq) Data
Room: Columbus KL
  • Benjamin Walton, The Jackson Laboratory, United States
  • Joshy George, The Jackson Laboratory, United States
  • Kyuson Yun, Houston Methodist Research Institute, United States
  • R. Krishna Murthy Karuturi, The Jackson Laboratory, United States

Presentation Overview: Show

Understanding Cell-to-Cell communication (C2CC) is critical to deciphering the complexity of cellular eco-system that control biological functions and disease states. For example, analyzing communication among cancer cells and stromal cells in a tumor has revealed important biological insights. Ligand-Receptor (LR) interactions typically drive C2CC. scRNA-seq helps identify C2CC by exploiting known LR interactions. However, no tool is available to identify C2CC networks and explore modes of communication among different cells at single cell level. Hence, we developed a methodology and an interactive tool called Single Cell-2-Cell Communicator (SC2CC) for end-2-end analysis of scRNA-seq data from normalization to eliciting C2CC network. Using SC2CC tool, we analyzed scRNA-seq data of a spontaneous murine medulloblastoma for C2CC network identification. The C2CC network indicates that all immune cell populations can communicate with each other. Interestingly, while tumor cell populations and astrocytes could signal robustly to all infiltrating macrophages, reciprocal signaling from immune cells to tumor cells or astrocytes was undetected. In addition, tumor cell signaling to brain-resident macrophage (microglia) was absent and may require infiltrating macrophage intermediary. This network shows hierarchical interaction among different cell types in a tumor and suggests subpopulation-specific C2CC in tumors. Our results signify the importance of the SC2CC tool.

11:20 AM-11:40 AM
IDP-denovo: de novo transcriptome assembly and isoform annotation by hybrid sequencing
Room: Columbus KL
  • Shuhua Fu, University of Iowa, United States
  • Yingke Ma, University of Iowa, United States
  • Hui Yao, Peking Union Medical College, Beijing 100193, China, China
  • Zhichao Xu, Peking Union Medical College, Beijing 100193, China, China
  • Shilin Chen, Peking Union Medical College, Beijing 100193, China, China
  • Jingyuan Song, Peking Union Medical College, Beijing 100193, China, China
  • Kin Fai Au, University of Iowa, United States

Presentation Overview: Show

In the past years, the long read(LR) sequencing technologies, such as Pacific Biosciences and Oxford Nanopore Technologies, have been demonstrated to substantially improve the quality of genome assembly and transcriptome characterization. Compared to the high cost of genome assembly by LR sequencing, it is more affordable to generate LRs for transcriptome characterization. That is, when informative transcriptome LR data are available without a high-quality genome, a method for de novo transcriptome assembly and annotation is of high demand.

Without a reference genome, IDP-denovo performs de novo transcriptome assembly, isoform annotation and quantification by integrating the strengths of LRs and short reads(SRs). Using GM12878 human data as a gold standard, we demonstrated that IDP-denovo had superior sensitivity of transcript assembly and high accuracy of isoform annotation. In addition, IDP-denovo outputs two abundance indices to provide a comprehensive expression profile of genes/isoforms. IDP-denovo represents a robust approach for transcriptome assembly, isoform annotation and quantification for non-model organism studies. Applying IDP-denovo to a non-model organism, Dendrobium officinale, we discovered many novel genes and novel isoforms that were not reported by the existing annotation library. These results reveal the high diversity of isoforms in D. officinale that not reported in the existing annotation library.

11:40 AM-12:00 PM
Proceedings Presentation: Bayesian parameter estimation for biochemical reaction networks using region-based adaptive parallel tempering
Room: Columbus KL
  • Benjamin Ballnus, Helmholtz-Zentrum München, Germany
  • Steffen Schaper, Bayer, Germany
  • Fabian Theis, Helmholtz Centre, Institute of Computational Biology, Munich, Germany, Germany
  • Jan Hasenauer, Institute of Computational Biology, Helmholtz Zentrum München, Germany

Presentation Overview: Show

Mathematical models have become standard tools for the investigation of cellular processes and the unraveling of signal processing mechanisms. The parameters of these models are usually derived from the available data using optimization and sampling methods. However, the efficiency of these methods is limited by the properties of the mathematical model, e.g., non-identifiabilities, and the resulting posterior distribution. In particular, multi-modal distributions with long valleys or pronounced tails are difficult to optimize and sample. Thus, the developement or improvement of optimization and sampling methods is subject to ongoing research.

We suggest a region-based adaptive parallel tempering algorithm which adapts to the problem-specific posterior distributions, i.e. modes and valleys. The algorithm combines several established algorithms to overcome their individual shortcomings and to improve sampling efficiency. We assessed its properties for established benchmark problems and two ordinary differential equation models of biochemical reaction networks. The proposed algorithm outperformed state-of-the-art methods in terms of calculation efficiency and mixing. Since the algorithm does not rely on a specific problem structure, but adapts to the posterior distribution, it is suitable for a variety of model classes.

The code is available both as supplementary material and in a Git repository written in MATLAB.

12:00 PM-12:20 PM
SplitsTree5 - a new provenance-graph-based program for calculating and exploring phylogenetic trees and networks
Room: Columbus KL
  • Daniel H. Huson, University of Tuebingen, Germany
  • David Bryant, Otago University, New Zealand

Presentation Overview: Show

Phylogenetic networks are used to analyze evolution in the presence
of speciation-by-hybridization, HGT or recombination. There are a number of
programs for computing them, including SplitsTree4, which we published in 2006.

While SplitsTree4 is still widely used (~500 citations per year), it was
designed for much smaller datasets than considered today, and doesn't employ parallelization.
So, for example, the program can't parse a file of 10,000 bootstrap trees on
600 taxa and compute a consensus network, now a typical task.

Here we present SplitsTree5, a completely new open-source JavaFX program that will replace SplitsTree4.
Many of the data-structures and algorithms used have been redesigned to address current-day datasets and
computing architectures.
Based on the concept of a provenance graph, the program explicitly models all data, algorithms and
parameters used in an analysis. This ensures reproducibility. Moreover, the graph is used to
generate a Methods Section that describes the analyses performed and parameters used,
and provides citations to the methods papers.

As well as replacing SplitsTree4, we are also reimplementing much of the functionality of the programs
Dendroscope, PopArt and Network so as to provide many of the most important methods for unrooted-, rooted- and haplotype networks
in a single program.

12:20 PM-12:40 PM
Theoretical analysis of graph-based and alignment-based hybrid error correction methods for error-prone long reads
Room: Columbus KL
  • Anqi Wang, University of Iowa, United States
  • Kin Fai Au, University of Iowa, United States

Presentation Overview: Show

Third Generation Sequencing technologies (TGS), including Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), have drawn growing attention in biological research. However, the application of the long reads currently suffers from relatively high error rates. The high quality Second Generation Sequencing (SGS) short reads can be applied to correct the TGS long reads, which is termed as hybrid error correction. Most of the existing hybrid correction methods can be categorized into two classes: alignment- and graph-based methods. Several principal factors affect the error correction performance, including long read and short read error rates, short read coverage, alignment criterion and solid k-mer size, etc. We propose a theoretical framework of hybrid-correction algorithm. Through the modeling of short read alignment and solid k-mer occurrence, we analyze the effects of different factors on error correction performance (i.e., accuracy gain) for alignment- and graph-based methods. The theoretical results are further validated by simulated and real data. We also make a general comparison of the two kinds of methods on error correction performance. Our study serves as guidance for method selection, parameter design and future method development for hybrid correction.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:20 PM
RWEN : Response-Weighted Elastic Net For Prediction of Chemosensitivity of Cancer Cell Lines
Room: Columbus KL
  • Amrita Basu, UCSF, United States
  • Ritwik Mitra, AT&T, United States
  • Han Liu, Princeton University, United States
  • Stuart Schreiber, Harvard University, United States
  • Paul Clemons, Harvard University, United States

Presentation Overview: Show

In recent years there have been several efforts to generate sensitivity profiles of collections
of genomically characterized cell lines to panels of candidate therapeutic compounds. These data provide
the basis for the development of in silico models of sensitivity based on cellular, genetic, or expression
biomarkers of cancer cells. However, a remaining challenge is an efficient way to identify accurate sets
of biomarkers to validate. To address this challenge, we developed methodology using gene-expression
profiles of human cancer cell lines to predict the responses of these cell lines to a panel of compounds.
We developed an iterative weighting scheme which, when applied to elastic net, a regularized
regression method, significantly improves the overall accuracy of predictions, particularly in the highly
sensitive response region. In addition to application of these methods to actual chemical sensitivity data,
we investigated the effects of sample size, number of features, model sparsity, signal-to-noise ratio, and
feature correlation on predictive performance using a simulation framework, particularly for situations
where the number of covariates is much larger than sample size. While our method aims to be useful in
therapeutic discovery, it
is generally applicable in any domain where predictions of extreme responses are of highest importance.

2:20 PM-2:40 PM
Decomposing spatially dependent and cell type specific contributions to cellular heterogeneity
Room: Columbus KL
  • Qian Zhu, Harvard University, United States
  • Sheel Shah, California Institute of Technology, United States
  • Ruben Dries, Harvard University, United States
  • Long Cai, California Institute of Technology, United States
  • Guo-Cheng Yuan, Harvard University, United States

Presentation Overview: Show

Both the intrinsic regulatory network and spatial environment are contributors of cellular identity and result in cell state variations. However, their individual contributions remain poorly understood. Here we present a systematic approach to integrate both sequencing- and imaging-based single-cell transcriptomic profiles, thereby combining whole-transcriptomic and spatial information from these assays. We applied this approach to dissect the cell-type and spatial domain associated heterogeneity within the mouse visual cortex region. Our analysis identified distinct spatially associated signatures within glutamatergic and astrocyte cell compartments, indicating strong interactions between cells and their surrounding environment. Using these signatures as a guide to analyze single cell RNAseq data, we identified previously unknown, but spatially associated subpopulations. As such, our integrated approach provides a powerful tool for dissecting the roles of intrinsic regulatory networks and spatial environment in the maintenance of cellular states.

2:40 PM-3:00 PM
Computational prediction of natural metabolites for suppression of tumor progression
Room: Columbus KL
  • Boris Reva, Icahn School of Medicine at Mount Sinai, United States
  • Anna Calinawan, Icahn School of Medicine at Mount Sinai, United States
  • Eric Schadt, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Diet and certain food supplements can significantly impact cancer outcomes. However, diversity of molecular alterations in tumors necessitates the personalized approach to determining optimal drugs, diet and food supplements for each patient. Motivated by the hypothesis that:

(i) upregulation of certain metabolic pathways results in downregulation of specific oncogenes;
(ii) expression levels of metabolic pathway genes can be upregulated by certain metabolic substrates (food supplements);
(iii) downregulation of tumor specific oncogenes will suppress tumor progression,

we studied co-regulation between gene expression levels using RNAseq profiles produced by The Cancer Genomic Atlas consortium. We found that major cancer genes are significantly co-regulated with large sets of genes, which we call “gene clouds”, many of which were significantly enriched by genes of various biological pathways. In particular, KEGG’s ribosome and oxidative phosphorylation pathways were universally overrepresented in oncogene-clouds across many cancers. We developed a computational protocol that determines gene-clouds, overrepresented biological pathways and related metabolites. Currently, metabolites were derived for ~30 common oncogenes and ~30 tumor suppressors for ~20 TCGA cancers. The computational protocol can be applied for an individual set of oncogenes and tumor-suppressors to propose metabolites for testing in cell lines and mouse models.

3:00 PM-3:20 PM
HyperMinHash: MinHash in LogLog space
Room: Columbus KL
  • Yun William Yu, Harvard Medical School, United States
  • Griffin M. Weber, Harvard Medical School, United States

Presentation Overview: Show

One of the central problems in bioinformatics is computing the similarity of two objects, e.g. genomes or patient cohorts. Exact algorithms for these problems are often computationally expensive, so researchers sometimes turn to sketching algorithms instead. Sketching algorithms represent big data as small probabilistic data structures that require only a small amount of memory and produce fast accurate estimates of properties of a dataset, e.g. Jaccard similarity. The MinHash sketch in particular has proven useful for accelerating genomic distance computations.

We describe and analyze a streaming probabilistic sketch, HYPERMINHASH, which provides an expected four-fold reduction in memory consumption for most applications currently using MinHash. HyperMinHash is a compression of standard O(ϵ^2 log⁡ n)-space MinHash. For a multiplicative approximation error 1+ϵ on a Jaccard similarity t, given a random oracle, HyperMinHash needs only O(ϵ^(-2) (log⁡log⁡n + log⁡(1/tϵ) )) space. Unlike comparable fingerprinting algorithms, HyperMinHash retains MinHash’s features of streaming updates, unions, and cardinality estimation. At equivalent memory consumption to Minhash, HyperMinHash enables software to either increase accuracy at the same cardinalities or support much larger cardinalities.

3:20 PM-3:40 PM
Identification of Candidate Genes Associated with Osteoarthritis by Microarray Data Analysis
Room: Columbus KL
  • Sanjana Choudhury, University of Dhaka, Bangladesh
  • Dr Khademul Islam, University of Dhaka, Bangladesh
  • Mohammad Sayeem, University of Cambridge, United Kingdom

Presentation Overview: Show

By analyzing a cohort of microarray dataset on OA patients, we identified differentially modulated genes including the highly deregulated genes termed as ‘drivers’ of OA mechanism. Genes with deregulated expression-level more than expected by chance on an average, were determined using “Gitools-oncodrive” script in Gitools software with binomial(bernoulli) test as the statistical test of significance. Our approach not only identified previously reported OA-associated genes e,g. ACAN, OGN, LOXL4 that validates our concept, but also identified some new genes like ASPM, AURKB, HAPLN1, HAS3, MAP1B and PCSK1 which have functions in extracellular matrix stability, cell cycle regulation, inflammation- making them potential regulators in OA mechanism. Most of the up-regulated driver genes detected are associated with inflammatory agents e,g cytokines: IL-8, IL-11, IL-6 which may drive increased production of matrix degrading enzymes in OA-joint tissues. Most down-regulated driver genes found are involved in stabilizing the proteoglycan monomers with hyaluronic acid in extracellular cartilage matrix. Because of significant down-regulation(as detected in this study) the stability of the ECM may be compromised in OA cartilage. Our identified driver genes not only provide better insight to the disease mechanism but can also aid as biomarkers in OA diagnosis and may work as potential drug targets.

3:40 PM-4:00 PM
ModulOmics: Integrating Multi-Omics Data to Identify Cancer Driver Modules
Room: Columbus KL
  • Dana Silverbush, Blavatnik School of Computer Science, Israel
  • Simona Cristea, Harvard University, United States
  • Gali Yanovich, Sackler Faculty of Medicine, Israel
  • Tamar Geiger, Sackler Faculty of Medicine, Israel
  • Niko Beerenwinkel, ETH Zurich, Switzerland
  • Roded Sharan, Blavatnik School of Computer Science, Israel

Presentation Overview: Show

The identification of molecular pathways driving cancer progression is a fundamental unsolved problem in tumorigenesis, which can substantially further our understanding of cancer mechanisms and inform the development of targeted therapies. Most current approaches to address this problem use primarily somatic mutations, not fully exploiting additional layers of biological information. Here, we describe ModulOmics, a method to de novo identify cancer driver pathways, or modules, by integrating multiple data types (protein-protein interactions, mutual exclusivity of mutations or copy number alterations, transcriptional co-regulation, and RNA co-expression) into a single probabilistic model. To efficiently search the exponential space of candidate modules, ModulOmics employs a two-step optimization procedure that combines integer linear programming with stochastic search. Across several cancer types, ModulOmics identifies highly functionally connected modules enriched with cancer driver genes, outperforming state-of-the-art methods. For breast cancer subtypes, the inferred modules recapitulate known molecular mechanisms and suggest novel subtype-specific functionalities. These findings are supported by an independent patient cohort, as well as independent proteomic and phosphoproteomic datasets.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:00 PM
Biomedical concept normalization using sequence-to-sequence LSTM model
Room: Columbus KL
  • Negacy Hailu, University of Colorado, Boulder, United States
  • Asmelash Teka Hadgu, L3S Research Center, Germany
  • Michael Bada, CU-Denver Anschutz Medical Campus, United States
  • Larry Hunter, CU-Denver Anschutz Medical Campus, United States

Presentation Overview: Show

Concept normalization is an important step in biomedical information extraction. The task is to identify references to particular controlled vocabulary or ontology terms in text. In this work, we present a novel sequence-to-sequence architecture to normalize biomedical concepts.

We develop a sequence-to-sequence model with LSTM (Long Short Term Memory) that has encoder and decoder architecture. The encoder is a bidirectional LSTM that takes as input one-hot-code encoding representation of the characters of biomedical mentions. The decoder is a unidirectional LSTM model that takes the output of the encoder. We have an attention model between the encoder and decoder models. The attention model gives different weights to the output of the encoder that address how much attention should the decoder give to the outputs of the encoder. The decoder predicts most likely concept IDs of biomedical mentions.

The proposed architecture is evaluated against gold textual mentions. It is also evaluated in an end-to-end system, where the input is scientific articles, so that we can compare it with other systems. For the end-to-end system, we extracted the span of biomedical mentions in text using an existing state-of-the-art approach, condition random fields.

5:00 PM-5:20 PM
MetaSRA: Normalized Human Sample-Specific Metadata for the Sequence Read Archive
Room: Columbus KL
  • Matthew Bernstein, University of Wisconsin-Madison, United States

Presentation Overview: Show

Motivation: The NCBI’s Sequence Read Archive (SRA) promises great biological insight if one could analyze the data in the aggregate; however, the data remain largely underutilized, in part, due to the poor structure of the metadata associated with each sample. The rules governing submissions to the SRA do not dictate a standardized description. As a result, the metadata include many synonyms, spelling variants and references to outside sources of information. Furthermore, manual annotation of the data remains intractable due to the large number of samples in the archive. For these reasons, it has been difficult to perform large-scale analyses that study the relationships between biomolecular processes and phenotype across diverse diseases, tissues and cell types present in the SRA.

Results: We present MetaSRA, a database of normalized SRA human sample-specific metadata. Our normalized metadata schema involves mapping samples to terms in biomedical ontologies, labeling each sample with a sample-type category, and extracting real-valued properties. We automated these tasks via a novel computational pipeline. The MetaSRA is available at metasra.biostat.wisc.edu via both a searchable web interface and bulk downloads.

Statement of significance: MetaSRA provides normalized sample-specific metadata for the SRA enabling more effective queries of SRA metadata and large-scale meta analyses.

5:20 PM-5:40 PM
Re-Identification of Individuals in Genomic Data-Sharing Beacons via Allele Inference
Room: Columbus KL
  • Nora von Thenen, Bilkent University, Turkey
  • Erman Ayday, Case Western Reserve University, United States
  • A. Ercument Cicek, Bilkent University, Turkey

Presentation Overview: Show

Genomic data-sharing beacons provides a standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. Previously deemed secure against re-identification attacks, beacons were shown to be vulnerable. Recent studies have demonstrated that it is possible to determine whether the victim is in the dataset, by repeatedly querying the beacon for his/her SNPs. Here, we propose a novel re-identification attack and show that the privacy risk is more serious than previously thought. Even if the victim systematically hides informative SNPs, it is possible to infer the alleles at positions of interest as well as the query results with high confidence. We use linkage disequilibrium and a high-order Markov chain-based algorithm for inference. We show that in a simulated beacon with 65 individuals from the CEU population, we can infer membership of individuals with 95% confidence with only 5 queries, even when SNPs with MAF less than 0.05 are hidden. We need less than 0.5% of the number of queries that existing works require, to determine beacon membership under the same conditions. We show that countermeasures such as a query budget would still fail to protect the privacy of the participants.

5:40 PM-6:00 PM
GangSTR: Genome-wide Analysis aNd Genotyping of Short Tandem Repeats
Room: Columbus KL
  • Nima Mousavi, University of California San Diego, United States
  • Sharona Shleizer-Burko, University of California San Diego, United States
  • Melissa Gymrek, University of California San Diego, United States

Presentation Overview: Show

Short Tandem Repeat (STR) expansions have been associated with dozens of genetic diseases such as Huntington’s Disease and Fragile X Syndrome. Common diagnostic genetic tests for genotyping STRs are highly specialized to certain disorders and only assay a single locus at a time. Next-generation sequencing (NGS) can theoretically genotype all potentially pathogenic variants simultaneously. However, expanded STRs have proven difficult to genotype using existing tools. Recent efforts have developed more generalized tools that use information extracted from paired-end short-read sequencing data to genotype a target set of STRs. However, these methods only use a subset of the available information, and are designed to genotype a predefined set of known loci rather than identifying novel expansions. We present GangSTR, a novel statistical model for accurately genotyping STRs, alongside a genome-wide STR reference panel. GangSTR employs a unified likelihood model that combines multiple sources of information extracted from paired-end reads to genotype short and expanded alleles. GangSTR is a standalone tool that is capable of genome-wide scanning for repeat expansions. We identify and experimentally validate novel STR expansions from a high coverage whole genome dataset. Our method outperforms existing STR callers in both accuracy and run time in genotyping known disease loci.