Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences

Theme Presentation Schedule

Highlights, Late Breaking Research and Proceedings Track submissions are presented by scientific theme as part of the combined Theme Presentation schedule.
Presenters names in bold (for updates and changes email steven@iscb.org)

Attention Conference Presenters - please review the Speaker Information Page available here.

TP001 (HT) - Robust Detection of Alternative Splicing in a Population of Single Cells
Date: Sunday, July 10 10:10 am - 10:30 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Joshua Welch, UNC Chapel Hill, United States
  • Yin Hu, Sage Bionetworks, United States
  • Jan Prins, UNC Chapel Hill, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Single cell RNA-seq data promises to be an invaluable tool for characterizing cellular heterogeneity, but study of alternative splicing in single cells has been limited by the unique challenges of single cell data and lack of suitable analysis methods. We present SingleSplice, which is to our knowledge the first algorithm for identifying alternative splicing in a population of single cells. SingleSplice uses a statistical model trained on the technical noise profile of synthetic spike-in transcripts to identify genes exhibiting biological variation in isoform composition. We applied SingleSplice to data from 279 mouse embryonic stem cells and discovered genes that show significant alternative splicing across the set of cells. A subset of these genes are linked to cell cycle stage, suggesting a novel connection between alternative splicing and the cell cycle. Using SingleSplice, we also characterized the isoform usage heterogeneity of 466 adult and fetal human cortical cells.

TP002 (PT) - DFLpred: High throughput prediction of disordered flexible linker regions in protein sequences
Date: Sunday, July 10 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Fanchi Meng, University of Alberta, Canada
  • Lukasz Kurgan, Virginia Commonwealth University, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

Motivation: Disordered flexible linkers (DFLs) are disordered regions that serve as flexible linkers/spacers in multi-domain proteins or between structured constituents in domains. They are different from flexible linkers/residues since they are disordered and longer. Availability of experimentally annotated DFLs provides an opportunity to build high-throughput computational predictors of these regions from protein sequences. To date, there are no computational methods that directly predict DFLs and they can be found only indirectly by filtering predicted flexible residues with predictions of disorder.
Results: We conceptualized, developed and empirically assessed a first-of-its-kind sequence-based predictor of DFLs, DFLpred. This method outputs propensity to form DFLs for each residue in the input sequence. DFLpred uses a small set of empirically selected features that quantify propensities to form certain secondary structures, disordered regions and structured regions, which are processed by a fast linear model. Our high-throughput predictor can be used on the whole-proteome scale; it needs < 1 hour to predict entire proteome on a single CPU. When assessed on an independent test dataset with low sequence-identity proteins, it secures area under the ROC curve (AUC) equal 0.715 and outperforms existing alternatives that include methods for the prediction of flexible linkers, flexible residues, intrinsically disordered residues, and various combinations of these methods. Prediction on the complete human proteome reveals that about 10% of proteins have a large content of over 30% DFL residues. We also estimate that about 6000 DFL regions are long with 30 or more consecutive residues.
Availability: http://biomine.ece.ualberta.ca/DFLpred/.

TP003 (LBR) - FUNCTIONALLY PROFILING METAGENOMES AND METATRANSCRIPTOMES AT SPECIES-LEVEL RESOLUTION
Date: Sunday, July 10 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS / GENES
  • Eric Franzosa, Harvard T. H. Chan School of Public Health, United States
  • Lauren McIver, Harvard T. H. Chan School of Public Health, United States
  • Gholamali Rahnavard, Harvard T. H. Chan School of Public Health, United States
  • George Weingart, Harvard T. H. Chan School of Public Health, United States
  • Karen Schwarzberg, Northern Arizona University, United States
  • Luke Thompson, University of Colorado at Boulder, United States
  • Rob Knight, University of California San Diego, United States
  • J. Gregory Caporaso, Northern Arizona University, United States
  • Nicola Segata, University of Trento, Italy
  • Curtis Huttenhower, Harvard T. H. Chan School of Public Health, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Profiling microbial community function typically involves mapping millions of metagenomic or metatranscriptomic (“meta’omic”) sequencing reads against comprehensive reference sequence databases, often by translated search. In addition to being time-consuming and error-prone, this approach only provides an aggregate profile for a community, thus obscuring the contributions of individual species. To address these challenges, we designed a new tiered strategy for meta’omic functional profiling (HUMAnN2). Our method 1) rapidly identifies the species in a meta’omic sample, 2) maps sequencing reads to a sample-specific database constructed from those species’ pangenomes, and 3) only falls back to translated search for unclassified reads. In evaluations using synthetic data, HUMAnN2’s predicted functional profiles were 87% accurate at the community level (vs. 33% for pure translated search), and 79 to 91% accurate at the level of individual species. We applied HUMAnN2 to identify conserved metabolic pathways among 921 metagenomes from the Human Microbiome Project. In this task, HUMAnN2 tended to explain the majority of sample reads 10x faster than traditional search methods, thus saving 1,000s of CPU hours of compute time. Moreover, by highlighting individual species’ functional contributions, HUMAnN2 revealed new ecological patterns of functional conservation in the human microbiome (e.g. conserved metabolic pathways contributed by different species in different individuals). We expect our improvements to the performance and resolution of meta’omic functional profiling to be broadly applicable to analyses of host- and environmentally-associated microbial communities. HUMAnN2 is open-source, fully documented, and available for download now from http://huttenhower.sph.harvard.edu/humann2.

TP004 (LBR) - Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects
Date: Sunday, July 10 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Florian Buettner, EMBL-EBI, United Kingdom
  • John C. Marioni, EMB-EBI, United Kingdom
  • Oliver Stegle, EMBL-EBI, United Kingdom

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. However, such heterogeneity can arise due to both technical and biological factors, thus making decomposing sources of variation extremely difficult. Current methods to dissect this heterogeneity have critical limitations as they do not scale to large datasets comprising tens of thousands of cells and in particular do not permit joint modelling of the effects of biological factors and additional unknown and confounding sources of variation. We here describe a computationally efficient model that uses latent factors to jointly infer both biological and confounding sources of gene expression variation. We validate the method using simulations, demonstrating both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Moreover, through applicationmodel to the largest single-cell RNA-seq study generated to date, consisting of 49,300 retina cells, we show that our model can robustly decompose scRNA-seq datasets into interpretable components as well as facilitating the identification of novel sub-populations.

TP005 (HT) - Unexpected Features of the Dark Proteome
Date: Sunday, July 10 10:30 am - 10:50 am
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Nelson Perdigão, Universidade de Lisboa, Portugal
  • Julian Heinrich, CSIRO, Australia
  • Christian Stolte, CSIRO, Australia
  • Kenneth Sabir, Garvan Institute of Medical Research, Australia
  • Michael Buckley, CSIRO, Australia
  • Bruce Tabor, CSIRO, Australia
  • Beth Signal, Garvan Institute of Medical Research, Australia
  • Brian Gloss, Garvan Institute of Medical Research, Australia
  • Christopher Hammang, Garvan Institute of Medical Research, Australia
  • Burkhard Rost, Technische Universität München, Germany
  • Andrea Schafferhans, Technische Universität München, Germany
  • Sean O'Donoghue, CSIRO & Garvan Institute, Australia

Area Session Chair: Lenore Cowen

Presentation Overview: Show

We surveyed the "dark" proteome - that is, regions of proteins never observed by experimental structure determination and inaccessible to homology modeling. For 546,000 Swiss-Prot proteins, we found that 44-54% of the proteome in eukaryotes and viruses was dark, compared with only 14% in archaea and bacteria. Surprisingly, most of the dark proteome could not be accounted for by conventional explanations, such as intrinsic disorder or transmembrane regions. Nearly half of the dark proteome comprised dark proteins, in which the entire sequence lacked similarity to any known structure. Dark proteins fulfill a wide variety of functions, but a subset showed distinct and largely unexpected features, such as association with secretion, specific tissues, the endoplasmic reticulum, disulfide bonding, and proteolytic cleavage. Dark proteins also had short sequence length, low evolutionary reuse, and few known interactions with other proteins. These results suggest new research directions in structural and computational biology.

TP006 (LBR) - Integrating very large multi'omics data by hierarchical all-against-all association testing
Date: Sunday, July 10 10:30 am - 10:50 am
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS
  • Gholamali Rahnavard, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Eric A. Franzosa, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Lauren J. McIver, Harvard T.H. Chan School of Public Health, United States
  • George Weingart, Harvard T.H. Chan School of Public Health, United States
  • Emma Schwager, The Broad Institute, Harvard T.H. Chan School of Public Health, United States
  • Yo Sup Moon, Harvard T.H. Chan School of Public Health, United States
  • Xochitl C. Morgan, Harvard T.H. Chan School of Public Health, United States
  • Levi Waldron, City University of New York School of Public Health, Hunter College, United States
  • Curtis Huttenhower, The Broad Institute, Harvard T.H. Chan School of Public Health, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Modern multi’omic screens of biological samples readily produce enormous numbers of measurements, yet finding statistically significant association patterns among features within these data remains challenging, in part due to the loss of statistical power inherent with testing large numbers of hypotheses. Here, we present and validate a novel hierarchical framework, HAllA (Hierarchical All-against-All association testing), for general purpose and well-powered association discovery in high-dimensional heterogeneous datasets. HAllA combines hierarchical nonparametric hypothesis testing with false discovery rate correction to enable high-sensitivity discovery of linear and non-linear associations in high-dimensional datasets (which may be categorical, continuous, or mixed). HAllA operates by 1) discretizing data to a unified representation, 2) hierarchically clustering paired high-dimensional datasets, 3) applying dimensionality reduction to boost power and potentially improve signal-to-noise ratio, and 4) iteratively testing associations between blocks of progressively more related features. We validated and optimized HAllA using synthetic datasets of known correlation structure. At a fixed false discovery rate, HAllA is consistently better-powered than naive all-against-all association testing across a range of association types. As an example application, we used HAllA to identify associations between high-throughput profiles of microbial genera and metabolites of the human gut microbiome. In addition to recapitulating known associations, we identified 60 previously unobserved associations, including between Ruminococcus and Lithocholic acid. Our implementation of HAllA is highly modular, enabling addition or substitution of alternative methods at each step, and is available with documention at http://huttenhower.sph.harvard.edu/halla.

TP007 (LBR) - Lightweight transcriptomics
Date: Sunday, July 10 10:50 am - 11:10 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Surojit Biswas, Harvard University, United States
  • Konstantin Kerner, Sainsbury Laboratory Cambridge University, Germany
  • Sandra Cortijo, Sainsbury Laboratory Cambridge University, United Kingdom
  • Varodom Charoensawan, Sainsbury Laboratory Cambridge University, United Kingdom
  • Vladimir Jojic, UNC-Chapel Hill, United States
  • Philip Wigge, Sainsbury Laboratory Cambridge University, United Kingdom

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Transcript levels are critical determinant of the proteome and hence cellular function. Because the transcriptome is an outcome of the interactions between genes and their products, we reasoned it may be accurately represented by a subset of transcript abundances. By analyzing thousands of publicly available RNA-Seq datasets, we show that the transcriptomes of A. thaliana and M. musculus are highly compressible. Capitalizing on this observation, we develop a method, Tradict, to reconstruct the expression of globally representative biological processes or the entire transcriptome with the abundances of a small, machine-learned subset of 100 transcripts. These findings suggest natural improvements to both the time and cost of performing forward genetic and small molecule drug screens, mapping eQTLs in natural populations, identifying tumor subtypes, and accurately profiling individual single-cell transcriptomes at scale.

TP008 (HT) - Widespread Expansion of Protein Interaction Capabilities by Alternative Splicing
Date: Sunday, July 10 10:50 am - 11:10 am
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Xinping Yang, Dana-Farber Cancer Institute, United States
  • Jasmin Coulombe-Huntington, McGill University, Canada
  • Shuli Kang, University of California, San Diego, United States
  • Gloria M. Sheynkman, Dana-Farber Cancer Institute, United States
  • Tong Hao, Dana-Farber Cancer Institute, United States
  • Aaron Richardson, Dana-Farber Cancer Institute, United States
  • Song Sun, University of Toronto, Canada
  • Fan Yang, University of Toronto, Canada
  • Yun A. Shen, Dana-Farber Cancer Institute, United States
  • Ryan R. Murray, Dana-Farber Cancer Institute, United States
  • Kerstin Spirohn, Dana-Farber Cancer Institute, United States
  • Bridget E. Begg, Dana-Farber Cancer Institute, United States
  • Miquel Duran-Frigola, Institute for Research in Biomedicine (IRB Barcelona), Spain
  • Andrew MacWilliams, Dana-Farber Cancer Institute, United States
  • Samuel J. Pevzner, Dana-Farber Cancer Institute, United States
  • Quan Zhong, Dana-Farber Cancer Institute, United States
  • Shelly A. Trigg, Dana-Farber Cancer Institute, United States
  • Stanley Tam, Dana-Farber Cancer Institute, United States
  • Lila Ghamsari, Dana-Farber Cancer Institute, United States
  • Nidhi Sahni, Dana-Farber Cancer Institute, United States
  • Song Yi, Dana-Farber Cancer Institute, United States
  • Maria D. Rodriguez, Dana-Farber Cancer Institute, United States
  • Dawit Balcha, Dana-Farber Cancer Institute, United States
  • Guihong Tan, University of Toronto, Canada
  • Michael Costanzo, University of Toronto, Canada
  • Brenda Andrews, University of Toronto, Canada
  • Charles Boone, University of Toronto, Canada
  • Xianghong J. Zhou, University of Southern California, United States
  • Kourosh Salehi-Ashtiani, Dana-Farber Cancer Institute, United States
  • Benoit Charloteaux, Dana-Farber Cancer Institute, United States
  • Alyce A. Chen, Dana-Farber Cancer Institute, United States
  • Michael A. Calderwood, Dana-Farber Cancer Institute, United States
  • Patrick Aloy, Institute for Research in Biomedicine (IRB Barcelona), Spain
  • Frederick P. Roth, University of Toronto, Canada
  • David E. Hill, Dana-Farber Cancer Institute, United States
  • Lilia M. Iakoucheva, University of California, San Diego, United States
  • Yu Xia, McGill University, Canada
  • Marc Vidal, Dana-Farber Cancer Institute, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

While alternative splicing is known to diversify the functional characteristics of some genes, the extent to which protein isoforms globally contribute to functional complexity on a proteomic scale remains unknown. To address this systematically, we cloned full-length open reading frames of alternatively spliced transcripts for a large number of human genes, and combined protein-protein interaction profiling with computer modeling to functionally compare hundreds of protein isoform pairs. The majority of isoform pairs share less than 50% of their interactions. In the global context of interactome network maps, alternative isoforms tend to behave like distinct proteins rather than minor variants of each other. Interaction partners specific to alternative isoforms tend to be expressed in a highly tissue-specific manner and belong to distinct functional modules. Our integrated experimental and computational strategy reveals a widespread expansion of protein interaction capabilities through alternative splicing and suggests that many alternative isoforms are functionally divergent.

TP009 (HT) - Single molecule-level characterization of bacterial epigenomes, heterogeneity and gene regulation
Date: Sunday, July 10 10:50 am - 11:10 am
Room: Northern Hemisphere E1/E2
Theme: GENES
  • John Beaulaurier, Icahn School of Medicine at Mount Sinai, United States
  • Xue-Song Zhang, New York University Medical School, United States
  • Shijia Zhu, Icahn School of Medicine at Mount Sinai, United States
  • Robert Sebra, Icahn School of Medicine at Mount Sinai, United States
  • Chaggai Rosenbluh, Icahn School of Medicine at Mount Sinai, United States
  • Gintaras Deikus, Icahn School of Medicine at Mount Sinai, United States
  • Nan Shen, Icahn School of Medicine at Mount Sinai, United States
  • Diana Munera, Harvard Medical School, United States
  • Matthew Waldor, Harvard Medical School, United States
  • Andrew Chess, Icahn School of Medicine at Mount Sinai, United States
  • Martin Blaser, New York University Medical School, United States
  • Eric Schadt, Icahn School of Medicine at Mount Sinai, United States
  • Gang Fang, Icahn School of Medicine at Mount Sinai, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Beyond its role in host defense, bacterial DNA methylation also plays important roles in the regulation of gene expression, virulence and antibiotic resistance. Bacterial cells in a clonal population can generate epigenetic heterogeneity to increase population-level phenotypic plasticity. Single molecule, real-time (SMRT) sequencing enables the detection of N6-methyladenine and N4-methylcytosine, two major types of DNA modifications comprising the bacterial methylome. However, existing SMRT sequencing-based methods for studying bacterial methylomes rely on a population-level consensus that lacks the single-cell resolution required to observe epigenetic heterogeneity. Here, we present SMALR (single-molecule modification analysis of long reads), a novel framework for single molecule-level detection and phasing of DNA methylation. Using seven bacterial strains, we show that SMALR yields significantly improved resolution and reveals distinct types of epigenetic heterogeneity. SMALR is a powerful new tool that enables de novo detection of epigenetic heterogeneity and empowers investigation of its functions in bacterial populations.

TP010 (PT) - Analysis of aggregated cell-cell statistical distances within pathways unveils therapeutic-resistance mechanisms in circulating tumor cells
Date: Sunday, July 10 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Theme: DISEASE / SYSTEMS
  • Alfred Schissler, Lussier Lab, United States
  • Qike Li, The University of Arizona, United States
  • James Chen, The Ohio State University, United States
  • Colleen Kenost, The University of Arizona, United States
  • Ikbel Achour, The University of Arizona, United States
  • D. Dean Billheimer, The University of Arizona, United States
  • Haiquan Li, University of Arizona, United States
  • Walter W. Piegorsch, University of Arizona Center for Biomedical Informatics and Biostatistics, United States
  • Yves Lussier, University of Arizona, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Motivation: As ‘omics’ biotechnologies accelerate the capability to contrast a myriad of molecular measurements from a single cell, they also exacerbate current analytical limitations for detecting meaningful single-cell dysregulations. Moreover, mRNA expression alone lacks functional interpretation, limiting opportunities for translation of single-cell transcriptomic insights to precision medicine. Lastly, most single-cell RNA-sequencing analytic approaches are not designed to investigate small populations of cells such as circulating tumor cells shed from solid tumors and isolated from patient blood samples.
Results: In response to these characteristics and limitations in current single-cell RNA-sequencing methodology, we introduce an analytic framework that models transcriptome dynamics through the analysis of aggregated cell-cell statistical distances within biomolecular pathways. Cell-cell statistical distances are calculated from pathway mRNA fold changes between two cells. Within an elaborate case study of circulating tumor cells derived from prostate cancer patients, we develop analytic methods of aggregated distances to identify five differentially expressed pathways associated to therapeutic resistance. Our aggregation analyses perform comparably to Gene Set Enrichment Analysis (GSEA) and better than differentially expressed genes followed by gene set enrichment. However, these methods were not designed to inform on differential pathway expression for a single cell. As such, our framework culminates with the novel aggregation method, cell-centric statistics (CCS). CCS quantifies the effect size and significance of differentially expressed pathways for a single cell of interest. Improved rose plots of differentially expressed pathways in each cell highlight the utility of CCS for therapeutic decision-making.
Availability: http://www.lussierlab.org/publications/CCS/

TP011 (HT) - Large-scale Text Mining Web Services for Bioinformatics Research
Date: Sunday, July 10 11:40 am - 12:00 pm
Room: Northern Hemisphere A3/A4
Theme: DATA
  • Chih-Hsuan Wei, NCBI, United States
  • Robert Leaman, NCBI, United States
  • Zhiyong Lu, NCBI, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

Processing the biomedical literature with automated tools becomes more important as its growth accelerates. We present NCBI text-mining web services, an online version of our text mining suite for biomedical concept recognition and information extraction. Our service incorporates five state of the art tools we developed previously: DNorm (for diseases), GNormPlus (genes/proteins), SR4GN (species), tmChem (chemicals and drugs), and tmVar (variants). Using our service, users can instantly retrieve results from all five tools for any abstract in PubMed. Users may also process arbitrary text – such as full-text articles or non-PubMed publications – using our asynchronous batch mode, or easily visualize results through our web-based application PubTator. We simplify interoperability by supporting multiple data formats, and handle large requests through a computer cluster to ensure scalability. Our web service is already in wide use, supporting research projects in biocuration, crowdsourcing and translational bioinformatics. The web service is freely available at: http://www.ncbi.nlm.nih.gov/CBBresearch/Lu/Demo/tmTools/#curl

TP012 (HT) - Genetic Architectures of Quantitative Variation in RNA Editing Pathways
Date: Sunday, July 10 11:40 am - 12:00 pm
Room: Northern Hemisphere E1/E2
Theme: GENES
  • Tongjun Gu, University of Florida, United States
  • Daniel Gatti, The Jackson Laboratory, United States
  • Anuj Srivastava, The Jackson Laboratory, United States
  • Elizabeth Snyder, The Jackson Laboratory, United States
  • Narayanan Raghupathy, The Jackson Laboratory, United States
  • Petr Simecek, The Jackson Laboratory, United States
  • Karen Svenson, The Jackson Laboratory, United States
  • Ivan Dotu, The Jackson Laboratory, United States
  • Jeffrey Chuang, The Jackson Laboratory, United States
  • Mark Keller, University of Wisconsin, United States
  • Alan Attie, University of Wisconsin, United States
  • Robert Braun, The Jackson Laboratory, United States
  • Gary Churchill, The Jackson Laboratory, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

RNA editing refers to post-transcriptional processes that alter the base sequence of RNA. Recently, hundreds of new RNA editing targets have been reported. However, the mechanisms that determine the specificity and degree of editing are not well understood. We examined quantitative variation of site-specific editing in a genetically diverse multiparent population, Diversity Outbred mice, and mapped polymorphic loci that alter editing ratios globally for C-to-U editing and at specific sites for A-to-I editing. An allelic series in the C-to-U editing enzyme Apobec1 influences the editing efficiency of Apob and 58 additional C-to-U editing targets. We identified 49 A-to-I editing sites with polymorphisms in the edited transcript that alter editing efficiency. In contrast to the shared genetic control of C-to-U editing, most of the variable A-to-I editing sites were determined by local nucleotide polymorphisms in proximity to the editing site in the RNA secondary structure. Our results indicate that RNA editing is a quantitative trait subject to genetic variation and that evolutionary constraints have given rise to distinct genetic architectures in the two canonical types of RNA editing.

TP013 (LBR) - DEVELOPMENT OF A BAYESIAN TENSOR FACTORIZATION MODEL TO PREDICT DRUG RESPONSE CURVES IN CANCER CELL LINES
Date: Sunday, July 10 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Theme: DISEASE / DATA
  • Nathan Lazar, Oregon Health & Science University, United States
  • Mehmet Gonen, Koç University, Turkey
  • Shannon McWeeney, Oregon Health & Science University, United States
  • Adam Margolin, Oregon Health & Science University, United States
  • Kemal Sonmez, Oregon Health & Science University, United States

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Biological data is inherently multi-dimensional in nature, yet most computational methods used today are based to some extent on flattening these data into two-dimensional matrices. We present a new model BaTFLED (Bayesian Tensor Factorization Linked to External Data) that predicts values in a three dimensional response tensor using input features for each of the dimensions. We apply this to predict full dose response curves in a panel of 599 cancer cell lines treated with 545 compounds as part of the Cancer Target Discovery and Development1 (CTD2) effort. BaTFLED learns projection matrices mapping features for cell lines and drugs into latent representations that combine to form the responses. Predictions for new cell lines, drugs or combinations of the two can be made by multiplying through these projection matrices. A Bayesian framework allows us to place distributions on the unknown variables, which encourage sparsity both row-wise in the projection matrices (for feature selection) and in the core tensor which combines the latent vectors (selecting interactions between latent representations). We train the model using a highly efficient variational method that learns optimal parameters for a distribution approximating the true posterior. This talk will explore implications of model design choices, demonstrate initial results on the CTD2 data and discuss how these methods may be applied to other multi-dimensional datasets.

TP014 (HT) - Text as Data: Using text-based features for proteins representation and for computational prediction of their characteristics
Date: Sunday, July 10 12:00 pm - 12:20 pm
Room: Northern Hemisphere A3/A4
Theme: DATA / PROTEINS
  • Hagit Shatkay, University of Delaware, United States
  • Scott Brady, University of Toronto, Canada
  • Andrew Wong, Mount Sinai Hospital, Canada

Area Session Chair: Lenore Cowen

Presentation Overview: Show

The current era of large-scale biology is characterized by a fast-paced growth in the number of sequenced genomes and, consequently, by a multitude of identified proteins whose function has yet to be determined.
Simultaneously, any known or postulated information concerning genes and proteins is part of the ever-growing published scientific literature, which is expanding at a rate of over a million new publications per year.
Computational tools that attempt to automatically predict and annotate protein characteristics, such as function and localization patterns, are being developed along with systems that aim to support the process via text mining.
Most work on protein characterization focuses on features derived directly from protein sequence data. Protein-related work that does aim to utilize the literature typically concentrates on extracting specific facts (e.g., protein interactions) from text.
In the past few years we have taken a different route, treating the literature as a source of text-based features, which can be employed just as sequence-based protein-features were used in earlier work, for predicting protein subcellular location and possibly also function. We discuss here in detail the overall approach, along with results from work we have done in this area demonstrating the value of this method and its potential use.

TP015 (PT) - A novel algorithm for calling mRNA m6A peaks by modeling biological variances in MeRIP-seq data
Date: Sunday, July 10 12:00 pm - 12:20 pm
Room: Northern Hemisphere E1/E2
Theme: GENES
  • Xiaodong Cui, UTSA, United States
  • Jia Meng, Xi'an Jiaotong-liverpool University, China
  • Shaowu Zhang, Northwestern Polytecnical University, China
  • Yidong Chen, UTHSCSA, United States
  • Yufei Huang, UTSA, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Motivation: N6-methyl-adenosine (m6A) is the most prevalent mRNA methylation but precise pre-diction of its mRNA location is important for understanding its function. A recent sequencing tech-nology, known as Methylated RNA Immunoprecipitation Sequencing technology (MeRIP-seq), has been developed for transcriptome-wide profiling of m6A. We previously developed a peak calling algorithm called exomePeak. However, exomePeak over-simplifies data characteristics and ig-nores the reads’ variances among replicates or reads dependency across a site region. To further improve the performance, new model is needed to address these important issues of MeRIP-seq data.
Results: We propose a novel, graphical model-based peak calling method, MeTPeak, for tran-scriptome-wide detection of m6A sites from MeRIP-seq data. MeTPeak explicitly models reads count of an m6A site and introduces a hierarchical layer of Beta variables to capture the variances and a Hidden Markov model (HMM) to characterize the reads dependency across a site. In addi-tion, we developed a constrained Newton’s method and designed a log-barrier function to compute analytically intractable, positively constrained Beta parameters. We applied our algorithm to simu-lated and real biological datasets and demonstrated significant improvement in detection perfor-mance and robustness over exomePeak. Prediction results on publicly available MeRIP-seq da-tasets are also validated and shown to be able to recapitulate the known patterns of m6A, further validating the improved performance of MeTPeak.
Availability: The package ‘MeTPeak’ is implemented in R and C++, and additional details are available at https://github.com/compgenomics/MeTPeak

TP016 (PT) - DrugE-Rank: Improving Drug-Target Interaction Prediction of New Candidate Drugs or Targets by Ensemble Learning to Rank
Date: Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere A1/A2
Theme: DISEASE / DATA
  • Qing-Jun Yuan, Fudan University, China
  • Junning Gao, FDU, China
  • Dongliang Wu, Fudan University, China
  • Shihua Zhang, University of Southern Canlifornia, United States
  • Hiroshi Mamitsuka, Kyoto University, Japan
  • Shanfeng Zhu, Fudan University, China

Area Session Chair: Ioannis Xenarios

Presentation Overview: Show

Motivation: Identifying drug-target interaction is an important task in drug discovery. To reduce heavy time and financial cost in experimental identification of drug-target interaction, many computational approaches have been proposed. Although these approaches have used many different principles, their performance is far from satisfactory, especially in predicting drug-target interactions of new drugs or new targets.

Methods: Approaches based on machine learning for this problem can be divided into two types: feature based and similarity-based methods. Learning to rank (LTR) is the known, most powerful technique in the feature-based methods, while similarity-based methods are well-accepted, due to their idea of connecting the chemical and genomic spaces, represented by drug and target similarities, respectively. We propose a
new method, DrugE-Rank, to improve the performance of the problem by nicely combining the advantages of the two different types of the methods. That is, DrugE-Rank uses LTR, for which multiple well-known similarity-based methods can be used as components of ensemble learning.

Results: The performance of DrugE-Rank was thoroughly examined by mainly three experiments, using data from DrugBank: 1) cross-validation on FDA (US Food and Drug Administration) approved drugs before March 2014, 2) independent test on FDA approved drugs after March 2014, and 3) independent test on FDA experimental drugs. Experimental results show that DrugE-Rank outperformed competing methods significantly, especially achieving more than 30% improvement in AUPR (Area under Prediction Recall curve) for FDA approved new drugs and FDA experimental drugs.

TP017 (LBR) - Good news: we are getting better at predicting protein function
Date: Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS / DATA
  • Predrag Radivojac, Indiana University, United States
  • Yuxiang Jiang, Indiana University, United States
  • Sean Mooney, University of Washington, United States
  • Tal Ronen-Oron, The Buck Institute for Aging Resarch, United States
  • Casey Greene, University of Pennsylvania, United States
  • Iddo Friedberg, Iowa State University, United States

Area Session Chair: Lenore Cowen

Presentation Overview: Show

Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging.

Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2.

Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.

TP018 (PT) - RNAiFold2T: Constraint Programming design of thermo-IRES switches
Date: Sunday, July 10 12:20 pm - 12:40 pm
Room: Northern Hemisphere E1/E2
Theme: GENES
  • Juan Antonio Garcia-Martin, Department of Biology, Boston College, United States
  • Ivan Dotu, Research Programme on Biomedical Informatics (GRIB), Department of Experimental and Health Sciences, Universitat Pompeu Fabra, Spain
  • Javier Fernandez-Chamorro, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Gloria Lozano, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Jorge Ramajo, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Encarnacion Martinez-Salas, Centro de Biologia Molecular Severo Ochoa, Consejo Superior de Investigaciones Cientificas – Universidad Autonoma de Madrid, Spain
  • Peter Clote, Department of Biology, Boston College, United States

Area Session Chair: Alex Bateman

Presentation Overview: Show

Motivation: RNA thermometers (RNATs) are cis-regulatory elements that change secondary structure
upon temperature shift. Often involved in the regulation of heat shock, cold shock and virulence genes,
RNATs constitute an interesting potential resource in synthetic biology, where engineered RNATs could
prove to be useful tools in biosensors and conditional gene regulation.
Results: Solving the 2-temperature inverse folding problem is critical for RNAT engineering. Here
we introduce RNAiFold2T, the first Constraint Programming (CP) and Large Neighborhood Search
(LNS) algorithms to solve this problem. Benchmarking tests of RNAiFold2T against existent programs
(adaptive walk and genetic algorithm) inverse folding show that our software generates two orders of
magnitude more solutions, thus allowing ample exploration of the space of solutions. Subsequently,
solutions can be prioritized by computing various measures, including probability of target structure in the
ensemble, melting temperature, etc. Using this strategy, we rationally designed two thermosensor internal
ribosome entry site (thermo-IRES) elements, whose normalized cap-independent translation efficiency is
approximately 50% greater at 42C than 30C, when tested in reticulocyte lysates. Translation efficiency
is lower than that of the wild-type IRES element, which on the other hand is fully resistant to temperature
shift-up. This appears to be the first purely computational design of functional RNA thermoswitches, and
certainly the first purely computational design of functional thermo-IRES elements.
Availability: RNAiFold2T is publicly available as as part of the new release RNAiFold3.0
at https://github.com/clotelab/RNAiFold and http://bioinformatics.bc.edu/
clotelab/RNAiFold, which latter has a web server as well. The software is written in C++ and
uses OR-Tools CP search engine.
Contact: clote@bc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

TP019 (HT) - Temporal dynamics of collaborative networks in large scientific consortia
Date: Sunday, July 10 2:00 pm - 2:20 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / DATA
  • Daifeng Wang, Yale University, United States
  • Koon-Kiu Yan, Yale University, United States
  • Joel Rozowsky, Yale University, United States
  • Eric Pan, Yale University, United States
  • Mark Gerstein, Yale University, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

The emergence of collective creative enterprise such as large scientific consortia is a unique feature in modern scientific research, especially in the biomedical field. Recent examples include the ENCyclopedia Of DNA Elements (ENCODE) consortium annotating the human genome and the 1000 Genomes consortium generating a catalog of uniformly called variants for the biomedical community. To ensure that the scientific community can benefit from these efforts, it is important to understand the connections between consortium members and researchers outside of the consortium. To address the issue, we analyzed the temporal co-authorship network structures of ENCODE and modENCODE consortia [1]. Our analysis revealed their publication patterns showing that the consortium members work closely as a community whereas non-members collaborate in the scale of a few laboratories. We also identified a few brokers playing an important role to facilitate collaborations with outside researchers, which suggests that large scientific consortia should set up formal an outreach group to communicate with outside researchers.

[1] Daifeng Wang, Koon-Kiu Yan, Joel Rozowsky, Eric Pan, Mark Gerstein, "Temporal dynamics of collaborative networks driven by large scientific consortia," in press, Trends in Genetics, 2016, doi: 10.1016/j.tig.2016.02.006

TP020 (LBR) - INTEGRATIVE COMPUTATIONAL MODELING ACROSS TUMORS REVEALS CONTEXT SPECIFIC IMPACT OF MUTATIONS
Date: Sunday, July 10 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / GENES
  • Hatice Osmanbeyoglu, Memorial Sloan Kettering Cancer Center, United States
  • Eneda Toska, Memorial Sloan Kettering Cancer Center, United States
  • Jose Baselga, Memorial Sloan Kettering Cancer Center, United States
  • Christina Leslie, Memorial Sloan Kettering Cancer Center, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

Pan-cancer analyses of somatic mutations and copy number aberrations have confirmed that the same genes or pathways are often altered across multiple tumor types. There is great interest in deploying targeted therapies in a pan-cancer manner, matching pathway-targeted drugs to the mutational profile of the tumor regardless of cancer type. However, ‘actionable mutations’ interact with distinct cancer-specific gene regulatory programs and signaling networks and can occur against different genetic backgrounds across tumor types. To better model the context-dependent role of somatic alterations, we applied a novel computational strategy for integrating parallel phosphoproteomic and mRNA sequencing data across 12 the The Cancer Genome Atlas (TCGA) tumor data sets, linking dysregulation of upstream signaling pathways with altered transcriptional response. We then developed a statistical approach to interpret the impact of mutations and copy number events in terms of functional outcomes such as altered signaling and transcription factor (TF) activity. Our analysis revealed both known and novel transcriptional regulators downstream of oncogenic pathways. These results have implications for the prospective experimental investigation of targeted therapies in tumors harboring specific mutations. Our evolving understanding of the context-dependent role of somatic alterations may potentially enhance current approaches for combinatorial clinical trial design.

TP021 (LBR) - Boosting alignment accuracy through adaptive local realignment
Date: Sunday, July 10 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Theme: PROTEINS
  • Dan Deblasio, University of Arizona, United States
  • John Kececioglu, University of Arizona, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Mutation rates can vary across the residues of a protein, but when multiple sequence alignments are computed for protein sequences, the same choice of values for the substitution score and gap penalty parameters is often used across their entire length. We provide for the first time a new method called adaptive local realignment that automatically uses diverse alignment parameter settings in different regions of the input sequences when computing protein multiple sequence alignments. This allows parameter settings to locally adapt across a protein to more closely match varying mutation rates.

Our method builds on our prior work on global alignment parameter advising with the Facet alignment accuracy estimator. Given a computed alignment, in each region that has low estimated accuracy, a collection of candidate realignments is generated using a precomputed set of alternate parameter choices. If one of these alternate realignments has higher estimated accuracy than the original subalignment, the region is replaced with the realignment, and the concatenation of these realigned regions forms the new output alignment.

Adaptive local realignment significantly improves the quality of alignments over using the single best default parameter choice. In particular, this new method of local advising, when combined with prior methods for global advising, boosts alignment accuracy by almost 23% over the best default parameter setting on the hardest-to-align benchmarks (and almost 5.9% over using global advising alone).

A new version of the Opal multiple sequence aligner that incorporates adaptive local realignment, using Facet for parameter advising, is available free for non-commercial use at facet.cs.arizona.edu.

TP022 (HT) - Positive and negative forms of replicability in gene network analysis
Date: Sunday, July 10 2:20 pm - 2:40 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / DATA
  • Wim Verleyen, Cold Spring Harbor Laboratory, United States
  • Sara Ballouz, Cold Spring Harbor Laboratory, United States
  • Jesse Gillis, Cold Spring Harbor Laboratory, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

Presentation description
In this work, we build a model of scientific communities in which simulated researchers characterizes gene function through an individual analysis of particular network data. We model each researcher by sampling from a pool of machine learning algorithms, each of which then samples individually from various public resources. By simulating groups of researchers operating under different constraints, we are able to assess practices leading to successful group decisions. Our analysis reveals an important principle limiting the value of replication, namely that targeting it directly causes ‘easy’ or uninformative replication to dominate analyses. We provide examples of this problem in action and walk through seminal results which replicate precisely because they are unlikely to be true. We also show that this bias has a strong impact in protein-protein interaction data leading to negative correlations between replicability and good quality control. We discuss some implications for public discourse, particularly on scientific matters.

Scientific Justification
Our recent work analyzes what is usually considered a fundamental basis of science – replication – and shows that not only can it be useless as a general heuristic for discovering the truth, it can be damaging when applied naively. Intuitively, the idea is close to that of overfitting in machine learning. Two researchers both of whom overfit to some data might obtain more replicable results, but this form of replicability is of little value. Using real data and analysis techniques, we show this problem is apparent in the field of gene network analysis as a whole.

While we focus on the field-wide meta-analysis, the detailed examples in the paper are particularly important:

A) We show that a seminal result in autism genetics replicates because it is false. Our detailed walk-through makes results that are otherwise very surprising into intuitive principles.

B) We show that the negative relationship our model predicts between replicability and quality control can be seen directly in even reports for individual protein-protein interactions.

Our research in this area is ongoing and our talk will discuss additional examples, drawn principally from medically important cases (e.g., point (A)) which I think will be of high interest at ISMB, as well as methods for identifying these problems.

Although the focus is on networks, the model and examples are of relevance to any knowledge-base (hence our area choice). This is work that repays careful consideration and I’m confident that discussing it at ISMB will provide exceptional value to our colleagues.

TP023 (HT) - COSMOS: accurate detection of somatic structural variations through asymmetric comparison between tumor and normal samples
Date: Sunday, July 10 2:20 pm - 2:40 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / GENES
  • Koichi Yamagata, AIST, Japan
  • Ayako Yamanishi, Graduate School of Medicine, Osaka University, Japan
  • Chikara Kokubu, Graduate School of Medicine, Osaka University, Japan
  • Junji Takeda, Graduate School of Medicine, Osaka University, Japan
  • Jun Sese, AIST, Japan

Area Session Chair: Paul Horton

Presentation Overview: Show

An important challenge in cancer genomics is precise detection of structural variations (SVs) by high-throughput short-read sequencing, which is hampered by the high false discovery rates of existing analysis tools. Here we propose an accurate SV detection method named COSMOS, which compares the statistics of the mapped read pairs in tumor samples with isogenic normal control samples in a distinct asymmetric manner. COSMOS also prioritizes the candidate SVs using strand-specific read-depth information. Performance tests on modeled tumor genomes revealed that COSMOS outperformed existing methods in terms of F-measure. We also applied COSMOS to an experimental mouse cell-based model, in which SVs were induced by genome engineering and gamma-ray irradiation, followed by polymerase chain reaction-based confirmation. The precision of COSMOS was 84.5 %, while the next best existing method was 70.4%. Moreover, the sensitivity of COSMOS was the highest, indicating that COSMOS has great potential for cancer genome analysis.

TP024 (LBR) - The Post-Genomic Era of Biological Network Alignment: Latest Insights
Date: Sunday, July 10 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS
  • Lei Meng, University of Notre Dame, United States
  • Vipin Vijayan, University of Notre Dame, United States
  • Tijana Milenkovic, University of Notre Dame, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Analogous to genomic sequence alignment, biological network alignment (NA) aims to find regions of similarities between molecular networks of different species. NA can be divided into local (LNA) or global (GNA). LNA finds small, highly conserved network regions; GNA finds large, suboptimally conserved regions. When a new NA method is proposed, it is compared against existing methods from the same NA category. However, both LNA and GNA aim to allow for transferring functional knowledge from well- to poorly-studied species between conserved (aligned) network regions. So, which one to choose, LNA or GNA? To answer this, we introduce the first systematic evaluation of the two NA categories and new measures of alignment quality that allow for fair comparison of the different LNA and GNA outputs. We find that LNA and GNA give complementary results: LNA has high functional but low topological quality, while GNA has the opposite. Thus, we propose IGLOO, a new approach that integrates GNA and LNA. IGLOO allows for a trade-off between topological and functional alignment quality better than any existing LNA and GNA methods. NA can also be divided into pairwise NA of two networks (PNA) vs. multiple NA of more than two networks (MNA). MNA may be more useful since it can capture at once biological knowledge common to multiple species. We present multiMAGNA++, a novel and superior MNA approach, and we introduce new MNA quality measures to allow for more complete alignment characterization and more fair MNA method evaluation compared to the existing measures.

TP025 (LBR) - Efficient Data-Driven Model Learning for Dynamical Systems
Date: Sunday, July 10 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / DATA
  • Ermao Cai, Carnegie Mellon University, United States
  • Ifigeneia Apostolopoulou, Carnegie Mellon University, United States
  • Pranay Ranjan, Carnegie Mellon University, United States
  • Paul Pan, Carnegie Mellon University, United States
  • Mark Wuebbens, Carnegie Mellon University, United States
  • Diana Marculescu, Carnegie Mellon University, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

In the analysis of non-linear dynamical biological systems, it is often of interest to determine an efficient, qualitative estimate of the behavior of the state variables as opposed to exact, quantitative measures which may be intractable or too expensive to obtain. Moreover, established closed form mathematical rules governing system behavior are not always available and one may need to emulate the nature of the system on the basis of observations and experimental data only. In this paper, we propose to rely on Boolean models for analyzing dynamical systems and develop a polynomial time complexity heuristic algorithm to infer such Boolean functions for dynamical systems with refractory periods. Our algorithm is structured to perform even more efficiently for systems with a nested canalizing behavior with respect to certain features, which is indeed the case for life science applications. For data obtained from existing dynamical systems, e.g., T helper (Th) cell signaling network, T-LGL survival network, and T-cell differentiation, our algorithm is 100X faster than two other state-of-the-art methods, yet achieves similar or better accuracy.

TP026 (LBR) - intSKAT, an integrated Sequence Kernel Association Test, to identify novel clinically impactful somatic mutations in melanomas
Date: Sunday, July 10 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / DATA
  • Yian Chen, Moffitt Cancer Center, United States
  • Zachary Thompson, Moffitt Cancer Center, United States
  • Jamie Teer, Moffitt Cancer Center, United States
  • Fernanda Flores, Moffitt Cancer Center, United States
  • Manali Phadke, Moffitt Cancer Center, United States
  • Zhihua Chen, Moffitt Cancer Center, United States
  • Eric Welsh, Moffitt Cancer Center, United States
  • Michael Schell, Moffitt Cancer Center, United States
  • Keiran Smalley, Moffitt Cancer Center, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

INTRODUCTION
In recent years, much has been learned about the molecular basis of progression or developing therapeutic strategies based on mutation information for some of the cancer types. Taking melanoma as an example, it is known that ~50% of the melanomas have BRAF mutations and BRAF inhibitors have been developed with initial success for treatment. However, after accounting for patients with major known driver mutations: BRAF (~50%) and NRAS (~15% -20%), and NF1 (~14%), there is still ~20% of melanoma patients without clear known mutation drivers responsible for driving the development or aggressiveness of the disease. The lack of identified important non-passenger mutations in this subgroup (or any other cancer types) yields a significant challenge and also provides a great opportunity for developing therapeutic strategies. This becomes particularly important for developing personalized therapeutic strategies.
Traditionally, the driver mutations are identified through one of the following ways: if their frequencies are higher than expected some methods would determine positive selection for non-silent mutations (such as frameshift indels, nonsense and splice-site mutations) by weighting the predicted functional impact and observed frequencies. Although these methods have been shown to be useful for identifying driver mutations, at the same time, it is also understandable that these approaches will have limited power to detect infrequently mutated driver genes.
We proposed an expansive and integrated approach to link genotype to phenotypes to identify clinically relevant somatic mutations. This is accomplished by performing a flexible and powerful gene-based association test, intSKAT, to investigate the association between mutations in each gene and patients’ overall survival outcome.
METHODS
Built upon a gene-based sequence kernel association test (SKAT) [1], developed for germline studies, we developed an integrated association test, intSKAT, to identify novel somatic mutations, which are associated with clinically relevant outcome, e.g., overall survival (OS). We first coded the multi-allelic mutations into bi-allelic variants with reference versus alternative allele. Our method included an expansive suite of eight gene-based methods: 1. Burden test, 2. SKAT, 3. SKAT-O, 4-6, Burden, SKAT, SKAT-O weighted by PolyPhen-2 score, 7. Cox Regression with mutation status in a gene (0/1) as the predictor, and 8. Cox Regression with number of mutation in a gene as the predictor.
This method not only could evaluate joint effects of mutations within a gene, identify important genes with infrequent mutations, but also has the flexibility of leveraging functional predictions when available. It also allows the combinations of different directions of mutations (protective or deleterious), and different levels of functional predictions (unknown or functional prediction) to be ranked high. FDR is performed to adjust for multiple comparison within each method, and minFDR of 10-3 across all methods is used to declare the statistical significance. Furthermore, we performed robust regression to regress number of mutations within each gene against the length of longest transcript. The genes with significantly associated with OS and also higher than expected standardized residual were considered more likely to be non-passenger genes.
Using the targeted exome sequencing data in 185 melanomas patients from the Total Cancer Care (TCC) database at Moffitt Cancer Ceter we applied intSKAT to investigate the association between mutations in genes and patients’ OS as a proof of principle study. Briefly about the sequencing and variant calls, tumor samples from the TCC project were subjected to genomic capture (performed by BGI, Shenzhen using SureSelect custom designs targeting 1,321 genes, Agilent Technologies, Inc., Santa Clara, CA) and massively parallel sequencing.. Sequences were aligned to the hs37d5 human reference with the Burrows-Wheeler Aligner (BWA). Insertion/deletion realignment, quality score recalibration, and variant identification were performed with the Genome Analysis ToolKit (GATK). Sequence variants were annotated with ANNOVAR and custom scripts. We limited variants to those within the 1,321 gene target regions plus 100 flanking base pairs. High quality variants were retained by including only variants with GQ score >=15 and excluding variants in the least specific VQSR Tranche (100.00). Variant were further retained if >=80% of the samples had a high quality genotype call (reference or variants) at that position. Somatic mutations were enriched by removing variants observed >1% in 1000 Genomes, ESP African or ESP European populations. Variants were also removed if observed >1% in a set of 238 normal tissue samples subjected to the same capture and sequencing procedure. Variant were finally filtered to include only protein altering (nonsynonymous, frameshifting or non-frameshifting indels, stopgain, stoploss, and splicing variants) or only protein altering plus UTR.

In addition to performing intSKAT, we performed robust regression to further narrow down the non-passenger mutations, which can drive the disease aggressiveness in the discovery phase (Figure 1). For validation, we used the whole exome sequencing data and overall survival information from TCGA (N=211) to validate our approach (Figure 1). For validation studies, variants were limited to the 1,321 gene target regions plus 100 flanking base pairs. We decided to use real world sequencing data patients’ survival data to reflect the real-world complexity.

Finally, after identifying our top gene with mutations, we performed cell line experiments to elucidate the potential roles of the mutations in the gene(s). The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. Malme-3M and MeWo cells were cultured in RPMI complete medium with 20% and 10% FBS, respectively. Cells were grown at 37°C in a 5% CO2 humidified atmosphere.

Three-dimensional spheroid assay
The three-dimensional melanoma spheroids were prepared using the liquid overlay method. Melanoma cells were added to a 96-well plate coated with agar by 72h. Spheroids were harvested and implanted into a collagen I and left to grow for 72h. Then, spheroids were washed in PBS and treated with Calcein-AM and propidium iodide for 1h at 37°C. After, pictures were taken using a Nikon-300 inverted fluorescence microscope. The percentage of invasion was determined using ImageJ software. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

Inverse Matrigel invasion assay
The matrigel invasion assay was performed. Matrigel was prepared 1:1 in ice cold PBS and inserted in 8 micron pore 6.5 mm diameter uncoated Transwells into the wells of a 24 well tissue culture plate and incubated for 30 min at 37°C. Cell suspensions (1 x 105/ml) were added onto the upward facing underside of the filter and incubated in the inverted state for 4 hours. Each transwell was washed in serum free medium, 100 μl of RPMI with 10% FBS was added into the transwell and incubated for 72h at 37°C. The cells were fixed in 1 ml of 4% para-formaldehyde/0.2% Triton-X 100 and staining with 1mL of 4 μM Calcein AM solution for 1h at room temperature. The images were obtained by confocal microscopy. siEPHA7 knockdown experiments were performed to investigate its effect on invation.

RESULTS & DISCUSSION
A total of 22,848 variants were identified in the1,345 genes with 24 genes were genes near the targeted 1,321 genes. In the discovery phase, 12 genes have minFDR < 10-3. Among which, 6 genes with standardized residuals greater 2 are: ADAMTS18, DNAH8, EPHA7, LRP1B, MUC16, and TTN (p<0.008). We are in the process of downloading and processing the TCGA data for a formal validation analyses. We did a quick validation and looked up the association between mutations and OS using cBioportal. Among the 6 genes, 3 of the 6 validated (p <0.05) using this initial quick lookup through cBioportal were: EPHA7 (P Burden = 1.47x10-7 for TCC; p log-rank test = 0.03 for TCGA) and MUC16 (P Burden = 2.23x10-6 for TCC; p log-rank test = 0.015 for TCGA). TTN (P Burden = 1.22x10-6 for TCC) has similar trend observed in TCGA but p = 0.07 using log rank test. The melanoma cell lines Malme-3M and MeWo were purchased from ATCC. These cell lines contain some mutations. Both The knockdown experiments siEPHA7 using 3-D spheroid assays and Inverse Matrigel invasion assays showed that EPHA7 knockdown significantly reduce the cell invasion by 40% (p<0.01) and by 53.4% (p<0.001), respectively. The striking impact after EPHA7 knockdown on both cell survival and cell invasion showed that EPHA7 likely played a major role as a regular for metastases.
Identifying EPHA7 as a gene with important non-passenger mutations demonstrated the power of evaluating the association between infrequent mutations jointly in a gene with patients’ clinical outcomes.

CONCLUSIONS
Through our three-phase melanoma study, we have demonstrated that our proposed integrated approach, combining intSKAT and robust regression, can successfully identify novel clinically impactful genes with mutations. Our proposed method should be readily applicable to discover novel mutations for other cancer types and provide potentially important strategies for personalized treatment options.



FIGURE 1. A three-phase study design to test our proposed method intSKAT, an integrated approach to discover novel clinically impactful mutations in melanoma patients.
[Add the figure legend text here].

REFERENCES

1. Wu, M.C., et al., Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 2011. 89(1): p. 82-93.

TP027 (HT) - Covariation Is a Poor Measure of Molecular Coevolution
Date: Sunday, July 10 2:40 pm - 3:00 pm
Room: Northern Hemisphere E1/E2
Theme: PROTEINS
  • David Talavera, University of Manchester, United Kingdom
  • Simon Lovell, University of Manchester, United Kingdom
  • Simon Whelan, Uppsala University, Sweden

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Covariation of amino-acid residues is widely studied for applications such as protein structure prediction, protein design and analysis of protein-protein interactions. However, there is no consensus as to the underlying evolutionary mechanisms that give rise to covariation. We have developed a theoretical model with the aim of understanding the origins of covariation. Our model predicts that covariation is generated only if strong selective pressure is present for extremely long periods of time. Our empirical analyses confirm this expectation as we demonstrate 1) that covariation methods select pairs of residues with slow evolutionary rates; and, 2) that the location of conserved residues in the core of the protein structure explains the precision of these methods at finding residues in close proximity. Altogether, our results explain the relative performance and limitations of current covariation methods, and the difficulties for developing evolutionary models for detecting coevolution.

TP028 (HT) - Quantitative analysis of microRNA mediated regulation on competing endogenous RNAs
Date: Sunday, July 10 3:30 pm - 3:50 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / GENES
  • Ye Yuan, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
  • Bing Liu, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
  • Peng Xie, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
  • Michael Zhang, Department of Molecular and Cell Biology, Center for Systems Biology, University of Texas, Dallas, United States
  • Yanda Li, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
  • Zhen Xie, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China
  • Xiaowo Wang, Bioinformatics Division, Center for Synthetic and Systems Biology, Tsinghua National Laboratory for Information Science and Technology/Department of Automation, Tsinghua University, China

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

Each microRNA species can bind various types of target RNAs. Therefore, target RNAs could indirectly regulate each other by sequestering shared microRNAs. This phenomenon is called competing endogenous RNAs (ceRNA) effect. The off-target phenomenon in RNAi technology is also closely related to this effect. With the combination of systems biology modeling analysis and synthetic biology experiments, we established a mathematical model to describe the microRNA regulation and built relative synthetic gene circuits in cultured human cells to quantify the ceRNA effect under variable conditions. The results suggested that the ceRNA effect is affected by the abundance of microRNA and targets, the number and affinity of binding site, and the mRNA degradation pathway determined by the degree of microRNA-mRNA complementarity. Furthermore, a non-reciprocal competing effect of microRNA and RNAi was also demonstrated, while providing a new direction for the improvement of RNAi technology.

TP029 (LBR) - A Weighted Exact Test for Significance of Mutually Exclusive Mutations in Cancer
Date: Sunday, July 10 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / GENES
  • Mark Leiserson, Brown University, United States
  • Matthew Reyna, Brown University, United States
  • Benjamin Raphael, Brown University, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

Large-scale cancer sequencing efforts over the past decade from consortia such as The Cancer Genome Atlas have revealed that different combinations of mutations cause cancer in different patients. One method for distinguishing the driver mutations responsible for cancer from the random mutations with no role in cancer is to search for combinations of mutations that are mutually exclusive across tumors. We introduce a new statistical test for mutual exclusivity that uses the observed number of mutations in genes and tumor samples. The statistical test weights mutations with per gene, per sample mutation probabilities. We present a formula for computing this test exactly, and derive an approximation that can compute the tail probability quickly and accurately. We demonstrate our approach by applying it to hundreds of colorectal, thyroid, and endometrial cancers.

TP030 (PT) - CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction
Date: Sunday, July 10 3:30 pm - 3:50 pm
Room: Northern Hemisphere E1/E2
Theme: PROTEINS
  • Xuefeng Cui, KAUST, Saudi Arabia
  • Zhiwu Lu, Renmin University, China
  • Sheng Wang, Toyota Technological Institute at Chicago, United States
  • Jingyan Wang, KAUST, Saudi Arabia
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Motivation: Protein homology detection, a fundamental problem in computational biology, is an indispensable step towards predicting protein structures and understanding protein functions. Despite the advances in recent decades on sequence alignment, threading, and alignment-free methods, protein homology detection remains a challenging open problem. Recently, network methods that try to find transitive paths in the protein structure space demonstrate the importance of incorporating network information of the structure space. Yet, current methods merge the sequence space and the structure space into a single space, and thus introduce inconsistency in combining different sources of information.

Method: We present a novel network-based protein homology detection method, CMsearch, based on cross-modal learning. Instead of exploring a single network built from the mixture of sequence and structure space information, CMsearch builds two separate networks to represent the sequence space and the structure space. It then learns sequence-structure correlation by simultaneously taking sequence information, structure information, sequence space information and structure space information into consideration.

Results: We tested CMsearch on two challenging tasks, protein homology detection and protein structure prediction, by querying all 8,332 PDB40 proteins. Our results demonstrate that CMsearch is insensitive to the similarity metrics used to define the sequence and the structure spaces. By using HMM-HMM alignment as the sequence similarity metric, CMsearch clearly outperforms state-of-the-art homology detection methods and the CASP-winning template-based protein structure prediction methods.

TP031 (PT) - Reconstructing the temporal progression of HIV-1 immune response pathways
Date: Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / DISEASE
  • Siddhartha Jain, Carnegie Mellon University, United States
  • Joel Arrais, Universidade de Aveiro, IEETA, Portugal
  • Narasimhan J. Venkatachari, University of Pittsburgh, United States
  • Velpandi Ayyavoo, University of Pittsburgh, United States
  • Ziv Bar-Joseph, Carnegie Mellon University, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

We present TimePath, a new method that integrates time series and static datasets to reconstruct dynamic models of host response to stimulus. TimePath uses an Integer Programming formulation to select a subset of pathways that, together, explain the observed dynamic responses. Applying TimePath to study human response to HIV-1 led to accurate reconstruction of several known regulatory and signaling pathways and to novel mechanistic insights. We experimentally validated several of TimePaths' predictions highlighting the usefulness of temporal models.

TP032 (LBR) - Clonal evolution inference and visualization in metastatic colorectal cancer
Date: Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / GENES
  • Ha X. Dang, Washington University in St. Louis, United States
  • Julie Grossman, Washington University in St. Louis, United States
  • Brian White, Washington University in St. Louis, United States
  • Steven Foltz, Washington University in St. Louis, United States
  • Christopher Miller, Washington University in St. Louis, United States
  • Jingqin Luo, Washington University in St. Louis, United States
  • Timothy Ley, Washington University in St. Louis, United States
  • Richard Wilson, Washington University in St. Louis, United States
  • Elaine Mardis, Washington University in St. Louis, United States
  • Ryan Fields, Washington University in St. Louis, United States
  • Christopher Maher, Washington University in St. Louis, United States

Area Session Chair: Paul Horton

Presentation Overview: Show


Dissecting genomic heterogeneity and clonal evolution in tumors is critical to understanding cancer progression, metastasis, and recurrence. To identify subclonal populations of cancer cells, somatic variants identified via sequencing are often clustered across tumor samples based on their variant allele frequencies (VAF) or cancer cell cellular fractions (CCF). We developed ClonEvol, a tool to infer and visualize clonal evolution models in multiple related tumor samples using pre-clustered variants. We demonstrated that ClonEvol was able to infer clonal evolution models using a published and simulated datasets. We also used ClonEvol to infer clonal evolution models for an unpublished dataset of whole genome/exome and targeted sequencing of multi organ multi region primary and metastatic tumors from a metastatic colorectal cancer cohort. We discovered that metastasis seeding in colorectal cancers were complex events that involved multiple subclones from primary and metastatic tumors. Moreover, the critical subclones that drove metastasis were often missed when a single biopsy was sequenced from the primary tumors, thus necessitated multi region sequencing in monitoring clonal evolution and identifying critical events driving metastasis. ClonEvol is available at https://github.com/hdng/clonevol

TP033 (PT) - Ensemble-Based Evaluation for Protein Structure Models
Date: Sunday, July 10 3:50 pm - 4:10 pm
Room: Northern Hemisphere E1/E2
Theme: PROTEINS
  • Michal Jamroz, Warsaw University, Poland
  • Andrzej Kolinski, Warsaw University, Poland
  • Daisuke Kihara, Purdue University, United States

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

Motivation: Comparing protein tertiary structures is a fundamental procedure in structural biology and protein bioinformatics. Structure comparison is important particularly for evaluating computational protein structure models. Most of the model structure evaluation methods perform rigid body superimposition of a structure model to its crystal structure and measure the difference of the corresponding residue or atom positions between them. However, these methods neglect intrinsic flexibility of proteins by treating the native structure as a rigid molecule. Since different parts of proteins have different levels of flexibility, for example, exposed loop regions are usually more flexible than the core region of a protein structure, disagreement of a model to the native need to be evaluated differently depending on the flexibility of residues in a protein.
Results: We propose a score named FlexScore for comparing protein structures that considers flexibility of each residue in the native state of proteins. Flexibility information may be extracted from experiments such as NMR or molecular dynamics simulation. FlexScore considers an ensemble of conformations of a protein described as a multivariate Gaussian distribution of atomic displacements and compares a query computational model to the ensemble. We compare FlexScore with other commonly used structure similarity scores over various examples. FlexScore agrees with experts’ intuitive assessment of computational models and provide information of practical usefulness of models.

TP034 (HT) - Identification of essential molecular and cellular processes controlling the response time and intensity of inflammation
Date: Sunday, July 10 4:10 pm - 4:30 pm
Room: Northern Hemisphere A1/A2
Theme: SYSTEMS / DISEASE
  • Alexander Mitrophanov, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States
  • Sridevi Nagaraja, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States
  • Jaques Reifman, Department of Defense Biotechnology High Performance Computing Software Applications Institute, United States

Area Session Chair: Hagit Shatkay

Presentation Overview: Show

Pathological inflammation, including inflammatory response with exaggerated intensity (sepsis) or with delayed resolution (chronic inflammation), has defied attempts at efficacious treatment. Here, we developed and applied a computational strategy to demonstrate how specific molecular and cellular components can be manipulated to achieve targeted modulation of the inflammatory response time and intensity. The strategy was based on comprehensive sensitivity and correlation analyses using our recently developed kinetic model that can represent thousands of possible inflammation scenarios. We identified three molecular mediators whose inhibition may robustly restore pathological inflammation to its normal course. We found that inflammation timing was more difficult to control than its intensity. Yet, simultaneous inhibition of two distinct targets suggested a reliable means to normalize both excessively strong and abnormally prolonged inflammatory responses. Our model was validated with existing experimental data and suggested new in vivo experiments.

TP035 (HT) - Robust discrimination of cell types from tissue expression profiles
Date: Sunday, July 10 4:10 pm - 4:30 pm
Room: Northern Hemisphere A3/A4
Theme: DISEASE / DATA
  • Aaron M. Newman, Stanford University, United States
  • Andrew J. Gentles, Stanford University, United States
  • Chih Long Liu, Stanford University, United States
  • Michael R. Green, University of Nebraska Medical Center, United States
  • Weiguo Feng, Stanford University, United States
  • Scott V. Bratman, University of Toronto, Canada
  • Dongkyoon Kim, Stanford University, United States
  • Yue Xu, Stanford University, United States
  • Amanda Khuong, Stanford University, United States
  • Chuong D. Hoang, National Cancer Institute, United States
  • Viswam S. Nair, Stanford University, United States
  • Robert B. West, Stanford University, United States
  • Sylvia K. Plevritis, Stanford University, United States
  • Maximilian Diehn, Stanford University, United States
  • Ash A. Alizadeh, Stanford University, United States

Area Session Chair: Paul Horton

Presentation Overview: Show

Changes in cellular composition underlie diverse physiological states. While flow cytometry and immunohistochemistry are commonly used to characterize tissue heterogeneity, the former requires cell dissociation, which can alter representation, while the latter is generally limited to one marker per section. To complement these methods, we developed CIBERSORT, an in silico deconvolution approach that robustly enumerates cell subsets of interest from gene expression profiles (GEPs) of bulk tissues. We evaluated CIBERSORT using fresh, frozen, and fixed specimens, including solid tumors, and found that it outperforms previous deconvolution methods with respect to noise, unknown mixture content, and closely related cell types. When applied to GEPs from 25 tumor types in a pan-cancer analysis, CIBERSORT revealed complex associations between 22 tumor-infiltrating leukocyte subsets and clinical outcomes. Predictions linking specific immune phenotypes to survival were validated in lung adenocarcinoma. CIBERSORT provides a novel platform for tissue characterization without requiring antibodies, disaggregation, or living cells.

TP036 (LBR) - Investigating molecular determinants of ebolavirus pathogenicity
Date: Sunday, July 10 4:10 pm - 4:30 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / PROTEINS
  • Morena Pappalardo, University of Kent, United Kingdom
  • Miguel Juliá, University of Kent, United Kingdom
  • Mark Howard, University of Kent, United Kingdom
  • Jeremy Rossman, University of Kent, United Kingdom
  • Martin Michaelis, University of Kent, United Kingdom
  • Mark Wass, University of Kent, United Kingdom

Area Session Chair: Jianlin Cheng

Presentation Overview: Show

The West Africa Ebola virus outbreak has killed thousands of people and demonstrated the scale on which the virus threatens human life. Using extensive sequencing data obtain during the outbreak, we compare Ebolavirus genomes to identify potential molecular determinants of Ebolavirus pathogenicity. Of the five Ebolavirus species, only Reston viruses are not pathogenic in humans. We compared the Reston virus genome with those from the four human pathogenic species to identify specificity determining positions (SDPs) that are differentially conserved and may therefore act as molecular determinants of pathogenicity. We initially identified 189 SDPs using 196 Ebolavirus genome sequences. We report a reduced number of SDPs using a much larger set of sequences from the current outbreak. Structural analysis was performed to identify SDPs that are likely to have alter protein structure and function and could be associated with pathogenicity. The most striking findings were in Ebolavirus proteins VP24 and VP40. Particularly SDPs present in VP24 are likely to impair binding to human karyopherin alpha proteins and therefore prevent inhibition of interferon signaling in repsosne to viral infection. VP24 is also critical for Ebolavirus adaptation to novel hosts, and as only a few SDPs distinguish Reston virus VP24 from VP24 of other Ebolaviruses, it is possible that human pathogenic Reston viruses may emerge.

TP037 (LBR) - LINEs between species: Evolutionary dynamics of LINE-1 retrotransposons across the eukaryotic tree of life
Date: Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Atma Ivancevic, The University of Adelaide, Australia
  • Dan Kortschak, The University of Adelaide, Australia
  • Terry Bertozzi, South Australian Museum, Australia
  • David Adelson, University of Adelaide, Australia

Area Session Chair: Yana Bromberg

Presentation Overview: Show

LINE-1 (L1) retrotransposons are dynamic elements. They have the potential to cause great genomic change by inserting copies of themselves throughout the genome, resulting in the duplication and rearrangement of regulatory DNA. Active L1, in particular, are often thought of as tightly constrained, homologous and ubiquitous elements with well-characterised domain organisation. For the past 30 years, model organisms have been used to define L1s as 6-8kb sequences containing a 5’-UTR, two open reading frames working harmoniously in cis, and a 3’-UTR with a polyA tail.
In this study, we demonstrate the remarkable and overlooked diversity of L1s via a comprehensive phylogenetic analysis of over 500 species from widely divergent branches of the tree of life. The rapid and recent growth of L1 elements in mammalian species is juxtaposed against their decline in plant species and complete extinction in most reptiles and insects. In fact, some of these previously unexplored mammalian species (e.g. snub-nosed monkey, minke whale) exhibit L1 retrotranspositional ‘hyperactivity’ far surpassing that of human or mouse. In contrast, non-mammalian L1s have become so varied that the current classification system seems to inadequately capture their structural characteristics. Our findings illustrate how both long-term inherited evolutionary patterns and random bursts of activity in individual species can significantly alter genomes, highlighting the importance of L1 dynamics in eukaryotes.

TP038 (PT) - Convolutional neural network architectures for predicting DNA-protein binding
Date: Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Theme: DATA / PROTEINS
  • Haoyang Zeng, Massachusetts Institute of Technology, United States
  • Matthew Edwards, MIT, United States
  • Ge Liu, MIT, United States
  • David Gifford, MIT, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Convolutional neural networks (CNN)
have outperformed conventional methods in modeling the sequence
specificity of DNA-protein binding. Yet inappropriate CNN
architectures can yield poorer performance than simpler models. Thus
an in-depth understanding of how to match CNN architecture to a
given task is needed to fully harness the power of CNNs for
computational biology applications. We present
a systematic exploration of CNN architectures for predicting DNA
sequence binding using a large compendium of transcription factor
datasets. We identify the best-performing architectures by varying
CNN width, depth, and pooling designs. We find that adding
convolutional kernels to a network is important for motif-based
tasks. We show the benefits of CNNs in learning rich higher-order
sequence features, such as secondary motifs and local sequence
context, by comparing network performance on multiple modeling tasks
ranging in difficulty. We also demonstrate how careful construction
of sequence benchmark datasets, using approaches that control
potentially confounding effects like positional or motif strength
bias, is critical in making fair comparisons between competing
methods. We explore how to establish the sufficiency of training
data for these learning tasks, and we have created a flexible
cloud-based framework that permits the rapid exploration of
alternative neural network architectures for problems in
computational biology.

TP039 (PT) - What Time is It? Deep Learning Approaches for Circadian Rhythms
Date: Monday, July 11 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS / GENES
  • Forest Agostinelli, University of California-Irvine, United States
  • Nicholas Ceglia, University of California-Irvine, United States
  • Babak Shahbaba, University of California-Irvine, United States
  • Paolo Sassone-Corsi, University of California-Irvine, United States
  • Pierre Baldi, University of California-Irvine, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Motivation: Circadian rhythms date back to the origins of life, are found in virtually every species and every cell, and play fundamental roles in functions ranging from metabolism to cognition. Modern high-throughput technologies allow the measurement of concentrations of transcripts, metabolites, and other species along the circadian cycle creating novel computational challenges and opportunities, including the problems of inferring whether a given species oscillate in circadian fashion or not, and inferring the time at which a set of measurements was taken.

Results: We first curate several large synthetic and biological time series data sets containing labels for both periodic and aperiodic signals. We then use deep learning methods to develop and train BIO_CYCLE, a system to robustly estimate which signals are periodic in high-throughput circadian experiments, producing estimates of amplitudes, periods, phases, as well as several statistical significance measures. Using the curated data, BIO_CYCLE is compared to other approaches and shown to achieve state-of-the-art performance across multiple metrics. We then use deep learning methods to develop and train BIO_CLOCK to robustly estimate the time at which a particular single-time-point transcriptomic experiment was carried. In most cases, BIO_CLOCK can reliably predict time, within approximately one hour, using the expression levels of only a small number of core clock genes.
BIO_CLOCK is shown to work reasonably well across tissue types, and often with only small degradation across conditions. BIO_CLOCK is used to annotate most mouse experiments found in the GEO database with an inferred time stamp.

Availability: All data and software are publicly available on the CircadiOmics web portal: circadiomics.igb.uci.edu/.

TP040 (PT) - phRAIDER: Pattern-Hunter Based Rapid Ab Initio Detection of Elementary Repeats
Date: Monday, July 11 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Charlotte Schaeffer, Miami University, United States
  • Nathan Figueroa, Miami University, United States
  • Xiaolin Liu, Miami University (Ohio), United States
  • John Karro, Miami University (Ohio), United States

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Transposable Elements and repetitive DNA make up a sizable fraction of Eukaryotic genomes, and their annotation is crucial to the study of the structure, organization, and evolution of any newly sequenced genome. While RepeatMasker and nHMMER are useful for identifying these repeats, they require a pre-compiled repeat library -- which is not always available. {\it De novo} tools such as Recon, RepeatScout, or RepeatGluer serve to identify TEs purely from sequence content, but are either limited by runtimes that prohibit whole-genome use or degrade in quality in the presence of substitutions that disrupt the sequence patterns.

Results: phRAIDER is an de novo transposable element tool that addresses both the issue of of runtime without sacrificing sensitivity, as compared to competing tools. The underlying model is a new definition of elementary repeats that incorporates the PatternHunter spaced seed model, allowing for greater sensitivity in the presence of genomic substitutions. As compared to the premier tool in the literature, RepeatScout, phRAIDER shows an average 10x speedup on any single human chromosome and has the ability to process the whole human genome in just over three hours. Here we present the tool, the theoretical model underlying the tool, and the results demonstrating its effectiveness.

Availability: phRAIDER is an open source tool available from https://github.com/karroje/phRAIDER.

TP041 (PT) - RCK: accurate and efficient inference of sequenceand structure-based protein-RNA binding models from RNAcompete data
Date: Monday, July 11 10:30 am - 10:50 am
Room: Northern Hemisphere A3/A4
Theme: DATA / GENES
  • Yaron Orenstein, MIT, United States
  • Yuhao Wang, MIT, United States
  • Bonnie Berger, MIT, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Motivation: Protein-RNA interactions, which play vital roles in many processes, are mediated through both RNA sequence and structure. CLIP-based methods, which measure protein-RNA binding in vivo, suffer from experimental noise and systematic biases, whereas in vitro experiments capture a clearer signal of protein RNA-binding. Among them, RNAcompete provides binding affinities of a specific protein to more than 240,000 unstructured RNA probes in one experiment. The computational challenge is to infer RNA structure- and sequence-based binding models from these data. The state-of-the-art in sequence models, Deepbind, does not model structural preferences. RNAcontext models both sequence and structure preferences, but was outperformed by GraphProt. Unfortunately, GraphProt cannot detect structural preferences from RNAcompete data due to the unstructured nature of the data, as noted by its developers.
Results: We develop RCK, an efficient, scalable algorithm to infer sequence and structure preferences based on a new k-mer model. Remarkably, even though RNAcompete data is designed to be unstructured, RCK can still learn structural preferences from it. RCK significantly outperforms both RNAcontext and Deepbind in in vitro binding prediction for 244 RNAcompete experiments. Moreover, RCK is also faster and uses less memory, which enables scalability. While currently on par with existing methods in in vivo binding prediction on a small scale test, we demonstrate that RCK will increasingly benefit from experimentally measured RNA structure profiles as compared to computationally predicted ones. By running RCK on the entire RNAcompete dataset, we generate and provide as a resource a set of protein-RNA structure-based models on an unprecedented scale.
Availability: Software and models are freely available at http://groups.csail.mit.edu/cb/rck/.
Contact: bab@mit.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

TP042 (HT) - Core Regulatory Circuitry of the Plant Circadian System
Date: Monday, July 11 10:30 am - 10:50 am
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS / GENES
  • Mathias Foo, University of Warwick, United Kingdom
  • David Somers, The Ohio State University, United States
  • Pan-Jun Kim, Asia Pacific Center for Theoretical Physics, Korea, Republic of

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Sleep/wake cycles in animals exemplify daily biological rhythms driven by internal molecular clocks, circadian clocks, which are important for plant life as well. The plant circadian clock is much more complex than any other organisms, eluding our understanding of its design principle. Based on the mechanistic modeling and simulation of Arabidopsis thaliana, we successfully identified a kernel of the plant circadian system, the critical gene regulatory circuitry for clock function. The kernel integrates four major negative feedback loops for molecular circadian oscillations. Strikingly, the kernel structure, as well as the whole clock circuitry, was found to be overwhelmingly composed of inhibitory, not activating, interactions among genes. This fact facilitates the global coordination of plant circadian molecular profiles to often exhibit sharply-shaped, cuspidate waveforms, which indicate clock events that are markedly peaked at very specific times of day. Our approach elucidates a design principle of biological clockwork, implicated in synthetic biology.

TP043 (HT) - DNA editing of LTR retrotransposons reveals the impact of APOBECs on vertebrate genomes
Date: Monday, July 11 10:50 am - 11:10 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Binyamin Knisbacher, The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Israel
  • Erez Levanon, The Mina & Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Israel

Area Session Chair: Yana Bromberg

Presentation Overview: Show

LTR retrotransposons are retrovirus-like entities widespread in vertebrate genomes. These replicating endogenous retroviruses (ERVs) must be restricted to prevent deleterious mutations and maintain genome integrity. The APOBEC DNA-editing enzymes can do so by inflicting C-to-U hypermutation in retrotransposon DNA during their mobilization. In some cases, hypermutated retrotransposons successfully integrate into the genome, introducing unique sequences, which increase retrotransposon diversity and the probability of developing new function at the loci of insertion. We developed a computational approach to identify such events, applied it to genomes of 123 diverse species and identified numerous DNA edited sites in humans and various vertebrate lineages. Unexpectedly, DNA editing is exceptionally prevalent in some birds, including one of Darwin's finches. Edited ERVs are enriched in genic regions, thereby raising the probability of their exaptation for novel function. Our results show that DNA editing has a substantial role in vertebrate innate immunity and may accelerate genome evolution.

TP044 (HT) - Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning
Date: Monday, July 11 10:50 am - 11:10 am
Room: Northern Hemisphere A3/A4
Theme: DATA / PROTEINS
  • Hannes Bretschneider, University of Toronto,
  • Brendan Frey, University of Toronto, Canada
  • Andrew Delong, Deep Genomics, Canada
  • Babak Alipanahi, University of Toronto, Canada

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with ‘deep learning’ techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a ‘mutation map’ that indicates how variations affect binding within a specific sequence.

TP045 (LBR) - A Framework for Integrating Co-expression Networks with GWAS to Prioritize Candidate Genes in Maize
Date: Monday, July 11 10:50 am - 11:10 am
Room: Northern Hemisphere E1/E2
Theme: SYSTEMS / GENES
  • Robert Schaefer, University of Minnesota, United States
  • Jean-Michel Michno, University of Minnesota, United States
  • Joseph Jeffers, University of Minnesota, United States
  • Owen Hoekenga, Independent Consultant, United States
  • Brian Dilkes, Purdue University, United States
  • Ivan Baxter, Donald Danforth Plant Science Center/6USDA-ARS Plant Genetics Research Unit, United States
  • Chad Myers, University of Minnesota, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Genome wide association studies (GWAS) have identified thousands of loci linked to hundreds of traits in many different species. However, in many cases, the causal genes and the cellular processes they contribute to remain unknown. This problem is even more pronounced in non-model species where functional annotations are sparse. To address these issues, we developed a computational framework called Camoco (Co-Analysis of Molecular Components) that systematically integrates loci identified by GWAS with gene co-expression networks to identify a focused set of putative causal genes that are coordinately regulated. We demonstrate the utility of our approach on new GWAS studies in maize, the world’s most produced staple crop. Using our approach, candidate SNPs associated with elemental accumulation in maize kernels were reduced by two orders of magnitude. Our study reveals the importance of gene expression data context as only root tissue-specific co-expression networks based on gene expression signatures across genotypically diverse individuals were able to provide signal for interpreting GWAS candidate SNPs. Both the software tools we developed and the lessons on integrating GWAS data with co-expression networks generalize to other contexts.

TP046 (PT) - Read-Based Phasing of Related Individuals
Date: Monday, July 11 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / SYSTEMS
  • Shilpa Garg, MPI-INF, Germany, Germany
  • Marcel Martin, Science for Life Laboratory, Sweden
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information - reads and pedigree - has the potential to deliver results better than each individually.
Results: We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2x for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15x coverage per individual.

TP047 (HT) - Revisiting the computational analysis of DNase sequencing
Date: Monday, July 11 11:40 am - 12:00 pm
Room: Northern Hemisphere A3/A4
Theme: GENES
  • Ivan G. Costa, RWTH Aachen Universtiy, Germany
  • Eduardo Gadde Gusmao, RWTH Aachen Universtiy, Germany
  • Manuel Allhoff, RWTH Aachen Universtiy, Germany
  • Martin Zenke, RWTH Aachen University, Germany

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

DNase-seq is a powerful technique for detection of cell-specific binding sites in a genome-wide manner. Computational footprinting methods, which search for footprint-like DNase I cleavage patterns on the DNA, allow the detection of binding sites in a base pair resolution. There is, however, a debate in the literature on the influence of experimental artifacts as DNase I cleavage bias and transcription factor residence time on computational footprint methods. We investigated these artifacts in a comprehensive panel of DNase-seq data sets, 10 footprinting methods and 88 transcription factors. Our comparative analysis indicates the advantage of HINT, DNase2TF and PIQ in relation to other footprinting methods. We demonstrate that correcting the DNase-seq signal based on cleavage bias estimation significantly improves accuracy of computational footprinting. We also propose a score to detect footprints arising from transcription factors with short residence time, as footprints of such factors have low predictive performance.

TP048 (PT) - Novel Applications of Multi-task Learning and Multiple Output Regression to Multiple Genetic Trait Prediction
Date: Monday, July 11 11:40 am - 12:00 pm
Room: Northern Hemisphere E1/E2
Theme: GENES / DATA
  • Dan He, IBM T.J. Watson, United States
  • Laxmi Parida, IBM T J Watson Research Center, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Given a set of biallelic molecular markers, such as SNPs, with genotype values encoded numerically on a collection of plant, animal or human samples, the goal of genetic trait prediction is to predict the quantitative trait values by simultaneously modeling all marker effects. Genetic trait prediction is usually represented as linear regression models. In many cases, for the same set of samples and markers, multiple traits are observed. Some of these traits might be correlated with each other. Therefore, modeling all the multiple traits together may improve the prediction accuracy. In this work, we view the multi-trait prediction problem from a machine learning angle: as either a multi-task learning problem or a multiple output regression problem, depending on whether different traits share the same genotype matrix or not. We then adapted multi-task learning algorithms and multiple output regression algorithms to solve the multi-trait prediction problem. We proposed a few strategies to improve the least square error of the prediction from these algorithms. Our experiments show that modeling multiple traits together could improve the prediction accuracy for correlated traits.

TP049 (PT) - An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree
Date: Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / SYSTEMS
  • Yufeng Wu, Computer Science and Engineering Department, University of Connecticut, United States

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Motivation: Gene tree represents the evolutionary history of gene
lineages that originate from multiple related populations. Under the
multispecies coalescent model, lineages may coalesce outside the
species (population) boundary. Given a species tree (with branch
lengths), the gene tree probability is the probability of observing a
specific gene tree topology under the multispecies coalescent model.
There are two existing algorithms for computing the exact gene tree
probability. The first algorithm is due to Degnan and Salter (2005),
where they enumerate all the so-called coalescent histories for the
given species tree and the gene tree topology. Their algorithm runs
in exponential time in the number of gene lineages in general. The
second algorithm is the STELLS algorithm (2012), which is usually
faster but also runs in exponential time in almost all the cases.

Results: In this paper, we present a new algorithm, called
CompactCH, for computing the exact gene tree probability. This new
algorithm is based on the notion of compact coalescent histories:
multiple coalescent histories are represented by a single compact
coalescent history. The key advantage of our new algorithm is that it
runs in polynomial time in the number of gene lineages if the number
of populations is fixed to be a constant. The new algorithm is more
efficient than the STELLS algorithm both in theory and in practice
when the number of populations is small and there are multiple
gene lineages from each population. As an application, we show
that CompactCH can be applied in the inference of population tree
(i.e. the population divergence history) from population haplotypes.
Simulation results show that the CompactCH algorithm enables
efficient and accurate inference of population trees with much more
haplotypes than a previous approach.

Availability: The CompactCH algorithm is implemented in the
STELLS software package, which is available for download at http:
//www.engr.uconn.edu/~ywu/STELLS.html.

Contact: ywu@engr.uconn.edu

TP050 (LBR) - The Role of Genome Accessibility in Transcription Factor Binding in Bacteria
Date: Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere A3/A4
Theme: GENES / PROTEINS
  • Antonio Gomes, Columbia University, United States
  • Harris Wang, Columbia UNIVERSITY, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

ChIP-seq enables genome-scale identification of regulatory regions that govern gene expression. However, the biological insights generated from ChIP-seq analysis have been limited to predictions of binding sites and cooperative interactions. Furthermore, ChIP-seq data often poorly correlate with in vitro measurements or predicted motifs, highlighting that binding affinity alone is insufficient to explain transcription factor (TF)-binding in vivo. One possibility is that binding sites are not equally accessible across the genome. A more comprehensive biophysical representation of TF-binding is required to improve our ability to understand, predict, and alter gene expression. Here, we show that genome accessibility is a key parameter that impacts TF-binding in bacteria. We developed a thermodynamic model that parameterizes ChIP-seq coverage in terms of genome accessibility and binding affinity. The role of genome accessibility is validated using a large-scale ChIP-seq dataset of the M. tuberculosis regulatory network. We find that accounting for genome accessibility led to a model that explains 63% of the ChIP-seq profile variance, while a model based in motif score alone explains only 35% of the variance. Moreover, our framework enables de novo ChIP-seq peak prediction and is useful for inferring TF-binding peaks in new experimental conditions by reducing the need for additional experiments. We observe that the genome is more accessible in intergenic regions, and that increased accessibility is positively correlated with gene expression and anti-correlated with distance to the origin of replication. Our biophysical model provides a more comprehensive description of TF-binding in vivo from first principles towards a better representation of gene regulation in silico, with promising applications in systems biology.

TP051 (PT) - A Network-driven Approach for Genome-wide Association Mapping
Date: Monday, July 11 12:00 pm - 12:20 pm
Room: Northern Hemisphere E1/E2
Theme: GENES / DISEASE
  • Seunghak Lee, Carnegie Mellon University, United States
  • Soonho Kong, Carnegie Mellon University, United States
  • Eric Xing, Carnegie Mellon University, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Motivation:

It remains a challenge to detect associations between genotypes and phenotypes because of insufficient sample sizes and complex underlying mechanisms involved in associations. Fortunately, it is becoming more feasible to obtain gene expression data in addition to genotypes and phenotypes, giving us new opportunities to detect true genotype-phenotype associations while unveiling their association mechanisms.

Results:

In this paper, we propose a novel method, NETAM, that accurately detects associations between SNPs and phenotypes, as well as gene traits involved in such associations. We take a network-driven approach: NETAM first constructs an association network, where nodes represent SNPs, gene traits, or phenotypes, and edges represent the strength of association between two nodes. NETAM assigns a score to each path from an SNP to a phenotype, and then identifies significant paths based on the scores. In our simulation study, we show that NETAM finds significantly more phenotype-associated SNPs than traditional genotype-phenotype association analysis under false positive control, taking advantage of gene expression data. Furthermore, we applied NETAM on late-onset Alzheimer's disease data and identified 477 significant path associations, among which we analyzed paths related to beta-amyloid, estrogen, and nicotine pathways. We also provide hypothetical biological pathways
to explain our findings.

TP052 (HT) - Deciphering evolutionary strata on plant sex chromosomes and fungal mating-type chromosomes through compositional segmentation
Date: Monday, July 11 12:20 pm - 12:40 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / SYSTEMS
  • Rajeev Azad, University of North Texas, United States
  • Ravi Shanker Pandey, University of North Texas, United States

Area Session Chair: Yana Bromberg

Presentation Overview: Show

Abstract:
Sex chromosomes have evolved from a pair of homologous autosomes which differentiated into sex determination systems, such as XY or ZW systems, as a consequence of successive recombination suppression between gametologous chromosomes. To identify regions of recombination suppression, the “evolutionary strata”, even when only the sequence of sex chromosome in the homogametic sex (i.e. X or Z chromosome) is available, we have developed an integrated segmentation and clustering method. In order to understand the early evolution of sex chromosomes, we applied our method to recently evolved plant sex chromosomes. Our method could decipher all known evolutionary strata on papaya and Silene latifolia X chromosomes, and decipheried two, yet unknown, evolutionary strata on an incipient sex chromosome of Populus trichocarpa. Application to sex chromosome V of brown alga Ectocarpus sp. recovered sex determining and pseudoautosomal regions, and application to mating-type chromosomes of an anther-smut fungus Microbotryum lychnidis-dioicae uncovered five new strata.

Justification:
Evolution of sex chromosomes in animals and birds is relatively well-studied than in plants, although 48 dioecious plants have already been reported. A key aspect in understanding sex chromosome evolution is to decipher the successive regions of recombination suppression between the gametologous sex chromosomes. However, until now, only two plants Silene latifolia and papaya have been examined for the recombination suppressed regions, namely, the evolutionary strata, on their X chromosomes. This was made possible by sequencing of sex-linked genes on both X and Y chromosomes, which is a requirement of all current methods that determine strata structure based on comparison of gametologous sex chromosomes. To circumvent this limitation and detect strata even in the absence of Y chromosome sequence, we have developed an integrated segmentation and clustering method, which could recapitulate the previously identified strata on the Silene latifolia and papaya X chromosomes without X-Y comparison, and deciphered two, yet unknown, strata on an incipient sex chromosome of Populus trichocarpa.

Emergence and evolution of sex chromosomes in many plants are much recent than the mammalian sex chromosome histories, and therefore, our approach provides a much needed tool for understanding early evolution of sex chromosomes using dioecious plants as model systems. The paucity of heterogametic sex chromosome sequence (Y or W sequence) makes our approach even more relevant, and perhaps the only available tool, for understanding the sex chromosome evolution without being constrained by the unavailability of Y or W sequence, or by the loss of Y-linked or W-linked genes.

TP053 (HT) - Predicting effects of noncoding variants with deep learning-based sequence model
Date: Monday, July 11 12:20 pm - 12:40 pm
Room: Northern Hemisphere A3/A4
Theme: GENES / DATA
  • Jian Zhou, Princeton University, United States
  • Olga Troyanskaya, Princeton University, United States

Area Session Chair: Bruno Gaeta

Presentation Overview: Show

Identifying functional effects of noncoding variants is a major challenge in human genetics. To predict the noncoding-variant effects de novo from sequence, we developed a deep learning-based algorithmic framework, DeepSEA (http://deepsea.princeton.edu/), that directly learns a regulatory sequence code from large-scale chromatin-profiling data, enabling prediction of chromatin effects of sequence alterations with single-nucleotide sensitivity. We further used this capability to improve prioritization of functional variants including expression quantitative trait loci (eQTLs) and disease-associated variants.

TP054 (HT) - Integrative genomics analyses unveil downstream biological effectors of disease-specific polymorphisms buried in intergenic regions
Date: Monday, July 11 12:20 pm - 12:40 pm
Room: Northern Hemisphere E1/E2
Theme: GENES / DISEASE
  • Haiquan Li, University of Arizona, United States
  • Ikbel Achour, University of Arizona Center for Biomedical Informatics and Biostatistics, United States
  • Lisa Bastarache, Vanderbilt University, United States
  • Joanne Berghout, The University of Arizona, United States
  • Vincent Gardeux, The University of Illinois at Chicago, France
  • Jianrong Li, University of Arizona, United States
  • Younghee Lee, University of Utah, United States
  • Lorenzo Pesce, The University of Chicago, United States
  • Xinan Yang, the University of Chicago, United States
  • Kenneth Ramos, The University of Arizona, United States
  • Ian Foster, Argonne National Laboratory & The University of Chicago, United States
  • Joshua Denny, Vanderbilt University, United States
  • Jason Moore, University of Pennsylvania, United States
  • Yves Lussier, The University of Arizona, United States

Area Session Chair: Nicola Mulder

Presentation Overview: Show

Altered biological mechanisms arising from disease-associated polymorphisms, remain difficult to characterize when those variants are intergenic. We developed a computational method that identifies shared downstream mechanisms by which inter- and intragenic SNPs contribute to a specific physiopathology. Modelling 2,000,000 pairs of disease-associated SNPs (GWAS) with eQTL and Gene Ontology functional annotations, we predicted 3,870 inter-intra and inter-intra SNP-pairs with convergent biological mechanisms (FDR<0.05). These SNP-pairs with overlapping mRNA targets or similar functional annotations were more associated with the same disease than unrelated pathologies (OR>12). We independently confirmed synergistic and antagonistic genetic interactions for prioritized SNP-pairs of Alzheimer’s (p=0.046), cancer (p=0.039), and rheumatoid arthritis (p<10-4). Using ENCODE, we validated that the biological mechanisms shared within prioritized SNP-pairs are frequently governed by matching transcription factor binding sites and long-range chromatin interactions. These results provide a “roadmap” of disease mechanisms emerging from GWAS and further identify downstream candidate therapeutic targets of intergenic SNPs.

TP055 (PT) - DeepMeSH: Deep Semantic Representation for Improving Large-scale MeSH Indexing
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere BCD
Theme: DATA
  • Shengwen Peng, Fudan University, China
  • Ronghui You, Fudan University, China
  • Hongning Wang, Department of Computer Science at University of Virginia, United States
  • Chengxiang Zhai, UIUC, United States
  • Hiroshi Mamitsuka, Kyoto University, Japan
  • Shanfeng Zhu, Fudan University, China

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Motivation:
Medical Subject Headings (MeSH) indexing, which is to assign a
set of MeSH main headings to citations, is crucial for many
important tasks in biomedical text mining and information retrieval.
Large-scale MeSH indexing has two challenging aspects: the citation side and
MeSH side.
For the citation side, all existing methods, including Medical Text
Indexer (MTI) by NLM (National Library of Medicine) and the
state-of-the-art method, MeSHLabeler, deal with text by bag-of-words,
which cannot capture semantic and context-dependent information well.

Methods: We propose DeepMeSH that incorporates deep semantic
information for large-scale MeSH indexing.
It addresses the two challenges in both citation and MeSH sides.
The citation side challenge is solved by a new deep semantic representation,
D2V-TFIDF, which concatenates both sparse and dense semantic representations.
The MeSH side challenge is solved by using the `learning to rank' framework of
MeSHLabeler, which integrates various types of evidence generated from
the new semantic representation.

Results:
DeepMeSH achieved a Micro F-measure of 0.6323, 2\% higher than 0.6218
of MeSHLabeler and 12\% higher than 0.5637 of MTI, for BioASQ3 challenge
data with 6,000 citations.

TP056 (HT) - Alignment-free scaffolding of large genome drafts using long sequences and jumping library MPET reads
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Rene Warren, BC Cancer Agency, Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer Agency, Genome Sciences Centre, Canada
  • Sarah Yeo, BC Cancer Agency, Genome Sciences Centre, Canada
  • Chen Yang, BC Cancer Agency, Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Agency, Genome Sciences Centre, Canada
  • Austin Hammond, BC Cancer Agency, Genome Sciences Centre, Canada
  • Hamid Mohamadi, BC Cancer Agency, Genome Sciences Centre, Canada
  • Ben Vandervalk, BC Cancer Agency, Genome Sciences Centre, Canada
  • Erdi Kucuk, BC Cancer Agency, Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Agency, Genome Sciences Centre, Canada

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

=====150 word description of the presentation

Over the past months, single-molecule long-reads from established and emerging technologies have proven valuable to the assembly of complete bacterial draft genomes, and to help track viral outbreaks. At the moment, the use of those technologies on their own is still too often costly for de novo assembly of mammalian-size genomes. Last year, we demonstrated that despite the lower base accuracy associated with long-read sequencing platforms, they are indisputably effective for scaffolding small and large high-quality draft genomes, as it increases the contiguity and completeness of low-cost assemblies, and thereby reduces the complexity of genome drafts. During the course of the year, a new read-linking technology from 10X Genomics has emerged, and holds promise for genome scaffolding. We will present advances in scaffolding and genome finishing, describing further developments to the LINKS scaffolder and how we applied these technologies to the large genomes of American bullfrog and spruce.

=====250 word justification-like argument

We submit the enclosed manuscript entitled “LINKS: Scalable, alignment-free scaffolding of draft genomes with long reads” for consideration as a presentation for the highlights track of ISMB.
Long sequence reads are of prime importance to genome assembly, which is in turn cornerstone to genome characterization. Although long reads from existing and upcoming technologies still have ways to go before being used routinely in de novo genome assembly projects, their utility for scaffolding existing good-quality assemblies is paramount. The scaffolding problem has been explored by many, including our group, but has only recently been applied to emerging long DNA sequence reads from Oxford Nanopore Technologies (ONT) Ltd.
In our presentation we discuss an effective and elegant method for genome scaffolding with long and imperfect sequences that use linked k-mers at set distance intervals. We present new developments since publication, including native scaffolding with jumping library (MPET) reads and the use of an improved Bloom filter to exclude erroneous k-mer pairs. We demonstrate that even low accuracy sequence data has tremendous potential for increasing genome assembly contiguity without the need for error correction or pre-processing, and show how our alignment-free solution scales up to large eukaryotic genomes.
We anticipate that this timely work will be of broad interest to ISMB attendees as the uptake of genomics in research labs and in the clinic increases with the affordability of DNA sequencing. We expect LINKS to have utility in helping assemble large genomes, as we enter the era of long DNA sequence reads.

TP057 (PT) - A Cross-Species Bi-Clustering Approach to Identifying Conserved Co-regulated Genes
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Theme: GENES / SYSTEMS
  • Jiangwen Sun, University of Connecticut, United States
  • Zongliang Jiang, University of Connecticut, United States
  • X Cindy Tian, University of Connecticut, United States
  • Jinbo Bi, University of Connecticut, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Motivation: A growing number of studies have explored the process of pre-implantation embryonic development of multiple mammalian species. However, the conservation and variation among different species in their developmental programming are poorly defined due to the lack of effective computational methods for detecting co-regularized genes that are conserved across species. The most sophisticated method to date for identifying conserved co-regulated genes is a two-step approach. This approach first identifies gene clusters for each species by a cluster analysis of gene expression data, and subsequently computes the overlaps of clusters identified from different species to reveal common subgroups. This approach is ineffective to deal with the noise in the expression data introduced by the complicated procedures in quantifying gene expression. Furthermore, due to the sequential nature of the approach, the gene clusters identified in the first step may have little overlap among different species in the second step, thus difficult to detect conserved co-regulated genes.

Results: We propose a cross-species bi-clustering approach which first denoises the gene expression data of each species into a data matrix. The rows of the data matrices of different species represent the same set of genes that are characterized by their expression patterns over the developmental stages of each species as columns. A novel bi-clustering method is then developed to cluster genes into subgroups by a joint sparse rank-one factorization of all the data matrices. This method decomposes a data matrix into a product of a column vector and a row vector where the column vector is a consistent indicator across the matrices (species) to identify the same gene cluster and the row vector specifies for each species the developmental stages that the clustered genes co-regulate. Efficient optimization algorithm has been developed with convergence analysis. This approach was first validated on synthetic data and compared to the two-step method and several recent joint clustering methods. We then applied this approach to two real world datasets of gene expression during the pre-implantation embryonic development of human and mouse. Co-regulated genes consistent between the human and mouse were identified, offering insights into conserved functions, as well as similarities and differences in genome activation timing between human and mouse embryos.

Availability: The R package containing the implementation of the proposed method in C++ is available at: https://github.com/JavonSun/mvbc.git and also at the R platform https://www.r-project.org/.

TP058 (LBR) - Candidate gene prioritization with Endeavour
Date: Monday, July 11 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / DATA
  • Léon-Charles Tranchevent, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
  • Amin Ardeshirdavani, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Sarah Elshal, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Daniel Alcaide, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Jan Aerts, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium
  • Didier Auboeuf, , Laboratoire de Biologie et de Modélisation de la Cellule, Ecole Normale Supérieure de Lyon, Université de Lyon, France
  • Yves Moreau, KU Leuven ESAT - STADIUS, Stadius Centre for Dynamical Systems, Signal Processing and Data Analytics, Belgium

Area Session Chair: Judith Blake

Presentation Overview: Show

Genomic studies and high-throughput experiments often produce large lists of candidate genes among which only a few are truly relevant to the disease, phenotype, or biological process of interest. Gene prioritization tackles this problem by ranking candidate genes by profiling candidates across multiple genomic data sources and integrating this heterogenous information into a global ranking. We describe an extended version of our gene prioritization method, Endeavour, now available for 6 species and integrating 75 data sources. Validation of our results indicate that this extended version of Endeavour efficiently prioritizes candidate genes. The Endeavour web server is freely available at https://endeavour.esat.kuleuven.be/

TP059 (HT) - Translation of Genotype to Phenotype by a Hierarchy of Cell Subsystems
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere BCD
Theme: DATA / SYSTEMS
  • Michael Ku Yu, UCSD, United States
  • Michael Kramer, UCSD, United States
  • Janusz Dutkowski, UCSD, Data4Cure, United States
  • Rohith Srivas, UCSD, Stanford University, United States
  • Katherine Licon, UCSD, United States
  • Jason F. Kreisberg, UCSD, United States
  • Cherie Ng, aTyr Pharmaceuticals, United States
  • Nevan Krogan, UCSF, United States
  • Roded Sharan, Tel Aviv University, United States
  • Trey Ideker, UCSD, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Accurately translating genotype to phenotype requires accounting for the functional impact of genetic variation at many biological scales. Here, we present a strategy for genotype-phenotype reasoning based on existing knowledge of cellular subsystems. These subsystems and their hierarchical organization are defined by the Gene Ontology or a complementary ontology inferred directly from previously published datasets. Guided by the ontology’s hierarchical structure, we organize genotype data into an “ontotype,” that is, a hierarchy of perturbations representing the effects of genetic variation at multiple cellular scales. The ontotype is then interpreted using logical rules generated by machine learning to predict phenotype. This approach substantially outperforms previous non-hierarchical methods for translating yeast genotype to cell growth phenotype, and it accurately predicts the growth outcomes of two new screens of 2,503 double gene knockouts affecting DNA repair or nuclear lumen. Ontotypes also generalize to larger knockout combinations, setting the stage for interpreting the complex genetics of disease.

TP060 (PT) - Genome assembly from synthetic long read clouds
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Volodymyr Kuleshov, Stanford University, United States
  • Michael Snyder, Stanford University, United States
  • Serafim Batzoglou, Stanford University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Motivation: Despite rapid progress in sequencing technology, assembling de-novo the genomes of new species as well as reconstructing complex metagenomes remain major technological challenges. New synthetic long read (SLR) technologies promise significant advances towards these goals; however, their applicability is limited by high sequencing requirements and the inability of current assembly paradigms to cope with combinations of short and long reads.
Results: Here, we introduce Architect, a new de-novo scaffolder aimed at synthetic long read technologies. Unlike previous assembly strategies, Architect does not require a costly subassembly step; instead it assembles genomes directly from the SLR’s underlying short reads, which we refer to as read clouds. This enables a 4 to 20 fold reduction in sequencing requirements and a five-fold increase in assembly contiguity on both genomic and metagenomic datasets relative to state-of-the-art assembly strategies aimed directly at fully-subassembled long reads.

TP061 (PT) - Structure-Based Prediction of Transcription Factor Binding Specificity using an Integrative Energy Function
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Alvin Farrel, University of North Carolina at Charlotte, United States
  • Jonathan Murphy, University of North Carolina at Charlotte, United States
  • Jun-Tao Guo, University of North Carolina at Charlotte, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Transcription factors (TFs) regulate gene expression through binding to specific target DNA sites. Accurate annotation of transcription factor binding sites (TFBSs) at genome scale represents an essential step toward our understanding of gene regulation networks. In this paper, we present a structure-based method for computational prediction of TFBSs using a novel, integrative energy function. The new energy function combines a multibody knowledge-based potential and two atomic energy terms (hydrogen bond and π-interaction) that might not be accurately captured by the knowledge-based potential due to the mean force nature and low count problem. We applied the new energy function to the TFBS prediction using a non-redundant dataset that consists of transcription factors from 12 different families. Our results show that the new integrative energy function improves the prediction accuracy over the knowledge-based, statistical potentials, especially for homeodomain transcription factors, the second largest TF family in mammals.

TP062 (LBR) - Furthering understanding of human diseases through integrative cross-species analysis
Date: Monday, July 11 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / DATA
  • Victoria Yao, Princeton University, United States
  • Rachel Kaletsky, Princeton University, United States
  • Coleen Murphy, Princeton University, United States
  • Olga Troyanskaya, Princeton University, United States

Area Session Chair: Judith Blake

Presentation Overview: Show

The etiology of complex human diseases is challenging to study, as they are likely a combination of many environmental and genetic factors. Elucidating the molecular basis of pathophysiologies of such diseases requires a combination of systems-level analyses in human and experimental investigations in model organisms. To fully leverage model systems to study human disease, we propose a framework that can combine human quantitative genetics results and computational models of model organism tissue biology to drive experimental screens for disruption of disease-relevant processes and identify candidate disease genes. Specifically, we develop a novel semi-supervised regularized Bayesian integration method to integrate a large compendium of heterogeneous datasets, primarily composed of publicly available expression datasets in model organism C. elegans. Using this method, we construct 203 tissue- and cell-type specific networks, and we demonstrate the accuracy of these networks in capturing tissue-specific functional signal, even for very small tissues and specific cell types. Combining these model organism functional maps with human quantitative genetics signal, we make disease gene predictions for 10 different diseases based on GWAS studies. Focusing on Parkinson’s disease, we further experimentally screen 45 of the top Parkinson's disease predictions for age-related motility defects. Analysis of 13,255 worms across 1,823 videos identifies significant age-related Parkinson's endophenotypes. Genes that correspond to strong phenotypes are prime candidates for further inquiry in human and could eventually be pursued as potential therapeutic targets.

TP063 (PT) - Jumping across biomedical contexts using compressive data fusion
Date: Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere BCD
Theme: DATA / DISEASE
  • Marinka Zitnik, Stanford University, United States
  • Blaz Zupan, University of Ljubljana, Slovenia

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Motivation:
The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects---such as a gene and a disease---can be related in different ways, for example, directly via gene-disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling.

Results:
We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous data sets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene-disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics.

TP064 (LBR) - Multi-Genome Scaffold Co-Assembly Based on the Analysis of Gene Orders and Genomic Repeats
Date: Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Sergey Aganezov, Computational Biology Institute & Department of Mathematics, The George Washington University, United States
  • Max Alekseyev, George Washington University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Advances in the DNA sequencing technology over the past decades have increased the volume of raw sequenced genomic data available for further assembly and analysis. While there exist many software tools for assembly of sequenced genomic material, they often experience difficulties with reconstructing complete chromosomes. Major obstacles include uneven read coverage and presence of long similar DNA subsequences (repeats). Genome assemblers therefore often are able to reliably reconstruct only long fragments, called scaffolds. We present a method for simultaneous co-assembly of all fragmented genomes (represented as collections of scaffolds rather than chromosomes) in a given set of annotated genomes. The method is based on the analysis of gene orders and relies on the evolutionary model, which includes genome rearrangements as well as gene insertions and deletions. It can also utilize information about genomic repeats and the phylogenetic tree of the given genomes, further improving their assembly quality.

TP065 (LBR) - Most of the tight positional conservation of transcription factor binding sites near the transcription start site is due to their co-localization within regulatory modules
    Cancelled
Date: Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Theme: GENES / PROTEINS
  • Natalia Acevedo-Luna, Iowa State University, United States
  • Leonardo Mariño-Ramírez, NIH, United States
  • Armand Halbert, NIH, United States
  • Ulla Hansen, Boston University, United States
  • David Landsman, NIH, United States
  • John Spouge, NIH, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Transcription factors (TFs) form complexes that bind regulatory modules (RMs) within DNA. Consider a “Subunit Hypothesis”: sometimes, different TF complexes contain inexact copies of a subunit that coordinates the regulation of specific genes. Then, within the RMs for the genes, transcription factor binding sites should display tightly consistent positions relative to each other, and possibly, consistent positions relative to the transcription start site (TSS), too. Our statistics found 43 significant sets of TF motifs with positional preferences relative to the TSS, with 38 preferences tight (±5 bp). Each set of motifs corresponded to a “gene group” of 135 to 3304 genes, some groups independently validated with FDR<10-4. The Subunit Hypothesis also implies that motifs corresponding to two TFs in a subunit should co-occur more than by chance alone, “enriching” the intersection of the gene groups corresponding to the two TFs. Of the 43 significant gene groups, we found 779 pairs of gene groups with significantly enriched intersections, many independently validated. A user-friendly web site at http://go.usa.gov/3kjsH permits experimental biologists to explore the interaction network of our TFBSs to identify candidate subunit RMs. Gene duplication and convergent evolution within a genome provide obvious biological mechanisms for replicating an RM that binds a particular TF subunit.

TP066 (HT) - SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets
Date: Monday, July 11 2:40 pm - 3:00 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / SYSTEMS
  • Jing Guo, School of Computer Engineering, Nanyang Technological University, Singapore
  • Hui Liu, Changzhou University, China
  • Jie Zheng, School of Computer Engineering, Nanyang Technological University, Singapore

Area Session Chair: Judith Blake

Presentation Overview: Show

250-word Scientific Justification

Synthetic lethality (SL) is a type of genetic interaction between two genes such that simultaneous perturbations of the two genes result in cell death, while a perturbation of either gene alone is not lethal. Hence, the inhibition of SL partners of genes with cancer-specific mutations could selectively kill cancer cells but spare normal cells. Therefore, SL is emerging as a promising anticancer strategy that could potentially overcome the drawbacks of traditional chemotherapies by reducing severe side effects. However, there has not been a comprehensive database dedicated to collecting SL pairs and related knowledge. In this paper, we propose a comprehensive database, SynLethDB (http://histone.sce.ntu.edu.sg/SynLethDB/), which contains SL pairs collected from biochemical assays, computational predictions and text mining results on human and four model species, i.e. mouse, fruit fly, worm and yeast. For each SL pair, a confidence score was calculated by integrating individual scores derived from different evidence sources. We also developed a statistical analysis module to estimate the sensitivity of cancer cells to drugs targeting human SL partners, based on large-scale genomics data, gene expression profiles and drug sensitivity profiles on more than 1000 cancer cell lines. To help users access and mine the wealth of the data, functionalities such as search and filtering, orthology search, gene set enrichment analysis as well as a user-friendly web interface have been implemented to facilitate data mining and interpretation. SynLethDB would be a useful resource for biomedical research community and pharmaceutical industry.



150-word Presentation Description

Synthetic lethality (SL) is a type of genetic interaction between two genes such that simultaneous perturbations of the two genes result in cell death, while a perturbation of either gene alone is not lethal. Hence, the inhibition of SL partners of genes with cancer-specific mutations could selectively kill cancer cells but spare normal cells. Therefore, SL is an emerging anticancer strategy that could potentially overcome the drawbacks of traditional chemotherapies by reducing severe side effects. However, there has not been a comprehensive database dedicated to collecting SL pairs and related knowledge. In this talk, I will present the SynLethDB database (http://histone.sce.ntu.edu.sg/SynLethDB/), which contains SL pairs collected from chemical assays and computational predictions on human and model species. I will introduce the computational problem of SL prediction, with SynLethDB as benchmark data. Biologists can use the knowledge and data resources to guide wet-lab screenings of SL using newest technologies (e.g. CRISPR-Cas9).



Source of Original Publication:
Jing Guo, Hui Liu, Jie Zheng. SynLethDB: synthetic lethality database toward discovery of selective and sensitive anticancer drug targets. Nucleic Acids Research, 44 (D1): D1011 – D1017, 2016 (Impact Factor = 9.112).

TP067 (HT) - CellCODE: a robust latent variable approach to differential expression analysis for heterogeneous cell populations
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere BCD
Theme: DATA / DISEASE
  • Maria Chikina, University of Pittsburgh, United States
  • Stuart Sealfon, Icahn School of Medicine at Mount Sinai, United States
  • Elena Zaslavsky, Icahn School of Medicine at Mount Sinai, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Identifying alterations in gene expression associated with different clinical states is important for the study of human biology. However, clinical samples used in gene expression studies are often derived from heterogeneous mixtures with variable cell-type composition, complicating statistical analysis.

Considerable effort has been devoted to modeling sample heterogeneity, and presently there are many methods that can estimate cell proportions or pure cell-type expression from mixture data. However, there is no method that comprehensively addresses mixture analysis in the context of differential expression without relying on additional proportion information, which can be inaccurate and is frequently unavailable.

In this study we consider a clinically relevant situation where neither accurate proportion estimates nor pure cell expression is of direct interest, but where we are rather interested in detecting and interpreting relevant differential expression in mixture samples. We develop a method, cell-type COmputational Differential Estimation (CellCODE), that addresses the specific statistical question directly, without requiring a physical model for mixture components. Our approach is based on latent variable analysis and is computationally transparent, requires no additional experimental data, yet outperforms existing methods that use independent proportion measurements. CellCODE has few parameters that are robust and easy to interpret. The method can be used to track changes in proportion, improve power to detect differential expression and assign the differentially expressed genes to the correct cell-type.

TP068 (PT) - deBWT: parallel construction of Burrows-Wheeler Transform for large collection of ge-nomes with de Bruijn-branch encoding
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / DATA
  • Bo Liu, Center for Bioinformatics, Harbin Institute of Technology, China
  • Dixian Zhu, Center for Bioinformatics, Harbin Institute of Technology, China
  • Yadong Wang, Center for Bioinformatics, Harbin Institute of Technology, China

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

Motivation: With the development of high-throughput sequencing, the number of assembled ge-nomes continues to rise. It is critical to well organize and index many assembled genomes to promote future genomics studies. Burrows-Wheeler Transform (BWT) is an important data structure of genome indexing which has many fundamental applications; however, it is still non-trivial to construct BWT for large collection of genomes, especially for highly similar or repetitive genomes. Moreover, the state-of-the-art approaches cannot well support scalable parallel computing due to their incremental nature, which is a bottleneck to utilize modern computers to accelerate BWT construction.
Results: We propose de Bruijn branch-based BWT constructor (deBWT), a novel parallel BWT con-struction approach. DeBWT innovatively represents and organizes the suffixes of input sequence with a novel data structure, de Bruijn branch encoding. This data structure takes the advantage of de Bruijn graph to facilitate the comparison between the suffixes with long common prefix, which breaks the bottleneck of the BWT construction of repetitive genomic sequences. Meanwhile, deBWT also utilizes the structure of de Bruijn graph for reducing unnecessary comparisons between suffixes. The benchmarking suggests that, deBWT is efficient and scalable to construct BWT for large dataset by parallel computing. It is well-suited to index many genomes, such as a collection of individual human genomes, with multiple-core servers or clusters.
Availability: deBWT is implemented in C language, the source code is available at https://github.com/hitbc/deBWT or https://github.com/DixianZhu/deBWT
Contact: ydwang@hit.edu.cn

TP069 (PT) - Finding correct protein-protein docking models using ProQDock
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Sankar Basu, Linköping University, Sweden
  • Bjorn Wallner, Linkoping University, Sweden

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Motivation: Protein-protein interactions are a key in virtually all biological process. For a detailed understanding of the biological processes, the structure of the protein complex is essential. Given the current experimental techniques for structure determination, the vast majority of all protein com-plexes will never be solved by experimental techniques. In lack of experimental data, computational docking methods can be used to predict the structure of the protein complex. A common strategy is to generate many alternative docking solutions (atomic models) and then use a scoring function to select the best. The success of the computational docking technique is, to a large degree, depend-ent on the ability of the scoring function to accurately rank and score the many alternative docking models.
Results: Here, we present ProQDock, a scoring function that predicts the absolute quality of dock-ing model measured by a novel protein docking quality score (DockQ). ProQDock uses support vec-tor machines trained to predict the quality of protein docking models using features that can be cal-culated from the docking model itself. By combining different types of features describing both the protein-protein interface and the overall physical chemistry it was possible to improve the correlation with DockQ from 0.25 for the best individual feature (EC) to 0.49 for the final version of ProQDock. ProQDock performed better than the state-of-the-art methods ZRANK and ZRANK2 in terms of cor-relations, ranking and finding correct models on an independent test set. Finally, we also demon-strate that it is possible to combine ProQDock with ZRANK and ZRANK2 to improve performance even further.

TP070 (HT) - Gene essentiality and synthetic lethality in haploid human cells
Date: Monday, July 11 3:30 pm - 3:50 pm
Room: Northern Hemisphere E1/E2
Theme: GENES / SYSTEMS
  • Jacques Colinge, IRCM Inserm U1194, University of Montpellier, ICM, France
  • Vincent Blomen, NKI, Netherlands
  • Peter Májek, CeMM, Austria
  • Lucas Jae, NKI, Netherlands
  • Johannes Bigenzahn, CeMM, Austria
  • Joppe Nieuwenhuis, NKI, Netherlands
  • Jacqueline Staring, NKI, Netherlands
  • Roberto Sacco, CeMM, Austria
  • Ferdy van Diemen, NKI, Netherlands
  • Nadine Olk, CeMM, Austria
  • Alexey Stukalov, CeMM, Austria
  • Caleb Marceau, Stanford University School of Medicine, United States
  • Hans Janssen, NKI, Netherlands
  • Jan Carette, Stanford University School of Medicine, United States
  • Keiryn Bennett, CeMM, Austria
  • Giulio Superti-Furga, CeMM, Austria
  • Thijn Brummelkamp, NKI, Netherlands

Area Session Chair: Judith Blake

Presentation Overview: Show

Among the many things one might want to know about a human cell, the list of its indispensable components, i.e. genes, is of great interest. Due to technical barriers, transposition of pioneering work done in yeast has taken years. We present a first genome-wide mutational screen conducted in human haploid cells that unraveled ~2000 genes required for fitness in culture condition. Bioinformatic analyses were performed to extract global characteristic of human essential genes and the interactions the have with other genes. By performing similar screens on cells depleted of specific genes we could obtain a synthetic lethality network around the secretory pathway, thus providing a first genetic interaction network in human cells obtained by mutagenesis.

Finally, we will comment on differences and similarities with concomitant essential gene lists published by two other groups (Wang et al., Science, 2015; Hart et al., Cell, 2015).

TP071 (LBR) - Solving the influence maximization problem on biological networks; a case study involving the cell cycle regulatory network in Saccharomyces Cerevisiae
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere BCD
Theme: DATA / SYSTEMS
  • David Gibbs, Institute for Systems Biology, United States
  • Ilya Shmulevich, Institute for Systems Biology, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

The Influence Maximization Problem (IMP) aims to discover the set of nodes with the greatest influence on network dynamics. The problem has previously been applied in epidemiology and social network analysis. Here, we demonstrate the application to cell cycle regulatory network analysis of Saccharomyces cerevisiae.
Fundamentally, gene regulation is linked to the flow of information. Therefore, our implementation of the IMP was framed as an information theoretic problem on a diffusion network. Utilizing all regulatory edges from YeastMine, gene expression dynamics were encoded as edge weights using a variant of time lagged transfer entropy, a method for quantifying information transfer across variables. Influence, for a particular number of sources, was measured using a diffusion model based on Markov chains with absorbing states. By maximizing over different numbers of sources, an influence ranking on genes was produced.
The influence ranking was compared to other metrics of network centrality. Although ‘top genes’ from each centrality ranking contained well known cell cycle regulators, there was little agreement and no clear winner. However, it was found that influential genes tend to directly regulate or sit upstream of genes ranked by other centrality measures. This is quantified by computing node reachability between gene sets; on average, 59% of central genes can be reached when starting from the influential set, compared to 7% of influential genes when starting at another centrality metric.
Influential nodes are critical sources of information flow, potentially impacting the state of the network, potentially leading to disease.

TP072 (PT) - Compacting de Bruijn graphs from sequencing data quickly and in low memory
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / DATA
  • Rayan Chikhi, CNRS, France
  • Antoine Limasset, IRISA, France
  • Paul Medvedev, Pennsylvania State University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

As the quantity of data per sequencing experiment increases, the challenges of fragment assembly are becoming increasingly computational. The de Bruijn graph is a widely used data structure in fragment assembly algorithms, used to represent the information from a set of reads. Compaction is an important data reduction step in most de Bruijn graph based algorithms where long simple paths are compacted into single vertices. Compaction has recently become the bottleneck in assembly pipelines, and improving its running time and memory usage is an important problem.

We present an algorithm and a tool BCALM 2 for the compaction of de Bruijn graphs. BCALM 2 is a parallel algorithm that distributes the input based on a minimizer hashing technique, allowing for good balance of memory usage throughout its execution. For human sequencing data, BCALM 2 reduces the computational burden of compacting the de Bruijn graph to roughly an hour and 3 GB of memory. We also applied BCALM 2 to the 22 Gbp loblolly pine and 20 Gbp white spruce sequencing datasets. Compacted graphs were constructed from raw reads in less than 2 days and 40 GB of memory on a single machine. Hence, BCALM 2 is at least an order of magnitude more efficient than other available methods.

TP073 (LBR) - HUMAN PROTEIN COMPLEX MAP: INTEGRATION OF 10K MASS SPECTROMETRY EXPERIMENTS
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Kevin Drew, University of Texas at Austin, United States
  • Edward Marcotte, University of Texas at Austin, United States

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Protein complexes carry out essential functions in the cell but we currently lack knowledge of their composition, formation and function. Several recent studies using high throughput discovery of protein interactions have allowed the construction of protein complex maps but the protein overlap of these maps are limited. Here we take an integrated approach by combining protein interaction experiments from multiple published mass spectrometry datasets and construct a more complete human protein complex map. We evaluate both pairwise interactions and complexes using a novel clique-based comparison method and show improved performance over the published complex maps. Additionally, we find several new complexes including ones with enrichment for developmental disorders suggesting candidate disease genes. The expansiveness and accuracy of this complex map yields greater understanding of cellular function and provides avenues for better disease characterization.

TP074 (PT) - Influence maximization in time bounded network identifies transcription factors regulating perturbed pathways
Date: Monday, July 11 3:50 pm - 4:10 pm
Room: Northern Hemisphere E1/E2
Theme: GENES
  • Kyuri Jo, Seoul National University, Korea, Republic of
  • Inuk Jung, Seoul National University, Korea, Republic of
  • Ji Hwan Moon, Seoul National University, Korea, Republic of
  • Sun Kim, Seoul National University, Korea, Republic of

Area Session Chair: Judith Blake

Presentation Overview: Show

To understand the dynamic nature of the biological process, it is crucial to identify perturbed pathways in an altered environment and also to infer regulators that trigger the response. Current time-series analysis methods, however, are not powerful enough to identify perturbed pathways and regulators simultaneously. Widely used methods include methods to determine gene sets such as differentially expressed genes or gene clusters and these genes sets need to be further interpreted in terms of biological pathways using other tools. Most pathway analysis methods are not designed for time series data and they do not consider gene-gene influence on the time dimension. In this paper, we propose a novel time-series analysis method TimeTP for determining transcription factors regulating pathway perturbation, which narrows the focus to perturbed sub-pathways and utilizes the gene regulatory network and protein-protein interaction network to locate transcription factors triggering the perturbation. TimeTP first identifies perturbed sub-pathways that propagate the expression changes along the time. Starting points of the perturbed sub-pathways are mapped into the network and the most influential transcription factors are determined by influence maximization technique. The analysis result is visually summarized in TF-Pathway map in time clock. TimeTP was applied to PIK3CA knock-in dataset and found significant sub-pathways and their regulators relevant to the PIP3 signaling pathway.

TP075 (LBR) - Scalable Tools for Quantitative Analysis of Chemical-Genetic Interactions from Sequencing-Based Chemical-Genetic Interaction Screens
Date: Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere BCD
Theme: DATA / SYSTEMS
  • Scott Simpkins, University of Minnesota, United States
  • Justin Nelson, University of Minnesota, United States
  • Raamesh Desphande, University of Minnesota, United States
  • Jeffrey Piotrowski, Yumanity Therapeutics, United States
  • Sheena Li, RIKEN Institute for Sustainable Resource Science, Japan
  • Charles Boone, University of Toronto, Canada
  • Chad Myers, University of Minnesota, United States

Area Session Chair: Russell Schwartz

Presentation Overview: Show

Recent improvements in the throughput of chemical-genetic interaction screens have necessitated the development of new, scalable pipelines for processing raw sequencing data from these experiments and interpreting the resulting chemical-genetic interaction profiles. We developed two computational tools, BEAN-counter and CG-TARGET, to respectively process and interpret the large influx of data from high-throughput chemical-genomic screens. These pipelines were applied to chemical-genetic interaction screens of more than 18,000 compounds in S. cerevisiae, ultimately yielding more than 2,000 compounds with high confidence predictions to biological process targets. We confirmed that our process-level target predictions overlap with the known functions of compounds and, importantly, enable us to discover novel compound modes-of-action. Additionally, these tools provided the foundation for new investigations into the nature of chemical interactions with biological systems.

TP076 (LBR) - Succinct Colored de Bruijn Graphs
Date: Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / DATA
  • Martin Muggli, Colorado State University, United States
  • Alex Bowe, National Institute of Informatics, Chiyoda-ku, Tokyo, Japan, Japan
  • Travis Gagie, Department of Computer Science,University of Helsinki, Finland
  • Robert Raymond, Colorado State University, United States
  • Noelle R. Noyes, Colorado State University, United States
  • Paul Morley, Colorado State University, United States
  • Keith Belk, Colorado State University, United States
  • Simon Puglisi, University of Helsinki, Finland
  • Christina Boucher, Colorado State University, United States

Area Session Chair: Pedja Radivojac

Presentation Overview: Show

MOTIVATION: Iqbal et al. (Nature Genetics, 2012) introduced the colored de Bruijn graph, a variant of the classic de Bruijn graph, which is aimed at "detecting and genotyping simple and complex genetic variants in an individual or population".
Because they are intended to be applied to massive population level data, it is essential that the graphs be represented efficiently.
Unfortunately, current succinct de Bruijn graph representations are not directly applicable to the colored de Bruijn graph, which require additional information to be succinctly encoded as well as support for non-standard traversal operations.
RESULTS: Our data structure dramatically reduces the amount of memory required to store and use the colored de Bruijn graph, with some penalty to runtime, allowing it to be applied in much larger and more ambitious sequence projects than was previously possible. In particular, we use our method along with a custom curated database of antimicrobial resistant genes to track changes in the resistome across food production facilities. A short video of our work is available at http://cdbg.martindmuggli.com.

TP077 (PT) - An Integer Programming Framework for Inferring Disease Complexes from Network Data
Date: Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS / DISEASE
  • Arnon Mazza, Tel Aviv University, Israel
  • Konrad Klockmeier, Max Delbrück Center for Molecular Medicine, Germany
  • Erich Wanker, Max Delbrück Center for Molecular Medicine, Germany
  • Roded Sharan, School of computer science, Tel Aviv university, Israel

Area Session Chair: Reinhard Schneider

Presentation Overview: Show

Unraveling the molecular mechanisms that underlie disease calls for methods that go beyond the identification of single causal genes to inferring larger protein assemblies that take part in the disease process. Here we develop an exact, integer-programming-based method for associating protein complexes with disease. Our approach scores proteins based on their proximity in a protein-protein interaction network to a prior set that is known to be relevant for the studied disease. These scores are combined with interaction information to infer densely interacting protein complexes that are potentially disease-associated. We show that our method outperforms previous ones and leads to predictions that are well supported by current experimental data and literature knowledge.

TP078 (HT) - Mogrify: a predictive system for cell reprogramming
    Cancelled
Date: Monday, July 11 4:10 pm - 4:30 pm
Room: Northern Hemisphere E1/E2
Theme: GENES / SYSTEMS
  • Owen Rackham, Duke-NUS, Singapore
  • Jaber Firas, Monash University, Australia
  • Jose Polo, Monash University, Australia
  • Julian Gough, University of Bristol, United Kingdom

Area Session Chair: Judith Blake

Presentation Overview: Show

Transdifferentiation, the process of converting from one cell type to another without going through a pluripotent state, has great promise for regenerative medicine. The identification of key transcription factors for reprogramming is currently limited by the cost of exhaustive experimental testing of plausible sets of factors, an approach that is inefficient and unscalable. Here we present a predictive system (Mogrify http://mogrify.net) that combines gene expression data with regulatory network information to predict the reprogramming factors necessary to induce cell conversion. We have applied Mogrify to over 300 human cell types and tissues, defining an atlas of cellular reprogramming. Mogrify correctly predicts the transcription factors used in known transdifferentiations. Furthermore, we validated two new transdifferentiations predicted by Mogrify. We provide a practical and efficient mechanism for systematically implementing novel cell conversions, facilitating the generalization of reprogramming of human cells. Predictions are made available to help rapidly further the field of cell conversion.

TP079 (HT) - Compressive Mapping for Next-Generation Sequencing
Date: Tuesday, July 12 10:10 am - 10:30 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Deniz Yorukoglu, Massachusetts Institute of Technology, United States
  • Yun William Yu, Massachusetts Institute of Technology, United States
  • Jian Peng, University of Illinois at Urbana-Champaign, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States

Area Session Chair: Scott Markel

Presentation Overview: Show

The high cost of mapping next-generation sequencing (NGS) read data onto a reference is a major bottleneck to sequencing analysis pipelines. We introduce COmpressive Read-mapping Accelerator (CORA), a framework that first maps reads to reads and reference to reference, exploiting inherent redundancies in both read and reference sequences, to accelerate read to reference mapping. We use this framework to map paired-end reads from the 1000 Genomes Project to the human reference, eliminating redundant sequence comparisons and improving time and sensitivity by orders of magnitude, particularly for multi-reads. The relative speed advantage of our approach will increase with the explosion of NGS data and advances in sequencing technologies, allowing researchers to keep pace with this data onslaught.

TP080 (LBR) - Interactome based drug discovery and disease-disease connections
Date: Tuesday, July 12 10:10 am - 10:30 am
Room: Northern Hemisphere A3/A4
Theme: PROTEINS / DISEASE
  • Gaurav Chopra, Purdue University, United States
  • Ram Samudrala, SUNY Buffalo, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

We have developed a Computational Analysis of Novel Drug Opportunities (CANDO) platform (http://protinfo.org/cando/) funded by a 2010 NIH Director's Pioneer Award that analyzes compound-proteome interaction signatures to determine drug behavior, in contrast to traditional single (or few) target approaches. Our platform implements a modeling pipeline that generates an interaction matrix between 3,733 human approved drugs and 48,278 proteins using a hierarchical chem- and bio-informatic fragment-based docking with dynamics protocol (~ 1 billion predicted interactions evaluated, considering multiple binding sites per protein). The platform then uses similarity of interaction signatures across all proteins indicative of similar functional behavior and nonsimilar signatures for off- and anti-target (side) effects, in effect inferring homology of compound/drug behavior at a proteomic level. The benchmarking accuracy using this approach to rank compounds for over 650 indications/diseases is ~36%, in contrast to accuracies of ~0.2% obtained when using scrambled control matrices. We prospectively validated “high value” predictions in vitro and in vivo preclinical studies for more than a dozen indications, including type 1 diabetes, herpes, dental caries, dengue, tuberculosis, malaria, hepatitis B, and different cancers. Our drug prediction accuracy is ~35% across the nine indications, where 57/162 compounds validated thus far show comparable or better activity than an existing drug, or micromolar inhibition at the cellular level, and serve as novel repurposeable therapies. Taken together, with benchmarking accuracy and the effect of druggable protein classes on repurposing accuracy, our multitargeting results indicate that a large number of protein structures with diverse fold space and a specific polypharmacological interactome is necessary for accurate drug predictions using our proteomic and evolutionary drug discovery and repurposing platform. Our approach is broadly applicable beyond repurposing, enables personalized and precision medicine, and foreshadows a new era of faster drug and target discovery using novel disease-disease connections.

TP081 (LBR) - Classifying Cancer Samples by microRNA Profiles: Read the Fine Print!
Date: Tuesday, July 12 10:10 am - 10:30 am
Room: Northern Hemisphere E1/E2
Theme: DISEASE / GENES
  • Roni Rasnic, The Hebrew University of Jerusalem, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel
  • Nathan Linial, The Hebrew University of Jerusalem, Israel

Area Session Chair: Yves Moreau

Presentation Overview: Show

MicroRNAs (miRNAs) primarily function is in gene regulating and maintaining cell homeostasis. Indeed, carcinogenesis is often represented by drastic perturbations in miRNA profiles. Many cancerous tissues share similar miRNA profiles with only few dominating miRNAs. The Cancer Genome Atlas (TCGA) provides a rich resource with thousands of human samples covering >25 major cancer types. Here, we test the significant of miRNA information from TCGA in characterizing the cancer tissues and distinguish their types and tissue origin. We apply an SVM multiclass classifier for assessing the separation power between cancer types and some of their healthy tissues. The ML approach was applied to 8522 samples associated with expression data for 1047 miRNAs. We find that the set of the lowest expressed miRNAs that comprises only 0.003% of total miRNA reads has a higher separation power. Actually including the complementary set of the highly expressed miRNAs deteriorates the classification success. We are able to improve the identification following a simple discretization of the data, improving the success from 56% by the naïve usage of the miRNA profiles to ~90%. We suggest using the separation capacity of the low expressing miRNAs for characterization of metastatic tumors with unknown tissue origin. Furthermore, we gain surprising and useful insights on classes that suffer a consistent failure in identification.

TP082 (PT) - RapMap: A Rapid, Sensitive and Accurate Tool for Mapping RNA-seq Reads to Transcriptomes
Date: Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Avi Srivastava, Stony Brook University, United States
  • Hirak Sarkar, Stony Brook University, United States
  • Nitish Gupta, Stony Brook University, United States
  • Rob Patro, Stony Brook University, United States

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: The alignment of sequencing reads to a transcriptome is a common and important step in many RNA-seq analysis tasks. When aligning RNA-seq reads directly to a transcriptome (as is common in the de novo setting or when a trusted reference annotation is available), care must be taken to report the potentially large number of multi-mapping locations per read. This can pose a substantial computational burden for existing aligners, and can considerably slow downstream analysis.

Results: We introduce a novel concept, quasi-mapping, and an efficient algorithm implementing this approach for mapping sequencing reads to a transcriptome. By attempting only to report the potential loci of origin of a sequencing read, and not the base-to-base alignment by which it derives from the reference, RapMap - our tool implementing quasi-mapping - is capable of mapping sequencing reads to a target transcriptome substantially faster than existing alignment tools. The algorithm we employ to implement quasi-mapping uses several efficient data structures and takes advantage of the special structure of shared sequence prevalent in transcriptomes to rapidly provide highly-accurate mapping information. We demonstrate how quasi-mapping can be successfully applied to the problems of transcript-level quantification from RNA-seq reads and the clustering of contigs from de novo assembled transcriptomes into biologically-meaningful groups.

Availability: RapMap is implemented in C++11 and is available as open-source software, under GPL v3, at https://github.com/COMBINE-lab/RapMap.

Contact: rob.patro@cs.stonybrook.edu

TP083 (PT) - A convex optimization approach for identification of human tissue-specific interactomes
Date: Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / DISEASE
  • Shahin Mohammadi, Purdue University, United States
  • Ananth Grama, Department of Computer Science, Purdue University, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

Motivation: Analysis of organism-specific interactomes has yielded novel insights into cellular function and coordination, understanding of pathology, and identification of markers and drug targets. Genes, however, can exhibit varying levels of cell-type specificity in their expression, and their coordinated expression manifests in tissue-specific function and pathology. Tissue-specific/selective interaction mechanisms have significant applications in drug discovery, as they are more likely to reveal drug targets. Furthermore, tissue-specific transcription factors (tsTFs) are significantly implicated in human disease, including cancers. Finally, disease genes and protein complexes have the tendency to be differentially expressed in tissues in which defects cause pathology. These observations motivate the construction of refined tissue-specific interactomes from organism-specific interactomes.

Results: We present a novel technique for constructing human tissue-specific interactomes. Using a variety of validation tests (ESEA, GO Enrichment, Disease-Gene Subnetwork Compactness), we show that our proposed approach significantly outperforms state of the art techniques. Finally, using case studies of Alzheimer's and Parkinson's diseases, we show that tissue-specific interactomes derived from our study can be used to construct pathways implicated in pathology and demonstrate the use of these pathways in identifying novel targets.\\

Availability: http://www.cs.purdue.edu/homes/mohammas/projects/ActPro.html

TP084 (LBR) - RNA sequencing-based cell proliferation analysis across 19 cancers identifies a subset of proliferation-informative cancers with a common survival signature
Date: Tuesday, July 12 10:30 am - 10:50 am
Room: Northern Hemisphere E1/E2
Theme: DISEASE
  • Brittany Lasseigne, HudsonAlpha Institute for Biotechnology, United States
  • Ryne Ramaker, HudsonAlpha Institute for Biotechnology and The University of Alabama at Birmingham, United States
  • Laura Palacio, HudsonAlpha Institute for Biotechnology, United States
  • David Gunther, HudsonAlpha Institute for Biotechnology, United States
  • Sara Cooper, HudsonAlpha Institute for Biotechnology, United States
  • Richard Myers, HudsonAlpha Institute for Biotechnology, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Despite advances in cancer diagnosis and treatment strategies, it has been difficult to identify robust prognostic signatures in cancer. Cell proliferation has long been recognized as a potential prognostic marker in cancer, but has not been investigated across multiple cancers using tissue-based RNA sequencing. Here we explore the role of cell proliferation across 19 cancers (n=6,312 patients) from The Cancer Genome Atlas project by employing a ‘proliferative index’ derived from gene expression associated with PCNA expression. This proliferative index is significantly associated with patient survival (Cox, p-value<0.05) in 8/19 cancers, which we have defined as ‘proliferation-informative cancers’ (PICs). In PICs the proliferative index is strongly correlated with tumor stage and nodal invasion. Furthermore, PICs demonstrate lower proliferation machinery expression relative to other cancers (Spearman, p=1.76E-23). Transcriptome-wide predictive survival modeling using multivariate Cox regression with L1-penalized log partial likelihood (LASSO) for feature selection outperformed the ‘proliferative-index’ in 18/19 cancers. Survival associated expression patterns were relatively unique between cancers, however PICs have a common survival signature of 86 genes (Cox, p<0.05 across all 8 cancers). Additionally, we find that proliferative index is significantly associated with somatic mutation burden (Spearman, p=1.76E-23). This study presents cancers for which cell proliferation may be an important prognostic marker and demonstrates that modern machine learning techniques can identify survival models more predictive than, and independent of, proliferative index for most cancers. We also prevent evidence for cell proliferation as a proxy for clinical parameters and confirm an association between cell proliferation and somatic mutation burden across cancers.

TP085 (HT) - ADAGE-Based Extraction of Biological Context from Public Gene Expression Data
Date: Tuesday, July 12 10:50 am - 11:10 am
Room: Northern Hemisphere A1/A2
Theme: GENES / DATA
  • Jie Tan, Geisel School of Medicine at Dartmouth, United States
  • John Hammond, Geisel School of Medicine at Dartmouth, United States
  • Deborah Hogan, Geisel School of Medicine at Dartmouth, United States
  • Casey Greene, University of Pennsylvania, United States

Area Session Chair: Scott Markel

Presentation Overview: Show

In this talk, I will introduce the overarching question that I’m addressing in my thesis: “How do we extract biological patterns from heterogeneous public gene expression data using unsupervised methods.” To address this challenge, we recently developed and published ADAGE (Analysis using Denoising Autoencoders for Gene Expression) in the journal mSystems. ADAGE is a method based on deep learning that extracts features representing biological states of an organism from the organism’s complete expression compendium without requiring pathway annotations or other curated knowledge. In this talk, I’ll primarily highlight the ADAGE method, and I’ll demonstrate how ADAGE can be applied to analyzing new RNA-Seq datasets. I’ll cover how ADAGE can be used to generate new hypotheses about how different environments activate distinct pathways. I’ll wrap up by mentioning an upcoming contribution: an approach that we call eADAGE that significantly improves the abundance and completeness of pathways extracted by ADAGE.

TP086 (HT) - Precision drug repurposing and multi-target drug design using structural systems pharmacology
Date: Tuesday, July 12 10:50 am - 11:10 am
Room: Northern Hemisphere A3/A4
Theme: PROTEINS / DISEASE
  • Thomas Hart, Rockefeller University, United States
  • Shihab Dider, Hunter College, CUNY, United States
  • Weiwei Han, Jilin University, China
  • Hua Xu, University of Texas Health Center, United States
  • Zhongming Zhao, University of Texas Health Center, United States
  • Philip Bourne, National Institute of Health, United States
  • Lei Xie, Hunter College, The City University of New York, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

Precision medicine is an emerging method for disease treatment. However, its advance is hindered by a lack of mechanistic understanding of the energetics and dynamics of genome-wide drug-target and genetic interactions. To address this challenge, we have developed a novel structural systems pharmacology approach to elucidate molecular basis and genetic biomarkers of drug action. We have applied our approach to repurposing metformin, an anti-diabetes drug, for precision anti-cancer therapy. Through searching the human structural proteome, we identified putative metformin binding targets, and experimentally verified the predictions. Subsequently, we linked these binding targets to genes whose expressions are altered by metformin through protein-protein interactions, and identified network biomarkers of drug phenotypic response. The key nodes in genetic networks are largely consistent with the existing experimental evidence. Their interactions can be affected by the observed cancer mutations. This study demonstrates that structural systems pharmacology is a powerful tool for precision medicine.

TP087 (LBR) - Data-Driven Analysis of Lymphocyte Infiltration in Breast Cancer Development and Progression
Date: Tuesday, July 12 10:50 am - 11:10 am
Room: Northern Hemisphere E1/E2
Theme: DISEASE
  • Ruth Dannenfelser, Princeton University, United States
  • Josie Ursini-Siegel, Lady Davis Institute for Medical Research, Canada
  • Vessela Kristensen, Radiumhospitalet, Norway
  • Olga Troyanskaya, Princeton University, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

The tumor microenvironment is now widely recognized for its role in tumor progression, treatment response, and clinical outcome. The intratumoral immunological landscape, in particular, has been shown to exert both pro-tumorigenic and anti-tumorigenic effects. Thus far, direct detailed studies of the cell composition of tumor infiltration have been limited; with some studies giving approximate quantifications using immunohistochemistry and other small studies obtaining detailed measurements by laboriously isolating cells from newly excised tumors and sorting them using flow cytometry. Herein we utilize a machine learning based approach to identify lymphocyte markers with which we can quantify the presence of B cells, cytotoxic T-lymphocytes, T-helper 1, and T-helper 2 cells in any gene expression data set and apply it on the studies of breast tissue. By leveraging many samples from existing large scale studies, we are able to find an inherent cell heterogeneity in clinically characterized immune infiltrates, a strong link between estrogen receptor status and infiltration in normal and tumor tissues, changes with genomic complexity, and identify characteristic differences in lymphocyte expression among molecular groupings. Furthermore, we explore the effects detailed infiltration patterns have on patient survival and changes with anti-estrogen therapy.

TP088 (PT) - SHARAKU: An algorithm for aligning and clustering read mapping profiles of deep sequencing in non-coding RNA processing
Date: Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Mariko Tsuchiya, Keio University, Japan
  • Kojiro Amano, Keio University, Japan
  • Masaya Abe, Keio University, Japan
  • Misato Seki, Keio University, Japan
  • Sumitaka Hase, Keio University, Japan
  • Kengo Sato, Keio University, Japan
  • Yasubumi Sakakibara, Keio University, Japan

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: Deep sequencing of the transcripts of regulatory non-coding RNA generates footprints of post-transcriptional processes. After obtaining sequence reads, the short reads are mapped to a reference genome, and specific mapping patterns can be detected called read mapping profiles, which are distinct from random non-functional degradation patterns. These patterns reflect the maturation processes that lead to the production of shorter RNA sequences. Recent next-generation sequencing studies have revealed not only the typical maturation process of miRNAs but also the various processing mechanisms of small RNAs derived from tRNAs and snoRNAs.
Results: We developed an algorithm termed SHARAKU to align two read mapping profiles of nextgeneration sequencing outputs for non-coding RNAs. In contrast with previous work, SHARAKU incorporates the primary and secondary sequence structures into an alignment of read mapping profiles to allow for the detection of common processing patterns. Using a benchmark simulated dataset, SHARAKU exhibited superior performance to previous methods for correctly clustering the read mapping profiles with respect to 5’-end processing and 3’-end processing from degradation patterns and in detecting similar processing patterns in deriving the shorter RNAs. Further, using experimental data of small RNA sequencing for the common marmoset brain, SHARAKU succeeded in identifying the significant clusters of read mapping profiles for similar processing patterns of small derived RNA families expressed in the brain.

TP089 (LBR) - NUCLEOTIDE SEQUENCE COMPOSITION ADJACENT TO INTRONIC 5’ END IMPROVES TRANSLATION COSTS IN FUNGI
Date: Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / GENES
  • Zohar Zafrir, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University,Department of Biomedical Engineering, Israel

Area Session Chair: Natasa Przulj

Presentation Overview: Show

It is generally believed that introns are not translated; therefore, the potential intronic sequence features that may be related to the translation step (occurring after splicing) have yet to be thoroughly studied. Focusing on four fungi as model organisms (S. cerevisiae, S. pombe, A. nidulans, and C. albicans) we performed a comprehensive large scale systems biology study to characterize for the first time how translation is encoded in introns and affects their evolution. When considering the reading frame of exons upstream and adjacent to introns, we find evidence suggesting preference of intronic STOP codons close to the intronic 5’end, and that the beginning of introns is selected for codons with higher translation efficiency, presumably resulting in reduced translation and metabolic costs in cases of non-spliced introns. Ribosomal profiling data analysis in S. cerevisiae supports the conjecture that in this organism intron retention frequently occurs; thus, introns are partially translated, and their translation efficiency affects organismal fitness. We also show that this selection is stronger in highly translated and highly spliced genes, but is not associated only with genes with a specific function. Finally, we discuss the potential relation of the reported signals to efficient Nonsense-mediated decay (NMD) pathway due to splicing errors. These new discoveries, supported by population-genetics considerations, contribute to a broader understanding of intron evolution, and of how silent mutations affect gene expression and organismal fitness.

The talk is based on a paper that will be published (accepted) in the journal: DNA Research; I will also review very recent related studies (Zafrir & Tuller, RNA, 2015; Yofe* and Zafrir* et al., PLoS Genetics, 2014).

TP090 (LBR) - Phenotype Stratification from the Electronic Health Record using Autoencoders
Date: Tuesday, July 12 11:40 am - 12:00 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / DATA
  • Brett K Beaulieu-Jones, University of Pennsylvania, United States
  • Jason H Moore, University of Pennsylvania, United States
  • Casey S Greene, University of Pennsylvania, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Genetic association and on a larger scale personalized medicine require highly specific and accurate phenotypes. Research quality phenotyping is costly and can require manual clinician review. Electronic Health Records (EHRs) contain a wealth of phenotypic information but were built for clinical and billing purposes. Effectively extracting this information for research is challenging because many records are sparsely filled and labeled with billing codes. Here, we show the unsupervised use of autoencoders to model patients in the EHR. To evaluate model fit, we created a semi-supervised classifier by adding a random forest to the trained autoencoder. Semi-supervised denoising autoencoders showed classification improvements in simulation models, particularly when small numbers of patients have high quality phenotypes. Deep autoencoders with dropout effectively imputed missing data in the PRO-ACT ALS clinical trial dataset as measured both spike-in imputation accuracy. Deep autoencoder imputed data enabled more accurate ALS disease progression prediction as defined by the ALS Functional Rating System. Finally, we show that despite symptomatic heterogeneity, ALS disease progression appears homogenous with time from onset being the most important predictor.

TP091 (PT) - Analysis of differential splicing suggests different modes of short-term splicing regulation
Date: Tuesday, July 12 12:00 pm - 12:20 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Hande Topa, Aalto University, Finland
  • Antti Honkela, University of Helsinki, Finland

Area Session Chair: Scott Markel

Presentation Overview: Show

Motivation: Alternative splicing is an important mechanism in which the regions of pre-mRNAs are differentially joined in order to form different transcript isoforms. Alternative splicing is involved in the regulation of normal physiological functions but also linked to the development of diseases such as cancer. We analyse differential expression and splicing using RNA-seq time series in three different settings: overall gene expression levels, absolute transcript expression levels and relative transcript expression levels.
Results: Using estrogen receptor alpha signalling response as a model system, our Gaussian process (GP)-based test identifies genes with differential splicing and/or differentially expressed transcripts. We discover genes with consistent changes in alternative splicing independent of changes in absolute expression and genes where some transcripts change while others stay constant in absolute level. The results suggest classes of genes with different modes of alternative splicing regulation during the experiment.
Availability: R and Matlab codes implementing the method are available at https://github.com/PROBIC/diffsplicing. An interactive browser for viewing all model fits is available at http://users.ics.aalto.fi/hande/splicingGP/.

TP092 (PT) - Prediction of Ribosome Footprint Profile Shapes from Transcript Sequences
Date: Tuesday, July 12 12:00 pm - 12:20 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / GENES
  • Tzu-Yu Liu, University of Pennsylvania, United States
  • Yun S. Song, University of California, Berkeley, United States

Area Session Chair: Natasa Przulj

Presentation Overview: Show

Motivation: Ribosome profiling is a useful technique for studying translational dynamics and quantifying protein synthesis. Applications of this technique have shown that ribosomes are not uniformly distributed along mRNA transcripts. Understanding how each transcript-specific distribution arises is important for unraveling the translation mechanism.

Results: Here, we apply kernel smoothing to construct predictive features and build a sparse model to predict the shape of ribosome footprint profiles from transcript sequences alone. Our results on Saccharomyces cerevisiae data show that the marginal ribosome densities can be predicted with high accuracy. The proposed novel method has a wide range of applications, including inferring isoform-specific ribosome footprints, designing transcripts with fast translation speeds, and discovering unknown modulation during translation.

TP093 (HT) - Leveraging electronic medical records for systematic drug repositioning
Date: Tuesday, July 12 12:00 pm - 12:20 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / DATA
  • Hyojung Paik, UCSF, United States
  • Ah-Young Chung, Korea University, Korea, Republic of
  • Hae-Chul Park, Korea University, Korea, Republic of
  • Rae Woong Park, Ajou University, Korea, Republic of
  • Kyoungho Suk, Kyungpook National University, Korea, Republic of
  • Atul Butte, UCSF, United States
  • Jihyun Kim, Ajou University, Korea, Republic of
  • Hyosil Kim, Ajou University, Korea, Republic of

Area Session Chair: Yves Moreau

Presentation Overview: Show

Prediction of new disease indications for approved drugs by computational methods has been based largely on the genomics signatures of drugs and diseases. We propose a method for drug repositioning that uses the clinical signatures extracted from electronic medical records of a tertiary hospital, including > 9.4 M laboratory tests from > 530,000 patients, in addition to diverse genomics signatures. Cross-validation shows this approach outperforms various predictive models based on genomics signatures. The prediction suggests that terbutaline sulfate, which is widely used for asthma, is a promising candidate for amyotrophic lateral sclerosis for which there are few therapeutic options. In vivo tests, terbutaline sulfate prevents defects in neuromuscular degeneration, and also have a therapeutic potential. Cotreatment with a b2-adrenergic receptor antagonist, butoxamine, suggests that the effect of terbutaline is mediated by activation of b2-adrenergic receptors. Our approach suggests that EMRs are valuable resources for discovering novel indications of drugs.

TP094 (HT) - Fast and accurate computation of differential splicing across multiple conditions
Date: Tuesday, July 12 12:20 pm - 12:40 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Jc Entizne, Pompeu Fabra University, Spain
  • A Pages, Pompeu Fabra University, Spain
  • Jl Trincado, Pompeu Fabra University, Spain
  • Gp Alamancos, Pompeu Fabra University, Spain
  • M Skalic, Pompeu Fabra University, Spain
  • N Bellora, Pompeu Fabra University, Spain
  • Eduardo Eyras, Pompeu Fabra University, Spain

Area Session Chair: Scott Markel

Presentation Overview: Show

Abstract

Alternative splicing plays an essential role in many cellular processes in eukaryotes and high-throughput RNA sequencing has allowed genome-wide studies of splicing across multiple conditions. However, the increasing number of data sets represents a major computational challenge and there are no dedicated tools for the study of splicing changes across multiple conditions. We describe SUPPA (Alamancos et al. 2015), a computational tool to calculate relative inclusion values of alternative splicing events from transcript quantification. Using simulated and experimental datasets, SUPPA achieves similar accuracies compared to standard methodologies but is thousand times faster. We extended SUPPA to calculate differential splicing across multiple conditions. Applied to data across different stages of cell differentiation SUPPA uncovers new splicing regulatory networks governing specific cell fates. SUPPA facilitates the study of splicing regulation across multiple conditions with large number of samples with limited computational resources.

Impact

Alternative pre-mRNA splicing diversifies the repertoire of transcripts in multicellular organisms, thereby providing a complex layer of gene regulation. There is increasing evidence that alternative splicing plays a crucial role in development and disease, and it has been identified as a key regulatory mechanism capable of triggering undifferentiated cell states (Gabut et al. 2011, Han et al. 2013). High-throughput sequencing technologies allow the determination of splicing patterns across multiple conditions, but poses major computational challenges. SUPPA meets these challenges by allowing for fast computation of splicing patterns across multiple conditions. SUPPA’s accuracy has been extensively tested using RNA sequencing data for a 23-point time-course of Arabidopsis plants transferred from 20°C to 4°C, and comparing with a RT-PCR platform using the same samples (Zhang et al. 2015). This has moreover facilitated the identification of new splicing changes in response to temperature. We have applied SUPPA to data across different stages of cell differentiation in human to uncover novel regulatory programs of pluripotency controlled by RNA binding proteins. In summary, SUPPA provides a powerful mean to uncover new relevant gene regulatory mechanisms and allows the systematic analysis of splicing by small labs with limited computational resources (Sebestyen et al. 2016). Finally, SUPPA is developed in Python and is an open source project with multiple contributors (https://bitbucket.org/regulatorygenomicsupf/suppa).

Alamancos et al. (2015). RNA 21(9):1521-31.
Zhang et al. (2015). New Phytol 208(1):96-101
Sebestyen et al. (2016) http://biorxiv.org/content/early/2015/08/02/023010
Gabut et al. (2011). Cell 147, 132–146
Han et al. (2013). Nature. 20113;498(7453):241-5.

TP095 (LBR) - Rapid Translation Initiation Prevents Mitochondrial Localization of mRNA
Date: Tuesday, July 12 12:20 pm - 12:40 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / GENES
  • Thomas Poulsen, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Kenichiro Imai, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Martin Frith, National Institute of Advanced Industrial Science and Technology (AIST), Japan
  • Paul Horton, National Institute of Advanced Industrial Science and Technology (AIST), Japan

Area Session Chair: Natasa Przulj

Presentation Overview: Show

The mRNA of some, but not all, nuclear encoded mitochondrial proteins localize to the periphery of mitochondria. Previous studies have shown that both the nascent polypeptide chain and an mRNA binding protein play a role in this phenomenon, and have noted a positive correlation between mRNA length and mitochondrial localization. Here, we report the first investigation into the relationship between mRNA translation initiation rate and mRNA mitochondrial localization. Our results indicate that translation initiation promoting factors such as Kozak sequences are associated with cytosolic localization, while inhibiting factors such as 5' UTR secondary structure correlate with mitochondrial localization. Moreover, the frequencies of nucleotides in various positions of the 5' UTR show higher correlation with localization than the 3' UTR. These results suggest that rapid translation initiation may prevent mRNA mitochondrial localization. Interestingly this may explain why short mRNAs, which are thought to initiate translation rapidly, seldom localize to mitochondria. Therefore we propose a model in which translating mRNA has reduced mobility and tends not to reach the mitochondria. Finally, we explore this model with a simulation of mRNA diffusion using previously estimated translation initiation probabilities and confirmed that our model produces localization values similar to those measured in experimental studies.

TP096 (PT) - Comparative Analyses of Population-scale Phenomic Data in Electronic Medical Records Reveal Race-specific Disease Networks
Date: Tuesday, July 12 12:20 pm - 12:40 pm
Room: Northern Hemisphere E1/E2
Theme: DISEASE / SYSTEMS
  • Benjamin S. Glicksberg, Icahn School of Medicine at Mount Sinai, United States
  • Li Li, Icahn School of Medicine at Mount Sinai, United States
  • Marcus A. Badgeley, Icahn School of Medicine at Mount Sinai, United States
  • Khader Shameer, Icahn School of Medicine at Mount Sinai, United States
  • Roman Kosoy, Icahn School of Medicine at Mount Sinai, United States
  • Noam D. Beckmann, Icahn School of Medicine at Mount Sinai, United States
  • Nam Pho, Harvard Medical School, United States
  • Joerg Hakenberg, Icahn School of Medicine at Mount Sinai, United States
  • Meng Ma, Icahn School of Medicine at Mount Sinai, United States
  • Kristin L. Ayers, Icahn School of Medicine at Mount Sinai, United States
  • Gabriel E. Hoffman, Icahn School of Medicine at Mount Sinai, United States
  • Shuyu Dan Li, Icahn School of Medicine at Mount Sinai, United States
  • Eric E. Schadt, Icahn School of Medicine at Mount Sinai, United States
  • Chriag J. Patel, Harvard Medical School, United States
  • Rong Chen, Icahn School of Medicine at Mount Sinai, United States
  • Joel T. Dudley, Icahn School of Medicine at Mount Sinai, United States

Area Session Chair: Yves Moreau

Presentation Overview: Show

Motivation: Underrepresentation of racial groups represents an important challenge and major gap in phenomics research. Most of the current human phenomics research is based primarily on European populations; hence it is an important challenge to expand it to consider other population groups. One approach is to utilize data from EMR databases that contain patient data from diverse demographics and ancestries. The implications of this racial underrepresentation of data can be profound regarding effects on the healthcare delivery and actionability. To the best of our knowledge, our work is the first attempt to perform comparative, population-scale analyses of disease networks across three different populations, namely Caucasian (EA), African American (AA), and Hispanic/Latino (HL).
Results: We compared susceptibility profiles and temporal connectivity patterns for 1,988 diseases and 37,282 disease pairs represented in a clinical population of 1,025,573 patients. Accordingly, we revealed appreciable differences in disease susceptibility, temporal patterns, network structure, and underlying disease connections between EA, AA, and HL populations. We found 2,158 significantly comorbid diseases for the EA cohort, 3,265 for AA, and 672 for HL. We further outlined key disease pair associations unique to each population as well as categorical enrichments of these pairs. Finally, we identified 51 key “hub” diseases that are the focal points in the race-centric networks and of par-ticular clinical importance. Incorporating race-specific disease co-morbidity patterns will produce a more accurate and complete picture of the disease landscape overall and could support more precise understanding of disease relationships and patient management towards improved clinical outcomes.

TP097 (PT) - Using genomic annotations increases statistical power to detect eGenes
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Dat Duong, UCLA, United States
  • Jennifer Zou, UCLA, United States
  • Farhad Hormozdiari, School of Computing Science, UCLA, United States
  • Jae Hoon Sul, Brigham and Women's Hospital, Boston, USA, United States
  • Jason Ernst, UCLA, United States
  • Buhm Han, Asan Institute for Life Sciences, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Republic of Korea, Korea, Republic of
  • Eleazar Eskin, University of California, Los Angeles, United States

Area Session Chair: Janet Kelso

Presentation Overview: Show

Expression quantitative trait loci (eQTL) are genetic variants
that affect gene expression. In eQTL studies, one important task
is to find eGenes or genes whose expressions are associated with at
least one eQTL. The standard statistical method to determine if a
gene is an eGene requires association testing at all nearby variants
and the permutation test to correct for multiple testing. The standard
method however does not consider genomic annotation of the
variants. In practice, variants near gene transcription start sites or
certain histone modifications are likely to regulate gene expression.
In this paper, we introduce a novel eGene detection method that
considers this empirical evidence and thereby increases the statistical
power. We applied our method to the liver Genotype-Tissue Expression
(GTEx) data using distance from transcription start sites, DNase
hypersensitivity sites, and six histone modifications as the genomic
annotations for the variants. Each of these annotations helped us
detected more candidate eGenes. Distance from transcription start
site appears to be the most important annotation; specifically, using
this annotation, our method discovered 50% more candidate eGenes
than the standard permutation method.

TP098 (PT) - Simultaneous prediction of enzyme orthologs from chemical transformation patterns for de novo metabolic pathway reconstruction
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / PROTEINS
  • Yasuo Tabei, Japan Science and Technology Agency, Japan
  • Yoshihiro Yamanishi, Kyushu University, Japan
  • Masaaki Kotera, Tokyo Institute of Technology, Japan

Area Session Chair: Trey Ideker

Presentation Overview: Show

Motivation:
Metabolic pathways are an important class of molecular networks consisting of compounds, enzymes, and their interactions.
The understanding of global metabolic pathways is extremely important for various applications in ecology and pharmacology.
However, large parts of metabolic pathways remain unknown, and most organism-specific pathways contain many missing enzymes.
Results:
In this study we propose a novel method to predict the enzyme orthologs that catalyze the putative reactions to facilitate the de novo reconstruction of metabolic pathways from metabolome-scale compound sets.
The algorithm detects the chemical transformation patterns of substrate-product pairs using chemical graph alignments, and constructs a set of enzyme-specific classifiers to simultaneously predict all the enzyme orthologs that could catalyze the putative reactions of the substrate-product pairs in the joint learning framework.
The originality of the method lies in its ability to make predictions for thousands of enzyme orthologs simultaneously, as well as its extraction of enzyme-specific chemical transformation patterns of substrate-product pairs.
We demonstrate the usefulness of the proposed method by applying it to some ten thousands of metabolic compounds,
and analyze the extracted chemical transformation patterns that provide insights into the characteristics and specificities of enzymes.
The proposed method will open the door to both primary (central) and secondary metabolism in genomics research,
increasing research productivity to tackle a wide variety of environmental and public health matters.

TP099 (PT) - Classifying and Segmenting Microscopy Images with Deep Multiple Instance Learning
Date: Tuesday, July 12 2:00 pm - 2:20 pm
Room: Northern Hemisphere E1/E2
Theme: DATA
  • Oren Kraus, University of Toronto, Canada
  • Lei Jimmy Ba, University of Toronto, Canada
  • Brendan Frey, University of Toronto, Canada

Area Session Chair: Curtis Huttenhower

Presentation Overview: Show

Abstract
Motivation: High content screening (HCS) technologies have enabled large scale imaging experiments for studying cell biology and for drug screening. These systems produce hundreds of thousands of microscopy images per day and their utility depends on automated image analysis. Recently, deep learning approaches that learn feature representations directly from pixel intensity values have dominated object recognition challenges. These tasks typically have a single centred object per image and existing models are not directly applicable to microscopy datasets. Here we develop an approach that combines deep convolutional neural networks (CNNs) with multiple instance learning (MIL) in order to classify and segment microscopy images using only whole image level annotations.
Results: We introduce a new neural network architecture that uses MIL to simultaneously classify and segment microscopy images with populations of cells. We base our approach on the similarity between the aggregation function used in MIL and pooling layers used in CNNs. To facilitate aggregating across large numbers of instances in CNN feature maps we present the Noisy-AND MIL pooling function, a new MIL operator that is robust to outliers. Combining CNNs with MIL enables training CNNs using whole microscopy images with image level labels. We show that training end-to-end MIL CNNs outperforms several previous methods on both mammalian and yeast datasets without requiring any segmentation steps.
Availability: We will make our implementation and training data available for the final version of the manuscript.
Contact: oren.kraus@mail.utoronto.ca
Supplementary information: Supplementary data are available at Bioinformatics online.

TP100 (HT) - GeneiASE: Detection of conditiondependent and static allele-specific expression from RNA-seq data without haplotype information
Date: Tuesday, July 12 2:20 pm - 2:40 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Daniel Edsgärd, KTH Royal Institute of Technology, Sweden
  • Maria Jesus Iglesias, KTH Royal Institute of Technology, Sweden
  • Sarah-Jayne Reilly, Karolinska Institute, Sweden
  • Anders Hamsten, Karolinska Institute, Sweden
  • Per Tornvall, Karolinska Institutet, Sweden
  • Jacob Odeberg, Karolinska Insitutet, Sweden
  • Olof Emanuelsson, KTH Royal Institute of Technology, Sweden

Area Session Chair: Janet Kelso

Presentation Overview: Show

Allele-specific expression (ASE) is the imbalance in transcription between maternal and paternal alleles at a locus and can be probed in single individuals using massively parallel DNA sequencing technology. Assessing ASE within a single sample provides a static picture of the ASE, but the magnitude of ASE for a given transcript may vary between different biological conditions in an individual. Such condition-dependent ASE could indicate a genetic variation with a functional role in the phenotypic difference. We developed a method, GeneiASE, to detect genes exhibiting static or condition-dependent ASE in single individuals. GeneiASE performed consistently over a range of read depths and ASE effect sizes, and did not require phasing of variants to estimate haplotypes. We applied GeneiASE on both our own and publicly available data sets, and validated a number of ASE cases using qPCR. GeneiASE is available at https://sourceforge.net/projects/geneiase/.

TP101 (PT) - Fast metabolite identification with Input Output Kernel Regression
Date: Tuesday, July 12 2:20 pm - 2:40 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / DATA
  • Céline Brouard, Aalto university, Finland
  • Huibin Shen, Aalto University, Finland
  • Kai Dührkop, Friedrich-Schiller-University Jena, Germany
  • Florence D'Alché-buc, Télécom ParisTech/Institut Mines-Télécom, France
  • Sebastian Böcker, Friedrich Schiller University Jena, Germany
  • Juho Rousu, Aalto University, Finland

Area Session Chair: Trey Ideker

Presentation Overview: Show

An important problematic of metabolomics is to identify metabolites using tandem mass spectrometry data. Machine learning methods have been proposed recently to solve this problem by predicting molecular fingerprints and matching these fingerprints against existing databases. In this work we propose to address the metabolite identification problem using a structured output prediction approach.
We use the Input Output Kernel Regression method to learn the mapping between tandem mass spectra and molecular structures. The principle of this method is to encode the structures in input and output with an output kernel and an operator-valued kernel in input. The mapping between the two structured sets is approximated by learning a function with values in the feature space associated to the output kernel and solving a pre-image problem for the prediction step. We show that our approach achieves state-of-the-art accuracy in metabolite identification. Moreover, our method has the advantage of decreasing the running times for the training step and the test step by several orders of magnitude over the preceding methods.

TP102 (PT) - PHOCOS: Inferring Multi-Feature Phenotypic Crosstalk Networks
Date: Tuesday, July 12 2:20 pm - 2:40 pm
Room: Northern Hemisphere E1/E2
Theme: DATA
  • Yue Deng, School of Pharmacy, UCSF, United States
  • Steven Altschuler, School of Pharmacy, UCSF, United States
  • Lani Wu, School of Pharmacy, UCSF, United States

Area Session Chair: Curtis Huttenhower

Presentation Overview: Show

Motivation: Quantification of cellular changes to perturbations can provide a powerful approach to infer crosstalk among molecular components in biological networks. Existing crosstalk inference methods conduct network-structure learning based on a single phenotypic feature (e.g. abundance) of a biomarker. These approaches are insufficient for analyzing perturbation data that can contain information about multiple features (e.g. abundance, activity or localization) of each biomarker.
Results: We propose a computational framework for inferring phenotypic crosstalk (PHOCOS) that is suitable for high-content microscopy or other modalities that capture multiple phenotypes per biomarker. PHOCOS uses a robust graph-learning paradigm to predict direct effects from potential indirect effects and identify errors due to noise or missing links. The result is a multi-feature, sparse network that parsimoniously captures direct and strong interactions across phenotypic attributes of multiple biomarkers. We use simulated and biological data to demonstrate the ability of PHOCOS to recover multi-attribute crosstalk networks from cellular perturbation assays.

TP103 (PT) - Data-driven mechanistic analysis method to reveal dynamically evolving regulatory networks
Date: Tuesday, July 12 2:40 pm - 3:00 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / SYSTEMS
  • Jukka Intosalmi, Aalto University, Finland
  • Kari Nousiainen, Aalto University, Finland
  • Helena Ahlfors, The Babraham Institute, United Kingdom
  • Harri Lähdesmäki, Aalto University, Finland

Area Session Chair: Janet Kelso

Presentation Overview: Show

Mechanistic models based on ordinary differential equations provide powerful and accurate means to describe the dynamics of molecular machinery which orchestrates gene regulation. When combined with appropriate statistical techniques, mechanistic models can be calibrated using experimental data and, in many cases, also the model structure can be inferred from time-course measurements. However, existing mechanistic models are limited in the sense that they rely on the assumption of static network structure and cannot be applied when transient phenomena affect, or rewire, the network structure. In the context of gene regulatory network inference, network rewiring results from the net impact of possible unobserved transient phenomena such as changes in signaling pathway activities or epigenome, which are generally difficult, but important, to account for.

We introduce a novel method that can be used to infer dynamically evolving regulatory networks from time-course data. Our method is based on the notion that all mechanistic ordinary differential equation models can be coupled with a latent process that approximates the network structure rewiring process. We illustrate the performance of the method using simulated data and, further, we apply the method to study the regulatory interactions during T helper 17 cell differentiation using time-course RNA sequencing data. The computational experiments with the real data show that our method is capable of capturing the experimentally verified rewiring effects of the core Th17 regulatory network. We predict Th17 lineage specific subnetworks that are activated sequentially and control the differentiation process in an overlapping manner.

TP104 (PT) - Faster and More Accurate Graphical Model Identification of Tandem Mass Spectra using Trellises
Date: Tuesday, July 12 2:40 pm - 3:00 pm
Room: Northern Hemisphere A3/A4
Theme: PROTEINS
  • Shengjie Wang, University of Washington, United States
  • John Halloran, University of Washington, United States
  • Jeff Bilmes, University of Washington, United States
  • William Stafford Noble, University of Washington, United States

Area Session Chair: Trey Ideker

Presentation Overview: Show

Tandem mass spectrometry (MS/MS) is the dominant high throughput technology for identifying and quantifying proteins in complex biological samples. Analysis of the tens of thousands of fragmentation spectra produced by an MS/MS experiment begins by assigning to each observed spectrum the peptide that is hypothesized to be responsible for generating the spectrum. This assignment is typically done by search- ing each spectrum against a database of peptides. To our knowledge, all existing MS/MS search engines compute scores individually between a given observed spectrum and each possible candidate peptide from the database. In this work, we use a trellis, a data structure capable of jointly representing a large set of candidate peptides, to avoid redundantly recomputing common sub-computations among different candidates. We show how trellises may be used to significantly speed up existing scoring algorithms, and we theoretically quantify the expected speed-up afforded by trellises. Furthermore, we demonstrate that compact trellis representations of whole sets of peptides enables efficient discriminative learning of a dynamic Bayesian network for spectrum identification, leading to greatly improved peptide identification accuracy.

TP105 (HT) - CD30 cell graphs of Hodgkin lymphoma are not scale-free—an image analysis approach
Date: Tuesday, July 12 2:40 pm - 3:00 pm
Room: Northern Hemisphere E1/E2
Theme: DATA / DISEASE
  • Hendrik Schäfer, Johann Wolfgang Goethe Universität, Germany
  • Tim Schäfer, Institute of Computer Science, Department of Molecular Bioinformatics, Germany
  • Joerg Ackermann, Johann Wolfgang Goethe Universität, Germany
  • Norbert Dichter, Institute of Computer Science, Department of Molecular Bioinformatics, Germany
  • Claudia Döring, Senckenberg Institute of Pathology, Germany
  • Sylvia Hartmann, Senckenberg Institute of Pathology, Germany
  • Martin-Leo Hansmann, Senckenberg Institute of Pathology, Germany
  • Ina Koch, Johann Wolfgang Goethe University Frankfurt am Main, Germany

Area Session Chair: Curtis Huttenhower

Presentation Overview: Show

In this talk, we present an investigation from the field of digital pathology. Using whole slide images, we analyzed the cell distribution of CD30 positive cells in Hodgkin lymphoma (HL). HL is a malignancy of the immune system that usually originates from B cells. For diagnosis, biopsies are taken from patients and immunostained. We detected cells in digitized versions of the images using a custom imaging pipeline. The spatial distribution of CD30 cells in the tissue was modeled as a CD30 cell graph. We found that the cell distribution in the tissue is not random. The cells show pronounced clustering in the tissue, which is higher for the lymphoma cases. The vertex degree distributions of the graphs could be modeled by the Gamma distribution, and thus were not scale-free. Our findings are a first step towards modeling the complex spatial interactions of different cell types in the lymph node.

TP106 (PT) - A novel method for discovering local spatial clusters of genomic regions with functional relationships from DNA contact maps
Date: Tuesday, July 12 3:30 pm - 3:50 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Xihao Hu, The Chinese University of Hong Kong, Hong Kong
  • Christina Huan Shi, The Chinese University of Hong Kong, Hong Kong
  • Kevin Yip, The Chinese University of Hong Kong, Hong Kong

Area Session Chair: Janet Kelso

Presentation Overview: Show

Motivation: The three-dimensional structure of genomes makes it possible for genomic regions not adjacent in the primary sequence to be spatially proximal. These DNA contacts have been found to be related to various molecular activities. Previous methods for analyzing DNA contact maps obtained from Hi-C experiments have largely focused on studying individual interactions, forming spatial clusters composed of contiguous blocks of genomic locations, or classifying these clusters into general categories based on some global properties of the contact maps.

Results: Here we describe a novel computational method that can flexibly identify small clusters of spatially proximal genomic regions based on their local contact patterns. Using simulated data that highly resemble Hi-C data obtained from real genome structures, we demonstrate that our method identifies spatial clusters that are more compact than methods previously used for clustering genomic regions based on DNA contact maps. The clusters identified by our method enable us to confirm functionally-related genomic regions previously reported to be spatially proximal in different species. We further show that each genomic region can be assigned a numeric affinity value that indicates its degree of participation in each local cluster, and these affinity values correlate quantitatively with DNase I hypersensitivity, gene expression, super enhancer activities and replication timing in a cell type specific manner. We also show that these cluster affinity values can precisely define boundaries of reported topologically associating domains (TADs), and further define local sub-domains within each domain.

Availability: The source code of BNMF and tutorials on how to use the software to extract local clusters from contact maps are available at http://yiplab.cse.cuhk.edu.hk/bnmf/ .

TP107 (PT) - BioASF: A Framework for Automatically Generating Executable Pathway Models Specified in BioPAX
Date: Tuesday, July 12 3:30 pm - 3:50 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS / DATA
  • Reza Haydarlou, VU University Amsterdam, Netherlands
  • Annika Jacobsen, VU University Amsterdam, Netherlands
  • Nicola Bonzanni, VU University Amsterdam, Netherlands
  • K. Anton Feenstra, VU University Amsterdam, Netherlands
  • Sanne Abeln, VU University, Netherlands
  • Jaap Heringa, VU University Amsterdam, Netherlands

Area Session Chair: Trey Ideker

Presentation Overview: Show

ABSTRACT
Motivation: Biological pathways play a key role in most cellular functions.
To better understand these functions, diverse computational
and cell biology researchers use biological pathway data for various
analysis and modeling purposes. For specifying these biological pathways,
a community of researchers has defined BioPAX and provided
various tools for creating, validating, and visualizing BioPAX models.
However, a generic software framework for simulating BioPAX models
is missing. Here, we attempt to fill this gap by introducing a generic
simulation framework for BioPAX. The framework explicitly separates
the execution model from the model structure as provided by BioPAX,
with the advantage that the modelling process becomes more reproducible
and intrinsically more modular; this ensures natural biological
constraints are satisfied upon execution. The framework is based
on the principles of discrete event systems and multi-agent systems,
and is capable of automatically generating a hierarchical multi-agent
system for a given BioPAX model.
Results: To demonstrate the applicability of the framework, we
simulated two types of biological network models: a gene regulatory
network modeling the haematopoietic stem cell regulators and a
signal transduction network modeling the Wnt/B-catenin signaling
pathway. We observed that the results of the simulations performed
using our framework were entirely consistent with the simulation
results reported by the researchers who developed the original
models in a proprietary language.
Availability and Implementation: The framework, implemented in
Java, is open source and its source code, documentation, and tutorial
are available at http://www.ibi.vu.nl/programs/BioASF.
Contact: j.heringa@vu.nl

TP108 (LBR) - Tracking the Evolution of 3D Gene Organization
Date: Tuesday, July 12 3:50 pm - 4:10 pm
Room: Northern Hemisphere A1/A2
Theme: GENES
  • Alon Diament, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University, Israel

Area Session Chair: Janet Kelso

Presentation Overview: Show

The study of eukaryotic genomic organization has been rapidly advancing in recent years, with next generation sequencing technologies, such as Hi-C, providing large scale measurements of 3D genomic organization at unprecedented resolution. It has recently been shown that the distribution of genes in eukaryotic genomes is not random and that their organization is strongly related to gene expression and function. It has also been shown that some level of conservation of this organization exists between organisms. However, almost all studies of 3D genomic organization analyzed each organism independently from others.

Here we propose a novel approach for inter-organismal analysis of the evolution of the 3D organization of genes based on a network representation of Hi-C data from S. cerevisiae and S. pombe. We report global signals of conservation and re-organization of genes in the genome, that are correlated with changes in their functionality and expression. Furthermore, we describe algorithms for identifying spatially co-evolving orthologous modules (SCOMs) and demonstrate them for various proposed types of modules, including: modules of co-localizing genes with conserved 3D positions; modules of genes that underwent significant changes in their 3D co-localization during evolution; and additional more complex gene arrangements.

We show that this approach enables identifying biologically relevant modules of co-evolving genes with shared function. The approach is expected to contribute to the study of genome evolution, gene expression, and even tumorigenesis.

TP109 (HT) - PSAMM: A Portable System for the Analysis of Metabolic Models
Date: Tuesday, July 12 3:50 pm - 4:10 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS
  • Jon Steffensen, University of Rhode Island, United States
  • Keith Dufault-Thompson, University of Rhode Island, United States
  • Ying Zhang, University of Rhode Island, United States

Area Session Chair: Trey Ideker

Presentation Overview: Show

The broad application of genome-scale metabolic modeling has made it a useful technique for tackling fundamental questions in biological research and engineering. Today over 100 models have been constructed for organisms of diverse metabolic activities spanning all three kingdoms of life. These models, however, have been curated independently following different conventions. The maintenance of model consistency has been challenging due to the lack of consensus in model representation and the absence of integrated modeling software for associating mathematical simulations with the annotation and biological interpretation of metabolic models. To solve this problem, we developed a new software package, PSAMM, and a new model format that incorporates heterogeneous, model-specific annotation information into modular representations of model definitions and simulation settings. PSAMM provides significant advances in standardizing the workflow of model annotation and consistency checking. Compared to existing tools, PSAMM supports more flexible configurations and is more efficient in running constraint-based simulations.

TP110 (HT) - A Low-Latency, Big Database System and Browser for Storage, Querying and Visualization of 3D Genomic Data
Date: Tuesday, July 12 4:10 pm - 4:30 pm
Room: Northern Hemisphere A1/A2
Theme: GENES / DATA
  • Alexander Butyaev, McGill University, Canada
  • Ruslan Mavlyutov, University of Fribourg, Switzerland
  • Mathieu Blanchette, McGill University, Canada
  • Philippe Cudré-Mauroux, University of Fribourg, Switzerland
  • Jérôme Waldispühl, McGill University, Canada

Area Session Chair: Janet Kelso

Presentation Overview: Show

Recent releases of genome three-dimensional (3D) structures have the potential to transform our understanding of genomes. Nonetheless, the storage technology and visualization tools need to evolve to offer to the scientific community fast and convenient access to these data. We introduce simultaneously a database system to store and query 3D genomic data (3DBG), and a 3D genome browser to visualize and explore 3D genome structures (3DGB). We benchmark 3DBG against state-of-the-art systems and demonstrate that it is faster than previous solutions, and importantly gracefully scales with the size of data. We also illustrate the usefulness of our 3D genome Web browser to explore human genome structures.
The 3D genome browser is available at http://3dgb.cs.mcgill.ca/.

TP111 (PT) - Linear effects models of signaling pathways from combinatorial perturbation data
Date: Tuesday, July 12 4:10 pm - 4:30 pm
Room: Northern Hemisphere A3/A4
Theme: SYSTEMS
  • Ewa Szczurek, University of Warsaw, Poland
  • Niko Beerenwinkel, ETH Zurich, Switzerland

Area Session Chair: Trey Ideker

Presentation Overview: Show

Motivation: Perturbations constitute the central means to study signaling pathways. Interrupting
components of the pathway and analyzing observed effects of those interruptions can give insight into
unknown connections within the signaling pathway itself, as well as the link from the pathway to the effects. Different pathway components may have different individual contributions to the measured perturbation effects, such as gene expression changes. Those effects will be observed in combination when the pathway components are perturbed. Extant approaches focus either on the reconstruction of pathway structure or on resolving how the pathway components control the downstream effects.
Results: Here, we propose a linear effects model, which can be applied to infer both from combinatorial
perturbation data. We use simulated data to demonstrate the accuracy of learning the pathway structure
as well as estimation of the individual contributions of pathway components to the perturbation effects.
The practical utility of our approach is illustrated by an application to perturbations of the mitogen-activated protein kinase pathway in Saccharomyces cerevisiae.
Availability: lem is available as a R package at http://www.mimuw.edu.pl/~szczurek/lem
Contact: niko.beerenwinkel@bsse.ethz.ch
Supplementary information: Supplementary data are available at Bioinformatics online.