Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

#ISMB2016

Sponsors

Silver:
Bronze:
F1000
Recursion Pharmaceuticals

Copper:
Iowa State University

General and Travel Fellowship Sponsors:
Seven Bridges GBP GigaScience OverLeaf PLOS Computational Biology BioMed Central 3DS Biovia GenenTech HiTSeq IRB-Group Schrodinger TOMA Biosciences

ISMB 2016 Special Sessions

Attention Conference Presenters - please review the Speaker Information Page available here.

SST01:  Lost in ribosome-profiling
Sunday, July 10, 2:00 pm – 4:30 pm Room: Northern Hemisphere BCD
Organizer(s):

Tamir Tuller is an Associate Professor, the head of the Laboratory of Computational Systems and
Synthetic Biology at Tel-Aviv University, and co-founder of SynVaccine Ltd.He has a multidisciplinary
academic background (including two PhD titles one in computer science and one in medical science
from TAU). He is the author of more than 110 peer reviewed scientific articles and received various
awards and fellowships (e.g. Sackler,YeshayaHorowitz,AsiShnidman,Koshland,ARCHES).
He performs a multidisciplinary research focusing on various aspects of gene expression and
specifically mRNA translation, among others, aiming at modeling and engineering it.

Presentation Overview:

Ribosome profiling (or Ribo-Seq) [Ingolia & Weissman et al., Science, 2009] is a relatively new, and currently the most popular methodology, for studying translation. Specifically, it potentially provides ribosome densities over an entire transcriptome in-vivo at the resolution of a single nucleotide. Ribo-Seq has been employed extensively in recent years to decipher various fundamental gene expression regulation aspects [Guo, et al.  Nature, 2010; Tuller et al., Cell, 2010; Oh et al., Cell, 2011; Ingolia et al., Cell, 2011; Li et al., Nature, 2012; Brar et al., Science, 2012; Bazzini et al., Science, 2012; Andrew et al. Nature 2012; Lee et al., PNAS, 2012; Stern-Ginossar et al., Science, 2012; Stumpf et al., Mol. Cell, 2013; Liu et al., Mol. Cell, 2013; Lee et al., Nature, 2013; Wang et al., Nature, 2014; Zid and O’Shea, Nature, 2014; Jan et al., Science, 2014; Ishimura et al., Science, 2014; Geula et al., Science, 2015; Jovanovic et al., Science, 2015; Shieh et al., Science 2015; Cho et al., Science, 2015].

Among others, it was shown that the speed by which ribosomes progress along the mRNA (and thus central biomedical phenomena such as co-translational protein folding; ribosomal allocation, jamming, and drop-off; organismal fitness and human diseases; and more) are affected by different local features of the coding sequence (e.g. interaction between the ribosome and the nascent peptide, mRNA folding, tRNA availability, etc), and by trans intracellular regulatory mechanisms. However, despite its promising throughput, analysis of Ribo-Seq data has led to contradictory conclusions between studies and active discussions about the protocol biases (see, for example, Tuller & Zur Nucleic Acids Res. 2015; Stadler, and Fire, RNA 2011; Charneski and Hurst. PLoS Biol. 2013; Artieri and Fraser, Genome Res., 2014, Dana and Tuller Nucleic Acids Res. 2014; Gardin  et al. eLife. 2014; Gerashchenko and Gladyshev Nucleic Acids Res. 2014;  Hussmann et al. PLoS Genet. 2015).  

The Ribo-Seq analysis is specifically challenging among others due to that fact that: 1) Computationally efficient analyses of extremely large sets of NGS data generated in the experiment are required (typically millions of reads). 2) The data and problems are statistically very challenging since Ribo-Seq tends to include non trivial biases, is based on very short reads (of length ~ 30 nt), and often (contrarily to other NGS based approaches) the aim is inferring signals at single nucleotide resolution. 3) The phenomena studied based on Ribo-Seq includes non-trivial intracellular biophysical phenomena (e.g. co-translational folding, and ribosomal movement) and molecular evolutionary processes (e.g. the evolution of codon usage bias). Thus, successful computational Ribo-Seq approaches should consider all (or at least many) of these aspects.                                                                                                                         

Relevant topics that will be discussed include: Inference and parameter estimation for biophysical and molecular evolution models based on Ribo-Seq; Handling biases in Ribo-Seq; Computational efficient approaches for analyzing Ribo-Seq data; Suggestions for novel/improved experimental approaches for Ribo-Seq based on computational Ribo-Seq analysis.

 

Part A: Lost in ribosome-profiling
(2:00-2:30)

Bio:

Tamir Tuller is an Associate Professor, the head of the Laboratory of Computational Systems and
Synthetic Biology at Tel-Aviv University, and co-founder of SynVaccine Ltd.He has a multidisciplinary
academic background (including two PhD titles one in computer science and one in medical science
from TAU). He is the author of more than 110 peer reviewed scientific articles and received various
awards and fellowships (e.g. Sackler,YeshayaHorowitz,AsiShnidman,Koshland,ARCHES).
He performs a multidisciplinary research focusing on various aspects of gene expression and
specifically mRNA translation, among others, aiming at modeling and engineering it.

Session Description:

We will give a general illustrative introduction to the session, which will focus on: 1) The promise of the ribosome profiling. 2) The experimental/computational/statistical challenges related to the ribosome profiling with quantitative examples/illustrations. 3) Some specific recent computational approaches/methodologies for dealing with these challenges.

Part B: The hidden code behind the genetic code: codon optimality regulates mRNA translation and stability during the maternal to zygotic transition
(2:30-3:00)

Bio:

Antonio studied Chemistry and Molecular Biology at the University of Cadiz and the University Autonoma of Madrid. During undergraduate, he worked with Gines Morata at the CBM in Madrid. Antonio did his PhD with Stephen Cohen at the EMBL (Heidelberg) (1998-2002) and a post-doc with Alex Schier at the Skirball Institute (NYU) and Harvard (2003-2006). Antonio is currently a Professor in the Genetics Department at Yale University and is part of the Computational Biology Program. His laboratory combines Combitational Biology, Genomics, Biochemistry, Systems and Developmental Biology to investigate the gene regulatory interactions that shape gene expression during cellular transitions in development. Antonio has received several inter national awards Such as Pew Biomedical Scholar, Blavatnik finalist for young Scientists, and Vilcek prize for creative Promise in Biomedical Sciences.

Session Description:

Developmental reprogramming requires the activation of a new program and the removal of the previous developmental program. Upon fertilization animals initiate the maternal-to-zygotic transition, a universal developmental step, whereby the genome of the embryo becomes activated and the maternally deposited mRNAs are degraded. During this transition there is a profound post-transcriptional remodeling of mRNAs. While previous studies have analyzed the function of MicroRNAs, yet the RNA binding proteins (Readers) and the regulatory code underlying clearance and repression of thousands of mRNAs is still poorly understood unknown.

We have combined ribosome footprinting with a novel RNA in vivo selection method to identify the regulatory function of untranslated and translated sequences during the maternal to zygotic transition in vertebrates. Using this method we discovered that mRNA translation by the ribosome and codon identity have a crucial role in regulating mRNA stability and translation. We observed that thousand of mRNAs in zebrafish, xenopus and mice are regulated through codon identity whereby optimal codons mediate mRNA stabilization and non-optimal codons induce mRNA decay, depending on translation of the mRNA. Furthermore we observe that the codon optimality defined in vivo corresponds with the mRNA steady state levels during tissue homeostasis across 36 human tissues, indicating that codon optimality is conserved across vertebrates and directly influences mRNA levels across tissues

These results identify a novel layer of the genetic code "codon optimality code" within vertebrates whereby codon identity has a regulatory function in mRNA stability and translation efficiency by modulating mRNA deadenylation and ribosome translocation. We hypothesize that this regulatory layer has profound implications in gene regulation across development, reprogramming growth and differentiation.

Part C: Uncovering tumor-specific amino acid vulnerabilities by differential ribosome codon reading
(3:00-3:30)

Bio:

Reuven agami is a graduate of The Weizmann Institute of Science, Rehovot, Israel. His postgraduate training was completed at the Netherlands Cancer Institute, Amsterdam. He is currently the head of the division of Biological Stress Response at the Netherlands Cancer Institute, and also since 2008 Professor at the genetic department, Erasmus medical center, University of Rotterdam. His research focuses on RNA regulation and its use to develop novel onco-genomic technologies in search of cancer genes. His recent research lines encompass a variety of topics, ranging from enhancers and their associated RNAs, to mRNA translation and utility to identify amino-acid vulnerabilities in cancer.

Session Description:

The demand for amino acids for protein synthesis, nucleotide synthesis, and energy production, is very high in the growing tumor. The metabolic changes a tumor undergoes to adapt to deregulated growth represent vulnerabilities that can be exploited for therapy. This was successfully demonstrated in the past 50 years for the amino acid asparagine, resulting in a very effective combined treatment of chemotherapy and L-asparaginase in acute lymphoblastic leukemia (ALL). To exploit metabolic vulnerabilities related to amino acids one needs to overcome the major obstacle of identifying which amino acid is restrictive to the tumor. It is important to realize that amino acid demand depends on many genetic and environmental factors of the growing tumor in the organism. We recently developed a novel measurement approach suitable to determine restrictive amino acid in cells that we named diricore (differential ribosome codon reading), which is based on the ribosome profiling technology. Using diricore we already uncovered shortage in proline in breast cancer cell lines expanded in vivo, and in human kidney tumors. Intriguingly, proline shortage was linked to high levels of PYCR1, a key enzyme in proline production. In particular, PYCR1 knockout did not affect cell proliferation under normal culture conditions, but compromised tumor growth in vivo. These data demonstrate the capacity of diricore to identify specific amino acid shortages in growing tumors.

Part D: Statistical Methods for the Analysis of Ribosome Profiling Data
(3:30-4:00)

Bio:

 Adam Olshen received his PhD in Biostatistics from the University of Washington.  After eight years on the faculty at the Memorial Sloan-Kettering Cancer Center, he became a Professor of Epidemiology and Biostatistics at the University of California, San Francisco.  There he also runs the Computational Biology Core in the Helen Diller Family Comprehensive Cancer Center.  His main research activities involve statistical genomics.    Before becoming interested in ribosome profiling he helped to develop methods for analyzing copy number data and for integrating multiple types of genomic data.

Session Description:

During translation messenger RNA produced by transcription is decoded by ribosomes to produce specific polypeptides. Ribosome profiling, a second generation sequencing technology, was developed to measures the position and counts of ribosomes. When combined with corresponding mRNA sequencing data, ribosome profiling data can give insights into translational efficiency. We developed the Babel framework to discover gene level changes in translational efficiency between different conditions utilizing ribosome profiling data. Here we describe the Babel framework and show how its utilization had led to biologically interesting finding.  We also describe a more classical regression approach to the same problem.

Part E: Understanding Biases in Ribosome Profiling Experiments Reveals Signatures of Translation Dynamics in Yeast
(4:00-4:30)

Bio:

Jeff Hussmann studied math and electrical engineering at Duke University before obtaining a PhD in computational and applied mathematics at the University of Texas at Austin under the supervision of William Press and Sara Sawyer. He is currently a postdoc in the labs of Jonathan Weissman and Carol Gross at UCSF. His research interests are broadly centered on understanding how to use high-throughput sequencing to produce accurate quantitative inferences about biological systems and processes, with a particular focus on understanding translational regulation and the selective forces shaping synonymous codon usage.

Session Description:

Ribosome profiling produces snapshots of the locations of actively translating ribosomes on messenger RNAs. These snapshots can be used to make inferences about translation dynamics. Recent ribosome profiling studies in yeast, however, have reached contradictory conclusions regarding the average translation rate of each codon. Some experiments have used cycloheximide (CHX) to stabilize ribosomes before measuring their positions, and these studies all counterintuitively report a weak negative correlation between the translation rate of a codon and the abundance of its cognate tRNA. In contrast, some experiments performed without CHX report strong positive correlations. To explain this contradiction, we identify unexpected patterns in ribosome density downstream of each type of codon in experiments that use CHX. These patterns are evidence that elongation continues to occur in the presence of CHX but with dramatically altered codon-specific elongation rates. The measured positions of ribosomes in these experiments therefore do not reflect the amounts of time ribosomes spend at each position in vivo. These results suggest that conclusions from experiments in yeast using CHX may need reexamination. In particular, we show that in all such experiments, codons decoded by less abundant tRNAs were being translated more slowly before the addition of CHX disrupted these dynamics.

top

 

SST02:  Compressive Omics: Making Big Data Manageable through Data Compression
Monday, July 11, 10:10 am – 12:40 pm Room: Northern Hemisphere BCD
Organizer(s):

Peter W. Rose is Site Head of the RCSB Protein Data Bank West and leads the Structural Bioinformatics Laboratory at the San Diego Supercomputer Center at UC San Diego. He received his Ph.D. in Chemistry from the Technical University of Munich, Germany, in 1990. Prior to joining UC San Diego in 2007, he held research and management positions of increasing responsibility at Pfizer Global R&D La Jolla, formerly Agouron Pharmaceuticals. He was instrumental in the establishment of the structure-based design platform at Agouron and its global adoption by Pfizer. As Director of Computational Chemistry and Bioinformatics, he oversaw Structural Bioinformatics, Structure-Based Drug Design, and Scientific Computing groups. More recently, he was a member of the Scientific Advisory Board at Dart NeuroScience.

Olgica Milenkovic graduated with an MSc degree in Mathematics from the University of Michigan in 2001. She earned her PhD at the same place in 2002, in Electrical and Computer Engineering. She then joined the University of Colorado as an Assistant Professor, and currently she is a professor at the University of Illinois Urbana-Champaign. Her main research interests are in the field of bioinformatics, information theory, signal processing, compressive sensing and error-control coding.

Presentation Overview:

The rapid growth of data in all areas of biomedical research offers new opportunities and challenges. The US National Institutes of Health has created the Big Data to Knowledge (BD2K) initiative in response to these challenges. The aim of this session is to introduce the audience to new and exciting software and accompanying algorithmic developments in data compression and their application in genomics, structural biology, biological networks, and biomedical image analysis. We will present state-of-the-art data compression and dimensionality reduction techniques that aim to enable data-intensive analysis and visualization workflows.

The speakers will seek community input and feedback about requirements for software and suggestions for new applications. This session will also benefit the scientific community by providing a perspective on the cutting-edge software development efforts for biomedical big data. The attendees will also have the opportunity to learn about NIH’s perspective on the software needs of the biomedical community, based on detailed information on the projects recently funded by the BD2K program.

 

 

Part A: Computational Biology in the 21st Century: Scaling with Compressive Algorithms
(10:10 am-10:30 am)

Bio:

Bonnie Berger is a Professor of Mathematics and Computer Science at MIT. After beginning her career working in algorithms at MIT, she was one of the pioneer researchers in computational biology and, together with the many students she has mentored, has been instrumental in defining the field. She has received numerous honors including: member of the American Academy of Arts and Sciences, the NIH Margaret Pittman Director's Lecture Award, Biophysical Society's Dayhoff Award, Technology Review Magazine's inaugural TR100 as a top young innovator, ACM Fellow, ISCB Fellow, AIMBE Fellow, NSF Career Award and Honorary Doctorate from EPFL. She currently serves as Vice President of ISCB, as Head of the Steering Committee for RECOMB, and on the NIGMS Advisory Council.   

 

Session Description:

The last two decades have seen an exponential increase in genomic and biomedical data, which will soon outstrip advances in computing power. Extracting new science from these massive datasets will require not only faster computers; it will require algorithms that scale sublinearly in the size of the datasets. We show how a novel class of algorithms that scale with the entropy of the dataset by exploiting both its redundancy and low fractal dimension can be used to address large-scale challenges in genomics, personal genomics and chemogenomics.

 

Part B: Trends and Methods in Genomic Data Compression
(10:30 am-10:50 am)

Bio:

Idoia Ochoa obtained an MSc degree in Electrical Engineering from Stanford University in 2012, where she is currently a PhD student working with professor Tsachy Weissman.

Olgica Milenkovic obtained an MSc degree in Mathematics from the University of Michigan in 2001. She earned her Ph.D. at the same place in 2002, in Electrical and Computer Engineering. Currently she is a professor at the University of Illinois Urbana-Champaign.

Tsachy Weissman obtained a BSc in Electrical Engineering from Technion in 1997, and earned his PhD at the same place in 2001. Currently he is a professor at Stanford University.

 

Session Description:

We present an overview of our recent results on lossless and lossy compression of heterogeneous genomic data, including raw sequencing reads, whole genomes, metagenomic samples and RNA-seq files. In particular, we focus on our contributions in the areas of lossy compression of quality scores, lossless compression of expression and ChipSeq data and whole genome databases of related species. Furthermore, we describe new methods for taxonomy identification, classification and clustering of microbial communities used in two different metagenomic compression suites. Our methods combine state-of-the art techniques from source coding, correlation and spectral clustering, hashing and filtering.

 

Part C: Meaningful Data Compression and Reduction of High-Throughput Sequencing Data
(10:50 am-11:10 am)

Bio:

Alexander Schliep received a PhD degree in computer science from the Center for Applied Computer Science at the Universität zu Köln, Germany (2001), working in collaboration with the Theoretical Biology and Biophysics Group at Los Alamos National Laboratory. From 2002-2009 he was the group leader of the Bioinformatics Algorithms Group in the Department for Computational Molecular Biology at the Max Planck Institute for Molecular Genetics in Berlin.

In August 2009 he joined Rutgers University as an associate professor. The position is jointly between the Department of Computer Science and the BioMaPS Institute for Quantitative Biology. He is on the graduate faculty in Computer Science and the Program in Computational Biology and Molecular Biophysics. He is also a permanent member of the Center for Discrete Mathematics and Theoretical Computer Science.

 

 

 

Session Description:

This project aims to develop novel computational algorithms for lossless data compression and lossy data reduction of sequencing data. The new development would allow direct downstream computation of compressed data without decompression. As so, potential impact of the proposed development is not limited to data storage and transfer, it will also impact high throughput sequence analysis such as genome comparison and metagenomics analysis. A compressive genomics middleware and related Active Programming Interface will be developed to allow communication with existing genomic analysis software without much new code development.

 

Part D: Compressive Structural Bioinformatics: High Efficiency 3D Structure Compression
(11:40 am-12:00 pm)

Bio:

Peter W. Rose is Site Head of the RCSB Protein Data Bank West and leads the Structural Bioinformatics Laboratory at the San Diego Supercomputer Center at UC San Diego. He received his Ph.D. in Chemistry from the Technical University of Munich, Germany, in 1990. Prior to joining UC San Diego in 2007, he held research and management positions of increasing responsibility at Pfizer Global R&D La Jolla, formerly Agouron Pharmaceuticals. He was instrumental in the establishment of the structure-based design platform at Agouron and its global adoption by Pfizer. As Director of Computational Chemistry and Bioinformatics, he oversaw Structural Bioinformatics, Structure-Based Drug Design, and Scientific Computing groups. More recently, he was a member of the Scientific Advisory Board at Dart NeuroScience.

Session Description:

As technologies in structural biology continue to improve, many new large complex 3D structures are becoming characterized. Interactive visualization of large complex structures and large scale queries or structural comparisons across the entire Protein Data Bank (PDB) archive are becoming a bottleneck in terms of network bandwidth, I/O, parsing, and memory consumption. We have developed a compact and extensible representation of 3D molecules to overcome these challenges. This compact representation enables efficient data transfer for interactive visualization. In large scale distributed parallel processing, the compressed PDB or large subsets can be kept in memory, leading to large efficiency gains, as we move the data to the processor.

 

Part E: Theoretical Foundations and Software Infrastructure for Biological Network Databases
(12:00 pm-12:20 pm)

Bio:

Mehmet Koyuturk is T. & A. Schroeder Associate Professor at the Department of Electrical Engineering and Computer Science at Case Western Reserve University. He received his Ph.D. degree in Computer Science from Purdue University in 2006. His research mainly focuses on the analysis of biological networks, systems biology of complex diseases, and computational genomics. He currently serves as an associate editor for the IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) and EURASIP Journal on Bioinformatics and Systems Biology. Mehmet received an NSF CAREER Award in 2010.

 

Session Description:

Ever-increasing amounts of physical, functional, and statistical interaction data among bio-molecules, ranging from DNA regulatory regions, functional RNAs, proteins, metabolites, and lipids, offer unprecedented opportunities for computational discovery and for constructing a unified systems view of the cellular machinery. These data and associated formalisms have enabled systems approaches that led to unique advances in biomedical sciences. However, storage schemes, data structures, representations, and query mechanisms for network data are considerably more complex, compared to other, "flat" or low-dimensional data representations (e.g., sequences or molecular expression).

In this talk, we describe our approach to answering fundamental questions that relate to efficient utilization of large network-structured datasets. What are (provably) optimal storage schemes for large network structured databases? How should multiple versions of same/ related datasets be stored? How does one trade-off compression with query efficiency? How does one suitably abstract network data so that users can interactively interrogate them using front-end visualization software?  Our approach to addressing these problems is to develop theoretically grounded and computationally validated storage schemes, algorithms, and software that will enable efficient and effective storage, update, processing, and querying of biological networks.

 

Part F: Task-Specific Compression for Biomedical Big Data
(12:20 pm-12:40 pm)

Bio:

Ali Bilgin received his Ph.D. degree in electrical engineering from the University of Arizona where he is currently an Associate Professor with the Departments of Biomedical Engineering, Electrical and Computer Engineering, and Medical Imaging. His research interests are in the areas of signal and image processing, data compression, and imaging. He has served/continues to serve on the editorial boards of journals and organizing committees of conferences in these areas including as Technical Program Co-chair for IEEE Data Compression Conference and as Associate Editor for IEEE Signal Processing Letters, IEEE Transactions on Image Processing, and IEEE Transactions on Computational Imaging.

 

Session Description:

Contemporary biomedical imaging techniques generate very large datasets. As this seemingly unending supply of biomedical Big Data is collected, processed, and stored, a key challenge is preserving and delivering this data efficiently while maintaining high quality. There is growing consensus in the image science community that the quality of an image should be defined in terms of how well an observer (human or machine) can perform a specified task of practical importance. The overarching goal of this work is to develop methods and open source software for optimization of image compression to maximize clinical task performance.

 

top

 

SST03:  Genomic Big Data Management, Modeling and Computing
Tuesday, July 12, 10:10 am – 12:40 pm Room: Northern Hemisphere BCD
Organizer(s):

Stefano Ceri is Professor at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work has been generally concerned with extending database technology to incorporate new features: distribution, object-orientation, rules, streaming data, crowd-based and genomic computing. He is currently leading the PRIN project GenData 2020, focused on building query and data analysis systems for genomic data as produced by fast DNA sequencing technology. He is the recipient of the ACM-SIGMOD “Edward T. Codd Innovation Award” (2013), and an ACM Fellow and member of the Academia Europaea.

Marco Masseroli received a PhD in Biomedical Engineering in 1996, from Universidad de Granada, Spain. He is Associate Professor at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano, Italy. His research interests are in bioinformatics and biomedical informatics, focused on biomolecular databases, biomedical terminologies and ontologies to effectively retrieve, manage, analyze, and semantically integrate genomic information with patient clinical and high-throughout genomic data. He is the author of more than 170 scientific articles in international journals, books and conference proceedings.

Emanuel Weitschek is full researcher in Computer Science at the Department of Engineering of Uninettuno International University. Additionally, he works with the bioinformatics group of the Institute of Systems Analysis and Computer Science Antonio Ruberti of the Italian National Research Council (IASI - CNR) in Rome. He obtained the PhD in Computer Science at the Engineering Department of Roma Tre University. His research interests are biomedical data analysis, bioinformatics and software engineering. He is involved in species classification with DNA Barcode sequences, gene expression profile analysis, clinical data mining, viruses/bacteria classification with alignment-free techniques, and next generation sequencing (NGS) analysis.

Presentation Overview:

Modern genomics promises to answer fundamental questions for biological and clinical research, e.g., how cancer develops, how driving mutations occur. Unprecedented efforts in genomics are made possible by Next Generation Sequencing (NGS), a family of technologies that is progressively reducing the cost and time of reading the DNA. Huge amounts of sequences are continuously collected by research laboratories, often organized through world-wide consortia (such as ENCODE [1], TCGA [2], the 1000 Genomes Project [3], and Epigenomic Roadmap [4]); personalized medicine, based on genomic information, is becoming a reality.

So far, the bioinformatics research community has been mostly challenged by primary analysis (production of sequences in the form of short DNA segments, or ''reads'') and secondary analysis (alignment of reads to a reference genome and search for specific genomic features on the reads, such as variants/mutations and peaks of expression); the most important emerging problem is the so-called tertiary analysis [5], concerned with sense making, e.g., discovering how heterogeneous regions interact with each other, by integrating heterogeneous DNA features, such as variants or mutations in a DNA position, or signals and peaks of expression, or structural properties of the DNA, e.g., break points (where the DNA is damaged) or junctions (where DNA creates loops). According to many biologists, answers to crucial genomic questions are hidden within genomic data already available in public repositories, but suitable tools for processing them are lacking.

The Data-Driven Genomic Computing (GenData 2020) PRIN project, March 2013 – February 2016 (http://www.bioinformatics.deib.polimi.it/gendata/), which includes 9 top-quality research groups throughout Italy, focuses on tertiary analysis. GenData 2020 proposes a paradigm shift for genomic data management, based on the Genomic Data Model (GDM), which mediates existing data formats, and the GenoMetric Query Language (GMQL - http://www.bioinformatics.deib.polimi.it/GMQL/), a high-level, declarative query language required by tertiary data analysis [6]. The first version of the cloud implementation of GMQL has been released [6], and a second enhanced version will be publicly available by the end of the project, supporting both Flink (https://flink.apache.org/) and Spark (https://spark.apache.org/), two data frameworks on the cloud which demonstrated to be extremely efficient in supporting massive genomic queries [7].

During GenData 2020, the Partners jointly developed several prototypes, including BMP (http://www-db.disi.unibo.it/research/GenData/), for efficient pattern-based queries on genomic feature datasets [8]; SoSGEM (http://www.bioinformatics.deib.polimi.it/sosgem/), a semantic data integrator supporting search upon ENCODE metadata [9]; and TCGA2BED (http://bioinf.iasi.cnr.it/tcga2bed/), for modeling TCGA data in GDM [10]; the latter works are the first steps towards a global repository for tertiary data analysis which integrates data of [1-4].

The Special Section discusses the current status and perspectives of the NGS tertiary analysis, as well as the results of the GenData 2020 project, which are of high interest and relevance to the ISMB and bioinformatics community.

Part A: Genomic big data management and the GenoMetric Query Language
(10:10-10:30)

Bio:

Stefano Ceri is Professor at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work has been generally concerned with extending database technology to incorporate new features: distribution, object-orientation, rules, streaming data, crowd-based and genomic computing. He is currently leading the PRIN project GenData 2020, focused on building query and data analysis systems for genomic data as produced by fast DNA sequencing technology. He is the recipient of the ACM-SIGMOD “Edward T. Codd Innovation Award” (2013), and an ACM Fellow and member of the Academia Europaea.

Session Description:

I will describe an approach to "tertiary genomic data analysis" (making sense of data) which goes beyond classical alignment and feature calling and focuses on integration of data resulting from these processes, which is normally an "art" more than a science. I cite the existence of huge world-wide repositories of processed data (Encode, TCGA, 1000 Genomes,  Epigenomic Roadmap) as sources that can be integrated with experimental data produced at each lab, and then discuss how we model processed data (both from repositories and from the bench), how we query them (with many biological examples, e.g. detecting gene-enhancer pairs which are favored by proximity on a 3D, which requires data integration on the whole genome of several data types), and how our system is efficiently implemented on the cloud.  This talk is based on a new data management system for genomics, described in: GenoMetric Query Language: a novel approach to large-scale genomic data management Bioinformatics. 2015 Jun 15;31(12):1881-8. doi: 10.1093/bioinformatics/btv048, which integrates tens of heterogeneous genomic dataset types, in particular those used in epigenetic experiments (chip-seq, rna-seq, dna-seq), but also annotations and new data types, e.g. Chia-pet for 3D loops.

Part B: TCGA2BED and CAMUR for cancer NGS data processing
(10:30-10:50)

Bio:

Emanuel Weitschek is full researcher in Computer Science at the Department of Engineering of Uninettuno International University. Additionally, he works with the bioinformatics group of the Institute of Systems Analysis and Computer Science Antonio Ruberti of the Italian National Research Council (IASI - CNR) in Rome. He obtained the PhD in Computer Science at the Engineering Department of Roma Tre University. His research interests are biomedical data analysis, bioinformatics and software engineering. He is involved in species classification with DNA Barcode sequences, gene expression profile analysis, clinical data mining, viruses/bacteria classification with alignment-free techniques, and next generation sequencing (NGS) analysis.

Session Description:

Data extraction and integration methods are becoming essential to effectively access huge amounts of genomics and clinical data. In this work, we focus on The Cancer Genome Atlas a comprehensive archive of tumor data from Next Generation Sequencing experiments of more than 30 cancer types. We propose TCGA2BED a software tool to download and convert public TCGA data in the structured BED format. Additionally, we extend TCGA data with further data from several other genomic databases (i.e., NCBI Entrez, HGNC, UCSC) and provide an updated data repository with all publicly available TCGA copy number, DNA-seq, RNA-seq, miRNA-seq, DNA-methylation experimental and meta data. The use of the BED format reduces the time needed in managing and analyzing TCGA data; it makes possible to efficiently deal with huge amounts of cancer data, and search, query, and extend them. It facilitates the investigators in performing knowledge discovery analyses aimed at aiding cancer treatments. Finally, we propose to analyze the TCGA data with a supervised approach using CAMUR, a tool able to elicit an high amount of knowledge by computing many rule-based classification models, and therefore to identify most of the clinical and genomic variables related to the predicted cancer class. 

Part C: Searching patterns in genomic feature regions
(10:50-11:10)

Bio:

Ilaria Bartolini is Associate Professor at Department of Computer Science and Engineering, University of Bologna. She graduated in Computer Science and received a PhD in Electronic and Computer Engineering from University of Bologna. She has been visiting researcher at CWI (Amsterdam), NJIT (Newark, NJ), and HKUST (Hong Kong). Her research interests mainly include similarity and preference-based search techniques for large multimedia data collections; she developed and spread worldwide query models, together with efficient and scalable query processing algorithms for highly differentiated users. More recently, she specializes in genomic computing (within GenData 2020 project), by focusing on pattern-based querying of genomic data.

Session Description:

Genomics is opening many interesting practical and theoretical computational problems due to the high amount of heterogeneous data that it is generating. One of them is the search for collections of genomic regions at given distances from each other (i.e., patterns of genomic regions, along the whole genome).
I will describe an optimized pattern-search algorithm able to find efficiently, within a large set of genomic data, genomic region sequences which are similar to a given pattern. I will present the method and its several variants, giving formal problem definitions and examples of the defined solutions. I will start with the simple problem, which is solved using dynamic programming enhanced with an efficient window-based approach; then, I will proceed with more complex problems, where the method is extended, also using cost-based and similarity-based matching models, in order to cope with practical applications in revealing interesting and unknown regions of the genome, thus making it an important ingredient in supporting biological research. The method is applied to enhancer detection, a relevant biological problem, showing that the method is both efficient and accurate.

Part D: Large epi(genomics) data sets, secondary analysis and network based approaches
(11:40-12:00)

Bio:

Alfonso Valencia is a Spanish biologist, the current director of the Spanish National Bioinformatics Institute and the Structural and Computational Biology Group leader at the Spanish National Cancer Research Centre (CNIO). He is President of the International Society for Computational Biology. His research is focused on the study of protein families and their interaction networks.

Session Description:

Large Epi-genomics data sets are increasing available, including Histone modifications, genome mapping preferences of a Chromatin Binding Proteins (CRPs), as well as Chromatin Capture experiments, describing the organization of the chromatin in the nucleus.

We have processed the heterogeneous ChIP-Seq information to build a comprehensive genome co-localization network of CRPs, histone marks and DNA modifications. In this network co-localization preferences are specific of “Chromatin States”, such as active regions or enhancers. The analysis of the properties of the “co-localization” network points to one of the DNA modifications 5hmC as the key component in the organization of this network. The importance of 5hmC in the network is reinforced by the evolutionary analysis of the protein components of the network, in which 5hmC acts as a mediator in the co-evolution of the CRPs protein components of the mESC network.

We have further explored the functional significance of the “Epigenetic Properties” and “Chromatin States” by analysing them in the context of the structure of the nucleus. The results revealed interesting properties of the organization of the mESC epigenetic control system, in line with the emerging models of gene expression control and chromatin organization. These two approaches demonstrate the growing importance of Network Biology techniques in the exploration of the functional and evolutionary properties of complex biological systems.

In an effort to make possible to the end users the analysis of this information, we have developed a new resource that enables the direct comparison of epigenetic features and chromatin states between cell types.

 

Juan D, et al. Epigenomic Co-localization and Co-evolution Reveal a Key Role for 5hmC as a Communication Hub in the Chromatin Network of ESCs. Cell Rep. 2016; 14(5):1246-1257. http://www.cell.com/cell-reports/pdf/S2211-1247(16)00028-0.pdf

Pancaldi V, et al. Iintegrating epigenomic data and 3D genomic structure with a new measure of chromatin assortativity. Genome Biol. 2016 (in the press) http://arxiv.org/abs/1512.00268

Fernandez-Gonzalez JM, et al. EPICO platform: a reference cyber-infrastructure for comparative epigenomics. The BLUEPRINT Data Analysis Portal as a practical case. 2016 (submitted)

 

This work was developed in collaboration with the Vingron's (MPIMG, Berlin) and Fraser’s (Babraham Institute) labs as part of the BLUEPRINT project (http://www.blueprint-epigenome.eu/)

Part E: Semi-automated human genome annotation using chromatin data
(12:00-12:20)

Bio:

Michael Hoffman is a principal investigator at the Princess Margaret Cancer Centre and Assistant Professor in Medical Biophysics and Computer Science, University of Toronto. He researches machine learning techniques for epigenomic data. He previously led the NIH ENCODE Project's large-scale integration task group while at the University of Washington. He has a PhD from the University of Cambridge, where he conducted computational genomics studies at the European Bioinformatics Institute. He was named a Genome Technology Young Investigator and has received several awards for his academic work, including a NIH K99/R00 Pathway to Independence Award.

Session Description:

Sequence census methods like ChIP-seq now produce an unprecedented amount of genome-anchored data. We have developed an integrative method, Segway, to identify patterns from multiple experiments simultaneously while taking full advantage of high-resolution data, discovering joint patterns across different assay types. We applied this method to ENCODE chromatin data for multiple human cell types, including ChIP-seq data on covalent histone modifications and transcription factor binding, and DNase-seq and FAIRE-seq readouts of open chromatin. In an unsupervised fashion, we identified patterns associated with transcription start sites, gene ends, enhancers, CTCF elements, and repressed regions. The method yields a model which elucidates the relationship between assay observations and functional elements in the genome. This model identifies sequences likely to affect transcription, and we verify these predictions in laboratory experiments. We have made software and integrative genome browser tracks freely available.

Part F: Genomic Computing challenges and perspectives
(12:20-12:40)

Bio:

Alfonso Valencia is a Spanish biologist, the current director of the Spanish National Bioinformatics Institute and the Structural and Computational Biology Group leader at the Spanish National Cancer Research Centre (CNIO). He is President of the International Society for Computational Biology. His research is focused on the study of protein families and their interaction networks.

Bio:

Michael Hoffman is a principal investigator at the Princess Margaret Cancer Centre and Assistant Professor in Medical Biophysics and Computer Science, University of Toronto. He researches machine learning techniques for epigenomic data. He previously led the NIH ENCODE Project's large-scale integration task group while at the University of Washington. He has a PhD from the University of Cambridge, where he conducted computational genomics studies at the European Bioinformatics Institute. He was named a Genome Technology Young Investigator and has received several awards for his academic work, including a NIH K99/R00 Pathway to Independence Award.

Bio:

Stefano Ceri is Professor at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work has been generally concerned with extending database technology to incorporate new features: distribution, object-orientation, rules, streaming data, crowd-based and genomic computing. He is currently leading the PRIN project GenData 2020, focused on building query and data analysis systems for genomic data as produced by fast DNA sequencing technology. He is the recipient of the ACM-SIGMOD “Edward T. Codd Innovation Award” (2013), and an ACM Fellow and member of the Academia Europaea.

Bio:

Pierre Baldi earned MS degrees in Mathematics and Psychology from the University of Paris, and a PhD in Mathematics from the California Institute of Technology. He is currently Chancellor's Professor in the Department of Computer Science, Director of the Institute for Genomics and Bioinformatics, and Associate Director of the Center for Machine Learning and Intelligent Systems at the University of California Irvine. The long term focus of his research is on understanding intelligence in brains and machines. He has pioneered the development and application of deep learning methods to problems in the natural sciences such as the detection of exotic particles in physics, the prediction of reactions in chemistry, and the prediction of protein structures and gene regulatory mechanisms in bioinformatics. He is and Elected Fellow of the ISCB.

Bio:

Søren Brunak, Ph.D., is professor of Disease Systems Biology at the University of Copenhagen and professor of Bioinformatics at the Technical University of Denmark. He is Research Director at the Novo Nordisk Foundation Center for Protein Research at the University of Copenhagen Medical School. He leads a research effort where molecular level systems biology data are combined with the analysis of phenotypic data from the healthcare sector, such as electronic patient records, registry information and biobank questionnaires. A major aim is to understand the network basis for comorbidities and discriminate between treatment related disease correlations and other comorbidities, thereby stratifying patients not only from their genotype, but also phenotypically based on the clinical descriptions in their medical records. Prof. Brunak started work within bioinformatics in the mid-1980ies, and was in 1993 the founding Director of the Center for Biological Sequence Analysis at DTU, which was formed as a multi-disciplinary research group of molecular biologists, biochemists, medical doctors, physicists, and computer scientists. The center offers a wide range of services at its web site, www.cbs.dtu.dk, including bioinformatics tools developed over the past 25 years.

Session Description:

Round table and discussion with the audience about the Genomic Computing hot topic, and related aspects.

Moderator: Marco Masseroli, Politecnico di Milano, Italy.

top

 

SST04:  Molecular Communication and Networking with Applications to Precision Medicine
Tuesday, July 12, 2:00 pm – 4:30 pm Room: Northern Hemisphere BCD
Organizer(s):

Radu Marculescu is a Professor in the Dept. of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in Electrical Engineering from the University of Southern California (1998). He has received multiple best paper awards, NSF Career Award (2000), Outstanding Research Award from the College of Engineering (2013). He has been involved in organizing several international symposia, conferences, workshops, and tutorials, as well as guest editor of special issues in archival journals and magazines. His research focuses on design methodologies for embedded and cyber-physical systems, biological systems, and social networks. Radu Marculescu is an IEEE Fellow.

Presentation Overview:

Context: This special session focuses on the emerging area of molecular communication and nano-networking which targets cell-based therapeutics. Cell-based therapeutics is a key component of precision medicine, i.e. the new paradigm for disease prevention and treatment aimed at providing customized healthcare solutions on a patient-to-patient basis. While in recent years there has been significant progress towards understanding the cellular behavior and controlling individual cells, the understanding of heterogeneous populations of cells still lacks an appropriate computational framework to capture various interactions and emerging behaviors that manifest at population-level. 
Contents: This special session involves four presentations covering the biological basis for modeling bacterial communities, computational models for inter-cellular communication and synchronization in populations of bacteria, automated model generation and hardware acceleration, as well as graph algorithms for microbiome applications. The session will end with a 20 minutes panel that involves both presenters and audience in a highly interactive discussion about these new ideas. 
Relevance: This special session brings a new perspective on nano-communication and nano-networking at population-level (as opposed to single cell-level) which is critical to engineer cells behavior, reprogram the cell-cell communication, and eventually develop new strategies that can control the dynamics of population of cells. Such systems are poised to revolutionize our understanding and treatment of major diseases like antibiotic-resistant infections or cancer.

Part A: Biological Basis for Modeling Bacterial Communities
(2:00-2:30)

Bio:

Dr. Hiller is an Assistant Professor at Carnegie Mellon University, and adjunct faculty at the Center for Excellence in Biofilm Research at the Allegheny Health Network. Dr. Hiller’s work is focused on bacterial communities, specifically cell-cell communication and genomic plasticity over short-time scales (single infections to a few decades).

Session Description:

Antibiotic resistance is one of the major challenges of this century. The emergence of drug resistance emphasizes the need for new herapies; and the absence of new compounds pinpoints the need for new classes of targets. Recent advances have transformed our understanding of microbial disease. We now realize that bacteria are organized into structured communities, termed biofilms. Within biofilms, cells sense and respond to one another and to their environment via quorum sensing systems. The knowledge of dynamic biofilms with inter-cellular communication offers a new vision of microbial life, which can be exploited in the search of antimicrobials.
The success of novel compounds will depend not only on their efficacy, but also on the emergence and spread of drug resistance. Mathematical models and computational simulations can transform our understanding of resistance by exposing the conditions and selective pressures that stimulate or suppress its spread. Further, disease occur in genetically heterogeneous and dynamic environments, thus development of accurate models will require an in depth understanding of the diversity and plasticity of bacterial genomes. We propose that modeling bacterial evolution will generate testable hypotheses regarding approaches that minimize the emergence and spread of drug resistance, and maximize the efficacy of novel therapies.

Part B: Molecular Tweeting: Bacteria Network Formation, Dynamics, and Control with Healthcare Applications
(2:30-3:00)

Bio:

Radu Marculescu is a Professor in the Dept. of Electrical and Computer Engineering at Carnegie Mellon University. He received his Ph.D. in Electrical Engineering from the University of Southern California (1998). He has received multiple best paper awards, NSF Career Award (2000), Outstanding Research Award from the College of Engineering (2013). He has been involved in organizing several international symposia, conferences, workshops, and tutorials, as well as guest editor of special issues in archival journals and magazines. His research focuses on design methodologies for embedded and cyber-physical systems, biological systems, and social networks. Radu Marculescu is an IEEE Fellow.

Session Description:

Computational models offer an attractive alternative for observing and understanding bacterial ecosystems. Indeed, the ability to efficiently run simulation-based experiments with finely tuned environmental parameters and generate reproducible results can allow researchers to explore the subtleties of bacteria inter-cellular network and its implications on biofilm dynamics.
As computational models become more and more powerful, a network-centric approach to studying bacteria populations can improve our understanding their social behaviors and possibly help controlling the infectious diseases they cause. This is a major step towards developing new drugs and targeted medical treatments to fight biofilm-related infections.

Part C: Data-Driven Modeling and In Silico Simulation of Cell Signaling Pathways
(3:30-3:50)

Bio:

Diana Marculescu is a Professor in the Department of Electrical and Computer Engineering, Carnegie Mellon University. Her research interests are in computing for sustainability and life science applications. Diana Marculescu has served as an associate editor for several journals, including IEEE Transactions on Computers, IEEE Transactions on VLSI Systems, and is a recipient of several best paper awards, the NSF Faculty Career Award (2000), and CIT George Tallman Ladd Research Award (2004). She was an IEEE CAS Distinguished Lecturer (2004-2005) and is the recipient of the Marie R. Pistilli Women in EDA Achievement Award (2014). She is an IEEE Fellow.

Session Description:

Complex systems are characterized by internal structures intrinsically connected across varying scales of time or space that exhibit, at macroscale, behaviors that are not characteristic for “simple systems.” Such emergent behavior resulting from the interaction of subsystems (and not evident from analysis of each subsystem) is a fundamental characteristic of cellular mechanisms governing cell signaling pathways.
In this talk, I will describe how such mechanisms can be automatically extracted from data and modeled in a manner that enables fast in silico simulation and incremental modification for in vitro behavior prediction. The modeling/simulation paradigm is order of magnitudes faster than existing software-based in silico modeling methods and can help in parameter identification matching biological behavior, accelerate nonlinear ODE simulation of pathway dynamics, and provide mechanisms for abstracting intra-cell behavior for inter-cell interaction analysis.

Part D: On Scaling Graph Algorithms for Microbiome Applications
(3:50-4:10)

Bio:

Ananth Kalyanaraman is an Associate Professor and Boeing Centennial Chair in Computer Science at the School of Electrical Engineering and Computer Science in Washington State University. He received his PhD from Iowa State University. His main research interests are in designing parallel algorithms and software for solving large-scale data-driven problems in the life sciences. He is a recipient of a DOE Early Career Award, Early Career Impact Award from Iowa State University, and two best paper awards. He serves on editorial boards of leading parallel computing journals (TPDS, JPDC). Ananth is a member of AAAS, ACM, IEEE-CS, and ISCB.

Session Description:

Microbiome characterization is a hot research topic with a foundational importance to our understanding of the microbial ecosystems that surround us, and with a translational potential that can revolutionize personalized delivery of health and agriculture biotechnology. In this talk, I will focus on the pivotal role of graph-theoretic modeling and analytics in microbiome characterization. More specifically, I will describe community detection algorithms, their parallelization, and their application to microbiome data sets. The aim is to functionally characterize the microbiomes through discovering key molecular pathways that are represented in an underlying microbial community. From a computational perspective, I will present efficient parallel algorithms and heuristics that are designed to scale on a wide variety of parallel architectures including multi-core/many-core architectures and distributed memory machines.

Part E: Panel
(4:10-4:30)

Session Description:

Panel session that involves all presenters and the audience in a highly interactive discussion. A few questions to be addressed are:
1.How can microbial communities be changed via QS manipulation? What is the probability of the emergence of drug resistant strains?
2.What are some of the technological and scientific challenges along our path in precisely understanding and modeling inter-cellular and intra-cellular communication?
3.How do we translate these model-based studies into predictive/prescriptive tools for precision medicine?

top