Tutorials

ISMB 2018 - Tutorials

Attention Presenters - please review the Speaker Information Page available here

Tutorial AM1: Single cell RNA-seq toolkit (SOLD OUT)
Tutorial AM2: Machine learning methods in the analysis of genomic and clinical data (SOLD OUT)
Tutorial AM3: Integrated network analysis: Cytoscape automation using R and Python (SOLD OUT)
Tutorial AM4: Computational methods for comparative regulatory genomics (SOLD OUT)
Tutorial PM5: Visualization of large biological data (SOLD OUT)
Tutorial PM6: Deep learning for network biology (SOLD OUT)
Tutorial PM7: High-throughput sequencing: Identification of disease variants in exomes and genomes
Tutorial PM8: Ontologies in computational biology

Tutorial AM1: Single cell RNA-seq toolkit

(SOLD OUT)

July 6, 2018, 9:00 am - 1:00 pm

Room: Grand Ballroom A

Presenters

Tyler Faits, Boston University, United States
Matan Hofree, Broad Institute, United States
Ayshwarya Subramanian, Broad Institute, United States
Alex Tsankov, Broad Institute, United States

Overview

Single cell transcriptomics has emerged as a powerful tool to identify and interrogate novel cell types in homeostatic and perturbed states. Unlike bulk transcriptomics, single cell data provides resolution at the level of individual cells while working with much smaller quantities of RNA. As such, analysis of single cell RNA sequencing (scRNA-seq) data presents challenges of scale and technical noise, while providing the resolution necessary to pursue novel questions that earlier technologies did not allow.

The objective of the tutorial is to provide an overview of the laboratory and computational challenges involved in generating and analyzing scRNA-seq data. Participants will be introduced to popular molecular technologies for generating scRNA-seq data, and gain hands-on experience with existing software tools and computational methods for its analysis. The tutorial will briefly introduce approaches for preprocessing of scRNA-seq data, including demultiplexing, sequence alignment, and quality control. Then, starting from a cell x gene expression matrix, participants will learn standard methods to infer heterogeneity by identifying clusters of cells and perform analyses to assign cell identity and function. Participants will also be introduced to specialized analytical methods for exploring expression signatures of cell states, cellular differentiation trajectories, inference of cellular localization, and modern methods targeted towards better understanding of cancer biology. Analyses will be performed by executing commands in RStudio as well as leveraging newly developed point-and-click graphical R/Shiny interfaces.

Audience

Familiarity with basic RNA-Seq data analysis and working knowledge of R.

Requirements

The tutorial will utilize web/cloud-based computing infrastructure with all software preinstalled, such that the only user requirement will be a personal laptop with the Google Chrome web browser installed. From within the Chrome web browser, users will access RStudio and additional web-based utilities for computation.

Schedule Overview

9:00-9:45 am	Introduction: Tutorial infrastructure setup; Technologies for scRNA-seq data generation; Description of course datasets, case study and analysis questions
9:45-10:15 am	Quality-control and preprocessing; introduction to scRNA-seq data structures in R
10:15-11:00 am	Basic analyses of scRNA-seq data; batch effect correction, clustering and inference of cell-types
11:00-11:15 am	Coffee Break
11:15-11:45 am	Cell cluster-based differential expression and pathway analysis
11:45 am-12:30 pm	Interactive tools for visualization and scRNA-seq analysis
12:30-1:00 pm	Specialized scRNA-seq applications, currently available resources, and data repositories

Capacity

Presenter Bios

Matan Hofree, Broad Institute, United States Dr. Hofree completed his PhD in UC San Diego under the supervision of Trey Ideker, developing approaches for improved inference, classification and biological subtype discovery in cancer, using prior biological knowledge encoded in gene interaction networks. He received his B.Sc. in Computer Science and Computational Biology at the Hebrew University of Jerusalem, Israel. Presently, Dr. Hofree is working under the mentorship of Dr. Regev, developing computational techniques for single cell transcriptomics, and studying how transcriptional plasticity and heterogeneity is driving diverse tumors.
Ayshwarya Subramanian, Broad Institute, United States Ayshwarya Subramanian completed her PhD at Carnegie Mellon University under the supervision of Russell Schwartz. Her dissertation focused on developing computational methods for resolving heterogeneity in high-throughput data from tumors, detecting progression markers, and using these markers for phylogenetic inference. Her postdoctoral training was completed at the Harvard T.H. Chan School of Public Health’s Department of Biostatistics, with Curtis Huttenhower and Rafael Irizarry, where she developed probabilistic models of transcriptional activity states from bulk transcriptome sequencing data and approaches for analysis of metagenomic data. She is a computational scientist at the Broad Institute with Anna Greka and Aviv Regev, working on understanding kidney biology and disease using single cell transcriptomics.
Alex Tsankov, Broad Institute, United States Alex Tsankov completed his M.S. and Ph.D. in electrical engineering and computer science at Massachusetts Institute of Technology. Under the mentorship of Aviv Regev and Oliver Rando, his Ph.D. thesis characterized the role of nucleosome positioning in the evolution of Ascomycota yeasts. His postdoctoral training in Alex Meissner’s lab at Harvard University focused on understanding transcription factor and epigenetic dynamics during differentiation of human embryonic stem cells into the three germ layers and also led to the creation of a quantitative assay of functional potency called the ScoreCard. Presently, Dr. Tsankov is a computational scientist at the Broad Institute and is using single cell transcriptomics to build a cellular atlas of the human lung and to study transcriptional heterogeneity and metastasis in lung cancer.
Tyler Faits, Broad Institute, United States Tyler Faits is a Ph.D. candidate in the Bioinformatics Program at Boston University. His thesis, supervised by Dr. Evan Johnson, is focused on creating tools for the optimization of single cell RNA-sequencing experimental designs, and developing interactive portals for transcriptomic data in applications that include single cell RNA-sequencing and metatranscriptomic data analysis.

Tutorial AM2: Machine learning methods in the analysis of genomic and clinical data

(SOLD OUT)

July 6, 2018, 9:00 am - 1:00 pm

Room: Grand Ballroom B

Presenters

Felipe Llinares-López, ETH Zurich, Basel, Switzerland
Damian Roqueiro, ETH Zurich, Basel, Switzerland

Website: https://www.bsse.ethz.ch/mlcb/education/tutorial-ismb18.html

Overview

This tutorial covers various machine learning (ML) tools that have been developed for the analysis of genomic and clinical data. It is an intermediate level tutorial targeted to an audience with previous experience in diverse bioinformatics methods such as: i) genome-wide association studies, ii) comparison of structured data such as graphs or time-series, and iii) traditional text mining. State-of-the-art methods and their applications are presented. We will also discuss illustrative examples of how deep learning is currently being used in the analysis of biomedical data.

Audience

Beginner or intermediate. For hands-on sessions: programmer experience in R/Python is required.

Collect your name badge July 6 between 8:00 am - 8:45 am at the Conference Registration Desk, Ballroom Foyer, East Tower (lower level) Hyatt Regency Chicago.

Participant Requirements

For the hands-on session, if you wish to follow the steps we present you will need to install one of the following on your laptops:
- installation of R 3.4 or newer
(or)
- installation of Python 2.7 or 3

Schedule Overview

9:00 - 9:10 am	Damian Roqueiro	Introduction: ML in Bioinformatics. Overview of topics presented in the session.
9:10 - 10:10 am	Felipe Llinares-López	Module I: Significant pattern mining (SPM) and pruning the search space in association studies to increase statistical power
10:10 - 11:00 am	Damian Roqueiro	Hands-on session: applying SPM on genomic data with the package “sigPatSearch”
11:00 - 11:15 am	Coffee break
11:15 - 12:00 pm	Damian Roqueiro	Module II: ML methods to compare structured biomedical data such as strings, graphs and time series.
12:00 - 12:30 pm	Felipe Llinares-López	Hands-on session: Computing graph kernels with the package “graphKernels”
12:30 - 1:00 pm	Damian Roqueiro and Felipe Llinares-López	Module III: Deep learning and its applications to biomedical data. Illustrative examples, with a focus on text mining and processing of electronic health records.

Capacity

Presenter Bios

Felipe Llinares-López, ETH Zurich, Basel, Switzerland Felipe Llinares-López is a PhD student in the Machine Learning and Computational Biology lab in ETH Zurich. The main focus of his PhD research has been the development of algorithms to assess the statistical association between a target of interest and high-order interactions between features, and applying these methods to selected problems in computational biology, such as genome-wide association studies.
Damian Roqueiro, ETH Zurich, Basel, Switzerland Damian Roqueiro is a postdoc at the Machine Learning and Computational Biology lab in ETH Zurich. His research has been focused on the development and application of machine learning techniques to better understand the association between specific diseases and the genetic markup of individuals afflicted by those diseases.

Tutorial AM3: Integrated network analysis: Cytoscape automation using R and Python

(SOLD OUT)

July 6, 2018, 9:00 am - 1:00 pm

Room: Columbus IJ

Presenters

Alexander Pico, Gladstone Institutes, United States
John “Scooter” Morris, UCSF, United States
Barry Demchak, UCSD, United States

Overview

Cytoscape is one of the most popular applications for network analysis and visualization. In this workshop, we will demonstrate new capabilities to integrate Cytoscape into programmatic workflows and pipelines using R and Python. We will begin with an overview of network biology themes and concepts, and then we will translate these into Cytoscape terms for practical applications. The bulk of the workshop will be a hands-on demonstration of accessing and controlling Cytoscape from R and Python to perform a network analysis of tumor expression and variant data.

Learning Objectives

By the end of tutorial, you should be able to:
• Know when and how to use Cytoscape in your research area
• Identify and discriminate relevant source of interactions, networks and datasets
• Command programmatic control over Cytoscape
• Integrate Cytoscape into your bioinformatics pipelines
• Publish, share and export networks
• Generalize network analysis methods to multiple problem domains

Schedule Overview

9:00-9:20 am	Introductory (20 min) Quick introductions: presenters & audience General network biology perspective and applications Cytoscape introduction
9:20-10:30 am	Getting relevant networks Types of networks, sources, and relevant apps How to choose a network source Hands-on exercise: STRING, NDEx, WikiPathways
10:30-11:00 am	Intermediate (30 min) Driving Cytoscape from R and Python Overview of Cytoscape automation Launch Cytoscape and connect Getting Disease Networks Query STRING database from R and Python via CyREST
11:00-11:15 am	Coffee break
11:15-12:15	Advanced (60 min) Interacting with Cytoscape following R and Python vignettes CyREST and Commands R and Python packages Visualizing data on networks Loading multiple data types into Cytoscape Setting visual styles Subnetwork selection Data-driven and diffusion-based subnetworks Saving, sharing and publishing Session files, images and web export
12:15-1:00 am	Additional Topics and Q&A (45 min) More docs, more exercises New features planned for Cytoscape 3.7 CyBrowser & web integration

Intended audience

This tutorial is intended for an audience that has prior experience with at least one of the following:
• Cytoscape software
• Network biology concepts
• Bioinformatics analysis using R or Python

Participant requirements

Participants are required to bring a laptop with Cytoscape, R, RStudio and Python installed. Installation instructions will be provided in the weeks preceding the tutorial.

Capacity

Presenter Bios

Alexander Pico, Gladstone Institutes, United States Alex is the Executive Director of the National Resource for Network Biology, the Vice President of the Cytoscape Consortium, and Associate Director of Bioinformatics at Gladstone Institutes. He has been a contributing member to Cytoscape since 2006 and has led numerous Cytoscape and Network Biology workshops and mentoring programs over the past 10 years.
John “Scooter” Morris, UCSF, United States Scooter is the Executive Director of the Resource for Biocomputing, Visualization, and Informatics at UCSF, the “Roving Engineer” for Cytoscape, and an Adjunct Assistant Professor of Pharmaceutical Chemistry at UCSF. He has given numerous presentations on using and extending Cytoscape and is a Cytoscape core developer as well as the developer of over a dozen Cytoscape apps, including chemViz, structureViz, clusterMaker, and cddApp.
Barry Demchak, UCSD, United States Barry is the Chief Architect of Cytoscape, Secretary/Treasurer of the Cytoscape Consortium and Project Manager in the Ideker lab at UCSD. He has been a contributing member to Cytoscape development since 2012 and has led numerous Cytoscape and Network Biology workshops and mentored projects over the past 5 years.

Tutorial AM4: Computational methods for comparative regulatory genomics

(SOLD OUT)

July 6, 2018, 9:00 am - 1:00 pm

Room: Columbus KL

Presenters

Saurabh Sinha, Institute of Genomic Biology, University of Illinois, Urbana-Champaign, United States
Colin Dewey, Genome Center of Wisconsin, University of Wisconsin-Madison, United States
Siavash Mirabab, Center for Microbiome Innovation, University of California, San Diego, United States
Ferhat Ay, La Jolla Institute for Allergy and Immunology, University of California, San Diego, United States
Sushmita Roy, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, United States

Learning Objectives

• Gain an overview of key challenges arising in the comparative analysis of molecular data at the sequence, expression, chromatin, and network level.
• Learn about recent algorithms, software tools and their applications to tackle these challenges.

Schedule Overview

9:00-9:40 am	Whole genome alignment
9:40-10:20 am	Identification and comparative analysis of regulatory sequence elements
10:20-11:00 am	Phylogenetic tree construction
11:20-12:00 pm	Comparative analysis of chromatin state and 3D genome organization
12:00-12:40 pm	Inference and comparative analysis of transcriptional regulatory networks
12:40-1:00 pm	Tutorial wrap up

Audience

Beginner or intermediate

Participant requirements

None

Capacity

Presenter Bios

Saurabh Sinha, Institute of Genomic Biology, University of Illinois, Urbana-Champaign, United States Saurabh Sinha is a Professor of Computer Science at the University of Illinois Urbana-Champaign. His research focuses on regulatory and comparative genomics and has been supported by NIH, NSF and USDA. He is co-Director of the NIH BD2K Center of Excellence at the University of Illinois. He chairs the M.S. Bioinformatics program of the department, and leads the educational program of the Mayo Clinic-University of Illinois Alliance. He serves as Program co-Chair of the RECOMB Regulatory and Systems Genomics (RSG) conference. He is an NSF CAREER award recipient and was recognized as a University Scholar in 2018.
Colin Dewey, Genome Center of Wisconsin, University of Wisconsin-Madison, United States Colin Dewey is Associate Professor in the Department of Biostatistics and Medical Informatics at UW-Madison, which he joined in 2006. His research focuses on the development of computational and statistical methodology for the analysis of biological sequence data, with RNA-seq data and whole genome sequences of particular interest. Among the methods his group has developed are RSEM (for RNA-seq transcript quantification), DETONATE (for de novo transcriptome assembly evaluation), and Mercator (for multiple whole-genome orthology mapping).
Siavash Mirabab, Center for Microbiome Innovation, University of California, San Diego, United States Siavash Mirarab is an Assistant Professor in the Department of Electrical and Computer Engineering at University of California, San Diego, where he has been since 2015. He obtained his Ph.D. from the Computer Science department at UT-Austin and was advised by Prof. Tandy Warnow. His dissertation won the honorable mention for the 2015 ACM Doctoral Dissertation Award and he is a recipient of the 2017 Sloan Research Fellowship in Computational & Evolutionary Molecular Biology. His lab develops methods for evolutionary computational biology, mostly targetting large-scale datasets. His specific areas of research span many topics, including, reconstruction of species trees from gene trees (phylogenomics), large-scale multiple sequence alignment, HIV transmission network reconstruction, and metagenomic analyses using phylogenetic approaches.
Ferhat Ay, La Jolla Institute for Allergy and Immunology, University of California, San Diego, United States Ferhat Ay is the Institute Leadership Assistant Professor of Computational Biology at the La Jolla Institute for Allergy and Immunology and an Assistant Adjunct Professor at the UC San Diego - School of Medicine. His primary research areas are bioinformatics, computational biology, epigenomics, regulatory genomics and 3D/4D Nucleome. He has developed several methods to model the 3D structure of chromatin and its relation to gene regulation in several diseases including malaria and cancer.
Sushmita Roy, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, United States Sushmita Roy is an assistant professor at the Biostatistics and Medical Informatics department and a faculty at the Wisconsin Institute for Discovery, University of Wisconsin, Madison. Her lab focuses on the development and application of methods for inference and analysis of gene regulatory networks and their dynamics on developmental and evolutionary lineages. Sushmita is a recipient of the 2014 Alfred P. Sloan Foundation Fellowship, an NSF CAREER award, and a James S. McDonnell Foundation Scholar award.

Tutorial PM5: Visualization of large biological data

(SOLD OUT)

July 6, 2018, 2:00 pm - 6:00 pm

Room: Columbus IJ

Presenters

G. Elisabeta Marai, University of Illinois at Chicago, United States
Kay Nieselt, Center for Bioinformatics, University of Tübingen, Germany
Michael Krone, Center for Bioinformatics, University of Tübingen, Germany

Overview

The aim of this tutorial is to familiarise the participants with modern visual analytics methodologies applied to biological data and to provide simple hands-on training. Questions such as what is data visualization, what is visual analytics and how can biological data be visualised to gain insight are addressed, so that hypotheses can be generated or explored and further targeted analyses can be defined. Topics covered are:
• Digital/Electronic Visualisation of data
• Understanding color
• Visual Design Principles
• Examples of visualisation of biological data
• Challenges of large-scale biological data visualisation

Learning Objectives

• Understand the relationship between visual analysis and bioinformatics
• Use principles of human perception and cognition in visual biological data analysis
• Understand and use visual design principles
• Know the basics and do’s and don’ts of visualisation
• Critically evaluate data visual representations and suggest improvements and refinements
• Create simple web-based interactive visualizations using HTML 5, JavaScript, and possibly D3

Schedule Overview

2:00-2:15pm	Welcome & Introduction to tutorial structure
2:15-2:45pm	What is (electronic) visualization - Understanding color Luminance Color Choice: mapping Data to Color
2:45-3:30pm	Visual design principles Tufte’s design principles small multiples Shneiderman’s mantra
3:30-4:00pm	Visualization software general: D3, prefuse, javascript, and more specific tools for biological data
4:00-4:15pm	Break
4:15-4:45pm	BioVis examples Sequences Macromolecules Multivariate Data Networks
4:45-5:15pm	Introduction to HTML5 and Javascript Generate a simple interactive, web-based visual analysis tool
5:15-6:00pm	Introduction to D3

Audience

This course is designed for everyone who would like to learn and apply visualization techniques in the analysis of large biological data sets. The course provides useful background material on data visualization principles, but the focus is on methods and tools for visualization of next-generation sequencing data, other omics data and network data.

Participant Requirements

None, if participants just wish to listen. For those who would like to also work on the programming part, should bring a laptop and should have programming knowledge (e.g. with java, C++ or similar).

Capacity

Presenter Bios

G. Elisabeta Marai, University of Illinois at Chicago, United States G.Elisabeta Marai is an Associate Professor of Computer Science at the University of Illinois at Chicago, affiliated with the Electronic Visualization Laboratory. Her research interests are in biomedical imaging, biology data visualization, and data visual analysis. Liz is a recipient of an NSF CAREER award, of multiple NSF and NIH R01 awards, and of multiple Outstanding Paper awards, and has co-created open-source software (RuleBender, MOSBIE) used by biologists across over 40 institutions. She received her Ph.D. from Brown University in 2007.
http://evl.uic.edu/marai
Kay Nieselt, Center for Bioinformatics, University of Tübingen, Germany Kay Nieselt got her PhD in Mathematics at the Max Planck Institute for Biophysical Chemistry in Göttingen, Germany. Since 2002, she is a group leader at the Center for Bioinformatics Tübingen. Her main research interests are transcriptomics, small non-coding RNAs, ancient pathogenomics and visual analytics of life science data. Some of her visual analytics software products are Mayday, an open-source framework for transcriptome data analysis, GenomeRing for visualisation of multiple genomes, and Pan-Tetris, an interactive platform for pan-genomes. In 2015 together with Liz Marai she has been General Chair of the Symposium on Biological Data Visualization (BioVis, http://www.biovis.net) at ISMB.

Tutorial PM6: Deep learning for network biology

(SOLD OUT)

July 6, 2018, 2:00 pm - 6:00 pm

Room: Grand Ballroom A

Presenters

Marinka Zitnik, Stanford University, United States
Jure Leskovec, Stanford University, United States

Overview

Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from single-cell to population level. Network approaches have been used many times to combine and amplify signals from individual genes,and have led to remarkable discoveries in biology,including drug discovery,protein function prediction,disease diagnosis,and precision medicine. Mathematical machinery that is central to these approaches is machine learning on networks. The main challenge in machine learning on networks is to ﬁnd a way to extract information about interactions between nodes and to incorporate that information into a machine learning model. To extract information from networks, classic machine learning approaches often rely on summary statistics (e.g., degrees or clustering coefﬁcients) or carefully engineered features to measure local neighborhood structures (e.g., network motifs). These classic approaches can be limited because these hand-engineered features are inﬂexible, they often do not generalize to networks derived from other organisms, tissues and experimental technologies,and can fail on datasets with low experimental coverage.

Recent years have seen a surge in approaches that automatically learn to encode network structure into low-dimensional representations using transformation techniques based on deep learning and nonlinear dimensionality reduction. The idea behind these representation learning approaches is to learn a data transformation function that maps nodes to embeddings, points in a low-dimensional space. Deep representation learning methods have revolutionized the state-of-the-art in network science. This tutorial will investigate methods and case studies for analyzing biological networks and extracting actionable insights,and in doing so,it will provide attendees with a toolbox of next - generation algorithms for network biology.

Tutorial Website: http://snap.stanford.edu/deepnetbio-ismb

2:00-2:30 pm	Part 1 – Introduction and overview of network biology Biological network maps and interaction resources Concepts of network theory Organizing principles of network biomedicine (hubs, local principle, network parsimony principle, shared components principle) Standard prediction tasks (node classiﬁcation, link prediction, and node clustering)
2:30-3:30 pm	Part 2 - Matrix factorization and network propagation Matrix factorization and Laplacian eigenmaps Random-walk embeddings (e.g., DeepWalk, node2vec, metapath2vec, struc2vec) Integrative matrix factorization and propagation methods to improve performance
3:30-4:00 pm	Part 3 - Introduction to graph autoencoders Principles of graph autoencoder approaches (encoding, message passing, decoding)
4:00 - 4:15 pm	Coffee Break
4:15-5:00 pm:	Part 4 - Graph autoencoders and deep representation learning Detailed description of graph convolutional networks (GCNs) Embedding nodes, entire graphs, and extensions for multimodal graphs
5:00-6:00 pm	Part 5 - Applications in network biology and new directions Single-cell genomics and gene regulation (e.g., clustering of cells, biomarker discovery) Human disease (e.g., disease pathway discovery, multi-omic and clinical data) Tissue-speciﬁc protein function prediction Computational pharmacology (medical indications, polypharmacy side effects, drug repurposing)

Participant Overview

The tutorial will be of broad interest to researchers who work with network data coming from biology, medicine, and life sciences. Graph-structured data arise in many different areas of data mining and predictive analytics, so the tutorial should be of theoretical and practical interest to a large part of data mining and network science community.

The tutorial will not require prior knowledge beyond fundamental concepts covered in introductory machine learning and network science classes. Attendees will come away with a broad knowledge necessary to understand state-of-the-art representation learning methods and to use these methods to solve central problems in network biology.

Presenter Bios

Marinka Zitnik, Stanford University, United States Marinka Zitnik is a postdoctoral fellow in Computer Science at Stanford University. Her research focuses on network science and representation learning methods for biomedicine. She received her PhD in Computer Science from University of Ljubljana in 2015 while also conducting research at Imperial College London, University of Toronto, Baylor College of Medicine. She received outstanding research awards at ISMB, CAMDA, RECOMB, and BC2 conferences, and is involved in projects at Chan Zuckerberg Biohub.
Jure Leskovec, Stanford University, United States Jure Leskovec is an Associate Professor of Computer Science at Stanford University and Chan Zuckerberg Biohub Investigator. His research is recently focusing on biological and biomedical problems and applications of network science to problems in biomedicine and health. Jure received his PhD in Machine Learning from Carnegie Mellon University in 2008 and spent a year at Cornell University. His work received five best paper awards, won the ACM KDD cup and topped the Battle of the Sensor Networks competition.

Tutorial PM7: High-throughput sequencing: Identification of disease variants in exomes and genomes

Download PDF

July 6, 2018, 2:00 pm - 6:00 pm

Room: Grand Ballroom B

Presenters

Francisco De La Vega, D.Sc. Stanford University & Fabric Genomics
Chad Huff, Ph.D. The University of Texas, MD Anderson Cancer Center
Suzanne Leal, Ph.D. Baylor College of Medicine
Mark Yandel, Ph.D. University of Utah
Yao Yu, Ph.D. The University of Texas, MD Anderson Cancer Center

Overview

With the advent and the continuous drop in cost of next-generation sequencing, whole exome (WES) and whole genome sequencing (WGS) have become the platforms of choice for the diagnosis of Mendelian disease. New clinical applications of genome sequencing continue to appear, such as the diagnosis of idiopathic disease and the rapid diagnosis of rare childhood diseases in neonatal/pediatric intensive care units. In the research setting, this technology is permitting to explore the role of rare genetic variation in common, complex diseases through the sequencing of patient cohorts and case/control studies. A number of research studies are now generating WES or WGS data for sample sizes ranging from hundreds to thousands of cases. As the cost of sequencing drops further, it is expected that the number of cases sequenced will reach the 100’s of thousands, allowing the statistical power to identify disease associations with rare variants. For example, Genomics England is well underway to achieve its goal of sequencing 100,000 genomes, about half of these for rare genetic diseases and half for cancer patients. Regeneron Pharmaceuticals, in collaboration with the Geisinger Healthcare system, have already sequenced about 100,000 exomes, with the goal of reaching 250,000. In addition, Regeneron has recently proposed to build a coalition to sequence the ~250K cases of the Welcome Trust Case Control Study. Many healthcare systems around the world are starting to conceive similar projects, where a key aspect of the initiative is that sampling of cases is carried out as part of the healthcare of patients, and while the data obtained will be used in aggregate to look for findings that can drive drug development and new therapeutic approaches, a diagnostic of immediate value to the patient should be provided as well. Finally, the NIH “All of Us” million-people project is starting to move forward, where the ultimate goal will be the sequencing of the genomes of all of the participants.

Motivation

Identification of disease-causing variants, whether in clinical diagnostics or research studies, requires algorithms and statistical methods to score variants with respect to their likely relevance to the disease at hand. In diagnostics, these scores should allow clinicians to focus quickly into a relatively small number of candidate variants to examine their evidence and be able to classify them as either pathogenic or benign, with respect to the patient’s disease phenotype. In research studies, the goal is to understand the role of genetic variation in complex disease, using methods that can aggregate the burden of many rare deleterious variants in key genes for its contributions to the trait. In addition, analysis methods should be able to identify donors harboring a Mendelianlike version of the disease, with ultra-rare variants of very strong effect – natural knock-outs of genes that while may not result in early developmental disease, may significantly influence late onset disease, either by accelerating its onset, or protecting against it. A great success example for this paradigm was the finding of homozygotes for deleterious variants in the PCSK9 gene, a very rare genotype that protects carriers against cardiovascular disease (CVD). This finding led to the development of the latest class of CVD drugs and finding more cases like this is driving a lot of pharmaceutical genome sequencing. Analysis strategies to deal with each of these cases are different, and yet they need to be considered together in the analysis of projects generating large-scale WGS/WES data from patients.

Goals of the Tutorial

The goal of this tutorial is to present an overview of the current state of disease variant identification approaches, describe the most common methods used to interpret variants from WGS/WES patient datasets for clinical diagnostics, as well as the statistical methods applied to the analysis of large cohorts of patients with WES/WGS data for finding novel disease genes. To ensure we deliver practical knowledge, we will discuss in some detail specific tools that the presenters have developed, explaining the fundamentals of the algorithms underlying them, how to use them in real use cases, and how these tools compare to other available tools and approaches. Since the scale of the genome datasets keeps growing, it is also important to understand the techniques to make these analyses scalable.

Learning Objectives

At the end of the tutorial the participants will have an understanding of: 1) What are the challenges of analyzing WES/WGS data for clinical diagnostics and disease association studies; 2) How variant prioritization can be performed probabilistically and why its superior to empirical filtering schemes; 3) How to take advantage of family structures and phenotype information in these endeavors; 4) What are the difficulties in the analysis of rare variants for disease gene finding; 5) What are the typical and most advanced tools for rare variant analysis; and 6) What are the novel approaches for the analysis of disease cohorts for both identifying rare variants influencing common disease as well as ultra-rare homozygotes with very strong effects.

Intended Audience

The participants of this tutorial will be bioinformaticians, statisticians, or geneticists that anticipate would be involved in the analysis of WES/WGS data for either clinical diagnostics or case/control association studies with emphasis in rare variants. This tutorial will be appealing to participants with either academic or industry (e.g. pharmaceutical industry/clinical diagnostic labs) background.

Participant requirements

This is a theoretical tutorial, and the only requirements would be familiarity with the basics of next-generation sequencing of genomes and exomes, the basics of human genetics, and ideally an understanding of how classical GWAS studies for common variants work.

Schedule Overview

Timing	Presenter	Topic
2:00-2:50 pm	F. De La Vega	Introduction to variant prioritization in Mendelian disease diagnostics Variant annotation and effects Assessment of deleteriousness Leveraging population allele frequencies Variants Interpretation schemes Challenges of annotation of small vs structural variants
3:00-3:50 pm	G. Wang	Analysis of Large-Scale Rare Variant Association Studies Common variant vs rare variant disease susceptibility Rare variant study design and power Rare variant association tests Burden vs variance component tests VAT - quality control and analysis of population-based exome association studies
4:00-4:15 PM	Coffee Break
4:15-5:15 pm	M. Yandel	Discovery of rare and ultra-rare disease variants in case/control and cohort studies VAAST and VVP algorithms Rare and common variants association from WES/WGS with VAAST Power consideration of ratios of cases & controls Challenges in finding Mendelian genotypes embedded in case/control studies
5:15-6:15 pm	C. Huff and Yao Yu	Rare variant prioritization and association analysis with VAAST, XPAT, PHEVOR, and related tools Rare variant association studies with VAAST Familial studies with pVAAST Familial studies with pVAAST Cross-platform sequencing association studies with XPAT Leveraging phenotype information with PHEVOR

References

Mendelian disease analysis by WGS/WES

Eilbeck, K., Quinlan, A. & Yandell, M. Settling the score: variant prioritization and Mendelian disease. Nature Publishing Group 1–14 (2017). doi:10.1038/nrg.2017.52

Wright, C. F., FitzPatrick, D. R. & Firth, H. V. Paediatric genomics: diagnosing rare disease in children. Nature Publishing Group 10, 1–16 (2018).

Coonrod, E. M., Margraf, R. L., Russell, A., Voelkerding, K. V. & Reese, M. G. Clinical analysis of genome nextgeneration sequencing data using the Omicia platform. Expert Rev Mol Diagn 13, 529–540 (2013).

Rare Variant Association Tests

Nicolae, D. L. Association Tests for Rare Variants. Annu. Rev. Genom. Human Genet. 17, 117–130 (2016). Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. The American Journal of Human Genetics 95, 5–23 (2014).

Auer PL et al (2016) Guidelines for Large-Scale Sequence-Based Complex Trait Association Studies: Lessons Learned from the NHLBI Exome Sequencing Project, Am J Hum Genet. 99 (4): 791-801.

Analysis Tools

F. Anthony San Lucas, Gao Wang, Paul Scheet, and Bo Peng (2012) Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools, Bioinformatics 28 (3): 421-422.

Gao Wang, Bo Peng and Suzanne M. Leal (2014) Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data, The American Journal of Human Genetics 94 (5): 770–83.

Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG. A probabilistic disease-gene finder for personal genomes. Genome Res 2011, 21(9):1529-1542.

Singleton M., Guthery SL., Voelkerding KV., Chen K., Kennedy BJ., Margraf RL., Durtschi J., Eilbeck K., Reese MG., Jorde LB., Huff CD., Yandell M. Phevor Combines Multiple Biomedical Ontologies for Accurate Identification of Disease-Causing Alleles in Single Individuals and Small Nuclear Families. Am J Hum Genet. 2014 Apr 3;94(4):599- 610.

Flygare S, Hernandez EJ, Phan L, et al. The VAAST Variant Prioritizer (VVP): ultrafast, easy to use whole genome variant prioritization tool. BMC Bioinformatics. 2018;19:57.

Yu, Y. et al. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets. Nucleic Acids Research 1–11 (2017). doi:10.1093/nar/gkx1280

Di Zhang et al. SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data. The American Journal of Human Genetics 101, 115–122 (2017).

Links to Tools and Code

VAASt 2.0, pVAAST, at Yandell Lab.: http://www.yandell-lab.org/software/vaast.html
PHEVOR 2.0 web service: http://weatherby.genetics.utah.edu/phevor2/index.html
Variant Tools: http://varianttools.sourceforge.net/
XPAT at Huff Lab: http://www.hufflab.org/software/xpat/
Materials and guides at Leal lab: https://statgen.research.bcm.edu/index.php/Tutorials

Presenter Bios

Francisco M. De La Vega, D.Sc. Stanford University School of Medicine & Fabric Genomics, United States Adjunct Professor at the Department of Biomedical Data science of Stanford, and SVP of Genomics at Fabric Genomics. De La Vega is a geneticist and computational biologist with interests in cancer, population, and clinical genomics, and with extensive experience in the life sciences industry. Dr. De La Vega has led the development of new methods and software for the analysis of next-generation sequencing data and has been involved in major population-scale sequencing projects such as the 1000 Genomes Project, the PanCancer Analysis of Whole Genomes project of the ICGC, and standard-setting public-private partnerships such as the NIST Genome-in-a-Bottle Consortium.
Chad Huff, Ph.D., The University of Texas MD Anderson Cancer., United States Associate Professor, Department of Epidemiology, The University of Texas MD Anderson Cancer Center. He works on understanding human evolution and the genetic basis of human disease through statistical, computational, and population genomics. Current focus is on developing new methods to analyze genomic data and by applying these methods to discover novel insights about the genetic basis of human disease, with particular emphasis on identifying and characterizing genes that increase the risk of developing common cancers.
Suzanne Leal, Ph.D., Baylor College of Medicine, United States Professor in the Department of Molecular and Human Genetics at Baylor College of Medicine and Director of the Center for Statistical Genetics, and also an adjunct Professor in the Department of Statistics at Rice University and a Senior Research Associate at The Rockefeller University. Dr. Leal interests lies in statistical genetics and genetic epidemiology and has worked extensively in developing methods to aid in gene identification and understanding disease etiology. Her current focus is in the development of methods to analyze rare variants. Dr. Leal is also pioneering big-data architectures to more effectively process large WES/WGS datasets of cases/control studies.
Mark Yandel, Ph.D., University of Utah, United States Professor of Human Genetics and H.A. and Edna Benning Presidential Endowed Chair at University of Utah. Dr. Yandel develops computational algorithms and software tools to analyze genomics data and uses these tools to identify disease-causing variants in clinical settings, to understand the molecular basis of gene dysfunction, and to understand evolution. He spent three years at the Genome Sequencing Center at Washington University School of Medicine, St. Louis, and then three years at Celera Genomics where he led the Annotation Software Research and Development group. Mark has led the development of innovative variant prioritization tools, and novel methods that take advantage of the disease phenotype of a patient disease leveraging biomedical phenotype ontologies, and more recently has been extending these tools to make them more efficient and applicable to large cohort studies.
Yao Yu, Ph.D., The University of Texas MD Anderson Cancer Center, United States Computational Scientist at the Department of Epidemiology, The University of Texas MD Anderson Cancer Center. His research interests cover a wide range of topics in computational biology, including genetics, genomics, transcriptomics, and metabolomics. He is the lead developer of the Cross-Platform Association Toolkit (XPAT), a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets.

Tutorial PM8: Ontologies in computational biology

July 6, 2018, 2:00 pm - 6:00 pm

Room: Columbus KL

Presenters

Michel Dumontier, Maastricht University, Netherlands
Robert Hoehndorf, King Abdullah University of Science and Technology, Kingdom of Saudi Arabia

Overview

Ontologies have long provided a core foundation in the organization of biomedical entities, their attributes, and their relationships. With over 500 biomedical ontologies currently available there are a number of new and exciting new opportunities emerging in using ontologies for large scale data sharing and data analysis. This tutorial will help you understand what ontologies are and how they are being used in computational biology and bioinformatics.

Learning Objectives

This is an introductory-level course to ontologies and ontology-based data analysis in bioinformatics. In this tutorial, participants will learn:
- what ontologies are and where to find them
- how to understand and use ontology semantics through automated reasoning
- how to measure semantic similarity
- how to incorporate ontologies and semantic similarity measures in bioinformatics analyses
- recent developments in bio-ontologies

Intended audience:

The tutorial will be of interest to any researcher who will use or produce large structured datasets in computational biology. The tutorial will be at an introductory level, but will also describe current research directions and challenges that will be of broad interest to researchers in computational biology.

Requirements:

The tutorial will contain a hands-on part. If you want to participate (instead of just watching the presentation), please download and install Jupyter Notebook (http://jupyter.org/) with a SciJava kernel. For latest updates on this tutorial, see https://github.com/bio-ontology-research-group/ontology-tutorial

Capacity

Presenter Bios

Michel Dumontier, Maastricht University, Netherlands Michel Dumontier is a Distinguished Professor of Data Science at Maastricht University. His research focuses on the development of computational methods for scalable integration and reproducible analysis of FAIR (Findable, Accessible, Interoperable and Reusable) data across scales - from molecules, tissues, organs, individuals, populations to the environment. His group combines semantic web technologies with effective indexing, machine learning and network analysis for drug discovery and personalized medicine. Dr. Dumontier leads a new inter-faculty Institute for Data Science at Maastricht University with a focus on accelerating discovery science, empowering communities, and improving health and well being. He is the editor-in-chief for the IOS press journal Data Science and an associate editor for the IOS press journal Semantic Web. He is the scientific director for Bio2RDF, an open source project to generate Linked Data for the Life Sciences and is a technical lead for the FAIR (Findable, Accessible, Interoperable, Re-usable) data initiative. He has published over 125 articles in top rated journals and international conferences. He is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies as evidenced by awards, keynote talks at international conferences, and collaborations on international projects.
Robert Hoehndorf, King Abdullah University of Science and Technology, Kingdom of Saudi Arabia Robert Hoehndorf is an Assistant Professor in Computer Science at King Abdullah University of Science and Technology in Thuwal. His research focuses on the applications of ontologies in biology and biomedicine, with a particular emphasis on integrating and analyzing heterogeneous, multimodal data. Dr. Hoehndorf has developed the PhenomeNET system for ontology-based prioritization of disease genes using model organism phenotypes, and contributed to the development of the AberOWL ontology repository. He is an associate editor for the Journal of Biomedical Semantics, BMC Bioinformatics, Applied Ontology, and editorial board member of the IOS press journal Data Science. He published over 90 papers in journals and international conferences, and presented previous tutorials on ontologies and their applications at ISMB, OWL-ED, and ECCB.