ISMB 2018 - Tutorials
- Tutorial AM1: Single cell RNA-seq toolkit (SOLD OUT)
- Tutorial AM2: Machine learning methods in the analysis of genomic and clinical data (SOLD OUT)
- Tutorial AM3: Integrated network analysis: Cytoscape automation using R and Python (SOLD OUT)
- Tutorial AM4: Computational methods for comparative regulatory genomics (SOLD OUT)
- Tutorial PM5: Visualization of large biological data (SOLD OUT)
- Tutorial PM6: Deep learning for network biology (SOLD OUT)
- Tutorial PM7: High-throughput sequencing: Identification of disease variants in exomes and genomes
- Tutorial PM8: Ontologies in computational biology
Tutorial AM1: Single cell RNA-seq toolkit(SOLD OUT)
Room: Grand Ballroom A
Tyler Faits, Boston University, United States
Matan Hofree, Broad Institute, United States
Ayshwarya Subramanian, Broad Institute, United States
Alex Tsankov, Broad Institute, United States
Single cell transcriptomics has emerged as a powerful tool to identify and interrogate novel cell types in homeostatic and perturbed states. Unlike bulk transcriptomics, single cell data provides resolution at the level of individual cells while working with much smaller quantities of RNA. As such, analysis of single cell RNA sequencing (scRNA-seq) data presents challenges of scale and technical noise, while providing the resolution necessary to pursue novel questions that earlier technologies did not allow.
The objective of the tutorial is to provide an overview of the laboratory and computational challenges involved in generating and analyzing scRNA-seq data. Participants will be introduced to popular molecular technologies for generating scRNA-seq data, and gain hands-on experience with existing software tools and computational methods for its analysis. The tutorial will briefly introduce approaches for preprocessing of scRNA-seq data, including demultiplexing, sequence alignment, and quality control. Then, starting from a cell x gene expression matrix, participants will learn standard methods to infer heterogeneity by identifying clusters of cells and perform analyses to assign cell identity and function. Participants will also be introduced to specialized analytical methods for exploring expression signatures of cell states, cellular differentiation trajectories, inference of cellular localization, and modern methods targeted towards better understanding of cancer biology. Analyses will be performed by executing commands in RStudio as well as leveraging newly developed point-and-click graphical R/Shiny interfaces.
Familiarity with basic RNA-Seq data analysis and working knowledge of R.
The tutorial will utilize web/cloud-based computing infrastructure with all software preinstalled, such that the only user requirement will be a personal laptop with the Google Chrome web browser installed. From within the Chrome web browser, users will access RStudio and additional web-based utilities for computation.
|9:00-9:45 am||Introduction: Tutorial infrastructure setup; Technologies for scRNA-seq data generation; Description of course datasets, case study and analysis questions|
|9:45-10:15 am||Quality-control and preprocessing; introduction to scRNA-seq data structures in R|
|10:15-11:00 am||Basic analyses of scRNA-seq data; batch effect correction, clustering and inference of cell-types|
|11:00-11:15 am||Coffee Break|
|11:15-11:45 am||Cell cluster-based differential expression and pathway analysis|
|11:45 am-12:30 pm||Interactive tools for visualization and scRNA-seq analysis|
|12:30-1:00 pm||Specialized scRNA-seq applications, currently available resources, and data repositories|
Matan Hofree, Broad Institute, United States Dr. Hofree completed his PhD in UC San Diego under the supervision of Trey Ideker, developing approaches for improved inference, classification and biological subtype discovery in cancer, using prior biological knowledge encoded in gene interaction networks. He received his B.Sc. in Computer Science and Computational Biology at the Hebrew University of Jerusalem, Israel. Presently, Dr. Hofree is working under the mentorship of Dr. Regev, developing computational techniques for single cell transcriptomics, and studying how transcriptional plasticity and heterogeneity is driving diverse tumors.
Ayshwarya Subramanian, Broad Institute, United States Ayshwarya Subramanian completed her PhD at Carnegie Mellon University under the supervision of Russell Schwartz. Her dissertation focused on developing computational methods for resolving heterogeneity in high-throughput data from tumors, detecting progression markers, and using these markers for phylogenetic inference. Her postdoctoral training was completed at the Harvard T.H. Chan School of Public Health’s Department of Biostatistics, with Curtis Huttenhower and Rafael Irizarry, where she developed probabilistic models of transcriptional activity states from bulk transcriptome sequencing data and approaches for analysis of metagenomic data. She is a computational scientist at the Broad Institute with Anna Greka and Aviv Regev, working on understanding kidney biology and disease using single cell transcriptomics.
Alex Tsankov, Broad Institute, United States Alex Tsankov completed his M.S. and Ph.D. in electrical engineering and computer science at Massachusetts Institute of Technology. Under the mentorship of Aviv Regev and Oliver Rando, his Ph.D. thesis characterized the role of nucleosome positioning in the evolution of Ascomycota yeasts. His postdoctoral training in Alex Meissner’s lab at Harvard University focused on understanding transcription factor and epigenetic dynamics during differentiation of human embryonic stem cells into the three germ layers and also led to the creation of a quantitative assay of functional potency called the ScoreCard. Presently, Dr. Tsankov is a computational scientist at the Broad Institute and is using single cell transcriptomics to build a cellular atlas of the human lung and to study transcriptional heterogeneity and metastasis in lung cancer.
Tyler Faits, Broad Institute, United States Tyler Faits is a Ph.D. candidate in the Bioinformatics Program at Boston University. His thesis, supervised by Dr. Evan Johnson, is focused on creating tools for the optimization of single cell RNA-sequencing experimental designs, and developing interactive portals for transcriptomic data in applications that include single cell RNA-sequencing and metatranscriptomic data analysis.
Tutorial AM2: Machine learning methods in the analysis of genomic and clinical data(SOLD OUT)
Room: Grand Ballroom B
Felipe Llinares-López, ETH Zurich, Basel, Switzerland
Damian Roqueiro, ETH Zurich, Basel, Switzerland
This tutorial covers various machine learning (ML) tools that have been developed for the analysis of genomic and clinical data. It is an intermediate level tutorial targeted to an audience with previous experience in diverse bioinformatics methods such as: i) genome-wide association studies, ii) comparison of structured data such as graphs or time-series, and iii) traditional text mining. State-of-the-art methods and their applications are presented. We will also discuss illustrative examples of how deep learning is currently being used in the analysis of biomedical data.
Beginner or intermediate. For hands-on sessions: programmer experience in R/Python is required.
Collect your name badge July 6 between 8:00 am - 8:45 am at the Conference Registration Desk, Ballroom Foyer, East Tower (lower level) Hyatt Regency Chicago.
For the hands-on session, if you wish to follow the steps we present you will need to install one of the following on your laptops:
- installation of R 3.4 or newer
- installation of Python 2.7 or 3
|9:00 - 9:10 am||Damian Roqueiro||Introduction: ML in Bioinformatics. Overview of topics presented in the session.|
|9:10 - 10:10 am||Felipe Llinares-López||Module I: Significant pattern mining (SPM) and pruning the search space in association studies to increase statistical power|
|10:10 - 11:00 am||Damian Roqueiro||Hands-on session: applying SPM on genomic data with the package “sigPatSearch”|
|11:00 - 11:15 am||Coffee break|
|11:15 - 12:00 pm||Damian Roqueiro||Module II: ML methods to compare structured biomedical data such as strings, graphs and time series.|
|12:00 - 12:30 pm||Felipe Llinares-López||Hands-on session: Computing graph kernels with the package “graphKernels”|
|12:30 - 1:00 pm||Damian Roqueiro and Felipe Llinares-López||Module III: Deep learning and its applications to biomedical data. Illustrative examples, with a focus on text mining and processing of electronic health records.|
Felipe Llinares-López, ETH Zurich, Basel, Switzerland Felipe Llinares-López is a PhD student in the Machine Learning and Computational Biology lab in ETH Zurich. The main focus of his PhD research has been the development of algorithms to assess the statistical association between a target of interest and high-order interactions between features, and applying these methods to selected problems in computational biology, such as genome-wide association studies.
Damian Roqueiro, ETH Zurich, Basel, Switzerland Damian Roqueiro is a postdoc at the Machine Learning and Computational Biology lab in ETH Zurich. His research has been focused on the development and application of machine learning techniques to better understand the association between specific diseases and the genetic markup of individuals afflicted by those diseases.
Tutorial AM3: Integrated network analysis: Cytoscape automation using R and Python(SOLD OUT)
Room: Columbus IJ
Alexander Pico, Gladstone Institutes, United States
John “Scooter” Morris, UCSF, United States
Barry Demchak, UCSD, United States
Cytoscape is one of the most popular applications for network analysis and visualization. In this workshop, we will demonstrate new capabilities to integrate Cytoscape into programmatic workflows and pipelines using R and Python. We will begin with an overview of network biology themes and concepts, and then we will translate these into Cytoscape terms for practical applications. The bulk of the workshop will be a hands-on demonstration of accessing and controlling Cytoscape from R and Python to perform a network analysis of tumor expression and variant data.
By the end of tutorial, you should be able to:
• Know when and how to use Cytoscape in your research area
• Identify and discriminate relevant source of interactions, networks and datasets
• Command programmatic control over Cytoscape
• Integrate Cytoscape into your bioinformatics pipelines
• Publish, share and export networks
• Generalize network analysis methods to multiple problem domains
|9:00-9:20 am||Introductory (20 min)
|9:20-10:30 am||Getting relevant networks
|10:30-11:00 am||Intermediate (30 min)
|11:00-11:15 am||Coffee break|
|11:15-12:15||Advanced (60 min)
|12:15-1:00 am||Additional Topics and Q&A (45 min)
This tutorial is intended for an audience that has prior experience with at least one of the following:
• Cytoscape software
• Network biology concepts
• Bioinformatics analysis using R or Python
Participants are required to bring a laptop with Cytoscape, R, RStudio and Python installed. Installation instructions will be provided in the weeks preceding the tutorial.
Alexander Pico, Gladstone Institutes, United States Alex is the Executive Director of the National Resource for Network Biology, the Vice President of the Cytoscape Consortium, and Associate Director of Bioinformatics at Gladstone Institutes. He has been a contributing member to Cytoscape since 2006 and has led numerous Cytoscape and Network Biology workshops and mentoring programs over the past 10 years.
John “Scooter” Morris, UCSF, United States Scooter is the Executive Director of the Resource for Biocomputing, Visualization, and Informatics at UCSF, the “Roving Engineer” for Cytoscape, and an Adjunct Assistant Professor of Pharmaceutical Chemistry at UCSF. He has given numerous presentations on using and extending Cytoscape and is a Cytoscape core developer as well as the developer of over a dozen Cytoscape apps, including chemViz, structureViz, clusterMaker, and cddApp.
Barry Demchak, UCSD, United States Barry is the Chief Architect of Cytoscape, Secretary/Treasurer of the Cytoscape Consortium and Project Manager in the Ideker lab at UCSD. He has been a contributing member to Cytoscape development since 2012 and has led numerous Cytoscape and Network Biology workshops and mentored projects over the past 5 years.
Tutorial AM4: Computational methods for comparative regulatory genomics(SOLD OUT)
Room: Columbus KL
Saurabh Sinha, Institute of Genomic Biology, University of Illinois, Urbana-Champaign, United States
Colin Dewey, Genome Center of Wisconsin, University of Wisconsin-Madison, United States
Siavash Mirabab, Center for Microbiome Innovation, University of California, San Diego, United States
Ferhat Ay, La Jolla Institute for Allergy and Immunology, University of California, San Diego, United States
Sushmita Roy, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, United States
• Gain an overview of key challenges arising in the comparative analysis of molecular data at the sequence, expression, chromatin, and network level.
• Learn about recent algorithms, software tools and their applications to tackle these challenges.
|9:00-9:40 am||Whole genome alignment|
|9:40-10:20 am||Identification and comparative analysis of regulatory sequence elements|
|10:20-11:00 am||Phylogenetic tree construction|
|11:20-12:00 pm||Comparative analysis of chromatin state and 3D genome organization|
|12:00-12:40 pm||Inference and comparative analysis of transcriptional regulatory networks|
|12:40-1:00 pm||Tutorial wrap up|
Beginner or intermediate
Saurabh Sinha, Institute of Genomic Biology, University of Illinois, Urbana-Champaign, United States Saurabh Sinha is a Professor of Computer Science at the University of Illinois Urbana-Champaign. His research focuses on regulatory and comparative genomics and has been supported by NIH, NSF and USDA. He is co-Director of the NIH BD2K Center of Excellence at the University of Illinois. He chairs the M.S. Bioinformatics program of the department, and leads the educational program of the Mayo Clinic-University of Illinois Alliance. He serves as Program co-Chair of the RECOMB Regulatory and Systems Genomics (RSG) conference. He is an NSF CAREER award recipient and was recognized as a University Scholar in 2018.
Colin Dewey, Genome Center of Wisconsin, University of Wisconsin-Madison, United States Colin Dewey is Associate Professor in the Department of Biostatistics and Medical Informatics at UW-Madison, which he joined in 2006. His research focuses on the development of computational and statistical methodology for the analysis of biological sequence data, with RNA-seq data and whole genome sequences of particular interest. Among the methods his group has developed are RSEM (for RNA-seq transcript quantification), DETONATE (for de novo transcriptome assembly evaluation), and Mercator (for multiple whole-genome orthology mapping).
Siavash Mirabab, Center for Microbiome Innovation, University of California, San Diego, United States Siavash Mirarab is an Assistant Professor in the Department of Electrical and Computer Engineering at University of California, San Diego, where he has been since 2015. He obtained his Ph.D. from the Computer Science department at UT-Austin and was advised by Prof. Tandy Warnow. His dissertation won the honorable mention for the 2015 ACM Doctoral Dissertation Award and he is a recipient of the 2017 Sloan Research Fellowship in Computational & Evolutionary Molecular Biology. His lab develops methods for evolutionary computational biology, mostly targetting large-scale datasets. His specific areas of research span many topics, including, reconstruction of species trees from gene trees (phylogenomics), large-scale multiple sequence alignment, HIV transmission network reconstruction, and metagenomic analyses using phylogenetic approaches.
Ferhat Ay, La Jolla Institute for Allergy and Immunology, University of California, San Diego, United States Ferhat Ay is the Institute Leadership Assistant Professor of Computational Biology at the La Jolla Institute for Allergy and Immunology and an Assistant Adjunct Professor at the UC San Diego - School of Medicine. His primary research areas are bioinformatics, computational biology, epigenomics, regulatory genomics and 3D/4D Nucleome. He has developed several methods to model the 3D structure of chromatin and its relation to gene regulation in several diseases including malaria and cancer.
Sushmita Roy, Wisconsin Institute for Discovery, University of Wisconsin-Madison, Madison, United States Sushmita Roy is an assistant professor at the Biostatistics and Medical Informatics department and a faculty at the Wisconsin Institute for Discovery, University of Wisconsin, Madison. Her lab focuses on the development and application of methods for inference and analysis of gene regulatory networks and their dynamics on developmental and evolutionary lineages. Sushmita is a recipient of the 2014 Alfred P. Sloan Foundation Fellowship, an NSF CAREER award, and a James S. McDonnell Foundation Scholar award.
Tutorial PM5: Visualization of large biological data(SOLD OUT)
Room: Columbus IJ
G. Elisabeta Marai, University of Illinois at Chicago, United States
Kay Nieselt, Center for Bioinformatics, University of Tübingen, Germany
Michael Krone, Center for Bioinformatics, University of Tübingen, Germany
The aim of this tutorial is to familiarise the participants with modern visual analytics methodologies applied to biological data and to provide simple hands-on training. Questions such as what is data visualization, what is visual analytics and how can biological data be visualised to gain insight are addressed, so that hypotheses can be generated or explored and further targeted analyses can be defined. Topics covered are:
• Digital/Electronic Visualisation of data
• Understanding color
• Visual Design Principles
• Examples of visualisation of biological data
• Challenges of large-scale biological data visualisation
• Understand the relationship between visual analysis and bioinformatics
• Use principles of human perception and cognition in visual biological data analysis
• Understand and use visual design principles
• Know the basics and do’s and don’ts of visualisation
• Critically evaluate data visual representations and suggest improvements and refinements
|2:00-2:15pm||Welcome & Introduction to tutorial structure|
|2:15-2:45pm||What is (electronic) visualization - Understanding color
|2:45-3:30pm||Visual design principles
|5:15-6:00pm||Introduction to D3|
This course is designed for everyone who would like to learn and apply visualization techniques in the analysis of large biological data sets. The course provides useful background material on data visualization principles, but the focus is on methods and tools for visualization of next-generation sequencing data, other omics data and network data.
None, if participants just wish to listen. For those who would like to also work on the programming part, should bring a laptop and should have programming knowledge (e.g. with java, C++ or similar).
G. Elisabeta Marai, University of Illinois at Chicago, United States G.Elisabeta Marai is an Associate Professor of Computer Science at the University of Illinois at Chicago, affiliated with the Electronic Visualization Laboratory. Her research interests are in biomedical imaging, biology data visualization, and data visual analysis. Liz is a recipient of an NSF CAREER award, of multiple NSF and NIH R01 awards, and of multiple Outstanding Paper awards, and has co-created open-source software (RuleBender, MOSBIE) used by biologists across over 40 institutions. She received her Ph.D. from Brown University in 2007.
Kay Nieselt, Center for Bioinformatics, University of Tübingen, Germany Kay Nieselt got her PhD in Mathematics at the Max Planck Institute for Biophysical Chemistry in Göttingen, Germany. Since 2002, she is a group leader at the Center for Bioinformatics Tübingen. Her main research interests are transcriptomics, small non-coding RNAs, ancient pathogenomics and visual analytics of life science data. Some of her visual analytics software products are Mayday, an open-source framework for transcriptome data analysis, GenomeRing for visualisation of multiple genomes, and Pan-Tetris, an interactive platform for pan-genomes. In 2015 together with Liz Marai she has been General Chair of the Symposium on Biological Data Visualization (BioVis, http://www.biovis.net) at ISMB.
Tutorial PM6: Deep learning for network biology(SOLD OUT)
Room: Grand Ballroom A
Marinka Zitnik, Stanford University, United States
Jure Leskovec, Stanford University, United States
Networks are ubiquitous in biology where they encode connectivity patterns at all scales of organization, from single-cell to population level. Network approaches have been used many times to combine and amplify signals from individual genes,and have led to remarkable discoveries in biology,including drug discovery,protein function prediction,disease diagnosis,and precision medicine. Mathematical machinery that is central to these approaches is machine learning on networks. The main challenge in machine learning on networks is to ﬁnd a way to extract information about interactions between nodes and to incorporate that information into a machine learning model. To extract information from networks, classic machine learning approaches often rely on summary statistics (e.g., degrees or clustering coefﬁcients) or carefully engineered features to measure local neighborhood structures (e.g., network motifs). These classic approaches can be limited because these hand-engineered features are inﬂexible, they often do not generalize to networks derived from other organisms, tissues and experimental technologies,and can fail on datasets with low experimental coverage.
Recent years have seen a surge in approaches that automatically learn to encode network structure into low-dimensional representations using transformation techniques based on deep learning and nonlinear dimensionality reduction. The idea behind these representation learning approaches is to learn a data transformation function that maps nodes to embeddings, points in a low-dimensional space. Deep representation learning methods have revolutionized the state-of-the-art in network science. This tutorial will investigate methods and case studies for analyzing biological networks and extracting actionable insights,and in doing so,it will provide attendees with a toolbox of next - generation algorithms for network biology.
Tutorial Website: http://snap.stanford.edu/deepnetbio-ismb
|2:00-2:30 pm||Part 1 – Introduction and overview of network biology
|2:30-3:30 pm||Part 2 - Matrix factorization and network propagation
|3:30-4:00 pm||Part 3 - Introduction to graph autoencoders
|4:00 - 4:15 pm||Coffee Break|
|4:15-5:00 pm:||Part 4 - Graph autoencoders and deep representation learning
|5:00-6:00 pm||Part 5 - Applications in network biology and new directions
The tutorial will be of broad interest to researchers who work with network data coming from biology, medicine, and life sciences. Graph-structured data arise in many different areas of data mining and predictive analytics, so the tutorial should be of theoretical and practical interest to a large part of data mining and network science community.
The tutorial will not require prior knowledge beyond fundamental concepts covered in introductory machine learning and network science classes. Attendees will come away with a broad knowledge necessary to understand state-of-the-art representation learning methods and to use these methods to solve central problems in network biology.
Marinka Zitnik, Stanford University, United States Marinka Zitnik is a postdoctoral fellow in Computer Science at Stanford University. Her research focuses on network science and representation learning methods for biomedicine. She received her PhD in Computer Science from University of Ljubljana in 2015 while also conducting research at Imperial College London, University of Toronto, Baylor College of Medicine. She received outstanding research awards at ISMB, CAMDA, RECOMB, and BC2 conferences, and is involved in projects at Chan Zuckerberg Biohub.
Jure Leskovec, Stanford University, United States Jure Leskovec is an Associate Professor of Computer Science at Stanford University and Chan Zuckerberg Biohub Investigator. His research is recently focusing on biological and biomedical problems and applications of network science to problems in biomedicine and health. Jure received his PhD in Machine Learning from Carnegie Mellon University in 2008 and spent a year at Cornell University. His work received five best paper awards, won the ACM KDD cup and topped the Battle of the Sensor Networks competition.
Tutorial PM7: High-throughput sequencing: Identification of disease variants in exomes and genomes
Room: Grand Ballroom B
Francisco De La Vega, D.Sc. Stanford University & Fabric Genomics
Chad Huff, Ph.D. The University of Texas, MD Anderson Cancer Center
Suzanne Leal, Ph.D. Baylor College of Medicine
Mark Yandel, Ph.D. University of Utah
Yao Yu, Ph.D. The University of Texas, MD Anderson Cancer Center
With the advent and the continuous drop in cost of next-generation sequencing, whole exome (WES) and whole genome sequencing (WGS) have become the platforms of choice for the diagnosis of Mendelian disease. New clinical applications of genome sequencing continue to appear, such as the diagnosis of idiopathic disease and the rapid diagnosis of rare childhood diseases in neonatal/pediatric intensive care units. In the research setting, this technology is permitting to explore the role of rare genetic variation in common, complex diseases through the sequencing of patient cohorts and case/control studies. A number of research studies are now generating WES or WGS data for sample sizes ranging from hundreds to thousands of cases. As the cost of sequencing drops further, it is expected that the number of cases sequenced will reach the 100’s of thousands, allowing the statistical power to identify disease associations with rare variants. For example, Genomics England is well underway to achieve its goal of sequencing 100,000 genomes, about half of these for rare genetic diseases and half for cancer patients. Regeneron Pharmaceuticals, in collaboration with the Geisinger Healthcare system, have already sequenced about 100,000 exomes, with the goal of reaching 250,000. In addition, Regeneron has recently proposed to build a coalition to sequence the ~250K cases of the Welcome Trust Case Control Study. Many healthcare systems around the world are starting to conceive similar projects, where a key aspect of the initiative is that sampling of cases is carried out as part of the healthcare of patients, and while the data obtained will be used in aggregate to look for findings that can drive drug development and new therapeutic approaches, a diagnostic of immediate value to the patient should be provided as well. Finally, the NIH “All of Us” million-people project is starting to move forward, where the ultimate goal will be the sequencing of the genomes of all of the participants.
Identification of disease-causing variants, whether in clinical diagnostics or research studies, requires algorithms and statistical methods to score variants with respect to their likely relevance to the disease at hand. In diagnostics, these scores should allow clinicians to focus quickly into a relatively small number of candidate variants to examine their evidence and be able to classify them as either pathogenic or benign, with respect to the patient’s disease phenotype. In research studies, the goal is to understand the role of genetic variation in complex disease, using methods that can aggregate the burden of many rare deleterious variants in key genes for its contributions to the trait. In addition, analysis methods should be able to identify donors harboring a Mendelianlike version of the disease, with ultra-rare variants of very strong effect – natural knock-outs of genes that while may not result in early developmental disease, may significantly influence late onset disease, either by accelerating its onset, or protecting against it. A great success example for this paradigm was the finding of homozygotes for deleterious variants in the PCSK9 gene, a very rare genotype that protects carriers against cardiovascular disease (CVD). This finding led to the development of the latest class of CVD drugs and finding more cases like this is driving a lot of pharmaceutical genome sequencing. Analysis strategies to deal with each of these cases are different, and yet they need to be considered together in the analysis of projects generating large-scale WGS/WES data from patients.
Goals of the Tutorial
The goal of this tutorial is to present an overview of the current state of disease variant identification approaches, describe the most common methods used to interpret variants from WGS/WES patient datasets for clinical diagnostics, as well as the statistical methods applied to the analysis of large cohorts of patients with WES/WGS data for finding novel disease genes. To ensure we deliver practical knowledge, we will discuss in some detail specific tools that the presenters have developed, explaining the fundamentals of the algorithms underlying them, how to use them in real use cases, and how these tools compare to other available tools and approaches. Since the scale of the genome datasets keeps growing, it is also important to understand the techniques to make these analyses scalable.
At the end of the tutorial the participants will have an understanding of: 1) What are the challenges of analyzing WES/WGS data for clinical diagnostics and disease association studies; 2) How variant prioritization can be performed probabilistically and why its superior to empirical filtering schemes; 3) How to take advantage of family structures and phenotype information in these endeavors; 4) What are the difficulties in the analysis of rare variants for disease gene finding; 5) What are the typical and most advanced tools for rare variant analysis; and 6) What are the novel approaches for the analysis of disease cohorts for both identifying rare variants influencing common disease as well as ultra-rare homozygotes with very strong effects.
The participants of this tutorial will be bioinformaticians, statisticians, or geneticists that anticipate would be involved in the analysis of WES/WGS data for either clinical diagnostics or case/control association studies with emphasis in rare variants. This tutorial will be appealing to participants with either academic or industry (e.g. pharmaceutical industry/clinical diagnostic labs) background.
This is a theoretical tutorial, and the only requirements would be familiarity with the basics of next-generation sequencing of genomes and exomes, the basics of human genetics, and ideally an understanding of how classical GWAS studies for common variants work.
|2:00-2:50 pm||F. De La Vega||Introduction to variant prioritization in Mendelian disease diagnostics
|3:00-3:50 pm||G. Wang||Analysis of Large-Scale Rare Variant Association Studies
|4:00-4:15 PM||Coffee Break|
|4:15-5:15 pm||M. Yandel||Discovery of rare and ultra-rare disease variants in case/control and cohort studies
|5:15-6:15 pm||C. Huff and Yao Yu||Rare variant prioritization and association analysis with VAAST, XPAT, PHEVOR, and related tools
Mendelian disease analysis by WGS/WES
Eilbeck, K., Quinlan, A. & Yandell, M. Settling the score: variant prioritization and Mendelian disease. Nature Publishing Group 1–14 (2017). doi:10.1038/nrg.2017.52
Wright, C. F., FitzPatrick, D. R. & Firth, H. V. Paediatric genomics: diagnosing rare disease in children. Nature Publishing Group 10, 1–16 (2018).
Coonrod, E. M., Margraf, R. L., Russell, A., Voelkerding, K. V. & Reese, M. G. Clinical analysis of genome nextgeneration sequencing data using the Omicia platform. Expert Rev Mol Diagn 13, 529–540 (2013).
Rare Variant Association Tests
Nicolae, D. L. Association Tests for Rare Variants. Annu. Rev. Genom. Human Genet. 17, 117–130 (2016). Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-Variant Association Analysis: Study Designs and Statistical Tests. The American Journal of Human Genetics 95, 5–23 (2014).
Auer PL et al (2016) Guidelines for Large-Scale Sequence-Based Complex Trait Association Studies: Lessons Learned from the NHLBI Exome Sequencing Project, Am J Hum Genet. 99 (4): 791-801.
F. Anthony San Lucas, Gao Wang, Paul Scheet, and Bo Peng (2012) Integrated annotation and analysis of genetic variants from next-generation sequencing studies with variant tools, Bioinformatics 28 (3): 421-422.
Gao Wang, Bo Peng and Suzanne M. Leal (2014) Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data, The American Journal of Human Genetics 94 (5): 770–83.
Yandell M, Huff C, Hu H, Singleton M, Moore B, Xing J, Jorde LB, Reese MG. A probabilistic disease-gene finder for personal genomes. Genome Res 2011, 21(9):1529-1542.
Singleton M., Guthery SL., Voelkerding KV., Chen K., Kennedy BJ., Margraf RL., Durtschi J., Eilbeck K., Reese MG., Jorde LB., Huff CD., Yandell M. Phevor Combines Multiple Biomedical Ontologies for Accurate Identification of Disease-Causing Alleles in Single Individuals and Small Nuclear Families. Am J Hum Genet. 2014 Apr 3;94(4):599- 610.
Flygare S, Hernandez EJ, Phan L, et al. The VAAST Variant Prioritizer (VVP): ultrafast, easy to use whole genome variant prioritization tool. BMC Bioinformatics. 2018;19:57.
Yu, Y. et al. XPAT: a toolkit to conduct cross-platform association studies with heterogeneous sequencing datasets. Nucleic Acids Research 1–11 (2017). doi:10.1093/nar/gkx1280
Di Zhang et al. SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data. The American Journal of Human Genetics 101, 115–122 (2017).
Links to Tools and Code
VAASt 2.0, pVAAST, at Yandell Lab.: http://www.yandell-lab.org/software/vaast.html
PHEVOR 2.0 web service: http://weatherby.genetics.utah.edu/phevor2/index.html
Variant Tools: http://varianttools.sourceforge.net/
XPAT at Huff Lab: http://www.hufflab.org/software/xpat/
Materials and guides at Leal lab: https://statgen.research.bcm.edu/index.php/Tutorials
Francisco M. De La Vega, D.Sc. Stanford University School of Medicine & Fabric Genomics, United States Adjunct Professor at the Department of Biomedical Data science of Stanford, and SVP of Genomics at Fabric Genomics. De La Vega is a geneticist and computational biologist with interests in cancer, population, and clinical genomics, and with extensive experience in the life sciences industry. Dr. De La Vega has led the development of new methods and software for the analysis of next-generation sequencing data and has been involved in major population-scale sequencing projects such as the 1000 Genomes Project, the PanCancer Analysis of Whole Genomes project of the ICGC, and standard-setting public-private partnerships such as the NIST Genome-in-a-Bottle Consortium.
Chad Huff, Ph.D., The University of Texas MD Anderson Cancer., United States Associate Professor, Department of Epidemiology, The University of Texas MD Anderson Cancer Center. He works on understanding human evolution and the genetic basis of human disease through statistical, computational, and population genomics. Current focus is on developing new methods to analyze genomic data and by applying these methods to discover novel insights about the genetic basis of human disease, with particular emphasis on identifying and characterizing genes that increase the risk of developing common cancers.
Suzanne Leal, Ph.D., Baylor College of Medicine, United States Professor in the Department of Molecular and Human Genetics at Baylor College of Medicine and Director of the Center for Statistical Genetics, and also an adjunct Professor in the Department of Statistics at Rice University and a Senior Research Associate at The Rockefeller University. Dr. Leal interests lies in statistical genetics and genetic epidemiology and has worked extensively in developing methods to aid in gene identification and understanding disease etiology. Her current focus is in the development of methods to analyze rare variants. Dr. Leal is also pioneering big-data architectures to more effectively process large WES/WGS datasets of cases/control studies.
Mark Yandel, Ph.D., University of Utah, United States Professor of Human Genetics and H.A. and Edna Benning Presidential Endowed Chair at University of Utah. Dr. Yandel develops computational algorithms and software tools to analyze genomics data and uses these tools to identify disease-causing variants in clinical settings, to understand the molecular basis of gene dysfunction, and to understand evolution. He spent three years at the Genome Sequencing Center at Washington University School of Medicine, St. Louis, and then three years at Celera Genomics where he led the Annotation Software Research and Development group. Mark has led the development of innovative variant prioritization tools, and novel methods that take advantage of the disease phenotype of a patient disease leveraging biomedical phenotype ontologies, and more recently has been extending these tools to make them more efficient and applicable to large cohort studies.
Yao Yu, Ph.D., The University of Texas MD Anderson Cancer Center, United States Computational Scientist at the Department of Epidemiology, The University of Texas MD Anderson Cancer Center. His research interests cover a wide range of topics in computational biology, including genetics, genomics, transcriptomics, and metabolomics. He is the lead developer of the Cross-Platform Association Toolkit (XPAT), a suite of tools designed to support and conduct large-scale association studies with heterogeneous sequencing datasets.
Tutorial PM8: Ontologies in computational biology
Room: Columbus KL
Michel Dumontier, Maastricht University, Netherlands
Robert Hoehndorf, King Abdullah University of Science and Technology, Kingdom of Saudi Arabia
Ontologies have long provided a core foundation in the organization of biomedical entities, their attributes, and their relationships. With over 500 biomedical ontologies currently available there are a number of new and exciting new opportunities emerging in using ontologies for large scale data sharing and data analysis. This tutorial will help you understand what ontologies are and how they are being used in computational biology and bioinformatics.
This is an introductory-level course to ontologies and ontology-based data analysis in bioinformatics. In this tutorial, participants will learn:
- what ontologies are and where to find them
- how to understand and use ontology semantics through automated reasoning
- how to measure semantic similarity
- how to incorporate ontologies and semantic similarity measures in bioinformatics analyses
- recent developments in bio-ontologies
The tutorial will be of interest to any researcher who will use or produce large structured datasets in computational biology. The tutorial will be at an introductory level, but will also describe current research directions and challenges that will be of broad interest to researchers in computational biology.
The tutorial will contain a hands-on part. If you want to participate (instead of just watching the presentation), please download and install Jupyter Notebook (http://jupyter.org/) with a SciJava kernel. For latest updates on this tutorial, see https://github.com/bio-ontology-research-group/ontology-tutorial
Michel Dumontier, Maastricht University, Netherlands Michel Dumontier is a Distinguished Professor of Data Science at Maastricht University. His research focuses on the development of computational methods for scalable integration and reproducible analysis of FAIR (Findable, Accessible, Interoperable and Reusable) data across scales - from molecules, tissues, organs, individuals, populations to the environment. His group combines semantic web technologies with effective indexing, machine learning and network analysis for drug discovery and personalized medicine. Dr. Dumontier leads a new inter-faculty Institute for Data Science at Maastricht University with a focus on accelerating discovery science, empowering communities, and improving health and well being. He is the editor-in-chief for the IOS press journal Data Science and an associate editor for the IOS press journal Semantic Web. He is the scientific director for Bio2RDF, an open source project to generate Linked Data for the Life Sciences and is a technical lead for the FAIR (Findable, Accessible, Interoperable, Re-usable) data initiative. He has published over 125 articles in top rated journals and international conferences. He is internationally recognized for his contributions in bioinformatics, biomedical informatics, and semantic technologies as evidenced by awards, keynote talks at international conferences, and collaborations on international projects.
Robert Hoehndorf, King Abdullah University of Science and Technology, Kingdom of Saudi Arabia Robert Hoehndorf is an Assistant Professor in Computer Science at King Abdullah University of Science and Technology in Thuwal. His research focuses on the applications of ontologies in biology and biomedicine, with a particular emphasis on integrating and analyzing heterogeneous, multimodal data. Dr. Hoehndorf has developed the PhenomeNET system for ontology-based prioritization of disease genes using model organism phenotypes, and contributed to the development of the AberOWL ontology repository. He is an associate editor for the Journal of Biomedical Semantics, BMC Bioinformatics, Applied Ontology, and editorial board member of the IOS press journal Data Science. He published over 90 papers in journals and international conferences, and presented previous tutorials on ontologies and their applications at ISMB, OWL-ED, and ECCB.