ISMB 2020 - Tutorials
ISMB 2020 features pre-conference tutorial sessions on Sunday, July 12, 2020 one day prior to the start of conference scientific program.
Register early for tutorials as seating is limited.
Tutorial attendees should register using the on-line registration system - pricing is available at https://www.iscb.org/ismb2020-registration. Tutorial participants must be registered for the ISMB conference to attend a tutorial. Attendees will receive a Tutorial Entry Pass (ticket) at the time they register on site.
Lunch is not included as part of tutorial program.
- Tutorial FD1: Mutational signature analysis: pipelines, machine learning, and benchmarking on synthetic data
- Tutorial FD2: Finding and analyzing data in the cloud with Gen3, Dockstore, Terra, and Galaxy
- Tutorial AM1: Full-Length RNA-Seq Analysis using PacBio long reads: from reads to functional interpretation
- Tutorial AM2: A practical introduction to biomedical text mining in the era of deep learning
- Tutorial AM3: BioC++ - solving daily bioinformatic tasks with C++ efficiently
- Tutorial PM1: Translational use of multifaceted RNA-Seq bioinformatics analysis in genetic disease investigation
- Tutorial PM2: Enhancing Molecular Dynamics Simulations Using Deep Learning
- Tutorial PM3: Automation of Network Analysis in the Cytoscape Ecosystem
Tutorial FD1: Mutational signature analysis: pipelines, machine learning, and benchmarking on synthetic data
Steven G. Rozen, Duke NUS Centre for Computational Biology Duke-NUS Medical School, Singapore.
Arnoud Boot, PhD Postdoctoral Fellow Duke-NUS Medical School, Singapore
Mutational signature analysis focuses on patterns of mutations across the genome to infer their causes, and is now an essential component of cancer-genomics studies. Over the last decade, mutational signatures have revealed endogenous mutational processes that are widespread in many cancer types but that were not previously known. Signatures also showed that exposure to naturally occurring mutagens that cause liver cancer is much more widespread than suspected. Mutational signature analysis can also provide insight into the causes of specific oncogenic mutations and can reveal gaps in our understanding of the mechanisms of DNA damage and repair. Mutational signatures can either be delineated in experimental systems (e.g. cell culture or rodents) or can be discovered by machine learning across sets of hundreds to 10s of thousands of tumors. More than 100 mutational signatures have been described, many of which have unknown causes. In line with the importance of mutational signature analysis, there are now ~20 software packages that use machine learning to discover mutational signatures and assess their activity in tumors. Unfortunately, however, the cancer genomics literature contains numerous erroneous mutational signature results stemming from uncritical application of these packages.
We will cover the basic concepts of mutational signature analysis and show how this analysis is important for understanding cancer development, for detecting mutational exposures that cause cancer, and for understanding DNA damage as processed by normal and defective DNA repair. We will introduce the computational analysis needed to delineate mutational signatures in experimental systems (e.g. cultured cells or rodents), including the computational subtraction of the signatures of background mutagenesis and of experimental artifacts. We will cover in detail machine learning approaches to discovering mutational signatures in large sets of tumors and the strengths and weaknesses of these approaches. We will also discuss in depth the importance of benchmarking the machine-learning approaches on synthetic data. Finally, the tutorial will show examples of the importance of interpreting machine-learning results in the light of all available evidence to obtain biologically relevant results.
This tutorial will equip participants with the ability to run machine-learning software to discover mutational signatures and to assess their activity in tumors and with strategies to evaluate the biological relevance of the results.
Computing experience: there will be exercises using the command line in Python and R. Participants must have a laptop with Python and R installed, along with the following packages (draft list):
Relatively small data sets will also need to be downloaded before the tutorial.
Participants will need a basic understanding of genome organization, mutations, and modern high throughput Illumina-type sequencing (BAM files, variant call files, etc.)
Maximum Participants: 60
|9:00 - 10:00 am||Overview of mutational signatures
|10:00 - 10:30 am||Case study: Kucab et al., 2019|
|10:30-11:00 am||(Hands on) Computational analysis of experimentally delineated mutational signatures|
|11:00-11:15 am||Coffee break|
|11:15-12:00 pm||(Hands on) Computational analysis of experimentally delineated mutational signatures; subtracting the signatures of background mutagenesis and of experimental artifacts|
|12:00 - 1:00 pm||Machine learning for discovering mutational signatures
|1:00 - 2:00 pm||Lunch Break (lunch is not included as part of tutorial)|
|2:00 - 3:00 pm||(Hands on) Non-negative matrix factorization for learning mutational signatures|
|3:00 - 4:00 pm||(Hands on) Hierarchical Dirichlet processes for learning mutational signatures|
|4:00 - 4:15 pm||Coffee Break|
|4:15 - 5:00 pm||The importance of testing signature discovery and signature assignment on synthetic data
|5:30 - 6:00 pm||Future perspectives and summary
Tutorial FD2: Finding and analyzing data in the cloud with Gen3, Dockstore, Terra, and Galaxy
Geraldine Van der Auwera, Broad Institute of MIT and Harvard, United States
Robert Majovski, Broad Institute of MIT and Harvard, United States
The era of big data for biomedical research is here. Massive data sets and cloud-based platforms will enable breakthrough discoveries while overcoming challenges of cost, accessibility, and security. A key strength of this new research landscape is the availability of interoperable, community-driven components that enable robust analyses for a variety of research needs.
One challenge to fully realizing this vision for your research is not only learning how several new products and platforms work, but at the same time learning how they work together . In this full-day tutorial, we will guide you through a research journey that highlights the capabilities and components of the NHGRI Genomic Data Science Analysis, Visualization and Informatics Lab-space (AnVIL) resource. You will integrate a suite of interoperable platforms to complete a sample project, gaining working knowledge of how the components work together to perform an end-to-end genetic analysis.
Specifically, you will learn how to:
- Find and access data in Gen3
- Locate analysis tools in the Dockstore repository
- Bring these data and tools together into a computational workspace in Terra
- Process data with automated, reproducible analysis pipelines
- Leverage Hail and Bioconductor in Jupyter Notebooks to do interactive analysis
- Perform genome-wide association studies with Galaxy workflows
While we will work in the context of AnVIL, you will be able to apply your new skills to myriad other genomic-related data sets and tools. Attendees must bring a WiFi-enabled laptop with the Chrome browser installed. Prior coding experience (R and/or Python) is required.
Maximum Audience: 40
- Section I: Introduction
- Section II: Finding and analyzing data in the cloud with Gen3, Dockstore and Terra
- Find and access data in the Gen3
- Locate analysis tools in the Dockstore repository
- Export both data and tools to Terra and run an analysis
- Section III: Interactive analysis
- Find data
- Hail with Jupyter Notebooks in Terra
- Bioconductor with Jupyter Notebooks in Terra
- Section IV: Genome-wide association study workflows
- Galaxy workflows and complementary components
Tutorial AM1: Full-Length RNA-Seq Analysis using PacBio long reads: from reads to functional interpretation
Ana Conesa, University of Florida, United States
Elizabeth Tseng, Pacific Biosciences, United States
Angeles Arzalluz, Polytechnical University Valencia, Spain
Francisco Pardo, Polytechnical University Valenciam, Spain
The PacBio Single-Molecule Real-Time sequencing technology produces highly accurate long reads that is suitable for full-length RNA sequencing. The Iso-Seq method generates full-length transcript sequences of 10 kb or longer that does not require transcript assembly or error correction. The high accuracy (>99%) of Iso-Seq transcripts allows for unambiguous characterization of alternative splicing events, direct ORF prediction without a reference genome, and identification of single cell barcodes.
The unique features of Iso-Seq data requires a special set of bioinformatics tools that typical short read RNA-seq tools fail to provide. The PacBio SMRT Analysis software processes raw sequencing data into full-length transcript sequences, which can then be analyzed with community tools that have been developed specifically for long read data: SQANTI compares Iso-Seq transcripts against known annotations (ex: GENCODE) to classify novel vs known genes and transcript, and remove artifacts; IsoAnnot functionally annotates Iso-Seq transcripts; tappAS compares multiple Iso-Seq samples to identify differential features. Existing RNA-Seq short read data are often paired with Iso-Seq data to strengthen the analysis.
Further, the Iso-Seq method can also be applied to single cell analysis. Matching single cell libraries of both long and short read data can be generated and combined to using the deeper coverage of short reads to identify cell types, while using matching cell barcodes to link fulllength isoforms generated by the long-read data back to individual cell types.
In this tutorial, we provide an overview of the Iso-Seq tools for both bulk and single cell RNAseq analysis and guide the audience through hands on analyses.
Beginner or intermediate. This tutorial will be of broad interest to researchers from academia or industry who want to learn to understand the unique features and tool sets of long read RNA sequencing (Iso-Seq) data using PacBio’s SMRT Technology.
Attendees are expected to have basic Unix command line skills and some familiarity with R/Rstudio. Programming knowledge is not required though most of the tools are written in Python.
Maximum Audience: 40
Attendees are expected to bring their own laptops and have installed R/RStudio and the tappAS software. We will be using a shared instance in AWS for the first part of the analysis (Iso-Seq and SQANTI), then running tappAS on the local laptops.
|9:00 - 9:30 am||Introduction
|9:30 - 10:15 am||Demo & Hands-On Session: Iso-Seq using BioConda
|10:15 - 11:00 am||Demo & Hands-On Session: Functional analysis of Iso-Seq data
|11:00 - 11:15 am||Coffee Break|
|11:15 - 11:45 am||Single Cell Iso-Seq
|12:15 - 12:45 pm||Hands-On Session: Single Cell Iso-Seq + RNA-Seq
|12:50 - 1:00 pm||Wrap Up|
Tutorial AM2: A practical introduction to biomedical text mining in the era of deep learning
Qingyu Chen, National Library of Medicine, National Institutes of Health
Robert Leaman, National Library of Medicine, National Institutes of Health
Cecilia Arighi, Delaware Biotechnology Institute, University of Delaware
Zhiyong Lu, National Library of Medicine, National Institutes of Health
The volume of biomedical literature is growing at an exponential rate. PubMed, a biomedical literature search engine managed by the National Library of Medicine, has ~2 new articles indexed per minute. Such rapid growth challenges manual information extraction, curation and annotation. Biomedical text mining aims to apply natural language processing techniques to biomedical literature and automatically assist biocurators, biologists and health professionals to overcome the burden. Biomedical text mining has matured significantly in recent years. More specifically, deep learning – end-to-end neural networks inspired by biological systems – has achieved state-of-the-art performance in a range of biomedical text mining applications. In the bioinformatics community, the use of text mining via deep learning to support other research in the biological and medical sciences has been increasing. Not restricted to standalone tools, deep learning models have also been fully deployed to public web servers, further improving the quality of biomedical text mining tools and lowering the barriers for non-specialists.
This tutorial aims to familiarize the audience with an introduction to text mining the biomedical literature using deep learning methods and to provide hands-on training. The tutorial will address questions such as “What is biomedical text mining?”, “What is deep learning?”, “How can deep learning be applied to address biomedical text mining problems?”, and “What biomedical text mining tools are currently available?”. The tutorial will cover the basics of biomedical text mining and deep learning with concrete examples. The latest deep learning methods in biomedical text mining will also be explained and discussed. Also, the audience will have the opportunity to get the first hands-on experience to develop their deep learning models in biomedical literature analysis. Topics include:
- Fundamentals of biomedical text mining and literature mining
- Overview of deep learning in biomedical text mining
- Word, sentence, concept embeddings for biomedical textual analysis
- Public biomedical text mining tools for biomedical information retrieval and extraction
- Case studies: biomedical literature analysis
This tutorial is an activity of the ISCB COSI on Text Mining.
We intend the tutorial to be for participants who are not text mining specialists but use or are interested in using it. This tutorial will provide a brief introduction, including describing existing tools and datasets. In addition, the session will provide an opportunity to describe their needs to text mining specialists.
Maximum Audience: 60
None, if participants just wish to listen. Those who would like to also participate in the hands-on exercises are required to provide their own laptop and should have a basic knowledge of programming in Python.
|9:00 - 9:30 am||Introduction to biomedical text mining
|9:30 - 10:00 am||Introduction to deep learning
|10:00 - 11:00 am||Biomedical language models
|11:00 - 11:15 am||Coffee Break|
|11:15 - 12:00 pm||Demonstration: deep learning tools and datasets for biomedical text mining tasks
|12:00 - 12:50 pm||Hands-on Session: biomedical literature analysis|
|12:50 - 1:00 pm||Q&A and feedback|
Tutorial AM3: BioC++ - solving daily bioinformatic tasks with C++ efficiently
René Rahn, Max Planck Institute for Molecular Genetics, Algorithmic Bioinformatics, Germany
Svenja Mehringer, Free University Berlin, Algorithmic Bioinformatics, Germany
Marcel Ehrhardt, Free University Berlin, Algorithmic Bioinformatics, Germany
In this half-day tutorial we are going to teach how to use modern C++ and utilise modern C++ libraries to rapidly develop tools and scripts for operating on and manipulating large-scale sequencing data.
The high variability and heterogeneity often observed within various genomic data is challenging for many standard tools, for example for read alignment and variant calling. Often, these tools are wrapped in complicated pre- and postprocessing data curation steps in order to obtain results with higher quality. However, these additional steps incur a high maintenance and performance burden to the established work process and often do not scale with larger data sets. Seldomly, C++ is considered as the language of choice for these small processes, although it is the main language used in high-performance computing. We are going to show that implementing modern C++ can be as easy as using other modern high-level languages.
This tutorial is organised as a half-day tutorial. At the beginning we are going to introduce fundamental concepts and principles of the C++ programming language. Further, we will teach how modern C++ features such as ranges and concepts can be used to rapidly develop high-quality C++ applications. This introduction to C++ follows a practical session were participants will read in typical files from sequencing experiments using the C++ library SeqAn and operate on the data with the taught principles to solve diverse problems, e.g. filtering out reads with low sequencing quality and others. In the last 30 minutes of the day we are going to summarise the learned concepts and compare the developed methods to current approaches.
This tutorial is mostly suited for computational biologist and bioinformaticians with research focus on sequence analysis (e.g., genomics, metagenomics, proteomics, read alignment, variant detection, etc.). A fundamental knowledge about sequencing experiments and the involved data is required. We expect that attendees have an intermediate knowledge in programming with any high-level programming language, e.g. Python, Java or C++. Some basic C++-knowledge is helpful but not mandatory to successfully complete the course.
This tutorial is targeting beginners and intermediate C++ developers that want to learn more about modern C++ features like ranges and concepts.
Attendees should bring their own laptop.
Software for the tutorial can be installed beforehand, but we will also dedicate some extra time for installing required software during the tutorial.
- g++ >= 7
- SeqAn 3 - (https://github.com/seqan/seqan3)
- CMake >= 3.12
or, VirtualBox if the attendee wishes to use the provided virtual image running Ubuntu.
Maximum Attendees: 25
|9:00 - 10:30 am||Introduction to modern C++ [talk: 30 min]
Initial app and parsing sequencing data [hands-on: 60 min]
|10:30 - 11:00 am||Coffee Break|
|11:00 - 12:30 pm||Filtering and data manipulation (hands-on)|
|12:30 - 1:00 pm||Wrap-up [talk: 30 min]|
Tutorial PM1: Translational use of multifaceted RNA-Seq bioinformatics analysis in genetic disease investigation
Gavin R. Oliver, Center for Individualized Medicine, Mayo Clinic, United States
Garrett Jenkinson, Mayo Clinic, United States
Eric W. Klee, PhD, Center for Individualized Medicine, Mayo Clinic, United States
RNA-Seq is increasingly being recognized as a testing modality with significant untapped potential in the field of genetic disease studies. These data present a unique opportunity for diverse multifaceted analysis. Data profiling methods including expression outlier analysis, aberrant splicing detection, fusion transcript identification and allele-specific expression have been demonstrated to achieve genetic diagnosis of diseases escaping resolution through traditional clinical and research-based DNA-testing. Recent published works have highlighted the ability to increase diagnostic rates by as much as 35% utilizing RNA-Seq analysis, but analytical workflows are diverse and non-trivial to implement or interpret. This tutorial focuses on the utilization of RNA-based analysis for the improved diagnosis of rare genetic disease. An introduction will be given to the current state of genetic disease diagnostics and the benefits revealed to date by RNA-Seq. RNA-based testing paradigms will be introduced individually and discussed in terms of translational utility with a focus on data analysis methodologies and considerations. Each computational analysis solution will be overviewed with hands-on sessions highlighting the analytical capabilities of a specific informatics solution for each testing paradigm. Means of prioritizing results based on biological and phenotypic relevance will be addressed and cutting-edge computational solutions demonstrated. Finally consideration will be given to the principles and considerations underlying final data integration, review and analysis to maximize the likelihood of patient diagnosis amidst a growing data deluge.
Researchers or scientists with computational or genomics training and an interest in analytical techniques aimed at the improved diagnosis or rare genetic disease. Individuals with prior and current experience in the field of rare genetic disease will benefit from the ability to utilize the knowledge gained immediately in their own work. Programming knowledge (e.g., R, python, bash, or similar) is required only if participants wish to perform the practical components of the hands-on sessions. Instructions will be provided on downloading relevant data and setting up the user’s compute environment. Attendees wishing to perform the practical components of the hands-on sessions are required to provide their own laptop.
Maximum Audience: 40
|2:00 - 2:30 pm||Introduction
|2:30 - 3:00 pm||Confounding variable correction and outlier expression analysis
|3:00 - 3:35 pm||Hands-On Practicum: OUTRIDER expression analyses
|3:35 - 4:05 pm||Fusion transcript detection in rare genetic disease
|4:05 - 4:20 pm||Coffee Break|
|4:20 - 4:55 pm||Hands-On Practicum: Fusion filtering and prioritization
|4:55 - 5:25 pm||Identification of aberrant splicing events in rare genetic disease patients
|5:25 - 6:00 pm||Hands-On Practicum: Leafcutter for detecting splicing outliers
Tutorial PM2: Enhancing Molecular Dynamics Simulations Using Deep Learning
Emmanuel Salawu, Machine Learning Laboratory, Amazon Web Services, United States
Lee-Wei Yang, Institute of Bioinformatics and Structural Biology, National Tsing Hua University, Taiwan
Computational studies of molecules (such as through Molecular Dynamics, MD, simulations) offer a set of vital approaches for the elucidation of molecular mechanisms at the resolutions that are currently difficult or impossible to obtain from wet-lab experiments. In a similar way, the use of Artificial Intelligence (most especially, Deep Learning, DL) techniques makes it possible to study and even design and create new molecules that have specific desired properties with high success rates. The computational studies of molecules using a combination of MD simulations and DL techniques are, therefore, an active area of research and presents unprecedented tools for advancing our understanding of biology at the molecular level.
In this tutorial, we will introduce and teach MD simulations using Classical Mechanics and Force Fields/Empirical Potentials. Temperature Control and Pressure Control which forms the basis for Canonical Ensemble (NVT), and Isothermal–isobaric Ensemble (NPT) will also be introduced. Thereafter, we will introduce the Building Blocks and Architectures of Deep Neural Network (namely, Layers, Activations Functions, Lost Functions, Optimization, etc.), Convolutional Neural Networks (CNN), AutoEncoders, Variational AutoEncoders, and Adversarial Autoencoders. There will be three hands-on subsections on (1) Implementation and Execution of Unbiased/Vanilla MD Simulations of a Medium-Size Protein System; (2) Implementation and Execution of DL-Enhanced MD Simulations for the Protein System; and (3) Comparison of the Vanilla MD Simulations Results and the DL-Enhanced MD Simulations.
The participants/attendees of this tutorial will benefit tremendously from learning MD Simulations, Deep Learning, and their combinations (MD+DL). They will have the opportunity of learning and practicing how to write computer codes/programs that use MD, DL, and MD+DL techniques to study molecules. All these will allow the participants to add to their arsenal of New Cutting-Edge Tools for doing Molecular Biology (as well as Computational Chemistry) research.
The tutorial is suitable for beginners/intermediates and for people with little or no experience in MD Simulations and/or Deep Learning. Basic knowledge of and experience in Computer Programing in Python will be a great asset. Estimated Level of Difficulty: Intermediate.
Maximum Audience: 80
None, if participants just wish to listen and watch. However, to actively participate and do the hands-on, each participant should bring his/her laptop computer with pre-installed
1. Anaconda for Python 3.7 (see https://www.anaconda.com/distribution/ )
2. OpenMM for Python 3.7 (see http://docs.openmm.org/latest/userguide/application.html#installing-openmm )
3. Pytorch for Python 3.7 (see https://pytorch.org/ )
|2:00 - 2:30 pm||Introduction to (and Mathematical Formalization) of Molecular Dynamics (MD) Simulations
|2:30 - 2:45 pm||Examples and Limitations of Vanilla/Unbiased MD Simulations and the Need for Enhanced Sampling|
|2:45 - 3:15 pm||Introduction to Deep Learning
|3:30 - 4:00 pm||Techniques for Achieving Enhanced Sampling in MD Simulations Using Deep Learning (DL)
|4:00 - 4:15 am||Coffee Break|
|4:15 - 4:45 pm||Hands-on 1
|4:45 - 5:15 pm||Hands-on 2
|5:15 - 5:45 pm||Hands-on 3
|5:45 - 5:55 pm||Reflections and Conclusions|
|5:55 - 6:00 pm||Questions and Answers
Tutorial PM3: Automation of Network Analysis in the Cytoscape Ecosystem
Dexter Pratt, UC San Diego School of Medicine, United States
Alexander Pico, Gladstone Institutes, United States
John “Scooter” Morris, UCSF, United States
This tutorial is intended for an audience that has prior experience with:
- R or Python
- The Cytoscape desktop application
- Bioinformatics analysis using R or Python
Maximum Audience: 60
|2:00 - 2:40 pm||Introduction
|2:40 - 3:20 pm||Setting up the Workspace
|3:20 - 4:00 pm||Network I/O to NDEx and Basic Visualization
|4:00 - 4:15 pm||Coffee break|
|4:15 - 5:00 pm||Data to Networks
|5:00 - 6:00 pm||Additional Topics and Q&A