ISMB/ECCB 2019 - Tutorials

ISMB/ECCB 2019 features pre-conference tutorial sessions on Sunday, July 21, 2019 one day prior to the start of conference scientific program.

Tutorial attendees should register using the on-line registration system - pricing is available at https://www.iscb.org/ismbeccb2019-registration. Tutorial participants must be registered for the ISMB/ECCB conference to attend a tutorial. Attendees will receive a Tutorial Entry Pass (ticket) at the time they register on site

Tutorial FD1: Interpretability for deep learning models in computational biology

Sunday, July 21, 9:00 am - 6:00 pm

Room: Montreal (2nd Floor)

Download Slide Deck Instructions
Presenters

Dr. María Rodríguez Martínez, IBM Research – Zürich.
Dr. Matteo Manica, IBM Research – Zürich.
Dr. Ali Oskooei, IBM Research – Zürich.
An-Phi Nguyen, IBM Research – Zürich.

Overview

The recent application of deep neural networks to long-standing problems such as the prediction of functional DNA sequences, the inference of protein-protein interactions or the detection of cancer cells in histopathology images has brought a break-through in performance and prediction power. However, high accuracy often comes at the price of loss of interpretability, i.e. many of these models are built as black-boxes that fail to provide new biological insights. This tutorial focuses on illustrating some of the recent advancements in the field of Interpretable Artificial Intelligence. We will show how explainable, smaller models can achieve similar levels of performance than cumbersome ones, while shedding light on the underlying biological principles driving model decisions.

We will demonstrate how to build and extract knowledge using interpretable approaches in two different domains of computational biology: the functional analysis of raw DNA sequencing data and drug sensitivity prediction models. The choice of these two applications is motivated by the availability of adequately large datasets that can support deep learning approaches and by their high relevance for personalized medicine. We will exploit both publicly available deep learning models as well as in-house developed models.

The tutorial is aimed to strike the right balance between theoretical input and practical exercises. The tutorial has been designed to provide the participants not only with the theory behind deep learning and interpretability, but also to offer a set of frameworks, tools and real-life examples that they can implement in their own projects.

Audience

This course is designed for everyone who would like to learn the basics of interpretability techniques for deep learning. The tutorial will provide a brief introduction to key concepts in deep learning, before exploring recent developments in the field of interpretability.

Requirements

None, if participants just wish to listen. Those who would like to also participate in the hands-on exercises are required to provide their own laptop and should have a basic programming knowledge on Python and shell scripting. All the material for the lectures and hands-on exercises will be available prior the day of the tutorial for download.

Maximum Participants: 100

Schedule Overview
9:00 - 10:00 am Introduction to deep learning
  • Deep learning: What, why, how deep?
  • Activations functions
  • Cost functions
  • Backpropagation
  • Regularization
  • Optimization
10:00 - 11:00 am Common deep learning models
  • Multi-Layer Perceptron (MLP)
  • Auto-enconders (AE)
  • Convolutional Neural Networks (CNN)
  • Recurrent Neural Networks (RNN)
11:00-11:15 am Coffee Break
11:15-12:30 pm Interpretability in deep learning
  • Introduction to interpretability
  • A few techniques for interpretability
    • Backpropagation-like approaches
    • Perturbation-based approaches
    • Attention mechanisms
    • Surrogate Models
    • Other models
  • Discussion
2:00 - 3:00 pm CellTyper: interpretability on simple models. (Hands-on)
3:00 - 4:00 pm Understanding DeepBind: actionable interpretability? (Hands-on)
4:00 - 4:15 pm Coffee Break
4:15 - 5:00 pm PaccMann - Interpreting complex models: are model-agnostic interpretability methods the way to go? (Hands-on)
5:00 - 6:00 pm PaccMann - Built-in interpretability: attention-mechanisms to the rescue. (Hands-on)

Tutorial AM2: Recent Advances in Statistical Methods and Computational Algorithms for Single-Cell Omics Analysis

Sunday, July 21, 9:00 am - 1:00 pm

Room: Sydney (2nd Floor)

Presenters

Rhonda Bacher, PhD Assistant Professor, Department of Biostatistics, University of Florida, United States
Yuchao Jiang, PhD Assistant Professor, Department of Biostatistics, University of North Carolina at Chapel Hill, United States
Jingshu Wang, PhD Postdoctoral fellow, Department of Statistics, University of Pennsylvania, United States

Overview

Single-cell genomics is the study of individual cells using omics approaches, which circumvents averaging artifacts associated with traditional bulk population data and yields new insights into cellular heterogeneity. The field has seen rapid development in both technologies and statistical methods and computational algorithms, leading to improved data analysis. This tutorial is focused on advanced statistical and computational methods that are recently developed for single-cell omics data. The first half of the tutorial will include a brief introduction, followed by “generalized” methods and workflows for scRNA-seq data, including data normalization, visualization, batch correction, and denoising. The second half of the tutorial will be on “specific” topics and applications in the single-cell domain, including pseudotime reconstruction, simultaneous measurements of single-cell transcriptomic and V(D)J profiles, multimodal alignment of single-cell transcriptomic and epigenomic data, as well as single-cell inference of tumor heterogeneity.

Website: https://github.com/rhondabacher/ISMB2019_SingleCellTutorial

Audience

This tutorial is intended for an audience with genomics/computational background, who are interested in cutting-edge developments of single-cell research, including both method development and application. Previous experiences in analyzing single-cell data are preferred. Advanced tools that are recently developed in the field will be taught from a high-level perspective.

Maximum Participants: 100

Schedule Overview
9:00 - 9:40 am Introduction: tutorial infrastructure setup; technologies for scRNA-seq data generation; types of analysis that can be carried out; data normalization, spike-ins, and technical artifacts (RB).
9:40 - 10:00 am Data visualization, including UMAP, t-SNE, etc. (YJ).
10:00 - 10:30 am Denoising, batch correction (JW).
10:30 am - 11:00 am Autoencoder and transfer learning for scRNA-seq (JW).
11:00 - 11:15 am Coffee Break
11:15 - 11:40 am Pseudotime reconstruction, cell ordering (RB).
11:40 - 12:00 am ScRNA-seq in immunology (VDJ, cell surface protein, RB).
12:00 - 12:30 pm Methods for scATAC-seq analysis and multimodal alignment of single-cell transcriptomic and epigenomic data (YJ).
12:30 - 1:00 pm Single-cell omics analysis in cancer, including assessing cancer heterogeneity and inferring tumor phylogeny by scRNA-seq, and profiling copy number changes by scDNA-seq (YJ).

RB: Rhonda Bacher. YJ: Yuchao Jiang. JW: Jingshu Wang


Tutorial AM3: Building a Distributed Knowledge Graph to Assist with Computational Drug Discovery

Sunday, July 21, 9:00 am - 1:00 pm

Room: Kairo 1/2 (Ground Floor)

Presenters

Rabie Saidi, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom
Maryam Abdollahyan, Queen Mary University of London, United Kingdom
Andrew Nightingale, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom
Maria J Martin, European Bioinformatics Institute (EMBL-EBI), Cambridge, United Kingdom

Overview

Drug discovery pipelines are expensive in time and resources, which are wasted if a drug is rejected due to toxicities discovered in late stages. Computational investigation of the different entities (proteins, diseases, pathways, ...) that are involved in drug discovery could help provide a better understanding of the dynamics governing their relations and the downstream effects of targeting proteins with drugs. Using various data sources, including the UniProt Knowledgebase, disease ontologies, the DrugBank database and protein interactions and pathways data, we present data integration approaches to build a distributed knowledge graph (DKG) that will assist with computational discovery of drugs.

In this tutorial, the participants will be introduced to two emerging tools in the field of big data, namely the Apache Spark computing framework and the Apache Zeppelin interactive analytics framework. Spark can be used from within Zeppelin and coupled with other back-end languages and tools to provide deeper insights. Participants will also learn about data structures for representing knowledge graphs (GraphFrames) and building machine learning (ML) models.

Audience

Beginner or intermediate. This tutorial will be of broad interest to researchers from academia or industry who would like to apply an interactive analytics platform coupled with other back-end languages and tools to build machine learning models for analysis of drug-discovery-related data.

This tutorial is mainly a hands-on session using Apache Spark and Apache Zeppelin. Programming knowledge (e.g. Scala, Java, Python or similar) is required. Instructions on how to setup the environment will be provided in advance.

Attendees are required to provide their own laptop.

Maximum Participants: 40

Schedule Overview
9:00 - 9:05 am Introduction
9:05 - 9:35 am Overview:
  • Motivation
  • Challenges
  • Data Sources (e.g. UniProt, Disease Ontologies, DrugBank, Protein Interactions and Pathways Databases, etc.)
9:35 - 11:00 am Hands-on Session: Generating the DKG
  • Data Transformation with Apache Spark
  • Linking Data Sources
  • Building the DKG
11:00 - 11:15 am Coffee Break
11:15 - 12:25 pm Hands-on Session: Exploring the DKG
  • Interactive Analytics with Zeppelin
  • Visualising and Querying the DKG
12:25 - 12:50 pm Hands-on Session: Predictive Analytics
12:50 - 1:00 pm Perspectives + Q&A
  • Integrating Additional Resources
  • Relation to Other Drug Discovery Projects

Tutorial AM4: A Practical Introduction to Reproducible Computational Workflows

Sunday, July 21, 9:00 am - 1:00 pm

Room: Shanghai 1/2 (Ground Floor)

Download Materials

Presenters

Peter W. Rose, Director, Structural Bioinformatics Lab, San Diego Supercomputer Center, UC San Diego, United States
Tim Head, Project member Jupyter Hub & mybinder.org and Wild Tree Tech, Brugg, Switzerland
Fergus Boyles, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom
Fergus Imrie, Oxford Protein Informatics Group, Department of Statistics, University of Oxford, United Kingdom

Overview

This hands-on tutorial teaches participants the key requirements and practical skills to setup a reproducible and reusable computational research environment. The tutorial is intended for Python and R users, and anyone interested in using Jupyter Notebooks, which supports over 50 programming languages. We will work through a few bioinformatics use cases step by step, including biological visualization and machine learning. We will then share the results using Binder (mybinder.org), a publicly hosted environment to run Jupyter Notebooks in a fully reproducible and interactive manner. We also cover collaborative development practices. After attending this workshop, participants should be able to set up their own projects by applying the principles and techniques learned and publish reproducible research protocols.

Audience

This course is designed for everyone who would like to gain hands-on experience in setting up reproducible computational environments to their own projects. Introductory level Python skills are required and R skills are optional.

Prerequisites

Create a GitHub account
Install miniconda/anaconda
Attendees are required to provide their own laptop.

Maximum Participants: 40

Schedule Overview
9:00 - 9:30 am Introduction
  • Best practices for reproducible research
  • Run example from mybinder.org
9:30 - 9:45 am
Hands-on Session: Set up your Conda environment
9:45 - 10:15 am
Hands-on Session: Create and run Jupyter Notebooks
  • Jupyter Notebook/Lab basics
  • Visualize biological data using plugins (3D structures, sequences, networks)
10:15 - 11:00 am
Hands-on Session: Open-source your code and collaborate using GitHub
  • GitHub GUI
  • Command line
  • Merging, branching, and version control
11:00 - 11:15 am Coffee Break
11:15 - 11:45 pm
Hands-on Session: Make your code reproducible by anyone, anywhere
  • Share Jupyter Notebook or RStudio on mybinder.org
  • Share single Jupyter Notebook on Google Colaboratory
11:45 - 12:45 pm
Hands-on Session: Work on provided example projects or your own project
  • Show and tell of what you did
12:45 - 1:00 pm Wrap Up

Tutorial PM5: Biomarker discovery and machine learning in large pharmacogenomics datasets

Sunday, July 21, 2:00 pm - 6:00 pm

Room: Kairo 1/2 (Ground Floor)

Presenters

Arvind Singh Mer, Princess Margaret Cancer Center, University of Toronto, Canada
Zhaleh Safikhani, Princess Margaret Cancer Center, University of Toronto, Canada
Petr Smirnov, Princess Margaret Cancer Center,Vector Institute, University of Toronto, Canada
Benjamin Haibe-Kains, Princess Margaret Cancer Center,Vector Institute,Ontario Institute for Cancer Research, University of Toronto, Canada

Overview

Over the past decade there has been an explosion in the availability of massive datasets combining drug screening with high-throughput molecular profiling in cancer model systems. These datasets have become a rich community resource which can be leveraged for biomarker discovery, in-silico validation, drug repurposing, drug method of action prediction, and to train statistical machine learning models for drug response prediction. However, this data poses unique challenges during analysis and requires methods that are robust to the noise inherent in the drug sensitivity assays. Furthermore, irreproducibility of some findings across studies strongly motivates integrative analysis across studies. Fortunately, tools have been developed implementing bioinformatics and machine learning methods designed specifically for the analysis of pre-clinical pharmacogenomics data.

In this tutorial, participants will become familiar with common preclinical cancer models (such as cell-line, patient derived xenografts and organoids) and publicly available large pharmacogenomics datasets. Next, in the hands on session, they will be introduced to the tools and packages published for analysis of these datasets, with a focus on tools written in R. Furthermore, after becoming familiar with the challenges posed by the noise in the pharmacological assays observed in high-throughput pharmacogenomics, participants will gain hands on experience using these datasets for the purpose of biomarker discovery and validation as well as building machine learning models predictive of drug response. A focus will be on translational research, validating discoveries from in vitro datasets using in vivo pharmacogenomic and clinical datasets. The hands on sessions will be conducted primarily in R and RStudio.

Audience

This tutorial is open to all participants who are interested in mining large cancer pharmacogenomic data for precision oncology. For hands-on sessions, some prior experience with the following is required:

  • Bioinformatics analysis using R
  • Knowledge of high throughput genomic data (gene expression, mutation etc.)
  • Familiarity with basic machine learning concepts
Requirements:

Participants are required to bring a laptop with R and RStudio installed. Installation instructions will be provided in the weeks preceding the tutorial.

Maximum Participants: 60

Schedule Overview
2:00 - 2:30 pm Introduction to high-throughput pharmacogenomics
  • Quick introductions: presenters & audience
  • Preclinical models in cancer: cell-lines, organoids, patient derived xenograft (PDXs), patient derived cells
  • Sensitivity and perturbation experiments in pharmacogenomics
    • Common experimental designs
2:30 - 3:00 pm Pharmacogenomics data-sets
  • Publically available in-vitro and in-vivo datasets
    • (CCLE, GDSC, L1000, PDX Encyclopedia)
  • Web Based Exploratory Resources
    • Cell Minder CDB, PharmacoDB
3:00 - 4:00 pm Hands-on Session: Tools for pharmacogenomics analysis
  • GDSCTools
  • GRCalculator
  • PharmacoGx
  • Xeva
4:00 - 4:15 am Coffee Break
4:15 - 4:40 pm Statistics and machine learning on pharmacogenomics data
  • Evaluating reproducibility and handling noise in pharmacogenomics data
  • Meta-analysis across studies
  • Applications of machine learning for drug ranking and predictive modeling
4:40 - 5:10 pm Hands-on Session: Finding anticancer drug biomarkers
  • Univariate biomarker discovery
  • Validating known biomarkers
  • Integrative analysis across in-vitro and in-vivo data
5:10 - 5:40 pm Hands-on Session: Machine learning using pharmacogenomics data
  • Building machine learning models to predict drug response
  • Personalized drug ranking
  • Testing models on clinical data
5:40 - 6:00 PM Q&A and Tutorial wrap up

Tutorial PM6: Visualization of Large Biological Data

Sunday, July 21, 2:00 pm - 6:00 pm

Room: Sydney (2nd Floor)

Presenters

Prof. G. Elisabeta Marai, Ph.D., University of Illinois at Chicago, United States
Prof. Dr. Kay Nieselt, Center for Bioinformatics, University of Tübingen, Germany
Jun.-Prof. Dr. Michael Krone, Center for Bioinformatics, University of Tübingen, Germany

Overview

The aim of this tutorial is to familiarize the participants with modern visual analytics methodologies applied to biological data and to provide simple hands-on training. Questions such as what is data visualization, what is visual analytics, and how can large-scale biological data be visualized to gain insight will be addressed, so that hypotheses can be generated or explored and further targeted analyses can be defined. The tutorial will cover the basics that are necessary to create visualizations for biological data. This includes a general introduction to visualization, basics of visual design, and fundamentals of human color perception. Based on these generally applicable principles, various examples of visualizations and visual analysis tools for biological data that adhere to the aforementioned fundamentals and best practices will be presented and discussed. A specific focus will be laid on visualization approaches of large-scale (omics) data. Finally, attendees will have the opportunity to get first hands-on experience in creating their own interactive web-based visualization application using modern web technologies like HTML5, JavaScript, and D3.

Topics Include

  • Digital/Electronic visualization of data
  • Understanding color
  • Visual Design Principles
  • Examples of visualization of biological data
  • Challenges of large-scale biological data visualization
  • Introduction to web-based visualization for biological data
Audience

The tutorial is designed for anyone who has no or only little prior knowledge of data visualization and wants to learn the basics (beginner level). The course provides useful background material on data visualization principles, but the focus is on methods and tools for visualization of next-generation sequencing data, other omics data, and network data. Previous knowledge in programming is a plus for the hands-on part, but not required to participate. Attendees that want to participate actively in the hands-on should bring a laptop with a text editor and a modern web browser (passive participation is also possible).

Maximum Participants: 60

Schedule Overview
2:00 - 2:15 pm Welcome & Introduction to tutorial structure
2:15 - 2:45 pm What is (electronic) visualization - Understanding color
  • Color perception and luminance
  • Mapping data to color
2:45 - 3:30 pm Visual design principles
  • Tufte’s design principles
  • Shneiderman’s mantra
  • Small multiples etc.
3:30 - 4:00 pm Introduction to Biological Data Visualization
  • Topics in BioVis (including examples)
  • Visualization of sequences, macromolecules, omics data, biological networks
4:00 - 4:15 pm Coffee Break
4:15 - 4:45 pm Tools and Software for Biological Visualization
  • Specific tools for visualizing large-scale biological data
4:45 - 5:00 pm Introduction to HTML5 and JavaScript
  • Hands-on: basics web application development
5:00 - 6:00 pm Introduction to D3
  • Hands-on: generating a simple interactive, web-based visualization

Tutorial PM7: Tools for reproducible research

Sunday, July 21, 2:00 pm - 6:00 pm

Room: Shanghai 1/2 (Ground Floor)

Conda Cheat Sheet
Snakemake Live Demo
Snakemake Talk Slides
Tools for reproducible research
Presenters

Johannes Koester - Group Leader, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen
Bjoern Gruening - Bioinformatician, Uni-Freiburg, Freiburg Germany
Devon Ryan - Bioinformatician, Max Planck Institute for Immunobiology and Epigenetics, Freiburg Germany

Overview

The typical data analyst must simultaneously juggle multiple projects, each having its own duration and software requirements. As few analysts have any formal training on structuring or even writing the code necessary to perform an analysis, it is unsurprising that the iterative analytic process can produce a wide assortment of almost identically named files (e.g., “final_results.txt”, “final_results.version2.txt”, “final_results.really_final.txt”), all with unclear origins and produced with a hodge-podge of similarly poorly named scripts. The near impossibility of tracing a results file to the exact process that produced it creates untold difficulties both when it comes time to publish results as well as when planning subsequent experiments months or years later (afterall, which of the “final_results” files was really the “right one”?). These issues are further compounded by software paths and other similar assumptions being hard-coded into scripts, preventing easy analysis replication elsewhere. Performing analyses in a reproducible and traceable manner is clearly needed to combat such problems.

In this hands-on tutorial, we demonstrate how Conda can be used to deploy specific software versions easily, reproducibly, and without administrator credentials. Moreover, we demonstrate how Conda’s ability to create isolated software environments helps to avoid side-effects between different analyses or different steps of the same analysis. Attendees will also learn how to create conda recipes themselves, so they can contribute new packages to projects such as Bioconda. We further demonstrate how Snakemake can be used in combination with Conda and Containers to create reproducible analysis workflows and execute them on any platform from workstations to clusters and the cloud. Finally, using snakePipes as an example, we demonstrate how Conda and Snakemake can be used to define reproducible and flexible workflows for complex genomics analysis.

Audience

Beginners, Intermediates, Core-Facility Staff
Expected audience should have basic familiarity with python, git and the command line.

Requirements:

Laptops with Linux or MacOS
Pre-installed Miniconda - install via miniconda : https://conda.io/miniconda.html

Maximum Participants: 40

Schedule Overview

2:00 - 2:10 pm Installing conda and snakeMake
2:10 - 2:30 pm Intro to conda and bioconda (slides)
2:30 - 3:30 pm Hands-on Session: creating conda envs and installing packages from bioconda repo
  • This practical would require installing hisat, samtools and deeptools via bioconda
3:30 - 4:00 pm Hands-on Session: writing conda recipes
  • Topics in BioVis (including examples)
  • Visualization of sequences, macromolecules, omics data, biological networks
4:00 - 4:15 am Coffee Break
4:15 - 4:35 pm Intro to snakemake
  • Specific tools for visualizing large-scale biological data
4:35 - 6:00 Hands On Session: Writing a snakemake workflow wrapper for mapping, indexing and creating coverage files