Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

Tutorials

There will be a series of in-person and virtual tutorials prior to the start of the conference. Tutorial registration fees are shown at: https://www.iscb.org/ismb2024/register#tutorials

In-person Tutorials (All times EDT)

Virtual Tutorials: (All times EDT) Presented through the conference platform

- top -

Tutorial IP1: Advanced machine learning methods for modeling, analyzing, and interpreting single-cell omics and spatial transcriptomics data

Room: TBD
Date: Friday, July 12, 2024 9:00 – 18:00 EDT

Organizer:
Juexin Wang

Speakers:
Mauminah Raina, (Ph.D. student) Indiana University Purdue University Indianapolis, United States
Juexin Wang, Indiana University Purdue University Indianapolis, United States
Anjun Ma, Ohio State University, United States
Qin Ma, Ohio State University, United States
Dong Xu, University of Missouri, United States

Max Participants: 50

Description
Emerging single-cell omics and spatial transcriptomics technologies provide unprecedented opportunities and challenges for molecular biology studies. How to model these vast sequencing data in different modalities, perform computational analyses, and interpret mechanisms by identifying biological and pathological meaningful cell types, regulatory relations, and key markers are central questions in this aera.

Advanced machine learning methods and tools provide a promising approach to address these challenges. scGNN (https://github.com/juexinwang/scGNN) is a graph neural network based framework for clustering and imputing scRNA-seq data by modeling the single cells as a cell graph. Targeting single-cell multi-omics data, DeepMAPS (https://bmblx.bmi.osumc.edu/) introduces a heterogenous graph transformer to infer single-cell biological networks. BSP (https://github.com/juexinwang/BSP) proposes a granularity-based statistical approach to identify spatially variable genes on 2D and 3D spatial transcriptomics.

Our tutorial will cover key advancements in machine learning methods developed on single-cell multiomics and spatial transcriptomics research over the past few years, emphasizing new opportunities in bioinformatics enabled by such advancements. We will start with a technical talk about the machine learning algorithms of covered approaches, including scGNN, DeepMAPS, and BSP, and from model training to model interpretation (discovery on cell types, regulatory relations, and key markers). We will then demonstrate the impact of machine learning on discovering underlying mechanisms in complex diseases, such as cancer, Alzheimer’s disease, and kidney disease.

Learning Objectives

  • To understand the basic principles of deep learning, graph representation learning, and model interpretation.
  • To understand the specifics of computational tools such as scGNN, DeepMAPS, and BSP, and become aware of the appropriate tools to use in different applications in single-cell multi-omics and spatial transcriptomics studies.
  • To gain hands-on experience in applying tools and interpreting results using standalone python-based software scGNN, R-based BSP, and webserver-based DeepMAPS.

Intended Audience and Level
The target audiences are graduate students, researchers, scientists, and practitioners in both academia and industry who are interested in applications of deep learning in bioinformatics (Broad Interest). The tutorial is aimed towards entry-level participants with knowledge of the fundamentals of biology and machine learning (beginner). Basic experience with Python and R programming languages is recommended for the participants.

The tutorial slides and materials for hands-on exercises (e.g., links to demo, code implementation, and datasets) will be posted online prior to the tutorial and made available to all participants.

Schedule

9:00

Part 1: Overview: Introduction to single-cell multi-omics and spatial transcriptomics and corresponding challenges.

  • Recent developments in single-cell multi-omics and spatial transcriptomics sequencing.
  • General computational approaches in data modeling.
  • The impact of advanced machine learning approaches on discovering cell types, regulatory relations, and key markers in complex diseases and other biological phenomena.
9:45

Part 2: Introduction to biological analyzing methods.

  • Data visualization on single cells
    • UMAP and t-sne
    • Sankey plot, Circos plot, and Dot plot
  • Cell-cell communications on scRNA-seq data and spatial transcriptomics data
    • CellChat, CellChatDB, and COMMOT
10:45 Coffee Break
11:00

Part 3: Clustering-based single-cell analysis and scGNN.

  • Identifying cell types using scRNA-seq data.
    • Louvain and Leiden clustering
    • Selecting marker genes from the clusters
    • Cell type annotations with singleR and scType
  • Modeling with graph neural networks.
    • Concepts of graph neural networks
    • Stacked graph autoencoders in scGNN 1.0
    • Integrating scRNA-seq and bulk RNA-seq in scGNN 2.0
11:30

Part 4: Applications #1: Single-cell RNA-seq dataset acquisition, model training, and analysis.

  • Dataset acquisition, including methods that convert the data to the required format of scGNN.
    • Data type: scRNA-seq
    • Format: comma-separated values (CSV), 10X sparse format and hdf5 file.
  • Modeling, including deep learning models and automated hyperparameter tuning.
    • Models
      • scGNN model
      • Quick mode
      • Including LTMG as regulatory priors
    • Default option
      • Basic models with hyperparameters.
    • Advanced option
      • Selecting the number of clusters with parameter resolution
      • Tuning the models by automated hyperparameter tuning algorithm.
    • Downstream analysis of Alzheimer’s disease
      • Application of Alzheimer’s disease using single-cell RNA-seq data
      • Cell type annotation and validation
13:00 Lunch
14:00

Part 5: Network analysis on single-cell multi-omics and DeepMAPS.

  • Network analysis using single-cell multi-omics.
    • Classical network analysis in integrating modalities
      • Identifying cell-specific regulatory relations
    • Modeling with heterogeneous graph transformer.
      • Transformer model
      • Various graph transformer model
      • Heterogenous graph transformer in modeling heterogeneous graph
14:30

Part 6: Applications #2: Single-cell multi-omics dataset acquisition, model training, and analysis.

  • Dataset acquisition, including methods that convert the data to the required format of DeepMAPS.
    • Data type: scRNA-seq, scATAC-seq, CITE-seq
    • Format: comma-separated values (CSV), 10X sparse format and hdf5 file.
  • Modeling, including deep learning models and automated hyperparameter tuning.
    • Models
      • DeepMAPS model
    • Default option
      • Basic models with hyperparameters tuned by us.
    • Job management
      • Monitoring the training steps via the interface.
      • Visualizing the learning curve during the model training.
      • Comparing the performance of different models.
      • Downstream analysis
  • Applications.
    • Analysis of human IFNB-stimulated and control PBMCs with multiple scRNA-seq data
    • Analysis of human PBMC and lung tumor leukocytes with CITE-seq data
16:00 Coffee Break
16:15

Part 7: Marker analysis on spatial transcriptomics and BSP.

  • Spatial variable genes identification on spatial transcriptomics.
    • Definition of spatially variable genes
    • Statistical approaches in identifying spatially variable genes
  • Granularity-based statistical approach.
    • BSP model
16:45

Part 8: Applications #3: Spatial transcriptomics dataset acquisition, model fitting, and analysis.

  • Dataset acquisition, including methods that convert the data to the required format of DeepMAPS.
    • Data type: 10X Visium, MERFISH, Seq-FISH, Slide-seq, and Slide-seq V2.
    • Format: comma-separated values (CSV), 10X sparse format and hdf5 file.
  • Modeling fitting.
    • Models
      • BSP model with lognormal and beta distributions
    • Default option
      • Basic models with different granularities.
    • Applications
      • Kidney research on 10X Visium
      • Rheumatoid Arthritis research on 3D spatial omics

- top -

Tutorial IP2: Just-in-time compiled Python for bioinformatics research

Room: TBD
Date: Friday, July 12, 2024 9:00 – 18:00 EDT

Organizer:
Sven Rahmann

Speakers:
Johanna Schmitz, Center for Bioinformatics Saar and Saarland University, Saarland Informatics Campus, Saarbrücken, Germany; Saarbrücken Graduate School of Computer Science
Jens Zentgraf, Center for Bioinformatics Saar and Saarland University, Saarland Informatics Campus, Saarbrücken, Germany; Saarbrücken Graduate School of Computer Science
Sven Rahmann, Center for Bioinformatics Saar and Saarland University, Saarland Informatics Campus, Saarbrücken, Germany

Max Participants: 20

Description
Python has a reputation for being a clean and easy-to-learn language, but slow when it comes to execution, and difficult concerning multi-threaded execution. Nonetheless, it is one of the most popular languages in science, including bioinformatics, because for many tasks, efficient libraries exist, and Python acts as a glue language. In this tutorial, we explore how to write efficient multi-threaded applications in Python using the numba just-in-time compiler. In this way, we can use Python’s flexibility and the existing packages to handle high-level functionality (e.g., design the user interface, run machine learning models), and then use compiled Python for additional custom compute-heavy tasks; these parts can even run in parallel.

Over a full tutorial day, we introduce a small (but still interesting and relevant) problem as an example: efficient search for bipartite DNA motifs. We develop an efficient tool that outputs every match in a reference genome in a matter of seconds. Starting with an introduction to the problem and a (slow) pure Python implementation, we learn how to write more jit-compiler-friendly code, transition towards a compiled version and observe speed increases until we obtain C-like speed. We parallelize the tool to make it even faster, and add more options for more flexible searching. Finally, we add a simple but effective GUI, which can increase the potential user-base of such a tool by an order of magnitude.

Learning Objectives

  • Understand the difference between interpretation, lazy and eager/early compilation
  • Understand the possibilities and limitations of the numba just-in-time compiler
  • Explore several examples about when numba can accelerate your code (and when it cannot)
  • Understand pre-requisites for compiling a function
  • Learn the differences between compileable and non-compileable Python code
  • Learn about parallelizing Python in spite of the Global Interpreter Lock (GIL) with compiled functions
  • Learn how to scale up a prototype to handle large data
  • Get an understanding of DNA motif search

Intended Audience and Level
The tutorial addresses active bioinformatics researchers, from graduate students to principal investigators, who write software tools as part of their research. In particular, we address researchers who are looking for an easier transition from research prototype software to software that scales to large datasets and is usable by a large non-technical user-base. Therefore, our participants should have at least some experience developing bioinformatics research software.

Prior experience with the Python programming language is required, as well as some experience with managing environments with installed software, ideally using (bio)conda / mamba.

Schedule

9:00 Introduction to the numba just-in-time compiler for Python; small examples,
possibilities, limitations, how the compilation works. Last 30 minutes are short hands-on
exercises (timing iterated execution of a small function in pure vs. compiled Python).
10:45 Coffee break
11:00 Introduction to DNA motif search and a “motif description” mini-language, with
examples from the literature. Automaton-based pattern search and a bit-parallel algorithm.
Hands-on: Implementation in pure Python (45 min, 15-20 lines).
13:00 Lunch break
14:00 Transforming a Python implementation to a numba-compiled implementation;
separation of high-level and low-level code parts; managing memory allocations; introduction
of type annotations (1 hour principles, 1 hour supervised coding).
16:00 Coffee break
16:15 Parallelization: Using threads to parallelize the application (e.g. parallel search
across chromosomes); Replacing the command-line interface by a simple but effective GUI
using streamlit. Hands-on coding: Splitting the task, collecting and visualizing the results.

- top -

Tutorial IP3: Multi-omic data integration for microbiome research using scikit-bio

Room: TBD
Date: Friday, July 12, 2024 9:00 – 18:00 EDT

Organizer:
Qiyun Zhu

Speakers:
Qiyun Zhu
James Morton
Daniel McDonald
Matthew Aton
Lars Hunger

Max Participants: 40

Description
Modern microbiome research is marked by the extensive use of high-throughput, multi-omic data derived from complex biological systems, such as amplicons, metagenomes, metatranscriptomes, metaproteomes, and metabolomes, as well as data and metadata of the host or environment. The complexity and richness of data demand robust, scalable, and reproducible integration and analysis methods. Our full-day tutorial offers an essential guide to leveraging the expanded capabilities of scikit-bio, alongside the broader Python data science ecosystem. Scikit-bio is a core library behind the widely used QIIME 2 project, and provides various data structures, metrics and algorithms commonly used in bioinformatics. This tutorial is designed to provide researchers, educators, and developers with an overview of current trends, foundational principles, and analytical strategies in microbiome research. Participants will engage in hands-on exercises on handling data and metadata, analyzing communities and features, as well as correlating and predicting biological traits. This tutorial aims to equip attendees with knowledge and practical skills that are adaptable to various applications in microbiome research and beyond.

Exercises will be delivered through Jupyter Notebooks with clear code and documentation. Tutorial materials, including data, slides, and notebooks, will be hosted in a public GitHub repository under a BSD open-source license.

Learning Objectives
Participants will learn how to use scikit-bio and other common Python libraries to analyze and integrate multiple types of omic data that are usually involved in studies of microbiomes and their roles in the host or natural environment. Specifically, participants will:

  • Understand and work with various summarized omic data types.
  • Handle sparse, high-dimensional data tables and associated metadata.
  • Analyze community composition using ecological, phylogenetic and statistical approaches.
  • Identify important microbial or functional features associated with sample properties.
  • Construct supervised learning models to predict traits of hosts or environments.
  • Develop reusable workflows for microbiome research.

In the end of the full-day tutorial, each participant will complete an analytical workflow based on a demo dataset and can be customized and extended to other datasets.

Intended Audience and Level
This tutorial is for researchers, educators and developers interested in analyzing various types of biological “omic” data, such as metagenomics, metabolomics, and host transcriptomics. Attendees should have basic skills in Python (preferred), or any other programming language (such as R or C/C++). Experience with the Linux command line is not required. Optionally, attendees may benefit from basic knowledge in bioinformatics, biostatistics, and any specific biological research fields, such as microbiology, ecology, molecular biology, and epidemiology.

Each participant should bring their own laptop or tablet (with keyboard). The practices will be conducted using Google Colab or a local Jupyter environment, depending on the participant’s preference

Schedule

9:00

Introduction and software setup
Lecture: Current trends in microbiome research

  1. Overview the latest developments in microbiome research, emphasizing the shift towards high-throughput, multi-omics and meta-analysis.
  2. Introduce scikit-bio within the Python ecosystem, explaining that it is a core library of the widely used QIIME 2 project, highlighting its role in facilitating reproducible, scalable, and expandable bioinformatics research.

Exercise: Setting up the software environment.

  1. Walkthrough the setup of scikit-bio in Google Colab or a local Jupyter environment, depending on participant preference.
  2. Test the basic functionalities of scikit-bio, ensuring that the software is installed correctly and participants are comfortable with its interface and capabilities.
  3. Briefly demonstrate the basics of Python, in order to (re)familiarize participants with various technical backgrounds.
10:00

Working with various omic data types
Lecture: Omic data types in microbiome research

  1. Review typical and emerging omic data types in microbiome research, such as 16S, ITS, metagenomics, metatranscriptomics, metaproteomics, metabolomics, and corresponding host data.
  2. Discuss derived data types, such as taxonomic, functional, genetic and metabolic profiles.
  3. Navigate key public data sources, like Qiita, GNPS, NMDC, ENA and SRA.
  4. Discuss the challenges and techniques for integrating different omics. Examples include combining 16S and shotgun data using Greengenes2 and WoL2, and integrating sequencing and mass spectrometry data via KEGG.

Exercise: A real-world multi-omic dataset

  1. Download a demo dataset, which is subsampled from a representative study, consisting of several omic data types, metadata, and biological questions.
  2. Explore the components of the dataset using Python.
10:45 Coffee break
11:00

Working with sparse, high-dimensional data tables
Lecture: Nuances of omic data tables

  1. Introduce the structure of feature tables commonly used in omics.
  2. Discuss the nature of omic data, highlighting its high-dimensionality and sparsity, and their implications on data processing and analysis.
  3. Discuss and compare sparse vs. dense matrices, and the strategies for handling them.
  4. Introduce the BIOM-format, a Genomics Standards Consortium standard format for representing sparse feature data.

Exercise: Working with omic data tables

  1. Navigate the basic techniques for loading, viewing, and editing data tables using scikit-bio and BIOM-format.
  2. Manipulate data tables according to the statistical properties of data, and the biological properties of samples informed by metadata.
12:00

Analyzing microbial community structures
Lecture: Microbial community ecology

  1. Introduce the fundamentals of microbial communities, explaining the notions of alpha and beta diversity.
  2. Introduce phylogeny, gene ontology, and the general notion of knowledge graph, addressing their roles in community analysis.
  3. Discuss the compositionality of omic data, explaining its implications in data analysis.

Exercise: Community diversity analyses

  1. Calculate alpha and beta diversity metrics, with or without phylogeny.
  2. Construct diversity distance matrices.
  3. Perform ordination of communities, visualize, and interpret.
  4. Compare matrices and ordinations across different omics.
13:00 Lunch break
14:00

Inferring and associating critical features
Lecture: Microbial signatures and their biological roles

  1. Discuss the role of microbial signatures in biological processes, explaining that signatures may emerge in different levels: taxonomy, function, and molecules
  2. Introduce multivariate statistical tests (such as PERMANOVA).
  3. Introduce differential abundance analysis (such as ANCOM).

Exercise: Statistical modeling and tests

  1. Perform multivariate statistical tests to correlate microbial community composition with sample metadata.
  2. Perform differential abundance analyses to identify important microbial features that may play roles in specific biological processes.
  3. Perform canonical analyses to associate features across different omic data types, revealing interconnections and dependencies.
  4. Utilize metadata for flexible and sophisticated statistical modeling.
15:00

Predicting host and environmental traits
Lecture: Microbiomes are predictive of biology

  1. Overview the interactions between microbiomes, hosts and environments, and
    the importance of understanding these relationships.
  2. Introduce the principles of supervised machine learning, and interpretable machine learning.

Exercise: Constructing predictive models

  1. Bridge scikit-bio data structures with machine learning models in Scikit-learn.
  2. Construct predictive models for categorical and numeric traits of samples.
  3. Navigate and interpret the results in a biological context.
16:00 Coffee break
16:15

Developing an analytical protocol for publication
Lecture: Good practices in scientific data analysis

  1. Introduce the FAIR principles in scientific research.
  2. Discuss good practices for developing and publishing analytical protocols, addressing key considerations such as documentation, accessibility and reproducibility.

Exercise: Assembling an analytical protocol

  1. Wrapping up analyses in one Jupyter Notebook, and document.
  2. Test and validate the protocol using both the provided dataset and different data, emphasizing the adaptability and robustness of the protocol.
  3. Discuss how the workflows can be adapted or expanded for the participants’ specific research needs.
17:15

Debugging, wrapping-up and open questions
Exercise: Troubleshooting and expansion

  1. Address participants’ specific questions and challenges in completing the practice on the demo dataset.
  2. Address participants’ questions in analyzing specific data types with specific research goals.

Lecture: Looking beyond

  1. Open discussion on the current and future trends of multi-omic biological research and scientific computing.
  2. Welcome participants to contribute to open-source projects.

- top -

Tutorial IP4: Quantum-enabled multi-omics analysis

Room: TBD
Date: Friday, July 12, 2024 9:00 – 18:00 EDT

Organizer:
Aritra Bose
Laxmi Parida

Speakers:
Aritra Bose, PhD, Research Scientist, IBM Research, Yorktown, NY
Hakan Doga, PhD, Postdoctoral Researcher, IBM Research, Cleveland, OH
Filippo Utro, PhD, Senior Research Scientist, IBM Research, Yorktown, NY
Laxmi Parida, PhD, ISCB Fellow, IBM Fellow

Max Participants: 50

Description
Single-cell and -omic analyses has provided profound insights on heterogeneity of complex tissues measuring multiple cells together, including a wide array of multi-omics data such as genomics, proteomics, transcriptomics, etc. The single cell analysis is often plagued by many uncertainties such as missingness, developing robust machine learning algorithms for discovering complex features across, finding patterns in spatial structure of single cell transcriptomics or proteomics, and most importantly integrating multi-omics data to create meaningful embeddings for the cells. Machine Learning (ML) techniques have been extensively used in analyzing, predicting, and understanding multi-omics data. For the purposes of this tutorial, we will use the term classical ML to refer to these the potential to overcome a lot of the above limitations of ML in single-cell analysis. This tutorial will be structured into five sessions as follows:

  • In the first session we will introduce quantum computing fundamentals such as notations, operations, quantum states, entanglement, quantum gates, and circuits.
  • In the second session, we will set up Qiskit, an open-source quantum computing toolkit based on Python and run a demo algorithm.
  • In the third session, we will process and analyze single-cell multi-omics data from the Single Cell atlas or TCGA, etc. using classical ML algorithms to create baseline.
  • In the fourth session, we will set up the data in Qiskit and run a QML algorithm to classify disease sub types.
  • In the fifth and concluding session, we will summarize the tutorial and do an interactive Q&A session with the attendees.

Learning Objectives
Participants in this tutorial will learn a new paradigm of analyzing multi-omics data with hands on experience with a quantum computer. More objectively, the major takeaways of this tutorial would be:

  • Understand the basics of quantum computing including hands-on experience on quantum gates and circuits using Qiskit.
  • Identify the class of problem: analyzing machine learning methods on multi-omics data for biomarker discovery, disease subtyping, etc.?
  • How to process single cell data for quantum-enabled algorithms.
  • How to apply QML algorithms on single cell data.
  • How to design experiments for healthcare and life sciences data using quantum computers

Intended Audience and Level
This tutorial is aimed at computational biologists, bioinformaticians, clinicians, practitioners, data analysts, including early-career to senior researchers in the fields of healthcare and life sciences enthusiastic to learn about new frontiers of computational biology. There are very few prerequisites for the tutorial, listed as follows:

  • Create an IBM Quantum account in IBM Quantum Learning website, click on “Create an IBMid” and follow the instructions.
  • Watch the Qiskit Global Summer School videos – QML 2021 or (optional)
  • Entry-level knowledge of single-cell data and multi-omics analyses.

Schedule

9:00 Session I: Quantum Information and Fundamentals
10:45 Coffee Break
11:00 Session II: Hello Qiskit!: Writing your first program in Qiskit
12:30 Session III: Processing multi-omics data with classical ML algorithms
13:00 Lunch
14:00 Session IV, Part I: Design and implement QML algorithm for single-cell data in Qiskit.
16:00 Coffee Break
16:15 Session IV, Part II: Analyze QML algorithm and compare with classical ML
17:00 Session V: Interactive Q&A session with the participants.

- top -

Tutorial IP5: Modelling Multi-Modal Biomedical Data Using Networks

Room: TBD
Date: Friday, July 12, 2024 9:00 – 18:00 EDT

Organizer:
Ian Simpson

Speakers:
Ian Simpson, Professor of Biomedical Informatics, School of Informatics, University of Edinburgh
Barry Ryan, PhD Student, UKRI Centre for Doctoral Training in Biomedical Artificial Intelligence, School of Informatics, University of Edinburgh
Sebesty´en Kamp, PhD Student, UKRI Centre for Doctoral Training in Biomedical Artificial Intelligence, School of Informatics, University of Edinburgh

Max Participants: 30

Description
Network structures allow us to model complex data in an extremely flexible way, enabling a wide range of downstream analytic approaches to help us gain insight into the biological processes and systems we model. The ability of networks to capture myriad features of the primary data and explore high order relationships between them makes them highly suitable to address questions that are not easily answered by classical statistical approaches that typically only look at first-order interactions. Networks have been widely used in the biomedical sciences to study gene and protein expression profiles, protein-protein interactions, metabolic processes, dynamic pathway models, and diseases amongst others. The emergence of multi-modal data in the biomedical setting has gathered pace significantly over recent years whereby several different types of data are measured from the same sample source. Integration of these data is proving incredibly valuable at increasing the breadth and depth of our understanding of the underlying systems by reducing noise, increasing information content, facilitating our handling of missing and/or incomplete data, and crucially, increasing our predictive power beyond that of uni-modal data analysis.

In this comprehensive tutorial we will introduce participants to network analysis from first principles using real-world multi-modal data derived from the Generation Scotland study, a world-leading longitudinal research programme and an excellent use case for biomedical network analysis. Participants will perform hands-on end-to-end network construction and computational analysis using a ground up approach which will give them the skills, experience, and confidence to develop their own network analytic pipelines in the future. We will work in the context of human disease using both molecular and clinical data and introduce introduce analysis approaches for network based tasks including clustering, functional annotation analysis, and classification using graph neural networks.

Learning Objectives
Participants will learn how to analyse biological datasets using networks. They will gain handson
experience with a real-world dataset as an exemplar that can be directly transferred to their
own work in the future. Following the course they will be able to:

  • Understand core network concepts and fundamentals
  • Construct networks using Python and R
  • Develop network models for uni-modal and multi-modal data (e.g. gene expression + DNA methylation)
  • Perform functional annotation analysis and community clustering
  • Implement simple Graph Neural Network based approached for classification tasks

Intended Audience and Level
Introductory Level.

This tutorial is aimed at an audience who have little prior experience working with and analysing data using networks. They will need at least a basic level of knowledge in Python and R programming. Specifically, participants are expected to be familiar with the Python packages Pandas, Numpy, and Matplotlib and the R packages ggplot2 and dplyr

The workshop will be conducted in both R and Python. We will communicate with participants in advance so that they have installed VisualStudioCode (Python) and RStudio (R) prior to the tutorial but can troubleshoot minor installation issues on the day and provide cloud compute instances of these if needed. All materials and data will be made available open-source through a dedicated GitHub repository. All analyses will be streamlined so that there are no challenging compute requirements for participants, a standard modern laptop will be suitable to take part.

Schedule

9:00 Welcome & Introduction
9:10 ”An Introduction to Networks”
9:40 Practical Session 1
10:45 Coffee Break
11:00 ”The Do’s and Don’ts of Biomedical Network Construction”
11:30 Practical Session 2
13:00 Lunch
14:00 ”Common Approaches to the Analysis of Biomedical Networks”
14:30 Practical Session 3
16:00 Coffee Break
16:15 ”An Introduction to Network Inference Using Graph Neural Networks”
16:45 Practical Session 4
17:50 Closing Remarks

- top -

Tutorial IP6: Creating and running cloud-native pipelines with WDL, Dockstore, and Terra

Room: TBD
Date: Friday, July 12, 2024 9:00 – 13:00 EDT

Organizer:
David Steinberg

Speakers:
Denis Yuen, Team Lead, Dockstore, Ontario Institute for Cancer Research
David Charles Steinberg, University of Santa Cruz
Leyla Tarhan, PhD, Senior Science Writer, Data Sciences Platform, Broad Institute of MIT and Harvard
Aseel Awdeh, PhD, Computational Biologist, Data Sciences Platform, Broad Institute of MIT and Harvard

Max Participants: 40

Description
With the advent of efficient sequencing technology, the scientific community produces petabytes of data daily. These data are prepared to answer diverse biological questions, each requiring unique sequencing approaches. To combine these disparate datasets and transform them into meaningful insights, researchers are turning to cloud-based approaches that adhere to Findable, Accessible, Interoperable, and Reusable (FAIR) practices. These include cloud-computing environments that allow for efficient resource-sharing and scalability. While the potential of these new resources is thrilling, the migration to cloud computing might feel daunting, as it requires new pipelines that harness the expanse of cloud tools. In this half-day tutorial, we introduce participants to key components that help them create cloud-native pipelines, including portable workflows written in the Workflow Description Language (WDL; pronounced “widdle”), portable packages of software and dependencies known as Docker containers, and Dockstore, a public platform for sharing Docker-based workflows. Participants will get hands-on experience with these resources by developing their own simple WDL workflow and Docker image for genomic analysis. They will push their workflows to Dockstore and export them to the cloud-based Terra platform so that they can run their workflow on real data.

Learning Objectives
In this tutorial, participants will learn how to:

  • Write a basic WDL syntax with inputs and outputs
  • Make a Docker image from a Dockerfile
  • Navigate Dockstore, a platform for Docker-based workflows
  • Find, evaluate, and share workflows in Dockstore
  • Automatically integrate GitHub WDL with Dockstore
  • Export a WDL workflow from Dockstore to Terra
  • Set up and run a workflow in Terra
  • Find resources for writing advanced WDL workflows

Intended Audience and Level
Researchers and tool developers interested in bringing their analyses to the cloud. A basic understanding of command line and a GitHub account is required, and participants are encouraged to have basic familiarity with genomics terminology and standard high-throughput sequencing data formats. The introduction to basic WDL syntax is designed for novice WDL writers and starts with a basic hello-world script.

Schedule

9:00 Welcome/opening remarks/review agenda and learning goals
9:05 Introduction to Docker
● How dockers improve software and scientific reproducibility
● Docker and Dockerfile basics
● Finding and using Dockers
9:15 Building and Using Dockers
● Pull and use an existing Docker
● Create a Dockerfile to build a Docker
9:45 Introduction to WDL
● Anatomy of a WDL
● Where to find and run existing WDLs
10:00 Basic WDL scripting
● Writing your first WDL Hello-world script for Terra
● Running WDLs in Terra
10:45 Coffee Break
11:00 Introduction to Dockstore
● Finding and assessing the quality of workflows on Dockstore
● Launching workflows from Dockstore
11:30 Integrate your GitHub with Dockstore
● Use GitHub apps to streamline the development cycle
12:00 Real genomics example: Modify, export and run a WDL
12:30 Wrap-up and Q&A

- top -

Tutorial IP7: Federated Ensemble Learning for Biomedical Data

Room: TBD
Date: Friday, July 12, 2024 14:00 – 18:00 EDT

Organizer:
Hryhorii Chereda

Speakers:
Anne-Christin Hauschild, Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
Hryhorii Chereda, Ph.D., Medical Bioinformatics, University Medical Center Göttingen, Göttingen, Germany
Youngjun Park (MSc), Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
Maryam Moradpour (MSc), Medical Informatics, University Medical Center Göttingen, Göttingen, Germany

Max Participants: 15

Description
The digital revolution in healthcare, fostered by novel high-throughput sequencing technologies and electronic health records (EHRs), transitions the field of medical bioinformatics towards an era of big data. While machine learning (ML) have proven to be advantageous in such settings for a multitude of medical applications, they generally depend on a centralization of datasets. Unfortunately, this is not suited for sensitive medical data, which is often distributed across different institutions, comprises intrinsic distribution shifts and cannot be easily shared due to high privacy or security concerns.

Initially proposed by Google in 2017, Federated learning, allows the training of machine learning models on geographically or legally divided data sets without sharing sensitive data. When combined with additional privacy-enhancing techniques, such as differential privacy or homomorphic encryption, it is a privacy-aware alternative to central data collections while still enabling the training of machine learning models on the whole data set. However, in such federated settings, both infrastructure and algorithms become much more complex compared to centralized machine learning approaches. Some of the most intuitive implementations rely on ensemble learning approaches, where only the model parameters are transferred. For example, we can exchange split values of tree nodes as in federated random forest or combine local subgraph-based graph neural network (GNN) models into a global federated Ensemble-GNN.

This tutorial covers the general theory of federated learning and the practice of federated ensemble learning. We will explain the concepts and benefits of federated ensemble learning, and demonstrate how to use Python to implement two state-of-the-art methods: federated random forest and Ensemble-GNN. The participants will learn how to apply these methods to breast cancer data, including clinical and gene expression features, and how to deploy the models in a federated setup. By the end of this tutorial, the participants will have both theoretical and practical skills in federated ensemble learning and privacy-preserving techniques for biomedical data analysis.

Availability of the tutorial’s material: https://gitlab.gwdg.de/cdss/tutorial-federated-ensemblelearning- for-biomedical-data

Learning Objectives

  1. Participants will learn the basics of federated machine learning theory and will be introduced to federated ensemble learning:
    1. Participants will learn about federated random forest.
    2. Participants will be introduced to GNNs, which utilize a molecular subnetwork structuring input genomic data, and they will learn how GNNs can be combined into an ensemble (Ensemble-GNN).
  2. Participants will learn how to practically implement and apply a federated random forest.
  3. Participants will learn how to use GPUs to train Ensemble-GNN and how to apply it in both centralized and federated scenarios.
    1. Optionally, participants can learn how to implement their own GNN as a new base learner for Ensemble-GNN.

Intended Audience and Level
The aimed audience are: Bioinformaticians, Data scientists, Medical informaticians that are already beginners in machine learning. Participants should have a laptop with Linux, macOS, or Windows and internet connection. The access to computational environment will be provided by the organisers.

Level requirements are the following:

  1. Basic knowledge of machine learning.
  2. Basic knowledge of python.

Schedule

14:00

Lecture: Federated ensemble learning in biomedical health data
The basic concepts pf federated ensemble learning are introduced.  Advantages and challenges of central machine learning are discussed based on practical examples.  Finally, privacy aassuring techniques such as differential privacy and homomorphic encryption are explained.

Anne-Christin Hauschild

14:30

Hands-on tutorial: how to develop and implement a federated random forest

  1. Into to federated ensemble learning with decision tree and random forests
  2. Implementation of federated random forest model with scikit-learn

Hryhorii Chereda, Maryam Moradpour, Younjun Park

15:45 Coffee Break
16:00

Continuation of hand-on tutorial: how to develop and implement a federated random forest

  1. Evaluation and comparison of the global model with client-specific local models

Maryam Moradpour, Youngjun Park

16:15

Lecture: Federated ensemble learning with graph neural networks

GNNs are particularly developed to eprform different tasks with graphs. For instance, a patient cna be represented by a biological network where the nodes contain patient-specific omics features.  In this case, GNNs perform graph classification to predict a patients's clinical endpoint.  Ensemble-GNN approach builds predictive models utilizing PPI networks containing carious node features such as gene experssion and/or DNA methylation.  To do this, Ensemble-GNN derives relevant PPI network communities and trains an ensemble of GNN models based on the inferred communities.  Sharing local GNN models allows for the deployment of a federated ensemble of GNNs.

Hryhorii Chereda

16:30

Hands-on tutorial: how to train an apply federated Ensemble-GNN

  1. Intro to a ChebNet GNN model as a base learner of Ensemble-GNN
  2. Training Ensemble-GNN (using PyTorch) in centralized and federated setups
  3. Evaluation and comparison of the global model with client-specific local models
  4. Showcase of a federated scenario where data distributions substantially differ across the clients

Hryhorii Chereda, Maryam Moradpour, Youngjun Park

- top -

Tutorial VT1: A Practical Introduction to Large Language Models in Biomedical Data Science Research

Part 1: Monday, July 8, 2024 14:00 – 18:00 EDT
Part 2: Tuesday, July 9, 2024 14:00 – 18:00 EDT

Organizer:
Robert Xiangru Tang

Speakers:
Robert Xiangru Tang, Yale University, USA.
Qiao Jin, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), USA.
Hufeng Zhou, Biostatistics Department, Harvard T. H. Chan School of Public Health, Harvard University, USA.
Shubo Tian, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), USA.
Zhiyong Lu, National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), USA.
Mark Gerstein, Yale University, USA.

Max Participants: 50
Website: https://llm4biomed.github.io/

Description
Large Language Models (LLMs) like ChatGPT have exhibited remarkable capabilities in understanding and generating language across diverse disciplines. In the realm of biomedical data science and computational biology, LLMs can significantly aid the processes of information accessibility, data analysis, and knowledge discovery. In this tutorial, we offer an introductory level hands-on guide to understanding and utilizing these LLMs in the field of biomedical data science. Our tutorial begins with leveling the learning ground by providing introductions to LLMs and Biomedical Data Science. Subsequently, we delve into the core applications of LLMs in biomedical data science/computational biology via retrieval-augmented generation, database functionalities, and code generation. To facilitate thought-provoking discussions, pertinent case studies will be discussed, emphasizing how to harness the power of LLMs to bridge the gap between technical feasibility and practical utility in biomedical data science. Furthermore, handson exercises are included to enable participants to apply their learning in real-time. Participants will also get acquainted with OpenAI's ChatGPT and open-source LLMs, as well as their design, use cases, limitations, and prospects.

Our topics include:

  • Introduction
    • Large Language Models (LLMs) and their evolution from RNNs, LSTM to Transformers and GPT family.
    • In-depth interaction with OpenAI’s ChatGPT, learning about its overview, capabilities, and implementation, focusing on Chain-of-Thought Prompting.
    • Open-source LLMs
  • Novel applications of LLMs in computational biology and biomedical data sciences
    • Database query generation with LLMs.
    • Retrieval-augmented generation.
    • Language agents and code generation.
  • Advanced topics of LLMs for bioinformatics
    • Biomedical text retrieval and literature mining
    • Gene set analysis
    • Developing Representations of Disease-Relevant Molecules
  • Guided hands-on exercises using provided datasets and problem statements for practical understanding and implementation.
  • Limitations and challenges (e.g. hallucination, fairness, and safety) of using LLMs for science.

Learning Objectives

  • Familiarizing with the key aspects of large-scale biomedical data.
  • Leveraging LLMs to handle and interpret vast amounts of biomedical data.
  • Learning cutting-edge research topics from two invited talks.
  • Utilizing OpenAI APIs for GPTs and open-source LLMs in Python.
  • Integrating LLMs to enhance their coding efficiency in bioinformatics.
  • Deploying LLMs for biomedical question-answering and academic literature exploration.

Intended Audience and Level
This tutorial is designed for graduate students, researchers, data analysts, and practitioners in the domains of bioinformatics, computational biology, and biomedical informatics who are seeking to harness the potential of Large Language Models (LLMs) in their work. The didactic content would be chiefly beneficial for individuals who are keen on enhancing the breadth and depth of their analytical skills.

While the focus of the workshop lies in catering to beginners or users with little experience in LLMs, intermediates will find the advanced topics and in-depth case studies enriching as well. Participants should ideally possess a basic understanding of Python programming and machine learning concepts. Preliminary experience with Linux-based operating systems or interacting with APIs would provide an added advantage but is not a prerequisite.

Our discussion on using OpenAI's ChatGPT and other open-source LLMs, such as LLaMA, along with hands-on exercises and case studies, will offer an immersive learning experience that spans theory and practice. Researchers looking to streamline their data analysis processes and improve the efficiency and accuracy of their results will find this tutorial particularly useful.

Relevant resources and tutorial materials for hands-on activities will be shared online before the commencement of the tutorial, ensuring an unhampered learning experience for all attendees.

Schedule

Part 1
14:00 Overview and Welcome
14:10 Introduction to LLMs with a focus on Biomedical Data Science
14:40 How to use GPT-3.5 and GPT-4 with Python
15:10 How to use Open-source LLMs with Python
15:30 Break
15:45 Database Query Generation with LLMs
16:10 Retrieval-augmented Generation with Large Language Models
16:35 Code generation in Bioinformatics
Part 2
14:00 Large Language Models for Biomedicine: from PubMed Search to Gene Set Analysis
14:45 AI in Biomedicine: Developing Representations of Disease-Relevant Molecules
15:30 Break
15:45 Integrating Biomedical Data Database Development with LLMs
16:10 Querying PubMed with RAG to answer biomedical questions with GPT-4
16:35 Code generation in Bioinformatics with Opensource LLMs
16:55 Closing Remarks

- top -

Tutorial VT2: BioViz: Interactive data visualisation and ML for omics data

Part 1: Monday, July 8, 2024 14:00 – 18:00 EDT
Part 2: Tuesday, July 9, 2024 14:00 – 18:00 EDT

Organizer:
Ragothaman M Yennamalli

Speakers:
Ragothaman M. Yennamalli - Assistant Professor, SASTRA Deemed to be University, Thanjavur, India
Dr Farzana Rahman – Assistant Professor, Kingston University London, UK.
Shashank Ravichandran - Senior Software Engineer, Incedo Inc, India
Megha Hegde, PhD Researcher, Kingston University London, UK.
Jean-Christophe Nebel, Professor of Computer Science, Kingston University London, UK.

Max Participants: 30

Description
Data Science and Machine Learning are intricately connected, particularly in computational biology. In a time when biological data is being produced on an unprecedented scale — encompassing genomic sequences, protein interactions, and metabolic pathways- meeting the demand has never been more crucial.

Data visualisation plays a crucial role in biological data sciences since it allows the transformation of complex, often incomprehensible raw data into visual formats that are easier to understand and interpret. This allows biologists to recognise patterns, anomalies, and correlations that would otherwise be lost in the sheer volume of data. In addition, machine learning (ML) has brought about a revolution in the analysis of biological data. Exploiting extensive datasets, ML provides tools to model complex systems and generate predictions. Indeed, ML algorithms excel at uncovering subtle patterns in data, contributing to tasks like predicting protein structures, comprehending genetic variations and their implications for diseases, and even facilitating drug discovery by predicting molecular interactions.

The integration of data visualisation and machine learning is particularly powerful. In particular, visualisation may aid in interpreting machine learning models, allowing biologists to understand and trust their predictions. It could also help fine-tune these models by identifying outliers or anomalies in the data.

Due to its remarkable capability, there has been a surge in the development and application of tools that combine data visualisation and machine learning in biology. Platforms that integrate these technologies enable biologists to conduct comprehensive analyses without needing deep expertise in computer science. Assuredly, this democratisation of data science and ML has empowered more and more biologists to engage in sophisticated, data-driven research.

Learning Objectives
This tutorial is divided into two parts. In the first part of the tutorial, the participants will learn how to install and use tools for data visualisation using Python.  The second part will focus on installing and using ML tools for feature selection, model training, and model optimisation using Python.  By the end of this tutorial, the participants will be able to:

  1. Explain the role and significance of data visualisation in the context of scientific research.
  2. Apply fundamental principles of data visualisation to create clear and informative visual representations of data.
  3. Create a variety of data visualisations using Python libraries, i.e., Matplotlib, Seaborn, and Plotly.
  4. Understand the basics of colour theory and its implications for creating accessible and aesthetically pleasing visualisations.
  5. Design data visualisations that are accessible to a diverse audience, including those with colour vision deficiencies.
  6. Gain practical skills in preprocessing data and selecting appropriate features for machine learning models.
  7. Build, train, and evaluate machine learning models using Python libraries like Scikit-learn and TensorFlow/Keras.
  8. Implement machine learning algorithms on real-world biological datasets, demonstrating an understanding of the application of these techniques in biology.
  9. Create integrated visualisations of machine learning results using tools like Yellowbrick, Bokeh, and TensorBoard.
  10. Critically evaluate and discuss the applications, challenges, and implications of data visualisation and machine learning in scientific research, particularly in biology.

Intended Audience and Level
The tutorial is aimed towards entry-level participants (Graduate students, researchers, and scientists) in both academia and industry who are interested in Data Visualisation and ML. Prerequisites: Basic knowledge of computer programming (preferably Python) and machine learning (Beginner). There is no prerequisite to have any knowledge about Art and Aesthetics.

Schedule

Part 1
14:00 Lecture Introduction to Data Visualisation: Importance and Basic principles of data visualization in scientific research
Jean-Christophe Nebel
15:00 Hands-on Python Libraries for Visualization: Matplotlib, Seaborn, Plotly and others
Farzana Rahman, Ragothaman Yennamalli, Shashank Ravichandran, and Megha Hegde
15:45 Coffee/Tea Break
16:00 Lecture Colour theory in Visualization: Colour palettes, Accessible and Inclusive Visualisations
Ragothaman Yennamalli
17:00 Hands-on Creating various types of charts, plots for clarity and aesthetics. Case studies with real world datasets
Farzana Rahman, Ragothaman Yennamalli, Shashank Ravichandran, and Megha Hegde
Part 2
14:00 Lecture Fundamentals of Machine Learning: Types of ML, Data preprocessing and feature selection, model selection and training
Ragothaman Yennamalli and Farzana Rahman
15:00 Hands on Python libraries for Machine Learning: Scikit-learn, Pandas, NumPy, TensorFlow/Keras. Building models using real-world biological data
Shashank Ravichandran, and Megha Hegde
16:00 Coffee/Tea Break
16:15 Hands on Integrating Data Viz and ML: Yellowbrick, Bokeh, Tensorboard, Scikit-plot, etc.
Farzana Rahman and Megha Hegde
17:15  Question and Answer session Identify and highlight blocks of hands-on content in your submission

- top -

Tutorial VT3: Using LinkML (Linked data Modeling Language) to model your data

Date: Monday, July 8, 2024 14:00 – 18:00 EDT

Organizer:
Nomi L. Harris

Speakers:
Sierra Moxon, software developer, Lawrence Berkeley National Laboratory
Chris Mungall, Principal Investigator, Lawrence Berkeley National Laboratory
Kevin Schaper, software developer, University of Colorado
Harry Caufield, Data Scientist, Lawrence Berkeley National Laboratory
Patrick Kalita, software developer, Lawrence Berkeley National Laboratory

Max Participants: 30

Description
LinkML (Linked data Modeling Language; linkml.io) is an open, extensible modeling framework that allows computers and people to work cooperatively to model, validate, and distribute data that is reusable and interoperable. It is designed to create interoperable data from the start without the overhead normally required for doing this. LinkML can help even non-techies create better, FAIRer, more reusable data models backed by ontologies.

Collecting and organizing biomedical data for an individual project presents a huge challenge; doing so in a way that allows for later reanalysis and reuse across projects is even harder. Many data standards are not machine-actionable, or are defined in isolation, leading to siloization. The quantity and variety of data being generated in biomedical fields is increasing rapidly, but is still often captured in unstructured formats like publications, posters, lab notebooks, or spreadsheets. Researchers at all levels struggle with collecting, managing, and analyzing data and complex knowledge, due to a confusing landscape of schemas, standards, and tools. These challenges impede scientific progress and limit our ability to tailor treatments based on data (precision medicine). AI and ML increasingly enable large-scale data analysis, but lack of data harmonization limits cross-disciplinary applications.

LinkML addresses these issues, weaving together elements of the Semantic Web with aspects of conventional modeling languages to provide a pragmatic way to work with a broad range of data types, maximizing interoperability and computability across sources and domains. LinkML meets data producers where they are technically, and speaks many different modeling languages. Data models can be authored in a variety of languages including YAML, JSON Schema, or even spreadsheets. LinkML supports all steps of the data analysis workflow: data generation, submission, cleaning, annotation, integration, and dissemination. LinkML enables even non-developers to create data models that are understandable and usable across the layers from data stores to user interfaces, reducing translation issues and increasing efficiency.

LinkML is an easy-to-use framework that both emerging and established data-generating communities can use to generate interoperable, reusable datasets and workflows. It has already seen wide uptake by projects across the biomedical spectrum and beyond, including the German Human Genome-Phenome archive, Critical Path Institute, iSample project, National Microbiome Data Collaborative, Center for Cancer Data Harmonization, INCLUDE project, NCATS Biomedical Data Translator, Reactome, Alliance of Genome Resources, Open Microscopy Environment (Next Generation File Format), and Genomics Standards Consortium.

In this tutorial, we will discuss best practices for data modeling; introduce LinkML as a modeling framework and tool suite; work together to set up a LinkML project from scratch; develop a model and validate it with test data; and auto-generate model documentation. If time permits, we will discuss using LinkML and Large Language Models (LLMs) like ChatGPT to extract data from text that conforms to a LinkML model.

Learning Objectives

  • Learn how to author a new data model that exercises some of the main LinkML modeling components.
  • Generate documentation for the new model, and get familiar with generating the model in different formats.
  • Get familiar with LinkML’s bootstrapping tools that help migrate existing models to LinkML.
  • Learn about LinkML integration with Large Language Model (LLM) technologies to extract data conformant to a LinkML model from text.

Intended Audience and Level
This tutorial is aimed at anyone who generates or works with data: biologists, biocurators, data scientists, and data modelers. No programming or data modeling expertise is required. Listening through the hands-on aspects is encouraged with or without participating directly. To participate in hands-on training, we assume that participants have basic familiarity with running commands from the command line (in a terminal)--for example, calling Python scripts or running simple commands like “cat” and “grep”--and they should have a GitHub account and basic familiarity with using GitHub.

Schedule

14:00  Ontologies, data models, and FAIR data
Chris Mungall
14:15 Section 1: Set up a LinkML repository
Kevin Schaper
15:00

Section 2: Authoring a LinkML Model

  1. Model components
  2. Classes and slots
  3. Mappings, definitions, enumerations

Sierra Moxon

15:30 BREAK
15:45 Section 3: Common modeling practices
Chris Mungall
16:00 Section 4: Validating your model
Patrick Kalita
16:30 Section 5: Generating code from your model
Kevin Schaper
16:45 BREAK
17:00 Section 5: LinkML and Large Language Models
Harry Caufield
17:45 Wrap up/Questions

- top -

Tutorial VT4: Computational Approaches for Identifying Context-Specific Transcription Factors using Single-Cell Multi-Omics Datasets

Date: Tuesday, July 9, 2024 14:00 – 18:00 EDT

Organizer:
Hatice Ulku Osmanbeyoglu

Speakers:
Hatice Ulku Osmanbeyoglu, Assistant Professor, University of Pittsburgh, USA
Merve Sahin, Computational Biologist, Memorial Sloan Kettering Cancer Center, USA
Parham Hadikhani, Postdoctoral fellow, University of Pittsburgh, USA
Linan Zhang, Assistant Professor, Ningbo University, China

Max Participants: 30

Description
Development of specialized cell types and their functions are controlled by external signals that initiate and propagate cell-type specific transcriptional programs. Activation or repression of genes by key combinations of transcription factors (TFs) drive these transcriptional programs and control cellular identity and functional state. For example, ectopic expression of the TF factors Oct4, Sox2, Klf4 and c-Myc are sufficient to reprogram fibroblasts into induced pluripotent stem cells. Conversely, disruption of TF activity can cause a broad range of diseases including cancer. Hence, identifying context-specific TFs is particularly relevant to human health and disease.

Systematically identifying key TFs for each cell-type represents a formidable challenge. Determination of TF activity in bulk tissue is confounded by cell-type heterogeneity. Single-cell technologies now measure different modalities from individual cells such as RNA, protein, and chromatin states. For example, recent technological breakthroughs have coupled the relatively sparse single cell RNA sequencing (scRNA-seq) signal with robust detection of highly abundant and well-characterized surface proteins using index sorting and barcoded antibodies such as cellular indexing of transcriptomes and epitopes by sequencing (CITE-seq). But these approaches are limited to surface proteins, whereas TFs are intracellular. Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) measures genome-wide chromatin accessibility and reveals cellular memory and response to stimuli or developmental decisions. Recently several computational methods have leveraged these omics datasets to systematically estimate TF activity influencing cell states. We will cover these TF activity inference methods using scRNA-seq, scATAC-seq, Multiome and CITE-seq data through hybrid lectures and hand-on-training sessions. We will cover the principles underlying these methods, their assumptions and trade-offs. We will apply multiple methods, interpret results and discuss strategies for further in silico validation. The audience will be equipped with practical knowledge, essential skills to conduct TF activity inference independently on their own datasets and interpret results.

Learning Objectives for Tutorial
At the completion of the tutorial, participants will gain understanding into the basic concepts and recent advances in transcription factor inference methods for single-cell omics datasets including scRNA-seq, scATAC-seq, CITE-seq and Multiome. Four learning objectives are proposed:

  1. Understand the basics principles underlying TF activity inference from single-cell omics
  2. Understand the specific methodologies, assumptions, and trade-offs between computational inference methods
  3. Gain hands-on experience in applying tools and interpreting results using multiple TF activity inference methods on public scRNA-seq, scATAC-seq, multiome and CITE-seq datasets
  4. Discuss current bottlenecks, gaps in the field, and opportunities for future work.

Intended Audience and Level
This tutorial is designed for individuals at the beginner to intermediate level, specifically targeting bioinformaticians or computational biologists with some prior experience in analyzing single-cell RNA sequencing (scRNA-seq), single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq), Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq), and Multiome data, or those familiar with next-generation sequencing (NGS) methods. A foundational understanding of basic statistics is assumed.

While participants are expected to be beginners, a minimum level of experience in handling NGS datasets is required. The workshop will be conducted using Python and JupyterLab, necessitating prior proficiency in Python programming and familiarity with command-line tools.

To facilitate the learning process, participants will be provided with pre-processed count matrices derived from real datasets. All analyses, including JupyterLab notebooks and tutorial steps, will be available on GitHub for reference.

The tutorial will employ publicly accessible data, with examples showcased using datasets that will be made available through repositories such as the Gene Expression Omnibus or similar public platforms. This hands-on workshop aims to equip participants with practical skills and knowledge, enabling them to navigate and analyze complex datasets in the field of single-cell omics.

Schedule

14:00 Welcome remarks and tutorial overview
Hatice
14:05

Basic principles behind TF activity inference methods

  • Overview of the importance of context-specific TF regulation in biological systems.
  • Significance of TF dynamics in health and disease.
  • Single-cell multi-omics technologies for TF activity inference (scRNA-seq, scATAC-seq, Multiome

Hatice

14:45 Overview of computational TF inference methods based on single cell omics
Hatice, Merve
15:45 Break
16:00 Hands-on experience in applying tools and interpreting results using multiple TF activity inference methods using public scRNA-seq
Linan and Merve
16:45 Hands-on experience in applying tools and interpreting results using multiple TF activity inference methods using public scATAC-seq and multiome
Parham and Merve
17:30 Hands-on experience in applying tools and interpreting results using TF activity inference methods using public CITE-seq
Parham and Hatice
17:55 Discuss current bottlenecks, gaps in the field, and opportunities for future work
Hatice

- top -

Tutorial VT5: Explainability in Graph Deep Learning for Biomedicine

Date: Monday, July 8 14:00 – 18:00 EDT

Organizer:
Guadalupe Gonzalez

Speakers:
Guadalupe Gonzalez, Prescient, Genentech Computational Sciences, Genentech.
Chirag Agarwal. Harvard University.

Max Participants: 50

Description
In the rapidly evolving field of biomedical research, graph deep learning (DL) has emerged as a powerful tool for analyzing complex biological data like molecular graphs, protein-protein interaction networks, and patient similarity networks. However, modern graph DL models are complex black-box neural networks comprising millions of parameters, and it is crucial to understand their model predictions before employing them in life-critical applications. Our proposed tutorial is designed to address the above challenge by providing a brief overview of explainability research in the context of graph neural networks (GNNs) and their applications to biomedical problems.

The tutorial will start with an introduction to graph DL, focusing on its relevance and potential in biomedicine. We will discuss why explainability is not just a desirable trait but a necessity in this domain, where model decisions can have significant implications for both model developers and relevant stakeholders.

The second part of the tutorial delves into the core of explainability research in GNNs. We will define what constitutes an explanation in GNN models, introduce post-hoc explainers, explore metrics for evaluating explanations, and criteria to assess the quality of explanations. We will also introduce explanation-directed message passing – a novel approach that integrates post-hoc explanations directly into the training pipeline of GNNs. Finally, we will introduce existing interpretable graph models in biomedicine.

In the third part, we will apply these concepts to high-stakes biomedical applications like predicting molecular properties, discovering new drug targets, and analyzing patient data. We will be discussing each application in depth, demonstrating how explainability enhances our understanding of modern GNNs and drives decision-making in biomedicine.

Finally, the tutorial will feature interactive demonstrations and a hands-on practical session. Participants will engage with real-world biomedical datasets, applying explainability techniques to GNN models. This session aims to provide attendees with practical experience and insights into developing and utilizing explainability techniques and interpretable GNN models effectively in their research.

By the end of this tutorial, participants will have a solid understanding of the importance, methods, and applications of explainability in GNNs within the biomedical sphere, equipped with the knowledge and skills to implement these techniques in their work.

Learning objectives

  1. Understand the fundamentals of graph deep learning:
    • Gain a solid understanding of graph DL and GNNs.
    • Recognize the significance and applications of graph DL in biomedicine.
  2. Learn the importance of the explainability and interpretability of machine learning models in biomedical applications:
    • Learn why the explainability and interpretability of machine learning models is crucial in biomedical research.
    • Appreciate the implications of model predictions in healthcare and research settings.
  3. Learn methods and metrics for explainability:
    • Understand different approaches to generating explanations for GNN models predictions.
    • Get acquainted with various metrics and desiderata used to assess the quality and effectiveness of explanations.
  4. Explore post-hoc explanation techniques and explanation-directed message passing:
    • Discover methods for post-hoc analysis of GNN predictions.
    • Delve into explanation-directed message passing and its role in enhancing model interpretability.
  5. Gain hands-on experience with explainability in GNN models:
    • Participate in interactive demonstrations and hands-on exercises to learn how to generate explanations of GNN models predictions for the tasks of molecular property prediction, drug target discovery, and patient data analysis.
    • Understand how explainability aids in the decision-making process in these applications.

Intended Audience and Level
This tutorial is primarily intended for:

  • Researchers and academics: Individuals working in the fields of bioinformatics, computational biology, biomedical research, and related areas. This includes both experienced researchers and graduate students who are exploring interpretability in the context of graph machine learning techniques in biomedicine
  • Data scientists and machine learning practitioners: Professionals in data science and machine learning working on graph DL and seeking to expand their knowledge into the interpretability domain.
  • Industry professionals: Individuals from biotech, pharmaceutical, and healthcare technology companies who are involved in research and development, particularly in areas intersecting with AI and machine learning.

The tutorial is designed to be intermediate. Participants are expected to have:

  • A basic understanding of machine learning concepts.
  • Familiarity with the fundamentals of DL.
  • Some knowledge of Python programming, as practical exercises will involve coding. No prior expertise in graph DL or specific biomedical applications is required. The tutorial will provide an introduction to these areas, but will also delve into more advanced topics suitable for attendees with existing knowledge in graph DL or bioinformatics.

Schedule

14:00

Part 1: Introduction to graph leep learning in biomedicine

  • Overview of graph DL: Basics of graph DL and GNNs
  • Importance of explainability and interpretability: Exploring the significance of explainability and interpretability in biomedical applications.
14:30

Part 2: Understanding and measuring explainability in GNNs

  • What are explanations?: Defining explanations in the context of graph DL.
  • Metrics for goodness of explanations: Discuss various metrics for evaluating the faithfulness of explanations and criteria used to evaluate their quality.
  • Post-hoc explainers: Introduction to post-hoc methods for explaining GNN model predictions.
  • Explanation-directed message passing: Exploring advanced techniques that incorporate explanation into the message-passing mechanism of GNNs.
  • Towards interpretable GNN models: Overview of existing interpretable GNN models in biomedicine
15:45 Coffee break
16:00

Part 3: Applying explainability techniques to GNN model predictions in biomedical contexts

  • Molecular property prediction: This section will demonstrate the application of explainability techniques to pretrained GNN models for molecular property prediction
  • Drug target discovery: In this section, we will showcase the use of explainability techniques on GNN model predictions for the discovery and validation of drug targets.
  • Patient data analysis: Illustrating the application of explainability techniques to GNNs analyzing patient data, focusing on how explanations can enhance personalized treatment strategies and disease comprehension.
16:45 Coffee break
17:00

Part 4: Hands-on demonstrations and practical session

  • Interactive demos: Real-time demonstrations showcasing the application of explainability techniques in graph DL.
  • Hands-on exercises: Participants engage in practical exercises applying explainability methods to state-of-the-art GNN models trained on biomedical datasets, such as those included in the MoleculeNet benchmark (https://moleculenet.org/) and Chemprop (https://github.com/chemprop/chemprop).
  • Practical advice: Sharing best practices for developing and utilizing interpretable graph
    models in biomedicine.

- top -