Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in BST
Monday, July 14th
9:00-13:00
Tutorial VT6: Beyond Bioinformatics: Snakemake for Versatile Computational Workflows
Format: In person


Authors List: Show

Presentation Overview: Show

Snakemake is a powerful, Python-based workflow management system that revolutionises how computational tasks are designed, executed and reproduced. By allowing researchers to define workflows as a series of interconnected rules, Snakemake simplifies complex computational pipelines and ensures reproducibility across diverse scientific domains. This intensive workshop introduces participants to the full potential of Snakemake, moving beyond traditional bioinformatics applications to demonstrate its versatility in machine learning and data analysis.

Designed for researchers dealing with complex computational challenges, this tutorial will explore Snakemake’s capabilities through practical, hands-on examples that span multiple disciplines. Participants will learn how to create robust, scalable, and efficient workflows that can adapt to various research challenges. By the end of the workshop, attendees will have a comprehensive understanding of workflow management principles and the skills to implement sophisticated pipelines using Snakemake.

9:00-18:00
Tutorial VT1: Visualising and interpreting your -omics results using ggplot2 and R
Format: In person


Authors List: Show

Presentation Overview: Show

This full-day tutorial introduces participants to the principles of impactful data visualisation and equips them with the skills to create publication-ready visualisations for -omics data. Designed for beginners with basic knowledge of the R programming language, the tutorial will guide attendees through creating and interpreting key visualisations such as volcano plots, box plots, heatmaps, dot plots, and network diagrams.

Participants will work with real biological datasets in Quarto notebooks and learn how to use tools like ggplot2, ComplexHeatmap, igraph, ggraph, and ClusterProfiler. Through hands-on coding exercises and interactive lectures, attendees will develop an intuition for ggplot2’s grammar of graphics, best practices in data visualisation, and the application of functional enrichment methods to contextualise results.

Whether you are a student or researcher wanting to improve your visualisation skills, or a computational biologist looking to enhance the presentation of your results, this tutorial will provide the tools and knowledge to produce professional-quality figures.

14:00-18:00
Tutorial VT5: Comprehensive Bioinformatics and Statistical Approaches for High-Throughput Sequencing Data Analysis, Including scRNA-seq, in Biomarker Discovery
Format: In person


Authors List: Show

Presentation Overview: Show

With the significant advancements in genomic profiling technologies and the emergence of selective molecular targeted therapies, biomarkers have played an increasingly pivotal role in both the prognosis and treatment of various diseases, most notably cancer. This workshop is designed to begin with an introductory overview of basic concepts of biomarkers, the diverse categories of biomarkers, commonly employed biotechnologies for biomarker detection, with a special focus on gene mutation and gene expression using DNA-seq, RNA-seq, and scRNA-seq data. Furthermore, we will discuss processes of biomarker discovery and development, and outlining the key steps involved and the current analytical methodologies utilized. Following this, we will discuss the identification of driver gene mutations and altered gene expression, using The Cancer Genome Atlas (TCGA) lung cancer data and PBMC scRNA-seq data as illustrative examples with using R code as practical demonstrations to enhance understanding. In the latter part of this workshop, we will discuss commonly utilized biostatistics and bioinformatics tools, including data visualization, survival analysis and machine learning methods, which are employed to predict disease progression and patient survival outcomes based on these critical biomarkers. By the conclusion of this course, participants will have acquired a broad and fundamental understanding of biomarker discovery, particularly in cancer. This encompasses key concepts, data sources, data analysis techniques, and interpretation strategies. Such expertise will equip participants with the knowledge necessary in contributing to the development of precision medicine in cancer patient treatment.

Tutorial VT8: Generative AI for Single-Cell Perturbation Modeling: Theoretical and practical considerations
Format: In person


Authors List: Show

Presentation Overview: Show

Single-cell perturbation modelling is revolutionising how we understand the effects of genetic interventions, drugs, and cellular stimulants on molecular and cellular physiology. This half-day virtual tutorial will introduce participants to highly performant Generative AI tools—scGEN and scPRAM—designed to simulate perturbations on single-cell datasets and extrapolate to unseen conditions. Through concise presentations and hands-on exercises, attendees will explore the theoretical underpinnings of these tools, preprocess single-cell data, train generative models, and interpret results using advanced metrics such as R², E-distance, and Maximum Mean Discrepancy as well as dimensionality reduction techniques. Special focus will be placed on challenging scenarios like extrapolating to unseen patient responses and cross-species predictions, leveraging benchmarking insights from the EU BH 2024 Perturb-Bench initiative. Participants will leave with actionable knowledge to implement, evaluate, and benchmark generative perturbation models, supported by practical resources hosted on Google Colab, Jupyter Book, and GitHub.

Tuesday, July 15th
9:00-13:00
Tutorial VT3: Computational approaches for deciphering cell-cell communication from single-cell transcriptomics and spatial transcriptomics data
Format: In person


Authors List: Show

Presentation Overview: Show

Tissues and organs are complex and highly-organized systems composed of diverse cells that work together to maintain homeostasis, drive development and mediate complex disease progression as Myocardial Infarction (Kuppe et al. 2022). A key focus of modern biology is understanding how heterogeneous populations of cells coexist and communicate with each other (intercellular signaling), how they properly respond (intracellular signaling) within a tissue and organ system and how these processes vary across different experimental conditions (comparative analysis). Recently, a rapid expansion of computational tools exploring the expression of ligand and receptor has enabled the systematic inference of cell-cell communication from single-cell transcriptomics and spatial transcriptomics data (Armingol et al. 2021; Armingol, Baghdassarian, and Lewis 2024). These are crucial in unravelling the complex landscape of biological systems.

This tutorial aims to provide a comprehensive introduction to computational approaches for cell-cell communication inference using high throughput transcriptomics data. It covers the fundamental concepts of cellular communication and assumptions underlying analysis focusing on the main computational methods used in the field. This includes computational approaches for inter-cellular communication inference (CellphoneDB (Efremova et al. 2020); LIANA (Dimitrov et al. 2022, 2024)) and for intra-cellular signals communication (scSeqComm (Baruzzo, Cesaro, and Di Camillo 2022); NicheNet (Browaeys, Saelens, and Saeys 2019)). Next, we will describe approaches for comparative analysis of cell-cell networks in distinct biological conditions (CrossTalkeR (Nagai et al. 2021)) and methods for spatially resolved cell-cell communications (Ischia (Regan-KomitoDaniel 2024); DeepCOLOR (Kojima et al. 2024)).

In the first part of the tutorial, participants will be introduced to the theoretical basis of state-of-the-art computational approaches and will learn how to use representative tools for inferring intercellular signaling and intracellular signaling pathways. In the second part, we will focus on the comparative analyses, i.e. changes of cell-cell communication in two conditions, and subsequently highlighting the unique insights spatial transcriptomics data can provide for understanding tissue architecture and cellular communication. Both sections will be followed by a hands-on component based on the analysis of single cell and spatial transcriptomics data from the myocardial infarction atlas (Kuppe et al. 2022). To promote transparency, all the codes, software tools and the datasets used throughout the tutorial will be available and accessible through open-access repositories (e.g. GitHub repositories or Zenodo platforms).

Tutorial VT9: Biomedical text mining for knowledge extraction
Format: In person


Authors List: Show

Presentation Overview: Show

Modern bioinformatics analyses rely heavily on the existing knowledge of the role of genes and mutations in different diseases, as well as the complex interactions between genes, proteins and drugs. However, access to this information is often limited for many biomedical problems, especially niche areas, due to lack of knowledge bases and large curation costs. The information is often locked in the text of the original research papers. Machine learning methods, particular natural language processing techniques, offer an automated approach to extracting knowledge from the research literature to build bespoke knowledge bases for scientists’ needs. This tutorial will provide a hands-on introduction to the core tasks in biomedical natural language processing (BioNLP). These include identifying mentions of important concepts (e.g. phenotypes, cell-lines, etc) and extracting nuanced relationships between them. Finally, it will show how large language models have changed how information can be quickly extracted, but also highlight their challenges.

9:00-18:00
Tutorial VT2: OmicsViz: Interactive Visualization and ML for Omics Data
Format: In person


Authors List: Show

Presentation Overview: Show

Data Science and Machine Learning are intricately connected, particularly in computational biology. In a time when biological data is being produced on an unprecedented scale — encompassing genomic sequences, protein interactions, and metabolic pathways- meeting the demand has never been more crucial.

Data visualization plays a crucial role in biological data sciences since it allows the transformation of complex, often incomprehensible raw data into visual formats that are easier to understand and interpret. This allows biologists to recognize patterns, anomalies, and correlations that would otherwise be lost in the sheer volume of data. In addition, machine learning (ML) has brought about a revolution in the analysis of biological data. Exploiting extensive datasets, ML provides tools to model complex systems and generate predictions. Indeed, ML algorithms excel at uncovering subtle patterns in data, contributing to tasks like predicting protein structures, comprehending genetic variations and their implications for diseases, and even facilitating drug discovery by predicting molecular interactions.

The integration of data visualization and machine learning is particularly powerful. In particular, visualization may aid in interpreting machine learning models, allowing biologists to understand and trust their predictions. It could also help fine-tune these models by identifying outliers or anomalies in the data.

Due to its remarkable capability, there has been a surge in the development and application of tools that combine data visualization and machine learning in biology. Platforms that integrate these technologies enable biologists to conduct comprehensive analyses without needing deep expertise in computer science. Assuredly, this democratization of data science and ML has empowered more and more biologists to engage in sophisticated, data-driven research.

14:00-18:00
Tutorial VT4: An applied genomics approach to crop breeding: A suite of tools for exploring natural and artificial diversity
Format: In person


Authors List: Show

Presentation Overview: Show

The urgent need for crop improvement is hindered by the lack of precision in crop breeding. Although significant progress has been made in genomics, many causal genes of important agronomic and nutritional traits remain unknown. This is due to the inefficient identification of causal genetic features in candidate genes. With the growing body of sequencing data and phenotype information, current advances in genomics and the development of bioinformatics tools offer improvements in candidate gene selection. In this workshop, SoyHUB, the suite of tools, strategies, and solutions for soybean applied genomics, will be presented along with its extension to other crops at KBCommons. Our methodology for data integration, curation, reuse, and leveraging will be highlighted. Practical utilization of integrated data with the tools will be demonstrated. We will focus on the selection of best-performing markers, identification of causal genes, and exploration of alleles and genomic variation. We will cover simple Mendelian traits, showing how to analyze variation in protein-coding regions and promoters and touch on copy number variation. Solutions for complex cases, such as multiple independent alleles in a single gene or quantitative traits, will also be introduced. This workshop will highlight recent achievements in leveraging big data to improve precision in GWAS-driven discoveries and, therefore, accelerate the breeding of soybean and other crops.

Tutorial VT7: Assessing and Enhancing Digital Accessibility of Biological Data and Visualizations
Format: In person


Authors List: Show

Presentation Overview: Show

As computational biologists, we produce biological datasets, visualizations, and computational tools. Our shared goal is to make our data and tools widely usable and accessible. However, we often fail to meet the needs of certain groups of people. Our recent comprehensive evaluation [1] (https://inscidar.org) shows that biological data resources are largely inaccessible to people with disabilities, with severe accessibility issues in almost 75% of all 3,112 data portals included in the study. To address the critical accessibility barriers, it is important to increase awareness of accessibility in the community and teach the workforce practical ways to enhance the accessibility of biological data resources. While there are existing resources and training opportunities that focus on content that includes no or little data, there is a lack of solutions and resources that provide insights into how to make data-intensive content accessible.

Our tutorial is designed to help participants understand the importance of digital accessibility in computational biology and practice various approaches to test and implement digital accessibility of biological data and visualizations. We will demonstrate our evaluation results [1] to help participants understand the critical barriers in biological research and education for people with disabilities, such as those involving vision, cognitive, and physical function. We will use hands-on examples that are familiar to and widely used by computational biologists, such as computational notebooks, genome browsers, and other visualizations. Our tutorial will conclude by introducing open problems and recent innovations, such as the accessibility of interactive genomics data visualizations [2]. We will ensure that our tutorial and all the materials are accessible.

Sunday, July 20th
9:00-10:45
Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels
Room: 11A
Format: In person


Authors List: Show

Presentation Overview: Show

Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability.

This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery.

Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.

Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource
Room: 03A
Format: In person


Authors List: Show

Presentation Overview: Show

This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects.

This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts.

Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.

Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research
Room: 04AB
Format: In person


Authors List: Show

Presentation Overview: Show

The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.

Tutorial IP4: Quantum Machine Learning for multi-omics analysis
Room: 03B
Format: In person


Authors List: Show

Presentation Overview: Show

Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5].

In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.

Tutorial IP5: Introduction to Causal Analysis using Mendelian Randomisation
Room: 12
Format: In person


Authors List: Show

Presentation Overview: Show

Mendelian randomisation (MR) is a method that uses genetic variation associated with an exposure (e.g., behaviours, biomarkers) to infer its causal effect on an outcome (e.g. health status). In statistical terms, it functions as an "instrumental variable" approach.

By mimicking the design of a randomised controlled trial through genetic inheritance, MR provides a framework for addressing confounding and reverse causation, making it a valuable tool in epidemiological and biomedical research.

This workshop offers a beginner-friendly introduction to the key concepts and assumptions underlying MR, such as the use of genome-wide association study (GWAS) data and the three key assumptions for valid instrumental variables: relevance, independence, and exclusion restriction. Participants will explore common challenges in MR analysis, including pleiotropy, population stratification, and measurement error while learning strategies to overcome these using advanced methods. The workshop also includes a two-hour hands-on session in which attendees will work with real-world data to conduct MR analyses using R. By the end of the session, participants will have a clear understanding of MR principles, the ability to critically evaluate MR studies, and practical skills to apply MR methods in their own research.

Tutorial IP6: Hello Nextflow: Getting started with workflows for bioinformatics
Room: 11BC
Format: In person


Authors List: Show

Presentation Overview: Show

Nextflow is a powerful and flexible open-source workflow management system that simplifies the development, execution, and scalability of data-driven computational pipelines. It is widely used in bioinformatics and other scientific fields to automate complex analyses, making it easier to manage and reproduce large-scale data analysis workflows.

This training workshop is intended as a “getting started” course for students and early-career researchers who are completely new to Nextflow. It aims to equip participants with foundational knowledge and skills in three key areas: (1) understanding the logic of how data analysis workflows are constructed, (2) Nextflow language proficiency and (3) command-line interface (CLI) execution.

Participants will be guided through hands-on, goal-oriented exercises that will allow them to practice the following skills:

Use core components of the Nextflow language to construct simple multi-step workflows effectively.
Launch Nextflow workflows locally, navigate output directories to access results, interpret log outputs for insights into workflow execution, and troubleshoot basic issues that may arise during workflow execution.

By the end of the workshop, participants will be well-prepared for tackling the next steps in their journey to develop and apply reproducible workflows for their scientific computing needs. Additional study-at-home materials will be provided for them to continue learning and developing their skills further.

11:00-13:00
Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels
Room: 11A
Format: In person


Authors List: Show

Presentation Overview: Show

Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability.

This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery.

Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.

Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource
Room: 03A
Format: In person


Authors List: Show

Presentation Overview: Show

This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects.

This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts.

Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.

Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research
Room: 04AB
Format: In person


Authors List: Show

Presentation Overview: Show

The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.

Tutorial IP4: Quantum Machine Learning for multi-omics analysis
Room: 03B
Format: In person


Authors List: Show

Presentation Overview: Show

Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5].

In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.

Tutorial IP5: Introduction to Causal Analysis using Mendelian Randomisation
Room: 12
Format: In person


Authors List: Show

Presentation Overview: Show

Mendelian randomisation (MR) is a method that uses genetic variation associated with an exposure (e.g., behaviours, biomarkers) to infer its causal effect on an outcome (e.g. health status). In statistical terms, it functions as an "instrumental variable" approach.

By mimicking the design of a randomised controlled trial through genetic inheritance, MR provides a framework for addressing confounding and reverse causation, making it a valuable tool in epidemiological and biomedical research.

This workshop offers a beginner-friendly introduction to the key concepts and assumptions underlying MR, such as the use of genome-wide association study (GWAS) data and the three key assumptions for valid instrumental variables: relevance, independence, and exclusion restriction. Participants will explore common challenges in MR analysis, including pleiotropy, population stratification, and measurement error while learning strategies to overcome these using advanced methods. The workshop also includes a two-hour hands-on session in which attendees will work with real-world data to conduct MR analyses using R. By the end of the session, participants will have a clear understanding of MR principles, the ability to critically evaluate MR studies, and practical skills to apply MR methods in their own research.

Tutorial IP6: Hello Nextflow: Getting started with workflows for bioinformatics
Room: 11BC
Format: In person


Authors List: Show

Presentation Overview: Show

Nextflow is a powerful and flexible open-source workflow management system that simplifies the development, execution, and scalability of data-driven computational pipelines. It is widely used in bioinformatics and other scientific fields to automate complex analyses, making it easier to manage and reproduce large-scale data analysis workflows.

This training workshop is intended as a “getting started” course for students and early-career researchers who are completely new to Nextflow. It aims to equip participants with foundational knowledge and skills in three key areas: (1) understanding the logic of how data analysis workflows are constructed, (2) Nextflow language proficiency and (3) command-line interface (CLI) execution.

Participants will be guided through hands-on, goal-oriented exercises that will allow them to practice the following skills:

Use core components of the Nextflow language to construct simple multi-step workflows effectively.
Launch Nextflow workflows locally, navigate output directories to access results, interpret log outputs for insights into workflow execution, and troubleshoot basic issues that may arise during workflow execution.

By the end of the workshop, participants will be well-prepared for tackling the next steps in their journey to develop and apply reproducible workflows for their scientific computing needs. Additional study-at-home materials will be provided for them to continue learning and developing their skills further.

14:00-16:00
Tutorial IP7: AI large cellular models and in-silico perturbation
Room: 11BC
Format: In person


Authors List: Show

Presentation Overview: Show

Transformer-based large language models (LLMs) are changing the world. The capabilities they illustrated in sophisticated natural language, vision and multi-modal tasks have inspired the development of large cellular models (LCMs) for single-cell transcriptomic data, such as scBERT, Geneformer, scGPT, scFoundation, GeneCompass, scMulan, etc. After pretraining on massive amount of single-cell RNA-seq data agnostic to any downstream task, these transformer-based models have demonstrated exceptional performance in various tasks such as cell type annotation, data integration, gene network inference, and the prediction of drug sensitivity or perturbation responses. Such advancements, albeit still in their early stage, suggested promising revolutionary approaches for leveraging AI to understand the complex system of cells from extensive datasets beyond human analytical capacity. Especially, such models have made it possible to conduct in-silico perturbation on cells of various types to predict their responses to gene perturbations without doing experiments on the cells. These models provided prototypes of digital virtual cells that can be used to reconstruct and simulate live cells, which will revolutionize many aspects of future biomedical studies.
Although the community is high enthusiastic to these exciting progresses, the structures and algorithms of LCMs and other similar-scale AI models are mysterious to many people who were not equipped with relevant backgrounds. This tutorial will try to fill this gap. In the tutorial, we will begin from an introduction of basic principles of deep neural networks, and explain the basic structure and algorithm of the original Transformer for natural language tasks. We’ll show to the attendees how to build such models based on current machine learning platforms. Then we’ll introduce several successful ways to build large cellular models based on the basic Transformer model, and overview how such models are pretrained on single-cell RNA-seq data. We’ll show and let the attendees to practice how to use LCMs for basic tasks such as cell type annotation, and look into the specific application of LCMs on in-silico perturbation tasks. Attendees will engage in hands-on activities such as building basic transformer models and executing downstream single-cell tasks, including cell type annotation and in-silico perturbation. These activities will remove the mystery of LCMs for the attendees and help them better understand and feel how LCMs can be built and applied

Tutorial IP8: Representation Learning and Feature Engineering for Genomic Sequences Analysis
Room: 12
Format: In person


Authors List: Show

Presentation Overview: Show

Machine learning (ML) has been successfully applied in different omics problems, such as sequence classification in the field of genomics. The effectiveness of ML methods relies greatly on the selection of the data representation, or features, that extract meaningful information from sequences. Genomic sequences can be viewed as one-dimensional strings of successive letters representing nucleotides. However, to make these sequences compatible with ML methods, they must first be transformed into structured numerical representations, such as vectors or matrices. Traditional methods for sequence classification often rely on manually crafted or pre-defined features, which require domain expertise and may not fully capture the complexity of the underlying biological information. Recently, representation learning has emerged as a powerful alternative, enabling the automatic extraction of latent patterns directly from raw data and reducing the dependence on manually crafted features. In genomics, representation learning methods have been introduced to characterize DNA and RNA sequences. In genomics, techniques like Word2Vec, Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) have demonstrated the ability to learn optimal sequence representations that effectively capture both local and global patterns in DNA and RNA sequences.

This tutorial provides a comprehensive introduction to feature engineering and representation learning for genomic sequences (DNA/RNA). Participants will explore traditional techniques for extracting features from genomic sequences, building a foundation in classical approaches. Furthermore, the tutorial will cover representation learning, introducing concepts such as embeddings and their applications. Topics include methods such as Word2vec and LLMs to obtain meaningful representations from genomic sequences. Through hands-on exercises and comparative analyses, attendees will learn to combine traditional feature engineering with representation learning approaches, developing practical skills and insights that are adaptable to diverse genomic research challenges. The goal is to offer participants the knowledge and tools to enhance genomic sequence analysis using different techniques for sequence representation.

Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels
Room: 11A
Format: In person


Authors List: Show

Presentation Overview: Show

Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability.

This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery.

Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.

Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource
Room: 03A
Format: In person


Authors List: Show

Presentation Overview: Show

This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects.

This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts.

Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.

Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research
Room: 04AB
Format: In person


Authors List: Show

Presentation Overview: Show

The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.

Tutorial IP4: Quantum Machine Learning for multi-omics analysis
Room: 03B
Format: In person


Authors List: Show

Presentation Overview: Show

Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5].

In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.

16:15-18:00
Tutorial IP7: AI large cellular models and in-silico perturbation
Room: 11BC
Format: In person


Authors List: Show

Presentation Overview: Show

Transformer-based large language models (LLMs) are changing the world. The capabilities they illustrated in sophisticated natural language, vision and multi-modal tasks have inspired the development of large cellular models (LCMs) for single-cell transcriptomic data, such as scBERT, Geneformer, scGPT, scFoundation, GeneCompass, scMulan, etc. After pretraining on massive amount of single-cell RNA-seq data agnostic to any downstream task, these transformer-based models have demonstrated exceptional performance in various tasks such as cell type annotation, data integration, gene network inference, and the prediction of drug sensitivity or perturbation responses. Such advancements, albeit still in their early stage, suggested promising revolutionary approaches for leveraging AI to understand the complex system of cells from extensive datasets beyond human analytical capacity. Especially, such models have made it possible to conduct in-silico perturbation on cells of various types to predict their responses to gene perturbations without doing experiments on the cells. These models provided prototypes of digital virtual cells that can be used to reconstruct and simulate live cells, which will revolutionize many aspects of future biomedical studies.
Although the community is high enthusiastic to these exciting progresses, the structures and algorithms of LCMs and other similar-scale AI models are mysterious to many people who were not equipped with relevant backgrounds. This tutorial will try to fill this gap. In the tutorial, we will begin from an introduction of basic principles of deep neural networks, and explain the basic structure and algorithm of the original Transformer for natural language tasks. We’ll show to the attendees how to build such models based on current machine learning platforms. Then we’ll introduce several successful ways to build large cellular models based on the basic Transformer model, and overview how such models are pretrained on single-cell RNA-seq data. We’ll show and let the attendees to practice how to use LCMs for basic tasks such as cell type annotation, and look into the specific application of LCMs on in-silico perturbation tasks. Attendees will engage in hands-on activities such as building basic transformer models and executing downstream single-cell tasks, including cell type annotation and in-silico perturbation. These activities will remove the mystery of LCMs for the attendees and help them better understand and feel how LCMs can be built and applied

Tutorial IP8: Representation Learning and Feature Engineering for Genomic Sequences Analysis
Room: 12
Format: In person


Authors List: Show

Presentation Overview: Show

Machine learning (ML) has been successfully applied in different omics problems, such as sequence classification in the field of genomics. The effectiveness of ML methods relies greatly on the selection of the data representation, or features, that extract meaningful information from sequences. Genomic sequences can be viewed as one-dimensional strings of successive letters representing nucleotides. However, to make these sequences compatible with ML methods, they must first be transformed into structured numerical representations, such as vectors or matrices. Traditional methods for sequence classification often rely on manually crafted or pre-defined features, which require domain expertise and may not fully capture the complexity of the underlying biological information. Recently, representation learning has emerged as a powerful alternative, enabling the automatic extraction of latent patterns directly from raw data and reducing the dependence on manually crafted features. In genomics, representation learning methods have been introduced to characterize DNA and RNA sequences. In genomics, techniques like Word2Vec, Convolutional Neural Networks (CNNs) and Large Language Models (LLMs) have demonstrated the ability to learn optimal sequence representations that effectively capture both local and global patterns in DNA and RNA sequences.

This tutorial provides a comprehensive introduction to feature engineering and representation learning for genomic sequences (DNA/RNA). Participants will explore traditional techniques for extracting features from genomic sequences, building a foundation in classical approaches. Furthermore, the tutorial will cover representation learning, introducing concepts such as embeddings and their applications. Topics include methods such as Word2vec and LLMs to obtain meaningful representations from genomic sequences. Through hands-on exercises and comparative analyses, attendees will learn to combine traditional feature engineering with representation learning approaches, developing practical skills and insights that are adaptable to diverse genomic research challenges. The goal is to offer participants the knowledge and tools to enhance genomic sequence analysis using different techniques for sequence representation.

Tutorial IP1: Machine Learning for Omics: Best practices and Real-Life Insights with TidyModels
Room: 11A
Format: In person


Authors List: Show

Presentation Overview: Show

Omics data analysis presents unique challenges due to its high dimensionality and complexity. Supervised machine learning (ML) offers powerful tools for gaining insights from these data but currently faces a crisis of reproducibility due to poor adherence to best practices when undertaking feature selection, model evaluation, and needs for further interpretability.

This full-day tutorial introduces participants to the common pitfalls and best practices of applying ML to omics research. It exemplifies good practice through example using the Tidymodels framework for ML workflows in R, tailored to omics applications. The course will feature a mixture of lectures, quizzes, real-life coding tutorials and hands-on practicals with 1-1 support. Example applications will illustrate regression analysis with methylation clocks, gene prioritisation and classification with cancer biomarker discovery.

Special attention will be paid to challenges in working with highly multivariate data and integrating various data types as well as providing tips to extract meaningful insights from complex data. Beginner-level R skills are required, and attendees will leave with practical skills to apply Tidymodels to their own datasets.

Tutorial IP2: Massively parallel reporter assays in functional regulatory genomics and as part of the IGVF data resource
Room: 03A
Format: In person


Authors List: Show

Presentation Overview: Show

This tutorial is designed to empower bioinformatics researchers with the knowledge and skills to effectively utilize Massively Parallel Reporter Assays (MPRAs) data in their work. MPRAs are gaining wider applications across the functional genomics community and are used as part of the Impact of Genomic Variation on Function (IGVF) Consortium. IGVF is a collaborative research initiative funded by the NHGRI that aims to systematically study how genomic variations affect genome function and, consequently, phenotypes. By integrating experimental and computational approaches, IGVF seeks to map and predict the functional impacts of genetic variants, providing a comprehensive catalog of these effects.

This tutorial provides a thorough introduction in MPRAs and IGVF data resources, practical training on MPRA data, and insights into advanced analysis methods for such data. Participants will gain an understanding of MPRA experiments, including their various experimental designs and the rationale for using them in functional genomics. This will involve learning the process of associating tags/barcodes with sequences incorporated in the reporter constructs from raw sequencing reads and counting barcodes from DNA sequencing and RNA expression. The tutorial will guide participants through data processing using MPRAsnakeflow, a streamlined snakemake workflow developed with IGVF for efficient MPRA data handling and QC reporting. Statistical analysis for sequence-level and variant-level effect testing of MPRA count data will be introduced using BCalm, a barcode-level MPRA analysis package developed as part of our IGVF efforts.

Further, the tutorial will provide a starting point for training (deep learning) sequence models on MPRA data and related functional genomics datasets. Participants will learn how to extract meaningful insights from their datasets by investigating the sequence activity relationship and extracting important sequence motifs. By integrating these topics and methods, participants will leave the tutorial equipped with both theoretical knowledge and practical skills necessary for analyzing and using MPRA data effectively.

Tutorial IP3: Genomic Variant Interpretation & prioritisation for clinical research
Room: 04AB
Format: In person


Authors List: Show

Presentation Overview: Show

The interpretation of genetic variation is important for understanding human health and disease. Increased knowledge leads to societal benefits including faster disease diagnosis, a better understanding of disease progression, more efficient identification and prioritisation of drug targets for testing, resulting in overall better health outcomes for a population. Whilst the speed and cost of sequencing has reduced, the complexity of variant interpretation remains a bottleneck for understanding. This tutorial will explore the variety of annotations and techniques available to assess human variation and the implications of variant effects on human health and disease.

Tutorial IP4: Quantum Machine Learning for multi-omics analysis
Room: 03B
Format: In person


Authors List: Show

Presentation Overview: Show

Single-cell and population-level multi-omics analyses have greatly enhanced our understanding of biological complexity. By integrating various types of biological data—such as genomics, proteomics, and transcriptomics, collectively known as multi-omics—these approaches have provided deep insights into the molecular mechanisms underlying complex diseases, both at the cellular level and across patient populations. As the size and complexity of multi-omics data continues to grow, the need to leverage emerging technologies such as artificial intelligence (AI) and quantum computing (QC) also grows. Recently, advances in QC have shown promise in solving real-world problems in machine learning and optimization in biomedicine, drug discovery, biomarker discovery, clinical trials, among other healthcare and life sciences objectives [1,2,3,4,5].

In this tutorial, participants will learn the fundamental concepts of QC, engage in hands-on experiments that apply classical machine learning (ML) techniques. They will also learn best practices for pre-processing multi-omics data in preparation for quantum machine learning (QML) tasks. Through a systematic evaluation of various data complexity measures and their impact on the performance of different ML and QML models, participants will gain insights into when to effectively utilize QML models. Additionally, they will explore quantum-classical hybrid workflows for ML, with a focus in biomedical data analysis.