ISCBacademy 2023 Webinars



To view previous webinars use the links below

2020 Webinars | 2021 Webinars | 2022 Webinars | 2024 Webinars

ISCBacademy is an online webinar series including the ISCB COSI, COVID webinars, Indigenous Voices and practical tutorials. We aim to inspire, connect, and communicate the science while providing a hands-on experience accessing and using newly developed bioinformatics tools while ensuring best practices for rigour and reproducibility.


  • January 10, 2023 - Preterm Birth Prediction Microbiome DREAM Challenge Rebroadcast by Tomiko Oskotsky, Abigail Kuntzleman, and Eunyong Kim, - Hosted by RegSys
  • January 17, 2023 - Design principles of complex cellular decision-making networks in cancer by Mohit Kumar Jolly, IISC - Hosted by SysMod
  • January 24, 2023 - Constructing disease trajectories at scale – incorporation of deep temporal information from patient records using text mining by Søren Brunak, University of Copenhagen - Hosted by TextMining
  • January 31, 2023 - Quantitative Systems Metabolism: Applications in basic research and personalized medicine by Nikolaus Berndt, Charité – Universitätsmedizin Berlin - Hosted by TransMed
  • February 14, 2023 - Substructural similarity searching: having more fun(ctions) with increasing biological structure data by Mohd Firdaus Raih, Universiti Kebangsaan Malaysia - Hosted by 3D-SIG
  • February 27, 2023 - Structural Bioinformatics: From sequence to structure of protein by Arooj Shafiq, University of Cambridge - Hosted by ISCB
  • March 2, 2023 - Ontologies and Machine Learning for Metabolism: Towards a Virtuous Cycle by Janna Hastings, University of Zurich - Hosted by Bio-Ontologies
  • March 7, 2023 - Navigating and Making Sense of Large 3D Visualizations of Biological Data by David Kouřil, Masaryk University - Hosted by BioVis
  • March 14, 2023 - Re-Thinking the Patient's Role in a Learning Health System: Lessons from the Patient-Led Research Collaborative by Hannah Wei, Patient-Led Research Collaborative - Hosted by BOSC
  • March 21, 2023 - Comparative analysis of information-theory-based statistical methods and transformer-based machine learning techniques for scientific literature classification by Ihor Stepanov, Institute of Molecular Biology and Genetics of NASU - Hosted by CAMDA
  • April 11, 2023 - Machine learning enables prediction of metabolic system evolution in bacteria by Naoki Konno, University of Tokyo - Hosted by EvolCompGen
  • April 21, 2023 - Immune cell profiling from single cell RNA with R by Meghana Kshirsagar, Gauri Vaidya, - Hosted by ISCB
  • May 2, 2023 - A systematic search for RNA structural switches across the human transcriptome by Matvei Khoroshkin, University of California San Francisco - Hosted by iRNA
  • May 9, 2023 - Creating genome-wide maps of identical repeats that can mediate complex genomic rearrangements in the human genome by Claudia Gonzaga-Jauregui, National Autonomous University of Mexico - Hosted by JPI
  • May 16, 2023 - PRECISE MICROBIOME GENOMICS TO PRECISION MEDICINE by Ami S Bhatt, Stanford University - Hosted by MICROBIOME
  • May 30, 2023 - Introduction to Explainable AI by Lisa Barros de Andrade e Sousa, Helmholtz Munich; Donatella Cea, Helmholtz Munich; Elisabeth Georgii, Helmholtz Munich; Helena Pelin, Helmholtz Munich, - Hosted by ISCB
  • June 2, 2023 - Anti-PD1 DREAM Challenge Rebroadcast by Jing Tang, Raghvendra Mall, Óscar Lapuente-Santana, Teemu Laajala, James Kozloski, - Hosted by RegSys
  • June 20, 2023 - 3D-BioInfo Webinar on Protein Design and Evolution by , - Hosted by
  • June 26, 2023 - Introduction to Open Science by Jessica Miller, The Royal Society - Hosted by ISCB
  • June 27, 2023 - Learning to understand life: how to gain knowledge of complex living systems (from a limited number of samples) by Sanne Abeln, Utrecht University - Hosted by TransMed
  • July 4, 2023 - Computational Exploration of Lung Function Genetics Across Populations via Public GWAS Data Integration by Afeefa Zainab, - Hosted by VarI
  • July 4, 2023 - When the outlier is the signal: RNA-seq based diagnostics of rare disorders by Vicente Yepez, - Hosted by VarI
  • August 29, 2023 - Systems Biomedicine and Pharmaceutics: Multiscale Modeling of Tissue Remodeling and Damage by Ashlee N Ford Versypt, University of Buffalo - Hosted by SysMod
  • September 20, 2023 - Getting Started with Peer Review by Jessica Miller, The Royal Society - Hosted by ISCB
  • October 3, 2023 - LinkML: an open data modeling framework, grounded with ontologies by Sierra Moxon, Lawrence Berkeley National Laboratory - Hosted by BOSC
  • October 5, 2023 - Clustering predicted structures at the scale of the known protein universe by Pedro Beltrao, Swiss Institute of Bioinformatics - Hosted by 3D-SIG
  • October 10, 2023 - Antimicrobial Resistance Prediction and Forensics CAMDA 2023 by Nelly Sélem Mojica, UNAM - Hosted by CAMDA
  • October 23, 2023 - Challenges and applications of small databases in mass spectrometry-based proteomics data analysis by Andy Lin, PNNL - Hosted by CompMS
  • October 24, 2023 - The BioCyc Web Services API by Suzanne Paley, SRI International - Hosted by ISCB
  • November 2, 2023 - Using the Ensembl genome browser and REST API to retrieve genome annotation data Part 1 by Aleena Mushtaq, EMBL-EBI - Hosted by ISCB
  • November 3, 2023 - Using the Ensembl genome browser and REST API to retrieve genome annotation data Part 2 by Aleena Mushtaq, EMBL-EBI - Hosted by ISCB
  • November 7, 2023 - Exploring the Potential and Risks of Third-Generation Long-Read Transcriptome Sequencing for Unraveling Transcriptome Complexity. by Ana Conesa, Spanish National Research Council - Hosted by HiTSeq
  • November 9, 2023 - Empirical evaluations of large language models for bioinformatics education and research by Stephen Piccolo, Brigham Young University - Hosted by Education
  • November 10, 2023 - Phylogenetic network estimation from site patterns using composite likelihood by Sungsik Kong, University of Wisconsin-Madison - Hosted by EvolCompGen
  • December 12, 2023 - Contextualizing protein representations using deep learning on protein networks and single-cell data by Michelle Li, Harvard University - Hosted by NetBio
  • December 20, 2023 - The Synthetic Patient - Predicting Health Trajectories for Diabetes eHRs by Francisco M. Ortuño, University of Granada - Hosted by CAMDA

  • Preterm Birth Prediction Microbiome DREAM Challenge Rebroadcast
    by Tomiko Oskotsky, Abigail Kuntzleman, and Eunyong Kim

    January 10, 2023

    Leveraging Microbiome Data in the Era of Precision Medicine
    Our DREAM challenge was to predict (a) preterm or (b) early preterm birth from 9 publicly available studies of the vaginal microbiome representing 3578 samples from 1268 pregnant individuals, aggregated from raw sequences via an open-source tool, MaLiAmPi. We validated the crowdsourced models on 2 novel datasets. From 318 participants we received 148 and 121 submissions for our prediction tasks with top-ranking submissions achieving bootstrapped AUROC scores of 0.69 and 0.87 respectively.

    Random forest model accurately predicts early preterm labor
    About 10% of births worldwide are preterm (delivery before 36 weeks), and 2% of births worldwide are early preterm (delivery before 32 weeks). For the Preterm Birth Microbiome DREAM Challenge we used 3,578 vaginal microbiome samples from 1,268 individuals to predict both preterm and early preterm birth. We explore the use of a generative adversarial network (GAN) applied to preterm birth prediction to train models on more data, but find that a basic random forest model trained on real relative abundances, diversity metrics, community state types, race, and collect week outperforms both a random forest and support vector machine trained on generated relative abundance data. For early preterm birth prediction, we employ a basic random forest model and find that the most important features for early preterm birth prediction include diversity statistics, collect week, race, community state types, and many phylotypes. However, few of these data show significant difference between early preterm and post-32 week samples, indicating that individual features on their own are not good predictors of early preterm birth. When tested on the validation dataset for the challenge, our early preterm birth prediction model had an AUC ROC of 0.87, an AUC PRC of 0.44, and an accuracy of 0.91.

    Prediction model construction of early-preterm birth via vaginal microbiomes based on ensemble learning approach
    The occurrence of preterm, including early-preterm birth, is estimated annually at 15 million births worldwide. Preterm birth(PTB) is a great concern as it is one of the leading causes of neonatal mortality, and the inflammation of the vaginal microbiome is known as the major cause of PTB. Because of the complexity of the vaginal microbial environment in pregnancy, it is necessary to accurately predict (early-) preterm birth using computational approaches based on microbiome characteristics and meta-information. In this Preterm Birth Prediction Microbiome DREAM Challenge, we constructed prediction models with selected features, handling highly sparse and similar data points in a given raw data. We applied the minimum redundancy maximum relevance method to select relevant features. Then, various machine learning models were tested to construct ensemble models to avoid overfitting and optimize the model. The constructed prediction models resulted in high performances, with an AUC of 0.635 and 0.841 for tasks 1 and 2.

    Hosted by:

    - top -


    Design principles of complex cellular decision-making networks in cancer
    by Mohit Kumar Jolly

    January 17, 2023

    Elucidating the design principles of regulatory networks driving cellular decision-making is of fundamental importance in mapping and controlling cell-fate. Despite their size and complexity, large regulatory networks often lead to a limited number of phenotypes. How this canalization is achieved remains largely elusive. Here, we investigated multiple different networks governing cellular plasticity during cancer metastasis, and identified a latent design principle in their topology that limits their phenotypic repertoire – the presence of two “teams” of nodes engaging in a mutually inhibitory feedback loop. These "teams" are specific to these networks and directly shape the phenotypic landscape and consequently the cell-fate trajectories. Our analysis reveals that network topology alone can contain information about phenotypic distributions it can lead to, thus obviating the need to simulate them. We present experimental evidence of such "teams" in transcriptomic datasets across many contexts (cancer cell plasticity in breast cancer, melanoma, lung cancer etc.). Overall, we propose these “teams” as a network design principle that can drive cell-fate canalization in diverse decision-making processes.

    Click here to watch

    Hosted by:

    - top -


    Constructing disease trajectories at scale – incorporation of deep temporal information from patient records using text mining
    by Søren Brunak

    January 24, 2023

    Analysis of disease progression patterns of multimorbid patients typically try to find systematic patterns of risk factors, diseases and complications. Such analyses are complicated by the fact that certain risk factors also can present as complications, thus representing “promiscuous” diseases that appear in quite different contexts. Another problem is that similar outcomes can be caused by different mechanisms, mixed etiologies, that can be difficult to disentangle longitudinally. Using population-wide registry data from Denmark (7-10 M patients) we construct disease and prescription trajectories that reflect these situations. However, compared to classical disease registries electronic patient records contain even deeper information, for example in the clinical narratives. The talk will describe how the text in EHRs can add to the construction of disease trajectories and reflect temporal patterns that are not coded in structured form as conventional registry data are.

    Click here to watch

    Hosted by:

    - top -


    Quantitative Systems Metabolism: Applications in basic research and personalized medicine
    by Nikolaus Berndt

    January 31, 2023

    In vivo studies of human metabolism are encumbered with serious ethical and technical issues. Only very few metabolic parameters can be directly measured and many parameters, such as metabolic fluxes are hard or impossible to assess experimentally.

    Inferring the response of a biological system to external or internal perturbations from the properties and interactions of its constituting molecules is a central goal of systems biology. For metabolic systems, reaching this goal requires the establishment of mathematical models enabling the computation of metabolite concentrations and fluxes at given external conditions (nutrients and hormones), level of metabolic enzymes, and the system’s history (e.g. current filling of nutrient stores).

    Classical approaches include biostatistical methods (gene set enrichment) and stoichiometric models (such as flux balance analysis) but neglect most of the known regulatory mechanisms. We develop comprehensive biochemistry-based kinetic models of the central metabolism for different organ systems taking into account the regulation of enzyme activities by their reactants, allosteric effectors, and hormone-dependent phosphorylation.

    The models have their utility in basic research, medicine, and pharmacology. Using proteomics data to scale maximal enzyme activities, the models help to investigate alterations in the metabolic functions of tissues or cells. Applications include genetic manipulations, diseases, and pharmacological treatments.

    Our computational models have reached a high level of maturity making them suitable as screening platforms on an individual basis enabling a mechanistic understanding of functional mitochondrial and metabolic alterations. In the future, we will use our platform to translate findings from animal studies by conducting virtual clinical trials applying the observed modes of action to real-world proteomic data of a well-characterized patient cohort. Furthermore, we aim to couple multiple organs to better understand whole-body metabolism.

    Click here to watch

    Hosted by:

    - top -


    Substructural similarity searching: having more fun(ctions) with increasing biological structure data
    by Mohd Firdaus Raih

    February 14, 2023

    Although the functions of proteins and nucleic acids are determined by the 3D structures that they fold into, only a subset of residues are directly involved in a particular function’s mechanism. Such crucial residues are substructural 3D arrangements that are conserved as motifs and partake in molecular interactions such as binding sites, catalytic mechanisms as well as the maintenance of specific folds or domains. Recently, protein structure prediction algorithms such as AlphaFold have computed highly accurate 3D structure models of protein sequences available in the UniProt database resulting in more than 200 million models. The deposition rate of experimentally determined structures into the Protein Data Bank has also increased and surpassed the 200,000 entries mark in January 2023. Our research group has developed tools and resources that allow for the searching and comparisons of 3D substructures in protein and RNA molecules. These capabilities allow for the functional annotation of existing structures as well as new experimentally determined or computationally generated structures with unknown functions and thus providing further insights into the extent of diversity or conservation of functional mechanisms. This can in turn lead to knowledge regarding a wider repertoire of functions that use known molecular mechanisms and usher in a new era of structure similarity driven function annotation beyond the sequence similarity based function annotation that has been in use for decades. These substructure searching tools are available at http://mfrlab.org/grafss/.

    Click here to watch

    Hosted by:

    - top -


    Structural Bioinformatics: From sequence to structure of protein
    by Arooj Shafiq

    February 27, 2023

    Protein sequence and three-dimensional structural analysis can provide valuable insight into its function and biological significance. This tutorial session aims to explore protein sequence and structural retrieval databases, analysis resources and tools. Session will include introduction and hands on training on the exploration of protein sequence (UniProt) and structural databases (Protein Data Band (PDB)) with analysis tools available in these databases. Hands on session on protein sequence similarity and homology through pairwise sequence alignment. Protein secondary structure prediction will be conducted using PSIPRED and TMHMM. Protein structural modeling and analysis will be done using Swiss Model and PyMOL. All the software and databases used in the tutorial session are web browser-based, free, open source, compatible with Windows or Mac OS X, requires no installation except PyMOL, which needs to be installed. License for the educational version PyMOL is available free.

    Click here to watch

    Hosted by:

    - top -


    Ontologies and Machine Learning for Metabolism: Towards a Virtuous Cycle
    by Janna Hastings

    March 2, 2023

    Small molecular metabolites are a fundamental component of biological processes. Bio-ontologies such as ChEBI describe and classify biologically relevant metabolites with information about their molecular structures, metadata such as names and identifiers, and biological activities. However, the number of metabolites in living systems far exceeds the size of available ontologies, and manual expert curation limits the speed at which knowledge resources can grow. Ontology-informed machine learning provides a promising approach to automatically extend chemical ontologies by using the information contained in the ontology to train a model that is able to automatically classify new metabolites in the ontology. In addition, we have seen that a model that was pre-trained on this ontology classification task was able to perform better on a subsequent toxicity prediction task than a model without pre-training. This shows that ontologies and machine learning can work together to create a virtuous cycle of knowledge capture and discovery.

    Click here to watch

    Hosted by:

    - top -


    Navigating and Making Sense of Large 3D Visualizations of Biological Data
    by David Kouřil

    March 7, 2023

    Over the years, scientific 3D visualization technology has matured to the extent where large and complex datasets can be displayed at interactive framerates, shifting the bottleneck in visualization from rendering to intuitive and effective interaction. In this talk, I will introduce my research in solving challenges related to biological data featuring several specific characteristics: they are multi-scale, multi-instance, three-dimensional, and incredibly dense. I will summarize my doctoral research work aimed at visualization for science communication. Afterward, I'll introduce our recent work on web-based visualization of 3D chromatin structures modeled from Hi-C data.

    Click here to watch

    Hosted by:

    - top -


    Re-Thinking the Patient's Role in a Learning Health System: Lessons from the Patient-Led Research Collaborative
    by Hannah Wei

    March 14, 2023

    So often, a patient’s role in research has been limited to the donor of data and receiver of care, limiting the possibilities in between for participatory methodologies and open science collaboration.

    This webinar will be a case study of the patient-led research model. We will explore how the model was formed under the context of those with COVID-19 developing their own research capacity to study Long COVID, or Post-Acute Sequelae of COVID (PASC). We will present the levels of patient-involvement, data ownership and access principles, and accommodations for disabled contributors from our community. We will further discuss how patient-generated data can guide research direction and inform research hypotheses, while patients’ lived experiences, experiments and own research can enrich a learning health system. We will suggest ways to incorporate patient-led research models into open science innovations.

    The talk will be delivered by a patient-researcher, co-founder and technologist at the Patient-Led Research Collaborative, a multi-disciplinary, patient-run organization dedicated to placing patient voices at the forefront of Long COVID research.

    Click here to watch

    Hosted by:

    - top -


    Comparative analysis of information-theory-based statistical methods and transformer-based machine learning techniques for scientific literature classification
    by Ihor Stepanov

    March 21, 2023

    Scientific literature grows very fast. One of the first studies regarding scientific literature production was conducted by De Solla Price, who used publication data collected over the 100 years (1862–1961) to calculate a doubling time. The results showed 13.5 years for doubling the scientific corpus with a 5.1% annual growth rate (de Solla Price, 1965). The development of technologies created conditions for scientific literature production, which made scientific information more accessible and introduced new challenges.
    Our research focuses on the biomedical domain, which is one of the largest and most rapidly developing. Accessibility of biomedical literature through databases such as Medline (Medline, 2021) and research activity in biomedicine creates an opportunity to use natural language processing (NLP) techniques.
    We implemented an information theory-based statistical approach and compared it with modern transformers on a relevant practical task ‒ classifying biomedical papers related to Drug-Induced Liver Injury (DILI) as part of the CAMDA 2022 Challenge 1. DILI is a clinically significant condition and is one reason for drug registration failures. Scientific literature is the primary source of information related to DILI. Thus collecting and processing vast amounts of biomedical literature can help pharma companies, research organizations, and regulators to find relevant information.

    Click here to watch

    Hosted by:

    - top -


    Machine learning enables prediction of metabolic system evolution in bacteria
    by Naoki Konno

    April 11, 2023

    Diverse organisms change their genomes during evolution, and prediction of those changes is a long-standing problem. While recent laboratory evolution studies have shown the predictability of short-term and sequence-level evolution, that of long-term and system-level evolution has not been systematically examined. Here, we show that the gene content evolution of metabolic systems is generally predictable by applying ancestral gene content reconstruction and machine learning techniques to ~3000 bacterial species. Our computational framework, Evodictor, successfully predicted gene gain and loss evolution at the branches of the reference phylogenetic tree, suggesting that evolutionary pressures and constraints on metabolic systems are universally shared. Investigation of pathway architectures and meta-analysis of metagenomic datasets confirmed that these evolutionary patterns have physiological and ecological bases as functional dependencies among metabolic reactions and bacterial habitat changes. Furthermore, pan-genomic analysis of intraspecies gene content variations proved that even “ongoing” evolution in extant bacterial species is predictable in our framework. I herein present our findings on the predictability of biological system evolution, and discuss future perspectives on the versatility of Evodictor concept to extract evolutionary rules with growing datasets of genetic/phenotypic traits and megaphylogenies.

    Click here to watch

    Hosted by:

    - top -


    Immune cell profiling from single cell RNA with R
    by Meghana Kshirsagar, Gauri Vaidya

    April 21, 2023

    Over the past couple of decades, immunotherapy treatments have been widely adopted as an alternative treatment for a variety of cancers. The study of tumour microenvironment of immune cells such as macrophages, T cells and B cells amongst others can help to unravel the mystery of differential outcomes to immunotherapy treatments. Gene expression profiling can help to identify the patterns of genes expressed in major immune cells amongst cohorts of patients at different stages of cancer to generate new biological hypotheses. Statistical approaches can facilitate the identification of highly variable genes and their expression in immune cells by performing analysis of scRNA sequencing data. The tutorial will be divided in three parts ; comparing the popular annotation tools , applying dimensionality reduction techniques to obtain multi-stage downstreaming of scRNA data and extracting crucial insights from immune cell populations and subpopulations. Throughout the tutorial we will follow the seurat pipeline version 4.0.

    Hosted by:

    - top -


    A systematic search for RNA structural switches across the human transcriptome
    by Matvei Khoroshkin

    May 2, 2023

    RNA structural switches are key regulators of gene expression in bacteria, yet their characterization in Metazoa remains limited. Here we present SwitchSeeker, a comprehensive computational and experimental approach for systematic identification of functional RNA structural switches. We applied SwitchSeeker to the human transcriptome and identified 245 putative RNA switches. To validate our approach, we characterized a previously unknown RNA switch in the 3’UTR of the RORC transcript. In vivo DMS-MaPseq, coupled with cryogenic electron microscopy, confirmed its existence as two alternative structural conformations. Furthermore, we used genome-scale CRISPR screens to identify trans factors that regulate gene expression through this RNA structural switch. We found that nonsense-mediated mRNA decay acts on this element in a conformation-specific manner. SwitchSeeker provides an unbiased, experimentally-driven method for discovering RNA structural switches that shape the eukaryotic gene expression landscape.

    Use this link to register for this webinar: https://pennmedicine.zoom.us/meeting/register/tJIscuiurTIpE9EDBk5Qour1gyweNlBMxyqp

    Hosted by:

    - top -


    Creating genome-wide maps of identical repeats that can mediate complex genomic rearrangements in the human genome
    by Claudia Gonzaga-Jauregui

    May 9, 2023

    Repeated sequences spread throughout the human genome play a role in genomic stability, structural variant complexity, and the generation of complex genomic rearrangements contributing to disease burden. Over time, experimental analyses of genomic rearrangements have identified features that characterize repeated regions as substrates for cellular mechanisms that can generate genomic rearrangements. Bioinformatic analyses of the key features of repeated sequences such as orientation, size, similarity, and distribution can help predict possible rearrangements, type of structural vari­ants, and major mechanisms occurring at specific genomic regions. The study of these genomic features can help identify regions that are prone to result in genomic disorders and provide insights into mechanisms of genome evolution.

    Click here to watch

    Hosted by:

    - top -


    PRECISE MICROBIOME GENOMICS TO PRECISION MEDICINE
    by Ami S Bhatt

    May 16, 2023

    More than 1,000 species of bacteria, archaea, viruses and fungi live in the human gut. Far from being passive passengers, these organisms strongly interact with one another and with their host’s metabolism and immune system. Compelling early experiments have demonstrated associations between the intestinal microbiome composition and obesity, cardiovascular diseases, and certain cancer chemotherapies’ efficacy. Yet teasing apart the mechanisms by which microbes impact host health has been challenging. To accelerate an otherwise challenging, slow and tedious process and to deconstruct mechanisms, we must critically examine our existing “tool kit” for studying the microbiome, and mature our measurement tools to meet these challenges. Our translational laboratory builds and designs observational and interventional clinical cohorts to study. We have also steadfastly worked to develop genomic tools to study strain level dynamics of the microbiome, how microbial genomes change over time and how microbes use hidden “microproteins” to communicate with each other and their human hosts. In this presentation, I will speak about three recent developments in our lab: (1) I will introduce the importance of absolute quantification in microbiome research, and our efforts toward that. This will form the basis of revised estimates of the number of microbial cells in a human body. (2) I will give an overview of a new computational workflow called “Phanta”, which enables simultaneous taxonomic profiling of eukaryotes, bacteria and viruses in a human gut metagenomic sample. (3) I will share exciting new unpublished work on our discovery of intragenic inversion as a previously unappreciated mechanism of generating genetic diversity in microbial genes.

    Click here to watch

    Hosted by:

    - top -


    Introduction to Explainable AI
    by Lisa Barros de Andrade e Sousa, Helmholtz Munich; Donatella Cea, Helmholtz Munich; Elisabeth Georgii, Helmholtz Munich; Helena Pelin, Helmholtz Munich

    May 30, 2023

    Complex supervised Machine Learning (ML) and Deep Learning (DL) models are often considered to be “Black Boxes” because it can be hard to understand why certain predictions have been made by the model. Particularly in biology, it is more and more important to not just accurately predict the outcome of a biological system with a machine learning model but to also be able to uncover the mechanisms behind those biological systems that led to a certain outcome. To uncover the underlying mechanisms of a biological system, we have to work on the interpretability of our models, which are able to learn the underlying patterns in our data. Besides this important scientific aspect, interpretability can play an important role in addressing problems related to safety and ethics. To deploy our models to the real world, it is indispensable to not only make accurate predictions, but also to understand the logic behind those predictions. Only by understanding the model’s decision-making process, we can ensure that the model produces valuable insights. Indeed, the application of interpretability methods allows us to identify potential biases, ensure fairness and prevent genre, social, and race discrimination. Moreover, in many fields, but in particular in the biomedical domain, explainability can be the key to acceptance and trust.
    During this 2-hour tutorial on eXplainable Artificial Intelligence (XAI), we want to identify the main aspects of why interpretability – in an era where AI is becoming more and more pervasive – is so important. After this general introduction, we will focus on model-agnostic state-of-the-art methods like Permutation Feature Importance and SHAP, as well as model-specific methods like Forest-Guided Clustering. During the practical sessions, the participants can discuss those methods with peers and mentors, and get familiar with the most popular python libraries on XAI.

    Hosted by:

    - top -


    Anti-PD1 DREAM Challenge Rebroadcast
    by Jing Tang, Raghvendra Mall, Óscar Lapuente-Santana, Teemu Laajala, James Kozloski

    June 2, 2023

    Non–small cell lung cancer (NSCLC) is the leading cause of cancer related death worldwide and a complex disease where multiple treatment options are required to address the needs of different patient populations. Significant progress has been made in the first-line treatment for patients with advanced NSCLC, ineligible for targeted therapy, by the investigation of checkpoint pathway blockade. Such immunotherapy (I-O)-based treatment regimens have been assessed as a monotherapy in patients whose tumors express programmed death ligand 1 (PD-L1), as well in combination with chemotherapy, regardless of PD-L1 expression.1-12 While durable responses and prolonged survival have been demonstrated in some patients treated with I-O, there remains a high disease burden and a need to improve the ability to predict which patients are more likely to receive benefit from treatment with I-O.

    CheckMate 026 is a phase III, international, randomized, open-label trial comparing the efficacy and safety of single-agent nivolumab with those of platinum-based chemotherapy as first line therapy in patients with NSCLC and tumor PD-L1 ≥1%. While nivolumab had a favorable safety profile over chemotherapy, progression free survival (PFS) was not significantly longer with nivolumab and overall survival was similar between treatment groups. Owing to the complexity of the immune system, identification of novel biomarkers of response to I-O–based treatments are still required in order to better predict survival and durable responses for subset of NSCLC patients. While immunostaining for PD-L1 expression is currently used to predict those patients likely to respond to I-O, several challenges remain with accurate detection and scoring of PD-L1. This Anti-PD-1 Response DREAM Challenge is a crowdsourced effort aiming to identify predictive biomarkers of response to anti-PD-1 monotherapy (nivolumab), in patients with NSCLC using clinical data and gene expression data from the phase 3 trial, CheckMate 026. This trial is an international, randomized, open-label trial comparing the efficacy and safety of single-agent nivolumab with those of platinum-based chemotherapy as first line therapy in patients with NSCLC. Overall survival was similar and progression-free survival was not improved, however, nivolumab had a favorable safety profile compared with chemotherapy. Owing to the complexity of the immune system, novel biomarkers for response are being explored and may improve survival for a subset of NSCLC patients, while also facilitating better patient stratification to assist in the development of novel therapeutic approaches for NSCLC patients. While immunostaining for PD-L1 expression is currently used to identify patients eligible for I-O monotherapy, several challenges remain with accurate detection and scoring of PD-L1. This Anti-PD-1 Response DREAM Challenge is a crowdsourced effort aiming to identify predictive biomarkers of response to anti-PD-1 monotherapy (nivolumab), in patients with metastatic NSCLC using clinical data and gene expression data from CheckMate 026.

    Hosted by:

    - top -


    3D-BioInfo Webinar on Protein Design and Evolution
    by

    June 20, 2023

    ISCB’s 3DSIG, in collaboration with Elixir 3D-Biodata Community, is pleased to announce this upcoming webinar. This webinar will have two speakers. Detailed information on the program and how to join the session are located at https://elixir-europe.org/events/3d-bioinfo-webinar-protein-design-and-evolution

    Hosted by:

    - top -


    Introduction to Open Science
    by Jessica Miller

    June 26, 2023

    Open Access has emerged in recent decades as a way to make scientific research accessible, without being behind a paywall. Many funders are now requiring Open Access publication, and the landscape is rapidly changing – for researchers and publishers. This has also resulted in the emergence of new funding models which don’t place the cost of Open Access publishing on the authors.
    Building on Open Access, Open Science seeks to open up research and the research process even further. For example, open data policies allow the code and data underlying a study to be made available, enhancing replicability. Transparent peer review, whereby the reviews, decision letters and responses are published alongside an article, can provide insight into the journey of a manuscript. Preprints are a way to give readers access to an article, before peer review and publication. They can also facilitate feedback on manuscripts prior to formal submission to a journal.
    During this talk, we will explain some of the background to Open Science, including an overview of Open Access. We will then go into some of the concepts that Open Science entails, including open data, preprints and transparent peer review. The advantages of the concepts will also be discussed, along with any potential pitfalls. Finally, we will start to explore where Open Science may be headed in the future.

    Click here to watch

    Hosted by:

    - top -


    Learning to understand life: how to gain knowledge of complex living systems (from a limited number of samples)
    by Sanne Abeln

    June 27, 2023

    Recent advances in machine learning have opened new paths for making predictions based on large sets of complex life science data. However, is it also possible to gain an understanding of the complex underlying systems? And, is this possible when there is only a small number of labelled samples available?
    Firstly, we demonstrate how to analyse the impact of a genomic alteration using system level response data. We employed machine learning methodology to explore the problem and subsequently developed a statistical approach to analyse the impact of genomic alterations using as few as 20 samples. Applying this approach, we identified several novel structural variants that are likely to have a significant impact on the development of colorectal cancer.
    In a second example, we demonstrate that it is possible to predict which proteins are likely to be found in extracellular vesicles (EVs). By using meaningful input features, we can gain an understanding of both the vesicle-sorting mechanism, as well as understanding why specific proteins are likely to be found in vesicles. In addition, we can shed light on which experimental EV extraction protocols are most reliable. Finally, I illustrate how we can tackle similar questions - that have a limited number of labelled data - by using multi-task deep learning architectures that can be enhanced by using labelled data from related problems.

    Click here to watch

    Hosted by:

    - top -


    Computational Exploration of Lung Function Genetics Across Populations via Public GWAS Data Integration
    by Afeefa Zainab

    July 4, 2023

    Chronic obstructive pulmonary disorder (COPD) is a highly prevalent disease, making it a leading cause of death worldwide. Several GWAS have been performed across multiple populations to measure lung function and identify loci associated with COPD. Population-specific GWAS shows that every population has a different ancestral genetic composition for the same disease in different populations. To analyze trans-ethnic genetics, GWAS meta-analysis is the commonly used method; however, meta-analysis has some limitations in terms of genetic heterogeneity when used for cross-population GWAS analysis, even though transethnic analyses are becoming increasingly important for personalized medicine in each population. In this study, we proposed a transethnic linkage disequilibrium LD analysis to identify common and unique functional variants in different population cohorts.
    Lung function measurement is used as an indicator for the risk prediction of COPD; therefore, we used lung function GWAS data from two populations. The results from the Japanese and European population GWAS for lung function were re-evaluated using a trans-ethnic LD approach.
    This study identified nine novel independent significant single nucleotide variants SNVs and four lead SNVs in three genomic risk loci in the Japanese GWAS, whereas five novel lead SNVs and 17 novel independent significant SNPs were identified in 21 genomic risk loci in the European population. Comparative analysis revealed 28 genes that were similar in the prioritized gene lists of both populations. We also performed a meta-analysis-based post-GWAS analysis that identified 18 common genes in both populations less frequently than in our approach. Our approach identified significant novel associations and genes that have not been previously reported or were missed in the meta-analysis.

    Hosted by:

    - top -


    When the outlier is the signal: RNA-seq based diagnostics of rare disorders
    by Vicente Yepez

    July 4, 2023

    RNA sequencing has emerged as a powerful complementary approach to DNA sequencing to discover disease-causing gene regulatory defects in individuals affected by rare disorders. However, as large rare disease consortia are implementing RNA-seq-based diagnostics, new challenges arise to cope with the heterogeneity of diseases, various tissues and sample sizes, and the multiplicity of interpreters. I will talk about different statistical methods that we have developed in the lab to aid rare disease diagnostics using RNA-seq data. This include autoencoder-based methods to detect outliers, variant effect prediction algorithms as well as quality control processes. We have developed and benchmarked them thoroughly using the multi-tissue GTEx dataset of > 8,000 samples and applied them to a variety of rare disease cohorts, including the European Solve-RD and the North American UDN. Altogether, our new algorithms and processes contribute to improved methodologies for RNA-seq-based rare disease diagnostics.

    Hosted by:

    - top -


    Systems Biomedicine and Pharmaceutics: Multiscale Modeling of Tissue Remodeling and Damage
    by Ashlee N Ford Versypt

    August 29, 2023

    Dr. Ford Versypt leads the Systems Biomedicine and Pharmaceutics research lab, which develops and uses multiscale systems engineering approaches including mathematical modeling and computational simulation to enhance understanding of the mechanisms governing tissue remodeling and damage as a result of diseases and infections and to simulate the treatment of those conditions to improve human health. The lab specializes in (a) modeling mass transport of biochemicals through heterogeneous porous materials—primarily extracellular matrices—that change morphology dynamically due to the influence of chemical reactions and (b) modeling dynamic, multi-species biological systems involving chemical, physical, and biological interactions of diverse, heterogeneous cell populations with these materials and the chemical species in tissue microenvironments.

    In this seminar, vignettes of three lines of research will be highlighted including (1) glucose-stimulated damage to kidney cells during diabetes, (2) viral and immune-induced damage in SARS-CoV-2 infected lung tissue, and (3) bone restoration via dietary supplementation of short chain fatty acids. The work is currently supported by an NSF CAREER award and NIH R35 MIRA and R21 grants.

    Click here to watch

    Hosted by:

    - top -


    Getting Started with Peer Review
    by Jessica Miller

    September 20, 2023

    Peer review plays a crucial role in helping to determine whether an article meets the quality and suitability for publication in a given journal. In this webinar, we will give an overview of the peer review process which will shine a light on where peer review fits in with the overall publication process. We will cover the basics of peer reviewing for a journal. This will cover topics such as deciding whether to accept an invitation, structuring a review, correspondence with the journal office, and a number of other factors that it’s important to take into consideration.

    We will also go onto to think about the bigger picture and discuss some of the questions, opportunities and challenges associated with peer review. What are the incentives for peer review? How can we address reviewer fatigue? What are the challenges in ensuring peer review is carried out in an ethical way?

    This webinar will be particularly suitable for early career researchers who may be getting started with peer review and would like to find out more. However, researchers from all stages of their career are welcome to attend.

    Jessica works as an Editorial Coordinator for Royal Society Publishing. She helps facilitate the submission and peer review processes for Journal of the Royal Society Interface and Interface Focus, as well as organising social media for both journals.

    Click here to watch

    Hosted by:

    - top -


    LinkML: an open data modeling framework, grounded with ontologies
    by Sierra Moxon

    October 3, 2023

    The Linked data Modeling Language (LinkML, https://linkml.io) is an open, extensible modeling framework that allows computers and people to work cooperatively to document, validate, and distribute data that is reusable and interoperable. It provides a flexible yet expressive standard for describing many kinds of data models from value sets and flat, checklist-style standards to complex normalized data structures that use polymorphism and inheritance. LinkML enables even non-developers to create data models. It is designed to allow software engineers and subject matter experts to communicate effectively in the same language, while also providing the semantic underpinnings to make data easier to validate, understand and reuse computationally. LinkML has an active and growing user community, and has seen uptake by projects including cancer data harmonization, environmental genomics, microbiome data, knowledge graph integration, ontology mappings and language profiles.

    In this talk, which expands on the short talk I gave at BOSC 2023, I will describe the LinkML framework and give examples demonstrating how to use it for biological data modeling, ontology-supported validation, conversion, and serialization. I will discuss how LinkML was used to create the Biolink Model, a unifying data model that brings together heterogeneous datasets to answer novel biomedical questions. I’ll also introduce the role the Biolink Model plays in the NCATS Biomedical Data Translator project.

    Click here to watch

    Hosted by:

    - top -


    Clustering predicted structures at the scale of the known protein universe
    by Pedro Beltrao

    October 5, 2023

    Proteins are key to all cellular processes and their structure is important in understanding their function and evolution. Sequence-based predictions of protein structures have increased in accuracy with over 214 million predicted structures available in the AlphaFold database (AFDB). However, studying protein structures at this scale requires highly efficient methods. Here, we developed a structural-alignment based clustering algorithm - Foldseek cluster - that can cluster hundreds of millions of structures. Using this method we have clustered all structures in AFDB, identifying 2.27M non-singleton structural clusters, of which 31% lack annotations representing likely novel structures. Clusters without annotation tend to have few representatives covering only 4% of all proteins in the AFDB. Evolutionary analysis suggests that most clusters are ancient in origin but 4% seem species specific, representing lower quality predictions or examples of de-novo gene birth. Additionally, we show how structural comparisons can be used to predict domain families and their relationships, identifying examples of remote homology. Based on these analyses we identify several examples of human immune related proteins with remote homology in prokaryotic species which illustrates the value of this resource for studying protein function and evolution across the tree of life.

    Click here to watch

    Hosted by:

    - top -


    Antimicrobial Resistance Prediction and Forensics CAMDA 2023
    by Nelly Sélem Mojica

    October 10, 2023

    Taxonomic and Anti Microbial Resistance (AMR) patterns arise from microorganisms of different cities. Every year, the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) provides a challenge that helps scientists build capacities and data practices. We explored microbiome data from 16 cities. Samples from 2016 and 2017 were supplied by MetaSUB, aiming to identify a mysterious city given an AMR pattern. Here, we address both 1) the forensic geolocalization challenge, i.e., given a training set to predict the city label of a test set, and 2) Discovering the mysterious city given an AMR profile. We used Negative Binomial regression to address variable selection by identifying differentially abundant OTUs, using its results to diminish the number of OTUs and reduce the sparsity in the dataset. Additionally, when applied to a mysterious sample, topological data analysis suggests that Escherichia and Klebsiella may contain resistance markers due to Horizontal Gene transfer, but that is not the case for Enterobacter. Finally, a meta-pangenome study suggests some gene families that may help to discriminate between cities with similar patterns.

    Click here to watch

    Hosted by:

    - top -


    Challenges and applications of small databases in mass spectrometry-based proteomics data analysis
    by Andy Lin

    October 23, 2023

    The most common statistic for assigning statistical significance to peptide detections that result from a proteomics experiment is the false discovery rate (FDR). The FDR of a set of peptide-spectrum matches (PSMs) is typically estimated through a process called target-decoy competition, where spectra generated from a proteomics experiment are searched against a database of target and decoy sequences. This methodology is well adapted to most proteomics experiments, where the database is large or many proteins in the database are present in the sample. However, this methodology is challenged by small databases or when few peptides in the database are present in the sample. Recent advances, such as subset-neighbor search, aim to address to this challenge. Unfortunately, we show this approach fails for cases when the database is extremely small. In addition to discussing the unclear future of how to approach this scenario, we give several example applications that could utilize future solutions.

    Click here to watch

    Hosted by:

    - top -


    The BioCyc Web Services API
    by Suzanne Paley

    October 24, 2023

    BioCyc.org is an extensive web portal containing 20,000 genomes and associated metabolic pathways for microbes, model eukaryotes, and humans. The BioCyc website provides extensive bioinformatics tools for searching and analyzing these databases, and leveraging them for analysis of omics datasets. In addition, BioCyc offers several types of web services for programmatically retrieving and managing data and creating visualizations. This tutorial will provide a survey of the different services we offer, with detailed examples.

    Click here to watch

    Hosted by:

    - top -


    Using the Ensembl genome browser and REST API to retrieve genome annotation data Part 1
    by Aleena Mushtaq

    November 2, 2023

    The Ensembl project provides freely available access to genome annotation datasets including gene, variant and regulatory feature annotation as well as comparative genomics analyses for over 300 vertebrate species and 30,000 non-vertebrate eukaryotes and prokaryotes. All of the data can be retrieved through Ensembl’s online genome browsers (www.ensembl.org, www.ensemblgenomes.org and rapid.ensembl.org) as well as programmatically via Ensembl’s REST API.

    This workshop will introduce you to the range of data available through Ensembl and the appropriate platforms to visualise and extract it for analysis and interpretation. The morning sessions of this workshop will guide you through the Ensembl genome browser and BioMart interfaces to show you how to explore and retrieve both small and large-scale datasets. In the afternoon session, this workshop will introduce you to the concepts of the Ensembl REST API and guide you through the principles of retrieving Ensembl data programmatically using both Python and R.

    To participate in the hands-on aspects of this workshop, including live polling, exploring online databases and exercises, you will need to bring a laptop. Workshop materials, including slides, screenshots, exercises, sample files and solutions will be available before the workshop and will remain permanently online at the Ensembl training portal: https://training.ensembl.org.

    Learning objectives


    1. Outline the different data types available through Ensembl

    2. Identify the appropriate methods for data retrieval from Ensembl

    3. Visualise and retrieve genome annotation data through the online Ensembl genome browser

    4. Export custom datasets from Ensembl using the BioMart export tool

    5. Perform queries and extract data returned from the Ensembl REST API


    Maximum number of participants
    30

    Intended audience and level
    The workshop is aimed at new and existing Ensembl users, from both the wet-lab and bioinformatics communities. The workshop is designed to provide participants with a greater understanding of the data available through the Ensembl interfaces and how to efficiently retrieve it at various scales.
    There are no prerequisites for this workshop, although a basic understanding of programming with Python or R would be beneficial. For the interactive aspects of this workshop, participants are required to bring their personal laptops.

    Click here to watch

    Hosted by:

    - top -


    Using the Ensembl genome browser and REST API to retrieve genome annotation data Part 2
    by Aleena Mushtaq

    November 3, 2023

    The Ensembl project provides freely available access to genome annotation datasets including gene, variant and regulatory feature annotation as well as comparative genomics analyses for over 300 vertebrate species and 30,000 non-vertebrate eukaryotes and prokaryotes. All of the data can be retrieved through Ensembl’s online genome browsers (www.ensembl.org, www.ensemblgenomes.org and rapid.ensembl.org) as well as programmatically via Ensembl’s REST API.

    This workshop will introduce you to the range of data available through Ensembl and the appropriate platforms to visualise and extract it for analysis and interpretation. The morning sessions of this workshop will guide you through the Ensembl genome browser and BioMart interfaces to show you how to explore and retrieve both small and large-scale datasets. In the afternoon session, this workshop will introduce you to the concepts of the Ensembl REST API and guide you through the principles of retrieving Ensembl data programmatically using both Python and R.

    To participate in the hands-on aspects of this workshop, including live polling, exploring online databases and exercises, you will need to bring a laptop. Workshop materials, including slides, screenshots, exercises, sample files and solutions will be available before the workshop and will remain permanently online at the Ensembl training portal: https://training.ensembl.org.

    Learning objectives


    1. Outline the different data types available through Ensembl

    2. Identify the appropriate methods for data retrieval from Ensembl

    3. Visualise and retrieve genome annotation data through the online Ensembl genome browser

    4. Export custom datasets from Ensembl using the BioMart export tool

    5. Perform queries and extract data returned from the Ensembl REST API


    Maximum number of participants
    30

    Intended audience and level
    The workshop is aimed at new and existing Ensembl users, from both the wet-lab and bioinformatics communities. The workshop is designed to provide participants with a greater understanding of the data available through the Ensembl interfaces and how to efficiently retrieve it at various scales.
    There are no prerequisites for this workshop, although a basic understanding of programming with Python or R would be beneficial. For the interactive aspects of this workshop, participants are required to bring their personal laptops.

    Click here to watch

    Hosted by:

    - top -


    Exploring the Potential and Risks of Third-Generation Long-Read Transcriptome Sequencing for Unraveling Transcriptome Complexity.
    by Ana Conesa

    November 7, 2023

    Third-generation long-read sequencing technologies offer the potential to sequence entire transcripts and unravel the intricacies of transcriptomes. However, the analysis of long-read transcriptomics (lrRNA-seq) data presents numerous challenges. These challenges encompass distinguishing biological variability from technical noise, accurately predicting transcript models, providing precise estimates of transcript expression levels and differential expression, and elucidating the biological significance of isoform diversity. In my presentation, I will showcase the research conducted in my laboratory that addresses these challenges. Moreover, I will discuss how, through the implementation of appropriate experimental techniques and bioinformatics approaches, lrRNA-seq has the capability to unveil fresh insights into the biology of the transcriptome.

    Click here to watch

    Hosted by:

    - top -


    Empirical evaluations of large language models for bioinformatics education and research
    by Stephen Piccolo

    November 9, 2023

    Large language models (LLMs) have ignited great interest in the bioinformatics community. Many of us are interested in their potential uses (and misuses?) in research and education. The Piccolo Lab is undertaking empirical research to answer three such questions. 1) How well can LLMs translate human-language prompts into functional computer code? 2) How well can LLMs translate code snippets from one programming language to another? 3) How effectively can LLMs act as virtual assistants to students who are learning to write code? We have addressed each of these questions in the context of introductory bioinformatics and biostatistics education at the undergraduate level. In this seminar, I will describe our findings and discuss potential practical implications for education and research.

    Click here to watch

    Hosted by:

    - top -


    Phylogenetic network estimation from site patterns using composite likelihood
    by Sungsik Kong

    November 10, 2023

    While phylogenies have been essential in understanding how species evolve, they do not adequately describe some evolutionary processes. For instance, hybridization, a common phenomenon where interbreeding between two species leads to formation of a new species, must be depicted by a phylogenetic network, a structure that modifies a phylogenetic tree by allowing two branches to merge into one, resulting in reticulation. Here, we propose a novel method, PhyNEST (Phylogenetic Network Estimation using SiTe patterns), that estimates level-1 phylogenetic networks directly from sequence alignment. PhyNEST achieves computational efficiency by using composite likelihood as well as accuracy by using the full genomic data to incorporate all sources of variability. To search network space, we implement both hill-climbing and simulated annealing algorithms. Simulation studies show that PhyNEST is robust to the model misspecification and leads to improved accuracy in many cases evaluated in comparison to SNaQ and PhyloNet that use composite likelihood and a set of gene trees as input. PhyNEST is implemented in an open-source Julia package and publicly available.

    Hosted by:

    - top -


    Contextualizing protein representations using deep learning on protein networks and single-cell data
    by Michelle Li

    December 12, 2023

    Understanding protein function and developing molecular therapies require deciphering the cell types in which proteins act as well as the interactions between proteins. However, modeling protein interactions across diverse biological contexts, such as tissues and cell types, remains a significant challenge for existing algorithms. We introduce PINNACLE, a flexible geometric deep learning approach that is trained on contextualized protein interaction networks to generate context-aware protein representations. Leveraging a human multi-organ single-cell transcriptomic atlas, PINNACLE provides 394,760 protein representations split across 156 cell type contexts from 24 tissues and organs. PINNACLE's contextualized representations of proteins reflect cellular and tissue organization and PINNACLE's tissue representations enable zero-shot retrieval of the tissue hierarchy. Pretrained PINNACLE's protein representations can be adapted for downstream tasks: to enhance 3D structure-based protein representations for important protein interactions in immuno-oncology (PD-1/PD-L1 and B7-1/CTLA-4) and to study the effects of drugs across cell type contexts. PINNACLE outperforms state-of-the-art, yet context-free, models in nominating therapeutic targets for rheumatoid arthritis and inflammatory bowel diseases, and can pinpoint cell type contexts that predict therapeutic targets better than context-free models. PINNACLE is a graph-based contextual AI model that dynamically adjusts its outputs based on biological contexts in which it operates.

    Click here to watch

    Hosted by:

    - top -


    The Synthetic Patient - Predicting Health Trajectories for Diabetes eHRs
    by Francisco M. Ortuño

    December 20, 2023

    Diabetes currently affects more than half a billion people, including men, women, and children of all ages and nationalities, with a projection of >1.3 billion people affected in 30 years from now. Also, diabetes can commonly lead to other health complications, although those can be anticipated from patient health records, and suitable interventions often prevent or delay their onset. Access to health records, however, is often restricted to protected health environments by privacy regulations to protect patient intimacy, preventing novel biomedical discoveries by data scientists.

    Recent generative models like Generative Adversarial Networks (GANs) or Variational AutoEncoders (VAEs) have emerged as powerful tools for the generation of synthetic data capturing complex relationships among variables, even when those relationships are unknown. In this talk we present their power in the context of creating realistic patient records for the scientific community while safeguarding patient privacy.

    In an on-going project, we generate completely open datasets with highly realistic Electronic Health Records (eHRs) from >1.2 million real diabetic patients from the Population Health Database of the Andalusian Ministry of Health in Spain. Initial synthetic datasets were generated with a Dual Adversarial AutoEncoders (DAAE). The first of these datasets includes 999,936 synthetic patients with eHRs provided as a list of ordered chronic pathologies together with some demographic information, like gender or age bracket. We have also generated another dataset with increased patient age resolution, and giving a specific age for each diagnosed pathology.

    The aim of sharing these synthetic datasets is to enable the scientific community to find possible unknown relationships and trajectories in diabetes-associated pathologies which could help predict pathologies before being diagnosed to allow earlier interventions. Community predictions will subsequently undergo internal validation on real diabetic patients, confirming that the patterns identified from the generated data generalize to real patient cohorts. Specifically, we first test this in the prediction of well-known clinical endpoints reflecting known consequences of diabetes, including retinopathies, chronic kidney diseases, ischemic heart disease, and other cardiovascular problems.

    The datasets are provided in the form of ISMB/CAMDA Contest Challenges (www.camda.info). Bioinformaticians and data scientists are encouraged to build models for making their own predictions of the selected or other endpoints to jointly establish one of the most comprehensive realistic synthetic clinical datasets open to the scientific community.

    Click here to watch

    Hosted by:

    - top -