ISCBacademy 2022 Webinars



To view previous webinars use the links below

2020 Webinars | 2021 Webinars | 2023 Webinars | 2024 Webinars

ISCBacademy is an online webinar series including the ISCB COSI, COVID webinars, Indigenous Voices and practical tutorials. We aim to inspire, connect, and communicate the science while providing a hands-on experience accessing and using newly developed bioinformatics tools while ensuring best practices for rigour and reproducibility.


  • January 11, 2022 - Integrated analysis of single-cell data across technologies and modalities by Rahul Satija, New York Genome Center - Hosted by RegSys
  • January 18, 2022 - Accelerating biomedical discovery with large-scale knowledge assembly and human-machine collaboration by Benjamin Gyori, Harvard Medical School - Hosted by SysMod
  • February 1, 2022 - COVID-19 Disease Map: building a computational repository of SARS-CoV-2 virus-host interaction mechanisms by Marek Ostaszewski, University do Luxembourg - Hosted by TransMed
  • February 8, 2022 - Identifying SNPs that are shared by multiple traits with PolarMorphism by Joanna von Berg, Princess Máxima Center for Pediatric Oncology - Hosted by VarI
  • February 8, 2022 - Unraveling the effect of sex on human genetic architecture by Elena Bernabeu, Institute of Genetics and Cancer - Hosted by VarI
  • February 15, 2022 - Critical Assessment of Computational Hit-finding Experiments (CACHE): An Initiative to Guide Future Computational Drug Design by Matthieu Schapira, University of Toronto - Hosted by 3D-SIG
  • February 18, 2022 - Tidy Transcriptomics for Single-cell RNA Sequencing Analyses by Stefano Mangiola, Maria Doyle, - Hosted by ISCB
  • February 22, 2022 - Growing open source communities with internships by Yo Yehudi, Wellcome Trust - Hosted by BOSC
  • March 1, 2022 - More is better: how integrating multiple biomedical ontologies can unlock artificial intelligence applications by Catia Pesquita, University of Lisbon - Hosted by Bio-Ontologies
  • March 18, 2022 - Spinning a semantic web of protein information by Monique Zahn, SIB Swiis Institute of Bioinformatics - Hosted by ISCB
  • March 22, 2022 - Filter Drug-induced Liver Injury (DILI) Literature with Natural Language Processing and Ensemble Learning by Xianghao Zhan, Stanford University - Hosted by CAMDA
  • March 29, 2022 - Mapping cell structure across scales by fusing protein images and interactions by Yue Qin & Leah Schaffer, University of California, Sand Diego - Hosted by CompMS
  • April 5, 2022 - The Bioinformatics education scenario in Latin America: from its beginnings to the present day by Vinicius Maracaja-Coutinho, Maria Bernardi, and Patricia Carvajal-López, - Hosted by Education
  • April 12, 2022 - Combining evolution and proteomics to discover protein complexes conserved across plants by Claire McWhite, University of Texas at Austin - Hosted by EvolCompGen
  • April 22, 2022 - Integrating gene expression and biological knowledge for drug discovery and repurposing by Mahmoud Ahmed, Trang Huyen Lai, - Hosted by ISCB
  • April 28, 2022 - Using evolutionary statistics to define emergent organization by Arjun Raman, University of Chicago - Hosted by Function
  • April 28, 2022 - Efficient sequence similarity searches with strobemers and applications to read mapping by Kristoffer Sahlin, Stockholm University - Hosted by HiTSeq
  • May 4, 2022 - Systematic mapping of nuclear domain-associated transcripts by Rasim Barutcu, University of Toronto - Hosted by iRNA
  • May 5, 2022 - Improving data transfer, analysis, and driving community challenge solutions with the precisionFDA platform by Sam Westreich, xVantage Group - Hosted by ISCB
  • May 10, 2022 - From hairballs to hypotheses: microbial network analysis by Karoline Faust, Sam Röttjers, - Hosted by MICROBIOME
  • May 18, 2022 - Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale by Jian Zhou, University of Texas - Hosted by MLCSB
  • May 28, 2022 - Using R/tidyverse to analyze and visualize genomic data by Janani Ravi, Michigan State University - Hosted by ISCB
  • June 7, 2022 - Network-based dynamic modeling of biological systems: toward understanding and control by Reka Albert, Pennsylvania State University - Hosted by SysMod
  • June 28, 2022 - Massively parallel phenotyping of coding variants in cancer with Perturb-seq by Oana Ursu, Genentech - Hosted by VarI
  • June 28, 2022 - Unraveling the genomic landscape of cancer through de novo extraction of mutational signatures. by Marcos Diaz-Gay, University of California, San Diego - Hosted by VarI
  • June 29, 2022 - Reconstructing the mutational histories of healthy and cancer genomes by Maria Secrier, University College London - Hosted by TransMed
  • June 30, 2022 - Modular, reproducible bioinformatics workflows with the targets R package by Joel Nitta, - Hosted by ISCB
  • August 25, 2022 - Single-Cell-Resolution, Image-Based CRISPR Screening at Druggable Genome Scale by Max R Salick, insitro - Hosted by ISCB
  • August 31, 2022 - An Introduction to accessing genomic data using the Ensembl REST API by Benjamin Moore, Ensembl - Hosted by ISCB
  • September 6, 2022 - Functional variomics: Systematic annotation of somatic mutations and gene fusions in cancer by Nidhi Sahni, University of Texas - Hosted by 3D-SIG
  • September 13, 2022 - Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation by Jiyu Chen, University of Melbourne - Hosted by Bio-Ontologies
  • September 30, 2022 - Essential Elements for Next Generation Sequencing Data Analysis by Prashanth N Suravajhala, Kiran K Telukunta and Gareth Price, - Hosted by ISCB
  • October 11, 2022 - DeSIDE-DDI: Interpretable prediction of drug-drug interactions using drug-induced gene expressions by Eunyoung Kim, Gwangju Institute of Science and Technology - Hosted by CAMDA
  • October 25, 2022 - Bioinformatics Education & Training Effort at Asia Pacific by Yam Wai Keat, International Medical University; APBioNET; MaSBiC - Hosted by Education
  • November 1, 2022 - Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses by Mohammed El-Kebir, University of illinois - Hosted by EvolCompGen
  • November 2, 2022 - Data Science for Biomarkers Discovery by Saed Sayad, - Hosted by ISCB
  • November 18, 2022 - How to make bioinformatics tools you've developed easily accessible for UK Biobank for UK Biobank Users by Brenton Pyle, Ben Busby, Ted Laderas, - Hosted by ISCB
  • November 22, 2022 - False-positive IRESes, mRNA annotation errors, and a paradigm “unshift” in mammalian development by Christina Akirtava, Carnegie Mellon University - Hosted by iRNA
  • December 6, 2022 - High-resolution large-scale metagenomics of the human microbiome by Nicola Segata, University of Trento - Hosted by MICROBIOME
  • December 20, 2022 - Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque by Adria Fernandez-Torras, Institute for Research in Biomedicine Barcelona - Hosted by NetBio

  • Integrated analysis of single-cell data across technologies and modalities
    by Rahul Satija

    January 11, 2022

    The simultaneous measurement of multiple modalities represents an exciting frontier for single-cell genomics and necessitates computational methods that can define cellular states based on multimodal data. Here, we introduce “weighted-nearest neighbor” analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of 211,000 human peripheral blood mononuclear cells (PBMCs) with panels extending to 228 antibodies to construct a multimodal reference atlas of the circulating immune system. Multimodal analysis substantially improves our ability to resolve cell states, allowing us to identify and validate previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets and to interpret immune responses to vaccination and coronavirus disease 2019 (COVID-19). Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets and to look beyond the transcriptome toward a unified and multimodal definition of cellular identity.

    Click here to watch

    Hosted by:

    - top -


    Accelerating biomedical discovery with large-scale knowledge assembly and human-machine collaboration
    by Benjamin Gyori

    January 18, 2022

    The rate at which biomedical knowledge is produced (both at the level of new publications and data sets) is accelerating, and there is an increasing need to monitor, extract and assemble this knowledge in an actionable form. Classic mechanistic models take substantial human effort to construct and rarely scale to the level of omics datasets, while statistical approaches often do not make use of prior knowledge about mechanisms. To address these challenges, we present INDRA, an automated knowledge assembly system which integrates multiple text mining tools that process the scientific literature, and structured sources (pathway databases, drug-target databases, etc.). INDRA standardizes knowledge extracted from these sources and corrects errors, resolves redundancies, fills in missing information, and calculates confidence to create a coherent knowledge base. From this knowledge, various executable model types (ODEs, Boolean networks, etc.) and causal networks can be generated automatically for further analysis. We discuss technology built on top of INDRA, including human-machine dialogue systems, and EMMAA, a framework which makes available a set of self-updating and self-analyzing models of specific diseases and pathways. We present applications of these tools to automatically construct explanations for experimental observations in multiple disease areas.

    Click here to watch

    Hosted by:

    - top -


    COVID-19 Disease Map: building a computational repository of SARS-CoV-2 virus-host interaction mechanisms
    by Marek Ostaszewski

    February 1, 2022

    Disease Maps are computational and visual knowledge repositories constructed to catalogue, standardise, and model disease-related mechanisms. They allow to bridge the knowledge gap between biomedical experts and the computational biologists towards contextualised data analysis and modelling of a given pathophysiology. Disease Maps are built using graphical and computational Systems Biology standards and can be used as interactive knowledge repositories, platforms for visual analytics of omics datasets, or integrated into large-scale computational workflows. With the global impact of COVID-19, we organised a community effort to develop a COVID-19 Disease Map to help researchers worldwide to study the mechanisms of the SARS-CoV-2 – host interactions. Our effort engaged over 250 members, contributing as domain experts, diagram curators, analysts, and modellers. This talk will discuss the challenges of community biocuration and integration of a plethora of resources, from Systems Biology diagrams, through interaction databases and text mining results to modelling pipelines of varying granularity.

    Click here to watch

    Hosted by:

    - top -


    Identifying SNPs that are shared by multiple traits with PolarMorphism
    by Joanna von Berg

    February 8, 2022

    Pleiotropic SNPs are associated with multiple traits. Such SNPs can help pinpoint biological processes with an effect on multiple traits or point to a shared etiology between traits. We present PolarMorphism, a new method for the identification of pleiotropic SNPs from GWAS summary statistics. PolarMorphism can be readily applied to more than two traits or whole trait domains. PolarMorphism makes use of the fact that trait-specific SNP effect sizes can be seen as Cartesian coordinates and can thus be converted to polar coordinates r (distance from the origin) and theta (angle with the Cartesian x-axis). r describes the overall effect of a SNP, while theta describes the extent to which a SNP is shared. r and theta are used to determine the significance of SNP sharedness, resulting in a p-value per SNP that can be used for further analysis. We apply PolarMorphism to a large collection of publicly available GWAS summary statistics enabling the construction of a pleiotropy network that shows the extent to which traits share SNPs. This network shows how PolarMorphism can be used to gain insight into relationships between traits and trait domains. Furthermore, pathway analysis of the newly discovered pleiotropic SNPs demonstrates that analysis of more than two traits simultaneously yields more biologically relevant results than the combined results of pairwise analysis of the same traits. Finally, we show that PolarMorphism is more efficient and more powerful than previously published methods.

    Click here to watch

    Hosted by:

    - top -


    Unraveling the effect of sex on human genetic architecture
    by Elena Bernabeu

    February 8, 2022

    Males and females present differences in complex traits and in the risk of a wide array of diseases. Genotype by sex (GxS) interactions are thought to account for some of these differences. However, the extent and basis of GxS are poorly understood. Here, we provide insights into both the scope and the mechanism of GxS across the genome of about 450,000 individuals of European ancestry and 530 complex traits in the UK Biobank. We found small yet widespread differences in genetic architecture across traits. We also found that, in some cases, sex-agnostic analyses may be missing trait-associated loci and looked into possible improvements in the prediction of high-level phenotypes. Finally, we studied the potential functional role of the differences observed through sex-biased gene expression and gene-level analyses. Our results suggest the need to consider sex-aware analyses for future studies to shed light onto possible sex-specific molecular mechanisms.

    Click here to watch

    Hosted by:

    - top -


    Critical Assessment of Computational Hit-finding Experiments (CACHE): An Initiative to Guide Future Computational Drug Design
    by Matthieu Schapira

    February 15, 2022

    Computational methods used to facilitate small molecule drug discovery are currently witnessing a revived optimism, fueled by continuous leaps in computational power, increased accessibility to commercial compounds, improved physics-based methods, and the emerging potential of generative models and newer machine learning approaches. It is fair to say that the question is not whether in silico design will transform the early phase of drug discovery, but how profoundly and how fast. But there is currently no metric to systematically evaluate and compare these approaches, no mechanism to highlight the most promising methods and identify the fastest route to success. CACHE is a prospective hit finding competition where compounds selected by virtual screening or invented by generative models are procured and tested experimentally. Hit rate and diversity, potency and drug-likeness are used to evaluate and compare methods. All data and method description are publicly released. We expect that CACHE will define the state-of-the-art as computational hit-finding evolves over the years, and will act as an accelerator in the field.

    Click here to watch

    Hosted by:

    - top -


    Tidy Transcriptomics for Single-cell RNA Sequencing Analyses
    by Stefano Mangiola, Maria Doyle

    February 18, 2022

    This tutorial will present how to perform analysis of single-cell RNA sequencing data following the tidy data paradigm. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. This can be achieved with the integration of packages present in the R CRAN and Bioconductor ecosystem, including tidyseurat, tidySingleCellExperiment, and tidyverse. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis.

    Instructors:
    Dr. Stefano Mangiola is a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical model for the analysis of RNA sequencing data, and data analysis and visualisation interfaces.
    Dr. Maria Doyle is the Application and Training Specialist for Research Computing at the Peter MacCallum Cancer Centre in Melbourne, Australia. She has a PhD in Molecular Biology and currently works in bioinformatics and data science education and training. She is passionate about supporting researchers, reproducible research, open source and tidy data.

    Recommended Prerequisites:
    Basic knowledge of single cell transcriptomic analyses
    Basic knowledge of tidyverse

    Hosted by:

    - top -


    Growing open source communities with internships
    by Yo Yehudi

    February 22, 2022

    Building communities for your open source computational tooling requires more than just technical expertise, and often isn't as straightforward as building the tool itself. Having a community of contributors and users can make a big difference in many ways - additional community members will spot opportunities and bugs in your code that previously you didn't notice, and may be able to offer unique skillsets to your team.

    One effective way to grow your community can be via internships. Programs such as Google Summer of Code and Outreachy offer the chance to work with interns for 6-12 weeks, working on individual supervised projects whilst getting paid for their work.

    This webinar will cover the ins-and-outs of participating in internship programs like this, from the perspective of a mentoring organisation. Topics will include:
    Getting started with internship programs - finding mentors and defining a set of projects
    Time commitments for mentors, before the application period and after interns are selected.
    Funding for internship programs! (it's not as tricky as you may fear - others handle this bit!)
    Keeping interns engaged during the program and bringing them in as long-term contributors afterwards.

    Click here to watch

    Hosted by:

    - top -


    More is better: how integrating multiple biomedical ontologies can unlock artificial intelligence applications
    by Catia Pesquita

    March 1, 2022

    Biomedical AI applications are increasingly multi-domain, with areas such as personalized medicine and systems biology showcasing the increasing need to explore heterogeneous data. Biomedical Ontologies represent an unparalleled opportunity in this area because they add meaning to the underlying data which can be used to support heterogeneous data integration, provide scientific context to the data augmenting AI performance, and afford explanatory mechanisms allowing the contextualization of AI predictions.
    In this talk I will present our recent work on aligning and integrating multiple ontologies and building knowledge graphs to support both supervised learning and explainable AI approaches for biomedicine. I will highlight lessons learned and chart a path for the coming challenges in biomedical ontology and knowledge graph alignment as AI becomes an integral part of biomedical research.

    Click here to watch

    Hosted by:

    - top -


    Spinning a semantic web of protein information
    by Monique Zahn

    March 18, 2022

    Description:
    Life science is the most demanding research field in terms of data quantity and complexity, with many relevant reference databases. To generate knowledge, heterogeneous data from various sources must often be combined. Semantic Web technologies, and in particular RDF and its companion query language SPARQL, provide a common framework allowing data to be shared and reused between resources. Many life science databases have recently turned to RDF to model their data, developed SPARQL endpoints and joined the Linked (Open) Data cloud. This tutorial will introduce neXtProt (www.nextprot.org/), one of the major public knowledge bases on human proteins, its comprehensive RDF data model, and its large collection of reusable example queries, including federated queries to other resources.

    At the end of the course, the participants are expected to:
    • Describe the neXtProt data model
    • Run example queries that answer biological questions
    • Search for data by modifying existing SPARQL queries
    • Understand how federated queries are constructed

    Instructor:
    Monique Zahn is the Quality Manager of the CALIPHO group which develops neXtProt. She is responsible for testing user interfaces and the contents of each release. She has established quality control procedures involving SPARQL queries carried out at each data release. She has taught biology in undergraduate degree programs in Switzerland and is also Training Manager at the SIB.

    Hosted by:

    - top -


    Filter Drug-induced Liver Injury (DILI) Literature with Natural Language Processing and Ensemble Learning
    by Xianghao Zhan

    March 22, 2022

    Abstract: Drug-induced liver injury (DILI) is an adverse effect of drugs characterized by abnormalities in liver tests, and it may lead to acute liver failure. As a key assessment for new drug candidates, DILI events are reported in the publications of clinical practices and preliminary in vitro and in vivo experiments. Conventionally, screening the large corpus of publications to label DILI-related reports is carried out manually, which substantially limits the processing speed. The development of natural language processing (NLP) techniques enables the automatic processing of texts. Here, we report a model for filtering DILI literature with four NLP text vectorization techniques and ensemble learning. The model with TF-IDF and logistic regression outperformed others with an AUROC of 0.990, an accuracy of 0.957, and an AUPRC of 0.990. An ensemble model with similar performance but the fewest false-negative cases was built based on 12 models. Both models showed good performance on the hold-out validation data, and the ensemble model reached a higher accuracy of 0.954 and an F1 score of 0.955. On the additional hold-out test data without title, the TF-IDF model reached a higher accuracy of 0.927 and a higher F1 score of 0.930. Additionally, important words in positive/negative predictions were identified by interpreting the models. Generally, the ensemble model reached satisfactory classification results, which can be used by researchers to quickly filter DILI-related literature.

    Click here to watch

    Hosted by:

    - top -


    Mapping cell structure across scales by fusing protein images and interactions
    by Yue Qin & Leah Schaffer

    March 29, 2022

    Determining the structure of biological systems and its relation to biological function is one of the ultimate goals of the biological sciences. There has been a recent push to produce large-sized data sets from the different major approaches for mapping cell structure including the Human Protein Atlas (HPA), a major imaging effort consisting of >10,000 immunofluorescence (IF) images, and the BioPlex networks, derived from systematic affinity-purification mass spectrometry (AP-MS) experiments. Integrating the resulting datasets from these two different approaches provides an opportunity to generate a more complete map of cell structure across scales, from individual protein interactions to subcellular location of whole complexes. Towards this goal, we developed an approach for creating Multi-Scale Integrated Cell (MuSIC) maps of cellular structure by integrating HPA IF images and the BioPlex protein interaction network; integration is achieved by configuring each approach as a general measure of protein distance, then calibrating the two measures using machine learning. This approach resulted in a MuSIC map for the 661 proteins in HEK293 present in both the HPA and BioPlex data sets consisting of 69 subcellular systems, approximately half of which were novel. This map of the cell determined roles for poorly characterized proteins and identified new protein assemblies in processes such as ribosomal biogenesis and RNA splicing. We will also describe initial developments to build MuSIC map v2.0 in the U2OS cell line consisting of ~5,000 proteins, representing a significant expansion of the first MuSIC map and providing a more global view of cell structure.

    Click here to watch

    Hosted by:

    - top -


    The Bioinformatics education scenario in Latin America: from its beginnings to the present day
    by Vinicius Maracaja-Coutinho, Maria Bernardi, and Patricia Carvajal-López

    April 5, 2022

    In this talk, we will walk through an almost 20-year journey in the field of bioinformatics education in the Latinamerican region, from the first events and meetings, to today’s perspectives and opportunities. Different communities, institutions and initiatives have contributed to the development of academic programs and networks to take bioinformatics education to a wider audience and cover different needs of the scientific community. The impact of these initiatives can be seen in the increased number of publications in the field coming from different countries in the region. Besides the nearly two decades of development and several academic programs created throughout LATAM, they are still low in number and wider geographical regions and centralised systems have been an obstacle for reaching more universities and institutions. This results in hotspot cities, regions and countries with academic programs devoted to bioinformatics, curbing the development of key academic and industrial areas for the region that would benefit from the transversality of bioinformatics and computational biology. There are key challenges to be solved for the following decade in order to consolidate the bioinformatics environment in LATAM. Multiple aspects related to decentralisation, equity, diversity and inclusion, language barriers, entrepreneurship, among others, should be addressed jointly by institutions and professional groups. Join us to have a conversation about these and other challenges, and most importantly, the initiatives being established to strengthen bioinformatics education in this important region.

    Click here to watch

    Hosted by:

    - top -


    Combining evolution and proteomics to discover protein complexes conserved across plants
    by Claire McWhite

    April 12, 2022

    Plants are foundational for global ecological and economic systems, but most plant proteins and and many protein complexes remain uncharacterized. In plants, highly duplicated protein families pose challenges for protein identification, which is heavily reliant on unique peptides. To address this problem, we developed a evolution-informed proteomics strategy that combines orthology analysis with mass spectrometry. We applied this strategy to 13 plant species of scientific and agricultural importance, greatly expanding the known repertoire of stable protein complexes in plants. We recovered known complexes, confirmed complexes predicted to occur in plants, and identified previously unknown interactions conserved over 1.1 billion years of green plant evolution The resulting map offers a cross-species view of conserved, stable protein assemblies shared across plant cells and provides a mechanistic, biochemical framework for interpreting plant genetics and mutant phenotypes.

    McWhite et al. A Pan-plant Protein Complex Map Reveals Deep
    Conservation and Novel Assemblies. Cell. 2020;181(2):460-474.e14.
    doi:10.1016/j.cell.2020.02.049

    Click here to watch

    Hosted by:

    - top -


    Integrating gene expression and biological knowledge for drug discovery and repurposing
    by Mahmoud Ahmed, Trang Huyen Lai

    April 22, 2022

    Description

    Part One: a talk describing the construction of a database for cancer-cell-specific perturbations of biological networks (LINPS). We pre-computed cancer-cell-specific perturbation amplitudes of several biological networks and made the output available in a database with an interactive web interface.
    Part Two: a talk describing the building of a functional network model of the metastasis suppressor RKIP and its regulators in breast cancer cells. In this case study, we applied text mining and a manual literature search to extract known interactions between several metastasis suppressors and their regulators. Then we adopted a reverse causal reasoning approach to evaluate and prioritize pathways that are most consistent and responsive to drugs that inhibit cell growth. We further validated some of the predicted regulatory links in the breast cancer cell line MCF7 experimentally and highlighted the points of uncertainty in our model.
    Part Three: a code walkthrough encoding directed interactions into the biological expression language (BEL), computing the network perturbation amplitudes (NPA), and interpreting the output.

    Pre-requisites

    Time

    Three half-hour parts. With breaks and time for discussion for up to two hours. (subject to change)

    Hosted by:

    - top -


    Using evolutionary statistics to define emergent organization
    by Arjun Raman

    April 28, 2022

    Cellular phenotypes emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be extracted purely from the statistical pattern of proteome variation as measured across thousands of bacteria and that these networks reflect the emergence of complex bacterial phenotypes. We validate our results through gene-set enrichment analysis and comparison to existing experimentally-derived databases. We demonstrate the biological utility of our approach by creating a model of motility in Pseudomonas aeruginosa and using it to identify a protein that affects pilus-mediated motility. We anticipate that our method, SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), will be useful for interrogating genotype-phenotype relationships in bacteria.

    Click here to watch

    Hosted by:

    - top -


    Efficient sequence similarity searches with strobemers and applications to read mapping
    by Kristoffer Sahlin

    April 28, 2022

    Sequence similarity search is a fundamental bioinformatic problem used in many read alignment and sequence clustering applications. In many of these applications, a seed-filter-extend methodology is used where short local matches (seeding) are found, some match sites are selected (filtering), and exact Smith-Waterman alignment is performed (extension). We have recently seen advances in all the above stages, making sequence similarity search an exciting and fast-moving research area.

    In this talk, I will focus on the seeding step. In particular, I will discuss some seeding techniques and describe our recently proposed strobemer-seeds, a type of fuzzy seed that can match over substitutions and indels. I will expand on our previous talk on strobemers (HitSeq, 2021) to include applications where strobemers have proven useful, such as short-read mapping and long-read overlapping, as well as briefly discuss open problems and future research directions.

    Click here to watch

    Hosted by:

    - top -


    Systematic mapping of nuclear domain-associated transcripts
    by Rasim Barutcu

    May 4, 2022

    The nucleus is highly compartmentalized through the formation of distinct classes of membraneless domains, yet the composition and function of many of these structures is not well understood. Using APEX2-mediated proximity labelling and RNA sequencing, we surveyed transcripts associated with nuclear speckles, several additional domains, and the lamina. Remarkably, speckles and lamina are associated with distinct classes of retained introns enriched in genes that function in RNA processing, translation, and the cell cycle. In contrast to the lamina-proximal introns, retained introns associated with speckles are relatively short, GC-rich, and enriched for functional sites of RNA binding proteins that are concentrated in these domains. They are also highly differentially regulated across diverse cellular contexts, including the cell cycle. Our study thus provides a rich resource of nuclear domain-associated transcripts and further reveals speckles and lamina as hubs of distinct populations of retained introns linked to gene regulation and cell cycle progression.

    Click here to watch

    Hosted by:

    - top -


    Improving data transfer, analysis, and driving community challenge solutions with the precisionFDA platform
    by Sam Westreich

    May 5, 2022

    The Food and Drug Administration (FDA) faces significant challenges involving bioinformatics data, including receiving data from academic collaborators and industry sponsors, analyzing data in a secure and compliant environment, and fostering community solutions to real-world bioinformatics problems. We present precisionFDA, a cloud-based platform with FedRAMP Moderate security clearance, built and maintained by DNAnexus. This platform:
    · allows FDA reviewers to collaborate with industry sponsors through shared scalable development environments;
    · offers in-browser application development for user-driven construction of custom bioinformatic tools, or adaptation of existing tools;
    · supports hosting public challenges to engage the bioinformatics community, focusing on real-world data analysis and current problems through an intuitive user experience.

    The precisionFDA platform helps improve the tools, capabilities, and speed of analysis available to government researchers. At DNAnexus, we welcome external collaborators and work to present community challenges to help find novel solutions to biology and data problems.

    Click here to watch

    Hosted by:

    - top -


    From hairballs to hypotheses: microbial network analysis
    by Karoline Faust, Sam Röttjers

    May 10, 2022

    The construction of microbial networks has become a popular method to analyse microbial sequencing data, with dozens of network inference tools available. However, these tools usually return "hairballs", i.e. densely connected networks, which require further analysis in order to derive biological hypotheses from them. Here, I will present a set of tools designed to address this challenge, including manta for clustering, anuran for comparing and mako for querying microbial networks.

    Click here to watch

    Hosted by:

    - top -


    Sequence-based modeling of genome 3D architecture from kilobase to chromosome-scale
    by Jian Zhou

    May 18, 2022

    The structural organization of the genome plays an important role in multiple aspects of genome function. Understanding how genomic sequence influences 3D organization can help elucidate their roles in various processes in healthy and disease states. However, the sequence determinants of genome structure across multiple spatial scales are still not well understood. To learn the complex sequence dependencies of multiscale genome architecture, here we developed a sequence-based deep learning approach, Orca, that predicts genome 3D architecture from kilobase to whole-chromosome scale, covering structures including chromatin compartments and topologically associating domains. Orca also makes both intrachromosomal and interchromosomal predictions and captures the sequence dependencies of diverse types of interactions, from CTCF-mediated to enhancer-promoter interactions and Polycomb-mediated interactions. Orca enables the interpretation of the effects of any structural variant at any size on multiscale genome organization and provides an in silico model to help study the sequence-dependent mechanistic basis of genome architecture. We show that the models accurately recapitulate effects of experimentally studied structural variants at varying sizes (300bp-80Mb) using only sequence. Furthermore, these sequence models enable in silico virtual screen assays to probe the sequence-basis of genome 3D organization at different scales. At the submegabase scale, the models predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, based on virtual screens of sequence activities, we propose a new model for the sequence basis of chromatin compartments: sequences at active transcription start sites are primarily responsible for establishing the expression-active compartment A, while the inactive compartment B typically requires extended stretches of AT-rich sequences (at least 6-12kb) and can form ‘passively’ without depending on any particular sequence pattern. Orca thus effectively provides an “in silico genome observatory” to predict variant effects on genome structure and probe the sequence-based mechanisms of genome organization.

    Click here to watch

    Hosted by:

    - top -


    Using R/tidyverse to analyze and visualize genomic data
    by Janani Ravi

    May 28, 2022

    About the workshop
    This 3-hour workshop will present how to analyze & visualize processed genomics data. The session will be divided into 5 parts:
    Part 1: Getting started w/ readr
    Part 2: Reshaping data w/ tidyr
    Part 3: Data wrangling w/ dplyr
    Part 4: Visualizing tidy data w/ ggplot
    Part 5: Export and Wrap-up w/ rmarkdown

    Learning Objectives
    By the end of this workshop, you will be able to load your genomic dataset, perform basic data tidying & wrangling, data visualization, and save/export your results using tidyverse! Hopefully, you will also have a newfound appreciation for reproducible research and R!

    Click here to watch

    Hosted by:

    - top -


    Network-based dynamic modeling of biological systems: toward understanding and control
    by Reka Albert

    June 7, 2022

    My group models cell types as attractors of a dynamic system of interacting (macro)molecules, and we aim to find the network patterns that determine these attractors. We collaborate with wet-bench biologists to develop and validate predictive dynamic models of specific systems. Over the years we found that network-based discrete dynamic modeling is very useful in synthesizing causal interaction information into a predictive, mechanistic model. We use the accumulated knowledge gained from specific models to draw general conclusions that connect a network's structure and dynamics. An example of such a general connection is our identification of stable motifs, which are self-sustaining cyclic structures that determine points of no return in the dynamics of the system. We have shown that control of stable motifs can guide the system into a desired attractor. We have recently translated the concept of stable motif to a broad class of continuous models. Stable motif - based attractor control can form the foundation of therapeutic strategies on a wide application domain.

    Click here to watch

    Hosted by:

    - top -


    Massively parallel phenotyping of coding variants in cancer with Perturb-seq
    by Oana Ursu

    June 28, 2022

    Genome sequencing studies have identified millions of somatic variants in cancer, but it remains challenging to predict the phenotypic impact of most. Experimental approaches to distinguish impactful variants often use phenotypic assays that report on predefined gene-specific functional effects in bulk cell populations. Here, we develop an approach to functionally assess variant impact in single cells by pooled Perturb-seq. We measured the impact of 200 TP53 and KRAS variants on RNA profiles in over 300,000 single lung cancer cells, and used the profiles to categorize variants into phenotypic subsets to distinguish gain-of-function, loss-of-function and dominant negative variants, which we validated by comparison with orthogonal assays. We discovered that KRAS variants did not merely fit into discrete functional categories, but spanned a continuum of gain-of-function phenotypes, and that their functional impact could not have been predicted solely by their frequency in patient cohorts. Our work provides a scalable, gene-agnostic method for coding variant impact phenotyping, with potential applications in multiple disease settings.

    Click here to watch

    Hosted by:

    - top -


    Unraveling the genomic landscape of cancer through de novo extraction of mutational signatures.
    by Marcos Diaz-Gay

    June 28, 2022

    All cancers are caused by somatic mutations imprinted by the activities of different mutational processes, with each process leaving a characteristic pattern of mutations termed mutational signature. Characterizing mutational signatures can help understand the processes behind the onset and progression of tumors, with potential application as biomarkers in clinical practice. In our group, we have recently developed SigProfilerExtractor, an automated tool for accurate de novo extraction of mutational signatures for all types of somatic mutations. We have performed a comprehensive benchmarking, with 34 distinct scenarios encompassing 2,500 simulated signatures operative in more than 60,000 unique synthetic genomes and 20,000 synthetic exomes, demonstrating that SigProfilerExtractor outperforms 13 other tools for extracting mutational signatures. For genome simulations with 5% noise, reflecting high-quality genomic datasets, SigProfilerExtractor identified between 20% and 50% more true positive signatures while yielding more than 5-fold less false positive signatures. Applying SigProfilerExtractor to 4,643 whole-genome and 19,184 whole-exome sequenced cancers revealed four previously missed mutational signatures, including a signature putatively attributed to tobacco smoking in bladder cancer and normal bladder epithelium.

    Click here to watch

    Hosted by:

    - top -


    Reconstructing the mutational histories of healthy and cancer genomes
    by Maria Secrier

    June 29, 2022

    A variety of mutational processes shape genome evolution and can lead to the development of cancer by inducing DNA damage in the cells. These processes are triggered by environmental as well as intrinsic risk factors, and they leave specific footprints of somatic alterations in the genome. These mutational footprints, called “mutational signatures”, can be read from the tumour sequencing data and reveal the main sources of DNA damage driving neoplastic progression. In this sense, they can be considered a form of evidence for historical mutational events that have acted throughout an individual's lifetime. I will discuss some of methodological innovations that have enabled the exploration of these mutational events in cancer genomes through the identification of systematic patterns of mutations in large scale sequencing cohorts. I will also illustrate some of the applications of this methodology to studying both healthy ageing as well as cancer.

    Click here to watch

    Hosted by:

    - top -


    Modular, reproducible bioinformatics workflows with the targets R package
    by Joel Nitta

    June 30, 2022

    Modern bioinformatics pipelines can be incredibly complex, but all tend to follow a common pattern: they start with raw data, then pass the data through various programs until arriving at a final result. If this is done in an ad-hoc, unorganized fashion, the results may never be reproducible or even worse, unreliable and/or wrong. Pipeline management software is therefore essential to obtain results that are robust and reproducible. The targets R package is a recently developed workflow manager that comes with many excellent features for bioinformatics, including data caching, pipeline-level parallelization, and HPC support. In this hands-on workshop, I will demonstrate how targets can be used in concert with other tools like docker and conda to orchestrate modular, reproducible bioinformatics pipelines. The workshop will feature variant-calling as an example, but the concepts and tools can be applied to nearly any analysis.

    Pre-requisites: Basic familiarity with R. Installations of recent versions of R, conda, and docker.

    Duration: 2 hours

    Click here to watch

    Hosted by:

    - top -


    Single-Cell-Resolution, Image-Based CRISPR Screening at Druggable Genome Scale
    by Max R Salick

    August 25, 2022

    Pooled CRISPR screening has emerged as a powerful method of uncovering entire gene networks and modulators of critical biomarkers [1,2], due to its scalability, low cost, and substantial resistance to inter-well and inter-plate artifacts. Unfortunately, the current methods of pooled CRISPR screening are only compatible with fitness or FACS-sortable phenotypes, while high-dimensional readout methods such as perturb-seq are costly and only apply to transcriptional readouts [3] and/or scalar protein readouts [4]. With the recent emergence of pooled optical screening methods [5,6], perturbagens such as gene-targeting gRNAs can be amplified and directly measured via in situ sequencing while maintaining cellular structure and spatial features. This enables CRISPR screens to be coupled with a nearly limitless range of imaging assays, such as cell migration, calcium signaling, CellPaint, quantitative phase contrast, protein aggregation, multicellular/cell-cell interaction assays, and more. Here we describe an automated platform that has been developed to allow for pooled optical screening at industrial capacity, as well as multiple optical CRISPR screens done at increasing scales. We describe the first screen conducted on morphological phenotypes using a modified version of CellPaint, in which genes targeting various core pathways were edited. We demonstrate that ultra-high-throughput morphological analysis successfully identified and grouped these gene clusters using simply CellPaint and high-dimensional morphological readouts. In the second screen, we conducted a druggable-genome scale screen to identify both marker-based modifiers of the mTOR pathway, as well as biomarker-free clustering of gene networks using machine learning and feature-based analysis. While only a handful of pooled optical screens have been conducted so far in the field, we demonstrate the beginning of a promising new stage of CRISPR screening technology, allowing for high-throughput functional genomics screens to span into a vastly wider assortment of imaging-based assays.

    1. O. Shalem, Science, 353, 6166 (2013)
    2. R. Ihry, Cell Reports, 27, 2 (2019)
    3. A. Dixit, Cell, 167, 7 (2016)
    4. M. Stoeckius, Nature Methods, 9, 865-868 (2017)
    5. D. Feldman, Cell, 179, 3 (2019)
    6. L. Funk, biorxiv, doi:10.1101/2021.11.28.470116 (2021)

    Click here to watch

    Hosted by:

    - top -


    An Introduction to accessing genomic data using the Ensembl REST API
    by Benjamin Moore

    August 31, 2022

    The Ensembl REST API allows language agnostic programmatic access to Ensembl data. This webinar will provide an introduction to the REST API and its documentation, and how to access various data types.

    Pre-requisites: None, basic knowledge of any programming language (particularly python or R would be helpful)
    Length: 1 hour

    Hosted by:

    - top -


    Functional variomics: Systematic annotation of somatic mutations and gene fusions in cancer
    by Nidhi Sahni

    September 6, 2022

    Proteins interact with other macromolecules in complex cellular networks for signal transduction and biological function. Our previous work in Mendelian disorders found a widespread phenomenon that disease-associated alleles often perturb distinct protein activities rather than grossly affecting folding and stability. In the context of cancer, the functional impact of the vast majority of somatic mutations remains unknown, representing a critical knowledge gap for implementing precision oncology. Here, we present the development of a high-throughput functional variomics platform consisting of efficient mutant generation, sensitive cell viability and drug response assays, and functional proteomic profiling of signaling effects for select aberrations. We apply the platform to annotate thousands of genomic aberrations, including point mutations, indels, and gene fusions, potentially doubling the number of driver mutations characterized in clinically actionable genes. Further, the platform is sufficiently sensitive to identify weak drivers. Our data are accessible through a user-friendly, public data portal. Our study will facilitate biomarker discovery, prediction algorithm improvement, and drug development.

    Hosted by:

    - top -


    Exploring Automatic Inconsistency Detection for Literature-based Gene Ontology Annotation
    by Jiyu Chen

    September 13, 2022

    Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Quality assurance of GOA is crucial for supporting biological research, such as gene expression analysis and gene clustering. However, a range of different kinds of inconsistencies can be identified between GOA and the scientific literature that serves as evidence for these annotations. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance.

    In this talk, Jiyu will present results from two recent studies exploring GOA inconsistencies, and assessing the feasibility of automatic detection of such inconsistencies. Jiyu will introduce the basic framework for the implementation of automatic GOA quality assurance systems, which satisfy human-in-the-loop curation. Jiyu will discuss opportunities in their findings and point out feasible future studies on the implementation of automatic GOA quality assurance.

    Click here to watch

    Hosted by:

    - top -


    Essential Elements for Next Generation Sequencing Data Analysis
    by Prashanth N Suravajhala, Kiran K Telukunta and Gareth Price

    September 30, 2022

    With the advent of next-generation sequencing (NGS) technologies, there arose a need to identify candidate mutations for causality. A challenge often faced in identifying and inferring the causal SNPs from sequence data is that different methods need to be preferentially used to predict the effect of mutations for determining bona fidelity. While there are approaches focused on a wide array of highly sensitive, if not less stringent methods that the NGS has delivered in the recent past, this workshop aims to bridge the gap in using systems genomic approach taking command line scripts to Galaxy based workflows. A special focus of this workshop is on the current trends in genome analyses with special insights into NGS analysis. The sessions largely focus on whole exome sequencing (WES) and whole transcriptome shotgun sequencing (WTSS) or RNA-seq pipelines, and galaxy integrated workflows, latest trends on single cell sequencing with vivid demonstration of various steps of data analysis including quality control and generation of variant calls, gene expression. An ample time will be set aside for discussing case studies on various diseased phenotypes.

    Hosted by:

    - top -


    DeSIDE-DDI: Interpretable prediction of drug-drug interactions using drug-induced gene expressions
    by Eunyoung Kim

    October 11, 2022

    Adverse drug-drug interaction (DDI) is a major concern to polypharmacy due to its unexpected adverse side effects and must be identified at an early stage of drug discovery and development. Many computational methods have been proposed for this purpose, but most require specific types of information, or they have less concern in interpretation on underlying genes. We propose a deep learning-based framework for DDI prediction with drug-induced gene expression signatures so that the model can provide the expression level of interpretability for DDIs. The model engineers dynamic drug features using a gating mechanism that mimics the co-administration effects by imposing attention on genes. Also, each side-effect is projected into a latent space through translating embedding. As a result, the model achieved an AUC of 0.889 and an AUPR of 0.915 in unseen interaction prediction, which is competitively very accurate and outperforms other state-of-the-art methods. Furthermore, it can predict potential DDIs with new compounds not used in training. In conclusion, using drug-induced gene expression signatures followed by gating and translating embedding can increase DDI prediction accuracy while providing model interpretability.

    Click here to watch

    Hosted by:

    - top -


    Bioinformatics Education & Training Effort at Asia Pacific
    by Yam Wai Keat

    October 25, 2022

    There has been a lot of education and training efforts in raising Bioinformatics awareness and research in Asia Pacific. In this talk, I will be sharing our experience and thoughts of Bioinformatics education and training efforts in Asia Pacific and how Asia Pacific Bioinformatics Network (APBioNET) bridges, facilitates and fills in the gaps in it.

    Click here to watch

    Hosted by:

    - top -


    Accurate Identification of Transcription Regulatory Sequences and Genes in Coronaviruses
    by Mohammed El-Kebir

    November 1, 2022

    Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5' end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS IDENTIFICATION problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS AND GENE IDENTIFICATION problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem and show that it outperforms state-of-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.

    Click here to watch

    Hosted by:

    - top -


    Data Science for Biomarkers Discovery
    by Saed Sayad

    November 2, 2022

    The advent of multi-omics technologies (e.g., genomics, transcriptomics, proteomics, and metabolomics) has brought the hope of discovering novel biomarkers that can be used to diagnosis, prognosis, and treatment of diseases. Data science has an important role in identifying biomarkers (biological markers) using data from Microarray and RNA-Seq experiments. In this hands-on tutorial, you will learn how to use data science and transcriptomic data to discover biomarkers for diagnosis, prognosis, response to treatment, monitoring and risk assessment.

    Click here to watch

    Hosted by:

    - top -


    How to make bioinformatics tools you've developed easily accessible for UK Biobank for UK Biobank Users
    by Brenton Pyle, Ben Busby, Ted Laderas

    November 18, 2022

    Do you have a data science tool you’d like to see people use at scale on UK Biobank and other biobank-sized data? If so, in this tutorial you will learn how to deploy your application as a standalone app, Jupyter notebook, RStudio object, or CWL/WDL workflow on the DNAnexus-enabled UK Biobank Research Analysis Platform. You'll also learn the easiest ways to distribute this functionality to others!

    Click here to watch

    Hosted by:

    - top -


    False-positive IRESes, mRNA annotation errors, and a paradigm “unshift” in mammalian development
    by Christina Akirtava

    November 22, 2022

    Hyperconserved genomic sequences have great promise for understanding core biological processes. It has been recently proposed that scores of hyperconserved 5′ untranslated regions (UTRs), also known as transcript leaders (hTLs), encode internal ribosome entry sites (IRESes) that drive cap-independent translation, in part, via interactions with ribosome expansion segments. However, the direct functional significance of such interactions has not yet been definitively demonstrated. We provide evidence that the putative IRESes previously reported in Hox gene hTLs are rarely included in transcript leaders. Instead, these regions function independently as transcriptional promoters. In addition, we find the proposed RNA structure of the putative Hoxa9 IRES is not conserved. Instead, sequences previously shown to be essential for putative IRES activity encode a hyperconserved transcription factor binding site (E-box) that contributes to its promoter activity and is bound by several transcription factors, including USF1 and USF2. Similar E-box sequences enhance the promoter activities of other putative Hoxa gene IRESes. Moreover, we provide evidence that the vast majority of hTLs with putative IRES activity overlap transcriptional promoters, enhancers, and 3′ splice sites that are most likely responsible for their reported IRES activities. These results argue strongly against recently reported widespread IRES-like activities from hTLs and contradict proposed interactions between ribosomal expansion segment ES9S and putative IRESes. Furthermore, our work underscores the importance of accurate transcript annotations, controls in bicistronic reporter assays, and the power of synthesizing publicly available data from multiple sources

    Click here to register

    Click here to watch

    Hosted by:

    - top -


    High-resolution large-scale metagenomics of the human microbiome
    by Nicola Segata

    December 6, 2022

    TBD

    Hosted by:

    - top -


    Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque
    by Adria Fernandez-Torras

    December 20, 2022

    Biomedical data are accumulating at an unprecedented rate and integrating them in a unified framework is a major challenge of the post-genomics era. We have created a gigantic heterogeneous network (more than 450k nodes and 30M edges) that harmonizes and connects data points from over >150 sources. Overall, 12 types of biological entities (e.g. genes, diseases, drugs) were linked by 67 types of relationships (e.g. drug treats disease, gene interacts with gene). In order to properly exploit the gathered knowledge, we systematically encoded these connections as numerical vectors (embeddings) creating the Bioteque, a resource of biological network embeddings of unprecedented size and scope (https://bioteque.irbbarcelona.org). We prove this concise representation of the data retains the meaningful information contained within the biological network, can be plugged to machine learning implementations and show how it can be used to characterize a given set of experimental observations.

    Click here to watch

    Hosted by:

    - top -