To view previous webinars use the links below
2020 Webinars | 2021 Webinars | 2023 Webinars | 2024 Webinars
ISCBacademy is an online webinar series including the ISCB COSI, COVID webinars, Indigenous Voices and practical tutorials. We aim to inspire, connect, and communicate the science while providing a hands-on experience accessing and using newly developed bioinformatics tools while ensuring best practices for rigour and reproducibility.
January 11, 2022
The simultaneous measurement of multiple modalities represents an exciting frontier for single-cell genomics and necessitates computational methods that can define cellular states based on multimodal data. Here, we introduce “weighted-nearest neighbor” analysis, an unsupervised framework to learn the relative utility of each data type in each cell, enabling an integrative analysis of multiple modalities. We apply our procedure to a CITE-seq dataset of 211,000 human peripheral blood mononuclear cells (PBMCs) with panels extending to 228 antibodies to construct a multimodal reference atlas of the circulating immune system. Multimodal analysis substantially improves our ability to resolve cell states, allowing us to identify and validate previously unreported lymphoid subpopulations. Moreover, we demonstrate how to leverage this reference to rapidly map new datasets and to interpret immune responses to vaccination and coronavirus disease 2019 (COVID-19). Our approach represents a broadly applicable strategy to analyze single-cell multimodal datasets and to look beyond the transcriptome toward a unified and multimodal definition of cellular identity.
Hosted by:
January 18, 2022
The rate at which biomedical knowledge is produced (both at the level of new publications and data sets) is accelerating, and there is an increasing need to monitor, extract and assemble this knowledge in an actionable form. Classic mechanistic models take substantial human effort to construct and rarely scale to the level of omics datasets, while statistical approaches often do not make use of prior knowledge about mechanisms. To address these challenges, we present INDRA, an automated knowledge assembly system which integrates multiple text mining tools that process the scientific literature, and structured sources (pathway databases, drug-target databases, etc.). INDRA standardizes knowledge extracted from these sources and corrects errors, resolves redundancies, fills in missing information, and calculates confidence to create a coherent knowledge base. From this knowledge, various executable model types (ODEs, Boolean networks, etc.) and causal networks can be generated automatically for further analysis. We discuss technology built on top of INDRA, including human-machine dialogue systems, and EMMAA, a framework which makes available a set of self-updating and self-analyzing models of specific diseases and pathways. We present applications of these tools to automatically construct explanations for experimental observations in multiple disease areas.
Hosted by:
February 1, 2022
Disease Maps are computational and visual knowledge repositories constructed to catalogue, standardise, and model disease-related mechanisms. They allow to bridge the knowledge gap between biomedical experts and the computational biologists towards contextualised data analysis and modelling of a given pathophysiology. Disease Maps are built using graphical and computational Systems Biology standards and can be used as interactive knowledge repositories, platforms for visual analytics of omics datasets, or integrated into large-scale computational workflows. With the global impact of COVID-19, we organised a community effort to develop a COVID-19 Disease Map to help researchers worldwide to study the mechanisms of the SARS-CoV-2 – host interactions. Our effort engaged over 250 members, contributing as domain experts, diagram curators, analysts, and modellers. This talk will discuss the challenges of community biocuration and integration of a plethora of resources, from Systems Biology diagrams, through interaction databases and text mining results to modelling pipelines of varying granularity.
Hosted by:
February 8, 2022
Pleiotropic SNPs are associated with multiple traits. Such SNPs can help pinpoint biological processes with an effect on multiple traits or point to a shared etiology between traits. We present PolarMorphism, a new method for the identification of pleiotropic SNPs from GWAS summary statistics. PolarMorphism can be readily applied to more than two traits or whole trait domains. PolarMorphism makes use of the fact that trait-specific SNP effect sizes can be seen as Cartesian coordinates and can thus be converted to polar coordinates r (distance from the origin) and theta (angle with the Cartesian x-axis). r describes the overall effect of a SNP, while theta describes the extent to which a SNP is shared. r and theta are used to determine the significance of SNP sharedness, resulting in a p-value per SNP that can be used for further analysis. We apply PolarMorphism to a large collection of publicly available GWAS summary statistics enabling the construction of a pleiotropy network that shows the extent to which traits share SNPs. This network shows how PolarMorphism can be used to gain insight into relationships between traits and trait domains. Furthermore, pathway analysis of the newly discovered pleiotropic SNPs demonstrates that analysis of more than two traits simultaneously yields more biologically relevant results than the combined results of pairwise analysis of the same traits. Finally, we show that PolarMorphism is more efficient and more powerful than previously published methods.
Hosted by:
February 8, 2022
Males and females present differences in complex traits and in the risk of a wide array of diseases. Genotype by sex (GxS) interactions are thought to account for some of these differences. However, the extent and basis of GxS are poorly understood. Here, we provide insights into both the scope and the mechanism of GxS across the genome of about 450,000 individuals of European ancestry and 530 complex traits in the UK Biobank. We found small yet widespread differences in genetic architecture across traits. We also found that, in some cases, sex-agnostic analyses may be missing trait-associated loci and looked into possible improvements in the prediction of high-level phenotypes. Finally, we studied the potential functional role of the differences observed through sex-biased gene expression and gene-level analyses. Our results suggest the need to consider sex-aware analyses for future studies to shed light onto possible sex-specific molecular mechanisms.
Hosted by:
February 15, 2022
Computational methods used to facilitate small molecule drug discovery are currently witnessing a revived optimism, fueled by continuous leaps in computational power, increased accessibility to commercial compounds, improved physics-based methods, and the emerging potential of generative models and newer machine learning approaches. It is fair to say that the question is not whether in silico design will transform the early phase of drug discovery, but how profoundly and how fast. But there is currently no metric to systematically evaluate and compare these approaches, no mechanism to highlight the most promising methods and identify the fastest route to success. CACHE is a prospective hit finding competition where compounds selected by virtual screening or invented by generative models are procured and tested experimentally. Hit rate and diversity, potency and drug-likeness are used to evaluate and compare methods. All data and method description are publicly released. We expect that CACHE will define the state-of-the-art as computational hit-finding evolves over the years, and will act as an accelerator in the field.
Hosted by:
February 18, 2022
This tutorial will present how to perform analysis of single-cell RNA sequencing data following the tidy data paradigm. The tidy data paradigm provides a standard way to organise data values within a dataset, where each variable is a column, each observation is a row, and data is manipulated using an easy-to-understand vocabulary. Most importantly, the data structure remains consistent across manipulation and analysis functions. This can be achieved with the integration of packages present in the R CRAN and Bioconductor ecosystem, including tidyseurat, tidySingleCellExperiment, and tidyverse. These packages are part of the tidytranscriptomics suite that introduces a tidy approach to RNA sequencing data representation and analysis.
Instructors:
Dr. Stefano Mangiola is a Postdoctoral researcher in the laboratory of Prof. Tony Papenfuss. His background spans from biotechnology to bioinformatics and biostatistics. His research focuses on prostate and breast tumour microenvironment, the development of statistical model for the analysis of RNA sequencing data, and data analysis and visualisation interfaces.
Dr. Maria Doyle is the Application and Training Specialist for Research Computing at the Peter MacCallum Cancer Centre in Melbourne, Australia. She has a PhD in Molecular Biology and currently works in bioinformatics and data science education and training. She is passionate about supporting researchers, reproducible research, open source and tidy data.
Recommended Prerequisites:
Basic knowledge of single cell transcriptomic analyses
Basic knowledge of tidyverse
Hosted by:
February 22, 2022
Building communities for your open source computational tooling requires more than just technical expertise, and often isn't as straightforward as building the tool itself. Having a community of contributors and users can make a big difference in many ways - additional community members will spot opportunities and bugs in your code that previously you didn't notice, and may be able to offer unique skillsets to your team.
One effective way to grow your community can be via internships. Programs such as Google Summer of Code and Outreachy offer the chance to work with interns for 6-12 weeks, working on individual supervised projects whilst getting paid for their work.
This webinar will cover the ins-and-outs of participating in internship programs like this, from the perspective of a mentoring organisation. Topics will include:
Getting started with internship programs - finding mentors and defining a set of projects
Time commitments for mentors, before the application period and after interns are selected.
Funding for internship programs! (it's not as tricky as you may fear - others handle this bit!)
Keeping interns engaged during the program and bringing them in as long-term contributors afterwards.
Hosted by:
March 1, 2022
Biomedical AI applications are increasingly multi-domain, with areas such as personalized medicine and systems biology showcasing the increasing need to explore heterogeneous data. Biomedical Ontologies represent an unparalleled opportunity in this area because they add meaning to the underlying data which can be used to support heterogeneous data integration, provide scientific context to the data augmenting AI performance, and afford explanatory mechanisms allowing the contextualization of AI predictions.
In this talk I will present our recent work on aligning and integrating multiple ontologies and building knowledge graphs to support both supervised learning and explainable AI approaches for biomedicine. I will highlight lessons learned and chart a path for the coming challenges in biomedical ontology and knowledge graph alignment as AI becomes an integral part of biomedical research.
Hosted by:
March 18, 2022
Description:
Life science is the most demanding research field in terms of data quantity and complexity, with many relevant reference databases. To generate knowledge, heterogeneous data from various sources must often be combined. Semantic Web technologies, and in particular RDF and its companion query language SPARQL, provide a common framework allowing data to be shared and reused between resources. Many life science databases have recently turned to RDF to model their data, developed SPARQL endpoints and joined the Linked (Open) Data cloud. This tutorial will introduce neXtProt (www.nextprot.org/), one of the major public knowledge bases on human proteins, its comprehensive RDF data model, and its large collection of reusable example queries, including federated queries to other resources.
At the end of the course, the participants are expected to:
• Describe the neXtProt data model
• Run example queries that answer biological questions
• Search for data by modifying existing SPARQL queries
• Understand how federated queries are constructed
Instructor:
Monique Zahn is the Quality Manager of the CALIPHO group which develops neXtProt. She is responsible for testing user interfaces and the contents of each release. She has established quality control procedures involving SPARQL queries carried out at each data release. She has taught biology in undergraduate degree programs in Switzerland and is also Training Manager at the SIB.
Hosted by:
March 22, 2022
Abstract: Drug-induced liver injury (DILI) is an adverse effect of drugs characterized by abnormalities in liver tests, and it may lead to acute liver failure. As a key assessment for new drug candidates, DILI events are reported in the publications of clinical practices and preliminary in vitro and in vivo experiments. Conventionally, screening the large corpus of publications to label DILI-related reports is carried out manually, which substantially limits the processing speed. The development of natural language processing (NLP) techniques enables the automatic processing of texts. Here, we report a model for filtering DILI literature with four NLP text vectorization techniques and ensemble learning. The model with TF-IDF and logistic regression outperformed others with an AUROC of 0.990, an accuracy of 0.957, and an AUPRC of 0.990. An ensemble model with similar performance but the fewest false-negative cases was built based on 12 models. Both models showed good performance on the hold-out validation data, and the ensemble model reached a higher accuracy of 0.954 and an F1 score of 0.955. On the additional hold-out test data without title, the TF-IDF model reached a higher accuracy of 0.927 and a higher F1 score of 0.930. Additionally, important words in positive/negative predictions were identified by interpreting the models. Generally, the ensemble model reached satisfactory classification results, which can be used by researchers to quickly filter DILI-related literature.
Hosted by:
March 29, 2022
Determining the structure of biological systems and its relation to biological function is one of the ultimate goals of the biological sciences. There has been a recent push to produce large-sized data sets from the different major approaches for mapping cell structure including the Human Protein Atlas (HPA), a major imaging effort consisting of >10,000 immunofluorescence (IF) images, and the BioPlex networks, derived from systematic affinity-purification mass spectrometry (AP-MS) experiments. Integrating the resulting datasets from these two different approaches provides an opportunity to generate a more complete map of cell structure across scales, from individual protein interactions to subcellular location of whole complexes. Towards this goal, we developed an approach for creating Multi-Scale Integrated Cell (MuSIC) maps of cellular structure by integrating HPA IF images and the BioPlex protein interaction network; integration is achieved by configuring each approach as a general measure of protein distance, then calibrating the two measures using machine learning. This approach resulted in a MuSIC map for the 661 proteins in HEK293 present in both the HPA and BioPlex data sets consisting of 69 subcellular systems, approximately half of which were novel. This map of the cell determined roles for poorly characterized proteins and identified new protein assemblies in processes such as ribosomal biogenesis and RNA splicing. We will also describe initial developments to build MuSIC map v2.0 in the U2OS cell line consisting of ~5,000 proteins, representing a significant expansion of the first MuSIC map and providing a more global view of cell structure.
Hosted by:
April 5, 2022
In this talk, we will walk through an almost 20-year journey in the field of bioinformatics education in the Latinamerican region, from the first events and meetings, to today’s perspectives and opportunities. Different communities, institutions and initiatives have contributed to the development of academic programs and networks to take bioinformatics education to a wider audience and cover different needs of the scientific community. The impact of these initiatives can be seen in the increased number of publications in the field coming from different countries in the region. Besides the nearly two decades of development and several academic programs created throughout LATAM, they are still low in number and wider geographical regions and centralised systems have been an obstacle for reaching more universities and institutions. This results in hotspot cities, regions and countries with academic programs devoted to bioinformatics, curbing the development of key academic and industrial areas for the region that would benefit from the transversality of bioinformatics and computational biology. There are key challenges to be solved for the following decade in order to consolidate the bioinformatics environment in LATAM. Multiple aspects related to decentralisation, equity, diversity and inclusion, language barriers, entrepreneurship, among others, should be addressed jointly by institutions and professional groups. Join us to have a conversation about these and other challenges, and most importantly, the initiatives being established to strengthen bioinformatics education in this important region.
Hosted by:
April 12, 2022
Plants are foundational for global ecological and economic systems, but most plant proteins and and many protein complexes remain uncharacterized. In plants, highly duplicated protein families pose challenges for protein identification, which is heavily reliant on unique peptides. To address this problem, we developed a evolution-informed proteomics strategy that combines orthology analysis with mass spectrometry. We applied this strategy to 13 plant species of scientific and agricultural importance, greatly expanding the known repertoire of stable protein complexes in plants. We recovered known complexes, confirmed complexes predicted to occur in plants, and identified previously unknown interactions conserved over 1.1 billion years of green plant evolution The resulting map offers a cross-species view of conserved, stable protein assemblies shared across plant cells and provides a mechanistic, biochemical framework for interpreting plant genetics and mutant phenotypes.
McWhite et al. A Pan-plant Protein Complex Map Reveals Deep
Conservation and Novel Assemblies. Cell. 2020;181(2):460-474.e14.
doi:10.1016/j.cell.2020.02.049
Hosted by:
April 22, 2022
Hosted by:
April 28, 2022
Cellular phenotypes emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be extracted purely from the statistical pattern of proteome variation as measured across thousands of bacteria and that these networks reflect the emergence of complex bacterial phenotypes. We validate our results through gene-set enrichment analysis and comparison to existing experimentally-derived databases. We demonstrate the biological utility of our approach by creating a model of motility in Pseudomonas aeruginosa and using it to identify a protein that affects pilus-mediated motility. We anticipate that our method, SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), will be useful for interrogating genotype-phenotype relationships in bacteria.
Hosted by:
April 28, 2022
Sequence similarity search is a fundamental bioinformatic problem used in many read alignment and sequence clustering applications. In many of these applications, a seed-filter-extend methodology is used where short local matches (seeding) are found, some match sites are selected (filtering), and exact Smith-Waterman alignment is performed (extension). We have recently seen advances in all the above stages, making sequence similarity search an exciting and fast-moving research area.
In this talk, I will focus on the seeding step. In particular, I will discuss some seeding techniques and describe our recently proposed strobemer-seeds, a type of fuzzy seed that can match over substitutions and indels. I will expand on our previous talk on strobemers (HitSeq, 2021) to include applications where strobemers have proven useful, such as short-read mapping and long-read overlapping, as well as briefly discuss open problems and future research directions.
Hosted by:
May 4, 2022
The nucleus is highly compartmentalized through the formation of distinct classes of membraneless domains, yet the composition and function of many of these structures is not well understood. Using APEX2-mediated proximity labelling and RNA sequencing, we surveyed transcripts associated with nuclear speckles, several additional domains, and the lamina. Remarkably, speckles and lamina are associated with distinct classes of retained introns enriched in genes that function in RNA processing, translation, and the cell cycle. In contrast to the lamina-proximal introns, retained introns associated with speckles are relatively short, GC-rich, and enriched for functional sites of RNA binding proteins that are concentrated in these domains. They are also highly differentially regulated across diverse cellular contexts, including the cell cycle. Our study thus provides a rich resource of nuclear domain-associated transcripts and further reveals speckles and lamina as hubs of distinct populations of retained introns linked to gene regulation and cell cycle progression.
Hosted by:
May 5, 2022
The Food and Drug Administration (FDA) faces significant challenges involving bioinformatics data, including receiving data from academic collaborators and industry sponsors, analyzing data in a secure and compliant environment, and fostering community solutions to real-world bioinformatics problems. We present precisionFDA, a cloud-based platform with FedRAMP Moderate security clearance, built and maintained by DNAnexus. This platform:
· allows FDA reviewers to collaborate with industry sponsors through shared scalable development environments;
· offers in-browser application development for user-driven construction of custom bioinformatic tools, or adaptation of existing tools;
· supports hosting public challenges to engage the bioinformatics community, focusing on real-world data analysis and current problems through an intuitive user experience.
The precisionFDA platform helps improve the tools, capabilities, and speed of analysis available to government researchers. At DNAnexus, we welcome external collaborators and work to present community challenges to help find novel solutions to biology and data problems.
Hosted by:
May 10, 2022
The construction of microbial networks has become a popular method to analyse microbial sequencing data, with dozens of network inference tools available. However, these tools usually return "hairballs", i.e. densely connected networks, which require further analysis in order to derive biological hypotheses from them. Here, I will present a set of tools designed to address this challenge, including manta for clustering, anuran for comparing and mako for querying microbial networks.
Hosted by:
May 18, 2022
The structural organization of the genome plays an important role in multiple aspects of genome function. Understanding how genomic sequence influences 3D organization can help elucidate their roles in various processes in healthy and disease states. However, the sequence determinants of genome structure across multiple spatial scales are still not well understood. To learn the complex sequence dependencies of multiscale genome architecture, here we developed a sequence-based deep learning approach, Orca, that predicts genome 3D architecture from kilobase to whole-chromosome scale, covering structures including chromatin compartments and topologically associating domains. Orca also makes both intrachromosomal and interchromosomal predictions and captures the sequence dependencies of diverse types of interactions, from CTCF-mediated to enhancer-promoter interactions and Polycomb-mediated interactions. Orca enables the interpretation of the effects of any structural variant at any size on multiscale genome organization and provides an in silico model to help study the sequence-dependent mechanistic basis of genome architecture. We show that the models accurately recapitulate effects of experimentally studied structural variants at varying sizes (300bp-80Mb) using only sequence. Furthermore, these sequence models enable in silico virtual screen assays to probe the sequence-basis of genome 3D organization at different scales. At the submegabase scale, the models predicted specific transcription factor motifs underlying cell-type-specific genome interactions. At the compartment scale, based on virtual screens of sequence activities, we propose a new model for the sequence basis of chromatin compartments: sequences at active transcription start sites are primarily responsible for establishing the expression-active compartment A, while the inactive compartment B typically requires extended stretches of AT-rich sequences (at least 6-12kb) and can form ‘passively’ without depending on any particular sequence pattern. Orca thus effectively provides an “in silico genome observatory” to predict variant effects on genome structure and probe the sequence-based mechanisms of genome organization.
Hosted by:
May 28, 2022
About the workshop
This 3-hour workshop will present how to analyze & visualize processed genomics data. The session will be divided into 5 parts:
Part 1: Getting started w/ readr
Part 2: Reshaping data w/ tidyr
Part 3: Data wrangling w/ dplyr
Part 4: Visualizing tidy data w/ ggplot
Part 5: Export and Wrap-up w/ rmarkdown
Learning Objectives
By the end of this workshop, you will be able to load your genomic dataset, perform basic data tidying & wrangling, data visualization, and save/export your results using tidyverse! Hopefully, you will also have a newfound appreciation for reproducible research and R!
Hosted by:
June 7, 2022
My group models cell types as attractors of a dynamic system of interacting (macro)molecules, and we aim to find the network patterns that determine these attractors. We collaborate with wet-bench biologists to develop and validate predictive dynamic models of specific systems. Over the years we found that network-based discrete dynamic modeling is very useful in synthesizing causal interaction information into a predictive, mechanistic model. We use the accumulated knowledge gained from specific models to draw general conclusions that connect a network's structure and dynamics. An example of such a general connection is our identification of stable motifs, which are self-sustaining cyclic structures that determine points of no return in the dynamics of the system. We have shown that control of stable motifs can guide the system into a desired attractor. We have recently translated the concept of stable motif to a broad class of continuous models. Stable motif - based attractor control can form the foundation of therapeutic strategies on a wide application domain.
Hosted by:
June 28, 2022
Genome sequencing studies have identified millions of somatic variants in cancer, but it remains challenging to predict the phenotypic impact of most. Experimental approaches to distinguish impactful variants often use phenotypic assays that report on predefined gene-specific functional effects in bulk cell populations. Here, we develop an approach to functionally assess variant impact in single cells by pooled Perturb-seq. We measured the impact of 200 TP53 and KRAS variants on RNA profiles in over 300,000 single lung cancer cells, and used the profiles to categorize variants into phenotypic subsets to distinguish gain-of-function, loss-of-function and dominant negative variants, which we validated by comparison with orthogonal assays. We discovered that KRAS variants did not merely fit into discrete functional categories, but spanned a continuum of gain-of-function phenotypes, and that their functional impact could not have been predicted solely by their frequency in patient cohorts. Our work provides a scalable, gene-agnostic method for coding variant impact phenotyping, with potential applications in multiple disease settings.
Hosted by:
June 28, 2022
All cancers are caused by somatic mutations imprinted by the activities of different mutational processes, with each process leaving a characteristic pattern of mutations termed mutational signature. Characterizing mutational signatures can help understand the processes behind the onset and progression of tumors, with potential application as biomarkers in clinical practice. In our group, we have recently developed SigProfilerExtractor, an automated tool for accurate de novo extraction of mutational signatures for all types of somatic mutations. We have performed a comprehensive benchmarking, with 34 distinct scenarios encompassing 2,500 simulated signatures operative in more than 60,000 unique synthetic genomes and 20,000 synthetic exomes, demonstrating that SigProfilerExtractor outperforms 13 other tools for extracting mutational signatures. For genome simulations with 5% noise, reflecting high-quality genomic datasets, SigProfilerExtractor identified between 20% and 50% more true positive signatures while yielding more than 5-fold less false positive signatures. Applying SigProfilerExtractor to 4,643 whole-genome and 19,184 whole-exome sequenced cancers revealed four previously missed mutational signatures, including a signature putatively attributed to tobacco smoking in bladder cancer and normal bladder epithelium.
Hosted by:
June 29, 2022
A variety of mutational processes shape genome evolution and can lead to the development of cancer by inducing DNA damage in the cells. These processes are triggered by environmental as well as intrinsic risk factors, and they leave specific footprints of somatic alterations in the genome. These mutational footprints, called “mutational signatures”, can be read from the tumour sequencing data and reveal the main sources of DNA damage driving neoplastic progression. In this sense, they can be considered a form of evidence for historical mutational events that have acted throughout an individual's lifetime. I will discuss some of methodological innovations that have enabled the exploration of these mutational events in cancer genomes through the identification of systematic patterns of mutations in large scale sequencing cohorts. I will also illustrate some of the applications of this methodology to studying both healthy ageing as well as cancer.
Hosted by:
June 30, 2022
Modern bioinformatics pipelines can be incredibly complex, but all tend to follow a common pattern: they start with raw data, then pass the data through various programs until arriving at a final result. If this is done in an ad-hoc, unorganized fashion, the results may never be reproducible or even worse, unreliable and/or wrong. Pipeline management software is therefore essential to obtain results that are robust and reproducible. The targets R package is a recently developed workflow manager that comes with many excellent features for bioinformatics, including data caching, pipeline-level parallelization, and HPC support. In this hands-on workshop, I will demonstrate how targets can be used in concert with other tools like docker and conda to orchestrate modular, reproducible bioinformatics pipelines. The workshop will feature variant-calling as an example, but the concepts and tools can be applied to nearly any analysis.
Pre-requisites: Basic familiarity with R. Installations of recent versions of R, conda, and docker.
Duration: 2 hours
Hosted by:
August 25, 2022
Pooled CRISPR screening has emerged as a powerful method of uncovering entire gene networks and modulators of critical biomarkers [1,2], due to its scalability, low cost, and substantial resistance to inter-well and inter-plate artifacts. Unfortunately, the current methods of pooled CRISPR screening are only compatible with fitness or FACS-sortable phenotypes, while high-dimensional readout methods such as perturb-seq are costly and only apply to transcriptional readouts [3] and/or scalar protein readouts [4]. With the recent emergence of pooled optical screening methods [5,6], perturbagens such as gene-targeting gRNAs can be amplified and directly measured via in situ sequencing while maintaining cellular structure and spatial features. This enables CRISPR screens to be coupled with a nearly limitless range of imaging assays, such as cell migration, calcium signaling, CellPaint, quantitative phase contrast, protein aggregation, multicellular/cell-cell interaction assays, and more. Here we describe an automated platform that has been developed to allow for pooled optical screening at industrial capacity, as well as multiple optical CRISPR screens done at increasing scales. We describe the first screen conducted on morphological phenotypes using a modified version of CellPaint, in which genes targeting various core pathways were edited. We demonstrate that ultra-high-throughput morphological analysis successfully identified and grouped these gene clusters using simply CellPaint and high-dimensional morphological readouts. In the second screen, we conducted a druggable-genome scale screen to identify both marker-based modifiers of the mTOR pathway, as well as biomarker-free clustering of gene networks using machine learning and feature-based analysis. While only a handful of pooled optical screens have been conducted so far in the field, we demonstrate the beginning of a promising new stage of CRISPR screening technology, allowing for high-throughput functional genomics screens to span into a vastly wider assortment of imaging-based assays.
1. O. Shalem, Science, 353, 6166 (2013)
2. R. Ihry, Cell Reports, 27, 2 (2019)
3. A. Dixit, Cell, 167, 7 (2016)
4. M. Stoeckius, Nature Methods, 9, 865-868 (2017)
5. D. Feldman, Cell, 179, 3 (2019)
6. L. Funk, biorxiv, doi:10.1101/2021.11.28.470116 (2021)
Hosted by:
August 31, 2022
The Ensembl REST API allows language agnostic programmatic access to Ensembl data. This webinar will provide an introduction to the REST API and its documentation, and how to access various data types.
Pre-requisites: None, basic knowledge of any programming language (particularly python or R would be helpful)
Length: 1 hour
Hosted by:
September 6, 2022
Proteins interact with other macromolecules in complex cellular networks for signal transduction and biological function. Our previous work in Mendelian disorders found a widespread phenomenon that disease-associated alleles often perturb distinct protein activities rather than grossly affecting folding and stability. In the context of cancer, the functional impact of the vast majority of somatic mutations remains unknown, representing a critical knowledge gap for implementing precision oncology. Here, we present the development of a high-throughput functional variomics platform consisting of efficient mutant generation, sensitive cell viability and drug response assays, and functional proteomic profiling of signaling effects for select aberrations. We apply the platform to annotate thousands of genomic aberrations, including point mutations, indels, and gene fusions, potentially doubling the number of driver mutations characterized in clinically actionable genes. Further, the platform is sufficiently sensitive to identify weak drivers. Our data are accessible through a user-friendly, public data portal. Our study will facilitate biomarker discovery, prediction algorithm improvement, and drug development.
Hosted by:
September 13, 2022
Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Quality assurance of GOA is crucial for supporting biological research, such as gene expression analysis and gene clustering. However, a range of different kinds of inconsistencies can be identified between GOA and the scientific literature that serves as evidence for these annotations. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance.
In this talk, Jiyu will present results from two recent studies exploring GOA inconsistencies, and assessing the feasibility of automatic detection of such inconsistencies. Jiyu will introduce the basic framework for the implementation of automatic GOA quality assurance systems, which satisfy human-in-the-loop curation. Jiyu will discuss opportunities in their findings and point out feasible future studies on the implementation of automatic GOA quality assurance.
Hosted by:
September 30, 2022
With the advent of next-generation sequencing (NGS) technologies, there arose a need to identify candidate mutations for causality. A challenge often faced in identifying and inferring the causal SNPs from sequence data is that different methods need to be preferentially used to predict the effect of mutations for determining bona fidelity. While there are approaches focused on a wide array of highly sensitive, if not less stringent methods that the NGS has delivered in the recent past, this workshop aims to bridge the gap in using systems genomic approach taking command line scripts to Galaxy based workflows. A special focus of this workshop is on the current trends in genome analyses with special insights into NGS analysis. The sessions largely focus on whole exome sequencing (WES) and whole transcriptome shotgun sequencing (WTSS) or RNA-seq pipelines, and galaxy integrated workflows, latest trends on single cell sequencing with vivid demonstration of various steps of data analysis including quality control and generation of variant calls, gene expression. An ample time will be set aside for discussing case studies on various diseased phenotypes.
Hosted by:
October 11, 2022
Adverse drug-drug interaction (DDI) is a major concern to polypharmacy due to its unexpected adverse side effects and must be identified at an early stage of drug discovery and development. Many computational methods have been proposed for this purpose, but most require specific types of information, or they have less concern in interpretation on underlying genes. We propose a deep learning-based framework for DDI prediction with drug-induced gene expression signatures so that the model can provide the expression level of interpretability for DDIs. The model engineers dynamic drug features using a gating mechanism that mimics the co-administration effects by imposing attention on genes. Also, each side-effect is projected into a latent space through translating embedding. As a result, the model achieved an AUC of 0.889 and an AUPR of 0.915 in unseen interaction prediction, which is competitively very accurate and outperforms other state-of-the-art methods. Furthermore, it can predict potential DDIs with new compounds not used in training. In conclusion, using drug-induced gene expression signatures followed by gating and translating embedding can increase DDI prediction accuracy while providing model interpretability.
Hosted by:
October 25, 2022
There has been a lot of education and training efforts in raising Bioinformatics awareness and research in Asia Pacific. In this talk, I will be sharing our experience and thoughts of Bioinformatics education and training efforts in Asia Pacific and how Asia Pacific Bioinformatics Network (APBioNET) bridges, facilitates and fills in the gaps in it.
Hosted by:
November 1, 2022
Transcription regulatory sequences (TRSs), which occur upstream of structural and accessory genes as well as the 5' end of a coronavirus genome, play a critical role in discontinuous transcription in coronaviruses. We introduce two problems collectively aimed at identifying these regulatory sequences as well as their associated genes. First, we formulate the TRS IDENTIFICATION problem of identifying TRS sites in a coronavirus genome sequence with prescribed gene locations. We introduce CORSID-A, an algorithm that solves this problem to optimality in polynomial time. We demonstrate that CORSID-A outperforms existing motif-based methods in identifying TRS sites in coronaviruses. Second, we demonstrate for the first time how TRS sites can be leveraged to identify gene locations in the coronavirus genome. To that end, we formulate the TRS AND GENE IDENTIFICATION problem of simultaneously identifying TRS sites and gene locations in unannotated coronavirus genomes. We introduce CORSID to solve this problem and show that it outperforms state-of-the-art gene finding methods in coronavirus genomes. Furthermore, we demonstrate that CORSID enables de novo identification of TRS sites and genes in previously unannotated coronavirus genomes. CORSID is the first method to perform accurate and simultaneous identification of TRS sites and genes in coronavirus genomes without the use of any prior information.
Hosted by:
November 2, 2022
The advent of multi-omics technologies (e.g., genomics, transcriptomics, proteomics, and metabolomics) has brought the hope of discovering novel biomarkers that can be used to diagnosis, prognosis, and treatment of diseases. Data science has an important role in identifying biomarkers (biological markers) using data from Microarray and RNA-Seq experiments. In this hands-on tutorial, you will learn how to use data science and transcriptomic data to discover biomarkers for diagnosis, prognosis, response to treatment, monitoring and risk assessment.
Hosted by:
November 18, 2022
Do you have a data science tool you’d like to see people use at scale on UK Biobank and other biobank-sized data? If so, in this tutorial you will learn how to deploy your application as a standalone app, Jupyter notebook, RStudio object, or CWL/WDL workflow on the DNAnexus-enabled UK Biobank Research Analysis Platform. You'll also learn the easiest ways to distribute this functionality to others!
Hosted by:
November 22, 2022
Hyperconserved genomic sequences have great promise for understanding core biological processes. It has been recently proposed that scores of hyperconserved 5′ untranslated regions (UTRs), also known as transcript leaders (hTLs), encode internal ribosome entry sites (IRESes) that drive cap-independent translation, in part, via interactions with ribosome expansion segments. However, the direct functional significance of such interactions has not yet been definitively demonstrated. We provide evidence that the putative IRESes previously reported in Hox gene hTLs are rarely included in transcript leaders. Instead, these regions function independently as transcriptional promoters. In addition, we find the proposed RNA structure of the putative Hoxa9 IRES is not conserved. Instead, sequences previously shown to be essential for putative IRES activity encode a hyperconserved transcription factor binding site (E-box) that contributes to its promoter activity and is bound by several transcription factors, including USF1 and USF2. Similar E-box sequences enhance the promoter activities of other putative Hoxa gene IRESes. Moreover, we provide evidence that the vast majority of hTLs with putative IRES activity overlap transcriptional promoters, enhancers, and 3′ splice sites that are most likely responsible for their reported IRES activities. These results argue strongly against recently reported widespread IRES-like activities from hTLs and contradict proposed interactions between ribosomal expansion segment ES9S and putative IRESes. Furthermore, our work underscores the importance of accurate transcript annotations, controls in bicistronic reporter assays, and the power of synthesizing publicly available data from multiple sources
Hosted by:
December 6, 2022
TBD
Hosted by:
December 20, 2022
Biomedical data are accumulating at an unprecedented rate and integrating them in a unified framework is a major challenge of the post-genomics era. We have created a gigantic heterogeneous network (more than 450k nodes and 30M edges) that harmonizes and connects data points from over >150 sources. Overall, 12 types of biological entities (e.g. genes, diseases, drugs) were linked by 67 types of relationships (e.g. drug treats disease, gene interacts with gene). In order to properly exploit the gathered knowledge, we systematically encoded these connections as numerical vectors (embeddings) creating the Bioteque, a resource of biological network embeddings of unprecedented size and scope (https://bioteque.irbbarcelona.org). We prove this concise representation of the data retains the meaningful information contained within the biological network, can be plugged to machine learning implementations and show how it can be used to characterize a given set of experimental observations.
Hosted by: