Monday, July 24, between 18:00 CEST and 19:00 CEST |
Tuesday, July 25, between 18:00 CEST and 19:00 CEST |
---|---|
Session A Poster Set-up and Dismantle Session A Posters set up: Monday, July 24, between 08:00 CEST and 08:45 CEST Session A Posters dismantle: Monday, July 24, at 19:00 CEST | Session B Poster Set-up and Dismantle Session B Posters set up: Tuesday, July 25, between 08:00 CEST and 08:45 CEST Session B Posters dismantle: Tuesday, July 25, at 19:00 CEST |
Wednesday, July 26, between 18:00 CEST and 19:00 CEST |
|
---|---|
Session C Poster Set-up and Dismantle Session C Posters set up: Wednesday, July 26,between 08:00 CEST and 08:45 CEST Session C Posters dismantle: Wednesday, July 26, at 19:00 CEST |
Virtual |
|
---|
Presentation Overview: Show
Bioinformatic pipelines for variant calling have recently undergone dramatic improvements given the decreasing costs of next-generation sequencing experiments. However, variant discovery in tumoral samples is hindered by the great variety of cancer types, high tumoral heterogeneity, and unpredictability of sequencing errors. A fully-characterized validation dataset for somatic variant calling, that takes into account all these aspects, is still missing. In this work, we performed an extensive review of nine somatic sample simulators (Synggen, SVEngine, BAMSurgeon, VarSim, Xome-Blender, tHapMix, Pysim-sv, SCNVSim, and HeteroGenesis), evaluating their ability to control variant features such as type, number, position, length, content and zygosity, and tumoral features, such as clonality, to learn variant and error profiles from real data, and to retrieve all files needed for variant calling validation. No individual simulator was able to provide the user full control over both variant features and tumoral features, together with adequate modeling of in-silico sequencing. However, Synngen, with its ad-hoc built-in read simulator that combines three different error models, perfectly emulates the variability of technical noise in real sequencing data, and SVEngine provides the most complete framework for simulating biological variability of tumoral samples, allowing the user to define all variant features for each individual variant.
Presentation Overview: Show
Advances in spatial transcriptomics technologies have enabled the gene expression profiling of tissues while retaining its spatial context. Effective exploitation of this data combination requires spatially informed analysis tools to perform three key tasks, spatial clustering, multi-sample integration, and cell type deconvolution. Here, we present GraphST, a novel graph self-supervised contrastive learning method that incorporates spatial location information and gene expression profiles to accomplish all three tasks in a streamlined process while outperforming existing methods in each task. GraphST combines graph neural networks with self-supervised contrastive learning to learn informative and discriminative spot representations by minimizing the embedding distance between spatially adjacent spots and vice versa. With GraphST, we achieved 10% higher clustering accuracy on multiple datasets than competing methods, and better delineated the fine-grained structures in tissues such as the brain and embryo. Moreover, GraphST is the only method that can jointly analyze multiple tissue slices in both vertical and horizontal integration while correcting for batch effects. Lastly, compared to other methods, GraphST’s cell type deconvolution achieved higher accuracy on simulated data. On experimentally acquired data, it better captured spatial niches such as lymph node germinal centers and exhausted tumor infiltrating T cells in breast tumor tissue.
Presentation Overview: Show
Patents play a crucial role in the drug discovery process by providing legal protection for discoveries and incentivising investments in research and development. By identifying patterns within patent data resources, researchers can gain insight into the market trends and priorities of the pharmaceutical industries, as well as provide additional perspectives on more fundamental aspects such as the emergence of potential new drug targets. In this paper, we used the PEMT to integrate and analyse patent literature for rare diseases (RD) and Alzheimer's disease (AD). This is followed by a systematic review of the underlying patent landscape to decipher trends and applications in patents. We start by discussing organisations involved in R&D in AD and RD. This allows us to gain an understanding of the importance of AD and RD from specific organisational perspectives. Next, we analysed the historical focus of patents for therapeutic targets and correlated them with market scenarios allowing the identification of prominent targets for a disease. Lastly, we identified repurposed drugs within the two diseases with the help of patents. The study demonstrates the expanded applicability of patent documents from legal to drug discovery, design, and research, thus, providing a valuable resource for future drug discovery efforts.
Presentation Overview: Show
In higher eukaryotes, pre-mRNA splicing is a set of reactions catalyzed by the spliceosome, a complex consisting of small nuclear ribonucleoproteins (U1, U2, U5 and U6 snRNPs). The importance of splicing is illustrated by the fact that 50% of the reported human genetic diseases arise from disruption of splicing by mutations in the splicing sites or in the cis-acting splicing regulatory sites. In yeast, it is easy to identify the BPS because it is a nearly invariant UACUAAC sequence with the branch point adenosine (BPA) being the sixth nucleotide, which is exactly complementary to GUAGUA in U2 snRNP. However, BPS characterization in mammal has been a far more complicated task since BPS is highly variable. In this paper, we propose a novel computational framework intergrating candidate BPS and PPT for BPS prediction. A novel scoring system by integrating the scores of BPS and PPT sequence was developed to predict the BPS. We demonstrate that our methods outperformed previously published methods. Compared to the SVM method, this new method can be easily applied to other mammals and predict the BPS without the “TNA" structure.
Presentation Overview: Show
Motivation: Improving the efficacy and overcoming the malfunctions of systems are significant chal-lenges. Variability characterizes all levels of complex biological systems. We reviewed the relevant publications and described a method for improving the systems' function.
Results: The constrained disorder principle (CDP) defines the function of living systems based on their degree of variability. Per the CDP, the boundaries of a system define its function and efficiency. The present paper aims to describe the role of variability in biological systems and the generation of CDP-based second-generation artificial intelligence (AI) algorithms designed to improve effective-ness and correct malfunctions of biological organisms by focusing on implementing personalized variability signatures. The paper describes some of the challenges of first-generation AI systems, focusing on the three steps process of establishing the second-generation platforms comprising: the use of a pseudorandom number generator in an open-loop system, implementing variability based on feedback in a closed-loop system, and quantifying variability signatures in a personalized way for improving algorithm' output. Examples of its use in humans are provided. The CDP provides a plat-form for improving disturbed systems' functions using second-generation AI systems.
Presentation Overview: Show
Resistance to programmed cell death (PCD) is hallmark of cancer. While some PCD components are prognostic in cancer, the roles of many molecules are masked by redundancies and crosstalks between PCD pathways, impeding the development of targeted therapeutics. Recent studies characterizing these redundancies have identified PANoptosis, an innate immune-mediated inflammatory PCD pathway that integrates components from other PCD pathways. Here, we designed a systematic computational framework to determine the pancancer clinical significance of PANoptosis and identify targetable biomarkers. We found that high expression of PANoptosis genes was detrimental in low-grade-gliomas (LGG) and kidney renal cell carcinoma (KIRC). ZBP1, ADAR, CASP2, CASP3, CASP4, CASP8 and GSDMD expression consistently had negative effects on prognosis in LGG across multiple survival models, while AIM2, CASP3, CASP4 and TNFRSF10 expression had negative effects for KIRC. Conversely, high expression of PANoptosis genes was beneficial in skin cutaneous melanoma (SKCM), with ZBP1, NLRP1, CASP8 and GSDMD expression consistently having positive prognostic effects. As therapeutic proof-of-concept, we treated melanoma cells with combination therapy that activates ZBP1 and showed this treatment induced PANoptosis. Overall, through our systematic framework, we identified and validated key innate immune biomarkers which can be targeted to improve patient outcomes in cancers.
Presentation Overview: Show
High-throughput screening based on CRISPR-Cas9 libraries has become an attractive and powerful technique to identify target genes for functional studies. However, accessibility of public data is limited due to the lack of user-friendly utilities and up-to-date resources covering experiments from third parties. Here, we describe iCSDB, an integrated database of CRISPR screening experiments using human cell lines. We compiled two major sources of CRISPR-Cas9 screening: the DepMap portal and BioGRID ORCS. DepMap portal itself is an integrated database that includes three large-scale projects of CRISPR screening. We additionally aggregated CRISPR screens from BioGRID ORCS that is a collection of screening results from PubMed articles. Currently, iCSDB contains 1375 genome-wide screens across 976 human cell lines, covering 28 tissues and 70 cancer types. Importantly, the batch effects from different CRISPR libraries were removed and the screening scores were converted into a single metric to estimate the knockout efficiency. Clinical and molecular information were also integrated to help users to select cell lines of interest readily. Furthermore, we have implemented various interactive tools and viewers to facilitate users to choose, examine and compare the screen results both at the gene and guide RNA levels. iCSDB is available at https://www.kobic.re.kr/icsdb/.
Presentation Overview: Show
The heterochromatic and highly repetitive state of the Y chromosome leads to multiple difficulties when assembling its scaffolds and contigs, resulting in a lack of final assembled sequences for it. In Drosophila melanogaster, the Y chromosome has an estimated size of 41 Mb of repeat-rich sequences, but only 10% of them are assembled in the most recent genome release. In contrast, the protocol for designing probes used in full chromosome fluorescent labelling experiments does not include repetitive sequences to avoid off-target hybridization, resulting in <1500 oligopaint probes for this Y, a value at least 10x smaller when compared to the other chromosomes of the same species. Here we present OligoY, a pipeline that allows the design of oligopaint probes for the Y chromosome of any specie. While using open-source tools in Bioinformatics, OligoY guarantees the user the autonomy to choose parameters and effectively uses repetitive sequences to design probes that are exclusive to the target chromosome, thus maximising overall efficiency of cytogenetic experiments. After extensive tests and validations in silico and in situ, we verified that the application of the developed pipeline, OligoY, allows staining the Y chromosome without generating off-target signal, despite using repetitive sequences for oligopaint probe design.
Presentation Overview: Show
Although the value and importance of data is increasingly recognized, data generated in the research process is difficult to share and collaborate on due to the different ways researchers collect and manage data and the lack of standardized data models. In particular, the food field produces a wide variety of research data for each field including processing, safety, and functionality. To solve this, we conducted a study to analyze and standardize food research data formats. In this study, we developed a data management plan format and collected and integrated metadata from 20 research projects to identify data types. We categorized the collected data into sample data and result data, and defined data model names for each data type. The essential elements of the selected data models were identified through interviews with food research data experts. We selected 12 and 15 data model names to group sample data and result data, respectively.
Developing a standardized data model can increase the accuracy and consistency of data and facilitate data sharing, reuse, and integration across different platforms and systems. This facilitates preliminary research utilizing existing data and reduces duplicate production of data, ultimately reducing the time and cost of food research.
Presentation Overview: Show
Hypertension is a polygenic disease that affects over 1.2 billion adults aged 30–79 worldwide. It is a major risk factor for renal, cerebrovascular, and cardiovascular diseases. The heritability of hypertension is estimated to be high; nevertheless, the understanding of the underlying mechanisms remains scarce and incomplete. Using a novel method called PWAS (proteome-wide association study) on participants from the UK Biobank (UKB), we discovered 70 statistically significant associated genes, most of which failed to reach significance by the routine GWAS, which is variant-based. Our findings were validated against independent cohorts, including the Finnish Biobank, and confirmed a substantial fraction of the PWAS hypertension-associated genes. The gene-based analyses that were performed on both sexes separately revealed a sex-dependent genetic signal with a stronger component associated with females. Analysis of the measurements for systolic and diastolic blood pressure for the entire UKB cohort confirmed the dominant genetic contribution for females. In this study, we will demonstrate the advantage of applying gene-based association methods over the classical GWAS in interpretability and in identifying sex-specific genetic signals as a lead towards mechanistic understanding of hypertension and related phenotypes.
Presentation Overview: Show
Hypertension is a polygenic disease that affects over 1.2 billion adults worldwide. It is a major risk factor for renal, cerebrovascular, and cardiovascular diseases. The understanding of the underlying mechanisms remains scarce and incomplete. This study covered European ancestry from the UK Biobank, with 74,090 cases diagnosed with essential (primary) hypertension and 200,734 controls. We compared the findings from large-scale GWAS to the gene-based method of proteome-wide association studies (PWAS). PWAS is based on a machine-learning-trained model to assess the impact of any variant on protein functionality. Applying PWAS in a case-control setting, 70 statistically significant associated genes were identified, most of which failed to reach significance in variant-based GWAS. A third of the PWAS-associated genes were replicated in independent cohorts. Gene-based analyses that were performed on females and males revealed sex-dependent genetics with a stronger component associated with females. Analysis of systolic and diastolic blood pressure measurements confirms a strong female's genetic effects. We demonstrated that gene-based approaches provide insight into the biology of hypertension with top-ranked significant genes that are involved in cellular immunity. We conclude that studying hypertension and blood pressure via gene-based association methods improves interpretability and exposes sex-dependent genetic effects, which enhances clinical utility.
Presentation Overview: Show
High-throughput RNA-sequencing technologies that provide spatial resolution of transcripts, popularly known as spatial transcriptomics, are on the rise. This technology seems highly promising and has an untapped potential for expression-driven discovery in development and disease. Nonetheless, it also faces the central challenge of mixed cell type signals due to limitations in resolution. This is apparent in sequencing-based 10X Visium where slides have larger spots of 55 μm. This mixed transcriptional signal can pose inferential problems; however, it can theoretically be deconvoluted into underlying cell types. To this end, we developed a systematic deconvolution framework and performed benchmarking in previously unvalidated healthy and disease samples from human coronary arterial and kidney disease. We used: Cell2location, RCTD and spatialDWLS that have previously been shown to perform well in mouse brain and simulated data (1). We show that all three methods are capable of deconvoluting verifiable cell types when benchmarked against expert provided ground truth based on accuracy scores (0.7-0.73). Kidney podocyte cells and major populations of macrophages, smooth muscle cells and fibroblasts in arteries are all deconvoluted with a high level of agreement. Bayesian Cell2location is more computationally demanding, however it provides quality solutions, when less reference data is available.
Presentation Overview: Show
Profiling the molecular features of all cells with their anatomical and functional attributes is essential for understanding the human body in health and diseases. Scientists have been enthusiastic in building such atlases of human cells using single-cell omics technologies. The community has conducted more and more single-cell studies with the rapid development and popularization single-cell RNA-sequencing technologies. Tremendous amount of single-cell data has been accumulating in the public domain. This suggests the possibility of building cell atlases by assembling such “shot-gun” data in scattered publications. Cell atlas assembly faces several major challenges comparing with the shot-gun assembly of the human genome. We proposed a unified information framework for assembling atlases and built the first cell-centric human Ensemble Cell Atlas (hECA) assembled from scattered data. We developed the “in data” cell sorting scheme that allows extracting cells using logic formula from the hECA as a “virtual human body” to investigate scientific questions involving multiple organs and cell types. We also developed a multidimensional coordinate system UniCoord for different physical and biological attributes of cells by adopting a supervised variational autoencoder (VAE) neural network model, and trained it on hECA to make it represent the diversity of healthy human cells.
Presentation Overview: Show
During the pandemic, children were less susceptible to contracting COVID-19, and studies have shown that children diagnosed with Leukemia often had prior infection of COVID-19. Recent studies showed that MDA5 (encoded by IFIH1) is responsible for children’s increased immunity to COVID-19. Our goal is to test the hypothesis that IFIH1 and its regulating miRNAs are biomarkers linked to AML in children. We also wanted to identify candidate genes that protect children from viral infections and leukemia development. Because miRNAs are important regulators of gene expression, we investigated our project goal in the context of miRNA targeting mechanisms.
Through checking TarBase for IFIH1 then searching for genes regulated by its targeting miRNAs, we identified two significant miRNAs, hsa-196a-5p and hsa-196b-5p, and 51 of its targeted genes that have high expression (>500 TPM) reported in TCGA AML RNA-Seq samples. Protein-Protein Interaction analysis with the STRING database indicated that two genes, STAT3 and MAP3K1, directly interact with IFIH1. Our DAVID/KEGG pathway analysis results further revealed that the three candidate genes (IFIH1, STAT3, MAP3K1) were also involved in Hepatitis B (p-value < 0.0004). Our research results for three genes in AML samples indicate that IFIH1 is likely a candidate biomarker for AML.
Presentation Overview: Show
The One-Stop Database is an initiative that aims to create a comprehensive resource for researchers and medical professionals interested in celiac disease. The database includes updated information on genes, protein sequences and structures, -omics data, SNP data and clinical trial information associated with celiac disease. Furthermore, the integration of the BLAST search feature will enable users to query their sequences for similarity with celiac-related proteins. By providing accurate and up-to-date information on celiac disease, the database will be a valuable resource for the scientific and medical communities, with the ultimate goal of advancing the discovery of novel therapeutics and improving our knowledge of the disease. By creating this resource, we aim to bridge the knowledge gap in celiac disease and ultimately improve patient diagnosis and treatment options. Additionally, we have developed SVM and LMST models using machine-learning techniques to classify celiac-inducing and non-inducing proteins. The resource can be accessed at: https://celiacindia.in/
Presentation Overview: Show
Three-dimensional (3D) genome organization is tightly coupled with gene regulation in various biological processes and diseases. In cancer, various types of large-scale genomic rearrangements can disrupt the 3D genome, leading to oncogenic gene expression. However, unraveling the pathogenicity of the 3D cancer genome remains a challenge since closer examinations have been greatly limited due to the lack of appropriate tools specialized for disorganized higher-order chromatin structure. Here, we updated a 3D-genome Interaction Viewer and database named 3DIV by uniformly processing ∼230 billion raw Hi-C reads to expand our contents to the 3D cancer genome. The updates of 3DIV are listed as follows: (i) the collection of 401 samples including 220 cancer cell line/tumor Hi-C data, 153 normal cell line/tissue Hi-C data, and 28 promoter capture Hi-C data, (ii) the live interactive manipulation of the 3D cancer genome to simulate the impact of structural variations and (iii) the reconstruction of Hi-C contact maps by user-defined chromosome order to investigate the 3D genome of the complex genomic rearrangement. In summary, the updated 3DIV will be the most comprehensive resource to explore the gene regulatory effects of both the normal and cancer 3D genome. ‘3DIV’ is freely available at http://3div.kr.
Presentation Overview: Show
Plants form the foundation of the nutrition that sustains life on Earth. To meet the increasing needs of the human population and tackle climate change, crops can provide protein-rich alternatives to animal-based protein. However, little is known about crop proteomes, which control every aspect of plant life. To address this gap, we launched the international doctoral program "The Proteomes that Feed the World" funded by the Elite Network of Bavaria. We aim to create a Crop Proteome Atlas by charting the proteomes of the 100 most vital crop plants for human nutrition. We established a robust protocol for analyzing plant tissues, and the resulting data will be publicly accessible. We also provide detailed information on our processing pipelines, along with an extensive update on ProteomicsDB. Our dataset serves as a valuable resource for developing tools in plant biology and provides new biological insights, including better genome annotations, cross-species analysis, homology inference, and protein function prediction. Our interdisciplinary team, comprising 16 PhD students and 12 principal investigators, supported by over 30 international partners, has optimized the project workflows. We seek partners to leverage the potential of this data. This Atlas is the first of its kind for many important crop plants.
Presentation Overview: Show
Citrus plants are a diverse group that belongs to the Rutaceae family. Among them, the genus Citrus is highly valued due to its economic and nutritional value. However, identifying the species and variety of seedlings acquired by growers can be challenging, as the leaves are similar. To address this issue, we propose using DArTseq technology to genetically analyze citrus samples and create a rapid species identification kit using the HRM technique. For this, 94 citrus samples from the state of Espirito Santo /Brazil were sent to the Service of Genetic Analysis for Agriculture (SAGA) in Mexico for analysis. The results showed 64,442 SNP markers and 69,963 SilicoDArT markers. After data filtering, the number of SNPs was reduced to 9,073 and the number of SilicoDArT reduced to 3,496. Their polymorphic information content (PIC) was 0.24 and 0.28, respectively. Eight clusters were observed in the dendrogram generated by separating nine citrus species. SNPs are being selected using RStudio and Biopython software for use in the HRM. Among the nine chromosomes, chromosome 2 presented the most SNPs, indicating the need for deeper analysis. We are continuing the analysis, and the expected results are promising.
Presentation Overview: Show
Understanding the cellular composition of complex tissues can help in uncovering disease mechanisms, treatment effects, and biological processes. Cell-type deconvolution methods quantify cellular composition from bulk RNA sequencing data using cell-type-specific transcriptomic signatures. While first-generation deconvolution methods are based on predefined signatures, second-generation deconvolution methods can directly learn these signatures from single-cell RNA sequencing data for virtually any cell type. However, differences in programming language, inputs, semantics, and workflows of these methods complicate their unified execution, and validating them poses additional challenges.
To address these issues, the omnideconv ecosystem was developed. It includes two R packages, omnideconv and SimBu, and a web app, DeconvExplorer, that can facilitate the systematic benchmarking of second-generation methods under different experimental conditions. The packages allow for the invocation of R and Python-based second-generation methods with single functions, and the simulation of pseudo-bulk RNA-seq datasets under different scenarios, respectively. Finally, DeconvExplorer provides a user-friendly web interface to analyze deconvolution results and signatures.
This framework makes second-generation deconvolution methods more accessible and streamlined and can aid in effectively utilizing large single-cell atlases. The omnideconv ecosystem is a novel resource that helps to benchmark second-generation methods and validate context-specific cell-type signatures.
Presentation Overview: Show
Somatic mutations in human cells have an heterogeneous genomic distribution, with increased burden in late-replicating, heterochromatic domains. This regional mutation density (RMD) varies between tissues, in association with tissue-specific RT or chromatin organization. We hypothesized that the RMD additionally varies between individual tumors independently of the tissue. Here, we identified three tissue-independent global RMD signatures that describe mutation risk redistribution across megabase-sized domains in >4000 tumors. First, we identified an RMD redistribution preferentially affecting facultative heterochromatin, Polycomb-marked domains, enriched in the B1 subcompartment and in malleable Hi-C domains. This RMD signature strongly reflects recurrent patterns in plasticity in DNA RT and heterochromatin domains linked with a higher expression of cell cycle genes. Consistently, occurrence of this mutation redistribution pattern is associated with altered cell cycle control via loss of activity of the RB1 gene. Second, another independant global RMD signature was associated with loss-of-function of the TP53 pathway, mainly affecting the redistribution of mutation rates within late-RT regions. Our study highlights that RMDs at the domain scale are variable across tumors in a manner independent of tissue-of-origin, but associated with loss-of-function in cell cycle genes, which may trigger the local remodeling of heterochromatin, spatial chromatin contacts or the RT program.
Presentation Overview: Show
Translating knowledge from lab to field is not straightforward because field conditions are very different from lab conditions. We are therefore developing a new strategy to study the molecular wiring of plant traits directly in the field, based on profiling of individual field-grown plants (single-plant omics). During a recent field trial, we profiled the autumnal rosette leaf transcriptome and a range of phenotypes (before winter and at time of harvest) of 192 plants of winter-type rapeseed variety Darmor, along with several environmental data layers at individual plant resolution such as microbiomes and soil nutrient profiles. To analyze this spatial multi-omics dataset we use unsupervised methods for integration of spatial omics data (MEFISTO) and supervised machine learning methods. The latter model plant phenotypes as function of other data layers such as autumnal gene expression, and identify features that potentially influence plant yield. Important features in our yield models include genes involved in vegetative to reproductive phase transition and floral transition, indicating that developmental processes in autumn influence final yield in summer. Conceptual similarity between single-plant and single-cell data allows us to apply methods from the single-cell field such as trajectory inference on our single-plant data to further unravel these developmental effects.
Presentation Overview: Show
Background: With the dramatic rise in clinical trials for recombinant adeno-associated virus (rAAV)-based gene therapies, there is increasing demand from regulatory agencies for more standardized and systematic approaches for nucleic acid characterization to mitigate vector toxicity. Single-molecule, real-time (SMRT) sequencing enables interrogation of rAAV genomes and packaged product- and process-related impurities at a single molecule level without fragmentation. This technology can thus address one of the remaining challenges in producing rAAV vectors, which is gaining an understanding of packaged impurities that may impact the efficacy and safety of rAAV vectors.
Method: We have developed a SMRT sequencing and computational workflow to characterize rAAV vectors and DNA impurities. Our approach recovers reads with low base calling accuracy and incorporates barcode scores to profile single-stranded and self-complementary rAAVs.
Results: This method identifies product- and process-related impurities, including truncated rAAV genomes, chimeras of rAAV genomes and residual plasmid and host cell genomic DNA. In addition, we found current recommendations restricting the analysis to High Fidelity (≥99% accuracy) reads with high quality barcode scores (≥80) skews estimations of intact rAAV genomes.
Conclusion: Pairing long read sequencing with the computational tools we have developed offers novel insights into rAAV genome integrity and impurities.
Presentation Overview: Show
The accumulation of high-quality genome assemblies has facilitated a more accurate comparison of genomes among multiple species. Furthermore, the availability of various omic data has further extended the scope of such comparative studies to identify the consequences of multi-omic signatures and underlying mechanisms. When performing such comparative multi-omic analyses, the genome-wide comparison of multi-omic data and the visualization of the results are critical. In addition, the visualization needs to be efficient enough to handle a large volume of multi-omic data. However, there is still a lack of applications that fulfill such requirements. In this study, we developed a web-based system for comparative multi-omic analyses. Using the data generation pipelines in our system, users can easily (i) compare multiple genomes, (ii) produce the profiles of omic data, and (iii) perform integrative analyses using the profiles in a web interface. Users can also browse the analysis results in a web interface, which also helps discover genomic regions harboring interesting multi-omic signatures easily. The web interface works very efficiently because of intelligent indexing and multi-level data sampling. Our system will contribute to making the use of multi-omic data easier and more effective.
Presentation Overview: Show
Understanding the evolution of proteins and their interactions with other molecules is critically important. Recent advancements in prediction of 3D protein structure with AlphaFold2 have revolutionized structural biology, enabling the exploration of the protein universe at a depth previously impossible as structure is conserved beyond the twilight zone of amino acids. Structural alignment, where evolutionarily related residues of multiple structures are grouped together, is the core of structural comparative analysis. However, with hundreds of millions (soon to be billions) of structures available, our current set of comparative alignment tools cannot scale to this enormous volume of data. Here we propose FoldMason, an alignment method capable of aligning huge sets of monomeric proteins. FoldMason is a progressive alignment tool built on top of Foldseek, our tool for rapid searches of massive protein structure databases. Foldseek utilises the 3D-interactions (3Di) alphabet, a novel structural alphabet based on tertiary interactions between neighbouring residues within proteins, to discretize structures, making them amenable to fast sequence alignment algorithms. FoldMason leverages this to construct multiple structure alignments of large protein datasets using a progressive alignment approach. Preliminary results on reference datasets show that FoldMason is orders of magnitude faster than gold-standard tools while maintaining comparable accuracy.
Presentation Overview: Show
Peptides are relevant in several biotechnology applications. These molecules have different biological activities, such as therapeutic, signalling, antimicrobial, and antitumoral. In particular, the peptides are attractive as therapeutic agents. New research has fostered the exponential increase of these molecules in common or specific databases. However, there needs to be more user-friendly tools to make up for the lack of bioinformatics or machine learning skills to study peptide sequences. In this work, we developed pepti-tools, a user-friendly web application tool that allows peptide analysis using bioinformatics and machine learning methods. From bioinformatics, we incorporate methods of phylogenetic analysis of sequences through alignments against databases and multiple sequence alignments. Besides, functional prediction methods (Gene Ontology/Pfam), secondary structure prediction, and structure search are incorporated. Pepti-tools use physicochemical properties, statistical property comparison, and analysis techniques to integrate sequence characterizers. From machine learning, Pepti-tools allows the elaboration of predictive models and pattern recognition, facilitating the exploration of algorithms, hyperparameters, and numerical representation. Finally, functional activity classification models, prediction of antiviral/HIV activity (IC50), solubility estimation, immunogenicity, and promiscuity probability have been enabled, proving to be a powerful and highly usable alternative to study peptide sequences without relying on bioinformatics and machine learning skills.
Presentation Overview: Show
In this study, we present the ``multivariate MArginal ePIstasis Test'' (mvMAPIT) --- a multi-outcome generalization of a recently proposed epistatic detection method which seeks to detect marginal epistasis or the combined pairwise interaction effects between a given variant and all other variants. By searching for marginal epistatic effects, one can identify genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact --- thus, potentially alleviating much of the statistical and computational burden associated with conventional explicit search-based methods. Our proposed mvMAPIT builds upon this strategy by taking advantage of correlation structure between traits to improve the identification of variants involved in epistasis. We formulate mvMAPIT as a multivariate linear mixed model and develop a multi-trait variance component estimation algorithm for efficient parameter inference and P-value computation. Together with reasonable model approximations, our proposed approach is scalable to moderately sized GWA studies. With simulations, we illustrate the benefits of mvMAPIT over univariate (or single-trait) epistatic mapping strategies. We also apply mvMAPIT framework to protein sequence data from two broadly neutralizing anti-influenza antibodies and approximately 2,000 heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics.
Presentation Overview: Show
Trypanosoma cruzi (T. cruzi), a kinetoplastid protozoan parasite, is the etiologic agents of Chagas disease that affects an estimated 8 million people worldwide, mainly in Latin America. Progress in developing improved treatments for Chagas disease is compromised by limitations in our knowledge of the mechanistic processes associated with the persistence of T. cruzi. During its life-cycle, the parasite undergoes changes in morphology, metabolism, and gene expression as it passes from the epimastigote replicative stage in the insect midgut to the metacyclic trypomastigote form, which infects humans. Trypomastigote progress in tissues where they become amastigotes. We developed a robust method for isolating populations of amastigotes parasites followed by integrated proteome/transcriptome profiling to identify discriminant markers/pathways associated with parasite dormancy (amastigote) vs replicating (epimastigote).
Presentation Overview: Show
Neuroblastoma is characterised by extensive inter- and intra-tumour genetic heterogeneity and varying clinical outcomes. One possible driver for this heterogeneity are extrachromosomal DNAs (ecDNA), which segregate independently to the daughter cells during cell division and can lead to rapid amplification of oncogenes. While ecDNA-mediated oncogene amplification has been shown to be associated with poor prognosis in many cancer entities, the effects of ecDNA copy number heterogeneity on intermediate phenotypes are still poorly understood.
Here, we leverage DNA and RNA sequencing data from the same single cells in cell lines and neuroblastoma patients to investigate these effects. We utilise ecDNA amplicon structures to determine precise ecDNA copy numbers and reveal extensive intercellular ecDNA copy number heterogeneity. We further provide direct evidence for the effects of this heterogeneity on gene expression of cargo genes, including MYCN and its downstream targets, and the overall transcriptional state of neuroblastoma cells.
These results highlight the potential for rapid adaptability of cellular states within a tumour cell population mediated by ecDNA copy number, emphasising the need for ecDNA-specific treatment strategies to tackle tumour formation and adaptation.
Presentation Overview: Show
Molecular genetics is the correlation of genotype and phenotype to discover important genomic regions. With the direct integration of NGS technologies into in-house workflow, genotyping has improved dramatically in terms of the number of genome wide variants such as SNP and Indel markers and thus the amount of genomic information. However, the current variant calling method is time consuming, labor intensive, no suitable mechanism for managing millions of SNPs and Indels in-house and more that delay innovations. At Karyosoft, we developed a cloud-based user-friendly platform Variants to circumvent these issues and reduced the time from the usual 28+ hours per sample with 30x data coverage to 4 hours/8 samples. Our Variants platform is 7 – 12 times faster, can reduce the cost by 12x – 7x and can save up to 168 days for 96 samples. Additionally, our cloud based Variant Mining Studio helps to manage and mine millions and millions of SNPs and Indels in seconds. Our platforms have a vast use in mutation discovery, direct genotyping, custom chip designing and amplicon sequencing. Above all, our user-friendly platform makes variant genotyping easy for scientists with any level of computational skills and empowers them to drive the innovations faster.
Presentation Overview: Show
Bacterial pathogens use so-called bet-hedging to switch between different states, improving their chances of developing multiple resistance mechanisms in fluctuating antibiotic conditions, particularly in nosocomial environments. However, the underlying mechanism is not yet fully investigated due to the limitations in methods that can explore bacterial heterogeneity at the sub-population level. Here, we utilized microbial single-cell RNA sequencing (Msc-RNA-seq), a high-throughput bacterial scRNA-seq technique, to profile multi-drug resistant Klebsiella pneumoniae populations at the single-cell level. Msc-RNA-seq employs random primers for in situ reverse transcription and droplets for DNA barcoding, allowing for high sensitivity and throughput. In our experimental scenarios, Msc-RNA-seq detected a median of approximately 700 genes per cell. Downstream scRNA-seq analysis revealed the heterogeneity in K. pneumoniae population under the sub-lethal ceftazidime/avibactam, and further confirmed our finding in clinical settings: the plasmid-encoded β-lactamase CTX-M-65 and its variant, CTX-M-249 showed a bet-hedging resistance against ceftazidime/avibactam and cefotaxime simultaneously. In addition, the workflow and framework we formulated and developed, including the experimental protocol and computational pipeline, will facilitate future discoveries in evolution of resistant bacteria, and beyond, promote bacterial population study at single-cell level.
Presentation Overview: Show
Following viral infection, the human immune system generates broad and dynamic CD8+ T cell responses to virus antigens. A characterization of such T cell responses allows to understand infection history and its contribution to protective immunity.
We performed in-depth profiling of CD8+ T cells reactive to CMV, EBV and Influenza virus derived antigens in peripheral blood samples from 114 healthy donors and 55 cancer patients using high-dimensional mass cytometry with combinatorial barcoding of peptide-MHC-I multimers and subsequent single cell RNA sequencing/VDJ-CITE-Seq for phenotypes and TCR repertoire analysis of identified antigen-specificities.
We analysed the expression of up to 138 surface markers from more than 500 antigen-specific T cell responses across six different HLA alleles by applying multiple machine learning approaches. Our data revealed unique phenotypic signatures of T cells specific for antigens from different virus categories. Based on these signatures, we built a ML approach to predict virus specificity from bulk CD8+ T cells. We validated our prediction capabilities in-silico using an independent sample cohort and also in-vitro by TCR expression in a Jurkat reporter assay. Our data suggest that machine learning can be used as a statistically rigorous and unbiased way to accurately predict antigen specificity from T cell phenotypes.
Presentation Overview: Show
Cytometry techniques are widely used to discover cellular characteristics at single-cell resolution. Many data analysis methods for cytometry data focus solely on identifying subpopulations via clustering and testing for differential cell abundance. For differential expression analysis of markers between conditions, only few tools exist. These tools either reduce the data distribution to medians, discarding valuable information, or have underlying assumptions that may not hold for all expression patterns. Here, we systematically evaluated existing and novel approaches for differential expression analysis on real and simulated CyTOF data. We found that methods using median marker expressions compute fast and reliable results when the data are not strongly zero-inflated. Methods using all data detect changes in strongly zero-inflated markers, but partially suffer from overprediction or cannot handle big datasets. We present a new method, CyEMD, based on calculating the earth mover’s distance between expression distributions that can handle strong zero-inflation without being too sensitive. Additionally, we developed CYANUS, a user-friendly R Shiny App allowing the user to analyze cytometry data with state-of-the-art tools, including well-performing methods from our comparison. A public web interface is available at https://exbio.wzw.tum.de/cyanus/.
Presentation Overview: Show
Cytokinin dehydrogenase (CKX) is a small gene family that regulates the level of cytokinin in plants. In Triticum aestivum, 11 CKX subfamilies were identified with similar gene structures, motifs, domains, cis-acting elements, and an average signal peptide of 25 amino acid length. We performed a genome-wide identification of CKX family members in the Triticum aestivum genome to get their chromosomal location, gene structure, cis-element, phylogeny, synteny, and tissue- and stage-specific expression along with gene ontology. This study has also elaborately described the tissue- and stage-specific expression and is the resource for further analysis of CKX in the regulation of biotic and abiotic stress resistance, growth, and development in Triticum and other cereals to endeavor for higher production and proper management.
Presentation Overview: Show
Cytokinin dehydrogenase (CKX) is a small gene family that regulates the level of cytokinin in plants. In Triticum aestivum, 11 CKX subfamilies were identified with similar gene structures, motifs, domains, cis-acting elements, and an average signal peptide of 25 amino acid length. We performed a genome-wide identification of CKX family members in the Triticum aestivum genome to get their chromosomal location, gene structure, cis-element, phylogeny, synteny, and tissue- and stage-specific expression along with gene ontology. This study has also elaborately described the tissue- and stage-specific expression and is the resource for further analysis of CKX in the regulation of biotic and abiotic stress resistance, growth, and development in Triticum and other cereals to endeavor for higher production and proper management.
Presentation Overview: Show
IMGT®, the International ImMunoGeneTics Information System®, is the reference resource in immunogenetics and immunoinformatics. Its main objective is to provide the basic knowledge, databases and tools to the scientific community that are relevant to explore the adaptive immune response using IMGT-ONTOLOGY standards. IMGT® is dedicated to advancing research and development in this field, with a focus on three key areas. Axis I centers on identifying and characterising immunoglobulin (IG) and T cell receptor (TR) genes in jawed vertebrates, aspiring to understand the adaptive immune response. This axis serves as a foundation for the remaining two axes. Axis II focuses on analysing and exploring expressed IG and TR repertoires in normal and pathological situations, achieved through comparing these repertoires with IMGT reference directories. Axis III investigates the 2D and 3D structures of engineered antibodies and TR, with their functions, the amino acid changes and the modifications of their properties. This talk will focus on the most recent features of the IMGT® databases, tools, reference directories and web resources, with an emphasis on their relevance to the current challenges in adaptative immune response studies.
Presentation Overview: Show
Tandem repeat proteins (TRPs) are proteins containing repeated units that can be formed either by identical or nearly identical amino acid sequences in the primary structure or by structural patterns that can be superimposed in the three-dimensional space. These tandem repeat units occur in various forms across all domains of life, ranging from short dipeptide repeats to longer units. TRPs can serve multiple functions, such as providing structural stability or catalytic activity, among others. However, predicting TRPs can be challenging due to their structural complexity and variability in sequence and length, making it difficult to predict their presence solely based on their primary sequences; therefore, methods based on the three-dimensional structure of proteins have been used to obtain the precise position of repeat units. Although these methods yield spectacular results, they come with a price: a substantial computational cost, making them unsuitable for large-scale analysis. The launch of Alphafold marked a milestone in the history of computational biology, with over 200 million predicted structures. To perform large-scale detection of TRPs on Alphafold-predicted structures, we have developed an ultra-fast prediction method based on a combination of repeated element analysis in secondary structure followed by subsequent three-dimensional evaluation.
Presentation Overview: Show
The assembly of highly heterozygous genomes such as the ones from primarily hybrid organisms from short sequencing reads remains challenging due to difficulties in accurately recovering different haplotypes. When standard assembly processes encounter highly heterozygous genomes, they tend to collapse homozygous regions and report heterozygous regions in alternative contigs. This creates boundaries between homozygous and heterozygous regions, leading to multiple assembly paths that are difficult to resolve. The result is usually a highly fragmented assembly with a larger total size than expected, causing problems in downstream analyses, such as fragmented gene model predictions, incorrect gene copy number and broken synteny. To address these issues here we present Redundans2, a Python3-based pipeline specifically designed to handle the short read assembly of heterozygous genomes from small to large size. This pipeline includes a reduction step to recognize and selectively remove alternative heterozygous contigs that can be applied for contigs derived from both short and long reads. In addition to that, Redundans2 allows the usage of long reads as well as reference based strategies for scaffolding. Our method is available for free at https://github.com/Gabaldonlab/redundans.
Presentation Overview: Show
Background: Drug development is a costly and challenging process, among others, due to high failure rates at late stages of the drug discovery process. Lack of comprehensive knowledge of disease mechanism and of causal effects induced by perturbation of selected drug targets are key causes for failure. Increasing availability of multi-omics data provides opportunities to address these lacks through quantitative assessment of mechanistic hypotheses from such data.
Method: A two-step approach is proposed to infer mechanisms from multi-omics data. First, agreements between observed data and prior-knowledge molecular interaction graphs are identified. Second, measures of confidence are added to mechanistic hypotheses by joining evidence from causal-reasoning with evidence from multi-omics-based protein activity estimations.
Results: The approach is evaluated using proteomics, phosphoproteomics, and transcriptomics data from pro- and anti-inflammatory macrophages. Results demonstrate how omics layers complement each other to provide mechanistic insights e.g. for key regulators like STAT1 and STAT6.
Conclusion: The presented approach enables quantitative inference of mechanistic insights from complex biological systems, linking disease-causing genes to measured phenotypes and explaining causal routes from drug targets to perturbation-induced effects. This approach can help to identify novel high-confidence drug targets, reveal unfavorable off-target mechanisms, and thereby facilitate the drug discovery process.
Presentation Overview: Show
Bacterial genome annotation is key to identifying genes, providing insight into bacterial biology, metabolic pathways, strain classification and potential novel drug targets, and aiding in the development of new treatments. We evaluated the performance of four widely used annotation tools (NCBI Prokaryotic Genome Annotation Pipeline (PGAP), Prokka, Bakta, eggNOG-mapper) utilizing 14,319 genomes from the Genome Taxonomy Database each of a unique species. Each genome was also subjected to random deletions to simulate various different states of genome assemblies (“noise”). In non-modified conditions, PGAP predicted the highest median of gene count (3907, IQR: 2890, 5041), while Prokka predicted the lowest (3768, IQR: 2768, 4843). However, PGAP struggled to annotate the predicted genes and had the second-highest median proportion of hypothetical proteins (19%, IQR: 16.5%, 21.8%), compared to Bakta (3%, IQR: 1.5%, 6.3%). Under noise conditions, PGAP retains the best annotation stability. The statistical results on how the taxa influence the quality of the annotations are still pending, but are an important cornerstone of this work so that the user can choose the right strategy depending on his data. Our preliminary data already highlights the need for serious consideration between the different prokaryotic annotation tools.
Presentation Overview: Show
We developed and analyzed a novel in vitro hypoxic kidney organoid (K-org) model to study how hypoxia contributes to the development and progression of kidney disease.
K-orgs containing kidney-specific architecture were generated from human pluripotent stem cells and exposed to hypoxic conditions for 24h; immunostaining, ELISA and 13C-glucose flux analysis confirmed expected protein and functional response. Bulk and single cell (sc) transcriptional profiling was performed, and the latter integrated by RPCA using Seurat v4.0 (500-5000 genes, >50% mitochondrial reads/cell). Ten cell-type clusters were identified and analyzed using CellxGene. Sc and bulk RNA findings were integrated with experimental validations.
Differential expression analysis comparing hypoxic to normoxic organoids revealed increased but variable expression of HIF1A and HIF1A targets in podocytes, stromal, proximal tubular, and distal tubular cells. Casual network inference analysis confirmed HIF1A as the top upstream regulator of the observed bulk and sc transcriptional responses. Moreover, key metabolic pathways (glycolysis, sirtuin signaling, gluconeogenesis and mitochondrial dysfunction) were activated.
Computational analyses integrating multiple omics datasets demonstrate that hypoxic k-orgs capture key metabolic pathway perturbations seen in kidney disease. In combination with protein expression and functional studies, these data demonstrate the relevance of the hypoxic k-org model to study pathomechanisms in kidney disease.
Presentation Overview: Show
Computing is rapidly becoming one of the major contributors to carbon emission. Mitigation strategies to reduce greenhouse gas emissions are location and time shifting the computation, making software more efficient, and using older hardware to avoid hardware obsolescence. In the public and especially the medical sector, users may be reluctant to location shift their computation due to privacy concerns. Therefore, we suggest time shifting. To date, no convenient tool is available to time shift the computation to a ‘greener time’ and report the estimated carbon emission. In this talk, we will present a toolkit for carbon-aware bioinformatics which minimizes the effort by the user to use greener energy and report emissions. It consists of an API returning the most favorable time to run the computation within a given tolerance frame; and a python package to streamline the integration of the API into python code and bioinformatics pipelines. Currently, users need to specify the estimated run time of their task, the percentage of renewable energy, an area code, and a deadline for the task. Users can upload reports on the resource usage of general tasks and pipelines which will allow us to provide automatic estimates in the future.
Presentation Overview: Show
The background mutation rate (BMR) in cancer is the neutral accumulation of passenger mutations that occur spontaneously during DNA replication and repair, and is influenced by diverse endogenous and exogenous factors such mutagen exposures. Estimating the BMR is essential for quantifying selection in cancer evolution studies and identifying driver genes in cancer.
Given the increasing availability of whole-genome sequencing (WGS) data, we have developed HyperInVEx, a Bayesian-regularized Poisson regression model to quantify selection in cancer, refining the previous “InVEx” approach. HyperInVEx estimates the local BMR based on the intronic and intergenic mutations. Confounding by trinucleotide and pentanucleotide composition is stringently accounted for via a locus sampling approach.
Using 8054 whole-cancer genomes from 28 cancer types, we demonstrated that our intronic-based BMR can more accurately model local neutral mutation rates than covariate signals utilized by state-of-the-art selection models such as dNdScv and MutSigCV. Using HyperInVEx, we identified many known cancer genes, detected by dNdScv and MutSigCV, as well as a long tail of putative cancer driver genes that await replication.
Presentation Overview: Show
Background:
High content brightfield imaging enables cost effective, longitudinal and high throughput assessment of human adipose progenitor (AP) cell fate cultured in vitro. This is relevant for inferring new insights into human metabolic health. It is however not clear how to classify genetic perturbations from brightfield images of adipocytes.
Method:
A novel workflow for classifying bright field images of CRISPR genetic perturbations using literature based gene function annotations for proof of concept. Our method encompasses virtual staining of brightfield images using deep neural networks to create fluorescence-like images of neutral lipid droplets, followed by training of Support Vector Machines (SVM) to distinguish loss of function of gene effect on features extracted with CellProfiler. From the SVM results we calculated the distance of images to the hyperplane which helps us determine Z prime values to estimate differences in cellular phenotypes following genetic perturbations.
Result:
Our method enables high throughput investigation of novel regulators of AP cell fate from brightfield images with direct impact on human metabolic health.
Presentation Overview: Show
Administrative medical databases contain a wealth of information on patients’ pathways that can be leveraged to improve our understanding of survival outcomes, like disease progression or treatment response. However, to fully capture the complexity of follow-up, it is essential to analyse all events that patients may experience. Recurrent events refer to subsequent occurrences of the same event, such as recurrences or rehospitalizations, which are common in many diseases.
In this context, we present an extension of the random forests algorithm for the analysis of survival data with recurrent events, utilizing concepts from non-parametric survival analysis and statistical learning.
The proposed approach is an ensemble of survival trees with the pseudo-score test as splitting rule and the Nelson-Aalen estimator of mean cumulative function for each terminal node. Model discrimination through adapted concordance index and variable importance were computed to assess the algorithm overall. Cross-validation was used for hyperparameter optimisation and performance evaluation. We evaluated our methodology on both simulated and real-world data settings, and the results were promising with consistent findings.
The proposed methodology has the potential to facilitate the analysis of recurrent events in biological systems, providing key insights into the underlying mechanisms of survival outcomes.
Presentation Overview: Show
Bacteriophages (phages), viruses that kill bacterial pathogens, are being collected for use in phage therapies, with the intention to apply these bactericidal viruses directly into the infection sites in bespoke phage cocktails. Using such a biological agent for infection control requires a deep understanding of the phage. Thus, and despite the great unsampled phage diversity for this purpose, a critical issue hampering the roll out of phage therapy is the poor-quality functional annotation of the majority of phages.
To this end, we have formulated a pipeline, including machine learning-based algorithms that capture informative features and experimental validation, to predict key types of phage proteins. Most recently, we developed PhageProfiler, based on protein language models to annotate phage proteins with over 15 core functions, from the prevailing capsid and tail proteins, to the rare but critical anti-CRISPR and depolymerase proteins. Benefitting from the protein language models that learn patterns from millions of protein sequences across all life domains, PhageProfiler can capture the key characteristics to distinguish phage proteins with different functions. Having been extensively validated on various benchmarking tests and case studies, PhageProfiler represents the state-of-the-art method to accurately annotate phage genomes in a high throughput manner.
Presentation Overview: Show
Neurodevelopmental disorders (NDDs) such as intellectual disabilities, autism, epilepsy and others are genetic disorders that primarily affect the brain, despite being caused by germline mutations present throughout the body. They are characterized by a wide range of developmental and neurological manifestations, even for a single disorder, indicating a multitude of possible underlying disease mechanisms. We explored the tissue-specific expression patterns of around 1000 NDD risk genes in GTEx to identify subgroups with distinct molecular mechanisms. Using hierarchical agglomerative clustering, as well as gene and disease ontology enrichment, we found that the largest group of genes showed uniform expression across all tissues, pointing towards the brain molecular context of the gene product but not the gene product itself causing the observed brain specific phenotype upon mutation of the gene. With this in mind, we are employing an integrative systems approach, combining various types of omics data of the brain and other tissues, evolutionary gene relationships, protein-protein interactions, and mutation data collected from the cohorts of NDD patients, to further improve our mechanistic understanding of brain-specific processes in neurodevelopment. Our study has the potential to unravel pathways affected in NDDs, as well as establish the approach for studying rare diseases in general.
Presentation Overview: Show
Two-Photon Microscopy (TPM) enables deep-tissue live imaging. However, its axial resolution is inferior to the lateral resolution, and this makes it difficult to reconstruct the three-dimensional structure of cellular details such as synapses or microglia spines. Previous studies have been insufficient for improving deep-tissue TPM images, or they are not suitable for live imaging.
We built a deep neural network that deblurs and improves the axial resolution of TPM images (“deblurring model”). Since we do not have the true structures of objects in TPM images, the deblurring model was trained in combination with a blurring generative model simulating the blurring process of TPM.
For quantitative evaluations, we first adapted our model to simulation data and real images of beads, and we found that our model accurately inferred the true shapes of the objects. Secondly, we adapted our model to images of axons, and we found that the model deblurred images and improved image resolution, resulting in providing more clear cellular shapes.
We expect this method enables more accurate evaluations of the three-dimensional structure of the living cells in deep tissue.
Presentation Overview: Show
The human body comprises over 37 trillion cells with diverse forms and functions, which can exhibit dynamic changes based on their environmental context. Understanding the spatial interactions between cells and changes in their state within the tissue microenvironment is crucial to comprehending the development of human diseases. State-of-the-art technologies such as PhenoCycler, IMC, CosMx, Xenium, and others can deeply phenotype cells in their native environment, providing a high-throughput means of identifying spatially related changes in cell state.
The Statial Bioconductor package offers a suite of complementary approaches for identifying changes in cell state explained by changes in cell type localization. In this presentation, we introduce new functionality in the Statial package that can 1) identify changes in cell state between distinct tissue environments, 2) uncover changes in marker expression associated with cell proximities, and 3) model spatial relationships between cells in the context of hierarchical cell lineage structures. We provide context for these approaches and explain when and why modeling spatial relationships between cells in these ways is appropriate. Finally, we demonstrate how these approaches can be used in a classification setting to predict patient prognosis or treatment response.
Presentation Overview: Show
Repetitivity and modularity of proteins are two related notions incorporated into multiple evolutionary concepts. We study whether they may also be essential for functional amyloids. Amyloids are proteins that create very regular and usually highly insoluble fibrils, often associated with neurodegeneration. However, recent discoveries revealed that amyloid structure of a protein could also be beneficial and desired, e.g., to promote cell adhesion. Functional amyloids are proteins which differ in their characteristics from pathological amyloids so that the fibril formation is more under control of an organism. We propose that repeats in the sequence could regulate the aggregation propensity of these proteins. The inclusion of multiple symmetric interactions, due to the presence of the repeats, may support and strengthen the desirable structural properties of functional amyloids. Our results show that tandem repeats in bacterial functional amyloids have specific characteristics. The pattern of repeats supports the appropriate level of fibril formation and better controllability of fibril stability. The repeats tend to be more imperfect, which attenuates excessive aggregation propensity. Their desired structure and function is also reinforced by their amino acid profile. Although in the study we focused on bacterial functional amyloids, due to their importance in biofilm formation, we propose that similar mechanisms could be employed in other functional amyloids which are designed by evolution to aggregate in a desirable manner.
Presentation Overview: Show
The immune system plays a critical role in recognizing and subsequently eliminating tumor cells. Structural genomic modifications can alter the expression of tumor antigens, immune signaling molecules and affect the tumor's ability to evade the immune response. Through the analysis of copy number and structural variation data from whole genome sequencing of ovarian tissue samples, genomic instability signatures have demonstrated the ability to classify tumor samples into distinct categories of severity. In this study, our objective is to analyze genomic instability in multiple ovarian tumor sites using single-cell sequencing data, assessing copy number and structural variations of the genome from multiple ovarian tumor sites. Specifically, the study seeks to test the reliability of these signatures in classifying genomic instability across different sample types, comparing results from both whole genome sequencing and single-cell data. The entire analysis is focused on paired samples derived from the same patients to assess the degree of similarity between classification results obtained from the two datasets. This work could provide new insights into genomic instability in ovarian tumors and determine whether genomic instability signatures can be applied to single-cell data for improved tumor classification.
Presentation Overview: Show
The Percellome database [1], which allows quantitative comparison of gene expression profiles induced by toxic chemicals through the process of estimating mRNA copy number per cell, is a useful resource for inferring the molecular mechanisms of chemical exposure-induced toxicity. This database contains gene expression profiles (number of mRNA copies per cell estimated by the above process) by exposure dose and time in mice for various chemicals. By quantitatively capturing the dynamic changes in gene expression, it is possible to extract ""what kind of molecular network changes lead to toxic expression due to chemical exposure”. The targets for our analysis include known PPARα (Peroxisome Proliferator Activated Receptor Alpha) ligands and chemicals that have been suggested to be PPARα ligands based on previous research results (clofibrate, valproic acid, estragole, di(2-ethylhexyl)phthalate, and phenobarbital). We compared the patterns of dynamic changes in gene expression levels of these five chemicals and detected common and unique patterns among them. We report findings on the inferred toxicological mechanisms of these chemicals obtained by tracing how biological responses tied to the characteristics of these chemicals change with dose and time of administration.
[1] Kanno J. et al., J. Toxicol. Sci. 2013;38(4): 643-654
Presentation Overview: Show
Neuroblastoma, the most common extracranial tumor in children; compared to adult cancers, neuroblastoma has a distinctly lower number of somatic mutations, with known drivers including MYCN, NRAS, and ALK. Two distinct cell states in neuroblastoma, adrenergic (ADRN) and mesenchymal (MES), dynamically interconvert in the process of noradrenergic-to-mesenchymal transition (NMT). MES cells are implicated in conferring an additional level of pathogenicity due to their more migratory and therapy-resistant phenotype. Epigenetic mechanisms, known to be involved in neuroblastoma, are likely important in regulating NMT. A recent study linked alternative polyadenylation (APA) to proliferation and neuronal differentiation in neuroblastoma.
Using an integrated computational and experimental approach, we explore whether changes in APA affect NMT. With scRNA-seq of five neuroblastoma cell lines, we identified distinct ADRN and MES populations and compared their usage of 3’UTR polyadenylation sites. Preliminary results show differential ADRN vs. MES 3’UTR usage in 180 genes that include transcription factors and chromatin modifiers. We are establishing an in vitro neuroblastoma APA model by biasing 3’UTR usage to shorter or longer extremes, which will enable us to study the effect of globally truncated or extended 3’UTRs on NMT. Elucidating the role of APA in NMT may reveal novel targetable vulnerabilities in neuroblastoma.
Presentation Overview: Show
Neuroblastoma, the most common extracranial tumor in children; compared to adult cancers, neuroblastoma has a distinctly lower number of somatic mutations, with known drivers including MYCN, NRAS, and ALK. Two distinct cell states in neuroblastoma, adrenergic (ADRN) and mesenchymal (MES), dynamically interconvert in the process of noradrenergic-to-mesenchymal transition (NMT). MES cells are implicated in conferring an additional level of pathogenicity due to their more migratory and therapy-resistant phenotype. Epigenetic mechanisms, known to be involved in neuroblastoma, are likely important in regulating NMT. A recent study linked alternative polyadenylation (APA) to proliferation and neuronal differentiation in neuroblastoma.
Using an integrated computational and experimental approach, we explore whether changes in APA affect NMT. With scRNA-seq of five neuroblastoma cell lines, we identified distinct ADRN and MES populations and compared their usage of 3’UTR polyadenylation sites. Preliminary results show differential ADRN vs. MES 3’UTR usage in 180 genes that include transcription factors and chromatin modifiers. We are establishing an in vitro neuroblastoma APA model by biasing 3’UTR usage to shorter or longer extremes, which will enable us to study the effect of globally truncated or extended 3’UTRs on NMT. Elucidating the role of APA in NMT may reveal novel targetable vulnerabilities in neuroblastoma.
Presentation Overview: Show
Cytometry is a powerful method, which is used in many areas of biology and medicine, e.g., immunology, haematology, cancer research, and microbiology. It is based on antibodies and allows to measure multiple cell parameters in large number of cells. In addition to flow cytometry, there are other, high-throughput, types of cytometry, such as mass cytometry (also known as CyTOF) and imaging cytometry. High-throughput cytometry data analysis requires sophisticated software and algorithms and often some previous programming knowledge. To overcome these limitations, we developed “CytoEXpert”, a free web portal based on HTML, PHP, MySQL and Javascript. The portal allows to pre-process and normalize raw cytometry data. CytoEXpert comprise of dynamic web-based gating tool, which allows identification of cell populations, and incorporates additional R-based tools, such as SPADE, Citrus, tSNE, viSNE and flowSOM. The results can be further analysed with different statistical tests, PCA or correspondence analysis. Besides tabular outputs, result can be visualized with different types of plots: e.g., heatmaps, box-plots, dot-plots, volcano-plots and others. This work was supported by grants APVV-19-0212, APVV-20-0183, and MZSR 2019/14-BMCSAV-9.
Presentation Overview: Show
Genome wide association studies (GWAS) of complex neuropsychiatric phenotypes are often limited in their ability to detect statistically significant single nucleotide polymorphisms, partly owing to the broad range and variability of considered symptoms. While this limitation can be mitigated in some cases by leveraging the larger sample sizes offered by biobanking initiatives such as the UK Biobank, studies of many common disorders, including schizophrenia and bipolar disorder, remain underpowered. GWAS of intermediate phenotypes, derived from phenotype-associated quantities such as neuroimaging biomarkers, could increase the potential for detecting significant genetic signals by refining the problem space. This concept has recently been explored using GWAS of brain imaging derived tabular data; however these approaches do not usually consider non-linear relationships between derived measures and outcomes. Here, we propose the use of deep-learning models, such as convolutional neural networks (CNNs) and autoencoders, to derive secondary phenotypes of neuropsychiatric conditions from neuroimaging data. We apply our methods to an Alzheimer’s disease dataset and compare the genetic properties of the derived phenotypes to primary GWAS results.
Presentation Overview: Show
Methionine adenosyltransferase (MAT2A, herewith MAT) catalyses the synthesis of S-adenosylmethionine from L-methionine and ATP. MAT is a pharmacologically validated cancer target. Furthermore, a binding protein (MAT2B or herewith BP), was recently shown to bind to and stabilize MAT. While MAT enzymes are ubiquitous in nature, the distribution of BP remains unknown. In addition, the detail of coevolution of MAT and BP and the molecular mechanisms involved remain elusive. To tackle these questions, I investigate the evolution of MAT and BP with computational methods that extract coevolutionary signals in interacting proteins. In addition, molecular dynamics (MD) simulation is employed to understand the interaction between MAT and BP as well as the interaction between BP and other potential ligands. The finding of the computational analysis is complemented by experimental investigations.
Our preliminary results from computational analyses suggest that Craniata presents a BP with a conserved C-terminus, while all other organisms possess a shortened version of the C-terminus (BP-like). MD simulations of the MAT-“BP-like” complex implies that “BP-like” protein show low or no affinity to its MAT counterpart. This finding is confirmed by the ITC experiment.
In conclusion, it seems that C-terminus of BP is important for the binding to MAT.
Presentation Overview: Show
Precision Medicine is defined by the U.S. Food & Drug Administration as “an innovative approach to tailoring disease prevention and treatment that considers differences in people’s genes, environments and lifestyles.
To succeed in providing personalised medicine to the patients, it will be necessary to combine medical, biological and molecular data not only to identify all complex diseases subtypes (patient stratification), but also to understand the underlying molecular mechanisms. Biomedical Knowledge Graphs (BKGs) are limited to the integration of prior knowledge data and do not integrate real-world data (RWD) that would allow for the incorporation of patient level information.
With this work we propose a first step towards using graphs and graph machine learning in a fully integrated precision medicine strategy. We show that RWD can be integrated with a BKG to form a Patient & Biomedical Knowledge Graph. This allows to create new patient’s representations using graph representation leaning and which can be used to synergize the strength of RWD studies in identifying disease subtypes with the strength of BKGs in bridging medical and molecular information.
We applied our methodology to atopic dermatitis (AD), identifying 7 subgroups of patients, characterising the medical, biological, and molecular evidence of each subtype.
Presentation Overview: Show
Precision Medicine is defined by the U.S. Food & Drug Administration as “an innovative approach to tailoring disease prevention and treatment that considers differences in people’s genes, environments and lifestyles.
To succeed in providing personalised medicine to the patients, it will be necessary to combine medical, biological and molecular data not only to identify all complex diseases subtypes (patient stratification), but also to understand the underlying molecular mechanisms. Biomedical Knowledge Graphs (BKGs) are limited to the integration of prior knowledge data and do not integrate real-world data (RWD) that would allow for the incorporation of patient level information.
With this work we propose a first step towards using graphs and graph machine learning in a fully integrated precision medicine strategy. We show that RWD can be integrated with a BKG to form a Patient & Biomedical Knowledge Graph. This allows to create new patient’s representations using graph representation leaning and which can be used to synergize the strength of RWD studies in identifying disease subtypes with the strength of BKGs in bridging medical and molecular information.
We applied our methodology to atopic dermatitis (AD), identifying 7 subgroups of patients, characterising the medical, biological, and molecular evidence of each subtype.
Presentation Overview: Show
Cryo-electron tomography (CryoET) is a powerful method for obtaining 3D images of biological samples, offering invaluable insights into cellular structures and their functions. However, this technique encounters challenges, such as radiation damage, low signal-to-noise ratio, and difficulty in determining particle orientation. To address these issues, we introduce a deep learning-based denoising model that employs a Vision Transformer (ViT).
Our approach assesses the ViT's capacity to identify particle locations and morphologies. The model utilizes three input tilt images—anchor, positive, and negative—and predicts the indirect angle between them. The attention heatmap produced by our model demonstrates its focus on particle locations and shapes within the tilt image.
By capitalizing on the model's reliable attention to particles, we anticipate that our approach can be applied to few-shot learning methods, enabling rapid adaptation to new tasks with minimal training data. Additionally, we are expecting that the ViT can effectively denoise CryoET images and maintain accuracy across diverse noise levels using a noise-to-noise training scheme. Our novel method aims to develop a versatile and efficient denoising model for CryoET, which could lead to significant advancements in structural biology.
Presentation Overview: Show
The multiscale hierarchical spatial structure of the mammalian genome is defined by the chromatin loops, TADs, compartments, & chromosomal territories. The chromatin looping is observed when the CTCF protein participates in the loop extrusion process driven by the ring-like cohesin molecular motor. The cohesin-mediated chromatin interactions vary among cell types and conditions. Such spatial variability correlates with the differences in gene expression between those cellular states and contributes to the microscale transcription and DNA replication processes. Our research aims to develop and test the concept of the structural epigenomic landscape (SEL) of regulatory elements around promoter regions for selected cell types and the different individuals of the human population. We will propose a biophysical method to construct probabilistic ensembles of three-dimensional conformations at genomic domains scale (i.e. for chromatin contact domains - CCDs, or topologically associating domains - TADs), compartments, chromosomal territories and finally, at the whole genome-scale.
Presentation Overview: Show
The transition from evaluating a single time point to examining the entire dynamic evolution of a system is possible only in the presence of the proper framework. The strong variability of dynamic evolution makes the definition of an explanatory procedure for data fitting and clustering, a challenging task. We developed CONNECTOR, a data-driven framework, able to analyze longitudinal data in a straightforward and revealing way. CONNECTOR is based on a functional clustering method, which provides fitted curves as well as cluster memberships through the estimation of a functional model written using natural cubic splines with random coefficients. CONNECTOR includes a collection of tools which help the user to visualize the data, to properly set the free parameters (the model selection phase) and to inspect the fitting and clustering results.
When used to analyze tumor growth kinetics over time in 1599 patient-derived xenograft growth curves from ovarian and colorectal cancers, CONNECTOR allowed the aggregation of time-series data through an unsupervised approach in informative clusters. We give a new perspective of mechanism interpretation, specifically, we define novel model aggregations and we identify unanticipated molecular associations with response to clinically approved therapies. CONNECTOR is freely available under GNU GPL license at https://qbioturin.github.io/connector.
Presentation Overview: Show
Background: Chronic Kidney Disease (CKD) and Nonalcoholic steatohepatitis (NASH) are multi-factorial metabolic diseases with interplay of fibrotic and inflammatory insults. The combination of single-cell (scRNASeq) and spatial transcriptomics (ST) could give unprecedent molecular disease understanding at single cell resolution. Notably, cell-specific ligand-receptor (L-R) interactions, learned across disease stages, have the potential to reveal novel disease features and contribute significantly to the early drug target discovery and validation process.
Methods: We present a systematic analytical framework to combine scRNASeq with ST to pinpoint the L-R pairs that play a role in disease-centric inter-cellular signaling. Our framework uses state-of-the-art methods such as Cell Chat, Cell2location, and a co-occurrence model to integrate ST and scRNASeq information.
Results: Our framework identified L-R pairs driving the cellular crosstalk in CKD and NASH. These cell-cell interactions are co-occurring in ST data and can be visualized directly in the tissue slides. Several of those L-R protein pairs are known CKD and NASH drivers while some are novel potential targets.
Conclusion: This integration of scRNASeq and ST modalities provides a comprehensive understanding of molecular mechanisms in CKD and NASH, which could not be attainable alone by a single technology, thus paving a way for future potential therapeutic targets.
Presentation Overview: Show
Copy number variants (CNVs) are genome-wide structural variations involving the duplication or deletion of large nucleotide sequences. While these types of variations can be commonly found in humans, large and rare CNVs including coding sequence gains or losses are known to contribute substantially to the development of various neurodevelopmental disorders (NDDs), and particularly to autism spectrum disorder (ASD). Nevertheless, given that these NDD-risk CNVs cover broad regions of the genome, it is particularly challenging to pinpoint the critical gene(s) responsible for the expression of the phenotype. Here we performed a meta-analysis study with 11,570 NDD patients and 4,114 controls from the SFARI-Gene database to identify NDD-risk regions and to later determine which deleted or duplicated genes within these broad regions were driving the phenotypic effects. We identified 38 NDD-risk CNV loci surpassing Bonferroni correction, including 23 novel ones, and provided evidence for dosage-sensitive genes within these regions being significantly enriched for driver gene candidates. Finally, we conducted a burden analysis using 4,194 NDD cases from Decipher and iHART and 2,504 neurotypical controls from the 1000 Genomes Project, which validated the association of 152 dosage sensitive driver genes with risk for NDDs, including 21 novel NDD-risk genes.
Presentation Overview: Show
Immune checkpoint blockade (ICB) therapies are now a important tool in the arsenal for the treatment of advanced kidney cancer with prolonged progression-free survival and overall survival. However, only a subset of patients respond to ICB therapies causing an urgent need for novel approaches to better select patients who may benefit from immunotherapy. Although substantial effort has been devoted to T cells towards ICB treatment response understanding, other cell types are involved in this process. Here, we used primary and metastatic ccRCC samples obtained before ICB treatment and performed cell deconvolution analysis to investigate novel biomarkers of ICB treatment response. We found that several cell types in the TME of metastatic samples of ccRCC were highly valuable to highlight several TME subtypes with significant differences in anti-PD-1 (Nivolumab) treatment response, cancer progression and overall survival. Moreover, differentially gene expression analyses between these TME subtypes revealed a 5 genes signature associated with a TME cluster harboring the worst ICB response values. Then, a numerical score was built to predict the treatment response outcome (overall response rate, ORR) for Nivolumab-treated patients and showed a strong classification performance (AUC-ROC=0.88) compared to other existing scores (AUC-ROCs ranging from 0.55 to 0.80).
Presentation Overview: Show
Genome-wide architectural landscapes of chromatin in the nucleus can be identified by advanced high-throughput sequencing-based 3C-type methods such as Hi-C, ChIA-PET, and HiChIP. The spatial organisation of chromatin in the nucleus is stabilised by structural proteins, such as the CCCTC-binding factor (CTCF), RNAPOL2 and cohesin complex. These proteins play an essential role in establishing long-range chromatin interactions (chromatin loops), facilitating topologically associating domain formation and allowing for the coordination of genes with their corresponding regulatory elements. Here, we discuss the exact role of CTCF, RNAPOL2 and cohesin in shaping chromatin multiscale three-dimensional architecture, particularly how static architecture defined by CTCF is re-shaped by the dynamical activity of cohesin (LEM: loop extrusion model), and re-organized during transcriptional activity by RNAPOL2. We analyse CTCF, cohesin and RNAPOL2 binding sites that account for the topological regulation of chromatin loops, the dynamics of loop extrusion, and phase separation condensates related to the transcriptional factories.
Presentation Overview: Show
Genome-wide architectural landscapes of chromatin in the nucleus can be identified by advanced high-throughput sequencing-based 3C-type methods such as Hi-C, ChIA-PET, and HiChIP. The spatial organisation of chromatin in the nucleus is stabilised by structural proteins, such as the CCCTC-binding factor (CTCF), RNAPOL2 and cohesin complex. These proteins play an essential role in establishing long-range chromatin interactions (chromatin loops), facilitating topologically associating domain formation and allowing for the coordination of genes with their corresponding regulatory elements. Here, we discuss the exact role of CTCF, RNAPOL2 and cohesin in shaping chromatin multiscale three-dimensional architecture, particularly how static architecture defined by CTCF is re-shaped by the dynamical activity of cohesin (LEM: loop extrusion model), and re-organized during transcriptional activity by RNAPOL2. We analyse CTCF, cohesin and RNAPOL2 binding sites that account for the topological regulation of chromatin loops, the dynamics of loop extrusion, and phase separation condensates related to the transcriptional factories.
Presentation Overview: Show
Several computational drug repurposing studies have highlighted candidate repurposed drugs, as well as clinical trial studies testing drugs in different phases. To our knowledge, the aggregation of information from previous studies has not been widely exploited. To fill this knowledge gap, we performed a weight-modulated majority voting of the modes of action, initial indications and targeted pathways of the drugs in the Drug Repurposing Hub repository. Our method, DReAmocracy, exploits this information and creates frequency tables and finally a disease suitability score for each drug from the selected library. This method was applied to Alzheimer’s, Parkinson’s and Huntington’s Disease, and Multiple Sclerosis. A super-reference table with drug suitability scores has been created for the four diseases. Based on this methodology, we will present an R-shiny tool, the DReAmocracy-app, which provides the user with the following options: (1) select or upload the library of drug lists from prior efforts on a specific disease, (2) query a registered drug of interest (3) adjust the parameters of the weight-modulated majority voting scheme for scoring a drug in terms of its candidacy against a disease.
Presentation Overview: Show
Neisseria meningitidis can cross human endothelial cells, causing meningitis. The species-specific mechanism involves an initial interaction between meningococcal type IV pili (Tfp) and the human CD147 receptor [1].
Using computational methods, we predicted the structure of the meningococcal Tfp, which consists of pilins-E and pilins-V; using GlyProt [2] and Sweet [3] software, we constructed various models of the glycosylated human CD147 receptor and glycosylated CD147 receptor of mouse and chimpanzee as control structures. The interaction between meningococcal Tfp and the various CD147 receptors was simulated with Web servers that performed glycosylated protein-protein docking.
The simulations predicted energetically favourable interactions between the glycan bound to asparagine 186 of human CD147, with Neu5Ac sialic acid (typical human) without fucosylations, and meningococcal Tfp. In contrast, when glycan contains Neu5Gc (typical animal) and fucosylations, no positive interactions were predicted. In the simulations conducted between meningococcal Tfp and glycosylated chimpanzee and mouse CD147 (containing once Neu5Ac and once Neu5Gc), no possible interaction was found between involved glycan and meningococcal Tfp.
This study emphasises the importance of glycans in this interaction and of the sialic acid Neu5Ac present on glycan antennae.
References
[1] Bernard et al. Nat Med. 2014 Jul;20(7):725-31.
[2] http://www.glycosciences.de/modeling/glyprot/php/main.php
[3] http://www.glycosciences.de/modeling/sweet2/doc/index.php
Presentation Overview: Show
Since the outbreak of SARS-CoV-2 in November 2019, several variants of interest (VOCs) have appeared and spread rapidly worldwide. Particularly, Omicron sub-lineages as BA.1, BA.2, BA.3, BA.4 and the BA.2.75, unofficially indicated as Centaurus [PMID: 36366461]. We compiled a non-redundant dataset containing 172 representative spike ensembles bound to different antibodies, manually selected from the PDB database (www.rcsb.org). For each of them, we modelled automatically the variants mentioned above, despite all the anomalies in the pdb files [PMID: 37031054]. We considered the H-bonds and hydrophobic interactions between each different chain of the non-mutated spike and the antibody in the selected complexes using LigPlot [PMID: 21919503] and detected the ionic interactions with an in-house developed Perl script. The complete analyses were supported by R programming (https://www.r-project.org). Lastly, we computed variants interactions and compared them with the ones of the prior version. This allowed us to determine whether the modifications could alter the way spike chains and antibodies bind to each other. Our results provide information on the behavior of mutant spikes in the presence of antibodies, which is important for both therapeutic development and the evaluation of vaccine efficacy.
Presentation Overview: Show
In recent years, 3D cell cultures (organoids and tumor spheroids) has generated great interest in biological engineering because they seem to allow a better representation of biological complexity than monolayer cell cultures. However, few works have yet focused on the detailed analysis of the levels of molecular and cellular similarity between different 3D culture models and with the corresponding reference tissues. We will present our framework integrating several bioinformatics approaches and their automatisation to facilitate the simultaneous analysis of multiple experimental conditions. We will also show the challenges associated with the comparison of transcriptomic proximity between 3D cultures and reference tissues, simulated from single-cell RNA-seq data, and the strategies we have implemented to investigate this question.
Presentation Overview: Show
Herbal medicines, widely employed in ethnomedicine including traditional Asian medicine, are often derived from a mixture of herbs, considered to impart a synergistic effect beyond the capabilities of single-herb extracts. To elucidate the metabolite-level differences between these complex herbal extracts, we employed molecular networking applied to LC/MS profiles obtained from the herbal medicines. Molecular networking provides a tool to visualize structural similarities among precursors in these profiles, and to uncover chemical alterations that arise from the extraction process involving multiple herbs.
To further explore the differences, we quantified the fold-change in precursor abundance between profiles, and assessed the statistical significance of these fold-changes using bootstrap sampling. This approach helps in managing variability in abundance measurements that may be attributed to factors such as ionization efficiency, chromatographic separation, matrix effects, and instrumental variability.
In our study comparing PM and YM extracts, where PM is a composite of eight herbs and YM is derived from six of the eight herbs used in PM in traditional Asian medicine, we revealed significant metabolite-level changes that occur when multiple herbs are co-extracted. Our results underscore the complex chemical interplay in mixed herbal medicines and highlight the value of molecular networking in understanding these interactions at a metabolite level.
Presentation Overview: Show
Cell-type identification is an important task for single-cell RNA-seq data analysis. Due to the recent successes of contrastive learning, we propose a novel contrastive learning-based cell-type identification method GsRCL. The experimental results suggest that GsRCL successfully obtained state-of-the-art performance and outperformed other well-known cell-type identification methods.
Presentation Overview: Show
The degradation of coarse particulate organic matter (CPOM) in streams largely depends on microbes, specifically fungi, and bacteria. However, the effect of environmental factors, such as temperature and salinity, on the metabolic pathways and enzymatic reactions of these degraders is undetermined. Our aim is to elucidate the role of different fungal and bacterial groups and their particular role in the enzymatic decomposition of CPOM in Emscher/Boye and Kinzig catchments, with and without stressors. We use DNA stable isotope probing (DNA-SIP), amplicon, and metatranscriptomic sequencing to study the taxonomical and functional diversity involved in leaf litter degradation. To determine the active taxa involved in degradation, we have used 13C labeled Alder leaves. The DNA of the active degraders of 13C leaves will be extracted by DNA-SIP. These targeted organisms will be further analyzed for metabolic pathways and enzymatic reactions. To analyze the metatranscriptomic sequences, we built a pipeline to preprocess the eukaryotic and prokaryotic mRNA and map it to databases like Mycocosm, CAZy, and NCBI for taxonomic and functional information. While studying the effect of multiple stressors, such as temperature, salinity, and flow velocity, we are testing the hypothesis that function recovers faster than community due to functional redundancy.
Presentation Overview: Show
Alzheimer's disease (AD) is a complex neurodegenerative disorder with multiple underlying biological pathways and mechanisms. Recent studies have shed light on these pathways and mechanisms, yet there remains a need to comprehensively integrate genetic susceptibility from multiple pathways for the effective prediction of AD.
We propose a novel approach based on transfer learning and multi-task polygenic risk score modeling (MT-PRS). The proposed MT-PRS involves learning key biomarkers (Aβ [A], tau processing [T], and neurodegeneration [N]) associated with the development of AD and exploring the role of pathway-specific genetic susceptibility in AD. In addition, we performed a conditional analysis according to APOE status.
We identified significant PRSs for AD and A/T/N biomarkers, with the proposed MT-PRS outperforming these predictors. Moreover, we identified significant pathways associated with AD not only with A/T/N but also with immunity and endocytosis. The identified pathways contributed to predicting AD risk after adjusting for APOE status.
In conclusion, our findings emphasize the usefulness of our approach, which takes into account additional biological pathways, in predicting AD and enhancing biological resolution. The proposed MT-PRS holds the potential to enhance the current understanding of the genetic basis of AD and provide novel insights into personalized prevention and treatment strategies.
Presentation Overview: Show
We developed GraphFusion, a light, modular, and scalable platform as a web application to combine several graph analysis methodologies and data fusion tools. It works as a graph analytics and visualization instrument that helps users explore, analyze, and visualize complex networks and graphs. It is an ideal solution for working with social networks, biological networks, and other graph datasets.
GraphFusion's user-friendly interface allows users to readily load and manipulate datasets in a wide range of models or representations, such as undirected graphs, directed graphs, probabilistic graphs, simplets, and hypergraphs.
GraphFusion also includes advanced graph inspection and comparison algorithms to work with, including pairwise analysis, data versus model analysis, network alignment, node clustering, annotations-based enrichment analysis, and a flexible data-fusion section to frame any joint NMF-based (Non-negative Matrix Factorization) optimization problems without having to code the optimization procedure. Those tools can help identify essential nodes and highlight features and communities within a graph or group of related graphs.
Overall, GraphFusion is a powerful and intuitive interface ideal for researchers and data analysts who need to study and visualize complex graph datasets with easy-to-use analysis algorithms and visualization capabilities.
Presentation Overview: Show
Multimorbidity, the co-occurrence of two or more distinct diseases, presents a challenge in research and healthcare. Cytokines and their regulators play an important role in acute and chronic inflammatory and immune responses. Moreover, they are pleiotropic, thereby having implications for multimorbid diseases. We aimed to characterise the role of cytokines as determinants of multimorbidity, and identify druggable cytokines relevant for therapeutic interventions in multiple diseases.
We leveraged Mendelian randomisation analysis to identify associations between 139 immune-proteins and 64 traits relevant in multimorbidity. The multiplicity corrected results were annotated with information on druggability, licensed indications and trait-tissue associations, creating a knowledge graph. This network was queried to identify cytokine communities and to identify a prioritised subset of pleiotropic cytokines.
Diseases affected by multiple proteins included irritable bowel disease (IBD), lung cancer and type 2 diabetes. 24 proteins were prioritised based on their graph importance, including the druggable proteins C4B, CX3CL1 and IL1RN. Diseases and traits predominantly affected by these proteins included IBD, dementia, cholesterol and asthma.
We found strong genetic support for plasma cytokines partially determining the onset of multimorbidity and identified druggable proteins which might be pursued in drug development programs.
Presentation Overview: Show
The shape of a cell is tightly controlled, and plays a crucial role during processes such as morphogenesis.
Cell shape is strongly linked to the differentiation of the cells and can further be used to infer important cellular properties such as force generation, cortical tension and adhesion properties.
Modern microscopy and image processing technologies have made it easier than ever to access accurate, high-resolution cell shapes.
We propose FlowShape, a new framework to study the shape of cells.
Our method represents shapes as a single function on a sphere: the mean curvature.
This function is then characterized by decomposing it into a sum of simple functions called spherical harmonics.
We applied these methods to real data from C. elegans embryos.
We show how this decomposition can be applied in a variety of applications:
aligning cells by finding the optimal rotation that matches their shapes,
clustering cells into groups with similar shapes,
scanning for structural features such as lamellipodia and
statistically determining phenotypes that show changes in cell shape.
Presentation Overview: Show
Somatic copy number alterations include large-scale events, such as chromosome arm-level gains and losses as well as focal amplifications and deletions and play a key role in the evolutionary processes that shape cancer genomes. In the case of small-scale events such as point mutations and indels, there exists a list of established mutational signatures that can be linked to distinct exogenous or endogenous exposures such as tobacco use. Despite previous efforts, accurate and meaningful copy-number signatures are still elusive. The biggest obstacle in creating copy-number signatures is that due to their cascading nature, traditional segment-based representations of copy number do not reveal individual evolutionary events.
Here we introduce a new method for deriving copy-number signatures that explicitly models evolutionary copy-number events. We derive these events using a minimum evolution framework based on our phylogenetic copy-number model MEDICC2 (Kaufmann 2022, Genome Biology) and employ a probabilistic approach to resolve ambiguous evolutionary trajectories.
We demonstrate the power of our approach on an independent simulation of mutational processes and real world data from 2,778 tumors from the Pan Cancer Analysis of Whole Genomes and demonstrate how the extracted copy-number signatures reveal novel insights into the nature of the mutational processes shaping cancer genomes.
Presentation Overview: Show
In this study, we focused on an often-overlooked extreme aspect of biology: the outliers of the protein length distribution, specifically those with more than 5000 amino acids, which we refer to as Huge Proteins. By examining UniprotKB we discovered more than 41,000 Huge Proteins, the majority in Eukaryotes and a significant proportion in Prokaryotes. The phyla with the highest propensity for Huge Proteins are Apicomplexa and Fornicata. Moreover, we observed that certain Bacteria, mostly members of the PVC superphylum, have the same tendency for possessing Huge Proteins than the average Eukaryote. To investigate if these proteins represent “real” proteins, we explored several indirect metrics, finding that the vast majority most likely exist. Additionally, we examined the orthologs of these proteins and identified around 7,000 clusters of homologous sequences, revealing functional groups related to key cellular processes such as cytoskeleton organization, or regulating transcription and translation. For Bacteria, the major clusters have functions related to Non-Ribosomomal peptide synthesis/Polyketide synthesis, pathogen-host attachment or recognition surface proteins. Further exploration of the domain annotations supported the previously identified functional groups. These findings underscore the need for further investigation of the cellular and ecological roles of these remarkable proteins and their potential impact and applications.
Presentation Overview: Show
Chromatin interactions are essential in enhancer-promoter interactions (EPIs) and transcriptional regulation. Transcription factor (TF) CTCF, which binds to chromatin interaction anchors, is the main insulator protein for EPIs in vertebrate. However, there is still no overall understanding of TFs and proteins involved in chromatin interactions and insulator functions. To identify the DNA-binding motifs of TFs, here, we describe a systematic and comprehensive deep-learning-based approach for this purpose. We discover 99 directional and non-directional biases of motifs in human fibroblast cells, which include those of 23 TFs related to an insulator function, CTCF, and/or other transcriptional regulations in previous studies. The estimated CTCF orientation bias is consistently proportional to the CTCF orientation rate at chromatin interaction anchors. Non-directional motifs consist only of palindromic motifs of TFs and their interacting TF. These findings reveal that the directional bias of motifs is associated with insulator functions and other chromatin regulations potentially through structural interactions.
Presentation Overview: Show
Human mobility is known to be a key factor in the spread of infectious diseases. During the Covid19 pandemic, the rapid spread of the virus caused healthcare systems to collapse in many countries, contributing to a large number of deaths. To avoid these undesirable outcomes, understanding the causal relationships between commuting flows and the spread of infectious diseases is crucial. With this objective in mind, we applied an information theoretic approach called Transfer Entropy, TE(X,Y), to measure the directed influence of the mobility-associated risk on patch X over the Covid19 incidence on another patch Y over time. We first validated our approach using simulated epidemiological data generated by a SIR model called EpiCommute. We then calculated the TE between all provinces in Spain using the time-series data from the Spanish cross-referenced Covid-19 Flow-Maps geographic information system. As a result, we identified the main drivers of the pandemic at each time period, spotting important known epidemiological events such as the outbreak in Lleida during the Summer of 2020 caused by the incoming flow of temporary workers. These results help clarify how human mobility contributes to the dynamic spread of infectious diseases and can be used to inform future non-pharmaceutical interventions.
Presentation Overview: Show
Single-cell revolution has made it possible to investigate heterogeneity in biological systems at single-cell resolution. Multiple modalities and computational methods are available to analyse such datasets. These datasets cover a wide range of biological scenarios such as tissue development, perturbation, and disease phenotypes. As there are no well-established protocols to automatically annotate and optimally integrate these datasets, it is challenging to leverage their full potential for systematic data-driven discovery of disease signatures. For example, different research groups annotate their cell-types manually and the importance of marker genes employed is not shared. This leads to a situation where similar cell-types can be annotated differently using different sets of markers. Furthermore, most existing tools for data integration are not yet interpretable. Moreover, these methods are computationally expensive to use as they require GPUs to perform efficiently. This makes some of these methods out of the reach of researchers without access to expensive computational hardware. To address interpretability, reproducibility and scalability, we have developed a set of tools for automatic annotation (MACA), and integration of different modalities (MASI). I will present our benchmark studies and our newer graph-based approaches to integrate spatial transcriptomics, single-cell chromatin accessibility, DNA methylation, and histone modification data.
Presentation Overview: Show
Allostery is the process by which binding at one site perturbs a distant site. Allosteric drugs activate or inhibit proteins and offer advantages over non-allosteric drugs. However, the identification of allosteric sites is challenging due to their distance and lack of conservation across protein structures. Machine learning (ML) approaches have been employed to predict allosteric sites, but the performance of these methods needs further improvement. This research investigates the potential of incorporating Large Language Models (LLMs) such as ProtBERT into ML/DL approaches for better prediction of allosteric sites. Preliminary results show that small Multi Layer Perceptrons (MLPs) without LLMs can achieve an F1 score of upto ~50%. This study contributes to research on protein structure and function prediction, potentially enabling identification of allosteric sites for drug discovery and protein engineering.
Presentation Overview: Show
Childhood asthma is the most common reason for hospitalization in early childhood. From epidemiological studies, it is evident that the prevalence is higher in boys than girls. After puberty, it is more prominent in women than men. The heritability of childhood asthma is estimated to be between 60-90%. This suggests that the genetic components driving the development of childhood asthma have a sex-specific effect. Yet, most association studies do not consider gender in their analysis.
In this project, a Bayesian logistic regression model with a variant-sex interaction term was developed to identify SNPs that have a sex-specific effect on childhood asthma. Discovery studies were conducted in a dataset of 1189 children with severe asthma from Copenhagen Prospective Studies on Asthma in Childhood and 5094 controls. 77 variants have a posterior probability of interaction higher than 95%. Sex-stratified analysis confirms the sex-specific effect in both data sets.
Variants are found to be part of the genes IL1R1 and CLEC16A, known for being associated with asthma previously, and 4 of the top 9 interacting SNPs are expressed in lung tissue.
Presentation Overview: Show
This project aims to extract molecular markers and pathogenic mechanisms associated with the progression of Parkinson’s Disease (PD) from the analysis of multi-omics (MO) datasets. We utilise the Parkinson’s Progression Marker Initiative dataset which includes multi-omics datasets for PD patients (blood RNA, miRNA and plasma proteomics).
We obtain pathway enrichment analysis (PEA) results using a single modality enrichment analysis method and two integrative PEA tools: MO Gene Set Analysis (MOGSA) and Multi-Omics Factors Analysis (MOFA). The tool MOGSA produces an integrative pathway enrichment analysis from a combined set of differentially expressed features.The tool MOFA uses dimensionality reduction to produce an integrated view of the three modalities and then performs pathway analysis on this combined output.
We present the enriched processes obtained from the single modality enrichment analysis and the two integrative methodologies to highlight disease mechanisms. We find a high overlap between the pathways obtained from the three methods. The integration methods allow us to re-rank and prioritise pathways that are important across all layers. In addition, pathways with low significance from only one omics layer are discarded, allowing a smaller more confident set to be obtained.
Presentation Overview: Show
The ability to integrate the abundance of biomedical information available for a disease is a great challenge yet, it can help understand better the underlying mechanisms and build more comprehensive profiles. This study aims to develop a computational framework that integrates multi-source data methods and network-based approaches for more precise diagnostic and therapeutic approaches.
To do so, Alzheimer’s disease (AD) is used as a case study. Subjects with normal cognition (CN), mild cognitive impairment (MCI) and AD are collected. MRI measurements, protein expression data and clinical assessments are obtained from the AD Neuroimaging Initiative (ADNI) database.
Single layer analyses and multi-layer analyses are conducted to obtain molecular and biomedical profiles. List2Net, an in-house tool that represents lists in a network context, is used to create subject-to-subject networks for all different combinations of the CN, MCI, AD based on their within layers and across layers correlation. Multi-omics factor analysis (MOFA) and mixOmics tool will be used to obtain an integrated vector of brain imaging and protein expression data. Graph clustering methods are applied in both the single and the multi-layer generated networks and are evaluated with the label-based clustering to assess the contribution of each approach.
Presentation Overview: Show
INTRODUCTION: In this study, we investigated the association between various viral agents and the risk of Alzheimer's disease (AD). We examined a large sample of AD cases and controls by comparing the quantity of viral reads identified in their DNA samples.
METHODS: We used both whole exome sequencing (WES) and whole genome sequencing (WGS) datasets and selected DNA sequence reads that did not align to the human genome, mapped them to viral reference sequences, quantified them, and tested them for association with AD.
RESULTS: Our results showed that several viruses were significant predictors of AD based on machine learning classifiers. Subsequent regression analyses showed that HSV-1 (OR=3.71, P=8.03x10−4) and HPV-71(OR=3.56, P=0.02), were significantly associated with AD after Bonferroni correction. The quantity of reads from the phylogenetic family Herpesviridae was significantly associated with AD in several strata of the data (P<0.01). Utilizing a novel propensity score matching algorithm, we found a significant association between HSV-1 and AD (OR=1.10, P= 0.02) using a regression model on a sample of 5828 AD cases and 6487 controls.
DISCUSSION: Overall, our findings support the hypothesis that viral infection, particularly HSV-1, is linked to AD risk.
Presentation Overview: Show
Like gene expression, the 3D structural organization of animal genomes varies widely across cell types or environments. Chromosome compartmentalization, for instance, characterizes genomic regions of different properties: active transcription and open chromatin for “A” compartments, compact chromatin and low gene expression for “B” compartments. Available computational tools can process chromosome conformation capture “Hi-C” data from a single experiment and assign compartment types to genomic regions, usually using a PCA-based dimensionality reduction. Much remains to be done to accurately detect compartmentalization differences between groups of samples.
Here we present HiCDOC, a Hi-C data analysis method for the automatic identification and comparison of A/B compartments between groups of Hi-C matrices. Unlike traditional PCA-based methods, HiCDOC performs a constrained K-means clustering to assign A or B compartments to genomic regions in multiple datasets simultaneously, using information from biological replicates to enhance accuracy. A statistic reflects the prediction confidence at each position and identifies regions with significant compartment differences (A=>B or B=>A) between experimental groups. First results show that HiCDOC compares favorably with dcHiC, the only other tool with similar functionalities.
HiCDOC is available as an R Bioconductor package: https://github.com/mzytnicki/HiCDOC
Presentation Overview: Show
Computational methods that decipher rare and private somatic changes can provide critical insights into the underlying mechanisms of cancer development and progression. Identifying potential cancer subtypes that might be associated with diverse biological responses is a key first step to define target therapeutics.
Machine and deep learning (ML/DL) methods that use clinical and/or multi-omics data have been adopted for the identification of cancer subtypes. There also exists a growing collection of sequence-based ML/DL models that accurately predict different epigenetic traits (e.g. transcription factor binding), and allow for estimating the impact of individual somatic aberrations. The application of sequence-based MD/DL on a genome-wide scale enables augmenting somatic mutations by a model-based view that captures functionally relevant differences between individuals.
In this study, we adopt SEI, a sequence-based DL model that is trained to predict more than 21K different regulatory activities, to obtain mutation impact embeddings. We first identify mutations with strong impacts through investigating clusters of alternative and reference sequence embeddings. Then, mutation impact embeddings are utilized to generate a patient similarity network (PSN) for unsupervised identification of patient subgroups. The proposed approach provides a novel strategy of utilizing variant impact scores in PSNs for cancer subtyping.
Presentation Overview: Show
In the aftermath of the terrorist attacks on the World Trade Center (WTC), the first responders received intense exposure to a complex mix of airborne carcinogens that elevated their cancer risk. However, the development of hematologic malignancy is not well studied. With current molecular genomic testing methods, acquired genetic alterations in hematopoietic precursor cells can be detected even prior to overt hematological manifestations. This finding has been defined as clonal hematopoiesis of indeterminate potential (CHIP). We hypothesized that exposure to WTC debris may have led to CHIP-specific mutations.
This study aims to determine whether i) prevalence of CHIP is elevated in WTC responders and ii) CHIP mutations are associated with phenotypes such as age, ancestry, smoking, WTC debris exposure and blood count parameters. To this end, we performed deep whole exome sequencing of blood in 350 WTC responders. We then analyzed CHIP mutations associated with hematologic malignancy.
Consistent with literature, we found that prevalence of CHIP increased with age. Furthermore, we observed that the responders exposed to the WTC debris had significantly higher rates of CHIP mutations than unexposed individuals. These findings will aid in the development of specialized cancer screening programs for WTC responders.
Presentation Overview: Show
Clonal hematopoiesis of indeterminate potential (CHIP) refers to the presence of somatic mutations in blood in hematologic malignancy associated genes, but without any clinical evidence of hematologic disease. However, CHIP is a known risk factor for hematologic malignancy and other systemic diseases. Some factors that increase CHIP prevalence include age, smoking and inflammatory conditions. As inflammatory bowel diseases (IBD), including ulcerative colitis (UC) and Crohn’s disease (CD), are characterized by increased inflammation, we hypothesized that individuals with IBD may have elevated rates of CHIP-specific mutations.
This study aims to characterize the role of disease activity or clinical phenotype in the prevalence of CHIP in IBD patients. To this end, we analyzed CHIP mutations from whole exome sequencing data of IBD patients (587 CD and 441 UC) and 293 controls from Mount Sinai’s IBD cohort, and performed CHIP association analysis using multivariate logistic regression. We then validated our results in an independent cohort.
We found that prevalence of CHIP mutations increased with age, with the top CHIP genes TET2, DNMT3A, AXSL1 and PPM1D. Interestingly, UC patients had significantly elevated levels of CHIP mutations than controls. These findings will aid in the development of CHIP-screening programs for IBD patients.
Presentation Overview: Show
RNA localization plays a significant role in gene expression regulation. It has been implicated in buffering proteins levels from bursty transcription, nuclear size control, protein localization, and even disease. Hence, estimating transcript localization is of major importance. An approach that has traditionally been followed by many studies in order to investigate relative nuclear and cytosolic RNA localization, is RNA sequencing (RNA-seq) coupled with cellular fractionation. Nevertheless, transcript quantification estimates obtained independently from nuclear and cytosolic RNA cannot be compared, as the total amount of RNA in each of these cellular compartments is usually unknown. Here we show that if, in addition to nuclear and cytosolic RNA-seq, whole cell RNA-seq is also performed, then accurate estimations of the localization of transcripts can be obtained. We first establish the theoretical basis that supports this by formalizing mathematically the relationship between the different RNA abundances. Based on that, we designed a method that estimates for every transcript a localization index. We evaluated our methodology on simulated data. Finally, we compared transcript localization in different human cell lines using bulk RNA-seq data from the ENCODE project, and attempted to explain the differences based on features known to regulate RNA localization.
Presentation Overview: Show
Gene therapy has the potential to address many loss-of-function genetic disorders by inducing wild-type transgene expression of the faulty gene in the affected cell. Adeno-associated viruses (AAVs) are a promising vector to deliver the transgene because they are replication defective and are not associated with any human disease.
The therapeutic AAV genome contains a gene of interest (GOI) in an ITR – GOI – ITR cassette. ITRs (inverted terminal repeats) allow for this cassette to be packaged inside viral capsids. ITRs are the only viral genetic material that is part of the therapy. Our standard QC of the cultured cassettes in plasmids includes restriction digests, capillary electrophoresis, Sanger sequencing, and Illumina sequencing. With long read PacBio CCS sequencing, we demonstrate limitations in existing QC methods in calling ITR variants.
We discovered a heterogenous population of plasmids with multiple ITR variants, with both the flip and flop alleles in the plasmid and deletions of the hairpin loops. Existing bioinformatic tools are unable to effectively call these variants even using CCS reads. We circumvent these limitations using a custom bioinformatics pipeline. Our work identifies appropriate methods to support AAV production using plasmids.
Presentation Overview: Show
The microenvironment of solid tumours comprises a wide range of innate and adaptive immune cells, which can characterize the microenvironment into one of two types: cold or hot. Hot tumours are defined as having the presence of tumour-infiltrating T lymphocytes and molecular signatures of immune activation, while cold tumours lack these hallmarks. The clinical significance of these categories is that hot tumours are more likely to have a better response to immune checkpoint blockade therapy. To understand the tumour microenvironment, we need to understand the nature and state of individual cells as well as their juxtaposition. For this project, three pediatric cerebellar brain cancers, known to have variable immune recruitment, will be subject to spatial transcriptomics via 10X Genomics Visium platform and scRNA-seq. Preliminary findings suggest the presence of 3 groupings: (1) metabolically active non-dividing tumour cells, (2) rapidly growing tumour cells expressing markers for both transcriptional activity and stemness, as well as cell cycle genes, and lastly (3) an immune infiltrating region. Identifying patterns in the tumour microarchitecture, will permit the identification of local interactions involved in immune cell infiltration in various pediatric cerebellar brain tumours and will provide knowledge to develop prognostic and predictive biomarkers to guide therapy.
Presentation Overview: Show
Different samples of the same tumor type can differ in their across-genome mutation rate spectrum, due to having undergone different combinations of mutational processes, such as those arising from DNA repair pathway deficiencies. These mutational processes could be summarized by SBS-based signatures, but most of these are convoluted (i.e. comprising several processes). Also, SBS-based signatures rely on knowledge of the DNA repair deficiencies of the training samples, in order to assign a signature to a specific aetiology. In our approach, each sample is summarized by a specific profile, which consists on a vector of regression coefficients from the associations between local mutation rates and the local activities/abundances of each DNA repair mark included in the model. We also account for factors known to play a role in mutation rate spectrum, such as replication time and trinucleotide context. Then, via non-negative matrix factorization (NMF) we reduce the dimensions of the all-samples coefficient profile matrix into signatures: each signature will have a different exposure in each sample, so sample outliers could potentially have a DNA repair pathway deficiency that results in an altered mutation pattern, and therefore in an unexpected association between mutation rate and DNA repair activity.
Presentation Overview: Show
We propose a generative model of artificial datasets that are similar to omics datasets in terms of correlation structure. Our method produces multidimensional data with desired correlation, where the distribution of generated variables is known. It is an alternative to black-box models or classical approaches using Cholesky factorization, which faces problems when sample sizes are small.
It is based on the Local cluster-wise dimensionality reduction of weighted gene correlation network analysis (WGCNA). In the WGCNA approach, one can simulate correlated variables using low dimensionality projection of the clusters to which they belong. Our approach uses that method iteratively in conjunction with our hierarchical clique-based clustering algorithm. We find multiple basis clusterings using edge weight thresholding to learn the structure on multiple resolutions.
We have compared our approach (S2) to a basic one-level simulation protocol (S1) on a reference dataset of 8673 genes, using network statistics and partition similarity of clusters found in simulations to clusters found in reference. As number of thresholds increases, the clustering coefficient distribution converges.Good fit requires accurately capturing the correlation structure, making the model a useful analytical tool.
Presentation Overview: Show
As antibiotic resistance becomes more prevalent, we explore the dynamics of sensitivity versus resistance to antibiotics in bacterial pathogens using multiple epidemiological compartmental models and stochastic simulations employing the fitness cost and advantage of resistance to understand their mutual relationship, coexistence, co-evolution and relative dynamics as a function of antibiotic usage to determine optimal antibiotic usage for the best treatment outcomes and reduced risk of resistance emergence and spread.
Stochastic simulations are performed to analyse model behaviour, concurrence or difference from one another in their epidemiological dynamics. Inference methods including iterated filtering and partial Markov chain Monte Carlo are performed to demonstrate that the fitness cost and advantage of resistance can be estimated from prevalence data on both susceptible and resistant infections. Model validation and comparison can be used to establish which model, if any, can explain the dataset at hand. This study will help understand model sensitivity on a stochastic scale, whereas previous studies consider a deterministic version.
Once a model and corresponding parameters have been selected and validated, it becomes possible to make predictions on future resistance dynamics under different scenarios of antibiotic use, and make recommendations for optimal use of antibiotics to avoid further increase in resistance.
Presentation Overview: Show
Gastric cancer affects the line the stomach and is known to be one of the leading causes of death worldwide. There are several factors that contributes to this malady, they include environmental, genetic and Helicobacter pylori infection. These cancers unfortunately get diagnosed only in advanced stages, resulting in poor outcomes for the patient. Development of non-invasive and easy to use diagnostic methods can help in catching the disease at an early stage. Towards this, salivary transcriptomic data could reveal potential signatures that could indicate the presence of disease. Here we present the results on the comparative analysis of transcriptome from saliva samples and gastric cancer tissue samples. The results lend credence to consider saliva as an alternative source of biofluid that can serve diagnostic purposes.
Presentation Overview: Show
Colorectal cancer (CRC) is uncontrolled tumor growth that originally starts in either the rectum or colon. Our research is focused on the microbiome in the gut. The end goal is to target signaling pathways in order to decrease the metastasis and malignity of gut tumors by increasing the expression of certain bacteria genes in CRC. The probiotic bacteria’s byproducts may play a role in this process. The usage of the R programming language allowed us to first narrow our target proteins down into a few that were common between known probiotic bacteria. We then utilized NCBI Blast to align the genomes of the probiotic bacteria in order to find structural similarities and differences that may play a role in how effective each probiotic bacteria is in inhibiting CRC. Currently, we are analyzing bacteria present from a recent cancer microbiome review paper to reveal novel phenotypic and genotypic differences at the protein and signaling/pathway levels. We hope to perform Protein Annotation and KEGG Pathway analysis to reveal undiscovered relationships. Eventually, we hope our research will help narrow down specific proteins/pathways in bacteria that microbiology, wet-lab, researchers can manipulate in order to find cheaper and novel ways to reduce colorectal cancer.
Presentation Overview: Show
Recent years have seen a surge of novel neural network architectures for multi-omics integration. One important parameter is the integration depth: the point at which the latent representations are computed or merged, which can be early, intermediate, or late. The literature on integration methods grows steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases. We developed a comparison framework that trains multi-omics integration methods under equal conditions. We incorporated four recent deep learning methods, early integration, PCA, and a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a drug response data set with multiple omics data. Our experiments confirmed that early integration has the lowest predictive performance. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT performed best in a cross-validation setting and Omics Stacking best on the external test set. When faced with a new data set, Super.FELT is a good option in the cross-validation setting as well as Omics Stacking in the external test set setting.
Presentation Overview: Show
Modeling requires the estimation of model parameters from experimental data. Probabilistic inference returns a distribution, thereby inherently estimating the uncertainty associated with the parameters.
We present Eulerian Parameter Inference (EPI), a probabilistic inference method based on the concept of random variable transformations. The input of EPI is a simulation model and a data distribution that is assumed to be generated by an underlying parameter distribution. EPI estimates this parameter distribution. In practice, we often have to estimate this distribution from individual samples by using established density estimation approaches. EPI transforms the estimated data distribution into a parameter distribution that is consistent with the observed data. This can be done by only using point-wise evaluations of the simulation model and approximations of its derivatives with respect to the parameters, which directly returns a density value in the parameter space. In particular, we do not require an explicit formulation of the inverse mapping from the output to the parameters.
EPI is parameter-free and provably correct if the parameter inference problem is well-posed.
Besides academic examples, we apply EPI to a diverse set of models ranging from algebraic equations over chaotic maps to ordinary differential equation systems, thereby proving its practical applicability.
Presentation Overview: Show
Genes are not randomly distributed in the nucleus space, but are organized within more or less dynamical spatial clusters. This genome spatial organization plays a major role in gene expression regulation. Using a variety of experimental datasets, we show that genes in spatial proximity share the same nucleotide composition biases, which could at least in part explain the spatial genome self-organization. In addition, co-localized genes equally biased have a higher probability of being co-regulated by the same transcription factors. They also produce RNAs that share the same nucleotide composition biases, that are co-regulated by the same RNA-binding proteins, and that generate proteins sharing the same amino acid composition biases. As a consequence, proteins produced by co-localized genes share the same physicochemical properties and have a higher probability of belonging to the same cellular sub-compartments. Thus, by analyzing compositional biases - as a proxy of the physicochemical properties of genes and their products - we uncover a link between the spatial organization of genes in the nucleus and the spatial organization of their products (i.e. proteins) in the cell.
Presentation Overview: Show
Single-cell technologies have transformed our understanding of human tissues. Yet, single-cell studies typically capture only a limited number of donors and disagree on cell type definitions. Integrating many datasets can address these limitations of individual studies and capture the variability in the population. Here, we present the integrated Human Lung Cell Atlas (HLCA), combining 49 datasets of the human respiratory system into a single atlas spanning over 2.4 million cells from 486 individuals. The HLCA presents a consensus cell-type re-annotation with matching marker genes, including annotations of rare and previously undescribed cell types. Leveraging the number and diversity of individuals in the HLCA, we identify gene modules that are associated with demographic covariates such as age, sex and BMI, as well as gene modules changing expression along the proximal-to-distal axis of the bronchial tree. Mapping new data to the HLCA enables rapid data annotation and interpretation. Using the HLCA as a reference for the study of disease, we identify shared cell states across multiple lung diseases, including SPP1+ profibrotic monocyte-derived macrophages in COVID-19, pulmonary fibrosis, and lung carcinoma. Overall, the HLCA serves as an example for the development and use of large-scale, cross-dataset organ atlases within the Human Cell Atlas.
Presentation Overview: Show
Axon guidance governs the growth direction of axons and forms neural circuits, and is crucially dependent on cell-cell interactions. Understanding these interactions provides insights into how circuit formation is achieved in normal and disease brains and can potentially inform neuroregenerative therapies. Single-cell RNA sequencing (scRNA-seq) technologies hold great promise in providing a comprehensive analysis of cell-to-cell interactions through axon guidance factors. However, the lack of computational methods hinders the realization of this potential.
Here, we present a novel data analysis framework, scAG (Single Cell RNA-seq analysis for Axon Guidance), to uncover cell-cell interactions in axon guidance using scRNA-seq data. scAG employs scRNA-seq data and Axon Guidance Related Genes (AGRG) ligand-receptor database for interaction detection and analysis, enabling thorough interaction identification, temporal progression analysis, and comparative studies in axon guidance.
We applied scAG to scRNA-seq data from the mouse cerebral cortex, obtained at different developmental stages from wild-type and Fezf2 mutant mice (Di Bella+ 2021 Nature). Our analysis unveiled stage-specific and mutant-specific cell-type pairs and AGRG ligand-receptor pairs, providing a detailed temporal and comparative overview of cell-cell interactions guiding axon growth. These findings elucidate the intricate cellular coordination necessary for proper neural circuit formation and have implications for understanding neurological disorders.
Presentation Overview: Show
Pharmaceutical research has long used differential gene expression signatures to study external stimuli like pathogenic determinants or small molecule treatments. These signatures measure expression values for multiple tags and are often compared using the concept of connectivity. Despite the scientific community's efforts to produce unbiased datasets for evaluating connectivity-based methods for drug identification and repurposing, the lack of reliable benchmarking data hinders their effectiveness.
To address this, we developed a simulation method for connected differential expression signatures, that is based on a three layers decomposition and relies on a statistical framework with different levels of parametrization.
We benchmarked seven connectivity scores methods from the literature using our simulated signatures. We then evaluated the capacity of each method to retrieve the most reversed signatures for a specific query, using the area under the precision-recall curves. Moreover, we introduced a novel application perspective by training a siamese neural network with our simulated data to predict the connectivity score.
Overall, our method is a significant advance in pharmaceutical research, providing a reliable way to simulate connected differential expression signatures. It will help develop and evaluate algorithms for comparing signatures to find the most connected or reversed, leading to more effective drug repurposing.
Presentation Overview: Show
Advances in next-generation sequencing (NGS) technologies such as whole-exome sequencing (WES) and targeted sequencing (TS) have revolutionized cancer genomics and precision medicine. However, accurate interpretation of somatic genomics profiling results from NGS requires reliable computational tools. That's where synggen comes in - a powerful tool written in C programming language that enables researchers to rapidly generate realistic synthetic WES and TS datasets for benchmarking.
Synggen closely mimics real-life cancer sequencing scenarios utilizing non-cancer NGS sequencing files in BAM format fo generate reference models and by incorporating user-specified phased germline polymorphisms, complex allele-specific somatic copy number aberrations and point mutations, as well as the clonality of somatic events and overall tumor content of the sample.
To demonstrate the effectiveness of synggen we simulated two liquid biopsy cfDNA scenarios: cancer data at decreasing tumor content, and cancer data simulating temporal sampling from a patient with dynamic tumor sub-clones’ populations.
Generating WES reference models using one control sample takes approximately 5 minutes with 4 cores, and 2.5 minutes with 16 cores. Generating a FASTQ file with 100 million reads using the same number of cores requires about 10 minutes and 4 minutes, respectively.
Presentation Overview: Show
Viral metagenomics is increasingly used for the detection of viral pathogens in clinical diagnostic settings and a wide variety of the bioinformatic tools are available for taxonomic classification of the metagenomic data. A growing number of studies have been reported on benchmarking of performance of taxonomic classifiers with the use of the experimental and simulated NGS datasets generated for the known viruses. However, benchmarking studies focusing on the detection of genetically distinct viral sequences are scarce.
RNA viruses evolve rapidly with a high rate of accruing mutations in their genomes. Classification of the newly emerging RNA viruses with taxonomic classifiers can be a challenge if the changes in the emerging virus genomes are not reflected in the reference genome databases used for classification. How sensitive are the results of taxonomic classification to the mutations in the newly emerging viruses?
In this study we evaluate the performance of taxonomic classifiers of three types (DNA-to-DNA, DNA-to-protein, and mixed one) in detection of several RNA viruses with simulated mutations that mimic evolution of the virus and produce closely related organisms at a controlled relative phylogenetic distance. With this approach, thresholds of phylogenetic distances for effective detection and classification by each tool are established.
Presentation Overview: Show
Nonribosomal peptide synthetases (NRPS) are modular enzymes that produce many important secondary metabolites, including antibiotics. Adenylation (A) domains within NRPS determine the final product by recognizing and activating its building blocks - specific amino acids. A-domain specificity prediction is vital for exploring metabolites and engineering NRPS pathways. However, the existing prediction tools have limited accuracy. The software developers lack a comprehensive resource with confirmed A-domain specificities and train their tools on ad hoc datasets.
To address this gap, we present ADD, a database encompassing A-domain sequences, specificities, neighbouring domains, biosynthetic gene clusters (BGCs), and producers' taxonomies. With 3459 entries, our database is the largest of its kind and includes both bacterial (3063) and fungal (396) A-domains. ADD incorporates and unifies records from previously published A-domain specificity datasets and MIBiG, the largest collection of experimentally validated links between secondary metabolite BGCs and their products. We complement our database with a benchmarking utility for assessing the quality of specificity prediction algorithms.
We believe ADD will become a useful resource for training and benchmarking A-domain specificity predictors, and might shed light on the evolutionary dynamics of A-domains in bacteria and fungi.
Presentation Overview: Show
Heterogeneity and composition of the tumor have a major impact on tumor growth, division, resistance (to treatment) and metastasis. We want to test to which extent the tumor heterogeneity impact on disease outcome is explained by the molecular features of the tumor (i.e. gene expression and DNA methylation (DNAm), an epigenetic mark regulating gene expression). Our goal is to develop a new multimodal high-dimension mediation analysis framework to unravel this causal links.
THEMA (our project) will concentrate on pancreatic ductal adenocarcinoma (PDAC) which is a highly heterogeneous cancer and is expected to become the second leading cause of cancer-related mortality by 2025.
Based on methylomes and transcriptomes from public cohorts, we will study the role of DNAm and gene expression in the causal link between tumor heterogeneity and outcome. Then we will perform multimodal high-dimension mediation analysis and question the relationship between the identified mediators (gene expression and DNAm). Finally, we will test how the exposure to treatment affects the mediators.
We expected that THEMA will help to identify molecular mediators of tumor heterogeneity, both at the DNAm and gene expression levels and offer perspectives in the development of new biomarkers and personalized therapeutic treatments.
Presentation Overview: Show
Food natural compounds are of interest as modulators of cancer progression and prognosis, as they participate in cellular processes such as growth and differentiation, DNA repair, programmed cell death and oxidative stress. Here we select dietary biocompounds for specific subgroups of 308 colorectal adenocarcinoma (COAD) samples by finding bioactives with opposite transcriptomic profiles to the subgroup-specific tumoral transcriptomes, hypothesizing they may counteract the cancer gene-expression profiles. First, we selected 2189 CpGs based on their differentially variable methylation between tumor and normal samples by a combination of linear and Bartlett tests. Afterwards, samples were meta-clustered by 1) classifying each sample by 8 different methods (including k-means and hierarchical clustering), 2) building a network and 3) meta-clustering it by the edge-betweenness method. We extracted 6 main subgroups, 2 of them with immune-enriched transcriptomes. Finally, we compared the transcriptomes of the 6 subgroups with the ones of 56 in vitro bioactive studies from GEO by Gene Set Enrichment Analysis (GSEA), resulting in a potential positive effect of resveratrol, japonicone A and vitamin D. In summary, we present a promising in silico strategy to propose specific bioactives as co-adjuvants in cancer treatment. Supported by Spanish PN I+D+i PID2019-110183RB-C21 and FNS-Cloud project H2020-EU.3.2.2.3 863059.
Presentation Overview: Show
Post-transcriptional RNA modifications have emerged as crucial regulatory
elements, exerting influence over diverse cellular processes. Among the 143 dis-
tinct modifications identified thus far, tRNA molecules display remarkable diver-
sity with respect to modifications. Accurate prediction of tRNA modifications
is essential for unraveling their functional significance and exploring potential
therapeutic implications. Although promising results have been achieved using
random forest models to predict m1A modifications, ongoing research aims to
expand these efforts to encompass additional modifications. We aim to develop
deep learning techniques by which the prediction accuracy and efficiency of
tRNA modifications can be significantly enhanced. These advanced techniques
have the capability to capture intricate patterns and relationships within tRNA
sequences, enabling precise identification of modified sites. Moreover, the here
presented prediction pipeline incorporates a specially designed, BED-based data
format for storing modifications and their corresponding predictions.
By employing well-engineered machine learning models to predict a wide
range of tRNA modifications, we aim to provide valuable insights into the regu-
latory mechanisms underlying tRNA modifications and their implications. Ad-
ditionally, the use of a user-friendly, genome viewer compatible data format
enhances accessibility and usefulness for non-computer scientists.
Presentation Overview: Show
Valuable insights into complex disease-driving factors such as heart failure are to be gained by high-throughput omics technologies. Liquid chromatography coupled to mass spectrometry (LC-MS) enables the high throughput profiling of diverse types of molecules such as proteins, peptides, metabolites and lipids. The timsTOF Pro MS raises the dimensionality of generated datasets by an additional ion mobility separation. This results in increased peak capacity and acquisition speed, but the extra dimension significantly increases data complexity and thus requires establishing computationally highly efficient solutions for raw-data processing. Data processing for such complex and voluminous data must be efficient, flexible, and able to be incorporated into existing workflows.
Therefore, we developed Proteolizard, a software toolset bridging high-performance C++ code with user-friendly Python bindings. It enables the seamless integration of timsTOF raw data into Python-centric machine-learning libraries like TensorFlow, PyTorch, and scikit-learn. Proteolizard facilitates effective utilization of multi-core systems or accelerators such as GPUs, and the implementation of novel algorithms based on deep learning, enhancing the analysis of high-dimensional omics data. We specifically implement data access, representation, processing, and visualization in three separate python modules.
Proteolizard is accessible and offered free-of-charge under the GPL3 license on GitHub: https://github.com/theGreatHerrLebert/proteolizard-[data, algorithm, vis].
Presentation Overview: Show
The purpose of this work is to construct an organized dataset of single nucleotide variants (SNVs) for patients with acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL). With Genomic Data Commons and cBioPortal as primary resources, SNV information from whole-exome or whole-genome sequencing data of 149 patients with AML and 603 patients with ALL was obtained. For each patient group, Python scripts were written to compile individual patients’ data files into a single dataset. The resulting AML and ALL datasets allowed us to obtain an overall statistical comparison of SNV counts between tumor and normal samples and between the two leukemia types. The numbers of distinct SNVs found in the AML and ALL datasets were 136071 and 174459 respectively, with the vast majority occurring only in tumor samples. We observed that the number of distinct variants per patient with AML is higher than that for ALL and that the tumor sample variants in both AML and ALL favored mutations that would reduce GC content in genes. These datasets will be used for downstream bioinformatics analyses to compare the two leukemia types with the ultimate aim of identifying SNV effects that can help discover potential targets for gene therapies.
Presentation Overview: Show
Sepsis is a life-threatening condition triggered by an immune response to infection, and it exhibits varying outcomes in patients based on their sex. We aim to investigate the molecular patterns associated with sex differences in sepsis. We retrieved sepsis-related datasets from GEO-NCBI and developed an R-based workflow to analyze the transcriptomic data. Our workflow involved data processing steps including quality control, normalization, outlier identification, and probe summarization. To confirm or determine the sex of samples, we used a set of genes known as immune sex expression signatures (iSEXS) to differentiate between male and female groups. After identifying differentially expressed genes (DEGs) consistently across multiple datasets, we performed a deconvolution analysis with CIBERSORTx to determine the proportion of specific cell types in each group. Additionally, we used CEMiTool to identify gene co-expression modules and explore enriched signaling pathways in each module. Our study found differentiated DEGs in septic patients, with both groups showing increased myelocytes and band neutrophils, but segmented neutrophil percentage being higher only in pre-pubertal males. Our study contributes to the understanding of sex differences in sepsis by revealing differences in gene expression profiles and the disparity of immature neutrophils with regard to sex and age.
Presentation Overview: Show
Genomic analysis often involves complex sequences of operations on chromosomal regions. For example, combining intervals of different sizes, such as the peaks derived from ChIP-Seq experiments or regions associated with genes, measuring the intensity of a signal in these intervals, performing differential analysis of these signals. This is especially true in the field of cancer epigenomics, which requires integrating signals for various epigenetic marks with gene expression data. While tools to perform these operations are commonly available, they are not designed interoperate easily in a reproducible way.
GenoMaker facilitates complex genomic analysis making it efficient, reproducible, and self-documenting. Operations to be executed are specified in a human-readable, high-level language that hides differences between the various underlying tools, and are translated into a “make” file that is then automatically executed. This allows GenoMaker to take advantage of the useful features of the make command: defining generic rules to express transformations between different types of files, and only performing operations that are necessary, without re-creating files that are already up to date. We present an example of its use in a complex cancer epigenomics project. GenoMaker is currently under development, and is available at https://github.com/uf-icbr-bioinformatics/GenoMaker.
Presentation Overview: Show
As a cornerstone of modern biology, structural bioinformatics is an important tool for the understanding of biomolecular structure, interactions, and dynamics. A plethora of software tools for this purpose have been developed over the recent decades, usually specialized with respect to their focus or target audience. However, they often require in-depth knowledge of both the programming framework and language to be used efficiently. The versatile Julia language bridges the gap between ease of use and high-performance capabilities. In particular, it allows for highly efficient native software with an easily accessible programming interface that can be utilized, e.g., for exploratory data analysis or incorporated into full-fledged data processing pipelines. Here we present BiochemicalAlgorithms.jl, our free and open-source Julia package, built around a comprehensive representation for biomolecular systems, accessible in a DataFrame-based manner. Our representations support multiple common data formats and can be processed through a toolset of data preparation routines, including such for bond reconstruction or inferring missing atoms. Prepared systems can further be used for energy calculations or structure optimization, with protein docking and simple molecular dynamics simulations being planned as future features. BiochemicalAlgorithms.jl is accompanied by a second Julia package, extending its functionality by visualization capabilities for our systems.
Presentation Overview: Show
Loss of crops to plant disease costs billions of dollars every year. Artificial intelligence, including machine vision, can improve manual lab assays to detect and score the severity of disease, allowing more robust conclusions to be drawn when assessing and comparing potential solutions. The disease lesion image data necessary to train computer vision models is sparse. Existing tools to gather and annotate image data are often unable to capture the unique biological context necessary to draw conclusions. Annotation is therefore typically a slow, manual process, lacking continuity between research groups.
I developed a Python package, CDAScorer, to quickly record coordinate and severity score data for cell death areas (CDAs) on plant leaves. CDAScorer is run from the command line. The user interacts with a graphical application window built using the Tkinter framework, dragging a bounding box around the CDA matching given positional metadata, then entering its score. Using a dataset built with CDAScorer, I will train deep learning models to create a computer vision tool to automatically score CDAs without the subjectivity inherent in qualitative visual scoring. Severity scoring will help to automate and streamline plant resistance breeding programs, supporting the development of climate resilient crops.
Presentation Overview: Show
Observational studies on the use of commercially available wearable devices for infection detection lack the rigor of controlled clinical studies, where time of exposure and onset of infection are exactly known. Towards that end, we carried out a feasibility study using a commercial smartwatch for monitoring of heart rate, skin temperature, and body acceleration on subjects as they underwent a controlled human malaria infection (CHMI) challenge. Subjects were asked to wear the smartwatch for at least 12 hours/day from 2 weeks pre-challenge to 4 weeks post-challenge. Using these data, we developed 2B-Healthy, a Bayesian-based infection prediction algorithm that estimates a probability of infection. We compared the infection probability over time with the time to onset of parasitemia, as determined by a daily FDA-approved blood smear diagnostic. Among 10 CHMI subjects, nine developed parasitemia, with an average time to parasitemia of 12 days. 2B-Healthy detected infection in seven of nine subjects (78% sensitivity), where in six subjects it detected infection 6 days before parasitemia (on average). We also investigated 2B-Healthy on eight control subjects for 4 weeks and obtained a false-positive rate of 6%/week. Our findings demonstrate the feasibility of wearables as a screening device to provide early warning of infection.
Presentation Overview: Show
Copy number alterations (CNAs), ranging from local to whole-chromosome-level, are common in cancer genomes. Different cancer types show distinct patterns of those alterations. However, what shapes those patterns and what causes the differences between tissues is poorly understood. We reasoned that differences in the probability of occurrence of CNAs (e.g. the epigenome at genomic breakpoints or lamina attachment regions), and selection acting on CNAs (e.g. negative selection acting against CNA-induced differential production of proteins, positive selection favouring amplification specific regions buffering for the detrimental effects of deleterious mutations) would shape the observed tissue-specific patterns of CNAs. To test this, we first identified individual features correlating with the frequency of focal or chromosomal amplification. We identified a number of genomic, transcriptomic and functional features explaining observed tissue-specific CNA patterns. We then fitted multivariable models that significantly improved the prediction of amplification patterns, demonstrating that combining multiple features can be useful to predict tissue-specific CNA patterns. Our ongoing efforts focus on extending our model by adding epigenomic properties of chromosomes. Taken together, our results highlight the need for a systematic analysis of determinants of tissue-specific alteration patterns and might guide our understanding of tissue-specific tumour evolution and, ultimately, therapy response.
Presentation Overview: Show
There is an urgent need to diversify the pipeline for discovering novel natural products due to the increase in multi-drug resistant infections. Like bacteria, fungi also produce secondary metabolites that have potent bioactivity and rich chemical diversity. To avoid self-toxicity, fungi encode resistance genes which are often present within the biosynthetic gene clusters (BGCs) of the corresponding bioactive compounds. Recent advances in genome mining tools have enabled the detection and prediction of BGCs responsible for the biosynthesis of secondary metabolites. The main challenge now is to prioritize the most promising BGCs that produce bioactive compounds with novel modes of action. With target-directed genome mining methods, it is possible to predict the mode of action of a compound encoded in an uncharacterized BGC based on the presence of resistant target genes. Here we introduce the “Fungal bioActive compound Resistant Target Seeker” (FunARTS) available at https://funarts.ziemertlab.com. This is a specific and efficient mining tool for the identification of fungal bioactive compounds with interesting and novel targets. FunARTS rapidly links housekeeping and known resistance genes to BGC proximity and duplication events, allowing for automated, target-directed mining of fungal genomes. Additionally, FunARTS generates gene cluster networking by comparing the similarity of BGCs from multi-genomes.
Presentation Overview: Show
Predicting chemical-induced toxicity presents a multifaceted challenge due to its complex nature and our limited understanding of underlying mechanisms. We are reimagining toxicity by elucidating the biological states during adaptation to chemical exposure and tipping points to adversity. Here, we investigate the hypothesis that single-cell high-throughput transcriptomics (sc-HTTr) can help decode tipping points between cellular adaptation and adversity, obscured in gene expression from bulk samples. We treated human hepatic cell line (HepaRG) with five chemicals that disrupt cellular homeostasis through different pathways, including mitochondrial disruption, oxidative stress, endoplasmic reticulum stress, heat shock, and DNA damage. After dissociating the cells, we used the TempO-LINC platform to generate over 74,000 sc-HTTr profiles with an average of 5,000 genes per cell detected. Clustering the profiles revealed diverse cellular states, including normal, adaptive, autophagic, and apoptotic. We identified putative tipping points marking boundaries between cellular adaptation and adversity by integrating empirical and simulated trajectories based on literature-derived signaling and regulatory networks. This presentation underscores the transformative potential of sc-HTTr in decoding toxicological tipping points, offering a novel perspective in understanding the mechanisms of chemical toxicity and a new approach for estimating human health risks of chemical exposures.
Presentation Overview: Show
Single-cell sequencing of tumors enable detailed understanding of intratumor heterogeneity and the individuality of cells, missing the context. Constructing a 3D picture that include the spatial context of the tumor microenvironment ( TME) is a critical factor in understanding selection of malignant cells with proliferative potential at the tumor front, immune surveillance and suppression of malignant immunogenic clones and deciphering spatial modes of growth and dispersal that impact tumor-immune co-evolution. At the IMAXT consortium, we used a 4T1 polyclonal mouse model to map TNBC tumor and its TME at single-cell resolution as a function of immune competency. We employed scRNA-seq and CITE-seq to identify tumor and immune cell states and design protein panels for single-cell spatial imaging methods (IMC &merFISH). Leveraging scDNA-seq, we identified clones within a mixed tumor population. This work specifically tackles multimodal single-cell integration challenges by presenting an analysis framework and devising strategies for tying together different data types using common anchors. Here, we project cell types/states discovered by single-cell sequencing on an accurate map of spatial organization in the TME by integrating CITE-seq and IMC . Moreover, with scDNA-seq, scRNA-seq and merFISH modalities, we create an accurate spatial map of tumor clones and their TME context.
Presentation Overview: Show
Genomic sequences integrated from human exogenous retroviruses (ERVs) account for nearly 9% of human DNA. These ERVs are generally silenced through epigenetic mechanisms. Growing evidence shows many ERVs are re-activated and associated with a variety of diseases including colon cancer. However, comprehensive profiling of ERVs and their clinical significances are lacking. We systematically profiled the ERV expression in 307 tumors and 41 adjacent normal tissues using RNA sequencing for 3,320 ERVs. ERV expression was found to be very different between tumors and their adjacent normal tissues, in which most ERVs had increased expression in tumors. These ERVs were mainly located in intergenic regions or intronic region of protein coding genes or lncRNAs. Host or nearby genes of ERVs with increased expression in tumors were significantly enriched in viral protein interactions with cytokine and cytokine receptors. ERV expression defined tumor subtypes were significantly associated with tumors’ methylation subtypes, MSI status, and hyper-mutation status. With adjustment for other known covariates, we found 152 ERVs were significantly associated with disease specific survival, 51 of which were also differentially expressed. Our comprehensive analysis provides in-depth insights to abnormal ERV expression in colon cancer and their clinical importance in tumor subclassification and clinical outcomes.
Presentation Overview: Show
High-dimensional biological data such as transcriptomics and proteomics often suffer from batch effects that arise when technical variables are not controlled during data acquisition. Current batch effect correction methods like ComBat, though robust, frequently struggle with data confounded with class imbalance or class-batch confounding. This underscores the need for an approach that can model both batch and class variables effectively during batch effect estimation and correction. In this study, we present a novel nonlinear approach towards the modelling of batch effects. This method uses the underlying empirical cumulative distribution function of the dataset to map class-batch variables. Compared to ComBat and batch mean centering, our batch effect correction method consistently achieves lower Euclidean distances between batch associated clusters after correction across varying severity of class imbalances, partially class-batch confounded datasets, and different distributions in both simulated datasets and proteomics datasets. Visualization using t-SNE and principal component analysis also shows improved clustering of class variables post-correction. As we anticipate an increase in prevalence of high throughput methods, we hope that this approach can address future nuances like interaction effects and different distributions when it comes to batch effects in high-dimensional data.
Presentation Overview: Show
Omics technologies result in data of varying data characteristics, which, in turn, can influence the performance of downstream analyses, such as normalization methods or statistical tests, and may also be at the core of differing performance results in benchmarking studies.
Here, we show typical data characteristic patterns for selected omics data types – including proteomics, metabolomics (mass spectrometry (MS)- and nuclear magnetic resonance (NMR)-based), RNA-sequencing, microarray, and microbiome data – and demonstrate at the example of normalization methods how particular data properties render those methods inapplicable.
Based on our results, we encourage the thorough inspection of omics datasets as to their data characteristics prior to conducting downstream analyses, since the inappropriate use of algorithms on those datasets is prone to introducing bias.
Presentation Overview: Show
Background
The reliable detection of emerging novel pathogens from next-generation sequencing data is a key challenge to solve. Traditional approaches depend on sequence similarity, may not able to detect novel species due to unavailability of closely related reference sequences. In contrast, machine learning methods can detect novel pathogens even though the biological context is unavailable.
Method
A list of pathogenic and nonpathogenic bacteria was retrieved from Integrated Microbial Genome and Mircrobiomes (IMG/M). One strain per species is included. Nonpathogenic strains of well-known pathogenic species were discarded. This resulted in a list of 446 species (342 pathogens and 67 non-pathogens). We simulated 10 million paired-end Illumina reads per class using InSilicoSeq. Reverse complements of the simulated reads were added to the final list. One hot-encoding was used to represent DNA sequences. The final list is divided into 90% training, 5% validation, 5% test sets.
Our deep learning model for predicting pathogenicity from DNA sequence reads includes two convolutional neural networks (CNN).
Presentation Overview: Show
Background
The reliable detection of emerging novel pathogens from next-generation sequencing data is a key challenge to solve. Traditional approaches depend on sequence similarity, may not able to detect novel species due to unavailability of closely related reference sequences. In contrast, machine learning methods can detect novel pathogens even though the biological context is unavailable.
Method
A list of pathogenic and nonpathogenic bacteria was retrieved from Integrated Microbial Genome and Mircrobiomes (IMG/M). One strain per species is included. Nonpathogenic strains of well-known pathogenic species were discarded. This resulted in a list of 446 species (342 pathogens and 67 non-pathogens). We simulated 10 million paired-end Illumina reads per class using InSilicoSeq. Reverse complements of the simulated reads were added to the final list. One hot-encoding was used to represent DNA sequences. The final list is divided into 90% training, 5% validation, 5% test sets.
Our deep learning model for predicting pathogenicity from DNA sequence reads includes two convolutional neural networks (CNN).
Presentation Overview: Show
Fusion genes or chimeras typically comprise sequences from two different genes. The chimeric RNAs of such joined sequences often serve as cancer drivers. Identifying such driver fusions in a given cancer or complex disease is important for diagnosis and treatment. The advent of next-generation sequencing technologies, such as DNA-Seq or RNA-Seq, together with the development of suitable computational tools, has made the global identification of chimeras in tumors possible. However, the testing of over 20 computational methods showed these to be limited in terms of chimera prediction sensitivity, specificity, and accurate quantification of junction reads. These shortcomings motivated us to develop the first 'reference-based' approach termed ChiTaH (Chimeric Transcripts from High-throughput sequencing data). ChiTaH uses 43,466 non-redundant known human chimeras as a reference database to map sequencing reads and to accurately identify chimeric reads. We benchmarked ChiTaH and four other methods to identify human chimeras, leveraging both simulated and real sequencing datasets. ChiTaH was found to be the most accurate and fastest method for identifying known human chimeras from simulated and sequencing datasets. Moreover, especially ChiTaH uncovered heterogeneity of the BCR-ABL1 chimera in both bulk and single-cells of the K-562 cell line, which was confirmed experimentally.