Presentation Overview: Show
Motivation: Machine learning methods can be used to support scientific discovery in healthcare-related research fields. However, these methods can only be reliably used if they can be trained on high-quality and curated datasets. Currently, no such dataset for the exploration of Plasmodium falciparum protein antigen candidates exists. The parasite Plasmodium falciparum causes the infectious disease malaria. Thus, identifying potential antigens is of utmost importance for the development of antimalarial drugs and vaccines. Since exploring antigen candidates experimentally is an expensive and time-consuming process, applying machine learning methods to support this process has the potential to accelerate the development of drugs and vaccines, which are needed for fighting and controlling malaria.
Results: We developed PlasmoFAB, a curated benchmark that can be used to train machine learning methods for the exploration of Plasmodium falciparum protein antigen candidates. We combined an extensive literature search with domain expertise to create high-quality labels for Plasmodium falciparum specific proteins that distinguish between antigen candidates and intracellular proteins. Additionally, we used our benchmark to compare different well-known prediction models and available protein localization prediction services on the task of identifying protein antigen candidates. We show that available general-purpose services are unable to provide sufficient performance on identifying protein antigen candidates and are outperformed by our models that were trained on this tailored data.
Presentation Overview: Show
Profiling the molecular features of all cells with their anatomical and functional attributes is essential for understanding the human body in health and diseases. Scientists have been enthusiastic in building such atlases of human cells using single-cell omics technologies. The community has conducted more and more single-cell studies with the rapid development and popularization single-cell RNA-sequencing technologies. Tremendous amount of single-cell data has been accumulating in the public domain. This suggests the possibility of building cell atlases by assembling such “shot-gun” data in scattered publications. Cell atlas assembly faces several major challenges comparing with the shot-gun assembly of the human genome. We proposed a unified information framework for assembling atlases and built the first cell-centric human Ensemble Cell Atlas (hECA) assembled from scattered data. We developed the “in data” cell sorting scheme that allows extracting cells using logic formula from the hECA as a “virtual human body” to investigate scientific questions involving multiple organs and cell types. We also developed a multidimensional coordinate system UniCoord for different physical and biological attributes of cells by adopting a supervised variational autoencoder (VAE) neural network model, and trained it on hECA to make it represent the diversity of healthy human cells.
Presentation Overview: Show
Trigeminal neurons play a crucial role in the perception of pain. Single-cell transcriptomics provides a promising avenue to decipher the underlying mechanisms of pain. In this case, a reference remains missing. To fill the gap, we have collected the most comprehensive datasets and employed a meta-analysis. We collected four public datasets for trigeminal tissues across various technologies. Raw sequencing reads were aligned to the mm10 reference genome using CellRanger . Standardized quality control analysis has been performed to exclude estimated empty cells, ambient RNAs, and doublets using dropkick, SoupX, and DoubletFinder. Processed datasets are integrated to reduce the sample effects by using scVI. We clustered the cells using Leiden algorithm. UMAP was used to generate cell clusters. We annotated clusters using known cell type marker genes. We used 4 public available datasets, 3 dropseq, and 1 from 10x genomics. We achieved 44,770 cells, 16 clusters, consisting of 6 neuron populations, 3 glial, 4 immune, fibroblasts and endothelial cells. To facilitate the public use of the generated atlas, we provided an automated cell type annotation utility using scArches, and the reference is visualized by CELLxGENE. This atlas may serve as a valuable data resource for the mouse trigeminal community.
Presentation Overview: Show
Exponential growth in the volume of biological sequences presents ever-increasing bioinformatics challenges. Clustering is often a first step in bioinformatics workflows to reduce sequences to more manageable numbers. Therefore, clustering presents a scalability bottleneck that is constrained by time and memory limitations. Here, I describe the development of Clusterize, a novel method for accurately clustering with linear time and memory complexity. The Clusterize algorithm linearizes the clustering problem through a process termed relatedness sorting. After linear-time relatedness sorting, each sequence only needs to be compared to fixed number of nearby sequences in the ordering and clustered if within the similarity threshold. I compare the performance of Clusterize relative to popular clustering programs, including CD-HIT, MMseqs, and UCLUST. Clusterize is able to quickly and accurately cluster tens of millions of homologous sequences, such as 16S amplicons, and non-homologous sequences, such as the UniProt database. Clusterize is far more accurate than another linear time clustering algorithm, Linclust, on many typical clustering tasks. Overall, Clusterize represents a novel approach for scaling clustering to new heights while assisting with the continual flood of biological sequences. Clusterize is part of the DECIPHER package for R available from the Bioconductor package repository.
Presentation Overview: Show
Following viral infection, the human immune system generates broad and dynamic CD8+ T cell responses to virus antigens. A characterization of such T cell responses allows to understand infection history and its contribution to protective immunity.
We performed in-depth profiling of CD8+ T cells reactive to CMV, EBV and Influenza virus derived antigens in peripheral blood samples from 114 healthy donors and 55 cancer patients using high-dimensional mass cytometry with combinatorial barcoding of peptide-MHC-I multimers and subsequent single cell RNA sequencing/VDJ-CITE-Seq for phenotypes and TCR repertoire analysis of identified antigen-specificities.
We analysed the expression of up to 138 surface markers from more than 500 antigen-specific T cell responses across six different HLA alleles by applying multiple machine learning approaches. Our data revealed unique phenotypic signatures of T cells specific for antigens from different virus categories. Based on these signatures, we built a ML approach to predict virus specificity from bulk CD8+ T cells. We validated our prediction capabilities in-silico using an independent sample cohort and also in-vitro by TCR expression in a Jurkat reporter assay. Our data suggest that machine learning can be used as a statistically rigorous and unbiased way to accurately predict antigen specificity from T cell phenotypes.
Presentation Overview: Show
The rapid improvements in genomic sequencing technology have led to the proliferation of locally collected genomic datasets. Given the sensitivity of genomic data, it is crucial to conduct collaborative studies while preserving the privacy of the individuals. However, before starting any collaborative research effort, the quality of the data needs to be assessed. One of the essential steps of the quality control process is population stratification: identifying the presence of genetic difference in individuals due to subpopulations. One of the common methods used to group genomes of individuals based on ethnicity is principal component analysis (PCA). In this paper, we propose a framework to perform population stratification using PCA across multiple collaborators in a privacy-preserving way. In our proposed client-server-based scheme, we initially let the server train a global PCA model on a publicly available genomic dataset which contains individuals from multiple populations. The global PCA model is later used to reduce the dimensionality of the local data by each collaborator (client). After adding noise to achieve local differential privacy (LDP), the collaborators send metadata (in the form of their local PCA outputs) about their research datasets to the server, which then aligns the local PCA results to identify the genetic differences among collaborators' datasets. Our results on real genomic data show that the proposed framework can perform population stratification with high accuracy while preserving the privacy of the research participants.
Presentation Overview: Show
The spatial positioning of chromosomes relative to functional nuclear bodies is intertwined with genome functions such as transcription, but the sequence patterns and epigenomic features that collectively influence chromatin spatial positioning in a genome-wide manner are not well understood. Here, we develop a new transformer-based deep learning model, called UNADON, that predicts the genome-wide cytological distance to a specific type of nuclear body, as measured by TSA-seq, using both sequence features and epigenomic signals. Evaluations of UNADON in four cell lines (K562, H1, HFFc6, HCT116) show high accuracy in predicting chromatin spatial positioning to nuclear bodies when trained on a single cell line. UNADON also performed well in an unseen cell type. Importantly, we reveal potential sequence and epigenomic factors that affect large-scale chromatin compartmentalization to nuclear bodies. Together, UNADON provides new insights into the principles between sequence features and large-scale chromatin spatial localization, which has important implications for understanding nuclear structure and function.
Presentation Overview: Show
Motivation: The Chemical Master Equation is a set of linear differential equations that describes the evolution of the probability distribution on all possible configurations of a (bio-)chemical reaction system. Since the number of configurations and therefore the dimension of the CME rapidly increases with the number of molecules, its applicability is restricted to small systems. A widely applied remedy for this challenge are moment-based approaches which consider the evolution of the first few moments of the distribution as summary statistics for the complete distribution.
Here, we investigate the performance of two moment-estimation methods for reaction systems whose equilibrium distributions encounter heavy-tailedness and hence do not possess statistical moments.
Results: We show that estimation via Stochastic Simulation Algorithm trajectories lose consistency over time and estimated moment values span a wide range of values even for large sample sizes. In comparison, the Method of Moments returns smooth moment estimates but is not able to indicate the non-existence of the allegedly predicted moments. We furthermore analyze the negative effect of a CME solution's heavy-tailedness on SSA run times and explain inherent difficulties.
While moment estimation techniques are a commonly applied tool in the simulation of (bio-)chemical reaction networks, we conclude that they should be used with care, as neither the system definition nor the moment estimation techniques themselves reliably indicate the potential heavy-tailedness of the CME's solution.
Presentation Overview: Show
In this study, we present the ``multivariate MArginal ePIstasis Test'' (mvMAPIT) --- a multi-outcome generalization of a recently proposed epistatic detection method which seeks to detect marginal epistasis or the combined pairwise interaction effects between a given variant and all other variants. By searching for marginal epistatic effects, one can identify genetic variants that are involved in epistasis without the need to identify the exact partners with which the variants interact --- thus, potentially alleviating much of the statistical and computational burden associated with conventional explicit search-based methods. Our proposed mvMAPIT builds upon this strategy by taking advantage of correlation structure between traits to improve the identification of variants involved in epistasis. We formulate mvMAPIT as a multivariate linear mixed model and develop a multi-trait variance component estimation algorithm for efficient parameter inference and P-value computation. Together with reasonable model approximations, our proposed approach is scalable to moderately sized GWA studies. With simulations, we illustrate the benefits of mvMAPIT over univariate (or single-trait) epistatic mapping strategies. We also apply mvMAPIT framework to protein sequence data from two broadly neutralizing anti-influenza antibodies and approximately 2,000 heterogenous stock of mice from the Wellcome Trust Centre for Human Genetics.
Presentation Overview: Show
Motivation: Single cell RNA-sequencing (scRNA-seq) offers a powerful tool to dissect the complexity of biological tissues through cell sub-population identification in combination with clustering approaches. Feature selection is a critical step for improving the accuracy and interpretability of single-cell clustering. Existing feature selection methods do not make full use of the information of cell-type-discriminating ability of genes. We hypothesize that incorporating such information could further boost the performance of single cell clustering.
Results: We develop CellBRF, a feature selection method that considers genes’ relevance to cell types for single-cell clustering. The key idea is to identify genes that are most important for discriminating cell types through random forests guided by predicted cell labels. Moreover, it proposes a class balancing strategy to mitigate the impact of unbalanced cell type distributions on feature importance evaluation. We benchmark CellBRF on thirty-three scRNA-seq datasets representing diverse biological scenarios, and demonstrate that it substantially outperforms state-of-the-art feature selection methods in terms of clustering accuracy and cell neighborhood consistency. Furthermore, we demonstrate the significantly outstanding performance of our selected features with three case studies on cell differentiation stage identification, non-malignant cell subtype identification, and rare cell identification. CellBRF provides a new and effective tool to boost single-cell clustering accuracy.
Availability and implementation: All source codes of CellBRF are freely available at github.com/xuyp-csu/CellBRF.
Presentation Overview: Show
Hypertension is a polygenic disease that affects over 1.2 billion adults aged 30–79 worldwide. It is a major risk factor for renal, cerebrovascular, and cardiovascular diseases. The heritability of hypertension is estimated to be high; nevertheless, the understanding of the underlying mechanisms remains scarce and incomplete. Using a novel method called PWAS (proteome-wide association study) on participants from the UK Biobank (UKB), we discovered 70 statistically significant associated genes, most of which failed to reach significance by the routine GWAS, which is variant-based. Our findings were validated against independent cohorts, including the Finnish Biobank, and confirmed a substantial fraction of the PWAS hypertension-associated genes. The gene-based analyses that were performed on both sexes separately revealed a sex-dependent genetic signal with a stronger component associated with females. Analysis of the measurements for systolic and diastolic blood pressure for the entire UKB cohort confirmed the dominant genetic contribution for females. In this study, we will demonstrate the advantage of applying gene-based association methods over the classical GWAS in interpretability and in identifying sex-specific genetic signals as a lead towards mechanistic understanding of hypertension and related phenotypes.
Presentation Overview: Show
Cancer-associated mutagenesis is accelerated by genome instability (GIN), a hallmark of cancer cells. GIN promotes cell transformation by alteration of cancer driver genes, and it may also facilitate tumor cells adaptation to therapy. The DNA repair machinery plays a master role preventing DNA damage and GIN, and indeed various DNA repair pathways are often deficient in cancer. One of these pathways is translesion DNA synthesis (TLS), which allows replication forks to bypass bulky DNA lesions, preventing gross chromosomal rearrangements but at the cost of introducing point mutations and indels. TLS is a conserved mechanism that diversified during evolution leading to multiple TLS enzymes with different, but overlapping specificities for DNA lesions in human cells. However, given their common ancestry, TLS polymerases perform redundant functions and can substitute for each other, making very difficult their study using single KOs. Here, we develop a double knock-out strategy using CRISPR-Cas12 to analyze pairwise deletions of TLS enzymes in human cells, and unveil interactions between them under a wide range of commonly used chemotherapeutic agents in lung adenocarcinoma and ovarian carcinoma cell models. Understanding genetic interactions between DNA repair enzymes under particular therapies may reveal new strategies aimed to target specifically cancer cells.
Presentation Overview: Show
Recent years have seen a surge of novel neural network architectures for multi-omics integration. One important parameter is the integration depth: the point at which the latent representations are computed or merged, which can be early, intermediate, or late. The literature on integration methods grows steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases. We developed a comparison framework that trains multi-omics integration methods under equal conditions. We incorporated four recent deep learning methods, early integration, PCA, and a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a drug response data set with multiple omics data. Our experiments confirmed that early integration has the lowest predictive performance. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT performed best in a cross-validation setting and Omics Stacking best on the external test set. When faced with a new data set, Super.FELT is a good option in the cross-validation setting as well as Omics Stacking in the external test set setting.
Presentation Overview: Show
Patents play a crucial role in the drug discovery process by providing legal protection for discoveries and incentivising investments in research and development. By identifying patterns within patent data resources, researchers can gain insight into the market trends and priorities of the pharmaceutical industries, as well as provide additional perspectives on more fundamental aspects such as the emergence of potential new drug targets. In this paper, we used the PEMT to integrate and analyse patent literature for rare diseases (RD) and Alzheimer's disease (AD). This is followed by a systematic review of the underlying patent landscape to decipher trends and applications in patents. We start by discussing organisations involved in R&D in AD and RD. This allows us to gain an understanding of the importance of AD and RD from specific organisational perspectives. Next, we analysed the historical focus of patents for therapeutic targets and correlated them with market scenarios allowing the identification of prominent targets for a disease. Lastly, we identified repurposed drugs within the two diseases with the help of patents. The study demonstrates the expanded applicability of patent documents from legal to drug discovery, design, and research, thus, providing a valuable resource for future drug discovery efforts.
Presentation Overview: Show
Cancer immunotherapy has greatly improved the quality of life of cancer patients and it hinges on the discovery of novel cancer antigens that could be targeted to improve disease outcomes. We have developed a pan-cancer, pan-HLA, and pan-tissue database containing immunopeptidomics data mapped to transcriptomic, genomic, immunological and biochemical data. The database was generated from 77 different publicly available immunopeptidomics mass spectrometry datasets collected between 2015-2022 (73 cancer and 4 normal datasets), covering 15 different types of cancers and 152 different HLA-I alleles. The peptides contained in our database were obtained by a combination of closed, open and de novo searches using an in-house developed computational pipeline. Following rigorous false discovery rate estimation at 1% and a second-round search to eliminate any false signals that may not have been detected in the previous round of FDR estimation, we obtained a list of 11.2 million peptide-HLA combinations comprising both coding and non-coding regions of the genome as well as bacterial peptides. These peptides have been mapped to chromosomal coordinates to facilitate adoption by the genomics community of this useful resource on antigen presentation. Our database includes a FAIR knowledge graph which contextualizes and enriches the data.
Presentation Overview: Show
Background: Heterogeneity of tumor immune microenvironment accounts for differential prognosis and immunotherapy response among colorectal cancer (CRC) patients. Here, we developed novel immune subtypes through integrative multi-omics analysis to characterize CRC heterogeneity.
Methods: Immune-related gene expression, mutation, and methylation profiles were collected from TCGA (n = 627) to perform multi-omics factor analysis (MOFA) and establish the Multi-Omics Tumor Immune Features Clustering of CRC (MotifCC). Transcriptomic, genomic, and epigenetic landscapes were analyzed to characterize differences among MotifCC clusters. Independent validation of MotifCC was performed in our large-scale, in-house COCC cohort (Clinical Omics Study of Colorectal Cancer in China, n = 1001).
Results: The three MotifCC clusters showed distinct characteristics. Cluster1 was a high immune and stroma infiltration subtype with the worst prognosis. Cluster2 was characterized by the low immune infiltration, high metabolic intensity and the best survival. With medium immune infiltration and intermediate prognosis, Cluster3 was distinguished by the highest methylation and stemness status. Besides, Cluster3 was associated with intermediate prognosis but better response to immunotherapy.
Conclusion: We established the MotifCC, novel immune subtypes capturing the multi-omics heterogeneity of CRC and facilitating patient stratification for immunotherapy.
Presentation Overview: Show
Proteases play a crucial role in various cellular processes, and the precise cleavage of substrates by proteases is essential for these processes to occur correctly. Accurately predicting substrate cleavage sites is a crucial step in understanding protease function and substrate specificity. Many bioinformatics methods have been developed to predict protease-specific substrate cleavage sites. However, with the development of mass spectrometry technology, an enormous amount of protease substrate cleavage data has been generated and will continually grow in the future. Consequently, it is not efficient and practical to train new models based on these rapidly accumulated data every year and update the prediction server for the wider community. In this study, we developed a multi-faceted, versatile bioinformatics tool, termed ProsperousPlus, that enables fast, accurate and high-throughput prediction of substrate cleavage sites for 110 proteases. Benchmarking tests show that ProsperousPlus achieves competitive predictive performance compared with state-of-the-art approaches. Furthermore, ProsperousPlus provides sought-after assistance for non-programming background users to build their customised in-house models and easily meet specific needs. It is anticipated that researchers with little bioinformatics expertise will be able to efficiently use rapidly accumulating substrate cleavage data to train in-house prediction models to meet their specific requirements.
Presentation Overview: Show
Computational methods that decipher rare and private somatic changes can provide critical insights into the underlying mechanisms of cancer development and progression. Identifying potential cancer subtypes that might be associated with diverse biological responses is a key first step to define target therapeutics.
Machine and deep learning (ML/DL) methods that use clinical and/or multi-omics data have been adopted for the identification of cancer subtypes. There also exists a growing collection of sequence-based ML/DL models that accurately predict different epigenetic traits (e.g. transcription factor binding), and allow for estimating the impact of individual somatic aberrations. The application of sequence-based MD/DL on a genome-wide scale enables augmenting somatic mutations by a model-based view that captures functionally relevant differences between individuals.
In this study, we adopt SEI, a sequence-based DL model that is trained to predict more than 21K different regulatory activities, to obtain mutation impact embeddings. We first identify mutations with strong impacts through investigating clusters of alternative and reference sequence embeddings. Then, mutation impact embeddings are utilized to generate a patient similarity network (PSN) for unsupervised identification of patient subgroups. The proposed approach provides a novel strategy of utilizing variant impact scores in PSNs for cancer subtyping.