Monday, July 24, between 18:00 CEST and 19:00 CEST |
Tuesday, July 25, between 18:00 CEST and 19:00 CEST |
---|---|
Session A Poster Set-up and Dismantle Session A Posters set up: Monday, July 24, between 08:00 CEST and 08:45 CEST Session A Posters dismantle: Monday, July 24, at 19:00 CEST | Session B Poster Set-up and Dismantle Session B Posters set up: Tuesday, July 25, between 08:00 CEST and 08:45 CEST Session B Posters dismantle: Tuesday, July 25, at 19:00 CEST |
Wednesday, July 26, between 18:00 CEST and 19:00 CEST |
|
---|---|
Session C Poster Set-up and Dismantle Session C Posters set up: Wednesday, July 26,between 08:00 CEST and 08:45 CEST Session C Posters dismantle: Wednesday, July 26, at 19:00 CEST |
Virtual |
|
---|
Presentation Overview: Show
We used long-read DNA sequencing to assemble the genome of a Southern Han Chinese male. We organized the sequence into chromosomes and filled in gaps using the recently completed T2T-CHM13 genome as a guide, yielding a gap-free genome, Han1, containing 3,099,707,698 bases. Using the T2T-CHM13 annotation as a reference, we mapped all genes onto the Han1 genome and identified additional gene copies, generating a total of 60,708 putative genes, of which 20,003 are protein coding. A comprehensive comparison between the genes revealed that 235 protein-coding genes were substantially different between the individuals, with frameshifts or truncations affecting the protein-coding sequence. Most of these were heterozygous variants in which one gene copy was unaffected. This represents the first gene-level comparison between two finished, annotated individual human genomes.
Presentation Overview: Show
Brain aging is characterized by a progressive loss of tissue integrity and increased cellular heterogeneity, leading to impaired function, increased susceptibility to disease and death. In this work, we collected and analyzed two large datasets including mRNA levels from Illumina and Oxford Nanopore technologies in young adult and aged adult mice. We report the first transcriptome-wide differential transcript usage study of brain aging. We provide the community with a large resource of whole brain transcriptomes and comprehensive analyses that identify widespread diversity in RNAs during aging. Specifically, we observed that the mRNAs encoding for neuronal synaptic proteins are upregulated with age and this is conserved in the human context. We also observed that a subset of the genes that are upregulated in the aged brain is associated with neurodegenerative diseases. In addition, we report that the RNA molecules that are longer and have a shorter 3'UTR are abundant in the aged brain and a subset of these are alternatively spliced at the 3'UTR. Finally, we observed a difference in the turnover of mRNAs at different ages. Overall, based on these observations, we speculate that alterations at the 3'UTR may play an active role in the aging process.
Presentation Overview: Show
We used Long-read sequencing (LRS) to identify structural variants (SVs) in 141 schizophrenic cases and obtained a median of 14,392 high-confidence SVs per individual using an alignment-based pipeline. We compared these SVs with those detected by short-read sequencing (SRS) of the same samples and found that 31.5% of the SVs detected by LRS were consistent with 64.5% of those detected by SRS. LRS-specific SVs were enriched in segmental duplications (SD) and simple repeats (SR) regions, demonstrating the advantages of LRS in these regions. After filtering through the public population-scale SV sets, we identified 618 potential pathogenic SVs that were enriched in SR regions and carried by multiple cases, detected only by LRS-based methods. We also identified previously reported SCZ-associated SVs, such as TRA in DISC1 and VNTRs in SLC6A4, among this set. These potential pathogenic SVs affected 551 genes, which were significantly enriched in developmental and synaptic-related pathways that tended to be associated with SCZ. Our study highlights the effectiveness of LRS in identifying SVs, particularly in repeat regions, and provides new insights into the genetic mechanisms of diseases.
Presentation Overview: Show
Short-read and long-read RNA sequencing technologies each have their strengths and weaknesses for transcriptome assembly. While short reads are highly accurate, they are rarely able to span multiple exons. Long-read technology can capture full-length transcripts, but its relatively high error rate often leads to mis-identified splice sites. The initial version of StringTie2, our guided transcriptome assembler, was able to handle either short or long read data but would not be able to handle both data types at the same time. We improved on StringTie2 and released a new version of StringTie that is capable of handling mixed transcriptomic data that includes both short and long RNA-seq reads sequenced from the same sample. By taking advantage of the strengths of both long and short reads, hybrid-read assembly with StringTie is more accurate than long-read only or short-read only assembly, and on some datasets it can more than double the number of correctly assembled transcripts, while obtaining substantially higher precision than the long-read data assembly alone. Using real and simulated data, we show that hybrid-read assemblies achieve greater precision and sensitivity than both the corrected or uncorrected long-read only, and short-read only assemblies as well as better estimates of gene expression levels.
Presentation Overview: Show
Precise identification of alleles in the human leukocyte antigen (HLA) region of the human genome is crucial for many clinical and research applications. However, HLA typing remains challenging due to the highly polymorphic nature of the HLA loci. With Next-Generation Sequencing (NGS) data becoming widely accessible, many computational tools have been developed to predict HLA types from RNA sequencing (RNA-seq) data. Despite this development, there remains a lack of comprehensive and systematic benchmarking of RNA-seq based HLA callers. To address this limitation, we rigorously compared the performance of all 9 HLA callers currently published on six gold standard datasets spanning 652 RNA-seq samples. In each case, we produced evaluation metrics of accuracy for each caller that is the percentage of correctly predicted alleles. We then reported the HLA genes and alleles most prone to misprediction. Furthermore, we evaluated the performance of each caller through their runtime and usage of CPU and memory. Our study is also the first to evaluate effect of read length on prediction quality using each tool, and the effect of ancestral origin (African vs. European) on accuracy. This study offers crucial information for researchers and clinicians regarding appropriate choices of methods for HLA typing.
Presentation Overview: Show
Spatial transcriptomics is a cutting-edge technique that enables the analysis of gene expression patterns within specific regions of tissue or organs. However, analyzing the large and complex datasets generated from spatial transcriptomics experiments remains a challenge. Here we propose U-CIE, a method for visualizing high-dimensional data by encoding it as colors using a combination of dimensionality reduction and the CIELAB color space. U-CIE allows genome-wide expression patterns within tissue or organ sections to be visualized and highlights the distribution of different cell types across the spatial transcriptomics data. U-CIE first uses UMAP to reduce high-dimensional gene expression data to three dimensions while preserving the spatial information. Next, the resulting three-dimensional representation is embedded within the CIELAB color space, generating a color encoding that captures much of the original structure of the data. U-CIE has been successfully applied to a mouse brain section dataset to highlight the distribution of different cell types across the spatial transcriptomics data and provide insights into the organization of these cells within brain regions. U-CIE has the potential to be a powerful tool for exploring spatial transcriptomics data and gaining new insights into cellular organization and function.
Presentation Overview: Show
Cancer-associated mutagenesis is accelerated by genome instability (GIN), a hallmark of cancer cells. GIN promotes cell transformation by alteration of cancer driver genes, and it may also facilitate tumor cells adaptation to therapy. The DNA repair machinery plays a master role preventing DNA damage and GIN, and indeed various DNA repair pathways are often deficient in cancer. One of these pathways is translesion DNA synthesis (TLS), which allows replication forks to bypass bulky DNA lesions, preventing gross chromosomal rearrangements but at the cost of introducing point mutations and indels. TLS is a conserved mechanism that diversified during evolution leading to multiple TLS enzymes with different, but overlapping specificities for DNA lesions in human cells. However, given their common ancestry, TLS polymerases perform redundant functions and can substitute for each other, making very difficult their study using single KOs. Here, we develop a double knock-out strategy using CRISPR-Cas12 to analyze pairwise deletions of TLS enzymes in human cells, and unveil interactions between them under a wide range of commonly used chemotherapeutic agents in lung adenocarcinoma and ovarian carcinoma cell models. Understanding genetic interactions between DNA repair enzymes under particular therapies may reveal new strategies aimed to target specifically cancer cells.
Presentation Overview: Show
Motivation: Computational analysis of large-scale metagenomics sequencing datasets has proved to be both incredibly valuable for extracting isolate-level taxonomic and functional insights from complex microbial communities. However, thanks to an ever-expanding ecosystem of metagenomics-specific algorithms and file formats, designing studies, implementing seamless and scalable end-to-end workflows, and exploring the massive amounts of output data have become studies unto themselves. Furthermore, there is little inter-communication between output data of different analytic purposes, such as short-read classification and metagenome assembled genomes (MAG) reconstruction. One-click pipelines have helped to organize these tools into targeted workflows, but they suffer from general compatibility and maintainability issues.
Results: To address the gap in easily extensible yet robustly distributable metagenomics workflows, we have developed a module-based metagenomics analysis system written in Snakemake, a popular workflow management system, along with a standardized module and working directory architecture. Each module can be run independently or conjointly with a series of others to produce the target data format (ex. short-read preprocessing alone, or short-read preprocessing followed by de novo assembly), and outputs aggregated summary statistics reports and semi-guided Jupyter notebook-based visualizations, The module system is a bioinformatics-optimized scaffold designed to be rapidly iterated upon by the research community at large.
Presentation Overview: Show
Advances in high-throughput sequencing technologies have enabled the use of genomic information to better understand biological processes through studies such as genome- wide association studies, polygenic risk score estimation and chromosome conformation capture. The study of spatial chromosome organization of the human genome plays an important role in understanding gene regulation. Chromosome conformation capture techniques, such as Hi-C, can capture long-range interactions between all pairs of loci on all chromosomes. These techniques have revealed structures of genome organization, such as A/B compartments, topologically associated domains, chromatin loops and frequently interacting regions.
Although the advancement of Hi-C techniques enables the generation of massive amounts of high-resolution data, we face several challenges such as a high proportion of missing data and noisy observed interaction frequencies. Therefore, it is currently unfeasible to reconstruct high-resolution genome structures efficient at high accuracy using existing state-of-the-art methods. To remedy this situation, we present PEKORA, a high-performance 3D genome reconstruction method using k-th order Spearman’s rank correlation approximation. PEKORA outperforms the state of the art by a huge margin of 35% on average.
Presentation Overview: Show
The Mouse Organogenesis Spatiotemporal Transcriptomic Atlas (MOSTA) contains spatial gene expression measurements for ~25,000 genes, captured by Stereo-Seq at 500nm resolution, for the entire mouse embryo, across eight developmental time points from E9.5 to E16.5. Data on this scale (terabytes) provide the opportunity to develop methods that untangle spatial organization, and its association with biological function, at the level of the whole organism.
We created a pipeline to identify spatial programs at the level of gene expression, niche organization, and cellular morphology. Using MOSTA expression and nuclei imaging data from timepoints E12.5, E14.5, and E16.5, we optimized a Hidden Markov Random Field model that differentially weights feature similarity with neighborhood effect to classify individual spatial units into larger spatially coherent domains. It is compatible with different types of spatial information: expression levels, tissue organization, and cell morphology. We also identified spatial co-expression modules (genes displaying spatial co-expression in a smoothed spatial k-nearest neighbor network) and evaluated the stability of these modules across adjacent tissue sections and consecutive time points of development. These analyses allow scientists to assess the biological role and organization of multiple spatial programs and how they change in a spatio-temporal manner.
Presentation Overview: Show
The Mouse Organogenesis Spatiotemporal Transcriptomic Atlas (MOSTA) contains spatial gene expression measurements for ~25,000 genes, captured by Stereo-Seq at 500nm resolution, for the entire mouse embryo, across eight developmental time points from E9.5 to E16.5. Data on this scale (terabytes) provide the opportunity to develop methods that untangle spatial organization, and its association with biological function, at the level of the whole organism.
We created a pipeline to identify spatial programs at the level of gene expression, niche organization, and cellular morphology. Using MOSTA expression and nuclei imaging data from timepoints E12.5, E14.5, and E16.5, we optimized a Hidden Markov Random Field model that differentially weights feature similarity with neighborhood effect to classify individual spatial units into larger spatially coherent domains. It is compatible with different types of spatial information: expression levels, tissue organization, and cell morphology. We also identified spatial co-expression modules (genes displaying spatial co-expression in a smoothed spatial k-nearest neighbor network) and evaluated the stability of these modules across adjacent tissue sections and consecutive time points of development. These analyses allow scientists to assess the biological role and organization of multiple spatial programs and how they change in a spatio-temporal manner.
Presentation Overview: Show
Variant callers typically produce massive numbers of false positives for structural variations, such as cancer-relevant copy-number alterations and fusion genes resulting from genome rearrangements. Here we describe an ultrafast and accurate detector of somatic structural variations that reduces read-mapping costs by filtering out reads matched to pan-genome k-mer sets. The detector, which we named ETCHING (for efficient detection of chromosomal rearrangements and fusion genes), reduces the number of false positives by leveraging machine-learning classifiers trained with six breakend-related features (clipped-read count, split-reads count, supporting paired-end read count, average mapping quality, depth difference and total length of clipped bases). When benchmarked against six callers on reference cell-free DNA, validated biomarkers of structural variants, matched tumour and normal whole genomes, and tumour-only targeted sequencing datasets, ETCHING was 11-fold faster than the second-fastest structural-variant caller at comparable performance and memory use. The speed and accuracy of ETCHING may aid large-scale genome projects and facilitate practical implementations in precision medicine.
Citation: https://doi.org/10.1038/s41551-022-00980-5 (Nature Biomed. Eng.).
Presentation Overview: Show
Variant callers typically produce massive numbers of false positives for structural variations, such as cancer-relevant copy-number alterations and fusion genes resulting from genome rearrangements. Here we describe an ultrafast and accurate detector of somatic structural variations that reduces read-mapping costs by filtering out reads matched to pan-genome k-mer sets. The detector, which we named ETCHING (for efficient detection of chromosomal rearrangements and fusion genes), reduces the number of false positives by leveraging machine-learning classifiers trained with six breakend-related features (clipped-read count, split-reads count, supporting paired-end read count, average mapping quality, depth difference and total length of clipped bases). When benchmarked against six callers on reference cell-free DNA, validated biomarkers of structural variants, matched tumour and normal whole genomes, and tumour-only targeted sequencing datasets, ETCHING was 11-fold faster than the second-fastest structural-variant caller at comparable performance and memory use. The speed and accuracy of ETCHING may aid large-scale genome projects and facilitate practical implementations in precision medicine.
Citation: https://doi.org/10.1038/s41551-022-00980-5 (Nature Biomed. Eng.).
Presentation Overview: Show
As single cell RNA-sequencing becomes more accessible and reliable, research groups can design replicated experiments with multiple biological samples. For large experiments, sample multiplexing is often used to reduce cost and limit batch effects.
A commonly used multiplexing technique involves tagging cells prior to pooling with a hashtag oligo (HTO) that can be sequenced along with the cells’ RNA to determine their sample of origin. Several tools have been developed to demultiplex HTO sequencing data and assign cells to samples, but these tools are often tested using data with high-quality HTO labelling and low contamination.
Using experimental data sets with both good and poor labelling of samples, we critically assess the performance of seven HTO demultiplexing tools: hashedDrops, HTODemux, GMM-Demux, demuxmix, deMULTIplex, BFF and HashSolo. Each sample in our data sets has also been demultiplexed using genetic variants from the RNA, enabling comparison of HTO demultiplexing techniques against complementary data from the genetic “ground truth”. We find that all methods perform similarly where HTO labelling is of high quality, but methods that assume a bimodal counts distribution perform poorly on lower quality data. We also provide heuristic approaches for assessing the quality of HTO counts in a scRNA-seq experiment.
Presentation Overview: Show
In biological applications it is common for the underlying biochemical theory to indicate that some observable should follow a Poisson distribution, often the result of assuming that processes occur with a constant mean rate: an example being the base coverage of a DNA sequencing effort. However, it is well known that the observed distributions often demonstrate significant over-dispersion compared to a Poisson. In this work we assume this is the result of a marginalisation (‘blurring’) of the sampling rate, and use advanced demarginalization techniques to recover the functional form of the underlying sampling bias which can result from features in the genome, or inherently from the sequencing platform. Using Bayesian methods, we infer the statistical significance of multiple underlying models. In doing so, we discover a highly multi-modal sampling bias, with some cases demonstrating that no part of the genome was sampled to the naively recovered mean sampling rate. We suggest that this result could have important uses in the future of fields of cancer detection and sequence data quality control, and that this methodology generalises to multiple other processes within the field of genomics.
Presentation Overview: Show
Epithelial ovarian cancer (EOC) is one of the most frequently diagnosed cancers in women and the major cause of mortality. High-grade serous EOC (HGS-EOC) represents its most aggressive subtype. Effective therapy includes a primary surgical cytoreduction, followed by chemotherapy based on platinum analogues, but, despite an initial positive response, the majority relapse and have a poor outcome. RNA-Seq data gave us the power to investigate differential gene expression, immune gene signatures and pathways to define tumor microenvironment (TME) and immune profile of this EOC. Otherwise single-cell sequencing technology (scRNA-seq) nowadays has become a powerful method to investigate cell-to-cell transcriptomic variation, revealing new cell types and providing insights into developmental processes of this heterogeneous disease. A first cohort of patients consisting in RNA-seq of longitudinal biopsies of HGS-EOC collected at primary surgery (naïve tumor) and relapse after chemotherapy, have been compared with a scRNA-seq patient dataset composed of four sites: the ovary sampled at the exploratory laparoscopy naïve to chemotherapy, and three metastatic sites at the debulking surgery after NACT. Through the analysis and comparison of these two dataset, this study aims to identify transcriptional changes and cell composition changes occuring during chemotherapy treatment and relapses.
Presentation Overview: Show
Mitochondria are a main control center for metabolism and OXPHOS and the phenotypic manifestations of an impaired mitochondrial function may be highly heterogeneous. Further, with the new technologies of single-cell and spatial transcriptomics it is now possible to explore alterations and dissect heterogeneity at a single-cell resolution.
With the aim to provide a tool to explore mitochondrial activity in different types of transcriptomic profiles, we developed the mitology R package. A list of genes was obtained from mitochondrial-specific databases and from Gene Ontology database. Then, from the Reactome pathway database and from the Gene Ontology database, pathways and terms enriched in our list were selected and reorganized in categories used to determine processes associated with well-defined gene sets. Leveraging these categories, we can now dissect mitochondrial processes at different specificity levels. Our tool uses this information to perform a single-transcriptome analysis from gene expression input of samples, cells or spots.
Here, we provide a new tool that helps in inspecting and dissecting mitochondrial activity. It represents a strong instrument for mitochondrial studies and their impact in disease onset and progression as transcriptomes can now be studied from the mitochondrial point of view and provide a powerful contribution in clinical studies.
Presentation Overview: Show
Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e., haplotypes) in one virus population helps study the viruses’ evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to automatically learn latent features from aligned reads. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF can achieve highly robust performance on data with different coverage, haplotype number, and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others.
Presentation Overview: Show
Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community, but correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated diverse reads, this allows Mora to achieve F1 scores comparable to other algorithms while having less runtime. However, Mora significantly outshines other algorithms on very similar reads. We show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately infer correct strains for real data in the form of short E. coli reads and long Covid-19 reads.
Presentation Overview: Show
Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community, but correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated diverse reads, this allows Mora to achieve F1 scores comparable to other algorithms while having less runtime. However, Mora significantly outshines other algorithms on very similar reads. We show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately infer correct strains for real data in the form of short E. coli reads and long Covid-19 reads.
Presentation Overview: Show
Although the advent of next-generation sequencing has increased diagnostic success in instances of monogenic disease, analysis of exonic sequences can still only provide a molecular diagnosis in ~20-40% of cases. The functional impact of variants in non-coding regions are more difficult to predict. We have created and evaluated a multi-omics pipeline to increase diagnostic yield, by identifying regions of aberrant allelic imbalance in ATACseq and RNAseq from cell populations best displaying the phenotype of an individual patient. As proof-of-concept we tested the workflow on two patients with a known non-coding variant associated with hemophagocytic lymphohistiocytosis type 3 (FHL3). By filtering variants based on differential accessibility and allelic imbalance of ATACseq from patient cells vs control subsets, in addition to minor allele frequency (MAF) in a population cohort and genomic region conservation, candidate variants in non-coding regions could be identified and ranked for functional analysis.
Presentation Overview: Show
Transcriptional rewiring is a fundamental process in the adaptation of pathogens, yet the role of coding regions in Salmonella has been disproportionately highlighted in previous studies while neglecting the potential role of non-coding mutations and transcriptional rewiring. In this study, we re-implemented the iterative comparison of gene co-expression method in R and made it available as a package for future research (https://github.com/Cuypers-Wim/gccR). This package enables the calculation of gene co-expression conservation between two bacterial strains, taking into account technical variation in the dataset, and identifies genes with significantly diverged or conserved co-expression profiles. To illustrate the efficacy of this approach, we investigated the differences in gene co-expression between two Salmonella strains: the commonly used lab strain S. Typhimurium 14028s and the clinical isolate S. Typhimurium D23580, representative of ST313 strains causing bloodstream infections in sub-Saharan Africa. Our results indicate little overall divergence in gene co-expression between both strains, with high conservation in genes linked to translation processes. However, we observed lower conservation in genes linked to colonising the gut compared to genes required for intracellular replication and survival, which is potentially linked to the more host-adapted properties of S. Typhimurium D23580.
Presentation Overview: Show
Large datasets containing different omics are increasingly available in public databases. However, capturing all the information contained in these data is a major challenge. Since the different omics are biologically related, it is fundamental to statistically detect their interplay with models that take into account all the omics and try to detect the key molecular players involved. To solve this need, we provide an easy-to-use R package for omics data integration.
Our package is designed to detect the association between the expression of a target and its regulators while taking into account their genomics modifications such as Copy Number Variations and methylation. In some cases, the number of regulators for a given target could be very high. To handle this eventuality, we provide a penalized model that will automatically keep only the most important regulators. We are also evaluating the possibility of adding more models for integration and expanding it to single-cell data. The package also provides functions for visualizing results to make model interpretation straightforward.
Our package provides a solid and easy-to-use way to solve the problem of multi-omics integration while allowing the detection and visualization of their interplay.
Presentation Overview: Show
While the primary focus of many RNA-seq applications is to estimate gene expression levels, a crucial first step in assessing gene activity is to distinguish technical or biological transcriptional noise from actively expressed genes. Typically, this is accomplished by setting an arbitrary abundance threshold. However, the usage of a fixed abundance threshold often leads to either a loss of information or an increase in false positives. To overcome these limitations, we propose an updated approach for the Bgee database. We identify a set of genes that are confidently non-expressed in each library using reads mapped to curated intergenic regions; and from their distribution, we compute a p-value. Additionally, we introduce a weighted merging function to aggregate per-gene expression calls signal from multiple libraries into a single presence/absence call. Its accuracy outperforms other existing methods in determining the true state of genes, when compared to reference sets in three distinct species. This approach can be applied to bulk as well as single-cell RNA-Seq. We also show that our method yields considerably fewer false positives when classifying genes carried by sex chromosomes. Overall, this novel approach has the potential to enhance the accuracy of genes accessible to differential expression analysis across conditions.
Presentation Overview: Show
Nucleoside analogues like 4-thiouridine (4sU) are used to metabolically label newly synthesized RNA. Chemical conversion of 4sU before sequencing induces T-to-C mismatches in reads sequenced from labelled RNA, allowing to obtain total and labelled RNA expression profiles from a single sequencing library. Cytotoxicity due to extended periods of labeling or high 4sU concentrations has been described, but the effects of extensive 4sU labeling on expression estimates from nucleotide conversion RNA-seq have not been studied. Here, we performed nucleotide conversion RNA-seq with escalating doses of 4sU with short-term labeling (1h) and over a progressive time course (up to 2h) in different cell lines. With high concentrations or at later time points, expression estimates were biased in an RNA half-life dependent manner. We show that bias arose by a combination of reduced mappability of reads carrying multiple conversions, and a global, unspecific underrepresentation of labelled RNA due to impaired reverse transcription efficiency and potentially global reduction of RNA synthesis. We developed a computational tool to rescue unmappable reads, which performed favourably compared to previous read mappers, and a statistical method, which could fully remove remaining bias.
Presentation Overview: Show
Pangenomics alignment has emerged as an opportunity to reduce bias in biomedical research. Traditionally, short read aligners---such as Bowtie and BWA---were used to index a single reference genome, which was then used to find approximate alignments of reads to that genome. Unfortunately, these methods can only index a small number of genomes. Moni, an emerging pangenomic aligner, uses a preprocessing technique called prefix-free parsing to build a dictionary and parse from the input---these, in turn, are used to build the main run-length encoded BWT, and suffix array of the input. This is accomplished in linear space in the size of the dictionary and parse. Therein lies the open problem that we tackle in this paper. Although the dictionary scales sub-linearly with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, we need to remove the parse from the construction of the RLBWT and suffix array. We solve this problem, and demonstrate that this improves the construction time and memory requirement allowing us to build the RLBWT and suffix array for 1000 diploid human haplotypes from the 1000 genomes project using less than 600GB of memory.
Presentation Overview: Show
To date, no methods are available for the targeted identification of genomic regions with differences in sequencing read distributions between two conditions. Existing approaches either cannot be targeted to genomic regions of interest, only determine changes in total read numbers, require a predefined subdivision of input regions, or average across multiple input regions. Here, we present RegCFinder, which automatically identifies subregions of input windows with differences in read density between two conditions. For this purpose, the problem is defined as an instance of the well-established all maximum scoring subsequences problem, which can be solved in linear time. Subsequently, statistical significance and relative use of these regions within the input windows are determined with DEXSeq. RegCFinder allows flexible definition of input windows, making it possible to target the analysis to any regions of interests, e.g. promoter regions, gene bodies, peak regions, and more. Furthermore, any type of sequencing data can be used as input, thus, RegCFinder lends itself to a wide range of applications and biological questions. We illustrate the usefulness of RegCFinder on two applications. In both cases, we can both confirm previous observations regarding changes in read distributions, but also identify interesting novel subgroups of genes with distinctive changes.
Presentation Overview: Show
Advancements in multiplexing techniques have enabled the application of single-cell genomic methods to comprehensively study the effects of high-throughput perturbation experiments at whole-embryo scale. Such analyses aim to pinpoint key genes, cell types, and signaling pathways that control cell fate decisions during development. However, there is a lack of statistically principled tools for measuring how cell types shift after perturbations (genetic, chemical, or environmental) and identifying which genes regulate those transitions. Hooke is a new software package that uses Poisson-Lognormal models to perform differential analysis of cell abundances for perturbation experiments read out by single-cell RNA-seq. This versatile framework allows users to both 1) perform multivariate statistical regression to describe how perturbations alter the relative abundances of each cell state and 2) describe how all pairs of states co-vary as a parsimonious network of partial correlations. To demonstrate Hooke’s utility, we analyzed a single-cell atlas of zebrafish organogenesis that includes wild-type and genetic perturbations at whole-embryo scale across multiple time points. With this method, we identified novel genetic requirements for relatively rare cell types in the embryonic kidney. Hooke will be available as an open-source R package and will enable users to dissect genetic dependencies in single-cell perturbation experiments.
Presentation Overview: Show
Recent benchmarks of structural variant (SV) detection tools revealed that the most medium-range (50-10,000 bp) SVs cannot be resolved with short-read sequencing, but long-read SV callers achieve great results. However, long-read sequencing has a higher cost and requires higher input DNA. Lowering sequence coverage reduces cost, but long-read SV callers perform poorly with coverage below 10$\times$. Synthetic long-read (SLR) technologies have great potential for SV detection, though their long-range information has been hard to utilize for events shorter than 50 kbp.
We propose a novel integrated alignment- and local-assembly-based algorithm, Blackbird, that uses SLR together with low-coverage long reads to improve the detection of challenging medium-size events. Without the need for a whole genome assembly, Blackbird uses barcode information encoded in SLR to accurately assemble small segments and use long reads for an improved assembly.
We evaluated Blackbird on simulated and real human genome datasets. Using the HG002 GIAB callset, we demonstrated that in hybrid mode, Blackbird demonstrated results comparable to state-of-the-art long-read tools using significantly lower long-read coverage. Blackbird requires only 5$\times$ to achieve F1 scores similar to PBSV and Sniffles2 using 10$\times$ long-read coverage. Additionally, Blackbird in the SLR-only is more sensitive than popular short-read methods.
Presentation Overview: Show
Correct structural variant (SV) detection is integral to rare disease diagnostics and cancer genomics, as well as genome variation in populations.
Existing methods of structural variant detection with long reads usually use read-to-reference alignments to detect SVs on abstractions of SV signals. Typical problems are the detection of very long or complex SVs as well as SVs within difficult regions, such as repeats or low complexity regions.
We propose a novel method, in which we assemble parts of reads that are aligned to SV candidate regions to resolve such difficult loci or re-construct complex SVs. The great advantage we try to exploit is that the depth of coverage can be significantly lower than it would be necessary for de novo assembly, while difficult regions can still be resolved better than with the initial alignments. This has become a feasible solution since the per base error rates of available long read technologies have decreased dramatically in the past four years. The challenges are to identify the regions correctly and to determine the correct number of haplotypes.
Presentation Overview: Show
In this study, we conducted whole-exome sequencing to investigate the origins and inheritance patterns of de-novo mutations in pediatric cancer patients. We analyzed a cohort of 280 patients using Illumina short-read trio exome sequencing and identified variants using GATK/Varscan pipelines. Among the high impact SNP/indel variants, 0.36% were identified as de-novo. Our analysis revealed 22.5% of the patients had three de-novo mutations per patient, with some individuals having 28. Interestingly, 5.71% of patients did not exhibit any de-novo mutations. The estimated rate of exonic de novo sequence variants in our study was 6.67x10-8 per generation. We observed a positive correlation between paternal age and the number of de-novo mutations, while maternal age did not show a significant effect. Further investigation focused on Tier1-3 genes, encompassing 364 genes. These mutations were identified in 6.78% of the cohort, and pathogenic variants were found in TP53, SOS1, PTPN11 and MSH6 genes, potentially associated with leukemia (BCP-ALL) and glioblastoma. Moreover, our analysis revealed higher prevalence of de-novo mutations within CpG regions compared to non-CpG regions. By performing read-based haplotype phasing, we determined that 74.28% of phased de-novo mutations originated from the paternal lineage, while 25.7% originated from the maternal lineage.
Presentation Overview: Show
Population count data derived from high-throughput sequencing experiments provides valuable information about the composition of biological environments such as cell populations or microbial communities. However, accurate statistical modeling of such data is challenging due to its high dimensionality, excessive number of zeros, correlation of features, and compositional constraints. The recent class of compositional power interaction models (PIMs; Yu et al., 2021) accommodates these properties and can be optimized efficiently through score matching methods.
In our work, we use PIMs to model covariate influence on the data composition and subsequently perform differential testing. We extend the score matching estimator to include latent effect variables and derive an extended parameter optimization scheme that selects the ideal power transform for accurate data representation without needing zero replacement strategies. We demonstrate that PIMs are better suited than other popular distributional methods to describe real and simulated high-throughput sequencing data with correlated features while requiring very low computational resources.
We further evaluate the model’s ability to discover significant and pairwise interactions between features and showcase its flexibility through applications to blood cell compositions obtained through single-cell RNA sequencing, as well as amplicon sequencing data of the human gut microbiome.
Presentation Overview: Show
The advent of large-scale data (e.g., from biotechnology) has made the development of suitable statistical techniques a cornerstone of modern interdisciplinary research. These data often contain many features but limited sample size, and are accompanied by experimental noise. A common research question in data-driven observational studies is to determine how features impact a readout of interest. Typically, only a subset of features is relevant, and they may interact in a concerted fashion. Thus, a major concern is to identify these relevant effects from a large number of possible combinations of features. To address this, we propose a robust statistical workflow to recover interactions in the data-scarce regime. Our multi-stage approach uses a lasso model for hierarchical interactions combined with stability-based model selection in a replicate consistent workflow. We demonstrate its superior performance compared to state-of-the-art techniques using synthetic data and show its wide applicability in a number of different biological applications including histone modification-protein interactions and combinatorial drug effects on cell morphological features.
Presentation Overview: Show
Single-cell ATAC-sequencing (scATAC-seq) is a powerful technique for studying chromatin regulation at the single-cell level. Typically, scATAC-seq data is binarized to indicate open chromatin regions, but the implications of this binarization are not well-understood. In this study, we demonstrate that a quantitative treatment of scATAC-seq data improves the goodness-of-fit of existing models and their applications, including clustering, cell type identification, and batch integration. Our contribution is twofold. First, we show that fragment counts, but not read counts, can be modeled using standard count distributions. Second, we compare the effects of binarization versus a count-based model (PoissonVAE) on scATAC-seq data using publicly available datasets and highlight the biological effects that are missed by a binary treatment. We show that high count peaks in scATAC-seq data correspond to important regulatory regions such as super-enhancers and highly transcribed promoters, similar to observations in bulk ATAC-seq data. Furthermore, we demonstrate that fragment counts in promoter regions correlate with gene expression, emphasizing a quantitative signal in promoter accessibility. Our results have significant implications for scATAC-seq analysis, suggesting that handling the data quantitatively can improve the accuracy of machine learning models used for investigating single-cell regulation.
Presentation Overview: Show
Studying biological sequences typically involves using a reference genome, but obtaining accurate assemblies from sequencing data can be challenging due to genomic repeats, errors, and biases.
Hence, working directly with raw data output by sequencers, without pre-processing, can be preferable. Our objective is to develop multifaceted indexes able to identify reads containing a specific k-mer in a given dataset. Popular indexes, dubbed colored de Bruijn graphs associate the kmer origin among thousand of datasets. However they are not able to index each reads separately.
To address this challenge, we present K2R, which leverages redundancy in the data to limit memory usage. Specifically, we use super-k-mers to reduce the number of entries in our structures and employ the concept of color to minimize memory impact of repetitive k-mer data. We present the main results obtained by comparing K2R with state-of-the-art methods such as hashing methods (e.g., read connector) and full-text indexing (e.g., r-index), in terms of memory impact, throughput, and time consumption for creation and query. We compare the performance of the tools encompassing varying coverage levels and error rates, to evaluate their advantages, disadvantages and respective comfort zone.
Presentation Overview: Show
Integration of heterogeneous datasets generated through advanced single-cell sequencing techniques enables improved understanding of the cellular states and expression programs underlying complex biological systems. Hence, integration methods focus on the removal of complex-nested batch effects arising due to the heterogeneity in samples generated across tissue locations, time and conditions. However, it is equally important to conserve the biology while improving batch-correction. The goal is to leverage any available cell type annotation for cells for an improved integration. Here we present a supervised data integration framework called scDREAMER-Sup that employs a novel adversarial hierarchical variational autoencoder with two neural network classifiers for improved bio-conservation and batch-correction respectively. We evaluated the performance of scDREAMER-Sup on 5 challenging real datasets consisting of ~1 million cells and multiple cell sub-types. We further introduced semi-supervised setting to address the challenge of missing cell type annotations in integration tasks and experimented with {10%, 20%, 50%} of missing cell type annotations categories. We compared scDREAMER-Sup’s performance against that of scANVI and scGEN, top performing state-of-the-art methods that utilize cell type labels under both supervised as well as semi-supervised settings and demonstrated that scDREAMER-Sup significantly outperformed other methods with an overall improvement of 36%-48% in combined composite score.
Presentation Overview: Show
Exclusion regions are sections of reference genomes with abnormal pileups of short sequencing reads. Removing reads overlapping them improves biological signal, and these benefits are most pronounced in differential analysis settings. Several labs created exclusion region sets, available primarily through ENCODE and Github. However, the variety of exclusion sets creates uncertainty which sets to use. Furthermore, gap regions (e.g., centromeres, telomeres, short arms) create additional considerations in generating exclusion sets. We generated exclusion sets for the latest human T2T-CHM13 and mouse GRCm39 genomes and systematically assembled and annotated these and other sets in the 'excluderanges' R/Bioconductor data package, also accessible via the BEDbase.org API. The package provides unified access to systematically annotated 82 GenomicRanges objects covering six organisms, multiple genome assemblies and types of exclusion regions. For human hg38 genome assembly, we recommend 'hg38.Kundaje.GRCh38_unified_blacklist' as the most well-curated and annotated, and sets generated by the Blacklist tool for other organisms. Package website: https://bioconductor.org/packages/excluderanges/, https://dozmorovlab.github.io/excluderanges/
Presentation Overview: Show
Clustered regularly interspaced short palindromic repeats (CRISPR)-based genetic perturbation screen is a powerful tool to probe gene function. However, experimental noises, especially for the lowly expressed genes, need to be accounted for to maintain proper control of false positive rate. We develop a statistical method, named CRISPR screen with Expression Data Analysis (CEDA), to integrate gene expression profiles and CRISPR screen data for identifying essential genes. CEDA stratifies genes based on expression level and adopts a three-component mixture model for the log-fold change of single-guide RNAs (sgRNAs). Empirical Bayesian prior and expectation–maximization algorithm are used for parameter estimation and false discovery rate inference. Taking advantage of gene expression data, CEDA identifies essential genes with higher expression. Compared to existing methods, CEDA shows comparable reliability but higher sensitivity in detecting essential genes with moderate sgRNA fold change. Therefore, using the same CRISPR data, CEDA generates an additional hit gene list.
Presentation Overview: Show
Next Generation Sequencing (NGS) panels are routinely used for small variant detection in cancer clinical care. The capture efficiency of these panels is not optimal, resulting in off-target DNA being sequenced alongside on-target DNA. These off-target reads are distributed across the genome, therefore allowing for a whole genome to be derived, although we see significant biases in the number of reads in these regions. Here we present a method to correct these biases and generate robust genome-wide copy number profiles.
A variety of capture based NGS protocols were analysed using our novel computational method and were compared with the current gold standard: Shallow Whole Genome Sequencing (sWGS), demonstrating that comparable results could be obtained from either sWGS or NGS targeted samples depending on sample purity, preservation method, and read depth.
We then benchmarked our algorithm with other NGS-based copy number extraction methods, showing at least equivalence in performance without the need of a matched normal tissue or filtering out noisy data, usual drawbacks for other methods in the cancer field. As most clinical sequencing workflows rely on targeted capture gene panels, our method has the potential for new biomarker discovery through the addition of robust copy number profiles to these assays.
Presentation Overview: Show
Differential composition analysis – the identification of cell types that have statistically significantly change in abundance between multiple experimental conditions – is one of the most common tasks in single cell omic data analysis. However, it remains challenging to perform differential composition analysis in the presence of flexible experimental designs and uncertainty in cell type assignment. Here, we introduce a statistical model and an open source R package, DCATS, for differential composition analysis based on a beta-binomial regression framework that addresses these challenges. Our empirical evaluation shows that DCATS consistently maintains high sensitivity and specificity compared to state-of-the-art methods.
Presentation Overview: Show
Short DNA molecules found in blood plasma originate mostly from dying cells including tumour cells in cancer patients. Information on the genomes can be extracted from cell-free DNA (cfDNA) in several ways including presence of point mutations, cytosine methylation and others. Chromosomal rearrangements and somatic copy number alterations (SCNAs) are key events in many cancer types and can help in the diagnosis of cancer and often associated with treatment response. Bioinformatic tools for SCNA detection from shallow whole genome sequencing (sWGS) data are widely used in cancer research. Some of the algorithms are specifically tailored to detect SCNAs in cfDNA rather than in genomic DNA from tumour tissues and most of them require baseline estimation from a panel of normal samples (PoN) with a priori absence of SCNAs. PoNs are often distributed within software packages, however they are generated on datasets that are not always based on cfDNA. Given specific properties of cfDNA (short fragment length, biased coverage profiles, etc) we investigate the influence of PoNs generated with various parameters and on different data types. We use samples from 1000 Genomes project and cfDNA WGS data to assess performance of CNAclinic and ichorCNA on sWGS of cfDNA of cancer patients.
Presentation Overview: Show
CONTEXT: RNA-seq allows us to uncover general molecular differences. However, it has not enough depth to identify more subtle disparities between different cellular subpopulations. Spatial transcriptomics(ST) is a novel method whose strong point is the gene spatial location, but, can we compare different conditions to get disease-associated signatures and susceptibility pathways within cell populations using ST? We afford it in autoimmune thyroid diseases/AITD (Hashimoto’s thyroiditis/HT and Graves’ disease/GD) vs controls.
METHODOLOGY: 3 HT, 3 GD and 2 controls were sequenced using Visium ST (10XGenomics). We compared the isolation of cell populations by pathology-based and unsupervised clustering. Then, we proceeded with a sample integration using harmony followed by a re-clustering strategy and pseudobulks differential expression analysis of: thyrocytes, connective tissue(CT) and vessels separately. Finally, validation using public single cell(SC) repositories and immunostaining.
RESULTS: We obtained a significant correlation between pathology-based and unsupervised clustering. We revealed damaged epithelial cells in AITD close to infiltration; molecular signatures which correlate to fibroblast subpopulations in CT from HT and GD samples and differential angiogenic processes guessed from their vessels. SC, immunostaining and literature validated ST results.
CONCLUSIONS: ST is also useful to integrate samples and to infer molecular distinctions in a cellular context among conditions.
Presentation Overview: Show
In humans, gene transcription is under the control of the promoter, which is located immediately upstream of the genes, and the enhancers, which are distal regulatory regions. Recent studies suggest the existence of hubs of interactions involving multiple enhancers and promoters. In order to define the structure of the interacting hubs of chromatin regions at high resolution, we have implemented the Pore-C approach based on long-read multiway contact nanopore sequencing.
Deshpande et al. (Nature Biotechnology, 2022), proposed dedicated bioinformatics tools to identify multi-way contacts and reveal significant cooperativities. We used the tools and the results of this study to validate our Pore-C experiments obtained on reference cell lines. We first mapped fragments located in multiway contact reads. Then, we generated virtual pairwise contacts and verified that we reproduced 3D genomic features classically observed in HIC maps. We also showed that the range of inter- and intrachromosomal interactions was very similar. We finally identify significant intra-chromosomal high-order contacts in regulatory regions using the Chromunity tool, described in the paper. We will now generate new Pore-C datasets on different cell types and conditions to explore the role of several proteins involved in chromatin remodeling in the formation of these hubs.
Presentation Overview: Show
The mouse reference genome, based on the laboratory strain C57BL/6J, has served as a foundation for improving our understanding of human health and genetics for over twenty years. However, the current mouse reference genome (GRCm39) contains over 170 known gaps and issues, and is missing key features such as telomere and centromere sequences. Here, we combine novel high-molecular-weight DNA extraction methodologies and ultra-long sequencing technologies on mESCs from a C57BL/6J x CAST/EiJ F1 animal to generate two of the most complete reference-quality mouse genome sequences to date. These new T2T mouse assemblies add significant amounts of novel sequence when compared to their respective current reference genomes (over 150Mbp and 250Mbp for C57BL/6J and CAST/EiJ respectively, of which 100Mbp and 150Mbp constitutes newly placed telocentric sequences). Our C57BL/6J assembly closes over 95% of the previously unresolved autosomal gaps in GRCm39 with over 12Mbp of novel sequences. Additionally, we have shown that our new T2T assemblies significantly improve the representation of previously hard-to-assemble regions when compared to the current reference genomes (e.g. PAR, KZFPs). As a result, these assemblies represent a major milestone in the journey towards a fully complete mouse reference genome.
Presentation Overview: Show
CZ CELLxGENE Discover has released all of its human and mouse single-cell data through its Census (cellxgene-census.readthedocs.io) – a free-to-use service with an API and data that allows for querying its single-cell data corpus directly from Python or R. The API uses a new technology that allows for efficient and low-latency querying. The data are fully standardized and hosted publicly for free access, and they are composed by a count matrix of 50 mi cells (observations) by >60 k genes (features) accompanied by 11 cell metadata variables (e.g. cell type, tissue, sequencing technology, donor id, etc) and gene metadata that includes GENCODE-based IDs and gene names. While these data are built from more than 500 datasets, the APIs enable convenient cell- and gene-based filtering to obtain any slice of interest in a matter of seconds. All data can be quickly transformed to numpy, pandas, anndata, Seurat, or R base objects.
Presentation Overview: Show
Correctly identifying all organisms in an environmental or clinical sample is fundamental in many metagenomic sequencing projects. Over the last years, many tools have been developed that classify short and long sequencing reads by comparing their nucleotide sequences to a predefined set of references. Although those methods already utilize flexible data structures with low memory requirements, the constantly increasing number of reference genomes in the databases poses a major computational challenge to the profilers regarding memory usage, index construction and query time. Here, we present Taxor as a fast and space-efficient tool for taxonomic profiling by utilizing hierarchical interleaved XOR filters. Taxor shows a precision of 99.9% for read classification on the species level while retaining a recall of 96.7%, outperforming tools like Kraken2 and Centrifuge in terms of precision by 3-9%. Our benchmarking based on simulated and real data indicates that Taxor accurately performs taxonomic read classification while reducing the index size of the reference database and memory requirements for querying by a factor of 2-12x when compared to other profiling tools.
Presentation Overview: Show
Advanced prostate cancers comprise distinct phenotypes, but tumor classification remains clinically challenging. Here, we harnessed circulating tumor DNA (ctDNA) to study tumor phenotypes by ascertaining nucleosome positioning patterns associated with transcription regulation. We sequenced plasma ctDNA whole genomes from patient-derived xenografts representing a spectrum of androgen receptor active (ARPC) and neuroendocrine (NEPC) prostate cancers. Nucleosome patterns associated with transcriptional activity were reflected in ctDNA at regions of genes, promoters, histone modifications, transcription factor binding, and accessible chromatin. We identified the activity of key phenotype-defining transcriptional regulators from ctDNA, including AR, ASCL1, HOXB13, HNF4G, and GATA2. To distinguish NEPC and ARPC in patient plasma samples, we developed prediction models that achieved accuracies of 97% for dominant phenotypes and 87% for mixed clinical phenotypes. Although phenotype classification is typically assessed by IHC or transcriptome profiling from tumor biopsies, we demonstrate that ctDNA provides comparable results with diagnostic advantages for precision oncology.
Presentation Overview: Show
Single-cell multiomics provides an opportunity to comprehend the regulatory relationships across modalities, including transcriptome and regulome. However, this approach is experimentally limited to revealing static snapshots at the time of observation, which hinders our understanding of dynamic state changes orchestrated across modalities. Although RNA velocity addresses this issue by estimating temporal changes in transcriptome, inferring dynamics in other modalities remains challenging.
To overcome this limitation, we develop a deep generative model named mmVelo, which estimates cell-state dependent dynamics across multiple modalities. mmVelo learns the dynamics of cellular states based on spliced and unspliced mRNA counts and projects them onto other modalities, thereby inferring cross-modal dynamics, such as RNA-chromatin accessibility dynamics.
We applied mmVelo to single-cell multiomics data from developing mouse brain and validated the accuracy of the estimated chromatin accessibility dynamics. Furthermore, we discover that known lineage-determining transcription factors play a crucial role in regulating chromatin accessibility in mouse skin. Finally, we demonstrate in human brain development that by using multiomics data as a bridge, the dynamics of other modalities can be inferred from single-modal data through cross-modal generation.
Overall, mmVelo offers a unique advantage in understanding the dynamic interactions between modalities, providing insights into regulatory relationships across molecular layers.
Presentation Overview: Show
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for characterizing cell types and states. However, it has limitations in measuring changes in gene expression during dynamic biological processes such as differentiation due to the destruction of cells during analysis. Recent studies combining scRNA-seq with lineage tracing have provided clonal information but still face challenges such as observations at discrete time points and difficulty in tracking cells within a certain lineage over the time course, since early observations are not direct ancestors of cells in the same lineage observed later time point. To address these issues, we developed Lineage Variational Inference (LineageVI), a model based on the framework of variational autoencoder (VAE), to convert single-cell transcriptome observation with DNA barcoding into the latent state dynamics consistent with the clonal relationship by assuming a common ancestor. This model enables us to quantitatively capture the cell state transitions. We demonstrate how our model can recapitulate differentiation trajectories in hematopoiesis and learn potential dynamics and estimated backward transitions from later to earlier observations in the latent space. Restoring transcriptomes at each time point in each lineage showed an increase in undifferentiated marker expression and a decrease in differentiation marker expression according to ancestors.
Presentation Overview: Show
Assigning taxonomic labels to metagenomic reads involves a trade-off between specificity and sensitivity, depending on the sequence type employed. DNA-based metagenomic classifiers offer higher specificity by capitalizing on mutations to differentiate closely related taxa. Conversely, AA-based classifiers provide higher sensitivity in detecting homology due to the increased conservation of AA.
To solve the trade-off, we developed Metabuli based on a novel k-mer structure, metamer, that simultaneously stores AA and DNA. Metabuli compares metamers first using AA for sensitivity and subsequently with DNA for specificity. We compared Metabuli to DNA-based (Kraken2, KrakenUniq, Centrifuge) and AA-based (Kraken2X, Kaiju, MMseqs2 Taxonomy) tools. In an inclusion test, where 2382 query subspecies were present in databases, DNA-based tools classified up to twice as many reads as AA-based tools to correct (sub)species. However, in an exclusion test, where 367 query species were excluded from databases, AA-based tools showed about twice higher sensitivity in genus-level classification.
Only Metabuli showed state-of-art level performance in both, achieving species-level precision of ~99% and sensitivity of ~97% in the inclusion test, and precision of ~65% and sensitivity of ~48% in the exclusion test. It demonstrates the robustness of Metabuli in diverse contexts of metagenomic studies. (metabuli.steineggerlab.com)
Presentation Overview: Show
Resistance for therapy against Germ cell tumours (GCTs) brings serious complications. Although, these processes have been studied, exact mechanism is still in need of further clarification. Here, we were observing changes in trancriptome profile between parental cell lines and those resistant on cisplatin. We studied 7 samples from 6 different parental cell lines and same number of derived resistant cell lines. Using standard procedure of differentially expressed genes analysis, using statistical test by DESeq2, we identified 2 protein coding genes DAZ1 and DAZ2 as differentially expressed and apart from that 2 non-coding RNA genes. Gene enrichment analysis by gProfiler2 on wider set of potentially differentially expressed genes showed upregulation of genes involved in cancers, and cancer related processes (drug metabolism, constitutive signalling by aberrant PIK3). In this study, we simultaneously continue with sequencing more samples and DNA-based variant analysis. This research aims to help us understand mechanism behind cisplatin resistance of GCTs and potentially improve the treatment.This work was supported by the Slovak Research and Development Agency under the Contract no. APVV-20-0158 and with the support of the OP Integrated Infrastructure for the project with the code ITMS: 313011AVH7, co-financed by the European Regional Development Fund.
Presentation Overview: Show
The influence of somatic mutations in cancer led to various somatic mutation callers, yet benchmarking such callers need to be standardized further by providing extensive ground truth data, especially for low frequency mutations. While consortium efforts such as TCGA-MC3, PCAWG-Pilot63, and SEQC2 achieve this by providing orthogonal deep sequencing data for a subset of called mutations, they do not cover the entire mutational spectrum and for patient derived cohorts it is not possible to perform additional orthogonal sequencing due to lack of sufficient DNA material. Here we present a cell line based data set that contains matched whole-exome sequencing data with two technical replicates for tumor and normal DNA of three cell lines of melanoma and pancreas cancer. All mutations called by Mutect2, Strelka2, and our AI-based variant caller, VariantMedium were deep sequenced using Illumina MiSeq (mean coverage=34870X). The presented data set contains a total of 1395 variants consisting of 1222 SNVs and 173 indels, with 894 confirmed somatic, 152 confirmed germline mutations, and 349 variants that were categorized as no mutation. Notably, 21% of all variant candidates have variant allele frequencies below 0.1. With this data set, more extensive benchmarking studies can be performed, specifically for low frequency mutations.
Presentation Overview: Show
The size and complexity of single-cell RNA-seq data are increasing, making standard workflows too computationally demanding. Existing tools for single cells do not scale efficiently to such large datasets and operate out-of-memory. To address this problem, the use of more efficient algorithms and out-of-memory data representations becomes essential for the analysis.
This work compares SVD algorithms, applied to the computation of the first 50 principal components of single-cell RNA-seq data, to IRLBA and randomized SVD algorithms as implemented in the R/Bioconductor BiocSingular, Python Scanpy and ScikitLearn packages. We tested the above methods on a real single-cell RNA-seq dataset from 10X Genomics that contains approximately 1.3 million cells and 30,000 genes isolated from the mouse brain. We found that randomized PCA was the most memory efficient, taking 7.05 GB of RAM, while IRLBA PCA took 7.48 minutes to compute the top 50 principal components, using a single core. This benchmark will represent a useful guideline to find out the best trade-off regarding time and memory consumption and to observe how computational times and costs change using e GPU-based rather than CPU-based pipelines.
Presentation Overview: Show
Upon cell death, short cell-free DNA fragments are released into the circulatory system, allowing a non-invasive view of various cellular processes in tissues. Of particular interest are tumor-derived cfDNA fragments, which appear at very low abundance. Recently, “fragmentomic” features, including cfDNA fragment length distribution, end motifs, and genomic location, were utilized to distinguish between ctDNA and normal cfDNA. As shown, different methods for library preparation and alignment often bias these features. In this study, we investigated how to properly analyze Illumina sequenced cfDNA data for optimal accuracy.
Five library preparation methods were applied to 10 healthy individuals’ cfDNA. sWGS was performed and multiple aligners were compared to quantify the variations in the results. We also aligned both trimmed and untrimmed fastq files to see if soft-clipping could bias the feature call. The biases in features inferred using different tools will be reported.
Trimgalore and trimmomatic do not trim 5’ adapters and may not be suitable for library preparations that included 5’ manipulation. BWA-MEM assumes fragment lengths are normally distributed whereas in cfDNA both mono- and di-nucleosomal fragments are expected, requiring specific parameter settings. We investigated and recommend a set of tools that give the expected and robust fragment length distribution.
Presentation Overview: Show
We present the Spanish polygenic risk score (PRS) reference distribution, a database for the Spanish population consisting of 3124 PRS distributions for common diseases and quantitative traits. The reference includes PRS for various types of cancer, disorders associated with the digestive, cardiovascular, neuronal, and immune systems, as well as quantitative traits such as hematological measurements, and anthropometric levels. The distributions can be explored at http://csvs.clinbioinfosspa.es/?tab=prs.
We released the pipeline we utilized to preprocess, phase and impute samples in our reference cohort. This makes it possible to compute PRS for external genomes and exomes, which can then be compared to a specific reference distribution in a standardized manner. Our pipeline is designed to handle large cohorts in parallel, and can be run on local or cloud-based infrastructures.
The use of these resources can assist in selecting the most suitable PRS, determining the relative risk for patient stratification, and calibrating absolute risk values of PRS for the Spanish population. This can aid in the incorporation of PRS in the Spanish healthcare system. Furthermore, this approach can be applied to establish population-specific PRS distributions for other populations, facilitating the adoption of PRS in healthcare systems worldwide.
Presentation Overview: Show
Oxford nanopore sequencing allows simultaneous capture of long sequencing reads and rich signal data, facilitating epigenetic modification detection and genome assembly.
Data simulation methods can be used to create large datasets for developing and benchmarking computational methods, and augment existing datasets to train robust machine learning models. The use of simulated data can alleviate the need to generate costly real-world data and avoid data privacy issues. However, simulated data has to accurately resemble real data in order for it to be useful.
Here we propose NanoDS (Nanopore distribution simulator), a probabilistic deep learning method for simulating nanopore sequencing data. NanoDS predicts the parameters that describe the signal level and event length, i.e. number of signal measurements, distributions for each k-mer in a DNA sequence. We model the signal levels and event lengths with Gaussian and k-inflated negative binomial distributions, respectively.
NanoDS was trained using two datasets from PCR-amplified DNA samples. We validated NanoDS by basecalling the generated signals with a state-of-the-art basecaller. We also compared NanoDS to other simulation methods proposed in the literature. These experiments showed that NanoDS can accurately simulate nanopore sequencing signals and can be used to generate datasets for developing and benchmarking computational methods.
Presentation Overview: Show
Background: Traditionally non-model species are increasingly emerging as the focus of research; most recently, the Golden Hamster (Mesocricetus auratus), with its human-like COVID-19 lung pathology. In this analysis, we use a mouse cell atlas to characterise bone marrow from M. auratus, using self-assembling manifolds (SAM). The most immediate application of this profile is to pathology. However, for the single-cell community, there was a deeper, more methodological motive: to amplify the power of reference-based mapping.
Results: Bone marrow was extracted from hamster legs and sequenced using 10X. Hamster genes were annotated via reciprocal BLAST against mouse. This facilitated integration with the Tabula Muris Senis, using SAM for reference-based mapping of cell identity (SAMap). Assignments were consistent with and exceeded the predictive power of marker-based mapping, conducted in parallel, supporting marker conservation from mouse.
Conclusions: Cellular atlases can improve significantly on the assignment of identity in single-cell datasets. Our study recapitulates this finding. Critically, we succeeded in extending this improvement across species, through the use of innovative SAM algorithms. Thus, we spotlight a novel approach for the community to exploit curated cell atlases and transcriptomes, beyond the constraints of their original, model organism; and beyond this coincidental case of the Golden Hamster.
Presentation Overview: Show
Advancements in technology have made it possible to use RNA sequencing in situ. This enables the comprehensive analysis of the entire transcriptome with almost single-cell accuracy while preserving the spatial information of the tissue. The resulting spatially-resolved ‘omics data presents a new analytical challenge for molecular biology – how to leverage the spatial aspect of the data effectively? Since tissues are congregations of intercommunicating cells, identifying local and global patterns of spatial association is imperative to elucidate the processes which underlie tissue function. Performing spatial data analysis requires particular considerations of the distinct properties of data with a spatial dimension, which gives rise to an association with a different set of statistical and inferential considerations. By their nature, the geographical sciences primarily use spatially oriented data and over many years have developed the necessary tools and methods to analyse them robustly. Here we discuss the application of a selection of such methods in the biological context, examining a publicly available dataset from the 10X Visium platform.
Presentation Overview: Show
Minimizers (m-mers where m < k) play a key role in modern methods for efficient searching, mapping and indexing of long genomic sequences. Grouping k-mers based on their minimizers is useful for distributed and parallel processing, as well as compact encoding using super-k-mers (consecutive k-mers sharing the same minimizer). One way to choose minimizers is to use Universal Hitting Sets (UHS), which select a small set of minimizers such that every k-mer is guaranteed to contain at least one of them. While standard minimizer selection usually achieve a density of 2 selected minimizers per k-mer, UHS-based approach typically achieve a lower density.
We introduce Fractional Hitting Sets (FHS), which select a fraction of the minimizers uniformly at random, without having to cover every k-mer. By relaxing this constraint of universality, we can reduce the density even further than with UHS. We derive a theoretical model for FHS allowing us to predict the expected density and the size of the super-k-mers based on the chosen fraction. We then show that FHS are suitable for both high coverage with reduced density, and small unbiased coverage with a minimal footprint.
Presentation Overview: Show
Different from bulk and common single-cell RNA-seq analyses, spatial transcriptomics provide the option to analyze the positional context of cells in a tissue. However, common approaches lack single-cell resolution. As a consequence, the analysis of these data requires a deconvolution step to infer the cell type composition of each spot. Several tools and algorithms are available for this purpose, most of which require a matching single-cell dataset and a list of specific marker genes for each cell type, usually defined via the top differentially expressed genes. However, this approach only considers changes in expression levels and their significance.
We performed a detailed analysis of genes and their potential eligibility as markers, considering changes in expression level and specificity of gene expression. The ideal marker gene will have significantly increased expression and will be expressed in all cells belonging to a specific cell type, while not being expressed in any of the others. Comparing different approaches on the basis of real spatial transcriptomics data from six cases of hepatoblastoma, we developed a filtration strategy resulting in a better and more specific list of marker genes that can be used for the deconvolution of spatial transcriptomics.
Presentation Overview: Show
Mutational signatures are context-specific frequency patterns of somatic mutations that result from endogenous and/or exogenous factors and the presence of distinct types of mutational signatures can provide information on the aetiology of tumours. Reference catalogues of mutational signatures have been created that are used to determine the presence or absence of the according patterns in a tumour sample.
However, the differences observed between two signatures based on the mutational profile alone can be very subtle. Typically, the cosine distance is used to define a formal distance between two mutational signatures. This however can provide misleading results when trying to identify signature similarities or the presence of signatures in existing data.
Therefore, we have been comparing different distance metrics to determine the most robust metric to distinguish and in return quantify the presence of mutational signatures in given samples. To compare different distance metrics, we simulated mutational signatures based on the existing COSMIC reference catalogues. We considered several performance criteria to provide a better understanding on how to assess mutational signatures in the future.
Presentation Overview: Show
Epithelial ovarian cancer (EOC) is a gynecological malignancy that develops within the ovary or the fallopian tubes. It is classified into different histotypes including high-grade serous, low-grade serous, endometrioid, clear cells, and mucinous with heterogeneous profiles and clinical outcomes. However, information regarding the heterogeneity of tumor-infiltrating immune cells (TIICs) and blood-associated immune cells as well as the role of the surrounding tumor microenvironment (TME) in tumor progression, is lacking.
We performed 3’ single-cell RNA sequencing on both peripheral blood mononuclear cells (PBMCs), sorted TIICs, and tumor cells in a cohort of EOC patients encompassing four different histotypes: high-grade serous, endometrioid, clear cells, and mucinous. We investigated the heterogeneity of TME by using different unsupervised learning algorithms including a holistic clustering approach, functional enrichment analysis (Reactome), differentiation and activation trajectory analysis (RNA velocity, CellRank), cell-cell communication (NicheNet), cytokine profiling (CytoSig database), and transcription factor inference (SCENIC).
Overall, our analysis revealed that different histotypes are characterized by high heterogeneity in the immune cell subpopulations in terms of ratio and phenotype. Moreover, it provided insights into the comprehension of tumor-immune cell interactions, outlining a correlation between the immune system and the clinical outcome in EOC patients.
Presentation Overview: Show
Recent findings indicated an association between MAIT cells and the immune response to the BNT162b2 vaccine, where MAIT cell frequency was associated with an increased adaptive immune response.
Herein, to investigate the effect of repeated SARS-CoV-2 vaccinations on MAIT cells, we performed a longitudinal 5’ scRNA-seq coupled with scTCR-seq analysis on the peripheral blood samples of six healthy adults naïve for the SARS-CoV-2 infection and immunized with the two doses of the mRNA-based vaccine BNT162b2. Taking advantages of computational approaches, including functional pathway enrichment analyses and the gene expression–effector cell-polarization’s fate probabilities correlation (RNA Velocity and CellRank), we identified MAIT cells as the major source of TNF-α across circulating lymphocytes, and this TNFhigh signature increased upon the second administration of the vaccine. Notably, the increased TNF-α expression correlated with SARS-CoV-2 specific antibody titers. Therefore, by modeling the intercellular communication with the NicheNet algorithm, we observed that the TNF-α-profile predicts the transcriptional changes of conventional switched memory B cells, deputed to high-affinity long-term memory.
Overall, our results indicate that MAIT cells promote B cell functionality in response to the vaccine, favoring effective and long-term protection against SARS-CoV-2 infection, suggesting the use of MAIT cells as cellular adjuvants in mRNA-based vaccines.
Presentation Overview: Show
The Curly Su (dMPO) protein, a homolog of the human myeloperoxidase (hMPO), is involved in wing development in Drosophila melanogaster. The dMPO contributes to various cellular and physiological processes through its production of reactive oxygen species (ROS). As the sequences of dMPO and hMPO are similar, dMPO is an excellent candidate for experimental validation for the development and immunity studies. To investigate the effects of specific mutations on dMPO and hMPO, we performed saturated computational mutagenesis to identify the target mutations based on predicted folding energy changes and proximity to Post-Translational Modification (PTM) sites. We constructed transgenic fruit flies with G378W, Del 305-687, S590A, K552R, and W621R mutations using genome editing. We observed wing phenotypes and the overall lifespan of the samples during husbandry. The transcriptome analysis was conducted for both transgenic and wild-type samples using RNAseq. We utilized the R Bioconductor package, BigPint, to visualize differentially expressed genes (DEGs) between treatment samples and conducted gene ontology analysis of the DEGs to provide functional attributes to down-regulated and up-regulated genes. The combination of computational tools and genetics experiments enabled rapid insights into the novel functional effects of missense mutations in target proteins.
Presentation Overview: Show
We introduce RNA-tailor, a novel tool designed to precisely inventory the repertoire of alternatively spliced transcripts of a target gene from third-generation sequencing data. Alternative splicing (AS) is a regulation mechanism that enables the production of various RNA isoforms. Abnormally spliced RNAs can be responsible for various diseases and cancers. For the study of AS and despite their higher error sequencing rate, long read sequencing technologies are preferred to short reads as they are able to capture full length transcripts, allowing to grasp the combinatorics of exons. Thanks to the usage of splice-aware alignment tools and fine refinement steps, RNA-tailor aims to provide a nucleotide-level precise picture of alternative transcripts of a given target gene using only a reference sequence. It makes the method usable for analysis on both model and non-model species, and provides unbiased and accurate results that better reflect the transcript diversity of a sample with a great potential for novel isoform discovery. We will present results on an ONT mouse transcriptome dataset and a subset of 19 genes for which annotated isoforms differs in number or AS events. RNA-tailor preliminary results show a better level of prediction than recent tools (Freddie, FLAIR), without prior knowledge.
Presentation Overview: Show
We introduce RNA-tailor, a novel tool designed to precisely inventory the repertoire of alternatively spliced transcripts of a target gene from third-generation sequencing data. Alternative splicing (AS) is a regulation mechanism that enables the production of various RNA isoforms. Abnormally spliced RNAs can be responsible for various diseases and cancers. For the study of AS and despite their higher error sequencing rate, long read sequencing technologies are preferred to short reads as they are able to capture full length transcripts, allowing to grasp the combinatorics of exons. Thanks to the usage of splice-aware alignment tools and fine refinement steps, RNA-tailor aims to provide a nucleotide-level precise picture of alternative transcripts of a given target gene using only a reference sequence. It makes the method usable for analysis on both model and non-model species, and provides unbiased and accurate results that better reflect the transcript diversity of a sample with a great potential for novel isoform discovery. We will present results on an ONT mouse transcriptome dataset and a subset of 19 genes for which annotated isoforms differs in number or AS events. RNA-tailor preliminary results show a better level of prediction than recent tools (Freddie, FLAIR), without prior knowledge.
Presentation Overview: Show
We introduce RNA-tailor, a novel tool designed to precisely inventory the repertoire of alternatively spliced transcripts of a target gene from third-generation sequencing data. Alternative splicing (AS) is a regulation mechanism that enables the production of various RNA isoforms. Abnormally spliced RNAs can be responsible for various diseases and cancers. For the study of AS and despite their higher error sequencing rate, long read sequencing technologies are preferred to short reads as they are able to capture full length transcripts, allowing to grasp the combinatorics of exons. Thanks to the usage of splice-aware alignment tools and fine refinement steps, RNA-tailor aims to provide a nucleotide-level precise picture of alternative transcripts of a given target gene using only a reference sequence. It makes the method usable for analysis on both model and non-model species, and provides unbiased and accurate results that better reflect the transcript diversity of a sample with a great potential for novel isoform discovery. We will present results on an ONT mouse transcriptome dataset and a subset of 19 genes for which annotated isoforms differs in number or AS events. RNA-tailor preliminary results show a better level of prediction than recent tools (Freddie, FLAIR), without prior knowledge.
Presentation Overview: Show
Targeted amplicon sequencing of the 16S ribosomal RNA gene is a common approach to study microbial communities in a site. However, the accuracy of this methodology strongly depends on the choice of primer pairs.
In our previous work, we developed mopo16S [DOI:10.1186/s12859-018-2360-6], a multi-objective optimization framework to simultaneously maximize primers efficiency, specificity and coverage. Here we present mopo16Sweb, a powerful tool to easily design optimal 16S primers that further improve mopo16S functionalities.
First, we have extended the multi-objective optimization framework by designing new fitness functions to include user-specified constraints in the optimization process.
Second, we have simplified the specification of the required input by including built-in presets of known bacteria sequences and primers pairs, querying the most widely used metagenomic databases (GreenGenes, Ribosomal Database Project, SILVA and probeBase).
Third, we have improved the analysis of the output-optimized primers by adding interactive plots/tables that simplify the selection of the best primers among the ones in the Pareto front output.
We have included all the above updates in mopo16Sweb, a novel interactive (containerized) webapp for 16S primers optimization on the cloud (https://mopo16sweb.dei.unipd.it/) or in user server.
Presentation Overview: Show
Detecting pathogenic variants in PMS2 is crucial for diagnosing Lynch Syndrome, an autosomal dominant cancer predisposition syndrome. However, the presence of a highly homologous pseudogene, PMS2CL, complicates variant calling using short-read sequencing. In this study, we introduce a computational method, Multi-Region Joint Detection (MRJD), to address these challenges and improve the reliability of small variant calling in the ~11kb PMS2 homology region.
MRJD detects variants in paralogous regions by jointly genotyping all paralogous regions, including reads with ambiguous alignment. The method offers a default mode that balances precision and recall, and a high sensitivity mode that maximizes the ability to identify all potential variants. These methods are implemented as part of the DRAGEN software suite v4.2, allowing users to choose the best fit for their needs.
We benchmarked MRJD on 150 samples from the Illumina Polaris diversity panel and found it outperforms default germline small variant calling, particularly for INDELs. The high sensitivity mode achieves 96% aggregated recall for SNPs and INDELs, while maintaining acceptable precision.
MRJD contributes to a more reliable diagnosis of Lynch Syndrome, improving risk assessment for affected individuals. This method also paves the way for further research on variant calling in genes with high homology challenges.
Presentation Overview: Show
Splicing is dysregulated in many tumors, and may result in tumor-specific transcripts that can encode neoantigens, which are promising targets for cancer immunotherapy. Detecting tumor-specific splicing is challenging because many non-canonical splice junctions identified in tumor transcriptomes appear also in healthy tissues. However, somatic mutations can disrupt canonical splicing motifs or create novel ones leading to true tumor-specific targets.
Here, we developed splice2neo to integrate the predicted splice effects from somatic mutations with splice junctions detected in tumor RNA-seq for individual cancer patients. Using splice2neo, we exclude canonical splice junctions and splice junctions from healthy tissue samples, annotate resulting transcript and peptide sequences, and integrate targeted re-quantification of supporting RNA-seq reads. By analyzing melanoma patient cohorts, we established a stringent detection rule to predict splice junctions as mutation-derived and tumor-specific targets. In an independent melanoma cohort, we identified 1.7 target splice junctions per tumor with an estimated false discovery rate of less than 5% and established tumor-specificity using additional healthy tissue samples. For individual examples of exon-skipping events, we confirmed the expression in tumor-derived RNA by quantitative real-time PCR experiments. Most target splice junctions encoded at least one neoepitope candidate with predicted MHC I or MHC II binding. Compared to neoepitope candidates derived from non-synonymous point mutations, the splicing-derived neoepitope candidates had a lower self-similarity to corresponding wild-type peptides.
The identification of mutation-derived and tumor-specific splice junctions can lead to additional neoantigen candidates to expand the target repertoire for cancer immunotherapies.
Splice2neo is avaialbe on GitHub: https://github.com/TRON-Bioinformatics/splice2neo
Presentation Overview: Show
Synonymous codons display an inherent non-random distribution, called codon usage bias, and can differ across species from the level of genes to genomes. This has been exploited by the biotechnology industry, where recombinant proteins are back-translated to DNA by selecting codons to maximise transcription and yield. However, obtaining accurate and representative codon bias estimates requires the identification of highly expressed genes and their codon variation across samples and conditions. To address this, we developed Codon Usage Bias from RNA-sequencing (CUBseq), a fully automatic meta-analysis pipeline to build highly expressed gene panels and estimate codon usage bias from RNA sequencing experiments.
Here, we use CUBseq to estimate codon usage bias in Escherichia coli (E. coli), leveraging publicly available RNA sequencing data from 6,763 samples across 72 strains. We found, on average, almost 40% of samples to harbour synonymous variants across the E. coli meta-transcriptome, demonstrating the importance of adjusting codon usage estimates to account for this variation. Following this, we identified a set of 81 highly expressed genes in our dataset with negligible variation across strains, suggesting codon usage to be stable across different strains. We then performed a species-level characterisation of codon usage bias of all transcriptome-derived protein-coding genes compared to the genome-derived CoCoPUTs codon usage table, where we found negligible differences, suggesting that codon usage in E. coli is consistent across strains and experimental conditions. This contrasts with when we compared codon usage bias estimates of our transcriptome-derived highly expressed genes to the widely used genome-derived Kazusa and CoCoPUTs codon usage tables, where we found significant variations at the amino acid and codon level. This variation is further emphasised when comparing transcriptome-derived codon frequencies of our highly expressed genes versus all protein-coding genes, demonstrating that the panel of genes used has a large effect on codon relative frequencies. Finally, we reveal new relationships with the codon frequencies of highly expressed genes and their cognate tRNA gene abundance.
In summary, we present (to our knowledge) the largest transcriptome-wide study of codon usage in E. coli, where we characterised a highly complex mutational landscape and found previously unknown global patterns in codon usage at the transcriptional level, suggesting that the transcriptome plays an important role in influencing codon preference. Overall, CUBseq provides a novel and robust method for large-scale transcriptome-based codon usage analysis, which we anticipate will have broad applications in areas such as the accurate quantification of codon usage for codon optimisation, studies in transcriptional and translational control, effects on protein structure and cellular fitness, as well as exploring the molecular evolution of species and host/pathogen relationships.
Presentation Overview: Show
The development of single-cell atlases has become a significant area of research in the last decade. However, there is currently no global quality control tool to evaluate the final reconstructed atlas. To address this issue, the Checkatlas bioinformatic tool was created to provide a user-friendly way to assess the overall quality of single-cell atlases. Checkatlas screens all relevant files in the working directory and produces quality control report for every atlas in one single html file. The tool generates a summary table of all the atlases in the working directory, quality control metrics in table and figure formats, and metadata regarding cell annotation and experimental design. Checkatlas also includes a catalog of essential metrics for evaluating clustering, annotation, and visualization, making it a valuable resource for researchers working with diverse single-cell atlas datasets. The tool is multi-threaded and can be deployed on a computing cluster for efficient and speedy analysis. In this presentation, we will demonstrate Checkatlas's use cases and its successful application in evaluating the quality of 80 COVID and healthy donor atlases.
Presentation Overview: Show
Single-cell RNA sequencing (scRNA-seq) technology, which enables parallel profiling of hundreds of thousands of cells, has already been successfully applied to search for new cell types or to understand different cellular states. A well-known feature of scRNA-seq is data sparsity, i.e. a high percentage of zero counts in the data matrix. Such an event could occur due to: (i) technical reasons when a cell's transcript is present but undetected because of inefficient cDNA polymerization, amplification error, or low sequencing depth; (ii) biological reasons when zeros reflect a real lack of expression in a cell. We investigated several technical factors that can contribute to expression shift between bulk and scRNA-seq platforms and found that a low level of bulk gene expression representing true expression is the main factor, however, RNA integrity, gene or UTR3 length, and the number of transcripts could also be important. Next, we developed a true biological zero (TBZ) score by calculating the ratio between the distribution of genes not expressed in scRNA-seq but observed in bulk and normally expressed genes. Finally, we used the TBZ score to test existing data imputation methods that can preserve true zeros: ALRA, DrImpute, SAVER, and scImpute.
Presentation Overview: Show
Recent advances in DNA sequencing technologies have allowed the detailed characterization of whole-exomes and whole-genomes in large cohorts of tumors. These studies have highlighted the extreme heterogeneity of somatic mutations between tumors. Such heterogeneity hinders out our ability to identify alterations important for the disease. Several tools have been developed to identify somatic mutations related to cancer phenotypes. However, such tools identify only correlations, with no guarantee of highlighting causal relations.
We describe ALLSTAR, a novel tool to infer reliable causal relations between somatic mutations and cancer phenotypes. In particular, our tool identifies reliable causal rules highlighting combinations of somatic mutations with the highest impact in terms of average effect on the phenotype. While we prove that the underlying computational problem is NP-hard, we develop a branch-and-bound approach that employs PPI networks and novel bounds for pruning the search space, while correcting for multiple hypothesis testing.
Our extensive experimental evaluation on synthetic data shows that ALLSTAR is able to identify reliable causal relations in large cancer cohorts. Moreover, the reliable causal rules identified by our tool in cancer data show that ALLSTAR identifies several somatic mutations known to be relevant for cancer phenotypes as well as novel biologically meaningful relations.
Presentation Overview: Show
Structural variants (SVs) have been associated with many monogenic and complex disorders. However, their accurate and reliable detection remains challenging, often resulting in diverging results from current SV detection methods. As a guide for SV calling in the context of clinical diagnostics, we compared SV detection methods based on Illumina short-reads, PacBio long-reads, 10x Genomics and TELL-Seq linked-reads, as well as Bionano optical maps in a cohort of 20 patients. This comparison was enabled by a novel approach for constructing manually curated benchmark SV callsets for each patient. Our results indicate caller-specific performance differences across categories of SV types, sizes, and genomic contexts.
Furthermore, we present an unsupervised learning approach to systematically identify the failure modes of current short-read-based SV detection methods. For that, each SV was annotated with a number of features. These include features about the variant itself, the genomic context, and the alignment of the reads supporting an SV. These features serve as the basis to construct a kNN-graph for clustering SVs. Next, groups of clusters that are not well detected by short-read sequencing are inferred. Identifying such biases can help improve short-read-based SV calling.
Presentation Overview: Show
Abstract
Recent advances in long-read sequencing technologies enabled accurate and contiguous de novo assemblies of large genomes and metagenomes. However, even long and accurate high-fidelity (HiFi) reads do not resolve repeats that are longer than the read lengths. This limitation negatively affects the contiguity of diploid human genome assemblies since two haplomes share many long identical regions. To generate the telomere-to-telomere assemblies of diploid genomes, biologists now construct their HiFi-based phased assemblies and use additional experimental technologies to transform these phased assemblies into more contiguous diploid assemblies. The barcoded linked-reads, generated using an inexpensive TELL-Seq technology, provide an attractive way to bridge unresolved repeats in phased assemblies of diploid genomes.
Here, we present a SpLitteR tool for haplotype phasing and scaffolding in an assembly graph using barcoded linked-reads. We benchmark SpLitteR on assembly graphs produced by various long-read assemblers and show how TELL-Seq reads facilitate phasing and scaffolding in these graphs. This benchmarking demonstrates that SpLitteR improves upon the state-of-the-art linked-read scaffolders in the accuracy and contiguity metrics. SpLitteR is implemented in C++ as a part of the freely available SPAdes package and is available at https://cab.spbu.ru/software/splitter.
Presentation Overview: Show
Spatial transcriptomic is one of the most promising technologies to analyze spatial distribution and interaction. Spatially barcoded next generation (NGS) sequencing-based methods enable the detection of transcripts on tissue sections. Several of these technologies, are near single cell with spots encompassing 1-20 cells. Therefore, a deconvolution step is required to gain insight into the mixture of cells.
Here, we propose a new method named CellFromSpace (CFS), based on the independent component analysis, a blind signal separation method, to deconvolute, without reference single cell data, spatial transcriptomic data. We developed an R package and a shiny interface to accelerate the annotation of the signal.
Visium fresh frozen and FFPE samples of adult mouse brain and human tumors from 10x genomics were analyzed. We were able to recapitulate the structure of the mouse brain using our method with high fidelity. Furthermore, we quickly identified cell types and activities within heterogeneous cancer tissues. The method also enables to subset the signal and the spot corresponding to specific cell, to drive further analysis usually performed for scRNA-seq such as trajectory inference.
In conclusion, CFS provides a full workflow to analyze and quickly interpret results from NGS-based spatial transcriptomics analysis without reference single cell dataset.
Presentation Overview: Show
Every RNA-seq experiment demands rigorous quality control (QC) to remove poor quality samples which may mask meaningful results. While the specifics of RNA-seq quality control depend on the goals of an experiment, thus far there has been minimal investigation into which metrics best identify poor quality samples. In addition, there is a lack of tools which support context-specific quality control analysis. Together, this creates obstacles for researchers looking for a data-informed justification on whether to exclude a potentially problematic sample.
To address this, we investigated the utility of different RNA-seq QC metrics using 252 RNA-seq samples from human patients. We examined the relationships between these QC metrics and 3 distinct endpoint metrics which together characterize RNA-seq sample quality holistically. Then, we trained and interrogated a random forest model to identify the most useful QC metrics and key inflection points in their effect on quality that inform guidelines for sample exclusion. Lastly, we developed an open source software tool, QC Doctor, which generates holistic visual summaries of RNA-seq sample quality that can be customized to a user’s experimental goals. Together, this work provides data-informed guidelines and tools to aid researchers in improving the rigor of their RNA-seq experiments.
Presentation Overview: Show
Accurate detection of somatic copy number variants (CNVs) in single-cell whole-genome sequencing (scWGS) is challenging due to sequencing depth limitations and amplification artifacts. Existing single-cell CNV callers are primarily designed for large CNVs arising in neoplastic cells, and often have low sensitivity for non-clonal CNVs and small (<1Mb) CNVs. Here, we propose scBIC-seq, a novel CNV detection algorithm that utilizes read count, B-allele frequency, and haplotype signals to accurately infer both clonal and non-clonal CNVs. Benchmarking experiments demonstrated superior performance compared to existing methods, especially at the sub-megabase scale. Remarkably, scBIC-seq enabled the detection of subtle genomic changes in minute cell populations in a longitudinal meningioma case, tracing the origin of the second clonal expansion back to a time even before the first surgical resection. Furthermore, applying scBIC-seq to scWGS data from eleven neurotypical human brains revealed an increase in somatic CNV burden in post-mitotic neurons during normal aging.
Presentation Overview: Show
Mayo Clinic Tapestry study is a population-scale sequencing initiative. With a goal of sequencing 100,000 consented patient participants, To accelerate genomic findings, we developed a data mining workbench that integrates genetic variants with phenotypic data in EMRs. The workbench is designed to enable the selection of appropriate analysis workflows based on the prevalence of a disease trait. For rare disease studies, the resources such as lists of curated genes and the prior knowledge of gene inheritance pattern are critical. While for common disease studies, we implemented multiple software packages for data QC, population stratification estimation, GWAS and PRS analysis. To demonstrate its broad applications, we applied the workbench to two common diseases, obesity and nonalcoholic fatty liver disease (NAFLD). Through rare variant analysis, the workflow prioritized key genes in obesity that were previously found to be significantly associated with BMI in multiple studies. While in NAFLD cohort, the PRS in the 95% quantile confers increased risk (OR=4.653, 95% CI, 3.88-5.58) compared to the control cohort. Finally, applied to rare disease study, we identified rare pathogenic germline variants in a cohort of 146 Cholangiocarcinoma subjects, which are among key variants previously known to be involved in Cholangiocarcinoma pathogenesis.
Presentation Overview: Show
Single-cell CRISPR screens combine CRISPR-Cas9 genetic perturbations with single cell RNA sequencing to directly test how genetic elements or variants impact gene expression. As the data collected from individual screens continues to grow exponentially, a need has emerged for computational methods that can efficiently analyze millions of single-cell transcriptomic profiles while remaining robust to technical confounders. To address this gap, we developed PerTurbo: a fast, fully Bayesian tool for estimating perturbation effects in single cell CRISPR screens. Like previous CRISPR-specific analysis methods, PerTurbo identifies differentially expressed genes (DEGs) between perturbed and unperturbed cells, while treating the observed guide RNA (gRNA) counts as noisy measurements of the true perturbation(s) received by each cell. But thanks to its efficient, GPU-accelerated implementation using stochastic variational inference, PerTurbo runs thousands of times faster, making it feasible to run transcriptome-wide DEG tests for each gRNA. Additionally, unlike other methods, PerTurbo tests for effects on both the mean and variance of transcript counts. We highlight PerTurbo’s superior scalability and performance by performing transcriptome wide tests on several datasets to investigate perturbation effects on enhancers, genes, or disease-associated variants in modulating gene expressions at single-cell resolution.
Presentation Overview: Show
Cancer is uncontrolled growth of a collection of cells. These cells can be parsed into distinct subgroups, called clones, characterized by a unique set of genetic mutations. Cancer treatments’ success hinges on its ability to destroy each of these clones, however genetic variation across clones often leaves treatment ineffective. Hence, it is of clinical importance to understand the clonal structure of cancer.
Circulating tumour DNA (ctDNA) are fragments of tumour DNA found in blood. Serial samples of ctDNA can be obtained non-invasively via blood samples and provide a lens into the changing tumour abundance and clonal structure. Existing methods estimate tumour abundance using ctDNA but ignore its underlying clonal structure.
We introduce LiquidBayes, a Bayesian statistical model, that infers tumour abundance and clonal structure by integrating ctDNA with single-cell whole-genome sequencing (scWGS). LiquidBayes uses scWGS to infer clone specific copy number profiles and leverage this as prior information on clonal structure.
LiquidBayes outperformed two state of the art approaches, ichorCNA and MRDetect, on semi-synthetic data. We performed scWGS and ctDNA sequencing on pre- and post-treatment samples from a patient with triple negative breast cancer. LiquidBayes shows both a reduction in overall tumour abundance and shift in clonal structure after treatment.
Presentation Overview: Show
Motivation: Cell-free DNA (cfDNA) sequencing is a promising diagnostic and monitoring tool in cancer care. The development of novel computational and analytic workflows in the field is, however, impeded by limited data sharing due to the strict control of genomic data. While the analysis of copy number variants (CNV), nucleosome footprints, and fragmentation patterns from cfDNA sequencing data do not theoretically require actual sequence information, current bioinformatics software is generally developed to process alignment files containing sensitive sequence data.
Results: We present Fragmentstein, a lightweight command line tool for converting non-sensitive cfDNA fragmentation data, consisting only of cfDNA fragment coordinates, into sequence alignment mapping (SAM/BAM) files which most contemporary cfDNA sequencing analysis tools require as input. Fragmentstein merges fragment coordinates and mapping quality scores with sequence information from a reference genome to create SAM/BAM files that contain fragment coordinates from the sample but no sensitive genome sequence. To demonstrate the utility of Fragmentstein, we analyze a publicly available dataset and show that CNVs, nucleosome footprints, and fragment length features can be fully recovered from non-sensitive fragment data.
Presentation Overview: Show
A challenge for Bioinformatics is to keep up with the amount of data generated by high throughput sequencing.
Being able to compare such volume of data remains a scalability challenge which is the focus of many methodological papers.
To achieve drastic memory cost reduction, a possibility is to transform documents into ""sketches"" of highly reduced sizes that can be quickly compared to compute the documents similarity with bounded error.
The most used tools rely on fixed sized sketches using techniques such as Minhash or HyperLogLog.
However, those techniques have a relatively poor accuracy when the compared datasets are very dissimilar in size or content.
To cope with this problem, novel methods proposed to construct adaptive sketches, scaling linearly with the size of the input, by selecting a fraction of the documents' k-mers.
Several techniques were proposed to perform uniform sub-sampling with theoretical guarantees such as modimizer/modminhash, scaled minhash/FracMinHash.
With SuperSampler, we improve such schemes by combining them with the concept of super-k-mers thus drastically reducing resources usage (CPU, memory, disk).
In this poster, we show that SuperSampler can use an order of magnitude less resources than state of the art with equivalent results.
Presentation Overview: Show
A typical analysis estimates the presence of known mutational signatures in each sample. However, current approaches rely on a large number of mutations to accurately estimate mutational signature exposure. Making this analysis possible when only sparse mutation data are available, such as data generated from panel sequencing or samples with low mutational burden, requires novel developments in the current methodologies for estimating mutational signature exposures. Here we present our work of assessing signature exposures using a novel predictive modeling approach. Our strategy follows two main steps. First, using a statistical model, we identify relevant signals from cancer mutations based on a mutational signature reference catalog (e.g., COSMIC [2]). Second, we use these mutational signals to train a predictive model. The model aims to estimate informative regions with respect to mutational signatures from the cancer genome sequence that are being considered when estimating the mutational signature exposure on a single sample.
Presentation Overview: Show
Motivation: The positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw)-time, by Durbin’s Algorithm 5. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory.
Results: We leverage the notion of r-index proposed for the BWT to present a memory efficient method for constructing and storing the run-length encoded PBWT, and computing one-vs-all set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file.
Availability: https://github.com/dlcgold/muPBWT and https://bioconda.github.io/recipes/
mupbwt/README.html
Presentation Overview: Show
Advances in sequencing technologies have enabled construction of individualized references representing genetic variations across populations. However, existing graph genome software has the disadvantage of being memory and storage-intensive, as it often stores complete reference sequences along the graph. We introduce ChromMiniGraph, a tool for constructing space-efficient pangenome reference that uses k-mer sampling to reduce memory and storage requirements while maintaining accuracy. The tool constructs a directed acyclic graph through iterative chaining with colored nodes representing haplotypes, which can be then used to map sequencing reads. ChromMiniGraph maps reads by subsampling them and performing colinear chaining onto a topologically linearized coordinate of the graph using banded dynamic programming. To further improve chaining accuracy, ChromMiniGraph identifies superbubbles in the graph to augment the linearized coordinate with an alternative sparse distance matrix to score anchors that straddle superbubbles. ChromMiniGraph can correctly assign haplotypes to simulated reads with short indels and complex structural variations and successfully map PacBio reads from HG01243 to a reference graph constructed using Human Chromosome 20 reference sequences obtained from the Genome Reference Consortium and 1000 Genomes Project with accuracy and efficiency. Overall, ChromMiniGraph offers a streamlined workflow for creating and visualizing pangenome references, read phasing, and identifying structural variations.
Presentation Overview: Show
Advances in sequencing technologies have enabled construction of individualized references representing genetic variations across populations. However, existing graph genome software has the disadvantage of being memory and storage-intensive, as it often stores complete reference sequences along the graph. We introduce ChromMiniGraph, a tool for constructing space-efficient pangenome reference that uses k-mer sampling to reduce memory and storage requirements while maintaining accuracy. The tool constructs a directed acyclic graph through iterative chaining with colored nodes representing haplotypes, which can be then used to map sequencing reads. ChromMiniGraph maps reads by subsampling them and performing colinear chaining onto a topologically linearized coordinate of the graph using banded dynamic programming. To further improve chaining accuracy, ChromMiniGraph identifies superbubbles in the graph to augment the linearized coordinate with an alternative sparse distance matrix to score anchors that straddle superbubbles. ChromMiniGraph can correctly assign haplotypes to simulated reads with short indels and complex structural variations and successfully map PacBio reads from HG01243 to a reference graph constructed using Human Chromosome 20 reference sequences obtained from the Genome Reference Consortium and 1000 Genomes Project with accuracy and efficiency. Overall, ChromMiniGraph offers a streamlined workflow for creating and visualizing pangenome references, read phasing, and identifying structural variations.
Presentation Overview: Show
Bulk transcriptomes and epigenomes are routinely measured in the clinic to diagnose and classify cancer patients. However, these classifications do not account for intra-tumor heterogeneity, i.e. proportions of the cell types mixed in a sample. This information is critical as it has an impact on the tumor behavior with respect to its evolution and treatment response.
The current state-of-the-art procedure to de-mix samples is to apply deconvolution algorithms on one block of data, either the transcriptome or the methylome. Nevertheless, there is no consensus on the best deconvolution method. In our project, we propose to combine both blocks. Our work hypothesis is that joint deconvolution should perform better. Indeed, we hope that more information, and from different nature, would improve the de-mixing task.
We face the following challenges: how to do joint deconvolution? How to compare multi-block versus simple-block approaches?
We first collected several deconvolution tools in the literature and tested different timings for the block integration step within the deconvolution pipeline. We then compared multi-block strategies with simple-block ones. Our first results showed that the multi-block approach exhibited superior performances.
Finally, we will use the output of multi-block deconvolution to build a more refined stratification of pancreatic cancer patients.
Presentation Overview: Show
Third-generation sequencing (TGS) has made draft genome production more accessible, however, the structural annotation of those genomes has not progressed at the same pace. This study explores the potential of using long-read RNA sequencing (lrRNA-seq) to improve gene annotation. The accuracy of lrRNA-seq supported gene prediction was evaluated for different sequencing platforms (PacBio or Nanopore), pipelines, and annotation approaches besides the gene prediction tool AUGUSTUS.
We used Nanopore and PacBio sequencing data from the WTC11 human cell line, processed independently and in combination using IsoSeq3 or FLAIR pipelines. Incorporating lrRNA-seq data during the gene prediction step significantly improved gene prediction accuracy, particularly with transcript models generated from PacBio long-reads.
This approach was then applied to annotate the Trichechus manatus latirostris genome, resulting in 25% and 12.3% more BUSCO genes than using only experimental data or ab initio predictions, respectively. The findings suggest that lrRNA-seq is a valuable source of experimental data for supporting gene annotation in mammalian species.
Presentation Overview: Show
Decrypting molecular processes in oncology is essential for precise diagnostics and deployment of the most effective targeted therapy that leads to improved treatment outcomes and reduced risks of adverse effects. However, inter- and intra-tumor heterogeneity hides underlying mechanisms of tumor drug response and adaptation or resistance. Here we linked molecular profiles of melanoma with drug resistance.
We analyzed single-cell RNAseq data of four NRAS-mutant melanoma cell lines (MelJuso, Sklmel30, M20, IPC298) in four stages of treatment with MEK1/2 plus CDK4/6 inhibitors, applying the previously developed R/Bioconductor package consICA. This package implements a reference-free deconvolution method that separates mixed molecular profiles into statistically independent signals. We were able to map single cells on a cell cycle and observed a strong linkage between the proportion of proliferating cells and adaptation to the treatment, which occurred with individual speed for each cell line. In three of four cell lines, we observed increased motility in resistant samples. We also observed several signals linked to ATP synthesis and metabolism that were modulated by the treatment and resistance. Interestingly, gene signals involved in mRNA processing and chromatin remodeling were mainly down-regulated in resistant cells, suggesting potential changes at the epigenetic level.
Presentation Overview: Show
Advances in next-generation sequencing have increased the usage of whole genome sequencing (WGS) for studying disease-related polymorphisms. Accurate detection of genomic variants is imperative, and selecting the appropriate bioinformatic pipeline is non-trivial. We assessed three alignment tools (bwa-mem, minimap2 and dragmap-os) and three variant callers (GATK, GATK-DRAGEN and DeepVariant) using Genome in a Bottle consortium datasets and real data.
GATK showed lowest accuracy for indels, whereas SNV results were similar across pipelines with slight improvements in DeepVariant. Base quality score recalibration significantly increased computational time and had adverse effects on DeepVariant’s accuracy. Filtering difficult genomic regions reduced variant calling time for GATK tools, but had no effect on DeepVariant, which is the fastest tool when GPU is available.
All aligners produced equally good results with minimal differences. Minimap2 was the fastest of the three, while bwa-mem and dragmap-os had similar runtimes. Notably, aligning data with bwa-mem is still recommended for some structural variant detection methods.
Overall, DeepVariant consistently performed well across aligners with high sensitivity and precision. It required no region filtering and had compatible resource requirements. The choice of preprocessing pipeline depends on the study's requirements, emphasizing the need for careful consideration of downstream analysis right from the start.
Presentation Overview: Show
Transcription factors (TFs) have a function to regulate the set of genes in the biological pathway simultaneously, resulting in driving the dynamic alternations in cellular states such as differentiation and reprogramming. High-throughput epigenetic analyses such as ChIP-seq and CUT&Tag can identify DNA binding sites of TFs, followed by the prediction of their binding motifs and target genes. While each epigenetic dataset captures only a snapshot of the binding profiles in one condition, the integration of large-scale epigenetic data including single-cell data will lead us to infer the stochasticity and context-dependence of TF at each genomic location.
In this study, we performed a large-scale comparison of ChIP-seq datasets for TFs and their family genes. By combining the epigenetic and co-expression patterns in pluripotent stem cells as well as other somatic cells, we attempt to reveal the mechanism of context-dependent binding of reprogramming factors.
As a result, our analysis recapitulates the strong overlap of binding sites between cooperative reprogramming factors in not only pluripotent stem cells but also many other tissues and organs. Because the integration of binding profiles of different experiments indicates the stochasticity of TF binding, it may be a clue for clarifying the impacts of biological contexts to determine the binding probability of TFs. By further comparing the co-expression patterns of the TF family genes, it will help to predict the pioneering ability and dosage dependency of each TF and its interacting proteins.
Presentation Overview: Show
Nanopore sequencing has emerged as a significant technique for DNA methylation analysis, as it enables the detection of base modifications without the need for additional conversion steps. This allows for the measurement of methylation using a small amount of input material, making it increasingly useful in liquid biopsies to detect cell-free DNA (cfDNA) methylation patterns. Such patterns hold promise for the early diagnosis and monitoring of tumors originating from various sources. Identifying the originating cell composition (OCC) is a key step; however, its accurate prediction poses a challenge due to data sparsity and limited overlapping CpG sites across samples. To address this challenge, we have developed a method that leverages the methrix R package's speed and efficacy to fit individual models based on the non-missing sites in each sample, thereby estimating OCC efficiently. This method can utilize both array and sequencing-based references and works with any sequencing-based cfDNA methylomes. Although our estimates demonstrate that even 1 million reads suffice for accurate OCC prediction, we have incorporated quality control measures to assess the discriminatory ability of covered CpGs in the reference dataset. Overall, our approach provides a simple way of OCC estimation based on the DNA methylation patterns of cfDNA.
Presentation Overview: Show
Cell-free DNA (cfDNA) is a valuable liquid biopsy biomarker for cancer diagnosis and monitoring, carrying genetic and epigenetic information released into the bloodstream from normal and cancerous cells. While cfDNA analysis often relies on DNA sequencing and subsequent bioinformatics processing, the effects of bioinformatics preprocessing on cfDNA measurements remain understudied. To investigate whether preprocessing choices affect cfDNA analysis outputs, we built a modular bioinformatics pipeline that evaluates a range of commonly used preprocessing settings. We evaluated the effect of preprocessing on low-pass whole-genome sequencing of plasma cfDNA (median coverage 2.38x). cfDNA fragment size, coverage, copy number changes and differential coverage analysis over DNase hypersensitivity sites were recovered in a cohort of 20 lung cancer and 20 healthy cfDNA samples. We found that the analyzed features remain robust to preprocessing such as read trimming and reference genome builds. However, we observed that strict alignment filtering improves the differentiation between cancer and healthy samples for coverage-related features, such as differential coverage analysis over DNase hypersensitivity sites. In conclusion, our findings indicate that bioinformatic preprocessing choices in cfDNA analysis have minimal impact on distinguishing cancer and healthy samples with only few features benefitting from strict alignment filtering.
Presentation Overview: Show
Most high-throughput sequencing methods for single-cells focus on quantifying gene expression. However, recent advancements in multimodal profiling techniques have enabled the simultaneous measurement of multiple -omics within the same cells. To facilitate the development of new statistical and computational methods for analyzing such data, it is important to have readily available landmark datasets that adhere to standard data classes.
We have gathered, processed, and compiled publicly available landmark datasets from several single-cell multimodal protocols, including CITE-Seq, ECCITE-Seq, SCoPE2, scNMT, 10X Multiome, seqFISH, and G&T-seq. These datasets are released as Bioconductor classes and are documented and distributed as the SingleCellMultiModal package through Bioconductor's ExperimentHub. This allows for the retrieval of landmark datasets from seven different single-cell multimodal data generation technologies using a single command, eliminating the need for additional data processing or manipulation, and allowing to analyze and develop methods within Bioconductor's extensive ecosystem.
We present two illustrative examples of integrative analyses that are greatly simplified by the use of SingleCellMultiModal. This package will facilitate the advancement of bioinformatic and statistical methods within the Bioconductor framework, enabling researchers to address the challenges associated with integrating multiple molecular layers and analyzing phenotypic outcomes, such as cell differentiation, activity, and disease.
Presentation Overview: Show
Recombinant adeno-associated virus (rAAV)-mediated gene therapy has been applied for human diseases. However, the rAAV capsids contain heterogeneous mixtures of full-length and truncated genomes and residual host cell and plasmid DNA, depending on the manufacturing process. Therefore, a method is needed to characterize the encapsidated DNA of rAAV in order to support process development and batch release. The emerging long-read sequencing (LRS) has achieved AAV single-genome resolution. Here we propose a Python-based LRS profiling framework to classify and quantitate residual DNA species in rAAV products. We designed a reference that contains universal genetic components that are commonly used in rAAV production, including AmpR, KanR, Rep and Cap genes along with HPV18, Ad5 and hg38 genomes. We accessed the impurities of rAAV production from public and in-house LRS datasets. Analyzing the lambda fragments supplemented in these datasets showed that sequencing introduced size biases, which couldn’t be fully rescued by regression but is improvable within library preparation. Functional potential of impurities were assessed through indicators derived from long-read alignments, which enabled us to quantitatively compare impurities between manufacturing batches. We demonstrated that LRS provides informative metrics for rAAV production and can facilitate process development to ensure therapeutic product safety and quality.
Presentation Overview: Show
Increasingly detailed investigations of the spatial organization of genomes reveal that chromosome folding influences or regulates dynamic processes such as transcription, DNA repair and segregation. Hi-C approach is commonly used to characterize genome architecture by quantifying physical contacts’ frequency between pairs of loci through high-throughput sequencing. These sequences cause challenges during the analysis’ alignment step, due to the multiplicity of plausible positions to assign sequencing reads. These unknown parts of the genome architecture, that may contain biological information, remains hidden throughout downstream functional analysis. To overcome these limitations, we have developed HiC-BERG, a method combining statistical inference with input from DNA polymer behavior characteristics and features of the Hi-C protocol to assign with robust confidence repeated reads in a genome and "fill-in" empty vectors in contact maps. HiC-BERG is intended to be applicable to different types of organisms. We will present the program and key validation tests, before applying it to unveil hidden parts of the genomes of E.coli, S.cerevisiae and P.falciparum. HiC-BERG shows that repeated sequences may be involved in singular genomic architectures. Our method can provide an alternative visualization of genomic contacts under a wide variety of biological conditions allowing a more complete view of genome plasticity.