Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT |
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT |
---|---|
Session A Poster Set-up and Dismantle
Session A Posters set up: Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT Session A Posters dismantle: Tuesday, July 12 at 6:00 PM CDT |
Session B Poster Set-up and Dismantle
Session B Posters set up: Wednesday, July 13 between 7:30 AM - 10:00 AM CDT Session B Posters dismantle: Thursday. July 14 at 2:00 PM CDT |
Presentation Overview: Show
Recent advances in single-cell RNA-seq (scRNA-seq) technology have made it possible to perform high-throughput, large-scale transcriptome profiling at single-cell resolution. Unsupervised learning, such as data clustering, is central to identifying and characterizing novel cell types and gene expression patterns. Clustering is used to computationally identify groups of cells by comparing the gene-expression profiles of the groups. The result of the clustering enables us to summarize complex scRNA-seq data into a digestible format for human interpretation. This allows us to describe population heterogeneity with discrete labels that are easier to understand, rather than trying to understand the higher-dimensional manifolds in which cells exist.
Clustering includes graph-based clustering and k-means clustering, in which the genes of each cell are clustered as features. However, these methods do not reflect the cell types in each cluster. Therefore, we have proposed a clustering method using the results of cell type identification as features. Cell type identification results in a score for each cell type. By using the scores, the method utilizes the proportion for clustering. The performance of the method will be demonstrated by applying the method to several benchmarking single-cell RNA-seq datasets.
Presentation Overview: Show
Clonal hematopoiesis (CH) of indeterminate potential (CHIP) is a premalignant state, in which leukemia-associated driver genes acquire somatic mutations in peripheral blood, at a variant allele frequency (VAF) of 2% or greater; yet the individual does not meet the World Health Organization diagnostic criteria for a hematologic neoplasm. CHIP represents a risk factor for various hematologic malignancies and cardiovascular diseases. The VAF cutoff of >=2% was set arbitrarily, which considers the limitation of standard next generation sequencing (NGS) platforms in detecting small clones due to the relatively high sequencing error and the rarity of clinical consequences associated with mutations at lower VAFs. However, individuals with CH at >=1% VAF in leukemia driver genes also had a significantly increased risk of developing AML as those with >=2% VAF. Popular variant calling algorithms for CHIP detection often lose power on variants with low VAFs. Currently, no analytical pipeline has been developed specifically for CHIP detection. this study presents UNIfied SOmatic calling of Next-generation sequencing data, or UNISON for short, which is a software toolkit designed for streamlined CHIP discovery from population studies, even with suboptimal sequencing coverage. UNISON should be broadly applicable to CHIP detection in large-scale WES and WGS projects.
Presentation Overview: Show
Circular chromosome conformation capture with high-throughput sequencing (4C-seq) is a next-generation sequencing technique that offers detailed insights into the three-dimensional structure of the genome around a chosen viewpoint. Since 4C-seq is characterized by a semi-quantitative, fragmented data structure and technical biases that distort the actual signal, the analysis strategies have to be adapted accordingly. 4C-seq replicate experiments are invaluable for the detection of regular and differential interactions between conditions, but add additional challenges to the analysis.
We present Basic4CVis, an R/Shiny app for the analysis and visualization of 4C-seq data. The R package offers routines for the filtering and quality control of a 4C-seq experiment's virtual fragment library data, related statistical overviews per sample, functionality for both near-cis and far-cis visualization of single samples and replicate interactions, and a user-friendly graphical user interface that allows to display overlaps and specific interactions for settings with multiple conditions and differential interactions. While standard data preprocessing and virtual fragment library generation are conducted with the R/Bioconductor package Basic4Cseq, sets of interacting regions can be imported from other 4C-seq analysis algorithms or text files for visualization purposes. Thus, Basic4CVis is a flexible addition to 4C-seq analyses with replicates or multiple conditions.
Presentation Overview: Show
Bioinformaticians often write one-off computer programs to perform a specific task instead of reusing existing code. This practice leads to lower software quality and non-reusable code. As bioinformatics analyses are becoming increasingly more complex and deal with ever more data, high quality code is needed for reliable and producible performance. The solution to this is well-designed and documented libraries, such as SeqAn – a C++ library that implements algorithms and data structures commonly used in bioinformatics. Here, we present the btllib library as an addition to this ecosystem with the goal of providing highly efficient, scalable, and ergonomic implementations of bioinformatics algorithms and data structures. The library is implemented in C++ with Python bindings available for a high-level interface. What sets it apart from other libraries is its focus on specialized algorithms with efficiency and scalability in mind as its aim is to enable sequence processing for large genomes. Parallelization, thread safety or race condition minimization when it helps performance (e.g. maximize throughput) are core fundamentals of btllib. The goal of btllib is not to compete, but to complement other available libraries with applications in bioinformatics and genomics research.
Presentation Overview: Show
The mouse is a widely studied animal, as in many aspects its biology is conserved to ours, and it constitutes a valuable model organism. However, the complexity of its transcriptome is not yet fully elucidated. One of the mechanisms responsible for this is alternative splicing (AS), which has been reported to be characteristic of almost all genes in mammals and may be one of the most widely exploited mechanisms responsible for increased transcriptomic and proteomic complexity. We present the results of studying the splicing events in data from an experiment focused on neuropathic pain in mice. In our study, we focused on finding commonalities in the collection of fairly diverse samples to expand the currently known mouse transcriptomic landscape. We hypothesize that the inclusion of different pathological factors should allow detection of the spinal cord novel alternative splicing events (nASEs) characteristic under all conditions. We found that the vast majority of nASEs are common among all samples in our study. Furthermore, the results of the functional analysis showed a clear connection to the nervous system. This result might indicate that the mouse reference model lacks information for brain tissue, but also reflect expected neuroplasticity.
Presentation Overview: Show
Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM).
We propose GenASM, the first ASM acceleration framework for genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint, and we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth.
We demonstrate that GenASM is a flexible, high-performance, and low-power framework, which provides significant performance and power benefits for three different use cases in genome sequence analysis: 1) GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while consuming 37x and 2.7x less power. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. 2) GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while consuming 1.7x less power and significantly improving the filtering accuracy. 3) GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while consuming 548-582x and 67x less power.
Presentation Overview: Show
An increasing number of genome assembly projects are exclusively utilizing long sequencing reads despite their still appreciable error rates (87-98%), and polishing of the resulting assemblies would be highly desirable. Popular methods for polishing assemblies with long reads, e.g. Racon, rely on sequence alignments for nucleotide base error correction, a costly paradigm that, although robust, is not scalable for large (>3Gbp) genomes, requiring large memory servers and long run times. We present GoldRush-Edit, a RAM-efficient polishing pipeline to correct base errors in long read assemblies using a scalable and targeted k-mer-based method. GoldRush-Edit uses ntEdit for fast correction and low-quality sequence tagging, then Sealer for finishing of these tagged regions using an implicit de Bruijn graph. To achieve acceptable base accuracy, each genomic locus under scrutiny obtains its long read mappings from ntLink, a lightweight minimizer-based scaffolder and gap-filling application. Corresponding long read k-mers are extracted to build reusable targeted Bloom filters in-memory to be used by ntEdit and Sealer. Goldrush-Edit achieves more than 60% reduction in both indels and mismatches on an assembly of the genome of a human cell line, NA24385, using matched nanopore long read data exclusively, while using orders of magnitude less memory compared to Racon.
Presentation Overview: Show
Generating high-quality de novo genome assemblies for model and non-model organisms opens the door to a plethora of important downstream studies. To leverage the repeat-spanning evidence from long-read sequencing technologies, we previously developed ntLink, a minimizer-based long-read scaffolding tool. However most scaffolders, including ntLink, introduce gap sequences (“N”s) between joined sequences, leaving large stretches of unresolved assembly bases, and naively join overlapping sequences. To address these limitations, we added two new features to ntLink: overlap detection and gap-filling. These features are crucial to our new de novo long read assembly tool, GoldRush, and are integrated in the GoldRush-Link stage of the pipeline. Both the overlap detection and gap-filling features are alignment-free, relying on lightweight minimizer mappings. As demonstrated by tests on assemblies from human individuals NA24385 and NA19240, these new features increase the contig NGA50 lengths 502-fold and 7-fold, respectively, while maintaining the high scaffold NGA50 lengths achieved through scaffolding with ntLink. With these two functionalities, >99% gaps were filled for each individual, leaving fewer than 55 N’s per 100 kbp in the final assemblies. These modular improvements in ntLink would benefit a wide variety of assembly workflows, including but not limited to GoldRush.
Presentation Overview: Show
Although current personalized cancer treatment approaches primarily target mutations in protein-coding regions, the relevance of non-coding regulatory regions has been previously demonstrated. However, the ability to detect these mutations through statistical methods is limited due to their low recurrence and missing statistical power. We overcome this by applying the REMIND-Cancer Pipeline, which is an integrative computational pipeline that combines genomic, transcriptomic and chromatin accessibility information to identify functional promoter mutations. The pipeline consists of three major steps: (1) exclude all mutations that show no potential for increasing gene expression by modifying promoter sequences, (2) rank the remaining candidates by a multivariate scoring function, and (3) allow for the in-depth analysis of the top scoring mutations through a multi-functional visualization tool. We analyzed the publicly-available PCAWG dataset, which consists of 2,583 patients and 43,639,986 SNVs, and the pipeline along with the manual inspection of these candidates highlighted 8 candidate promoter mutations. In validation experiments, 7 of these mutations exhibited an increase in promoter activity when comparing the mutant to its wild type. With a specificity of 87.5% and a 3-week lab validation turnover, our method represents a substantial improvement over existing workflows and the pipeline approaches applicability in precision oncology programs.
Presentation Overview: Show
RNA expression at isoform level can potentially reveal cellular subsets and corresponding biomarkers that are not visible at gene level. However, due to the strong 3ʹ bias sequencing protocol, mRNA quantification for high-throughput single-cell RNA sequencing such as Chromium Single Cell 3ʹ 10× Genomics is currently performed at the gene level. We have developed an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs. The method, called Scasa, compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression. The reanalysis of a CITE-Seq dataset with isoform-based Scasa reveals a subgroup of CD14 monocytes missed by gene-based methods.
Presentation Overview: Show
Advancements in state-of-the-art molecular profiling techniques have resulted in better understanding of pediatric cancers and their drivers. Many new types and subtypes of pediatric cancers have been identified with distinct molecular and clinical characteristics.The ITCC-P4 consortium is a preclinical collaboration between academic centers across Europe and several pharmaceutical companies, with the overall aim to establish a sustainable platform of >400 molecularly well-characterized PDX models of high-risk pediatric cancers and use them for in vivo testing of novel mechanism-of-action based treatments.Currently, 340 models are fully established, including 87 brain and 253 non-brain tumor models, together representing different tumor types both from primary (113) and relapsed (92)/metastatic disease (42). 252 of these models have been fully molecularly characterized, representing 18 pediatric cancer entities and 43 different subtypes.Using low coverage whole-genome and whole exome sequencing, somatic mutation calling, DNA copy number, transcriptome analysis and methylation profiling we have observed that the molecular profile of most PDX models closely mimics their original tumors. Clonal evolution of somatic variants was only observed in some PDX-tumor pairs or between disease states. Somatic copy number variant analysis highlights specific alterations; for instance, MYB, MYC, MYCN, NTRK3, PTEN loss differently distributed between PDX-patient tumor pairs in high-grade gliomas.
Presentation Overview: Show
Mapping genomic sequences to references is an essential step for genomic analysis. Since the early days of genomics research, genomic sequence mapping and alignment tools have placed great effort to improve accuracy and decrease resource usage. Throughout the years, the mapping software improved substantially fueled by the diversity of data structures and algorithms developed by the community. Here we present miBF-mapper a long-read mapping software where we indexed reference genome with our in-house data structure multi-indexed Bloom Filter(miBF). Considering >10% overlap with the true region as correct mapping, miBF-mapper had 99.9% mapping accuracy in mapping 45k simulated ONT reads to C.elegans reference in 5 minutes 32 seconds and required 1 GB of RAM, and 92.6% accuracy in mapping 50k simulated ONT reads to H.sapiens GRCh38 reference in 35 minutes 21 seconds requiring 148GB of RAM. Here we discuss miBF-mapper algorithm in detail which is a successful application of the novel miBF data structure that should be of interest to the community.
Presentation Overview: Show
The tropical tasar silkworm, a semi-domesticated wild sericigenous insect,found in the form of 44 ecoraces In India, with variations in phenotypic traits.The wide range of distribution of the species has encountered diverse geographic and climatic variations of the distinct are as,leading to marked differences in not only phenotypical and physiological traits but also in the commercial and technological aspects. A.mylitta Drury,which is an exclusive ecorace of the states of Andhra Pradesh and Telangana, is well known for its superior commercial characters,but, is on the verge of extinction due to its weaknesses involtinism,emergence,hatching,lowyieldetc.Theecoraceconservationis essential to utilize their valuable genes in enhancing productivity and to build variation in new population through hybridization. Modern sequencing methods like NGS technologies and Insilco analysis are used in population genetic studies to investigate the evolutionary forces affecting genetic variation.In the present studies, the genomic DNA of parental ecoraces - Andhra local and Daba TV of A. mylitta and their hybrid populations were sequenced independently using the Illumina NextSeq500 in order to analyze their genetic relationship.The sequencing library revealed that the fragment size ranged between 200bp to 700bp and identified 35877 sites in 8 samples.Further, the phylogenetic tree showed closely and distantly related taxa among the populations.
Presentation Overview: Show
Every day, massive amounts of data are generated by Next-Generation Sequencing (NGS) technologies. However, streamlined analysis remains a major barrier to effectively utilizing the technology. In recent years, many algorithms, statistical methods, and software tools have been developed to perform the individual analysis steps of various NGS applications. We have developed a Python package (pySeqRNA), that allows fast, efficient, manageable, and reproducible RNA-Seq analysis with uniform workflow interface and support for running on the High-Performance Computing Cluster (HPCC) as well as on local computers. It is an extensible pipeline for performing end-to-end analysis with automated report generation. pySeqRNA workflow consists of quality check and pre-processing of raw sequence reads, accurate mapping of millions of sequencing reads to a reference genome including the identification of expression levels of genes in two ways: (i) Uniquely mapped reads, (ii) Multi-mapped groups, a novel feature added, and Differential analysis of gene expression among different biological conditions, functional enrichment analysis. By integrating several command-line tools and custom Python scripts, it allows effective use of existing software and tools with newly written modules without restricting users to a collection of pre-defined methods and environments. This package accelerates retrieval of reproducible results from NGS experiments. http://bioinfo.usu.edu/pySeqRNA/.
Presentation Overview: Show
Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint by launching effective correlation attacks which leverage the intrinsic correlations among genomic data (e.g., Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks.
We first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g.,database accuracy and consistency of SNP-phenotype associations measured via p-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP-phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases.
Presentation Overview: Show
In Alzheimer’s disease (AD), reactive microglia and astrocytes are suggested to disrupt neuronal functions potentially leading to neurodegeneration and cognitive decline. As microglia and astrocytes may act together in the disease process, we aimed to identify ligand-receptor interaction pairs involved in AD pathology by using single nuclei RNA sequencing (snRNA-seq) NeuN-negative profiles of glial cells from brain tissue of 18 AD and control donors.
Six permutation-based approaches implemented in the LIANA framework (CellChat, Connectome, iTALK, CellPhoneDB, NATMI and SCA) that assigned cell-cell interaction scores were used in combination to identify astrocyte-microglia subtype interactions specific to AD. A public snRNA-seq study including 24 AD and 24 control donors (Mathys et al., 2019) was not only used to validate these interactions but was further utilized to infer glial-neuronal interactions. Interactions associated with progression of AD were identified by correlating interaction scores to pathological determinants such as APOE status, Braak stage, total tangle and total plaque. Interactions uniquely occurring in early and late pathology AD indicated involvement of specific biological processes in different disease stages.
This study highlights human glial-neuronal interactions with known AD GWAS hits, drug targets or association to AD pathology.
Presentation Overview: Show
Aligning sequencing reads onto a reference is an essential step in the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We survey algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2021, for both short and long reads. We discuss the weakness and strengths of the algorithms using our rigorous experimental evaluation. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques.
Our review focuses on the interplay between technological development and algorithm development. It can explain the success behind popular read aligners, guide the choice of the most appropriate read alignment tools for particular problems, and identify new algorithmic research directions in response to the advancement of long-read technologies and novel sequencing protocols. It also discusses how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoires of T and B cells receptors, and human microbiome studies.
Presentation Overview: Show
The recent breakthrough of single-cell RNA velocity methods brings attractive promises to automatically identifying directed trajectory on cell differentiation, states transition and response to perturbations, which is uniquely demanded in in-vivo applications and abnormal conditions. However, the existing RNA velocity methods, including scVelo, are often found to return erroneous results, partly due to model violation of complex expression profiles or lack of temporal regularization. Here, we present UniTVelo, a statistical framework of RNA velocity that models the flexible transcription dynamics of spliced and unspliced RNAs via a spliced RNA oriented framework. Uniquely, it also supports the effective inference of unified latent time across genes and orders cells on individual genes in the phase portrait, especially for multiple-rate kinetics genes and those with stable and monotonic changes across the transcriptome. With ten datasets, we demonstrate that UniTVelo returns the expected trajectory in different biological systems, including hematopoietic differentiation and those even with weak kinetics or complex branches. Specifically, UniTVelo correctly identifies the differentiation trajectories of the human bone marrow development, from hematopoietic stem cells to three distinct branches. This system is complex and cannot be fully resolved by other currently available RNA velocity methods.
Presentation Overview: Show
Long abstract
Presentation Overview: Show
Single-cell transcriptomics has been used to study dynamical processes such as cell differentiation. RNA velocity (La Manno et. al. 2020) was a breakthrough towards obtaining a more complete description of the dynamics of such processes. Here, simultaneous measurement of new unspliced and old spliced mRNA adds a temporal dimension to the data. The change in mRNA abundance, called RNA velocity, is used to infer the progression of cells through the dynamical process. However, reliable velocity analysis is still impeded by multiple computational issues. State-of-the-art methods for velocity inference (Bergen et. al. 2020) have issues in velocity inference as well as visualisation. Moreover, there are inconsistencies in current processing pipelines and the single-cell specific (stochastic) part of the dynamic is lost through multiple layers of data smoothing.
We introduce a new method for RNA velocity analysis that addresses some of the issues in velocity estimation. We also propose that visualisation of the velocities based on the Nystroem projection method represents the single-cell stochasticity better than current practices. Finally, we adjust the processing pipeline for consistency with downstream velocity estimation. We validate our model on simulation and on real data, and compare it to current state-of-the-art.
Presentation Overview: Show
DNA sequencing is a unique way to gain insight into the structure of the genome and the functions of an organism. In this study, we compared the widely used Illumina short reads and Oxford Nanopore long reads sequencing technologies in structural and functional annotation of non-model bacteria. We examined Schlegelella thermodepolymerans subspecies DSM 15264, LMG 21645, and CCUG 50061, non-model Gram-negative industrially utilizable representatives. Although these bacteria have a significant potential for the production of polyhydroxyalkanoates - degradable bioplastics by utilizing waste from the agro-food industry, assemblies of their genomes are not available.
The results revealed the Nanopore as the more efficient approach for initial genome characterization. Compared to Illumina, Nanopore revealed more structural genomic features and assigned more genes to the Clusters of Orthologous Groups (COGs). Moreover, Nanopore resulted in the largest contig and N50 many times higher and the number of contigs many times lower than Illumina assemblies. On the other hand, Nanopore sequencing has been shown to be error-prone. Consequently, assemblies of Nanopore's individual genomic features are less accurate, resulting in incomplete structural annotation and incorrect functional annotation in several cases. Illumina sequencing is, therefore, more applicable for detailed studies of specific genomic regions.
Presentation Overview: Show
Motivation:
Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is however a lack of understanding behind the theory of why certain methods perform well.
Results:
We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and performed a long-read transcriptome mapping experiment. Our results give new insights into how new k-mer selection strategies offer new parameterizations that can be used for optimizing speed and alignment quality.
Presentation Overview: Show
RNA-Seq is the leading technology for genome-wide transcript quantification and characterization. RNA-Seq data may contain useful information about transcribed genetic variants. However, accurate variant calling from RNA-Seq data is challenging due to the huge variation in depth of coverage. We leveraged DeepVariant to call variants by retraining a previous whole-exome sequencing CNN model with RNA-Seq alignments. We trained on data from RNA-Seq data from three cell lines used as sources of the Genome-in-a-Bottle (GiaB) reference materials. To represent assay variability, we sequenced HG002 and HG005 in triplicate and HG001 in 10 replicates. Benchmarking shows that our training improves the F1 score for all coding regions from 0.08 for the initial whole exome sequencing starting model to 0.64 after the training cycle. With a genotype quality score threshold set to provide a ≤1.5% false discovery rate, we obtained a sensitivity of 37% for all coding regions and 92% for coding regions of highly expressed genes. Our results show that DeepVariant models trained with RNA-Seq data with high quality truth sets can deliver accurate germline variant calls.
Presentation Overview: Show
The ability to identify and track T cell receptor (TCR) sequences from patient samples becomes central to the field of cancer research. The available high-throughput method to profile T cell receptor repertoires is TCR sequencing. However, the available TCR-Seq data is limited compared to RNA sequencing. We have benchmarked the ability of RNA-Seq-based methods to profile TCR repertoires by examining 19 bulk RNA-Seq samples across four cancer cohorts including both T cell rich and poor tissues. We have performed a comprehensive evaluation of the existing RNA-Seq-based repertoire profiling methods using targeted TCR-Seq as the gold standard. We also highlighted scenarios under which the RNA-Seq approach is suitable and can provide comparable accuracy to the TCR-Seq approach. Results show that these methods are able to effectively capture the clonotypes and estimate the diversity of TCR repertoires, as well as provide relative frequencies of clonotypes in T cell rich tissues and monoclonal repertoires. However, these methods have limited power in T cell poor tissues, especially in polyclonal repertoires. The results of our benchmarking provide an appealing argument to incorporate RNA-Seq into immune repertoire screening of cancer patients as it offers knowledge into transcriptomic changes that exceed the limited information provided by TCR-Seq.
Presentation Overview: Show
SARS-CoV-2 sequencing has scaled dramatically, yet existing genome annotation methods can result in missing or incorrect gene/protein sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that is reference-free and overcomes atypical genomic traits. With this, we analyzed 66,000 genomes and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to Prokka (base) and VAPiD, we yielded 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved spatiotemporally and others representing emerging mutations e.g. D614G and N501Y. For spike glycoprotein domains, we achieved >97.9% reference sequence identity and characterized RBD variants. We demonstrated robustness and extensibility on an additional 4,000 genomes spanning eight variants of concern and interest. In this cohort, we successfully identified all keystone spike glycoprotein mutations with >99% accuracy and demonstrated high protein and domain annotation accuracy. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.
Presentation Overview: Show
The gene UGT1A1 encodes the enzyme responsible for the glucuronidation of SN-38, the active metabolite of IRI. Wild-type UGT1A1 contains six TA repeats [A(TA)6TAA] in its promoter region. Polymorphic UGT1A1 alleles with a higher number of TA repeats, such as UGT1A1 *28 /(TA)7 and *37/(TA)8 alleles, cause decreased enzyme activity and are associated with adverse events of irinotecan, a chemotherapy drug. Genotyping of UGT1A1 polymorphisms from NGS data is challenging due to artifacts from DNA polymerase slippage. We developed a novel method, BayeSTR, to call accurate UGT1A1 repeat genotypes from target capture NGS data. BayeSTR analyzes read alignments to a graph-based model representing the possible repeat alleles, applies an empirically derived “stutter” denoising model, and then performs genotype calling by a Bayesian model. We validated our method with germline data from the Tempus xT tumor-normal matched NGS test, which targets 648 cancer related genes including the UGT1A1 promoter. We observed 100% accuracy through analysis of sequencing data from a collection of 54 Coriell cell-line DNA samples whose UGT1A1 genotypes were established orthogonally. BayeSTR allows for automated, accurate UGT1A1 promoter genotyping from targeted NGS data.
Presentation Overview: Show
Splice junction, govern the process of removing introns by the RNA splicing machinery, is a vital component of eukaryotic genes. Identification of splice junction provides valuable insights of alternative splicing and fusion transcripts events, which have been found in most of the hallmarks of cancer and can potentially apply to cancer diagnosis, prognosis, and therapy. However, most of the available tools for splice junction detection directly align paired-end short reads to the genomic reference and identify the splice junctions from the discordant read pairs. Although computationally efficient, alignment-based approaches are fundamentally limited in detecting sequences that are substantially different from the reference, as such are most likely containing splice junctions due to the challenges in accurately splitting and aligning short fragments. On the other hand, the de novo whole transcriptome assembly approach, attempting to assemble all reads into a single consensus transcriptome, is computationally intensive. In this study, we proposed a local assembly-based framework, called novoBreak-rna, which modify our well-attested genomic structural variation breakpoint assembly tool novoBreak to assemble novel splice junctions in RNA-seq data. The results using real data of prostate cancer from TCGA demonstrate that our method can achieve higher sensitivity to detect the novel splice junctions.
Presentation Overview: Show
Annotation of structural variations (SVs) and base-level karyotyping in cancer cells remains challenging. Here, we present Integrative Framework for Genome Reconstruction (InfoGenomeR)-a graph-based framework that can reconstruct individual SVs into karyotypes based on whole-genome sequencing data, by integrating SVs, total copy number alterations, allele-specific copy numbers, and haplotype information. Using whole-genome sequencing data sets of patients with breast cancer, glioblastoma multiforme, and ovarian cancer, we demonstrate the analytical potential of InfoGenomeR. We identify recurrent derivative chromosomes derived from chromosomes 11 and 17 in breast cancer samples, with homogeneously staining regions for CCND1 and ERBB2, and double minutes and breakage-fusion-bridge cycles in glioblastoma multiforme and ovarian cancer samples, respectively. Moreover, we show that InfoGenomeR can discriminate private and shared SVs between primary and metastatic cancer sites that could contribute to tumour evolution. These findings indicate that InfoGenomeR can guide targeted therapies by unravelling cancer-specific SVs on a genome-wide scale. This paper was published in Nat Commun 12, 2467 (2021) https://doi.org/10.1038/s41467-021-22671-6.
Presentation Overview: Show
Researchers who focus on model organisms greatly benefit from compendia like recount, GTEx, and The Cancer Genome Atlas, which improve the findability and decrease the analysis burden for gene expression data from different experiments. While gene expression compendia exist for some bacterial model organisms like Escherichia coli and Pseudomonas aeruginosa, no compendium exists that unites gene expression profiles across all bacterial and archaeal species. We produced a compendium that integrates the 59,239 publicly available isolate bacterial and archaeal RNA-seq samples, creating a community resource that stands to improve data access and decrease time-to-insight for researchers interested in microbial gene expression. The main product of the pipeline is a normalized ortholog count table that includes all processed samples. Additionally, to support cross-domain and domain-specific inquiries the pipeline allows flexible data outputs. These include strain- or species-specific count tables and interconversion between annotation formats (e.g. ortholog to reference genome). All research products, including strain profiles, reference pangenomes, raw and normalized compendia, annotation maps, and analysis code will be made publicly available. Our pipeline is encoded in Snakemake and is available at github.com/greenelab/2022-microberna.
Presentation Overview: Show
Complex genomic rearrangements (CGRs) are common in cancer and are known to form via two aberrant cellular structures—micronuclei and chromatin bridge. However, which mechanism is more relevant to CGR formation in cancer and whether there are other undiscovered mechanisms remain unknown. Here we developed a computational algorithm ‘Starfish’ to analyze 2,014 CGRs from 2,428 whole-genome-sequenced tumors and discover six CGR signatures based on their copy number and breakpoint patterns. Through extensive benchmarking, we show that our CGR signatures are highly accurate and biologically meaningful. Three signatures can be attributed to known biological processes—micronuclei- and chromatin-bridge-induced chromothripsis and circular extrachromosomal DNA. More than half of the CGRs belong to the remaining three signatures not been reported previously. A unique signature, we named “hourglass chromothripsis”, with localized breakpoints and small amount of DNA loss is abundant in prostate cancer. We find SPOP is associated with hourglass chromothripsis and may play an important role in maintaining genome integrity.
Presentation Overview: Show
Long-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.
Presentation Overview: Show
Single cell genomics is a rapidly growing and widely used technology that helps to understand cellular heterogeneity and to elucidate the cell type-specific mechanisms mediating disease susceptibility. Nevertheless, the costs of single cell genomic assays remain relatively high and sample throughput low. Genetic demultiplexing is a method that can be used to identify cells from individuals based on natural genetic variation in single cell datasets. It has been used in several human studies but have not been applied to data from non-human systems. A detailed examination of the factors influencing the power and accuracy of labelled sample assignments is not available.
Here we examine the parameters affecting the success of a single cell genetic demultiplexing study using demuxlet1, a tool used to separate samples using genetic variation. We find that sequencing depth is the main factor of demultiplexing success, suggesting that independent genetic variants are the key quantity powering genetic demultiplexing. We provide a pipeline that can be used to split a BAM file by individual cell barcodes and downsample each individual cell’s reads to determine the robustness of sample assignments as a function of sequencing depth.
Presentation Overview: Show
Modern RNA-sequencing protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge, and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers StringTie2 and Scallop. Scallop2 represents a significant leap forward for transcript assembly and therefore enables further improvement of the identification of novel transcripts and the downstream isoform-level expression analysis. More importantly, Scallop2 enables accurate construction of transcriptomes at single-cell resolution, which benefits a broader use and advances biological and biomedical research in the era of single-cell omics.
Presentation Overview: Show
Gene editing is a powerful approach to improve our ability to treat specific diseases with an unmet medical need. CRISPR-Cas-based gene editing has broad therapeutic applications but also has the potential to increase the possibility of chromosomal translocations after introducing genomic cuts, especially when introducing multiple edits (multiplex editing), nullifying or diminishing the benefits of the therapy by precipitating additional disorders. The development of computational tools to support translocation detection and quantification methods therefore represents a necessary and impactful contribution to the field.
Here, we enhance tools designed for unidirectional sequencing to improve scalable detection and characterization of on-on, on-off, and off-off target translocation events in edited genomes. Our bioinformatics package, TransACT (Translocation Analysis Computational Toolkit), can detect translocation in unidirectional as well as targeted amplicon next generation sequencing data. In addition, we implement advanced false positive filtering to increase the confidence level and generate summary statistics with translocation visualizations at single base-pair resolution. Finally, we demonstrate the accuracy and limit of detection using spike-in translocation datasets.
TransACT is a sophisticated translocation detection and quantification method especially useful for the evaluation of multiplex editing techniques to assess the pre-clinical and clinical safety of gene editing drug products.
Presentation Overview: Show
Tracking the spread of SARS-CoV-2 variants has been an essential tool in the public health response to the COVID-19 pandemic. The inflow to public wastewater treatment facilities is a source of SARS-CoV-2 viruses from the community served by the facility. Short-read sequencing of these viral samples has the potential to identify variants present in the sample. However, the combination of the short read length and the heterogeneity of the sample pose challenges to the analysis. We demonstrate a novel graph-theory based analytical approach to the analysis of sequencing data from heterogeneous SARS-CoV-2 samples. Briefly, we identify sites in the viral genome which are polymorphic in the sample, and then identify subsets of these, which we term “discriminating mutation sets,” which segregate with reads. We applied this analysis to data from sequencing of wastewater sampled in January 2022 and, by counting reads consistent with each of the discriminating mutation sets, were able to provide estimates of the relative abundance of Delta and Omicron variants in the samples. This technique also shows potential for identification and relative quantification of variants at a more fine-grained phylogenic level.
Presentation Overview: Show
Somatic evolution plays a key role in development and aging as well as in disease processes, notably cancer. The importance of understanding mechanisms of somatic mutability has promoted a proliferation of new sequencing technologies, each with distinctive capabilities and limitations. The enormous space of possible combinations of sequencing modalities poses a substantial challenge for selecting optimal technologies for any particular scientific questions. Versatile simulation tools are thus needed to make it possible to explore and optimize potential study designs. We present a clonal evolution and sequencing simulator allowing for generating synthetic data from a wide range of clonal lineages, variant classes, and sequencing technologies designed for evaluating study designs for assessing somatic mutation mechanisms. Users can define properties of the somatic evolutionary process, mutation classes (e.g., single nucleotide polymorphisms, copy number changes, and classes of structural variation), and biotechnology options (e.g., coverage, bulk vs single cell, whole genome vs exome, error rate, number of samples). The simulator then generates synthetic sequence reads and their corresponding ground-truth parameters for the given study design. We demonstrate its utility in evaluating and optimizing study designs to detect differences in somatic mutation mechanisms between sequence samples.
Presentation Overview: Show
Storing and analyzing large sequencing datasets is computationally expensive and developing scalable data structures and algorithms is essential for analyzing their information content. Here, we introduce Stash, a novel hash-based data structure based on stochastic tile hashing (Stashing), which provides a lossy representation of nucleotide sequences, such as long reads.
Stash is implemented as a two-dimensional bit array and populated using sliding windows of spaced seed patterns to hash input sequences. The sequence hashes indicate the memory loci, and sequence ID hashes determine the stored value.
By measuring the number of tile matches for related Stash frames, one can detect whether two genomic regions are covered by the same set of sequencing reads. We report this score on a chromosome of the human genome reference after Stash is filled with experimental Oxford Nanopore Technology sequencing reads and show that as the distance between two loci of the reference contig increases, the metric decreases since a smaller number of common reads cover those regions.
We expect Stash to provide benefits to a variety of bioinformatics applications, including de novo genome assembly and misassembly detection.
Presentation Overview: Show
De novo genome assembly is a cornerstone to a variety of genomic analyses. Long sequencing read technologies have enabled researchers to assemble draft genomes with high contiguity and few structural errors. Most long read assemblers adopt the overlap layout consensus paradigm, a quadratic run time algorithm in its naïve implementation, to address the high number of base errors present in long reads. Recently, ONT and PacBio have made tremendous strides in improving the quality of their long read sequencing technologies, and opportunities for new long read assembly algorithms have emerged. We present GoldRush-Path, a memory-efficient long read assembler algorithm that runs in linear time in the number of reads, as part of the GoldRush pipeline. GoldRush-Path iterates through the long reads and identifies a set of “golden path” sequences that cover ~1X of the target genome by querying each read against a multi-index Bloom filter and inserting it only if its associated sequence signatures are missing. GoldRush-Path, the costliest step in the GoldRush pipeline, consumes at most 73 GB of RAM when assembling human genomes. The selected golden path is then polished and scaffolded in the pipeline, yielding NGA50 lengths of 12 Mbp for human genome assemblies in our tests.
Presentation Overview: Show
Amplicon sequencing is widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and accurately quantifying their abundance from noisy amplicon sequence data is crucial for downstream analyses, but measured abundances are distorted by stochasticity and bias in amplification, plus errors during Polymerase Chain Reaction (PCR) and sequencing. Previously we presented AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. Here we present AmpliCI v2, that can take into account Unique Molecular Identifier (UMI) information to achieve higher resolution when denoising Illumina amplicon data. The v2 version includes a new module, DAUMI, a probabilistic framework to resolve haplotypes and deduplicated abundance from amplicon sequence data with UMIs. We demonstrate that AmpliCI v2 achieves better performance in haplotype identification and accurate abundance estimation compared to previous AmpliCI version and other UMI-aware clustering methods.
Presentation Overview: Show
Alternative isoforms that arise from internal splicing as well as transcription start site (TSS) or transcription end site (TES) choice are known to play key roles during postnatal development. Long-read RNA-seq (lrRNA-seq) sequences through the entire transcript, thus providing not only the ends but also the internal structure of each transcript, and can be applied to both bulk and single-cell samples.
As a part of the final phase of the ENCODE Consortium, we collected 5 tissues (adrenal glands, gastrocnemius muscle, heart, hippocampus, and cortex) from C57BL6J/Castaneus F1 hybrid mice at 7 postnatal timepoints (P4, P10, P14, P25, P36, P2mo, P18-20mo). We have sequenced all of these timepoints using bulk long-read RNA-seq in adrenal gland and gastrocnemius, and a subset of these at key developmental timepoints in the remaining tissues. We have also profiled adrenal gland and hippocampus using single-cell long-read RNA-seq (LR-Split-seq) with both PacBio and Oxford Nanopore (ONT) platforms. We call cell type and timepoint specific isoforms, TSSs, and TESs. We integrate the LR-Split-seq results with matching single-cell multiome data. This approach allows us to connect coaccessible regulatory DNA regions to alternative TSSs that we observe, giving us insight into the regulatory underpinnings guiding promoter choice.
Presentation Overview: Show
Biological tissues are heterogeneous and comprise cells undergoing continuous biological processes like cell differentiation. Single-cell RNA-sequencing technologies enable the investigation of these processes. However, generating large cohorts of single-cell data is challenging compared to bulk transcriptomic data. Although many computational methods have been developed for inferring cell type abundance from bulk transcriptomic data, these approaches rely on cell type gene expression signatures and ignore intra-cluster variability. Continuous Deconvolution, ConDecon, is a clustering-independent deconvolution algorithm specifically developed to predict complex changes in single-cell abundance from bulk tissue. This approach estimates the probability that each cell in a reference single-cell data is present in a query bulk data. We compared ConDecon to 17 other methods and find that ConDecon performs comparably to state-of-the-art algorithms when inferring discrete cell type abundances. We then focus on ConDecon’s ability to estimate dynamic cell abundances along continuous cellular processes. To that end, we applied ConDecon to well-characterized biological systems like B-cell maturation and immune activation. Finally, we use it to identify changes in the activation of tumor-infiltrating microglia during the mesenchymal transformation of pediatric ependymoma. We anticipate that ConDecon will extend the utility of current methods to characterize single-cell dynamics in bulk tissue.
Presentation Overview: Show
Genome editing technologies are rapidly evolving, and analysis of deep sequencing data from target and off-target regions is necessary for evaluating editing efficiency, precision and specificity. Our group has developed the widely-used tool, CRISPResso2, which standardized quantification of editing frequencies at predefined loci using amplicon sequencing. However, this and other methods are only able to detect small insertions and deletions. In order to quantify complex genome editing events including large insertions, inversions and translocations, assays have been proposed which enrich for DNA sequences using only one PCR origin as the anchor for amplification. We developed a novel analytic tool called CRISPRLungo to analyze sequencing data produced from single-anchor PCR which can quantify and visualize complex genome editing events without any a priori assumption of the expected outcomes. We generated single-anchor amplification data for a therapeutic genome editing experiment and show that our tool can take advantage of the richness of unidirectional sequencing data to both sensitively and specifically detect a variety of complex genome editing outcomes, including identifying rare chromosomal alterations not detectable using current analysis toolkits. CRISPRLungo is available as open-source software that enables researchers to comprehensively assess genome editing outcomes without the biases of amplicon sequencing.
Presentation Overview: Show
The ability to generate and analyze massive data can accelerate our understanding of gene editing processes. However, the generation of such data imposes two major challenges. The first, is the experimental procedure which parallelizes many samples/conditions at once. The second is the computational analysis which aims to produce few metrics for fast meaningful comparison. Addressing these challenges must be scalable and reproducible, while limiting human intervention to reduce errors. At EmendoBio, we developed a procedure that takes thousands of samples and automatically forms a DNA library prep for next-generation sequencing (NGS), using a robotic Biomek i7 system. This step involves target specific amplification, different amplicon mixing and Illumina distinct indexing. Strict input validation steps are taken to meet pre-defined formats. Following sequencing procedures from various sources, the analysis of many samples is triggered at once (automatically or manually by user request) using Amazon serverless technology combined with parallel batch processing. The analysis space comprises many different bioinformatics tools such as CRISPR on/off target analysis, transcriptome characterization assays, mutations and SNPs phasing, RNA-Seq analysis and others. Each analysis is followed by specific post-processing calculations, visualization and summarized metrics. Our simplified and automated procedure enables efficient cross-experimental conclusions regarding gene editing processes.
Presentation Overview: Show
While absolute quantification is challenging in high-throughput measurements, changes of features between conditions can often be determined with high precision. Therefore, analysis of fold changes is the standard method sufficient for differential expression, but often, the analysis of “changes of changes” is required. Differential alternative splicing is an application of such a doubly differential analysis. EmpiReS is a quantitative approach for various kinds of omics data based on fold changes for appropriate features of biological objects. Empirical error distributions for these fold changes are estimated from Replicate measurements and used to quantify feature fold changes and their directions.
We assess the performance of EmpiReS to detect differentially expressed genes applied to RNA-Seq using simulated data. It achieved higher precision than established tools at nearly the same recall level. Furthermore, we assess the detection of alternatively Spliced genes via changes of isoform fold changes on distribution free simulations and on experimentally validated splicing events. EmpiReS achieves the best precision-recall values for simulations based on different biological datasets. We propose EmpiReS as a general, quantitative and fast approach with high reliability and an excellent trade-off between sensitivity and precision for both differential expression and differential alternative splicing.
Presentation Overview: Show
Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. Unlike previous machine-learning approaches for this problem, which classify individual contigs separately, we employ graph neural networks (GNNs) to include information from the assembly graph. Propagation of information from nearby nodes in the graph allows accurate classification of even short contigs that are difficult to classify based on sequence features or database searches alone.
Our new species-agnostic software tool plASgraph outperforms recently developed PlasForest, which uses database searches to supplement sequence-based features. Since our tool does not rely on existing plasmid databases, it is more suitable for classification of contigs in novel species. Our tool can also be trained on a specific species, and in that scenario it outperforms mlplasmids trained on the same species.
On one hand, our work provides a new, accurate, and easy to use tool for plasmid classification; on the other hand, it serves as a motivation for more widespread use of GNNs in bioinformatics, such as in pangenome sequence analysis, where sequence graphs serve as a fundamental data structure.
Availability: https://github.com/cchauve/plASgraph
Presentation Overview: Show
Third-generation (i.e., long-read) sequencing platforms like Oxford Nanopore and PacBio implement additional capabilities for collecting genomic information, including novel isoform detection, due to their ability to sequence the entire length of mRNA transcripts. While measuring which genes and/or transcripts are differentially expressed across conditions is common, it is only one way to compare gene expression and is susceptible to missing important biological information. As nothing in biology acts in isolation, there is a need to describe patterns present in entire gene expression profiles in addition to comparing individual differentially-expressed genes. Another way to measure transcriptional differences globally is transcriptional diversity. Transcriptional diversity can refer to the overall number of genes expressed, or it can refer to differential isoform usage. Transcriptional diversity has been previously described in many ways, but different measures of transcriptional diversity may distinctly capture biological and technical variation. Here, we compare transcriptional diversity metrics including coefficient of variation (CV), Shannon entropy, and the Gini index in publicly-available Genotype-Tissue Expression (GTEx) project long-read RNA sequencing data with respect to brain region and biological sex.
Presentation Overview: Show
The Genome in a Bottle consortium generates variant benchmarks for a set of human genomes to enable evaluation and comparison of sequencing technologies and variant detection methods. While these technologies can resolve most of the genome, correctly calling variants in complex or repetitive regions remains a challenge. We currently have general heuristics to predict incorrectly-called variants (more repetition is harder, etc); however, we lack a data-driven model to link variant caller performance to specific, quantifiable genomic contexts.
We aim to make such a model using explainable boosting machines (EBMs). EBMs are a linear combination of arbitrary univariate and bivariate functions (generalized additive models with interaction terms). Despite being flexible, the relative simplicity of EBMs will allow interpretation of the functional relationship and relative contribution of each feature. For example, the model revealed A/T homopolymers longer than ~15bp predict higher Illumina single nucleotide variant (SNV) and insertion/deletion (INDEL) error rates. For G/C homopolymers, any length above 0bp and increasing imperfect fraction predicted higher error rates for Illumina.
Ultimately, this will provide a data-driven foundation for comparing variant caller methods and/or sequencing technologies in difficult regions of the genome, and enable improved design of stratifications delineating difficult regions.
Presentation Overview: Show
Autosomal dominant polycystic kidney disease (ADPKD) is characterized by the development of cysts in the kidneys that increase in number and volume with age. The increase in the quantity and size of cysts eventually interferes with normal kidney function and ultimately leads to end-stage kidney disease. Roughly 1 in 1000 people are affected by ADPKD, and it is the fourth leading cause of end-stage kidney disease. ADPKD is a multisystem disease therefore patients can also suffer from hemorrhagic stroke, cardiac arrest, and/or complications from severe cystic liver disease. The disease is predominantly caused by mutations in the PKD1 and PKD2 genes which encode for polycystin 1 (PC1) and polycystin 2 (PC2), respectively. ADPKD displays metabolic changes including alternative glucose metabolism similar to the Warburg effect, oxidative phosphorylation, and fatty acid synthesis. Additionally, dietary modifications including caloric restriction have shown to improve symptoms. However, metabolic changes at a single-cell resolution have not been thoroughly examined. Here we use single-cell RNA-seq approaches to explore cell-specific metabolic pathway changes in publicly available human ADPKD and Pkd2 knock-out mice datasets.
Presentation Overview: Show
There is no end to innovation in modern biology. Experimental designs are becoming increasingly complex, encompassing perturbations of multiple dimensions and conditions. Despite the immense information gain, statistics, e.g. with DESeq2, often focus on a 1-vs-1 or 1-vs-all design. Moreover, these tests are all based on questionable null hypothesis testing and the resulting p-values. These are often misunderstood and misapplied.
Here we introduce the EmpiReR, which employs a fuzzy value based representation of the count data. By fuzzy binning it captures the empirical error distributions of the data and estimates whether the features are consistent compared to the rest of the data in the same condition and whether they show an significant change when comparing conditions.
EmpiReR allows a p-value free analysis of data and the analysis of higher order differentials. This is important not only for complex data but also for comparisons, like pattern extraction in iATAC, where combination of data types (RNAseq and ATACseq) is critical to the analysis. EmpiReR can be used not only to compute the ‘best’ higher order differential changes, but also to extract and visualize evidence for complex patterns, e.g. fuzzy differential flows of foldchanges for time series data along various conditions.
Presentation Overview: Show
Tn5 was one of the first identified prokaryotic transposons, and Tn5 transposase is already widely adopted into different genomic protocols to explore the genome and epigenome in a high-throughput fashion. Specifically, ATAC-seq and CUT&Tag-seq are becoming the most widely used epigenomic experimental approaches to measure chromatin accessibility and detect the DNA-protein interactions. Along with large-scale data production, it is now the new bottleneck to process these epigenomic data correctly. Many bioinformatics tools were developed for processing ATAC-seq and CUT&Tag-seq data, however, a comprehensive comparison and benchmarking of these methods is still lacking. Here, we conducted a comprehensive benchmarking to evaluate the performance of eight popular software in processing ATAC-seq and CUT&Tag-seq data, including AIAP, MACS2, SEACR, HMMRATAC, CUT&RUNTools2.0, and ChromHMM. We further test the performance of differentially analysis strategies for ATAC-seq and CUT&Tag-seq data. In conclusion, our study supplied a comprehensive bioinformatics guidance of ATAC-seq and CUT&Tag-seq data processing and differential analysis. The recommended analysis strategy was complied into Docker/Singularity image, allowing biologists easily perform data analysis by executing one line of command.