Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: A clustering method based on cell type identification for single-cell RNA-seq data
COSI: HiTSeq
  • Sohta Nishida, Graduate School of Information Science and Technology, Osaka University, Japan
  • Junko Yoshida, Department of Physiology II, Nara Medical University, Japan
  • Shigeto Seno, Graduate School of Information Science and Technology, Osaka University, Japan
  • Kyoji Horie, Department of Physiology II, Nara Medical University, Japan
  • Hideo Matsuda, Graduate School of Information Science and Technology, Osaka University, Japan


Presentation Overview: Show

Recent advances in single-cell RNA-seq (scRNA-seq) technology have made it possible to perform high-throughput, large-scale transcriptome profiling at single-cell resolution. Unsupervised learning, such as data clustering, is central to identifying and characterizing novel cell types and gene expression patterns. Clustering is used to computationally identify groups of cells by comparing the gene-expression profiles of the groups. The result of the clustering enables us to summarize complex scRNA-seq data into a digestible format for human interpretation. This allows us to describe population heterogeneity with discrete labels that are easier to understand, rather than trying to understand the higher-dimensional manifolds in which cells exist.
Clustering includes graph-based clustering and k-means clustering, in which the genes of each cell are clustered as features. However, these methods do not reflect the cell types in each cluster. Therefore, we have proposed a clustering method using the results of cell type identification as features. Cell type identification results in a score for each cell type. By using the scores, the method utilizes the proportion for clustering. The performance of the method will be demonstrated by applying the method to several benchmarking single-cell RNA-seq datasets.

Virtual: A unified somatic calling of next-generation sequencing data enhances the detection of clonal hematopoiesis of indeterminate potential
COSI: HiTSeq
  • Shulan Tian, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Garrett Jenkinson, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Alejandro Ferrer, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Huihuang Yan, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Saurabh Baheti, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Terra Lasho, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Joel Morales-Rosado, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Mrinal Patnaik, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Wei Ding, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Konstantinos Lazaridis, Division of Gastroenterology & Hepatology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Eric Klee, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States


Presentation Overview: Show

Clonal hematopoiesis (CH) of indeterminate potential (CHIP) is a premalignant state, in which leukemia-associated driver genes acquire somatic mutations in peripheral blood, at a variant allele frequency (VAF) of 2% or greater; yet the individual does not meet the World Health Organization diagnostic criteria for a hematologic neoplasm. CHIP represents a risk factor for various hematologic malignancies and cardiovascular diseases. The VAF cutoff of >=2% was set arbitrarily, which considers the limitation of standard next generation sequencing (NGS) platforms in detecting small clones due to the relatively high sequencing error and the rarity of clinical consequences associated with mutations at lower VAFs. However, individuals with CH at >=1% VAF in leukemia driver genes also had a significantly increased risk of developing AML as those with >=2% VAF. Popular variant calling algorithms for CHIP detection often lose power on variants with low VAFs. Currently, no analytical pipeline has been developed specifically for CHIP detection. this study presents UNIfied SOmatic calling of Next-generation sequencing data, or UNISON for short, which is a software toolkit designed for streamlined CHIP discovery from population studies, even with suboptimal sequencing coverage. UNISON should be broadly applicable to CHIP detection in large-scale WES and WGS projects.

Virtual: Basic4CVis: an R/Shiny app for 4C-seq quality control and interaction visualization
COSI: HiTSeq
  • Carolin Walter, Westfälische Wilhelms-Universität Münster, Germany
  • Julian Varghese, Westfälische Wilhelms-Universität Münster, Germany


Presentation Overview: Show

Circular chromosome conformation capture with high-throughput sequencing (4C-seq) is a next-generation sequencing technique that offers detailed insights into the three-dimensional structure of the genome around a chosen viewpoint. Since 4C-seq is characterized by a semi-quantitative, fragmented data structure and technical biases that distort the actual signal, the analysis strategies have to be adapted accordingly. 4C-seq replicate experiments are invaluable for the detection of regular and differential interactions between conditions, but add additional challenges to the analysis.
We present Basic4CVis, an R/Shiny app for the analysis and visualization of 4C-seq data. The R package offers routines for the filtering and quality control of a 4C-seq experiment's virtual fragment library data, related statistical overviews per sample, functionality for both near-cis and far-cis visualization of single samples and replicate interactions, and a user-friendly graphical user interface that allows to display overlaps and specific interactions for settings with multiple conditions and differential interactions. While standard data preprocessing and virtual fragment library generation are conducted with the R/Bioconductor package Basic4Cseq, sets of interacting regions can be imported from other 4C-seq analysis algorithms or text files for visualization purposes. Thus, Basic4CVis is a flexible addition to 4C-seq analyses with replicates or multiple conditions.

Virtual: btllib: A C++ library with Python interface for efficient sequence processing
COSI: HiTSeq
  • Vladimir Nikolic, BC Cancer Agency - Genome Sciences Centre, Canada
  • Parham Kazemi, BC Cancer Agency - Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer Agency - Genome Sciences Centre, Canada
  • Johnathan Wong, Genome Sciences Centre, Canada
  • Amirhossein Afshinfard, BC Cancer Genome Sciences Centre., Canada
  • Rene Warren, BC Cancer Agency, Canada
  • Inanc Birol, BC Genome Sciences Centre, Canada


Presentation Overview: Show

Bioinformaticians often write one-off computer programs to perform a specific task instead of reusing existing code. This practice leads to lower software quality and non-reusable code. As bioinformatics analyses are becoming increasingly more complex and deal with ever more data, high quality code is needed for reliable and producible performance. The solution to this is well-designed and documented libraries, such as SeqAn – a C++ library that implements algorithms and data structures commonly used in bioinformatics. Here, we present the btllib library as an addition to this ecosystem with the goal of providing highly efficient, scalable, and ergonomic implementations of bioinformatics algorithms and data structures. The library is implemented in C++ with Python bindings available for a high-level interface. What sets it apart from other libraries is its focus on specialized algorithms with efficiency and scalability in mind as its aim is to enable sequence processing for large genomes. Parallelization, thread safety or race condition minimization when it helps performance (e.g. maximize throughput) are core fundamentals of btllib. The goal of btllib is not to compete, but to complement other available libraries with applications in bioinformatics and genomics research.

Virtual: Exploring mouse transcriptomic landscape
COSI: HiTSeq
  • Agata Muszyńska, Institute of Automatic Control, Electonics and Computer Science, Silesian University of Technology, Gliwice, Poland, Poland
  • Ryszard Przewłocki, Department of Molecular Neuropharmacology, Institute of Pharmacology Polish Academy of Sciences, Kraków, Poland, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology UJ, Kraków, Poland, Poland


Presentation Overview: Show

The mouse is a widely studied animal, as in many aspects its biology is conserved to ours, and it constitutes a valuable model organism. However, the complexity of its transcriptome is not yet fully elucidated. One of the mechanisms responsible for this is alternative splicing (AS), which has been reported to be characteristic of almost all genes in mammals and may be one of the most widely exploited mechanisms responsible for increased transcriptomic and proteomic complexity. We present the results of studying the splicing events in data from an experiment focused on neuropathic pain in mice. In our study, we focused on finding commonalities in the collection of fairly diverse samples to expand the currently known mouse transcriptomic landscape. We hypothesize that the inclusion of different pathological factors should allow detection of the spinal cord novel alternative splicing events (nASEs) characteristic under all conditions. We found that the vast majority of nASEs are common among all samples in our study. Furthermore, the results of the functional analysis showed a clear connection to the nervous system. This result might indicate that the mouse reference model lacks information for brain tissue, but also reflect expected neuroplasticity.

Virtual: GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis
COSI: HiTSeq
  • Damla Senol Cali, Bionano Genomics, United States
  • Gurpreet Singh Kalsi, Intel, United States
  • Zülal Bingöl, Bilkent University, Turkey
  • Can Firtina, ETH Zurich, Switzerland
  • Lavanya Subramanian, Facebook, United States
  • Jeremie S. Kim, ETH Zurich, Switzerland
  • Rachata Ausavarungnirun, King Mongkut's University of Technology North Bangkok, Thailand
  • Mohammed Alser, ETH Zurich, Switzerland
  • Juan Gómez Luna, ETH Zurich, Switzerland
  • Amiral Boroumand, Carnegie Mellon University, United States
  • Anant Nori, Intel, United States
  • Allison Scibisz, Carnegie Mellon University, United States
  • Sreenivas Subramoney, Intel Labs, India
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Saugata Ghose, University of Illinois Urbana-Champaign, United States
  • Onur Mutlu, ETH Zurich, Switzerland


Presentation Overview: Show

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM).

We propose GenASM, the first ASM acceleration framework for genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint, and we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth.

We demonstrate that GenASM is a flexible, high-performance, and low-power framework, which provides significant performance and power benefits for three different use cases in genome sequence analysis: 1) GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while consuming 37x and 2.7x less power. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. 2) GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while consuming 1.7x less power and significantly improving the filtering accuracy. 3) GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while consuming 548-582x and 67x less power.

Virtual: GoldRush-Edit : A targeted, alignment-free polishing & finishing pipeline for long read assembly, using long read k-mers
COSI: HiTSeq
  • Vladimir Nikolic, BC Cancer Agency - Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer Agency - Genome Sciences Centre, Canada
  • Johnathan Wong, Genome Sciences Centre, Canada
  • Janet Li, BC Cancer Agency - Genome Sciences Centre, Canada
  • Inanc Birol, BC Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer Agency, Canada


Presentation Overview: Show

An increasing number of genome assembly projects are exclusively utilizing long sequencing reads despite their still appreciable error rates (87-98%), and polishing of the resulting assemblies would be highly desirable. Popular methods for polishing assemblies with long reads, e.g. Racon, rely on sequence alignments for nucleotide base error correction, a costly paradigm that, although robust, is not scalable for large (>3Gbp) genomes, requiring large memory servers and long run times. We present GoldRush-Edit, a RAM-efficient polishing pipeline to correct base errors in long read assemblies using a scalable and targeted k-mer-based method. GoldRush-Edit uses ntEdit for fast correction and low-quality sequence tagging, then Sealer for finishing of these tagged regions using an implicit de Bruijn graph. To achieve acceptable base accuracy, each genomic locus under scrutiny obtains its long read mappings from ntLink, a lightweight minimizer-based scaffolder and gap-filling application. Corresponding long read k-mers are extracted to build reusable targeted Bloom filters in-memory to be used by ntEdit and Sealer. Goldrush-Edit achieves more than 60% reduction in both indels and mismatches on an assembly of the genome of a human cell line, NA24385, using matched nanopore long read data exclusively, while using orders of magnitude less memory compared to Racon.

Virtual: GoldRush-Link: Integrating minimizer-based overlap detection and gap-filling to the ntLink long read scaffolder
COSI: HiTSeq
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Vladimir Nikolic, BC Cancer, Genome Sciences Centre, Canada
  • Johnathan Wong, BC Cancer, Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer, Genome Sciences Centre, Canada


Presentation Overview: Show

Generating high-quality de novo genome assemblies for model and non-model organisms opens the door to a plethora of important downstream studies. To leverage the repeat-spanning evidence from long-read sequencing technologies, we previously developed ntLink, a minimizer-based long-read scaffolding tool. However most scaffolders, including ntLink, introduce gap sequences (“N”s) between joined sequences, leaving large stretches of unresolved assembly bases, and naively join overlapping sequences. To address these limitations, we added two new features to ntLink: overlap detection and gap-filling. These features are crucial to our new de novo long read assembly tool, GoldRush, and are integrated in the GoldRush-Link stage of the pipeline. Both the overlap detection and gap-filling features are alignment-free, relying on lightweight minimizer mappings. As demonstrated by tests on assemblies from human individuals NA24385 and NA19240, these new features increase the contig NGA50 lengths 502-fold and 7-fold, respectively, while maintaining the high scaffold NGA50 lengths achieved through scaffolding with ntLink. With these two functionalities, >99% gaps were filled for each individual, leaving fewer than 55 N’s per 100 kbp in the final assemblies. These modular improvements in ntLink would benefit a wide variety of assembly workflows, including but not limited to GoldRush.

Virtual: Identifying Functional, Non-Coding Somatic Single Nucleotide Variants through the REMIND-Cancer Bioinformatics Pipeline
COSI: HiTSeq
  • Nicholas Abad, Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany, Germany
  • Cindy Körner, Division of Molecular Genome Analysis, German Cancer Research Center (DKFZ), Heidelberg, Germany, Germany
  • Lars Feuerbach, Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany, Germany


Presentation Overview: Show

Although current personalized cancer treatment approaches primarily target mutations in protein-coding regions, the relevance of non-coding regulatory regions has been previously demonstrated. However, the ability to detect these mutations through statistical methods is limited due to their low recurrence and missing statistical power. We overcome this by applying the REMIND-Cancer Pipeline, which is an integrative computational pipeline that combines genomic, transcriptomic and chromatin accessibility information to identify functional promoter mutations. The pipeline consists of three major steps: (1) exclude all mutations that show no potential for increasing gene expression by modifying promoter sequences, (2) rank the remaining candidates by a multivariate scoring function, and (3) allow for the in-depth analysis of the top scoring mutations through a multi-functional visualization tool. We analyzed the publicly-available PCAWG dataset, which consists of 2,583 patients and 43,639,986 SNVs, and the pipeline along with the manual inspection of these candidates highlighted 8 candidate promoter mutations. In validation experiments, 7 of these mutations exhibited an increase in promoter activity when comparing the mutant to its wild type. With a specificity of 87.5% and a 3-week lab validation turnover, our method represents a substantial improvement over existing workflows and the pipeline approaches applicability in precision oncology programs.

Virtual: Isoform-level quantification for single-cell RNA sequencing
COSI: HiTSeq
  • Lu Pan, Karolinska Institutet, Sweden
  • Trung Nghia Vu, Karolinska Institutet, Sweden
  • Yudi Pawitan, Karolinska Institutet, Sweden
  • Huy Dinh, McCardle Laboratory for Cancer Research, Department of Oncology, University of Wisconsin, United States


Presentation Overview: Show

RNA expression at isoform level can potentially reveal cellular subsets and corresponding biomarkers that are not visible at gene level. However, due to the strong 3ʹ bias sequencing protocol, mRNA quantification for high-throughput single-cell RNA sequencing such as Chromium Single Cell 3ʹ 10× Genomics is currently performed at the gene level. We have developed an isoform-level quantification method for high-throughput single-cell RNA sequencing by exploiting the concepts of transcription clusters and isoform paralogs. The method, called Scasa, compares well in simulations against competing approaches including Alevin, Cellranger, Kallisto, Salmon, Terminus and STARsolo at both isoform- and gene-level expression. The reanalysis of a CITE-Seq dataset with isoform-based Scasa reveals a subgroup of CD14 monocytes missed by gene-based methods.

Virtual: ITCC-P4: Genomic profiling and analyses of pediatric patient tumor and patient-derived xenograft (PDX) models for high throughput in vivo testing
COSI: HiTSeq
  • Justyna Wierzbinska, Bayer AG, Pharmaceuticals, Research and Development, Berlin, Germany, Germany
  • Marcel Kool, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Stefan M Pfister, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Jan Koster, Department of Oncogenomics, Amsterdam University Medical Centre, Amsterdam, the Netherlands, Netherlands
  • Gudrun Schleiermacher, Institut Curie Research Centre, Paris, France, France
  • Natalie Jäger, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Gilles Vassal, Department of Clinical Research, Gustave Roussy, Villejuif, France, France
  • Louis F Stancato, Eli Lilly and Company, Indianapolis, IN, USA, Germany
  • Andreas Schlicker, Bayer AG, Pharmaceuticals, Research and Development, Berlin, Germany, Germany
  • Joshua Waterfall, Institut Curie Research Centre, Paris, France, France
  • Apurva Gopisetty, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Benjamin Schwalm, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Norman Mack, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Anna-Lisa Böttcher, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany
  • Alexandra Saint-Charles, Institut Curie Research Centre, Paris, France, France
  • Yasmine Iddir, Institut Curie Research Centre, Paris, France, France
  • Elnaz Saberi-Ansari, Institut Curie Research Centre, Paris, France, France
  • Didier Surdez, Institut Curie Research Centre, Paris, France, France
  • Aniello Federico, German Cancer Research Center (DKFZ) and German Cancer Consortium (DKTK), Heidelberg, Germany, Germany


Presentation Overview: Show

Advancements in state-of-the-art molecular profiling techniques have resulted in better understanding of pediatric cancers and their drivers. Many new types and subtypes of pediatric cancers have been identified with distinct molecular and clinical characteristics.The ITCC-P4 consortium is a preclinical collaboration between academic centers across Europe and several pharmaceutical companies, with the overall aim to establish a sustainable platform of >400 molecularly well-characterized PDX models of high-risk pediatric cancers and use them for in vivo testing of novel mechanism-of-action based treatments.Currently, 340 models are fully established, including 87 brain and 253 non-brain tumor models, together representing different tumor types both from primary (113) and relapsed (92)/metastatic disease (42). 252 of these models have been fully molecularly characterized, representing 18 pediatric cancer entities and 43 different subtypes.Using low coverage whole-genome and whole exome sequencing, somatic mutation calling, DNA copy number, transcriptome analysis and methylation profiling we have observed that the molecular profile of most PDX models closely mimics their original tumors. Clonal evolution of somatic variants was only observed in some PDX-tumor pairs or between disease states. Somatic copy number variant analysis highlights specific alterations; for instance, MYB, MYC, MYCN, NTRK3, PTEN loss differently distributed between PDX-patient tumor pairs in high-grade gliomas.

Virtual: Mapping noisy long-reads with multi-indexed Bloom Filter: miBF-mapper
COSI: HiTSeq
  • Lauren Coombe, Canada's Michael Smith Genome Science Centre, Canada
  • Rene Warren, Canada's Michael Smith Genome Science Centre, Canada
  • Vladimir Nikolic, Canada's Michael Smith Genome Science Centre, Canada
  • Johnathan Wong, Canada's Michael Smith Genome Science Centre, Canada
  • Inanc Birol, Canada's Michael Smith Genome Science Centre, Canada
  • Talha Murathan Goktas, Canada's Michael Smith Genome Science Centre, Canada
  • Ka Ming Nip, Canada's Michael Smith Genome Science Centre, Canada


Presentation Overview: Show

Mapping genomic sequences to references is an essential step for genomic analysis. Since the early days of genomics research, genomic sequence mapping and alignment tools have placed great effort to improve accuracy and decrease resource usage. Throughout the years, the mapping software improved substantially fueled by the diversity of data structures and algorithms developed by the community. Here we present miBF-mapper a long-read mapping software where we indexed reference genome with our in-house data structure multi-indexed Bloom Filter(miBF). Considering >10% overlap with the true region as correct mapping, miBF-mapper had 99.9% mapping accuracy in mapping 45k simulated ONT reads to C.elegans reference in 5 minutes 32 seconds and required 1 GB of RAM, and 92.6% accuracy in mapping 50k simulated ONT reads to H.sapiens GRCh38 reference in 35 minutes 21 seconds requiring 148GB of RAM. Here we discuss miBF-mapper algorithm in detail which is a successful application of the novel miBF data structure that should be of interest to the community.

Virtual: Next-Generation Sequencing (NGS) in populations of Indian Tropical Tasar Silkworm, Antheraea mylitta
COSI: HiTSeq
  • Renuka Gattu, Kakatiya University, India
  • Shamitha Gangupanthula, Kakatiya University, India


Presentation Overview: Show

The tropical tasar silkworm, a semi-domesticated wild sericigenous insect,found in the form of 44 ecoraces In India, with variations in phenotypic traits.The wide range of distribution of the species has encountered diverse geographic and climatic variations of the distinct are as,leading to marked differences in not only phenotypical and physiological traits but also in the commercial and technological aspects. A.mylitta Drury,which is an exclusive ecorace of the states of Andhra Pradesh and Telangana, is well known for its superior commercial characters,but, is on the verge of extinction due to its weaknesses involtinism,emergence,hatching,lowyieldetc.Theecoraceconservationis essential to utilize their valuable genes in enhancing productivity and to build variation in new population through hybridization. Modern sequencing methods like NGS technologies and Insilco analysis are used in population genetic studies to investigate the evolutionary forces affecting genetic variation.In the present studies, the genomic DNA of parental ecoraces - Andhra local and Daba TV of A. mylitta and their hybrid populations were sequenced independently using the Illumina NextSeq500 in order to analyze their genetic relationship.The sequencing library revealed that the fragment size ranged between 200bp to 700bp and identified 35877 sites in 8 samples.Further, the phylogenetic tree showed closely and distantly related taxa among the populations.

Virtual: pySeqRNA: An automated Python package for Next-Generation Sequencing data analysis and report generation
COSI: HiTSeq
  • Naveen Duhan, Utah State University, United States
  • Rakesh Kaundal, Utah State University, United States


Presentation Overview: Show

Every day, massive amounts of data are generated by Next-Generation Sequencing (NGS) technologies. However, streamlined analysis remains a major barrier to effectively utilizing the technology. In recent years, many algorithms, statistical methods, and software tools have been developed to perform the individual analysis steps of various NGS applications. We have developed a Python package (pySeqRNA), that allows fast, efficient, manageable, and reproducible RNA-Seq analysis with uniform workflow interface and support for running on the High-Performance Computing Cluster (HPCC) as well as on local computers. It is an extensible pipeline for performing end-to-end analysis with automated report generation. pySeqRNA workflow consists of quality check and pre-processing of raw sequence reads, accurate mapping of millions of sequencing reads to a reference genome including the identification of expression levels of genes in two ways: (i) Uniquely mapped reads, (ii) Multi-mapped groups, a novel feature added, and Differential analysis of gene expression among different biological conditions, functional enrichment analysis. By integrating several command-line tools and custom Python scripts, it allows effective use of existing software and tools with newly written modules without restricting users to a collection of pre-defined methods and environments. This package accelerates retrieval of reproducible results from NGS experiments. http://bioinfo.usu.edu/pySeqRNA/.

Virtual: Robust Fingerprinting of Genomic Databases
COSI: HiTSeq
  • Erman Ayday, Case Western Reserve University, United States
  • Tianxi Ji, Case Western Reserve Univeristy, United States
  • Emre Yilmaz, University of Houston-Downtown, United States
  • Pan Li, Case Western Reserve University, United States


Presentation Overview: Show

Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint by launching effective correlation attacks which leverage the intrinsic correlations among genomic data (e.g., Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks.

We first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g.,database accuracy and consistency of SNP-phenotype associations measured via p-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP-phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases.

Virtual: snRNA-seq resolved glial and neuronal communication changes during Alzheimer‘s disease progression
COSI: HiTSeq
  • Yashna Paul, AbbVie Deutschland GmbH & Co. KG, Genomics Research Center, Knollstrasse, 67061 Ludwigshafen, Germany
  • Gen Lin, AbbVie Deutschland GmbH & Co. KG, Genomics Research Center, Knollstrasse, 67061 Ludwigshafen, Germany
  • Maya Woodbury, AbbVie, Cambridge Research Center, 200 Sidney Street Cambridge, MA 02139, United States
  • Robert Talanian, AbbVie, Cambridge Research Center, 200 Sidney Street Cambridge, MA 02139, United States
  • Knut Biber, AbbVie Deutschland GmbH & Co. KG, Genomics Research Center, Knollstrasse, 67061 Ludwigshafen, Germany
  • Janina S. Ried, AbbVie Deutschland GmbH & Co. KG, Genomics Research Center, Knollstrasse, 67061 Ludwigshafen, Germany
  • Astrid Wachter, AbbVie Deutschland GmbH & Co. KG, Genomics Research Center, Knollstrasse, 67061 Ludwigshafen, Germany


Presentation Overview: Show

In Alzheimer’s disease (AD), reactive microglia and astrocytes are suggested to disrupt neuronal functions potentially leading to neurodegeneration and cognitive decline. As microglia and astrocytes may act together in the disease process, we aimed to identify ligand-receptor interaction pairs involved in AD pathology by using single nuclei RNA sequencing (snRNA-seq) NeuN-negative profiles of glial cells from brain tissue of 18 AD and control donors.
Six permutation-based approaches implemented in the LIANA framework (CellChat, Connectome, iTALK, CellPhoneDB, NATMI and SCA) that assigned cell-cell interaction scores were used in combination to identify astrocyte-microglia subtype interactions specific to AD. A public snRNA-seq study including 24 AD and 24 control donors (Mathys et al., 2019) was not only used to validate these interactions but was further utilized to infer glial-neuronal interactions. Interactions associated with progression of AD were identified by correlating interaction scores to pathological determinants such as APOE status, Braak stage, total tangle and total plaque. Interactions uniquely occurring in early and late pathology AD indicated involvement of specific biological processes in different disease stages.
This study highlights human glial-neuronal interactions with known AD GWAS hits, drug targets or association to AD pathology.

Virtual: Technology dictates algorithms: recent developments in read alignment
COSI: HiTSeq
  • Sergey Knyazev, University of California, Los Angeles, United States
  • Serghei Mangul, USC, United States
  • Onur Mutlu, Carnegie Mellon University, ETH Zurich, United States
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Alex Zelikovsky, Georgia State University, United States
  • Pavel Skums, Georgia State University, United States
  • David Koslicki, Penn State University, United States
  • Brunilda Balliu, Leiden University Medical Center, Netherlands
  • Benjamin D. Singer, Northwestern University Feinberg School of Medicine, United States
  • Mohammed Alser, ETH Zurich, Switzerland
  • Victor Xue, University of California Los Angeles, United States
  • Harry Taegyun Yang, UCLA Department of Computer Science, Zarlab, United States
  • Pelin Burcak Icer, ETH-Zurich, Switzerland
  • Huwenbo Shi, Harvard University, United States
  • Kodi Taraszka, UCLA, United States
  • Dhrithi Deshpande, University of Southern California, United States
  • Jeremy Rotman, University of Southern California, United States


Presentation Overview: Show

Aligning sequencing reads onto a reference is an essential step in the majority of genomic analysis pipelines. Computational algorithms for read alignment have evolved in accordance with technological advances, leading to today’s diverse array of alignment methods. We survey algorithmic foundations and methodologies across 107 alignment methods published between 1988 and 2021, for both short and long reads. We discuss the weakness and strengths of the algorithms using our rigorous experimental evaluation. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques.

Our review focuses on the interplay between technological development and algorithm development. It can explain the success behind popular read aligners, guide the choice of the most appropriate read alignment tools for particular problems, and identify new algorithmic research directions in response to the advancement of long-read technologies and novel sequencing protocols. It also discusses how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoires of T and B cells receptors, and human microbiome studies.

Virtual: UniTVelo: temporally unified RNA velocity reinforces single-cell trajectory inference
COSI: HiTSeq
  • Mingze Gao, The University of Hong Kong, Hong Kong
  • Yuanhua Huang, The University of Hong Kong, Hong Kong
  • Chen Qiao, The University of Hong Kong, Hong Kong


Presentation Overview: Show

The recent breakthrough of single-cell RNA velocity methods brings attractive promises to automatically identifying directed trajectory on cell differentiation, states transition and response to perturbations, which is uniquely demanded in in-vivo applications and abnormal conditions. However, the existing RNA velocity methods, including scVelo, are often found to return erroneous results, partly due to model violation of complex expression profiles or lack of temporal regularization. Here, we present UniTVelo, a statistical framework of RNA velocity that models the flexible transcription dynamics of spliced and unspliced RNAs via a spliced RNA oriented framework. Uniquely, it also supports the effective inference of unified latent time across genes and orders cells on individual genes in the phase portrait, especially for multiple-rate kinetics genes and those with stable and monotonic changes across the transcriptome. With ten datasets, we demonstrate that UniTVelo returns the expected trajectory in different biological systems, including hematopoietic differentiation and those even with weak kinetics or complex branches. Specifically, UniTVelo correctly identifies the differentiation trajectories of the human bone marrow development, from hematopoietic stem cells to three distinct branches. This system is complex and cannot be fully resolved by other currently available RNA velocity methods.

A-001: Gene fusion detection and characterization in long-read cancer transcriptomes with FusionSeeker
COSI: HiTSeq
  • Yu Chen, University of Alabama at Birmingham, United States
  • Yiqing Wang, University of Alabama at Birmingham, United States
  • Weisheng Chen, University of Alabama at Birmingham, United States
  • Yuwei Song, University of Alabama at Birmingham, United States
  • Herbert Chen, University of Alabama at Birmingham, United States
  • Zechen Chong, University of Alabama at Birmingham, United States


Presentation Overview: Show

Long abstract

A-002: κ-velo improves single-cell RNA-velocity estimation
COSI: HiTSeq
  • Valérie Marot-Lassauzaie, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Brigitte Joanne Bouman, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Fearghal Declan Donaghy, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Laleh Haghverdi, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany


Presentation Overview: Show

Single-cell transcriptomics has been used to study dynamical processes such as cell differentiation. RNA velocity (La Manno et. al. 2020) was a breakthrough towards obtaining a more complete description of the dynamics of such processes. Here, simultaneous measurement of new unspliced and old spliced mRNA adds a temporal dimension to the data. The change in mRNA abundance, called RNA velocity, is used to infer the progression of cells through the dynamical process. However, reliable velocity analysis is still impeded by multiple computational issues. State-of-the-art methods for velocity inference (Bergen et. al. 2020) have issues in velocity inference as well as visualisation. Moreover, there are inconsistencies in current processing pipelines and the single-cell specific (stochastic) part of the dynamic is lost through multiple layers of data smoothing.
We introduce a new method for RNA velocity analysis that addresses some of the issues in velocity estimation. We also propose that visualisation of the velocities based on the Nystroem projection method represents the single-cell stochasticity better than current practices. Finally, we adjust the processing pipeline for consistency with downstream velocity estimation. We validate our model on simulation and on real data, and compare it to current state-of-the-art.

A-003: Comparison of short and long reads in structural and functional annotation of non-model bacteria
COSI: HiTSeq
  • Jana Musilova, Brno University of Technology, FEEC, Department of Biomedical Engineering, Czechia
  • Xenie Kourilova, Brno University of Technology, FCH, Department of Food Chemistry and Biotechnology, Czechia
  • Matej Bezdicek, University Hospital Brno, Department of Internal Medicine—Hematology and Oncology, Czechia
  • Stanislav Obruca, Brno University of Technology, FCH, Department of Food Chemistry and Biotechnology, Czechia
  • Karel Sedlar, Brno University of Technology, FEEC, Department of Biomedical Engineering, Czechia


Presentation Overview: Show

DNA sequencing is a unique way to gain insight into the structure of the genome and the functions of an organism. In this study, we compared the widely used Illumina short reads and Oxford Nanopore long reads sequencing technologies in structural and functional annotation of non-model bacteria. We examined Schlegelella thermodepolymerans subspecies DSM 15264, LMG 21645, and CCUG 50061, non-model Gram-negative industrially utilizable representatives. Although these bacteria have a significant potential for the production of polyhydroxyalkanoates - degradable bioplastics by utilizing waste from the agro-food industry, assemblies of their genomes are not available.
The results revealed the Nanopore as the more efficient approach for initial genome characterization. Compared to Illumina, Nanopore revealed more structural genomic features and assigned more genes to the Clusters of Orthologous Groups (COGs). Moreover, Nanopore resulted in the largest contig and N50 many times higher and the number of contigs many times lower than Illumina assemblies. On the other hand, Nanopore sequencing has been shown to be error-prone. Consequently, assemblies of Nanopore's individual genomic features are less accurate, resulting in incomplete structural annotation and incorrect functional annotation in several cases. Illumina sequencing is, therefore, more applicable for detailed studies of specific genomic regions.

A-004: Theory of local k-mer selection with applications to long-read alignment
COSI: HiTSeq
  • Jim Shaw, University of Toronto, Canada
  • Yun William Yu, University of Toronto, Canada


Presentation Overview: Show

Motivation:

Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is however a lack of understanding behind the theory of why certain methods perform well.

Results:

We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and performed a long-read transcriptome mapping experiment. Our results give new insights into how new k-mer selection strategies offer new parameterizations that can be used for optimizing speed and alignment quality.

A-005: Accurate germline variant calling from RNA-Seq data using deep learning and Genome-in-a-Bottle reference cell-line data
COSI: HiTSeq
  • Aarti Venkat, Tempus Labs Inc., United States
  • Daniel Cook, Google Health, United States
  • Yannick Pouliot, Tempus Labs, Inc., United States
  • Pi-Chuan Chang, Google Health, United States
  • Andrew Carroll, Google Health, United States
  • Francisco De La Vega, Tempus Labs, Inc., United States


Presentation Overview: Show

RNA-Seq is the leading technology for genome-wide transcript quantification and characterization. RNA-Seq data may contain useful information about transcribed genetic variants. However, accurate variant calling from RNA-Seq data is challenging due to the huge variation in depth of coverage. We leveraged DeepVariant to call variants by retraining a previous whole-exome sequencing CNN model with RNA-Seq alignments. We trained on data from RNA-Seq data from three cell lines used as sources of the Genome-in-a-Bottle (GiaB) reference materials. To represent assay variability, we sequenced HG002 and HG005 in triplicate and HG001 in 10 replicates. Benchmarking shows that our training improves the F1 score for all coding regions from 0.08 for the initial whole exome sequencing starting model to 0.64 after the training cycle. With a genotype quality score threshold set to provide a ≤1.5% false discovery rate, we obtained a sensitivity of 37% for all coding regions and 92% for coding regions of highly expressed genes. Our results show that DeepVariant models trained with RNA-Seq data with high quality truth sets can deliver accurate germline variant calls.

A-006: Rigorous benchmarking of T cell receptor repertoire profiling methods for cancer RNA sequencing
COSI: HiTSeq
  • Kerui Peng, University of Southern California, United States
  • Serghei Mangul, University of Southern California, United States


Presentation Overview: Show

The ability to identify and track T cell receptor (TCR) sequences from patient samples becomes central to the field of cancer research. The available high-throughput method to profile T cell receptor repertoires is TCR sequencing. However, the available TCR-Seq data is limited compared to RNA sequencing. We have benchmarked the ability of RNA-Seq-based methods to profile TCR repertoires by examining 19 bulk RNA-Seq samples across four cancer cohorts including both T cell rich and poor tissues. We have performed a comprehensive evaluation of the existing RNA-Seq-based repertoire profiling methods using targeted TCR-Seq as the gold standard. We also highlighted scenarios under which the RNA-Seq approach is suitable and can provide comparable accuracy to the TCR-Seq approach. Results show that these methods are able to effectively capture the clonotypes and estimate the diversity of TCR repertoires, as well as provide relative frequencies of clonotypes in T cell rich tissues and monoclonal repertoires. However, these methods have limited power in T cell poor tissues, especially in polyclonal repertoires. The results of our benchmarking provide an appealing argument to incorporate RNA-Seq into immune repertoire screening of cancer patients as it offers knowledge into transcriptomic changes that exceed the limited information provided by TCR-Seq.

A-007: Semi-Supervised Pipeline for Autonomous Annotation of SARS-CoV-2 Genomes
COSI: HiTSeq
  • Kristen Beck, IBM, United States
  • Edward Seabolt, IBM, United States
  • Akshay Agarwal, IBM Corp, United States
  • Gowri Nayar, IBM Research, United States
  • Simone Bianco, IBM, United States
  • Harsha Krishnareddy, IBM, United States
  • Timothy Ngo, IBM, United States
  • Mark Kunitomi, IBM, United States
  • Vandana Mukherjee, IBM Research, Almaden, United States
  • James Kaufman, IBM, United States


Presentation Overview: Show

SARS-CoV-2 sequencing has scaled dramatically, yet existing genome annotation methods can result in missing or incorrect gene/protein sequences. To overcome this limitation, we developed a novel semi-supervised pipeline for automated gene, protein, and functional domain annotation of SARS-CoV-2 genomes that is reference-free and overcomes atypical genomic traits. With this, we analyzed 66,000 genomes and identified the comprehensive set of known proteins with 98.5% set membership accuracy and 99.1% accuracy in length prediction, compared to proteome references, including Replicase polyprotein 1ab (with its transcriptional slippage site). Compared to Prokka (base) and VAPiD, we yielded 6.4- and 1.8-fold increase in protein annotations. Our method generated 13,000,000 gene, protein, and domain sequences—some conserved spatiotemporally and others representing emerging mutations e.g. D614G and N501Y. For spike glycoprotein domains, we achieved >97.9% reference sequence identity and characterized RBD variants. We demonstrated robustness and extensibility on an additional 4,000 genomes spanning eight variants of concern and interest. In this cohort, we successfully identified all keystone spike glycoprotein mutations with >99% accuracy and demonstrated high protein and domain annotation accuracy. This work comprehensively presents the molecular targets to refine biomedical interventions for SARS-CoV-2 with a scalable, high-accuracy method to analyze newly sequenced infections as they arise.

A-008: Accurate genotyping of UGT1A1 dinucleotide repeat polymorphism from targeted NGS data for the assessment of irinotecan chemotherapy adverse events
COSI: HiTSeq
  • Yan Yang, Tempus Labs, Inc., United States
  • Len Trigg, Real Time Genomics, Inc., New Zealand
  • Kurt Gaastra, Real Time Genomics, Inc., New Zealand
  • Sean Irvine, Real Time Genomics, Inc., New Zealand
  • Gene Selkov, Tempus Labs, Inc., United States
  • Kyung Choi, Tempus Labs, Inc., United States
  • Robert Huether, Tempus Labs, Inc., United States
  • Francisco De La Vega, Tempus Labs, Inc., United States


Presentation Overview: Show

The gene UGT1A1 encodes the enzyme responsible for the glucuronidation of SN-38, the active metabolite of IRI. Wild-type UGT1A1 contains six TA repeats [A(TA)6TAA] in its promoter region. Polymorphic UGT1A1 alleles with a higher number of TA repeats, such as UGT1A1 *28 /(TA)7 and *37/(TA)8 alleles, cause decreased enzyme activity and are associated with adverse events of irinotecan, a chemotherapy drug. Genotyping of UGT1A1 polymorphisms from NGS data is challenging due to artifacts from DNA polymerase slippage. We developed a novel method, BayeSTR, to call accurate UGT1A1 repeat genotypes from target capture NGS data. BayeSTR analyzes read alignments to a graph-based model representing the possible repeat alleles, applies an empirically derived “stutter” denoising model, and then performs genotype calling by a Bayesian model. We validated our method with germline data from the Tempus xT tumor-normal matched NGS test, which targets 648 cancer related genes including the UGT1A1 promoter. We observed 100% accuracy through analysis of sequencing data from a collection of 54 Coriell cell-line DNA samples whose UGT1A1 genotypes were established orthogonally. BayeSTR allows for automated, accurate UGT1A1 promoter genotyping from targeted NGS data.

A-009: novoBreak-rna: local assembly for novel splice junction detection from RNA-seq data
COSI: HiTSeq
  • Yukun Tan, UT MD Anderson Cancer Center, United States
  • Vakul Mohanty, UT MD Anderson Cancer Center, United States
  • Shaoheng Liang, UT MD Anderson Cancer Center, United States
  • Kun Hee Kim, UT MD Anderson Cancer Center, United States
  • Jun Ma, UT MD Anderson Cancer Center, United States
  • Marc Jan Bonder, German Cancer Research Center, Germany
  • Xinghua Shi, Temple University, United States
  • Zechen Chong, University of Alabama at Birmingham, United States
  • Ken Chen, UT MD Anderson Cancer Center, United States


Presentation Overview: Show

Splice junction, govern the process of removing introns by the RNA splicing machinery, is a vital component of eukaryotic genes. Identification of splice junction provides valuable insights of alternative splicing and fusion transcripts events, which have been found in most of the hallmarks of cancer and can potentially apply to cancer diagnosis, prognosis, and therapy. However, most of the available tools for splice junction detection directly align paired-end short reads to the genomic reference and identify the splice junctions from the discordant read pairs. Although computationally efficient, alignment-based approaches are fundamentally limited in detecting sequences that are substantially different from the reference, as such are most likely containing splice junctions due to the challenges in accurately splitting and aligning short fragments. On the other hand, the de novo whole transcriptome assembly approach, attempting to assemble all reads into a single consensus transcriptome, is computationally intensive. In this study, we proposed a local assembly-based framework, called novoBreak-rna, which modify our well-attested genomic structural variation breakpoint assembly tool novoBreak to assemble novel splice junctions in RNA-seq data. The results using real data of prostate cancer from TCGA demonstrate that our method can achieve higher sensitivity to detect the novel splice junctions.

A-010: Integrative reconstruction of cancer genome karyotypes using InfoGenomeR
COSI: HiTSeq
  • Yeonghun Lee, Gwangju Institute of Science and Technology, South Korea
  • Hyunju Lee, Gwangju Institute of Science and Technology, South Korea


Presentation Overview: Show

Annotation of structural variations (SVs) and base-level karyotyping in cancer cells remains challenging. Here, we present Integrative Framework for Genome Reconstruction (InfoGenomeR)-a graph-based framework that can reconstruct individual SVs into karyotypes based on whole-genome sequencing data, by integrating SVs, total copy number alterations, allele-specific copy numbers, and haplotype information. Using whole-genome sequencing data sets of patients with breast cancer, glioblastoma multiforme, and ovarian cancer, we demonstrate the analytical potential of InfoGenomeR. We identify recurrent derivative chromosomes derived from chromosomes 11 and 17 in breast cancer samples, with homogeneously staining regions for CCND1 and ERBB2, and double minutes and breakage-fusion-bridge cycles in glioblastoma multiforme and ovarian cancer samples, respectively. Moreover, we show that InfoGenomeR can discriminate private and shared SVs between primary and metastatic cancer sites that could contribute to tumour evolution. These findings indicate that InfoGenomeR can guide targeted therapies by unravelling cancer-specific SVs on a genome-wide scale. This paper was published in Nat Commun 12, 2467 (2021) https://doi.org/10.1038/s41467-021-22671-6.

A-011: Building a compendium of publicly available microbial isolate RNA-seq data
COSI: HiTSeq
  • Taylor Reiter, University of Colorado Anschutz Medical Campus, United States
  • Casey Greene, University of Colorado Anschutz Medical Campus, United States


Presentation Overview: Show

Researchers who focus on model organisms greatly benefit from compendia like recount, GTEx, and The Cancer Genome Atlas, which improve the findability and decrease the analysis burden for gene expression data from different experiments. While gene expression compendia exist for some bacterial model organisms like Escherichia coli and Pseudomonas aeruginosa, no compendium exists that unites gene expression profiles across all bacterial and archaeal species. We produced a compendium that integrates the 59,239 publicly available isolate bacterial and archaeal RNA-seq samples, creating a community resource that stands to improve data access and decrease time-to-insight for researchers interested in microbial gene expression. The main product of the pipeline is a normalized ortholog count table that includes all processed samples. Additionally, to support cross-domain and domain-specific inquiries the pipeline allows flexible data outputs. These include strain- or species-specific count tables and interconversion between annotation formats (e.g. ortholog to reference genome). All research products, including strain profiles, reference pangenomes, raw and normalized compendia, annotation maps, and analysis code will be made publicly available. Our pipeline is encoded in Snakemake and is available at github.com/greenelab/2022-microberna.

A-012: Mutational signatures of complex genomic rearrangements in human cancer
COSI: HiTSeq
  • Lixing Yang, University of Chicago, United States


Presentation Overview: Show

Complex genomic rearrangements (CGRs) are common in cancer and are known to form via two aberrant cellular structures—micronuclei and chromatin bridge. However, which mechanism is more relevant to CGR formation in cancer and whether there are other undiscovered mechanisms remain unknown. Here we developed a computational algorithm ‘Starfish’ to analyze 2,014 CGRs from 2,428 whole-genome-sequenced tumors and discover six CGR signatures based on their copy number and breakpoint patterns. Through extensive benchmarking, we show that our CGR signatures are highly accurate and biologically meaningful. Three signatures can be attributed to known biological processes—micronuclei- and chromatin-bridge-induced chromothripsis and circular extrachromosomal DNA. More than half of the CGRs belong to the remaining three signatures not been reported previously. A unique signature, we named “hourglass chromothripsis”, with localized breakpoints and small amount of DNA loss is abundant in prostate cancer. We find SPOP is associated with hourglass chromothripsis and may play an important role in maintaining genome integrity.

A-013: Accurate long-read de novo assembly evaluation with Inspector
COSI: HiTSeq
  • Yu Chen, UAB, United States
  • Yixin Zhang, UAB, United States
  • Amy Wang, UAB, United States
  • Min Gao, UAB, United States
  • Zechen Chong, UAB, United States


Presentation Overview: Show

Long-read de novo genome assembly continues to advance rapidly. However, there is a lack of effective tools to accurately evaluate the assembly results, especially for structural errors. We present Inspector, a reference-free long-read de novo assembly evaluator which faithfully reports types of errors and their precise locations. Notably, Inspector can correct the assembly errors based on consensus sequences derived from raw reads covering erroneous regions. Based on in silico and long-read assembly results from multiple long-read data and assemblers, we demonstrate that in addition to providing generic metrics, Inspector can accurately identify both large-scale and small-scale assembly errors.

A-014: Quantifying the accuracy of genetic demultiplexing of pooled single cell genomics in the mouse across multiple tissues and data types
COSI: HiTSeq
  • Marina Yurieva, The Jackson Laboratory for Genomic Medicine, United States
  • Dan Skelly, The Jackson Laboratory for Genomic Medicine, United States
  • Candice Baker, The Jackson Laboratory for Genomic Medicine, United States
  • Will Schott, The Jackson Laboratory for Genomic Medicine, United States
  • Sandy Diagle, The Jackson Laboratory for Genomic Medicine, United States
  • Joshy George, The Jackson Laboratory for Genomic Medicine, United States


Presentation Overview: Show

Single cell genomics is a rapidly growing and widely used technology that helps to understand cellular heterogeneity and to elucidate the cell type-specific mechanisms mediating disease susceptibility. Nevertheless, the costs of single cell genomic assays remain relatively high and sample throughput low. Genetic demultiplexing is a method that can be used to identify cells from individuals based on natural genetic variation in single cell datasets. It has been used in several human studies but have not been applied to data from non-human systems. A detailed examination of the factors influencing the power and accuracy of labelled sample assignments is not available.

Here we examine the parameters affecting the success of a single cell genetic demultiplexing study using demuxlet1, a tool used to separate samples using genetic variation. We find that sequencing depth is the main factor of demultiplexing success, suggesting that independent genetic variants are the key quantity powering genetic demultiplexing. We provide a pipeline that can be used to split a BAM file by individual cell barcodes and downsample each individual cell’s reads to determine the robustness of sample assignments as a function of sequencing depth.

A-015: Accurate assembly of multi-end RNA-seq data with Scallop2
COSI: HiTSeq
  • Qimin Zhang, The Pennsylvania State University, United States
  • Qian Shi, The Pennsylvania State University, United States
  • Mingfu Shao, The Pennsylvania State University, United States


Presentation Overview: Show

Modern RNA-sequencing protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge, and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers StringTie2 and Scallop. Scallop2 represents a significant leap forward for transcript assembly and therefore enables further improvement of the identification of novel transcripts and the downstream isoform-level expression analysis. More importantly, Scallop2 enables accurate construction of transcriptomes at single-cell resolution, which benefits a broader use and advances biological and biomedical research in the era of single-cell omics.

A-016: TransACT provides enhanced detection and characterization of translocation events from high-throughput sequencing data at base-pair resolution for gene editing products
COSI: HiTSeq
  • Timothy Collingsworth, Vor Biopharma, United States
  • Kit Cummins, Vor Biopharma, United States
  • Michael Pettiglio, Vor Biopharma, United States
  • Nipul Patel, Vor Biopharma, United States
  • Caroline McGowan, Vor Biopharma, United States
  • Julianna Xavier-Ferrucio
  • Ruijia Wang, Vor Biopharma, United States
  • Shu Wang, Vor Biopharma, United States
  • Michelle Lin, Vor Biopharma, United States
  • John Lydeard, Vor Biopharma, United States
  • Gary Ge, Vor Biopharma, United States
  • Tirtha Chakraborty, Vor Biopharma, United States


Presentation Overview: Show

Gene editing is a powerful approach to improve our ability to treat specific diseases with an unmet medical need. CRISPR-Cas-based gene editing has broad therapeutic applications but also has the potential to increase the possibility of chromosomal translocations after introducing genomic cuts, especially when introducing multiple edits (multiplex editing), nullifying or diminishing the benefits of the therapy by precipitating additional disorders. The development of computational tools to support translocation detection and quantification methods therefore represents a necessary and impactful contribution to the field.

Here, we enhance tools designed for unidirectional sequencing to improve scalable detection and characterization of on-on, on-off, and off-off target translocation events in edited genomes. Our bioinformatics package, TransACT (Translocation Analysis Computational Toolkit), can detect translocation in unidirectional as well as targeted amplicon next generation sequencing data. In addition, we implement advanced false positive filtering to increase the confidence level and generate summary statistics with translocation visualizations at single base-pair resolution. Finally, we demonstrate the accuracy and limit of detection using spike-in translocation datasets.

TransACT is a sophisticated translocation detection and quantification method especially useful for the evaluation of multiplex editing techniques to assess the pre-clinical and clinical safety of gene editing drug products.

A-017: Detecting multiple SARS-CoV-2 variants from short-read sequencing reads of community wastewater samples
COSI: HiTSeq
  • James Denvir, Marshall University, United States
  • Vinícius Magalhães Borges, Marshall University, United States
  • Alejandro Q. Nato Jr., Marshall University, United States
  • Adeoluwa Adeluola, Marshall University, United States


Presentation Overview: Show

Tracking the spread of SARS-CoV-2 variants has been an essential tool in the public health response to the COVID-19 pandemic. The inflow to public wastewater treatment facilities is a source of SARS-CoV-2 viruses from the community served by the facility. Short-read sequencing of these viral samples has the potential to identify variants present in the sample. However, the combination of the short read length and the heterogeneity of the sample pose challenges to the analysis. We demonstrate a novel graph-theory based analytical approach to the analysis of sequencing data from heterogeneous SARS-CoV-2 samples. Briefly, we identify sites in the viral genome which are polymorphic in the sample, and then identify subsets of these, which we term “discriminating mutation sets,” which segregate with reads. We applied this analysis to data from sequencing of wastewater sampled in January 2022 and, by counting reads consistent with each of the discriminating mutation sets, were able to provide estimates of the relative abundance of Delta and Omicron variants in the samples. This technique also shows potential for identification and relative quantification of variants at a more fine-grained phylogenic level.

A-018: A Clonal Evolution Sequence Simulator for Planning Somatic Evolution Studies
COSI: HiTSeq
  • Arjun Srivatsa, Carnegie Mellon University, United States
  • Haoyun Lei, Carnegie Mellon University, United States
  • Russell Schwartz, Carnegie Mellon University, United States


Presentation Overview: Show

Somatic evolution plays a key role in development and aging as well as in disease processes, notably cancer. The importance of understanding mechanisms of somatic mutability has promoted a proliferation of new sequencing technologies, each with distinctive capabilities and limitations. The enormous space of possible combinations of sequencing modalities poses a substantial challenge for selecting optimal technologies for any particular scientific questions. Versatile simulation tools are thus needed to make it possible to explore and optimize potential study designs. We present a clonal evolution and sequencing simulator allowing for generating synthetic data from a wide range of clonal lineages, variant classes, and sequencing technologies designed for evaluating study designs for assessing somatic mutation mechanisms. Users can define properties of the somatic evolutionary process, mutation classes (e.g., single nucleotide polymorphisms, copy number changes, and classes of structural variation), and biotechnology options (e.g., coverage, bulk vs single cell, whole genome vs exome, error rate, number of samples). The simulator then generates synthetic sequence reads and their corresponding ground-truth parameters for the given study design. We demonstrate its utility in evaluating and optimizing study designs to detect differences in somatic mutation mechanisms between sequence samples.

A-019: Stash: A data structure based on stochastic tile hashing
COSI: HiTSeq
  • Armaghan Sarvar, Genome Sciences Centre, BC Cancer Agency, Canada
  • Lauren Coombe, Genome Sciences Centre, BC Cancer Agency, Canada
  • René Warren, Genome Sciences Centre, BC Cancer Agency, Canada
  • Inanc Birol, Genome Sciences Centre, BC Cancer Agency, Canada


Presentation Overview: Show

Storing and analyzing large sequencing datasets is computationally expensive and developing scalable data structures and algorithms is essential for analyzing their information content. Here, we introduce Stash, a novel hash-based data structure based on stochastic tile hashing (Stashing), which provides a lossy representation of nucleotide sequences, such as long reads.
Stash is implemented as a two-dimensional bit array and populated using sliding windows of spaced seed patterns to hash input sequences. The sequence hashes indicate the memory loci, and sequence ID hashes determine the stored value.
By measuring the number of tile matches for related Stash frames, one can detect whether two genomic regions are covered by the same set of sequencing reads. We report this score on a chromosome of the human genome reference after Stash is filled with experimental Oxford Nanopore Technology sequencing reads and show that as the distance between two loci of the reference contig increases, the metric decreases since a smaller number of common reads cover those regions.
We expect Stash to provide benefits to a variety of bioinformatics applications, including de novo genome assembly and misassembly detection.

A-020: GoldRush-Path: A de novo assembler for long reads with linear time complexity
COSI: HiTSeq
  • Johnathan Wong, BC Cancer, Genome Sciences Centre, Canada
  • Vladimir Nikolic, BC Cancer, Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Emily Zhang, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer, Genome Sciences Centre, Canada


Presentation Overview: Show

De novo genome assembly is a cornerstone to a variety of genomic analyses. Long sequencing read technologies have enabled researchers to assemble draft genomes with high contiguity and few structural errors. Most long read assemblers adopt the overlap layout consensus paradigm, a quadratic run time algorithm in its naïve implementation, to address the high number of base errors present in long reads. Recently, ONT and PacBio have made tremendous strides in improving the quality of their long read sequencing technologies, and opportunities for new long read assembly algorithms have emerged. We present GoldRush-Path, a memory-efficient long read assembler algorithm that runs in linear time in the number of reads, as part of the GoldRush pipeline. GoldRush-Path iterates through the long reads and identifies a set of “golden path” sequences that cover ~1X of the target genome by querying each read against a multi-index Bloom filter and inserting it only if its associated sequence signatures are missing. GoldRush-Path, the costliest step in the GoldRush pipeline, consumes at most 73 GB of RAM when assembling human genomes. The selected golden path is then polished and scaffolded in the pipeline, yielding NGA50 lengths of 12 Mbp for human genome assemblies in our tests.

A-021: Accurate estimation of haplotypes and abundances from Illumina amplicon data by AmpliCI
COSI: HiTSeq
  • Xiyu Peng, Memorial Sloan Kettering Cancer Center, United States
  • Karin Dorman, Iowa State University, United States


Presentation Overview: Show

Amplicon sequencing is widely applied to explore heterogeneity and rare variants in genetic populations. Resolving true biological variants and accurately quantifying their abundance from noisy amplicon sequence data is crucial for downstream analyses, but measured abundances are distorted by stochasticity and bias in amplification, plus errors during Polymerase Chain Reaction (PCR) and sequencing. Previously we presented AmpliCI, a reference-free, model-based method for rapidly resolving the number, abundance and identity of error-free sequences in massive Illumina amplicon datasets. Here we present AmpliCI v2, that can take into account Unique Molecular Identifier (UMI) information to achieve higher resolution when denoising Illumina amplicon data. The v2 version includes a new module, DAUMI, a probabilistic framework to resolve haplotypes and deduplicated abundance from amplicon sequence data with UMIs. We demonstrate that AmpliCI v2 achieves better performance in haplotype identification and accurate abundance estimation compared to previous AmpliCI version and other UMI-aware clustering methods.

A-022: Characterizing alternative splicing in the ENCODE4 mouse postnatal time course using bulk and single-nucleus long-read RNA-seq
COSI: HiTSeq
  • Fairlie Reese, University of California, Irvine, United States
  • Elisabeth Rebboah, University of California, Irvine, United States
  • Narges Rezaie, University of California, Irvine, United States
  • Brian Williams, California Institute of Technology, United States
  • Heidi Liang, University of California, Irvine, United States
  • Magdalena Gantuz, University of California, Irvine, United States
  • Barbara Wold, California Institute of Technology, United States
  • Ali Mortazavi, University of California, Irvine, United States


Presentation Overview: Show

Alternative isoforms that arise from internal splicing as well as transcription start site (TSS) or transcription end site (TES) choice are known to play key roles during postnatal development. Long-read RNA-seq (lrRNA-seq) sequences through the entire transcript, thus providing not only the ends but also the internal structure of each transcript, and can be applied to both bulk and single-cell samples.

As a part of the final phase of the ENCODE Consortium, we collected 5 tissues (adrenal glands, gastrocnemius muscle, heart, hippocampus, and cortex) from C57BL6J/Castaneus F1 hybrid mice at 7 postnatal timepoints (P4, P10, P14, P25, P36, P2mo, P18-20mo). We have sequenced all of these timepoints using bulk long-read RNA-seq in adrenal gland and gastrocnemius, and a subset of these at key developmental timepoints in the remaining tissues. We have also profiled adrenal gland and hippocampus using single-cell long-read RNA-seq (LR-Split-seq) with both PacBio and Oxford Nanopore (ONT) platforms. We call cell type and timepoint specific isoforms, TSSs, and TESs. We integrate the LR-Split-seq results with matching single-cell multiome data. This approach allows us to connect coaccessible regulatory DNA regions to alternative TSSs that we observe, giving us insight into the regulatory underpinnings guiding promoter choice.

A-023: ConDecon: a clustering-independent method for estimating single-cell abundance in bulk tissues using reference single-cell RNA-seq data
COSI: HiTSeq
  • Rachael Aubin, University of Pennsylvania, United States
  • Javier Montelongo, University of Pennsylvania, United States
  • Pablo Camara, University of Pennsylvania, United States


Presentation Overview: Show

Biological tissues are heterogeneous and comprise cells undergoing continuous biological processes like cell differentiation. Single-cell RNA-sequencing technologies enable the investigation of these processes. However, generating large cohorts of single-cell data is challenging compared to bulk transcriptomic data. Although many computational methods have been developed for inferring cell type abundance from bulk transcriptomic data, these approaches rely on cell type gene expression signatures and ignore intra-cluster variability. Continuous Deconvolution, ConDecon, is a clustering-independent deconvolution algorithm specifically developed to predict complex changes in single-cell abundance from bulk tissue. This approach estimates the probability that each cell in a reference single-cell data is present in a query bulk data. We compared ConDecon to 17 other methods and find that ConDecon performs comparably to state-of-the-art algorithms when inferring discrete cell type abundances. We then focus on ConDecon’s ability to estimate dynamic cell abundances along continuous cellular processes. To that end, we applied ConDecon to well-characterized biological systems like B-cell maturation and immune activation. Finally, we use it to identify changes in the activation of tumor-infiltrating microglia during the mesenchymal transformation of pediatric ependymoma. We anticipate that ConDecon will extend the utility of current methods to characterize single-cell dynamics in bulk tissue.

A-024: Quantification of complex genome editing events including large insertions and translocations using CRISPRlungo
COSI: HiTSeq
  • Kendell Clement, Massachusetts General Hospital / Harvard Medical School, United States
  • Linda Lin, Boston Childrens Hospital / Harvard Medical School, United States
  • Pengpeng Liu, UMass Medical School, United States
  • Jing Zeng, Boston Childrens Hospital / Harvard Medical School, United States
  • Amy Nguyen, Boston Childrens Hospital / Harvard Medical School, United States
  • Scot Wolfe, UMass Medical School, United States
  • Daniel Bauer, Boston Childrens Hospital / Harvard Medical School, United States
  • Luca Pinello, Massachusetts General Hospital / Harvard Medical School, United States


Presentation Overview: Show

Genome editing technologies are rapidly evolving, and analysis of deep sequencing data from target and off-target regions is necessary for evaluating editing efficiency, precision and specificity. Our group has developed the widely-used tool, CRISPResso2, which standardized quantification of editing frequencies at predefined loci using amplicon sequencing. However, this and other methods are only able to detect small insertions and deletions. In order to quantify complex genome editing events including large insertions, inversions and translocations, assays have been proposed which enrich for DNA sequences using only one PCR origin as the anchor for amplification. We developed a novel analytic tool called CRISPRLungo to analyze sequencing data produced from single-anchor PCR which can quantify and visualize complex genome editing events without any a priori assumption of the expected outcomes. We generated single-anchor amplification data for a therapeutic genome editing experiment and show that our tool can take advantage of the richness of unidirectional sequencing data to both sensitively and specifically detect a variety of complex genome editing outcomes, including identifying rare chromosomal alterations not detectable using current analysis toolkits. CRISPRLungo is available as open-source software that enables researchers to comprehensively assess genome editing outcomes without the biases of amplicon sequencing.

A-025: Comprehensive bioinformatics tools for quantitative analysis of gene editing
COSI: HiTSeq
  • Noa Oded Elkayam, Emendo Biotherapeutics Ltd, Israel
  • Michal Sharabi Schwager, Emendo Biotherapeutics Ltd, Israel
  • Malka Aker, Emendo Biotherapeutics Ltd, Israel
  • Ella Segal, Emendo Biotherapeutics Ltd, Israel
  • Idit Buch, Emendo Biotherapeutics Ltd, Israel


Presentation Overview: Show

The ability to generate and analyze massive data can accelerate our understanding of gene editing processes. However, the generation of such data imposes two major challenges. The first, is the experimental procedure which parallelizes many samples/conditions at once. The second is the computational analysis which aims to produce few metrics for fast meaningful comparison. Addressing these challenges must be scalable and reproducible, while limiting human intervention to reduce errors. At EmendoBio, we developed a procedure that takes thousands of samples and automatically forms a DNA library prep for next-generation sequencing (NGS), using a robotic Biomek i7 system. This step involves target specific amplification, different amplicon mixing and Illumina distinct indexing. Strict input validation steps are taken to meet pre-defined formats. Following sequencing procedures from various sources, the analysis of many samples is triggered at once (automatically or manually by user request) using Amazon serverless technology combined with parallel batch processing. The analysis space comprises many different bioinformatics tools such as CRISPR on/off target analysis, transcriptome characterization assays, mutations and SNPs phasing, RNA-Seq analysis and others. Each analysis is followed by specific post-processing calculations, visualization and summarized metrics. Our simplified and automated procedure enables efficient cross-experimental conclusions regarding gene editing processes.

A-026: EmpiReS: Differential Analysis of Gene Expression and Alternative Splicing
COSI: HiTSeq
  • Gergely Csaba, LMU Munich, Germany
  • Evi Berchtold, LMU Munich, Germany
  • Armin Hadziahmetovic, LMU Munich, Germany
  • Markus Gruber, LMU Munich, Germany
  • Constantin Ammar, LMU Munich, Germany
  • Ralf Zimmer, LMU Munich, Germany


Presentation Overview: Show

While absolute quantification is challenging in high-throughput measurements, changes of features between conditions can often be determined with high precision. Therefore, analysis of fold changes is the standard method sufficient for differential expression, but often, the analysis of “changes of changes” is required. Differential alternative splicing is an application of such a doubly differential analysis. EmpiReS is a quantitative approach for various kinds of omics data based on fold changes for appropriate features of biological objects. Empirical error distributions for these fold changes are estimated from Replicate measurements and used to quantify feature fold changes and their directions.
We assess the performance of EmpiReS to detect differentially expressed genes applied to RNA-Seq using simulated data. It achieved higher precision than established tools at nearly the same recall level. Furthermore, we assess the detection of alternatively Spliced genes via changes of isoform fold changes on distribution free simulations and on experimentally validated splicing events. EmpiReS achieves the best precision-recall values for simulations based on different biological datasets. We propose EmpiReS as a general, quantitative and fast approach with high reliability and an excellent trade-off between sensitivity and precision for both differential expression and differential alternative splicing.

A-027: plASgraph - using graph neural networks to detect plasmid contigs from an assembly graph
COSI: HiTSeq
  • Janik Sielemann, Bielefeld University, Germany
  • Katharina Sielemann, Bielefeld University, Germany
  • Broňa Brejová, Comenius University in Bratislava, Slovakia
  • Tomas Vinar, Comenius University in Bratislava, Slovakia
  • Cedric Chauve, Simon Fraser University, Canada


Presentation Overview: Show

Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread. We provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. Unlike previous machine-learning approaches for this problem, which classify individual contigs separately, we employ graph neural networks (GNNs) to include information from the assembly graph. Propagation of information from nearby nodes in the graph allows accurate classification of even short contigs that are difficult to classify based on sequence features or database searches alone.

Our new species-agnostic software tool plASgraph outperforms recently developed PlasForest, which uses database searches to supplement sequence-based features. Since our tool does not rely on existing plasmid databases, it is more suitable for classification of contigs in novel species. Our tool can also be trained on a specific species, and in that scenario it outperforms mlplasmids trained on the same species.

On one hand, our work provides a new, accurate, and easy to use tool for plasmid classification; on the other hand, it serves as a motivation for more widespread use of GNNs in bioinformatics, such as in pangenome sequence analysis, where sequence graphs serve as a fundamental data structure.

Availability: https://github.com/cchauve/plASgraph

A-028: Comparing transcriptional diversity metrics across brain regions and biological sex in long-read RNA sequencing data
COSI: HiTSeq
  • Timothy Howton, The University of Alabama at Birmingham, United States
  • Vishal Oza, The University of Alabama at Birmingham, United States
  • Brittany Lasseigne, The University of Alabama at Birmingham, United States
  • Anisha Haldar, The University of Alabama at Birmingham, United States
  • Avery Williams, The University of Alabama at Birmingham, United States
  • Emma Jones, The University of Alabama at Birmingham, United States


Presentation Overview: Show

Third-generation (i.e., long-read) sequencing platforms like Oxford Nanopore and PacBio implement additional capabilities for collecting genomic information, including novel isoform detection, due to their ability to sequence the entire length of mRNA transcripts. While measuring which genes and/or transcripts are differentially expressed across conditions is common, it is only one way to compare gene expression and is susceptible to missing important biological information. As nothing in biology acts in isolation, there is a need to describe patterns present in entire gene expression profiles in addition to comparing individual differentially-expressed genes. Another way to measure transcriptional differences globally is transcriptional diversity. Transcriptional diversity can refer to the overall number of genes expressed, or it can refer to differential isoform usage. Transcriptional diversity has been previously described in many ways, but different measures of transcriptional diversity may distinctly capture biological and technical variation. Here, we compare transcriptional diversity metrics including coefficient of variation (CV), Shannon entropy, and the Gini index in publicly-available Genotype-Tissue Expression (GTEx) project long-read RNA sequencing data with respect to brain region and biological sex.

A-029: Using Machine Learning Models to Understand Errors in Human Genomic Variation
COSI: HiTSeq
  • Peter Tonner, NIST, United States
  • Nathan Dwarshuis, NIST, United States
  • Justin Wagner, NIST, United States
  • Nathanael Olson, NIST, United States
  • Jennifer McDaniel, NIST, United States
  • Justin Zook, NIST, United States


Presentation Overview: Show

The Genome in a Bottle consortium generates variant benchmarks for a set of human genomes to enable evaluation and comparison of sequencing technologies and variant detection methods. While these technologies can resolve most of the genome, correctly calling variants in complex or repetitive regions remains a challenge. We currently have general heuristics to predict incorrectly-called variants (more repetition is harder, etc); however, we lack a data-driven model to link variant caller performance to specific, quantifiable genomic contexts.

We aim to make such a model using explainable boosting machines (EBMs). EBMs are a linear combination of arbitrary univariate and bivariate functions (generalized additive models with interaction terms). Despite being flexible, the relative simplicity of EBMs will allow interpretation of the functional relationship and relative contribution of each feature. For example, the model revealed A/T homopolymers longer than ~15bp predict higher Illumina single nucleotide variant (SNV) and insertion/deletion (INDEL) error rates. For G/C homopolymers, any length above 0bp and increasing imperfect fraction predicted higher error rates for Illumina.

Ultimately, this will provide a data-driven foundation for comparing variant caller methods and/or sequencing technologies in difficult regions of the genome, and enable improved design of stratifications delineating difficult regions.

A-030: Changes in Cellular Metabolism in Autosomal Dominant Polycycstic Kidney Disease
COSI: HiTSeq
  • Timothy Howton, University of Alabama at Birmingham, United States
  • Vishal Oza, University of Alabama at Birmingham, United States
  • Elizabeth Wilk, University of Alabama at Birmingham, United States
  • Michal Mrug, University of Alabama at Birmingham, United States
  • Bradley Yoder, University of Alabama at Birmingham, United States
  • Brittany Lasseigne, University of Alabama at Birmingham, United States


Presentation Overview: Show

Autosomal dominant polycystic kidney disease (ADPKD) is characterized by the development of cysts in the kidneys that increase in number and volume with age. The increase in the quantity and size of cysts eventually interferes with normal kidney function and ultimately leads to end-stage kidney disease. Roughly 1 in 1000 people are affected by ADPKD, and it is the fourth leading cause of end-stage kidney disease. ADPKD is a multisystem disease therefore patients can also suffer from hemorrhagic stroke, cardiac arrest, and/or complications from severe cystic liver disease. The disease is predominantly caused by mutations in the PKD1 and PKD2 genes which encode for polycystin 1 (PC1) and polycystin 2 (PC2), respectively. ADPKD displays metabolic changes including alternative glucose metabolism similar to the Warburg effect, oxidative phosphorylation, and fatty acid synthesis. Additionally, dietary modifications including caloric restriction have shown to improve symptoms. However, metabolic changes at a single-cell resolution have not been thoroughly examined. Here we use single-cell RNA-seq approaches to explore cell-specific metabolic pathway changes in publicly available human ADPKD and Pkd2 knock-out mice datasets.

A-031: EmpiReR: Model-free reliable analysis of higher order differentials in complex replicate count data
COSI: HiTSeq
  • Felix Offensperger, LMU Munich, Germany
  • Evi Berchtold, LMU Munich, Germany
  • Ralf Zimmer, LMU Munich, Germany


Presentation Overview: Show

There is no end to innovation in modern biology. Experimental designs are becoming increasingly complex, encompassing perturbations of multiple dimensions and conditions. Despite the immense information gain, statistics, e.g. with DESeq2, often focus on a 1-vs-1 or 1-vs-all design. Moreover, these tests are all based on questionable null hypothesis testing and the resulting p-values. These are often misunderstood and misapplied.

Here we introduce the EmpiReR, which employs a fuzzy value based representation of the count data. By fuzzy binning it captures the empirical error distributions of the data and estimates whether the features are consistent compared to the rest of the data in the same condition and whether they show an significant change when comparing conditions.

EmpiReR allows a p-value free analysis of data and the analysis of higher order differentials. This is important not only for complex data but also for comparisons, like pattern extraction in iATAC, where combination of data types (RNAseq and ATACseq) is critical to the analysis. EmpiReR can be used not only to compute the ‘best’ higher order differential changes, but also to extract and visualize evidence for complex patterns, e.g. fuzzy differential flows of foldchanges for time series data along various conditions.

A-032: Benchmark of analysis strategies for ATAC-seq and CUT&Tag-seq
COSI: HiTSeq
  • Bo Zhang, Washington University in St. Louis, United States
  • Siyuan Cheng, Washington University in St. Louis, United States
  • Benpeng Miao, Washington University in St. Louis, United States


Presentation Overview: Show

Tn5 was one of the first identified prokaryotic transposons, and Tn5 transposase is already widely adopted into different genomic protocols to explore the genome and epigenome in a high-throughput fashion. Specifically, ATAC-seq and CUT&Tag-seq are becoming the most widely used epigenomic experimental approaches to measure chromatin accessibility and detect the DNA-protein interactions. Along with large-scale data production, it is now the new bottleneck to process these epigenomic data correctly. Many bioinformatics tools were developed for processing ATAC-seq and CUT&Tag-seq data, however, a comprehensive comparison and benchmarking of these methods is still lacking. Here, we conducted a comprehensive benchmarking to evaluate the performance of eight popular software in processing ATAC-seq and CUT&Tag-seq data, including AIAP, MACS2, SEACR, HMMRATAC, CUT&RUNTools2.0, and ChromHMM. We further test the performance of differentially analysis strategies for ATAC-seq and CUT&Tag-seq data. In conclusion, our study supplied a comprehensive bioinformatics guidance of ATAC-seq and CUT&Tag-seq data processing and differential analysis. The recommended analysis strategy was complied into Docker/Singularity image, allowing biologists easily perform data analysis by executing one line of command.