HiTSeq COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CDT
Tuesday, July 12th
10:30-11:30
Utilizing multi-omics, networks and bacterial immune systems to understand microbiome
Room: Madison B
Format: Live from venue

  • Yuzhen Ye


Presentation Overview: Show

Microbiomes (communities of microorganisms) are related to almost every aspect of human beings. We have been developing computational approaches for utilizing multi-omics data to characterize the composition and function of microbial communities. Applications of our approaches to multi-omics data of human microbiome data revealed gut microbial signatures that are likely expressed and are predictive of host phenotypes. We have also developed methods for inference of biological pathways and bacteria-bacteria co-occurrence network, which provide important context for understanding the functions of microbial organisms and their impacts on their hosts or the environment. Using the tools that we have developed for charactering the bacterial immune systems (the CRISPR-Cas systems), we can study the arms-race between the microbial organisms and their invaders such as phages, and use the dynamics of the CRISPR-Cas systems as a tool for studying the heterogeneity of microbial species and their adaptation to rapidly changing environments such as the human gut.

11:30-11:50
Proceedings Presentation: CALDERA: Finding all significant de Bruijn subgraphs for bacterial GWAS
Room: Madison B
Format: Live-stream

  • Hector Roux de bezieux, Pendulum Therapeutics, United States
  • Leandro Lima, LBBE, UCBL1, INRIA, France
  • Fanny Perraudeau, Pendulum Therapeutics, San Francisco, United States
  • Arnaud Mary, LBBE, France
  • Sandrine Dudoit, Associate Professor, Division of Biostatistics, University of California, Berkeley, United States
  • Laurent Jacob, CNRS, France


Presentation Overview: Show

Motivation: Genome wide association studies (GWAS), aiming to find genetic variants associated with a trait, have widely been used on bacteria to identify genetic determinants of drug resistance or hypervirulence. Recent bacterial GWAS methods usually rely on $k$-mers, whose presence in a genome can denote variants ranging from single nucleotide polymorphisms to mobile genetic elements. This approach does not require a reference genome, making it easier to account for accessory genes. However, a same gene can exist in slightly different versions across different strains, leading to diluted effects.
Results: Here we overcome this issue by testing covariates built from closed connected subgraphs of the De Bruijn graph defined over genomic $k$-mers. These covariates capture polymorphic genes as a single entity, improving $k$-mer based GWAS both in terms of power and interpretability. However, a method naively testing all possible subgraphs would be powerless due to multiple testing corrections, and the mere exploration of these subgraphs would quickly become computationally intractable. The concept of testable hypothesis has successfully been used to address both problems in similar contexts. We leverage this concept to test all closed connected subgraphs by proposing a novel enumeration scheme for these objects which fully exploits the pruning opportunity offered by testability, resulting in drastic improvements in computational efficiency. Our method integrates with existing visual tools to facilitate interpretation.\\
Availability: We provide an implementation of our method, as well as code to reproduce all results at https://github.com/HectorRDB/Caldera_ISMB.

11:50-12:10
Proceedings Presentation: MeConcord: a new metric to quantitatively characterize DNA methylation heterogeneity across reads and CpG sites
Room: Madison B
Format: Live-stream

  • Xianglin Zhang, Tsinghua University, China
  • Xiaowo Wang, Tsinghua University, China


Presentation Overview: Show

Motivation: Intermediately methylated regions occupy a significant fraction of the human genome and are markedly associated with epigenetic regulations or cell-type deconvolution of bulk data. However, these regions show distinct methylation patterns, corresponding to different biological mechanisms. Although there have been some metrics developed for investigating these regions, the high sensitivity to noise limits the utility for distinguishing distinct methylation patterns.
Results: We proposed a method named MeConcord to measure local methylation concordance across reads and CpG sites, respectively. MeConcord showed the most stable performance in distinguishing distinct methylation patterns (‘identical’, ‘uniform’, and ‘disordered’) compared with other metrics. Applying MeConcord to the whole genome across 25 cell lines or primary cells or tissues, we found that distinct methylation patterns were associated with different genomic characteristics, such as CTCF binding or imprinted genes. Further, we showed the differences of CpG island’s hypermethylation patterns between senescence and tumorigenesis by using MeConcord. MeConcord is a powerful method to study local read-level methylation patterns for both the whole genome and specific regions of interest.
Availability: MeConcord is available at https://github.com/vhang072/MeConcord.

12:10-12:30
Proceedings Presentation: ReadBouncer: Precise and Scalable Adaptive Sampling for Nanopore Sequencing
Room: Madison B
Format: Live from venue

  • Jens-Uwe Ulrich, Hasso Plattner Institute, Germany
  • Ahmad Lutfi, Hasso Plattner Institute, Germany
  • Kilian Rutzen, Robert Koch Institute, Germany
  • Bernhard Renard, Hasso Plattner Institute, Germany


Presentation Overview: Show

Motivation: Nanopore sequencers allow targeted sequencing of interesting nucleotide sequences by rejecting other sequences from individual pores. This feature facilitates the enrichment of low-abundant sequences by depleting overrepresented ones in-silico. Existing tools for adaptive sampling either apply signal alignment, which cannot handle human-sized reference sequences, or apply read mapping in sequence space relying on fast GPU base callers for real-time read rejection. Using nanopore long-read mapping tools is also not optimal when mapping shorter reads as usually analyzed in adaptive sampling applications.
Results: Here we present a new approach for nanopore adaptive sampling that combines fast CPU and GPU base calling with read classification based on Interleaved Bloom Filters (IBF). ReadBouncer improves the potential enrichment of low abundance sequences by its high read classification sensitivity and specificity, outperforming existing tools in the field. It robustly removes even reads belonging to large reference sequences while running on commodity hardware without graphical processing units (GPUs), making adaptive sampling accessible for in-field researchers. Readbouncer also provides a user-friendly interface and installer files for end-users without a bioinformatics background.

14:30-14:50
Rigorous benchmarking of T cell receptor repertoire profiling methods for cancer RNA sequencing
Room: Madison B
Format: Live from venue

  • Kerui Peng, University of Southern California, United States
  • Serghei Mangul, University of Southern California, United States


Presentation Overview: Show

The ability to identify and track T cell receptor (TCR) sequences from patient samples becomes central to the field of cancer research. The available high-throughput method to profile T cell receptor repertoires is TCR sequencing. However, the available TCR-Seq data is limited compared to RNA sequencing. We have benchmarked the ability of RNA-Seq-based methods to profile TCR repertoires by examining 19 bulk RNA-Seq samples across four cancer cohorts including both T cell rich and poor tissues. We have performed a comprehensive evaluation of the existing RNA-Seq-based repertoire profiling methods using targeted TCR-Seq as the gold standard. We also highlighted scenarios under which the RNA-Seq approach is suitable and can provide comparable accuracy to the TCR-Seq approach. Results show that these methods are able to effectively capture the clonotypes and estimate the diversity of TCR repertoires, as well as provide relative frequencies of clonotypes in T cell rich tissues and monoclonal repertoires. However, these methods have limited power in T cell poor tissues, especially in polyclonal repertoires. The results of our benchmarking provide an appealing argument to incorporate RNA-Seq into immune repertoire screening of cancer patients as it offers knowledge into transcriptomic changes that exceed the limited information provided by TCR-Seq.

14:50-15:10
A unified somatic calling of next-generation sequencing data enhances the detection of clonal hematopoiesis of indeterminate potential
Room: Madison B
Format: Live-stream

  • Shulan Tian, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Garrett Jenkinson, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Alejandro Ferrer, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Huihuang Yan, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Saurabh Baheti, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Terra Lasho, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Joel Morales-Rosado, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Mrinal Patnaik, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Wei Ding, Division of Hematology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Konstantinos Lazaridis, Division of Gastroenterology & Hepatology, Department of Internal Medicine, Mayo Clinic, Rochester, MN 55905, USA, United States
  • Eric Klee, Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55905, USA, United States


Presentation Overview: Show

Clonal hematopoiesis (CH) of indeterminate potential (CHIP) is a premalignant state, in which leukemia-associated driver genes acquire somatic mutations in peripheral blood, at a variant allele frequency (VAF) of 2% or greater; yet the individual does not meet the World Health Organization diagnostic criteria for a hematologic neoplasm. CHIP represents a risk factor for various hematologic malignancies and cardiovascular diseases. The VAF cutoff of >=2% was set arbitrarily, which considers the limitation of standard next generation sequencing (NGS) platforms in detecting small clones due to the relatively high sequencing error and the rarity of clinical consequences associated with mutations at lower VAFs. However, individuals with CH at >=1% VAF in leukemia driver genes also had a significantly increased risk of developing AML as those with >=2% VAF. Popular variant calling algorithms for CHIP detection often lose power on variants with low VAFs. Currently, no analytical pipeline has been developed specifically for CHIP detection. this study presents UNIfied SOmatic calling of Next-generation sequencing data, or UNISON for short, which is a software toolkit designed for streamlined CHIP discovery from population studies, even with suboptimal sequencing coverage. UNISON should be broadly applicable to CHIP detection in large-scale WES and WGS projects.

15:10-15:30
Mutational signatures of complex genomic rearrangements in human cancer
Room: Madison B
Format: Live from venue

  • Lixing Yang, University of Chicago, United States


Presentation Overview: Show

Complex genomic rearrangements (CGRs) are common in cancer and are known to form via two aberrant cellular structures—micronuclei and chromatin bridge. However, which mechanism is more relevant to CGR formation in cancer and whether there are other undiscovered mechanisms remain unknown. Here we developed a computational algorithm ‘Starfish’ to analyze 2,014 CGRs from 2,428 whole-genome-sequenced tumors and discover six CGR signatures based on their copy number and breakpoint patterns. Through extensive benchmarking, we show that our CGR signatures are highly accurate and biologically meaningful. Three signatures can be attributed to known biological processes—micronuclei- and chromatin-bridge-induced chromothripsis and circular extrachromosomal DNA. More than half of the CGRs belong to the remaining three signatures not been reported previously. A unique signature, we named “hourglass chromothripsis”, with localized breakpoints and small amount of DNA loss is abundant in prostate cancer. We find SPOP is associated with hourglass chromothripsis and may play an important role in maintaining genome integrity.

16:00-17:00
Deep molecular- and cellular-phenotyping of zebrafish development at whole organism scale
Room: Madison B
Format: Live-stream

  • Cole Trapnell


Presentation Overview: Show

Single cell transcriptomics now enables comprehensive “atlases” of gene expression across whole embryos. In principle, single-cell sequencing could be used for high-content phenotyping at unprecedented scale to study genetic programs of development. However, cost and workflow complexity have limited efforts to profiling more than a handful of embryos and perturbations. Here, we present a new experimental and analytical approach for high-resolution phenotyping of thousands of whole, individually barcoded zebrafish embryos in response to myriad genetic perturbations at multiple stages of development. Using this approach, we (i) comprehensively map the zebrafish developmental landscape from 18 to 96 hours post-fertilization in ~1,220 embryos, (ii) statistically assess the effects of 23 genetic perturbations across 645 embryos and 5 timepoints, comprising 98 conditions, (iii) resolve developmental trajectories and define new lineage-specific markers for rare cell populations in the peripheral nervous system, and (iv) identify a transcriptional program that sheds light on the origin of head cartilage in vertebrates. We anticipate that this dataset and workflow will expand the genetic screening capabilities in zebrafish and other organisms and catalyze mechanistic insights for understanding genetic circuits in whole developing organisms.

17:00-17:20
Proceedings Presentation: Semi-deconvolution of bulk and single-cell RNA-seq data with application to metastatic progression in breast cancer
Room: Madison B
Format: Live from venue

  • Haoyun Lei, Carnegie Mellon University, United States
  • Xiaoyan Guo, Carnegie Mellon University, United States
  • Yifeng Tao, Carnegie Mellon University, United States
  • Kai Ding, UPMC Hillman Cancer Center, Magee-Womens Research Institute, United States
  • Xuecong Fu, Carnegie Mellon University, United States
  • Steffi Oesterreich, UPMC Hillman Cancer Center, Magee-Womens Research Institute, United States
  • Adrian Lee, UPMC Hillman Cancer Center, Magee-Womens Research Institute, United States
  • Russell Schwartz, Carnegie Mellon University, United States


Presentation Overview: Show

Identifying cell types and their abundances and how these evolve during tumor progression is critical to understanding the mechanisms of metastasis and identifying predictors of metastatic potential that can guide the development of new diagnostics or therapeutics. Single-cell RNA sequencing (scRNA-seq) has been especially promising in resolving heterogeneity of expression programs at the single-cell level, but is not always feasible, for example for large cohort studies or longitudinal analysis of archived samples. In such cases, clonal subpopulations may still be inferred via genomic deconvolution, but deconvolution methods have limited ability to resolve fine clonal structure and may require reference cell type profiles that are missing or imprecise. Prior methods can eliminate the need for reference profiles but show unstable performance when few bulk samples are available. In this work, we develop a new method using reference scRNA-seq to interpret sample collections for which only bulk RNA-seq is available for some samples, e.g., clonally resolving archived primary (PRM) tissues using scRNA-seq from metastases (METs). By integrating such information in a Quadratic Programming (QP) framework, our method can recover more accurate cell types and corresponding cell type abundances in bulk samples. Application to a breast tumor bone metastases dataset confirms the power of scRNA-seq data to improve cell-type inference and quantification in same-patient bulk samples.

17:20-17:40
Gene fusion detection and characterization in long-read cancer transcriptomes with FusionSeeker
Room: Madison B
Format: Live from venue

  • Yu Chen, University of Alabama at Birmingham, United States
  • Yiqing Wang, University of Alabama at Birmingham, United States
  • Weisheng Chen, University of Alabama at Birmingham, United States
  • Yuwei Song, University of Alabama at Birmingham, United States
  • Herbert Chen, University of Alabama at Birmingham, United States
  • Zechen Chong, University of Alabama at Birmingham, United States


Presentation Overview: Show

Long abstract

17:40-18:00
Accurate assembly of multi-end RNA-seq data with Scallop2
Room: Madison B
Format: Live from venue

  • Qimin Zhang, The Pennsylvania State University, United States
  • Qian Shi, The Pennsylvania State University, United States
  • Mingfu Shao, The Pennsylvania State University, United States


Presentation Overview: Show

Modern RNA-sequencing protocols can produce multi-end data, where multiple reads originating from the same transcript are attached to the same barcode. The long-range information in the multi-end reads is beneficial in phasing complicated spliced isoforms, but assembly algorithms that leverage such information are lacking. Here we introduce Scallop2, a reference-based assembler optimized for multi-end RNA-seq data. The algorithmic core consists of three steps: (1) using an algorithm to ‘bridge’ multi-end reads into single-end phasing paths in the context of splice graph, (2) employing a method to refine erroneous splice graphs by utilizing multi-end reads that fail to bridge, and (3) piping the refined splice graph and bridged phasing paths into an algorithm that integrates multiple phase-preserving decompositions. Tested on 561 cells in two Smart-seq3 datasets and on ten Illumina paired-end RNA-seq samples, Scallop2 substantially improves the assembly accuracy compared with two popular assemblers StringTie2 and Scallop. Scallop2 represents a significant leap forward for transcript assembly and therefore enables further improvement of the identification of novel transcripts and the downstream isoform-level expression analysis. More importantly, Scallop2 enables accurate construction of transcriptomes at single-cell resolution, which benefits a broader use and advances biological and biomedical research in the era of single-cell omics.

Wednesday, July 13th
10:30-11:30
Keynote Presentation: A Draft Human Pangenome Reference
Room: Madison B
Format: Live from venue

  • Benedict Paten


Presentation Overview: Show

Keynote

11:30-11:50
Proceedings Presentation: The Effect of Genome Graph Expressiveness on the Discrepancy Between Genome Graph Distance and String Set Distance
Room: Madison B
Format: Live-stream

  • Yutong Qiu, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States


Presentation Overview: Show

Motivation: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, these true string sets are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs are often used to represent such sets of strings. However, a genome graph is generally more expressive than string sets and is able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not closely model the distance between true string sets.

Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that FGTED always underestimates the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and improve FGTED so that it reduces the expected error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the deviation of the estimated distance from the true string set distances.

11:50-12:10
Stash: A data structure based on stochastic tile hashing
Room: Madison B
Format: Live-stream

  • Armaghan Sarvar, Genome Sciences Centre, BC Cancer Agency, Canada
  • Lauren Coombe, Genome Sciences Centre, BC Cancer Agency, Canada
  • René Warren, Genome Sciences Centre, BC Cancer Agency, Canada
  • Inanc Birol, Genome Sciences Centre, BC Cancer Agency, Canada


Presentation Overview: Show

Storing and analyzing large sequencing datasets is computationally expensive and developing scalable data structures and algorithms is essential for analyzing their information content. Here, we introduce Stash, a novel hash-based data structure based on stochastic tile hashing (Stashing), which provides a lossy representation of nucleotide sequences, such as long reads.
Stash is implemented as a two-dimensional bit array and populated using sliding windows of spaced seed patterns to hash input sequences. The sequence hashes indicate the memory loci, and sequence ID hashes determine the stored value.
By measuring the number of tile matches for related Stash frames, one can detect whether two genomic regions are covered by the same set of sequencing reads. We report this score on a chromosome of the human genome reference after Stash is filled with experimental Oxford Nanopore Technology sequencing reads and show that as the distance between two loci of the reference contig increases, the metric decreases since a smaller number of common reads cover those regions.
We expect Stash to provide benefits to a variety of bioinformatics applications, including de novo genome assembly and misassembly detection.

12:10-12:30
GenASM: A High-Performance, Low-Power Approximate String Matching Acceleration Framework for Genome Sequence Analysis
Room: Madison B
Format: Live-stream

  • Damla Senol Cali, Bionano Genomics, United States
  • Gurpreet Singh Kalsi, Intel, United States
  • Zülal Bingöl, Bilkent University, Turkey
  • Can Firtina, ETH Zurich, Switzerland
  • Lavanya Subramanian, Facebook, United States
  • Jeremie S. Kim, ETH Zurich, Switzerland
  • Rachata Ausavarungnirun, King Mongkut's University of Technology North Bangkok, Thailand
  • Mohammed Alser, ETH Zurich, Switzerland
  • Juan Gómez Luna, ETH Zurich, Switzerland
  • Amiral Boroumand, Carnegie Mellon University, United States
  • Anant Nori, Intel, United States
  • Allison Scibisz, Carnegie Mellon University, United States
  • Sreenivas Subramoney, Intel Labs, India
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Saugata Ghose, University of Illinois Urbana-Champaign, United States
  • Onur Mutlu, ETH Zurich, Switzerland


Presentation Overview: Show

Genome sequence analysis has enabled significant advancements in medical and scientific areas such as personalized medicine, outbreak tracing, and the understanding of evolution. Unfortunately, it is currently bottlenecked by the computational power and memory bandwidth limitations of existing systems, as many of the steps in genome sequence analysis must process a large amount of data. A major contributor to this bottleneck is approximate string matching (ASM).

We propose GenASM, the first ASM acceleration framework for genome sequence analysis. We modify the underlying ASM algorithm (Bitap) to significantly increase its parallelism and reduce its memory footprint, and we design the first hardware accelerator for Bitap. Our hardware accelerator consists of specialized compute units and on-chip SRAMs that are designed to match the rate of computation with memory capacity and bandwidth.

We demonstrate that GenASM is a flexible, high-performance, and low-power framework, which provides significant performance and power benefits for three different use cases in genome sequence analysis: 1) GenASM accelerates read alignment for both long reads and short reads. For long reads, GenASM outperforms state-of-the-art software and hardware accelerators by 116x and 3.9x, respectively, while consuming 37x and 2.7x less power. For short reads, GenASM outperforms state-of-the-art software and hardware accelerators by 111x and 1.9x. 2) GenASM accelerates pre-alignment filtering for short reads, with 3.7x the performance of a state-of-the-art pre-alignment filter, while consuming 1.7x less power and significantly improving the filtering accuracy. 3) GenASM accelerates edit distance calculation, with 22-12501x and 9.3-400x speedups over the state-of-the-art software library and FPGA-based accelerator, respectively, while consuming 548-582x and 67x less power.

14:30-14:50
Proceedings Presentation: The minimizer Jaccard estimator is biased and inconsistent
Room: Madison B
Format: Live from venue

  • Mahdi Belbasi, The Pennsylvania State University, United States
  • Antonio Blanca, Penn State, United States
  • Robert S. Harris, The Pennsylvania State University, United States
  • David Koslicki, Penn State University, United States
  • Paul Medvedev, The Pennsylvania State University, United States


Presentation Overview: Show

Sketching is now widely used in bioinformatics to reduce data size and increase data processing speed. Sketching approaches entice with improved scalability but also carry the danger of decreased accuracy and added bias. In this paper, we investigate the minimizer sketch and its use to estimate the Jaccard similarity between two sequences. We show that the minimizer Jaccard estimator is biased and inconsistent, which means that the expected difference (i.e., the bias) between the estimator and the true value is not zero, even in the limit as the lengths of the sequences grow. We derive an analytical formula for the bias as a function of how the shared k-mers are laid out along the sequences. We show both theoretically and empirically that there are families of sequences where the bias can be substantial (e.g. the true Jaccard can be more than double the estimate). Finally, we demonstrate that this bias affects the accuracy of the widely used mashmap read mapping tool.

14:50-15:10
Proceedings Presentation: Sparse and Skew Hashing of K-Mers
Room: Madison B
Format: Live-stream

  • Giulio Ermanno Pibiri, ISTI-CNR, Italy


Presentation Overview: Show

Motivation: A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports exact membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of several billions of strings – in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge.
Results: To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions.
Availability: The C++ implementation of the dictionary is available at https://github.com/jermp/sshash.

15:10-15:30
Matchtigs: minimum plain text representation of kmer sets
Room: Madison B
Format: Live from venue

  • Sebastian Schmidt, Helsinki University, Finland
  • Shahbaz Khan, Indian Institute of Technology Roorkee, India
  • Jarno Alanko, University of Helsinki, Finland
  • Alexandru I. Tomescu, University of Helsinki, Finland


Presentation Overview: Show

Kmer-based methods are widely used in bioinformatics, which raises the question of what is the smallest practically usable representation (i.e. plain text) of a set of kmers.
We propose a polynomial algorithm computing a minimum such representation (which was previously posed as a potentially NP-hard open problem), as well as an efficient near-minimum greedy heuristic.
When compressing genomes of large model organisms, read sets (Illumina short reads) thereof or bacterial pangenomes, with only a minor runtime increase, we decrease the size of the representation by up to 60% over unitigs and 27% over previous work.
Additionally, the number of strings is decreased by up to 97% over unitigs and 91% over previous work.
Finally, our small representation has advantages in downstream applications, as it speeds up queries on the popular kmer indexing tool Bifrost by 1.66x over unitigs and 1.29xover previous work.

16:00-16:20
Proceedings Presentation: Markov chains improve the significance computation of overlapping genome annotations
Room: Madison B
Format: Live from venue

  • Askar Gafurov, Comenius University in Bratislava, Slovakia
  • Broňa Brejová, Comenius University in Bratislava, Slovakia
  • Paul Medvedev, Penn State University, United States


Presentation Overview: Show

Genome annotations are a common way to represent genomic features such as genes, regulatory elements or epigenetic modifications. The amount of overlap between two annotations is often used to ascertain if there is an underlying biological connection between them. In order to distinguish between true biological association and overlap by pure chance, a robust measure of significance is required. One common way to do this is to determine if the number of intervals in the reference annotation that intersect the query annotation is statistically significant. However, currently employed statistical frameworks are often either inefficient or inaccurate when computing p-values on the scale of the whole human genome. We show that finding the p-values under the typically used ''gold'' null hypothesis is NP-hard. This motivates us to reformulate the null hypothesis using Markov chains. To be able to measure the fidelity of our Markovian null hypothesis, we develop a fast direct sampling algorithm to estimate the p-value under the gold null hypothesis. We then present an open-source software tool MCDP that computes the p-values under the Markovian null hypothesis in O(m^2+n) time and O(m) memory, where m and n are the numbers of intervals in the reference and query annotations, respectively. Notably, MCDP runtime and memory usage are independent from the genome length, allowing it to outperform previous approaches in runtime and memory usage by orders of magnitude on human genome annotations, while maintaining the same level of accuracy.

16:20-16:40
GoldRush-Path: A de novo assembler for long reads with linear time complexity
Room: Madison B
Format: Live from venue

  • Johnathan Wong, BC Cancer, Genome Sciences Centre, Canada
  • Vladimir Nikolic, BC Cancer, Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Emily Zhang, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer, Genome Sciences Centre, Canada


Presentation Overview: Show

De novo genome assembly is a cornerstone to a variety of genomic analyses. Long sequencing read technologies have enabled researchers to assemble draft genomes with high contiguity and few structural errors. Most long read assemblers adopt the overlap layout consensus paradigm, a quadratic run time algorithm in its naïve implementation, to address the high number of base errors present in long reads. Recently, ONT and PacBio have made tremendous strides in improving the quality of their long read sequencing technologies, and opportunities for new long read assembly algorithms have emerged. We present GoldRush-Path, a memory-efficient long read assembler algorithm that runs in linear time in the number of reads, as part of the GoldRush pipeline. GoldRush-Path iterates through the long reads and identifies a set of “golden path” sequences that cover ~1X of the target genome by querying each read against a multi-index Bloom filter and inserting it only if its associated sequence signatures are missing. GoldRush-Path, the costliest step in the GoldRush pipeline, consumes at most 73 GB of RAM when assembling human genomes. The selected golden path is then polished and scaffolded in the pipeline, yielding NGA50 lengths of 12 Mbp for human genome assemblies in our tests.

16:40-17:00
Quantification of complex genome editing events including large insertions and translocations using CRISPRlungo
Room: Madison B
Format: Live from venue

  • Kendell Clement, Massachusetts General Hospital / Harvard Medical School, United States
  • Linda Lin, Boston Childrens Hospital / Harvard Medical School, United States
  • Pengpeng Liu, UMass Medical School, United States
  • Jing Zeng, Boston Childrens Hospital / Harvard Medical School, United States
  • Amy Nguyen, Boston Childrens Hospital / Harvard Medical School, United States
  • Scot Wolfe, UMass Medical School, United States
  • Daniel Bauer, Boston Childrens Hospital / Harvard Medical School, United States
  • Luca Pinello, Massachusetts General Hospital / Harvard Medical School, United States


Presentation Overview: Show

Genome editing technologies are rapidly evolving, and analysis of deep sequencing data from target and off-target regions is necessary for evaluating editing efficiency, precision and specificity. Our group has developed the widely-used tool, CRISPResso2, which standardized quantification of editing frequencies at predefined loci using amplicon sequencing. However, this and other methods are only able to detect small insertions and deletions. In order to quantify complex genome editing events including large insertions, inversions and translocations, assays have been proposed which enrich for DNA sequences using only one PCR origin as the anchor for amplification. We developed a novel analytic tool called CRISPRLungo to analyze sequencing data produced from single-anchor PCR which can quantify and visualize complex genome editing events without any a priori assumption of the expected outcomes. We generated single-anchor amplification data for a therapeutic genome editing experiment and show that our tool can take advantage of the richness of unidirectional sequencing data to both sensitively and specifically detect a variety of complex genome editing outcomes, including identifying rare chromosomal alterations not detectable using current analysis toolkits. CRISPRLungo is available as open-source software that enables researchers to comprehensively assess genome editing outcomes without the biases of amplicon sequencing.

17:00-17:20
Proceedings Presentation: Robust Fingerprinting of Genomic Databases
Room: Madison B
Format: Live-stream

  • Erman Ayday, Case Western Reserve University, United States
  • Tianxi Ji, Case Western Reserve Univeristy, United States
  • Emre Yilmaz, University of Houston-Downtown, United States
  • Pan Li, Case Western Reserve University, United States


Presentation Overview: Show

Database fingerprinting has been widely used to discourage unauthorized redistribution of data by providing means to identify the source of data leakages. However, there is no fingerprinting scheme aiming at achieving liability guarantees when sharing genomic databases. Thus, we are motivated to fill in this gap by devising a vanilla fingerprinting scheme specifically for genomic databases. Moreover, since malicious genomic database recipients may compromise the embedded fingerprint by launching effective correlation attacks which leverage the intrinsic correlations among genomic data (e.g., Mendel’s law and linkage disequilibrium), we also augment the vanilla scheme by developing mitigation techniques to achieve robust fingerprinting of genomic databases against correlation attacks.

We first show that correlation attacks against fingerprinting schemes for genomic databases are very powerful. In particular, the correlation attacks can distort more than half of the fingerprint bits by causing a small utility loss (e.g.,database accuracy and consistency of SNP-phenotype associations measured via p-values). Next, we experimentally show that the correlation attacks can be effectively mitigated by our proposed mitigation techniques. We validate that the attacker can hardly compromise a large portion of the fingerprint bits even if it pays a higher cost in terms of degradation of the database utility. For example, with around 24% loss in accuracy and 20% loss in the consistency of SNP-phenotype associations, the attacker can only distort about 30% fingerprint bits, which is insufficient for it to avoid being accused. We also show that the proposed mitigation techniques also preserve the utility of the shared genomic databases.

17:20-17:40
ConDecon: a clustering-independent method for estimating single-cell abundance in bulk tissues using reference single-cell RNA-seq data
Room: Madison B
Format: Live from venue

  • Rachael Aubin, University of Pennsylvania, United States
  • Javier Montelongo, University of Pennsylvania, United States
  • Pablo Camara, University of Pennsylvania, United States


Presentation Overview: Show

Biological tissues are heterogeneous and comprise cells undergoing continuous biological processes like cell differentiation. Single-cell RNA-sequencing technologies enable the investigation of these processes. However, generating large cohorts of single-cell data is challenging compared to bulk transcriptomic data. Although many computational methods have been developed for inferring cell type abundance from bulk transcriptomic data, these approaches rely on cell type gene expression signatures and ignore intra-cluster variability. Continuous Deconvolution, ConDecon, is a clustering-independent deconvolution algorithm specifically developed to predict complex changes in single-cell abundance from bulk tissue. This approach estimates the probability that each cell in a reference single-cell data is present in a query bulk data. We compared ConDecon to 17 other methods and find that ConDecon performs comparably to state-of-the-art algorithms when inferring discrete cell type abundances. We then focus on ConDecon’s ability to estimate dynamic cell abundances along continuous cellular processes. To that end, we applied ConDecon to well-characterized biological systems like B-cell maturation and immune activation. Finally, we use it to identify changes in the activation of tumor-infiltrating microglia during the mesenchymal transformation of pediatric ependymoma. We anticipate that ConDecon will extend the utility of current methods to characterize single-cell dynamics in bulk tissue.

17:40-18:00
κ-velo improves single-cell RNA-velocity estimation
Room: Madison B
Format: Live from venue

  • Valérie Marot-Lassauzaie, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Brigitte Joanne Bouman, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Fearghal Declan Donaghy, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany
  • Laleh Haghverdi, Berlin Institute for Medical Systems Biology, Max Delbrück Center in the Helmholtz Association, Berlin, Germany, Germany


Presentation Overview: Show

Single-cell transcriptomics has been used to study dynamical processes such as cell differentiation. RNA velocity (La Manno et. al. 2020) was a breakthrough towards obtaining a more complete description of the dynamics of such processes. Here, simultaneous measurement of new unspliced and old spliced mRNA adds a temporal dimension to the data. The change in mRNA abundance, called RNA velocity, is used to infer the progression of cells through the dynamical process. However, reliable velocity analysis is still impeded by multiple computational issues. State-of-the-art methods for velocity inference (Bergen et. al. 2020) have issues in velocity inference as well as visualisation. Moreover, there are inconsistencies in current processing pipelines and the single-cell specific (stochastic) part of the dynamic is lost through multiple layers of data smoothing.
We introduce a new method for RNA velocity analysis that addresses some of the issues in velocity estimation. We also propose that visualisation of the velocities based on the Nystroem projection method represents the single-cell stochasticity better than current practices. Finally, we adjust the processing pipeline for consistency with downstream velocity estimation. We validate our model on simulation and on real data, and compare it to current state-of-the-art.