Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

HiTSeq COSI

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in UTC
Sunday, July 25th
11:00-11:20
Dynamic Adaptive Sampling During Nanopore Sequencing and Assembly using Bayesian Experimental Design
Format: Pre-recorded with live Q&A

Moderator(s): Can Alkan 

  • Lukas Weilguny, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
  • Nicola De Maio, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom
  • Nick Goldman, European Molecular Biology Laboratory, European Bioinformatics Institute, United Kingdom

Presentation Overview: Show

One particularly promising feature of nanopore sequencing is the ability to reject reads, enabling real-time selection of molecules without complex sample preparation. Previously, such decisions were based on a priori choice. Instead, they could also incorporate already-observed data in order to maximise information gain. For example, during resequencing the genotype of sites without variation is confirmed by few reads; whereas more data would be desirable at variable sites.

We present BOSS-RUNS, a mathematical model to calculate the expected benefit of new reads and an algorithm to generate dynamically updated decision strategies.
During sequencing, we quantify the uncertainty at each site and for each novel read decide whether the potential decrease in uncertainty at the sites it will most likely cover warrants complete sequencing.

In simulations of a microbial community we show that this can mitigate coverage bias or lead to higher minimum coverage in regions of interest compared to sequencing without, or with a priori adaptive sampling.
Further, we consider the problem of de novo assembly by adapting our framework to genome graphs, which allows for the rejection of fragments from well-assembled regions to focus on reads that extend contigs or resolve repeats instead.

11:20-11:40
Proceedings Presentation: Topology-based Sparsification of Graph Annotations
Format: Pre-recorded with live Q&A

Moderator(s): Can Alkan 

  • Daniel Danciu, ETH Zurich, Switzerland
  • Mikhail Karasikov, ETH Zurich, Switzerland
  • Harun Mustafa, ETH Zurich, Switzerland
  • Andre Kahles, ETH Zurich, Switzerland
  • Gunnar Ratsch, ETH Zurich, Switzerland

Presentation Overview: Show

Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are needed to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. In this paper, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10,000 RNA-seq datasets show that RowDiff combined with Multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a Multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST.

11:40-12:00
Proceedings Presentation: Cuttlefish: Fast, parallel, and low-memory compaction of de Bruijn graphs from large-scale genome collections
Format: Pre-recorded with live Q&A

Moderator(s): Can Alkan 

  • Rob Patro, University of Maryland, United States
  • Jamshed Khan, University of Maryland, United States

Presentation Overview: Show

Motivation: The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.
Results: We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 hours, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 hours, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 hours using ~126 GB of memory, and over 16 hours using ∼289 GB of memory, respectively.

12:00-12:20
Proceedings Presentation: Constructing small genome graphs via string compression
Format: Pre-recorded with live Q&A

Moderator(s): Can Alkan 

  • Carl Kingsford, Carnegie Mellon University, United States
  • Yutong Qiu, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: The size of a genome graph --- the space required to store the nodes, their labels and edges --- affects the efficiency of operations performed on it. For example, the time complexity to align a sequence to a graph without a graph index depends on the total number of characters in the node labels and the number of edges in the graph. This raises the need for approaches to construct space-efficient genome graphs.

Results: We point out similarities in the string encoding mechanisms of genome graphs and the external pointer macro (EPM) compression model. Supported by these similarities, we present a pair of linear-time algorithms that transform between genome graphs and EPM-compressed forms. We show that the algorithms result in an upper bound on the size of the genome graph constructed in terms of an optimal EPM compression. To further optimize the size of the genome graph, we purpose the source assignment problem that optimizes over the equivalent choices during compression and introduce an ILP formulation that solves that problem optimally. As a proof-of-concept, we introduce RLZ-Graph, a genome graph constructed based on the relative Lempel-Ziv algorithm. We show that using RLZ-Graph, across all human chromosomes, we are able to reduce the disk space to store a genome graph on average by 40.7% compared to colored compacted de Bruijn graphs constructed by Bifrost under the default settings.

Availability: The RLZ-Graph software is available at https://github.com/Kingsford-Group/rlzgraph

12:40-13:00
Proceedings Presentation: CentromereArchitect: inference and analysis of the architecture of centromeres
Format: Pre-recorded with live Q&A

Moderator(s): Ana Conesa

  • Tatiana Dvorkina, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Olga Kunyavskaya, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Andrey V. Bzikadze, Graduate Program in Bioinformatics and Systems Biology, University of California, San Diego, United States
  • Ivan Alexandrov, Center for Algorithmic Biotechnology, Saint Petersburg State University, Russia
  • Pavel A. Pevzner, Department of Computer Science and Engineering, University of California, San Diego, United States

Presentation Overview: Show

Motivation: Recent advances in long-read sequencing technologies led to a rapid progress in centromere assembly in the last year and, for the first time, opened a possibility to address the long-standing questions about the architecture and evolution of human centromeres. However, since these advances have not been yet accompanied by the development of the centromere-specific bioinformatics algorithms, even the fundamental questions (e.g., centromere annotation by deriving the complete set of human monomers and high-order repeats), let alone more complex questions (e.g., explaining how monomers and high-order repeats evolved) about human centromeres remain open. Moreover, even though there was a four-decade long series of studies aimed at cataloging all human monomers and high-order repeats, the rigorous algorithmic definitions of these concepts are still lacking. Thus, development of a centromere annotation tool is a prerequisite for follow-up personalized biomedical studies of centromeres across human population and evolutionary studies of centromeres across various species.
Results: We describe the CentromereArchitect, the first tool for the centromere annotation in a newly sequenced genome, apply it to the recently generated complete assembly of a human genome by the Telomere-to-Telomere consortium, generate the complete set of human monomers and high-order repeats for so-called live centromeres, and reveal a vast set of hybrid monomers that may represent the focal points of centromere evolution.

13:00-13:20
The statistics of \kmers from a sequence undergoing a simple mutation process without spurious matches
Format: Pre-recorded with live Q&A

Moderator(s): Ana Conesa

  • Antonio Blanca, Penn State, United States
  • Robert S. Harris, The Pennsylvania State University, United States
  • David Koslicki, Penn State University, United States
  • Paul Medvedev, The Pennsylvania State University, United States

Presentation Overview: Show

K-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g. a genome or a read) undergoes a simple mutation process whereby each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of non-mutated k-mers). We then derive hypothesis tests and confidence intervals for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without minhash). We demonstrate the usefulness of our results using a few select applications: obtaining a confidence interval to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long read alignments to a de Bruijn graph by Jabba.

13:20-13:40
Proceedings Presentation: Long Reads Capture Simultaneous Enhancer-Promoter Methylation Status for Cell-type Deconvolution
Format: Pre-recorded with live Q&A

Moderator(s): Ana Conesa

  • Roded Sharan, Tel Aviv University, Israel
  • Sapir Margalit, Tel Aviv University, Israel
  • Yotam Abramson, Tel Aviv University, Israel
  • Hila Sharim, Tel Aviv University, Israel
  • Zohar Manber, Tel Aviv University, Israel
  • Surajit Bhattacharya, Children’s National Hospital, Washington DC, United States
  • Yi-Wen Chen, Children’s National Hospital, Washington DC; George Washington University, United States
  • Eric Vilain, Children’s National Hospital, Washington DC; George Washington University, United States
  • Hayk Barseghyan, Children’s National Hospital, Washington DC; George Washington University, United States
  • Ran Elkon, Tel Aviv University, Israel
  • Yuval Ebenstein, Tel Aviv University, Israel

Presentation Overview: Show

Motivation: While promoter methylation is associated with reinforcing fundamental tissue identities, the methylation status of distant enhancers was shown by genome-wide association studies to be a powerful determinant of cell-state and cancer. With recent availability of long-reads that report on the methylation status of enhancer-promoter pairs on the same molecule, we hypothesized that probing these pairs on the single-molecule level may serve the basis for detection of rare cancerous transformations in a given cell population. We explore various analysis approaches for deconvolving cell-type mixtures based on their genome-wide enhancer-promoter methylation profiles.
Results: To evaluate our hypothesis we examine long-read optical methylome data for the GM12787 cell line and myoblast cell lines from two donors. We identified over 100,000 enhancer-promoter pairs that co-exist on at least 30 individual DNA molecules per pair. We developed a detailed methodology for mixture deconvolution and applied it to estimate the proportional cell compositions in synthetic mixtures based on analyzing their enhancer-promoter pairwise methylation. We found our methodology to lead to very accurate estimates, outperforming our promoter-based deconvolutions. Moreover, we show that it can be generalized from deconvolving different cell types to subtle scenarios where one wishes to deconvolve different cell populations of the same cell-type.

Availability: The code used in this work to analyze single-molecule Bionano Genomics optical maps is available via the GitHub repository https://github.com/ebensteinLab/ Single_molecule_methylation_in_EP.
Contact: uv@post.tau.ac.il (Y.E), roded@tauex.tau.ac.il (R.S)

13:40-14:00
Proceedings Presentation: Practical selection of representative sets of RNA-seq samples using a hierarchical approach
Format: Pre-recorded with live Q&A

Moderator(s): Ana Conesa

  • Laura Tung, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Despite numerous RNA-seq samples available at large databases, most RNA-seq analysis tools are evaluated on a limited number of RNA-seq samples. This drives a need for methods to select a representative subset from all available RNA-seq samples to facilitate comprehensive, unbiased evaluation of bioinformatics tools. In sequence-based approaches for representative set selection (e.g. a k-mer counting approach that selects a subset based on k-mer similarities between RNA-seq samples), because of the large numbers of available RNA-seq samples and of k-mers/sequences in each sample, computing the full similarity matrix using k-mers/sequences for the entire set of RNA-seq samples in a large database (e.g. the SRA) has memory and runtime challenges; this makes direct representative set selection infeasible with limited computing resources.
Results: We developed a novel computational method called “hierarchical representative set selection” to handle this challenge. Hierarchical representative set selection is a divide-and-conquer-like algorithm that breaks representative set selection into sub-selections and hierarchically selects representative samples through multiple levels. We demonstrate that hierarchical representative set selection can achieve summarization quality close to that of direct representative set selection, while largely reducing runtime and memory requirements of computing the full similarity matrix (up to 8.4X runtime reduction and 5.35X memory reduction for 10000 and 12000 samples respectively that could be practically run with direct subset selection). We show that hierarchical representative set selection substantially outperforms random sampling on the entire SRA set of RNA-seq samples, making it a practical solution to representative set selection on large databases like the SRA.

14:20-15:20
HiTSeq Keynote: Toward high-throughput and full-length characterization of transcript isoforms, including their function
Format: Live-stream

Moderator(s): Ana Conesa

  • Angela Brooks
Monday, July 26th
11:00-12:00
HiTSeq Keynote: Population scale analysis of human sequence data
Format: Live-stream

Moderator(s): Birte Kehr

  • Bjarni Halldórsson
12:00-12:20
Co-linear chaining with overlaps and gap costs
Format: Pre-recorded with live Q&A

Moderator(s): Birte Kehr

  • Chirag Jain, Indian Institute of Science, India
  • Daniel Gibney, University of Central Florida, United States
  • Sharma V. Thankachan, University of Central Florida, United States

Presentation Overview: Show

Co-linear chaining has proven to be a powerful technique for finding approximately optimal alignments and approximating edit distance. It is used as an intermediate step in numerous mapping tools that follow seed-and-extend strategy. Despite this popularity, subquadratic time algorithms for the case where chains support anchor overlaps and gap costs are not currently known. Moreover, a theoretical connection between co-linear chaining cost and edit distance remains unknown. We present algorithms to solve the co-linear chaining problem with anchor overlaps and gap costs in O(polylog(n)*n) time, where n denotes the count of anchors. We establish the first theoretical connection between
co-linear chaining cost and edit distance. Specifically, we prove that for a fixed set of anchors under a carefully designed chaining cost function, the optimal `anchored' edit distance equals the optimal co-linear chaining cost. Finally, we demonstrate experimentally that optimal co-linear chaining cost under the proposed cost function can be computed significantly faster than edit distance, and achieves high correlation with edit distance for closely as well as distantly related sequences.

12:40-13:00
Proceedings Presentation: Sequence-specific minimizers via polar sets
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Hongyu Zheng, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States
  • Guillaume Marcais, Carnegie Mellon University, United States

Presentation Overview: Show

Minimizers are efficient methods to sample k-mers from genomic sequences, that unconditionally preserve long enough matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, that is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences.
We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers.

13:00-13:20
Proceedings Presentation: Real-time mapping of nanopore raw signals
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Haowen Zhang, Georgia Institute of Technology, United States
  • Haoran Li, Ohio State University, United States
  • Chirag Jain, Indian Institute of Science, United States
  • Haoyu Cheng, Harvard Medical School, United States
  • Kin Fai Au, Ohio State University, United States
  • Heng Li, Harvard Medical School and Dana-Farber Cancer Institute, United States
  • Srinivas Aluru, Georgia Institute of Technology, United States

Presentation Overview: Show

Motivation: Oxford Nanopore Technologies sequencing devices support adaptive sequencing, in which undesired reads can be ejected from a pore in real time. This feature allows targeted sequencing aided by computational methods for mapping partial reads, rather than complex library preparation protocols. However, existing mapping methods either require a computationally expensive base calling procedure before using aligners to map partial reads, or work well only on small genomes.

Results: In this work, we present a new streaming method that can map nanopore raw signals for real-time selective sequencing. Rather than converting read signals to bases, we propose to convert reference genomes to signals and fully operate in the signal space. Our method features a new way to index reference genomes using k-d trees, a novel seed selection strategy and a seed chaining algorithm tailored towards the current signal characteristics. We implemented the method as a tool Sigmap. Then we evaluated it on both simulated and real data, and compared it to the state-of-the-art nanopore raw signal mapper Uncalled. Our results show that Sigmap yields better mapping accuracy on mapping yeast real raw signals and is 4.4 times faster. Moreover, our method performed well on mapping raw signals to genomes of size >100Mbp and correctly mapped 11.49% more real raw signals of green algae, which leads to a significantly higher F1-score (0.9354 vs. 0.8660).

13:20-13:40
Strobemers: an alternative to k-mers for sequence comparison
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Kristoffer Sahlin, Stockholm University, Sweden

Presentation Overview: Show

K-mer-based methods are widely used in bioinformatics for sequence comparison. However, a single mutation will mutate k consecutive k-mers and makes most k-mer based applications sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, e.g., spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches due to the size of k.

Here, we propose strobemers, a new data structure for sequence comparison. We use simulated data to show that, under several mutation rates, strobemers outperform k-mers and spaced k-mers for sequence similarity searches by producing more evenly spread sequence matches and higher match coverage. We further implement a proof-of-concept sequence matching tool StrobeMap. We use StrobeMap with synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios. A reference implementation of our tool StrobeMap together with code for analyses is available at https://github.com/ksahlin/strobemers.

13:40-14:00
Efficient linked-read barcode mapping without read alignment
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Richard Lüpken, Berlin Institute of Health, Germany
  • Birte Kehr, Regensburg Center for Interventional Immunology (RCI), Germany

Presentation Overview: Show

When sequencing whole genomes, one is facing a tremendous amount of mostly unstructured data. Obtaining all reads corresponding to a specific genomic location currently requires the computationally expensive alignment of all reads. Linked-read sequencing technologies provide an additional level of structure in their reads through the use of barcodes. Reads with the same barcode originate from a small set of large DNA molecules. This provides opportunities that have not yet been used to their full potential.
Here we introduce an efficient approach for determining barcode intervals in a reference genome without performing a costly read alignment. Simultaneously we construct an index to quickly retrieve all reads of a given barcode from the input read files. Our barcode mapping approach queries minimizers from an open addressing k-mer index, which are then clustered into barcode intervals using a sliding window approach based on a scoring function. Mapping barcodes of a full set of reads took us 6.5 CPU hours whereas aligning the same read set with BWA mem took 244 CPU hours. When faced with WGS data but interested in a specific genomic location, our approach can quickly return all barcodes and reads belonging to the locus of interest.

14:20-14:40
Proceedings Presentation: doubletD: Detecting doublets in single-cell DNA sequencing data
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Leah Weber, University of Illinois at Urbana-Champaign, United States
  • Palash Sashittal, University of Illinois at Urbana-Champaign, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

While single-cell DNA sequencing (scDNA-seq) has enabled the study of intra-tumor heterogeneity at an unprecedented resolution, current technologies are error-prone and often result in doublets where two or more cells are mistaken for a single cell. Not only do doublets confound downstream analyses, but the increase in doublet rate is also a major bottleneck preventing higher throughput with current single-cell technologies. Although doublet detection and removal is standard practice in scRNA-seq data analysis, there are no standalone doublet detection methods for scDNA-seq data.

We present doubletD, the first standalone method for detecting doublets in scDNA-seq data. Underlying our method is a simple maximum likelihood approach with a closed-form solution. We demonstrate the performance of doubletD on simulated data as well as real datasets, outperforming current methods for downstream analysis of scDNA-seq data that jointly infer doublets as well as standalone approaches for doublet detection in scRNA-seq data. Incorporating doubletD in scDNA-seq analysis pipelines will reduce complexity and lead to more accurate results.

14:40-15:00
ACE: Explaining single-cell cluster from an adversarial perspective
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Yang Lu, University of Washington, United States
  • Timothy Yu, University of Washington, United States
  • Giancarlo Bonora, gbonora@uw.edu, United States
  • William Noble, University of Washington, United States

Presentation Overview: Show

A common workflow in single-cell RNA-seq analysis is to project the data to a latent space, cluster the cells in that space, and identify sets of marker genes that explain the differences among the discovered clusters. A primary drawback to this three-step procedure is that each step is carried out independently, thereby neglecting the effects of the nonlinear embedding and inter-gene dependencies on the selection of marker genes. Here we propose an integrated deep learning framework, Adversarial Clustering Explanation (ACE), that bundles all three steps into a single workflow. The method thus moves away from the notion of "marker genes" to instead identify a panel of explanatory genes. This panel may include genes that are not only enriched but also depleted relative to other cell types, as well as genes that exhibit differences between closely related cell types. Empirically, we demonstrate that ACE is able to identify gene panels that are both highly discriminative and nonredundant.

15:00-15:20
Proceedings Presentation: SAILER: Scalable and Accurate Invariant Representation Learning for Single-cell ATACseq Processing and Integration
Format: Pre-recorded with live Q&A

Moderator(s): Kjong Lehmann

  • Jing Zhang, UC Irvine, United States
  • Laiyi Fu, Xi'an Jiaotong University, China
  • Yingxin Cao, University of California, Irvine, United States
  • Jie Wu, University of California, Irvine, United States
  • Qin Ke Peng, Xi'an Jiaotong University, China
  • Qing Nie, U. of California, Irvine, United States
  • Xiaohui Xie, University of California, Irvine, United States

Presentation Overview: Show

Motivation:
Single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) provides new opportunities to dissect epigenomic heterogeneity and elucidate transcriptional regulatory mechanisms. However, computational modeling of scATAC-seq data is challenging due to its high dimension, extreme sparsity, complex dependencies, and high sensitivity to confounding factors from various sources.

Results: Here we propose a new deep generative model framework, named SAILER, for analysing scATAC-seq data. SAILER aims to learn a low-dimensional nonlinear latent representation of each cell that defines its intrinsic chromatin state, invariant to extrinsic confounding factors like read depth and batch effects. SAILER adopts the conventional encoder-decoder framework to learn the latent representation but imposes additional constraints to ensure the independence of the learned representations from the confounding factors. Experimental results on both simulated and real scATAC-seq datasets demonstrate that SAILER learns better and biologically more meaningful representations of cells than other methods. Its noise-free cell embeddings bring in significant benefits in downstream analyses: Clustering and imputation based on SAILER result in 6.9% and 18.5% improvements over existing methods, respectively. Moreover, because no matrix factorization is involved, SAILER can easily scale to millions of cells. We implemented SAILER into a software package, freely available to the scientific community for large-scale scATAC-seq data analysis.

Tuesday, July 27th
11:00-11:20
Comparative genome analysis using sample-specific string detection in accurate long reads
Format: Pre-recorded with live Q&A

Moderator(s): Birte Kehr

  • Parsoa Khorsand, University of California, Davis, United States
  • Luca Denti, Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, France
  • Paola Bonizzoni, Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, Italy
  • Rayan Chikhi, Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, France
  • Fereydoun Hormozdiari, University of California, Davis, United States

Presentation Overview: Show

Motivation: Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include discovery of genomic differences segregating in population, case-control analysis in common disease, and rare disorders. With the current progress of accurate long-read sequencing technologies (e.g., circular consensus sequencing from PacBio sequencers) we can dive into studying repeat regions of genome (e.g., segmental duplications) and hard-to-detect variants (e.g., complex structural variants).

Results: We propose a novel framework for addressing the comparative genome analysis by discovery of strings that are specific to one genome (""samples-specific"" strings). We have developed an accurate and efficient novel method for discovery of samples-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome. We show that the proposed approach is capable of accurately finding samples-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g., PacBio HiFi data).

Availability: The proposed tool is publicly available at https://github.com/Parsoa/PingPong.

11:20-11:40
Polishing Copy Number Variant Calls on Exome Sequencing Data via Deep Learning
Format: Pre-recorded with live Q&A

Moderator(s): Birte Kehr

  • Furkan Ozden, Bilkent University, Turkey
  • Can Alkan, Bilkent University, Turkey
  • A. Ercument Cicek, Bilkent University, Turkey

Presentation Overview: Show

Accurate and efficient detection of copy number variants (CNVs) is of critical importance due to their significant association with complex genetic diseases. Although algorithms that use whole genome sequencing (WGS) data provide stable results with mostly-valid statistical assumptions, copy number detection on whole exome sequencing (WES) data shows comparatively lower accuracy. This is unfortunate as WES data is cost efficient, compact and is relatively ubiquitous. The bottleneck is primarily due to non-contiguous nature of the targeted capture: biases in targeted genomic hybridization,GC content, targeting probes, and sample batching during sequencing. Here, we present a novel deep learning model,DECoNT, which uses the matched WES and WGS data and learns to correct the copy number variations reported by any off-the-shelf WES-based germline CNV caller. We train DECoNT on the 1000 Genomes Project data, and we show that we can efficiently triple the duplication call precision and double the deletion call precision of the state-of-the-art algorithms. We also show that our model consistently improves the performance independent from (i) sequencing technology, (ii) exome capture kitand (iii) CNV caller. Using DECoNT as a universal exome CNV call polisher has the potential to improvethe reliability of germline CNV detection on WES data sets.

11:40-12:00
Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing
Format: Pre-recorded with live Q&A

Moderator(s): Birte Kehr

  • Timofey Prodanov, University of California San Diego, United States
  • Vikas Bansal, University of California San Diego, United States

Presentation Overview: Show

Copy number and sequence variation in more than 150 genes that overlap low-copy repeats (LCRs) is associated with risk for rare and complex human diseases. Such duplicated genes are problematic for standard NGS analysis pipelines since a large fraction of reads derived from these regions cannot be mapped unambiguously to the genome. We have developed a computational framework, Parascopy, to estimate the total and paralog-specific copy number of genes that overlap LCRs using whole-genome sequencing (WGS) data. Parascopy jointly analyzes reads aligned to a genomic region and its paralogous sequences without relying on read mapping quality, uses a multi-sample Hidden Markov Model (HMM) to infer aggregate copy number, and leverages an EM algorithm to jointly estimate paralog-specific copy number and identify invariant paralogous sequence variants or PSVs. Analysis of WGS data for 2504 samples from the 1000 Genomes project and validation using experimental data shows that Parascopy outperforms existing methods for several disease-relevant genes such as SMN1/2, RHCE and SRGAP2, can automatically identify invariant PSVs and can estimate copy number for more than 165 duplicated gene loci for a single human genome in less than 20 minutes.

12:00-12:20
phasebook: haplotype-aware de novo assembly of diploid genomes from long reads
Format: Pre-recorded with live Q&A

Moderator(s): Birte Kehr

  • Xiao Luo, Bielefeld University, Germany
  • Xiongbin Kang, Bielefeld University, Germany
  • Alexander Schönhuth, Bielefeld University, Germany

Presentation Overview: Show

Haplotype-aware diploid genome assembly is crucial in genomics, precision medicine, and many other disciplines. Long-read sequencing technologies have greatly improved genome assembly thanks to advantages of read length. However, current long-read assemblers usually introduce disturbing biases or fail to capture the haplotype diversity of the diploid genome. Here, we present phasebook, a novel approach for reconstructing the haplotypes of diploid genomes from long reads de novo. Benchmarking experiments demonstrate that our method outperforms other approaches in terms of haplotype coverage by large margins, while preserving competitive performance or even achieving advantages in terms of all other aspects relevant for genome assembly.

12:40-13:00
Strainline: full-length de novo viral haplotype reconstruction from noisy long reads
Format: Pre-recorded with live Q&A

Moderator(s): Francisco De La Vega

  • Xiao Luo, Bielefeld University, Germany
  • Alexander Schönhuth, Bielefeld University, Germany
  • Xiongbin Kang, Bielefeld University, Germany

Presentation Overview: Show

Haplotype-resolved assembly of highly diverse virus genomes is critical in prevention, control and treatment of viral diseases. Current methods either only handle accurate short reads, or collapse haplotype-specific variations. Here, we present Strainline, a novel approach to reconstruct viral haplotypes from noisy long reads. As a crucial novelty, Strainline is the first approach that reconstructs viral haplotypes from error-prone long read data referring to RNA virus quasispecies both accurately and at full length. Benchmarking experiments on both simulated and real datasets of varying complexity and diversity confirm this, by demonstrating the superiority of Strainline in terms of relevant criteria in comparison with the state of the art.

13:00-13:20
BinSPreader: refine binning results for fuller MAG reconstruction
Format: Pre-recorded with live Q&A

Moderator(s): Francisco De La Vega

  • Yury Kamenev, ITMO University, Russia
  • Roman Kruglikov, Lomonosov Moscow State University, Russia
  • Ivan Tolstoganov, Saint Petersburg State University, Russia
  • Anton Korobeynikov, Saint Petersburg State University, Russia

Presentation Overview: Show

Despite the recent advances in high-throughput sequencing, analysis of the metagenome of the whole microbial population still remains a challenge. In particular, the metagenome-assembled genomes (MAGs) are often fragmented due to interspecies repeats, uneven coverage and vastly different strain abundance.
MAGs are usually constructed via a dedicated binning process that uses different features of input data in order to cluster contigs that might belong to the same species. This process has some limitations and therefore binners usually discard contigs that are shorter than several kilobases. Therefore, binning of even simple metagenome assemblies can miss a decent fraction of contigs and resulting MAGs oftentimes do not contain important conservative sequences.
In this work we present BinSPreader – a novel binning refiner tool that exploits the assembly graph topology and other connectivity information to refine the existing binning, correct binning errors, propagate binning from longer contigs to shorter contigs and infer contigs belonging to multiple bins. Furthermore, BinSPreader can split input reads in accordance with the resulting binning predicting reads potentially belonging to multiple MAGs.
We show that BinSPreader could effectively complete the binning increasing the completeness of the bins without sacrificing the purity and could predict contigs belonging to several MAGs.

13:20-13:40
Proceedings Presentation: Haplotype-based membership inference from summary genomic data
Format: Pre-recorded with live Q&A

Moderator(s): Francisco De La Vega

  • Haixu Tang, Indiana University Bloomington, United States
  • Diyue Bu, Indiana University Bloomington, United States
  • Xiaofeng Wang, Indiana University Bloomington, United States

Presentation Overview: Show

Motivation: The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g., the allele frequencies, or even the presence/absence of genetic variants (e.g., shared by the Beacon project) in the group. These methods relied on the availability of the second sample, i.e., the DNA profile of a target human subject, and thus are often referred to as the membership inference method.
Results: In this paper, we demonstrate the haplotypes, i.e., the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a second sample. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs.

13:40-14:00
Biological discovery and consumer genomics databases activate latent privacy risk in functional genomics data
Format: Pre-recorded with live Q&A

Moderator(s): Francisco De La Vega

  • Steven Brenner, University of California, Berkeley, United States
  • Zhiqiang Hu, University of California, Berkeley, United States

Presentation Overview: Show

The privacy risks from individuals’ genomes have garnered increasing attention. Recent research studies and forensics have underscored the ability to re-identify a person using genomic-identified relatives and quasi-identifiers, such as sex, birthdate and zip code. However, summary omics data, such as gene expression values and DNA methylation sites, are generally treated as safe to share, with low privacy risks – though research studies have indicated they could be linked to existing genomes. We have demonstrated that some types of summary omics data can be accurately linked to a unique genome. We developed methods to match against genotypes in consumer genealogy databases with their restricted tools. Thus, the theoretical privacy concerns regarding summary omics data are now practically relevant. The ability to link sets of quasi-identifiers can reveal a research participant’s identity and protected health information. Most important, such risks increase over time, activated by new techniques, new knowledge, and new databases. Thus public omics data may become privacy time bombs: safe at the time of distribution, but increasingly likely to compromise personal information. The need to preserve individuals’ genomic privacy for their lifetime and beyond (for descendants and relatives) poses unique challenges to the effective sharing of high-throughput molecular data.

14:20-15:20
HiTSeq Keynote: The perils of contamination in genome databases
Format: Live-stream

Moderator(s): Francisco De La Vega

  • Steven Salzberg



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube