Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

HiTSeq COSI

Presentations

Schedule subject to change
Wednesday, July 15th
10:40 AM-11:40 AM
Hitseq keynote: Genetic and epigenetic maps of human centromeric regions
Format: Live-stream

  • Karen Migan, UC Santa Cruz, United States
12:00 PM-12:20 PM
Proceedings Presentation: The String Decomposition Problem and its Applications to Centromere Analysis and Assembly
Format: Pre-recorded with live Q&A

  • Andrey V. Bzikadze, University of California, United States
  • Tatiana Dvorkina, Saint Petersburg State University, Saint Petersburg, Russia, Russia
  • Pavel A. Pevzner, University of California, United States

Presentation Overview: Show

Motivation: Recent attempts to assemble extra-long tandem repeats (such as centromeres) faced the challenge of translating long error-prone reads from the nucleotide alphabet into the alphabet of repeat units. Human centromeres represent a particularly complex type of high-order repeats (HORs) formed by chromosome-specific monomers. Given a set of all human monomers, translating a read from a centromere into the monomer alphabet is modeled as the String Decomposition Problem. The accurate translation of reads into the monomer alphabet turns centromere assembly into a more tractable problem than the notoriously difficult problem of assembling centromeres in the nucleotide alphabet.

Results: We describe a StringDecomposer algorithm for solving this problem, benchmark it on the set of long error-prone reads generated by the Telomere-to-Telomere consortium, and identify a novel (rare) monomer that extends the set of known X-chromosome specific monomers. Our identification of a novel monomer emphasizes the importance of identification of all (even rare) monomers for future centromere assembly efforts and evolutionary studies. To further analyze novel monomers, we applied StringDecomposer to the set of recently generated long accurate Pacific Biosciences HiFi reads. This analysis revealed that the set of known human monomers and HORs remains incomplete. StringDecomposer opens a possibility to generate a complete set of human monomers and HORs for using in the ongoing efforts to generate the complete assembly of the human genome.

Availability: StringDecomposer is available on https://github.com/ablab/stringdecomposer.

12:20 PM-12:40 PM
Proceedings Presentation: Weighted minimizer sampling improves long read mapping
Format: Pre-recorded with live Q&A

  • Sergey Koren, National Institutes of Health, United States
  • Chirag Jain, National Institutes of Health, United States
  • Arang Rhie, National Institutes of Health, United States
  • Haowen Zhang, Georgia Institute of Technology, United States
  • Claudia Chu, Georgia Institute of Technology, United States
  • Adam Phillippy, National Institutes of Health, United States
  • Brian Walenz, National Institutes of Health, United States

Presentation Overview: Show

In this era of exponential data growth, minimizer sampling has become a standard algorithmic technique for rapid genome sequence comparison. This technique yields a sub-linear representation of sequences, enabling their comparison in reduced space and time. A key property of the minimizer technique is that if two sequences share a substring of a specified length, then they can be guaranteed to have a matching minimizer. However, because the k-mer distribution in eukaryotic genomes is highly uneven, minimizer-based tools (e.g., Minimap2, Mashmap) opt to discard the most frequently occurring minimizers from the genome in order to avoid excessive false positives. By doing so, the underlying guarantee is lost and accuracy is reduced in repetitive genomic regions.

We introduce a novel weighted-minimizer sampling algorithm. A unique feature of the proposed algorithm is that it performs minimizer sampling while taking into account a weight for each k-mer; i.e, the higher the weight of a k-mer, the more likely it is to be selected. By down-weighting frequently occurring k-mers, we are able to meet both objectives: (i) avoid excessive false-positive matches, and (ii) maintain the minimizer match guarantee. We tested our algorithm, Winnowmap, using both simulated and real long-read data and compared it to a state-of-the-art long read mapper, Minimap2. Our results demonstrate a reduction in the mapping error-rate from 0.14% to 0.06% in the recently finished human X chromosome (154.3 Mbp), and from 3.6% to 0% within the highly repetitive X centromere (3.1 Mbp). Winnowmap improves mapping accuracy within repeats and achieves these results with sparser sampling, leading to better index compression and competitive runtimes.

12:40 PM-1:00 PM
Proceedings Presentation: TandemTools: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats
Format: Pre-recorded with live Q&A

  • Alla Mikheenko, Saint Petersburg State University, Russia
  • Andrey Bzikadze, University of California San Diego, United States
  • Alexey Gurevich, Center for Algorithmic Biotechnology, St. Petersburg State University (Saint Petersburg, Russia), Russia
  • Karen Miga, UC Santa Cruz Genomics Institute, University of California Santa Cruz, United States
  • Pavel Pevzner, University of California San Diego, United States

Presentation Overview: Show

Extra-long tandem repeats (ETRs) are widespread in eukaryotic genomes and play an important role in fundamental cellular processes, such as chromosome segregation. Although emerging long-read technologies have enabled ETR assemblies, the accuracy of such assemblies is difficult to evaluate since there are no tools for their quality assessment. Moreover, since the mapping of error-prone reads to ETR remains an open problem, it is not clear how to polish draft ETR assemblies. To address these problems, we developed the TandemTools package that includes the TandemMapper tool for mapping reads to ETRs and the TandemQUAST tool for polishing ETR assemblies and their quality assessment. We demonstrate that TandemTools not only reveals errors in ETR assemblies but also improve the recently generated assemblies of human centromeres.

2:00 PM-2:20 PM
PopDel detects large deletions jointly in tens of thousands of genomes
Format: Pre-recorded with live Q&A

  • Sebastian Niehus, Berlin Institute of Health (BIH) / Charité – Universitätsmedizin Berlin, Germany
  • Janina Schoenberger, Berlin Institute of Health (BIH) / Charité – Universitätsmedizin Berlin, Germany
  • Hákon Jónsson, deCODE genetics/Amgen Inc, Iceland
  • Eythór Björnsson, deCODE genetics/Amgen Inc, Iceland
  • Doruk Beyter, deCODE genetics/Amgen Inc, Iceland
  • Hannes P. Eggertsson, deCODE genetics/Amgen Inc, Iceland
  • Patrick Sulem, deCODE genetics/Amgen Inc, Iceland
  • Kári Stefánsson, deCODE genetics/Amgen Inc, Iceland
  • Bjarni V. Halldórsson, deCODE genetics/Amgen Inc, Iceland
  • Birte Kehr, Berlin Institute of Health (BIH) / Charité – Universitätsmedizin Berlin, Germany

Presentation Overview: Show

Catalogs of genetic variation for large numbers of individuals are a foundation for modern research on human diversity and disease. Creating such catalogs for small variants from whole-genome sequencing (WGS) data is now commonly done for thousands of individuals collectively. We have transferred this joint calling idea from SNPs and indels to larger deletions and developed the first joint calling tool, PopDel, that can detect and genotype deletions in WGS data of tens of thousands of individuals simultaneously as demonstrated by our evaluation on data of up to 49,962 human genomes. Good sensitivity, precision and the correctness of genotypes are demonstrated by extensive tests on simulated and real data and comparison to other state-of-the-art SV-callers. PopDel detects deletions in HG002 and NA12878 with high sensitivity while maintaining a low false positive rate as shown by our comparison with different high-confidence reference sets. On data of up to 6,794 trios, inheritance patterns are in concordance with Mendelian inheritance rules and exhibit a close to ideal transmission rate. PopDel reliably reports common, rare and de novo deletions. Therefore, PopDel enables routine scans for deletions in large-scale sequencing studies and we are currently in the process of implementing the detection of other SV-types.

2:20 PM-2:40 PM
Metalign: Efficient alignment-based metagenomic profiling via containment min hash
Format: Pre-recorded with live Q&A

  • Nathan Lapierre, University of California, Los Angeles, United States
  • Mohammed Alser, ETH Zurich, Switzerland
  • Eleazar Eskin, University of California, Los Angeles, United States
  • David Koslicki, Pennsylvania State University, United States
  • Serghei Mangul, University of Southern California, United States

Presentation Overview: Show

Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with important implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as “metagenomic profiling”, is a critical step in microbiome analysis. Existing profiling methods have been shown to suffer from poor false positive or false negative rates, while alignment-based approaches are often considered accurate but computationally infeasible. Here we present a novel method, Metalign, that addresses these concerns by performing efficient alignment-based metagenomic profiling.
Metalign employs a high-speed, high-recall pre-filtering method based on the mathematical concept of Containment Min Hash to reduce the reference database size dramatically before alignment, followed by a method to estimate organism relative abundances in the sample by handling reads aligned to multiple genomes. We show that Metalign achieves significantly improved results over existing methods on simulated datasets (Figure 1) from a large benchmarking study, CAMI, and performs well on in vitro mock community data and environmental data from the Tara Oceans project. Metalign is freely available at https://github.com/nlapier2/Metalign, and via bioconda.

2:40 PM-3:00 PM
META^2: Memory-efficient taxonomic classification and abundance estimation for metagenomics with deep learning
Format: Pre-recorded with live Q&A

  • Gunnar Rätsch, ETH Zürich, Switzerland
  • Andreas Georgiou, ETH Zürich, Switzerland
  • Vincent Fortuin, ETH Zürich, Switzerland
  • Harun Mustafa, ETH Zürich, Switzerland

Presentation Overview: Show

Taxonomic classification is an important step in the analysis of samples found in metagenomic studies. Conventional mapping-based methods trade off between high memory and low recall, with recent deep learning methods suffering from very large model sizes. We aim to develop a more memory-efficient technique for taxonomic classification. A task of particular interest is abundance estimation. Current methods initially classify reads independently and are agnostic to the co-occurence patterns between taxa. In this work, we also attempt to take these patterns into account. We develop a novel memory-efficient read classification technique, combining deep learning and locality-sensitive hashing. We show that this approach outperforms conventional mapping-based and other deep learning methods for taxonomic classification when restricting all methods to a fixed memory footprint. Moreover, we formulate the task of abundance estimation as a Multiple Instance Learning problem and we extend current deep learning architectures with two types of permutation-invariant MIL pooling layers: a) deepsets and b) attention-based pooling. We illustrate that our architectures can exploit the co-occurrence of species in metagenomic read sets and outperform the single-read architectures in predicting the distribution over taxa at higher taxonomic ranks.

3:20 PM-3:40 PM
Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis
Format: Pre-recorded with live Q&A

  • Kristoffer Sahlin, Department of Mathematics, Stockholm University, Sweden
  • Botond Sipos, Oxford Nanopore Technologies Ltd., United Kingdom
  • Phillip James, Oxford Nanopore Technologies Ltd., United Kingdom
  • Daniel Turner, Oxford Nanopore Technologies Ltd., United Kingdom
  • Paul Medvedev, The Pennsylvania State University, United States

Presentation Overview: Show

Oxford Nanopore (ONT) is a leading long-read technology which has been revolutionizing transcriptome analysis through its capacity to sequence the majority of transcripts from end-to-end. This has greatly increased our ability to study the diversity of transcription mechanisms such as transcription initiation, termination, and alternative splicing. However, ONT still suffers from high error rates which have thus far limited its scope to reference-based analyses. When a reference is not available or is not a viable option due to reference-bias, error correction is a crucial step towards the reconstruction of the sequenced transcripts and downstream sequence analysis of transcripts. In this paper, we present a novel computational method to error-correct ONT cDNA sequencing data, called isONcorrect. IsONcorrect is able to jointly use all isoforms from a gene during error correction, thereby allowing it to correct reads at low sequencing depths. We are able to obtain an accuracy of 98.9-99.6%, demonstrating the feasibility of applying cost-effective cDNA full transcript length sequencing for reference-free transcriptome analysis.

3:40 PM-4:00 PM
Reference-guided transcript discovery and quantification for long read RNA-Seq data
Format: Pre-recorded with live Q&A

  • Ploy Pratanwanich, Genome Institute of Singapore, Singapore
  • Ying Chen, Genome Institute of Singapore, Singapore
  • Fei Yao, Genome Institute of Singapore, Singapore
  • Boon Hsi Sarah Ng, Genome Institute of Singapore, Singapore
  • Wee Siong Sho Goh, Genome Institute of Singapore, Singapore
  • Jonathan Göke, Genome Institute of Singapore, Singapore
  • Yuk Kei Wan, Genome Institute of Singapore, Singapore
  • Hwee Meng Low, Genome Institute of Singapore, Singapore
  • Viktoriia Iakovleva, Genome Institute of Singapore, Singapore
  • Lixia Xin, Duke NUS Medical School, Singapore
  • Puay Leng Lee, Genome Institute of Singapore, Singapore
  • Qiang Yu, Genome Institute of Singapore, Singapore
  • Torsten Wüstefeld, Genome Institute of Singapore, Singapore

Presentation Overview: Show

Transcriptome profiling is one of the most frequently used technologies and key to interpreting the function of the genome in human diseases. However, quantification of transcript expression with short read RNA-sequencing remains challenging as different transcripts from the same gene are often highly similar. Nanopore RNA Sequencing reduces the complexity of transcriptome profiling with ultra-long reads that can cover the full length of the isoforms. The technology has a high sequencing error rate and often generates shorter, fragmented reads due to RNA degradation, however, currently no specific transcript quantification method exists for such data. Here, we present bambu, a long read isoform discovery and quantification method. Bambu performs probabilistic assignment of reads to annotated and novel transcripts across samples to improve the accuracy of transcript expression estimates. We apply our method to cancer cell line data with spike-in controls, and compare the results with estimates obtained from short read data. Bambu recovered annotated isoforms from spike-ins and showed consistency in gene expression estimation with existing methods for short read RNA-Sequencing data, but improved accuracy in transcript expression estimation. The method is implemented in R (https://github.com/GoekeLab/bambu), enabling simple, fast, and accurate analysis of long read transcriptome profiling data.

4:00 PM-4:20 PM
Improving RNA-seq mapping and haplotype-specific transcript inference using variation graphs
Format: Pre-recorded with live Q&A

  • Jonas A. Sibbesen, University of California Santa Cruz, United States
  • Jordan Eizenga, University of California Santa Cruz, United States
  • Benedict Paten, University of California Santa Cruz, United States

Presentation Overview: Show

Current methods for analyzing RNA-seq data are generally based on first mapping the reads to a reference genome or a known set of reference transcripts. However, this approach can bias read mappings toward the reference, which negatively affects downstream analyses such as haplotype-specific expression quantification. One way to mitigate this reference bias is to use variation graphs, which contain both the primary reference and known genetic variants. For RNA-seq data specifically, variation graphs can also be augmented with splice junctions and haplotype-specific transcripts can be embedded as paths. In this work, we introduce a pipeline based on the variation graph (vg) toolkit for both mapping RNA-seq data to spliced variation graphs and inferring the expression of known haplotype-specific transcripts from the mapped reads. We demonstrate that spliced variation graphs reduce reference bias and show that vg improves mapping of RNA-seq data compared to other mapping algorithms. We also demonstrate that our novel method, rpvg, can accurately estimate expression among millions of haplotype-specific transcripts derived from the GENCODE transcript annotation and the haplotypes from the 1000 Genomes Project.

4:20 PM-4:40 PM
Proceedings Presentation: Terminus enables the discovery of data-driven, robust transcript groups from RNA-seq data
Format: Pre-recorded with live Q&A

  • Rob Patro, University of Maryland, United States
  • Avi Srivastava, Stony Brook University, United States
  • Hirak Sarkar, University of Maryland, United States
  • Hector Corrada Bravo, University of Maryland, United States
  • Michael I. Love, University of North Carolina-Chapel Hill, Chapel Hill, United States

Presentation Overview: Show

Motivation: Advances in sequencing technology, inference algorithms and differential testing methodology have enabled transcript-level analysis of RNA-seq data. Yet, the inherent inferential uncertainty in transcript-level abundance estimation, even among the most accurate approaches, means that robust transcript level analysis often remains a challenge. Conversely,
gene-level analysis remains a common and robust approach for understanding RNA-seq data, but it coarsens the resulting analysis to the level of genes, even if the data strongly support specific transcript-level effects.

Results: We introduce a new data-driven approach for grouping together transcripts in an experiment based on their inferential uncertainty. Transcripts that share large numbers of ambiguously-mapping fragments with other transcripts, in complex patterns, often cannot have their abundances confidently estimated. Yet, the total transcriptional output of that group of transcripts will have greatly-reduced inferential uncertainty, thus allowing more robust and confident downstream analysis. Our approach, implemented in the tool terminus, groups together transcripts in a data-driven manner allowing transcript-level analysis where it can be confidently supported, and deriving transcriptional groups where the inferential uncertainty is too high to support a transcript-level result.

Availability: Terminus is implemented in Rust, and is freely-available and open-source. It can be obtained from https://github.com/COMBINE-lab/Terminus

5:00 PM-6:00 PM
Hitseq Keynote: Cell-type specific isoform expression of coding and non-coding genes
Format: Live-stream

  • Hagen Tilgner, Brain and Mind Research Institute, Weill Cornell Medicine, United States
Thursday, July 16th
10:40 AM-11:40 AM
Hitseq Keynote: Genomics 2020: Real-time Nanopore signal mapping, learned index structures, and comparisons across hundreds of thousands of genomes
Format: Live-stream

  • Michel Schatz, Johns Hopkins University, United States
12:00 PM-12:20 PM
Proceedings Presentation: Identification of Conserved Evolutionary Trajectories in Tumors
Format: Pre-recorded with live Q&A

  • Cenk Sahinalp, National Cancer Institute, NIH, United States
  • Colin Collins, ubc, Canada
  • Salem Malikic, Indiana University Bloomington, United States
  • Ermin Hodzic, Simon Fraser University, Canada
  • Raunak Shrestha, University of California, San Francisco, United States
  • Kevin Litchfield, Francis Crick Institute, United Kingdom
  • Samra Turajlic, Francis Crick Institute, United Kingdom

Presentation Overview: Show

As multi-region, time-series, and single cell sequencing data become more widely available, it is becoming clear that certain tumors share evolutionary characteristics with others. In the last few years, several computational methods have been developed with the goal of inferring the subclonal composition and evolutionary history of tumors from tumor biopsy sequencing data. However, the phylogenetic trees that they report differ significantly between tumors (even those with similar characteristics).

In this paper, we present a novel combinatorial optimization method, CONETT, for detection of recurrent tumor evolution trajectories. Our method constructs a consensus tree of conserved evolutionary trajectories based on the information about temporal order of alteration events in a set of tumors. We apply our method to previously published datasets of 100 clear-cell renal cell carcinoma and 99 non-small-cell lung cancer patients and identify both conserved trajectories that were reported in the original studies, as well as new trajectories.

12:20 PM-12:40 PM
Proceedings Presentation: Identifying tumor clones in sparse single-cell mutation data
Format: Pre-recorded with live Q&A

  • Benjamin Raphael, Princeton University, United States
  • Matthew Myers, Princeton University, United States
  • Simone Zaccaria, Princeton University, United States

Presentation Overview: Show

Motivation: Recent single-cell DNA sequencing technologies enable whole-genome sequencing of hundreds to thousands of individual cells. However, these technologies have ultra-low sequencing coverage (<0.5x per cell) which has limited their use to the analysis of large copy-number aberrations (CNAs) in individual cells. While CNAs are useful markers in cancer studies, single-nucleotide mutations are equally important, both in cancer studies and in other applications. However, ultra-low coverage sequencing yields single-nucleotide mutation data that is too sparse for current single-cell analysis methods.

Results: We introduce SBMClone, a method to infer clusters of cells, or clones, that share groups of somatic single-nucleotide mutations. SBMClone uses a stochastic block model to overcome sparsity in ultra-low coverage single-cell sequencing data, and we show that SBMClone accurately infers the true clonal composition on simulated datasets with coverage at low as 0.2x . We applied SBMClone to single-cell whole-genome sequencing data from two breast cancer patients obtained using two different sequencing technologies. On the first patient, sequenced using the 10X Genomics CNV Solution with sequencing coverage 0.03x, SBMClone recovers the major clonal composition when incorporating a small amount of additional information. On the second patient, where pre- and post-treatment tumor samples were sequenced using DOP-PCR with sequencing coverage 0.5x , SBMClone shows that tumor cells are present in the post-treatment sample, contrary to published analysis of this dataset.

12:40 PM-1:00 PM
Proceedings Presentation: PhISCS-BnB: A Fast Branch and Bound Algorithm for the Perfect Tumor Phylogeny Reconstruction Problem
Format: Pre-recorded with live Q&A

  • S. Cenk Sahinalp, National Cancer Institute, National Institutes of Health, United States
  • E. Michael Gertz, National Cancer Institute, National Institutes of Health, United States
  • Alejandro A. Schäffer, National Cancer Institute, National Institutes of Health, United States
  • Erfan Sadeqi Azer, Indiana University, United States
  • Salem Malikic, Indiana University, United States
  • Farid Rashidi Mehrabadi, Indiana University, National Cancer Institute, National Institutes of Health, United States
  • Xuan Cindy Li, University of Maryland,National Cancer Institute, National Institutes of Health, United States
  • Chi-Ping Day, National Cancer Institute, National Institutes of Health, United States
  • Eva Pérez-Guijarro, National Cancer Institute, National Institutes of Health, United States
  • Kerrie Marie, National Cancer Institute, National Institutes of Health, United States
  • Maxwell P. Lee, National Cancer Institute, National Institutes of Health, United States
  • Glenn Merlino, National Cancer Institute, National Institutes of Health, United States
  • Funda Ergun, Indiana University, United States
  • Kevin Litchfield, Cancer Evolution and Genome Instability Laboratory. Lung Cancer Centre of Excellence, United Kingdom
  • Osnat Bartok, Weizmann Institute of Science, Israel
  • Ronen Levy, Weizmann Institute of Science, Israel
  • Yardena Samuels, Weizmann Institute of Science, Israel

Presentation Overview: Show

Motivation: Recent advances in single cell sequencing (SCS) offer an unprecedented insight into tumor emergence and evolution. Principled approaches to tumor phylogeny reconstruction via SCS data are typically based on general computational methods for solving an integer linear program (ILP), or a constraint satisfaction program (CSP), which, although guaranteeing convergence to the most likely solution, are very slow. Others based on Monte Carlo Markov Chain (MCMC) or alternative heuristics not only offer no such guarantee, but also are not faster in practice. As a result, novel methods that can scale up to handle the size and noise characteristics of emerging SCS data are highly desirable to fully utilize this technology.
Results: We introduce PhISCS-BnB, a Branch and Bound algorithm to compute the most likely perfect phylogeny (PP) on an input genotype matrix extracted from a SCS data set. PhISCS-BnB not only offers an optimality guarantee, but is also 10 to 100 times faster than the best available methods on simulated tumor SCS data. We also applied PhISCS-BnB on a recently published large melanoma data set derived from the sub-lineages of a cell line involving 20 clones with 2367 mutations, which returned the optimal tumor phylogeny in less than 4 hours. The resulting phylogeny agrees with and extends the published results by providing a more detailed picture on the clonal evolution of the tumor.

3:20 PM-3:40 PM
Single-cell copy number lineage tracing enabling gene discovery
Format: Pre-recorded with live Q&A

  • Ken Chen, The University of Texas MD Anderson Cancer Center, United States
  • Fang Wang, The University of Texas MD Anderson Cancer Center, United States
  • Qihan Wang, Rice University, United States
  • Vakul Mohanty, The University of Texas MD Anderson Cancer Center, United States
  • Shaoheng Liang, The University of Texas MD Anderson Cancer Center, United States
  • Jinzhuang Dou, The University of Texas MD Anderson Cancer Center, United States
  • Jincheng Han, The University of Texas MD Anderson Cancer Center, United States
  • Darlan Minussi, The University of Texas MD Anderson Cancer Center, United States
  • Ruli Gao, Houston Methodist Research Institute, United States
  • Li Ding, Washington University School of Medicine, United States
  • Nicholas Navin, UT MD Anderson Cancer Center, United States

Presentation Overview: Show

Aneuploidy plays critical roles in genome evolution.
Alleles, whose dosages affect the fitness of an ancestor, will have altered frequencies in the descendant populations upon perturbation.
Single-cell sequencing enables comprehensive genome-wide copy number profiling of thousands of cells at various evolutionary stage and lineage. That makes it possible to discover dosage effects invisible at tissue level, provided that the cell lineages can be accurately reconstructed.
Here, we present a Minimal Event Distance Aneuploidy Lineage Tree (MEDALT) algorithm that infers the evolution history of a cell population based on single-cell copy number (SCCN) profiles. We also present a statistical routine named lineage speciation analysis, which facilitates discovery of fitness-associated alterations and genes from SCCN lineage trees.
We assessed our approaches using a variety of single-cell datasets. Overall, MEDALT appeared more accurate than phylogenetics approaches in reconstructing copy number lineage. From the single-cell DNA-sequencing data of 20 triple-negative breast cancer patients, our approaches effectively prioritized genes that are essential for breast cancer cell fitness and are predictive of patient survival, including those implicating convergent evolution. Similar benefits were observed when applying our approaches on single-cell RNA sequencing data obtained from cancer patients.
The source code of our study is available at https://github.com/KChen-lab/MEDALT.

3:40 PM-4:00 PM
Proceedings Presentation: Artificial-Cell-Type Aware Cell Type Classification in CITE-seq
Format: Pre-recorded with live Q&A

  • Qiuyu Lian, Tsinghua University, China
  • Hongyi Xin, University of Pittsburgh, United States
  • Jianzhu Ma, Purdue University, United States
  • Liza Konnikova, University of Pittsburgh, United States
  • Wei Chen, University of Pittsburgh, United States
  • Jin Gu, Tsinghua University, China
  • Kong Chen, University of Pittsburgh, United States

Presentation Overview: Show

Motivation: Cellular Indexing of Transcriptomes and Epitopes by sequencing (CITE-seq), couples the measurement of surface marker proteins with simultaneous sequencing of mRNA at single cell level, which brings accurate cell surface phenotyping to single cell transcriptomics. Unfortunately, multiplets in CITE-seq datasets create artificial cell types and complicates the automation of cell surface phenotyping.
Results: We propose CITE-sort, an artificial-cell-type aware surface marker clustering method for CITE-seq. CITE-sort is aware of and is robust to multiplet-induced artificial cell types. We benchmarked CITE-sort with real and simulat-ed CITE-seq datasets and compared CITE-sort against canonical clustering methods. We show that CITE-sort produces the best clustering performance across the board. CITE-sort not only accurately identifies real biological cell types but also consistently and reliably separates multiplet-induced artificial-cell-type droplet clusters from real biological-cell-type droplet clusters. In addition, CITE-sort organizes its clustering process with a binary tree, which facilitates easy interpretation and verification of its clustering result and simplifies cell type annotation with domain knowledge in CITE-seq.

4:00 PM-4:20 PM
Proceedings Presentation: Hopper: A Mathematically Optimal Algorithm for Sketching Biological Data
Format: Pre-recorded with live Q&A

  • Bonnie Berger, Massachusetts Institute of Technology, United States
  • Benjamin DeMeo, Massachusetts Institute of Technology, United States

Presentation Overview: Show

Single-cell RNA-sequencing (scRNA-seq) has grown massively in scale since its inception, presenting substantial analytic and computational challenges. Even simple downstream analyses, such as dimensionality reduction and clustering, require days of runtime and hundreds of gigabytes of memory for today's largest datasets. In addition, current methods often favor common cell types, and miss salient biological features captured by small cell populations. Here we present Hopper, a single-cell toolkit that both speeds up the analysis of single-cell datasets and highlights their transcriptional diversity by intelligent subsampling, or sketching. Hopper realizes the optimal polynomial-time approximation of the Hausdorff distance between the full and downsampled dataset, ensuring that each cell is well-represented by some cell in the sample. Unlike prior sketching methods, Hopper adds points iteratively and allows for additional sampling from regions of interest, enabling fast and targeted multi-resolution analyses.
In a dataset of over 1.3 million mouse brain cells, we detect a cluster of just 64 macrophages expressing inflammatory tissues (0.004% of the full dataset) from a Hopper sketch containing just 5,000 cells, and several other small but biologically interesting immune cell populations invisible to analysis of the full data. On an even larger dataset consisting of ~2 million developing mouse organ cells, we show even representation of important cell types in small sketch sizes, in contrast with prior sketching methods. By condensing transcriptional information encoded in large datasets, Hopper grants the individual user with a laptop the same analytic capabilities as large consortium.

4:40 PM-5:00 PM
Proceedings Presentation: Mutational Signature Learning with Supervised Negative Binomial Non-Negative Matrix Factorization
Format: Pre-recorded with live Q&A

  • Xinrui Lyu, Department for Computer Science, ETH Zürich, Switzerland
  • Jean Garret, Department of Mathematics, ETH Zürich, Switzerland
  • Gunnar Rätsch, Department for Computer Science, ETH Zürich, Switzerland
  • Kjong-Van Lehmann, Department for Computer Science, ETH Zürich, Switzerland

Presentation Overview: Show

Motivation:
Understanding the underlying mutational processes of cancer patients has been a long standing goal in the community and promises to provide new insights that could improve cancer diagnoses and treatments. Mutational signatures are summaries of the mutational processes, and improving the derivation of mutational signatures can yield new discoveries previously obscured by technical and biological confounders. Existing mutational signature extraction methods depend on the size of patient- cohort available and solely focus on the analysis of mutation count data without considering the exploitation of available metadata.
Results:
Here we present a supervised method that utilizes cancer type as metadata to extract more distinctive signatures. More specifically, we use a Negative Binomial Non-Negative Matrix Factorization and add a Support Vector Machine loss. We show that mutational signatures extracted by our proposed method have a lower reconstruction error and are designed to be more predictive of cancer type than those generated by unsupervised methods. This design reduces the need for elaborate post-processing strategies in order to recover most of the known signatures unlike the existing unsupervised signature extraction methods. Signatures extracted by a supervised model used in conjunction with cancer type labels are also more robust, especially when using small and potentially cancer-type limited patient cohorts. Finally, we adapted our model such that molecular features can be utilized to derive an according mutational signature. We used APOBEC expression and MUTYH mutation status to demonstrate the possibilities that arise from this ability. We conclude that our method, which exploits available metadata, improves the quality of mutational signatures as well as helps derive more interpretable representations.

5:00 PM-5:20 PM
Proceedings Presentation: Improved Design and Analysis of Practical Minimizers
Format: Pre-recorded with live Q&A

  • Carl Kingsford, Carnegie Mellon University, United States
  • Guillaume Marcais, Carnegie Mellon University, United States
  • Hongyu Zheng, Carnegie Mellon University, United States

Presentation Overview: Show

Minimizers are methods to sample k-mers from a sequence, with the guarantee that similar set of k-mers will be chosen on similar sequences. It is parameterized by the k-mer length k, a window length w and an ordering on the k-mers. Minimizers are used in a large number of softwares and pipelines to improve computation efficiency and decrease memory usage. Despite the method’s popularity, many theoretical questions regarding its performance remain open. The core metric for measuring performance of a minimizer is the density, which measures the sparsity of sampled k-mers. The theoretical optimal density for a minimizer is 1/w, provably not achievable in general. For given k and w, little is known about asymptotically optimal minimizers, that is minimizers with density O(1/w).
We derived a necessary and sufficient condition for existence of asymptotically optimal minimizers. We also provide a randomized algorithm, called the Miniception, to design minimizers with the best theoretical guarantee to date on density in practical scenarios. Constructing and using the Miniception is as easy as constructing and using a random minimizer, which allows the design of efficient minimizers that scale to the values of k and w used in current bioinformatics software programs.

5:20 PM-5:40 PM
Proceedings Presentation: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets
Format: Pre-recorded with live Q&A

  • Rayan Chikhi, Pasteur Institute, CNRS, France
  • Camille Marchet, Université de Lille, CRIStAL, France
  • Zamin Iqbal, EBI, United Kingdom
  • Daniel Gautheret, Université Paris-Sud, Orsay, France
  • Mikaël Salson, Université de Lille, CRIStAL, France

Presentation Overview: Show

Analyzing abundances of sequences within large collections of sequencing datasets is of prime importance for biological investigations, such as the detection and quantification of variants in genomes and transcriptomes. In this work we present REINDEER, a novel computational method that performs indexing of k-mers and records their counts across a collection of datasets. We demonstrate that REINDEER is able to index counts within 2,585 human RNA-seq datasets using only 36 GB of RAM and 60 GB of disk space during construction, and 75 GB of RAM during random access queries. To the best of our knowledge, REINDEER is the first practical method that can index k-mer counts in large dataset collections. Briefly, REINDEER constructs the compacted de Bruijn graph (DBG) of each dataset, then implicitly represents a merged DBG of all datasets. Indexing all the k-mers present in the merged DBG would be too expensive; instead REINDEER indexes a set of minitigs, which are sequences of coherently grouped k-mers. Minitigs are then associated to vectors of counts per dataset. Software is available at github.com/kamimrcht/REINDEER.

5:40 PM-6:00 PM
Proceedings Presentation: Distance Indexing and Seed Clustering in Sequence Graphs
Format: Pre-recorded with live Q&A

  • Benedict Paten, University of California Santa Cruz, United States
  • Xian Chang, University of California Santa Cruz Genomics Institue, United States
  • Jordan Eizenga, University of California Santa Cruz, United States
  • Adam Novak, UC Santa Cruz, United States
  • Jouni Sirén, University of California, Santa Cruz, United States

Presentation Overview: Show

Graph representations of genomes are capable of expressing more genetic variation and
can therefore better represent a population than standard linear genomes. However, due to the greater
complexity of genome graphs relative to linear genomes, some functions that are trivial on linear genomes
become much more difficult in genome graphs. Calculating distance is one such function that is simple in a
linear genome but complicated in a graph context. In read mapping algorithms such distance calculations
are fundamental to determining if seed alignments could belong to the same mapping.
We have developed an algorithm for quickly calculating the minimum distance between positions
on a sequence graph using a minimum distance index. We have also developed an algorithm that uses the
distance index to cluster seeds on a graph. We demonstrate that our implementations of these algorithms
are efficient and practical to use for a new generation of mapping algorithms based upon genome graphs