Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-10:45
Welcome
Room: 517d
Format: In person


Authors List: Show

10:45-11:40
Invited Presentation: Unsupervised learning approaches for genomics to decipher structure and dynamics of 3D genome organization and gene regulatory networks
Confirmed Presenter: Sushmita Roy

Room: 517d
Format: In Person


Authors List: Show

  • Sushmita Roy

Presentation Overview: Show

Advances in genomic technologies have substantially expanded our repertoire of high-dimensional datasets that capture different modalities such as the transcriptome, epigenome and chromosome conformation across many different cellular contexts. An open challenge is to effectively analyze these datasets to extract meaningful structures such as cell types, chromosomal domains, gene modules and regulatory networks. Unsupervised machine learning that aims to extract structure, often low-dimensional, from unlabeled data is a powerful paradigm for unbiased analysis of omic datasets. In this talk, I will present two examples of such approaches, Non-negative matrix factorization (NMF) and graph structure learning to tackle problems in regulatory genomics. We consider multi-task extensions of NMF for examining three-dimensional organization of the genome. Our results show that NMF is a powerful approach for analyzing 3D genome organization from Hi-C assays that can recover biologically meaningful topological units and their dynamics. In the second part of my talk, I will present factorization and graph learning approaches to single cell omic datasets. Using our approaches we have identified key gene expression programs and cell type-specific gene regulatory networks that are informative of cell state and fate specification in different dynamic processes such as cellular differentiation and reprogramming.

11:40-12:00
An Adaptive K-Nearest Neighbor Graph Optimized for Single-cell and Spatial Clustering
Confirmed Presenter: Qi Liu, Vanderbilt University Medical Center, United States

Room: 517d
Format: In Person


Authors List: Show

  • Jia Li, Vanderbilt University Medical Center, United States
  • Yu Shyr, Vanderbilt University Medical Center, United States
  • Qi Liu, Vanderbilt University Medical Center, United States

Presentation Overview: Show

Unsupervised clustering is crucial for characterizing cellular heterogeneity in single-cell and spatial transcriptomics analysis. While conventional clustering methods have difficulty in identifying rare cell types, approaches specifically tailored for detecting rare cell types gain their ability at the cost of poorer performance for grouping abundant ones. We introduce aKNNO, a method to identify abundant and rare cell types simultaneously based on an adaptive k-nearest neighbor graph with optimization. Unlike traditional kNN graphs, which require a predetermined and fixed k value for all cells, aKNNO selects k for each cell adaptively based on its local distance distribution. This adaptive approach enables accurate capture of the inherent cellular structure. Through extensive evaluation across 38 simulated scenarios and 20 single-cell and spatial transcriptomics datasets spanning various species, tissues, and technologies, aKNNO consistently demonstrates its power in accurately identifying both abundant and rare cell types. Remarkably, aKNNO outperforms conventional and even specifically tailored methods by uncovering both known and novel rare cell types without compromising clustering performance for abundant ones. Most notably, when utilizing transcriptome data alone, aKNNO delineates stereotyped fine-grained anatomical structures more precisely than integrative approaches combining expression with spatial locations and/or histology images, including GraphST, SpaGCN, BayesSpace, stLearn, and DR-SC.

12:00-12:20
Proceedings Presentation: Forseti: A mechanistic and predictive model of the splicing status of scRNA-seq reads
Confirmed Presenter: Yuan Gao, University of Maryland, College Park, United States

Room: 517d
Format: In Person


Authors List: Show

  • Dongze He, University of Maryland, College Park, United States
  • Yuan Gao, University of Maryland, College Park, United States
  • Spencer Skylar Chan, University of Maryland, College Park, United States
  • Natalia Quintana-Parrilla, University of Puerto Rico, Mayagüez Campus, United States
  • Rob Patro, University of Maryland, College Park, United States

Presentation Overview: Show

Motivation: Short-read single-cell RNA-sequencing (scRNA-seq) has been used to study cellular heterogeneity, cellular fate, and transcriptional dynamics. Modeling splicing dynamics in scRNA-seq data is challenging, with inherent difficulty in even the seemingly straightforward task of elucidating the splicing status of the molecules from which the underlying sequenced fragments are drawn. This difficulty arises, in part, from the limited read length and positional biases, which substantially reduce the specificity of the sequenced fragments. As a result, the splicing status of many reads in scRNA-seq is ambiguous because of a lack of definitive evidence. We are therefore in need of methods that can recover the splicing status of ambiguous reads which, in turn, can lead to more accuracy and confidence in downstream analyses.
Results: We develop Forseti, a predictive model to probabilistically assign a splicing status to scRNA-seq reads. Our model has two key components. First, we train a binding affinity model to assign a probability that a given transcriptomic site is
used in fragment generation. Second, we fit a robust fragment length distribution model that generalizes well across datasets deriving from different species and tissue types. Forseti combines these two trained models to predict the splicing status of the molecule of origin of reads by scoring putative fragments that associate each alignment of sequenced reads with proximate potential priming sites. Using both simulated and experimental data, we show that our model can precisely predict the splicing status of reads and identify the true gene origin of multi-gene mapped reads.

12:20-12:40
Invited Presentation: Computational Advances In Multiomics Analysis Using HiFi Sequencing
Confirmed Presenter: Liz Tseng

Room: 517d
Format: In person


Authors List: Show

  • Liz Tseng

Presentation Overview: Show

PacBio HiFi sequencing has been used to generate the latest and most complete version of the human genome and has ushered in a new era of bioinformatics development. The two main characteristics of HiFi data – long read length and high accuracy – are critical for applications that require near-perfect consensus sequencing or long-range phasing information.
In this workshop, we will describe the bioinformatics tools that have been developed for HiFi data. These tools often address genetic puzzles that were previously challenging or impossible to solve with short reads. For example, Paraphase for resolving segmental duplications, StarPhase for diplotyping important pharmacogenetic genes (e.g. HLA, CYP2D6), and TRGT for repeat expansion profiling. In other cases, new methods were developed based on the unique nature of HiFi sequencing that include methylation signals (e.g. MethBat). Beyond the genome, the ability to sequence full-length transcripts without the need for computational assembly has brought about new long-read-aware tools that address isoform classification (e.g. SQANTI3), fusion detection (e.g. pbfusion, CTAT-LR-fusion), and quantification (e.g. Oarfish). Together, these tools reveal novel insights across the whole spectrum of genetic applications, from uncovering de novo mutations in rare diseases, detecting allele-specific methylation patterns, to helping design new therapeutic targets in neurodegenerative diseases.

14:20-14:40
Proceedings Presentation: Sigmoni: classification of nanopore signal with a compressed pangenome index
Confirmed Presenter: Vikram Shivakumar, Johns Hopkins University, United States

Room: 517d
Format: In Person


Authors List: Show

  • Vikram Shivakumar, Johns Hopkins University, United States
  • Omar Ahmed, Johns Hopkins University, United States
  • Sam Kovaka, Johns Hopkins University, United States
  • Mohsen Zakeri, Johns Hopkins University, United States
  • Ben Langmead, Johns Hopkins University, United States

Presentation Overview: Show

Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics, all in linear query time without the need for seed-chain-extend. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes. Sigmoni is the first signal-based tool to scale to a complete human genome and pangenome while remaining fast enough for adaptive sampling applications.

14:40-15:00
Proceedings Presentation: Label-guided seed-chain-extend alignment on annotated De Bruijn graphs
Confirmed Presenter: Harun Mustafa, ETH Zurich, Switzerland

Room: 517d
Format: In Person


Authors List: Show

  • Harun Mustafa, ETH Zurich, Switzerland
  • Mikhail Karasikov, ETH Zurich, Switzerland
  • Nika Mansouri Ghiasi, ETH Zurich, Switzerland
  • Gunnar Rätsch, ETH Zürich, Department for Computer Science, Switzerland
  • André Kahles, ETH Zurich, Switzerland

Presentation Overview: Show

Motivation: Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g., label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically-irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

Results: We introduce a new scoring model, multi-label alignment (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically-relevant sample combinations, Label Change incorporates more informative global sample similarity into local scores. To improve connectivity, Node Length Change dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments.
MLC extracts seeds from SCA’s alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically-relevant alignments, decreasing average weighted UniFrac errors by 63.1–66.8% and covering 45.5–47.4% (median) more long-read query characters than state-of-the-art aligners. MLA’s runtimes are competitive with label-free alignment and substantially faster than single-label alignment.

Availability: https://github.com/ratschlab/mla.

15:00-15:20
Compressed Indexing for Pangenome Substring Queries
Confirmed Presenter: Stephen Hwang, XDBio Program, Johns Hopkins School of Medicine, United States

Room: 517d
Format: In Person


Authors List: Show

  • Stephen Hwang, XDBio Program, Johns Hopkins School of Medicine, United States
  • Nathaniel K. Brown, Department of Computer Science, Johns Hopkins University, United States
  • Omar Y. Ahmed, Department of Computer Science, Johns Hopkins University, United States
  • Katharine Jenike, Department of Computer Science, Johns Hopkins University, United States
  • Sam Kovaka, Department of Computer Science, Johns Hopkins University, United States
  • Michael C. Schatz, Department of Computer Science, Johns Hopkins University, United States
  • Ben Langmead, Department of Computer Science, Johns Hopkins University, United States

Presentation Overview: Show

Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8x smaller than a comparable KMC3 index and 11.4x smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5x faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.

15:20-15:40
Sequence-to-graph alignment based copy number calling using a flow network formulation
Confirmed Presenter: Hugo Magalhães, Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, HHU, Düsseldorf, Germany

Room: 517d
Format: In Person


Authors List: Show

  • Hugo Magalhães, Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, HHU, Düsseldorf, Germany
  • Timofey Prodanov, Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, HHU, Düsseldorf, Germany
  • Jonas Weber, Institute of Medical Microbiology and Hospital Hygiene, HHU, Düsseldorf, Germany
  • Gunnar Klau, Algorithmic Bioinformatics, HHU, Düsseldorf, Germany
  • Tobias Marschall, Institute for Medical Biometry and Bioinformatics, Medical Faculty, and Center for Digital Medicine, HHU, Düsseldorf, Germany

Presentation Overview: Show

Variation of copy number (CN) between individuals has been associated with phenotypic differences. Consequently, CN calling is an important step for disease association and identification, as well as in genome assembly. Traditionally, sequencing reads are mapped to a linear reference genome, after which CN is estimated based on observed read depth. This approach, however, leads to inconsistent CN assignments and is hampered by sequences not represented in a linear reference. To address this issue, we propose a method for CN calling with respect to a graph genome using a flow network formulation.
The tool processes read alignments to any bidirected genome graph, and calculates CN probabilities for every node according to the Negative Binomial distribution and total base pair coverage across the node. Integer linear programming is then employed to find a maximum likelihood flow through the graph, resulting in CN predictions for each node. This way, the method achieves consistent CN assignments across the graph.
The proposed method is capable of processing a wide variety of input graphs and read mappings from different sequencing technologies. We processed reads aligned to a Verkko assembly graph for HG02492 (HGSVC) using high coverage mixed HiFi and ONT-UL reads in under 2 hours using one thread and <2Gb peak memory. For 18% nodes, the method produced different CN values than those expected from read depth alone, showcasing how the graph topology informs CN assignment. Further applications include CN assignment as part of diploid/polyploid (pan)genome assembly workflows.

15:40-16:00
Targeted genotyping of complex polymorphic genes using short and long reads
Confirmed Presenter: Timofey Prodanov, Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, 40225 Düsseldorf, Germany, Germany

Room: 517d
Format: In Person


Authors List: Show

  • Timofey Prodanov, Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, 40225 Düsseldorf, Germany, Germany
  • Tobias Marschall, Institute for Medical Biometry and Bioinformatics, Heinrich Heine University, 40225 Düsseldorf, Germany, Germany

Presentation Overview: Show

The human genome contains numerous highly polymorphic loci, rich in tandem repeats and structural variants. There, read alignments are often ambiguous and unreliable, resulting in hundreds of disease-associated genes being inaccessible for accurate variant calling. In such regions, structural variant callers show limited sensitivity, k-mer based tools cannot exploit full linkage information of a sequencing read, and gene-specific methods cannot be easily extended to process more loci. Improved ability to genotype highly polymorphic genes can increase diagnostic power and uncover novel disease associations.
We present a targeted tool Locityper, capable of genotyping complex polymorphic loci using both short- and long-read whole genome sequencing, including error-prone ONT data. For each target, Locityper recruits WGS reads and aligns them to possible locus haplotypes (e.g. extracted from a pangenome). By optimizing read alignment, insert size, and read depth profiles across haplotypes, Locityper efficiently estimates the likelihood of each haplotype pair. This is achieved by solving integer linear programming problems or by employing stochastic optimization.
Across 256 challenging medically relevant loci and 40 HPRC Illumina datasets, 95% Locityper haplotypes were accurate (QV, Phred-scaled divergence, ≥33), compared to 27% accurate haplotypes, reconstructed from the phased NYGC call set. In leave-one-out (LOO) evaluation, Locityper produced 60% accurate haplotypes, a fraction that will increase with larger reference panels as >91% haplotypes were very close (ΔQV≤5) to best available haplotypes. Overall, 82% 1KGP trio haplotypes were concordant. Finally, across 36 HLA genes LOO Locityper correctly predicted protein product in 94% cases, outperforming the specialized HLA-genotyper T1K at 78%.

16:40-17:00
VISTA: An integrated framework for structural variant discovery
Confirmed Presenter: Varuni Sarwal, UCLA, United States

Room: 517d
Format: In Person


Authors List: Show

  • Varuni Sarwal, UCLA, United States
  • Seungmo Lee, UCLA, United States
  • Jianzhi Yang, USC, United States
  • Sriram Sankararaman, UCLA, United States
  • Mark Chaisson, USC, United States
  • Eleazar Eskin, UCLA, United States
  • Serghei Mangul, USC, United States

Presentation Overview: Show

Structural variation (SV), refers to insertions, deletions, inversions, and duplications in human genomes. With advances in whole genome sequencing (WGS) technologies, a plethora of SV detection methods have been developed. However, dissecting SVs from WGS data remains a challenge, with the majority of SV detection methods prone to a high false-positive rate, and no existing method able to precisely detect a full range of SVs present in a sample. Here, we report an integrated structural variant calling framework, VISTA (Variant Identification and Structural Variant Analysis) that leverages the results of individual callers using a novel and robust filtering and merging algorithm. In contrast to existing consensus-based tools which ignore the length and coverage, VISTA overcomes this limitation by executing various combinations of top-performing callers based on variant length and genomic coverage to generate SV events with high accuracy. We evaluated the performance of VISTA on comprehensive gold-standard datasets across varying organisms and coverage. We benchmarked VISTA using the Genome-in-a-Bottle (GIAB) gold standard SV set, haplotype-resolved de novo assemblies from The Human Pangenome Reference Consortium (HPRC), along with an in-house PCR-validated mouse gold standard set. VISTA maintained the highest F1 score among top consensus-based tools measured using a comprehensive gold standard across both mouse and human genomes.In conclusion, VISTA represents a significant advancement in structural variant calling, offering a robust and accurate framework that outperforms existing consensus-based tools and sets a new standard for SV detection in genomic research.

17:00-18:00
Invited Presentation: Long-read sequencing and pangenome perspective of structural variation
Room: 517d
Format: In person


Authors List: Show

  • Evan Eichler, University of Washington, United States
Sunday, July 14th
10:40-11:40
Invited Presentation: Why and how long reads are used to improve gene isoform quantification
Confirmed Presenter: Kin Au

Room: 517d
Format: In person


Authors List: Show

  • Kin Au
11:40-12:00
Telomere-to-telomere assembly by preserving contained reads
Confirmed Presenter: Sudhanva Shyam Kamath, Indian Institute of Science, Bangalore, India

Room: 517d
Format: Live Stream


Authors List: Show

  • Sudhanva Shyam Kamath, Indian Institute of Science, Bangalore, India
  • Mehak Bindra, Indian Institute of Science, Bangalore, India
  • Debnath Pal, Indian Institute of Science, Bangalore, India
  • Chirag Jain, Indian Institute of Science, Bangalore, India

Presentation Overview: Show

Automated telomere-to-telomere (T2T) de novo assembly of diploid and polyploid genomes remains a formidable task. A string graph is a commonly used assembly graph representation in the overlap-based algorithms. The string graph formulation employs graph simplification heuristics, which drastically reduce the count of vertices and edges. One of these heuristics involves removing the reads contained in longer reads. However, this procedure is not guaranteed to be safe. In practice, it occasionally introduces gaps in the assembly by removing all reads covering one or more genome intervals. The factors contributing to such gaps remain poorly understood. In this work, we mathematically derived the frequency of observing a gap near a germline and a somatic heterozygous variant locus. Our analysis shows that (i) an assembly gap due to contained read deletion is an order of magnitude more frequent in Oxford Nanopore reads than PacBio HiFi reads due to differences in their read-length distributions, and (ii) this frequency decreases with an increase in the sequencing depth. Drawing cues from these observations, we addressed the weakness of the string graph formulation by developing the RAFT assembly algorithm. RAFT fragments reads and produces a more uniform read-length distribution. The algorithm retains spanned repeats in the reads during the fragmentation. We empirically demonstrate that RAFT significantly reduces the number of gaps using simulated datasets. Using real Oxford Nanopore and PacBio HiFi datasets of the HG002 human genome, we achieved a twofold increase in the contig NG50 and the number of haplotype-resolved T2T contigs compared to Hifiasm.

12:00-12:20
Rawsamble: Overlapping and Assembling Raw Nanopore Signals using a Hash-based Seeding Mechanism
Confirmed Presenter: Can Firtina, ETH Zurich, Switzerland

Room: 517d
Format: In Person


Authors List: Show

  • Can Firtina, ETH Zurich, Switzerland
  • Maximilian Mordig, Max Planck Institute for Intelligent Systems, ETH Zurich,, Germany
  • Joël Lindegger, ETH Zurich, Switzerland
  • Harun Mustafa, ETH Zurich, University Hospital Zurich, Swiss Institute of Bioinformatics, Switzerland
  • Sayan Goswami, ETH Zurich, Switzerland
  • Stefano Mercogliano, ETH Zurich, Switzerland
  • Yan Zhu, University of Toronto, ETH Zurich, Canada
  • Andre Kahles, ETH Zurich, University Hospital Zurich, Swiss Institute of Bioinformatics, Switzerland
  • Onur Mutlu, ETH Zurich, Switzerland

Presentation Overview: Show

Although raw nanopore signal mapping to a reference genome is widely studied to achieve highly accurate and fast mapping of raw signals, mapping to a reference genome is not possible when the corresponding reference genome of an organism is either unknown or does not exist. To circumvent such cases, all-vs-all overlapping is performed to construct de novo assembly from overlapping information. However, such an all-vs-all overlapping of raw nanopore signals remains unsolved due to its unique challenges such 1) generating multiple and accurate mapping pairs per read, 2) performing similarity search between a pair of noisy raw signals, and 3) performing space- and compute-efficient operations for portability and real-time analysis.

We introduce Rawsamble, the first mechanism that can quickly and accurately find overlaps between raw nanopore signals without translating them to bases. We find that Rawsamble can 1) find overlaps while meeting the real-time requirements with throughput on average around 200,000 bp/sec, 2) share a large portion of overlapping pairs with minimap2 (37.12% on average), and 3) lead to constructing long assemblies from these useful overlaps. Finding overlapping pairs from raw signals is critical for enabling new directions that have not been explored before for raw signal analysis, such as de novo assembly construction from overlaps that we explore in this work. We believe these overlaps can be useful for many other new directions coupled with real-time analysis.

Rawsamble is integrated in RawHash and available at https://github.com/CMU-SAFARI/RawHash.

14:20-14:40
Proceedings Presentation: Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets
Confirmed Presenter: Igor Martayan, Univ Lille, France

Room: 517d
Format: In Person


Authors List: Show

  • Igor Martayan, Univ Lille, France
  • Bastien Cazaux, Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France, France
  • Antoine Limasset, CNRS, France
  • Camille Marchet, CNRS, France

Presentation Overview: Show

In this paper, we introduce the Conway-Bromage-Lyndon (CBL) structure, a compressed, dynamic and exact method for
representing k-mer sets. Originating from Conway and Bromage’s concept, CBL innovatively employs the smallest cyclic
rotations of k-mers, akin to Lyndon words, to leverage lexicographic redundancies. In order to support dynamic operations
and set operations, we propose a dynamic bit vector structure that draws a parallel with Elias-Fano’s scheme. This
structure is encapsulated in a Rust library, demonstrating a balanced blend of construction efficiency, cache locality, and
compression. Our findings suggest that CBL outperforms existing k-mer set methods, particularly in dynamic scenarios.
Unique to this work, CBL stands out as the only known exact k-mer structure offering in-place set operations. Its different
combined abilities positions it as a flexible Swiss knife structure for k-mer set management.

14:40-15:00
Proceedings Presentation: Learning Locality-Sensitive Bucketing Functions
Confirmed Presenter: Xin Yuan, Pennsylvania State University, United States

Room: 517d
Format: In Person


Authors List: Show

  • Xin Yuan, Pennsylvania State University, United States
  • Ke Chen, Pennsylvania State University, United States
  • Xiang Li, Pennsylvania State University, United States
  • Qian Shi, Pennsylvania State University, United States
  • Mingfu Shao, Pennsylvania State University, United States

Presentation Overview: Show

Many tasks in sequence analysis ask to identify biologically related sequences in a large set. Edit distance is widely used in these tasks as a measure. To avoid all-vs-all pairwise comparisons and save on expensive edit distance computations, locality-sensitive bucketing (LSB) functions have been proposed. Formally, a (d1,d2)-LSB function sends sequences into multiple buckets with the guarantee that pairs of sequences of edit distance at most d1 can be found within a same bucket while those of edit distance at least d2 do not share any. LSB functions generalize the locality-sensitive hashing (LSH) functions and admit favorable properties, making them potentially ideal solutions to the above problem. But constructing LSB functions for practical use is scarcely possible. In this work, we aim to utilize machine learning techniques to train LSB functions. With the development of a novel loss function and insights in the neural network structures that can extend beyond this specific task, we obtained LSB functions that exhibit nearly perfect accuracy for certain (d1,d2). Comparing to the state-of-the-art method OMH, the trained LSB functions achieve a 2- to 5-fold improvement on the sensitivity of recognizing similar sequences. An experiment on analyzing erroneous cell barcode data is also included to demonstrate the application of the trained LSB functions.

15:00-15:20
Proceedings Presentation: Fast Multiple Sequence Alignment via Multi-Armed Bandits
Confirmed Presenter: Kayvon Mazooji, University of Illinois Urbana-Champaign, United States

Room: 517d
Format: In Person


Authors List: Show

  • Kayvon Mazooji, University of Illinois Urbana-Champaign, United States
  • Ilan Shomorony, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

Multiple sequence alignment is an important problem in computational biology with applications that include phylogeny and the detection of remote homology between protein sequences. UPP is a popular software package that constructs accurate multiple sequence alignments for large datasets based on ensembles of Hidden Markov Models (HMMs). A computational bottleneck for this method is a sequence-to-HMM assignment step, which relies on the precise computation of probability scores on the HMMs. In this work, we show that we can speed up this assignment step significantly by replacing these HMM probability scores with alternative scores that can be efficiently estimated. Our proposed approach utilizes a Multi-Armed Bandit algorithm to adaptively and efficiently compute estimates of these scores. This allows us to achieve similar alignment accuracy as UPP with a significant reduction in computation time.

15:20-15:40
Contrasting and Combining Transcriptome Complexity Captured by Short and Long RNA Sequencing Reads
Confirmed Presenter: Seong Woo Han, University of Pennsylvania, United States

Room: 517d
Format: In Person


Authors List: Show

  • Seong Woo Han, University of Pennsylvania, United States
  • San Jewell, University of Pennsylvania, United States
  • Andrei Thomas-Tikhonenko, University of Pennsylvania, United States
  • Yoseph Barash, University of Pennsylvania, United States

Presentation Overview: Show

High-throughput short-read RNA sequencing has given researchers unprecedented detection and quantification capabilities of splicing variations across biological conditions and disease states. However, short-read technology is limited in its ability to identify which isoforms are responsible for the observed sequence fragments and how splicing variations across a gene are related. In contrast, more recent long-read sequencing technology offers improved detection of underlying full or partial isoforms but is limited by high error rates and throughput, hindering its ability to accurately detect and quantify all splicing variations in a given condition.

To better understand the underlying isoforms and splicing changes in a given biological condition, it’s important to be able to combine the results of both short and long-read sequencing, together with the annotation of known isoforms. To address this need, we develop MAJIQ-L, a tool to visualize and quantify splicing variations from multiple data sources. MAJIQ-L combines transcriptome annotation, long reads based isoform detection tools output, and MAJIQ (Vaquero-Garcia et al. (2016, 2023)) based short-read RNA-Seq analysis of local splicing variations (LSVs). We analyze which splice junction is supported by which type of evidence (known isoforms, short-reads, long-reads), followed by the analysis of matched short and long-read human cell line datasets. Our software can be used to assess any future long reads technology or algorithm, and combine it with short reads data for improved transcriptome analysis.

15:40-16:00
Quantum Computing for Genomic Analysis
Confirmed Presenter: Sergii Strelchuk, University of Cambridge, United Kingdom

Room: 517d
Format: In Person


Authors List: Show

  • James Bonfield, Wellcome Sanger Institute, United Kingdom
  • Tony Burdett, European Bioinformatics Institute, European Molecular Biology Laboratory, United Kingdom
  • Peter Clapham, Wellcome Sanger Institute, United Kingdom
  • Josh Cudby, University of Cambridge, United Kingdom
  • Robert Davies, Wellcome Sanger Institute, United Kingdom
  • Richard Durbin, University of Cambridge, United Kingdom
  • David Holland, Wellcome Sanger Institute, United Kingdom
  • Aditya Jain, University of Cambridge, United Kingdom
  • James McCafferty, Wellcome Sanger Institute, United Kingdom
  • Yanisa Sunthornyotin, European Bioinformatics Institute, European Molecular Biology Laboratory, United Kingdom
  • Andrew Whitwham, Wellcome Sanger Institute, United Kingdom
  • Orson Ye, University of Cambridge, United Kingdom
  • David Yuan, European Bioinformatics Institute, European Molecular Biology Laboratory, United Kingdom
  • Sergii Strelchuk, University of Cambridge, United Kingdom

Presentation Overview: Show

Many essential tasks in genomic analysis are extremely difficult for classical computers due to problems inherently hard to solve efficiently with classical (empirical) algorithms. Quantum computing offers novel possibilities with algorithmic techniques capable of achieving provable speedups over existing classical exact algorithms in large-scale genomic analyses. Our work utilizes PhiX174, SARS-CoV-2, and human genome data to explore quantum algorithms and data encoding techniques to pave the way for the analysis with better time and space efficiency.

We take a two-pronged approach:

1) Algorithm Development: We will design novel quantum algorithms for MSA subproblems and heuristic methods (QAOA) for de novo assembly.

2) Data Encoding and State Preparation: We develop efficient quantum circuits to encode genomic data and reduce the computational overhead with a variety of techniques, including tensor network methods. It facilitates data encoding into quantum states for Machine Learning applications.

Starting with the PhiX174 genome, we will test our quantum algorithms with provable theoretical speedup compared to classical methods. This allows us to scale the approach to larger and more complex genomes like SARS-CoV-2 and the human genome. We'll develop efficient encoding strategies and optimize quantum circuits to minimize resource needs for the current hardware. To test how noise sources that appear in a variety of hardware implementations affect the computation we are using recently-developed tensor network contraction methods for efficient small-scale classical simulation.

This project aims to identify problem settings where utilizing quantum computing will be the most beneficial to unlocking the vast potential of genomics in healthcare. By studying classical computational bottlenecks and developing ways to speed them up, we aim to achieve a deeper understanding of human health and pathogens.

16:40-17:00
Proceedings Presentation: Adaptive Digital Tissue Deconvolution
Confirmed Presenter: Franziska Görtler, Department of Oncology and Medical Physics, Haukeland University Hospital, Norway

Room: 517d
Format: In Person


Authors List: Show

  • Franziska Görtler, Department of Oncology and Medical Physics, Haukeland University Hospital, Norway
  • Malte Mensching-Buhr, Department of Medical Bioinformatics, University Medical Center Göttingen, Germany
  • Ørjan Skaar, Computational Biology Unit, University of Bergen, Norway
  • Stefan Schrod, University Medical Center Göttingen, Germany
  • Thomas Sterr, Institute of Theoretical Physics, University of Regensburg, Germany
  • Andreas Schäfer, Institute of Theoretical Physics, University of Regensburg, Germany
  • Tim Beissbarth, University Medicine Göttingen, Germany
  • Anagha Joshi, Department of Clinical Science, Computational Biology Unit, University of Bergen, Norway
  • Helena U. Zacharias, University Medical Center Schleswig-Holstein; Kiel University, Germany
  • Sushma Nagaraja Grellscheid, Computational Biology Unit, University of Bergen, Norway
  • Michael Altenbuchinger, Department of Medical Bioinformatics, University Medical Center Göttingen, Germany

Presentation Overview: Show

Motivation: The inference of cellular compositions from bulk and spatial transcriptomics data increasingly complements data analyses. Multiple computational approaches were suggested and recently, machine learning techniques were developed to systematically improve estimates. Such approaches allow to infer additional, less abundant cell types. However, they rely on training data which do not capture the full biological diversity encountered in transcriptomics analyses; data can contain cellular contributions not seen in the training data and as such, analyses can be biased or blurred. Thus, computational approaches have to deal with unknown, hidden contributions. Moreover, most methods are based on cellular archetypes which serve as a reference; e.g., a generic T-cell profile is used to infer the proportion of T-cells. It is well known that cells adapt their molecular phenotype to the environment and that pre-specified cell archetypes can distort the inference of cellular compositions.
Results: We propose Adaptive Digital Tissue Deconvolution (ADTD) to estimate cellular proportions of pre-selected cell types together with possibly unknown and hidden background contributions. Moreover, ADTD adapts prototypic reference profiles to the molecular environment of the cells, which further resolves cell-type specific gene regulation from bulk transcriptomics data. We verify this in simulation studies and demonstrate that ADTD improves existing approaches in estimating cellular compositions. In an application to bulk transcriptomics data from breast cancer patients, we demonstrate that ADTD provides insights into cell-type specific molecular differences between breast cancer subtypes.
Availability and implementation: A python implementation of ADTD and a tutorial are available at https://doi.org/10.5281/zenodo.7548362 (doi:10.5281/zenodo.7548362).

17:00-17:20
Maximizing accuracy of cellular deconvolution. (ACeD)
Confirmed Presenter: Jonathan Bard, State University of New York at Buffalo, United States

Room: 517d
Format: In Person


Authors List: Show

  • Jonathan Bard, State University of New York at Buffalo, United States
  • Norma Nowak, State University of New York at Buffalo, United States
  • Satrajit Sinha, State University of New York at Buffalo, United States
  • Michael Buck, State University of New York at Buffalo, United States

Presentation Overview: Show

Bulk RNA-sequencing has been a mainstay for biomedical research since its inception. In cancer alone, the TCGA project has examined 33 cancer types with over 20,000 samples. Each sample has a wealth of patient information associated with it, from survival records to several data modalities including copy number, microbiome, methylation and transcriptomic profiling at the bulk tissue level. However, the challenge with bulk tissue profiling, like RNA-seq, is that the assay measures the average expression across all the cells in the sample, thus hiding cellular heterogeneity. Leveraging cellular deconvolution, these datasets can be used to infer cell type composition and molecular heterogeneity. However, accurate deconvolution is contingent upon using a high-quality single-cell reference dataset with proper cell-type cluster resolution. Therefore, there is a fundamental need for methodology to quantify single-cell dataset quality for deconvolution with optimization of cell-type cluster resolution. To address this challenge, we developed a novel computational strategy to identify the optimal cell-type clustering resolution that maximizes deconvolutional performance. Our R-based software package (ACeD) provides the research community with a valuable toolset to evaluate reference set quality and optimize data upstream of reference-based deconvolution algorithms, enhancing our analysis and understanding of the tumor microenvironment.

17:20-17:40
Evolution of genomic and epigenomic heterogeneity in prostate cancer from tissue and liquid biopsy
Confirmed Presenter: Marjorie Roskes, Weill Cornell Medicine, United States

Room: 517d
Format: In Person


Authors List: Show

  • Marjorie Roskes, Weill Cornell Medicine, United States
  • Alexander Martinez Fundichely, Weill Cornell Medicine, United States
  • Weiling Li, Weill Cornell Medicine, United States
  • Sandra Cohen, Weill Cornell Medicine, United States
  • Hao Xu, McGill University, Canada
  • Shahd Elnaggar, Barnard College, United States
  • Anisha Tehim, Cornell University, United States
  • Metin Balaban, Princeton University, United States
  • Chen Khuan Wong, Memorial Sloan Kettering Cancer Center, United States
  • Yu Chen, Memorial Sloan Kettering Cancer Center, United States
  • Ben Raphael, Princeton University, United States
  • Ekta Khurana, Weill Cornell Medicine, United States

Presentation Overview: Show

Castration Resistant Prostate Cancer (CRPC) is an aggressive disease that is highly plastic. Although histologically there are two subtypes of CRPC: adenocarcinoma and neuroendocrine, we have shown it has four distinct molecular subtypes exhibiting differential chromatin and transcriptomic profiles. These are CRPC-AR (androgen receptor dependent), CRPC-WNT (Wnt pathway dependent), CRPC-SCL (stem-cell like), and CRPC-NE (neuroendocrine). During treatment with AR signaling inhibitors, patient tumors can evolve to different subtypes. Clinical identification of these subtypes and mechanistic understanding of the genomic and epigenomic heterogeneity accompanying this evolution is a huge challenge. To address this, we have amassed a unique cohort of 60 CRPC patients with various subtypes from whom cell-free DNA (cfDNA) was collected at various clinically relevant time points and whole-genome sequencing (WGS) was performed. For 24 of these patients, time-matched tissue RNA-seq was performed. We estimated epigenetic/transcriptomic heterogeneity in tissue by deconvolution of bulk RNA-seq data. We performed nucleosomal profiling from cfDNA WGS to infer tumor chromatin accessibility and estimate each epigenetic subtype’s fractional contribution. We can detect the different subtypes in cfDNA and find that CRPC-SCL patients exhibit more heterogeneity than other subtypes in both tissue and cfDNA, likely indicating the transitory state of this subtype. We calculated allele-specific, genome-wide copy number alterations in cfDNA, and can track the parallel evolution of genomic and epigenomic events, e.g. AR gains track with increasing CRPC-AR fraction over time. Our study shows that, beyond biomarker development, cfDNA WGS can be used for characterizing the epigenomic and genomic evolution of patient tumors.

17:40-18:00
Accurate and robust bootstrap inference of single-cell phylogenies by integrating sequencing read counts
Confirmed Presenter: Rija Zaidi, University College London Cancer Institute, United Kingdom

Room: 517d
Format: In Person


Authors List: Show

  • Rija Zaidi, University College London Cancer Institute, United Kingdom
  • Simone Zaccaria, University College London Cancer Institute, United Kingdom

Presentation Overview: Show

Recent single-cell DNA sequencing (scDNA-seq) technologies have enabled the parallel investigation of thousands of individual cells. This is required for accurately reconstructing tumour evolution, during which cancer cells acquire a multitude of different genetic alterations. Although the evolutionary analysis of scDNA-seq datasets is complex due to their unique combination of errors and missing data, several methods have been developed to infer single-cell tumour phylogenies by integrating estimates of the false positive and false negative error rates. This integration relies on the assumption that errors are uniformly distributed both within and across cells. However, this assumption does not always hold; error rates depend on sequencing coverage, which is not constant within or across cells in a sequencing experiment due to, e.g., copy-number alterations and the replication status of a cell, limiting the accuracy of existing methods.

To address this challenge, we developed a novel single-cell phylogenetic method that integrates raw sequencing read counts into a statistical framework to robustly correct the errors and missing data. Specifically, our method includes bootstrapping to robustly correct for high error frequency genomic positions and a fast probabilistic heuristic based on hypothesis testing to distinguish the remaining errors from truly observed genotypes. We demonstrate the improved accuracy and robustness of our method compared to existing approaches across several simulation settings. To demonstrate its impact, we applied our method to 42,009 breast cancer cells and 19,905 ovarian cancer cells, revealing more accurate phylogenies consistent with larger genetic alterations.