Return to ISMB/ECCB 2025 Homepage Click here for the abridged agenda

Schedule for HiTSeq

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date	Start Time	End Time	Room	Track	Title	Confrimed Presenter	Format	Authors	Abstract
2025-07-23	11:20:00	12:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Learning variant effects on chromatin accessibility and 3D structure without matched Hi-C data	Valentina Boeva	In person	Valentina Boeva	Chromatin interactions provide insights into which DNA regulatory elements connect with specific genes, informing the activation or repression of gene expression. Understanding these interactions is crucial for assessing the role of non-coding mutations or changes in chromatin organization due to cell differentiation or disease. Hi-C and single-cell Hi-C experiments can reveal chromatin interactions, but these methods are costly and labor-intensive. Here, I will introduce our computational approach, UniversalEPI, an attention-based deep ensemble model that predicts regulatory interactions in unseen cell types with a receptive field of 2 million nucleotides, relying solely on DNA sequence data and chromatin accessibility profiles. Demonstrating significantly better performance than state-of-the-art methods, UniversalEPI—with a much lighter architecture—effectively predicts chromatin interactions across malignant and non-malignant cancer cell lines (Spearman’s Rho > 0.9 on unseen cell types). To further expand its applicability, we integrate ASAP, our deep learning toolset that predicts the effects of genomic variants on ATAC-seq profiles. These predicted accessibility profiles can serve as input to UniversalEPI. Importantly, the accuracy of Hi-C interaction prediction remains virtually unchanged when replacing experimental ATAC-seq profiles with those generated by ASAP, indicating strong robustness and enabling predictions even in the absence of experimental accessibility data. This combined framework represents an advancement in in-silico 3D chromatin modeling, essential for exploring genetic variant impacts on disease and monitoring chromatin architecture changes during development.
2025-07-23	12:20:00	12:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Spatial transcriptomics deconvolution methods generalize well to spatial chromatin accessibility data	Laura D. Martens	In person	Sarah Ouologuem, Laura D. Martens, Anna C. Schaar, Maiia Shulman, Julien Gagneur, Fabian J. Theis	Motivation: Spatially resolved chromatin accessibility profiling offers the potential to investigate gene regulatory processes within the spatial context of tissues. However, current methods typically work at spot resolution, aggregating measurements from multiple cells, thereby obscuring cell-type-specific spatial patterns of accessibility. Spot deconvolution methods have been developed and extensively benchmarked for spatial transcriptomics, yet no dedicated methods exist for spatial chromatin accessibility, and it is unclear if RNA-based approaches are applicable to that modality. Results: Here, we demonstrate that these RNA-based approaches can be applied to spot-based chromatin accessibility data by a systematic evaluation of five top-performing spatial transcriptomics deconvolution methods. To assess performance, we developed a simulation framework that generates both transcriptomic and accessibility spot data from dissociated single-cell and targeted multiomic datasets, enabling direct comparisons across both data modalities. Our results show that Cell2location and RCTD, in contrast to other methods, exhibit robust performance on spatial chromatin accessibility data, achieving accuracy comparable to RNA-based deconvolution. Generally, we observed that RNA-based deconvolution exhibited slightly better performance compared to chromatin accessibility-based deconvolution, especially for resolving rare cell types, indicating room for future development of specialized methods. In conclusion, our findings demonstrate that existing deconvolution methods can be readily applied to chromatin accessibility-based spatial data. Our work provides a simulation framework and establishes a performance baseline to guide the development and evaluation of methods optimized for spatial epigenomics. Availability: All methods, simulation frameworks, peak selection strategies, analysis notebooks and scripts are available at https://github.com/theislab/deconvATAC.
2025-07-23	12:40:00	13:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Towards Personalized Epigenomics: Learning Shared Chromatin Landscapes and Joint De-Noising of Histone Modification Assays	Tanmayee Narendra	In person	Tanmayee Narendra, Giovanni Visonà, Crhistian de Jesus Cardona, James Abbott, Gabriele Schweikert	Epigenetic mechanisms enable cellular differentiation and the maintenance of distinct cell-types. They enable rapid responses to external signals through changes in gene regulation and their registration over longer time spans. Consequently, chromatin environments exhibit cell-type and individual specificity contributing to phenotypic diversity. Their genomic distributions are measured using ChIP-Seq and related methods. However, the chromatin landscape introduces significant biases into these measurements. Here, we introduce DecoDen to simultaneously learn shared chromatin landscapes while de-biasing individual measurement tracks. We demonstrate DecoDen's effectiveness on an integrative analysis of histone modification patterns across multiple tissues in personal epigenomes.
2025-07-23	14:00:00	14:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-Seq data using virtual colors for accurate genomic pseudoalignment	Noor Pratap Singh	In person	Noor Pratap Singh, Jamshed Khan, Rob Patro	Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into “virtual colors”. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct “colors” from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac. We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC. Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately 3 times less memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual color-enhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger.
2025-07-23	14:20:00	14:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Oarfish: Enhanced probabilistic modeling leads to improved accuracy in long read transcriptome quantification	Zahra Zare Jousheghani	In person	Zahra Zare Jousheghani, Noor Pratap Singh, Rob Patro	Motivation: Long read sequencing technology is becoming an increasingly indispensable tool in genomic and transcriptomic analysis. In transcriptomics in particular, long reads offer the possibility of sequencing full-length isoforms, which can vastly simplify the identification of novel transcripts and transcript quantification. However, despite this promise, the focus of much long read method development to date has been on transcript identification, with comparatively little attention paid to quantification. Yet, due to differences in the underlying protocols and technologies, lower throughput (i.e. fewer reads sequenced per sample compared to short read technologies), as well as technical artifacts, long read quantification remains a challenge, motivating the continued development and assessment of quantification methods tailored to this increasingly prevalent type of data. Results: We introduce a new method and corresponding user-friendly software tool for long read transcript quantification called oarfish. Our model incorporates a novel coverage score, which affects the conditional probability of fragment assignment in the underlying probabilistic model. We demonstrate, in both simulated and experimental data, that by accounting for this coverage information, oarfish is able to produce more accurate quantification estimates than existing long read quantification tools. Availability and Implementation: Oarfish is implemented in the Rust programming language, and is made available as free and open-source software under the BSD 3-clause license. The source code is available at https://www.github.com/COMBINE-lab/oarfish.
2025-07-23	14:40:00	15:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Identification of interactions defining 3D chromatin folding from micro to meso-scale	Leonardo Morelli	In person	Leonardo Morelli, Stefano Cretti, Davide Cittaro, Tiago P. Peixoto, Alessio Zippo	Understanding the structural principles of chromatin organization is a central challenge in computational epigenomics, largely due to the sparse, noisy, and complex nature of Hi-C data. Existing methods tend to focus either on local features, such as topologically associating domains (TADs), or global structures, like compartments. This methodological split often leads to poor agreement between models, limiting our ability to obtain a unified view of genome architecture. We introduce HiCONA, a novel graph-based framework that directly infers global 3D chromatin folding from both Hi-C contact maps and super resolution microscopy data. Unlike existing approaches, HiCONA optimizes a nested hierarchical representation of chromatin architecture by minimizing the entropy of the partition, thereby capturing the most informative and functionally relevant interactions. HiCONA enables simultaneous identification of topologically associating domains (TADs) and subcompartments using a single unified model, and performs robustly across gold-standard datasets. In benchmarking experiments, HiCONA recovers key chromatin contacts under both wild-type and cohesin-deficient conditions, offering insight into the structural consequences of architectural protein depletion. Furthermore, HiCONA provides a shared representation that facilitates direct comparison between imaging and sequencing-based data, bridging a major methodological gap in chromatin biology. By capturing chromatin folding from micro to mesoscale, HiCONA opens new avenues for understanding genome organization and its functional implications. This integrative and interpretable framework marks a significant advance in uncovering the forces that shape nuclear architecture, with potential applications in development, disease, and synthetic genome design.
2025-07-23	15:00:00	15:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	SpliSync: Genomic language model-driven splice site correction of long RNA reads	Liliana Florea	In person	Wui Wang Lui, Liliana Florea	We developed SpliSync, a deep learning method for accurate splice site correction in long read alignments. It combines a genomic language model, HyenaDNA, and a 1D U-net segmentation head, integrating genome sequence and alignment embeddings. SpliSync improves the detection of splice sites and introns and, when integrated with a short read transcript assembler, allows for improved transcript reconstruction, matching or outperforming reference methods like IsoQuant and FLAIR. The method shows promise for transcriptomic applications, especially in species with incomplete gene annotations or for discovering novel transcript variations.
2025-07-23	15:20:00	15:30:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	adverSCarial: a toolkit for exposing classifier vulnerabilities in single-cell transcriptomics	Ghislain Fievet	In person	Ghislain Fievet, Julien Broséus, David Meyre, Sébastien Hergalant	Adversarial attacks pose a significant risk to machine learning (ML) tools designed for classifying single-cell RNA-sequencing (scRNA-seq) data, with potential implications for biomedical research and future clinical applications. We present adverSCarial, a novel R package that evaluates the vulnerability of scRNA-seq classifiers to various adversarial perturbations, ranging from barely detectable, subtle changes in gene expression to large-scale modifications. We demonstrate how five representative classifiers spanning marker-based, hierarchical, support vector machine, random forest, and neural network algorithms, respond to these attacks on four hallmarks scRNA-seq datasets. Our findings reveal that all classifiers eventually fail under different amplitudes of perturbations, which depend on the ML algorithm they are based on and on the nature of the modifications. Beyond security concerns, adversarial attacks help uncover the inner decision-making mechanisms of the classifiers. The various attack modes and customizable parameters proposed in adverSCarial are useful to identify which gene or set of genes is crucial for correct classification and to highlight the genes that can be substantially altered without detection. These functionalities are critical for the development of more robust and interpretable models, a step toward integrating scRNA-seq classifiers into routine research and clinical workflows. The R package is freely available on Bioconductor (10.18129/B9.bioc.adverSCarial) and helps evaluate scRNA-seq-based ML models vulnerabilities in a computationally-cheap and time-efficient framework.
2025-07-23	15:30:00	15:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Quality assessment of long read data in multisample lrRNA-seq experiments using SQANTI-reads	Netanya Keil	In person	Netanya Keil, Carolina Monzó, Lauren McIntyre, Ana Conesa	SQANTI-reads leverages SQANTI3, a tool for the analysis of the quality of transcript models, to develop a read-level quality control framework for replicated long-read RNA-seq experiments. The number and distribution of reads, as well as the number and distribution of unique junction chains (transcript splicing patterns), in SQANTI3 structural categories are informative of raw data quality. Multisample visualizations of QC metrics are presented by experimental design factors to identify outliers. We introduce new metrics for 1) the identification of potentially under-annotated genes and putative novel transcripts and for 2) quantifying variation in junction donors and acceptors. We applied SQANTI-reads to two different datasets, a Drosophila developmental experiment and a multiplatform dataset from the LRGASP project and demonstrate that the tool effectively reveals the impact of read coverage on data quality, and readily identifies strong and weak splicing sites. SQANTI-reads is open source and is available in versions ≥ 5.3.0 in the SQANTI3 GitHub repository.
2025-07-23	15:40:00	16:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Transcriptome Assembly at Single-Cell Resolution with Beaver	Qian Shi	Live stream	Qian Shi, Qimin Zhang, Mingfu Shao	Motivation: The established single-cell RNA sequencing technologies (scRNA-seq) has revolutionized biological and biomedical research by enabling the measurement of gene expression at single-cell resolution. However, the fundamental challenge of reconstructing full-length transcripts for individual cells remains unresolved. Existing single-sample assembly approaches cannot leverage shared information across cells while meta-assembly approaches often fail to strike a balance between consensus assembly and preserving cell-specific expression signatures. Results: We present Beaver, a cell-specific transcript assembler designed for short-read scRNA-seq data. Beaver implements a transcript fragment graph to organize individual assemblies and designs an efficient dynamic programming algorithm that searches for candidate full-length transcripts from the graph. Beaver incorporates two random forest models trained on 51 meticulously engineered features that accurately estimate the likelihood of each candidate transcript being expressed in individual cells. Our experiments, performed using both real and simulated Smart-seq3 scRNA-seq data, firmly show that Beaver substantially outperforms existing meta-assemblers and single-sample assemblers. At the same level of sensitivity, Beaver achieved 32.0%-64.6%, 13.5%-36.6%, and 9.8%-36.3% higher precision in average compared to meta-assemblers Aletsch, TransMeta, and PsiCLASS, respectively, with similar improvements over single-sample assemblers Scallop2 (10.1%-43.6%) and StringTie2 (24.3%-67.0%). Availability: Beaver is freely available at https://github.com/Shao-Group/beaver. Scripts that reproduce the experimental results of this manuscript are available at https://github.com/Shao-Group/beaver-test.
2025-07-23	16:40:00	17:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Bioinformatics analysis for long-read RNA sequencing: challenges and promises	Elizabeth Tseng	In person	Pacific Biosciences, TBA , Elizabeth Tseng	Long-read RNA sequencing has emerged as a powerful tool in transcriptomics, offering the ability to sequence full-length cDNAs—often exceeding 10 kb—without the need for transcript assembly. This capability shifted early bioinformatics efforts toward the discovery of novel isoforms, enabling the development of new nomenclature to describe isoform features previously undetectable by short reads. Renewed focus was also placed on identifying and filtering potential cDNA artifacts. With long read lengths and high accuracy, PacBio’s Iso-Seq data prompted new tool developments covering cancer fusion detection, direct open reading frame predictions, allele-specific isoform expression, and finally, differential isoform expression analyses. However, gaps remain the tool space that need to be addressed with the advent of large, population-scale long-read RNA-Seq projects. In this talk, I will explore how Iso-Seq has propelled the long-read sequencing field forward, highlight the current challenges in tool development and data analysis, and discuss the promising avenues for discovery that lie ahead.
2025-07-23	17:00:00	18:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Quantifying RNA Expression and Modifications using Long Read RNA-Seq	Jonathan Göke	In person	Jonathan Göke	The human genome contains instructions to transcribe more than 200,000 RNAs. However, many RNA transcripts are generated from the same gene, resulting in alternative isoforms that are highly similar. Furthermore, the addition of post-transcriptional RNA modifications further impacts their function. The availability of long read RNA-Seq provides an opportunity to sequence entire RNA transcripts, enabling the analysis of individual RNA isoforms and their modifications. In this presentation I will show how the raw nanopore signal data can be used to identify and distinguish multiple RNA modifications from direct RNA-Seq data, I will summarise new results from the Singapore Nanopore Expression Project (SG-NEx) and describe computational methods that analyse long read RNA-Seq data to estimate isoform expression, track full length reads, and identify novel isoforms at single cell and spatial resolution.
2025-07-24	08:40:00	09:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Pangenome based analysis of structural variation	Tobias Marschall	In person	Tobias Marschall	Breakthroughs in long-read sequencing technology and assembly methodology enable the routine de novo assembly of human genomes to near completion. Such assemblies open a door to exploring structural variation (SV) in previously inaccessible regions of the genome. The Human Pangenome Reference Consortium (HPRC) and the Human Genome Structural Variation Consoritum (HGSVC) have produced high quality genome assemblies, which provide a basis for comparative genome analysis using pangenome graphs. First, we will ask how a pangenomic resource like this can be leveraged in order to better analyze structural variants in samples with short-read whole-genome sequencing (WGS) data. In a process called genome inference, implemented in the PanGenie software, we can use a pangenome reference to infer the haplotype sequences of individual genomes to a quality clearly superior to standard variant calling workflows. This process allows us to detect more than twice the number of structural variants per genome from short-read WGS and therefore provides an opportunity for genome-wide association studies to include these SVs. Second, we introduce Locityper, a tool specifically designed for targeted genotyping of complex loci using short and long-read whole genome sequencing. For each target, Locityper recruits and aligns reads to local haplotypes and finds the likeliest haplotype pair by optimizing read alignment, insert size and read depth profiles. Locityper accurately genotypes up to 194 of 256 challenging medically relevant loci (95% haplotypes at QV33), an 8.8-fold gain compared to 22 genes achieved with standard variant calling pipelines. Furthermore, Locityper provides access to hyperpolymorphic HLA genes and other gene families, including KIR, MUC and FCGR.
2025-07-24	09:40:00	10:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Latest advances in bioinformatics for Oxford Nanopore data	Mike Vella		Oxford Nanopore Technologies, TBA , Mike Vella	Oxford Nanopore Technologies has transformed genomics with real-time, long-read sequencing capable of detecting epigenetic modifications and structural variants at single-molecule resolution. This talk will cover recent advances in bioinformatics for nanopore data, including improvements in basecalling, signal-level analysis, assembly, and variant detection. Particular focus will be given to how deep learning and raw signal analysis are driving gains in accuracy and enabling new applications.
2025-07-24	11:20:00	11:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	CREMSA: Compressed Indexing of (Ultra) Large Multiple Sequence Alignments	Mikaël Salson	In person	Mikaël Salson, Arthur Boddaert, Awa Bousso Gueye, Laurent Bulteau, Yohan Hernandez-Courbevoie, Camille Marchet, Nan Pan, Sebastian Will, Yann Ponty	Recent viral outbreaks motivate a systematic collection of pathogenic genomes, including a strong focus on genomic RNA, in order to accelerate their study and monitor the apparition/spread of variants. Due to their limited length and temporal proximity of their collection, viral genomes are usually organized, and analyzed as oversized Multiple Sequence Alignments (MSAs). Such MSAs are largely ungapped, and mostly homogeneous on a column-wise level but not at a sequential level due to local variations, hindering the performances of sequential compression algorithms. In order to enable an efficient manipulation of MSAs, including subsequent statistical analyses, we introduce CREMSA (Column-wise Run-length Encoding for Multiple Sequence Alignments), a new index that builds on sparse bitvector representations to compress an existing or streamed MSA, all the while allowing for an expressive set of accelerated requests to query the alignment without prior decompression. Using CREMSA, a 65GB MSA consisting of 1.9M SARS-CoV 2 genomes could be compressed into 22MB using less than half a gigabyte of main memory, while supporting access requests in the order of 100ns. Such a speed up enables a comprehensive analysis of covariation over this very large MSA. We further assess the impact of the sequence ordering on the compressibility of MSAs and propose a resorting strategy that, despite the proven NP-hardness of an optimal sort, induces greatly increased compression ratios at a marginal computational cost.
2025-07-24	11:40:00	12:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Exploiting uniqueness: seed-chain-extend alignment on elastic founder graphs	Nicola Rizzo	In person	Nicola Rizzo, Manuel Cáceres, Veli Mäkinen	Sequence-to-graph alignment is a central challenge of computational pangenomics. To overcome the theoretical hardness of the problem, state-of-the-art tools use seed-and-extend or seed-chain-extend heuristics to alignment. We implement a complete seed-chain-extend alignment workflow based on indexable elastic founder graphs (iEFGs) that support linear-time exact searches unlike general graphs. We show how to construct iEFGs, find high-quality seeds, chain, and extend them at the scale of a telomere-to-telomere assembled human chromosome. Our sequence-to-graph alignment tool and the scripts to replicate our experiments are available in https://github.com/algbio/SRFAligner.
2025-07-24	12:00:00	12:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	FroM Superstring to Indexing: a space-efficient index for unconstrained k-mer sets using the Masked Burrows-Wheeler Transform (MBWT)	Ondřej Sladký	In person	Ondřej Sladký, Pavel Veselý, Karel Brinda	The exponential growth of DNA sequencing data calls for efficient solutions for storing and querying large-scale k-mer sets. While recent indexing approaches use spectrum-preserving string sets (SPSS), full-text indexes, or hashing, they often impose structural constraints or demand extensive parameter tuning, limiting their usability across different datasets and data types. Here, we propose FMSI, a minimally parametrized, highly space-efficient membership index and compressed dictionary for arbitrary k-mer sets. FMSI combines approximated shortest superstrings with the Masked Burrows-Wheeler Transform (MBWT). Unlike traditional methods, FMSI operates without predefined assumptions on k-mer overlap patterns but exploits them when available. We demonstrate that FMSI offers superior memory efficiency for processing queries over established indexes such as SSHash, Spectral Burrows-Wheeler Transform (SBWT), and Conway-Bromage-Lyndon (CBL), while supporting fast membership and dictionary queries. Depending on the dataset, k, or sampling, FMSI offers 2–3x space savings for processing queries over all state-of-the-art indexes; only a space-optimized SBWT (without indexing reverse complement) matches its memory efficiency in some cases but is 2–3x slower. Overall, this work establishes superstring-based indexing as a highly general, flexible, and scalable approach for genomic data, with direct applications in pangenomics, metagenomics, and large-scale genomic databases.
2025-07-24	12:20:00	12:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	The Alice assembler: dramatically accelerating genome assembly with MSR sketching	Roland Faure	In person	Roland Faure, Jean-François Flot, Dominique Lavenier	The PacBio HiFi technology and the R10.4 Oxford Nanopore flowcells are transforming the genomic world by producing for the first time long and accurate sequencing reads. The low error rate of these reads opens new venues for computational optimizations. However, genome and particularly metagenome assembly using high-fidelity reads still faces challenges. Current assemblers (e.g., Flye, hifiasm, metaMDBG) struggle to efficiently resolve highly similar haplotypes (homologous chromosomes, bacterial strains, repeats) while maintaining computational speed, creating a gap between rapid and haplotype-resolved methods. We investigated this issue using on several dataset including a human gut microbiome sequencing and a diploid, finding that hifiasm_meta and metaFlye required over a month of CPU time to produce an assembly, while metaMDBG, which collapses similar strains, assembles the same dataset in four days. We present Alice, a new assembler which introduces a new sequence sketching method called MSR sketching to bridge this gap and produce efficiently haplotype-resolved assemblies, for both genomic and metagenomic datasets. On the aforementioned human gut dataset, Alice completed the assembly in just 7 CPU hours. Furthermore, the analysis of the assemblies revealed that Alice missed <1% of abundant 31-mers (≥20x coverage), compared to >15% missed by both metaMDBG and hifiasm_meta. Overall, our results indicate that Alice accelerates assembly dramatically while providing high quality assemblies, offering a powerful new tool for the field.
2025-07-24	12:40:00	13:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	BINSEQ: A Family of High-Performance Binary Formats for Nucleotide Sequences	Noam Teyssier	In person	Noam Teyssier, Alexander Dobin	Modern genomics routinely generates billions of sequencing records per run, typically stored as gzip-compressed FASTQ files. This format's inherent limitations—single-threaded decompression and sequential parsing of irregularly sized records—create significant bottlenecks for bioinformatics applications that would benefit from parallel processing. We present BINSEQ, a family of simple binary formats designed for high-throughput parallel processing of sequencing data. The family includes BINSEQ, optimized for fixed-length reads with true random access capability through two-bit encoding, and VBINSEQ, supporting variable-length sequences with optional quality scores and block-based organization. Both formats natively handle paired-end reads, eliminating the need for synchronized files. Our comprehensive evaluation demonstrates that BINSEQ formats deliver substantial performance improvements across bioinformatics workflows while maintaining competitive storage efficiency. Both formats achieve up to 32x faster processing than compressed FASTQ and continue to scale with increasing thread counts where traditional formats quickly plateau due to I/O bottlenecks. These advantages extend to complex workflows like alignment, with BINSEQ formats showing 2-5x speedups at higher thread counts when tested with tools like minimap2 and STAR. Storage requirements remain comparable to or better than existing formats, with BINSEQ (610.35 MB) similar to gzip-compressed FASTA (647.29 MB) and VBINSEQ (509.89 MB) approaching CRAM (491.85 MB) efficiency. To facilitate adoption, we provide high-performance libraries, parallelization APIs, and conversion tools as free, open-source implementations. BINSEQ addresses fundamental inefficiencies in genomic data processing by considering modern parallel computing architectures.
2025-07-24	14:00:00	14:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Ultrafast and Ultralarge Multiple Sequence Alignments using TWILIGHT	Yu-Hsiang Tseng	In person	Yu-Hsiang Tseng, Sumit Walia, Yatish Turakhia	Motivation: Multiple sequence alignment (MSA) is a fundamental operation in bioinformatics, yet existing MSA tools are struggling to keep up with the speed and volume of incoming data. This is because the runtimes and memory requirements of current MSA tools become untenable when processing large numbers of long input sequences and they also fail to fully harness the parallelism provided by modern CPUs and GPUs. Results: We present TWILIGHT (Tall and Wide Alignments at High Throughput), a novel MSA tool optimized for speed, accuracy, scalability, and memory constraints, with both CPU and GPU support. TWILIGHT incorporates innovative parallelization and memory-efficiency strategies that enable it to build ultralarge alignments at high speed even on memory-constrained devices. On challenging datasets, TWILIGHT outperformed all other tools in speed and accuracy. It scaled beyond the limits of existing tools and performed an alignment of 1 million RNASim sequences within 30 minutes while utilizing less than 16 GB of memory. TWILIGHT is the first tool to align over 8 million publicly available SARS-CoV-2 sequences, setting a new standard for large-scale genomic alignment and data analysis. Availability: TWILIGHT’s code is freely available under the MIT license at https://github.com/TurakhiaLab/TWILIGHT. The test datasets and experimental results, including our alignment of 8 million SARS-CoV-2 sequences, are available at https://zenodo.org/records/14722035.
2025-07-24	14:20:00	14:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	GreedyMini: Generating low-density DNA minimizers	Arseny Shur	In person	Shay Golan, Ido Tziony, Matan Kraus, Yaron Orenstein, Arseny Shur	Motivation: Minimizers are the most popular k-mer selection scheme in algorithms and data structures analyzing high-throughput sequencing (HTS) data. In a minimizer scheme, the smallest k-mer by some predefined order is selected as the representative of a sequence window containing w consecutive k-mers, which results in overlapping windows often selecting the same k-mer. Minimizers that achieve the lowest frequency of selected k-mers over a random DNA sequence, termed the expected density, are desired for improved performance of HTS analyses. Yet, no method to date exists to generate minimizers that achieve minimum expected density. Moreover, for k and w values used by common HTS algorithms and data structures there is a gap between densities achieved by existing selection schemes and the theoretical lower bound. Results: We developed GreedyMini, a toolkit of methods to generate minimizers with low expected or particular density, to improve minimizers, to extend minimizers to larger alphabets, k, and w, and to measure the expected density of a given minimizer efficiently. We demonstrate over various combinations of k and w values, including those of popular HTS methods, that GreedyMini can generate DNA minimizers that achieve expected densities very close to the lower bound, and both expected and particular densities much lower compared to existing selection schemes. Moreover, we show that GreedyMini's k-mer rank-retrieval time is comparable to common k-mer hash functions. We expect GreedyMini to improve the performance of many HTS algorithms and data structures and advance the research of k-mer selection schemes.
2025-07-24	14:40:00	15:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	LYCEUM: Learning to call copy number variants on low coverage ancient genomes	Mehmet Alper Yilmaz	In person	Mehmet Alper Yilmaz, Ahmet Arda Ceylan, Gun Kaynar, A. Ercument Cicek	Motivation: Copy number variants (CNVs) are pivotal in driving phenotypic variation that facilitates species adaptation. They are significant contributors to various disorders, making ancient genomes crucial for uncovering the genetic origins of disease susceptibility across populations. However, detecting CNVs in ancient DNA (aDNA) samples poses substantial challenges due to several factors: (i) aDNA is often highly degraded; (ii) contamination from microbial DNA and DNA from closely related species introduce additional noise into sequencing data; and finally, (iii) the typically low coverage of aDNA renders accurate CNV detection particularly difficult. Conventional CNV calling algorithms, which are optimized for high coverage read-depth signals, underperform under such conditions. Results: To address these limitations, we introduce LYCEUM, the first machine learning-based CNV caller for aDNA. To overcome challenges related to data quality and scarcity, we employ a two-step training strategy. First, the model is pre-trained on whole genome sequencing data from the 1000 Genomes Project, teaching it CNV-calling capabilities similar to conventional methods. Next, the model is fine-tuned using high-confidence CNV calls derived from only a few existing high-coverage aDNA samples. During this stage, the model adapts to making CNV calls based on the downsampled read depth signals of the same aDNA samples. LYCEUM achieves accurate detection of CNVs even in typically low-coverage ancient genomes. We also observe that the segmental deletion calls made by LYCEUM show correlation with the demographic history of the samples and exhibit patterns of negative selection inline with natural selection. Availability: LYCEUM is available at https://github.com/ciceklab/LYCEUM.
2025-07-24	15:00:00	15:20:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	POPSICLE: a probabilistic method to capture uncertainty in single-cell copy-number calling	Lucrezia Patruno	In person	Lucrezia Patruno, Sophia Chirrane, Simone Zaccaria	During tumour evolution, cancer cells acquire somatic copy-number alterations (CNAs), that are frequent genomic alterations resulting in the amplification or deletion of large genomic regions. Recent single-cell technologies allow the accurate investigation of CNA rates and their underlying mechanism by performing whole-genome sequencing of thousands of individual cancer cells in parallel (scWGS-seq). While several methods have been developed to identify the most likely CNAs from scWGS-seq data, the high levels of variability in these data make the accurate inference of point estimates for CNAs (i.e., a single value for the most likely copy number) challenging. Moreover, given that variability increases with increasing copy numbers, this is especially true when considering high amplifications and highly aneuploid cells, which play a key role in cancer. However, to date existing methods are limited to the inference of point estimates for CNAs in single cells and do not capture their related uncertainty. To address these limitations we introduce POPSICLE, a novel probabilistic approach that computes the probability of having different copy numbers for every genomic region in each single cell. Using simulations, we show that POPSICLE improves ploidy and CNA inference for up to 20% of the genome in 90% of cells. Using a dataset comprising more than 60,000 of breast and ovarian cancer cells, we show how POPSICLE leverages uncertainty to improve the identification of genes that are recurrently highly amplified and might play a key role in tumour progression.
2025-07-24	15:20:00	15:40:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	MutSuite: A Toolkit for Simulating and Evaluating Mutations in Aligned Sequencing Reads	Kendell Clement	In person	Kendell Clement	Simulated sequencing reads containing known mutations are essential for developing, testing, and benchmarking mutation detection tools. Most existing simulation tools introduce mutations into synthetic reads and then realign them to a reference genome prior to downstream analysis. However, this realignment step can obscure the true position of insertions and deletions, introducing ambiguity and potential error in evaluation. In particular, the alignment process can shift the apparent location of insertions and deletions, complicating efforts to assess recall and precision of variant callers. To address this limitation and support the development of more accurate and sensitive mutation detection algorithms, we developed MutSim, a tool that introduces substitutions, insertions, and deletions directly into aligned reads (e.g., in BAM files). By avoiding realignment, MutSim ensures that each simulated mutation remains at its exact specified position, enabling precise evaluation of variant caller performance. MutSim is part of a larger toolkit we call MutSuite, which also includes MutRun, a companion tool that automates the execution of variant calling software on simulated datasets, and MutAgg, which aggregates and summarizes results across multiple variant callers for performance comparison. Together, these tools provide a robust and flexible framework for mutation simulation and benchmarking. MutSuite is open-source and freely available at: https://github.com/clementlab/mutsuite.
2025-07-24	15:40:00	16:00:00	01A	HiTSeq: High Throughput Sequencing Algorithms & Applications	Landscape of The Dark Genome’s variants and their influence on cancer	Joao P. C. R. Mendonca	In person	Joao P. C. R. Mendonca, Kristoffer Staal Rohrberg, Peter Holst, Frederik Otzen Bagger	Human endogenous retroviruses (HERVs) are remnants of ancient viral infections that now make up ~8% of the human genome. Although typically silenced, HERVs can become reactivated in cancer and are emerging as biomarkers and immunotherapeutic targets. However, their clinical utility is limited by challenges in resolving individual loci due to high sequence similarity, incomplete genome annotations, and an overreliance on linear reference genomes. To address this, we constructed a variational pangenome using long-read sequencing data from Genome in a Bottle and the Platinum Pedigree projects. This approach enables accurate detection of single nucleotide variants (SNVs), insertions/deletions (indels), and structural variants (SVs) in a reference-free manner, revealing polymorphic HERV insertions absent from the human reference genome. By integrating data from the Copenhagen Prospective Personalized Oncology (CoPPO) biobank, we link these variants to HERV expression in cancer, distinguishing potentially pathogenic variants from benign ones. We combine pangenome-informed annotations with locus-specific expression quantification tools to resolve HERV transcription at individual loci and connect specific sequence variants to tumorigenesis and immune modulation. Our findings enhance the resolution of HERV mapping across individuals and cancer types, uncovering previously inaccessible variation in a historically overlooked portion of the genome. This work not only improves our understanding of HERV-driven disease mechanisms but also lays the groundwork for variant-informed biomarker discovery and therapeutic targeting in precision oncology.

- top -