HiTSeq

Attention Presenters - please review the Speaker Information Page available here

NOTICE: HiTSeq sessions change rooms between conference days.

Schedule subject to change
All times listed are in CEST
Tuesday, July 25th
10:35-11:30
Invited Presentation: Using single cell data to understand disease and cell type differences across species
Room: Salle Saint Claire 3
Format: Live from venue

  • Irene Papatheodorou


Presentation Overview: Show

EMBL-EBI’s Expression Atlas is an added-value knowledge base that enables researchers to answer the question of where (tissue, organism part, developmental stage, cell type) and under which conditions (disease, treatment, gender, etc) a gene or protein of interest is expressed. Expression Atlas brings together data from >4500 expression studies from >65 different species, across different conditions and tissues. It makes these data freely available in a way that is easy to visualise, re-analysed via standardised pipelines that rely on open-source, community developed tools. I will specifically focus on the single cell component of the resource, the Single Cell Expression Atlas (SCEA). SCEA enables gene and cell type queries across hundreds of scRNA-Seq studies. Finally, I will discuss the opportunities and challenges coming from the increasing number of available single cell data: How they are providing new insights into understanding disease at the cell type level and how we can start looking more closely into similarities and differences of cell types across species.

11:30-11:50
Detecting Chromosomal Translocations using Augmented Genome Sequence Graphs
Room: Salle Saint Claire 3
Format: Live from venue

  • Alister D'Costa, Ontario Institute for Cancer Research; University of Toronto, Dept of Computer Science, Canada
  • Philip Zuzarte, Ontario Institute for Cancer Research, Canada
  • Michael Molnar, Ontario Institute for Cancer Research, Canada
  • Tracy Murphy, University Health Network, Canada
  • Mark Minden, University Health Network, Canada
  • Yun William Yu, University of Toronto, Dept of Mathematics, Canada
  • Jared Simpson, Ontario Institute for Cancer Research; University of Toronto, Dept of Computer Science, Dept of Molecular Genetics, Canada


Presentation Overview: Show

Chromosomal translocations have the potential to generate fusion proteins and disrupt gene expression. Current methods to identify translocations from long DNA sequencing reads typically rely on alignments to a linear reference, with a read requiring a high-quality alignment to two different genomic positions. While effective, reference-based translocation detection may generate false positive calls tied to polymorphic insertions. In this work we show that augmented genome sequence graphs, that contain known variation not found in the linear reference genome, can be used to effectively detect chromosomal translocations with far fewer false positive calls and in less time than existing state of the art methods. Demonstrating our method on a set of leukemia samples with known translocations, we show large decrease in the number of false positive translocation calls using a graph-based translocation detection approach.

11:50-12:10
VeChat: Correcting errors in long reads using variation graphs
Room: Salle Saint Claire 3
Format: Live from venue

  • Xiao Luo, Bielefeld University, Germany
  • Xiongbin Kang, Bielefeld University, Germany
  • Alexander Schoenhuth, Bielefeld University, Germany


Presentation Overview: Show

Error correction is the canonical first step in long-read sequencing data analysis. Current self-correction methods, however, are affected by consensus sequence induced biases that mask true variants in haplotypes of lower frequency showing in mixed samples. Unlike consensus sequence templates, graph based reference systems are not affected by such biases, so do not mistakenly mask true variants as errors. We present VeChat, as an approach to implement this idea: VeChat is based on variation graphs, as a popular type of data structure for pangenome reference systems. Extensive benchmarking experiments demonstrate that long reads corrected by VeChat contain 4 to 15 (Pacific Biosciences) and 1 to 10 times (Oxford Nanopore Technologies) less errors than when being corrected by state of the art approaches. Further, using VeChat prior to long-read assembly significantly improves the haplotype awareness of the assemblies. VeChat is an easy-to-use open-source tool and publicly available at https://github.com/HaploKit/vechat.

12:10-12:30
Proceedings Presentation: RawHash: Enabling Fast and Accurate Real-Time Analysis of Raw Nanopore Signals for Large Genomes
Room: Salle Saint Claire 3
Format: Live from venue

  • Can Firtina, ETH Zurich, Switzerland
  • Nika Mansouri Ghiasi, ETH Zurich, Switzerland
  • Joel Lindegger, ETH Zurich, Switzerland
  • Gagandeep Singh, ETH Zurich, Switzerland
  • Meryem Banu Cavlak, ETH Zurich, Switzerland
  • Haiyu Mao, ETH Zurich, Switzerland
  • Onur Mutlu, ETH Zurich, Switzerland


Presentation Overview: Show

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either 1) require powerful computational resources that may not be available for portable sequencers or 2) lack scalability for large genomes, rendering them inaccurate or ineffective.

We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value.

We evaluate RawHash on three applications: 1) read mapping, 2) relative abundance estimation, and 3) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides 1) 25.8x and 3.4x better average throughput and 2) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.

13:40-14:10
ALLSTAR: Inference of ReliAble CausaL RuLes between Somatic MuTAtions and CanceR Phenotypes
Room: Salle Saint Claire 3
Format: Live from venue

  • Dario Simionato, Department of Information Engineering, University of Padua, Italy, Italy
  • Antonio Collesei, Venetian Oncology Institute (IOV-IRCSS), Italy
  • Federica Miglietta, Veneto Institute of Oncology, IOV-IRCCS, Padua, Italy, Italy
  • Fabio Vandin, Department of Information Engineering, University of Padua, Italy, Italy


Presentation Overview: Show

Recent advances in DNA sequencing technologies have allowed the detailed characterization of whole-exomes and whole-genomes in large cohorts of tumors. These studies have highlighted the extreme heterogeneity of somatic mutations between tumors. Such heterogeneity hinders out our ability to identify alterations important for the disease. Several tools have been developed to identify somatic mutations related to cancer phenotypes. However, such tools identify only correlations, with no guarantee of highlighting causal relations.
We describe ALLSTAR, a novel tool to infer reliable causal relations between somatic mutations and cancer phenotypes. In particular, our tool identifies reliable causal rules highlighting combinations of somatic mutations with the highest impact in terms of average effect on the phenotype. While we prove that the underlying computational problem is NP-hard, we develop a branch-and-bound approach that employs PPI networks and novel bounds for pruning the search space, while correcting for multiple hypothesis testing.
Our extensive experimental evaluation on synthetic data shows that ALLSTAR is able to identify reliable causal relations in large cancer cohorts. Moreover, the reliable causal rules identified by our tool in cancer data show that ALLSTAR identifies several somatic mutations known to be relevant for cancer phenotypes as well as novel biologically meaningful relations.

14:10-14:30
Estimate mutational signature exposure from sparse clinical sequencing data.
Room: Salle Saint Claire 3
Format: Live from venue

  • Arnab Chakrabarti, Centrum für Integrierte Onkologie (CIO) Köln, RWTH Aachen University, Germany
  • Hiroshi Hamano, RWTH Aachen University, Germany
  • Lancelot Seillier, Centrum für Integrierte Onkologie (CIO) Köln, Uniklinik Köln, Germany
  • Kjong-Van Lehmann, Centrum für Integrierte Onkologie (CIO) Köln, Uniklinik Köln, Germany


Presentation Overview: Show

A typical analysis estimates the presence of known mutational signatures in each sample. However, current approaches rely on a large number of mutations to accurately estimate mutational signature exposure. Making this analysis possible when only sparse mutation data are available, such as data generated from panel sequencing or samples with low mutational burden, requires novel developments in the current methodologies for estimating mutational signature exposures. Here we present our work of assessing signature exposures using a novel predictive modeling approach. Our strategy follows two main steps. First, using a statistical model, we identify relevant signals from cancer mutations based on a mutational signature reference catalog (e.g., COSMIC [2]). Second, we use these mutational signals to train a predictive model. The model aims to estimate informative regions with respect to mutational signatures from the cancer genome sequence that are being considered when estimating the mutational signature exposure on a single sample.

14:30-14:50
PEKORA: High-Performance 3D Genome Reconstruction Using K-th Order Spearman's Rank Correlation Approximation
Room: Salle Saint Claire 3
Format: Live from venue

  • Yeremia Gunawan Adhisantoso, Leibniz University Hannover, Germany
  • Jan Voges, Leibniz University Hannover, Germany
  • Jörn Ostermann, Leibniz University Hannover, Germany


Presentation Overview: Show

Advances in high-throughput sequencing technologies have enabled the use of genomic information to better understand biological processes through studies such as genome- wide association studies, polygenic risk score estimation and chromosome conformation capture. The study of spatial chromosome organization of the human genome plays an important role in understanding gene regulation. Chromosome conformation capture techniques, such as Hi-C, can capture long-range interactions between all pairs of loci on all chromosomes. These techniques have revealed structures of genome organization, such as A/B compartments, topologically associated domains, chromatin loops and frequently interacting regions.
Although the advancement of Hi-C techniques enables the generation of massive amounts of high-resolution data, we face several challenges such as a high proportion of missing data and noisy observed interaction frequencies. Therefore, it is currently unfeasible to reconstruct high-resolution genome structures efficient at high accuracy using existing state-of-the-art methods. To remedy this situation, we present PEKORA, a high-performance 3D genome reconstruction method using k-th order Spearman’s rank correlation approximation. PEKORA outperforms the state of the art by a huge margin of 35% on average.

14:50-15:10
Modeling fragment counts improves single-cell ATAC-seq analysis
Room: Salle Saint Claire 3
Format: Live from venue

  • Laura Martens, Technical University Munich, Germany
  • David Fischer, Broad Institute, Germany
  • Vicente Yépez, Technical University Munich, Germany
  • Fabian Theis, Helmholtz Center Munich, Germany
  • Julien Gagneur, Technical University Munich, Germany


Presentation Overview: Show

Single-cell ATAC-sequencing (scATAC-seq) is a powerful technique for studying chromatin regulation at the single-cell level. Typically, scATAC-seq data is binarized to indicate open chromatin regions, but the implications of this binarization are not well-understood. In this study, we demonstrate that a quantitative treatment of scATAC-seq data improves the goodness-of-fit of existing models and their applications, including clustering, cell type identification, and batch integration. Our contribution is twofold. First, we show that fragment counts, but not read counts, can be modeled using standard count distributions. Second, we compare the effects of binarization versus a count-based model (PoissonVAE) on scATAC-seq data using publicly available datasets and highlight the biological effects that are missed by a binary treatment. We show that high count peaks in scATAC-seq data correspond to important regulatory regions such as super-enhancers and highly transcribed promoters, similar to observations in bulk ATAC-seq data. Furthermore, we demonstrate that fragment counts in promoter regions correlate with gene expression, emphasizing a quantitative signal in promoter accessibility. Our results have significant implications for scATAC-seq analysis, suggesting that handling the data quantitatively can improve the accuracy of machine learning models used for investigating single-cell regulation.

15:10-15:30
Proceedings Presentation: Deep statistical modelling of nanopore sequencing translocation times reveals latent non-B DNA structures
Room: Salle Saint Claire 3
Format: Live from venue

  • Marjan Hosseini, University of Connecticut, United States
  • Aaron Palmer, University of Connecticut, United States
  • William Manka, University of Connecticut, United States
  • Patrick GS Grady, University of Connecticut, United States
  • Venkata Patchigolla, University of Connecticut, United States
  • Jinbo Bi, University of Connecticut, United States
  • Rachel O'Neill, University of Connecticut, United States
  • Zhiyi Chi, University of Connecticut, United States
  • Derek Aguiar, University of Connecticut, United States


Presentation Overview: Show

Motivation: Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. Oxford Nanopore sequencing is an efficient and low-cost platform, but it is currently unknown whether nanopore reads can be used for identifying non-B structures.
Results: We build the first computational pipeline to predict non-B DNA structures from nanopore sequencing. We formalize non-B detection as a novelty detection problem and develop the GoFAE-DND, an autoencoder (AE) that uses goodness-of-fit (GoF) tests as a regularizer. A discriminative loss encourages non-B DNA to be poorly reconstructed and optimizing Gaussian GoF tests allows for the computation of p-values that indicate non-B structures. Based on whole genome nanopore sequencing of NA12878, we show that there exist significant differences between the timing of DNA translocation for non-B DNA bases compared to B-DNA. We demonstrate the efficacy of our approach through comparisons with novelty detection methods using experimental data and data synthesized from a new translocation time simulator. Experimental validations suggest that reliable detection of non-B DNA from nanopore sequencing is achievable.
Availability: Source code is available at https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.
Contact: {marjan.hosseini, aaron.palmer, derek.aguiar}@uconn.edu
Supplementary information: Supplementary data are available at Bioinformatics online.

16:00-16:20
SuperSampler: efficient scaled sketches for metagenomics and extensive genomics compositional analysis
Room: Salle Saint Claire 3
Format: Live from venue

  • Timothé Rouzé, CNRS, Univ Lille, France
  • Camille Marchet, CNRS, France
  • Antoine Limasset, CNRS, France


Presentation Overview: Show

A challenge for Bioinformatics is to keep up with the amount of data generated by high throughput sequencing.
Being able to compare such volume of data remains a scalability challenge which is the focus of many methodological papers.
To achieve drastic memory cost reduction, a possibility is to transform documents into ""sketches"" of highly reduced sizes that can be quickly compared to compute the documents similarity with bounded error.
The most used tools rely on fixed sized sketches using techniques such as Minhash or HyperLogLog.
However, those techniques have a relatively poor accuracy when the compared datasets are very dissimilar in size or content.
To cope with this problem, novel methods proposed to construct adaptive sketches, scaling linearly with the size of the input, by selecting a fraction of the documents' k-mers.
Several techniques were proposed to perform uniform sub-sampling with theoretical guarantees such as modimizer/modminhash, scaled minhash/FracMinHash.
With SuperSampler, we improve such schemes by combining them with the concept of super-k-mers thus drastically reducing resources usage (CPU, memory, disk).
In this poster, we show that SuperSampler can use an order of magnitude less resources than state of the art with equivalent results.

16:20-16:40
Proceedings Presentation: Locality-Preserving Minimal Perfect Hashing of K-Mers
Room: Salle Saint Claire 3
Format: Live from venue

  • Giulio Ermanno Pibiri, Ca' Foscari University of Venice, Italy
  • Yoshihiro Shibuya, University Gustave Eiffel, France
  • Antoine Limasset, CNRS, France


Presentation Overview: Show

Motivation: Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,...,n} bijectively. It is well-known that n log2(e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k−1 symbols, it seems possible to beat the classic log2(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to preserve as much as possible their relationships also in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers.
Results: Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.

16:40-17:00
Building a Pangenome Alignment Index via Recursive Prefix-Free Parsing
Room: Salle Saint Claire 3
Format: Live from venue

  • Marco Oliva, University of Florida, United States
  • Travis Gagie, Dalhousie University, Canada
  • Christina Boucher, University of Florida, United States


Presentation Overview: Show

Pangenomics alignment has emerged as an opportunity to reduce bias in biomedical research. Traditionally, short read aligners---such as Bowtie and BWA---were used to index a single reference genome, which was then used to find approximate alignments of reads to that genome. Unfortunately, these methods can only index a small number of genomes. Moni, an emerging pangenomic aligner, uses a preprocessing technique called prefix-free parsing to build a dictionary and parse from the input---these, in turn, are used to build the main run-length encoded BWT, and suffix array of the input. This is accomplished in linear space in the size of the dictionary and parse. Therein lies the open problem that we tackle in this paper. Although the dictionary scales sub-linearly with the size of the input, the parse becomes orders of magnitude larger than the dictionary. To scale the construction of Moni, we need to remove the parse from the construction of the RLBWT and suffix array. We solve this problem, and demonstrate that this improves the construction time and memory requirement allowing us to build the RLBWT and suffix array for 1000 diploid human haplotypes from the 1000 genomes project using less than 600GB of memory.

17:00-17:20
Proceedings Presentation: Effects of Spaced k-mers on Alignment-Free Genotyping
Room: Salle Saint Claire 3
Format: Live from venue

  • Hartmut Häntze, National Cheng Kung University, Taiwan
  • Paul Horton, National Cheng Kung University, Taiwan


Presentation Overview: Show

Motivation: Alignment-free, k-mer based genotyping methods are a fast alternative to alignment-based methods and are particularly well suited for genotyping larger cohorts. The sensitivity of algorithms, that work with k-mers, can be increased by using spaced seeds, however, the application of spaced seeds in k-mer based genotyping methods has not been researched yet.
Results: We add a spaced seeds functionality to the genotyping software PanGenie and use it to calculate genotypes. This significantly improves sensitivity and F-score when genotyping SNPs, indels and structural variants on reads with low (5x) and high (30x) coverage. Improvements are greater than what could be achieved by just increasing the length of contiguous k-mers. Effect sizes are particularly large for low coverage data. If applications implement effective algorithms for hashing of spaced k-mers, spaced k-mers have thepotential to become an useful technique in k-mer based genotyping.

17:20-17:40
Proceedings Presentation: Seeding with Minimized Subsequence
Room: Salle Saint Claire 3
Format: Live from venue

  • Xiang Li, Department of Computer Science and Engineering, The Pennsylvania State University, United States
  • Qian Shi, Department of Computer Science and Engineering, The Pennsylvania State University, United States
  • Ke Chen, Department of Computer Science and Engineering, The Pennsylvania State University, United States
  • Mingfu Shao, Department of Computer Science and Engineering, The Pennsylvania State University, United States


Presentation Overview: Show

Modern methods for computation-intensive tasks in sequence analysis (e.g., read mapping, sequence alignment, genome assembly, etc.) often first transform each sequence into a list of short, regular-length seeds so that compact data structures and efficient algorithms can be employed to handle the ever-growing large-scale data. Seeding methods using kmers have gained tremendous success in processing sequencing data with low mutation/error rates. However, they are much less effective for sequencing data with high error rates as kmers cannot tolerate errors. We propose SubseqHash, a strategy that uses subsequences, rather than substrings, as seeds. Formally, SubseqHash maps a string of length n to its smallest subsequence of length k, k < n, according to a given order over all length-k strings. Finding the smallest subsequence of a string by enumeration is impractical as the number of subsequences grows exponentially. To overcome this barrier, we propose a novel algorithmic framework that consists of a specifically designed order (termed ABC order) and an algorithm that computes the minimized subsequence under an ABC order in polynomial time. We first show that the ABC order exhibits the desired property and the probability of hash collision using the ABC order is close to a theoretical upper bound. We then show that SubseqHash overwhelmingly outperforms the substring-based seeding methods in producing high-quality seeds for three critical applications: read mapping, sequence alignment, and overlap detection. SubseqHash presents a major algorithmic breakthrough for tackling the high error rates and we expect it to be widely adapted for long-reads analysis.

17:40-18:00
Proceedings Presentation: Scalable sequence database search using Partitioned Aggregated Bloom Comb-Trees
Room: Salle Saint Claire 3
Format: Live from venue

  • Camille Marchet, CNRS, France
  • Antoine Limasset, CNRS, France


Presentation Overview: Show

The Sequence Read Archive public database has reached 45 Peta-bytes of raw sequences and doubles its nucleotide
content every two years. Although BLAST-like methods can routinely search for a sequence in a small collection of
genomes, making accessible immense public resources accessible is beyond the reach of alignment-based strategies.
In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using
k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures
that conjugate the ability to query small signatures or variants while being scalable to collections up to 10,000
eukaryotic samples. Here, we present PAC, a novel approximate membership query data structure for querying
collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint
besides the index itself. It shows a 3 to 6 fold improvement in construction time compared to other compressed
methods for comparable index size. A PAC query can need single random access and be performed in constant
time in favorable instances. Using limited computation resources, we built PAC for very large collections. They
include 32,000 human RNA-seq samples in five days, the entire Genbank bacterial genome collection in a single
day for an index size of 3.5TB. The latter is to our knowledge the largest sequence collection ever indexed using
an approximate membership query structure. We also showed that PAC’s ability to query 500,000 transcript
sequences in less than an hour. PAC’s open-source software is available at https://github.com/Malfoy/PAC.

Wednesday, July 26th
10:35-11:30
Invited Presentation: Deciphering genomic disease mechanisms via single-cell & single-molecule sequencin
Room: Lumière Auditorium
Format: Live from venue

  • Jan Korbel
11:30-11:50
Proceedings Presentation: Coriolis: Enabling metagenomic classification on lightweight mobile devices
Room: Lumière Auditorium
Format: Live from venue

  • Andrew Mikalsen, University at Buffalo, United States
  • Jaroslaw Zola, University at Buffalo, United States


Presentation Overview: Show

Motivation: The introduction of portable DNA sequencers such as the Oxford Nanopore Technologies MinION has enabled real-time and in the field DNA sequencing. However, in the field sequencing is actionable only when coupled with in the field DNA classification. This poses new challenges for metagenomic software since mobile deployments are typically in remote locations with limited network connectivity and without access to capable computing devices.
Results: We propose new strategies to enable in the field metagenomic classification on mobile devices. We first introduce a programming model for expressing metagenomic classifiers that decomposes the classification process into well-defined and manageable abstractions. The model simplifies resource management in mobile setups and enables rapid prototyping of classification algorithms. Next, we introduce the compact string B-tree, a practical data structure for indexing text in external storage, and we demonstrate its viability as a strategy to deploy massive DNA databases on memory-constrained devices. Finally, we combine both solutions into Coriolis, a metagenomic classifier designed specifically to operate on lightweight mobile devices. Through experiments with actual MinION metagenomic reads and a portable supercomputer-on-a-chip, we show that compared to the state-of-the-art solutions Coriolis offers higher throughput and lower resource consumption without sacrificing quality of classification.
Availability: Source code and test data can be obtained from http://jzola.org/smarten/. Contact: ajmikals@buffalo.edu, jzola@buffalo.edu

11:50-12:10
Metabuli: sensitive and specific metagenomic classification through a novel joint analysis of amino-acid and DNA sequences.
Room: Lumière Auditorium
Format: Live from venue

  • Jaebeom Kim, Interdisciplinary Program in Bioinformatics, Seoul National University, South Korea
  • Martin Steinegger, School of Biological Sciences, Seoul National University, South Korea


Presentation Overview: Show

Assigning taxonomic labels to metagenomic reads involves a trade-off between specificity and sensitivity, depending on the sequence type employed. DNA-based metagenomic classifiers offer higher specificity by capitalizing on mutations to differentiate closely related taxa. Conversely, AA-based classifiers provide higher sensitivity in detecting homology due to the increased conservation of AA.

To solve the trade-off, we developed Metabuli based on a novel k-mer structure, metamer, that simultaneously stores AA and DNA. Metabuli compares metamers first using AA for sensitivity and subsequently with DNA for specificity. We compared Metabuli to DNA-based (Kraken2, KrakenUniq, Centrifuge) and AA-based (Kraken2X, Kaiju, MMseqs2 Taxonomy) tools. In an inclusion test, where 2382 query subspecies were present in databases, DNA-based tools classified up to twice as many reads as AA-based tools to correct (sub)species. However, in an exclusion test, where 367 query species were excluded from databases, AA-based tools showed about twice higher sensitivity in genus-level classification.

Only Metabuli showed state-of-art level performance in both, achieving species-level precision of ~99% and sensitivity of ~97% in the inclusion test, and precision of ~65% and sensitivity of ~48% in the exclusion test. It demonstrates the robustness of Metabuli in diverse contexts of metagenomic studies. (metabuli.steineggerlab.com)

12:10-12:30
HaploDMF: viral haplotype reconstruction from long reads via deep matrix factorization
Room: Lumière Auditorium
Format: Live from venue

  • Dehan Cai, City University of Hong Kong, Hong Kong
  • Jiayu Shang, City University of Hong Kong, Hong Kong
  • Yanni Sun, City University of Hong Kong, Hong Kong


Presentation Overview: Show

Lacking strict proofreading mechanisms, many RNA viruses can generate progeny with slightly changed genomes. Being able to characterize highly similar genomes (i.e., haplotypes) in one virus population helps study the viruses’ evolution and their interactions with the host/other microbes. High-throughput sequencing data has become the major source for characterizing viral populations. However, the inherent limitation on read length by next-generation sequencing makes complete haplotype reconstruction difficult. In this work, we present a new tool named HaploDMF that can construct complete haplotypes using third-generation sequencing (TGS) data. HaploDMF utilizes a deep matrix factorization model with an adapted loss function to automatically learn latent features from aligned reads. The latent features are then used to cluster reads of the same haplotype. Unlike existing tools whose performance can be affected by the overlap size between reads, HaploDMF can achieve highly robust performance on data with different coverage, haplotype number, and error rates. In particular, it can generate more complete haplotypes even when the sequencing coverage drops in the middle. We benchmark HaploDMF against the state-of-the-art tools on simulated and real sequencing TGS data on different viruses. The results show that HaploDMF competes favorably against all others.

13:50-14:10
Proceedings Presentation: SVJedi-graph: improving the genotyping of close and overlapping Structural Variants with long reads using a variation graph
Room: Lumière Auditorium
Format: Live from venue

  • Sandra Romain, INRIA, France
  • Claire Lemaitre, INRIA, France


Presentation Overview: Show

Motivation: Structural variation (SV) is a class of genetic diversity whose importance is increasingly revealed by genome re-sequencing, especially with long-read technologies. One crucial problem when analyzing and comparing SVs in several individuals is their accurate genotyping, that is determining whether a described SV is present or absent in one sequenced individual, and if present, in how many copies. There are only a few methods dedicated to SV genotyping with long read data, and all either suffer of a biais towards the reference allele by not representing equally all alleles, or have difficulties genotyping close or overlapping SVs due to a linear representation of the alleles. Results: We present, SVJedi-graph, a novel method for SV genotyping that relies on a variation graph to represent in a single data structure all alleles of a set of SVs. The long reads are then mapped on the variation graph and the resulting alignments that cover allele-specific edges in the graph are used to estimate the most likely genotype for each SV. Running SVJedi-graph on simulated sets of close and overlapping deletions showed that this graph model prevents the biais towards the reference alleles and allows maintaining high genotyping accuracy whatever the SV proximity, contrary to other state-of-the-art genotypers. On the human gold standard HG002 dataset, SVJedi-graph obtained the best performances, genotyping 99.5 % of the high confidence SV callset with an accuracy of 95% in less than 30 minutes.

14:10-14:30
Taxor: Fast and space-efficient taxonomic classification of long reads
Room: Lumière Auditorium
Format: Live from venue

  • Jens-Uwe Ulrich, Hasso Plattner Institute, Germany
  • Bernhard Renard, Hasso Plattner Institute, Germany


Presentation Overview: Show

Correctly identifying all organisms in an environmental or clinical sample is fundamental in many metagenomic sequencing projects. Over the last years, many tools have been developed that classify short and long sequencing reads by comparing their nucleotide sequences to a predefined set of references. Although those methods already utilize flexible data structures with low memory requirements, the constantly increasing number of reference genomes in the databases poses a major computational challenge to the profilers regarding memory usage, index construction and query time. Here, we present Taxor as a fast and space-efficient tool for taxonomic profiling by utilizing hierarchical interleaved XOR filters. Taxor shows a precision of 99.9% for read classification on the species level while retaining a recall of 96.7%, outperforming tools like Kraken2 and Centrifuge in terms of precision by 3-9%. Our benchmarking based on simulated and real data indicates that Taxor accurately performs taxonomic read classification while reducing the index size of the reference database and memory requirements for querying by a factor of 2-12x when compared to other profiling tools.

14:30-14:50
Proceedings Presentation: Foreign RNA spike-ins enable accurate allele-specific expression analysis at scale
Room: Lumière Auditorium
Format: Live from venue

  • Asia Mendelevich, Altius Institute for Biomedical Sciences, United States
  • Saumya Gupta, Stem Cell Program, Boston Children’s Hospital; Department of Stem Cell and Regenerative Biology, Harvard University, United States
  • Aleksei Pakharev, ---, United States
  • Athanasios Teodosiadis, Altius Institute for Biomedical Sciences, United States
  • Andrey Mironov, Lomonosov Moscow State University, Institute of Information Transmission Problems, Russia
  • Alexander Gimelbrant, Altius Institute for Biomedical Sciences, United States


Presentation Overview: Show

Analysis of allele-specific expression is strongly affected by the technical noise present in RNA-seq experiments. Previously, we showed that technical replicates can be used for precise estimates of this noise, and we provided a tool for correction of technical noise in allele-specific expression analysis. This approach is very accurate but costly due to the need for two or more replicates of each library. Here, we develop a spike-in approach that is highly accurate at only a small fraction of the cost.
We show that a distinct RNA added as a spike-in before library preparation reflects technical noise of the whole library and can be used in large batches of samples. We experimentally demonstrate the effectiveness of this approach using combinations of RNA from species distinguishable by alignment, namely, mouse, human, and C.elegans. Our new approach, controlFreq, enables highly accurate and computationally efficient analysis of allele-specific expression in (and between) arbitrarily large studies at an overall cost increase of ∼ 5%.

14:50-15:10
Proceedings Presentation: Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes
Room: Lumière Auditorium
Format: Live from venue

  • Jarno Alanko, University of Helsinki, Finland
  • Simon Puglisi, University of Helsinki, Finland
  • Tommi Mäklin, University of Helsinki, Finland
  • Jaakko Vuohtoniemi, University of Helsinki, Finland


Presentation Overview: Show

Motivation: Huge data sets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these data sets, efficient indexing data structures - that are both scalable and provide rapid query throughput - are paramount.

Results: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 hours. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 thousand genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets.

Availability and implementation: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

15:10-15:30
Proceedings Presentation: A multi-locus approach for accurate variant calling in low-copy repeats using whole-genome sequencing
Room: Lumière Auditorium
Format: Live from venue

  • Timofey Prodanov, Heinrich Heine University, Duesseldorf 40225, Germany, Germany
  • Vikas Bansal, University of California San Diego, United States


Presentation Overview: Show

Motivation: Low-copy repeats (LCRs) or segmental duplications are long segments of duplicated DNA that cover > 5% of the human genome. Existing tools for variant calling using short reads exhibit low accuracy in LCRs due to ambiguity in read mapping and extensive copy number variation. Variants in more than 150 genes overlapping LCRs are associated with risk for human diseases.

Methods: We describe a short-read variant calling method, ParascopyVC, that performs variant calling jointly across all repeat copies and utilizes reads independent of mapping quality in LCRs. To identify candidate variants, ParascopyVC aggregates reads mapped to different repeat copies and performs polyploid variant calling. Subsequently, paralogous sequence variants (PSVs) that can differentiate repeat copies are identified using population data and used for estimating the genotype of variants for each repeat copy.

Results: On simulated whole-genome sequence data, ParascopyVC achieved higher precision (0.997) and recall (0.807) than
three state-of-the-art variant callers (best precision = 0.956 for DeepVariant and best recall = 0.738 for GATK) in 167 LCR
regions. Benchmarking of ParascopyVC using the Genome-in-a-Bottle high-confidence variant calls for HG002 genome
showed that it achieved a very high precision of 0.991 and a high recall of 0.909 across LCR regions, significantly better than
FreeBayes (precision = 0.954 and recall = 0.822), GATK (precision = 0.888 and recall = 0.873) and DeepVariant (precision
= 0.983 and recall = 0.861). ParascopyVC demonstrated a consistently higher accuracy (mean F_1 score = 0.947) than other callers
(best F_1 = 0.908) across seven human genomes.

Availability and implementation: ParascopyVC is implemented in Python and is freely available at https://github.com/
tprodanov/ParascopyVC.

16:00-16:20
Variational Inference for Single-Cell Transcriptome with DNA Barcoding Reconstructs Unobserved Cell States and Differentiation Trajectories
Room: Lumière Auditorium
Format: Live from venue

  • Koichiro Majima, Division of Systems Biology, Nagoya University Graduate School of Medicine, Japan
  • Kodai Minoura, Japanese Red Cross Nagoya Daiichi Hospital, Japan
  • Yasuhiro Kojima, Laboratory of Computational Life Science, National Cancer Center Research Institute, Japan
  • Teppei Shimamura, Division of Systems Biology, Nagoya University Graduate School of Medicine, Japan


Presentation Overview: Show

Single-cell RNA sequencing (scRNA-seq) is a powerful tool for characterizing cell types and states. However, it has limitations in measuring changes in gene expression during dynamic biological processes such as differentiation due to the destruction of cells during analysis. Recent studies combining scRNA-seq with lineage tracing have provided clonal information but still face challenges such as observations at discrete time points and difficulty in tracking cells within a certain lineage over the time course, since early observations are not direct ancestors of cells in the same lineage observed later time point. To address these issues, we developed Lineage Variational Inference (LineageVI), a model based on the framework of variational autoencoder (VAE), to convert single-cell transcriptome observation with DNA barcoding into the latent state dynamics consistent with the clonal relationship by assuming a common ancestor. This model enables us to quantitatively capture the cell state transitions. We demonstrate how our model can recapitulate differentiation trajectories in hematopoiesis and learn potential dynamics and estimated backward transitions from later to earlier observations in the latent space. Restoring transcriptomes at each time point in each lineage showed an increase in undifferentiated marker expression and a decrease in differentiation marker expression according to ancestors.

16:20-16:40
Proceedings Presentation: GAN-based Data Augmentation for Transcriptomics: Survey and Comparative Assessment
Room: Lumière Auditorium
Format: Live from venue

  • Alice Lacan, IBISC, University Paris-Saclay (Univ. Evry), France
  • Michele Sebag, TAU, CNRS-INRIA-LISN, University Paris-Saclay, France
  • Blaise Hanczar, IBISC, University Paris-Saclay (Univ. Evry), France


Presentation Overview: Show

Motivation: Transcriptomics data is becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations
of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as Generative Adversarial Networks (GANs) have been proposed to generate additional samples.
In this paper, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes.
Results: This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and
more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly.
Availability: All data used for this research is publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitHub repository: GANs-for-transcriptomics
Contact: alice.lacan@univ-evry.fr
Supplementary information: Supplementary data are available at Bioinformatics online.

16:40-17:00
Visualizing Spatial Transcriptomics with U-CIE Color Encoding
Room: Lumière Auditorium
Format: Live from venue

  • Mikaela Koutrouli, Novo Nordisk Foundation Center of Protein Research, Denmark
  • Radha Swaminathan, Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, United States
  • Jeremy Edwards, Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, United States
  • Lars Juhl Jensen, Novo Nordisk Foundation Center of Protein Research, Denmark


Presentation Overview: Show

Spatial transcriptomics is a cutting-edge technique that enables the analysis of gene expression patterns within specific regions of tissue or organs. However, analyzing the large and complex datasets generated from spatial transcriptomics experiments remains a challenge. Here we propose U-CIE, a method for visualizing high-dimensional data by encoding it as colors using a combination of dimensionality reduction and the CIELAB color space. U-CIE allows genome-wide expression patterns within tissue or organ sections to be visualized and highlights the distribution of different cell types across the spatial transcriptomics data. U-CIE first uses UMAP to reduce high-dimensional gene expression data to three dimensions while preserving the spatial information. Next, the resulting three-dimensional representation is embedded within the CIELAB color space, generating a color encoding that captures much of the original structure of the data. U-CIE has been successfully applied to a mouse brain section dataset to highlight the distribution of different cell types across the spatial transcriptomics data and provide insights into the organization of these cells within brain regions. U-CIE has the potential to be a powerful tool for exploring spatial transcriptomics data and gaining new insights into cellular organization and function.

17:00-17:20
CellFromSpace: A versatile tool for spatial transcriptomic data analysis through reference-free deconvolution and guided cell type/activity annotation
Room: Lumière Auditorium
Format: Live from venue

  • Corentin Thuilliez, INSERM U1015, Gustave Roussy Cancer Campus, France
  • Maria Eugenia Marques Da Costa, INSERM U1015, Gustave Roussy Cancer Campus, France
  • Nathalie Gaspar, INSERM U1015 & Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, France
  • Pierre Khneisser, Department of Medical Biology and Pathology, Gustave Roussy Cancer Campus, France
  • Gael Moquin-Beaudry, INSERM U1015, Gustave Roussy Cancer Campus, France
  • Jean-Yves Scoazec, Department of Medical Biology and Pathology, Gustave Roussy Cancer Campus, France
  • Antonin Marchais, INSERM U1015 & Department of Pediatric and Adolescent Oncology, Gustave Roussy Cancer Campus, France


Presentation Overview: Show

Spatial transcriptomic is one of the most promising technologies to analyze spatial distribution and interaction. Spatially barcoded next generation (NGS) sequencing-based methods enable the detection of transcripts on tissue sections. Several of these technologies, are near single cell with spots encompassing 1-20 cells. Therefore, a deconvolution step is required to gain insight into the mixture of cells.

Here, we propose a new method named CellFromSpace (CFS), based on the independent component analysis, a blind signal separation method, to deconvolute, without reference single cell data, spatial transcriptomic data. We developed an R package and a shiny interface to accelerate the annotation of the signal.

Visium fresh frozen and FFPE samples of adult mouse brain and human tumors from 10x genomics were analyzed. We were able to recapitulate the structure of the mouse brain using our method with high fidelity. Furthermore, we quickly identified cell types and activities within heterogeneous cancer tissues. The method also enables to subset the signal and the spot corresponding to specific cell, to drive further analysis usually performed for scRNA-seq such as trajectory inference.

In conclusion, CFS provides a full workflow to analyze and quickly interpret results from NGS-based spatial transcriptomics analysis without reference single cell dataset.

17:20-17:40
demuxSNP: supervised demultiplexing of scRNAseq data using cell hashing and SNPs
Room: Lumière Auditorium
Format: Live from venue

  • Michael P Lynch, University of Limerick, Ireland
  • Yufei Wang, Dana-Farber Cancer Institute, United States
  • Laurent Gatto, UCLouvain, Belgium
  • Aedin C Culhane, University of Limerick, Ireland


Presentation Overview: Show

Single-cell sequencing allows unprecedented understanding of biologically relevant differences between individual cells. Multiplexing, that is loading multiple biological samples into each sequencing lane, is widely used to further reduce sequencing costs. The sequencing reads must then be demultiplexed or identified as being from a particular biological sample. Methods to date have either used cell hashing labels (tags) or SNPs. We present our approach and its corresponding R package ‘demuxSNP’ which overcomes current technical challenges in demultiplexing scRNAseq reads which can be applied to genetically distinct biological samples.
demuxSNP uses data from both tags and SNPs. demuxSNP performs SNP feature selection then trains a doublet-aware knn classifier on the SNP profiles of singlet cells called with high confidence using cell tagging methods. Low confidence cells (cells which we could not confidently call using cell tagging methods) are then assigned based on their SNP profiles. demuxSNP is a computationally efficient and cell-type unbiased algorithm for demultiplexing genetically distinct biological samples.

17:40-18:00
Leveraging Evolutionary Constraints to Refine Somatic Variant Calls from Single-Cell Sequencing Data
Room: Lumière Auditorium
Format: Live from venue

  • Gryte Satas, Memorial Sloan Kettering Cancer Center, United States
  • Matthew A. Myers, Memorial Sloan Kettering Cancer Center, United States
  • Seongmin Choi, Memorial Sloan Kettering Cancer Center, United States
  • Sohrab Shah, Memorial Sloan Kettering Cancer Center, United States


Presentation Overview: Show

Single-cell DNA sequencing (scDNA-seq) technologies enable scaled measurements of tumor cell genomes. However, low per-cell coverage and technical biases present analytical challenges for identifying nucleotide resolution mutations. Accurate calling of small variants (eg., single-nucleotide variants [SNVs]) is a critical prerequisite for many downstream analyses but call sets from scDNA-seq data often contain many false positives (‘artifacts’). We introduce ArtiCull, a variant call refinement algorithm that exploits evolutionary constraints to identify artifacts in scDNA-seq data. ArtiCull requires no external training data, manual inspection, or prior knowledge of artifact profiles. Instead, ArtiCull uses somatic evolutionary models to identify a subset of high-confidence artifactual and true variants; these labeled variants are then used to train a feature-based classifier. This enables researchers to train patient-, cohort-, or technology-specific classifiers attuned to the specific profile of technical biases in their dataset. Validation with matched bulk sequencing data shows that ArtiCull greatly improves SNV calling precision with minimal loss of recall. We demonstrate that ArtiCull improves the identification of clones in scDNA-seq data, and increases sensitivity of mutational signature analyses to identify processes active in a small number of cells.