Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


HitSeq: High-throughput Sequencing

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Monday, July 22nd
10:15 AM-10:20 AM
Welcome to the HiTSeq
Room: San Francisco (3rd Floor)
10:20 AM-11:20 AM
HiTSeq Keynote: Quantifying the rates and routes of metastasis
Room: San Francisco (3rd Floor)
  • Christina Curtis, Stanford, United States

Presentation Overview: Show


11:20 AM-11:40 AM
Proceedings Presentation: Alignment-free Filtering for cfNA Fusion Fragments
Room: San Francisco (3rd Floor)
  • Xiao Yang, Grail, Inc, United States
  • Mohini Desai, Grail, Inc, United States
  • Wenying Pan, Grail, Inc, United States
  • Matthew Larson, Grail, Inc, United States
  • Eric Scott, Grail, Inc, United States
  • Pranav Singh, Grail, Inc, United States
  • Hyunsung John Kim, Grail, Inc, United States
  • Arjun Rao, Grail, Inc, United States
  • Yasushi Saito, Grail, Inc, United States
  • Earl Hubbell, Grail Bio, United States

Presentation Overview: Show

Motivation: Cell-free nucleic acid (cfNA) sequencing data require improvements to existing fusion
detection methods along multiple axes: high depth of sequencing, low allele fractions, short fragment
lengths, and specialized barcodes such as unique molecular identifiers.
Results: AF4 was developed to address these challenges. It uses a novel alignment-free kmer based
method to detect candidate fusion fragments with high sensitivity and orders of magnitude faster than
existing tools. Candidate fragments are then filtered using a max-cover criterion that significantly reduces
spurious matches while retaining authentic fusion fragments. This efficient first stage reduces the data
sufficiently that commonly used criteria can process the remaining information, or sophisticated filtering
policies that may not scale to the raw reads can be used. AF4 provides both targeted and de novo fusion
detection modes. We demonstrate both modes in benchmark simulated and real RNA-seq data as well as
clinical and cell-line cfNA data.
Availability: AF4 is open sourced, licensed under Apache License 2.0, and is available at:
Contact: {xyang,ysaito}@grail.com

11:40 AM-12:00 PM
Descendant Cell Fraction: Copy-aware Inference of Clonal Composition and Evolution in Cancer
Room: San Francisco (3rd Floor)
  • Gryte Satas, Princeton University, United States
  • Simone Zaccaria, Princeton University, Italy
  • Ben Raphael, Princeton University, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Presentation Overview: Show

A tumor results from an evolutionary process, giving rise to distinct clones distinguished by somatic mutations including single-nucleotide variants (SNVs), copy-number aberrations (CNAs). The standard approach to identify such clones is to cluster SNVs that have similar cancer cell fractions (CCFs), i.e. the proportion of tumor cells harboring the mutation. The key assumption is that SNVs with similar CCFs have occurred on the same phylogenetic branch. There are, however, two key deficiencies: (1) the CCF cannot be unambiguously inferred from DNA sequencing data; (2) the CCF does not account for loss of mutations, which is common in tumors with CNAs.

To address these deficiencies, we define a novel quantity, the descendant cell fraction (DCF), which is a summary statistic for both the prevalence and evolutionary history of an SNV. We introduce DeCiFer, an algorithm to simultaneously infer evolutionary histories of individual SNVs and clusters SNVs by their corresponding DCFs under the principle of parsimony. On simulated data, we show that DeCiFer more accurately clusters SNVs than existing methods. On a metastatic prostate cancer dataset, we show that DeCiFer yields more parsimonious evolutionary and migration histories. Thus, DeCiFer enables more accurate quantification of intra-tumor heterogeneity and improves inference of tumor evolution.

12:00 PM-12:20 PM
Subpopulation detection and their comparative analysis across single cell ex-periments with PopCorn
Room: San Francisco (3rd Floor)
  • Yijie Wang, National Center of Biotechnology Information, National Library of Medicine, NIH, United States
  • Jan Hoinka, NCBI, NIH, United States
  • Teresa Przytycka, National Center of Biotechnology Information, NLM, NIH, United States

Presentation Overview: Show

One of the key applications of scRNA-seq technology concerns the identification of subpopulations of cells present in a sample, and comparing such subpopulations across multiple samples/experiments. This conceptually natural task, is complicated by technical and biological noise which can obscure the true biological similarities and differences between the samples.

We present is a new approach, PopCorn, that leverages several algorithmic ideas to achieve this goal. The key ideas is to construct a graph representation of the data that consist of two types of edges – one representing the relation between cells within each experiment ant the other representing the relation of the cells between the experiments. These two types of edges are computed differently but are combined into one graph which is then partitioned into connected subgraphs that define both.

We tested the performance of PopCorn in three distinct settings. First, we demonstrated its potential in identifying and aligning subpopulations from single cell data from different organisms, aligning biological replicates and comparing populations of cells from cancer and healthy brain tissues. PopCorn not only outperforms currently used approaches but also introduces mathematical concepts that can serve as stepping stones to improve other tools.

12:20 PM-12:40 PM
Proceedings Presentation: Minnow: A principled framework for rapid simulation of dscRNA-seq data at the read level
Room: San Francisco (3rd Floor)
  • Hirak Sarkar, Stony Brook University, United States
  • Avi Srivastava, Stony Brook university, United States
  • Robert Patro, Stony Brook University, United States

Presentation Overview: Show

With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or UMI deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-seq (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as PCR amplification, CB (cellular barcodes) and UMI (Unique Molecule Identifiers) selection, and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification, and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.

2:00 PM-2:20 PM
Proceedings Presentation: Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index
Room: San Francisco (3rd Floor)
  • Ali Ghaffaari, Max-Planck Institut für Informatik, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show

Motivation: Sequence graphs are versatile data structures that are, for instance, able to represent the genetic variation found in a population and to facilitate genome assembly. Read mapping to sequence graphs constitutes an important step for many applications and is usually done by first finding exact seed matches, which are then extended by alignment. Existing methods for finding seed hits prune the graph in complex regions, leading to a loss of information especially in highly polymorphic regions of the genome. While such complex graph structures can indeed lead to a combinatorial explosion of possible alleles, the query set of reads from a diploid individual realizes only two alleles per locus---a property that is not exploited by extant methods.
Results: We present the Pan-genome Seed Index (PSI), a fully-sensitive hybrid method for seed finding, which takes full advantage of this property by combinining an index over selected paths in the graph with an index over the query reads. This enables PSI to find all seeds while eliminating the need to prune the graph. We demonstrate its performance with different parameter settings on both simulated data and on a whole human genome graph constructed from variants in the 1000 Genome Project data set. On this graph, PSI outperforms GCSA2 in terms of index size, query time, and sensitivity.

2:20 PM-2:40 PM
Proceedings Presentation: Building Large Updatable Colored de Bruijn Graphs via Merging
Room: San Francisco (3rd Floor)
  • Martin Muggli, Colorado State University, United States
  • Bahar Alipanahi, University of Florida, United States
  • Christina Boucher, University of Florida, United States

Presentation Overview: Show

Motivation: There exists several massive genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph have been developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed.
Results: We create a method for constructing and updating the colored de Bruijn graph on a very-large dataset through partitioning the data into smaller subsets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this paper. We refer to the resulting method as VariMerge. We validate our approach, and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8,000 strains. Lastly, we compare VariMerge to other competing methods — including Vari , Rainbowfish , Mantis , Bloom Filter Trie , the method by Almodaresi and Multi-BRWT — and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16,000 strains in a manner that allows additional samples to be added. Competing methods either did not scale to this large of a dataset or cannot allow for additions without reconstruction.
Availability: VariMerge is at https://github.com/cosmo-team/cosmo/tree/VARI-merge under under GPLv3 license.

2:40 PM-3:00 PM
Accurate determination of node and arc multiplicities in de Bruijn graphs using conditional random fields
Room: San Francisco (3rd Floor)
  • Aranka Steyaert, Ghent University, Belgium
  • Pieter Audenaert, Ghent University, Belgium
  • Jan Fostier, Ghent University, Belgium

Presentation Overview: Show

Many bioinformatics tools use read-based de Bruijn graphs as an estimated representation of the underlying genome sequence. However, sequencing errors and repeated subsequences complicate the identification of the true underlying sequence. A key step in this process is to infer the multiplicities of nodes/arcs in the graph, which are the number of times each k-mer (resp. k+1-mer) corresponding to a node (resp. arc), is present in the genomic sequence. Multiplicities thus reveal repeat structure and the presence of sequencing errors. Multiplicities of nodes/arcs are reflected in the node/arc coverage, however, coverage variability and coverage biases complicate their determination.
Current methodology determines multiplicities based solely on the information in nodes/arcs individually, underutilising the information present in the sequencing data. To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. We observe an accuracy improvement and a more robust EM parameter estimation. We believe this methodology can be a useful addition to tools that make use of de Bruijn graphs.

3:00 PM-3:20 PM
GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment
Room: San Francisco (3rd Floor)
  • Mikko Rautiainen, Max Planck Institute for Informatics, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show


Sequence graphs provide a natural way of expressing variation or uncertainty in a genome, or a collection of genomes. They can be used for diverse applications such as genome assembly, error correction and SV genotyping. With the growing usage of graphs, methods for handling graphs efficiently are becoming more important. In particular, sequence alignment is one of the most fundamental operations in genome analysis and used in many applications.


We present our tool GraphAligner for aligning long reads to genome graphs. Comparisons with existing tools show that our method is faster by one order magnitude. To demonstrate the downstream benefits, we present a hybrid error correction pipeline based on aligning long reads to a de Bruijn graph, which achieves error rates up to one order of magnitude lower than competing tools and scales to high coverage whole-genome mammalian datasets.


As sequence alignment is one of the most fundamental operations in genome analysis, better alignment methods will produce many downstream benefits. GraphAligner is a tool for rapidly aligning long reads to genome graphs faster than existing methods, enabling many use cases that have been computationally infeasible before. GraphAligner is open source and available on bioconda.

3:20 PM-3:40 PM
Proceedings Presentation: cloudSPAdes: Assembly of Synthetic Long Reads Using de Bruijn graphs
Room: San Francisco (3rd Floor)
  • Ivan Tolstoganov, Center for Algorithmic Biotechnology, Institute for Translational Biomedicine, St.Petersburg State University, Russia
  • Anton Bankevich, Dept. of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA, United States
  • Pavel Pevzner, Dept. of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA, United States

Presentation Overview: Show

The recently developed barcoding-based Synthetic Long Read (SLR) technologies have already found many applications in genome assembly and analysis. However, although some new barcoding protocols are emerging and the range of SLR applications is being expanded, the existing SLR assemblers are optimized for a narrow range of parameters and are not easily extendable to new barcoding technologies and new applications such as metagenomics or hybrid assembly. We describe the algorithmic challenge of the SLR assembly and present a cloudSPAdes algorithm for SLR assembly that is based on analyzing the de Bruijn graph of SLRs. We benchmarked cloudSPAdes across various barcoding technologies/applications and demonstrated that it improves on the state-of-the-art SLR assemblers in accuracy and speed. The project was supported by the Russian Science Foundation (grant 19-14-00172).

3:40 PM-4:00 PM
Proceedings Presentation: Locality sensitive hashing for the edit distance
Room: San Francisco (3rd Floor)
  • Guillaume Marçais, Carnegie Mellon University, United States
  • Dan DeBlasio, Carnegie Mellon University, United States
  • Prashant Pandey, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality Sensitive Hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have an alignment from those that may have an alignment. Therefore, an LSH reduces in the overall computational requirement while not introducing many false negatives (i.e., omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. And due to the lack of a practical LSH method for
edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy.

Results: We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is not only sensitive to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH.

4:40 PM-5:00 PM
Proceedings Presentation: TideHunter: efficient and sensitive tandem repeat detection from noisy long-reads using seed-and-chain
Room: San Francisco (3rd Floor)
  • Yan Gao, Harbin Institute of Technology, China
  • Bo Liu, Harbin Institute of Technology, China
  • Yadong Wang, Harbin Institute of Technology, China
  • Yi Xing, Children’s Hospital of Philadelphia, United States

Presentation Overview: Show

Motivation: Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) sequencing technologies can produce long-reads up to tens of kilobases, but with high error rates. In order to reduce sequencing error, Rolling Circle Amplification (RCA) has been used to improve library preparation by amplifying circularized template molecules. Linear products of the RCA contain multiple tandem copies of the template molecule. By integrating additional in silico processing steps, these tandem sequences can be collapsed into a consensus sequence with a higher accuracy than the original raw reads. Existing pipelines using alignment-based methods to discover the tandem repeat patterns from the long-reads are either inefficient or lack sensitivity.
Results: We present a novel tandem repeat detection and consensus calling tool, TideHunter, to efficiently discover tandem repeat patterns and generate high-quality consensus sequences from amplified tandemly repeated long-read sequencing data. TideHunter works with noisy long-reads (PacBio and ONT) at error rates of up to 20% and does not have any limitation of the maximal repeat pattern size. We benchmarked TideHunter using simulated and real datasets with varying error rates and repeat pattern sizes. TideHunter is tens of times faster than state-of-the-art methods and has a higher sensitivity and accuracy.
Availability and Implementation: TideHunter is written in C, it is open source and is available at https://github.com/yangao07/TideHunter
Contact: bo.liu@hit.edu.cn; ydwang@hit.edu.cn; XINGYI@email.chop.edu

5:00 PM-6:00 PM
HiTSeq Keynote: Bacterial pangenome graphs - an approximate solution to the correct problem
Room: San Francisco (3rd Floor)
  • Zam Iqbal, European Bioinformatics Institute, United Kingdom
Tuesday, July 23rd
10:15 AM-11:20 AM
HiTSeq Keynote: Advances in single-cell epigenomics using combinatorial indexing
Room: San Francisco (3rd Floor)
  • Andrew Adey, Oregon Health and Sciences University, United States
11:20 AM-11:40 AM
Proceedings Presentation: hicGAN infers super resolution Hi-C data with generative adversarial networks
Room: San Francisco (3rd Floor)
  • Qiao Liu, Tsinghua University, China
  • Hairong Lv, Tsinghua University, China
  • Rui Jiang, Tsinghua University, China

Presentation Overview: Show

Motivation: Hi-C is a genome-wide technology for investigating 3D chromatin conformation by measuring physical contacts between pairs of genomic regions. The resolution of Hi-C data directly impacts the effectiveness and accuracy of downstream analysis such as identifying topologically associating domains (TADs) and meaningful chromatin loops. High resolution Hi-C data are valuable resources which implicate the relationship between 3D genome conformation and function, especially linking distal regulatory elements to their target genes. However, high resolution Hi-C data across various tissues and cell types are not always available due to the high sequencing cost. It is therefore indispensable to develop computational approaches for enhancing the resolution of Hi-C data.
Results: We proposed hicGAN, an open-sourced framework, for inferring high resolution Hi-C data from low resolution Hi-C data with generative adversarial networks (GANs). To the best of our knowledge, this is the first study to apply GANs to 3D genome analysis. We demonstrate that hicGAN effectively enhances the resolution of low resolution Hi-C data by generating matrices that are highly consistent with the original high resolution Hi-C matrices. A typical scenario of usage for our approach is to enhance low resolution Hi-C data in new cell types, especially where the high resolution Hi-C data are not available. Our study not only presents a novel approach for enhancing Hi-C data resolution, but also provides fascinating insights into disclosing complex mechanism underlying the formation of chromatin contacts.

11:40 AM-12:00 PM
Characterizing chromatin landscape from aggregate and single-cell genomic assays using flexible duration modeling.
Room: San Francisco (3rd Floor)
  • Mariano Gabitto, Flatiron Institute, United States
  • Anders Rasmussen, Flatiron Institute, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Presentation Overview: Show

Distilling functional regions from ATAC-seq and other similar genomic technologies presents diverse analysis challenges, due to the relative sparseness of the data produced and the interaction of complex noise with multiple chromatin structure scales. Methods commonly used to analyze chromatin accessibility datasets are adapted from algorithms designed to process different experimental technologies, disregarding the statistical and biological differences intrinsic to the ATAC-seq technology. Here, we present a Bayesian statistical approach that uses Hidden Semi-Markov models to better model the duration of functional and accessible regions, termed ChromA. We demonstrate the method on multiple genomic technologies, with a focus on ATAC-seq data. ChromA annotates the cellular epigenetic landscape by integrating information from replicates, producing a consensus de-noised annotation of chromatin accessibility. ChromA can analyze single cell ATAC-seq data, improving cell type identification and correcting many biases generated by the sparse sampling inherent in single cell technologies. We validate ChromA on multiple technologies and biological systems, including mouse and human immune cells and find it effective at recovering accessible chromatin, establishing ChromA as a top performing general platform for mapping the chromatin landscape in different cellular populations from diverse experimental designs.

12:00 PM-12:20 PM
Proceedings Presentation: Integrating read-based and population-based phasing for dense and accurate haplotyping of individual genomes
Room: San Francisco (3rd Floor)
  • Vikas Bansal, University of California San Diego, United States

Presentation Overview: Show

Motivation: Reconstruction of haplotypes for human genomes is an important problem in medical and population genetics. Hi-C sequencing generates read pairs with long-range haplotype information that can be computationally assembled to generate chromosome-spanning haplotypes. However, the haplotypes have limited completeness and low accuracy. Haplotype information from population reference panels can potentially be used to improve the completeness and accuracy of Hi-C haplotyping.

Results: In this paper, we describe a likelihood based method to integrate short-range haplotype information from a population reference panel of haplotypes with the long-range haplotype information present in sequence reads from methods such as Hi-C to assemble dense and highly accurate haplotypes for individual genomes. Our method leverages a statistical phasing method and a maximum spanning tree algorithm to determine the optimal second-order approximation of the population-based haplotype likelihood for an individual genome. The population-based likelihood is encoded using pseudo-reads which are then used as input along with sequence reads for haplotype assembly using an existing tool, HapCUT2. Using whole-genome Hi-C data for two human genomes (NA19240 and NA12878), we demonstrate that this integrated phasing method enables the phasing of 97-98% of variants, reduces the switch error rates by 3-6 fold, and outperforms an existing method for combining phase information from sequence reads with population-based phasing. On Strand-seq data for NA12878, our method improves the haplotype completeness from 71.4% to 94.6% and reduces the switch error rate 2-fold, demonstrating its utility for phasing using multiple sequencing technologies.

Availability and Implementation: Code and datasets are available at github.com/vibansal/IntegratedPhasing

Contact: vibansal@ucsd.edu

12:20 PM-12:40 PM
Haplotype Threading: Accurate Polyploid Phasing from Long Reads
Room: San Francisco (3rd Floor)
  • Sven Schrinner, Heinrich Heine University Düsseldorf, Germany
  • Rebecca Serra Mari, Center for Bioinformatics, Saarland University, Saarbrücken; Graduate School of Computer Science, Saarbrücken, Germany
  • Jana Ebler, Center for Bioinformatics, Saarland University; Graduate School of Computer Science; MPI for Informatics, Saarbrücken, Germany
  • Gunnar W. Klau, Heinrich Heine University Düsseldorf; Cluster of Excellence on Plant Sciences (CEPLAS), Düsseldorf, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Presentation Overview: Show

The genome of many plant species, including important food crops, is polyploid. Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. While phasing diploid genomes using long reads has become a routine step, polyploid phasing still presents considerable challenges. The Minimum Error Correction (MEC) model, the most common and successful formalization for diploid phasing, is limited in the use for polyploid phasing since it does not address regions where two or more haplotypes are identical. In addition, dynamic programming techniques solving diploid MEC become infeasible in the polyploid case.

Here, we present a method for accurate polyploid phasing that overcomes these challenges by departing from the MEC model. We propose a novel two-stage approach based on (i) clustering reads using a position-dependent scoring function and (ii) threading the haplotypes through the resulting clusters by dynamic programming. We demonstrate that our method scales to whole chromosomes and results in more accurate haplotypes than those computed by the state-of-the-art tool H-PoP. Our algorithm is implemented as part of the widely used open source tool WhatsHap and is hence ready to be included in production settings.

2:00 PM-2:20 PM
Characterization of large-scale structural variants using Linked-Reads
Room: San Francisco (3rd Floor)
  • Fatih Karaoglanoglu, Bilkent University, Turkey
  • Camir Ricketts, Cornell University, United States
  • Ezgi Ebren, Bilkent University, Turkey
  • Marzieh Eslami Rasekh, Boston University, United States
  • Iman Hajirasouliha, Cornell University, United States
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey

Presentation Overview: Show

Here we propose novel algorithms to characterize large (>40 Kbp) interspersed segmental duplications, (>80 Kbp) inversions, (>100 Kbp) deletions, and (>100 Kbp) translocations using Linked-Read sequencing data. Linked-Read sequencing provides long range information, where Illumina reads are tagged with barcodes that can be used to assign short reads to pools of larger (30-50 Kbp) molecules. Our methods rely on split molecule sequence signature that we have previously described. Similar to the split read, split molecules refer to large segments of DNA that span an SV breakpoint. Therefore, when mapped to the reference genome, the mapping of these segments would be discontinuous. We redesign our earlier algorithm, VALOR, to specifically leverage Linked-Read sequencing data to discover large structural variation. We implement our new algorithms in a new software package, called VALOR2.

2:20 PM-2:40 PM
Detection and assembly of novel sequence insertions using Linked-Reads
Room: San Francisco (3rd Floor)
  • Dmitrii Meleshko, Cornell University, United States
  • Patrick Marks, 10X Genomics Inc., United States
  • Stephen Williams, 10X Genomics Inc., United States
  • Iman Hajirasouliha, Cornell University, United States

Presentation Overview: Show

See the 2-page abstract attached.

2:40 PM-3:00 PM
Genotyping structural variations using long reads data
Room: San Francisco (3rd Floor)
  • Lolita Lecompte, INRIA, France
  • Pierre Peterlongo, INRIA, France
  • Dominique Lavenier, CNRS, France
  • Claire Lemaitre, INRIA, France

Presentation Overview: Show

Studies on structural variants (SV) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, more and more SVs are discovered, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it becomes important to genotype newly sequenced individuals on well defined and characterized SVs. Whereas many SV genotypers have been developed for short read data, there is still no approach to assess whether some SVs are present or not in a new sequenced sample of long reads, from third generation sequencing technologies.

We present a novel method to genotype known SVs from long read sequencing. The principle of our method is based on the generation of a set of reference sequences that represent the two alleles of each SV. After mapping the long reads to these reference sequences, alignments are analyzed and filtered out to keep only informative ones, in order to quantify and estimate the presence of each allele. Tests on simulated long reads based on 1000 deletions from dbVar show a precision of 95.8%. We also applied the method to the whole human genome NA12878.

3:00 PM-3:20 PM
Beta-binomial modeling of CRISPR pooled screen data identifies target genes with greater sensitivity and fewer false negatives
Room: San Francisco (3rd Floor)
  • Hyun-Hwan Jeong, Baylor College of Medicine, United States
  • Seon Young Kim, Baylor College of Medicine, United States
  • Maxime W.C. Rousseaux, University of Ottawa, Canada
  • Huda Y. Zoghbi, Howard Hughes Medical Institute, United States
  • Zhandong Liu, Baylor College of Medicine, United States

Presentation Overview: Show

The simplicity and cost-effectiveness of CRISPR technology have made high-throughput pooled screening approaches accessible to virtually any lab. Analyzing the large sequencing data derived from these studies, however, still demands considerable bioinformatics expertise. Various methods have been developed to lessen this requirement, but there are still three tasks for accurate CRISPR screen analysis that involve bioinformatic know-how if not prowess: designing a proper statistical hypothesis test for robust target identification, developing an accurate mapping algorithm to quantify sgRNA levels, and minimizing the parameters necessary that need to be fine-tuned. To make CRISPR screen analysis more reliable as well as more readily accessible, we have developed a new algorithm, called CRISPRBetaBinomial or CB2 (https://CRAN.R-project.org/package=CB2). Based on the beta-binomial distribution, which is better suited to sgRNA data, CB2 outperforms the eight most commonly used methods (HiTSelect, MAGeCK, PBNPA, PinAPL-Py, RIGER, RSA, ScreenBEAM, and sgRSEA) in both accurately quantifying sgRNAs and identifying target genes, with greater sensitivity and a much lower false discovery rate. It also accommodates staggered sgRNA sequences. In conjunction with CRISPRcloud, CB2 will bring CRISPR screen analysis within reach for a wider community of researchers.

3:20 PM-3:40 PM
Fast and accurate bisulfite alignment and methylation calling for mammalian genomes
Room: San Francisco (3rd Floor)
  • Jonas Fischer, Max Planck Institute for Informatics, Germany
  • Marcel Schulz, Goethe University Frankfurt, Germany

Presentation Overview: Show

Assessment of DNA CpG methylation (CpGm) values via whole-genome bisulfite sequencing (WGBS) is computationally demanding. We present FAst MEthylation calling (FAME), the first approach to quantify CpGm values directly from WGBS reads using efficient data structures. FAME is incredibly fast but as accurate as standard methods, which first produce BS alignment files before computing CpGm values, thus solving the current WGBS analysis bottleneck for large-scale datasets without compromising accuracy.

3:40 PM-4:00 PM
PipelineOlympics: Benchmarking of processing workflows for bisulfite sequencing data
Room: San Francisco (3rd Floor)
  • Reka Toth, German cancer research center, Germany
  • Yassen Assenov, German Cancer Research Center (DKFZ), Germany
  • Karl Nordstroem, Saarland Univerisy, Germany
  • Angelika Merkel, CNAG, Center of Genomic Regulation (CGR), Spain
  • Edahi Gonzalez-Avalos, La Jolla Institute, United States
  • Matthias Bieg, Heidelberg Center for Personalized Oncology, German Cancer Research Center (DKFZ), Germany
  • Stephen Kraemer, German Cancer Research Center (DKFZ), Germany
  • Murat Iskar, German Cancer Research Center, Germany
  • Helene Kretzmer, University of Leipzig, Germany
  • Lelia Wagner, University of Heidelberg, Germany
  • Lilian Leiter, University of Heidelberg, Germany
  • Giuseppe Petroccino, BioMed X Innovation Center, Germany
  • Anand Mayakonda, German Cancer Research Center (DKFZ), Germany
  • Kersten Breuer, German Cancer Research Center (DKFZ), Germany
  • Gideon Zipprich, German Cancer Research Center (DKFZ), Germany
  • Lena Weiser, German Cancer Research Center (DKFZ), Germany
  • Philip Kensche, German Cancer Research Center (DKFZ), Germany
  • Renata Jurkowska, BioMed X Innovation Center, Germany
  • Christian Lawerenz, Berlin Institute of Health (BIH), Charite - University Clinic of Berlin, Germany
  • Ivo Buchhalter, German Cancer Research Center (DKFZ), Germany
  • Steve Hoffmann, Leibniz Institute of Aging - Fritz Lipmann Institute, Germany
  • Simon Heath, CNAG, Center of Genomic Regulation (CRG), Spain
  • Marc Zapatka, German Cancer Research Center, Germany
  • Joern Walter, Saarland University, Germany
  • Matthias Schlesner, Bioinformatics and Omics Data Analytics, Germany
  • Christoph Bock, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Christoph Plass, German Cancer Research Center (DKFZ), Germany
  • Pavlo Lutsik, German cancer research center, Germany

Presentation Overview: Show

Whole genome bisulfite sequencing (WGBS) is a state-of-the-art method for the genome-scale assessment of DNA methylation levels, used for bulk, low-input and, more recently, single-cell analysis. Basic data processing – read trimming, alignment and site-wise estimation of DNA methylation levels – is crucial for downstream analysis. PipelineOlympics is a collaborative effort of the leading labs in the field to comprehensively benchmark bisulfite sequencing software, and to provide data processing guidelines for popular wet-lab protocols. At the core of the benchmarking is a reference data set of highly accurate DNA methylation measurements obtained with locus-specific assays which we use as the gold-standard. In the initial phase of the benchmark we generated WGBS data of the well-characterized samples from the gold-standard set using several protocols, circulated the data among the partners, and collected methylation calls of ten representative workflows. Our pilot evaluation through comparing the calls to the gold-standard measurements and to each other revealed important differences in workflow performance, further amplified by protocol peculiarities, biological nature of the samples and variable sequencing depth. Furthermore, an exhaustive exploration of the combinatorial space of workflows is being performed. Ultimately, PipelineOlympics will be transformed into a long-term and extensible public benchmarking resource.

4:25 PM-4:40 PM
AWS for Genomics in the Public Sector
Room: San Francisco (3rd Floor)
  • Angus McAllister, Amazon Web Services, United Kingdom
4:40 PM-5:00 PM
Flexible Experimental Designs for Valid Single-cell RNA-sequencing Experiments Allowing Batch Effects Correction
Room: San Francisco (3rd Floor)
  • Yingying Wei, The Chinese University of Hong Kong, Hong Kong

Presentation Overview: Show

Despite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs---the “reference panel” and the “chain-type” designs---true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.

5:00 PM-5:20 PM
Bayesian deconvolution of somatic clones and pooled individuals with expressed variants in single-cell RNA-seq data
Room: San Francisco (3rd Floor)
  • Yuanhua Huang, EMBL-European Bioinformatics Institute, United Kingdom
  • Davis McCarthy, EMBL-EBI, United Kingdom
  • Raghd Rostom, Wellcome Sanger Institute, United Kingdom
  • Sarah Teichmann, Wellcome Sanger Institute, United Kingdom
  • Oliver Stegle, EMBL-European Bioinformatics Institute, Germany

Presentation Overview: Show

Decoding the clonal substructures of somatic tissues sheds light on cell growth, development and differentiation in health, ageing and disease. However, approaches to systematically characterize phenotypic and functional variations between individual clones are not established.

Here we present cardelino (https://github.com/PMBio/cardelino), a Bayesian method for inferring the clonal tree configuration and the identity of individual cells by modelling the expressed variants in single-cell RNA-seq (scRNA-seq) data. Critically, cardelino can integrate a clonal tree configuration derived from external data, e.g., bulk DNA sequencing, and adapt it to scRNA-seq observations. Simulations validate the accuracy of our model and its robustness to the errors in the guide clone configuration. We applied cardelino to 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones.

Additionally, a variant of cardelino, with an efficient variational inference algorithm (named Vireo) solves a similar problem in deconvolution of multiplexed scRNA-seq data by inferring genotypes and clustering cells, hence does not require genotype information of the pooled samples.

Taken together, our method suite allows to identify molecular signatures that differ between clonal cell populations, and to demultiplex pooled scRNA-seq across a variety of experiment designs and platforms.

5:20 PM-5:40 PM
ImmunoPepper: Generating Neoepitopes from RNA-Seq data
Room: San Francisco (3rd Floor)
  • Matthias Hüser, ETH Zurich, Switzerland
  • Jiayu Chen, ETH Zurich, Switzerland
  • Andre Kahles, ETH Zurich, Switzerland

Presentation Overview: Show

Often RNA-Seq is used as a proxy to inform on the state of a cell’s proteome. However, predicting the set of expressed transcripts from shotgun sequencing data is inherently hard. For some applications, however, it is not necessary to generate full protein isoforms and one is only interested in the local proteome variability. This is especially relevant in the context of personalized cancer therapy, when predicting immunogenicity of peptide fragments sampled from the proteome.

We present ImmunoPepper, a software that generates the set of all plausible peptides from a splicing graph, derived from a given RNA-Seq sample. The generated peptide set can be personalized with germline and somatic variants and takes un-annotated introns into account. To facilitate analysis with standardized tools for MHC binding prediction, we provide output for unique k-mer sets of all generated peptides, where typical k-mer lengths reach from 8 to 22.

We demonstrate the versatility of ImmunoPepper with applications to a set of 63 cancer samples from TCGA in contrast to GTEx and the analysis of 5 mouse tumor samples in comparison to more than 300 background samples taken from mouse reference sets. Both times, we can demonstrate the existence of sample-specific (tumor-specific) splicing-derived peptides.

5:40 PM-6:00 PM
BANDITS: a Bayesian hierarchical model for differential splicing accounting for sample-to-sample variability and mapping uncertainty
Room: San Francisco (3rd Floor)
  • Simone Tiberi, University of Zurich, Switzerland
  • Mark D. Robinson, University of Zurich, Switzerland

Presentation Overview: Show

Alternative splicing plays a fundamental role in the biodiversity of proteins as it allows a single gene to generate several transcripts and, hence, to code for multiple proteins. However, variations in splicing patterns can be involved in diseases. When comparing conditions, typically healthy vs disease, scientists are increasingly focusing on differential transcript usage (DTU), i.e. in changes in the proportion of transcripts.

A big challenge in DTU analyses is that, unlike gene level studies, the counts at the transcript level, which are of primary interest, are not observed because most reads map to multiple transcripts. Most DTU methods follow a plug-in approach and input estimated transcript-level counts, yet neglecting the uncertainty in these estimates.

To overcome the limitations of current methods for DTU, we present Bspliced, an R package to perform DTU, at both transcript and gene level, based on RNA-seq data. Bspliced uses a Bayesian hierarchical structure to explicitly model the variability between samples, and treats the allocations of reads to the transcripts as latent variables. The parameters of the model are inferred via Markov chain Monte Carlo (MCMC) techniques.

We will show how, both, in simulation studies and experimental data analyses, the proposed methodology outperforms existing methods.