Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


HitSeq: High-throughput Sequencing

COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Sunday, July 8th
10:15 AM-10:20 AM
HiTSeq: Welcome
Room: Grand Ballroom A
10:20 AM-11:20 AM
Using Biobanks to swim upstream: phenome risk scores as a way to start with function and move to phenotype
Room: Grand Ballroom A
  • Nancy Cox, Vanderbilt University, United States
11:20 AM-11:40 AM
Allelic decomposition and exact genotyping of highly polymorphic and structurally variant genes
Room: Grand Ballroom A
  • Ibrahim Numanagić, Massachusetts Institute of Technology, United States
  • Salem Malikić, Simon Fraser University, Canada
  • Michael Ford, Simon Fraser University, Canada
  • Xiang Qin, Baylor College of Medicine, United States
  • Lorraine Toji, Coriell Institute for Medical Research, United States
  • Milan Radovich, Indiana University, United States
  • Todd Skaar, IUPUI, United States
  • Victoria Pratt, indiana university school of medicine, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States
  • Steve Scherer, HGSC, Baylor College of Medicine, United States
  • S. Cenk Sahinalp, Indiana University Bloomington, United States

Presentation Overview: Show

High-throughput sequencing provides the means to determine the allelic decomposition for any gene of interest--- the number of copies and the exact sequence content of each copy of a gene. However, this is a challenging task, as many clinically and functionally important genes are highly polymorphic and have undergone structural alterations. Despite the clinical and scientific need, no high-throughput sequencing data analysis tool has yet been designed to effectively solve the full allelic decomposition problem.

Here we introduce a combinatorial optimization framework that successfully resolves the number of copies and the exact sequence content of each copy of a gene, including for genes that have undergone structural alterations. We provide an associated computational tool Aldy that performs allelic decomposition of highly polymorphic, multi-copy genes through the use of whole or targeted genome sequencing data. For a large and diverse data set obtained through the use of various sequencing platforms, Aldy identifies multiple rare and novel alleles for several important pharmacogenes, significantly improving upon the accuracy and utility of current genotyping assays. As more data sets become available, we expect Aldy to become an essential component of genotyping toolkits.

11:40 AM-12:00 PM
Realignment of short reads around short tandem repeats significantly improves accuracy of genomic variants detection
Room: Grand Ballroom A
  • Daniel Tello, Universidad de Los Andes, Colombia
  • Juanita Gil, Universidad de los Andes, Colombia
  • Jorge Duitama, Universidad de los Andes, Colombia

Presentation Overview: Show

Accurate detection of genomic variants from short reads generated by high throughput sequencing technologies is a fundamental feature of any successful data analysis production pipeline for applications such as genetic diagnosis in medicine or genomic selection in plant breeding. Our research group maintains a well established open-source software solution integrating algorithms for discovery of different types of genomic variants, which can be efficiently used from command line, a rich graphical interface or a web environment. Because incorrect alignments around short tandem repeats (STRs) are a main source of false positives, we present our solution for realignment of short reads spanning these variants. Users can provide STRs predicted from other tools to realign reads spanning these STRs and to genotype them as a single variation locus. We performed extensive benchmark experiments comparing our solution to state-of-the-art software using both simulated datasets and real data from four species and varying conditions of ploidy, read length, average read depth and read alignment software. Our solution consistently shows equal or better accuracy and efficiency than the other solutions under different conditions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for applications such as personalized medicine.

12:00 PM-12:20 PM
Jointly aligning a group of DNA reads improves accuracy of identifying large deletions
Room: Grand Ballroom A
  • Anish Man Singh Shrestha, The University of Tokyo, Japan
  • Martin C Frith, The University of Tokyo, Japan
  • Kiyoshi Asai, The University of Tokyo, Japan
  • Hugues Richard, University Pierre and Marie Curie, France

Presentation Overview: Show

Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls—with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program.

12:20 PM-12:40 PM
Convolutional filtering for mutation signature detection
Room: Grand Ballroom A
  • Christopher Yau, University of Birmingham and The Alan Turing Institute, United Kingdom
  • Yun Feng, University of Oxford, United Kingdom

Presentation Overview: Show

Mutation signatures are the hallmarks of mutagenic processes in cancer that can provide clues about the biochemical mechanisms by which DNA is altered in cancer. The extraction of such signatures from next generation sequencing data has traditionally been formulated as an unsupervised learning problem and solved using non-negative matrix factorization. We present an entirely novel approach based on convolutional filtering, inspired by technologies used in computer vision and image processing for genomic data analysis. We show that our approach (convSig) has state-of-the-art performance compared to standard methods but also generalizes to allow consideration of longer sequence contexts using deep layering of convolutional networks providing a tool that could potentially reveal the impact of high-level genome structure on mutational density.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:20 PM
Proceedings Presentation: Novo&Stitch: Accurate Reconciliation of Genome Assemblies via Optical Maps
Room: Grand Ballroom A
  • Weihua Pan, University of California, Riverside, United States
  • Steve Wanamaker, UC Riverside, United States
  • Audrey Ah-Fong, UC Riverside, United States
  • Howard Judelson, UC Riverside, United States
  • Stefano Lonardi, UC Riverside, United States

Presentation Overview: Show

Motivation: De novo genome assembly is a challenging computational problem due to the high repetitive content of eukaryotic genomes and the imperfections of sequencing technologies (i.e., sequencing errors, uneven sequencing coverage, and chimeric reads). Several assembly tools are currently available, each of which has strengths and weaknesses in dealing with the trade-off between maximizing contiguity and minimizing assembly errors (e.g., mis-joins). To obtain the best possible assembly, it is common practice to generate multiple assemblies from several assemblers and/or parameter settings and try to identify the highest quality assembly. Unfortunately, often there is no assembly that both maximizes contiguity and minimizes assembly errors, so one has to compromise one for the other.

Results: The concept of assembly reconciliation has been proposed as a way to obtain a higher quality assembly by merging or reconciling all the available assemblies. While several reconciliation methods have been introduced in the literature, we have shown in (Alhakami et al., 2017) that none of them can consistently produce assemblies that are better than the assemblies provided in input. Here we introduce Novo&Stitch, a novel method that takes advantage of optical maps to accurately carry out assembly reconciliation (assuming that the assembled contigs are sufficiently long to be reliably aligned to the optical maps, e.g., 50 Kbp or longer). Experimental results demonstrate that Novo&Stitch can double the contiguity (N50) of the input assemblies without introducing mis-joins or reducing genome completeness.

Availability: Novo&Stitch can be obtained from https://github.com/ucrbioinfo/Novo_Stitch

2:20 PM-2:40 PM
Proceedings Presentation: A graph-based approach to diploid genome assembly
Room: Grand Ballroom A
  • Shilpa Garg, MPI-INF, Germany, Germany
  • Mikko Rautiainen, Max Planck Institute for Informatics, Germany
  • Adam Novak, UC Santa Cruz, United States
  • Erik Garrison, Wellcome Trust Sanger Institute, United Kingdom
  • Richard Durbin, Wellcome Trust Sanger Institute, United Kingdom
  • Tobias Marschall, Max Planck Institute for Informatics, Center for Bioinformatics, Saarland University, Germany

Presentation Overview: Show

Motivation: Constructing high-quality haplotype-resolved de novo assemblies of diploid genomes is important for revealing the full extent of structural variation and its role in health and disease. Current assembly approaches often collapse the two sequences into one haploid consensus sequence and, therefore, fail to capture the diploid nature of the organism under study. Thus, building an assembler capable of producing accurate and complete diploid assemblies, while being resource-efficient with respect to sequencing costs, is a key challenge to be addressed by the bioinformatics community.
Results: We present a novel graph-based approach to diploid assembly, which combines accurate Illumina data and long-read Pacific Biosciences (PacBio) data. We demonstrate the effectiveness of our method on a pseudo-diploid yeast genome and show that we require as little as 50x coverage Illumina data and 10x PacBio data to generate accurate and complete assemblies. Additionally, we show that our approach has the ability to detect and phase structural variants.

2:40 PM-3:00 PM
Proceedings Presentation: Strand-seq Enables Reliable Separation of Long Reads by Chromosome via Expectation Maximization
Room: Grand Ballroom A
  • Maryam Ghareghani, Max Planck Institute for Informatics, Germany
  • David Porubsky, Max Planck Institute for Informatics, Germany
  • Ashley Sanders, EMBL, Germany
  • Sascha Meiers, EMBL, Germany
  • Evan Eichler, University of Washington, United States
  • Jan Korbel, EMBL, Germany
  • Tobias Marschall, Max Planck Institute for Informatics, Center for Bioinformatics, Saarland University, Germany

Presentation Overview: Show

Motivation: Current sequencing technologies are able to produce reads orders of magnitude longer than ever possible before. Such long reads have sparked a new interest in de novo genome assembly, which removes reference biases inherent to re-sequencing approaches and allows for a direct characterization of complex genomic variants. However, even with latest algorithmic advances, assembling a mammalian genome from long error-prone reads incurs a significant computational burden and does not preclude occasional misassemblies. Both problems could potentially be mitigated if assembly could commence for each chromosome separately.
Results: To address this, we show how single-cell template strand sequencing (Strand-seq) data can be leveraged for this purpose. We introduce a novel latent variable model and a corresponding Expectation Maximization (EM) algorithm, termed SaaRclust, and demonstrate its ability to reliably cluster long reads by chromosome.
For each long read, this approach produces a posterior probability distribution over all chromosomes of origin and read directionalities. In this way, it allows to assess the amount of uncertainty inherent to sparse Strand-seq data on the level of individual reads. Among the reads that our algorithm confidently assigns to a chromosome, we observed more than 99% correct assignments on a subset of PacBio reads with 30.1x coverage. To our knowledge, SaaRclust is the first approach for the in silico separation of long reads by chromosome prior to assembly.

3:00 PM-3:20 PM
Proceedings Presentation: A space and time-efficient index for the compacted colored de Bruijn graph
Room: Grand Ballroom A
  • Fatemeh Almodaresi, Stony Brook University, United States
  • Hirak Sarkar, Stony Brook University, United States
  • Avi Srivastava, Stony Brook university, United States
  • Robert Patro, Stony Brook University, United States

Presentation Overview: Show

Motivation: Indexing large ensembles of genomic sequences is an important building block for various sequence analysis pipelines. De Bruijn graphs are extensively used for representing large genomic indices, although the direct sequence query in a de Bruijn graph is, in general, time consuming and computationally intensive. This substantially slows down downstream methods, such as those used for mapping and alignment. Therefore, a fast, succinct and exact graph based index can be instrumental for performing efficient and accurate genomic analyses at a large scale.
Results: We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences.
Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.
Availability:pufferfish is written in C++11, is open source, and is available at https://github.com/ COMBINE- lab/pufferfish.

3:20 PM-3:40 PM
Integrating Hi-C links with assembly graphs for chromosome-scale assembly
Room: Grand Ballroom A
  • Jay Ghurye, University of Maryland, United States
  • Arang Rhie, National Human Genome Research Institute, National Institutes of Health, United States
  • Brian Walenz, National Human Genome Research Institute, National Institutes of Health, United States
  • Anthony Schmitt, Arima Genomics, United States
  • Siddarth Selvaraj, Arima Genomics, United States
  • Mihai Pop, University of Maryland, United States
  • Adam Phillippy, National Human Genome Research Institute, National Institutes of Health, United States
  • Sergey Koren, National Human Genome Research Institute, National Institutes of Health, United States

Presentation Overview: Show

Motivation: Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies.
Results: We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes.
Availability and Implementation: The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.

3:40 PM-4:00 PM
Bridging Linear to Graph-based Alignment with Whole Genome Population Reference Graphs
Room: Grand Ballroom A
  • Mohamed Gunady, University of Maryland, United States
  • Stephen Mount, University of Maryland, College Par, United States
  • Hector Corrada Bravo, University of Maryland, United States
  • Sangtae Kim, Illumina Inc., 5200 Illumina Way, San Diego, United States

Presentation Overview: Show

For next-generation sequencing (NGS), most existing read aligners depend on a linear reference genome that usually represents a single consensus haplotype. With the diversity in genome sequences among individuals, the need to consider characterized variants derived from population haplotypes becomes inevitable in genotyping and disease association studies. This need has ignited a growing interest into developing graph-based aligners to utilize comprehensive catalogues of known genomic variants. Unfortunately, existing graph alignment algorithms are too computationally expensive for practical application.
We propose an approach that takes advantage of representing population haplotypes as a graph, and efficiently linearizes the graph through Yanagi’s segmentation model. Our method generates a set of maximal L-disjoint segments representing the linearized population graph into a reference sequence library that can be used by any alt-aware linear aligner. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs.
We tested our approach on the highly polymorphic HLA genes which have significant medical importance. Preliminary results show promising results that we can achieve comparable performance to graph aligners using linear aligners assisted with population segments without compromising their space and computational requirements.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:00 PM
Hercules: a profile HMM-based hybrid error correction algorithm for long reads
Room: Grand Ballroom A
  • Can Firtina, Bilkent University, Turkey
  • Ziv Bar-Joseph, Carnegie Mellon University, United States
  • Can Alkan, Bilkent University, Turkey
  • A. Ercument Cicek, Bilkent University, Turkey

Presentation Overview: Show

Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. Researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better integration of these two technologies.
We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.

5:00 PM-6:00 PM
Non-coding genetic variation in cancer
Room: Grand Ballroom A
  • Ekta Khurana, Weill Cornell Medical College, United States
Monday, July 9th
10:15 AM-11:20 AM
Coordinated evolution of tumor phylogeny inference methods and sequencing technologies
Room: Grand Ballroom A
  • Cenk Sahinalp, Indiana University, United States
11:20 AM-11:40 AM
Proceedings Presentation: Haplotype Phasing in Single-Cell DNA Sequencing Data
Room: Grand Ballroom A
  • Gryte Satas, Brown University, United States
  • Ben Raphael, Princeton University, United States

Presentation Overview: Show

Motivation: Single-cell DNA sequencing is a promising technology that allows researchers to examine the genomic content of individual cells. Because the amount of DNA in a single cell is to o little to sequence directly, single-cell sequencing requires a method of whole-genome amplification (WGA). WGA introduces biases in the data, including high rates of allelic drop out and non-uniform coverage of the genome. These biases confound many downstream analyses, including the detection of genomic variants. Here, we show that amplification biases have a potential upside: long range correlations in a rates of allele drop out give a signal for phasing haplotypes at the lengths of PCR amplicons from WGA, rather than individual sequence reads.

Results: We describe a statistical test to evaluate concurrent allelic drop out between single-nucleotide polymorphisms (SNPs) across multiple sequenced single cells. Using this model we derive an algorithm to perform haplotype assembly on single cells. We demonstrate that the algorithm predicts SNP-pair phasing with high accuracy using whole-genome sequencing data from only seven single cells, and results in haplotype blocks (median length 10.2kb) that are orders of magnitude longer than with sequence reads alone (median length 312bp), with low switch error rates (< 2%). We demonstrate similar advantages on whole-exome data, where we obtain haplotype blocks with lengths on the order of typical gene lengths (median length 9.2kb) compared to median lengths of 41bp with sequence reads alone, with low switch error rates (< 4%)

11:40 AM-12:00 PM
HiVA: a web platform for haplotyping and copy number analysis of single-cell genomes and mosaicism detection in bulk DNA
Room: Grand Ballroom A
  • Amin Ardeshirdavani, KU Leuven ESAT - STADIUS, Belgium
  • Masoud Zamani Esteki, University Hospital Leuven, Belgium
  • Daniel Alcaide, KU Leuven ESAT - STADIUS, Belgium
  • Heleen Masset, University Hospital Leuven, Belgium
  • Alejandro Sifrim, University Hospital Leuven, Belgium
  • Parveen Kumar, University Hospital Leuven, Belgium
  • Nathalie Brison, University Hospital Leuven, Belgium
  • Niels Van der Aa, University Hospital Leuven, Belgium
  • Eftychia Dimitriadou, University Hospital Leuven, Belgium
  • Koen Theunis, University of Leuven – KU Leuven, Belgium
  • Hilde Peeters, University Hospital Leuven, Belgium
  • Jan Aerts, KU Leuven ESAT-STADIUS, Belgium
  • Joris Vermeesch, University Hospital Leuven, Belgium
  • Thierry Voet, University Hospital Leuven, Belgium
  • Yves Moreau, Katholieke Universiteit Leuve, Belgium

Presentation Overview: Show

We have developed a pipeline and user interface for single-cell analysis named HiVA (https://hiva.esat.kuleuven.be). HiVA (Haplarithm inference of Variant Alleles) is an interactive web platform for genome haplarithmisis of DNA samples derived from a large number of cells down to a single cell. HiVA automatically reconstructs parental haplarithms (i.e. haplarithm profiles indicating both haplotypes and copy number states) and provides a user-friendly interface for scrutinizing allelic imbalances across the genome. HiVA is unique as it can uncover mechanistic/segregational origin of genomic aberrations. As such, HiVA can discern meiotic and mitotic origin of aberrations and has the capacity of identifying mosaicisms, chimaerisms, parthenogenetic/gynogenetic. These features therefore would allow scrutinizing the aetiology of genetic disorders.

12:00 PM-12:20 PM
Probabilistic inference of clonal gene expression through integration of RNA & DNA-seq at single-cell resolution
Room: Grand Ballroom A
  • Kieran Campbell, The University of British Columbia, Canada
  • Alexandre Bouchard-Cote, The University of British Columbia, Canada
  • Sohrab Shah, BC Cancer Agency, Canada

Presentation Overview: Show

Cancers form clones - sets of cells that exhibit similar mutations and genomic rearrangements. As clones evolve to resist chemotherapy understanding their molecular properties is crucial to designing effective treatments. While it is possible to measure both the DNA and RNA in the same single-cell, it is far more common to have large datasets where DNA and RNA are measured in separate cells albeit on the same samples containing the same clones. Therefore, it remains an open problem to link data across the expression space and genomic space that would allow for clone-specific gene expression estimates.

Here we present clonealign, a scalable statistical method to probabilistically assign each cell as measured in gene expression space to a clone defined in copy number space. We derive an expectation-maximization algorithm that allows thousands of cells to be assigned to clones in minutes on commodity hardware. We apply our method to independently generated whole genome scDNA-seq and 10x genomics scRNA-seq from both patient-derived breast cancer xenografts and ovarian cancer cell lines to characterize the gene expression of expanding clones over time. We validate the method through held-out gene predictions and loss-of-heterozygosity analyses.

12:20 PM-12:40 PM
De novo single-cell transcript sequence reconstruction with Bloom filters
Room: Grand Ballroom A
  • Ka Ming Nip, BC Cancer Genome Sciences Centre, Canada
  • Readman Chiu, BC Cancer Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Presentation Overview: Show

De novo transcript sequence reconstruction from RNA-seq data is a difficult problem due to the short read length and the wide dynamic range of transcript expression levels. Although more than 10 algorithms for bulk RNA-seq were published over the past decade, very limited effort was made for de novo single-cell transcript sequence reconstruction, likely due to the technical challenges in analyzing single-cell RNA-seq (scRNA-seq) data. Compared to bulk RNA-seq, scRNA-seq tend to yield more variable read depth across each transcript, lower transcript coverage, and lower overall signal-to-noise ratio. Here, we present a fast and lightweight method for de novo single-cell transcript sequence reconstruction that leverages sequence reads across multiple cells. Our method is implemented in a program called “RNA-Bloom,” which utilizes lightweight probabilistic data structures based on Bloom filter. RNA-Bloom pools input reads from all cells to reconstruct read fragments from paired-end reads for individual cells. In particular, the cell-specificity of reconstructed transcripts is still maintained. In our benchmark, RNA-Bloom’s performance and accuracy surpasses state-of-the-art methods that were designed for bulk RNA-seq data. While scRNA-seq has primarily been used for gene expression analysis, this work unlocks new territory for identifying unique isoform structures at the single-cell level.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:20 PM
ABySS-LR: de novo Assembly Pipeline for Linked Reads
Room: Grand Ballroom A
  • Benjamin Vandervalk, BC Cancer Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Genome Sciences Centre, Canada
  • Shaun Jackman, BC Cancer Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer Genome Sciences Centre, Canada
  • Jeffrey Tse, BC Cancer Agency, Canada
  • Hamid Mohamadi, BC Cancer Agency, Canada
  • Yee Fay Lim, BC Cancer Agency, Canada
  • Rene Warren, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Presentation Overview: Show

Recently, 10x Genomics introduced the Chromium library preparation protocol for augmenting Illumina paired-end reads with long range linkage information ("linked reads"). Under the Chromium protocol, each read pair is tagged with a 16 bp barcode that associates it to one or more long DNA molecule(s) up to 100 kbp in length, providing invaluable information for resolving genomic repeat structures during de novo assembly.

Here we present ABySS-LR, a linked reads assembly pipeline that leverages Chromium barcode information to resolve repeat components (ABySS), cut contigs at misassemblies (Tigmint), and build assembly scaffolds (ARCS). ABySS-LR is being developed for assembly of large genomes with multiple Chromium libraries, in conjunction with other sequencing data types such as paired-end reads, mate pair reads, and long reads. On a linked reads data set for human chromosome 21, ABySS-LR yields an NA50 length of 5.1 Mbp, which represents a 50X improvement over a standard ABySS v2.0 assembly.

2:20 PM-2:40 PM
Kourami: Graph-guided assembly for novel HLA allele discovery
Room: Grand Ballroom A
  • Heewook Lee, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Accurate typing of human leukocyte antigen (HLA), a histocompatibility test, is important because HLA genes play various roles in immune responses and disease genesis. The current gold standard for HLA typing uses targeted DNA sequencing technology requiring specially designed primers or probes. Although there exist enrichment-free computational methods that use various types of sequencing data, hyper-polymorphism found in HLA region of the human genome makes it challenging to type HLA genes with high accuracy from whole genome sequencing data. Furthermore, these methods are database-matching approaches where their output is inherently limited by the incompleteness of already known types, forcing them to find the best matching known alleles from a database, thereby causing them to be unsuitable for discovery of novel alleles. In order to ensure both high accuracy as well as the ability to type novel alleles, we have developed a graph-guided assembly technique for classical HLA genes, which is capable of assembling phased, full-length haplotype sequences of typing exons given high-coverage (> 30-fold) whole genome sequencing data. Our method delivers highly accurate HLA typing, comparable to the current state-of-the-art database-matching methods. Using various data, we also demonstrate that our method can type novel alleles.

2:40 PM-3:00 PM
Proceedings Presentation: Versatile genome assembly evaluation with QUAST-LG
Room: Grand Ballroom A
  • Alla Mikheenko, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Andrey Prjibelski, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Vladislav Saveliev, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Dmitry Antipov, St. Petersburg State University, St. Petersburg, Russia, Russia
  • Alexey Gurevich, St. Petersburg State University, St. Petersburg, Russia, Russia

Presentation Overview: Show

Motivation: The emergence of high-throughput sequencing technologies revolutionized genomics in early 2000s. The next revolution came with the era of long-read sequencing. These technological advances along with novel computational approaches became the next step towards the automatic pipelines capable to assemble nearly complete mammalian-size genomes.

Results: In this manuscript we demonstrate performance of the state-of-the-art genome assembly software on six eukaryotic datasets sequenced using different technologies. To evaluate the results, we developed QUAST-LG --- a tool that compares large genomic de novo assemblies against reference sequences and computes relevant quality metrics. Since genomes generally cannot be reconstructed completely due to complex repeat patterns and low coverage regions, we introduce a concept of upper bound assembly for a given genome and set of reads, and compute theoretical limits on assembly cor- rectness and completeness. Using QUAST-LG we show how close the assemblies are to the theoretical optimum, and how far this optimum is from the finished reference.

Availability and implementation: http://cab.spbu.ru/software/quast-lg
Contact: aleksey.gurevich@spbu.ru
Supplementary information: Supplementary data are available.

3:00 PM-3:20 PM
Proceedings Presentation: AmpUMI: Design and analysis of unique molecular identifiers for deep amplicon sequencing
Room: Grand Ballroom A
  • Kendell Clement, Harvard University, United States
  • Rick Farouni, Harvard University, United States
  • Daniel Bauer, Boston Children's Hospital, United States
  • Luca Pinello, Harvard Medical School, United States

Presentation Overview: Show

Motivation: Unique molecular identifiers (UMIs) are added to DNA fragments before PCR amplification to discriminate between alleles arising from the same genomic locus and sequencing reads produced by PCR amplification. While computational methods have been developed to take into account UMI information in genome-wide and single-cell sequencing studies, they are not designed for modern amplicon based sequencing experiments, especially in cases of high allelic diversity. Importantly, no guidelines are provided for the design of optimal UMI length for amplicon-based sequencing experiments.
Results: Based on the total number of DNA fragments and the distribution of allele frequencies, we present a model for the determination of the minimum UMI length required to prevent UMI collisions and reduce allelic distortion. We also introduce a user-friendly software tool called AmpUMI to assist in the design and the analysis of UMI-based amplicon sequencing studies. AmpUMI provides quality control metrics on frequency and quality of UMIs, and trims and deduplicates amplicon sequences with user specified parameters for use in downstream analysis. AmpUMI is open-source and freely available at http://github.com/pinellolab/AmpUMI

3:20 PM-3:40 PM
Proceedings Presentation: A Spectral Clustering-Based Method for Identifying Clones from High-throughput B cell Repertoire Sequencing Data
Room: Grand Ballroom A
  • Nima Nouri, Yale School of Medicine, United States
  • Steven H. Kleinstein, Yale University, United States

Presentation Overview: Show

B cells derive their antigen-specificity through the expression of Immunoglobulin (Ig) receptors on their surface. These receptors are initially generated stochastically by somatic rearrangement of the DNA and further diversified following antigen-activation by a process of somatic hypermutation, which introduces mainly point substitutions into the receptor DNA at a high rate. Recent advances in next-generation sequencing (NGS) have enabled large-scale profiling of the B cell Ig repertoire from blood and tissue samples. A key computational challenge in the analysis of these data is partitioning the sequences to identify descendants of a common B cell (i.e. a clone). Current methods group sequences using a fixed distance threshold, or a likelihood calculation that is computationally-intensive. Here, we propose a new method based on spectral clustering with an adaptive threshold to determine the local sequence neighborhood. Validation using simulated and experimental data sets demonstrates that this method has high sensitivity and specificity compared to a fixed threshold that is optimized for these measures. In addition, this method works on data sets where choosing an optimal fixed threshold is difficult and is more computationally efficient in all cases. The ability to quickly and accurately identify members of a clone from repertoire sequencing data will greatly improve downstream analyses. Clonally-related sequences cannot be treated independently in statistical models, and clonal partitions are used as the basis for the calculation of diversity metrics, lineage reconstruction and selection analysis. Thus, the spectral clustering-based method here represents an important contribution to repertoire analysis.

3:40 PM-4:00 PM
IsoCon: Deciphering highly similar multigene family transcripts from Iso-Seq data
Room: Grand Ballroom A
  • Kristoffer Sahlin, Pennsylvania State University, United States
  • Marta Tomaszkiewicz, Pennsylvania State University, United States
  • Kateryna Makova, Pennsylvania State University, United States
  • Paul Medvedev, Pennsylvania State University, United States

Presentation Overview: Show

A significant portion of genes in vertebrate genomes belongs to multigene families, with each family containing several gene copies whose presence/absence, as well as isoform structure, can be highly variable across individuals. Existing de novo techniques for assaying the sequences of such highly-similar gene families fall short of reconstructing end to end transcripts with nucleotide-level precision or assigning alternatively spliced transcripts to their respective gene copies. We present IsoCon, a high-precision method to tackle this challenge. IsoCon is a combination of experimental, computational, and statistical techniques that leverage the power of long PacBio Iso-Seq reads. We apply IsoCon to nine Y chromosome ampliconic gene families, some of which have been associated with male infertility disorders. IsoCon outperforms existing methods on both experimental and simulated data and is able to reconstruct error-free low abundant transcripts that differ by as little as one base pair to transcripts with two orders of magnitude greater abundance. IsoCon has allowed us to detect an unprecedented number of novel isoforms and has opened the door for unraveling the structure of many multigene families and gaining a deeper understanding of genome evolution and human disease.

4:00 PM-4:40 PM
Coffee Break
4:40 PM-5:00 PM
CliqueSNV: Scalable Reconstruction of Intra-Host Viral Populations from NGS Reads
Room: Grand Ballroom A
  • Sergey Knyazev, Georgia State University, United States
  • Viachaslau Tsyvina, Georgia State University, United States
  • Andrew Melnyk, Georgia State University, United States
  • Alexander Artyomenko, Georgia State University, United States
  • Tatyana Malygina, lyrical tokarev icfp contest team, Russia
  • Yuri Porozov, Saint Petersburg National Research University of Information Technologies, Mechanics and Optics, Russia
  • Ellsworth Campbell, Centers for Disease Control and Prevention, United States
  • William Switzer, Centers for Disease Control and Prevention, United States
  • Pavel Skums, Georgia State University, United States
  • Alex Zelikovsky, Georgia State University, United States

Presentation Overview: Show

Highly mutable RNA viruses such as influenza A virus, human immunodeficiency virus and hepatitis C virus exist in infected hosts as highly heterogeneous populations of closely related genomic variants. The presence of low-frequency variants with few mutations with respect to major strains may result in an immune escape, emergence of drug resistance, and an increase of virulence and infectivity. Next-generation sequencing technologies permit detection of sample intra-host viral population at extremely great depth, thus providing an opportunity to access low-frequency variants. Long read lengths offered by single-molecule sequencing technologies allow all viral variants to be sequenced in a single pass. However, high sequencing error rates limit the ability to study heterogeneous viral populations composed of rare, closely related variants.
In this article, we present CliqueSNV, a novel reference-based method for reconstruction of viral variants from NGS data. It efficiently constructs an allele graph based on linkage between single nucleotide variations and identifies true viral variants by merging cliques of that graph using combinatorial optimization techniques. The full paper text is available at https://www.biorxiv.org/content/early/2018/03/31/264242

5:00 PM-5:20 PM
MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Room: Grand Ballroom A
  • Martin Steinegger, Max-Planck-Institute, Republic of Korea
  • Johannes Soeding, MPI BPC, Germany

Presentation Overview: Show

Metagenomics is revolutionizing the study of microbes in their natural environments, such as the human gut, the oceans, or soil, and is revealing the enormous impact of microbes on our health, our climate, and ecology. In metagenomics, DNA or RNA of bacteria, archaea and viruses are sequenced directly, making the 99% of microbes that cannot be cultivated in the lab amenable to investigation. Combined with the enormous drop in sequencing costs by a factor of ten thousand in just ten years, this has led to an explosive growth in the amount of sequence data in public databases.

Sensitive sequence searching has become the main bottleneck in the analysis of large metagenomic datasets. We therefore developed the open-source software MMseqs2 (mmseqs.org), which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed. Sensitive searches enabled us to annotate 1.1 billion sequences in 8.3 hours on 28 cores. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.

5:20 PM-5:40 PM
Quantification of Private Information Leakage and Privacy-Preserving File Formats for Functional Genomics Data
Room: Grand Ballroom A
  • Gamze Gursoy, Yale University, United States

Presentation Overview: Show

Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to identity of individuals but represent universal statements about disease and developmental stages. On the other hand, by virtue of the experimental procedures, the reads from them are tagged with small bits of patients' variant information, which presents privacy challenges, as far as sharing the data. By looking at the “data exhaust” from transcriptome analysis, one can infer sensitive information revealing findings. However, there is great desire to share the data as broadly as possible. Therefore, there is need to formulate amount of sensitive information leaked in every step of the data exhaust. Here we developed information theory-based measures to quantify private information leakage in various stages of functional genomics data. We found that noisy variant calling, while not useful genotypes, can be used as strong quasi-identifiers for re-identification purposes through linking attacks. We then focused on how quantifications of expression levels can potentially reveal sensitive information about the subject studied, and how one can take steps to protect patient anonymity.

5:40 PM-6:00 PM
Proceedings Presentation: Asymptotically optimal minimizers schemes
Room: Grand Ballroom A
  • Guillaume Marçais, Carnegie Mellon University, United States
  • Dan DeBlasio, Carnegie Mellon University, United States
  • Carl Kingsford, Carnegie Mellon University, United States

Presentation Overview: Show

Motivation: The minimizers technique is a method to sample k-mers that
is used in many bioinformatics software to reduce computation, memory
usage and run time. The number of applications using minimizers keeps
on growing steadily. Despite its many uses, the theoretical
understanding of minimizers is still very limited. In many
applications, selecting as few k-mers as possible (i.e. having a low
density) is beneficial. The density is highly dependent on the choice
of the order on the k-mers. Different applications use different
orders, but none of these orders are optimal. A better understanding
of minimizers schemes, and the related local and forward schemes, will
allow designing schemes with lower density, and thereby making
existing and future bioinformatics tools even more efficient.

Results: From the analysis of the asymptotic behavior of minimizers,
forward and local schemes, we show that the previously believed lower
bound on minimizers schemes does not hold, and that schemes with
density lower than thought possible actually exist. The proof is
constructive and leads to an efficient algorithm to compare k-mers.
These orders are the first known orders that are asymptotically
optimal. Additionally, we give improved bounds on the density
achievable by the 3 type of schemes.