Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters

Preparing your Poster - Information and Poster Size
Poster Schedule
Print your poster in Chicago
Poster Categories

View Posters By Category

Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
A-94: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
COSI: HiTSeq
  • Martin Steinegger, Max-Planck-Institute, Republic of Korea
  • Johannes Soeding, MPI BPC, Germany

Short Abstract: Metagenomics is revolutionizing the study of microbes in their natural environments, such as the human gut, the oceans, or soil, and is revealing the enormous impact of microbes on our health, our climate, and ecology. In metagenomics, DNA or RNA of bacteria, archaea and viruses are sequenced directly, making the 99% of microbes that cannot be cultivated in the lab amenable to investigation. Combined with the enormous drop in sequencing costs by a factor of ten thousand in just ten years, this has led to an explosive growth in the amount of sequence data in public databases. Sensitive sequence searching has become the main bottleneck in the analysis of large metagenomic datasets. We therefore developed the open-source software MMseqs2 (mmseqs.org), which improves on current search tools over the full range of speed-sensitivity trade-off, achieving sensitivities better than PSI-BLAST at more than 400 times its speed. Sensitive searches enabled us to annotate 1.1 billion sequences in 8.3 hours on 28 cores. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.

A-96: Diagnosis of Glioma Tumors Using Circulating Cell-Free DNA
COSI: HiTSeq
  • Vikrant Palande, The Azrieli Faculty of Medicine, Bar-Ilan University, Israel
  • Dorith Raviv Shay, The Azrieli Faculty of Medicine, Bar-Ilan University, Israel
  • Rajesh Detroja, The Azrieli Faculty of Medicine, Bar-Ilan University, Israel
  • Milana Frenkel-Morgenstern, The Azrieli Faculty of Medicine, Bar-Ilan University, Israel

Short Abstract: Gliomas are the most frequent brain tumors, making up about 30% of all brain and central nervous system tumors, and 80% of all malignant brain tumors. ‘Liquid biopsy’ is a new and recently developed non-invasive cancer diagnostic technique. This technique includes collection of blood or urine samples and diagnosis of cancer based on analyzing molecular bits or cancer cells that are released from tumor tissue into the blood or urine system. Circulating cell-free DNA (cfDNA) fragments is one of those molecular bits that are released into the bloodstream after rapid apoptosis or necrosis of the tumor cells in the cancer patients. Our goal is to do comprehensive study between distinct types of glioma cancer tumors and cfDNA of the respective patients, to elucidate the scope of cfDNA in liquid biopsy technique for glioma diagnosis. We have successfully detected glioma specific mutations such as IDH1, IDH2, PDGFRA, NOTCH1, PIK3R1 and TP53, from cfDNA isolated from the plasma of glioma patients and could relate this mutations to the different tumor grades of glioma. This study may help in developing liquid biopsy technique for glioma tumor diagnosis and in its prognosis.

A-98: Hercules: a profile HMM-based hybrid error correction algorithm for long reads
COSI: HiTSeq
  • Can Firtina, Bilkent University, Turkey
  • Ziv Bar-Joseph, Carnegie Mellon University, United States
  • Can Alkan, Bilkent University, Turkey
  • A. Ercument Cicek, Bilkent University, Turkey

Short Abstract: Choosing whether to use second or third generation sequencing platforms can lead to trade-offs between accuracy and read length. Several studies require long and accurate reads including de novo assembly, fusion and structural variation detection. Researchers often combine both technologies and the more erroneous long reads are corrected using the short reads. Current approaches rely on various graph based alignment techniques and do not take the error profile of the underlying technology into account. Memory- and time- efficient machine learning algorithms that address these shortcomings have the potential to achieve better integration of these two technologies. We designed and developed Hercules, the first machine learning-based long read error correction algorithm. The algorithm models every long read as a profile Hidden Markov Model with respect to the underlying platform's error profile. The algorithm learns a posterior transition/emission probability distribution for each long read and uses this to correct errors in these reads. Using datasets from two DNA-seq BAC clones (CH17-157L1 and CH17-227A2), and human brain cerebellum polyA RNA-seq, we show that Hercules-corrected reads have the highest mapping rate among all competing algorithms and highest accuracy when most of the basepairs of a long read are covered with short reads.

A-100: De novo assembled transcriptome of Pelodera (syn. Rhabditis) strongyloides provides insight to genes involved in parasitism
COSI: HiTSeq
  • Menglei Zhang, University of Macau, Macao
  • Liisa Heikkinen, University of Jyvaskyla, Finland
  • Emily Knott, University of Jyvaskyla, Finland
  • Garry Wong, University of Macau, Macao

Short Abstract: Pelodera strongyloides is a facultative parasitic nematode whose dauer larvae is found to be infective. The sequences of P. strongyloides remains unknown except for a few small subunit ribosomal RNA gene (ssrRNA) sequences are available. Herein, this project launched a de novo transcriptome assembly with 285Mb 100 bp paired-end RNA-seq reads from normal, starved and dauer (infective) organisms. Trinity generated 104634 transcript contigs with 2195 bp N50 and 1103 bp average contig length. Results of BLASTX to five nematodes (C. elegans, Strongyloides stercoralis, Necator americanus, Trichuris trichiura, and Pristionchus pacificus) were consistent with species evolutionary relationships. Peptide functional annotation was performed using Trinotate. Sixteen genes were identified to be homologous to C.elegans RNAi system. Differential expressed genes were identified using EdgeR. Dauer up-regulated invading resistance genes: infection response gene (irg-3), TIL-domain protease inhibitors, retinal degeneration and protein degradation genes. Which suggests P. strongyloides to be a potential model stage to study parasitism. And dauer down-regulated cuticle, collagen and body morphology genes, supporting the hypothesis that infective worms tend to express less collagen transcripts. Our results show that P. strongyloides is a nematode with simple parasitic behavior that may provide a model to investigate mechanisms involved in parasitism.

A-102: P(Psy)finder: Identification of novel pseudogenes in DNA sequencing data
COSI: HiTSeq
  • Marcela Davila, Gothenburg University, Sweden
  • Sanna Abrahamsson, Gothenburg University, Sweden
  • Anna Rohlin, Gothenburg University, Sweden

Short Abstract: Processed pseudogenes are created by retro-transposition of mRNA from functional protein coding loci back into the genome. These represent a new class of mutations that occur during cancer development. Typically they are characterized bya compelte lack of introns, the presence of small flanking direct repeats and a polyadenine tail near the 3 prime end. There have been several efforts to identify and characterize these pseudogenes fo rthe entire human genome. These approaches rely on locus-specific transcription evidence and high throughput sequencing data. Here we present an automatic pipeline that uses targeted sequencing data to identify processed pseudogenes as a complement to SNP analysis. The output of this method encompases a list of candidate pseudogenes and visualization plots. Currently we are screening 250 samples from hereditary colorectal cancer. We have identifies several new processed pseudogenes and their validation is ongoing.

A-104: Integrating Hi-C links with assembly graphs for chromosome-scale assembly
COSI: HiTSeq
  • Jay Ghurye, University of Maryland, United States
  • Arang Rhie, National Human Genome Research Institute, National Institutes of Health, United States
  • Brian Walenz, National Human Genome Research Institute, National Institutes of Health, United States
  • Anthony Schmitt, Arima Genomics, United States
  • Siddarth Selvaraj, Arima Genomics, United States
  • Mihai Pop, University of Maryland, United States
  • Adam Phillippy, National Human Genome Research Institute, National Institutes of Health, United States
  • Sergey Koren, National Human Genome Research Institute, National Institutes of Health, United States

Short Abstract: Motivation: Long-read sequencing and novel long-range assays have revolutionized de novo genome assembly by automating the reconstruction of reference-quality genomes. In particular, Hi-C sequencing is becoming an economical method for generating chromosome-scale scaffolds. Despite its increasing popularity, there are limited open-source tools available. Errors, particularly inversions and fusions across chromosomes, remain higher than alternate scaffolding technologies. Results: We present a novel open-source Hi-C scaffolder that does not require an a priori estimate of chromosome number and minimizes errors by scaffolding with the assistance of an assembly graph. We demonstrate higher accuracy than the state-of-the-art methods across a variety of Hi-C library preparations and input assembly sizes. Availability and Implementation: The Python and C++ code for our method is openly available at https://github.com/machinegun/SALSA.

A-106: TissueEnrich: A tool to calculate tissue-specific gene enrichment
COSI: HiTSeq
  • Ashish Jain, Iowa State University, United States
  • Geetu Tuteja, Iowa State University, United States

Short Abstract: The development of RNA-Seq technology has enabled large-scale comparison of gene expression in a multitude of developmental stages, cell-types, and conditions. RNA-Seq data analysis typically results in the identification of genes that likely have a shared function. Although gene ontology enrichment analysis is widely used to identify enriched pathways in gene groups, it does not determine enrichment of tissue-specific genes. Understanding which groups of genes are tissue-specific is valuable, as tissue-specific genes are more likely to be associated with human disease. Therefore, we developed “TissueEnrich”, a tool to carry out tissue-specific gene enrichment. TissueEnrich uses RNA-Seq datasets from mouse and human to define tissue-specific genes, using a method developed by the Human Protein Atlas. It uses the hypergeometric test to calculate the enrichment of tissue-specific genes in an input gene list. We tested TissueEnrich using genes that are highly expressed in cardiomyocytes and trophoblast-like cells, each differentiated from embryonic stem cells. We found tissue-specific enrichment for heart and placenta, respectively, validating the robustness of our tool. TissueEnrich is available as a web application, allowing the user to visualize tissue-specific gene enrichment as a barplot, as well as visualize expression of tissue-specific genes.

A-108: Development of a New Pipeline for Haplotype-Specific Expression: case-study in F1 reciprocal cross in pepper
COSI: HiTSeq
  • Andrea Ghelfi, Kazusa DNA Research Institute, Japan
  • Kenta Shirasawa, Kazusa DNA Research Institute, Japan
  • Munetaka Hosokawa, Kyoto University, Japan
  • Sachiko Isobe, Kazusa DNA Research Institute, Japan

Short Abstract: Diploid species have two genomes derived from maternal and paternal parents. RNAs are transcribed from the two genomes consisting of two haplotypes or from one of them including one haplotypes of either parents. In pepper, a diploid species, abnormal growth was observed in F1 plants derived from a cross between a cultivar ‘Charapita’ (Capsicum chinense) as a female and cultivar ‘28’ (C. baccatum) as a male, namely Charapitax28, while normal growth was exhibited in the reciprocal cross 28xCharapita. Since haplotype-specific expression (HSE) is possible to explain their differential phenotypes, we established a pipeline for HSE analysis as follows: (1) quality control; (2) quantification of level of sequence similarity of Charapitax28 (and 28xCharapita) RNA-seq reads against both parental genomes to determine level of stringency; (3) allocation the reads to parental sources; (4) assembly the reads for the parents independently; (5) align haplotype-specific transcripts against respective parental genomes; (6) sequence comparison transcriptomes of each haplotype for SNP identification; (7) SNP annotation and functional effect prediction; (10) differential HSE expression analysis and annotation. We have also investigated contributions of the cytosolic genomes to the abnormal and normal growth.

A-110: Identification and comparison of shared miRNA-mRNA interaction pairs across multiple cancer types
COSI: HiTSeq
  • Yongsheng Bai, Indiana State University, United States

Short Abstract: MicroRNAs, or miRNAs, are short highly processed oligonucleotides (approx. 17-25 bp) that carry out post-transcriptional regulation of target mRNAs through either degradation of the target mRNA or inhibition of protein translation. Clusters in a miRNA-mRNA interaction network are often interaction complexes and/or parts of pathways. It is valuable to identify clusters with common mRNAs (or genes) and miRNAs across several cancer types since they can be associated with several cancer diseases. We employed sequencing data from multiple cancer types in The Cancer Genome Atlas (TCGA) to identify and compare clusters (corresponding pairs of genes and miRNAs) and determine the cluster similarity. We found that one cluster in Lung Squamous Cell Carcinoma (LUSC) showed the greatest common percentage similarity while having most compared cancer types. The clusters in Breast Invasive Carcinoma (BRCA) had the most number of common pairs with high common percentage similarity. Our observation indicates that these cancer clusters likely possess common driving transcription regulation patterns due to the fact that they share similar miRNA-mRNA interaction pairs. By providing a comparison method to confront this growing class of cancer sequencing big data, our results provide a list of candidate cancer-associated genes and miRNAs likely associated with several cancer diseases.

A-112: HiCcompare: an R-package for joint normalization and comparison of Hi-C datasets
COSI: HiTSeq
  • John Stansfield, Virginia Commonwealth University, United States
  • Kellen Cresswell, Virginia Commonwealth University, United States
  • Vladimir Vladimirov, Virginia Institute for Psychiatric and Behavioral Genetics, United States
  • Mikhail Dozmorov, Virginia Commonwealth University, United States

Short Abstract: Changes in spatial chromatin interactions are emerging as a unifying mechanism orchestrating the regulation of gene expression. Hi-C sequencing technology allows insight into chromatin interactions on a genome-wide scale. However, Hi-C data contains many DNA sequence- and technology-driven biases. These biases prevent effective comparison of chromatin interactions aimed at identifying genomic regions differentially interacting between, e.g., disease-normal states. Several methods have been developed for normalizing individual Hi-C datasets. However, they fail to account for biases between Hi-C datasets, hindering comparative analysis of chromatin interactions. We developed a simple and effective method, HiCcompare, for the joint normalization and differential analysis of multiple Hi-C datasets. The method introduces a distance-centric analysis of the differences between two Hi-C datasets on a single plot that allows for a data-driven normalization of biases using locally weighted linear regression (loess). HiCcompare outperforms methods for normalizing individual Hi-C datasets and methods for differential analysis (diffHiC, FIND) in detecting a priori known chromatin interaction differences while preserving the detection of genomic structures, such as A/B compartments. Additionally, HiCcompare is able to detect differences in genomic regions related to biological pathways as would be expected based on the cell types being compared. HiCcompare is available on Bioconductor: https://bioconductor.org/packages/HiCcompare/

A-114: A new method for cell line authentication using variants derived from RNA-seq data
COSI: HiTSeq
  • Tabrez Anwar Shamim Mohammad, Greehey Children's Cancer Research Institute (GCCRI), UTHSCSA, United States
  • Yun S Tsai, Greehey Children's Cancer Research Institute (GCCRI), UTHSCSA, United States
  • Safwa Ameer, University of Texas at San Antonio, United States
  • Hung-I Chen, University of Texas Health Science Center at San Antonio, United States
  • Yu-Chiao Chiu, University of Texas Health Science Center at San Antonio, United States
  • Yidong Chen, UT Health Science Center at San Antonio, United States

Short Abstract: Cell lines are an essential component of biomedical research into understanding the underlying mechanisms of normal- and disease-biology including cancer. However, the prevalence of cell line contamination is affecting biomedical research and available methods for cell line authentication suffer from limited access and being too daunting for many researchers. Therefore, we made an attempt to develop a new RNA-seq based approach for cell line authentication. Our method uses RNA-seq data to identify variants and compare it with variant profiles from other cell lines. RNA-seq data for 934 CCLE cell lines were downloaded from NCI-GDC and cell line specific variant profiles were generated. Pair-wise correlations were calculated using frequencies and depth of coverage of variants and comparative analysis revealed that variant profiles differ significantly from cell line to cell line whereas identical, synonymous and derivative cell lines share high variant identity and are highly correlated (r > 0.9). Benchmarking studies including tests on independent datasets revealed that new method can identify a cell line with high accuracy and hence can be a valuable tool in biomedical science. Moreover, our method also estimates about the possible cross-contamination using linear admixture model if no perfect match was detected.

A-116: Comparative Analysis of Genomic Sequencing Workflow Management Systems
COSI: HiTSeq
  • Liudmila Mainzer, University of Illinois at Urbana-Champaign, United States
  • Azza Ahmed, University of Khartoum, Sudan
  • Saurabh Baheti, Mayo Clinic, United States
  • Matthew Bockol, Mayo Clinic, United States
  • Prakruthi Burra, University of Illinois at Urbana-Champaign, United States
  • Roy Campbell, University of Illinois at Urbana-Champaign, United States
  • Travis Drucker, Mayo Clinic, United States
  • Faisal Fadlelmola , University of Khartoum, Sudan
  • Steven Hart, Mayo Clinic, United States
  • Jacob Heldenbrand, University of Illinois at Urbana-Champaign, United States
  • Mikel Hernaez, University of Illinois at Urbana-Champaign, United States
  • Matthew Hudson, University of Illinois at Urbana-Champaign, United States
  • Ravishankar Iyer, University of Illinois at Urbana-Champaign, United States
  • Michael Kalmbach, Mayo Clinic, United States
  • Katherine Kendig, University of Illinois at Urbana-Champaign, United States
  • Eric Klee, Mayo Clinic, United States
  • Cynthia Liu, University of Illinois at Urbana-Champaign, United States
  • Nathan Mattson, Mayo Clinic, United States
  • Olgica Milenkovic, University of Illinois at Urbana-Champaign, United States
  • Christian Ross, Mayo Clinic, United States
  • Saurabh Sinha, University of Illinois at Urbana-Champaign, United States
  • Ramshankar Venkatakrishnan, University of Illinois at Urbana-Champaign, United States
  • Matthew Weber, University of Illinois at Urbana-Champaign, United States
  • Eric Wieben, Mayo Clinic, United States
  • Matthew Wiepert, Mayo Clinic, United States
  • Derek Wildman, University of Illinois at Urbana-Champaign, United States

Short Abstract: As genomic sequencing becomes widely used in academic and commercial settings, there is a need for advanced tools to manage the sheer volume of data and the complexity of sequencing analyses. Most genomics is run as a complex workflow with many steps, fans, merges, and quality controls. In order to effectively and efficient manage analyses on large cohorts of samples, workflow management systems are needed to wrap bioinformatics commands and streamline the variant calling process. Here, we compare the various aspects of three popular workflow management systems for large-scale genomic sequencing analyses: Cromwell/WDL, Nextflow, and Swift/T, using a production-grade use case of genomic variant calling, fully automated in a high-performance computing environment. Though all three can fulfill the same general purpose, their different inbuilt functionalities lend them to different usages. We present a qualitative comparison of the three and a delineation of key comparison metrics, and outline the pros and cons of each workflow management system depending on the nature of the required analysis for high-performance computational needs.

A-118: Zika virus (ZIKV) infection and dysregulation patterns in host editome
COSI: HiTSeq
  • Noel-Marie Plonski, Kent State University, United States
  • Madeline Frederick, Kent State University, United States
  • Helen Piontkivska, Kent State University, United States

Short Abstract: Recent outbreaks of Zika Virus have been linked with microcephaly and other neurodevelopmental defects, collectively named congenital Zika syndrome (CZS). As part of host anti-viral immune response the interferon alpha (IFNA) pathway is activated, in turn upregulating adenosine deaminase acting on RNA (ADAR), an RNA-editing enzyme that plays a key role in neurodevelopment. We hypothesize that virus-induced dysregulation of ADAR will change editing patterns of specific neutrally-expressed transcripts. However, to understand the role of dysregulated RNA editing in the disease pathogenesis, such as that of the CZS, we need better insights into RNA editing during neural development. We delineate the ADAR isoforms expression patterns that occur throughout neural development using publicly available RNA-seq samples and the AIDD pipeline, designed in our lab for exploring transcriptome diversity. In addition to differential expression patterns of ADAR isoforms, the status of Q/R sites in GRIA2 transcripts are also explored to determine edited isoform(s) and their relative ratio during development. By characterizing the normal Q/R switch pattern as well as ADARp150 expression among different neural stem cell types, we gain insights into the role of host innate immune response via ADARs as potential factor in the CZS pathogenesis.

A-120: Realignment of short reads around short tandem repeats significantly improves accuracy of genomic variants detection
COSI: HiTSeq
  • Daniel Tello, Universidad de Los Andes, Colombia
  • Juanita Gil, Universidad de los Andes, Colombia
  • Jorge Duitama, Universidad de los Andes, Colombia

Short Abstract: Accurate detection of genomic variants from short reads generated by high throughput sequencing technologies is a fundamental feature of any successful data analysis production pipeline for applications such as genetic diagnosis in medicine or genomic selection in plant breeding. Our research group maintains a well established open-source software solution integrating algorithms for discovery of different types of genomic variants, which can be efficiently used from command line, a rich graphical interface or a web environment. Because incorrect alignments around short tandem repeats (STRs) are a main source of false positives, we present our solution for realignment of short reads spanning these variants. Users can provide STRs predicted from other tools to realign reads spanning these STRs and to genotype them as a single variation locus. We performed extensive benchmark experiments comparing our solution to state-of-the-art software using both simulated datasets and real data from four species and varying conditions of ploidy, read length, average read depth and read alignment software. Our solution consistently shows equal or better accuracy and efficiency than the other solutions under different conditions. We expect that this work will contribute to the continuous improvement of quality in variant calling needed for applications such as personalized medicine.

A-122: Convolutional filtering for mutation signature detection
COSI: HiTSeq
  • Christopher Yau, University of Birmingham and The Alan Turing Institute, United Kingdom
  • Yun Feng, University of Oxford, United Kingdom

Short Abstract: Mutation signatures are the hallmarks of mutagenic processes in cancer that can provide clues about the biochemical mechanisms by which DNA is altered in cancer. The extraction of such signatures from next generation sequencing data has traditionally been formulated as an unsupervised learning problem and solved using non-negative matrix factorization. We present an entirely novel approach based on convolutional filtering, inspired by technologies used in computer vision and image processing for genomic data analysis. We show that our approach (convSig) has state-of-the-art performance compared to standard methods but also generalizes to allow consideration of longer sequence contexts using deep layering of convolutional networks providing a tool that could potentially reveal the impact of high-level genome structure on mutational density.

A-124: Quantification of Private Information Leakage and Privacy-Preserving File Formats for Functional Genomics Data
COSI: HiTSeq
  • Gamze Gursoy, Yale University, United States

Short Abstract: Functional genomics experiments on human subjects present a privacy conundrum. On one hand, many of the conclusions we infer from these experiments are not tied to identity of individuals but represent universal statements about disease and developmental stages. On the other hand, by virtue of the experimental procedures, the reads from them are tagged with small bits of patients' variant information, which presents privacy challenges, as far as sharing the data. By looking at the “data exhaust” from transcriptome analysis, one can infer sensitive information revealing findings. However, there is great desire to share the data as broadly as possible. Therefore, there is need to formulate amount of sensitive information leaked in every step of the data exhaust. Here we developed information theory-based measures to quantify private information leakage in various stages of functional genomics data. We found that noisy variant calling, while not useful genotypes, can be used as strong quasi-identifiers for re-identification purposes through linking attacks. We then focused on how quantifications of expression levels can potentially reveal sensitive information about the subject studied, and how one can take steps to protect patient anonymity.

A-126: A Novel Method for Detecting Variation in Single-Cell Sequencing
COSI: HiTSeq
  • Adam Cornish, University of Nebraska Medical Center, United States
  • Babu Guda, University of Nebraska Medical Center, United States

Short Abstract: Single-cell sequencing enables the acquisition of genetic data from individual cells to better understand diseases such as cancer or autoimmune disorders, which are often affected by changes in rare cells. Currently, no existing software targets identifying single nucleotide or insertion/deletion variations in single-cell RNA sequencing (scRNA-seq) data, and generating high quality variant data is vital to the study of the aforementioned diseases, among others. We have created such a tool, Red Panda, which takes advantage of the uniqueness of scRNA-seq. Sequencing data, scRNA-seq and exome, were generated from articular chondrocytes extracted from a human knee. On average, Red Panda identified 5,167 variants per cell. Our software outperformed all tested bulk sequencing variant callers. We used the exome as our comparison dataset as it is isogenic with the scRNA-seq data: for Red Panda, on average, 913 variants were shared with the exome and had a PPV of 45.0%; freebayes: 65 variants and a 8.7% PPV; GATK HaplotypeCaller: 705 variants and a 31.7% PPV; GATK UnifiedGenotyper: 222 variants and a 5.8% PPV; and Platypus: 386 variants and a 7.0% PPV. Our method provides a novel and improved mechanism to identify variants in scRNA-seq.

A-128: SAPLING: Suffix Array Piecewise Linear INdex for Genomics
COSI: HiTSeq
  • Michael Kirsche, Johns Hopkins University, United States
  • Michael Schatz, Johns Hopkins University, United States

Short Abstract: Data structures that facilitate rapid string search are essential to many applications of genomic sequencing data such as read mapping and whole genome alignment. Existing approaches to the substring search problem use variants of binary search to locate a query substring in the reference genome’s suffix array, a lexicographically sorted list of its suffix indices. While in theory these approaches scale logarithmically with respect to the input size, the runtime is worse in practice due to the additional time required to make consecutive memory lookups when the memory addresses are on different cache pages. Here we present Sapling, a data structure which augments suffix arrays to enable faster substring search queries. By using a prediction function which better localizes the position of a substring in the suffix array, Sapling can perform a more focused binary search in the neighborhood of the predicted position. This greatly reduces the number of random memory lookups at the cost of the small number of arithmetic operations needed to compute the prediction function. We show that Sapling improves performance in human and other genomes compared to previous binary search techniques.

A-130: Probabilistic inference of clonal gene expression through integration of RNA & DNA-seq at single-cell resolution
COSI: HiTSeq
  • Kieran Campbell, The University of British Columbia, Canada
  • Alexandre Bouchard-Cote, The University of British Columbia, Canada
  • Sohrab Shah, BC Cancer Agency, Canada

Short Abstract: Cancers form clones - sets of cells that exhibit similar mutations and genomic rearrangements. As clones evolve to resist chemotherapy understanding their molecular properties is crucial to designing effective treatments. While it is possible to measure both the DNA and RNA in the same single-cell, it is far more common to have large datasets where DNA and RNA are measured in separate cells albeit on the same samples containing the same clones. Therefore, it remains an open problem to link data across the expression space and genomic space that would allow for clone-specific gene expression estimates. Here we present clonealign, a scalable statistical method to probabilistically assign each cell as measured in gene expression space to a clone defined in copy number space. We derive an expectation-maximization algorithm that allows thousands of cells to be assigned to clones in minutes on commodity hardware. We apply our method to independently generated whole genome scDNA-seq and 10x genomics scRNA-seq from both patient-derived breast cancer xenografts and ovarian cancer cell lines to characterize the gene expression of expanding clones over time. We validate the method through held-out gene predictions and loss-of-heterozygosity analyses.

A-132: The fractured landscape of RNA-seq alignment: The default in our STARs
COSI: HiTSeq
  • Sara Ballouz, Cold Spring Harbor Laboratory, United States
  • Alexander Dobin, Cold Spring Harbor Laboratory, United States
  • Thomas Gingeras, Cold Spring Harbor Laboratory, United States
  • Jesse Gillis, Cold Spring Harbor Laboratory, United States

Short Abstract: Benchmarking assessments for RNA-seq alignment and quantification methods often show comparatively high performance at their default parameters. Alignment assessment is thus subject to Fredkin’s paradox, in which the difficulty of picking a winner reflects this broad similarity in output. Even though performances may be subject to technical or algorithmic differences, subtle choices such as performance metric, assessment task or dataset are rarely themselves questioned or assessed. What, then, should assessments and developers focus on in order to move the field forward? Here, we perform an assessment of assessments, focusing on the characterization of gene expression from RNA-seq data. Our exhaustive assessment of the STAR aligner across a range of alignment parameters using common metrics, and then on biologically focused tasks reveals three key insights. First, technical metrics are uninformative, capturing properties unlikely to have any role in biological discovery. Second, changes in alignment parameters have surprisingly little impact. Third, when performance finally does break, it happens mostly in difficult genomic. Not surprisingly, these problems generalize when we then assess our specific findings within a broader compendium of RNA-seq data. By determining where results are likely to be robust or fragile, we can establish a better baseline to help methodological progress.

A-134: UNCALLED: an aligner for quickly mapping raw nanopore signals to large references
COSI: HiTSeq
  • Sam Kovaka, Johns Hopkins University, United States
  • Michael Schatz, Johns Hopkins University, United States

Short Abstract: Nanopore sequencing works by measuring ionic current as a nucleotide strand passes through a pore. Different k-mers in the pore have different expected current levels, which can be used to infer individual nucleotides. Oxford Nanopore sequencers have a unique ability known as “read until”, where the sequencing of an unwanted molecule can be stopped in real-time. This can be used to preferentially avoid or enrich for sequencing reads that originate from certain target genomes, such as the contaminants or pathogens in a metagenomics study. Current methods for efficiently aligning nanopore reads require computationally expensive basecalling, and all accurate basecallers require fully sequenced reads. Here we present UNCALLED (Utility for Nanopore Current ALignment to Large Expanses of DNA), an aligner that can rapidly map streaming nanopore current signals to a reference genome without basecalling. This is accomplished by probabilistically considering all possible k-mers that the signal could represent, and then pruning the possibilities based on what is in the reference using an FM-index. UNCALLED currently processes at about 1.5x the rate DNA passes through the pore (~675bp/sec), and requires only ~250bp to accurately map E. coli reads. We are actively optimizing this algorithm for larger genomes such as human.

A-136: Simulating multiple ploidies in RNAseq data
COSI: HiTSeq
  • Adam Voshall, University of Nebraska-Lincoln, United States
  • Etsuko Moriyama, University of Nebraska-Lincoln, United States

Short Abstract: Current RNAseq simulation methods use only a single reference sequence to generate reads, which is equivalent to haploid cells or homozygous backcrosses of model organisms. For most non-model species, however, most RNAseq experiments involve strains that are diploid or polyploid. In nature, some plants and amphibians have up to 12 copies of each chromosome, with different combinations of similarity between pairs of chromosomes. However, the impact of these ploidies on various bioinformatics workflows has not been studied well. In order to determine the impact that heterozygous alleles have on transcriptome assembly and other downstream analyses (e.g., gene expression), in this study, we examined how we can generate benchmark datasets using simulation. Existing RNAseq simulation methods, such as Flux-Simulator, must be extended to utilize more than one reference sequences from different strains or closely related species. By specifying allele specific expression for each reference sequence, these simulated datasets can be used for testing, such as, read mapping, allele quantification, variant calling, differential expression, and time-series using biologically realistic conditions. We discuss the impact of heterozygous alleles on transcript assembly accuracy and transcript quantification at both the overall and allele-specific level.

A-138: Comparing binning methods in metagenomic assembly
COSI: HiTSeq
  • Kyle Wong, University of Illinois at Chicago, United States
  • Mark Maienschein-Cline, University of Illinois at Chicago, United States
  • George Chlipala, University of Illinois at Chicago, United States
  • Pinal Kanabar, University of Illinois at Chicago, United States

Short Abstract: Advances in next-generation sequencing (NGS) have greatly facilitated the generation of high-quality read data from metagenomic datasets. Assembling these reads can yield large sequences that can be used to determine useful data, such as species diversity and gene/pathway prediction. However, directly assembling these unfiltered reads presents the risk of creating chimeric contigs - sequences that contain data from two or more species - and requires considerable computational power (especially memory) for highly diverse data sets. One possible solution to this issue is binning: grouping and assembling similar reads to reduce the risk of misassembly and the computational workload per assembly. We assessed the efficacy of two binning methods, k-mer-based (MetaProb) and taxonomic-based (DIAMOND alignments to NCBI non-redundant database, with Least Common Ancestor taxonomic summaries), on a synthetic metagenomic dataset. We then assembled the bins with the SPAdes assembler and compared the results to an assembly performed on the same dataset without binning. The results will allow us to determine the binning method that most meaningfully reduces the number of misassemblies, as well as evaluate the specificity and sensitivity of different binning strategies. We anticipate that these methods can be used to provide high-quality contigs when performing analyses on metagenomic datasets.

A-140: GenomeScope 2.0: Polyploid-Aware Inference of Genome Characteristics from Unassembled Sequencing Data
COSI: HiTSeq
  • Timothy Ranallo-Benavidez, Johns Hopkins University, United States
  • Michael Schatz, Johns Hopkins University, United States

Short Abstract: De novo genome assembly and analysis are essential to our understanding of the genes and molecular biology of an organism. A crucial first step often involves estimating genome characteristics such as genome size, heterozygosity, and repetitiveness, since knowing these properties can guide the assembly process and inform evolutionary analysis. Current kmer-based methods for inferring these properties, however, are unable to account for the increased complexity of polyploid genomes, thereby limiting their application for many important plant and animal species. Here we present GenomeScope 2.0, which utilizes a polyploid-aware mixture model to computationally infer genome characteristics from unassembled sequencing data. As with GenomeScope 1.0, a mixture of negative binomial distributions is fit to the kmer spectrum of the sequencing data, but the new model includes additional components to capture kmers across higher ploidy levels. We show that GenomeScope 2.0 quickly and accurately analyzes simulated and real data, including sequencing data from the sweet potato (Ipomoea batatas) and wheat (Tritium aestivum) genomes. We also show that GenomeScope 2.0 can be utilized to help phase polyploid genomes and to determine the chromosomal phylogenetic relationship of polyploid genomes.

A-142: An accurate approach for detecting InDels in circulating tumor DNA (ctDNA) and FFPET samples
COSI: HiTSeq
  • Hamid Mirebrahim, Roche Sequencing Solutions, United States
  • Paul W. Shi, Roche Sequencing Solutions, United States
  • Christopher Kingsley, Roche Sequencing Solutions, United States
  • Fergal Casey, Roche Sequencing Solutions, United States
  • Alex Lovejoy, Roche Sequencing Solutions, United States
  • Amrita Pati, Roche Sequencing Solutions, United States

Short Abstract: Next Generation Sequencing (NGS) has accelerated advancements in precision medicine, especially for cancer diagnosis and treatment, by enabling fast and accurate detection of genetic changes. Multiple techniques have been introduced for deep sequencing of circulating tumor DNA (ctDNA, liquid biopsy) and fixed tissue samples (e.g. Formalin-Fixed Paraffin-Embedded Tissue (FFPET, solid tumor biopsy) samples). Insertions and deletions (InDels) are one of the dominant forms of DNA aberration in human cancer. InDels are the driver mutation in many oncogenes and tumor suppressor genes. We present a hybrid approach for detecting functional and de-novo somatic InDels in ctDNA and FFPET samples with high sensitivity and specificity. Accurate barcoding of the reads followed by digital error suppression enables us to detect short and long de-novo InDels with AF as low as 5% on the targeted exonic regions with 99% sensitivity and specificity. To improve the sensitivity of our method, we have compiled a comprehensive list of functional InDels for NSCLC and CRC. Using this list along with specific models for highly mutated genes in NSCLC and CRC, like MET and EGFR, enabled even lower limits of detection (<1%) for the targeted InDels with > 95% sensitivity and specificity.

A-144: Identifying Genotypically Distinct Strains of Mycoplasma bovis Within a Single Set of Reads
COSI: HiTSeq
  • Matthew Waldner, University of Saskatchewan, Canada
  • Anthony Kusalik, University of Saskatchewan, Canada

Short Abstract: Current general laboratory techniques for culturing a single bacterial species do not effectively differentiate genotypically distinct bacterial strains. The subsequent colony selection, DNA isolation, and sequencing steps result in a single set of reads that may contain DNA sequences that are not unique to a single strain. If the reads are assembled without taking strain mixing into consideration, the resultant assembly will be a composition of multiple strains. Our objective is to establish if the genotypically distinct regions within a set of mixed-strain reads can be identified. This objective is being pursued by evaluating the coverage and quality of paired-end reads mapped against the de novo assembly graph generated by SPAdes. Testing is currently being completed on an in vitro mix of multiple genotypically distinct strains of Mycoplasma bovis DNA, as determined by previously sequenced whole genome sequence taxonomies. The previously sequenced strains are also being used to synthetically create M. bovis mixed-strain read sets for testing. We expect the results will include the identification of sequences unique to each of the genotypically distinct strains in the mix and provide a method for the identification of a set of mixed-strain reads preventing the inaccurate assembly of multiple strains as one.

A-146: Using RNA-seq to determine cancer-specific therapeutic targets
COSI: HiTSeq
  • Mainá Bitar, QIMR Berghofer, Australia
  • Elizabeth O'Brien, QIMR Berghofer, Australia
  • Isabela Almeida, QIMR Berghofer and UFPB, Australia
  • Guy Barry, QIMR Berghofer, Australia

Short Abstract: Cancer treatments impose a hazard to the populations of stem cells that reside outside the tumour and are significantly affected by radiotherapy, for example. This can be addressed by developing alternative therapies that target cancer-specific molecules. To identify cancer-specific transcripts, we sequenced the RNA content of multiple human stem cell populations and characterized their overall transcriptional landscape. As proof of principle, we investigated the similarities and differences between these various stem cells and between these and publicly available glioblastoma (GBM) cell-lines, based on transcript expression revealed by RNA-Seq. We used different methods for transcript quantification and comparison of expression profiles, also investigating the functional impact of the observed differences. In a pilot experiment we selected 6 uniquely expressed transcripts and used antisense oligonucleotides (ASOs) to modulate their expression in primary GBM cell-lines. We observed a drastic decrease in proliferation rates with all the ASOs tested. Overall, contrasting the transcriptomes of stem cells and cancer cells represents an alternative approach to a better understanding the disease progression and the pathways that lead to cell malignancy. Our findings may further support the developmet of new forms of treatment that specifically target the malignant cells within a tumour.

A-148: Bridging Linear to Graph-based Alignment with Whole Genome Population Reference Graphs
COSI: HiTSeq
  • Mohamed Gunady, University of Maryland, United States
  • Stephen Mount, University of Maryland, College Par, United States
  • Hector Corrada Bravo, University of Maryland, United States
  • Sangtae Kim, Illumina Inc., 5200 Illumina Way, San Diego, United States

Short Abstract: For next-generation sequencing (NGS), most existing read aligners depend on a linear reference genome that usually represents a single consensus haplotype. With the diversity in genome sequences among individuals, the need to consider characterized variants derived from population haplotypes becomes inevitable in genotyping and disease association studies. This need has ignited a growing interest into developing graph-based aligners to utilize comprehensive catalogues of known genomic variants. Unfortunately, existing graph alignment algorithms are too computationally expensive for practical application. We propose an approach that takes advantage of representing population haplotypes as a graph, and efficiently linearizes the graph through Yanagi’s segmentation model. Our method generates a set of maximal L-disjoint segments representing the linearized population graph into a reference sequence library that can be used by any alt-aware linear aligner. Using segments empowers any linear aligner with the efficient graph representation of population variations, while avoiding the expensive computational overhead of aligning over graphs. We tested our approach on the highly polymorphic HLA genes which have significant medical importance. Preliminary results show promising results that we can achieve comparable performance to graph aligners using linear aligners assisted with population segments without compromising their space and computational requirements.

A-150: Standardized reference sample for reliable differential expression calls across labs
COSI: HiTSeq
  • Paweł P. Łabaj, MCB UJ, Kraków, Poland & Austrian Academy of Sciences, Vienna, Austria
  • David P. Kreil, Boku University, Vienna, Austria

Short Abstract: The deduction of gene function remains a major bottleneck in improving our understanding of living systems. An important source of information about the function of a gene is when, where, and how strongly it is being expressed. In the post-genomic era the genome-scale expression profiling has become a key tool of functional genomics. The recent advances in next generation sequencing (NGS) technology have led to wave of new findings based on whole-transcriptome sequencing (RNA-Seq). We and others have shown, however, that NGS suffers from different sources of unwanted variation affecting interpretation of the results. In our recent study of the power and limitations of RNA-Seq (by MAQC-III/SEQC consortium) we have shown that unwanted variation is largely due to library preparation. Taking advantage of controlled SEQC benchmark we have demonstrated that appropriate tools for factor analysis like PEER or SVASeq can identify and remove confounding factors. This allow to correct for site effects and thus improve specificity without losing sensitivity. Going beyond original SEQC study, we here present results for a range of realistic effect strengths. Moreover, we demonstrate the benefits that can be gained by analysing novel results in the context of standardized reference sample – across lab reliability improvement.

A-154: Whole genome sequencing to investigate the respiratory syncytial virus outbreak
COSI: HiTSeq
  • Zhengdeng Lei, University of Illinois at Chicago, United States
  • Yijun Zhu, University of Chicago, United States
  • Teresa Zembower, Northwestern University, United States
  • Kristen Metzger, Northwestern Memorial Healthcare, United States
  • Hong Hu, University of Illinois at Chicago, United States
  • George Chlipala, University of Illinois at Chicago, United States
  • Pinal Kanabar, University of Illinois at Chicago, United States
  • Mark Maienschein-Cline, University of Illinois at Chicago, United States
  • Stefan Green, University of Illinois at Chicago, United States
  • Chao Qi, Northwestern University, United States

Short Abstract: A viral whole genome sequencing (WGS) strategy, based on PCR amplification followed by next-generation Sequencing, was used to investigate a nosocomial respiratory syncytial virus (RSV-B) outbreak in a hematology-oncology and stem cell transplant unit. RSV-B genomes from 16 patients and healthcare workers (HCWs) suspected to be involved in the outbreak were compared to RSV-B genomes acquired from outpatients during the same time period but epidemiologically unrelated to the outbreak. We used SPANDx pipeline to perform high-throughput comparative analysis of haploid WGS datasets, which produced genotyping matrix for building a phylogenetic tree. Phylogenetic analysis of the whole genome identified a cluster of 11 patients and healthcare works with an identical RSV-B strain which were clearly distinct from strains recovered from individuals unrelated to the outbreak. The purpose of the study is to determine whether WGS would be able to separate cases of RSV-B transmission from the patients with strains unrelated to the transmission in an outbreak over eight weeks. We showed that WGS is a valuable tool for a local outbreak investigation compared to the traditional G gene-based analysis. Accurately identifying transmission and defining outbreak boundaries is critical information that allows implementation of appropriate infection control and prevention measures.

A-156: Evaluation of the capabilities of mouse TCR profiling from short read RNA-seq data
COSI: HiTSeq
  • Yu Bai, Regeneron Pharmaceuticals, United States
  • David Wang, Cornell University, United States
  • Wentian Li, Feinstein Institute for Medical Research, Northwell Health, United States
  • Ying Huang, Regeneron Pharmaceuticals, United States
  • Xuan Ye, Regeneron Pharmaceuticals, United States
  • Thomas Barry, Regeneron Pharmaceuticals, United States
  • Kurt Edelmann, Regeneron Pharmaceuticals, United States
  • Natasha Levenkova, Regeneron Pharmaceuticals, United States
  • Chunguang Guo, Regeneron Pharmaceuticals, United States
  • Dimitris Skokos, Regeneron Pharmaceuticals, United States
  • Yi Wei, Regeneron Pharmaceuticals, United States
  • Lynn Macdonald, Regeneron Pharmaceuticals, United States
  • Wen Fury, Regeneron Pharmaceuticals, United States

Short Abstract: Profiling T cell receptor (TCR) repertoire via short read RNA-Seq has a unique advantage of probing TCRs and the genome-wide gene expressions simultaneously. Nevertheless, only a small percentage of the reads may cover the TCR loci and thus the repertoire can be significantly undersampled. Despite being applied in a few studies, its utility in probing TCR repertoires has not been evaluated extensively. Here we conduct a systematic assessment of RNA-Seq in TCR profiling. We evaluate the full-length single cell (Fluidigm) and bulk RNA-Seq regarding repertoires that are subjected to either naïve or immunogenic conditions, and cross-reference the results with the targeted amplicon approach. TCR sequences are derived by taking the consensus from several published programs. Standard read length and coverage are employed such that the evaluation is in accord with the current RNA-Seq practice. We observe that quantifying clones with <1% abundance and the clonality of a repertoire is relatively more difficult despite high sequencing depth. However, top enriched clones with an abundance of a few percents or higher can be faithfully detected. In cases when top TCR clones are of interest and transcriptome sequencing is available, it is worthwhile to conduct a TCR profiling using the RNA-Seq data.

A-158: Assessing variations in SNVs identified using different versions of human genome
COSI: HiTSeq
  • Bohu Pan, National Center for Toxicological Research, United States
  • Wenming Xiao, National Center for Toxicological Research, United States
  • Zhichao Liu, National Center for Toxicological Research, United States
  • Weida Tong, National Center for Toxicological Research, United States
  • Huixiao Hong, National Center for Toxicological Research, United States

Short Abstract: Human reference genome is the foundation for next generation sequencing (NGS) data analyses and critical to investigate genetic variations, which offers a potential to inform clinical decisions. Different versions of human reference genome have been used in NGS data analysis. HG19 and HG38 are the most popularly used versions. The concordance between the results from using HG19 and HG38 has not been assessed. Therefore, we conducted comparative analysis on the SNVs identified using HG19 and HG38 to assess the impact of human reference genome version using NGS data from genome-in-a-bottle (GIAB) project. Twenty different pipelines were used identify SNVs based on both HG19 and HG38. Two conversion tools were then used to convert the coordinates of SNVs between HG19 and HG38. Discordant rates in SNVs between HG19 and HG38 were calculated to assess the impact of genome versions and characteristics of the discordant SNVs were examined. We found some 1.5 % SNVs were discordant between the two versions. Our findings suggest that cautions should be taken when translating genetic findings between different reference versions.

A-160: ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
COSI: HiTSeq
  • Lauren Coombe, BC Cancer Genome Sciences Centre, Canada
  • Jessica Zhang, BC Cancer Genome Sciences Centre, Canada
  • Benjamin Vandervalk, BC Cancer Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Genome Sciences Centre, Canada
  • Shaun Jackman, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer Genome Sciences Centre, Canada

Short Abstract: The long-range sequencing information captured by linked reads, such as those available from 10x Genomics (10xG), helps resolve genome sequence repeats and yields accurate and contiguous draft genome assemblies. We introduce ARKS, an alignment-free linked read genome scaffolding methodology that uses linked reads to organize genome assemblies further into contiguous drafts. Our approach departs from other read alignment-dependent linked read scaffolders, including our own (ARCS), and uses a kmer-based mapping approach. The kmer mapping strategy has several advantages over read alignment methods, including better usability and faster processing, as it precludes the need for input sequence formatting and draft sequence assembly indexing. We demonstrate how the method provides further improvements of megabase-scale Supernova human genome assemblies, which themselves exclusively use linked read data for assembly. Following ARKS scaffolding of a human genome (NA12878) 10xG Supernova assembly, fewer than 9 scaffolds cover each chromosome, except the largest (chromosome 1, n=13). We analyzed the contiguity and accuracy of the resulting assemblies and conclude that ARKS can provide correct chromosome-scale genome assemblies, promptly. We expect ARKS to have broad utility in helping to refine any draft genome.

A-162: SPRING: A practical compressor for short-read FASTQ data
COSI: HiTSeq
  • Shubham Chandak, Stanford University, United States
  • Kedar Tatwawadi, Stanford University, United States
  • Mikel Hernaez, University of Illinois at Urbana-Champaign, United States
  • Idoia Ochoa, University of Illinois at Urbana-Champaign, United States
  • Tsachy Weissman, Stanford University, United States

Short Abstract: High-Throughput Sequencing (HTS) technologies produce huge amounts of data in the form of short genomic reads, associated quality values, and read identifiers. Because of the significant structure present in the FASTQ data, general purpose compressors are unable to completely exploit much of its inherent redundancy. Although substantial research has been dedicated to designing specialized FASTQ compressors, general-purpose compressors such as Gzip are predominantly used, mainly due to the ease of their deployment and the fact that they provide perfectly lossless compression on any type of data. In this work, we propose SPRING, a practical compressor for FASTQ files produced by short-read NGS sequencers. SPRING is a reference-free compressor and supports a wide variety of compression modes, including perfectly lossless compression, pairing-preserving compression, and lossy compression of quality values among others, while at the same time achieving significantly better compression than existing tools. On data sequenced on Illumina's latest sequencer, NovaSeq, SPRING achieves 2x-5x improvement over Gzip, and 1.3x-1.8x improvement over state-of-the-art FASTQ compressor FaStore, while being competitive in terms of time/memory requirements. SPRING efficiently utilizes the pairing information in the reads to achieve this improved performance. Source code and detailed results are available at \url{https://github.com/shubhamchandak94/SPRING}.

A-164: Aggregating multiple rare variant association tests to enhance gene discovery in sequence-based association studies
COSI: HiTSeq
  • Yao Yu, MD Anderson Cancer Center, United States
  • Chad Huff, MD Anderson Cancer Center, United States

Short Abstract: Numerous rare-variant association tests (RVATs) have been proposed for disease-gene association studies with high-throughput sequencing data. The statistical properties of RVATs vary widely according to the genetic architecture of the disease at a given locus. A general solution of conducting multiple RVATs substantially reduces statistical power and will negate any benefit obtained from incorporating additional RVATs. In addition, most RVATs aggregate all variants in a gene, irrespective of gene isoforms, which is not necessarily optimal. We introduce a newly developed cross-Method Association Toolkit (XMAT), which employs a permutation approach to combine statistics from multiple RVATs to test each isoform of a gene, summarizing the contribution of each transcript and each RVAT to calculate a single gene-level p-value. This approach leverages the locus-specific correlation between RVATs and gene isoforms without the need to directly model the correlation between isoform-test combinations. Using XMAT, we conducted a gene-based analysis involving 783 breast cancer cases and 3,607 controls. We assessed the optimal combination of tests among the 25 RVATs for a known set of susceptibility genes. XMAT offers the robusticity of multiple RVATs while reducing the multiple testing penalty, resulting in increased statistical power for gene discovery across a wide range of genetic architectures.

A-166: RECAP reveals the true statistical significance of ChIP-seq peak calls
COSI: HiTSeq
  • Justin Chitpin, University of Ottawa, Canada
  • Aseel Awdeh, University of Ottawa, Canada
  • Theodore Perkins, Ottawa Hospital Research Institute, Canada

Short Abstract: ChIP-seq is used extensively to identify sites of transcription factor binding or epigenetic modifications to the genome. The fundamental bioinformatics problem is to take ChIP-seq read and control data, and determine genomic regions enriched between the groups. While many programs have been designed to solve this task, nearly all fall into the statistical trap of using the data twice --- once to determine candidate enriched regions, and a second time to assess enrichment by classical statistical hypothesis testing. This double use of the data has the potential to invalidate the statistical significance assigned to enriched regions and invalidate false discovery rate estimates. Thus, the true significance or reliability of peak calls remains unknown. We show, through extensive simulation studies of null hypothesis data, that three well-known peak callers, MACS, SICER, and diffReps, output optimistically biased p-values, and therefore optimistic false discovery rate estimates. We also propose a new wrapper algorithm called RECAP, that resamples ChIP-seq and control data to estimate and correct for biases built into peak calling algorithms. When tested against ENCODE Consortium ChIP-seq data, RECAP maintains well-calibrated false discovery rates, making it a powerful new tool for assessing the true statistical significance of ChIP-seq peak calls.

A-168: Dense and accurate whole-chromosome haplotyping of individual genomes
COSI: HiTSeq
  • David Porubsky, Max Planck Institute for Informatics, Germany
  • Shilpa Garg, Max Planck Institute for Informatics, Center for Bioinformatics, Saarland University, Germany
  • Ashley Sanders, EMBL, Germany
  • Jan Korbel, EMBL, Germany
  • Victor Guryev, EMBL-EBI, Netherlands
  • Peter Lansdorp, Terry Fox Laboratory, BC Cancer Agency, Canada
  • Tobias Marschall, Max Planck Institute for Informatics, Center for Bioinformatics, Saarland University, Germany

Short Abstract: The diploid nature of the human genome is neglected in many analyses done today, where a genome is perceived as a set of unphased variants with respect to a reference genome. This lack of haplotype-level analyses can be explained by a lack of methods that can produce dense and accurate chromosome-length haplotypes at reasonable costs. Here we introduce an integrative phasing strategy that combines global, but sparse haplotypes obtained from strand-specific single-cell sequencing (Strand-seq) with dense, yet local, haplotype information available through long-read or linked-read sequencing. We provide comprehensive guidance on the required sequencing depths and reliably assign more than 95% of alleles (NA12878) to their parental haplotypes using as few as 10 Strand-seq libraries in combination with 10-fold coverage PacBio data or, alternatively, 10X Genomics linked-read sequencing data. We conclude that the combination of Strand-seq with different technologies represents an attractive solution to chart the genetic variation of diploid genomes.

A-170: De novo single-cell transcript sequence reconstruction with Bloom filters
COSI: HiTSeq
  • Ka Ming Nip, BC Cancer Genome Sciences Centre, Canada
  • Readman Chiu, BC Cancer Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: De novo transcript sequence reconstruction from RNA-seq data is a difficult problem due to the short read length and the wide dynamic range of transcript expression levels. Although more than 10 algorithms for bulk RNA-seq were published over the past decade, very limited effort was made for de novo single-cell transcript sequence reconstruction, likely due to the technical challenges in analyzing single-cell RNA-seq (scRNA-seq) data. Compared to bulk RNA-seq, scRNA-seq tend to yield more variable read depth across each transcript, lower transcript coverage, and lower overall signal-to-noise ratio. Here, we present a fast and lightweight method for de novo single-cell transcript sequence reconstruction that leverages sequence reads across multiple cells. Our method is implemented in a program called “RNA-Bloom,” which utilizes lightweight probabilistic data structures based on Bloom filter. RNA-Bloom pools input reads from all cells to reconstruct read fragments from paired-end reads for individual cells. In particular, the cell-specificity of reconstructed transcripts is still maintained. In our benchmark, RNA-Bloom’s performance and accuracy surpasses state-of-the-art methods that were designed for bulk RNA-seq data. While scRNA-seq has primarily been used for gene expression analysis, this work unlocks new territory for identifying unique isoform structures at the single-cell level.

A-225: Organellar genome annotation from the amino acid and nucleotide references
COSI: HiTSeq
  • Jaehee Jung, HongIk University, South Korea
  • Gangman Yi, Dongguk University, South Korea

Short Abstract: Next-generation sequencing (NGS) technologies have led to the accumulation of high throughput sequence data from various organisms in biology. To apply gene annotation of organellar genomes for various organisms, more optimized tools for functional gene annotation are required. Almost all gene annotation tools are mainly focused on the chloroplast genome of land plants or the mitochondrial genome of animals.We have developed a web application for the fast, user-friendly, and improved annotations of organellar genomes. This application annotates genes based on a BLAST-based homology search and clustering with selected reference sequences from the NCBI database or user-defined uploaded data. This application can annotate the functional genes in almost all mitochondrion and plastid genomes of eukaryotes. The gene annotation of a genome with an exon-intron structure within a gene or inverted repeat region is also available. It provides information of start and end positions of each gene, BLAST results compared with the reference sequence, and visualization of gene map by OGDRAW.

A-430: Empirical sample size analysis for comparative RNA-seq studies
COSI: HiTSeq
  • Zixuan Shao, Caltech, United States
  • Julie Kornfield, Caltech, United States

Short Abstract: Sample size calculation is an important optimization step in any comparative RNA sequencing (RNA-seq) analysis. Empirical RNA-seq Sample Size Analysis (ERSSA) is a R software package designed to test whether an existing RNA-seq dataset has sufficient biological replicates to detect a majority of differentially expressed genes (DEGs) between two conditions. Compare to existing RNA-seq sample size analysis algorithms, ERSSA does not rely on any a priori assumptions about the dataset. Rather ERSSA uses the supplied pilot RNA-seq dataset to test whether the current replicate level is sufficient to detect a majority of DEGs. Base on the number of replicates available, the algorithm subsamples at step-wise replicate levels and uses existing differentially expression analysis software (e.g. edgeR and DESeq2) to identify the number of DEGs. This process is repeated for a given number of times with unique combinations of samples to generate a distribution of the number of DEGs at each replicate level. Using RNA-seq data from studies including GTeX, the algorithm successfully identified the sample size sufficient to detect a majority of DEGs between conditions. ERSSA is a flexible and easy-to-use tool that offers an alternative approach to identify the appropriate sample size in comparative RNA-seq studies.

A-436: Identification of meiotic recombination events through gamete haplotype reconstruction by linked-read sequencing method
COSI: HiTSeq
  • Peng Xu, University of Alabama at Birmingham, United States
  • Zechen Chong, University of Alabama at Birmingham, United States

Short Abstract: In eukaryotes, meiotic recombination (MR) events promote genetic materials exchange between homologous chromosomes, which can be inherited to offspring. However, direct identification of meiotic recombination events in an individual is still challenging due to the difficulty in resolving chromosome haplotypes. Linked-read sequencing platform is capable of providing high-quality haplotypes in relatively long ranges, but the haplotype information mainly lies in isolated phased fragments. In this study, we developed a pipeline to reconstruct chromosome-level haplotypes through pedigree analysis of linked-read sequencing data. A whole-chromosome haplotype comparison between parent and child leads to the discovery of 462 meiotic reciprocal crossovers in 6 trio datasets. In three trio samples from Human Genome Structural Variation Consortium, our pipeline identified 149 of high-confidence MRs. In addition, it detected 85 new events. MR regions co-localize with meiotic recombination hotspots of human populations and are enriched with the PRDM9 protein-binding motif. Interestingly, about half of breakpoint regions occur inside a gene, which increases haplotype diversity in genic regions. Taken together, these results demonstrated the great potential to apply linked-read sequencing analysis to study haplotype-based genetic exchange in human inheritance and diseases. The source code is publicly available at https://github.com/ChongLab/meiotic_recombination_10x.