Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide



Poster Categories
Poster Schedule
Preparing your Poster - Information and Poster Size
How to mount your poster
Print your poster in Basel

View Posters By Category

Session A: (July 22 and July 23)
Session B: (July 24 and July 25)

Presentation Schedule for July 22, 6:00 pm – 8:00 pm

Presentation Schedule for July 23, 6:00 pm – 8:00 pm

Presentation Schedule for July 24, 6:00 pm – 8:00 pm

Session A Poster Set-up and Dismantle
Session A Posters set up: Monday, July 22 between 7:30 am - 10:00 am
Session A Posters should be removed at 8:00 pm, Tuesday, July 23.

Session B Poster Set-up and Dismantle
Session B Posters set up: Wednesday, July 24 between 7:30 am - 10:00 am
Session B Posters should be removed at 2:00 pm, Thursday, July 25.

D-001: Reliable, precise, and fast detection of structural variants from long reads
  • Armin Töpfer, Pacific Biosciences, Germany

Short Abstract: Single Molecule Real-Time (SMRT) sequencing enables an unprecedented and complete view of an individual's genome, from single-nucleotide variants (SNVs), over indels between 1 to 50 bp, to large structural variants (SVs) spanning kilo- to megabases. Even though the mass of variation between two human genomes is dominated by approximately 5,000,000 SNVs, the fewer by count 400,000 indels and 20,000 SVs contain the majority of base pairs, 5 and 10 megabases, respectively. Recent studies have found pathogenic SVs, undetectable by short-read whole-genome sequencing. Consequently, there is a need to extract indels and SVs from low-coverage data fast and reliably without requiring a compute expensive de-novo assembly. We present pbsv, a read-mapping based SV caller that detects insertions, deletions, duplications, translocations, and copy number variations from single to cohort-scale number of low-coverage samples. Leveraging its native joint-calling capability, predictive power increases when employed in a trio or cohort setting. Using long and high-quality circular consensus sequencing (CCS) reads at a coverage of 28-fold, pbsv detects 98% of all known HG002/NA24385 SVs, while keeping precision above 95%; recall is ~5% higher compared to state-of-the-art tools sniffles and svim.

D-002: BANDITS: a Bayesian hierarchical model for differential splicing accounting for sample-to-sample variability and mapping uncertainty
  • Simone Tiberi, University of Zurich, Switzerland
  • Mark D. Robinson, University of Zurich, Switzerland

Short Abstract: Alternative splicing plays a fundamental role in the biodiversity of proteins as it allows a single gene to generate several transcripts and, hence, to code for multiple proteins. However, variations in splicing patterns can be involved in diseases. When comparing conditions, typically healthy vs disease, scientists are increasingly focusing on differential transcript usage (DTU), i.e. in changes in the proportion of transcripts. A big challenge in DTU analyses is that, unlike gene level studies, the counts at the transcript level, which are of primary interest, are not observed because most reads map to multiple transcripts. Most DTU methods follow a plug-in approach and input estimated transcript-level counts, yet neglecting the uncertainty in these estimates. To overcome the limitations of current methods for DTU, we present Bspliced, an R package to perform DTU, at both transcript and gene level, based on RNA-seq data. Bspliced uses a Bayesian hierarchical structure to explicitly model the variability between samples, and treats the allocations of reads to the transcripts as latent variables. The parameters of the model are inferred via Markov chain Monte Carlo (MCMC) techniques. We will show how, both, in simulation studies and experimental data analyses, the proposed methodology outperforms existing methods.

D-003: A flexible permutation approach to detect cell state transitions from high-throughput single-cell data
  • Simone Tiberi, University of Zurich, Switzerland
  • Mark D. Robinson, University of Zurich, Switzerland

Short Abstract: Technology developments in the last years have led to an explosion of high-throughput single-cell data. Here, we present a flexible and general statistical methodology, based on a permutation approach, to discover differential state (DS) transitions from single-cell data. DS analyses aim at identifying cell-type-specific responses between conditions, where genes or markers vary in specific populations of cells. For instance, when investigating diseases, DS tools could allow to identify cell types that respond differently to distinct treatments. The methodology we propose does not rely on asymptotic theory, avoids parametric assumptions, explicitly models biological replicates, and is able to identify, both, differential patterns involving changes in means, as well as more subtle variations that can have biological insight. Furthermore, unlike most differential tools, it can be applied to several types of single-cell data, mostly single-cell RNA sequencing (scRNA-seq) and high-dimensional flow or mass cytometry (HDCyto) data. Our model also adjusts for covariates and batch effects. We will present results, based on scRNA-seq and HDCyto simulated and experimental datasets, where our tool outperforms several competitors and is able to detect more patterns of differential expression compared to canonical differential abundance methods. Ultimately, our method will be distributed as a Bioconductor R package.

D-004: ntEdit: scalable genome assembly polishing
  • Hamid Mohamadi, BC Cancer Agency Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Jessica Zhang, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: Today, genome sequence assemblies are routine practice. Depending on the methodology, resulting drafts may contain considerable base errors. Although utilities exist for genome base polishing, they do not scale well. We developed ntEdit, a Bloom filter-based genome polishing utility that scales linearly to large genomes. We first tested ntEdit and assembly improvement tools GATK, Pilon and Racon on controlled E. coli, C. elegans and H. sapiens sequence data. Generally, ntEdit performs well at low sequence depths (<20X), fixing the majority (>97%) of base substitutions and indels, and its performance is consistent across a range of sequence coverages. In all experiments conducted, the ntEdit pipeline executed in <14s, <3m and <40m, on average, on E. coli, C. elegans and H. sapiens, respectively. We show how ntEdit ran in <2h20m to improve upon long and linked read human genome assemblies of NA12878, using Illumina sequence data from the same individual, fixing frame shifts in coding sequences. We also generated 17-fold coverage spruce sequence data from haploid sequence sources, and used it to edit our pseudo haploid assembly of the 20 Gbp spruce genome in <4h, making roughly 50M edits at a (substitution+indel) rate of 0.0024. ntEdit simplifies polishing of genome sequences. Availability:https://github.com/bcgsc/nedit

D-005: Containerized open-source framework for NGS data analysis and management
  • Roberto Vera Alvarez, NCBI/NLM/NIH, United States
  • Lorinc Pongor, NCI/NIH, United States
  • David Landsman, NCBI/NLM/NIH, United States

Short Abstract: The management of next generation sequencing (NGS) data produced by different technologies such as RNA-Seq, ChIP-Seq, ATAC-Seq and DNA-methylation is complex and demands advanced bioinformatics skills. For example, pre-processing quality control and sample selection based on principle component analysis (PCA) are tasks that should be easily available for researchers producing sequencing samples. In this work we present an open source containerized framework that is easily executed on most workstations for the integration and management of RNA-Seq, ChIP-Seq and ATAC-Seq data. The framework offers a user-friendly interface to execute the basic steps in data analysis allowing researchers a quick and straightforward evaluation of samples produced. The framework is comprised of a set of NGS data analysis workflows and pipelines in CWL format, a Python-Django back-end for data management and a set of Jupyter notebooks as user interface. Analysis reports with tables, figures and plots are automatically generated from data files with details and resolution ready for scientific publication. All components are freely available using only open-source components.

D-006: Discriminating true and false zeros in single-cell RNA-seq data in differential expression analysis and data imputation
  • Zhun Miao, Tsinghua University, China
  • Ke Deng, Tsinghua University, China
  • Xiaowo Wang, Tsinghua University, China
  • Xuegong Zhang, Tsinghua University, China

Short Abstract: There are more zero values in typical single-cell RNA sequencing (scRNA-seq) data than in bulk RNA-seq data. These include true zeros of genes that are not expressed in the single cell at the time of sampling, and false zeros that are caused by the so-called "drop-out" events in scRNA-seq procedures. We developed a mathematical model to study the distribution of reads in scRNA-seq data under the influence of low RNA-capture and sequencing efficiency. The model can give estimations on the proportion of true zeros in the data. Based on this model, we developed a DEsingle method for detecting differentially expressed genes between two groups of scRNA-seq data, which can distinguish three different types of differential expression. And we developed a scRECOVER method for missing data imputation in scRNA-seq data that only recovers the false zeros but not altering the true zeros. Experiments on both simulation and real data showed the effectiveness of the proposed two methods.

D-007: SCOPE: a normalization and copy number estimation method for single-cell DNA sequencing
  • Rujin Wang, The University of North Carolina at Chapel Hill, United States
  • Dan-Yu Lin, The University of North Carolina at Chapel Hill, United States
  • Yuchao Jiang, The University of North Carolina at Chapel Hill, United States

Short Abstract: Whole genome single-cell DNA sequencing (scDNA-seq) enables characterization of copy number profiles at the cellular level, which circumvents the averaging effects associated with bulk-tissue sequencing and increases resolution while decreasing ambiguity in tracking cancer evolutionary history. ScDNA-seq data is, however, sparse and noisy due to the biases and artifacts that are introduced during the library preparation and sequencing procedure. Here, we propose SCOPE, a normalization and copy number estimation method for scDNA-seq data. The main features include: (i) a Poisson latent factor model for normalization, which borrows information across cells and regions to estimate bias, using negative control cells identified by Gini coefficients; (ii) modeling of GC content bias using an expectation-maximization algorithm embedded in the normalization step, which accounts for the aberrant copy number changes that deviate from the null distributions; and (iii) a cross-sample segmentation procedure to identify breakpoints that are shared across cells from the same subclone. We evaluate and show outperformance of SCOPE on a diverse set of scDNA-seq data in cancer genomics, using array-based calls of purified bulk samples as gold standards and whole-exome sequencing and single-cell RNA sequencing as orthogonal validations. Further, we demonstrate SCOPE on three recently released scDNA-seq datasets by 10X Genomics.

D-008: RNA-Seq of endogenous human stem cells and tumors to identify cancer-specific therapeutic targets
  • Grace Borchert, Griffith University, Australia
  • Charlotte Woods, Queensland University of Technology, Australia
  • Maina Bitar, QIMR Berghofer, Australia
  • Isabela Pimentel de Almeida, Universidade de Sao Paulo and QIMR Berghofer, Brazil
  • Elizabeth O'Brien, QIMR Berghofer, Australia
  • Guy Barry, QIMR Berghofer, Australia

Short Abstract: Stem cells are characterized by their capacity for self-renewal, long-term viability and ability to differentiate into multiple types of specialized cells. Similarly, cancer cells are capable of self-renewal, which allows aggressive and unlimited tumor growth. Interestingly, pathways normally associated with stem cell development overlap with cancer progression. Therefore, stem cell populations residing outside the tumor are significantly affected by cancer treatments. Here we investigate for the first time the similarities and differences between types of endogenous adult human stem cells and publicly available patient-derived glioblastoma/medulloblastoma primary tumors based on transcript expression via RNA-Seq. Additionally, we profiled the known gene targets of all currently FDA approved drugs for cancer treatment in our data. The study included different methods for transcript quantification and comparison of expression profiles. Comparing the transcriptomes of human stem cells and tumors represents an alternative approach to identify better drug targets, with potentially less severe side effects. As proof of principle, we used our data to uncover clinically relevant ASOs targeted to candidate transcripts highly expressed in glioblastoma but negligibly expressed in stem cells, resulting in a marked proliferation decrease. Our findings may support the development of alternative therapies that specifically target the malignant cells within a tumor.

D-009: A method for inferring gene regulatory networks based on pseudo time-series gene expression profiles from single-cell RNA-seq data
  • Kaito Uemura, Osaka University, Japan
  • Naoki Osato, Osaka University, Japan
  • Hironori Shigeta, Osaka University, Japan
  • Shigeto Seno, Osaka University, Japan
  • Hideo Matsuda, Osaka University, Japan

Short Abstract: Single cell analysis has been widely used to understand cell diversity in detail. It is demanded to develop a highly accurate gene regulatory network (GRN) inference method using the single-cell gene expression data. Thus we propose a novel method for the GRN inference based on: (1) Construct pseudo time-series gene expression profiles by calculating the pseudo time of each cell in the expression data, and apply the dynamic Bayesian network model to the data to infer the causality of the regulatory relationships between genes. (2) Use a bootstrap method to estimate highly-confident regulatory relationships by resampling from groups of cells having the same or similar pseudo time (we regard these groups as pseudo biological replica). The effectiveness of our method is demonstrated by applying the method to a GRN inference from the single-cell RNA-seq data representing cell differentiation from myeloid progenitor to neutrophil published by Paul et al. (Cell, 2015). We will present inferred GRNs indicating detailed gene-by-gene regulatory relationships in the cell differentiation process, which was not clearly shown in the paper by Paul et al.

D-010: A Bioinformatics Pipeline for Identifying Point Mutations in CML Patients
  • Julia Vetter, University of Applied Sciences Upper Austria, Campus Hagenberg, Austria
  • Jonathan Burghofer, Ordensklinikum Linz GmbH, Barmherzige Schwestern, Austria
  • Theodora Malli, Ordensklinikum Linz GmbH, Barmherzige Schwestern, Austria
  • Gerald Webersinke, Ordensklinikum Linz GmbH, Barmherzige Schwestern, Austria
  • Wolfgang Kranewitter, Ordensklinikum Linz GmbH, Barmherzige Schwestern, Austria
  • Susanne Schaller, University of Applied Sciences Upper Austria, Campus Hagenberg, Austria
  • Stephan Winkler, University of Applied Sciences Upper Austria, Campus Hagenberg, Austria

Short Abstract: In high-throughput sequencing experiments unique molecular identifiers (UMIs) have become of great interest in the last years. UMIs allow backtracking of sequences to their original RNA, quantification of transcripts, elimination of PCR errors and detection of true variants. Here we show a pipeline processing UMIs to detect low frequent (~1%) point mutations in ABL gene of CML patients. To verify the workflow, sequencing data of synthesized positive controls containing wildtype ABL and a defined amount of mutated ABL (c.763G>A) were used. To achieve an appropriate clustering of reads, we have developed methods that assemble reads in dependence on the UMIs. Our re-clustering approach allows a specific number of mismatches in UMIs. The size of clusters intended to be re-clustered can be set individually, and users can set parameters necessary for identifying the consensus sequence. Additionally, we compare sequences in clusters containing only one sequence and discard erroneous sequences. Results show significant differences between normally clustered and re-clustered samples. Further, the number of clusters is reduced to 1/3. In future work we aim to implement an evolutionary strategy for most efficient parameter settings using samples containing an allele frequency of 0%, 1%, 10% and 20% at position c.763G>A.

D-011: Development of an agnostic analysis method of RNA-seq data for the identification of potential novel biomarkers in liquid biopsy: application on Amyotrophic lateral sclerosis (ALS) patients.
  • Gabriel Wajnberg, Atlantic Cancer Research institute, Canada
  • Simi Chacko, Atlantic Cancer Research Institute, Canada
  • Annie-Pier Beauregard, Atlantic Cancer Research Institute, Canada
  • Daniel Saucier, Université de Moncton, Canada
  • Pier Morin, Université de Moncton, Canada
  • Nicolas Crapoulet, Atlantic Cancer Research Institute, Canada
  • Rodney Ouellette, Atlantic Cancer Research Institute, Canada

Short Abstract: The current methods of small RNA expression analysis of extracellular vesicles (EVs) content in liquid biopsy samples are focused on the identification of specific annotated non-coding RNA molecules, such as miRNAs and fragments of lncRNAs. However, in our recent publication on EVs-content from ALS patients, we identified other non-coding regions by small RNA-Seq, including intergenic regions, this finding suggests that the usage of an RNA-species centric analysis might not be the most effective way to identify potential novel biomarkers in liquid biopsy. In this study, we developed an agnostic analysis method using the coverage of any expressed chromosome region as a potential biomarker for diagnosis. We identified a total of 227 upregulated regions (p-value < 0.05 and log2FC >1) and 338 downregulated regions (p-value < 0.05 and log2FC >1) in ALS patients samples when compared to control individuals. A total of 40% of these differentially expressed regions are in transcribed regions that are not annotated. This new strategy has the potential to help us identify new potential diagnostic biomarkers using liquid biopsy and leverage this approach not only for ALS patients, but also in other diseases such as cancer.

D-012: Seq: A High-Performance Language for Genomics
  • Ariya Shajii, Massachusetts Institute of Technology, United States
  • Ibrahim Numanagić, Massachusetts Institute of Technology, United States
  • Riyadh Baghdadi, Massachusetts Institute of Technology, United States
  • Saman Amarasinghe, Massachusetts Institute of Technology, United States
  • Bonnie Berger, Massachusetts Institute of Technology, United States

Short Abstract: The scope and scale of biological data is increasing at an exponential rate, as technologies like next-generation sequencing are becoming cheaper and more prevalent. Yet, as Moore's Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. Here, we introduce Seq, the first language tailored specifically to genomics, which marries the ease and productivity of Python with C-like performance. Seq is a drop-in replacement for Python that incorporates novel bioinformatics-oriented data types, language constructs and optimizations. Seq enables users to write high-level code without having to worry about low-level optimizations, and allows for seamless expression of the algorithms, idioms and patterns found in many genomics applications. Seq attains a performance improvement of up to 175x improvement over Python once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650x improvement. Compared to optimized C++ code, Seq attains up to a 2x improvement with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.

D-013: Automated Diagnosis of Viruses in Solanaceous crops in Colombia using High-throughput Sequencing Data
  • Pablo Gutierrez, Universidad Nacional de Colombia, Colombia
  • Daniel Tejada Hernandez, Universidad Nacional de Colombia, Colombia
  • Mauricio Marin Montoya, Universidad Nacional de Colombia, Colombia

Short Abstract: With the advent of high throughput sequencing methods (HTS), detection and characterization of plant viruses have experienced a dramatic change. Previously, virus identification usually required a priori assumptions about the viruses present in the sample and successful detection was highly dependent on the availability of specific antibodies and nucleic acid probes. These methods often required time consuming experiments on indicator plants and cumbersome purification protocols. HTS changed virus detection methods into a data centered approach where samples are sequenced first and viral sequences identified later using bioinformatic approaches. HTS methods are extremely sensitive, allowing detection of even a single viral sequence in databases containing tens of millions of sequences and have become a promising diagnostic tool in plant surveillance programs. One of the main difficulties in analyzing HTS data by non-bioinformaticians consist on dealing with large amounts of data and complex pipelines. In this work, we describe an automated virus diagnostics pipeline in R that will serve for rapid identification of viruses in potato (Solanum tuberosum, S. phureja), bell pepper (Capsicum annuum) and tomato (S. lycopersicum). This pipeline has been optimized to run on standard desktop computers by non-bioinformaticians.

D-014: A study on novel estimation method of cell differentiation lineage by single cell trajectory inference
  • Hironori Shigeta, Osaka University, Japan
  • Shigeto Seno, Osaka University, Japan
  • Hideo Matsuda, Osaka University, Japan
  • Shuhei Yao, Osaka University, Japan

Short Abstract: Tools for analyzing gene expression data play an important role in providing an understanding of the biological mechanisms underlying these responses. Recently, single cell RNA sequencing is considered to be notable approach to discover biological mechanisms, especially heterogeneity of complex tissues. However, a robust analysis method for single cell expression dataset is not established. Especially, it is unknown which is the best method to consider biological mechanisms. For instance, it is important to recognize which is the best cell trajectory analysis at the point of reflecting real cell differentiation. Saelens et al. reported comprehensive evaluation for single cell analysis platform from the view point of stability, accuracy, usability and scalability. However, their biological usability was inadequately evaluated. It means it is difficult to measure biological consideration. In this presentation, in order to test the biological availability, we will report evaluation of two major methods, Monocle3 and Slingshot, by cellular differentiation using single cell data of myeloid progenitors (Paul et al., 2015). As a result, Slingshot provided different lineage pattern from Monocle3 and each method has pros and cons. It may indicate we should develop another method to investigate data of single cell analysis.

D-015: NanoVar: Accurate Characterization of Patients’ Genomic Structural Variants Using Low-Depth Nanopore Sequencing
  • Cheng Yong Tham, Cancer Science Institute of Singapore, Singapore
  • Roberto Tirado-Magallanes, Cancer Science Institute of Singapore, Singapore
  • Daniel G. Tenen, Cancer Science Institute of Singapore, Harvard Stem Cell Institute, Singapore
  • Touati Benoukraf, Cancer Science Institute of Singapore, Memorial University of Newfoundland, Singapore

Short Abstract: Despite the increasing significance of structural variants (SV) in the development of many human diseases, progress in novel pathological SV discovery remains impeded, partly due to the challenges of accurate and routine SV characterization in patients. The recent advent of third-generation sequencing (3GS) technologies brings promise for better characterization of genomic aberrations by virtue of having longer sequencing reads. However, the applications of 3GS are still uncertain due to its high sequencing error rates and low sequencing throughput. To overcome these limitations, we present NanoVar, an accurate, rapid and low-depth (4X) 3GS SV caller utilizing long-reads generated by Oxford Nanopore sequencing. NanoVar employs split-reads and hard-clipped reads for initial SV detection and utilizes a simulation-trained neural network classifier for true SV breakend enrichment. In simulated data, NanoVar demonstrated the highest SV detection accuracy (F1 score = 0.91) amongst other long-read SV callers using 12 gigabases (4X) of sequencing data. Interestingly, in patient samples, NanoVar was able to characterize not only genomic aberrations but also normal alternative sequences or alleles, present in healthy individuals. Taken together, NanoVar improves the reliability, scope, and speed of SV characterization at a lower sequencing cost, an approach compatible with clinical studies and large-scale SV-association research.

D-016: Counting genes, cells etc. When ecology meets cell biology.
  • Leonid Bystrykh, UMCG, University of Groningen, Netherlands

Short Abstract: Cellular and molecular barcodes are used in molecular biology studies to count cells, clones, organelles or biomolecules. Usually they employ synthetic DNA tags, or specific DNA mutations. Although the only purpose of molecular counts are estimations of the size of the population, the process of counting is complicated by quality of the barcodes and noisiness of the (sequencing) readouts. Most frequently, the count of the molecular entities is limited to the actual count of barcodes. Sporadically, some predicting approaches are used to estimate complexity of entire population. Another increasingly popular approach is using the Shannon index to count the most contributing barcodes to the population. The key issue is to use proper controls for the used methods and test hose with populations of known size, like number of genes in the genome, number of barcodes in the library etc. Whereas these counting methods are relatively new in molecular biology, they are routinely used in ecological populations studies. We expect that a broad range of predictors, like Good-Turing, ACE, Chao1 (and 2) will became popular in molecular biology as well. Here we provide a few examples of those implementations, we show critical issues regarding those applications and possible ways for improvement.

D-017: Integrative analysis of RNA-Seq and ChIP-Seq data using Pareto Optimization
  • Yingying Cao, University of Duisburg-Essen, Germany
  • Manuela Wuelling, University of Duisburg-Essen, Germany
  • Andrea Vortkamp, University of Duisburg-Essen, Germany
  • Daniel Hoffmann, University of Duisburg-Essen, Germany

Short Abstract: Background RNA-Seq is an important tool to define differential gene expression profiles (mRNA levels) between different cell populations or experimental conditions. Gene expression regulation is a complex and dynamic process of which post-translational modifications of histones are a key component. The advent of high-throughput sequencing has allowed genome-wide profiling of histone modifications by ChIP-Seq. Matched RNA-Seq data and ChIP-Seq data of several different histone modifications exist, but methods to integrate both data types are still rare. Methods To integrate RNA-Seq and ChIP-Seq data of more than two different histone modifications, we developed a novel integrative analysis approach by using Pareto Optimization. Results We demonstrate the approach on several datasets with RNA-Seq and corresponding ChIP-Seq data of histone modifications. The results show that histone modification levels and gene expression are very well correlated. With the integrative approach, we detected a set of genes showing congruent differences in both transcriptomic and epigenomic levels that make sense given the respective cell types and gene functions that should feature prominently in these cell types.

D-018: 4C-seq data: Simulation and algorithm benchmarking
  • Carolin Walter, Westfälische Wilhelms-Universität Münster, Germany
  • Daniel Schuetzmann, University Muenster, Germany
  • Frank Rosenbauer, University of Münster, Institute of Molecular Tumorbiology, Germany
  • Martin Dugas, Institute of Medical Informatics, University of Muenster, Germany, Germany

Short Abstract: Circular chromosome conformation capture combined with high-throughput sequencing (4C-seq) is a next-generation sequencing (NGS) based method that provides information on the three-dimensional organization of the genome. With a fragmented data structure and technical biases that may influence the actual signal, choosing an optimized 4C-seq analysis strategy is of critical importance to prevent misinterpretation of the data. Replicates help to discriminate between signal and noise, but are not trivial to analyze. We therefore present a benchmarking study of the 4C-seq algorithms fourSig, r3Cseq, peakC, FourCSeq, 4C-ker, and Splinter's algorithm, with two to six variants per program. Each variant's performance is tested for both single sample and replicate conditions. We evaluate the respective precision, recall, and F1 score, and compare the similarity of replicate results on simulated data with a known ground truth, and 16 published datasets with a set of 80 validated interactions in total. The simulation of 18 datasets in far-cis, with 12 subsets and 2x5 replicates each, and 32 near-cis datasets with 2x5 replicates allows for a thorough benchmarking of the chosen algorithms' performance. We identify algorithm variants with high precision and recall in a variety of settings, and provide information regarding the programs' run time and usability.

D-019: Management of Big Variant Data: Evaluation of Tools and Database Systems
  • Mohamed Fawzy, KACST, Saudi Arabia
  • Mohamed El-Kalioby, KACST-KFSHRC, Saudi Arabia
  • Mohamed Abouelhoda, King Faisal Specialist Hospital and Research Center, Saudi Arabia

Short Abstract: The last decade has witnessed the start of mega genome projects to study the genomes of thousands or even millions of persons. These projects accumulate large number of variants in text files in VCF format. These VCF files need to be managed and indexed for efficient queries. In this abstract, we review and evaluate different systems that were developed to manage huge sets of human variants (e,g, GenomicsDB, Gemini, and GTRAC). Moreover, we will evaluate the performance of basic database systems (e.g., MySQL, Clickhous), after straightforward creation of data models and composition of SQL queries. Our experimental results using huge datasets show that each family of tools has own advantages with some tradeoffs. Most space efficient tools cannot support complex queries and the fastest tools have bad space consumption. As a conclusion, we think that a straightforward implementation based on Clickhouse presents a good choice and has a competitive edge in terms of space consumption, query time, and out-of-the box complex SQL-based querying capabilities.

D-020: Single-cell Lineage Tracing by Integrating CRISPR-Cas9 Mutations With Transcriptomic Data
  • Hamim Zafar, Computational Biology Department, School of Computer Science, Carnegie Mellon University, United States
  • Chieh Lin, Machine Learning Department, School of Computer Science, Carnegie Mellon University, United States
  • Ziv Bar-Joseph, Computational Biology Department, Machine Learning Department, School of Computer Science, Carnegie Mellon University, United States

Short Abstract: Reconstructing cell lineages that lead to the formation of tissues and organs is of crucial importance in developmental biology. scGESTALT is an experimental technique, which combines the CRISPR-Cas9-based lineage tracing method termed GESTALT with droplet-based single-cell transcriptomic profiling. Even though expression information is collected, scGESTALT reconstructs the lineage tree solely based on the stochastic Cas9-induced mutations via maximum parsimony. As a result, the resulting lineage sometimes fails to separate different types of cells and places similar cell types on distant branches. To improve the reconstruction of lineages from CRISPR and scRNA-seq data, we developed a novel statistical method, LinTIMaT (LineageTracing by Integrating Mutation and Transcriptomic data) that defines a new optimization problem by combining maximum-likelihood-based tree reconstruction method with Bayesian hierarchical clustering, which evaluates the coherence of the expression information such that the resulting tree concurrently maximizes agreement for both transcriptomics and induced mutations from the same cell. We applied LinTIMaT on two zebrafish datasets generated using scGESTALT and show that by combining these two data types, we can both, improve the reconstruction of trees for individual organisms as well as generate, for the first time, a consensus model that combines data from multiple individuals studied using scGESTALT.

D-021: Differential Expression Analysis with Latent Variable Models
  • Joana Godinho, Instituto Superior Técnico, ULisboa, Portugal
  • Alexandra Carvalho, Instituto Superior Técnico, ULisboa and Instituto de Telecomunicações, Portugal
  • Susana Vinga, INESC-ID, Portugal

Short Abstract: In the field of molecular biology, genomic signature analysis is a powerful mean to unravel obscured cellular aspects. One of the main tasks in genomic signature analysis is the exploration of transcriptomic data, which enables the recognition of genes that are differentially expressed (DEGs). Disease profiling, treatment development and the identification of new cell populations are some of the most relevant applications that rely on DEG identification. In this context, three main technologies emerged, namely, DNA microarrays, bulk RNA sequencing (RNA-seq), and single-cell RNA sequencing (scRNA-seq), the main focus of this work. Although it tends to offer more accurate data than the other two, it is still limited by many confounding factors. We introduce two novel approaches to perform DEGs identification over single-cell data. Both techniques rely on latent variable models to account for the misleading factors present in the data. In addition, we benchmark the proposed methods with known DEG detection tools for single-cell data, such as SCDE and MAST. To do so we used two public real datasets that have been used in past assessment tasks. The results obtained show that the two procedures can be very competitive with current methods, allowing to identify putative biomarkers.

D-022: Expression kinetics of differentiating mouse embryonic stem cells
  • Fabian Titz-Teixeira, CECAD, University of Cologne, Germany
  • Andreas Lackner, Mfpl, University of Vienna, Austria
  • Martin Leeb, Mfpl, University of Vienna, Austria
  • Andreas Beyer, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany

Short Abstract: Due to their ability of self-renewal and to differentiate into any cell type, embryonic stem cells (ESCs) have great potential in different fields such as regenerative medicine. In contrast to the state of pluripotency, the exit from the pluripotent state is only poorly characterized. Here, we present the transcriptomic characterization of 74 mouse knockout cell lines exhibiting delayed differentiation potential. In order to quantify the extent of delay of individual sub-networks in these cell lines, we compared knockout expression profiles to data from a dense, 32 h expression time course of differentiating wild-type mESCs. We developed a bioinformatics strategy to ‘position’ each knockout on the time axis of differentiating mESCs, which enables to quantify the differentiation delay of the whole transcriptome and of molecular sub-networks. The analysis of the data lead to the surprising finding that the core pluripotency network was often lagging behind the state changes of differentiation-associated networks. Thus, our work implies a regulatory disconnect between the core pluripotency network and genes required for establishing subsequent cellular states.

D-023: Mapping genome-wide DNA methylation patterns in gliomas in context of IDH gene mutation status and REST transcription factor binding
  • Bartosz Wojtas, Nencki Institute of Experimental Biology, PAS, Warsaw, Poland
  • Michal J Dabrowski, Institute of Computer Science, PAS, Warsaw, Poland
  • Agata Dziedzic, Institute of Computer Science, PAS, Warsaw, Poland
  • Michal Draminski, Institute of Computer Science, PAS, Warsaw, Poland
  • Rafal Guzik, Nencki Institute of Experimental Biology, PAS, Warsaw, Poland
  • Karolina Stepniak, Nencki Institute of Experimental Biology, PAS, Warsaw, Poland
  • Bartlomiej Gielniewski, Nencki Institute of Experimental Biology, PAS, Warsaw, Poland
  • Tomasz Czernicki, Warsaw Medical University, Poland
  • Bartosz Czapski, Mazovian Brodno Hospital, Poland
  • Miroslaw Zabek, sekretariat@szpital-brodnowski.waw.pl, Poland
  • Wieslawa Grajkowska, Children’s Memorial Health Institute, Warsaw, Poland
  • Katarzyna Kotulska, Children’s Memorial Health Institute, Warsaw, Poland
  • Bozena Kaminska, Nencki Institute of Experimental Biology, PAS, Warsaw, Poland

Short Abstract: Methylation of DNA regulatory regions influence gene expression. Alterations of methylome play an important role in the glioma pathogenesis. Here we have identified differentially methylated sites in gliomas of different histopathological WHO grades (I,II,III and IV) or IDH gene mutation status (n = 21). We used bisulphite conversion and SeqCap Epi CpGiant Methylation panel with Illumina NGS sequencing. Additionally ChIPseq analysis for REST transcription factor was performed on chromatin isolated from freshly resected glioma tumors as well as from IDH mutated and paired isogenic WT cell line. For the same brain tumor samples RNA-seq was performed. We have detected differential pathways affected in IDH mutant and WT samples. We have also noted differential REST transcription factor binding to differentially methylated promoters. We believe that REST may be a mediator of IDH-related phenotype in human gliomas. Supported by the NCN grant 2015/16/W/NZ2/00314 and TEAM TECH CORE FACILITY FNP: Development of comprehensive diagnostics and personalized therapy in neuro-oncology.

D-024: MicroNAP: A pipeline to characterize genetic changes in engineered microbial strains based on Next Generation Sequencing data.
  • Veronika Schusterbauer, Istitute of Medical Engineering, Graz University of Technology; bisy e.U., Austria
  • Gerhard Thallinger, Institue of Computational Biotechnology, Graz University of Technology, Austria

Short Abstract: Genetic modification of microbial organisms to express recombinant proteins is routine since several decades. With the emergence of next generation sequencing (NGS) techniques, a broad spectrum of approaches to detect genomic variation from NGS data has been developed. Nonetheless, no single approach is able to resolve all kinds of genomic variation, especially the complex structural rearrangements introduced during genetic modification. The workflow we develop, aims to resolve single nucleotide variations (SNVs), small insertions and deletions (InDels) as well as bigger structural variants (SVs) with single base accuracy. Furthermore, we want to be able to identify potential microbial contaminations, validate monoclonality and detect changes of ploidy. The current pipeline resolves small variations as well as larger simple SVs with high sensitivity and specificity and it can determine the sequence of gene knockout locations with single nucleotide resolution. We analyzed 14 sets of short paired-end read data of 13 genetically modified strains of the methylotrophic yeast Komagataella phaffii. The analysis showed that severe off-target effects, such as chromosomal rearrangements, happen more often than expected. For the reliable characterization of integrated recombinant genes, we will probably have to include longer reads into the analysis, like those produced by PacBio of Oxford Nanopore technologies.

D-025: Using extended context information to construct somatic single nucleotide variant mutational signatures
  • Kwong Leong John Wong, DKFZ, Germany
  • Christian Aichmüller, DKFZ, Germany
  • Dr. Peter Lichter, German Cancer Research Institute, Germany
  • Dr. Marc Zapatka, German Cancer Research Center, Germany

Short Abstract: Somatic mutational signatures are useful for the stratification of cancer patients and identifying active mutational processes. Yet, the context information behind SNV signatures was not fully utilized by the trinucleotide signatures, more information could be extracted by extending context nucleotides for the signature construction. Attempts were made to use penta-nucleotides context information for SNV signatures construction, but the increase in complexity for 1536 classes of penta-nucleotides substitutions is also challenging mathematically. The construction of NxNxN Penta-nucleotide signatures regardless of trinucleotide information in the “x” nucleotide positions can reduce the complexity to 96 classes. Using NMF methods, forty signatures were identified based on the extended SNV context of ~3500 tumors from the PCAWG and the ICGC cohorts. One signature is found to be associated with defective homologous recombination and two signatures are associated with APOBEC activity. The signatures in extended context are not only useful for a better understanding of mutational processes in cancer but also a benchmarking tool for signature assignment algorithms. A toolset was built to test for the reproducibility across two signature context on shared biological process. The proposed metric was used to review the performance of 5 existing signature assignment algorithms.

D-026: Accurate determination of node and arc multiplicities in de Bruijn graphs using conditional random fields
  • Aranka Steyaert, Ghent University, Belgium
  • Pieter Audenaert, Ghent University, Belgium
  • Jan Fostier, Ghent University, Belgium

Short Abstract: Many bioinformatics tools use read-based de Bruijn graphs as an estimated representation of the underlying genome sequence. However, sequencing errors and repeated subsequences complicate the identification of the true underlying sequence. A key step in this process is to infer the multiplicities of nodes/arcs in the graph, which are the number of times each k-mer (resp. k+1-mer) corresponding to a node (resp. arc), is present in the genomic sequence. Multiplicities thus reveal repeat structure and the presence of sequencing errors. Multiplicities of nodes/arcs are reflected in the node/arc coverage, however, coverage variability and coverage biases complicate their determination. Current methodology determines multiplicities based solely on the information in nodes/arcs individually, underutilising the information present in the sequencing data. To improve the accuracy with which node and arc multiplicities in a de Bruijn graph are inferred, we developed a conditional random field model to efficiently combine the coverage information within each node/arc individually with the information of surrounding nodes and arcs. Multiplicities are thus collectively assigned in a more consistent manner. We observe an accuracy improvement and a more robust EM parameter estimation. We believe this methodology can be a useful addition to tools that make use of de Bruijn graphs.

D-027: Comparison of RNA-seq applications using Oxford Nanopore, Pacific Biosciences and Illumina technologies
  • Weihong Qi, Functional Genomics Center Zurich, ETH Zürich / University of Zurich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland
  • Anna Bratus-Neuenschwander, Functional Genomics Center Zurich, ETH Zürich / University of Zurich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland
  • Andrea Patrignani, Functional Genomics Center Zurich, ETH Zürich / University of Zurich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland
  • Ralph Schlapbach, Functional Genomics Center Zurich, ETH Zürich / University of Zurich, Winterthurerstrasse 190, 8057 Zürich, Switzerland, Switzerland

Short Abstract: We sequenced the Universal Human Reference RNA sample (Agilent P/N 74000) using Oxford Nanopore (ONT) GridIONX5 and PromethION, Pacific Biosciences (PacBio) and Illumina technologies. First, we evaluated the performance of splice-aware alignment tools GMAP and minimap2 in aligning different types of long reads, including ONT 1D RNA, 1D and 1D2 DNA reads, PacBio subreads and circular consensus reads. minimap2 aligned more reads with more accurate alignments, reporting more known junctions and less chimeric alignments. Secondly, we evaluated different ONT RNA-seq protocols for GridIONX5, including direct sequencing of mRNAs and cDNAs, as well as sequencing of PCR amplified cDNAs (PCR-cDNA). We found that PCR-cDNA sequencing could safely start with total RNA, without pre-isolation of mRNA. Direct cDNA sequencing started with pre-isolated mRNA produced libraries with higher diversity, where more transcripts were detected with less reads. Third, compared to the PacBio Iso-Seq, ONT PCR-cDNA sequencing produced less chimeric reads. Both protocols produced similar fraction of full-length transcripts, but transcripts detected showed different length distributions, with enrichedlong transcripts being observed with PacBio Iso-Seq. Finally, GridIONX5 and PromethIO sequencing yielded 6 and 73 million reads, respectively, which allowed us to quantify the agreement of measured cDNA abundance between ONT and Illumina.

D-028: Flexible Experimental Designs for Valid Single-cell RNA-sequencing Experiments Allowing Batch Effects Correction
  • Yingying Wei, The Chinese University of Hong Kong, Hong Kong

Short Abstract: Despite their widespread applications, single-cell RNA-sequencing (scRNA-seq) experiments are still plagued by batch effects and dropout events. Although the completely randomized experimental design has frequently been advocated to control for batch effects, it is rarely implemented in real applications due to time and budget constraints. Here, we mathematically prove that under two more flexible and realistic experimental designs---the “reference panel” and the “chain-type” designs---true biological variability can also be separated from batch effects. We develop Batch effects correction with Unknown Subtypes for scRNA-seq data (BUSseq), which is an interpretable Bayesian hierarchical model that closely follows the data-generating mechanism of scRNA-seq experiments. BUSseq can simultaneously correct batch effects, cluster cell types, impute missing data caused by dropout events, and detect differentially expressed genes without requiring a preliminary normalization step. We demonstrate that BUSseq outperforms existing methods with simulated and real data.

D-029: Large-scale RNA-seq datasets enable the detection of genes with a differential expression dispersion in cancer
  • Christophe Le Priol, Univ. Grenoble Alpes, CEA, Inserm, BIG-BGE, France
  • Xavier Gidrol, Univ. Grenoble Alpes, CEA, Inserm, BIG-BGE, France
  • Chloé-Agathe Azencott, MINES ParisTech, France

Short Abstract: The majority of gene expression studies focus on looking for genes whose mean expression is different between two or more populations of samples. However, a difference of variance in gene expression may also be biologically and physiologically relevant. In the classical differential expression analysis workflow using RNA-sequencing (RNA-seq) data, the dispersion, which defines the variance, is only considered as a parameter to be estimated prior to looking for a difference of mean expression between conditions of interest. Here, we propose to evaluate two new methods, MDSeq and DiPhiSeq, which enable to detect differences in both mean and dispersion in RNA-seq data. We thoroughly investigated the performance of these methods on simulated datasets and characterized parameter settings in order to reliably detect genes with a differential expression dispersion on real datasets. We applied MDSeq and DiPhiSeq to The Cancer Genome Atlas mRNAs and microRNA datasets. Interestingly, among non-differently expressed mRNAs with an increased expression dispersion in tumors, we identified some key cellular functions, such as catabolism, which are over-represented in most cancers. Moreover, our approach highlights autophagy, whose role in cancerogenesis is context-dependent, and thus may be a lead to further investigate this pathway.

D-030: Genome sequence and comparative genomics of the rubber tree pathogen Pseudocercospora ulei
  • Sandra Milena Gonzalez Sayer, Ntional University of Colombia, Colombia
  • Diego Mauricio Riaño Pachon, Nuclear Energy Center, São Paulo University, Brazil
  • Ibonne Aydee Garcia, National University of Colombia, Colombia
  • Fabio Aristizabal Gutierrez, National University of Colombia, Colombia

Short Abstract: Natural rubber is a naturally produced polymer that represents the raw material for the manufacture of many products for the medical and automotive industries. Hevea brasiliensis is the main commercial source due to its physicochemical properties and high yield production. Despite being the geographic origin, Latin American countries only contribute 1% to global production, limited by the presence of a disease known as the South American leaf blight (SALB). The causal agent, Pseudocercospora ulei, which is an ascomycete fungus, has not been properly studied and the molecular basis of its pathogenicity mechanism remains unknown. We are sequencing the genome of P. ulei with the aim of identifying genes involved in the SALB pathogenicity. For this, whole-genome shotgun sequencing and assembly of the genome was done using PacBio, Oxford nanopore and Illumina platforms. The final assembly size was 90 Mb, split into 1311 contigs with an N50 of 144.000 bp. BUSCO results assigned 95% complete BUSCOs from Ascomycete lineage to P. ulei genome, from these, only 2% were duplicated, suggesting either that this genome is haploid or a very low level of polymorphism between chromosomes. We also found domains associated with DNA rearrangement which could be correlated with gene duplication events.

D-031: Beta-binomial modeling of CRISPR pooled screen data identifies target genes with greater sensitivity and fewer false negatives
  • Hyun-Hwan Jeong, Baylor College of Medicine, United States
  • Seon Young Kim, Baylor College of Medicine, United States
  • Maxime W.C. Rousseaux, University of Ottawa, Canada
  • Huda Y. Zoghbi, Howard Hughes Medical Institute, United States
  • Zhandong Liu, Baylor College of Medicine, United States

Short Abstract: The simplicity and cost-effectiveness of CRISPR technology have made high-throughput pooled screening approaches accessible to virtually any lab. Analyzing the large sequencing data derived from these studies, however, still demands considerable bioinformatics expertise. Various methods have been developed to lessen this requirement, but there are still three tasks for accurate CRISPR screen analysis that involve bioinformatic know-how if not prowess: designing a proper statistical hypothesis test for robust target identification, developing an accurate mapping algorithm to quantify sgRNA levels, and minimizing the parameters necessary that need to be fine-tuned. To make CRISPR screen analysis more reliable as well as more readily accessible, we have developed a new algorithm, called CRISPRBetaBinomial or CB2 (https://CRAN.R-project.org/package=CB2). Based on the beta-binomial distribution, which is better suited to sgRNA data, CB2 outperforms the eight most commonly used methods (HiTSelect, MAGeCK, PBNPA, PinAPL-Py, RIGER, RSA, ScreenBEAM, and sgRSEA) in both accurately quantifying sgRNAs and identifying target genes, with greater sensitivity and a much lower false discovery rate. It also accommodates staggered sgRNA sequences. In conjunction with CRISPRcloud, CB2 will bring CRISPR screen analysis within reach for a wider community of researchers.

D-032: Characterizing chromatin landscape from aggregate and single-cell genomic assays using flexible duration modeling.
  • Mariano Gabitto, Flatiron Institute, United States
  • Anders Rasmussen, Flatiron Institute, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Short Abstract: Distilling functional regions from ATAC-seq and other similar genomic technologies presents diverse analysis challenges, due to the relative sparseness of the data produced and the interaction of complex noise with multiple chromatin structure scales. Methods commonly used to analyze chromatin accessibility datasets are adapted from algorithms designed to process different experimental technologies, disregarding the statistical and biological differences intrinsic to the ATAC-seq technology. Here, we present a Bayesian statistical approach that uses Hidden Semi-Markov models to better model the duration of functional and accessible regions, termed ChromA. We demonstrate the method on multiple genomic technologies, with a focus on ATAC-seq data. ChromA annotates the cellular epigenetic landscape by integrating information from replicates, producing a consensus de-noised annotation of chromatin accessibility. ChromA can analyze single cell ATAC-seq data, improving cell type identification and correcting many biases generated by the sparse sampling inherent in single cell technologies. We validate ChromA on multiple technologies and biological systems, including mouse and human immune cells and find it effective at recovering accessible chromatin, establishing ChromA as a top performing general platform for mapping the chromatin landscape in different cellular populations from diverse experimental designs.

D-033: Mosaic loss of Y chromosome in brain tissue associated with late-stage Alzheimer's neuropathology.
  • Sara Mostafavi, The University of British Columbia, Canada
  • Emma Graham, The University of British Columbia, Canada
  • Mike Vermeulen, The University of British Columbia, Canada
  • Badri Vardarajan, Columbia University, United States

Short Abstract: Somatic mosaicism refers to the co-existence of two or more somatic cell populations within an individual that do not share identical genotype. Recent studies have found significant mosaic loss of the Y chromosome in human blood, which has been associated with poor health outcomes, including Alzheimer’s and cancer. Mosaic somatic aneuploidy (mSA) has also been found in brain tissue but little is known about its relative prevalence compared to that observed in blood. Using WGS data from 354 males from a longitudinal ageing cohort, we quantified mSA of chromosomes X and Y in the dorsolateral prefrontal cortex (DLPFC), cerebellum and blood. In addition, we examined the relationship between mSA of the sex chromosomes and neuropathological characteristics of Alzheimer’s disease. We found that somatic mosaic events occur more frequently in the whole blood than in the DLPFC or cerebellum. In the DLPFC, a reduction in read depth of the Y chromosome (suggesting mosaic loss) was associated with neurofibrillary tangle pathology and an increased rate of cognitive decline after controlling for age and post-mortem interval. Our results suggest that Y chromosome read depth in the DLPFC is modestly associated with late stage Alzheimer’s neuropathology.

D-034: PipelineOlympics: Benchmarking of processing workflows for bisulfite sequencing data
  • Reka Toth, German cancer research center, Germany
  • Yassen Assenov, German Cancer Research Center (DKFZ), Germany
  • Karl Nordstroem, Saarland Univerisy, Germany
  • Angelika Merkel, CNAG, Center of Genomic Regulation (CGR), Spain
  • Edahi Gonzalez-Avalos, La Jolla Institute, United States
  • Matthias Bieg, Heidelberg Center for Personalized Oncology, German Cancer Research Center (DKFZ), Germany
  • Stephen Kraemer, German Cancer Research Center (DKFZ), Germany
  • Murat Iskar, German Cancer Research Center, Germany
  • Helene Kretzmer, University of Leipzig, Germany
  • Lelia Wagner, University of Heidelberg, Germany
  • Lilian Leiter, University of Heidelberg, Germany
  • Giuseppe Petroccino, BioMed X Innovation Center, Germany
  • Anand Mayakonda, German Cancer Research Center (DKFZ), Germany
  • Kersten Breuer, German Cancer Research Center (DKFZ), Germany
  • Gideon Zipprich, German Cancer Research Center (DKFZ), Germany
  • Lena Weiser, German Cancer Research Center (DKFZ), Germany
  • Philip Kensche, German Cancer Research Center (DKFZ), Germany
  • Renata Jurkowska, BioMed X Innovation Center, Germany
  • Christian Lawerenz, Berlin Institute of Health (BIH), Charite - University Clinic of Berlin, Germany
  • Ivo Buchhalter, German Cancer Research Center (DKFZ), Germany
  • Steve Hoffmann, Leibniz Institute of Aging - Fritz Lipmann Institute, Germany
  • Simon Heath, CNAG, Center of Genomic Regulation (CRG), Spain
  • Marc Zapatka, German Cancer Research Center, Germany
  • Joern Walter, Saarland University, Germany
  • Matthias Schlesner, Bioinformatics and Omics Data Analytics, Germany
  • Christoph Bock, CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Austria
  • Christoph Plass, German Cancer Research Center (DKFZ), Germany
  • Pavlo Lutsik, German cancer research center, Germany

Short Abstract: Whole genome bisulfite sequencing (WGBS) is a state-of-the-art method for the genome-scale assessment of DNA methylation levels, used for bulk, low-input and, more recently, single-cell analysis. Basic data processing – read trimming, alignment and site-wise estimation of DNA methylation levels – is crucial for downstream analysis. PipelineOlympics is a collaborative effort of the leading labs in the field to comprehensively benchmark bisulfite sequencing software, and to provide data processing guidelines for popular wet-lab protocols. At the core of the benchmarking is a reference data set of highly accurate DNA methylation measurements obtained with locus-specific assays which we use as the gold-standard. In the initial phase of the benchmark we generated WGBS data of the well-characterized samples from the gold-standard set using several protocols, circulated the data among the partners, and collected methylation calls of ten representative workflows. Our pilot evaluation through comparing the calls to the gold-standard measurements and to each other revealed important differences in workflow performance, further amplified by protocol peculiarities, biological nature of the samples and variable sequencing depth. Furthermore, an exhaustive exploration of the combinatorial space of workflows is being performed. Ultimately, PipelineOlympics will be transformed into a long-term and extensible public benchmarking resource.

D-035: Single cell level characterization of the peripheral blood immune cells of opioid-dependent individuals
  • Tanya Karagiannis, Boston University, United States
  • John Cleary, Boston University, United States
  • Christine Cheng, Boston University, United States

Short Abstract: Chronic opioid usage is known to advance addiction behavior and decisions, but also has modulatory effects on the peripheral immune system. The underlying mechanisms of opioids on immune function including immunostimulatory and immunosuppressive effects are still very controversial. In order to understand the underlying effect of opioids on the peripheral immune system, we performed single cell RNA-sequencing (scRNA-seq) of peripheral blood mononuclear cells (PBMCs) from opioid dependent individuals and neighborhood controls. Using single cell transcriptomic tools, we identified and characterized transcriptional changes between opioid dependent and control samples across naïve and LPS stimulated immune innate and adaptive populations. We found that chronic opioid usage results in the suppression of immune response pathways. Furthermore, we performed single cell RNA-sequencing of in vitro morphine treated PBMCs and observed similar effects. These results indicate dysregulatory effects of opioids on immune systems at acute and chronic levels, and suggest the need to further characterize the immune modulatory effects.

D-036: Impact of batch effect and study design biases on identification of genetic risk factors in sequencing data
  • Daniel P. Wickland, Mayo Clinic, University of Illinois at Urbana-Champaign, United States
  • Yan W Asmann, Mayo Clinic, United States

Short Abstract: Sequencing-based searches for disease-associated variants require large sample sizes to achieve sufficient statistical power, but they often entail batch effects and biases from study design, both of which hinder the ability to detect true genotype-trait associations. Common batch effects include different sequencing centers, different sample collection protocols, and different exome capture kits. For example, the Alzheimer’s Disease Sequencing Project (ADSP) sequenced exomes of more than 10,000 cases and controls using three sequencing centers and two exome capture kits. In addition, the controls were intentionally older than AD cases in an effort to increase the confidence of the AD variants, as “true” disease-causal variants should be absent in older but cognitively normal individuals. This design introduced an age variable confounded with AD status. We studied batch effects and confounding variables in this data set and demonstrated that both significantly impacted the association analysis. In particular, we identified significant batch differences in genotype quality and the confounding effect of age on variant association with AD. Our findings suggest that in order to minimize spurious associations, exome sequencing studies should follow consistent sample collection and sequencing protocols, use a uniform set of reagents, and minimize variation unrelated to disease phenotype between sample classes.

D-037: DAMN fast: Dna Alignment using Multi-armed baNdits
  • Tavor Baharav, Stanford University, United States
  • Govinda Kamath, Stanford University, United States
  • David Tse, Stanford University, United States
  • Ilan Shomorony, University of Illinois Urbana-Champaign, United States

Short Abstract: Pairwise read alignment is one of the most time consuming tasks in long-read genome assembly. The current state of the art pipelines use a seed and extend algorithm. In this first stage one identifies reads that have high overlap using a k-mer hashing scheme, and then obtains more exact alignments via dynamic programming. We consider the hashing scheme used in the minHash alignment process (MHAP) of the Canu assembler where the time consuming step is to estimate the Jaccard similarity of two strings by random sampling of hash functions. We introduce an algorithm ADA-MHAP, which leverages recent ideas in machine learning literature on Monte Carlo optimization. We adaptively perform these hashes and comparisons, spending little time on reads with no overlap and more on reads that are contending matches, improving on MHAP which performs equal number of hash comparisons for all reads. We provide a theoretical foundation for our algorithm and prove its superior performance over the naive non-adaptive MHAP strategy. We corroborate our theoretical results with experiments on real PacBio data, showing the 10x improvement afforded by adaptivity.

D-038: Spider and bagworm moth genomes reveal the diversity of silk protein motifs and its mechanical properties
  • Yuki Yoshida, Keio University, Japan
  • Masaru Tomita, Keio University, Japan
  • Kazuharu Arakawa, Keio University, Japan
  • Nobuaki Kono, Keio University, Japan
  • Nakamura Hiroyuki, Spiber Inc., Japan
  • Rintaro Ohtoshi, Spiber Inc., Japan
  • Daniel Pedrazzoli Moran, Spiber Inc., Japan
  • Asaka Shinohara, Spiber Inc., Japan
  • Masaru Mori, Keio University, Japan
  • Keiji Numata, RIKEN, Japan

Short Abstract: Many arthropods such as spiders and Lepidopterans utilize silk for various purposes, including but not limited to reproduction, foraging, and protection. Egg sac, prey capture thread, or dragline silk produced by spiders, for example, each displays distinct mechanical properties for strength, elasticity, and toughness. It is therefore considered an attractive biomaterial for sustainable industrial applications. Orb-weaving spiders of the family Araneidae and bagworm moths are particularly known for their high performance in orders Aranae and Lepidoptera, respectively. Here, we performed a hybrid-sequencing approach and present the orb-weaving spider and bagworm moth genome including fibroin genes obtained by long sequencing reads, confirmed with multiple-omics analysis of silk. In the spider genome, we obtained a new type of spidroin gene, and several novel spider silk-constituting elements designated SpiCE were found. Likewise in the bagworm genome, we identified a fibroin gene with distinct repetitive motif. Comparative study of the fibroin sequences and mechanical properties of silks demonstrated a phylogenetic relationship between the unique silk constituents and the silk mechanical performance.

D-039: Analysis of non-genetic variation in cancer by analyzing single cell transcriptome
  • Yejin Lee, Sookmyung women's university, South Korea
  • Sukjoon Yoon, Sookmyung women's university, South Korea

Short Abstract: Single cell RNA sequencing data analysis is emerging as one of the most important techniques for analyzing heterogeneity of cancer. This study is focused on analyzing the variation of single cell gene expression among cancer cells (A549 and H1437) in the clonal population. By using SOM (Self Organizing Map) and PCA (Principle Component Analysis), we found genes with varied expression patterns among cells in clonal population. These genes might play the key role in the drug resistance and recurrence. These genes were found to play several important biological functions by the gene set-based analysis. Studying the non-genetic, transcriptional variation will provide new insights for understanding the differentiation and progression of cancer cell in the course of anticancer treatments. By analyzing single cell data from clonal populations, we could discover potential new biomarkers, contributing to the development of customized medicine for individual patients more precisely.

D-040: Enrichment Analysis of k-mer Composition Enables Identification of Telomeres
  • Askar Gafurov, Comenius University in Bratislava, Slovakia
  • Hana Lichancova, Faculty of Natural Sciences, Comenius University in Bratislava, Slovakia., Slovakia
  • Viktória Hodorová, Faculty of Natural Sciences, Comenius University in Bratislava, Slovakia., Slovakia
  • Jozef Nosek, Comenius University in Bratislava, Slovakia
  • Tomas Vinar, Comenius University in Bratislava, Slovakia
  • Broňa Brejová, Comenius University in Bratislava, Slovakia

Short Abstract: Background: Telomeres and repeat-rich subtelomeric regions are often hard to assemble from high-throughput sequencing data, and therefore the exact nature of the telomeric sequences remains unknown in many species. Results: We have developed a k-mer based sequence analysis method to identify contig ends belonging to telomeric and sub-telomeric regions. Our method uses a combination of long-read and short-read sequencing and compares k-mer composition in reads from untreated DNA to DNA treated with BAL31 nuclease. This enzyme digests ends of DNA molecules and thus creates a depletion of telomeric and sub-telomeric areas. Conclusions: We have applied our methods to Jaminea angkorensis genome. Our approach combining k-mer analysis, BAL31 digestion protocol, and Oxford Nanopore sequencing has allowed us to better assemble repeat-rich subtelomeric regions in this genome.

D-041: NeuSomatic: Deep convolutional neural networks for robust somatic mutation detection
  • Sayed Mohammad Ebrahim Sahraeian, Roche Sequencing Solutions, United States
  • Li Tai Fang, Roche Sequencing Solutions, United States
  • Ruolin Liu, Roche Sequencing Solutions, United States
  • Bayo Lau, Roche Sequencing Solutions, United States
  • Marghoob Mohiyuddin, Roche Sequencing Solutions, United States
  • Hugo Y. K. Lam, Roche Sequencing Solutions, United States
  • Wenming Xiao, The Center for Devices and Radiological Health, U.S. Food and Drug Administration, FDA, United States

Short Abstract: Somatic mutations are crucial to the understanding of cancer formation, progression, and treatment, but are still challenging to detect with high fidelity. Here we present NeuSomatic, the first convolutional neural network based approach for detection of somatic mutations with NGS. It is versatile with applications across sequencing platforms and sequencing strategies. It can be used as a stand-alone somatic mutation detection method or with an ensemble of existing methods to achieve the highest accuracy. In addition to a diverse collection of in silico datasets for assessment, we used the first comprehensive and well-characterized tumor-normal reference samples from the SEQC-II consortium to investigate best practices for utilizing our deep-learning framework in cancer mutation detection. Using the high-confidence somatic mutations established for these reference samples by the consortium, we identified robust model-building strategies on multiple datasets derived from samples representing real scenarios. Our strategy achieved high robustness across multiple sequencing strategies such as WGS, WES, AmpliSeq PCR enrichment for fresh and FFPE DNA input, and varying tumor/normal purities and coverages (ranging from 10X - 2000X), and significantly outperformed conventional detection approaches, especially for challenging scenarios such as low tumor purities, low allelic frequencies, and tumor cell contamination in matched normal.

D-042: Phylum-wide genome sequencing of Tardigrada with ultra-low input
  • Kazuharu Arakawa, Keio University, Japan

Short Abstract: Tardigrades remain relatively unexplored by genomics, in spite of the enigmatic phylogenetic position of the phylum that is a key to resolve the ecdysozoan phylogeny, and albeit the many unique adaptations exemplified by anhydrobiosis. Our previous reports of the genome sequencing of two eutardigrades, Ramazzottius varieornatus and Hypsibius exemplaris, revealed a multitude of genes related to anhydrobiosis, and provided novel data for phylogenomics. While a more comprehensive exploration of the phylum is necessary, limited amount of samples that can be obtained from the wild, and non-negligible effects of contamination have been major hurdles for such studies. Using ultra-low input sequencing from individual tardigrades, we have so far sequenced 50 tardigrade genomes, including 11 heterotardigrade species, both marine and terrestrial, and 39 eutardigrade species. Although the coverage within the phylum is still limited, we are beginning to see a rather diverse evolution of anhydrobiosis. Whereas several components, such as the duplication of oxidative stress response genes, are shared between Eutardigrada and Heterotardigrada, many of the tardigrade-specific anhydrins identified so far seem to be clade-specific and non-conserved. Our effort of this “phylome” study is hoped to provide a fundamental resource for tardigradology and evolutionary genomics of invertebrates.

D-043: Realistic Simulation of NGS Data based on FFPE Samples
  • Lanying Wei, Institute of Medical Informatics, University of Münster, Germany
  • Sarah Sandmann, Institute of Medical Informatics, University of Münster, Germany
  • Christian Thomas, Institute of Neuropathology, University of Münster, Germany
  • Martin Hasselblatt, Institute of Neuropathology, University of Münster, Germany
  • Martin Dugas, Institute of Medical Informatics, University of Muenster, Germany, Germany

Short Abstract: In routine clinical practice, fresh-frozen tissue is rarely available and next-generation sequencing (NGS) usually needs to be carried out on archival formalin fixed, paraffin embedded (FFPE) tissue. This holds especially true for rare diseases. However, DNA from FFPE tissue is typically fragmented, cross-linked and of lower quality, which leads to a high number of false positives in NGS analysis. Currently reported characteristics and levels of FFPE artifacts are still controversial. Realistic tools for simulation are not available. Here we present a novel NGS read simulator for FFPE samples. Analyses were performed on 8 high-coverage whole-genome sequencing (WGS) FFPE samples of 7 papillary tumors of the pineal region (PTPR) and 1 ependymoma as well as 17 public FFPE/paired fresh-frozen samples. Comparison to standard approaches for NGS read simulation shows considerable differences between simulated and real FFPE data. Optimization of available variant calling tools for the analysis of FFPE samples can only be realized if realistic simulated data with known biological truth is available. In conclusion, our NGS read simulator for FFPE samples fills a current gap in the field of data simulation. It is the first step on the way of optimized variant calling in FFPE samples.

D-044: Encoding yeast genomic diversity using variation graphs
  • Prithika Sritharan, National Collection of Yeast Cultures, Quadram Institute, Norwich Research Park, Norwich, NR4 7UA, UK, United Kingdom
  • Katharina T. Huber, School of Computing Sciences, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK, United Kingdom
  • Eleanor Stanley, Eagle Genomics, The Biodata Innovation Centre, Wellcome Campus, Hinxton, Cambridge, CB10 1DR, UK, United Kingdom
  • William Spooner, Genomics England, Dawson Hall, Charterhouse Square, London, EC1M 6BQ, UK, United Kingdom
  • Jo Dicks, Quadram Institute, United Kingdom

Short Abstract: Linear reference genomes guide the mapping of sequence reads prior to variant detection. However, the exclusion of common variants from the reference sequence poses fundamental limitations when studying entire species, and therefore may be inadequate for organisms such as yeast which display a high level of sequence diversity. Variation graphs allow for individual genomes to be incorporated as variants within a bi-directed sequence graph. Crucially, it has been suggested that the use of variation graphs as alternative reference structures mitigates reference allele bias and improves both the accuracy and precision of read mapping, thereby increasing the detection of true, de novo variants. The National Collection of Yeast Cultures (NCYC; http://www.ncyc.co.uk) contains approximately 4,000 strains from over 530 species. A recent project has led to the whole genome sequencing of ~1,000 NCYC strains. Here, we describe a precise evaluation of variation graphs as yeast reference structures, using Illumina sequence read sets for Saccharomyces cerevisiae strains to quantify read mapping and variant calling. In all experiments conducted, we confirm that multi-strain variation graphs improve both the quantity of sequence read mapping and the quality of alignment, supporting the future use of variation graphs as reference structures for yeast genomes.

D-045: Bayesian deconvolution of somatic clones and pooled individuals with expressed variants in single-cell RNA-seq data
  • Yuanhua Huang, EMBL-European Bioinformatics Institute, United Kingdom
  • Davis McCarthy, EMBL-EBI, United Kingdom
  • Raghd Rostom, Wellcome Sanger Institute, United Kingdom
  • Sarah Teichmann, Wellcome Sanger Institute, United Kingdom
  • Oliver Stegle, EMBL-European Bioinformatics Institute, Germany

Short Abstract: Decoding the clonal substructures of somatic tissues sheds light on cell growth, development and differentiation in health, ageing and disease. However, approaches to systematically characterize phenotypic and functional variations between individual clones are not established. Here we present cardelino (https://github.com/PMBio/cardelino), a Bayesian method for inferring the clonal tree configuration and the identity of individual cells by modelling the expressed variants in single-cell RNA-seq (scRNA-seq) data. Critically, cardelino can integrate a clonal tree configuration derived from external data, e.g., bulk DNA sequencing, and adapt it to scRNA-seq observations. Simulations validate the accuracy of our model and its robustness to the errors in the guide clone configuration. We applied cardelino to 32 human dermal fibroblast lines, identifying hundreds of differentially expressed genes between cells from different somatic clones. Additionally, a variant of cardelino, with an efficient variational inference algorithm (named Vireo) solves a similar problem in deconvolution of multiplexed scRNA-seq data by inferring genotypes and clustering cells, hence does not require genotype information of the pooled samples. Taken together, our method suite allows to identify molecular signatures that differ between clonal cell populations, and to demultiplex pooled scRNA-seq across a variety of experiment designs and platforms.

D-046: Genotyping structural variations using long reads data
  • Lolita Lecompte, INRIA, France
  • Pierre Peterlongo, INRIA, France
  • Dominique Lavenier, CNRS, France
  • Claire Lemaitre, INRIA, France

Short Abstract: Studies on structural variants (SV) are expanding rapidly. As a result, and thanks to third generation sequencing technologies, more and more SVs are discovered, especially in the human genome. At the same time, for several applications such as clinical diagnoses, it becomes important to genotype newly sequenced individuals on well defined and characterized SVs. Whereas many SV genotypers have been developed for short read data, there is still no approach to assess whether some SVs are present or not in a new sequenced sample of long reads, from third generation sequencing technologies. We present a novel method to genotype known SVs from long read sequencing. The principle of our method is based on the generation of a set of reference sequences that represent the two alleles of each SV. After mapping the long reads to these reference sequences, alignments are analyzed and filtered out to keep only informative ones, in order to quantify and estimate the presence of each allele. Tests on simulated long reads based on 1000 deletions from dbVar show a precision of 95.8%. We also applied the method to the whole human genome NA12878.

D-047: A comprehensive benchmark of somatic CNV calling tools on short-read WGS and WES data
  • Ana Popic, Seven Bridges Genomics, Serbia
  • Vladimir Tomic, Seven Bridges Genomics, Serbia
  • Vojislav Varjacic, Seven Bridges Genomics, Serbia
  • Jelena Randjelovic, Seven Bridges Genomics, Serbia
  • Svetozar Nesic, Seven Bridges Genomics, Serbia

Short Abstract: Copy-number variation (CNV) plays an important role in cancer development since activation of an oncogene may be linked to a genomic copy-number amplification and inactivation of a tumor suppressor gene can be caused by a deletion. Benchmarking of CNV callers is hampered by the lack of appropriate truth sets, necessitating the use of simulated data, especially in studies focusing on somatic CNVs. In this study, we have (i) tested multiple commonly used CNV data simulators, (ii) used them to create several synthetic CNV datasets, and (iii) benchmarked seven CNV callers using both synthetic and real data. Seven CNV callers (CNVkit, Control-FREEC, Sequenza, FACETS, PureCN, CNVnator and GATK4 CNV) were benchmarked under three scenarios which included both whole genome sequencing (WGS) and whole exome sequencing (WES) data: matched tumor-normal sample pairs, tumor sample against a panel of normal samples and tumor-only sample CNV calling. For each dataset two complementary evaluation methods were applied, resulting in two distinct, but positively correlated scores. Analysis of concordance between calls made by top performing tools shows potential for consensus calling which would generate high-confidence calls, as well as mark regions where different callers generate contradictory calls as high complexity regions.

D-048: Pan-genome mapping and pairwise SNP-distance improve detection of Mycobacterium tuberculosis transmission clusters
  • Christine Jandrasits, Robert Koch-Institut, Germany
  • Stefan Kröger, Robert Koch Institute, Germany
  • Walter Haas, Robert Koch Institute, Germany
  • Bernhard Y. Renard, Robert Koch Institute, Germany

Short Abstract: With about 10 Million new infections and 1.7 Million deaths per year Tuberculosis is a major threat to global health. It is essential to detect and interrupt transmissions to stop the spread of this disease. Next-generation sequencing based base-by-base distance measures have become an integral complement to epidemiological investigation of infectious disease outbreaks. The mutation rate of M. tuberculosis is very low and many existing methods for comparative analysis provide inadequate results for such stable genomes as their resolution is too limited. This study introduces PANPASCO, a computational pan-genome mapping based, pairwise distance method that considers all genomic differences between cases, even when located in regions of lineage specific reference genomes. We show that our approach is superior to previously published methods in several datasets and across different lineages, as its characteristics allow the comparison of a high number of diverse samples in one analysis - a scenario that becomes more and more likely with the increased usage of whole-genome sequencing in transmission surveillance.

D-049: GraphAligner: Rapid and Versatile Sequence-to-Graph Alignment
  • Mikko Rautiainen, Max Planck Institute for Informatics, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Short Abstract: Background Sequence graphs provide a natural way of expressing variation or uncertainty in a genome, or a collection of genomes. They can be used for diverse applications such as genome assembly, error correction and SV genotyping. With the growing usage of graphs, methods for handling graphs efficiently are becoming more important. In particular, sequence alignment is one of the most fundamental operations in genome analysis and used in many applications. Results We present our tool GraphAligner for aligning long reads to genome graphs. Comparisons with existing tools show that our method is faster by one order magnitude. To demonstrate the downstream benefits, we present a hybrid error correction pipeline based on aligning long reads to a de Bruijn graph, which achieves error rates up to one order of magnitude lower than competing tools and scales to high coverage whole-genome mammalian datasets. Conclusion As sequence alignment is one of the most fundamental operations in genome analysis, better alignment methods will produce many downstream benefits. GraphAligner is a tool for rapidly aligning long reads to genome graphs faster than existing methods, enabling many use cases that have been computationally infeasible before. GraphAligner is open source and available on bioconda.

D-050: Cogito: Automated and generic comparison of annotated genomic intervals
  • Annika Buerger, Universität Münster, Germany
  • Martin Dugas, Institute of Medical Informatics, University of Muenster, Germany, Germany

Short Abstract: Genetic and epigenetic biological studies often combine different types of experiments. Raw data and processed data is publicly available in several databases, whereas it is often difficult to get an systematic overview of a complex dataset without redoing the whole analysis. We developed the tool Cogito (Compare Annotated Genomic Intervals Tool) to provide a first overview and systematic first sight analysis of datasets consisting of different data types. All data types presentable as annotated genomic ranges, i.e. genomic position with attached values like scores, can be combined and are automatically processed in three major steps: preparation, summary and comparison. After the aggregation of the data to genes, in the second step the data is summarized in different ways, considering possibly different scales of the attached data. A candidate regions list is produced in the final step. Cogito is implemented as R Package based on the Bioconductor package GRanges. The resulting html report allows users to get an overview and identify where it is worth to take a closer look at. We validated the results of our tool with two distinct datasets: one murin and one human dataset consisting of in total approx. 150 samples, several technologies and multiple conditions.

D-051: McSplicer: a probabilistic model for detecting local splicing variations from RNA-Seq data
  • Stefan Canzar, Ludwig Maximilian University of Munich, Germany
  • Israa Alqassem, Ludwig Maximilian University of Munich, Germany
  • Yash Kumar Sonthalia, Google Mountain View, United States
  • Heejung Shim, The University of Melbourne, Australia

Short Abstract: A gene is a region of DNA that consists of exons and introns. During transcription of DNA to RNA, introns are omitted in a process called splicing, while exons remain in the mature RNA. Variations of splicing are commonplace, previous research shows that more than 90% of human genes experience alternative splicing (AS). Splicing is an essential process in eukaryotic cell development and has been linked to various diseases. We propose McSplicer; a probabilistic model for detecting alternative splicing. McSplicer identifies the potential splice sites from given input exon boundaries, which can be obtained from annotated databases or estimated from RNA-seq reads using a third party tool such as StringTie. These splice sites divide a gene into varying width segments. For each segment, we introduce a hidden variable indicating whether that segment is part of an isoform or not. Our model assumes that the sequence of these hidden variables follows an inhomogeneous Markov chain. The parameters in the model are interpreted as splice site usage. We use an expectation-maximization (EM) method enhanced with dynamic programming algorithms to efficiently estimate the splice site usage. Then, we label AS events from the generated annotation using Astalavista toolbox.

D-052: Effects of aging on circadian transcriptional dynamics in mice
  • Antonios Papadakis, Cluster of Excellence - Cellular Stress Responses in Aging-Associated Diseases(CECAD), Germany
  • Jonatan Gabre, Cluster of Excellence - Cellular Stress Responses in Aging-Associated Diseases(CECAD), Germany
  • Fabian Braun, University Medical Center Hamburg - Eppendorf, Germany
  • Marc Johnsen, Cluster of Excellence: Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany
  • Cedric Debes, Cluster of Excellence - Cellular Stress Responses in Aging-Associated Diseases(CECAD), Germany
  • Paul Thaben, Institute for Theoretical Biology, Charité – Universitätsmedizin, Germany
  • Pal Westermark, Leibniz-Institut für Nutztierbiologie (FBN), Germany
  • Roman-Ulrich Müller, Kidney Research Center Cologne, Germany
  • Andreas Beyer, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany

Short Abstract: Organisms use internal timing systems to adapt to daily changes to the environment. These endogenous oscillators, known as circadian clocks, drive daily rhythms in behaviour and physiology by regulating molecular oscillations which are important for critical metabolic and signaling pathways. Aging causes disruptions to these clocks, which are highly associated with adverse effects to health and longevity. In this study, we focus on the diurnal transcriptome of kidney tissue in young and old mice, collected every three hours over two days. Circadian oscillations can be observed in gene expression and alternative splicing, involving a significant fraction of all protein-coding genes. Circadian regulation also seems to extend to transcriptional kinetics, as the estimated speed of RNA Pol-II cycles in hundreds of transcripts. Furthermore, there are significant age-associated changes to all observed types of diurnal rhythmicity. Of special interest are changes in alternative splicing of core regulating elements of the circadian clock. On the whole, our data shed new light on the transcriptional dynamics of rhythmic expression in the kidney and their possible role in aging-related dysfunctions.

D-053: Inference of DNA copy number alterations from single-cell RNA-sequencing data using transcriptional regulatory networks
  • Ronja Johnen, CECAD, Germany
  • Luise Nagel, CECAD, Germany
  • Manuel Lentzen, CECAD, Germany
  • Ana Carolina Leote, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany
  • Andreas Beyer, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany

Short Abstract: Somatic copy number alterations (CNAs) are an important contributor to diseases such as cancer and potentially contribute to the emergence of age-associated phenotypes such as accumulating senescent cells. The detection of rarely occurring CNAs requires single cell measurements. Here, we present a comparison of single-cell CNA detection methods using single-cell RNA-sequencing (scRNA-seq) data. scRNA-seq data is often more readily available than single cell DNA measurements, however, expression variation complicates the detection of CNAs based on RNA-measurements. In order to address this issue, we developed a new method based on a genome-wide transcriptional regulatory network, which enables us to more accurately detect altered gene expression due to CNAs. The relative copy number state of a gene is estimated from the deviation between the measured expression and the predicted expression based on genes in the relevant regulatory subnetwork. Using a large reference dataset based on different scRNA-seq methods, we show that our network approach outperforms published methods in identifying CNAs.

D-054: Descendant Cell Fraction: Copy-aware Inference of Clonal Composition and Evolution in Cancer
  • Gryte Satas, Princeton University, United States
  • Simone Zaccaria, Princeton University, Italy
  • Ben Raphael, Princeton University, United States
  • Mohammed El-Kebir, University of Illinois at Urbana-Champaign, United States

Short Abstract: A tumor results from an evolutionary process, giving rise to distinct clones distinguished by somatic mutations including single-nucleotide variants (SNVs), copy-number aberrations (CNAs). The standard approach to identify such clones is to cluster SNVs that have similar cancer cell fractions (CCFs), i.e. the proportion of tumor cells harboring the mutation. The key assumption is that SNVs with similar CCFs have occurred on the same phylogenetic branch. There are, however, two key deficiencies: (1) the CCF cannot be unambiguously inferred from DNA sequencing data; (2) the CCF does not account for loss of mutations, which is common in tumors with CNAs. To address these deficiencies, we define a novel quantity, the descendant cell fraction (DCF), which is a summary statistic for both the prevalence and evolutionary history of an SNV. We introduce DeCiFer, an algorithm to simultaneously infer evolutionary histories of individual SNVs and clusters SNVs by their corresponding DCFs under the principle of parsimony. On simulated data, we show that DeCiFer more accurately clusters SNVs than existing methods. On a metastatic prostate cancer dataset, we show that DeCiFer yields more parsimonious evolutionary and migration histories. Thus, DeCiFer enables more accurate quantification of intra-tumor heterogeneity and improves inference of tumor evolution.

D-055: Fast and accurate bisulfite alignment and methylation calling for mammalian genomes
  • Jonas Fischer, Max Planck Institute for Informatics, Germany
  • Marcel Schulz, Goethe University Frankfurt, Germany

Short Abstract: Assessment of DNA CpG methylation (CpGm) values via whole-genome bisulfite sequencing (WGBS) is computationally demanding. We present FAst MEthylation calling (FAME), the first approach to quantify CpGm values directly from WGBS reads using efficient data structures. FAME is incredibly fast but as accurate as standard methods, which first produce BS alignment files before computing CpGm values, thus solving the current WGBS analysis bottleneck for large-scale datasets without compromising accuracy.

D-056: ImmunoPepper: Generating Neoepitopes from RNA-Seq data
  • Matthias Hüser, ETH Zurich, Switzerland
  • Jiayu Chen, ETH Zurich, Switzerland
  • Andre Kahles, ETH Zurich, Switzerland

Short Abstract: Often RNA-Seq is used as a proxy to inform on the state of a cell’s proteome. However, predicting the set of expressed transcripts from shotgun sequencing data is inherently hard. For some applications, however, it is not necessary to generate full protein isoforms and one is only interested in the local proteome variability. This is especially relevant in the context of personalized cancer therapy, when predicting immunogenicity of peptide fragments sampled from the proteome. We present ImmunoPepper, a software that generates the set of all plausible peptides from a splicing graph, derived from a given RNA-Seq sample. The generated peptide set can be personalized with germline and somatic variants and takes un-annotated introns into account. To facilitate analysis with standardized tools for MHC binding prediction, we provide output for unique k-mer sets of all generated peptides, where typical k-mer lengths reach from 8 to 22. We demonstrate the versatility of ImmunoPepper with applications to a set of 63 cancer samples from TCGA in contrast to GTEx and the analysis of 5 mouse tumor samples in comparison to more than 300 background samples taken from mouse reference sets. Both times, we can demonstrate the existence of sample-specific (tumor-specific) splicing-derived peptides.

D-057: Deciphering the transcriptional modulation program of trabectedin in sensitive and resistant patient derived xenograft models of myxoid liposarcoma
  • Laura Mannarino, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy
  • Ilaria Craparotta, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy
  • Sara Ballabio, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy
  • Roberta Frapolli, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy
  • Enrica Calura, University of Padova, Italy
  • Sergio Marchini, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy
  • Maurizio D'Incalci, IRCCS Istituto di Ricerche Farmacologiche Mario Negri, Italy

Short Abstract: Background FUS-DDIT3 prevents adipocytes differentiation in myxoid liposarcomas (MLS). Evidences suggest that trabectedin restores differentiation in MLS, however the mechanisms underlying its antitumor activity are unknown. Since trabectedin acts as transcriptional regulator, we used RNA-Seq to analyze the transcriptomic modulation in two patient-derived xenograft models either sensitive (ML017) or resistant (ML017/ET) to trabectedin. Methods Samples from ML017 and ML017/ET were collected at 24 and 72 hours after the first dose and 15 days after the third dose of trabectedin. Alignment of RNA-Seq data was performed with HISAT2 (Kim, 2015), gene expression was quantified with Salmon (Patro, 2017) and differential expression analysis was led through DeSeq2 (Love, 2014). Results In ML017 at 24 and 72 hours most genes were involved in TP53 pathway, however the major effect was at 15 days with the activation of extracellular matrix organization and collagene formation. Instead, ML017/ET responded only at 24 hours involving the activation of DNA repair. Conclusions At late timepoint in ML017 trabectedin activates the remodeling of the extracellular matrix and collagene formation suggesting a change in the phenotype of the tumoral cells, in line with the ability of trabectedin to restore differentiation. Instead, in ML017/ET the treatment has no effects.

D-058: Scalable Annotated Genome Sequence Graphs
  • Andre Kahles, ETH Zurich, Switzerland
  • Harun Mustafa, ETH Zurich, Switzerland
  • Mikhail Karasikov, ETH Zurich, Switzerland
  • Gunnar Rätsch, ETH Zurich, Switzerland

Short Abstract: Advances in DNA sequencing technology have led to massive growth in the amount of available high-throughput sequencing data. However, the lack of standard approaches for optimal representation and indexing of sequencing data at the petabyte scale severely limits large-scale genomic analysis efforts. We present Metagraph, a framework designed for the construction and query of annotated de Bruijn graphs from petabase-scale datasets, allowing for sequence classification and reference-free comparison of samples. To demonstrate the utility of Metagraph on datasets of highly variable sequences, we have indexed a diverse range of datasets, including human gut microbiome, public transit surface microbiomes (MetaSUB), reference genomes, Genotype-Tissue Expression (GTEx), and whole exome sequencing data from The Cancer Genome Atlas (TCGA). Applying only moderate filtration on the input sequences, a Metagraph index typically requires orders-of-magnitude less storage than the original gzip-compressed inputs. As an online service, we have deployed our Metagraph MetaSUB index with a web interface to query and visualize detected matches. We envision Metagraph to not only provide a scalable framework for indexing highly diverse sequence databases, but also to be used as a powerful tool that allows researchers to perform large-scale comparative analysis in genomics and medicine on typical academic compute clusters.

D-059: Network-based imputation of dropouts in single-cell RNA sequencing data
  • Ana Carolina Leote, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany
  • Xiaohui Wu, Department of Automation, Xiamen University, China
  • Andreas Beyer, Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases (CECAD), Germany

Short Abstract: Single-cell RNA sequencing (scRNA-seq) methods are typically unable to quantify the expression levels of all genes in a cell, creating a need for the computational prediction of missing values (‘dropout imputation’). Most existing dropout imputation methods are limited in the sense that they exclusively use the scRNA-seq dataset at hand and do not exploit external gene-gene relationship information. Here, we show that a gene regulatory network learned from external, independent gene expression data improves dropout imputation. Using a variety of human scRNA-seq datasets we demonstrate that our network-based approach outperforms published state-of-the-art methods, performing particularly well for lowly expressed genes, including cell-type-specific transcriptional regulators. Additionally, we tested a baseline approach, where we imputed missing values using the sample-wide average expression of a gene. Unexpectedly, 52% to 77% of the genes were better predicted using this baseline approach, suggesting negligible cell-to-cell variation of expression levels for many genes. Our work shows that there is no single best imputation method; rather, the best method depends on gene-specific features, such as expression level and variation across cells. We thus implemented an R-package, ADImpute (available from http://cellnet.cecad.uni-koeln.de/adimpute), that determines the best imputation method for each gene in a dataset.

D-060: Hecaton: a machine learning approach to reliably detect copy number variation in plant genomes
  • Raúl Wijfjes, Wageningen University, Netherlands
  • Sandra Smit, Wageningen University, Netherlands

Short Abstract: Copy number variation (CNV) is thought to actively contribute to adaptive evolution of plant species. While there are many computational algorithms available to detect copy number variation from short read sequencing datasets, the performance of these tools is unsatisfactory when applied to plants. Rather than developing a new algorithm from the ground up, it may be more practical to aggregate and filter the predictions of existing callers using a machine learning model trained on plant data. We developed Hecaton, a computational framework that combines custom post-processing and machine learning to accurately detect CNV in plants. Hecaton corrects dispersed duplications that are erroneously represented by several CNV detection tools as overlapping deletions and tandem duplications. Moreover, the machine learning model employed by Hecaton attains a better combination of sensitivity and precision compared to current state-of-the-art methods when applied to low coverage samples of A. thaliana and rice. In conclusion, Hecaton provides a robust method to detect CNV in plants, demonstrating the benefits of using machine learning to optimize CNV calling in non-human species.

D-061: High-throughput benchmarking of engineered CRISPR-Cas nucleases
  • John Hawkins, The University of Texas at Austin, United States
  • Stephen Jones, The University of Texas at Austin, United States
  • James Rybarski, The University of Texas at Austin, United States
  • Nicole Johnson, The University of Texas at Austin, United States
  • Janice Chen, University of California, Berkeley, United States
  • William Press, The University of Texas at Austin, United States
  • Jennifer Doudna, University of California, Berkeley, United States
  • Ilya Finkelstein, The University of Texas at Austin, United States

Short Abstract: I will present a new NGS-based platform that measures the cleavage and binding specificity of natural and engineered CRISPR-Cas nucleases. Our new nuclease sequencing pipeline (NucleaSeq) exhaustively measures cleavage kinetics and captures the time-resolved identities of cleaved products for a large library of partially gRNA-matched DNAs. The same DNA library is used to measure the binding specificity of each enzyme via the chip-hybridized association mapping platform (CHAMP). Coupling NucleaSeq and CHAMP, we benchmarked the cleavage and binding specificities of Cas12a and four SpCas9 variants for 105 DNAs containing mismatches, insertions, and deletions. Engineered Cas9s dramatically increase cleavage specificity, but provide minimal improvement to overall binding specificity, with Cas9-HF1 performing best. In contrast, Cas12a strongly discriminates along the whole hybridized sequence during binding and cleavage. Surprisingly, both Cas9 and Cas12a produce variable ssDNA overhangs. Initial cleavage position and subsequent end-trimming vary with the nuclease, gRNA sequence, and position and base identity of modified target DNAs. By programming mismatches between gRNA and target DNA, these nucleases can generate incompatible DNA ends without slowing cleavage, ultimately biasing cellular repair outcomes. More broadly, NucleaSeq and CHAMP enable rapid, quantitative, and systematic comparison of the specificities and cleavage products of engineered and natural CRISPR-Cas nucleases.

D-062: SPARSim Single Cell: a count data simulator for scRNA-seq data
  • Giacomo Baruzzo, University of Padova, Italy
  • Ilaria Patuzzi, Istituto Zooprofilattico Sperimentale delle Venezie, Italy
  • Barbara Di Camillo, University of Padova, Italy

Short Abstract: Background: Single cell RNA-seq (scRNA-seq) count data shows many differences compared to bulk RNA-seq count data, making the application of RNA-seq preprocessing/analysis methods not straightforward or even inappropriate. For this reason, the development of new methods for scRNA-seq count data is currently one of the most active research field in bioinformatics. However, methods performance are often variable across datasets, making the definition of a standardized analysis pipeline an open issue. To help the development of more robust methods and the assessment of existing tools, the availability of simulated data could play a pivotal role. Unfortunately, only few scRNA-seq count data simulators are available, often showing poor or not demonstrated similarity with real data. Results: We present SPARSim, a scRNA-seq count data simulator based on a Gamma-Multivariate Hypergeometric model. We demonstrate that SPARSim allows generating count data that resemble real data in terms of counts intensity, variability and sparsity, performing comparable or better than one of the most used scRNA-seq simulators, Splat. In particular, SPARSim simulated data well resemble the distribution of zeros across different expression intensities observed in real count data. Conclusions: SPARSim simulator could boost the development and assessment of scRNA-seq bioinformatics methods.

D-063: Haplotype Threading: Accurate Polyploid Phasing from Long Reads
  • Sven Schrinner, Heinrich Heine University Düsseldorf, Germany
  • Rebecca Serra Mari, Center for Bioinformatics, Saarland University, Saarbrücken; Graduate School of Computer Science, Saarbrücken, Germany
  • Jana Ebler, Center for Bioinformatics, Saarland University; Graduate School of Computer Science; MPI for Informatics, Saarbrücken, Germany
  • Gunnar W. Klau, Heinrich Heine University Düsseldorf; Cluster of Excellence on Plant Sciences (CEPLAS), Düsseldorf, Germany
  • Tobias Marschall, Saarland University / Max Planck Institute for Informatics, Germany

Short Abstract: The genome of many plant species, including important food crops, is polyploid. Resolving genomes at haplotype level is crucial for understanding the evolutionary history of polyploid species and for designing advanced breeding strategies. While phasing diploid genomes using long reads has become a routine step, polyploid phasing still presents considerable challenges. The Minimum Error Correction (MEC) model, the most common and successful formalization for diploid phasing, is limited in the use for polyploid phasing since it does not address regions where two or more haplotypes are identical. In addition, dynamic programming techniques solving diploid MEC become infeasible in the polyploid case. Here, we present a method for accurate polyploid phasing that overcomes these challenges by departing from the MEC model. We propose a novel two-stage approach based on (i) clustering reads using a position-dependent scoring function and (ii) threading the haplotypes through the resulting clusters by dynamic programming. We demonstrate that our method scales to whole chromosomes and results in more accurate haplotypes than those computed by the state-of-the-art tool H-PoP. Our algorithm is implemented as part of the widely used open source tool WhatsHap and is hence ready to be included in production settings.

D-064: An automated pipeline for detecting and annotating mitochondrial variants from paired tumor-normal samples
  • Catherine Welsh, Rhodes College, United States
  • Kelly McCastlain, St. Jude Children's Research Hospital, United States
  • Mondira Kundu, St. Jude Children's Research Hospital, United States

Short Abstract: Next-generation sequencing techniques paired with bioinformatics pipelines are necessary to elucidate the causes and functional consequences of mitochondrial DNA (mtDNA) variants. A number of recent tools have been created to detect variants in mtDNA, however none have been built to specifically find somatic variants used paired tumor-normal samples. Recent investigations of somatic mtDNA mutations have used tools built specifically for nuclear genomes as a result. Therefore, we introduce a bioinformatics pipeline built specifically for detecting and annotating mtDNA variants from paired tumor-normal samples. In addition to reporting somatic mutations, we also report inherited mutations, shared heteroplasmy, and germline-specific mutations. Our tool uses a double alignment algorithm to best handle the circular nature of the mitochondrial genome. We have utilized this pipeline on over 600 paired tumor-normal samples from the Pediatric Cancer Genome Project, and verified its accuracy using a Rho-0 +/- experiment for quality control, as well as RNA sequencing data on a subset of the samples.

D-065: Methods to detect large indels and tandem duplication in acute myeloid leukemia using single cell DNA sequencing
  • Sombeet Sahu, Mission Bio, United States
  • Manimozhi Manivannan, Mission Bio, United States
  • Shu Wang, Mission Bio, United States
  • Dong Kim, Mission Bio, United States
  • Saurabh Gulati, Mission Bio, United States
  • Nianzhen Li, Mission Bio, United States
  • Adam Sciambi, Mission Bio, United States
  • Nigel Beard, Mission Bio, United States

Short Abstract: Background: Single-Cell DNA sequencing technologies such as Tapestri platform enables us to understand the clonal heterogeneity of AML patient samples. Here we present an algorithm to identify these large indels and reduce false positives to accurately measure the clonal heterogeneity and enable precision diagnostics. Methods The Tapestri analytical workflow was used to pre-process the reads and map the reads. We use a soft-clip based approach to detect the internal tandem duplications found in the FLT3 gene. We first identify the candidate ITD size bins from the frequency peaks of all the called ITD variants and group the individual variants that are within 20bp boundaries of the frequency peaks into their respective bins. We project the ITD sequence strings within a bin on to Levenshtein vector space domain and calculate the median distance between all strings. We then use the string with the median distance to collapse the ITDs to the consensus sequence and report it in the vcf file. Results Using this method, we were able to accurately identify the ITDs and reproduce the true positive clones for the sample. We are currently optimizing this approach using different samples with a wide range of known ITDs.

D-066: Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data
  • Katharina Jahn, ETH Zurich, Switzerland
  • Cenk Sahinalp, Indiana University Bloomington, United States
  • Jack Kuipers, Computational Biology Group, D-BSSE, ETH, Switzerland
  • Niko Beerenwinkel, ETH Zurich, Switzerland
  • Salem Malikic, Simon Fraser University, Canada

Short Abstract: Understanding the evolutionary history and subclonal composition of a tumour represents one of the key challenges in overcoming treatment failure due to resistant cell populations. Most of the current cancer genomics data is short-read bulk sequencing data. While this type of data is characterised by low sequencing noise and cost, it consists of aggregate measurements across a large number of cells. It is therefore of limited use for the accurate detection of the distinct cellular populations present in a tumour and the unambiguous inference of their evolutionary relationships. Single-cell DNA sequencing instead provides data of the highest resolution for studying intra-tumour heterogeneity and evolution, but is characterised by higher sequencing costs and elevated noise rates. We developed B-SCITE, the first computational approach that infers tumour phylogenies from combined single-cell and bulk sequencing data. Using a comprehensive set of simulated data, we show that B-SCITE systematically outperforms existing methods with respect to tree reconstruction accuracy and subclone identification. B-SCITE provides high-fidelity reconstructions even with a modest number of single cells and in cases where bulk allele frequencies are affected by copy number changes. On real tumour data, B-SCITE generated mutation histories that show high concordance with expert generated trees.

D-067: Making RNA-seq analysis results more reproducible
  • Agata Muszyńska, Małopolska Centre of Biotechnology UJ, Kraków, Poland, Poland
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland

Short Abstract: The reproducibility crisis is a widely known problem in science. One of the fields, which requires improvement is expression profiling by RNA-seq, what was addressed by the SEQC project [1]. In this work, however, all tools were tested on reference UHRR and HBRR samples which, although dedicated for benchmarking, do not provide realistic signal changes. Here, following SEQC recommendations [1][2], we were investigating how reproducibility among experiments can be improved using method which identifies and removes hidden factors from the data (svaseq). This approach has been tested when analysing RNA-seq data created from sequencing mice tissue. The experiment was performed in three batches with 4 samples stemming from mice with neuropathy and 4 controls each. Preliminary results show that without using factor analysis there are almost no genes with statistically significant change of expression. From other studies of the neuropathy we would expect, however, quite strong signal. Applying svaseq removes unwanted variation and enhance the signal. The number of detected genes increased up to several thousands. The signal, however, is not stable across the experiments. Among all the genes detected only about 8% is common for all three batches. Still the factor analysis is a promising approach. The work has been co-financed by the European Union through the European Social Fund (grant POWR.03.02.00-00-I029)

D-068: Critical Need to Improve the Quality of the Rat Reference Genome
  • A. Bilge Ozel, University of Michigan - Ann Arbor, United States
  • Shweta Ramdas, University of Michigan - Ann Arbor, United States
  • Yanchao Pan, University of Michigan - Ann Arbor, United States
  • Hao Chen, University of Tennessee Health Science Center, United States
  • Mary K. Treutelaar, University of Michigan - Ann Arbor, United States
  • Katie Holl, Medical College of Wisconsin, United States
  • Myrna Mandel, National Institutes of Health, United States
  • Robert Williams, University of Tennessee Health Science Center, United States
  • Huda Akil, University of Michigan - Ann Arbor, United States
  • Leah C. Solberg Woods, Wake Forest School of Medicine, United States
  • Jun Z. Li, University of Michigan - Ann Arbor, United States

Short Abstract: The common rat has been an important model organism in biomedical research. Currently, the rat reference genome has ~100 times more fragments than the mouse and human reference genomes. We performed whole-genome sequencing of the eight inbred founders of the NIH Heterogeneous Stock. Joint variant calling identified ~16 million variants, with heterozygous call frequencies ranging from 10.6% (BN) to 15.0% (M520), higher than expected for inbred lines. Interestingly, these sites disproportionately reside in 300+ segments, spanning ~9% of the genomes. These segments are shared in most or all strains, show higher read depths than average, and contain higher rates of tri-allelic sites. These findings suggest that the high-heterozygous regions are likely a result of mis-assembly of the current reference genome, where two or more highly repetitive segments were folded incorrectly into the same interval. This motivated the establishment of the multi-institute International Rat Omics Consortium to apply multiple technologies - PacBio, 10X Genomics, Bionano, HiC - to create whole-genome sequencing maps for 80-100 inbred lines, and finish them to reference quality. The goal is to improve the knowledge foundation of rat genome informatics for most of the strains commonly used in genetic and functional studies.

D-069: Intra-bin structural variant segmentation for whole-genome sequencing data using U-net
  • Yao-Zhong Zhang, The University of Tokyo, Japan
  • Seiya Imoto, The University of Tokyo, Japan
  • Satoru Miyano, Human Genome Center, the Institute of Medical Science, University of Tokyo, Japan
  • Rui Yamaguchi, Aichi Cancer Center Research Institute, Japan

Short Abstract: Structure variants (SVs) are complex genomic alterations with affected genome sizes larger than 50 nucleotides. With the advance of sequencing technologies, more precise profiling of SVs becomes available. It is still a challenging task to determine SVs in nucleotide-resolution with satisfying both high accuracy and coverage for a single sample. In this work, we propose a novel nucleotide-resolution SV detection method for determining SV fragment inside of a bin with only base-pair read-depth (RD) information. Distinguished from traditional RD based SV detectors, which usually require grouping and smoothing RD signals at the scale of a bin, we use a U-net model to directly process base-pair RD signals inside of a bin and learn feature representations for intra-bin SV segmentation. We did a systematic evaluation of the method using WGS data from 1000 Genomes Project. In both single-sample and cross-sample experiments, the U-net model achieves significantly better performances on detecting intra-bin SV fragments when compared with convolutional neural networks (CNN). Furthermore, we compare the proposed method with non-learning based SV detection methods, such as Delly and Lumpy. We demonstrate that through the integration of the intra-bin detection method, confidence intervals of breakpoints predicted by Lumpy can be further refined.

D-070: Ploidetect: Interpretable detection of tumour purity and aneuploidy from whole-genome sequence data
  • Luka Culibrk, Canada's Michael Smith Genome Sciences Centre, Canada
  • Jasleen Grewal, Canada's Michael Smith Genome Sciences Centre, Canada
  • Erin Pleasance, Canada's Michael Smith Genome Sciences Centre, Canada
  • Richard Corbett, Canada's Michael Smith Genome Sciences Centre, Canada
  • Karen Mungall, Canada's Michael Smith Genome Sciences Centre, Canada
  • Janessa Laskin, British Columbia Cancer Agency, Canada
  • Steven Jones, BC Cancer, Genome Sciences Centre, Canada
  • Marco Marra, BC Cancer, Genome Sciences Centre, Canada

Short Abstract: Whole-Genome Sequencing (WGS) of tumors is being increasingly adopted to inform clinical decision-making in the treatment of cancer. One major challenge in analysis of biopsied tumours using sequencing is estimation of tumour purity. We present Ploidetect, an R package which simultaneously estimates tumor purity and ploidy from WGS data. Ploidetect is robust to a large degree of tumor heterogeneity and provides accurate estimates in extremely impure tumors. Ploidetect was applied to a cancer cohort comprising of previously treated metastatic tumor WGS data (n = 710). We found good concordance between Ploidetect estimates and manual estimation of tumour purity from copy number data (r = 0.88). Ploidetect estimates fell within 10% of the manual review assessment in 74% of cases. Re-review of discordant cases revealed that Ploidetect provided superior estimates in the majority of evaluted samples. Ploidetect demonstrated superior estimates compared to other tools in a set of eight randomly chosen test cases. Ploidetect provided essentially perfect estimates of tumour purity in synthetically diluted cell line data. The results obtained from Ploidetect are simple to assess for accuracy by a human reviewer. Finally, Ploidetect performs detection of copy number variation using a novel segmentation-by-compression algorithm. Ploidetect is freely available at https://github.com/lculibrk/Ploidetect-package/

D-071: Accurate variant calling in unique and repetitive regions of the genome using single-molecule long read sequencing
  • Peter Edge, University of California San Diego, United States
  • Timofey Prodanov, University of California San Diego, United States
  • Vikas Bansal, University of California San Diego, United States

Short Abstract: Single-molecule sequencing technologies such as Pacific Biosciences (PacBio) and Oxford Nanopore generate long reads (5-100 kilobases) that can address two key limitations of short read sequencing technologies for sequencing human genomes: (i) lack of long-range haplotype information and (ii) the inability to call variants in long repetitive regions with high sequence identity. However, the high error rate of SMS reads makes it challenging to detect small variants such as single nucleotide variants (SNVs) and short indels in diploid genomes. We have developed a variant calling method, Longshot, that leverages the haplotype information present in SMS reads to enable the accurate detection and phasing of single nucleotide variants in diploid genomes. Using whole-genome PacBio data for multiple human individuals and high-confidence GIAB variant calls, we demonstrate that Longshot achieves very high accuracy for SNVs (precision > 0.995 and recall > 0.96) that is significantly better than two recent variant calling methods. To enable accurate variant calling in segmental duplications with high sequence identity, we have designed a re-mapping strategy that leverages paralogous sequence variants to align reads with high confidence in such regions.

D-072: EPCY: Evaluation of Predictive CapabilitY for ranking biomarker gene candidates
  • Éric Audemard, IRIC, Université de Montréal, Canada
  • Léonard Sauvé, IRIC, Université de Montréal, Canada
  • Sébastien Lemieux, University of Montreal, Canada

Short Abstract: Finding biomarker gene candidates constitutes the entry point for the identification of links between gene expression levels and features of interest in RNA-sequencing patient data. Features can be defined as cancer subtypes, presence of prognostic mutations or genome rearrangements. Expression biomarkers are commonly identified using Differentially Expressed Gene (DEG) analysis to compare a test group (presenting the feature of interest) to a control group. At this point, it becomes standard to use significant DEGs as input to integrative analyses, based on a priori knowledge, to find a subset of DEG linked with sample features. We posit that current issues with biomarker identification derive from a misalignment of the DEG procedure with the objective pursued by biomarker identification, namely sample classification based on some criterion. We propose a more direct approach that evaluates gene expressions based strictly on their individual predictive capability to accurately classify samples. We then show that the resulting ranking returns a more informative set of biomarker gene candidate compared to other approaches. Finally, using two patient cohorts and 4 features, we show that the resulting ranking returns a more informative set of biomarkers when compared to other approaches, compare to DEG selected by DESeq, Edger or Limma.

D-073: Identifying and Separating Strain-Unique Bacterial Sequences Using SepSIS
  • Matthew Waldner, University of Saskatchewan, Department of Computer Science, Canada
  • Anthony Kusalik, University of Saskatchewan, Department of Computer Science, Canada
  • Murray Jelinski, University of Saskatchewan, Western College of Veterinary Medicine, Department of Large Animal Clinical Sciences, Canada

Short Abstract: Laboratory techniques for isolating and culturing bacteria lack the ability to ensure an isolate is mono-clonal. If a multi-clonal bacterial sample is short-read sequenced, the resulting read set will contain sequences from more than one genomically unique strain. Once assembled using reference-based or de novo methods, the assembly will be corrupted by these sequences. As a solution, we present the Separator of Strain Inherent Sequences (SepSIS), a tool to detect and isolate strain-unique genes and sequences in mixed-strain, short-read datasets with variable coverage. SepSIS functions by analyzing the cyclic components of the assembly graph produced by the SPAdes assembler. It then iteratively categorizes adjacent sequences in each component as belonging to one or more strains, and merges them using coverage-based criteria. We demonstrate the ability of SepSIS to separate strain-unique sequences from in vitro mixed strains of Mycoplasma bovis. SepSIS can also be run using separately sequenced, in silico mixed read sets to allow for study of strain-unique sequences. With SepSIS we show that multiple strains of M. bovis exist on a single culture plate by separately sequencing individual colonies, combining them in silico, and producing a list of sequences unique to particular strains.

D-074: Physlr: Construct a Physical Map from Linked Reads
  • Vladimir Nikolić, BC Cancer Genome Sciences Centre, Canada
  • Sauparna Palchowdhury, BC Cancer Genome Sciences Centre, Canada
  • Joerg Bohlmann, The University of British Columbia, Canada
  • Hamid Mohamadi, BC Cancer Agency Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Agency Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Amirhossein Afshinfard, BC Cancer Genome Sciences Centre, Canada
  • Shaun Jackman, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: Sequencing large molecules of DNA has drastically improved the contiguity of genome sequence assemblies. Long read sequencing has reduced sequence fidelity compared to short read sequencing and is currently more expensive. Linked read sequencing from 10x Genomics Chromium combines the benefits of large DNA molecules with the sequence fidelity and cost of short read sequencing. Our tool, Physlr, constructs a physical map of large DNA molecules from linked reads without first assembling those reads. A barcode-overlap graph is constructed, where each edge represents two barcodes sharing minimizer k-mers. The underlying molecule-overlap graph is reconstructed from the barcode-overlap graph by identifying k-clique communities, where each community is born from a DNA molecule. The physical map is a set of contigs, where each contig is an ordered list of barcodes. The scaffolds of an existing assembly may be ordered and oriented using the physical map. We constructed a physical map of the 1.34 Gbp zebrafish (Danio rerio) genome. A Supernova assembly was scaffolded by mapping it to this physical map, improving the NG50 from 4.8 Mbp to 9.1 Mbp. Physlr can employ multiple libraries of linked reads, necessary for genomes larger than mammals such as conifer genomes, which can exceed 20 Gbp.

D-075: Trans-NanoSim: Characterizing and simulating Oxford Nanopore cDNA/directRNA reads using statistical models
  • Chen Yang, BC Cancer Genome Sciences Centre, Canada
  • Ka Ming Nip, BC Cancer Genome Sciences Centre, Canada
  • Saber Hafezqorani, BC Cancer Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: Recently, long read technologies such as the Oxford Nanopore Technology (ONT) RNA sequencing (RNA-seq) protocol from complementary DNA (cDNA) or direct RNA (dRNA) libraries have generated valuable data for studying transcriptomes. The potential of these data types will be better realized if coupled with bioinformatics tools that are tuned to their platform-specific characteristics. Development of these tools would highly benefit from datasets with a known ground-truth. Simulated data provides a cost-effective way to accomplish this. Here we introduce Trans-NanoSim, a two-stage pipeline that (1) captures the technology-specific features of ONT transcriptome reads, and (2) simulates reads true to the platform. In the modelling stage, it utilizes state-of-the-art tools to align reads to a reference transcriptome, and generates statistical models that describe the read profiles, such as their error modes and length distributions. It also models features of the library preparation protocols used, including intron retention (IR) events. Further, it profiles and mimics the transcript expression patterns. Next, these models are used to produce in silico reads for a given reference transcriptome. We benchmark Trans-NanoSim against competing tools using publicly available experimental human and mouse ONT transcriptome reads. Trans-NanoSim demonstrates better performance in mimicking the characteristics of the ONT platform.

D-076: Processing pipeline calibration and confounder removal for accurate and reproducible gene activity profiling at the level of alternative transcripts
  • Paweł P. Łabaj, Małopolska Centre of Biotechnology of Jagiellonian University, Poland
  • David P. Kreil, Chair of Bioinformatics, Boku University Vienna, Austria

Short Abstract: Recent studies show that >70% of researchers have failed to reproduce the results of other scientists’ analyses, and >50% of their own. Every data modality that we can measure adds novel sources of variation and artefacts that need to be characterized and removed. We show that it is necessary and possible to clean hidden confounding factors in modern assay data. Here we consider the systematic measurement of gene activity by expression profiling. The latest developments now allow higher resolution profiling, for instance, the discrimination of alternative gene transcripts. These, not genes, are in fact relevant for functional analysis, thus profiling gene activity at the level of alternative transcripts is a corner stone of functionally relevant analyses. In general, public repositories already hold raw data allowing transcript level analyses from high-throughput technologies. However, the majority of studies still focuses on gene level analysis. We here show that transcript level interpretation of high-resolution profiles remains non-trivial, and the reproducibility of results from current state-of-the-art approaches is drastically lower than at the gene level. However, application of modern factor analysis allows to remove hidden confounders and by this improve reproducibility. In particular, this benefits from community standard reference sets for inter-lab calibration.

D-077: High-throughput sequencing using combinatorial profiling
  • Luisa Teasdale, Australian National Insect Collection, CSIRO, Australia
  • Andreas Zwick, Australian National Insect Collection, CSIRO, Australia

Short Abstract: Many questions in biology only require a small number of genetic loci but from thousands of samples. This goal can be cost-prohibitive and technically not practical given current high-throughput sequencing requires individually labelled samples (i.e. ‘indexing’). We developed a computational approach that enables the simultaneous sequencing of genes for thousands of samples in a single high-throughput sequencing run. Instead of individual labels, we use combinatorics to track and match sequences to samples. Samples are pooled across a unique subset of a small set of indexed libraries, and the known and unique pooling pattern is used to decode sequences. This approach (termed ‘combinatorial profiling’) has previously been used for applications such as the detection of rare variants but not to retrieve full sequences for every sample encoded. We present a complete computational pipeline to encode and decode pooled sequences and validation with both simulation and sequencing. A minimum level of sequence divergence between pooled samples is ideal (>1bp difference), however, the technique can tolerate some redundancy. Combinatorial profiling has a vast array of applications and is particularly suited to projects where several loci are needed for thousands of specimens, such as producing reference genetic databases of museum collections.

D-078: Gemini: A novel mate-aware algorithm for enhancing indel detection and variant calling accuracy
  • Gwenn Berry, Illumina, Inc., United States
  • Tamsen Dunn, Bandit Bioscience, United States
  • Hans Kang, Illumina, Inc., United States
  • Xiao Chen, Illumina, Inc., United States
  • Nathan Haseley, Illumina, Inc., United States
  • Kristina Kruglyak, Illumina, Inc., United States

Short Abstract: The rapidly advancing applications of Next-Generation Sequencing (NGS) to personalized medicine and oncology research demand increasingly accurate small variant calling performance on challenging, noisy datasets. While accurate single nucleotide variant (SNV) calling techniques are well-established, indels (insertions and deletions) continue to present a challenge in short read sequencing due to incomplete spanning coverage and alignment limitations. Furthermore, both indel and SNV calling are even more difficult in cancer samples (e.g., low-quality FFPE tissue, low tumor content, tumor heterogeneity). We present Gemini, an approach to enhance both sensitivity and specificity of indel detection, as well as reduce false positive SNV calls around true, difficult-to-detect indels. By performing mate-aware, local indel realignment and read stitching, Gemini is able to span and call longer indel events, boost confidence in base calls, and rescue valuable sequencing results previously lost to soft-clipping by standard alignment tools. Benchmarking of both real and simulated targeted sequencing data demonstrates up to 15% improvement in indel sensitivity and 2x increase to precision when compared against similar tools. Moreover, runtime read categorization and prioritization allows for a 5x speed improvement over comparable methods, making Gemini both a fast and accurate variant calling enhancer for targeted sequencing applications.

D-079: Eliminating transcriptome-wide batch effects results in higher precision in allele-specific expression measurements
  • Asia Mendelevich, Skoltech, Russia
  • Svetlana Vinogradova, Harvard University, United States
  • Saumya Gupta, Harvard Medical School, Dana-Farber Cancer Institute, United States
  • Andrey Mironov, Dept.Bioengineering and Bioinformatics Moscow State University, Russia
  • Alexander Gimelbrant, Harvard Medical School, Dana-Farber Cancer Institute, United States

Short Abstract: Analysis of allele-specific expression aims to measure relative activity (Allelic Imbalance) of the maternal and paternal alleles in diploid cells and thus capture the integral output of the gene-regulatory systems. Transcriptome-wide allele-specific expression can be measured by a variety of methods, with RNA sequencing being the most widely used. Most of the existing approaches do not fully take into account the limits of applicability of the experimental and statistical methods and thus may be biased in their estimations. Two usual assumptions are that allele-specific expression analysis is generally robust to technical artifacts and that taking one technical replicate for each biological sample should be enough for accounting the biases. We demonstrated that batch effects in AI estimations are substantial, and found that for RNA-seq data coming from a set of technical replicates there exists an invariant that captures the overdispersion of AI coming from the experiment and data processing pipeline. Overall, we have shown that using at least two technical replicates under the assumption that each of them constitutes a truthful sample from the overall distribution, we can estimate the necessary corrections in AI measurements for the corresponding experiment and adjust confidence intervals, which enables differential allele-specific expression analysis.

D-080: Application of CNVnator to analyze copy number alterations in cancer
  • Shobana Sekar, Mayo Clinic, United States
  • Milovan Suvakov, Mayo Clinic, United States
  • Alexej Abyzov, Mayo Clinic, United States

Short Abstract: Copy number variations and alterations (CNVs/CNAs) have been linked to several human diseases including cancer. Around 90% of tumors have CNAs as well as gain or loss of a part or whole chromosome (i.e., aneuploidies). CNAs and aneuploidies are similarly frequent in cancer subclones and metastasis and their discovery and analysis are important for understanding cancer evolution, progression, and invasiveness, as well as individual response to treatment. CNVnator is a read-depth based approach to perform CNV discovery on whole genome sequencing (WGS) data. Initially developed for analysis of the normal genome, it is being expanded for analysis of somatic CNA in cancer. Recently added features include functionality to import single nucleotide polymorphism data from VCF files. Using this data, users can generate B-allele frequency plots along with RD signal to facilitate analysis of subclonal CNAs and copy-number neutral losses of heterozygosity. Further, in order to address the pressing need for the analysis of large number of WGS data, we have introduced an option to reduces the footprint of intermediate files by about 10 times. We demonstrate the application of these new functionalities along with the analysis of subclonal CNAs on colorectal cancer samples with residual polyp from 13 patients.

D-081: Analysis of expression of important tooth genes through bulk and single cell rna-seq
  • Rishi Das Roy, University of Helsinki, Finland
  • Outi Hallikas, University of Helsinki, Finland
  • Jukka Jernvall, University of Helsinki, Finland

Short Abstract: The development of an embryo to an adult form is a time-dependent synchronized process. Different sets of genes are expressed at specific development stages to give rise to complex organs. The identification of these genes is a major research interest in developmental biology. The conventional methods in developmental biology, in situ hybridization and immunohistochemical staining, have been used to identify and visualize the expression patterns of genes. These methods have revealed that many genes are expressed only in subpopulations of cells (for example epithelium or mesenchyme) of organs, and they are referred to biomarkers. But not all genes have been studied through these time-consuming processes. Also, it is not possible to quantify the expression levels by conventional methods. Although bulk RNA-seq of a piece of tissue can identify the expression of thousands of genes simultaneously, it will average out the expression levels of genes that are expressed only in subpopulations of cells. However, recent development in single-cell RNA sequencing has made it possible to measure expression levels of genes at the cellular level. Here, we report a comparative study of expression levels of many well-known tooth genes from developing (E14) mouse tooth by integrating single-cell RNA-seq data with bulk RNA-seq.

D-082: GABAC: an arithmetic coding solution for genomic data
  • Jan Fostier, Ghent University, Belgium
  • Mikel Heranez, University of Illinois, at Urbana-Champaign, United States
  • Tom Paridaens, Ghent University, Belgium
  • Jan Voges, University of Hannover, Germany
  • Jorn Ostermann, University of Hannover, Germany

Short Abstract: In an effort to provide a response to the ever-expanding generation of genomic data, ISO is designing a new solution for the representation, compression, and management of genomic sequencing data: the MPEG-G standard. This standard specifies an abstract representation of sequencing data and offers support for, among others, multi-dimensional random access, extensive meta-data annotation, data and privacy protection, data storage and streaming. We propose the first baseline implementation of an MPEG-G compliance entropy encoder/decoder: GABAC. It combines proven coding technologies, such as context-adaptive binary arithmetic coding, binarization schemes, and transformations into one straightforward solution. We tested GABAC on a test set of 206 descriptor stream files and compared it to the codecs used in the CRAM framework: gzip, bzip2, xz, rANS order-0, and rANS order-1. CABAC offers, on average, the highest compression ratio across all coding solutions (yielding the smallest compressed size among all 6 codecs for 127 out of 206 test items), while being, on average, 5.5 times faster than the second best compressor in compression time. Additionally, adding GABAC to the CRAM set of encoders would offer a slight compression gain, but, more importantly, significant speed-up improvement of a factor of 2.4 in compression time.

D-083: Parameter exploration improves accuracy of long-read genome assembly
  • Anurag Priyam, Queen Mary University of London, United Kingdom
  • Eckart Stolle, Institut für Biology, Martin-Luther-Universität Halle-Wittenberg, Germany
  • Yannick Wurm, Queen Mary University of London, United Kingdom

Short Abstract: Accurate and complete assembly of genomes is important for molecular biology studies. It is now possible to generate genome assemblies with high consensus accuracy from long-read sequencing technologies despite their high error rate. As a consequence, long-read sequencing is now being applied to generate reference genome assemblies for species across the whole tree of life. Because genome complexity, sequencing depth, and the length and error profile of reads can vary across projects, we asked if the default parameters of an assembler are the best choice for genome assembly. We tested 45 parameter-combinations of the popular long-read assembler Canu to generate an improved reference genome assembly for the red fire ant (Solenopsis invicta). For this, we sequenced a pool of 21 haploid brothers on a PacBio Sequel (V2 chemistry) to 44x genome coverage. To compare the assemblies we assessed their contiguity, structural accuracy, the extent to which repetitive regions were collapsed, and gene completeness. We find that adjusting the stringency of overlap detection and of trimming raw reads improves assembly quality by 6-56% over default parameters. Our work provides an axis and framework for optimising assembly parameters for new species.

D-084: ntHits: a streaming algorithm for identifying repeats in genomics data
  • Hamid Mohamadi, BC Cancer Agency Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Agency Genome Sciences Centre, Canada
  • Kristina Gagalova, BC Cancer Agency Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Shaun Jackman, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: Repeat elements are abundant in eukaryotic genomes, often complicating sequence assembly, comparative genomics studies, and other genomic analyses. De novo identification of repeats is very computationally challenging. Existing tools for this purpose need considerable computational resources in terms of memory/disk space and runtime requirements. Here, we present ntHits, a streaming method for de novo repeat identification based on the statistical characteristics of k-mer content profiles of input data. ntHits first obtains k-mer coverage distributions of input datasets using the ntCard algorithm. After excluding the effect of erroneous k-mers, ntHits computes the homozygous k-mer rate λ. The k-mers that are repeated in the input data occur at rates λr ≈ r × λ, r ≥ 2. After identifying the thresholds, ntHits streams through the input datasets and then filters out non-repetitive k-mers using a counting Bloom filter data structure that counts up to the repeat threshold. The k-mers appearing more than the threshold will be passed to the next stage and stored in a hash table. Our results show that ntHits can efficiently and accurately identify repeat contents in large-scale sequencing data. We expect ntHits to provide utility in characterizing repeat content of large-scale datasets and de novo sequencing projects.

D-085: Leveraging known genomic variants to improve detection of variants, especially close-by Indels
  • Nam Vo, The University of Chicago, United States
  • Vinhthuy Phan, The University of Memphis, United States

Short Abstract: The detection of genomic variants has great significance in genomics research and its applications. Current approaches based on sequencing data usually require high coverage to detect Indels and structural variants accurately. However Indels, especially those close to each other, are still hard to detect accurately. Here we introduce a novel approach that leverages known variants, e.g. provided by dbSNP/dbVar, ExAC, or the 1000 Genomes Project, to improve sensitivity of detecting variants, especially close-by Indels. In our approach, the standard reference and the known variants are combined to build a meta-reference. A novel alignment algorithm is developed to accurately align reads to the meta-reference. This strategy resulted in accurate variant calling even with low coverage data. We showed that compared to popular methods such as GATK and SAMtools, our method was able to call the close-by Indels at a 15–20% higher sensitivity at low coverage, and still get 1–5% higher sensitivity at high coverage, at competitive precision. These results were validated using simulated data and real data from the Illumina Platinum Genomes Project. Our finding suggests that by incorporating known variant information, sensitive variant calling is possible at a low cost. Our implementation can be found in https://github.com/namsyvo/IVC

D-086: Integrative single-cell sequencing to unravel the composition and evolution of Acute Lymphoblastic Leukemia
  • Llucia Albertí Servera, VIB Leuven, Belgium
  • Sofie Demeyer, VIB Leuven, Belgium
  • Jan Cools, VIB Leuven, Belgium

Short Abstract: Acute lymphoblastic leukemia (ALL), which is the most common cancer in children, shows extensive genetic intra-tumoral heterogeneity. This heterogeneity might be the underlying reason for an incomplete response to treatment and for the development of relapse. In order to envision the clinical implementation of a refined risk-category strategy based on ALL subclonal composition, it is essential to first generate a reference single-cell map and accumulate evidence on how the subclonal composition affects the response to treatment. Therefore, in this project I am building a comprehensive single-cell overview of the composition, development and response to therapy for pediatric ALL. For that, I perform large-scale and integrative single-cell genome and transcriptome profiling of ex vivo carefully selected pediatric samples at diagnosis, during drug treatment and in case of relapse. This provides real temporal information about the sensitivity of each cell type to the therapy and about how relapse can develop. Moreover, the results gather information on the feasibility to detect minor clinically relevant leukemia clones at diagnosis or during early days of treatment in ALL. Ultimately, this project could have clinical implications for improved risk-stratification methods based on individualized patient’s molecular profiles.

D-087: Single-cell RNA analysis deciphers tumor heterogeneity and the immune microenvironment
  • Anne Bertolini, Nexus Personalized Health Technologies, ETH Zurich, Switzerland
  • Michael Prummer, NEXUS Personalized Health Technologies, ETH Zurich, Switzerland

Short Abstract: Single-cell RNA sequencing (scRNA-seq) based tumor biopsy analysis is an emerging technique that allows to profile tumor cells and infiltrating immune cells at unprecedented detail. Based on RNA expression, distinct tumor sub clones can be identified, informing on tumor heterogeneity and potential treatment resistant sub clones. Moreover, the cell type composition of the tumor microenvironment, in particular the presence and variability of immune cell sub populations, can strongly influence treatment response. Although multiple methods for scRNA-seq analyses exist, their application in a clinical setting demands standardized and reproducible analyses workflows, targeted to display the clinically relevant information. To this end, we designed a workflow to characterize the cell type composition of tumor biopsies based on scRNA-seq data of the 10x Genomics platform. The raw sequenced reads are assigned to genes and cells and subsequently filtered and normalized in several quality control steps. Based on the gene expression profile of each cell, we inform on tumor heterogeneity and immune cell sub populations, and highlight the expression of clinically relevant genes and pathways. We applied our workflow in the analysis of diverse tumor samples, informed on tumor and immune composition and monitored treatment response.

D-088: Investigating T-Cells differentiation using single cell pseudotime analysis.
  • Virginie Stanislas, Max Planck Institute for Molecular Genetics, Germany

Short Abstract: T-cells are subtypes of white blood cells playing a central role in cell-mediated immunity. Those cells are traditionally divided into two main subtypes depending on distinct T-cells receptors, alpha beta T-cells, that represent 95% of the overall population, and gamma delta, which constitute the remaining 5%. While perceived as two independent T-cells categories, new investigations suggested a possible developmental relationship between those different subtypes which opens up new perspectives in the field of immunology. With single-cell RNA sequencing (scRNA seq) it is now possible to profile the transcriptome of thousands of individual cells simultaneously and to investigate cell heterogeneity at an unpreceded resolution. However, as cells are destroyed during the RNA extraction using this technology, it is not possible to directly study the development of a particular cell over time. Nevertheless, developmental trajectories can be inferred using pseudotemporal ordering in which cells are ordered according to their expression similarity along a learned trajectory. Using a public dataset of Peripheral Blood Mononuclear Cells (PBMC) and applying pseudotime algorithms recently developed in the context of scRNA seq, we will investigate the development path of T-cells.

D-089: Physlr-molecule: de novo barcode-to-molecule deconvolution of linked reads
  • Hamid Mohamadi, BC Cancer Agency Genome Sciences Centre, Canada
  • Justin Chu, BC Cancer Agency Genome Sciences Centre, Canada
  • Lauren Coombe, BC Cancer, Genome Sciences Centre, Canada
  • Rene Warren, BC Cancer, Genome Sciences Centre, Canada
  • Amirhossein Afshinfard, BC Cancer Genome Sciences Centre, Canada
  • Shaun Jackman, BC Cancer Genome Sciences Centre, Canada
  • Vladimir Nikolić, BC Cancer Genome Sciences Centre, Canada
  • Jessica Zhang, BC Cancer Genome Sciences Centre, Canada
  • Inanc Birol, BC Cancer Genome Sciences Centre, Canada

Short Abstract: One main challenge in analyzing 10x Genomics linked-reads arises from barcode reuse, whereby distinct DNA molecules are assigned the same barcode. Here we present Physlr-molecule, a de novo method to deconvolute barcodes into their component molecules. A barcode overlap-graph is first constructed, where each edge represents two barcodes that share minimizers. To split a barcode into molecules, we inspect each barcode’s neighbourhood graph, the vertex-induced subgraph of a barcode's immediate neighbours. Since each molecule of a barcode overlaps with a distinct set of barcodes, this neighbourhood subgraph is composed of multiple communities, one per molecule. Physlr-molecule detects these communities in millions of subgraphs, each comprising hundreds to thousands of vertices. In such a setting, state-of-the-art community-detection algorithms do not scale well. To reduce the running time of these superlinear-time algorithms, each subgraph is partitioned into bins. In each bin, communities are detected using k-clique percolation, and merged with communities of other bins if merging increases modularity. This novel community-detection approach reduces the running time from 19 days to 8 minutes for Drosophila melanogaster linked-reads. To rescue imperfectly-split communities, convoluted regions of the overlap-graph are detected and resolved with our robust consensus community-detection algorithm combining cosine similarity, k-clique, and modularity-based approaches.

D-090: Development and validation of a bioinformatics pipeline for routine analysis of whole genome sequencing data for typing of Listeria monocytogenes
  • Qiang Fu, Sciensano, Belgium
  • Bert Bogaerts, Sciensano, Belgium
  • Raf Winand, Sciensano, Belgium
  • Julien Van Braekel, Sciensano, Belgium
  • Wesley Mattheus, Sciensano, Belgium
  • Pieter-Jan Ceyssens, Sciensano, Belgium
  • Bavo Verhaegen, Sciensano, Belgium
  • Sigrid De Keersmaecker, Sciensano, Belgium
  • Nancy Roosens, Sciensano, Belgium
  • Kevin Vanneste, Sciensano, Belgium

Short Abstract: The use of whole-genome sequencing (WGS) for routine molecular typing and pathogen characterization in public health remains a substantial challenge due to the required bioinformatics resources and expertise. Moreover, national reference centers and laboratories, working under a quality system, require extensive validation to demonstrate that employed methods generate high-quality results, but a harmonized framework to guide the validation of WGS analysis is still lacking. We present a bioinformatics workflow developed in-house for typing of Listeria monocytogenes isolates, which performs automated data processing and quality control, and conducts multiple bioinformatics assays, including sequence typing (species confirmation, classic MLST, and cgMLST), gene characterization (antimicrobial resistance, virulence, and metal and detergent resistance), and plasmid replicon characterization. Using a well-characterized dataset of 130 samples, the conducted assays were validated against commonly used bioinformatics tools, achieving high performance with >99.50% accuracy, >99.92% precision, >97.82% sensitivity, and >97.84% specificity. Additionally, the MLST typing results were validated against the classic molecular approach, obtaining 98.88% accuracy, 97.67% precision, 97.67% sensitivity, and 97.90% specificity. Furthermore, every assay demonstrates 100% repeatability and 100% reproducibility. The pipeline with demonstrated high performance will facilitate the routine analysis of L. monocytogenes, and demonstrates the benefit of using WGS in a public health setting.

D-091: SmartPhase: accurate and fast phasing of potentially compound heterozygous variant pairs for genetic diagnosis of rare diseases
  • Paul Hager, Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum München, Germany
  • Werner Mewes, School of Life Sciences, Technische Universität München, Germany
  • Meino Rohlfs, Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, Germany
  • Christoph Klein, Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, Germany
  • Tim Jeske, Department of Pediatrics, Dr. von Hauner Children's Hospital, University Hospital, LMU Munich, Germany

Short Abstract: There is an increasing need to use genome and transcriptome sequencing to genetically diagnose patients suffering from suspected monogenic rare diseases. As prevailing next-generation sequencing methods ignore haplotype information, it requires computational tools for a proper evaluation of compound heterozygous variant combinations in clinical workflows. Here, we present SmartPhase, an open source phasing tool, designed to efficiently reduce the set of potential compound heterozygous variant pairs in genetic diagnoses pipelines. The implemented phasing algorithm creates haplotypes using both parental genotype information and reads generated by DNA or RNA sequencing. It incorporates existing haplotype information and applies logical rules to determine variants that cannot cause a recessive, monogenic disease. SmartPhase phases either all variants in pre-defined genetic loci or pre-selected variant pairs of interest. We compared SmartPhase to WhatsHap, one of the leading comparable phasing tools, using simulated trio data. We found that SmartPhase was on average six times faster than WhatsHap and generated more accurate predictions. When using pedigree information, SmartPhase could resolve all pairs. We validated the clinical utility of SmartPhase by applying it to a real clinical sequencing data set. In only three hours, SmartPhase tested 116,613 heterozygous variant pairs in 921 individuals.

D-092: Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm
  • Can Firtina, ETH Zurich, Switzerland
  • Jeremie S. Kim, ETH Zurich, Switzerland
  • Damla Senol Cali, Carnegie Mellon University, United States
  • A. Ercument Cicek, Bilkent University, Turkey
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Mohammed Alser, ETH Zurich, Switzerland

Short Abstract: Long read sequencing technologies started to dominate de novo genome assembly projects due to their power in resolving complex and repetitive regions. However, the higher error rates associated with long reads necessitate the use of assembly polishing algorithms. We introduce Apollo, a universal assembly polishing algorithm that is scalable to polish an assembly of any size (i.e., large and small genomes) with reads from all sequencing technologies. Our goal is to provide a single algorithm that uses reads from multiple sequencing technologies in a single run to improve the accuracy of assembly of any size. Apollo 1) models an assembly as a profile hidden Markov model (pHMM), 2) uses read-to-assembly alignment to train the pHMM with the Forward-Backward algorithm, and 3) decodes the trained model with the Viterbi algorithm to produce a polished assembly. Our experiments demonstrate that 1) Apollo is the only algorithm that can use reads from any sequencing technology within a single run and that can polish an assembly of any size, 2) using reads from multiple sequencing technologies produces most accurate assemblies, and 3) Apollo performs comparable to the competing algorithms in terms of accuracy even when polishing with reads from a single sequencing technology.

D-093: Universal k-mer sets for large-scale genomic analysis
  • Fiyinfoluwa Gbosibo, Lincoln University, United States
  • Carl Kingsford, Carnegie Mellon University, United States
  • Guillaume Marçais, Carnegie Mellon University, United States
  • Dan DeBlasio, Carnegie Mellon University, United States

Short Abstract: Minimizer schemes have found widespread use in genomic applications as a way to quickly predict the matching probability of large sequences. Most methods for minimizer schemes use randomized (or close to randomized) ordering of k-mers when finding minimizers, but recent work has shown that not all non-lexicographic orderings are the same. Universal hitting sets, which are subsets of k-mers that will be guaranteed to cover all windows, can be used to construct more desirable orderings. The smaller this set, the fewer distinct k-mers will be used as minimizers. Current methods for creating universal hitting sets are limited in the length of the k-mer that can be considered, and cannot compute sets in the range of lengths currently used in practice. In this work we create universal hitting sets which can be used to construct practical minimizer orders. We do this using iterative extension of the k-mers in a set, and guided contraction of the set itself. We also show that this process will be guaranteed to never increase the number of distinct minimizers chosen in a sequence, and to decrease the number of distinct minimizer positions over using the current orderings on small k-mers.

D-094: BitMAC: An In-Memory Accelerator for Bitvector-Based Sequence Alignment of Both Short and Long Genomic Reads
  • Can Firtina, ETH Zurich, Switzerland
  • Jeremie S. Kim, ETH Zurich, Switzerland
  • Damla Senol Cali, Carnegie Mellon University, United States
  • Gurpreet S Kalsi, Intel, United States
  • Lavanya Subramanian, Intel, United States
  • Anant Nori, Intel, United States
  • Zulal Bingol, Bilkent University, Turkey
  • Rachata Ausavarungnirun, King Mongkut's University of Technology North Bangkok, Thailand
  • Juan Gomez-Luna, ETH Zurich, Switzerland
  • Amirali Boroumand, Carnegie Mellon University, United States
  • Allison Scibisz, Carnegie Mellon University, United States
  • Sreenivas Subramoney, Intel, United States
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Saugata Ghose, Carnegie Mellon University, United States
  • Mohammed Alser, ETH Zurich, Switzerland

Short Abstract: Read alignment is one of the critical steps of the genome sequence analysis pipeline, and a​pproximate string matching makes up over 70% of the total read alignment time. Through an analysis of approximate string matching, we find that it is bottlenecked by both computational power and memory bandwidth. In this work, we propose BitMAC, an in-memory accelerator for read alignment. BitMAC performs bitvector-based approximate string matching using ​processing-in-memory ​(PIM), exploiting the high internal bandwidth of 3D-stacked DRAM and reducing data movement between the memory and the CPU to improve the performance and energy efficiency of read alignment. BitMAC combines 1) BitMAC-DC, a PIM accelerator for the edit distance calculation step, and 2) BitMAC-TB, a PIM core for the traceback step of read alignment. BitMAC is optimized for both 1) short accurate reads, and 2) long error-prone reads, making it sequencing technology independent, and can perform alignment either for the whole reference genome or for a subset of localized candidate locations reported by a filtering step. Our analysis shows that BitMAC achieves speedups over the state-of-the-art read mappers Minimap2 and BWA-MEM of 142.3x and 6333.5x, while reducing the power consumption by 16.5x and 15.3x, respectively.

D-095: Theoretical estimation of the strand cross-correlation in ChIP-Seq data
  • Hayato Anzawa, Grad. Sch. Info. Sci., Tohoku Univ., Japan
  • Hitoshi Yamagata, Adv. Res. Lab., Canon Medical Systems Corp., Japan
  • Kengo Kinoshita, Tohoku University, Japan

Short Abstract: In ChIP-Seq peak calling pre-analysis, strand shift profiling, calculating cross-correlation coefficients between forward- and reverse-strand mapped reads, is commonly used to estimate a mean fragment length. Some methods using strand shift profiling were proposed to assess quality of ChIP-Seq data. Although strand shift profiling is commonly used in ChIP-Seq analysis, its qualitative characterization had not been performed yet. In 2013, Ramachandran, Palidwor, Porter and Perkins introduced a method to calculate accurate cross-correlation named mappability-sensitive cross-correlation (MaSC). Their approach uses binarized distributions of mapped reads and calculates the cross-correlation only over the regions both forward and reverse positions are uniquely mappable. In this study, we derived theoretical maximums and minimums of cross-correlation coefficient under the simplified ChIP-Seq read distribution models based upon the approach introduced by Ramachandran, et al. Calculated maximums from simulation data exactly match expected values. Simulation analyses illustrate the impact of the mappability bias and the effect of mappability correction improves separation between maximum and minimum. Using ENCODE ChIP-Seq data, we will also discuss with applicability for the actual data, current limitations and difference between other methods. We will also introduce PyMaSC, an open source tool to calculate MaSC implemented in Python. PyMaSC is available at https://github.com/ronin-gw/PyMaSC.

D-096: Custom substitution matrices for error-prone long reads
  • Caner Bagci, University of Tuebingen, Germany
  • Dominic Boceck, University of Tuebingen, Germany
  • Benjamin Albrecht, Center for Bioinformatics, University of Tübingen, Sand 14, Tübingen, Germany

Short Abstract: The third generation sequencing technologies (Oxford Nanopore, PacBio) produce long, but very error-prone (approximately 88% sequence identity) reads. The high error rate makes it difficult for aligners to find reasonable alignments of these reads against reference databases, especially in the case of nucleotide-to-protein alignments when the task is to identify protein-coding regions on the reads. One major problem in the alignment of long, error-prone reads against a protein reference database is the use of evolutionary substitution matrices (e.g. BLOSUM62) by the aligners. These matrices assume that mismatches in the alignment are a result of similar functional or chemical properties of the aminoacids that can easily change from one to another. The majority of the mismatches in long-read alignments, however, result from errors introduced by the applied sequencing technology even in the case of environmental samples. We developed a custom, device-specific substitution matrix, which was calculated from the probabilities of aminoacids changing to others because of device-related sequencing errors. We show that the custom matrix alone increases the sensitivity of the LAST aligner on average by 14% (from 78% to 92%). We are also developing hybrid matrices, capturing both evolutionary changes and sequencing device specific error profiles.

D-097: Assessing sequencing data for genome assembly
  • Sara Bakic, University of Zagreb, Faculty of Electrical Engeneering and Computing, Croatia
  • Luka Pozega, University of Zagreb, Faculty of Electrical Engeneering and Computing, Croatia
  • Robert Vaser, University of Zagreb, Faculty of Electrical Engeneering and Computing, Croatia
  • Mile Sikic, University of Zagreb, Faculty of Electrical Engineering and Computing, Croatia

Short Abstract: In order to aid development of new assembly algorithms we created a lightweight standalone tool that for a given yield of third generation sequencing data and corresponding reference genome, calculates the most contiguous assembly possible for each chromosome separately. In addition, it determines valid sequence regions and classifies sequences into distinct classes with annotations for related events, that is break points in chimeric sequences, inclusion intervals of contained sequences and repetitive genomic regions in sequences overlapping them. The result is a set of containment free and non-chimeric sequences with repeat annotations, structured in assembly graphs at chromosome level.

D-098: Single-cell RNA-seq analysis combined with live cell imaging on inflammatory response
  • Naoki Osato, Osaka University, Japan
  • Hironori Shigeta, Osaka University, Japan
  • Shigeto Seno, Osaka University, Japan
  • Hideo Matsuda, Osaka University, Japan
  • Yutaka Uchida, Graduate School of Medicine, Osaka University, Japan
  • Masaru Ishii, Graduate School of Medicine, Osaka University, Japan

Short Abstract: Single-cell analysis has been widely used as one of the prospective tools for elucidating cellular heterogeneity. Several types of the analysis have been available, such as single-cell RNA-seq, microscopy cell imaging, etc. In this study, we present on a combined analysis of single-cell RNA-seq and live cell imaging. Our approach consists of (1) dynamics analysis of migrating cells by live cell imaging, (2) transcriptome analysis of the cells, and (3) corresponding analysis between the results of (1) and (2) by cell clustering and trajectory analysis. We obtained live cell imaging data of mouse leukocytes 6 and 12 hours after inflammatory stimulation observed by two-photon excitation microscopy, and single-cell RNA-seq data of leukocytes at the same conditions by an ICELL8 single-cell system. We performed cell clustering on cell-migration dynamics and gene expressions from single-cell RNA-seq, and corresponding analysis between their clusters according to the response time after inflammatory stimulation. We obtained specific gene expression patterns associated with cell migration dynamics on inflammatory response. The effectiveness of the combined approach is demonstrated by the determination of the degree of inflammation, e.g., acute vs. chronic inflammation, at the single-cell level.

D-099: SneakySnake: a fast and efficient pre-alignment filter for accelerating approximate string matching
  • Can Alkan, Bilkent University, Department of Computer Engineering, Turkey
  • Mohammed Alser, ETH Zurich, Switzerland
  • Onur Mutlu, ETH Zurich, Switzerland

Short Abstract: We introduce SneakySnake, a highly parallel and accurate pre-alignment filter that remarkably reduces the need for computationally-costly approximate string matching (ASM) algorithms. The first key idea of SneakySnake is to provide fast and highly accurate filtering by reducing the ASM problem to a single net routing problem in VLSI chip layout. The second key idea is to design a hardware accelerator - called Snake-on-Chip - that adopts modern FPGA architectures to further boost the performance of our algorithm. SneakySnake significantly improves the filtering accuracy by up to three orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, MAGNET, and SHD. The cost-effective CPU implementation for our SneakySnake algorithm accelerates the state-of-the-art ASMs, Edlib and Parasail, by up to 37.67x and 18.05x, respectively, without the need for hardware accelerators. Snake-on-Chip further accelerates SneakySnake algorithm by up to two orders of magnitude. Using a single FPGA chip, the addition of Snake-on-Chip as a pre-alignment filter reduces the execution time of five state-of-the-art ASMs, designed for different computing platforms, by up to 154.7x. SneakySnake and Snake-on-Chip do not sacrifice any of the ASM algorithm capabilities (i.e., scoring and backtracking), as they do not modify or replace the ASM algorithm.

D-100: Gemini: A novel mate-aware algorithm for enhancing indel detection and variant calling accuracy
  • Gwenn Berry, Illumina, Inc., United States
  • Tamsen Dunn, Bandit Bioscience, United States
  • Hans Kang, Illumina, Inc., United States
  • Xiao Chen, Illumina, Inc., United States
  • Nathan Haseley, Illumina, Inc., United States
  • Kristina Kruglyak, Illumina, Inc., United States

Short Abstract: The rapidly advancing applications of Next-Generation Sequencing (NGS) to personalized medicine and oncology research demand increasingly accurate small variant calling performance on challenging, noisy datasets. While accurate single nucleotide variant (SNV) calling techniques are well-established, indels (insertions and deletions) continue to present a challenge in short read sequencing due to incomplete spanning coverage and alignment limitations. Furthermore, both indel and SNV calling are even more difficult in cancer samples (e.g., low-quality FFPE tissue, low tumor content, tumor heterogeneity). We present Gemini, an approach to enhance both sensitivity and specificity of indel detection, as well as reduce false positive SNV calls around true, difficult-to-detect indels. By performing mate-aware, local indel realignment and read stitching, Gemini is able to span and call longer indel events, boost confidence in base calls, and rescue valuable sequencing results previously lost to soft-clipping by standard alignment tools. Benchmarking of both real and simulated targeted sequencing data demonstrates up to 15% improvement in indel sensitivity and 2x increase to precision when compared against similar tools. Moreover, runtime read categorization and prioritization allows for a 5x speed improvement over comparable methods, making Gemini both a fast and accurate variant calling enhancer for targeted sequencing applications.

D-101: Investigating T-Cells differentiation using single cell pseudotime analysis.
  • Virginie Stanislas, Max Planck Institute for Molecular Genetics, Germany

Short Abstract: T-cells are subtypes of white blood cells playing a central role in cell-mediated immunity. Those cells are traditionally divided into two main subtypes depending on distinct T-cells receptors, alpha beta T-cells, that represent 95% of the overall population, and gamma delta, which constitute the remaining 5%. While perceived as two independent T-cells categories, new investigations suggested a possible developmental relationship between those different subtypes which opens up new perspectives in the field of immunology. With single-cell RNA sequencing (scRNA seq) it is now possible to profile the transcriptome of thousands of individual cells simultaneously and to investigate cell heterogeneity at an unpreceded resolution. However, as cells are destroyed during the RNA extraction using this technology, it is not possible to directly study the development of a particular cell over time. Nevertheless, developmental trajectories can be inferred using pseudotemporal ordering in which cells are ordered according to their expression similarity along a learned trajectory. Using a public dataset of Peripheral Blood Mononuclear Cells (PBMC) and applying pseudotime algorithms recently developed in the context of scRNA seq, we will investigate the development path of T-cells.

D-102: GBSathon: Benchmarking reproducibility of Genotyping-By-Sequencing analysis workflows through comparison with SNP chip and pedigree data
  • Rayna Anderson, AgResearch, Invermay Agricultural Centre, New Zealand
  • Andrew Griffiths, AgResearch, Grasslands Research Centre, New Zealand
  • Shannon Clarke, AgResearch, Invermay Agricultural Centre, New Zealand
  • John McEwan, AgResearch, Invermay Agricultural Centre, New Zealand
  • Ken Dodds, AgResearch, Invermay Agricultural Centre, New Zealand
  • Jeanne Jacobs, AgResearch, Lincoln Research Centre, New Zealand
  • Tracey van Stijn, AgResearch, Invermay Agricultural Centre, New Zealand
  • Siva Ganesh, University of South Wales, United Kingdom
  • Roger Moraga, AgResearch, Grasslands Research Centre, New Zealand
  • Rachael Ashby, AgResearch, Invermay Agricultural Centre, New Zealand
  • Paul Maclean, AgResearch, Grasslands Research Centre, New Zealand
  • Abdul Baten, AgResearch, Grasslands Research Centre, New Zealand
  • Charles Hefer, AgResearch, Lincoln Research Centre, New Zealand
  • Aurelie Laugraud, AgResearch, Lincoln Research Centre, New Zealand
  • Monica Vallender, AgResearch, Invermay Agricultural Centre, New Zealand
  • Ruy Jauregui, AgResearch, Grasslands Research Centre, New Zealand
  • Hayley Baird, AgResearch, Invermay Agricultural Centre, New Zealand
  • Rudiger Brauning, AgResearch, Invermay Agricultural Centre, New Zealand

Short Abstract: The advent of reduced representation genotyping-by-sequencing (GBS) provides a cost-effective high-throughput genotyping platform to many ‘orphan’ species. This enables downstream analyses including genomic selection, parentage assignment, conservation genetics, population genetics and genome wide association studies. There are many different workflows available for deriving SNPs from GBS data. Key aspects of any bioinformatic workflow include accuracy, reproducibility and reliability. Few independent studies benchmark multiple workflows to biological ‘gold standards’, such as pedigree or SNP chip data, to assess these key aspects. Here, we benchmark ten open source SNP-calling workflows for GBS data to assess their accuracy and reproducibility. To do this, we generated GBS data for a cohort of 333 sheep. These have also been genotyped using a 50k or 600k SNP chip. Furthermore, the cohort comprised 125 parent-offspring trios and all individuals had multigenerational pedigree data. The SNPs called from the GBS workflows were compared back to the gold standards to assess the accuracy, reproducibility and reliability of SNP callers. Focusing on the bigger picture, we derived genomic relationship matrices (GRMs) from all methods to compare the accuracy of the SNPs called for downstream biological applications including relationship estimates among parents and progeny.