- Wigard Kloosterman, UMC Utrecht, Netherlands
- Mark van Roosmalen, UMC Utrecht, Netherlands
- Ivo Renkens, UMC Utrecht, Netherlands
- Mircea Cretu Stancu, UMC Utrecht, Netherlands
- Marleen Nieboer, UMC Utrecht, Netherlands
- Sjors Middelkamp, UMC Utrecht, Netherlands
- Joep de Ligt, UMC Utrecht, Netherlands
- Giulia Pregno, University of Torino, Italy
- Daniela Giachino, University of Torino, Italy
- Giorgia Mandrile, University of Torino, Italy
- Jose Espejo Valle-Inclan, UMC Utrecht, Germany
- Jerome Korzelius, UMC Utrecht, Germany
- Ewart de Bruijn, UMC Utrecht, Germany
- Edwin Cuppen, UMC Utrecht, Germany
- Mike Talkowski, Harvard Medical School, United States
- Tobias Marschall, Saarland University, Germany
- Jeroen de Ridder, UMC Utrecht, Netherlands
Large capital investments are needed for second generation sequencing equipment, which has led to the concentration of human genome sequencing efforts in specialized sequencing centers. An interesting alternative for human genome sequencing is the MinION nanopore sequencer, a small and low-cost device that can generate long sequence reads in real-time. Here, we demonstrate sequencing of the genomes of two patients with congenital abnormalities using the MinION at 11X and 16x mean coverage, respectively. We developed a bioinformatic pipeline - nanoSV - to efficiently map genomic structural variants (SVs) from the nanopore data. We readily detected all de novo rearrangements involving multiple chromosomes and comprising complex chromothripsis events. Genome-wide surveillance of SVs, revealed hundreds of novel variants that were missed in short-read data of the same sample. The majority of these map to repetitive/low-complexity regions in the human genome. Nanopore reads enabled efficient phasing of genetic variations, allowing the construction of genome-wide maps of phased SVs. Finally, we show that all de novo chromothripsis breakpoints occurred on paternal chromosomes and we used this information to resolve the long-range structure of the chromothripsis. This work demonstrates the value of small-size and low-cost sequencing devices for human genome sequencing in future life sciences research and clinical diagnostics.
- Angelika Merkel, Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
- Marcos Fernandez, Callejo Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
- Eloi Casals, Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
- Santiago Marco-Sola, Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
- Ron Schuyler, Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
- Simon C. Heath, Centro Nacional de Análisis Genómico (CNAG-CRG), Spain
DNA methylation is essential for normal development and cell differentiation in mammals. Acting in concert with other epigenetic marks to alter chromatin conformation, it has been implicated in the regulation of gene expression and a multitude of biological processes such as genomic imprinting, silencing of transposable elements, and disease, particularly cancer. Whole genome bisulfite sequencing (WGBS) is the gold standard for studying genome-wide DNA methylation at base pair resolution. However, processing of NGS data from bisulfite converted DNA requires specific, yet efficient, processing to accommodate for the cytosine to thymine conversion that allows the distinction of methylated from unmethylated cytosines (methylated cytosines resist the conversion).
We present GemBS, a pipeline specifically designed for the high-throughput analysis of WGBS data, which has already been successfully applied to several small and large scale projects including BLUEPRINT, the European epigenome project (http://www.blueprint-epigenome.eu). GemBS is implemented with the ‘JIP Pipeline system’ (http://pyjip.readthedocs.io/en/latest/) and comprises two core modules: a BS-adapted version of ‘gem mapper’ (Marco-Sola et al., 2012) and the ‘BS_call’ genotype-methylation caller. QC matrices and mapping statistics are conveniently presented as html reports, and processing can be highly automated in connection with a Laboratory Information Management system (LIMs). Benchmarking demonstrates high efficiency and speed of GemBS over other commonly used pipelines, such that a standard 30X WGBS data set can be processed 5-8h on a computing cluster.
We further demonstrate that given sufficient coverage GemBS can be used for SNP calling from WGBS.
- Zdislav Staševskij, Department of Biological DNA Modification, Institute of Biotechnology, Vilnius University, Lithuania
- Povilas Gibas, Department of Biological DNA Modification, Institute of Biotechnology, Vilnius University, Lithuania
- Juozas Gordevicius, Department of Biological DNA Modification, Institute of Biotechnology, Vilnius University, Lithuania
- Edita Kriukiene, Department of Biological DNA Modification, Institute of Biotechnology, Vilnius University, Lithuania
- Saulius Klimašauskas, Department of Biological DNA Modification, Institute of Biotechnology, Vilnius University, Lithuania
Modification of CG dinucleotides in DNA sequence regulates gene function in vertebrates and is among the main drivers of human development. Aberrant DNA modifications have been implicated in aetiology of complex diseases, such as cancer. We present a new method for genome-wide analysis of DNA methylation, TOP-seq (Tethered Oligonucleotide-Primed sequencing) . TOP-seq provides comparable resolution at a fraction of cost needed for the gold standard whole genome bisulfite sequencing.
We have developed a data analysis workflow tailored specifically for TOP-seq and allowing to further reduce its cost. First, we use kernel density estimation to model the distribution of unmodified cytosines from the observed reads. Second, we adjust obtained density estimates for possible sequence specific biases. Third, we use publicly available genomic data and machine learning to convert density distribution into absolute estimates of methylation level at any specific CG dinucleotide or region. We evaluate our approaches in various human tissues and cell lines and confirm that TOP-seq combined with these analytical approaches is the most suitable and cost effective tool for population wide studies of DNA modifications.
1. Stasevskij et al., 2017, Tethered Oligonucleotide-Primed sequencing (TOP-seq): a high resolution economical approach for DNA epigenome profiling. Molecular Cell, 65, 1–11 February 2, 2017. doi: http://dx.doi.org/10.1016/j.molcel.2016.12.012
- Daniel Cameron, Walter and Eliza Hall Institute of Medical Research, Australia
- Anthony Papenfuss, The Walter and Eliza Hall Institute of Medical Research, Australia
The identification of genomic rearrangements with high sensitivity and specificity using massively parallel sequencing remains a major challenge. Many methods have been developed for Illumina sequence data, with most methods using read depth analysis, read pair clustering, split read identification, assembly, or a combination of these approaches. Existing assembly-based methods perform either de novo assembly (e.g. cortex), targeted assembly based on previously identified candidates (e.g. manta, SVMerge, TIGRA), or perform windowed assembly to detect small events (e.g. DISCOVAR, SOAPindel).
Here we describe GRIDSS, the Genome Rearrangement IDentification Software Suite, composed of an assembler, and a variant caller which combines assembly, split read and read pair evidence to identify genomic rearrangement breakpoints using a probabilistic model. Our novel genome-wide break-end assembly approach assembles reads not supporting the reference prior to breakpoint identification or variant calling using a positional de Bruijn graph. By constraining the assembly of each read based on the mapping locations of the read/read pair, and encoding these assembly constraints directly within the assembly graph itself, a single genome-wide assembly can be performed. This approach enables robust assembly at a lower coverage than possible with a traditional de Bruijn graph based assembler.
GRIDSS recently won structural variant detection sub-challenge #5 of the ICGA-TCGA DREAM Somatic Mutation Calling challenge and has been extensively benchmarked against BreakDancer, cortex, CREST, DELLY, HYDRA, LUMPY, manta, Pindel, SOCRATES and TIGRA across a wide range of variant types, variant sizes, read depths, read lengths, and library fragment sizes. With the exceptions of low coverage data (≤8x), and large novel insertions detectable only by de novo assemblers, GRIDSS F-scores exceeded that of all other callers. On well-studied human cell line data, GRIDSS is able to achieve a false discovery rate less than half that of other methods, with no loss of sensitivity.
- Ghislain Durif, CNRS, France
- Laurent Modolo, CNRS, France
- Jeff Mold, Karolinska Institutet, Sweden
- Sophie Lambert-Lacroix, Grenoble University, France
- Franck Picard, CNRS, France
The developpment of high throughput single-cell technologies now allows
the investigation of the genome-wide diversity of transcription. This
diversity has shown two faces : the expression dynamics (gene to gene
variability) can be quantified more accurately, thanks to the
measurement of lowly-expressed genes. Second, the cell-to-cell
variability is high, with a low proportion of cells expressing the same
gene at the same time/level. Those emerging patterns appear to be very
challenging from the statistical point of view, especially to represent
and to provide a summarized view of single-cell expression data. PCA is
one of the most powerful framework to provide a suitable representation
of high dimensional datasets, by searching for new axis catching the
most variability in the data. Unfortunately, classical PCA is based on
Euclidian distances and projections that work poorly in presence of
over-dispersed counts that show zero-inflation. We propose a
probabilistic PCA for single-cell expression data, that relies on a
sparse Gamma-Poisson model. This hierarchical model is inferred using a
variational EM algorithm, and we revisit the selection of the number of
axis using an integrated likelihood criterion. We show how this
probabilistic framework induces a geometry that is suitable for
single-cell data, and produces a compression of the data that is very
powerful for clustering purposes. Our method is competed to other
standard representation methods like tSNE, and we illustrate its
performance on a project that is based on transcriptomic data of
CD8+ T cells. Understanding the mechanisms of an adaptive immune
response is of great interest for the creation of new vaccines. We show
that our method allows a better understanding of the transcriptomic diversity
of T cells, which constitutes a new challenge to better characterize the
short and long-term response to vaccination.
- Dmitri Pervouchine, Centre for Genomic Regulation, Spain
Artificial intelligence has been used widely to learn biological outcomes from molecular data which are often surveyed by next-generation sequencing. Famous examples include so called splicing code approaches that predict condition-specific exon inclusion at high accuracy. Despite that, machine learning does not provide any insight into the molecular mechanisms underlying the biological process. We developed a method that combined machine learning with convoluted NGS signals which reflect spatio-temporal aspects of RNA processing to predict splicing outcomes in two human cell lines. We show that convolutions of epigenetic signals with long-range RNA-RNA interactions enhance the significance of features represented by splicing factors and that RNA structure overall brings epigenetic signals to the place of action. These effects are also affected by the transcription rate implying that co-transcriptional and post-transcriptional splicing are also sensitive to the local epigenetic context shaped by RNA structure. In sum, our results imply that spatio-temporal organization of RNA processing is strongly affected by long-range RNA-RNA interactions.
- Michael Dunne, University of Oxford,
- Steven Kelly, University of Oxford,
We present OrthoFiller, a tool designed to address the problem of identifying “missing” genes: genes which are present in the nucleotide sequence of a genome but which are not present in its genome annotation. OrthoFiller leverages information from multiple related species to identify genes from conserved gene families which are present in a genome but have not been previously predicted. By simulating missing gene annotations in real sequence datasets from both plants and fungi we show that OrthoFiller succeeds in accurately identifying missing genes and improving genome annotations. Furthermore, we show that applying OrthoFiller to existing complete genome annotations can identify and recover substantial numbers of erroneously missing genes in these two sets of species.
- Basel Abu-Jamous, University Of Oxford,
- Steven Kelly, University Of Oxford,
Massive amounts of transcriptomic data have been, and are increasingly being generated. These datasets can be leveraged to bring additional resolution to specific biological questions. Therefore, tools that can analyse multiple transcriptomic datasets collectively are needed to bridge the gap between the fast pace of data generation and the slower pace of data analysis. Consequently, we developed an automated method, the binarisation of consensus partition matrices (Bi-CoPaM), which scrutinises multiple transcriptomic datasets to identify the clusters of genes that are consistently co-expressed in all of them. Such datasets can have different numbers of conditions and noise levels, be generated by different technologies (RNA-seq and microarrays), and even from different but related species (e.g. human and mice). Here we present a substantial enhancement to the Bi-CoPaM method that provides an automated error correction mechanism for cluster membership; removing genes that poorly fit to clusters as well as identifying genes that should have been included in clusters but which we missed by the clustering algorithm. We demonstrate the utility of the method by application to three transcriptomic datasets from rice and maize leaves. This identified 22 clusters containing 2,567 genes in total, out of 9,864 genes submitted to the method. Strikingly, 957 of these genes (37%) would have been missed in the absence of error correction, demonstrating the significance of the proposed improvement. Importantly, the mechanism does not compensate clusters’ tightness (compactness) while including missed genes, which means that 37% of the genes would have been missed without any gain in cluster tightness. Finally, this application has revealed important biological insights into leaf cell-specific dynamics and regulation.
- Anne Friedrich, University of Strasbourg, France
- Jackson Peter, Université de Strasbourg / CNRS, France
- Matteo De Chiara, Institute of Research on Cancer and Ageing, France
- David Pflieger, Université de Strasbourg, France
- Jia-Xing Yue, Université Côte d'Azur, France
- Anders Bergstrom, Institute of Research on Cancer and Ageing, France
- Anastasie Sigwalt, Université de Strasbourg / CNRS, France
- Agnès Llored, Institute of Research on Cancer and Ageing, France
- Kelle Freel, Université de Strasbourg, France
- Stefan Engelen, Institut de Génomique - Genoscope, France
- Arnaud Lemainque, Institut de Génomique - Genoscope, France
- Patrick Wincker, Institut de Génomique - Genoscope, France
- Gianni Liti, Institute of Research on Cancer and Ageing, France
- Joseph Schacherer, University of Strasbourg, France
Comprehensive genetic variant maps are the nuts and bolds to dissect the genetic architecture of traits. Recently, we completely sequenced the genome of 1,011 natural Saccharomyces cerevisiae isolates providing an accurate picture of the assortment, weight, and impact of the various genetic variants shaping the species-wide phenotypic landscape across subpopulations. We found that large genomic differences, such as ploidy and aneuploidy variations concerning 13% and 20% of the S. cerevisiae isolates respectively, correlate with specific ecological origins and have a significant impact on fitness. The variability of the yeast pan-genome, composed of 4,940 core and 2,908 dispensable ORFs, is increased for human-related isolates. In contrast, wild lineages from south East Asian harbor the higher genetic species diversity and genome analysis support a single out-of-China event. The spectrum of the single nucleotide variants, a total of 58,912,916 distributed over 1,625,809 polymorphic positions, is highly skewed towards an excess of low-frequency alleles. We have clearly shown that while yeast linkage analyses have identified rare alleles with major effects, most of the variants we identified by genome-wide association correspond to copy number variants. Finally, loss-of-heterozygosity (LOH) events are extensive in S. cerevisiae, with an average of 21 regions covering ~5 Mb per genome. Our results clearly show that LOH events are an essential source of genetic diversity in yeast, having mostly an asexual life cycle. Overall, our study provides a comprehensive view of the multiple genome evolution patterns across subpopulations within the S. cerevisiae species.
- Maite G. Barrón Aduriz, Institut de Biologia Evolutiva (IBE), Spain
- Jose Luis Villanueva-Cañas, Institut de Biologia Evolutiva (IBE), Spain
- Gabriel E Rech, Institut de Biologia Evolutiva (IBE), Spain
- Josefa Gonzalez, Institut de Biologia Evolutiva (IBE), Spain
Transposable elements are an abundant, diverse, and active component of virtually all genomes sequenced to date. However, TEs have been largely ignored in genomic studies mainly due to methodological limitations. The Drosophila melanogaster genome is one of the few in which transposable elements have been well-annotated. As such, this species is an excellent model to understand the role of TEs in genome structure, function and evolution. We have analyzed the transposable element genomic content in 61 worldwide natural populations, 37 of them reported here for the first time. For 23 of these 61 natural populations, we have seasonal samples, meaning that the same population was sequenced at least twice in the same year. This comprehensive dataset, which includes all the natural populations available for this species, allow us to investigate the geographic and temporal scale of transposable element dynamics. Our results show that the site frequency spectrum is similar in all the populations analyzed: most of the transposable element insertions are either fixed or present at very low frequencies. The levels of geographical and seasonal variation in transposable element frequencies are similar suggesting that both the spatial and temporal scales play a role in the dynamics of transposable elements. Moreover, we detect 203 transposable element insertions present at low frequencies in populations from the ancestral range of the species and at high frequencies in derived populations and thus likely to be adaptive. 80% of the candidate adaptive transposable element insertions are located inside genes or less than 1kb from a gene. Most of the insertions located inside genes are in introns or UTRs, suggesting that they might affect gene expression. Overall, these results suggest that transposable elements play a substantial role in adaptive evolution.
- Melanie Parejo, University of Bern, Switzerland
- David Wragg, The Roslin Institute,
- Alain Vignal, Institut National de la Recherche Agronomique, France
- Peter Neumann, Institute of Bee Health, Switzerland
- Markus Neuditschko, Swiss Bee Research Centre, Switzerland
Human-mediated selection has left signatures in the genomes of many domesticated animals. The European Dark Honey Bee, Apis mellifera mellifera, has been selected by apiculturists for centuries. Using whole-genome sequence information, we investigated selection signatures in two geographically isolated A. m. mellifera subpopulations with different breeding regimes (Switzerland: N=39 and Savoy, France: N=17), which have been found to be genetically differentiated in a previous study. To confidently identify signatures of selection between the two subpopulations, we combined three different test statistics calculated in windows of 2kb (Fixation index FST, Cross-population extended haplotype homozygosity XP-EHH and Cross-population composite likelihood ratio XP-CLR) into a composite selection score (CSS). Applying a stringent false discovery rate (FDR=0.01), we identified 14 significant selection signatures distributed across 5 chromosomes. 13 genes are found in these regions, which are associated to multiple molecular and biological functions amongst other regulation of transcription, mushroom body development, transmembrane transport and peptide hormone processing. Of particular interest is a selection signature on chromosome 1, which corresponds to the wnt4 gene. The family of Wnt genes is conserved across the animal kingdom with a variety of functions, whereas in Drosophila melanogaster Wnt4-alleles have been associated to differential wing, cross vein and abdominal phenotypes. A. m. mellifera is selected for various different traits including wing morphological characters, therefore the identified selection signature could be a result of different applied breeding practices within the two regions.
- Hyun Goo, Ajou University, Korea, Dem. Rep.
- Ji-Hye Choi, Ajou University, Korea, Dem. Rep.
The Cancer Genome Atlas (TCGA) is a comprehensive database including multi-layered cancer genome profiles. Large-scale collection of data inevitably generates batch effects which are introduced by different processing of the sample collection to data generation. However, the batch effect on the sequence variation and its characteristics have not been studied extensively.
We systematically evaluated the batch effect on the sequence variation in pan-cancer TCGA data including 19 cancer types, which revealed 999 batch-biased variants with statistical significance (P<0.00001, Fisher’s exact test, false discovery rate ≤0.0027). We found that most of the batch-biased variants were generated by sample plate IDs. The batch-biased variants had unique mutational spectrum with frequent indel type mutations, and preferentially occur at the sites of erroneous false sequencing variant calls. Comparison of the batch-biased variants with the unbiased variants revealed frequent batch-biases at the sites harbouring long range homopolymer runs. Moreover, we found higher frequency of the batch-biased variants at splicing sites, and which were proved to have a unique consensus motif sequence i.e. ‘TTDTTTAGTT’. As a proof-of-concept of our analysis, we demonstrated that some of the batch-biased variants could be found in well-known cancer driver genes, thereby may potentially induce to misinterpret the mutation profiles.
We suggest that our strategy to identify batch-biased variants and characterization of the sequence patterns might be useful in eliminating false erroneous variation and interpreting the sequence profiles correctly.
- Christopher Illingworth, University of Cambridge,
The influenza virus is an important human pathogen, infecting between 5 and 15% of the world's population each year. NGS data describing the evolution of the virus within a single host can now be collected, allowing for new insights into the evolution of the virus within single infections, and across transmission events. However, the nature of NGS data raises key challenges for quantifying evolutionary events. Firstly, NGS data is comprised of short reads, so that critical linkages between variants in the influenza genome cannot be observed directly. Secondly, NGS data is subject to multiple sources of error, so that quantifying the extent of fit between an evolutionary model and a given dataset is not a trivial task. I here describe recent advances in bioinformatic methods for exploiting NGS data, which allow evolutionary processes in influenza to be quantified as never before. In an example application I describe the use of NGS data to study reassortment in the influenza virus. Reassortment is a process via which segments of different influenza viruses which co-infect a single cell form viruses with novel genetic combinations, as occurred for example in the creation of the 2009 pandemic strain. Using NGS data collected from a human challenge study, in which volunteers were infected with the influenza virus, I show results demonstrating that the effective rate of reassortment is substantially slower in human hosts than has previously been suggested by animal studies. This result shows the potential of NGS data to answer previously intractable questions about the evolution of viral populations.
- Fatima Heinicke, Department of Medical Genetics, Norway
- Xiangfu Zhong, Department of Medical Genetics, Norway
- Siri Flåm, Department of Medical Genetics, Norway
- Albert Pla, Department of Medical Genetics, Norway
- Simon Rayner, Department of Medical Genetics, Norway
- Benedicte A, Department of Medical Genetics, Norway
Rheumatoid arthritis (RA) is a chronic, inflammatory autoimmune disorder that causes lifelong, irreversible joint destruction. Leukocytes, like T and B cells, play a primary role in the RA pathogenesis. The efficiency of treatments is variable with a trial and error approach to find the optimal treatment. Revealing biological markers for treatment response may improve the quality of life for many RA patients.
Studying a patient’s genetic background may be a key to understanding treatment outcomes. To this end, we are studying signature differences of mature microRNA (miRNA), a class of small non-coding RNA that regulates gene expression at a post-transcriptional level, and their isoforms (isomiRs) in blood separated leukocyte subpopulations of RA patients at different time points during treatment and in healthy controls. Dysregulation in miRNA signatures in RA have been reported in previous studies.
Here we report a workflow for constructing small RNA NGS libraries (using our in house analysis pipeline) and the subsequent preliminary analysis of the data sets in terms of mature miRNA and isomiR population studies. Our optimized experimental protocol includes simultaneous RNA and DNA extraction followed by sequencing library preparation. The ability to co-extract both RNA and DNA permits more expansive characterization of cell populations in terms of additional NGS mRNA and reduced representation bisulfite sequencing (RRBS) studies.
Our analysis indicates that the miRNA pool consists of a combination of isomiRs and mature miRNAs and which may be impact downstream differential expression analysis based on mature sequences alone. Additionally, further (initial) analysis suggests that variation in components of mature miRNA and isomiRs may also be a consequence of the sequencing library preparation kit and not solely biological in origin.
- Joseph Kawash, Rutgers University, United States
- Sean Smith, Rutgers University, United States
- Andrey Grigoriev, Rutgers University, United States
Known as mountain gorilla and eastern lowland gorilla respectively, Gorilla beringei are endangered great apes with dwindling populations that inhabit discrete areas of central Africa. Conservation efforts have prevented further decline of gorilla numbers. However, already small populations may reduce genetic diversity, giving rise to adverse phenotypes, and threaten long-term species survival. We performed comparative analysis of structural variation (SV) in two gorilla sub-species, G. b. beringei and G. b. graueri, to identify possible genetic causes of divergent physical characteristics and disease phenotypes.
Few studies have used whole genome sequencing to comparatively analyze G. beringei; previous work has focused specifically on single nucleotide variants. However, SVs have the potential to account for more variation in terms of the number of nucleotides. We expand on previous research by employing a paired-end and split-read integrated approach to analyze SVs of 12 gorilla samples from the two sub-species of G. beringei. The intersection of coding regions and SVs identified novel potential causes of population variation. To predict the phenotypic effect of these SVs, Gene Ontology enrichment analysis was performed on the affected genes and combined with Human Phenotype Ontology annotations.
Several distinct phenotypic associations were identified in each of the populations, with a comparatively limited divergence in physiology. Many of these phenotype associations correspond to documented physical features used to distinguish between the two gorilla species. These include SV affected genome regions responsible for cranial, facial, and dental development. Indeed we also found uniquely enriched genetic evidence for the causation of disease and abnormality, such as syndactyly. Previously reported in mountain gorilla populations, this adverse phenotype symptom is used as an indicator of insufficient population size and inbreeding. This work provides a more complete insight to the genetic health of G. beringei and helps elucidate the causation of distinguishing phenotypes during speciation.
- Maria Rigau, Spanish Nacional Cancer Research Centre (CNIO), Spain
- David Juan, Spanish Nacional Cancer Research Centre (CNIO), Spain
- Alfonso Valencia, Spanish Nacional Cancer Research Centre (CNIO), Spain
- Daniel Rico, Institute of Cellular Medicine,
Next generation sequencing technologies have allowed the fine-mapping of copy number variation (CNV) in human population but the functional effect of CNVs partially affecting different genic regions is poorly understood.
Through the meta-analysis of five state-of-the-art CNV datasets we evaluated the impact of CNVs on protein-coding genes taking into account their evolutionary age. We found that exons of evolutionarily old genes are impoverished in CNVs. In contrast, we observed that introns of ancient genes are significantly enriched in CNVs, which are mostly deletions. Focusing on the characterisation of intronic deletions, we also observed that these deletions overlap with intronic regulatory regions less often than expected, suggesting that the location of deletions is restricted by the functional relevance of the affected regions.
Previous works have shown that transcriptional splicing is affected by gene length and exon-intron differences in GC content, with longer introns requiring higher GC content differences. Strikingly, we observed that introns with deletions are GC richer and that the removal of deleted regions decreases significantly the average GC content of the intron. Consequently, intronic deletions simultaneously shorten the intron and accentuate its difference of GC with the flanking exons, presumably improving recognition by the splicing machinery. The analysis of mechanisms involved in these deletions uncovered transposable elements as likely drivers of many of these events.
Using cancer data available from Pan-Cancer, we also found a significant enrichment of somatic copy number alterations (SCNAs) within introns. However, we observe that somatic intronic deletions in tumours follow a different behaviour in terms of overlap with regulatory features and GC content.
This work unravels a previously unexpected variation in intronic length among individuals that might have important consequences for the regulation, transcription and splicing of genes, potentially relevant in the interpretation of the influence of CNVs in disease and SCNAs in cancer.
- Sergi Villatoro, Institut de Biotecnologia i de Biomedicina,
- Marta Puig, Institut de Biotecnologia i de Biomedicina,
- Carla Giner-Delgado, Institut de Biotecnologia i de Biomedicina,
- Magdalena Gayà-Vidal, Institut de Biotecnologia i de Biomedicina,
- Jon Lerga-Jaso, Institut de Biotecnologia i de Biomedicina,
- David Vicente-Salvador, Institut de Biotecnologia i de Biomedicina,
- Roser Zaurin, Institut de Biotecnologia i de Biomedicina,
- Sarai Pacheco, Institut de Biotecnologia i de Biomedicina,
- Isaac Noguera, Institut de Biotecnologia i de Biomedicina,
- Jack F., Digital Biology Center,
- George Karlin-Neumann, Digital Biology Center,
- Alejandra Delprat, Institut de Biotecnologia i de Biomedicina,
- Marina Laplana, Institut de Biotecnologia i de Biomedicina,
- David Izquierdo, Institut de Biotecnologia i de Biomedicina,
- Mario Caceres, ICREA and Institut de Biotecnologia i de Biomedicina,
Next-generation sequencing (NGS) has emerged as a powerful tool to determine the entire sequence of genomes. However, not all types of genetic variants are identified with the same sensitivity and accuracy. In particular, in the case of inversions, their balanced nature and the presence in many cases of highly identical inverted repeats (IRs) at the breakpoints make their detection especially challenging, even with the newest long-read technologies. As part of the INVFEST project we have developed several methods, based on PCR, such as inverse PCR (iPCR) and droplet-digital PCR (ddPCR), or multiplex ligation-dependent probe amplification (MLPA), to validate and genotype reliably inversions with either simple breakpoints or mediated by IRs of as much as 100 kb. Together with the analysis of available sequences, these techniques have allowed us to analyze more than 200 inversion predictions in humans. First, we have found that a large fraction of inversion predictions are false positives caused by problems in the genome assembly, mapping errors due to sequence differences between individuals, or sequencing artifacts. Second, by genotyping 45 of these inversions in 550 individuals from diverse populations, we have generated the most accurate data set on human inversions so far. This has shown that most known inversions are missed in current genome sequencing projects and that when they are genotyped, the error rates can be as high as 31%. In addition, it has been possible to establish the population distribution and evolutionary history of the inversions, uncover several inversions with functional effects on genes, and show that most inversions with IRs are recurrent and are not linked to SNPs. Therefore, the availability of genotyping techniques to screen quickly a large number of samples opens the door to study for the first time the association of inversions with complex phenotypes and disease susceptibilities.
- Giulia Babbi, Biocomputing Group Bologna, Italy
- Pier Luigi, University of Bologna, Italy
- Giuseppe Profiti, Università di Bologna, Italy
- Samuele Bovo, Bologna Biocomputing Group, Italy
- Castrense Savojardo, Bologna Biocomputing Group, Italy
- Rita Casadio, UNIBO, Italy
Background. Modern sequencing technologies allow dissecting the genetic component of phenotypic traits, with a specific focus on diseases. While determining the gene-disease associations, it emerges that the number of diseases associated with multiple genes is increasing. The molecular mechanisms at the basis of the pathogenesis are often uncharacterized; investigating the functional relations among genes involved in the same disease may give fundamental indications about the diseases insurgence and development.
Results. We develop eDGAR, a database collecting and organizing data on gene/disease associations as derived from OMIM (www.omim.org), Humsavar (www.uniprot.org/docs/humsavar) and ClinVar (www.ncbi.nlm.nih.gov/clinvar/).
eDGAR lists 2672 diseases related to 3658 different genes, for a total of 5729 gene-disease associations. eDGAR provides precomputed results for 621 polygenic and heterogeneous diseases (2600 genes, 3678 associations), analyzing the associated list of genes and describing their features. These include physical and/or regulatory interactions between pairs of genes (reported for 530 diseases), retrieved from PDB (www.rcsb.org/pdb), BIOGRID (thebiogrid.org) and STRING (string-db.org/) as well as co-occurrence in structural complexes listed in CORUM (mips.helmholtz-muenchen.de/corum/) and a recently published paper . For 612 diseases, at least one pair of genes shares GO terms (www.geneontology.org/) and/or KEGG (www.genome.jp/kegg/) and/or REACTOME (www.reactome.org/) pathways. Moreover, eDGAR reports enriched functional annotations computed with NET-GE , detecting statistically significant functional terms for 606 diseases. Regulatory interactions are derived from TRRUST (www.grnpedia.org/trrust/). The main novelty is the localization on chromosomes and/or co-localization in neighboring loci.
Conclusions. eDGAR offers a resource to address the question why different genes are related to the same disease by investigating the molecular mechanisms and the functional features that are related to a specific set of genes. eDGAR is available at: edgar.biocomp.unibo.it
1. Havugimana et al. Cell. 2012;150(5):1068-81.
2. Bovo et al. Bioinformatics. 2016;32(22):3489-3491.
- Kathrin Trappe, Robert Koch Institute, Germany
- Enrico Seiler, Robert Koch Institute, Germany
- Tobias Marschal, Saarland University / Max Planck Institute for Informatics, Germany
- Bernhard Renard, Robert Koch Institute, Germany
Horizontal gene transfer (HGT) is a powerful method that allows bacteria to directly transfer genetic material between distant species. Compared to a vertical inheritance from one generation to the next, HGT occurs between individuals from the same generation, and involves fully functional genes or even complete genomic islands. Thereby, bacteria can instantly acquire new traits such as antibiotic resistance or pathogenic toxins.
Established bioinformatics approaches for HGT detection explore phylogenetic trees or genome composition inconsistencies due to a different pattern of, e.g., GC or k-mer content of the inserted foreign sequence, and focus mainly on past HGT events. These methods usually require fully sequenced and annotated genomes, and hence, do not use Next-Generation Sequencing (NGS) data directly for detection. A mapping-based approach using NGS data offers the chance to detect HGT events in the sequenced HGT organism early in analysis, even when the HGT organism has not been sequenced before. This can be important in outbreak scenarios of emerging pathogens involving HGT.
We propose the tools Donald and Daisy for mapping-based HGT detection from NGS data. Donald leverages metagenomic profiling tools to identify candidate references for acceptor genome references (the parent genome of the HGT organism acquiring the HGT sequence) and donor genome references (the parent donating the HGT sequence). Subsequently, Daisy determines specific HGT regions relying on established methods from structural variant detection approved for human NGS data.
Results of simulated and real data show that Donald successfully identifies acceptor and donor candidates as such and is able to distinguish non-HGT samples as true negatives. Daisy detects HGT regions with base pair resolution, and outperforms alternative approaches using a genome assembly of the reads. We see our approach as a powerful complement for comprehensive analysis of bacterial genomes in the context of NGS data.
- Ernesto Picardi, University of Bari & IBBE-CNR, Italy
- Anna Maria, University of Bari & IBBE-CNR, Italy
- Graziano Pesole, University of Bari & IBBE-CNR, Italy
A-to-I RNA editing in humans is carried out by members of ADAR family of enzymes that act on double strand RNAs and can alter codon identity, splicing sites or base-pairing interactions within higher-order RNA structures. Recoding RNA editing is essential for normal brain development and regulates important functional properties of neurotransmitter receptors. Recently, we released a comprehensive human inosinome Atlas including more than 4.6 millions of A-to-I events (PMID: 26449202) and found that genes undergoing RNA editing were consistently enriched in genes involved in neurological disorders and cancer, confirming the relevant biological role of RNA editing in human (PMID: 27587585). Although investigations in bulk tissues are extremely useful, they do not capture the transcriptomic heterogeneity of multiple cell types constituting the ensemble tissue. To characterize the complexity of RNA editing at single cell resolution, we investigated this phenomenon in single cells from adult human cortex obtained from living subjects in which transcriptome diversity was already surveyed by single cell RNA sequencing (scRNA-seq) (PMID: 26060301). Using our REDIportal database, we explored inosinome profiles in 466 cortex cells. Individual scRNAseq data were quality checked by FASTQC and poor regions at 3’ ends were trimmed by means of trim_galore tool. Cleaned read were then mapped onto the human reference genome by STAR and RNA editing candidates were detected using our REDItools (PMID: 23742983).
We found that the number of A-to-I events was strongly correlated with the number of RNAseq reads. RNA editing levels per cell were bimodally distributed and distinguished between major brain cell types as neurons, astrocytes and oligodendrocytes, underlining the cell specific nature of RNA editing. Interestingly, recoding RNA editing were mainly detectable in neurons, remarking the primary role of A-to-I editing in modulating brain functions through key modifications in neurotransmitter receptors.
- Tobias P. Loka, Robert Koch Institute, Germany
- Simon H. Tausch, Robert Koch Institute, Germany
- Piotr W. Dabrowski, Robert Koch Institute, Germany
- Andreas Nitsche, Robert Koch Institute, Germany
- Bernhard Y., Robert Koch Institute, Germany
Human genomic data contain highly sensitive patient-related information that should be protected. Besides sequencing human samples, sensitive information may occur as a by-product in a variety of other data, e.g. human virus sequencing data and human microbiome studies. Thus, such data could be used for re-identification of individuals or other attacks on privacy. A read filtering step (or human host removal) is therefore essential but often neglected or performed in an unsatisfactory manner due to unawareness or time reasons. Here we present a new real-time read filtering tool, HiLive RETIRED, which reliably and irrecoverably removes human information from next generation sequencing data when the sequencing machine is still running. In doing so, sensitive information is directly excluded from the original output data of the sequencing machine (base call files). Our tool achieves high accuracy (>99% sensitivity and >99.9% specificity) and supports the use of foreground reference genomes to even further exclude the loss of relevant data. To demonstrate the benefits for genomic privacy, we showed that the re-identification of individuals by analyzing their sequencing data was no longer successful with the respectively filtered data. Our concept can easily be integrated in standard NGS workflows and is therefore applicable for most sequencing facilities. The subsequent analyses are neither time-delayed nor negatively affected in quality. This makes our approach well-suited for time-critical and high-quality analyses as they are needed in clinical applications and disease outbreak scenarios. Thereby, the level of genomic privacy stands up to critical data protection standards as for instance within the European Union and its ‘privacy by design’ principle.
- Olivier Quenez, Normandie Univ, France
- Kilan Le, Normandie Univ, France
- Stéphanie David, Normandie Univ, France
- Camille Charbonnier, Normandie Univ, France
- Frex Consortium, ,
- FRGael Nicolas, Normandie Univ, France
- Dominique Campion, Normandie Univ, France
- Anne Rovelet-Lecrux, Normandie Univ, France
With the advent of Whole Exome Sequencing (WES), new tools are consistently being developed to mine and extract Copy Number Variation (CNVs). Each software is developed depending on the targets and the aim of the study, and is therefore set up and trained on a specific data set.
Our team focuses on the genetics of neuropsychiatric disorders. We recruited and performed WES on 522 patients with early-onset Alzheimer disease (EOAD) and 24 patients with Primary Brain Calcification (PBC). We also get access to WES data for 584 ethnically-matched controls through the FREX consortium.
We tested different software to extract CNV from these WES data. We finally chose CANOES, a tool based on a Hidden Markov Model using depth of coverage. To reduce both false positive and negative results, we made multiple adjustments on target definitions, and removed low coverage targets. To determine the sensitivity of CANOES, we used 42 samples for which both WES and high-resolution aCGH data were available. For comparison purposes, we focused on CNVs that could be efficiently detected by both methods: we focused on rare, genic CNVs, and excluded CNVs located on segmental duplications and smaller than 8kb. Of the 67 CNVs detected by aCGH, 59 (88%) were detected by CANOES.
Screening of the 522 EOAD patients and 584 controls revealed a duplication of the 17q21.31 locus in 4 cases, which was absent in controls. Clinical, brain imaging and functional data enabled to define this rearrangement as the genetic basis of a novel clinico-neuropathological entity.
For the 24 patients with PBC, after a first screening, we focused on the known disease-causing gene SLC20A2 by using the “genotype” function of CANOES. We found two monoexonic deletions that had previously been missed by CANOES, but this also led to the detection of a false duplication in another patient.
- Lenka Piherova, Institute of Inherited Metabolic Disorders,
- Viktor Stranecky, 1st Faculty of Medicine, Czech Republic
- Hana Hartmannova, 1st Faculty of Medicine, Czech Republic
- Katerina Hodanova, 1st Faculty of Medicine, Czech Republic
- Miloš Kubánek, IKEM, Czech Republic
- Stanislav Kmoch, 1st Faculty of Medicine, Czech Republic
Familiar dilated cardiomyopathy (DCM) and left ventricular non-compaction (LVNC) are rare inherited disorders, associated with an increased risk of ventricular arrhythmias and sudden cardiac death. Genetic testing in affected individuals may play an important role in therapy determination and risk assessment in their family members.
We performed clinical genetic counseling followed by molecular genetic analysis using NGS (HiSeq, Illumina, USA) in our cohort of probands with clinical diagnosis of familiar DCM or LVNC. We modified the preparation of library by adding enrichment for mitochondria coded genes. Using this approach we were also able to find multi-exon deletion/duplication.
Potentially disease-causing variants were found mainly in sarcomeric, desmosomal genes.
The identification of potentially causative mutation in desmosomal genes resulted in the recognition of predominantly left ventricular form of ACM patients with clinical diagnosis of familiar DCM. Left ventricular ACM and DCM are phenotipically difficult to discriminate in agreement with previous observations; variants in the DSP gene are a frequent cause of predominant left ventricular ACM. Either finding of mutation in mtDNA led to change of clinical diagnosis.
The genetic stratification allowed a modification of the therapy in affected individuals and cardiological follow up in relatives at risk.
Supported from scientific grants: 15-27682A, NF-CZ11-PDP-3-003-2014
- Zobayer Alam, Memorial University of Newfoundland, Canada
- Julissa Roncal, Memorial University of Newfoundland, Canada
- Lourdes Pena-Castillo, Memorial University of Newfoundland, Canada
Partridgeberry (Vaccinium vitis-idaea L.), one of the least studied crops in the Ericaceae family, has seen its worldwide demand dramatically increased due to its numerous health benefits. Newfoundland and Labrador (NL) exports more than 100 tons of wild partridgeberry. Commercial cultivation of partridgeberry in NL could be developed to increase production of this berry, however genetic markers are needed to facilitate the selection of berries with high total phenolic content (TPC) and antioxidant capacity (AC). Previous partridgeberry studies have shown TPC and AC variation in partridgeberry grown under different environmental conditions. Here, we used Genotyping-by-Sequencing (GBS)  to analyze the genetic variation of 56 partridgeberry samples from various locations in NL. We identified 1,586 high-quality putative single nucleotide polymorphisms (SNPs) using the UNEAK pipeline . To search for an association between adaptation to environmental conditions and population-level genetic diversity, we obtained the correlation of each of the identified SNPs with eight environmental variables. We also searched for an association between the identified SNPs and the TPC and AC of the 56 wild partridgeberry fruit samples. We found 260 SNPs likely to be associated with at least one of the environmental variables, TPC or AC. To obtain insights on the function of the genomic sequences containing the SNPs likely to be associated with the environmental variables, TPC or AC, we performed a sequence-based functional annotation and identified protein-coding sequences with functional roles likely to be related to biotic and abiotic stress response, nitrate transport, and, most interestingly, phenolic compound biosynthesis.
- Gaia Andreoletti, University of California, United States
- Roger A Hoskins, University of California, United States
- John Moult, University of Maryland, United States
- Steven E Brenner, University of California, United States
- The Cagi Participants , University of California, United States
The Critical Assessment of Genome Interpretation (CAGI, \'kā-jē\) is a community-experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. CAGI participants are provided genetic variants and make predictions of resulting phenotype. These predictions are evaluated against experimental characterizations by independent assessors.
The fourth CAGI experiment (2016) included 11 challenges reflecting: non-synonymous variants and their biochemical impact measured by targeted assays; noncoding regulatory variants and their impact on gene expression; research exomes for complex traits prediction; personal genomes and trait profiles; and clinical sequences and associated referring indications.
Notable discoveries and general themes emerged throughout CAGI. The independent assessment found that top missense prediction methods are highly statistically significant, but individual variant accuracy is limited. Moreover, missense methods tend to correlate better with each other than with experiments (for reasons that may reflect the predictive methods and the assays themselves). There might be particular potential for missense interpretation at the extreme of the distribution. Structure-based missense methods excel in a few cases, while evolutionary-based methods have more consistent performance. Bespoke approaches often enhance performance.
On the clinical studies, predictors were able to identify causal variants that were overlooked by the clinical laboratory, and it appears that physicians may not always order the most relevant genetic test for their patients. CAGI data show that running multiple uncalibrated methods and considering their consensus often provides undue confidence in their correlation; we therefore advise against running multiple uncalibrated variant interpretation tools in clinical analysis.
The results showed that predicting complex traits from exomes is fraught. Interpretation of non-coding variants shows promise but is not at the level of missense. Beyond this, creating a genetic study that provides a reliable gold standard is remarkably difficult. However, there were notable improvements in the ability to match genomes to trait profiles
- Laia Carreté, Centre for genomic regulation, Spain
- Toni Gabaldón, Centre for genomic regulation, Spain
Infections caused by pathogenic yeasts are becoming of increasing medical importance. Candida glabrata is one of the most common pathogenic fungi in humans, ranking as the second causative agent of candidiasis worldwide. Despite it’s name, C. glabrata belongs to the Nakaseomyces, a clade more closely related to the baker’s yeast Saccharomyces cerevisiae, and distantly related to the model pathogen Candida albicans. Our understanding of the evolution of C. glabrata at the species level is limited to analyses of natural variation restricted to a few loci. These studies have shown the existence of genetically distinct clades and generally suggested clonal, geographically structured populations. Geographically structured populations are also found in the model pathogen C. albicans and S. cerevisiae. C. albicans is tightly associated to humans and can undergo a parasexual cycle, and S. cerevisiae has been domesticated and can undergo a full sexual cycle, usually involving self-mating. In contrast, C. glabrata has been described as an asexual species despite the presence of homologous of S. cerevisiae in mating genes. In order to understand the recent evolution of this important opportunistic pathogen we analyzed the genomes of 33 different clinical and colonizing C. glabrata isolates sampled from different human body sites and globally distributed locations. Our results show 7 deeply divergent clades, which show recent geographical dispersion and large within-clade genomic and phenotypic differences. We show compelling evidence of recent admixture and of purifying selection on mating genes. Altogether, these findings support the existence of a sexual cycle, and suggest that humans are only a secondary niche for this pathogenic yeast.
- Swann Floc'Hlay, Institut de Biologie de l'Ecole normale supérieure, France
- David Garfield, EMBL, Germany
- Bingqing Zhao, EMBL, Germany
- Morgane Thomas-Chollier, Institut de Biologie de l'Ecole Normale Supérieure, France
- Eileen Furlong, EMBL, Germany
- Denis Thieffry, IBEns (UMR CNRS 8197 - INSERM 1024), France
Recent high-throughput sequencing studies between individuals of a given species have revealed extensive variation in gene expression as a consequence of segregating genetic variation within the population. Most of this regulatory genetic variation is in non-coding DNA, presumably disrupting the function of enhancer elements. However, predicting how genetic variants disrupts transcriptional regulation remains very poorly understood. My thesis aims to get a mechanistic understanding of how natural genetic variation affects multiple layers of transcriptional regulation, using hybrid embryos of genetically distinct Drosophila isolated from a wild population. The use of hybrid individuals offers a powerful approach to dissect cis versus trans -regulatory mutations by
obtaining allelic specific information across multiple steps of transcriptional regulation (e.g. allelic
specific ATAC-seq, ChIP-seq, RNA-seq data).
The Furlong lab (EMBL, Heidelberg) has recently performed the largest F1 embryo collections that we are aware of: F1 embryos were collected from 10 different intra-species crosses, at three crucial windows of embryogenesis. Together, the combined datasets spanning multiple layers of transcriptional regulation at multiple stages of development represent approximately 400 samples.
We will particularly focus on the regulation of the early steps of mesoderm specification and muscle cell differentiation, occurring during stages 8 to 12 in fly embryos. The integration of various levels of regulation will allow us to disentangle the influence of genetic variation on transcriptional regulation and should highlight novel interactions occurring during embryonic development. This should lead to a more extensive view of the genetic bases influencing transcriptional regulation by simultaneously integrating data from gene expression, enhancer/promoter activity, transcription factor occupancy and chromatin state.
Ultimately, the resulting knowledge will be used to develop and refine a predictive dynamical model for mesoderm specification and extend it to account for the main events controlling the differentiation and diversification of muscle and heart cells.
- Joseph Kawash, Rutgers University, United States
- Spyros Karaiskos, Rutgers University, United States
- Sean Smith, Rutgers University, United States
- Andrey Grigoriev, Rutgers University, United States
We analyzed genome variants across several representatives of the order Proboscidea including the woolly mammoth, an ancient species of megafauna that inhabited the upper Arctic regions until the demise of the most recent population about 3500 years ago. The theories surrounding mammoth extinction commonly involve climate change and human predation. Although the woolly mammoth did not survive, other representatives of Proboscidea, the asian and african elephants, successfully carried on to the present date. Analysis of sister species allows for a genetic comparison identifying mutations unique to woolly mammoth that possibly attributed to its exit.
Typically, comparative genomics solely involves analysis of single nucleotide variants (SNVs). We extended the Proboscidea genome comparisons to include SNVs, larger structural variants (SVs) and indels. We found multiple fixed, derived variants in the mammoth genomes to occur in protein-coding regions, with copy number variants (CNVs) in parts of genes or even complete coding units. Interestingly, CNVs frequently affected immunity and defense related genes. We found an abundance of fixed, derived woolly mammoth variants in genes involved with lipid metabolism, maintenance of core body temperature and skeletal features. Further, these variants showed striking links to CpG islands. We discuss these observations in the context of Proboscidea genome evolution and how they could shed light on the ultimate demise of mammoths.
While detection of SNVs is well established in comparative genomics, tools for finding and comparing SVs are still catching up. We describe a novel approach for graphical representation of SVs found across multiple genomes using an interactive graph display. Such representation allows for depiction of structural genome divergence between Proboscidea genomes as an example and can be utilized in other comparative studies. We present illustrative cases where such graph representation can highlight interesting evolutionary patterns while improving the identification and annotation of SVs.
- Simon H. Tausch, Robert Koch Institute, Germany
- Jakob Schulze, Robert Koch Institute, Germany
- Andreas Andrusch, Robert Koch Institute, Germany
- Tobias P. Loka, Robert Koch Institute, Germany
- Jeanette Klenner, Robert Koch Institute, Germany
- Piotr W. Dabrowski, Robert Koch Institute, Germany
- Bernhard Y. Renard, Robert Koch Institute, Germany
- Andreas Nitsche, Robert Koch Institute, Germany
In the past years, Next Generation Sequencing has been utilized in time critical applications such as pathogen diagnostics with promising results. Yet, long turnaround times had to be accepted to generate sufficient data, as the analysis was performed sequentially after the sequencing was finished. Finally, the interpretation of results can be hindered by various types of contaminations, clinically irrelevant sequences, and the sheer amount and complexity of the data.
We designed and implemented a real-time diagnostics pipeline which allows the detection of pathogens from clinical samples up to five days before the sequencing procedure is even finished. To achieve this, we adapted the core algorithm of HiLive, a real-time read mapper, while enhancing its accuracy for our use case. Furthermore, common contaminations, low-entropy areas, and sequences of widespread, non-pathogenic organisms are automatically marked beforehand using NGS datasets from healthy humans as a baseline. The results are visualized in an interactive taxonomic tree, providing the user with several measures regarding the relevance of each identified potential pathogen.
We applied the pipeline on a human plasma sample spiked with Vaccinia virus, Yellow fever virus, Mumps virus, Rift Valley fever virus, and Mammalian orthoreovirus, which was then sequenced on an Illumina HiSeq. All spiked agents could already be detected after only 12% of the complete sequencing procedure. While we also found a large number of other sequences, these are correctly marked as clinically irrelevant in the resulting visualization, allowing the user to obtain the correct assessment of the situation at first glance.