- Kavya Kopparapu, Thomas Jefferson High School for Science and Technology, United States
- Neeyanth Kopparapu, Thomas Jefferson High School for Science and Technology, United States
Diabetic retinopathy (DR) is the leading cause of blindness among working-age adults and affects over 10 million people worldwide. Many adults, particularly in developing countries, remain undiagnosed due to limited access to the expensive tools needed for diagnosis. Smartphone technology, notably, is cheap, readily available nearly everywhere, and has potential to aid in diagnostics. We developed the Eyeagnosis system, which utilizes machine learning techniques and a smartphone camera for the automatic screening of DR. Specifically, we designed a neural network architecture that uses residual neural networks with cyclic pooling to automatically diagnose DR from retinal images. We were able to obtain an accuracy of 78.9%, sensitivity of 0.675, specificity of 0.812, and area under the receiver operating characteristic curve (ROC) of 0.752. These results are statistically comparable to the results of a group of 74 optometrists.
Additionally, we created a smartphone application which was able to take photos, send them to a server, and display the server’s diagnosis. With a custom-designed 3D-printed lens attachment, Eyeagnosis was able to take focused retinal images, as shown through testing on dilated eyes. These results demonstrate that Eyeagnosis is capable of assisting doctors in diagnosing DR in the field.
- Martin Steinegger, Max-Planck Instiute for Biophysical Chemistry, Germany
- Johannes Soeding, Max-Planck Institute for Biophysicial Chemistry, Germany
Sequencing costs have dropped much faster than Moore's law in the past decade, and sensitive sequence searching has become the main bottleneck in the analysis of large (meta)genomic datasets. While previous methods sacrificed sensitivity for speed gains, the parallelized, open-source software MMseqs2 overcomes this trade-off: In three-iteration profile searches it reaches 50% higher sensitivity than BLAST at 83-fold speed and the same sensitivity as PSI-BLAST at 270 times its speed. MMseqs2 therefore offers great potential to increase the fraction of annotatable (meta)genomic sequences.
- Elad Segev, Israel Holon Institute of Technology, Israel
- Zohar Pasternak, The Hebrew University of Jerusalem, Israel
- Tom Ben Sasson, The Open University of Israel,
- Edouard Jurlevitch, The Hebrew University of Jerusalem, Israel
Currently, the identification of specific microbial groups, either taxonomic or phenotypic, is mostly done using multi-locus typing where up to 16 genes are selected as molecular markers and compared between isolates, either by their presence/absence or sequence. Therefore, an algorithm taking advantage of the available high throughput sequencing data for finding optimal group markers is of great value to ecological and medical research. We developed a new method, called DiffGene, and employed it to find optimal marker genes for three different groups of pathogenic bacteria. We found that:
1. The combined presence of the tdcR gene, a transcription activator, and absence of sspH2, an invasion plasmid antigen, is specific for E. coli bacteria.
2. The presence of hlyD, a multidrug resistance efflux pump gene, in an E.coli genome is specific for extra-intestinal pathogenic E.coli (ExPEC).
3. The presence of eltA, a cholera enterotoxin gene, in an E.coli genome is specific for enterotoxigenic E.coli (ETEC).
DiffGene implementation successfully found in-silico genes that can differentiate E. coli from all other bacteria (including Shigella), and specifically type different pathogroups of E. coli, where bacteria are phylogenetically very close but phenotypically varied. DiffGene allows for a quick and inexpensive characterization of bacteria as long as sequenced genomes representing the group of study are available. Thus, efficient assignment of microbial isolates into ecological, pathological or taxonomic groups can be achieved.
- Bouchra Chaouni, University Mohamed V Rabat, United States
- Sanae Raoui, University Mohamed V Rabat,
- Hassan Ghazal, Polydisciplinary Faculty of Nador,
- Linda Amaral, Marine Biological Laboratory, United States
- Elhoussine Zaid, University Mohamed V Rabat,
We need to advance the understanding of how emerging diseases are responding to global changes (global warming, acidification, coastal urbanization, and pollution). The relevance of microbes to ocean resiliency and marine resource management is becoming undeniable. Furthermore, marine ecosystems are among the most attractive topics in metagenomics. We propose to use this approach to explore the microbial diversity of Moroccan marine ecosystems. The microbial communities will be evaluated with respect to species richness, evenness, and diversity. During June solstice 2014, we collected samples from seven different marine Moroccan sites at the same time: 3 on the Mediterranean, 3 on the Atlantic and one on the Strait of Gibraltar. The microbiomes of these 7 marine sites were determined by metagenomics analysis and compared in the context of physicochemical parameters; with regard to suggesting which environmental factors influence marine microbial diversity. Results show that our sites were enriched in a large number of identified bacteria representing 27 phyla, 56 classes, 98 orders, 162 families, 229 genus and 80 species plus a high percentage of unclassified microbes. Additional sampling campaigns occurred on the June solstice of 2015 and 2016. The samples are in process of being sequenced and analyzed. The purpose of Ocean Sampling Day (OSD) is to collect baseline information on the diversity of marine microorganisms so that changes in ocean ecosystems can be detected by comparing samples from the same locations over time. This “same day and over time” sampling strategy provides a mechanism for detecting changes in microbial populations that may reveal clues to ocean health in response to environmental changes such as climate change, global warming, and marine acidification.
- Judith Neukamm, Institute for Archaeological Sciences,
- Alexander Peltzer, Institute for Archaeological Sciences,
- Wolfgang Haak, Max Planck Institute for the Science of Human History,
- Johannes Krause, Institute for Archaeological Sciences,
- Kay Nieselt, Integrative Transcriptomics,
Despite the availability of modern next generation sequencing technologies and therefore nuclear human genomes, the sequencing and analysis of mitochondrial DNA (mtDNA) is still common. Especially in the research field of ancient DNA and the context of population genetics, mtDNA is often the only proxy available to study extinct populations and their relationship with modern populations. As a consequence, many population genetic studies rely on the analysis of mtDNA.
A plethora of methods for the analysis of mtDNA exist, that address questions in population genetics, phylogeny and others. However, these tools typically rely on different file formats and often require manual interaction with the data for downstream analysis. Ultimately, these steps can be cumbersome, especially for non-bioinformaticians, resulting in an increased risk of errors during the analysis.
To tackle these issues, we present MitoBench, a workbench to interactively analyze and visualize mitochondrial genomes with a focus on population genetics. The graphical user interface is kept simple, to accommodate even users without further prior knowledge on computational methods. Furthermore, it shows additional information such as metadata and statistics. Currently, MitoBench offers automatic file conversion tools to connect the workbench with existing analysis methods such as BEAST, Arlequin and others.
In future, we will also link MitoBench to a large database for mitochondrial reference data. Our ultimate aim is to provide a central reference database of population genetics studies on mitochondrial data that can be easily accessed via the workbench.
- Atul Kamboj, Children Hospital at Westmead, Australia
- Claus Hallwirth, Children’s Medical Research Institute, Australia
- Ian Alexander, Children’s Medical Research Institute, Australia
- Belinda Kramer, Children Hospital at Westmead, Australia
Human genetic diseases have been successfully treated in gene therapy trials using viral vectors to integrate functional copies of defective genes into patient genomic DNA to reverse the disease phenotype. Since genomic integration of clinical vectors is individually unpredictable and tends to cluster around transcription start sites (TSSs) of genes or favour genomic regions that are being actively transcribed during vector administration, the major risk associated with this strategy is a vector integration event leading to dysregulation of an oncogene and subsequent development of a malignancy such as leukaemia. Analysis of vector integration sites (ISs) is therefore critically important in monitoring safety of patients undergoing gene therapy. In addition, IS identification is an important investigative tool for mapping viral elements in mutagenesis screens to elucidate gene function. We have developed a vector IS analysis pipeline (Ub-ISAP) that utilises a UNIX-based workflow for automated IS identification and annotation of both single- and paired-end sequencing reads. Ub-ISAP takes user-defined vector sequences to extract reads that identify putative ISs which, after trimming and alignment to the genome, produces a list of unique ISs, classified relative to RefSeq genes as TSS-proximal, intragenic or intergenic. Ub-ISAP was validated by re-analysing a set of raw reads derived from DNA libraries prepared using human stem cells transduced with clinical grade vector for which published IS data were available. Ub-ISAP successfully extracted 255,502 and 61,067 unique IS from data sets containing 8,706,511 and 7,503,456 raw reads, marginally more than recovered using the published methodology, with concordance demonstrated between analysis methodologies in the identity of the top 20 genes targeted for integration by the vector. Ub-ISAP is a reliable time and memory-efficient UNIX-based pipeline, effective for the generation of large IS datasets to assess the safety of integrating vectors in clinical settings, with broader applications in cancer research.
- Samuel Liburd, Virgin Islands, United States
- Marc Boumedine, University of the Virgin Islands, United States
Viruses serve as one of the most efficient vectors for death and disease, killing millions worldwide and mutating uncontrollably. Because of this, I hypothesized that it was possible to biologically classify viruses using machine learning; with attributes based mainly on their genetic characteristics. To do so, I downloaded and analysed 511 (+) ssRNA virus genomes for unique genetic characteristics that identify them. The six virus families used were Flaviviridae, Potyviridae, Betaflexaviridae, Virgaviridae, Picornaviridae, and Tombusviridae. Among these virus families, 12 different features were manually chosen to classify the viruses; genome length, adenine, guanine, cytosine, and thymine count, the number of start codons, G-C and A-T percentages, host organisms, the number of proteins encoded, and the number, if any, of segmentations in the genome. These attributes were extracted using Python and tested by the Correlation-based Feature Subset Eval and Best First algorithms in WEKA’s (a data mining package) Attribute Selector. The chosen attributes; genome length, A, C, and G counts, G-C percentage, host organism, and number of proteins formed, were then run through the J48 classification algorithm. This algorithm used 66% of the genomic datasets to create a decision tree model. After creating the model, the program then classified the remaining datasets based on the created decision tree. Using this decision tree, 99.4% of the remaining viruses were accurately classified. This accuracy level shows that it is possible to classify viruses using machine learning and a new, mainly genetic, analysis method. Using machine learning and data mining techniques like classification, and eventually more dynamic techniques like neural networking, could lead to a powerful tool to monitor and update the changes to viral genomes.
This research was funded through the UVI NSF HBCU-UP SURE grant #1137472.
- Momiao Xiong, University of Texas School of Public Health, United States
- Kelin Xu, Fudan University, Switzerland
- Jin Li, Fudan University, Switzerland
Epistasis plays an essential rule in understanding the regulation mechanisms and is an essential component of the genetic architecture of the gene expressions. However, interaction analysis of gene expressions remains fundamentally unexplored. RNA-seq measurements generate large expression variability and collectively create the observed position level read count curves. A single number for measuring gene expression which is widely used for microarray measured gene expression analysis is highly unlikely to sufficiently account for large expression variation across the gene. Simultaneously analyzing epistatic architecture using the RNA-seq and NGS data poses enormous challenges. To meet these challenges, we develop a nonlinear functional regression model (FRGM) for epistasis analysis with RNA-seq data. Instead of testing the interaction of all possible pair-wises SNPs, the FRGM takes a gene as a basic unit for epistasis analysis, which tests for the interaction of all possible pairs of genes and use all the information that can be accessed to collectively test interaction between all possible pairs of SNPs within two genome regions. By large-scale simulations, we demonstrate that the proposed FRGM for epistasis analysis can achieve the correct type 1 error and has higher power to detect the interactions between genes than the existing methods. The proposed methods are applied to the RNA-seq and WGS data from the 1000 Genome Project. The numbers of pairs of significantly interacting genes after Bonferroni correction identified using FRGM, RPKM and DESeq were 16,2361, 260 and 51, respectively, from the 350 European samples. The proposed FRGM for epistasis analysis of RNA-seq can capture isoform and position-level information and will have a broad application. Both simulations and real data analysis highlight the potential for the FRGM to be a good choice of the epistatic analysis with sequencing data.
- Paul Simion, ISEM, France
- Khalid Belkhir, ISEM, France
- Hervé Philippe, CBTM, France
- Clémentine François, ISEM, France
- Julien Veyssier, Université de Montpellier,
- Jochen Rink, Max Plank Institute of Molecular Cell Biology and Genetics,
- Michaël Manuel, Université Pierre et marie Curie,
- Max Telford, University College London,
Contamination between nucleic acid samples handled in parallel has always been a potential problem in molecular biology. These cross contaminations can arise at virtually any step of a sequencing experiment, from sample handling to library preparation and sequencing. High throughput sequencing methods combined with PCR amplifications now allow the sequencing of even extremely low amount of nucleic acids, unfortunately including potential cross contaminants. We created a multi-platform software (CroCo) designed to detect cross contaminations in assembled transcriptomes of different species. We used an expression level quantification approach to determine the true source of every transcript among all datasets involved in a given sequencing experiment. CroCo then outputs assembled transcriptomes filtered from the detected cross contaminations. We show that their presence in RNA-seq datasets is pervasive, sometimes at high level, and can be deleterious for a range of downstream analyses, especially for phylogenetic inference, even at phylogenomic scale. Cross contaminations can theoretically be deleterious for many aspects of comparative genomics (e.g. presence of genes, phylogenetics, SNPs calling, reconciliation methods) as they represent a confounding source of incongruence for gene phylogenies. Their behavior can indeed mimic various evolutionary processes such as gene duplication, lateral gene transfer, or introgression. CroCo helps alleviate these potential serious problems.
- Albert Pla, University of Oslo, Norway
- Xiangfu Zhong, University of Oslo, Norway
- Fatima Heinicke, University of Oslo, Norway
- Simon Rayner, University of Oslo, Norway
MicroRNAs (miRNAs) are a family of ~22-nucleotide small RNAs regulating gene expression at the post-transcriptional level. While many miRNA target prediction algorithms exist, their predictions are often inconsistent, and they assume that it is the miRNA seed region (located in the first 8 to 9 nucleotides) that defines almost all-important interactions between a miRNA and its target. However, recent studies indicate that the entire miRNA can have a role in targeting, reducing the utility of available tools.
Here we present miRAW, a deep-learning based approach for predicting miRNA targets, which uses all miRNA and mRNA target nucleotides as inputs, and automatically learns a set of feature descriptors that uninhibited by limits in current knowledge regarding the targeting process. We used data from more than 150.000 experimentally validated homo sapiens miRNA targets to build a training data set, and implemented and trained a deep neural network to distinguish positive and negative miRNA targets. To automatically learn the features describing miRNA-mRNA interaction, the network follows the shape of an autoencoder; the learned features are then classified using a feed-forward neural network. To obtain a satisfactory predictive model we trained the network using a cross-validation methodology, then used the best resulting network to analyze potential target sites in the 3’UTR of the gene.
In a comparison with independent datasets, miRAW consistently outperformed existing prediction methods (mirSVR, targetScan, microTDS and PITA), obtaining higher accuracy, precision, sensitivity and specificity. Our findings support arguments that the whole miRNA should be analyzed for target prediction. Additional experiments in which miRNA fragments of different sizes where systematically used for training the network also showed that extending the analysis beyond the miRNA seed region improves target prediction. Our results also demonstrate deep learning algorithm’s ability to learn their own feature descriptors without being constrained by human knowledge.
- Yue Huang, Boston Children's Hospital and Harvard Medical School, United States
- Javier Couto, Boston Children's Hospital, United States
- Dennis Konczyk, Boston Children's Hospital, United States
- Jeremy Goss, Boston Children's Hospital, United States
- Steven Fishman, Boston Children's Hospital, United States
- John Mulliken, Boston Children's Hospital, United States
- Matthew Warman, Boston Children's Hospital and Harvard Medical School, United States
- Arin Greene, Boston Children's Hospital, United States
Extracranial arteriovenous malformation (AVM) is a congenital vascular anomaly that causes disfigurement and tissue destruction. AVMs are difficult to treat, and no pharmacological therapy is currently available. In order to identify the genetic basis of AVM, we applied MosaicHunter to WES data of 10 AVM specimens, with average depths-of-coverage ranging from 224-fold to 281-fold. MosaicHunter (http://mosaichunter.cbi.pku.edu.cn) is a bioinformatics tool that can detect somatic mutations in control-free NGS data, by incorporating a Bayesian-based genotyper and a series of stringent error filters. Our filtering strategy considered novel and predicted deleterious variants that were present in ≥ 5 reads and at a frequency ≥ 2%. When applied, this strategy detected somatic mosaic variants in 8 genes among the 10 specimens. Importantly, only 1 of the 8 genes was mutant in multiple specimens. This gene was MAP2K1; four different specimens had MAP2K1 mutations. We then reduced the filtering criteria to ≥ 3 reads and frequency ≥ 1%, which detected MAP2K1 mutations in 2 additional specimens. Finally, we looked for small indels in MAP2K1 and identified a 15-bp in-frame deletion in another specimen. In total, 7 of 10 specimens contained somatic MAP2K1 mutations. MAP2K1 encodes MAP-extracellular signal-regulated kinase 1 (MEK1) which plays an important role in the RAS/MAPK signaling pathway. We confirmed the presence of these mutations by performing droplet digital PCR, and validated they were somatic in 8 patients for whom we had a paired DNA sample from blood or saliva. In summary, we demonstrate that MosaicHunter could detect causal somatic mutations for AVM with high precision, when using WES with depth-of-coverage of ~ 200-fold and modestly stringent filtering criteria (≥ 5 variant reads and variant frequency ≥ 2%). An advantage of MosaicHunter is that it does not require sequencing data from paired unaffected tissues when searching for somatic mutations.
- Marc Sturm, Institute of Medical Genetics and Applied Genomics, Germany
- Christopher Schroeder, Institute of Medical Genetics and Applied Genomics, Germany
- Tobias Haack, Institute of Medical Genetics and Applied Genomics, Germany
Today, NGS is widely used in clinical diagnostics and translational research to identify disease-causing variants. While several commercial software suites for short-read NGS data analysis are available, they are normally quite costly and not easy to automate in a high-throughput setting.
Thus, we have developed megSAP, a free-to-use open-source data analysis pipeline tailored towards research and diagnostics in medical genetics. megSAP offers a complete NGS data analysis pipeline (adapter trimming, mapping, duplicate removal if applicable, indel realignment, variant calling, variant normalization and variant annotation) that is complemented with quality control on several levels (raw reads, mapped reads and variant). It is entirely based on open-source tools that are free for commercial use, which rules out popular tools like GATK and Annovar. It also integrates free-to-use databases like 1000 Genomes, ExAC, Kaviar and ClinVar for annotation of variants. Optionally, commercial databases which are important for diagnostics (OMIM, HGMD and COSMIC) can be used if a license is available. Due to the comprehensive annotation, the variant lists (produced in VCF and TSV format) can be easily filtered to identify disease variants.
megSAP is regularly updated (both tool and annotation databases). Each release is validated using the GiaB NA12878 gold-standard dataset, inter-laboratory comparisons and EMQN test schemes. Currently, megSAP is readily usable to analyze single-sample NGS data from whole-genome sequencing, whole-exome sequencing and panel sequencing (both shotgun and amplicon-based data). Several other applications (RNA-Seq, tumor-normal pairs, trios, and molecular barcodes) are already implemented and the corresponding documentation will be added shortly. To facilitate the installation of megSAP and thereby improve usability, we are working on a first containerized release using Docker.
megSAP is available at https://github.com/imgag/megSAP
- Arnaud Ceol, Istituto Italiano di Tecnologia, Italy
- Piero Montanari, School of Engineering, Italy
- Ilaria Bartolini, School of Engineering, Italy
- Paolo Ciaccia, School of Engineering, Italy
- Marco Patella, School of Engineering, Italy
- Stefano Ceri, Dipartimento di Elettronica,
- Marco Masseroli, Dipartimento di Elettronica,
- Heiko Muller, Istituto Italiano di Tecnologia, Italy
Biologists generally interrogate genomics data using web-based genome browsers that have limited analytical potential. New generation genome browsers such as the Integrated Genome Browser (IGB) have largely overcome this limitation and permit customized analysis to be implemented using plugins. We extend the functionality of IGB with two plugins which allow to enable molecular, network and structure biology, and to perform advanced pattern search in the genome browser tracks.
3D interactomes (networks of molecular interactions, which structures are either known or modeled) facilitate the identification of disease-relevant interactions that can then be specifically targeted by drugs. We developed a plugin for IGB, that uses advanced visualization techniques to integrate the analysis of genomics data with network and structural biology approaches. The plugin automatically maps genomic regions to protein sequence and interaction structures and identifies residues in contact with proteins, nucleic acids or small molecules. This allows the end user to generate hypotheses regarding drug- and ligand-dependent perturbations of PPI networks, and provides predictions as to how specific mutations might have an impact on drug resistance.
We also integrated a pattern-search algorithm to provides biologists with the ability, once they identify an interesting genomic pattern, to look for similar occurrences in the data, thus facilitating genomic data access and use. For example, such patterns can describe gene expression regulatory DNA areas including heterogeneous (epi)genomic features (e.g. histone modification and/or different transcription factor binding regions). It is possible to define complex patterns based on perfect matches in genome tracks (regions that must match), partial matches (regions that are allowed to be absent), and negative matches (for instance to search for regions distant from transcription start sites).
Plugins available at: http://cru.genomics.iit.it/igbmibundle/ and http://www-db.disi.unibo.it/research/GenData/SimSearch
- Carles Hernandez-Ferrer, Barcelona Global Health Institute ES,
- Carlos Ruiz-Arenas, Barcelona Global Health Institute ES,
- Juan R., Barcelona Global Health Institute ES,
Background: Reduction in the cost of genomic assays has generated large amounts of biomedical-related data. As a result, current studies perform multiple experiments in the same subjects. While Bioconductor's methods and classes implemented in different packages manage individual experiments, there is not a standard class to properly manage multiple types of datasets obtained from the same individuals. In addition, most R/Bioconductor packages that have been designed to integrate and visualize biological data often use basic data structures with no clear general methods, such as subsetting or selecting samples.
Results: To cover this need, we have developed MultiDataSet, a new R class based on Bioconductor standards, designed to encapsulate multiple data sets. MultiDataSet deals with the usual difficulties of managing multiple and non-complete data sets while offering a simple and general way of subsetting features and selecting samples. We illustratethe use of MultiDataSet in three common situations: 1) subsetting operations before doing an integration analysis with a third party package; 2) creating new methods and functions for omic data integration; 3) encapsulating new unimplemented data from any biological experiment.
Conclusions: MultiDataSet is a suitable class for coordinate data management and data integration under R and Bioconductor framework.
- Manuel Tardaguila, University of Florida, United States
- Lorena de, Centro de Investigacion Principe Felipe, Spain
- Ana Conesa, University of Florida, United States
The possibility of massively sequencing full length transcripts has paved the way for the discovery of thousands of novel isoforms, even in well annotated organisms as mice and humans. With the increasing utilization of long read technologies such as Pacbio, the necessity for a tool that provides a comprehensive classification of these novel transcripts as well as their exhaustive characterization is ever more pressing. Here we present SQANTI, an automated pipeline that through a splice junction matching criterion allows for the identification and classification of known and novel transcripts as well as the mining of extensive quality control features using a well-established differentiation model of Neural Progenitor Cells to Oligodendrocyte Precursors. Importantly, we show that a considerable fraction of the novel isoforms is in reality technical artifacts tracing back to the library preparation stages, and that SQANTI calculated features along with machine learning algorithms success in weeding them out to obtain a curated Expressed Reference (ExR). To address the question of the actual proteomic readout of this ExR we undertook a shotgun proteogenomics search for peptides that could uniquely identify ORFs that has reproduced previous findings; unique peptide evidence for secondary and novel ORFs, though non-negligible, remains limited compared to evidence for the so called Principal Isoform ORFs defined for each gene. In addition, we demonstrate the advantages of using an ExR over classical RNAseq approaches relying on mapping Illumina reads to Global references (GlR) to quantify isoform expression; chief among which is precluding the misallocation of short reads caused by unaccounted 3’ end variability. All in a nutshell, SQANTI allows the user to maximize the analytical outcome of Long Read technologies while providing biological cues and graphical output that expedite the way to detect and discard artefactual transcripts.
- Xiangfu Zhong, University of Oslo and Oslo University Hospital, Norway
- Fatima Heinicke, University of Oslo and Oslo University Hospital, Norway
- Albert Pla, University of Oslo and Oslo University Hospital, Norway
- Benedicte Lie, University of Oslo and Oslo University Hospital, Norway
- Simon Rayner, University of Oslo and Oslo University Hospital, Norway
MicroRNAs are short non-coding RNAs with length varying from 19 to 26 nucleotides that regulate gene expression by binding to mRNA targets. miRBase, the standard miRNA reference database, catalogues miRNAs with a single dominant sequence. However, Next Generation Sequencing (NGS), a common approach for profiling miRNA populations, identifies a variety of miRNAs isoforms (or isomiRs) with the dominant isoform varying amongst experiments. Nevertheless, typical NGS analyses neglect the presence of these isoforms. In this paper, we investigate the origin of these isoforms, their impact on downstream analysis and how different sample preparation kits influence the observed isomiRs population.
We introduce a nomenclature for distinguishing isomiR types and apply it to investigate how isomiR populations vary with sample preparation protocols. We investigate two NGS datasets: (1) NGS data collected using the same sample sequenced with three different sample preparation kits and (2) NGS data sequenced using same sample preparation kit across multiple studies performed by different research groups.
Results and Conclusions
We detect significant variation in the composition of isomiRs populations amongst data generated by three library preparation kits, with signature bias associated with each kit which may impact subsequent analysis. Thus, while isomiRs represent an important biological feature that should be considered during NGS analyses, it appears that some library specific correction step is required prior to downstream isoform analysis.
- Juliane C., University of Natural Resources and Life Sciences, Austria
- Alexandrina Bodrug, University of Natural Resources and Life Sciences, Austria
- Alvaro Rodriguez, University of Natural Resources and Life Sciences, Austria
- J. Mitchell, Department of Agriculture-Agricultural Research Service,
- Britta Schulz, AT SE, Germany
- Heinz Himmelbauer, University of Natural Resources and Life Sciences Vienna, Austria
Sugar beet (Beta vulgaris ssp. vulgaris) is an important crop plant that accounts for roughly 25% of the world´s sugar production per year. We have previously shown that sugar beet has a quite narrow genetic base, presumably due to a domestication bottleneck. To increase the versatility of the crop, the introduction of desirable traits from wild beets is required. As a first step, we have set out to characterize the genomes of sugar beet and its wild progenitor species, the sea beet (Beta vulgaris ssp. maritima). The genome of sugar beet was assembled from 454, Illumina and Sanger sequencing data, followed by integration with genetic and physical maps (Dohm et al., 2014). Efforts to further improve the sugar beet reference assembly are still ongoing, capitalizing on long-read technologies as well as on optical mapping data. We have sequenced the genomes of several sea beet accessions from different geographical areas to sample the diversity of the species, and from Beta patula, a close relative of sugar beet endemic to the Madeira archipelago where it is considered as a threatened species. Lastly, we have shortlisted ca. 500 beets of differing genetic background for whole genome sequencing. We expect our work to provide a solid foundation to decipher the complex genetic architecture of a species, impacting research on plant genome evolution and applications for molecular breeding.
Dohm JC, Minoche AE, Holtgräwe D, Capella-Gutiérrez S, Zakrzewski F, Tafer H, Rupp O, Sörensen TR, Stracke R, Reinhardt R, Goesmann A, Kraft T, Schulz B, Stadler PF, Schmidt T, Gabaldón T, Lehrach H, Weisshaar B, Himmelbauer H. The genome of the recently domesticated crop plant sugar beet (Beta vulgaris). Nature 505 (2014), 546-549.
- Nicolas Salvetat, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Bérengère Vire, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Siem Van-Der-Laan, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Stephanie Pointet, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Yoann Lannay, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Christopher Cayzac, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Guillaume Marcellin, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Pilar Saiz, Dept Psychiatry, Spain
- Courtet Philippe, INSERM U1061,
- Franck Molina, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
- Dinah Weissmann, SYS2DIAG - UMR9005 CNRS/ALCEDIAG, France
Alterations of adenosine-to-inosine (A-to-I) RNA editing of proteins have been shown to be involved in etiology of different diseases as neuropsychiatric disorders, cancers or autoimmune diseases. RNA editing is a co- or posttranscriptional process leading to a site-specific alteration in RNA sequences. It plays an important role in the epitranscriptomic regulation of RNA. While epigenetics bridge gene to expression,the field of epitranscritpomics links transcriptomics to protein synthesis and function.
Today, RNA editing analyses have been largely facilitated by the advent of NGS technologies. Recently, we have developed a targeted ultra-deep sequencing assay for RNA editing analyses on mechanistically-relevant transcripts. Our bioinformatic pipeline combines several steps using 1) standard softwares to pre-process, clean and check the quality of data and align processed reads against reference sequence and 2) specific scripts have been generated to compute the percentage of reads that have been edited. In addition we graphically visualized relevant RNA editing sites.
Using this pipeline, we performed a clinical study to identify an RNA editing signature in blood of depressed patients with and without the history of suicide attempts. In addition, gene expression analysis by quantitative PCR was performed. We generated a multivariate algorithm comprising various selected biomarkers to detect patients with a high risk to attempt suicide. In conclusion, we developed a full RNA editing pipeline that allows quantification of disease-relevant biomarkers in blood samples of patients for clinical diagnosis purposes.
- Bruno Fosso, Institute of Biomembranes and Bioenergetics, Italy
- Monica Santamaria, Institute of Biomembranes and Bioenergetics, Italy
- Gabriel Valiente, Department of Computer Science, Spain
- Graziano Pesole, Institute of Biomembranes and BIoenergetics; Dept. of Biosciences,
Since first life hours, human body is colonized by prokaryotes, microscopic eukaryotes and viruses creating a complex community, called human microbiome, playing a fundamental role both in physiological (e.g. tissue development and immune system training) and in pathological conditions (e.g. cancer and dysbiosis). This polymicrobial community contains a number of cells about ten times greater than those of the host thus increasing of several orders of magnitude the gene diversity with respect to one represented in the human genome.
The advent of metagenomics and Next Generation Sequencing (NGS) technologies opened new and fascinating avenues for understanding microbes-host interactions and related pathologies. The Shotgun Metagenomics approach involves the deep sequencing of total DNA/RNA extracted from a clinical sample and allows both the taxonomical and functional characterization of the microbiome.
Here we present MetaShot (Bioinformatics, in press) a Python-based workflow designed for the compositional profiling of host-associated microbiomes.
It implements a two-step similarity-based approach to get the best compromise between computational efficiency and assignment accuracy: (i) it first detects candidate microbial reads in a context dominated by host-derived reads; (ii) candidate reads are taxonomically classified by using a fine-grained similarity comparison on reference collections; (iii) taxon assignments are refined considering observed abundances.
MetaShot has been benchmarked against Kraken (ccb.jhu.edu/software/kraken/) (7) and MetaPhlAn2 (http://huttenhower.sph.harvard.edu/metaphlan2), two state of the art tools, using both an in silico simulated human microbiome and a mock community consisting in bacterial and viral species (9). MetaShot outperforms Kraken and MetaPhlAn2 in terms of the overall accuracy of reads assignment for the Prokaryotes and Viruses at the Family, Genus and Species levels.
A software package implementing the MetaShot pipeline is freely available at https://github.com/bfosso/MetaShot and includes a utility tool for extracting all reads assigned to a specific NCBI taxonomic ID or all those left unassigned.
- Suryani Lukman, Khalifa University, United Arab Emirates
- Kelvin Sim, OneAnalytix Pte Ltd, Singapore
The advancement of next generation sequencing (NGS) technologies have resulted in millions of sequences simultaneously. These big data need to be thoroughly yet efficiently analyzed to derive insightful action plans for understanding population genomics and unraveling genetic of complex diseases The needs for resequencing of multiple individuals to address the variation and diversity of population genomics, add further complexity and dimensionality to the data.
We have surveyed and compared several data mining and pattern recognition methods, particularly clustering and matrix decomposition methods, for analyzing NGS data. The methods include, but are not limited to, graph-based clustering, paired-end mapping clustering, subspace clustering, principal component analysis, multiple dimensional scaling, nonlinear iterative partial least squares analysis, singular value decomposition, independent component analysis, non-negative matrix factorization, and factorization machine.
Our results highlight the requirements for (i) multidisciplinary collaborations among researchers of different fields (molecular biology, genetics, bioinformatics, data science, software engineering, etc), and (ii) comprehensive assessment and error correction algorithms. These requirements are essential for developing fast, accurate, and scalable methods for many applications of NGS, such as clinical diagnostics, precise personalized medicine, and pharmacogenomics.
- Matt Simmons, Dartmouth, United States
- Vanni Bucci, Dartmouth, United States
The human gut microbiota is a complex system whose dynamics have huge implications to human health and disease. Current sequencing technology can produce a massive surplus of data and, as a result, microbiome researchers continuously strive for novel and more accurate approaches to parse out useful information. To date, studies of the dynamics of the microbiome have primarily focused on descriptive and correlation based analysis. In order to further our understanding of these dynamics toward using this knowledge to prototype microbiome-based therapies, one promising approach is to use data driven inference to produce predictive models for microbiome dynamics. Data driven inferences discard the notion of a mechanistic model in favor of a statistical approach because of the chaotic nature of the microbiota. We have previously introduced the Microbial Dynamical Systems INference Engine (MDSINE): a suite of algorithms for inferring a parametric model from microbiome time-series data and then predicting temporal behaviors. Despite the simplistic generalized lotka volterra model (GLV), MDSINE significantly outperforms the previously available inference methods in the predictions of C. difficile and immune-systems regulating mouse models dynamics. To relax MDSINE’s fundamental GLV assumption, here we introduce a new method for inference based on nonparametric multiplicative regression (NPMR). This approach has been shown to be useful for modeling the response of an organism to its environment, but has yet to be involved with modeling the dynamics of microbiomes. We show that NPMR provides us with increased predictive ability in parameters inference and trajectory predictions on simulated and previously published data, and therefore can be considered as another viable option for microbiome forecasting model inference.
- Santhi Natarajan, Indian Institute of Science, India
- Krishna Kumar, Indian Institute of Science, India
- Debnath Pal, Indian Institute of Science, India
- Soumitra Kumar, Nandy Indian Institute of Science, India
The rapid pace of development in Next Generation Sequencing technologies has allowed a significant affordability in genomic sequencing, improved throughput and quality. However, continued use of heuristics in secondary analysis of the genome remain a bane in leveraging the advancements as even a single unwanted error makes the technology unreliable for use in critical domains like diagnostics and healthcare. To circumvent some of the lacunae associated with usage of heuristics in alignment, we have developed an accelerator based fast and accurate alignment technology using a parallel dynamic programming algorithm that gives error free alignment with full sequence coverage. The method is currently implemented for aligning full genome and exome up to two mismatches, with an option to use hard clipped reads as inputs, if needed. The throughput is faster than many popular aligners like Bowtie, BWA, CUSHAW etc. Our technology returns many more multi-read alignments with substantial repeat region coverage at a significantly shorter or comparable time depending whether a Graphical Processing Unit (GPU) or a re-configurable hardware like the Field Programmable Gate Array (FPGA) is being used. The technology has been extensively validated on test data sets of human and microbial genomes and is being commercially launched.
- Morten Munk, Johansen Department of Hematology/ The Epi-/Genome lab, Denmark
- Fazila Asmar, Department of Hematology/ The Epi-/Genome lab, Denmark
- Christina Westmose, Yde Center for Genomic Medicine, Denmark
- Daniel El, Department of Haematology, Denmark
- Kirsten Grønbæk, Department of Hematology/ The Epi-/Genome lab, Denmark
In this study, we aimed to investigate a possible common genetic origin of hematological cancers in patients with concomitant lymphoid and myeloid malignancy.
By exome sequencing we identified the somatic mutational landscapes of the malignant clones using T-cells as germline tissue for two patients with concomitant de novo acute myeloid leukemia(AML) and chronic lymphocytic leukemia(CLL) and two patients with CLL and therapy-related AML (t-AML). The somatic mutational landscapes of the malignant clones in the de novo concomitant cases and the cases with CLL and t-AML were quite similar to what has previously been reported in sporadic cases of disease. The malignant clones did not share any of the mutations indicating development of two independent diseases.
We further identified possible pre-disposing mutations by comparing variants between the myeloid malignant clone, CLL cells, and T cells, as well as using saliva to aid in characterizing the mutations as either germline or only present in the hematological compartment. Three additional patients with concomitant CLL and a myeloid disease (AML, myelodysplastic syndrome and chronic myelomonocytic leukemia type 1) were assessed for potential predisposing mutations using this approach.
We only identified shared variants most likely representing germline mutations.
In all the patients except one with de novo AML and CLL, we identified a potential damaging germline variant in a DNA-repair related gene, such as ATM (387dupA, D130fs*4), SMARCAL1 (2114C>T, T705I), HELQ (393_397delAGGTG, G132fs*16), SWI5( 652C>T, R218*), LIG1(2168A>G, Q761R) and PRKDC(902G>A, C301Y)
The third patient with concomitant de novo AML and CLL, harbored a potential damaging germline variant in an epigenetic regulator believed to play a role in normal and malignant hematopoiesis, KDM2B(44delC, P15fs*92).
Our results suggest a possible role of germline variations in the susceptibility to development of t-AML or concomitant de novo hematological cancers. However, further studies are needed to confirm this hypothesis.
- Andre Gohr, Centre for Genomic Regulation (CRG), Spain
- Manuel Irimia, Centre for Genomic Regulation (CRG), Spain
Splicing is the molecular mechanism by which pre-mRNA is processed into mature mRNA by removing introns and combining exons.
Alternative splicing is the ability to splice pre-mRNAs from the same gene into different mRNAs, e.g., by skipping exons and retaining introns.
When studying alternative splicing, often one goal is to identify differentially regulated exons/introns and compare their features with reference sets.
While tools (e.g., MISO, rMATS, SANJUAN, VAST-TOOLS) are available for estimating inclusion levels of exons/introns from NGS data, there is a lack of tools for the downstream analysis.
Matt is a command-line toolkit bundling functionality for the analysis of exons/introns.
It is implemented in Perl and uses R for generating graphics; as such it works on all Linux-like systems.
Matt works on tables; thus, it can proceed directly with the output of many tools (like rMATS, MISO, VAST-TOOLS) which report exons/introns and inclusion levels in table format.
In addition, Matt includes functionality for altering tables, e.g., sub-selecting rows or columns, to allow the user to extract sets of exons/introns.
Matt facilitates the extraction of exonic/intronic sequences, and their up-stream and down-stream sequences.
Matt allows to determine up to 90 exon/intron features including splice site strength  and predicted branch points , and statistically tests for significant differences in these features between sets of exons/introns.
Furthermore, Matt can be used to test for enrichment of binding sites of RNA-binding proteins between two sets of exons/introns, including up to 300 motifs from the CISBP-RNA database.
Bringing together this functionality is one of the strengths of Matt facilitating the downstream analysis of exons/introns.
 using MaxEntScan, genes.mit.edu/burgelab/maxent/Xmaxentscan_scoreseq.html
 using SVM-BP, http://regulatorygenomics.upf.edu/Software/SVM_BP
- Adam Giess, UIB, Norway
While methods for annotation of genes are increasingly reliable the exact identification of the translation initiation site remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated. Using these unique signatures we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic samples, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and elongations of annotated coding sequences. These re-annotations are supported by the presence of Shine-Dalgarno sequences, structural and sequence based features and N-terminal peptides. Finally, our model identifies 61 novel genes previously undiscovered in the genome.
- Maximilian Hastreiter, Helmholtz Zentrum München, Germany
- Tim Jeske, Helmholtz Zentrum München, Germany
- Jonathan Hoser, Helmholtz Zentrum München, Germany
- Michael Kluge, Helmholtz Zentrum München, Germany
- Kaarin Ahomaa, Helmholtz Zentrum München, Germany
- Marie Sophie-Friedl, Helmholtz Zentrum München, Germany
- Sebastian Kopetzky, Helmholtz Zentrum München, Germany
- Jan-Dominik Quell, Helmholtz Zentrum München, Germany
- Hans-Werner , Mewes Helmholtz Zentrum München, Germany
- Robert Küffner, Helmholtz Zentrum München, Germany
Analysis of Next Generation Sequencing (NGS) data requires the processing of large datasets by chaining various tools with complex input and output formats. In order to automate data analysis, the application of workflow mangement platforms has been proposed. This simplifies reliable handling and processing of NGS data, and corresponding solutions become substantially more reproducible and easier to maintain. Here, we present a well-documented, KNIME based, toolbox of 42 processing modules as well as important technical extensions for NGS data analysis. Besides a number of auxiliary nodes, our extension provides building blocks and wrappers for well-established tools for tasks like read mapping, differential expression analysis and variant calling. The nodes can easily be combined to create standardized workflows which enables expert and non-expert users to set up reliable processing workflows quickly without the need of dealing with command-line tools. We also extended KNIME to use NGS specific file types. This improves workflow robustness enabling automated checks of tool configuration parameters and versions. To organise software binaries required nodes, we developed a lightweight binary management system which simplifies keeping workflows up to date with most recent software packages. Furthermore, we developed the High-throughput Executor (HTE) as an extension of the standard KNIME node model which helps to reduce manual interventions during data processing. In the analyses of large datasets, even established tools are prone to spurious premature termination. In this case, HTE automatically retries the execution of a failed node as often as defined by the user. Additionally, the HTE model ensures that not the entire workflow, but only those nodes are re-executed that depend on failed previous steps. Taken together, our KNIME extension can substantially lower the effort for scientists entering into the areas of NGS and Big Data and stimulate collaboration between computational and wet lab biologists.
- David Emms, University of Oxford, United Kingdom
- Steve Kelly, University of Oxford, United Kingdom
Identifying orthology relationships between genes is fundamental to genomics research. The use of these relationships provides a coherent framework for the extrapolation of biological knowledge between organisms, enabling genome and transcriptome annotation, and underpinning comparative/population genomic analyses. However, the identification of all homology relationships across increasingly large numbers of genomes is an essential task that is often poorly formulated and is methodologically and computationally challenging.
We present OrthoFinder, a fast and accurate tool for orthology inference for comparative genomics analyses. It defines the problem of multi-species orthology using a consistent framework. It accounts for gene-family level and species-level variations in sequence divergence rates and corrects for gene-length bias in order to provide high-accuracy inference of orthogroups (gene families defined at a consistent taxonomic level), orthologues, paralogues, gene trees and the rooted species tree for a given set of species. On the independent OrthoBench dataset it outperformed other orthogroup methods by between 8% and 33% in terms of accuracy. The complete OrthoFinder analysis is performed with a single command and the implementation is fast, memory efficient and highly parallel thus permitting large scale analyses of data from high-throughput sequencing.
- Bernat Del, IDIBGI, Spain
- Jesús Matés, IDIBGI-UDG, Spain
- Mel·lina Pinsach, IDIBGI-UDG, Spain
- Jing Zhang, UCSD, United States
- Daria Merkurjev, UCLA, United States
- Catarina Allegue, Grupo de Medicina Xenomica-CIMUS-USC, Spain
- Sergiy Konovalov, UCSD, United States
- Sneha Bhattaram, UCSD, United States
- Farah Sheikh, UCSD, United States
- Ramon Brugada, IDIBGI-UDG, Spain
- Ivan Garcia-Bassets, UCSD, United States
- Sara Pagans, UDG, Spain
The noncoding fraction of the human genome contains the instructions on how to build an individual and make it different to the rest of individuals. Transcription factors (TFs) are the proteins that read these instructions and convert them into actions. However, TF instructions are not identical in all humans. Each personal genome harbors a unique set of genetic variants that alters a subset of TF instructions. These differences contribute to phenotypic diversity among individuals and to their distinct susceptibility to disease and response to pharmacological treatments. Thereby understanding personal differences in TF instructions may help to understand disease and the development of precision medicine. Here, we use a machine-learning approach, known as DeepBind, to predict alterations in TF instructions at binding sites for TF CTCF, a central structural organizer of the human genome. Here, we have integrated 56,261 cardiac CTCF binding sites from human cardiomyocytes (ENCODE) with human genetic variants from the 1,000 Genomes Project. Machine-learning-based analysis of CTCF binding after simulating all the annotated genetic variation within these coordinates in the human population predicts substantial variation in TF binding, which may result in alterations in TF instructions. Based on these analyses, we have catalogued cardiac CTCF sites based on the number and level of predicted alterations. To interrogate the potential impact of CTCF-binding sites alterations in cardiac disease. we have sequenced the CTCF coordinates around a number of ion channel genes in a cohort of n=89 human individuals diagnosed with Brugada Syndrome (BrS). We have predicted potential effects in CTCF-binding that may be associated with this familial cardiac disease. In summary, we show value in the application of machine learning techniques to predict alterations in TF instructions that may be relevant to disease.
- Manu Ferrando, CNAG-CRG, Spain
- Carolina Medina-Gomez, ErasmusMC, Spain
- Beatriz Cadenas, CNAG-CRG, Spain
- Iago Maceda, CNAG-CRG, Spain
- Fernando Rivadeneira, Erasmus MC, Spain
- Oscar Lao, CNAG-CRG, Spain
Bone mineral density (BMD) is a complex phenotype of medical interest as proxy for fracture susceptibility and it is particularly interesting from an evolutionary point of view due to the reported recent gracilization of the human skeleton. BMD is influenced by many genes, environmental and behavioral variables. It has been showed that BMD is influenced by lifestyle, like the exercise level. In addition to phenotypic plasticity, it has also been suggested that BMD shows constitutive differences among human populations; children with a large fraction of Sub-Saharan African ancestry tend to show higher BMD values than individuals from non-African ancestry living in the same geographic area and in a similar environment.
The analysis of BMD associated markers has allowed starting understanding the microevolution of this phenotype. In particular, it has been suggested that population phenotypic heterogeneity is the result of differential selective pressures between African and non-African populations (Medina-Gómez et al. 2015). However, it is unknown whether the observed pattern is due to a relaxation of selection in African populations or positive selection out of Africa for the BMD decreasing alleles. Furthermore, the role of archaic introgression has also been claimed for increasing the risk of some complex diseases. Within the context of the BMD phenotype, morphological studies described a large number of skeletal differences between Neanderthal and anatomically modern humans. Therefore, it is possible that these differences could be responsible of the observed BMD differences between African and non-African populations.
We have analyzed full sequenced Neanderthal, Denisovan and archaic anatomically modern human genomes to quantify the role of Neanderthal and Denisovan introgression at BMD phenotype, as well as the possibility that these differences are due to a decrease of the daily physical activities following cultural innovations in ancient Europe.
- Daniela Puiu, Johns Hopkins University, United States
- Aleksey Zimin, Johns Hopkins University, United States
- Sorel Fitz-Gibbon, University of California Los Angeles, United States
- Victoria Sork, University of California Los Angeles, United States
- Steven Salzberg, Johns Hopkins University, United States
Oaks represent a valuable natural resource across the Northern Hemisphere with a large research community studying their genetics, systematics, ecology, conservation, and management.
We report our progress on assembling the genome sequence of Valley oak (Quercus lobata) using both Pacific Biosciences (PacBio) and Illumina sequencing data from adult leaf tissue of an individual found in a natural population. We compare three assemblies: Illumina only, PacBio only, and a hybrid combining both data sets. The Illumina assembly, performed with MaSuRCA and SOAPdenovo2, consists 18,512 scaffolds of length >2 kb totaling 1.15 Gb, with N50 scaffold size of 278 kb. The k-mer histograms indicate an approximate diploid genome size of 700-740 Mb, which is smaller than the assembled length due to the high heterozygosity in this outcrossing tree species. The PacBio assembly, performed with Canu, consists of 7,300 assembled contigs totaling 0.97 Gb with N50 of 429 kb, plus unassembled sequences totaling 3.31 Gb. The hybrid assembly was performed with the MaSuRCA Assembler using error-corrected PacBio reads, yielding a draft nuclear genome of 25,683 unitigs totaling 1.31 Gb with N50 size of 880 kb.
We have developed a software piepline (minimus3) whose purpose is to decrease the haplotype redundancy and yield an assembly close to the expected genome size. The pipeline uses whole genome alignment and alignment filtering methods, as well read coverage, transposon and BUSCO annotation for deciding which sequences represent different haplotypes and need to be merged. We are working on integrating this method in the MaSuRCA assembler.
- Martin Selmansberger, Helmholtz Zentrum Muenchen, Germany
- Julia Hess, Helmholtz Zentrum Muenchen, Germany
- Ulrike Schoetz, LMU Clinics Munich, Germany
- Kirsten Lauber, AG Molekular Oncology,
- Horst Zitzelsberger, Helmholtz Zentrum Muenchen, Germany
- Kristian Unger, Helmholtz Zentrum Muenchen, Germany
CD44v6, a variant of the CD44 cell-surface glycoprotein comprising eight different transcripts has been shown to be associated with clinical outcome in various cancer entities, such as colon, gastric, and pancreatic cancer. In order to test whether CD44v6 expression also has a prognostic role in radiotherapy (RTx) treated head and neck squamous cell carcinoma (HNSCC), we aimed to test the expressions of the CD44v6 transcript variants for association with overall survival in the sub-cohort of RTx-treated and HPV-negative patients (n=77) of The Cancer Genome Atlas (TCGA) HNSC cohort (n=279).
Since the quantifications of the CD44v6 transcripts were not sufficiently available from the existing TCGA level 3 data we performed the alignment of the RNAseq reads to a reference transcriptome using a novel pseudoalignment approach. Firstly, the paired-end reads of the downloaded BAM files were randomized and extracted with the SamTools software. Secondly, alignment of the reads to a reference transcriptome followed by quantification of individual transcripts was performed using the Salmon approach (http://salmon.readthedocs.io/) by which all eight annotated CD44v6 transcripts were detected. The resulting quantifications as transcripts per million (TPM) were used to split the 77 cases in two groups with high and low expressions (threshold 2-fold-above-median expression) of the respective transcript. Subsequent Kaplan-Meier survival analysis revealed that increased expression of one out of the eight CD44v6 transcripts (ENST00000415148.6) was significantly associated with better overall survival of RTx treated HNSCC patients (FDR < 0.1, Hazard-ratio 3.97, 95% CI = [1.55;10.16]). Thereby, we were able to demonstrate a prognostic role of a specific CD44v6 transcript in the RTx-treated and HPV-negative HNSCC TCGA sub-cohort. Additionally, our study suggests that reanalysis of existing RNAseq data with state-of-the-art analysis approaches has the potential to generate added value to the previous analysis.
- Bertrand Sodjahin, Athabasca University, Canada
- Vivekanandan Kumar, Athabasca University, Canada
- Shawn Lewenza, University of Calgary, Canada
- Shauna Reckseidler-Zenteno, University of Calgary, Canada
Pseudomonas aeruginosa is a bacterial organism notable for its ubiquity in the ecosystem and its resistance to antibiotics. It is an environmental bacterium that is a common cause of hospital-acquired infections. Approximately 50% of mechanically ventilated patients with P. aeruginosa pneumonia succumb to their condition. It has been shown that the organism can be isolated from water in some intensive care units. Identifying its survival mechanism in water is critical for designing preventative and curative measures. In addition understanding this mechanism is beneficial because P. aeruginosa and other related organisms are capable of bioremediation. To address this practical problem, it has been decomposed into multiple learnable components, the first of which is the identification of genes responsible for the bacterial survival in water. In this work, a Bayesian Machine Learning methodology was devised to analyze the gene expression response of P. aeruginosa incubated in water, a nutrient depleted environment. With this approach, and with collected unlabeled gene expression data based on a PAO1 mini-Tn5-luxCDABE transposon mutant library, an optimal regulatory network model of the survival mechanism was constructed. The model’s reliability was tested and its validity benchmarked on the ALARM diagnostic system, a network of reference. Subsequently, node influence techniques were implemented to identify and isolate a group of genes as orchestrators of P. aeruginosa survival in low nutrient water. We generated hypothesis to test and confirm these predictions. We also posit that this methodology and framework model can be generalized to other bacteria for similar study.
- Abdulfatai Tijjani, University of Nottingham, United Kingdom
- Jaemin Kim, National Human Genome Research Institute, United States
- Raphael Mrode, International Livestock Research Institute, Kenya
- Heebal Kim, Seoul National University,
- Olivier Hanotte, University of Nottingham, United Kingdom
The availability of next-generation sequencing data provides an opportunity for in-depth characterization of the genome of an organism. Structural variants including copy-number variations (CNVs) are now being detected in the genomes of human and different animal species and they have been linked to several phenotypes. The West African trypanotolerant taurine cattle population include indigenous breeds such as Muturu and N’dama which are known for their unique tolerance to trypanosomiasis. They also represent a very important animal genetic resource yet to be fully exploited for the improvement of livestock productivity and the study of local adaption.
In this study, we sequenced the genome of 20 cattle comprising an equal number of Muturu and N’dama cattle to about 10x coverage on the Illumina platform. Following comparison of the sequenced genomes to UMD3.1 bovine reference assembly and using read depth from split reads and discordant paired ends as SV detection signals, we identified 6,014 and 13,471 putative CNVs respectively which are distributed unevenly along the 29 bovine autosomes. The detected CNVs in both cases were merged into a total of 3607 copy-number variation regions (CNVRs) with a minimum length of > 1 kb. They include 2486 deletions, 667 duplications, and 454 mixed duplication and deletion regions. Numerous protein coding genes were retrieved from these genomic variant regions and some regions were also found to overlap with up to 5,541 Bovine (UMD3.1) known quantitative trait loci (QTL) including QTLs associated with heat tolerance, tick resistance, disease susceptibility and several production traits.
Our study provides for the first time a genome-wide assessment of CNV detection in local cattle breeds from West Africa using full genome sequence. The results of this study may provide further insight into the genome diversity and genomic basis of important phenotypes in indigenous African cattle.
- Cristian Perez-Garcia, Imegen, Spain
- Carlos Mackintosh, Imegen, Spain
- Miguel Gonzalez, Imegen, Spain
- Carol Monzo, Imegen, Spain
- Carlos Ruiz, Imegen, Spain
- Javier Garcia-Planells, Imegen, Spain
- Pablo Marin-Garcia, Imegen, Spain
oncoIMG_Viz is a NGS bioinformatics production platform for analysis of cancer targeted genes panels using paired samples (germinal-somatic). The aim of the tool is to provide clinicians an interactive tool for data management and visualization for filtering somatic variants yielding the right set of quality control annotations for filtering out dubious SNPs and helps to discriminate subclonal variants at different levels of variant allele fraction (until 5%). The tool also helps in the prioritization of variants providing standard variation germinal annotation and adding annotations of cancer pathogenicity, LOH, and known drug interactions. The tool calculate and visualize the CNVs and use them to improve the genotype interpretations.
- Jose-Miguel Juanes, INCLIVA, Spain
- Cristian Perez, Imegen, Spain
- Azahara Fuentes, Incliva, Spain
- Miguel Gonzalez, INCLIVA, Spain
- Carol Monzo, Incliva, Spain
- Javier Chaves, Incliva, Spain
- Vicente Arnau, ETSE, Spain
- Javier Garcia-Planells, Imegen, Spain
- Ana Barbara, INCLIVA, Spain
- Pablo Marin-Garcia, Imegen, Spain
At Medical Genomics Visualization group (MGviz) we recently have developed fully integrated QC and data análisis procedures that creates automatic NGS clinical reports for clinicians to review. We provide integrated and summarized quality control measures with threshold based in all the previous runs that allow detecting any kind of bias or sample performance problems and any kind of sample swap. Also we provide very intuitive interactive tools that help prioritize and select the variants of interest related to the patients phenotype from our NGS gene panels and other common products like TruSightOne from Illumina or whole exome capture from Agilent.
We have also developed a full suit of open source tools in python, R and MEAN stack for creating clinical bioinformatics as a service from sample tracking and variant annotation to interactive selection tools, reannotation and automatic clinical reports generation. All this is bases in our own modern js library for interactive bioinformatic visualization: jviz (https://github.com/jviz).
- Devendra Biswal, North Eastern Hill University, India
- Ruchishree Konhar, National Institute of Science Education and Research, India
- Manish Debnath, Bioinformatics Centre, India
- Veena Tandon, Biotech Park, India
Many trematode parasites cause infection in humans and are thought to be a major public health problem. Their diversity in different ecological regions provide challenging question in evolution of these organisms. In this report, we perform transcriptome analysis of the giant intestinal fluke, Fasciolopsis buski, using next generation sequencing technology. Short read sequences derived from polyA containing RNA of this organism were assembled into 30677 unigenes that led to the annotation of 12380 genes. Annotation of the assembled transcripts allowed an analysis of different processes and pathways, such as RNAi pathway, and energy metabolism. The expressed kinome of the organism was deciphered by identifying all protein kinases. We have also carried out genome sequencing and used the sequences to confirm absence of some of the genes, not observed in transcriptome data, such as genes involved in fatty acid biosynthetic pathway. Transcriptome data also helped us to identify some of the expressed transposable elements. Though many Long interspersed elements (L,INEs) were identified, only two Short Interspersed Elements (S,INEs) were visible. Overall transcriptome analysis helped us to decipher some of the biological characteristics of F. buski and provided enormous resources for development of a suitable diagnostic system and therapeutic molecules.
- Devendra Biswal, North Eastern Hill University, India
- Srinivasan Ramachandran, Institute of genomics and Integrative Biology, India
- Anupam Chatterjee, North-Eastern Hill University, India
- Manish Debnath, Bioinformatics Centre, India
- Veena Tandon, Biotech park, India
Paragonimiasis in humans is a neglected tropical disease of the lung and pleural cavity; besides, the extra-pulmonary paragonimiasis also happens to be an important clinical manifestation that has received feeble attention from public health authorities and has a far wider scope than mere clinical diagnosis and treatment. The major causative agent, Paragonimus westermani (Trematoda: digenea), is a cryptic species complex which is widely spread in east and northeast China, Japan, Korea and Taiwan (collectively referred to as East Asia). Lung flukes are also found in the tropics and sub tropics of East and South Asia and suburban Africa. The results of genetic variance analyses showed that samples from the same geographical region (country) may be attributed to different clusters but may not be having sufficient phylogenetic signals to exhibit the biological uniqueness of the corresponding populations. In a highly heterogeneous population the genetic variability in toto may be represented by a minor fraction of sampled individuals that might have resulted from fortuitous events in the exhibited clustering pattern of some variants. In fact the geographical fitting presents no surprise at all, taking into account that the Paragonimus species distribution covering different countries, is more or less aligned following an East-South axis because of the relatively slenderness of the South Asian countries. We harnessed the entire genome sequence information for P. westermani via Next Generation Sequencing (NGS), and its correlation with the current information for the P. westermani towards mt DNA phylogenomic investigations. Specific primers were designed for the 12 protein coding genes with the guide of existing P. westermani mtDNA as the reference. The Ion torrent technology was used in the mitochondrial genome sequencing and the genome assembled and analyzed in silico.
- Milana Frenkel-Morgenstern, Bar-Ilan University, Israel
- Alfonso Valencia, CNIO, Spain
Chimeric proteins, comprising peptides deriving from the translation of two parental genes, are produced in cancers by chromosomal aberrations. The expressed fusion protein incorporates domains of both parental proteins. Considering discrete protein domains as binding sites for specific domains of interacting proteins, we have catalogued the protein interaction networks for more than 11,000 cancer fusions in order to build the Chimeric Protein-Protein-Interactions (ChiPPI). Mapping the influence of fusion proteins on cell metabolism and protein interaction networks reveals that chimeric protein-protein interaction (PPI) networks often lose tumor suppressor proteins, and gain onco-proteins. We compared ChiPPI networks in different cancer phenotypes, e.g. in leukemia/lymphoma, sarcoma and solid tumors finding distinct enrichment patterns for each disease type. While certain pathways are enriched in all three diseases (Wnt, Notch, TGF beta), there are distinct patterns for leukemia (EGF receptor signaling, DNA replication, CCKR signaling), for sarcoma (p53 pathway, CCKR signaling), and solid tumors (FGF and EGF signaling). Finally, we validated the predicted PPI networks using high-throughput transcriptomics and proteomics methods. We found that more than 65% of fusions were confirmed at the unique junction sites and more than 46% of PPI networks were altered in at least two data samples. Thus, the ChiPPI method represents a comprehensive tool for studying the anomaly of skewed cellular networks produced by fusion proteins in different cancer types.
- Manuel Corpas, Repositive.io, United Kingdom
Common practice suggests that human-origin genome data should be deposited in public repositories for further reuse. Finding and accessing deposited genome datasets, however, is cumbersome, with data and metadata being scattered throughout the internet, annotated inconsistently and often machine unreadable. This provokes a huge loss of opportunities, hence wasting resources and research investment. The Repositive platform (https://repositive.io/?NGS17) is an online portal and community of users that facilitates finding, accessing, and sharing of published genomic data: a one-stop shop to discover and explore a research question’s most relevant genomic datasets. Repositive holds descriptions and metadata about existing deposited genomic datasets across hundreds of sources from around the world. Its interface leverages the crowdsourcing of dataset metadata curation via its social networking capabilities. Repositive currently indexes more than 1 Million genomic datasets. These datasets cover population studies, microbiomes, methylomes and other NGS data. Datasets are further classified in curated collections. The Chinese Control Data Collection, for instance, indexes 10 datasets from more than 600,000 individuals of Chinese and other Asian ethnicities, including data from healthy, diseased, and reference individuals, some open access, some requiring data access agreements. By using the Repositive platform, users are able to find all published genome datasets and understand the genomic data landscape for a particular disease or condition, hence drastically lowering the barrier to genomic data access.
- Ercan Tural, Lecturer, Turkey
- Sengul Tural, Ondokuz Mayis, Turkey
The determinants of human athletic performance have long been a challenging field of study in sport sciences. In recent years there has been a great progress in molecular biology techniques, which has facilitated the researches on influence of genetics on human performance. A primary challenge when attempting to describe the influence of genetic factors on athletic performance is its multifactorial nature. Both the scientific and sporting communities acknowledge that genetic factors undoubtedly contribute to athletic performance. Every sport has unique physical requirements and these requirements can be dramatically different between sports. Therefore, any study of the genetic influence on performance must consider the performance components most appropriate for the sport of interest. Athletic performance is one of the most complex human traits. As genetic technologies continue to progress at a rapidly, these developments bring on some potential misuses together. After created “super mouse” models by animal experimental studies the term of genetically modified athletes (GMA) was popular. This study, firstly explain related genes [eg. erythropoietin (EPO), peroxisome proliferator-activated receptor gama (PPAR-γ), phosphoenolpyruvate carboxykinase (PEPCK), insulinlike growth factor 1 (IGF-1), myostatin, follistatin, bone morphogenic protein (BMP), vascular endothelial growth factor (VEGF), angiotensin I–converting enzyme (ACE), endothelial nitric oxide syntase (eNOS), Actinin binding protein 3 (ACTN3) and endorphins] and we also will give information about our study resuls in this presentation.
- Sengul Tural, Ondokuz Mayis University Faculty of Medicine, Turkey
- Betul Celik, Ondokuz Mayis University Faculty of Medicine, Turkey
- Davut Guven, Ondokuz Mayis University Faculty of Medicine, Turkey
- Hasan Ulubas, Ondokuz Mayis University Faculty of Medicine, Turkey
- Mehmet Elbistan, Ondokuz Mayis University Faculty of Medicine, Turkey
- Nurten Kara, Ondokuz Mayis University Faculty of Medicine, Turkey
Infertility is defined as the failure to conceive, with no contraception, after one year of regular intercourse in women < 35 years and after 6 months in women > 35 years. Epidemiological data suggest that around 10% to 15% of couples are infertile. Anovulatory problems are responsible from 25% to 50% of causes of female infertility a significant number of infertility phenotypes have been associated with specific genetic anomalies. The genetic causes of infertility are varied and include chromosomal abnormalities, single gene disorders and phenotypes with multifactorial inheritance. Approximately 9–24% of women undergoig IVF respond more poorly than expected to the ovarian stimulation protocol selected according to their clinical characteristics. On the other hand high responses can cause a serious medical condition, ovarian hyper stimulation syndrome. Identification in advance of patients who will carry out a poor or high response to standard treatment would be of great clinical interest for such patients. Application of pharmacogenetics to ovarian response may predict stimulation success but also help in the adjustment and design of doses prior to treatment. Different studies have examined the impact of variations in FSHR, LH, LHR, ESR1, AMH, SHGB, MTHFR, CYP19, BMP15, GDF9, VEGF, SOD2 genes. Recently, gene association studies have tried to identify a number of genetic variations affecting inter individual variability in controlled ovarian stimulation.The development of testing panels that analyze interactions among different variations could increase the clinical value. Further GWAS and high-throughput sequencing strategies in women undergoing COS are waiting to be employed in COS pharmacogenomics. We will present our next generation DNA sequnce results of LH,VEGF and FSHR genes and emphasise effects of ovarian response in women who underwent Assisted Reproductive Technics.
- Carla Giner-Delgado, Departament de Genètica and Institut,
- Sergi Villtoro, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
- Isaac Noguera, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
- Ruth Gómez-Graciani, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
- David Izquierdo, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
- Jon Lerga-Jaso, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
- Mario Cáceres, Departament de Genètica and Institut de Biotecnologia i de Biomedicina, Spain
Inversions are balanced structural variants that inhibit recombination between inverted and non-inverted alleles and have been related to evolutionary processes such as local adaptation and speciation. However, methods for testing selection acting on human inversions are not well established yet. Here, we took advantage of the large population-scale data of the INVFEST project, where 45 inversions have been genotyped in 550 individuals from seven populations of the 1000 Genomes Project, to investigate the main evolutionary forces acting on them and point out candidates with possible functional relevance. First, by comparing the observed frequency distribution of the inversions with that of SNPs selected with the same detection bias, we found a significant enrichment of high-frequency variants, which could be related to recurrence or selective processes. In addition, the analysis of the population frequencies identified a few inversions showing extreme population differentiation, suggestive of the initial phase of a selective sweep or local adaptation. To further investigate inversion genomic regions, we retrieved the LD- and SFS-based neutrality statistics from the 1000 Genomes Selection Browser. While inversion regions have average values for most statistics, indicating that recombination inhibition is unlikely to be creating a selection-like signal, some of the regions show patterns consistent with non-neutral evolution. Finally, we estimated inversion ages from the sequence divergence patterns between orientations and identified both inversions with a relatively recent origin that have rapidly increased in frequency in a continent or population and old inversions (>1 Myr old) that have been maintained at high frequency in all populations. Interestingly, candidate inversions to be under selection are significantly enriched for functional effects on genes. Our study therefore supports the evolutionary importance of this type of structural variants, and reports inversions that deserve further molecular and phenotypic characterization.
- Marc Llirós, Université Catholic de Louvain, Belgium
- Alessandro Darzelle, Université de Namur, Belgium
- Jitendra Narayan, Université de Namur, Belgium
- Marie Cariou, Université de Namur, Belgium
- Nicolas Debortoli, Université de Namur, Belgium
- Matthiew Terwagne, Université de Namur, Belgium
- Boris Hespeels, Université de Namur, Belgium
- Veronique Baumle, Université de Namur, Belgium
- Julie Virgo, Université de Namur, Belgium
- Bernard Hallet, Université de Namur, Belgium
- Karine van, Université de Namur, Belgium
Applications of NGS in Metagenomics, Applications of NGS in Polulation genomics, Copy number variation in population genomics and translational applications
Abstract: Science is a field of work that moves forward from surprise to surprise. The evidences of a high ranked taxon within the metazoans (bdelloid rotifers) still reproducing asexually (i.e., clonal reproduction) after 4x107 years of evolutionary history coined the term “evolutionary scandal” for bdelloid rotifers . Within this group of animals, Adineta vaga appears as an interesting model due to its degenerated tetraploidy, lacking homologous pairs of chromosomes (a keystone for meiotic processes), after genome sequencing . In fact, allelic regions in A. vaga are massively rearranged thus evidencing the absence of meiosis in this bdelloid rotifer lineage. Furthermore, A. vaga females can resist and survive to stressful situations (i.e., desiccation or proton radiation) that induce severe genome damage (i.e., DNA double strand breaks) with no or little viability reduction.
The present work is aimed to decipher if the genome damage and reparation associated with survival to stressful situations (i.e., long desiccation (30 days) and strong proton radiation events (up to 500 Gy)) explain their peculiar genome structure and by extension in bdelloid rotifers. Furthermore, lab induced structural variations might help to understand the bizarre genomic structure of such long-term living group of organisms. Accordingly, clonal cultures of A. vaga adult females have been faced to severe genome damage and we have followed their resulting repaired genome by means of Illumina® sequencing. Current results, pointed towards a nearly reparation to the identical when compared to available published data thus highlighting their amazing behaviour and capabilities. It’s worth to mention, at the same time, that currently available structural variation tools are not suitable to deal with non-reducible to haploidy genomes, thus increasing the need for new sequencing and bioinformatic tools.
 Maynard Smith J. Nature 324, 300–301 (1986)
 Flot JF et al. Nature 500, 453–457 (2013)
- Ahmad Al, King's College London, United Kingdom
Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease predominantly of motor neurons, characterized by progressive weakness of voluntary muscles and death from respiratory failure due to diaphragmatic paralysis, typically within 3 years of onset. Despite the very poor prognosis, there is considerable variation in the survival rate, and up to 10% of people with ALS live more than 8 years from first symptoms.
There is a strong genetic contribution to ALS risk. In 5% of cases or more, a family history of ALS or frontotemporal dementia is obtained, and the Mendelian genes responsible for ALS in such families have now been identified in about 70% of cases. Even in apparently sporadic cases, twin and population studies show the heritability is about 60%.
Although risk genes reveal information about the mechanism of causation of ALS, it is also important to identify gene variants that modify survival. Survival genes could potentially be targeted directly, or their product augmented to improve ALS survival.
A number of common gene variants associated with ALS survival have been identified through genome-wide association studies or other genome-wide approaches like studying structural variants.
Aim: Is to know the structural variations difference between different ALS survival groups.
Methods: Analysis of copy-number variation using whole-genome sequencing data of 50 ALS patients with two extreme phenotypes, 25 short lived patients against 25 long survived ALS patients using Copy Number Segmentation by Regression Tree in Next Generation Sequencing (CONSERTING).
Samples: The top and bottom 1.5% of ALS patients by survival were identified (25 patients from each tail of the distribution). All patients were classified as definite or probable ALS according to the El Escorial criteria
and had no family history of ALS. Sample ancestry and relatedness were evaluated by principal components analysis and relationship matrices.
- Montserrat Torres, Georg-August-Universität Göttingen,
- Elisa Buchberger, Georg-August-Universität Göttingen,
- Nico Posnien, Georg-August-Universität Göttingen,
It is known that many of the genes that orchestrate some of the most important early developmental processes are relatively well conserved across distant animal phyla (“toolkit” genes). Therefore, gene expression divergence is thought to play an important role in these processes. Here we wanted to know what kind of regulatory changes are more likely to be fixed to give rise to morphological differences between species of Drosophila. The fruit flies D. mauritiana and D. melanogaster are a good model for these studies, since they are very closely related (they diverged less than 3 million years ago) but significant differences in the size of their eyes have been observed (Posnien et al. 2012).
We have sequenced the transcriptome of the eye-antennal imaginal discs of these species at key developmental time points to investigate the position that the genes that present most expression divergence have in some of the underlying gene networks that regulate eye and head development. Most interestingly, we have also sequenced imaginal discs of the F1 hybrids between these species to study the relative expression of the alleles in these animals compared to their parents. This analysis can indicate whether the inter-species expression differences are due to changes in the cis-regulatory region of a gene or due to changes in upstream factors that regulate their expression (variation in trans). The analysis of allele-specific expression between these two species poses additional challenges, such as the relatively low frequency of polymorphisms in their transcript sequences, but also due to the differences in the quality of their genomic references. These challenges and how we have tackled them will also be discussed.
So far, our results indicate that most differences are due to changes in trans, and this is also the case in other developing tissues, such as the wing imaginal discs of these species. These results are different to what has been previously observed in adult Drosophila tissues, and it could indicate that different stages of an organism’s life are subject to different evolutionary mechanisms influencing gene expression divergence.
- Francesc Muyas, Center for Genomic Regulation (CRG), Spain
- Mattia Bosio, Center for Genomic Regulation (CRG), Spain
- Hana Susak, Universitat Pompeu Fabra (UPF), Spain
- Luís Zapata, Universitat Pompeu Fabra (UPF), Spain
- Anna Puig, Center for Genomic Regulation (CRG), Spain
- Raquel Rabionet, Center for Genomic Regulation (CRG), Spain
- Stephan Ossowski, Universitat Pompeu Fabra (UPF), Spain
Introduction: Next generation sequencing (NGS) significantly changed biomedical research in the last years. Nonetheless, the use of high-throughput DNA sequencing for identification of causal/associated variants is highly sensitive to systematic errors in sequencing, read alignment and variant calling, which can lead to false genotype-phenotype associations. Therefore, understanding and identifying systematic errors is crucial to avoid misleading results in familial disease and rare variant association studies.
Methods: We present a variant callability score based on allele balance (AB), i.e. the proportion of reads supporting alternative alleles in a specific genomic position, to detect systematic errors. Specifically, we exploit the correlation between recurrent AB bias (ABB) across hundreds of samples and false positive (FP) calls. In a cohort of 987 whole-exome sequencing (WES) samples, we obtained AB distribution models for all possible DNA genotypes and measured the recurrence of ABB across all positions. Using these measures, we developed a logistic regression model able to predict recurrent ABB and likelihood of false positive calls in 200 independent samples.
Results: Our callability score (ABB-score) showed high precision and recall for detecting ABB positively correlated with FP rates in germline variant calling pipelines (~4% of the GATK variants calls). Using Sanger validation we found that ~50% variants called in AB biased positions are indeed false positives. A similar analysis of somatic variant calls in 450 chronic lymphocytic leukemia tumor-normal pairs labeled ~8% of the confident somatic calls reported by MuTect as likely FP.
Conclusion: We have developed a variant callability score (ABB-score) able to identify FP calls caused by systematic sequencing or alignment errors in the human exome. This model can be integrated in disease variant prioritization or association methods to remove spurious results.
- Hana Susak, Centre for Genomic Regulation (CRG) Universitat Pompeu Fabra, Spain
- Luis Zapata, Centre for Genomic Regulation (CRG) Universitat Pompeu Fabra, Spain
- Oliver Drechsel, Centre for Genomic Regulation (CRG), Spain
- Stephan Ossowski, Centre for Genomic Regulation (CRG), Spain
Tumors evolve over time leading to tumor heterogeneity and sub-clonal structure. Here w explored the value of cancer cell fraction (CCF) of mutations, i.e. the fraction of a tumor’s cells harboring a mutation, for predicting cancer driver genes. We describe a novel Bayesian model, cDriver, that integrates signatures of selection of somatic point mutations at three levels: i) population level (recurrence in the cancer cohort), ii) cellular level (CCF), and iii) molecular level, the functional impact of the variant allele. While recurrence and functional impact were previously used by other methods to identify driver genes, CCF as signature of positive selection in cancer has been neglected. To assess if CCF of somatic mutation have discriminative power over driver and passenger genes we compared CCF values in 4 groups: 1) nonsilent somatic mutations in driver genes, 2) silent somatic mutations in driver genes, 3) nonsilent somatic mutations in passenger genes and 4) silent somatic mutations in passenger genes, showing highly significant differences. Next, we comprehensively benchmarked ranked driver gene list of cDriver and 4 competing, widely used methods. We demonstrate that cDriver performs favorably in single cancer or pan-cancer analysis. We further show how high quality lists of driver genes can be used to identify novel tumor type – driver gene connections of clinical value.
- Luis Zapata, The Barcelona Institute of Science and Technology, Universitat Pompeu Fabra,
- Jia Ding, Max Planck Institute for Plant Breeding Research, Germany
- Eva-Maria Willing, Max Planck Institute for Plant Breeding Research, Germany
- Benjamin Hartwig, Max Planck Institute for Plant Breeding Research, Germany
- Vipul Patel, Max Planck Institute for Plant Breeding Research, Germany
- Geo Velikkakam James, Max Planck Institute for Plant Breeding Research, Germany
- Maarten Koornneef, Max Planck Institute for Plant Breeding Research, Germany
- Stephan Ossowski, The Barcelona Institute of Science and Technology, Universitat Pompeu Fabra,
- Korbinian Schneeberger, Max Planck Institute for Plant Breeding Research, Germany
Reference-based assemblies reveal small and large-scale sequence variation. However, they miss to separate local variation into co-linear and re-arranged variation. Despite the hundreds of genomes of different Arabidopsis thaliana accessions available, there is so far only one de novo, full-length assembled genome, the reference sequence. We assembled 117 Mb of the A. thaliana Lergenome into five chromosome-equivalent sequences using a combination of short Illumina reads, long PacBio reads and linkage information. Whole-genome comparison against the reference sequence revealed 564 transpositions and 47 inversions comprising about 3.6 Mb, in addition to 4.1 Mb of non-reference sequence mostly originating from duplications. Moreover, we found 105 single copy genes, which were present in only one of the genomes and 334 single-copy orthologs, which showed an additional copy in only one of the genomes. Our work gives insights into the degree and type of variation, which can be revealed when full-length de novo assemblies instead of resequencing or other reference dependent methods are used.