View Posters By Category
Session A: (July 22 and July 23)
Session B: (July 24 and July 25)
Presentation Schedule for July 22, 6:00 pm – 8:00 pm
Presentation Schedule for July 23, 6:00 pm – 8:00 pm
Presentation Schedule for July 24, 6:00 pm – 8:00 pm
Session A Poster Set-up and Dismantle
Session B Poster Set-up and Dismantle
Short Abstract: Short-read resequencing of genomes produces abundant information of the genetic variation of individuals. Due to their numerous nature, these variants are rarely exhaustively validated. Furthermore, low levels of undetected variant miscalling will have a systematic and disproportionate impact on the interpretation of individual genome sequence information, especially should these also be carried through into in reference databases of genomic variation. We find that sequence variation from short-read sequence data is subject to recurrent-yet-intermittent miscalling that occurs in a sequence intrinsic manner and is very sensitive to sequence read length. Furthermore, the resultant miscalled variants are sensitive to small sequence variations between genomes, and thereby are often intrinsic to an individual, pedigree, strain or human ethnic group. We show that recurrently miscalled variants may be reproduced for a given genome from repeated simulation rounds of read resampling, realignment and recalling. Identification and removal of recurrent false positive variants from specific individual variant sets will improve overall data quality. Further, read length is a strong determinant of whether given false variants will be called for any given genome – which has profound significance for cohort studies that pool datasets collected and sequenced at different points in time.
Short Abstract: Genetic variation is a major cause of intra-species differences, but the mechanisms through which it affects phenotype are poorly understood. While it is straightforward to understand the effect of variants overlapping coding sequences, the majority of functional variants overlap intergenic regions. These variants are expected to impact gene regulation by changing the function of cis Regulatory Modules. Here, we apply a gapped-kmers method to a DNAse Hypersensitivity dataset spanning 5 time points and 3 tissues in Drosophila Melanogaster. Firstly we train lsgkm-svm on tissue specific DHS and then score a catalogue of variants based on their predicted effect on tissue specific chromatin accessibility. We use the same model to learn the motifs underlying tissue specific accessible regions in an unbiased way. We validate the pipeline on an independent dataset by showing that it predicts the direction of chromatin accessibility Quantitative Trait Loci. We further validate our predictions by performing tissue specific ATAC-Seq on two inbred Drosophila lines. In summary, we build a framework that can prioritize variants and predict their tissue specific effects on chromatin accessibility. This dataset will increase variant interpretability and serve as a resource for researchers interested in population genetics and regulatory variants.
Short Abstract: Obtaining valid SNV- and indel calling results from next-generation sequencing data is still an important issue in the field of bioinformatics. In close collaboration with medical and biological experts, we developed appreci8 - A Pipeline for PREcise variant Calling Integrating 8 tools, which outperforms any alternative approach. Here we present the appreci8R, an R-version of our appreci8-algorithm. The appreci8R relies on the same rule-based system for variant calling as appreci8. Functions for the standard analysis steps as well as a shiny GUI are provided. Different from classical appreci8, the user may change any analysis parameter, including the artifact- and polymorphism score. Furthermore, any other combination of variant calling tools, including additional tools, is supported. Checkpoints are provided, allowing to re-start analysis from intermediate results. Analyzing two well-characterized Illumina datasets, we observe that run-time and performance of the appreci8R are highly comparable to classical appreci8. Investigating the influence of a single updated tool, here GATK-3.3 vs GATK-4, we observe 5-23% more variants - true as well as false positives - in the GATK output. However, the final appreci8R results remain stable. Concluding, the appreci8R represents a powerful tool for variant calling, providing more liberties at equal performance compared to classical appreci8.
Short Abstract: Genome-wide association studies (GWAS) indicate that a huge percentage of single nucleotide polymorphisms (SNPs) appear in non-coding genomic regions, that may be thousands of kilobases away of their target genes. Since these SNPs can lead to functional consequences like various traits or diseases, a major aim is to understand their molecular mechanisms. SNPs may occur in regulatory regions like promoters or enhancers, and impact transcriptional regulation. Our newly devised approach prioritizes SNPs as targets of one or several transcription factors (TFs). We infer this information jointly with the target gene and identify TF-gene pairs for whom the TF binding affinity is influenced by the SNP, which in turn affects the gene expression. Precisely, conditional mutual information is used to link regulatory regions to genes, using expression and epigenetics data. Additional criteria like the TF binds a region, overlapping with the SNP, and showing a differential binding affinity are combined into one statistical test. We evaluate our approach on experimentally validated TF-SNP-Gene triplets derived from literature and database search. Considering the additional information improves the prioritization of the correct TF. We apply our method to SNPs identified in GWAS to study the impact of genetically induced transcriptional misregulation in human diseases.
Short Abstract: AMLVaran is a web-based platform that allows the upload of targeted NGS sequencing raw data, performs the complete bioinformatics data processing workflow, and provides the user with a clearly structured overview of the detected mutations, as well as a clinical report adapted to the needs of Acute Myeloid Leukemia diagnostics. A user study with a first prototype showed that users without a specific training (n=9) had problems operating the software. In several cases, results were misinterpreted. We identified problems that led to the misunderstandings and implemented numerous improvements to the software. In a follow-up study with the final software (with a different user group and new tasks), hotspot mutations were correctly interpreted by all testers (n=12). In addition, these testers performed much more complex tasks with greater precision. The subjective user-friendliness of the software, evaluated with a standardized SUS questionnaire, increased from a median score of 68 to 80. The optimized version was rated with "good" or better by 62% and with "excellent" by 33% of the testers. The results demonstrate the benefit of usability studies and show that a good user interface can contribute decisively to the success or failure of software. AMLVaran was significantly improved by these studies.
Short Abstract: Evidence has shown that GWAS associated non-coding variants for a particular trait/disease are significantly enriched in certain chromatin states of relevant tissues/cell types. Thus, the precise prediction of functional regulatory variants, especially in the context of tissues/cell types becomes an important topic in post-GWAS analyses. Here, we presented our new method cepip2 which leverages fine-mapped GTEx eQTL and 101 epigenomic or functional features based on Roadmap Epigenomic projects to train a context-specific model. The best match between GTEx tissues and Roadmap tissues were automatically selected according to the model performance of different tissue matching schemes. cepip2 demonstrated better performance compared with other methods according to multiple independent evaluations. It was applied to GWAS summary data of multiple diseases and successfully estimated relevant tissues/cell types of diseases. cepip2 can also prioritize non-coding causal SNPs from GWAS credible set in which highly-linked SNPs achieve similar posterior causal probabilities by conventional fine-mapping tools. In summary, we believe that cepip2 will significantly promote the prediction of functional regulatory variants as well as the interpretation of GWAS results.
Short Abstract: Genetic variants are commonly prioritised based on pathogenity scores such as the Combined Annotation Dependent Depletion (CADD) score, allele frequencies in variant databases like the Genome Aggregation Database (gnomAD) or functional effect predictions by tools such as SnpEff. In addition to these metrics, metadata about affected genes is still commonly looked up manually, thus rendering variant prioritisation a cumbersome and time-consuming task. We developed Haystack: a web application to streamline the search for disease-related genes in our in-house variant database Sciobase. Haystack provides variant filtering capabilities based on our databases’ columns and integrates 12 728 gene descriptions from the Reference Sequence (RefSeq) database, 1 687 962 gene expression values from the Genotype-Tissue Expression (GTEx) project and the Human Protein Atlas (HPA), and 465 584 genotype-phenotype associations from the Mammalian Phenotype Ontology (MPO). In a first project, we used Haystack to determine genes linked to male infertility in a whole exome sequencing dataset of 198 patients affected by complete germ cell loss (Sertoli cell-only syndrome). Using the features provided by Haystack, variants were filtered by minor allele frequencies in public databases, predicted severe consequences on protein structure and tissue expression in testis, resulting in 683 candidate genes for further analyses.
Short Abstract: Genome-wide association studies have identified more than 30 SNPs in Alzheimer’s disease (AD). However, molecular function of most identified SNPs remains unknown. Here, we investigated if there is an evidence at the transcript level in which SNPs are functionally implicated in alternative splicing related to AD pathology by first focusing on two well-known risk factor genes in AD, CLU and PCALM. CLU regulates the clearance and aggregation of amyloid-beta. PICALM encodes phosphatidylinositol binding clathrin assembly protein modulating amyloid-beta brain metabolism and neuronal toxicity. We analyzed genotype and RNA-seq data measured in four brain regions: frontal pole, superior temporal gyrus, para-hippocampal gyrus, and inferior frontal gyrus. For each brain region, we calculated exon expression levels in these genes. We identified alternative splicing events associated with two SNPs (rs3851179 in PICALM and rs4236673 in CLU). The minor allele of rs3851179 tended to be associated with increased exon skipping level in two brain regions. The major allele T of rs7982 showed an association with a reduced intron retention level in two brain regions. Both alternative splicing events were potent to leading to abnormal protein products. Our study suggests that both SNPs may play a role in underlying AD pathology through alternative splicing mechanisms.
Short Abstract: Personalized genomic medicine depends on integrated analyses that combine genetic and phenotypic data from individual patients with reference knowledge of the functional and clinical significance of sequence variants. Sources of this reference knowledge include the ClinVar repository of human genetic variants, a community resource that accepts submissions from external groups, and UniProtKB/Swiss-Prot, an expert curated resource of protein sequences and functional annotation, which provides knowledge on over 30,000 human protein coding sequence variants, curated from peer reviewed literature reports. Here we present a pilot study that lays the groundwork for the integration of curated knowledge of protein sequence variation from UniProtKB/Swiss-Prot with ClinVar. The existing interpretations of variant pathogenicity in UniProtKB/Swiss-Prot and ClinVar are highly concordant, with 88% of common variants having interpretations of clinical significance that agree. Re-curation of a subset of UniProtKB/Swiss-Prot variants using ACMG guidelines further increases this level of agreement, mainly due to the reclassification of supposedly pathogenic variants as benign, based on newly available population frequency data. We have incorporated ACMG guidelines and ClinGen tools into the UniProtKB curation workflow, and routinely submit variant data from UniProtKB/Swiss-Prot to ClinVar. These efforts will increase the usability and utilization of UniProtKB variant data.
Short Abstract: The vast amount of DNA sequencing data collected from large patient cohorts have helped in identifying a wide number of disease related mutations relevant for diagnosis and therapy. While existing bioinformatics methods and resources are mainly focusing on causal variants in Mendelian diseases, many difficulties remain to analyse more intricate genetic models involving variant combinations in different genes, an essential step for the discovery of the causes of oligogenic diseases. ORVAL (the Oligogenic Resource for Variant AnaLysis) tries to solve this problem by generating networks of pathogenic variant combinations in gene pairs, as opposed to isolated variants in unique genes. This online platform integrates innovative machine learning methods for combinatorial variant pathogenicity prediction and offers several interactive and exploratory tools, such as predicted pathogenicity and protein-protein interaction networks, a ranking of pathogenic gene pairs, as well as visual mappings of the cellular location and pathway information. ORVAL is the first web-based exploration platform dedicated to identifying networks of candidate pathogenic variant combinations to help clinicians and researchers in uncovering oligogenic causes for more complex diseases. ORVAL is available at https://orval.ibsquare.be.
Short Abstract: Motivation: In genomic medicine for rare disease patients, the primary goal is to identify one or more variants that cause their disease. Typically, this is done through filtering and then prioritization of variants for manual curation. However, prioritization of variants in rare disease patients remains a challenging task due to the high degree of variability in phenotype presentation and molecular source of disease. Thus, methods that can identify and/or prioritize variants to be clinically reported in the presence of such variability are of critical importance. Results: We tested the application of classification algorithms that ingest variant annotations along with phenotype information for predicting whether a variant will ultimately be clinically reported and returned to a patient. To test the classifiers, we performed a retrospective study on variants that were clinically reported to 237 rare disease patients. We treat the classifiers as variant prioritization systems for ranking all variants observed by clinical analysts. For comparison, we performed the same analysis with four other variant prioritization algorithms and two single-measure controls. We showed that all five classifiers outperformed all other methods with the best classifiers ranking 73% of all reported variants and 97% of reported pathogenic variants in the top 20.
Short Abstract: The ACMG/AMP evidence-based guidelines for variant pathogenicity assessment define several criteria assessing particular supporting evidence information. Criteria are combined to classify a variant as either pathogenic (P), likely-pathogenic (LP), benign (B), likely-benign (LB), or uncertain significance (VUS). Although widely adopted in clinical interpretation of variants this process has remained largely manual and time-consuming. Current informatics tools aimed to ease the application of the guidelines do not completely automate the entire process. Therefore, we developed a forward-chaining inference engine implementing the ACMG–AMP criteria and taking as input annotated variants and codified gene-condition curation to automatically infer the classification of variants. A natural language generation module in the engine provides explanatory text with the rationale of the classification for reference by clinical geneticists. Here we present a thorough performance evaluation of our method, analyzing a truth set of 37,491 previously classified variants for a hereditary cancer risk (15-gene), newborn screening (30-gene), and incidental findings (59-gene) panels. We show automatic classification of up to 95% and 77% of P/LP and B/LB variants with essentially no misclassifications. Unclassified variants are annotated with resolved criteria for rapid manual classification. This advance would allow clinical labs to scale-up and reduce effort in processing gene-panel tests.
Short Abstract: Genome wide association studies (GWAS) have given many insights into the genetic underpinnings of disease. However, the design of GWAS prevent us from learning the molecular function of each locus, keeping us from easily tying non-coding SNPs to a biological mechanism. Many studies have unraveled the cis-regulatory effect of SNPs on molecular traits through correlating genotype to gene expression, deemed cis-eQTL (quantitative trait loci) analyses. With multi-omic data sets that share samples across measurements, we can use genotype to reason about causal association between molecular traits. Previously, our group used an established causal inference test (CIT) to infer the mediation of individual SNP effects by gene expression and epigenetic modifications. With many loci associated to each gene, using one test per locus likely fails to account for the role of complex interaction between loci in causal mediation. Here, we investigate the mediation of multiple SNP effects by describing genotype as a small, complex non-linear combination of latent variables. We then test for mediation of these latent-variable effects in a multi-omic data set from 411 samples from the dorsolateral prefrontal cortex of older individuals having matched H3K9 acetylation, DNA methylation, and genotyping data.
Short Abstract: We present an accessible, fast and customizable network propagation system for pathway boosting and interpretation of genome-wide association studies. This system – NAGA (Network Assisted Genomic Association) – taps the NDEx biological network resource to gain access to thousands of protein networks and select those most relevant and performative for a specific association study. The method works efficiently, completing genome-wide analysis in under five minutes on a modern laptop computer. We show that NAGA recovers many known disease genes from analysis of schizophrenia genetic data, and it substantially boosts associations with previously unappreciated genes such as amyloid beta precursor. On this and seven other gene-disease association tasks, NAGA outperforms conventional approaches in recovery of known disease genes and replicability of results. Protein interactions associated with disease are stored as networks in NDEx where they are readily visualized, annotated, and interpreted using desktop and web-based tools in the Cytoscape Cloud ecosystem, and where the data is programmatically accessible for further analysis.
Short Abstract: Identification and accurate evaluation of the effects of nonsynonymous single nucleotide polymorphisms (snSNPs) on protein structure and function is essential to assess the reasons behind many inherited diseases; to understand the association to drug resistance mechanisms; to link to drug sensitivity issues in certain populations for precision medicine development, among many other applications. As is general practice in identifying SNPs, two main sequence-level techniques are employed: Genome Wide Association Studies (GWAS) and Candidate Gene Association Studies (CGAS). These techniques associate SNPs with diseases by comparing the genomes/genes of healthy individuals with those of unhealthy individuals to determine which variations mostly occur in disease-affected patients. However, these statistical methods do not provide an understanding of the functional effects of variation. Hence, we would need to combine genomics and post-genomics methods. Recently, we have developed protocols and related tools/web servers by merging computational chemistry, structural bioinformatics and biophysics approaches to address the issues mentioned above, and applied them to case studies. This presentation will include two examples: 1) Characterizing early drug resistance-related events using geometric ensembles from HIV protease dynamics (Sheik Amamuddy et al., Sci Rep., 2018); 2) Mechanism of action of non-synonymous single nucleotide variations associated with α-carbonic anhydrase II deficiency.
Short Abstract: The increased number of sequenced cancer genomes since the completion of the human genome project, and the importance of correctly identifying somatic mutations, which can influence treatment or prognosis, is driving forward the development of novel somatic variant calling tools (somatic callers). A lack of best practices algorithm for identifying somatic variants, however, requires constant testing, comparing and benchmarking these tools. The absence of truth set further hinders the effort for the evaluation. By comparing widely used open source somatic callers, such as MuTect2, Strelka2, VarDict, VarScan2, Seurat and LoFreq, through analysis of in-house generated synthetic data, we found complex dependencies of somatic caller parameters relative to coverage depth, allele frequency, variant type, and detection goals. Next, we normalized and filtered the output data such that it can be appropriately compared to the truth set. The acquired benchmarking results were automatically and efficiently structured and stored. All of the tools used for the analysis have been implemented in Common Workflow Language which makes them portable and reproducible.
Short Abstract: Understanding the molecular consequences of variants and their potential contribution to a disease is an essential step towards the development of a cure. The origin of a disease is routinely investigated using genome sequencing that provides good evidence for which genetics variants are the underlying course of a disease. But associating variants to a molecular consequence that explains the observed phenotype of the disease is still a bottleneck in clinical genomics. UniProt, in collaboration with Ensembl, PDBe and the Janet Thornton research group have developed an extension to the Ensembl VEP for interpreting the consequence of variants on the molecular function of proteins. Protein Variant Effect Predictor (PepVEP) combines information from VEP with the comprehensive, high-quality protein functional annotations and clinical information from UniProt and protein structure functional annotations from PDBe. This new tool integrates tools and services (GIFTS, The Proteins API, SIFTS, ProtVista, etc) from the three database services and research group to provide a service that allows users to interactively analyse and interpret the molecular consequences of variants. Here we illustrate this tool and how we aim to support the scientific community, computational biologists and clinical researchers to analyse and interpret the link between variation and protein function.
Short Abstract: Advances in DNA sequencing technologies have led to a dramatic increase in the volume of available genomic sequence data. Rare disease diagnostics is one of the fields that has been transformed by these technologies. However, variant annotation remains a considerable challenge. Despite recent progress, there is still a lack of robust in silico tools that accurately assign pathogenicity to variants. Disease-associated variants in CACNA1F are the commonest cause of X-linked incomplete Congenital Stationary Night Blindness (iCSNB), a condition associated with non-progressive visual impairment. We combined detailed genetic and homology modelling data to produce CACNA1F-vp, an in silico tool that differentiates pathogenic from benign missense CACNA1F variants. CACNA1F-vp predicts variant effects on the structure of the CACNA1F encoded protein (a calcium channel) using parameters based upon changes in amino acid properties; these include size, charge, hydrophobicity, and position. The algorithm produces an overall score for each variant that can be used to predict its pathogenicity. CACNA1F-vp was able to identify pathogenic variants with a high degree of accuracy (p-value=4.2x10-20) using a 10-fold cross-validation regression algorithm. We consider this protein-specific model to be a robust stand-alone diagnostic tool that could be replicated in other proteins and could enable precise, timely diagnosis.
Short Abstract: Distinguishing novel pathogenic from rare but benign variants is a key challenge in clinical genetics. Many in silico tools have been developed to predict the clinical effect of missense variants. Some utilize a machine learning based algorithm trained on previously documented variants. These tools typically perform unreliably on novel variants that are not well studied or represented in training sets. Our group has previously developed Paralogue Annotation (PA); a method that identifies a variant’s equivalent position in paralogous genes and searches for known variants at those positions. If the paralogous variants are pathogenic, this can be used to infer that the variant in the query gene is also likely pathogenic. The validity of this method has previously been restricted to genes associated with arrythmia syndromes. Here, we systematically applied PA to 4,499 genes harbouring 24,185 pathogenic and 14,843 benign variants identified from Clinvar. We show that this approach consistently performs at a higher precision than established tools with an achievable maximum of 99.5%. Compared to machine learning based tools that require training, PA can be used as an orthogonal method to identify novel and rare disease-causing variants. As more known variants arise, PA will become more promising over time.
Short Abstract: Prediction of phenotypes from organisms’ genotypes is one of the major challenges in biomedical science. Large amounts of clinical and genetic data make it nowadays approachable computationally. We focus on the nuclear problem: predicting phenotypic impact of individual genetic variants. Numerous lines of evidence suggest that there is a correlation between the variant-carrying gene’s identity and certain phenotypes, introducing a statistical bias. It can artificially inflate performance of methods in a setting when mutations, but not genes are randomly split into the training and test sets. Methods trained in such a way are likely to misclassify benign variants in pathogenicity-prone genes and to fail when predicting the impact of variants in genes not seen in the training. We present a novel random forest-based machine learning method that employs features related to protein evolution and their three-dimensional structures. By applying it to deep mutational scan data and clinically annotated mutations from ClinVar, we demonstrate that including structure-related features and excluding features that may introduce protein bias, such as protein length or tendency to form homooligomeric complexes, improves the performance in a fair setting when genes as a whole, and not individual mutations, are split into the training and test sets.
Short Abstract: The ITHANET Portal (www.ithanet.eu) is an expanding resource for researchers and healthcare professionals dealing with haemoglobinopathies. As an official partner of the HVP Global Globin Challenge, ITHANET has been selected for national data collection, storage and sharing, and for the development of a thalassaemia-specific genotype-phenotype database. The ITHANET Portal offers a wide range of curated databases and tools, as follows: 1. IthaGenes is a database that organises genes and variations affecting haemoglobinopathies, including causative mutations, disease-modifying mutations and diagnostically relevant neutral polymorphisms. Additionally, IthaGenes integrates the NCBI sequence viewer and provides phenotype, epidemiology, HPLC data, related publications and external links. 2. IthaMaps is a database that stores and illustrates information on the epidemiology of haemoglobinopathies as documented in published literature. Country-specific information on haemoglobinopathy-related policies, prevalence, incidence and overall disease burden is given, including relative allele frequencies of specific globin mutations in each country and/or region. 3. IthaChrom provides digitised reports of diagnostic HPLC analyses as a reference tool for haemoglobinopathy diagnosis. ITHANET is also coordinating the ClinGen-associated Haemoglobinopathy Variant Curation Expert Panel, aiming at standardising the interpretation of haemoglobinopathy-related variants based on specified ACMG/AMP guidelines.
Short Abstract: The spatial distribution of disease associated variants in proteins can suggest mechanisms of action and can help to differentiate benign from damaging candidates. However, in rare diseases the sparsity of variants per protein reduces the power of spatial distribution analysis. We overcome this by enriching rare variant data by aligning similar structural (CATH) domains from different proteins. We then analyse the domains together and uncover any shape spatial pattern using DBSCAN spatial clustering. Firstly, comparing the locations of clusters with those of known disease-associated variants and those considered benign can help to assign a pathogenicity probability to variants of unknown significance. Secondly, the location of the cluster in the protein or complex can suggest potential mechanisms of action such as the interruption of ligand binding, catalysis or structure destabilization/misfolding. Thirdly by comparing members of the same spatial clusters with gene lists of known disease association we can uncover new disease gene candidates. Enrichment and spatial clustering of rare variants on protein structures can allow important regions of proteins to be uncovered, new disease associated genes to be found and can help to characterise variants of unknown significance.
Short Abstract: Multiple regression is widely used for post-GWAS analyses, especially for variant finemapping and re-ranking the nominally significant regions identified by GWAS. Bayesian multiple regression selects the variants using a sparsity-enforcing prior on the variant effect sizes to avoid over-training and integrate out the effect sizes for posterior inference. For case-control GWAS with binary disease status, the logistic model should perform significantly better than the linear model. Regardless, existing multiple regression methods approximate the logistic model with a linear function because otherwise the integration requires costly and technically challenging MCMC sampling. We introduced the quasi-Laplace approximation to solve the integral and developed a software called Bayesian multiple LOgistic REgression (B-LORE). In extensive simulations, B-LORE outperformed existing methods whenever non-linearities are strong, e.g. B-LORE could extract more information simply by adding controls in a GWAS keeping the same number of cases. From a meta-analysis of five small GWAS for coronary artery disease (CAD), we applied B-LORE on the top 50 regions, which included 11 regions discovered by a 14-fold larger study (CARDIoGRAMplusC4D). B-LORE discovered all the 11 regions with >95% causal probablity, along with 12 novel regions, of which 9 are known to be associated with well-known CAD risk-related blood metabolic phenotypes.
Short Abstract: Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files. We foresee that MMSplice will be a useful tool to interpret variants of unknown significance in rare and common diseases.
Short Abstract: The annotation of genetic variants is an important step in the molecular diagnosis of people with suspected Mendelian disorders. RNA-seq can expand the set of annotated variants for diagnosis to include intronic and synonymous variants that result in pathogenic changes in splicing. One caveat with RNA-seq-based diagnostics is that they are typically performed on one of the three most clinically accessible tissues (CATs): blood, lymphoblasts, and fibroblasts. Therefore, we first assess limitations of CATs by assessing tissue-specific expression and splicing variations across the 50 non-CATs and 3 CATs in GTEx and identify genes that are inadequately represented by CATs for each tissue. Next, we adapt MAJIQ, a tool for splicing analysis, for the setting of Mendelian diagnosis. MAJIQ’s key advantages are in its ability to accurately detect, quantify, and visualize de novo (unannotated) and complex (involving more than two junctions) splicing changes. We develop filters and statistical tests for MAJIQ into a pipeline for identifying potentially pathogenic splicing events and apply it to patient RNA-seq from previous studies. We reproduce their results and demonstrate our pipeline’s advantages. In summary, our results show where RNA-seq is limited and how we can optimize tools to prioritize disease-causing splicing variants.
Short Abstract: The assembly of genomic exons into mRNAs via splicing is of critical importance for the correct synthesis of proteins. As such, genomic variants in splicing regions have been linked to various genomic diseases. In order to assist the identification of splice altering variants from genetic data, a number of methods aim to predict splicing effects. However, as shown by a recent study using Multiplexed Splicing Reporter Assays (Cheung et al. 2018), the gained-knowledge from these methods has been barely adopted in general variant effect prediction. Combined Annotation Dependent Depletion (CADD, cadd.gs.washington.edu) is a widely used variant score that weights and integrates a diverse collection of annotations. We show how the adoption of splicing effect features in CADD v1.4 improved its performance. Further, we highlight the incremental gain from categorical splice site annotations and explore novel Deep Learning based splicing effect predictors for further improvements. While splice effect scores show superior performance on splice variants, the specialized predictors cannot compete with other variant scores in general variant interpretation. General approaches also account for nonsense and missense effects that are missing from specialized scores. However, integrating the domain-specific knowledge improves variant effect prioritization across all functional variants.
Short Abstract: “Beacon” of the Global Alliance for Genomics and Health is an important international platform for sharing genetic variants. The Beacon search is described as follows: a user sends a pair consisting of a genomic position and an allele (e.g., Chr21:122211 and ‘A’) to a server, and the server returns only a binary response “yes” or “no” to let the user know whether or not the variant exists in the database. Since the Beacon query often includes private information (such as patient’s genomic variant), there is a need to protect query privacy. We introduce a novel search system called Crypto Beacon which enables the user to hide the query from the server while successfully obtaining the results. In our system, the query is encrypted on the user’s browser; the server conducts the search without decrypting the query and returns the encrypted result. Only the user can decrypt the result, i.e., our system ensures that the server never monitors the user’s query. Hence, the user can send a sensitive query in full privacy. This is the first public service that protects genome privacy for large-scale databases and can process privacy-preserving search through a web browser without installing a specific software package.
Short Abstract: Background: The number of reported examples of chromatin architecture alterations involved in regulation of gene transcription and in disease is increasing. However, no genome-wide testing was performed to asses the abundance of these events and their importance relative to other factors affecting genome regulation. This genome-wide study attempts to fill this lack by analyzing the impact of genetic variants identified in individuals from 26 human populations and in genome-wide association studies onto chromatin spatial organization. Results: We assess the tendency of structural variants to accumulate in spatially interacting genomic segments and design a high-resolution computational algorithm to model chromatin conformational changes caused by structural variations. We show that differential gene transcription is closely linked to variation in chromatin interaction networks mediated by RNA polymerase II. We also demonstrate that CTCF-mediated interactions are well conserved across population, but enriched with disease-associated SNPs. Moreover, we find boundaries of topological domains as relatively frequent target of duplications, which suggests that these duplications can be an important evolutionary mechanism of genome spatial organization. Conclusions: Altogether, this study assesses the critical impact of genetic variants on the higher-order organization of chromatin folding and provides unique insight into the mechanisms regulating transcription at the population scale.
Short Abstract: Background: Pediatric tumors are believed to be linked to germline alterations, which are either inherited from parents or exists as de-novo in the child. Since 2015, we have conducted a trio study, where we sequence parents and children with cancer to detect possible germline variations. Methods and results Our bioinformatics pipeline processes whole exome sequencing (WES) data from trio samples. Processing steps start with alignment and quality control followed by variants calling and annotation, where various prediction tools and annotation databases are used. The pipeline currently detects single nucleotide variants, in addition to indels, de-novo mutations, parental and child mosaicism. Afterwards, important cancer-related pathways are analyzed to detect cancer predisposing germline mutations, de-novo mutations are phased to their parental origin, and digenic mutations are studied for possible pathogenicity. Currently we are extending the pipeline to enable the integration of external WES pediatric cancer data into our analysis to perform joint analysis for specific tumor entities. Outlook Integrating WES data into our bioinformatics pipeline will help detect rare variants associated with specific pediatric tumors. Moreover, structural variation detection using Bionano machine will soon be integrated into our pipeline. Primary pipeline source code is available at: https://github.com/sjanssen2/spike.
Short Abstract: Variant interpretation requires assessment of annotations across numerous databases, predictive algorithms and publications. Furthermore, complex disease population-scale sequencing studies require application of statistical genetic and machine learning methods for thousands of subjects, across billions of data points. Our solution to these challenges is Varhouse: a massively scalable variant warehouse and interpretation tool that enables identification of pathogenic variants in both individual and large cohorts studies. Varhouse is built upon cutting-edge serverless technology in the AWS cloud that scales seamlessly in a HIPAA compliant environment with zero in-house infrastructure costs. Using Apache Spark, variants are annotated at the levels of position, region, gene and transcript. Results are transferred to S3 in the storage-efficient Parquet format. In addition to minimizing costs and maintaining auditability, this warehousing approach makes clinical reinterpretation automatable and enables large cohort research without repeated variant calling and annotation. Varhouse makes variant interpretation highly efficient. By feeding back internal variant frequencies and evaluation notes from analysts into its database, artifacts and previously reviewed variants can be automatically eliminated. It supports analysis of multiple inheritance models and interpretation driven by clinically evaluated phenotypes. Through Varhouse, we have simplified clinical variant interpretation, driving high diagnostic yields from genomic sequence analysis.
Short Abstract: Antimicrobial resistance (AMR) is an important global health concern. Being able to predict AMR from genetic information would allow for more effective treatment of infections and reduce AMR accumulation in bacterial populations. Deep neural networks are a promising technique for AMR prediction; however, the curse of dimensionality and lack of large datasets make training them and achieving accurate results difficult. We show that transfer learning can be utilized to improve the effectiveness of deep neural networks for AMR prediction in Neisseria gonorrhoaea. In the best case, transfer learning improved accuracy approximately 12 times and reduced training time 75% as compared to phenotype prediction without transfer learning. We also show that transfer learning can be used to improve the effectiveness of neural networks when very little training data is available. These results advance deep learning based phenotype prediction because large datasets are not necessarily required for network training; instead, an existing network can be downloaded to seed the training process and make more productive use of available data. The effectiveness of transfer learning suggests that AMR mechanisms share genomic features. Finally, we show that this technique is not only effective for bacteria, but for plants as well.
Short Abstract: Genotype imputation infers missing genotypic data computationally, and is highly useful in genome-wide association studies and genomic selection. While various measures have been used to evaluate genotype imputation programs, some, such as Pearson correlation, might not be appropriate for a given context and may result in misleading results. Further, most evaluations of genotype imputation programs are focused on human data. Finally, the most commonly used measure, concordance, is unable to determine a difference in performance in some cases. Since Kullback- Leibler divergence (KLD) and Hellinger distance (HD) can aid in ranking statistical inference methods, they can be used in evaluating genotype imputation results. In this study, we utilize negative logarithmic KLD (NLKLD) and negative logarithmic HD (NLHD) to investigate the performance of Beagle and Minimac on data from Arabidopsis thaliana, rice, and human. We demonstrate that NLKLD and NLHD reflect the correspondence between the known and imputed minor allele frequencies, and the chance agreement between the known and imputed genotypes. Additionally, neither Minimac nor Beagle performs consistently better on either plant or human data. Finally, the NLKLD and NLHD results indicate that Minimac has a superior imputation method over Beagle. Our ongoing study is confirming these trends.
Short Abstract: Synonymous single nucleotide variants (sSNVs) represent a poorly understood source of genetic variation, and their role in human health and disease is largely unknown. Though not affecting the amino acid sequence, sSNVs can alter mRNA structural properties and the dynamics of translation. To advance the interpretation of sSNVs, we analyzed synonymous codon usage in the human transcriptome, and successfully isolated patterns that displayed evidence of constraint. Comparing the sequence contexts among synonymous codons within RefSeq transcripts, we identified local base content metrics that inform why a synonymous codon is utilized in a particular situation. We find that one of the most predictive features is, surprisingly, the bases in the third nucleotide position from nearby codons. For each codon, we calculated the "distinguishing" base in these positions for the codons two amino acid positions removed; almost all contexts show significant enrichment or depletion relative to these bases. Furthermore, 20-24 codons exhibit significant constraint against synonymous mutation in the presence of their distinguishing third-position bases, according to the gnomAD database. In addition, we find that the constrained patterns are enriched for more highly conserved bases. We explore the possible structural and biological mechanisms that may facilitate constraint on these new motifs.
Short Abstract: Background: Loss of heterozygosity (LOH) is an important mechanism for studying the impact of rare variants on gene disruption. In this study, we aim to elucidate the characteristics of LOH using whole transcriptome sequencing to augment the value of Whole exome sequencing (WES) based rare disease gene discovery. Description: We analyzed 30 unrelated rare disease patients using WES and whole transcriptome sequencing. A profile for heterozygous mutations from whole exome data was generated and the distribution of the allele frequencies for these loci within the transcriptome data was obtained. Using these features as a guide, an algorithm was implemented for the identification of LOH at the gene level. Conclusion: A total of 110 genes and 347 transcripts within 30 rare disease patients were identified as showing strong evidence of LOH. We aim to categorize loci showing ASE into several classes of genetic mechanisms that can explain the presence of LOH. The goal of this project would be identifying mutations showing unexplained strong expression bias of one allele versus the other for reasons that cannot be explained by known genetic mechanisms. These findings are instrumental for understanding the underlying pathophysiology of rare genetic diseases.
Short Abstract: Although new sequencing techniques have greatly improved the amount and quality of omics data available, our understanding of the mechanisms behind cancer is still limited. Crucial for improving this understanding is the effective integration of mutation and transcriptome data. We’ve recently demonstrated that many important oncogenes in breast cancer display a bi-modal expression profile and that many of these expression modes can be directly associated with underlying changes in the genome. Using estimated conditional probabilities derived from the METABRIC cohort, we now attempted to predict the mode of each gene in a sample given the mutations that are present. Confronting these predicted expression profiles with the observed data, allows to classify unseen patients to their PAM50 subtype with a 68% accuracy. These results illustrate the strong connection between genomic alterations and differential expression and may provide insight in the modus operandi of mutations. Additionally, we present preliminary results on how this connection can be used to identify the driver mutations in a sample.
Short Abstract: Mutational signatures are specific patterns of somatic mutations introduced into the genome by oncogenic processes. Different mutational processes often generate different combinations of mutations, or signatures, that have been previously described. Identification of the processes contributing to mutations observed in a sample is potentially informative to understand the cancer etiology. We present here SigsPack, an R package to estimate a sample's exposure to mutational processes described by a set of known mutational signatures (for example from COSMIC). The exposure stability is quantified by bootstrapping the mutational catalogue. Using multiple samples from the same tumors, we show that these estimates are compatible with the discrepancies observed in exposures computed from different samples. The exposure stability appears to vary between mutational signatures, independently of the sample. In the case of data originating from exome panels, we also investigate the dependence of exposure accuracy on the mutational catalogue's size, and on the sequence contents of the regions probed by the experiment.
Short Abstract: Autism spectrum disorders (ASDs) are a group of neurodevelopmental conditions often characterized by difficulties with social interaction, communication, and behavior. Genome-wide association studies have revealed that the majority of disease-associated variants lie within noncoding regions; yet, many studies investigating the genetic risk factors associated with ASD have focused on protein-coding regions, and only a fraction of the overall heritability of ASD has been accounted for. Therefore, investigation of noncoding variation has the potential to uncover novel ASD risk-variants. Notably, the rare nature of de novo mutations makes them potentially more harmful, as they are not subject to selective pressures. We have identified a subset of de novo SNVs from whole genome sequencing data of 1,918 affected quad families from the Simons Simplex Collection that could disrupt regulatory element function, ultimately impairing expression of ASD-associated genes. We utilized RegulomeDB for comprehensive annotation of known and predicted regulatory elements, including data from the Encyclopedia of DNA elements (ENCODE) consortium and the Roadmap Epigenomics Consortium. SNVs were prioritized according to their overlap with predicted functional elements. This project contributes to our understanding of the functions of noncoding regulatory elements and their contributions to ASD genetics and pathogenesis.
Short Abstract: Today’s clinical sequencing covers only a tiny portion of the entire genome. The reason: storage and computing requirements for full-scale analysis are nearly impossible to meet. Therefore, the goal of this project has been to develop a bioinformatics workflow (pipeline) for multi-step processing of human full genome sequencing data, from the data generated by the sequencing instrument located in different locations and making the data ready for interpretation by a genomics specialist. To address all these challenges, we developed WiNGS (Widely integrated NGS) platform to break down the complexity of analyzing genome sequencing data. WiNGS using a distributed data model to optimize the ICT infrastructure required to support and enable Whole Genome Sequencing (WGS). Analyzing the huge amount of complex data needs an advanced filtering option to reduce the number of potential mutations from thousands of cases to few. Therefore, in the WiNGS platform, we developed an advanced, user-friendly and optimized filtering module. Filters are constructed based on the tree structure and users are able to add different kind of comparisons such as AND, OR, NOT, and etc., with few clicks. We showed that by this filtering module how much time the researchers can save to accomplish their goals.
Short Abstract: Accurate prediction of the effects of genetic variation is important for many applications in biological research. One of the most effective prediction methods is based on the variational autoencoder (VAE), a deep generative model that uses an approximate inference technique to optimize an intractable density function. In this work, we propose a deep autoregressive generative model called mutationTCN, based on the Temporal Convolutional Network (TCN) architecture used. This model defines a tractable density function, assuming that the probability of each data point is conditional only on its previous data points. It uses dilated causal convolutions, which allows autoregressive training while maintaining a large receptive field. The network also employs the attention mechanism in order to capture inter-residue correlations in a sequence. We show that this model outperforms the VAE when tested on a set of 42 deep mutational scanning experimental data when measured by the spearman rank correlation between the experimental values and model predictions. Especially for proteins with viral sequence family origin, the difference in rank correlations is considerably large. In addition to the improvement in performance, our model allows a direct optimization of the model likelihood, as well as a fast and stable training process.
Short Abstract: RNA editing is a co-/posttranscriptional process, which acts on RNA transcripts by changing single bases and potentially altering functional traits causing or maintaining pathogenic conditions, such as cancer. The most frequently observed RNA editing events are adenine to inosine (A to I) and cytidine to uridine (C to U) changes. In brain tumors, expression of RNA editing enzymes, such as ADAR, can be linked to tumor types. Therefore, we focused on deciphering the RNA editome investigating all 12 possible nucleotide exchanges using a custom build analysis pipeline. Nucleotide variations were detected separately in whole-genome- and ribosomal depleted RNA-sequencing data by applying stringent filtering mechanisms. A quasi-Poisson-model showed significant associations of A to I changes between age, tumor type and amount of RNA editing events. Interestingly, cytidine to guanine changes were predominantly found at splicing sites, while only a minority of detected events affected exonic regions. Intronic, intergenic and 3’UTR regions were strongly affected by editing events. Furthermore, as 3’UTRs are functionally associated with miRNA binding, miRNA sequencing data will be integrated to study effects on RNA expression.
Short Abstract: Streptomyces lividans produces a vast range of bioactive secondary metabolites and other biological materials. We performed systematic mutagenesis on the genome of Streptomyces lividans to increase the amount of phospholipase D (PLD) production. The mutagenesis was iterated ten times to generate ten generations of genetically modified Streptomyces lividans. Their genome sequences were analyzed to investigate the relationship between mutations and the amount of the PLD production. In this work, we have developed a tool to analyze the genome sequences derived from multiple bacterial generations. Firstly, the tool determines the mutated sites on the genome by comparing multiple NGS datasets. Secondly, the tool maps the mutated genes on metabolic pathways and protein-protein interaction network downloaded from KEGG database and STRING, respectively. Thirdly, the effects of the mutations to the increasing of the PLD production is investigated by statistical and pathway analyses. This tool enables us to automatically analyze multiple NGS samples derived from multiple bacterial generations. A singularity container of the tool will be available through the internet.
Short Abstract: High throughput sequencing has revolutionized molecular diagnostics by allowing the inspection of many genes in a cost-effective way. However, these studies often reveal a vast number of genetic variants of uncertain clinical significance. In silico methods are usually applied to prioritize among variants, but the choice of tools to use is not straightforward, given the large variety of approaches available. This work aimed to benchmark the performance of more than 30 publicly available prediction tools. We developed Variant prEdiction Tools evAluation (VETA, https://github.com/PedroBarbosa/VETA), a framework that can be applied to any variant dataset and it is expected to incorporate new tools as they are released. We applied VETA on a manually curated dataset of 75 pathogenic variants associated with Hypertrophic Cardiomyopathy and 75 likely benign variants located in cardiomyopathy-related genes. We demonstrate that predictions quality depends on the type of variants under analysis. Considering the variants altogether, FATHMM-MKL showed the best accuracy, of 79%. However, by looking at missense variants alone, VEST3 and Revel had better predictions (95% and 93%, respectively). Likewise, SpliceAI (98%) and CADD (93%) stood out on the variants with impact on splicing. These results reinforce the need to combine different tools to effectively address variant prioritization.
Short Abstract: The classification of human genetic variants into Neutral and Disease-causing is still a challenge for computational biology. Even though it is possible to discriminate quite accurately between deleterious and neutral variants, most methods lack in explanatory power. Nevertheless the ability to explain how a mutation impacts the molecular phenotype is a prerequisite for rational drug design and personalized treatments. We previously developed SNPMuSiC, a tool able to accurately classify mutations based on their impact on protein stability. Here, we show a predictor based on the proximity between mutations and annotated functional sites. The lower the spacial distance separating a mutation and a functional site, the higher the probability the mutation will impact the protein function and be deleterious. Our method, that reaches balanced accuracy higher than 70%, provides explanatory insights about the biophysical effect of mutations. With this new feature, SNPMuSiC can identify variants whose deleteriousness is not only caused by stability variation, which improves its overall predictive capability.
Short Abstract: The main bottleneck for the biomedical use of next-generation sequencing is the medical interpretation of large-scale genomic data. Currently, the use of only genomic data with limited or low-quality access to the curated clinical and phenotypic information is a major obstacle to accurately diagnosing disease mutations. This implies a need for a secure system that collects and stores phenotypic information of the individuals and links them to genotype. We propose a bioinformatics interface called PhenBook under WiNGS (Widely integrated NGS platform) project. This interface stores the clinical observations of patients in a structural way using Human Phenotype Ontology (HPO) and associates them with OMIM and Genes. This information is being used in the variant filtering and prioritization. By this way, we increase the accuracy of the results. Since WiNGS is based on the federated data model, users are able to securely and with respects to the privacy of the individuals, query the data across different centers and check the frequency of occurrence of a certain mutation in combination with the phenotype of interest. We hope that our model helps researchers toward the discovery of rare diseases.
Short Abstract: The vast majority of cancer driver mutations have been characterized in non-coding regions of the genome. The increasing number of evidences showed the association of G-quadruplex (GQ) structures with cancer regulation suggesting their crucial role in gene expression and regulation. Here, we proposed a novel approach for discovering the potential cancer-driving mutations which overlap with GQ in noncoding regions. We integrated whole genome sequencing data from the International Cancer Genome Consortium (ICGC) and Sanger of 1290 patients from 31 cancer types with regulatory annotations to identify the set of mutations arising in individual samples in the regulatory region. Genome-wide mapping of non-coding regulatory mutations to their target genes has been done using enhancer-promoter interactions maps. We identified the cancerous mutations that harbor potential GQ motifs within enhancer and promoter regions. Recurrent mutation clusters were spotted by filtering out randomly distributed mutations in 30 bp sliding window. On the basis of mutation frequency and differential gene expression, we predicted the significant hotspots which have potential regulatory impact. Therefore, our method was capable of predicting recurrent mutations in non-coding regions, overlapped with GQ, that can enhance the understanding of the unusual transcriptional regulatory networks of cancer genomes.