Presentation Overview: Show
Characterizing the multifaceted contribution of genetic and epigenetic factors to disease phenotypes is a major challenge in human genetics and medicine. I will discuss the use of large-scale genomic analyses applied to large-scale population cohorts, to investigate the contribution of genetic variants associated with complex human traits. Using genome sequencing and multi-omic phenotypes, we discover and assess the contribution of genetic variation, from common to rare. I will describe how these expansive, high-resolution atlases of multi-omics changes inform understanding of mechanisms of disease. Furthermore, I will describe first efforts to assess the function of genetic variants and to fine-map causal effects, using experimental approaches of different types. These results pave the way for a better understanding of genetic and molecular events underpinning complex human diseases.
Presentation Overview: Show
Tumors are the result of a somatic evolutionary process leading to substantial intra-tumor heterogeneity.
Single-cell and multi-region sequencing enable the detailed characterization of the clonal architecture of tumors, and have highlighted its extensive diversity across tumors.
While several computational methods have been developed to characterize the clonal composition and the evolutionary history of tumors, the identification of significantly conserved evolutionary trajectories across tumors is still a major challenge.
We present a new algorithm, MASTRO, to discover significantly conserved evolutionary trajectories in cancer.
MASTRO discovers all conserved trajectories in a collection of phylogenetic trees describing the evolution of a cohort of tumors, allowing the discovery of conserved complex relations between alterations.
MASTRO assesses the significance of the trajectories using a conditional statistical test that captures the coherence in the order in which alterations are observed in different tumors.
We apply MASTRO to data from non-small-cell lung cancer bulk sequencing and to acute myeloid leukemia data from single-cell panel sequencing, and find significant evolutionary trajectories recapitulating and extending the results reported in the original studies.
Presentation Overview: Show
The clinical impact of most germline missense variants in humans remains unknown. Genetic constraint identifies genomic regions under negative selection, where variations likely have functional impacts, but the spatial resolution of existing constraint metrics is limited. Here we present the Homologous Missense Constraint (HMC) score, which measures genetic constraint at quasi single amino-acid resolution by aggregating signals across protein homologues. We identify one million possible missense variants under strong negative selection. HMC precisely distinguishes pathogenic variants from benign variants for both early-onset and adult-onset disorders. It outperforms existing constraint metrics and pathogenicity meta-predictors in prioritising de novo mutations from probands with developmental disorders (DD), and is orthogonal to these, adding power when used in combination. We demonstrate utility for gene discovery by identifying seven genes newly-significant associated with DD that could act through an altered-function mechanism. Overall, HMC is a novel and strong predictor to improve missense variant interpretation.
Presentation Overview: Show
The emergence of the pathogenic of pathogenic coronaviruses, including SARS-CoV-2, SARS-CoV, MERS-CoV are serious threats to global health. The spike (S) glycoprotein of SARS-CoV-2 is responsible for the binding to the permissive cells. The receptor-binding domain (RBD) of SARS-CoV-2 S protein directly interacts with the human ACE2. Here, we applied the computational saturation mutagenesis to mutate all residues in the S and ACE2 proteins. We used structure-based energy calculations and sequence-based bioinformatics tools to quantify the systemic effects of missense mutations on the protein structure and function. A total of 18,354 mutations in SARS-CoV-2 spike protein were analyzed and most of these mutations could destabilize the entire S protein and its RBD region. The mutations in SARS-CoV-2 RBD residue G431 and S514 can alter the spike protein stability. The viral variation D614G can stabilize SARS-Cov-2 entire spike protein. We investigated in RBD mutations and found that the mutations located in the interface of RBD-ACE2 complex can alter its binding affinity. In addition, we applied the similar approaches to investigate the effects of S mutations in MERS-CoV and SARS-CoV on protein stability and virus-receptor interaction. The findings provide potential target sites in the development of drugs and vaccines against COVID-19.
Presentation Overview: Show
This talk will present what we know about the origins of SARS-CoV-2, the properties that led to such a successful new human coronavirus and what we’ve learned from data-driven observations of its evolution in the human population. Surprisingly the immediate efficient spread of SARS-CoV-2 in humans was not because it had been especially adapted to humans, rather, SARS-CoV-2 has a relatively promiscuous nature, evidenced by frequent transmission to many mammal species. This meant for approximately the first year of the COVID-19 pandemic SARS-CoV-2 was transmitted relatively unchanged and unchallenged in an immunologically naive human population. However, SARS-CoV-2 by the end of 2020 started to show how much more transmissible it could become in the context of a changing host environment due to acquired host immunity from past infections and since 2021 vaccinations. The more heavily mutated SARS-CoV-2 variants were capable of out-competing previous variants by being even better adapted for human-to-human spread via enhanced transmissibility and immune-evasion changes, exemplified by the “variants of concern” (VOCs) Alpha, Delta and now Omicron. Novel variants are continuously arising that further contribute to SAR-CoV-2 success. How and where in the SARS-CoV-2 genome variation accumulates will be discussed, the underlying processes involved and how this knowledge can be used to predict putative properties of novel variants using evolutionary properties and machine learning methods. While we should of course be optimistic about the short and longer-term evolution of this new human pathogen, SARS-CoV-2’s evolution has been inherently unpredictable so it will be crucial to continue to monitor its evolution and prepare for future VOCs.
Presentation Overview: Show
This talk/poster will present the lessons learned from 10 years of the Critical Assessment of Genome Interpretation (CAGI) experiments as written in the consortium’s summary manuscript. CAGI aims to advance the state of the art for computational prediction of genetic variant impact, particularly those relevant to human disease. There have been five editions of the CAGI community experiment comprised of 50 challenges, in which participants make blind predictions of phenotypes from genetic data, which are evaluated by independent assessors. Overall, the results show that while current methods are imperfect, they already have major utility for research and clinical applications. Missense variant interpretation methods are able to estimate biochemical effects with increasing accuracy. Performance is particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Assessment of methods for regulatory variants and those for complex trait disease risk is less definitive but shows potential performance suitable for auxiliary use in the clinic. Emerging methods and increasing availability of large and robust datasets for training and assessment suggest further progress ahead.
Presentation Overview: Show
In 2019, we released Missense3D which identifies stereochemical features that are disrupted by a missense variant, such as introducing a buried charge. Missense3D, which has >150 citations and over ~7K users in the last year, analyses the effect of missense variants on a single structure. Here we present Missense3D-PPI for the prediction of missense variants at protein-protein interfaces (PPI).
Our dataset comprised 1,301 interface variants in 441 proteins and 553 PDB complexes. Benchmarking of Missense3D-PPI was performed using a training (320 benign and 320 pathogenic variants) and testing (257 benign and 404 pathogenic) dataset. Structural features affecting PPI were analysed to assess the impact of the variant at PPI.
Missense3D and Missense3D-PPI were run on the test data. The inclusion of these PPI-specific features improved the Matthews Correlation Coefficient (from 0.11 to 0.21) and the accuracy of Missense3D-PPI (from 42% to 56%, p-value of 1x10-9, McNemar’s test). Comparison of Missense3D-PPI with MutaBind2, BeatMusic and mCSM-PPI2 showed that the programs performed similarly on our test data of naturally occurring human missense variants.
Missense3D-PPI represents a valuable tool to predict the structural effect of missense variants at PPI and will be available from our Missense3D web portal (http://missense3d.bc.ic.ac.uk/).
Presentation Overview: Show
Variant interpretation remains a central challenge for precision medicine. Missense variants are particularly difficult to understand as they change only a single amino acid in a protein sequence yet can have large and varied effects on protein activity. Numerous tools have been developed to identify missense variants with putative disease consequences from protein sequence and structure. However, biological function arises through higher order interactions among proteins and molecules within cells. We therefore sought to capture information about the potential of missense mutations to perturb protein-protein interaction networks by integrating protein structure and interaction data. We developed 16 network-based annotations for missense mutations that provide orthogonal information to features classically used to prioritize variants. We then evaluated them in the context of a proven machine-learning framework for variant effect prediction across multiple benchmark datasets and demonstrated their potential to improve variant classification. Interestingly, network features resulted in larger performance gains for classifying somatic mutations than for germline variants, possibly due to different constraints on what mutations are tolerated at the cellular versus organismal level. Our results suggest that modeling variant potential to perturb context-specific interactome networks is a fruitful strategy to advance in silico variant effect prediction.
Presentation Overview: Show
Evolutionary information is the primary tool for detecting functional conservation in nucleic acid and protein. This information has been extensively used to predict structure, interactions and functions in macromolecules. Pathogenicity prediction models rely on multiple sequence alignment information at different levels. However, most accurate genome-wide variant deleteriousness ranking algorithms consider different features to assess the impact of variants. Here, we analyze three different ways of extracting evolutionary information from sequence alignments in the context of pathogenicity predictions at DNA and protein levels. We showed that protein sequence-based information is slightly more informative in the annotation of Clinvar missense variants than those obtained at the DNA level. Furthermore, to achieve the performance of state-of-the-art methods, such as CADD and REVEL, the conservation of reference and variant, encoded as frequencies of reference/alternate alleles or wild-type/mutant residues, should be included. Our results on a large set of missense variants show that a basic method based on three input features derived from the protein sequence profile performs similarly to the CADD algorithm which uses hundreds of genomic features. These observations indicate that for missense variants, evolutionary information, when properly encoded, plays the primary role in ranking pathogenicity.
Presentation Overview: Show
The ability to obtain high quality and reliable PCR-Free whole genome next generation sequence (WGS) data from small amounts of DNA is critically important for clinical applications such as newborn screening or prenatal testing. We have established and validated a high sensitivity and specificity methodology for detection of a variety of clinically relevant variant types utilizing the Illumina Whole Genome PCR-Free Tagmentation protocol with as low as 25-300 ng of template DNA. One hundred and forty five validation samples representing Small Sequence Changes (SSCs), Copy Number Variants (CNVs), Structural Variants (SVs), Short Tandem Repeats (STRs), Aneuploidies, Mitochondrial Variants, Mobile Element Insertions (MEIs), Uniparental Disomy (UPD), and difficult to detect variants such as SMN1/2 copy numbers were included in this study. The resulting statistics met or exceeded metrics obtained through orthogonal WGS methodologies utilizing much greater amounts of template DNA (>1ug). This technology further advances WGS as a first line genetic test for diagnosis of thousands of rare diseases and is now in clinical use at Variantyx, Inc.
Presentation Overview: Show
Pleiotropic SNPs are associated with multiple traits. Such SNPs can help pinpoint biological processes with an effect on multiple traits or point to a shared etiology between traits. We present PolarMorphism, a new method for the identification of pleiotropic SNPs from GWAS summary statistics. PolarMorphism can be readily applied to more than two traits or whole trait domains. PolarMorphism makes use of the fact that trait-specific SNP effect sizes can be seen as Cartesian coordinates and can thus be converted to polar coordinates r (distance from the origin) and theta (angle with the Cartesian x-axis). r describes the overall effect of a SNP, while theta describes the extent to which a SNP is shared. r and theta are used to determine the significance of SNP sharedness, resulting in a p-value per SNP that can be used for further analysis. We apply PolarMorphism to a large collection of publicly available GWAS summary statistics enabling the construction of a pleiotropy network that shows the extent to which traits share SNPs. This network shows how PolarMorphism can be used to gain insight into relationships between traits and trait domains. Furthermore, pathway analysis of the newly discovered pleiotropic SNPs demonstrates that analysis of more than two traits simultaneously yields more biologically relevant results than the combined results of pairwise analysis of the same traits. Finally, we show that PolarMorphism is more efficient and more powerful than previously published methods.
Presentation Overview: Show
Budding yeast has proven to be an excellent model system for characterizing the molecular spectrum of un-selected mutations, ranging from single-nucleotide substitutions to large chromosomal changes. Combining this genomic information from hundreds of experimental strains with phenotypic analyses also allows us to test predictions for how genetic variation will impact key traits. We describe how genome architecture influences the mutation spectrum in yeast, altering genetic load and the trajectory of adaptation. We show that a popular computational tool used to identify deleterious variants based on evolutionary sequence conservation largely fails to predict growth rates in the lab. Finally, we describe mutation patterns in the highly repetitive ribosomal DNA locus, and show that stable copy number at this locus is maintained by a balance between mutation and stabilizing selection. By combining insights from genomics with population and quantitative genetics approaches, our work sheds light on the origins and impacts of genetic variation.
Presentation Overview: Show
Recommendations from the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) for interpreting sequence variants specify the use of computational predictors as only “supporting” level of evidence for pathogenicity or benignity using criteria PP3 and BP4, respectively. However, score intervals defined by tool developers, and these recommendations in general, lack quantitative support. Previously, we described a probabilistic framework that quantified the strengths of evidence (Supporting, Moderate, Strong, Very Strong) within ACMG/AMP recommendations. Building on this framework, we introduce a new standard that converts a computational tool’s scores to PP3 and BP4 evidence strengths. Our approach is based on estimating the local positive predictive value and can calibrate any tool or other continuous-scale evidence on any variant type. We estimate score thresholds corresponding to each strength of evidence for pathogenicity and benignity for 13 missense variant tools, using carefully assembled data sets. Most tools achieved Supporting evidence level for both pathogenic and benign classification using these thresholds. Several tools yielded score thresholds justifying even stronger levels, including one reaching Very Strong evidence level for benignity. Lastly, we provide recommendations for evidence-based revisions of the ACMG/AMP PP3 and BP4 criteria, and future assessments for clinical use.
Presentation Overview: Show
RNA G-quadruplex (rG4) have been known to play an important role in gene regulation. Recent advances in sequencing technology have produced experimental data detailing G4 formations in the genome. Following this, various computational methods have been developed to predict whether a G4 is likely to form on a given sequence. While deep learning methods such as convolutional neural network (CNN) have proven to perform well, these models are not able to integrate information from long-range interactions of the input sequence. To be able to better capture up and downstream nucleotide contexts, we trained a transformer-based model on rG4-seeker in-vitro G4 RNA data to produce G4mer. We show that G4mer outperforms other CNN-based models such as G4Detector (Barshai et al., 2021) in predicting G4 sequences and classifying G4s into its sub-categories. Using G4mer, we study G4 formations in the human 5’UTR regions and search for variants in G4s associated with disease and phenotype.
Presentation Overview: Show
Insight into synonymous codon usage and metrics to assess the impact of changes in sSNVs are critically needed to enable consideration of “silent” variants in the genetic diagnosis process. While synonymous single nucleotide variants (sSNVs) do not alter protein sequence, they may influence several overlaying mRNA processes. Using an information theoretic approach, we evaluated how codon usage in human transcripts varies with sequence contexts and relate disruption of sSNV patterns to constraint in the human population.
For each amino acid group with codons varying at codon position 3 (CP3), we calculate the mutual information (MI) between the distributions of codons and those of neighboring codons and nucleotides. We find that MI, outside of central bicodons, is driven by local maxima at CP3s and demonstrate that this effect is significantly larger than control contexts. Other local sequence variables account for much, but not all, of this codon-context correlation. We convert this MI to a sSNV score that represents how expected the substitution is in that sequence context. Using TCGA data as a test case, we find relevant pathways enriched in the somatic sSNVs from the extrema of our score’s distribution, thereby demonstrating the biological significance of sSNVs in human disease.
Presentation Overview: Show
Hypertension is among the most prevalent conditions, with an estimated one billion cases worldwide. It increases the risk of renal diseases, cerebrovascular and cardiovascular diseases. Hypertension is a polygenic disease with strong environmental contribution. The list of associated genes from GWAS keeps growing (1362 genes, OpenTargets), with only a few having been functionally validated. In this study, we applied a gene-based method called the Proteome-wide association study (PWAS) to detect associations through the effect of variants on protein functions. PWAS aggregates the signal from all variants from UK-Biobank (UKBB) affecting the protein-coding genes and provides three generalized models for dominant, recessive, and combined inheritance. PWAS identified 72 statistically significant associated genes (FDR-q-value <0.05) and 158 with a more relaxed threshold (FDR-q-value <0.1). Only half of the PWAS genes overlap with GWAS. Analyzing females and males from UKBB genotyping revealed a strong sex-dependent genetic signal. We found that females carry most of the polygenetic signal (28 vs. 6 genes), with only SH2B3 shared between sexes. Several of the female-female-unique genes are associated with cellular immune function. We conclude that hypertension displays sex-dependent genetics with a substantial recessive inheritance. We show the benefit of PWAS in enhancing its interpretability and clinical utility.