- Kelley Harris , United States
Presentation Overview: Show
Population genetic models provide a unified theoretical framework for interpreting the function and significance of natural genetic variants, which arise as random mutations and persist in populations as a result of natural selection and random genetic drift. Although these models recognize that drift and selection can vary widely between populations with different histories, they often characterize mutagenesis as a simple, uniform process, overlooking the fact that mutagenesis is a trait that can evolve and vary between populations. We can show that the germline mutagenic process appears to vary between populations as well as functional genomic compartments by statistically analyzing the "mutation spectrum," meaning the relative abundance of genetic variation in different contexts such as three-base-pair motifs. Mutation spectra systematically vary between great ape species and even closely related human populations, and we have developed a novel method, MuSHi, to deconvolute this variation into the footprints of mutational processes that have changed in rate over the course of population history. Our computational methodology indicates the presence of mutation spectrum variation within a database of natural yeast strains; for example, a clade of strains known as "mosaic beer yeast" appear to have been accumulating higher rates of C>A mutations than standard laboratory strains. We show that this mutation spectrum difference can be recapitulated in de novo mutations accumulated in a controlled laboratory setting, confirming that this component of natural yeast mutation spectrum variation is likely caused by a genetically encoded mutator allele.
- Gary Benson, Boston University, United States
- Marzieh Eslami Rasekh, Boston University, United States
- Yozen Hernandez, The Rockefeller University, United States
Presentation Overview: Show
Variable number of tandem repeats (VNTRs) are polymorphic DNA tandem repeat loci in which the number of pattern copies varies across a population. Human minisatellite VNTR loci (with pattern lengths from seven to hundreds of base pairs) have a variety of functional effects (transcription factor binding, RNA splicing) and are associated with disease (neurodegenerative disorders, cancers, Alzheimer’s disease). Despite their importance, relatively few minisatellite VNTRs have been identified and studied in detail. As part of a large survey of VNTR occurrence in over 2,500 human whole genome sequencing samples from the 1000 genomes project, we sought to identify population-specific VNTR alleles. We found 5,541 “common” VNTR loci (occurring in ≥ 5% of the samples) and used their alleles to develop a decision tree classification model to predict super-population membership, with 97.81% accuracy. We then identified 1,283 top population predictive alleles. Finally, we developed a novel ‘Virtual Gel’ illustration showing how alleles differ across populations at population-specific loci. This is the first large-scale study of population-specific VNTRs and the information obtained could be useful for haplotype inference, studies of human migration and evolution, and accurate use of VNTRs in GWAS studies.
- Arjun Bhattacharya, Department of Biostatistics, University of North Carolina at Chapel Hill, United States
- Michael Love, Department of Biostatistics, Department of Genetics, University of North Carolina at Chapel Hill, United States
Presentation Overview: Show
Traditional models for transcriptome-wide association studies (TWAS) consider only single nucleotide polymorphisms (SNPs) local to genes of interest and perform parameter shrinkage with a regularization process. These approaches ignore the effect of distal-SNPs or possible effects underlying the SNP-gene association. Here, we outline multi-omic strategies for transcriptome imputation from germline genetics for testing gene-trait associations by prioritizing distal-SNPs to the gene of interest. In one extension, we identify mediating biomarkers (CpG sites, microRNAs, and transcription factors) highly associated with gene expression and train predictive models for these mediators using their local SNPs. Imputed values for mediators are then incorporated into the final model as fixed effects with local SNPs to the gene included as regularized effects. In the second extension, we assess distal-eSNPs for their mediation effect through mediators local to these distal-eSNPs. Highly mediated distal-eSNPs are included in the transcriptomic prediction model. We show considerable gains in prediction of gene expression and TWAS power using simulation analysis and real data applications with TCGA breast cancer and ROS/MAP brain data. This integrative approach to transcriptome-wide imputation and association studies aids in understanding the complex interactions underlying genetic regulation within a tissue and identifying important risk genes for various traits.
- Joseph Atemia, International Centre of Insect Physiology and Ecology (icipe), Nairobi, Kenya, Kenya
- Santie de Villiers, Pwani University, Kilifi, Kenya, Kenya
- Suhaila Hashim, Pwani University, Kilifi, Kenya, Kenya
Presentation Overview: Show
Finger millet (Eleusine coracana) is a key staple crop in eastern Africa cultivated mainly by small-holder farmers. It can withstand high temperatures, salinity, drought stress and low soil fertility. Typically, it yields only one-third of its genetic potential of 6 tons per hectare due to the use of unimproved varieties that are regularly affected by the finger millet blast disease along with other stresses. Blast disease is caused by Magnaporthe oryzae, which is a host-specific complex species that affects different grasses including rice and wheat. While many efforts have been directed towards characterizing rice blast, finger millet blast genetic diversity, specificity and virulence remain poorly understood. To address this, we sequenced 224 blast isolates from Kenya, Tanzania, Uganda and Ethiopia, using Illumina sequencing. One blast isolate, E2, was sequenced using a combination of PacBio and Illumina technologies. A reference genome assembly was generated for E2, and the resequenced isolates’ reads mapped to it. Variant calling identified 195,705 SNPs. Cluster analysis for diversity assessment was then conducted using STRUCTURE, PCA and phylogenetic analysis and the findings are presented here. This information will enhance the existing knowledge of the genetic diversity of the blast fungus.
- Yi-Fei Huang, Pennsylvania State University, United States
Presentation Overview: Show
A challenge in genomics is to identify variants and genes associated with severe genetic disorders. Several statistical methods have been developed to predict pathogenic variants or constrained genes based on the signatures of negative selection in human populations. However, we currently lack a statistical framework to jointly predict deleterious variants and constrained genes from both variant-level features and gene-level selective constraints. Here we present such a unified approach, UNEECON, based on deep learning and population genetics. UNEECON treats the contributions of variant-level features and gene-level constraints as a variant-level fixed effect and a gene-level random effect, respectively. The sum of the fixed and random effects is then combined with an evolutionary model to infer the strength of negative selection at both variant and gene levels. Compared with previously published methods, UNEECON shows unmatched performance in predicting missense variants and protein-coding genes associated with autosomal dominant disorders. Furthermore, based on UNEECON, we observe an unexpected low correlation between gene-level intolerance to missense mutations and that to loss-of-function mutations. Finally, we show that genes intolerant to both missense and loss-of-function mutations play key roles in autism. Overall, UNEECON is a promising framework for both variant and gene prioritization.
- Lincoln Stein, Ontario Institute for Cancer Research, Canada
Presentation Overview: Show
The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes project explored the role of coding and non-coding variation among a cohort of >2,600 cancer whole genomes and matching normal tissues. The consortium had to overcome multiple challenges to generate a consistent and high quality set of variant calls across the cohort, including large differences in the accuracy of different somatic mutation calling algorithms, the lack of uniform benchmarking, and the technical challenges of uniformly processing a geographically scattered data set of 800 TB. In this talk, I will walk through how the consortium addressed these challenges, what we learned about the significance of coding and non-coding cancer driver mutations, and discuss insights gained from the distribution of non-driver "passenger" mutations.
- Remo Monti, Hasso Plattner Institute, Max Delbrück Center for Molecular Medicine, Germany
- Pia Rautenstrauch, Hasso Plattner Institute, University of Tübingen, Germany
- Stefan Konigorski, Hasso Plattner Institute, Germany
- Alva Rani James, Hasso Plattner Institute, Germany
- Mahsa Ghanbari, Max Delbrück Center for Molecular Medicine, Germany
- Uwe Ohler, Max Delbrück Center for Molecular Medicine, Germany
- Christoph Lippert, Hasso Plattner Institute, Germany
Presentation Overview: Show
Sequencing-based genotyping methods are on the rise, yet leveraging the predominantly rare genetic variants they measure remains challenging. Large rare-variant association studies have mainly focused on protein-altering variation while little attention has been given to variants acting on the RNA level or other non-coding regulatory mechanisms. For these mechanisms, deep learning has recently been successful at predicting the effects of genetic variants.
Here we introduce seak (sequence annotations in kernel-based tests), a Python package that flexibly integrates variant effect predictions into set-based association tests while controlling for relatedness and population structure using linear mixed models. We first show that using functional variant effect predictions can increase statistical power in simulation studies and shed light on potentially causal mechanisms. Then we apply seak to the UK Biobank exome-sequencing dataset. We perform association tests for three biomarkers of cardiovascular disease and cancer, incorporating deep-learning-derived variant effects for disease-related RNA-binding proteins. With this novel approach we find two significant associations for each biomarker, which include both novel and known associations.
Our results demonstrate that, by incorporating regulatory variant effects, seak can identify novel biologically interpretable associations, thereby unlocking the potential of whole-exome and whole-genome sequencing studies.
- Alex Kaplun , United States
Presentation Overview: Show
Our diagnostic testing lab uses PCR-free whole genome sequencing (WGS) as the method platform for a comprehensive genetic evaluation to detect single nucleotide variants and small indels (like traditional exomes), but in addition we have the ability to detect structural variants and short tandem repeats (STRs). Of the clinical cases processed this year with reported pathogenic or likely pathogenic variants, just 65% involved SNVs/indels only. By starting with the most comprehensive testing we capture the vast majority of genetic variants with comparable or improved sensitivity and specificity than the standard of care/best practice testing, essentially eliminating or drastically reducing the need for all other tests that are currently only detectable using platforms such as Southern blots for detection of long STRs and large deletions, PCR/capillary electrophoresis for small STRs, qPCR/MLPA for exon level deletions and duplications, and microarrays, FISH and karyotype for gross chromosomal deletions. This type of genetic diagnostic test has the potential to flip the current costly and time-consuming paradigm of starting small with single gene tests and going to large panels or exomes/genomes. With short read WGS analysis we currently see sensitivity values of >99% for SNVs, >96% for indels and >85% for structural variants. With the addition of long read WGS analysis we can strengthen many of the weaknesses of short-read analysis, including, but not limited to, accurately identifying hard to detect deletions and insertions, covering non-uniquely mappable areas, detect exact count of repeat units in expansions and identifying balanced translocations. With simultaneous orthogonal confirmation using both short and long read WGS analysis, the sensitivity for indels, structural variants and STR detection increases to >95% (based on preliminary validation data). We will present a break out of this data.
- Berk Alpay, University of Connecticut, United States
- Pinar Demetci, Brown University, United States
- Sorin Istrail, Brown University, United States
- Derek Aguiar, University of Connecticut, United States
Presentation Overview: Show
Motivation:
Genome-wide association studies have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, eQTL studies have interpreted these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post-hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experiments.
Results:
In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a conventional regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that incorporates suffix tree based haplotype sharing with spectral clustering to identify expression classes from haplotype sequences. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD on five GTEx v8 tissues with three state-of-the-art expression prediction methods.
HAPLEXD exhibits significantly higher classification accuracy overall and HAPLEXR shows higher prediction accuracy on a significant subset of genes. These results demonstrate the importance of explicitly modelling non-dosage dependent and intragenic epistatic effects when predicting expression.
- Patrick May, Luxembourg Centre for Systems Biomedicine, Luxembourg
- Eduardo Pérez-Palma, Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, United States
- Dennis Lal, Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, United States
Presentation Overview: Show
Classifying pathogenicity of missense variants represents a major challenge in clinical practice. While orthologous gene conservation is commonly employed in variant annotation, approximately 80% of known disease-associated genes belong to gene families. We empirically evaluated whether paralog-conserved or non-conserved sites in human gene families are important in neurodevelopmental diseases (NDDs) and could demonstrate that disease-associated missense variants are enriched at paralog-conserved sites across all disease groups and inheritance models tested. We developed a gene family de novo enrichment framework that identified 43 exome-wide enriched gene families including 98 de novo variant carrying genes in NDD (Lal et al., 2020).
Essential regions for protein function are conserved among gene-family members, and genetic variants within these regions are potentially more likely to confer risk to disease. We explored if gene family information could support the identification novel disease-related regions within proteins. We compared 2,219,811 variants from the general population to 76,153 variants from patients. With this gene-family approach, we identified 465 regions enriched for patient variants in 1252 genes. We found that missense variants inside the identified regions are 106-fold more likely to be classified as pathogenic in comparison to benign (Pérez-Palma et al., 2020).
- Zishuo Zeng, Rutgers University, United States
- Yana Bromberg, Rutgers University, United States
Presentation Overview: Show
Due to the lack of experimental evaluations of variant effects, existing predictors for synonymous singleton nucleotide variants (sSNVs) often rely on databases of variant-disease associations, which are limited for establishing functional effects. Large-scale genome-sequencing efforts have observed roughly four 4 sSNVs (real variants); whereas there are 58 million possible sSNVs have not been observed. We expect that a large fraction of the latter will be found with more sequencing (not-yet-seen variants), while others are either purified due to extreme deleteriousness or unobservable due to physicochemical or sequencing constraints. We further highlight the fact that the not-yet-seen variants are likely similar to the real variants, representing a wide range of functional effects; although the former are probably enriched in more deleterious consequences.
We first built a model to identify the not-yet-seen variants from all the unobserved ones. We then trained a second model to differentiate the not-yet-seen and real sets of variants. We further trained the final model model using the real and not-yet-seen variants subset by cutoffs determined by common variants and pathogenic variants, respectively. Our final model outperforms currently available sSNV-predictors in differentiating experimentally verified pathogenic sSNVs from real variants in testing set.
- Joseph Chi-Fung Ng, King's College London, United Kingdom
- F Fraternali, Randall Division of Cell and Molecular Biophysics, King’s College London, United Kingdom
- Anna Laddach, The Francis Crick Institute, United Kingdom
Presentation Overview: Show
Missense variants are present amongst the healthy population, but some are causative of human diseases. A deeper understanding of the nature of missense variants in health and disease and their underlying biophysical features are essential to better distinguish pathogenic from population variants. Here we quantify variant enrichment across full-length proteins, domains and 3D-structure defined regions, and integrate this with available transcriptomic and proteomic (half-life, thermal stability, abundance) data. We have mined a rich set of molecular features which separate pathogenic and population variants: pathogenic variants mainly affect proteins involved in cell proliferation and nucleotide processing, localise to protein cores and interaction interfaces, and are enriched in abundant proteins. In contrary to other studies, we find that rare population variants display molecular features which are closer to common than pathogenic variants. We validate these molecular features indicative of variant pathogenicity by comparing against existing in silico impact annotations. This study reveals molecular principles of the sensitivity of proteins towards missense variants. This could be useful in predicting variant deleteriousness, and prioritising protein domains for therapeutic development. The ZoomVar (http://fraternalilab.kcl.ac.uk/ZoomVar) database has been created for large-scale annotation of variants onto protein structures and calculation of variant enrichment across protein structural regions.
- Silvia Benevenuta, University of Torino, Italy
- Emidio Capriotti, University of Bologna, Italy
- Piero Fariselli, University of Torino, Italy
Presentation Overview: Show
Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. This problem is relevant since poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making.
We evaluated the calibration and the prediction performance of four predictors that directly furnish a probability as output (DANN, DeepSea, FATHMM-MKL and PhD-SNPg). Despite the fact that they show similar performances in terms of AUC, most of them are not well calibrated. We also tested two methods that provide only raw scores, applying calibration transformations to them (CADD and Eigen). We showed that CADD can be well-calibrated and can be used, after this transformation, as a probability score. Among the predictors tested, PhD-SNPg provides the best calibration without any transformation of the scores, while CADD is the best predictor after calibration.
- Muhammed Hasan Çelik, Technical University of Munich, Germany
- Nils Wagner, Technical University of Munich, Germany
- Julien Gagneur, Technical University of Munich, Germany
Presentation Overview: Show
Aberrant splicing is a major cause of genetic diseases. However, the affected tissues of a large set of genetic disorders, including cardiac and neurological disorders are not clinically accessible, preventing experimental detection of aberrant splicing. Here we develop the first benchmark datasets and algorithms for predicting tissue-specific aberrant splicing. We focus on the task of prioritizing rare genetic variants. Applying MMSplice, a state-of-the-art model predicting percent-spliced-in based on DNA sequence, to existing exon annotations show limited performance for outlier prediction. A substantial improvement is obtained by combining MMSplice with a tissue-specific map of splice site and splicing fractions (Percent Spliced-In) we generated. Finally, a model which further integrates splicing measurements from whole blood RNA-seq reaches a median of AU-PRC 12%, i.e. about 15-fold improvement over MMSplice alone. Altogether, our approach and results have implications for non-invasive genetic diagnostics including in neonatal settings.
- Fritz Roth
Presentation Overview: Show
Cell-based assays can model the organismal impact of human genetic variants, and the dependence of variant effects on environmental and genetic factors. Multiplexing such assays can provide functional impact scores for nearly all possible amino acid changes in specific protein targets. My group has applied the TileSeq framework for variant effect mapping to more than a dozen human proteins thus far. I will highlight variant effect mapping for methylenetetrahydrofolate reductase (MTHFR; associated with hyperhomocysteinemia and other disorders), including dependence on folate and on presence of the A222V variant (carried by ~50% of all humans). I will also outline tools for analysis and data-sharing, and explore the feasibility of a community effort to proactively test all possible missense variants in thousands of human disease genes.
- Michal Linial, The Hebrew University of Jerusalem, Israel
- Kerem Wainer-Katsir, The Hebrew University of Jerusalem, Israel
Presentation Overview: Show
Current technologies for single-cell transcriptomics allow thousands of cells to be analyzed in a single experiment. The increased scale of these methods raises the risk of cell doublets contamination. Available tools and algorithms for identifying doublets and estimating their occurrence in single-cell experimental data focus on doublets of different species, cell types or individuals. In this study, we analyze transcriptomic data from single cells having an identical genetic background. We claim that the ratio of monoallelic to biallelic expression provides a discriminating power towards doublets’ identification. We present a pipeline called BIRD (BIallelic Ratio for Doublets) that relies on heterologous genetic variations, from single-cell RNA-Seq (scRNA-seq). For each dataset, doublets were artificially created from the actual data and used to train a predictive model. BIRD was applied on Smart-Seq data from 163 primary fibroblast single cells. The model achieved 100% accuracy in annotating the randomly simulated doublets. Bonafide doublets were verified based on a biallelic expression signal amongst X-chromosome of female fibroblasts. Data from 10X Genomics microfluidics of human peripheral blood cells achieved in average 83% (± 3.7%) accuracy, and an area under the curve of 0.88 (± 0.04) for a collection of ~13,300 single cells. BIRD addresses instances of doublets which were formed from cell mixtures of identical genetic background and cell identity. Maximal performance is achieved for high coverage data from Smart-Seq. Success in identifying doublets is data specific which varies according to the experimental methodology, genomic diversity between haplotypes, sequence coverage, and depth.