View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Advent of the next generation sequencing technologies led to the ever-increasing gap between virus discovery and characterization. Currently, the International Committee on Taxonomy of Viruses classifies newly discovered viruses using their genetic and phenotypic properties. However, the lack of biological characterization for many viruses challenges this traditional approach. An alternative way based solely on the genome sequences was implemented as a tool called DEmARC [Lauber&Gorbalenya,J.Virol,2012]. It partitions genetic diversity of a monophyletic group of viruses by: 1) calculation of pairwise evolutionary distances (PED) between genomes; 2) generation of genome clusters; 3) evaluation of quality of each cluster (clustering cost, CC); 4) semi-automatic selection of PED thresholds associated with partitioning of the genome sequence diversity into clusters of the highest quality (ranks of hierarchical taxonomy). Recent improvements in the DEmARC were done with a goal to make it fully automatic. To account for sampling, we introduced sequence weights in steps #2, #3. We also introduced a function that measures quality of each clustering interval (CQ) and group consisting of a limited number of clustering intervals with high CQ can be considered as one defining the taxonomic ranks (step #4). This approach paves the way to automatic selection of taxonomic rank thresholds.
Short Abstract: Variant interpretation is a critical issue for labs involved in germline and somatic sequencing, particularly those looking at high‐risk patient populations. The Cancer-Related Analysis of Variants Toolkit (CRAVAT) is a popular service for interpreting high-volume genomic variation data that has been used by tens of thousands of users world-wide to analyze billions of variants. CRAVAT provides results in an intuitive, visually striking, and interactive environment. Historically these services have been available via a public web portal or a Docker container installed locally or on a cloud-based server. Recently we have developed a new Open CRAVAT that allows researchers to make custom variant interpretation tools available as plug-ins to the CRAVAT architecture. Open CRAVAT is lightweight and modular, allowing users to install tools and workflows most relevant to their scientific focus. The new architecture is a pip-installable python package and includes a “CRAVAT store” from which tools can be selected. Developers can publish tools to the store, which features author names, citations, descriptions, and community reviews. Creating plug-in versions of tools enables developers to leverage our user base and an existing infrastructure that processes user input, maps variants from genome to transcriptome to proteome, and provides interactive tabular and graphical results.
Short Abstract: Genome wide association studies are revealing an increasing number of variants associated with Alzheimer's Disease,majority of which fall in non-coding regions of the genome.Identification of causal variants and impacted cellular mechanisms remains a huge challenge however,due to linkage disequilibrium in the population and incomplete knowledge of the cell type-specific functions of non-coding regions in the genome.We attempt to overcome this challenge using new cell type-specific epigenomics data from the brain.We first identify regulatory elements in three different cell types in the human brain using ChIP-Seq of H3K27ac on sorted nuclei.Then,using a computational overlap enrichment analysis,we compare our database of regulatory elements with variant data from an existing Alzheimer's GWAS.Our analysis reveals that microglial regulatory elements have significant overlap with AD associated GWAS variants relative to neuronal and glial regulatory elements (p<10-4) suggesting that microglial gene regulation is significantly altered in AD.We further train machine learning models to ascertain which of multiple variants in LD are causal and to interpret their role in cell type-specific gene regulation.In addition to revealing a role for microglial gene regulation in AD progression,our analyses provide a candidate set of regulatory elements,cell types and variants that can be prioritized for experimental follow up.
Short Abstract: There are over 10,000 synonymous single nucleotide variants (sSNVs) present in the genome of every individual. We analyzed the amino acid distribution and codon use of the observed sSNVs in all human protein-coding transcripts. Our analysis shows that, depending on the amino acid type, only 4.9% (of K) to 23.5% (T) of all amino acids have at least one observed sSNV; i.e. only 12.4% of all protein sequence positions harbor synonymous variants. Notably, the often structurally or functionally important amino acids (e.g. charged or disulfide bond-forming) have a lower frequency of sSNV. Similarly, only 2.0% (of ATC) to 24.8% (CCT) of all codons harbor at least one sSNV. The percent amino acid composition and percent codon frequency are only somewhat correlated with the likelihood of having at least one sSNV (Pearson correlation, r = 0.46 and 0.12, respectively). The likelihood of observing a particular variation (e.g. GTT->GTC) also deviates from the expectation to an extent defined by the corresponding codon use, the number of possible synonymous codons, and the transition/transversion nature of the substitution. Our results suggest that the observed human sSNVs are highly nonrandom and, thus, quantifying deviation of an sSNV from random could help evaluate its effect.
Short Abstract: Each person’s genome sequence contains thousands of missense variants. Without exhaustive experimental measurement, practical interpretation of a mutation’s functional significance currently relies on computational inferences. Here we assess the efficacy of such inferences by sequencing progeny of ENU treated mice across 23 essential immune system genes. PolyPhen2, SIFT, MutationAssessor, Panther, CADD, and Condel were used to predict functional impact and homozygous mutant mice examined for the expected loss-of-function phenotype. Only 20% of predicted damaging mutations exhibited a discernible phenotype yet most mutations appear to be subject to purifying selection as few persist between separate mouse substrains, rodents, or primates. Because genes defects could be phenotypically masked in vivo by compensation and environment, we generated functional predictions for all 2,314 possible single amino acid missense variants in TP53 and compared to in-vitro phenotypes. Here 42% of predicted deleterious mutations had little impact on TP53-promoted transcription. We conclude that half of inferred deleterious missense mutations correspond to nearly neutral mutations that have little impact on clinical phenotype but are subject to purifying selection. These results highlight an important gap in our ability to relate genotype to phenotype in clinical sequencing: the inability to differentiate clinically relevant mutations from nearly neutral mutations.
Short Abstract: An increasing number of evidences showed the association of GQ structures with cancer regulation suggesting their crucial role in gene expression and regulation. Here, we aim to identify potential mutations that modulate the GQ stability in the non-coding regions and could proposed as cancer driver mutations. We collected 989 samples with 3,863,577 variants of whole genome cancer mutation data from ICGC and the genomewide map of GQs from non-B database. On the basis of these two data sets we identified the cancerous mutations that harbor potential GQ motifs within a non-coding regulatory region. Genomewide mapping of non-coding regulatory mutations to their target genes has been done using enhancer-promoter interactions maps. To filter out random distribution of mutations, “hotspot” analysis identified small significant regions with frequent mutations by detecting clusters of mutations within 50 bp of each other. On the basis of frequency and sequence-based approaches, we scanned the cancer genome for non-coding mutations with potential regulatory impact. Therefore, these recurrent mutations in non-coding regions, which overlapped with GQ, can enhance the understanding of the unusual transcriptional regulatory networks of cancer genome. These findings will lead to study the significant mutations that can be used as novel drug-targets for therapeutic purposes.
Short Abstract: Pathway-based analysis in genome-wide association study (GWAS) is being widely used to uncover novel multi-genic functional associations. Many of these pathway-based methods have been used to test the enrichment of the associated genes in the pathways, but exhibited low powers and were highly affected by free parameters. We present the novel method and software GSA-SNP2 for pathway enrichment analysis of GWAS P-value data. GSA-SNP2 provides high power, decent type I error control and fast computation by incorporating the random set model and SNP count adjusted gene score. In a comparative study using simulated and real GWAS data, GSA-SNP2 exhibited high power and best prioritized gold standard positive pathways compared with six existing enrichment-based methods and two self-contained methods (alternative pathway analysis approach). Based on these results, the difference between pathway analysis approaches was investigated and the effects of the gene correlation structures on the pathway enrichment analysis were also discussed. In addition, GSA-SNP2 is able to visualize protein interaction networks within and across the significant pathways so that the user can prioritize the core subnetworks for further studies. GSA-SNP2 is freely available at https://sourceforge.net/projects/gsasnp2.
Short Abstract: Methods for integrating genome-wide association and molecular data must be developed to reveal the multiple weak variants affecting human traits. Although integrative frameworks are appropriate for the analysis of heterogeneous data, they usually lack robustness against confounding factors (e.g, patients' ancestry), which generates spurious findings and questions its validity versus traditional regression models (RMs) for SNV prioritisation. In this research, we defined a data integration framework called cNMTF for prioritising reliable associations between SNVs and traits. This algorithm uses matrix factorisation to capture the interrelatedness between variants data, the SNVs deleteriousness effect and the protein-protein interactions (PPIs) that might be disrupted. It simultaneously accounts for the patient's outcome and ancestry by means of kernels functions, minimizing the confounding for population structures. We implemented cNMTF in the prioritisation of SNVs associated with serum lipids levels in American and Finnish cohorts. The prioritised variants were validated to 67% with previous GWAS, and allowed us to identify disrupted PPIs specific to these cohorts. cNMTF not only performed efficiently against strong population structures, it also supported the hypothesis that SNVs under the significance level in RMs still carry evidence to explain the phenotype, and our method can effectively prioritise them when integrating protein data.
Short Abstract: Discovery of potentially deleterious sequence variants is important and has wide implications for research and generation of new hypotheses in human and veterinary medicine and drug discovery. The GenProBiS web server maps sequence variants to protein structures from the Protein Data Bank (PDB), and further to protein-protein, protein-nucleic acid, protein-compound, protein-metal ion binding sites. The concept of a protein-compound binding site is understood in the broadest sense, which includes glycosylation and other post-translational modification sites. Binding sites were defined by local structural comparisons of whole protein structures using the Protein Binding Sites (ProBiS) algorithm and transposition of ligands from the similar binding sites found to the query protein using the ProBiS-ligands approach with new improvements introduced in GenProBiS. Binding site surfaces were generated as three-dimensional grids encompassing the space occupied by predicted ligands. The server allows intuitive visual exploration of comprehensively mapped variants, such as human somatic missense mutations related to cancer and non-synonymous single nucleotide polymorphisms from 21 species, within the predicted binding sites regions for about 80,000 PDB protein structures using fast WebGL graphics. The GenProBiS web server is open and free to all users at http://genprobis.insilab.org.
Short Abstract: Genetic alterations are essential for cancer initiation and progression. However, differentiating mutations that drive the tumor phenotype from mutations that do not affect tumor fitness remains a fundamental challenge in cancer biology. To better understand the impact of a given mutation within cancer, RNA-sequencing data was used to categorize mutations based on their allelic expression. For this purpose, we developed the MAXX (Mutation Allelic Expression Extractor) software, which is highly effective at delineating the allelic expression of both single nucleotide variants and small insertions and deletions. Results from MAXX demonstrated that mutations can be separated into three groups based on their expression of the mutant allele, lack of expression from both alleles, or expression of only the wild-type allele. Utilizing selectively expressed genes that are the target of mutation in PDAC, it was possible to develop subtypes that have prognostic significance and are associated with sensitivity to select classes of therapeutic agents in cell culture. Thus, mutant allele expression via MAXX represents a means to parse somatic variants in tumor genomes, reducing the ambiguity of a gene’s respective role in cancer.
Short Abstract: Genome-wide association studies (GWAS) have identified single nucleotide polymorphisms (SNPs) associated with trait diversity and disease susceptibility, yet their functional properties often remain unclear. It has been hypothesized that SNPs in microRNA (miRNA) binding sites may disrupt gene regulation by miRNAs. While several studies have predicted the location of SNPs in binding sites, there has been no comprehensive analysis of their functional impact. Here we investigate the functional properties of SNPs and their effects on miRNA regulation of gene expression in cancer. Our analysis is motivated by the hypothesis that distinct alleles may cause differential miRNA:mRNA binding and alter the expression of genes. We previously identified pathways that are dysregulated by miRNAs in cancer, by comparing miRNA-pathway associations between healthy and tumor tissue. We draw on these results to assess whether SNPs are responsible for miRNA dysregulation of individual genes in tumors. Using an integrative analysis that incorporates miRNA expression, mRNA expression, and SNP genotype data, we identify functional SNPs that we term "regulatory QTLs (regQTLs)": loci whose alleles impact gene regulation by miRNAs. We apply the method to breast, liver, lung, and prostate cancers from The Cancer Genome Atlas (TCGA), and provide a tool to explore the findings.
Short Abstract: Plasma lipid levels are risk factors for cardiovascular disease, a leading cause of death worldwide. While many European-centric studies have been conducted on lipid genetics, their transferability to diverse populations is unclear. We performed SNP- and gene-level genome-wide association studies (GWAS) of four lipid traits in Nigerian and Filipino cohorts and compared them to the results of larger, predominantly European meta-analyses. Two previously implicated loci met significance in our GWAS in the Nigerian cohort, rs34065661 in CETP associated with HDL cholesterol (P=9.0e−10) and rs1065853 upstream of APOE associated with LDL cholesterol (P=6.6e−9). The top SNP in the Filipino cohort, which was also previously implicated, associated with triglyceride levels (rs662799; P=2.7e−16). While this SNP is located directly upstream of APOA5, we show it may be involved in regulation of BACE1 and SIDT2. Our gene-based association analysis, PrediXcan, revealed decreased expression of BACE1 and SIDT2 in several tissues, driven by rs662799, significantly associate with increased triglyceride levels in Filipinos (FDR<0.1). Our PrediXcan analysis also implicated gene regulation as the mechanism underlying the associations of other previously discovered lipid loci. Our BACE1 and SIDT2 findings were confirmed using summary statistics from the Global Lipids Genetic Consortium meta-GWAS.
Short Abstract: Some cancer mutations target protein-protein interactions, suggesting that perturbation of specific interactions may contribute to tumorigenesis. Interactions under positive selection for mutations in cancer could implicate new driver genes. We therefore assessed physical interaction interfaces of proteins for unexpected bias toward nonsynonymous (NS) mutations across tumors, a signature of positive selection during tumorigenesis. We analyzed 1.4 million NS cancer mutations against a PPI network comprising 6230 human proteins with atomic level interface details extracted from protein structures. This analysis found that NS mutations selectively target PPIs of known oncogenes (OR=1.73, P-value=4.8e-21) and tumor suppressors (OR=1.23, P-value=5.4e-4), but not of other genes (OR=0.98, P-value=0.028). Screening for signatures of positive selection, we identified 138 interfaces enriched for NS mutations, implicating 212 genes as putative drivers. We also observed mutations at multiple distinct interfaces of cancer genes, suggesting that different interactions may be affected in different patients. Interestingly, three hotspot mutations in TP53 (residues 175, 248, 273) which affect distinct interfaces displayed significant differences in patient survival, indicating that phenotypic pleiotropy caused by different perturbations of cancer gene interaction networks could contribute to heterogeneity in tumors with significant implications for patient outcomes.
Short Abstract: The anti-cancer immune response against mutated peptides of immunological relevance (neoantigens) is primarily attributed to MHC-I-restricted cytotoxic CD8+ T-cell responses. MHC-II-restricted CD4+ T-cells also drive anti-tumor responses; however, their relation to neoantigen selection, cancer susceptibility and tumor evolution has not been systematically studied. To address this, we developed a score allowing interpretation of MHC-II variation-based genotype in the context of presentation on the cell surface. Computationally modeling the potential of an individual’s MHC-II genotype to present 1,018 cancer-causing mutations in 7,137 tumors, we demonstrate that MHC-II genotype constrains the somatic mutational landscape during tumorigenesis. Poor presentation by MHC-II increased the odds of observing a mutation, even more than MHC-I. Exploiting MHC-II and MHC-I genotype complementarily increased power to predict occurrence of mutations; however, overall precision was limited, suggesting that other factors are stronger determinants of specific mutations. While MHC-I genotype correlated with age at diagnosis, MHC-II showed no such correlation, consistent with a prevalent regulatory role of CD4+ T-cells. These results implicate the immune system as a key heritable risk factor for cancer.
Short Abstract: Manually curating biomedical knowledge from publications is necessary to build a knowledge-based service that provides highly precise and organized information to users. The process of retrieving candidate publications for curation, which is called triage, is usually carried out by querying PubMed; however, this query-based method obtains unsatisfactory precision and recall on the retrieved results, and it is difficult to manually generate optimized queries. To address this, we propose a machine learning-assisted triage method. We collected curated publications from databases such as UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog, and used them as a gold-standard dataset for training machine learning models based on recent convolutional neural networks. We then used the trained models to classify and rank new publications for curation. We applied this method to the real-world manual curation process of UniProtKB/Swiss-Prot and the GWAS Catalog. Our method achieved a precision 1.81 and 2.99 times higher than that obtained by the current query-based triage methods of UniProtKB/Swiss-Prot and the GWAS Catalog, respectively, without significantly compromising recall. Our method found many additional relevant publications that the query-based method of UniProtKB/Swiss-Prot could not find. In the GWAS Catalog case, our method significantly improved the efficiency of the curation process.
Short Abstract: Combining statistical significances (p-values) from a set of single-locus association tests in genome-wide association studies is a proof-of-principle method for identifying disease-associated genomic segments, functional genes, and biological pathways. We review p-value combinations for genome-wide association studies and introduce an integrated analysis tool, Omnibus P-value Association Tests (OPATs), which provides popular analysis methods of p-value combinations. The software OPATs programmed in R and R graphical user interface (GUI) features a user-friendly interface. In addition to analysis modules for data quality control and single-locus association tests, OPATs provides three types of set-based association test: window-, gene-, and biopathway-based association tests. P-value combinations with or without threshold and rank truncation are provided. The significance of a set-based association test is evaluated by using resampling procedures. Performance of the set-based association tests in OPATs has been evaluated by simulation studies and real data analyses. In summary, p-value combinations facilitate the identification of marker sets associated with disease susceptibility and uncover missing heritability in association studies, thereby establishing a foundation for the genetic dissection of complex diseases and traits. OPATs provides an easy-to-use and statistically powerful analysis tool for p-value combinations. OPATs, examples, and user guide can be downloaded from http://www.stat.sinica.edu.tw/hsinchou/genetics/association/OPATs.htm.
Short Abstract: Ebola viruses cause hemorrhagic fever in humans. The West African Ebola virus outbreak demonstrated this on an epidemic scale. However, these viruses do not normally cause disease in rodents. To establish in vivo models, Ebola viruses have been adapted to rodents. We used a structural bioinformatics approach to analyse the mutations associated with Ebola virus adaptation to rodents to elucidate the determinants of host-specific Ebola virus pathogenicity. Only three Ebola virus proteins, VP24, GP and NP were consistently mutated in rodent-adapted Ebola virus strains. The role of mutations in GP and NP is unclear. However, three VP24 mutations located in the protein interface with karyopherin5 may enable VP24 to inhibit karyopherins and subsequently the host interferon response. Three further VP24 mutations change hydrogen bonding or cause conformational changes. In conclusion, we show that few mutations including crucial mutations in VP24 enable Ebolavirus adaptation to new hosts. Since Reston virus, the only non-human pathogenic Ebolavirus species circulates in domestic pigs in Asia, this raises public health concerns that novel human-pathogenic Ebolaviruses may emerge.
Short Abstract: Most mutations in cancer are neutral, with few mutations acting as drivers of cancer progression. To distinguish between mutations which reoccur due to high background mutation probability and mutations under selection in cancer progression, we built a model of background DNA and amino acid mutation rates in cancer. Our model is based on the probability of particular mutations in a given DNA sequence context to occur. We find that silent mutations are much more mutable than either missense or nonsense mutations and mostly accumulate at frequencies according to their background mutation rate. Our analysis of somatic mutations in tumor samples showed background mutation rate may explain the variance in occurrences of silent mutations and is the main contributor in driving reoccurrence of missense mutations in tumor suppressor genes, but not in oncogenes. We compiled a dataset of experimentally annotated cancer mutations (n=4,996) and applied mutability-based score to classify cancer driver and neutral mutations with ROC-AUC of 0.85 and MCC of 0.64. This performance was comparable to other state-of-the-art machine-learning methods. A model of background mutation rate highlights mutations under positive selection in tumors cells and elucidates the role of mutagenesis in shaping the observed mutation spectrum.
Short Abstract: Identifying driver genes is a central problem in cancer biology and will suggest new therapeutic targets for cancer treatment. Driver genes can be identified by detecting positive selection signals in somatic mutation data from large sequencing studies of tumor samples. However, existing methods for this problem struggle to distinguish positive selection signals from the highly heterogeneous background mutational process. Here, we present a powerful new statistical approach, driverMAPS (Model-based Analysis of Positive Selection) for driver gene identification. A key feature of driverMAPS is its detailed modelling of factors that characterize both background mutation -- including gene-specific effects -- and positive selection, including spatial clustering of mutations and elevated mutation rates at functionally important sites. Applying driverMAPS to TCGA data across 20 tumor types identified 159 new potential driver genes. Cross-referencing this list with data from external sources suggests that it is strongly enriched for real discoveries. The novel genes include the mRNA methytransferases METTL3-METTL14, and we experimentally validated the functional importance of somatic mutations in METTL3, confirming it as a potential tumor suppressor gene in bladder cancer.
Short Abstract: As genome-wide association studies have discovered a large number of genetic variants from human disease patients, prediction of mutational impacts has been crucial for sorting disease-associated variants from neutral variants. Current methods exploit evolutionary conservation of sequences by an assumption that highly conserved residues tend to be functionally or structurally important. However, many disease-associated variants are found from less conserved sites. Therefore, those variants are not covered by methods relying on evolutionary conservation approaches. Our group has reported that functionally important sites for protein conformational changes and allosteric regulations tend to be moderately conserved and coevolved with many other residues. Based on these observations, we devise a new method to predict mutational impact using networks of evolutionarily coupled residues. Our method identify more functional variants with mutational impacts which cannot be covered by current methods. Specifically, disease-associated variants sensitive to our method enriched in the sites with certain structural characteristics known to be relatively less conserved but functionally important, such as PPI interface, protein surface, disordered or loop regions. Our study provides an opportunity to identify less conserved disease-associated variants and gives an insight into the relationship between evolutionarily coupled residues and human disease mutations.
Short Abstract: The study was designed to perform a genome-wide association (GWA) and partitioning of genome using Illumina 60K SNP chip to identify variants for pig meat quality. And we determined the relationship between genome prediction result using SNP data and pig meat quality phenotype to establish juvenile pig selection system. A genome-wide mixed linear model-based association analysis was conducted. And for estimating the explained heritability with genome- or chromosome-wide SNPs the genetic relatedness estimation through maximum likelihood approach was used in our study. And then selection accuracy between genome prediction result using SNP data and breeding value estimates of best linear unbiased prediction (BLUP) using conventional pedigree information were compared (Accuracy increased by approximately 20%). We conducted GWAS and identified that genes associated with pig meat quality connected with several biological functions containing muscle genesis. Genetic variances of pig meat quality were approximately 0.2~0.5 and with positive correlation between the estimate of variance explained by individual chromosomes and their physical length. And we found that genome prediction using SNP were superior to best linear unbiased prediction (BLUP) using conventional pedigree.
Short Abstract: Motivation: Enrichment of trait associated SNPs for specific transcription-factor binding sites or regulatory regions in the genome can yield profound insight into underlying causal mechanisms. Analysis is complicated because the truly causal SNPs are generally unknown and can be either SNPs reported in GWAS studies or other SNPs in their linkage disequilibrium. Hence, a comprehensive pipeline for SNP enrichment analysis that utilizes all relevant information about both the genotyped SNPs and their proxies is needed. Results: We developed an R package snpEnrichR for SNP enrichment analysis. The software utilizes respected tools for background SNP set generation and genome association analysis to automatize and help users to create custom SNP enrichment analysis. We show via an example that including proxy SNPs in SNP enrichment analysis enhances the sensitivity of enrichment detect
Short Abstract: The study of rare Mendelian diseases through exome sequencing typically yields incomplete diagnostic rates (~8-70%). Whole genome sequencing of the unresolved cases allows addressing the hypothesis that causal variants could lay in regulatory regions. However, state-of-the-art methods to prioritize non-coding variants have been characterized on variant sets largely composed of trait-associated polymorphisms and common diseases. In this work we first curated large collections of bona-fide pathogenic variants in proximal cis-regulatory regions leading to Mendelian diseases. We then systematically evaluated the ability to predict causal variants of an exhaustive set of genomic features extracted at three levels: the affected position, the flanking region and the affected gene. In addition to epigenetic features and inter-species conservation scores, a complete set of ongoing purifying selection signals in humans was explored. This represents a main novelty allowing to exploit sequence constraints potentially associated to recently acquired human regulatory elements. Our results show that a supervised learning using gradient tree boosting on the previously described sets of features outperforms current reference methods for prioritization of non-coding Mendelian disease variants. A detailed comparative benchmark is presented and results discussed in terms of the type of the targeted regulatory region.
Short Abstract: Transcripts are frequently modified by structural variants, which lead to fused transcripts of either multiple genes— known as a fusion gene— or a gene and a previously non-transcribed sequence. These transcriptome modifications, collectively called transcriptomic structural variants (TSV), can lead to drastic changes in downstream products and become cancer drivers. Detecting TSVs is an important and challenging computational problem, especially when only RNA-seq measurements are available. We introduce SQUID, a novel algorithm to predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID attempts to rearrange genome segments to best explain the observed RNA-seq reads. TSVs are processed from the rearrangement result. Tested on two previously studied cell lines, SQUID achieves similar accuracy on fusion-gene detections as current fusion-gene detection methods, but with higher accuracy for non-fusion-gene detections. SQUID is open source and available at https://github.com/Kingsford-Group/squid. Applying SQUID on TCGA tumor samples, we observe that non-fusion-gene TSVs are more likely to be intra-chromosomal than fusion-gene TSVs for multiple cancer types. Novel non-fusion gene TSVs are detected and involve tumor suppressor genes, such as ZFHX3 and ASXL1. It is reasonable to suspect that these TSVs may lead to loss-of-function in the corresponding tumor suppressor genes and play a role in tumorgenesis.
Short Abstract: Ecosystems are subject to environmental changes which can alter key components such as pH or pollutant levels. Under such selection pressures, microbial communities respond by promoting favorable genes. We develop a high-throughput and sensitive method to discover genes and sites undergoing changes in the underlying allele distributions to understand the dynamics of adaptation processes in microbial communities. Next-generation sequencing (NGS) provides a solid basis for statistical analysis of genetic variants in complex metagenomes. For novel environments, de-novo assembly of genes and genomes replaces static gene catalogues. Aligning metagenomic sequencing reads to assembled genes captures genetic variation within the communities. Technical or biological replicates help to mitigate uncertainty. Metagenome studies are limited by sequencing depth and often show large differences in abundances of taxa, therefore many adaptation processes may occur close to or below the detection limit. We evaluate sites based on a Dirichlet-multinomial (DMN) model which applies to unmodified read counts without tranformation, normalization or single nucleotide variant (SNV) calling. Preliminary results show that our approach detects shifts in the underlying allele distributions while being robust to noise. Finally, we scale this approach up to full metagenomes to understand the community-wide adaptation dynamics.
Short Abstract: Genome-wide association studies have become common over the last ten years, with a shift towards targeting rare variants, especially in pedigree-data. Despite lower costs, sequencing for rare variants still remains expensive. To have a relatively large sample with acceptable cost, imputation approaches may be used, such as GIGI for pedigree data. GIGI is an imputation method that handles large pedigrees and is particularly good for rare variant imputation. GIGI requires a subset of individuals in a pedigree to be fully sequenced, while other individuals are sequenced only at relevant markers. The imputation will infer the missing genotypes at untyped markers. Running GIGI on large pedigrees for large numbers of markers can be very time consuming. We present GIGI-Quick as a method to efficiently split GIGI’s input, run GIGI in parallel and efficiently merge the output to reduce the runtime with the number of cores. This allows obtaining imputation results faster, and therefore all subsequent association analyses.
Short Abstract: Advances in high throughput sequencing have vastly accelerated genomic study of human disease. Large scale sequencing studies typically produce information on tens of millions of sequence variants. To ensure that the derived genotype data are consistent and accurate, quality control analyses are commonly performed prior to downstream comparative analyses. Most existing variant evaluation tools only provide output in a text file, requiring tedious QC review to detect deviant results in large studies. Here we introduce VariantQC, a visual QC report that can be easily incorporated into analysis pipelines to accelerate variant QC analysis. From a variant calling format (VCF) file, VariantQC generates a set of summary statistics, stratified by contig, sample, and filter type, that are compiled into a visual HTML report complete with interactive tables and plots. Additionally, summary values that deviate from the norm are flagged to enable rapid identification of sample data requiring further review. VariantQC has been successfully used on 213 whole genome sequencing and 671 genotyping-by-sequencing samples to detect subjects with a high number of private single nucleotide variants or with mismatched gender assignment based on X and Y genotype calls.
Short Abstract: As scientists accumulate more finely grained knowledge about biology, they still struggle with how to leverage this new information in ways that let us build hypotheses and frame alternative explanations. Our lab has built a system that combines many data sources into a coherent biological representation using Open Biological Ontologies and OWL semantics. This allows users to explore biological molecules and the relations that connect them in various processes and pathways. Recent work has focused on incorporating information from UniProt, which contains detailed protein sequence features and variant information, with the functional relations described in Reactome to model proteins’ biochemical reactions and interactions. Integrating entities and relations from these sources into the knowledge base poses significant challenges, including when to recognize existing entities in the knowledge base and when to posit new ones; however, this additional information permits investigation of the processes that mediate modification, trafficking, and localization of proteins. Disruptions in these processes are key factors in many diseases: mislocalized or mismodified proteins can gain or lose function, coerce partners into pathological behavior, and otherwise cause varying degrees of havoc in the cell.
Short Abstract: In this study, we developed a method to predict driver somatic single nucleotide variants (SNVs) that can potentially impact ovarian cancer (OC) development and progression through altering the sequence of disease-specific regulatory elements (REs), such as enhancers and promoters, eventually resulting in perturbation of the expression of target genes. First, we established genome-wide H3K27ac epigenomic profiles, annotating active REs for the different OC histotypes (clear cell, endometrioid, high grade serous and mucinous) using chromatin immunoprecipitation sequencing (ChIP-seq) in 20 fresh frozen primary OC tissue samples—five tumors for each histotype. In parallel, we performed transcriptional profiling using RNA sequencing (RNA-seq). Together, these two datasets enabled us to evaluate both epigenetic landscapes and the transcriptome. We used the RNA-seq data, to find putative target genes of cis-REs. Next, we integrated these unique profiles with WGS data from 232 OCs. We tested the significance of the observed number of mutated samples for any given active RE individually, or grouped by putative target gene. After p-value correction, we identified several significantly mutated active REs, including the promoter of POLR3E and the super enhancer overlapping HOXD9, and also the collection of REs associated to HOXD4, HOXD8 and C19orf44.
Short Abstract: Large-scale sequencing projects including the Thousand Genomes Project and the Human Genome Diversity Project incorporate samples from dozens of diverse global populations. Despite these initiatives, and advances in sequencing technologies, several smaller ethnic populations remain vastly underrepresented. This report describes the sequencing and analysis of a single genome obtained from an individual of Serbian origin. We illustrate the influence of read mapping and variant calling pipelines on the concordance of identified single nucleotide and insertion/deletion variants. Ancestry analysis places this individual in close proximity of the Central and Eastern European populations, particularly Croatian and Bulgarian individuals. Admixture analysis confirmed gene flow between Neanderthal and ancestral pan-European populations with similar contributions to the Serbian genome as observed in other European groups. Contrasting the genome against the Genome Aggregation Database (gnomAD) identified tens of thousands of high-quality previously unseen variants in coding and noncoding regions of the genome. The burden of disease-causing and putatively clinically relevant variation were assessed utilizing manually curated genotype-phenotype association databases and variant-effect predictors. We identified several variants that have been previously associated with severe early-onset disease that is not evident in the proband, as well as variants with potential for clinical relevance later in life.
Short Abstract: The psychiatric disorders are a major public health concern. Disrupted in schizophrenia 1 (DISC1) gene is a genetic risk factor for developing serious mental illnesses including schizophrenia, bipolar disorder and major depression. Cognitive dysfunctions play important roles in these diseases. The molecular studies have shown that DISC1 functions as scaffold protein in brain functions through a large complex pathway. To investigate the roles of genetic variants in this pathway, we carried out targeted resequencing to sequence 213 DISC1 pathway genes in 654 psychiatric patients and 889 healthy controls. We identified a novel protective association between a common intronic variant in Neurexin 1 (NRXN1) in a combined cohort of cases compared to controls. We observed an overall enrichment of rare disruptive variants in schizophrenia relative to controls. We found an increase in the burden of damaging mutations in DISC1 pathway genes with cognitive ability measures. In addition, the structure modelling analysis revealed that the missense mutations could affect protein stability. The findings will elucidate the roles of DISC1 pathway variants in the etiology of mental illnesses and offer the targets for developing more precise treatments and medications for psychiatric patients.
Short Abstract: Recent work utilizing allele-specific expression data shows that rare deleterious coding variants are more likely observed on the under-expressed haplotype (Castel, 2018). A model for this is that the penetrance of a deleterious coding variant is decreased on the under-expressed haplotypes. In this work we extend this model to the population level and hypothesize that GWAS and TWAS associations in regulatory regions may be differentiating haplotypes enriched or depleted for deleterious coding variants. We use simulations to understand the selection dynamics and steady-state population-level distributions of deleterious coding variation as a function of regulation and gene length. Simulations demonstrate that de novo deleterious variants are able to drift to a higher frequency when they occur on down-regulated haplotypes. Due to this drift, simulated genes subject to random mutation are enriched for deleterious variation when on down-regulated haplotypes. Using 1000 Genomes, GTEx and CADD scores, we observe an enrichment of deleterious variation on down-regulated haplotypes. We also observe that more down-regulated haplotypes than up-regulated haplotypes contain at least one deleterious variant for any deleteriousness. Since in these genes deleterious enrichment is linked to regulation, these results may explain why some GWAS associations are outside of coding regions.
Short Abstract: To reconcile genomics and precision medicine, a pathway-level understanding of genomic perturbations is crucial. Existing computational methods simply correlate mutational events with clinical outcomes, missing a functional impact of genomic substitutions at pathway level. There will be an urgent need of tools that predict signaling pathway outcomes from multiple genomic alterations in terms of hallmarks of cancer. Here, we introduce a pilot informatics framework to tackle this problem. First, we applied RDF Sketch tool to reduce the complexity of integrated disease map of canonical pathways by using a set of disease-specific biomarkers. Then, we extended network flow method and formulated a constrained st-cut problem to infer canonical pathways, perturbed by genomic alterations. The constraints are defined as nodes where network attacking mutations, inferred by the ReKINect tool, perturb signaling flow. Finally, we stratified the perturbed pathways according to specific cancer hallmarks for further analysis. The proposed framework allowed us to classify ~20% of downstream perturbation subnetworks with corresponding genomic alterations, such as BRAF (V600E, V600K, K601E and V600R) in melanoma cohort. Our method can provide a high-explanatory view of functional impact of missense mutations in terms of hallmarks of cancer to tailor therapeutic regimens according to the patient’s unique features.
Short Abstract: Next generation sequencing (NGS) data are increasingly used in genome-wide genetic association studies to detect important disease-causing variants, and for the identification of genetic biomarkers of drug efficacy or adverse reactions. Because NGS data are much larger and richer than traditional genotyping microarray data, statistical tools to analyze them need to be scalable, fast, and handle all types of variants. These tools must also deal with allelic and locus heterogeneity. To address these needs, we have developed the Variant Annotation, Analysis & Search Tool Case-Control Software (VAASTc). VAASTc employs a robust, best-practice burden test statistic; bundles a novel variant scoring algorithm; and provides an empirically calculated p-value for every gene-hit. To validate VAASTc, we created synthetic case-control samples by spiking known disease variants and testing a range of allele frequencies, carrier rates, inheritance models, and allelic heterogeneities. We also benchmarked VAASTc against three commonly used association testing tools – VT, SKAT, and SKAT-O. VAASTc consistently identified known disease-genes with genome-wide significance, and with fast compute times. Finally, we replicated the findings of a published case-control analyses for age-related macular degeneration. Collectively, our results demonstrate that VAASTc is fast, scalable, easy to use, and has great power to identify disease genes.
Short Abstract: The accurate characterization of the translational mechanism is crucial for enhancing our understanding of the relationship between genotype and phenotype. In particular, predicting the impact of the genetic variants on gene expression will allow to optimize specific pathways and functions for engineering new biological systems. In this work we present PGExpress, a new regression method for predicting the log2-fold-change of the translation efficiency of an mRNA sequence in E. coli. PGExpress algorithm takes as input 12 features corresponding to RNA folding and anti-Shine-Dalgarno hybridization free energies. The method was trained on a set of 1,772 sequence variants of 137 essential E. coli genes. For each gene, we considered 13 sequence variants of the first 33 nucleotides encoding for the same amino acids followed by the superfolder GFP. Our gradient-boosting-based tool (PGExpress) was trained using a 10-fold gene-based cross-validation procedure on the WT-High dataset. In this test PGExpress achieved a correlation coefficient of 0.57, with a Root Mean Square Error (RMSE) of 1.4. When the regression task is cast in a classification problem, PGExpress reaches an overall accuracy of 0.73 a Matthews correlation coefficient 0.47 and an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.80.
Short Abstract: The Critical Assessment of Genome Interpretation (CAGI, \'kā-jē\) is a community experiment to objectively assess computational methods for predicting the phenotypic impacts of genomic variation. CAGI participants are provided genetic variants and make blind predictions of resulting phenotype. Independent assessors evaluate the predictions by comparing with experimental and clinical data. There have been notable discoveries throughout the CAGI experiments: Independent assessment has found that top missense prediction methods are highly statistically significant, but individual variant accuracy is limited. Missense methods tend to correlate better with each other than with experiment. Bespoke approaches often enhance performance. Interpretation of non-coding variants shows promise but is not at the level of missense. In challenges using clinical data predictors identified causal variants overlooked in the initial clinical pipeline analysis. The results suggest that running multiple uncalibrated methods and considering their consensus may result in undue confidence in a pathogenic assignment, so we advise against this procedure. CAGI results are increasingly used to inform clinical use of computational tools, including the ClinGen working group revisions of the ACMG guidelines for appropriately considering evidence from computational approaches. The CAGI5 conference will be held a few days before VarI-COSI 2018, and the newly-released results will be presented.
Short Abstract: Over the last three decades, it has become increasingly evident that genetic background is a key determinant of many types of human diseases. Identifying the genes and mutations that underlie these human disease phenotypes using next-generation sequencing (NGS) is important for multiple purposes, including: (1) understanding the disease mechanism via the functions of the genes and the broader background of their biological pathways; (2) pre- and post-natal risk assessment; and (3) precision medicine translational advances based upon a patient’s disease and unique genetic background. To this end, we annotate the full genetic variation data from the Genome Aggregation Database (gnomAD), together with all known germline disease-causing mutations, using the largest to date array of biological annotations (>2,000 features). We then apply state of the art unsupervised deep learning techniques (stacked autoencoders) to this trove of data to understand the feature space and identify structures therein, which allow discriminating pathogenic from functional non-pathogenic and benign variation. We show that our approach helps to identify clustering patterns that discriminate non-pathogenic from pathogenic variants and correlate well with the broad categories of diseases in which the latter are involved.
Short Abstract: Only a small proportion of the heritable risk for colorectal cancer (CRC) can currently be attributed to mutations within known CRC-associated genes. To address this problem we have undertaken a large-scale study of individuals (N=1130) with a family history of CRC or early onset CRC, using whole genome, exome and targeted DNA sequencing, yielding new likely pathogenic variants in FAN1, NTHL1, POLE and POLD1. In such large studies, variant prioritization for further validation is critical to efficiently direct precious research resources. In this work we will demonstrate our bioinformatics analysis workflow for shortlisting variants for further study in the context of predicted CRC variants, which is likely to be applicable to other rare, highly penetrant, inherited diseases. In the process of refining our workflow we explored the veracity of functional impact prediction tools in order to optimize for sensitivity and specificity of classification. This has led to new insights into the limited degree of concordance and generalizability of classifiers and their dependence on training datasets. We will also describe our efforts to derive new assay-driven test data sets which may be valuable for future benchmarking experiments, and provide more robust assessments of predictor accuracy.
Short Abstract: With the rapid progress of cancer genome studies, many missense variants in populations of cells at different stages of cancer have been identified. However, it is challenging to understand the roles of these cancer-related variants. Structural information of the protein surfaces can provide useful information for assessing the biochemical effects of these variants. We have mapped 469,544 somatic missense mutations from the Catalogue of Somatic Mutations In Cancer (COSMIC) to 32,764 human protein 3D structures in the Protein Data Bank (PDB). Our results show that a large portion of these missense mutations is located on protein surface pockets. We report detailed analysis of several oncoproteins including HRAS, EGFR and PIK3CA. By incorporating additional geometric features of the protein surfaces with the residue annotations from literature, we assess the importance of each variant. In addition, we developed a method to predict the likelihood of candidate variants that have not yet been collected in the cancer genomics but may be highly relevant to cancer. Furthermore, we discuss our findings on higher-order cooperative units of cancer variants in these oncoproteins.
Short Abstract: It is substantially challenging to interpret and identify a mutation responsible for the patient’s illness among thousands of genetic variants in the massively high-throughput data. We present a software package, Divine, designed to prioritize mutated genes underlying rare hereditary disorders. Given Human Phenotype Ontology IDs manifesting patient clinical features, we calculate the semantic similarity among each disease previously discovered. Divine annotates a VCF file with nearly 30 orthogonal public databases to comply with ACMG variant interpretation guidelines. Both the phenotypic and variant score are combined into a single value in a Bayesian framework. Then, a heat diffusion kernel propagates the value in a gene interaction network [STRING] connecting a total of 19,035 genes. We gathered 26 cases studied between 2012 and 2016, covering a broad spectrum of rare Mendelian disease with average 10.8 HPO terms for evaluation. Both Divine (AUC score: 0.959) and Exomiser_v10 (0.954) outperformed the other methods: Phen-Gen_v1 (0.482), eXtasy (0.5), and PhenIx (0.728). Divine also supports a discovery mode. In a recent study, a patient with severe and complex hemostatic abnormalities was found to have a frameshift in C3AR1. Divine alone detected the mutation ranked at 4th from the top. Divine is freely available at https://github.com/hwanglab/divine/.
Short Abstract: Motivation: Accurately mapping and annotating genomic locations on 3D protein structures is a key step in structure-based analysis of genomic variants detected by recent large-scale sequencing efforts. There are several mapping resources currently available, but none of them provides a web API (Application Programming Interface) that support programmatic access. Results: We present G2S, a real-time web API that provides automated mapping of genomic variants on 3D protein structures. G2S can align genomic locations of variants, protein locations, or protein sequences to protein structures and retrieve the mapped residues from structures. G2S API uses REST-inspired design conception and it can be used by various clients such as web browsers, command terminals, programming languages and other bioinformatics tools for bringing 3D structures into genomic variant analysis. Availability: The webserver and source codes are freely available at https://g2s.genomenexus.org