Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner


Accepted Posters

If you need assistance please contact submissions@iscb.org and provide your poster title or submission ID.


Track: VarI

Session B-261: Genetic variation affecting exon skipping contributes to hippocampal atrophy in Alzheimer’s disease
COSI: VarI
  • Younghee Lee, University of Utah, United States
  • Dokyoon Kim, Geisinger Health System, United States
  • Shannon Risacher, Indiana University School of Medicine, United States
  • Andrew Saykin, Indiana University School of Medicine, United States
  • Kwangsik Nho, Indiana University School of Medicine, United States

Short Abstract: Background: Genetic variation in cis-regulatory elements related to splicing machinery and splicing regulatory elements (SREs) results in exon skipping and undesired protein products, and thus forms the genetic basis of 15 - 50% of heritable human diseases. However, variation affecting alternative splicing (AS) is understudied in Alzheimer’s diseases (AD). Therefore, we identified SNPs in cis-acting SREs and used multimodal neuroimaging to assess associations with AD-related atrophy and other endophenotypes. Methods: We developed a splicing decision model to identify actionable loci among common SNPs for gene regulation. The splicing decision model identified SNPs affecting exon skipping by analyzing sequence-driven AS models and by scanning the genome for the regions with putative SRE motifs. We used non-Hispanic Caucasians (N=1565) with HRC-based imputed GWAS and neuroimaging data (MRI and PET scans) from the Alzheimer’s Disease Neuroimaging Initiative. We examined exonic SNPs in SREs as a mechanism to understand how genetic variants contribute to AD. Results: We identified 17,088 exonic SNPs affecting exon skipping (MAF >1%) and two SNPs were associated with hippocampal volume after controlling for multiple testing (corrected p<0.05). One SNP (rs157581) is within TOMM40 and another missense SNP (rs1140317) within HLA-DQB1. Further analysis revealed that rs1140317 was significantly associated with brain amyloid-β deposition (measured by [18F] Florbetapir PET and CSF). Conclusions: We identified exonic SNPs in genes that may regulate AS and thereby contribute to AD pathology including HLA-DQB1, a key immune gene and TOMM40, an AD candidate gene near APOE. SRE may hold potential as novel therapeutic targets for AD.

Session B-263: Analyzing intratumor heterogeneity and clonal evolutionary history using population genetics
COSI: VarI
  • Yutaro Konta, Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Japan
  • Hisanori Kiryu, Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo, Japan

Short Abstract: Tumor arises from a single founder cell, whose subsequent accumulation of advantageous mutations causes clonal expansion. In the course of clonal expansion, a driver mutation gives rise to another type of clone, which is called a subtype. Thus, tumor consists of heterogeneous mixture of various subtypes. In the latest cancer therapy, it is important to identify the subtype composition and the growth rates of these subtypes. The emergence of the next generation sequencers has made it possible to analyze whole cancer genomes at a single nucleotide resolution. Furthermore, using the latest single cell sequencing, we can investigate the genotype of each cell to identify whole subtypes in a tumor. However, sequencing a bulk tumor is still common because of technical difficulties and high cost of the single cell sequencing. Thus, our problem is to identify the subtype composition and their characteristics from the bulk sequencing reads. To solve this problem, several methods such as PyClone and PhyloSub have been proposed in previous works. However, they cannot estimate growth rates of each subtype. Here we provide a Bayesian statistical model to infer the growth rates and abundance ratio of each subtype. We modelled the allele frequency drift of the passenger mutations within each subtype with diffusion equation applying Wright-Fisher process. Then we integrated this population genetics model with mixture modelling to infer the birth time and abundance ratio using expectation-maximization algorithm. Using the simulated NGS reads, we could estimate the birth time and abundance ratio of each subtype.

Session B-265: Splice-aware multiple sequence alignment improves protein alignment quality and supports alternative reading frame detection
COSI: VarI
  • Alex Nord, University of Montana, United States
  • Peter Hornbeck, Cell Signaling Technology, United States
  • Travis Wheeler, University of Montana, United States

Short Abstract: We present the Mirage software package, designed to improve the accuracy of multiple protein sequence alignment by accounting for introns in the encoding DNA. Mirage initially maps each protein to its encoding genomic DNA, allowing for splice sites. For a collection of isoforms from a single species, amino acids mapping to the same codon are aligned into the same column of species-specific alignment. These per-species alignments are merged as in progressive alignment. As a result, Mirage produces more accurate alignments of intron-bearing protein isoforms. In the within-species transitive alignment step, all letters aligned in a column should in principle be identical, as they are encoded by the same codon. We have identified a surprising number of cases in which this does not hold true, because an alternate splice boundary leads to an alternative reading frame (ARF) within the same exon. We investigate the bioinformatic support for the legitimacy of these ARF protein variants.

Session B-267: Detecting selective sweeps in genetic variants in model and non-model organisms
COSI: VarI
  • Brandon Pickett, Brigham Young University, United States
  • Spencer Smith, Brigham Young University, United States
  • Perry Ridge, Brigham Young University, United States

Short Abstract: Natural selection drives changes in allele frequency in sexually reproducing populations, causing adaptive alleles to increase in frequency in the population. Due to linkage disequilibrium, variants near the mutation will hitchhike to the same frequency. This frequency change, referred to as a hard selective sweep, is detectable in the genomes within the population until recombination introduces additional variation to the genomic region over hundreds of generations. Other selective sweeps (e.g., soft sweeps) produce detectable signatures that are more difficult to identify. Several population genetics statistics capture such signals with varying efficacy. Most statistics are based on allele frequency (e.g., Wright's F, Tajima's D) or haplotype homozygosity (e.g., extended haplotype homozygosity (EHH), integrated haplotype score (iHS)). While each statistic provides meaningful information, most are unable to correctly identify selective sweeps under certain conditions. Composite methods overcome this by combining values for a single statistic on multiple loci (e.g., composite likelihood ratio (CLR)) or combining multiple statistics on a single locus (e.g., composite of multiple signals (CMS)). These statistics detect regions constrained by recent positive selection, but are difficult to calculate because existing software is complicated for non-expert users and is developed for very specific cases/organisms. Accordingly, we present SelecT, a user-friendly program for detecting selective sweeps using CMS and other statistics. SelecT is easily configured or extended to create new composite scores and can be used to analyze populations of model or non-model organisms.

Session B-269: Evaluation of mutational signatures inferred from a single sample tumor profile
COSI: VarI
  • Marko Zecevic, Seven Bridges, Serbia
  • Mladen Lazarevic, Seven Bridges, Serbia

Short Abstract: Analysis of somatic Single Nucleotide Variants (SNVs) from a single tumor sample may reveal a number of mutational processes triggered by different exposures. The Wellcome Trust Sanger Center identified a set of 30 mutational signatures, some of which show a very strong correlation with a specific mutagen exposure or tumor type. Unfortunately, the number of somatic mutations in the observed sample is often not high enough to reliably reconstruct the original tumor profile as a linear combination of these signatures. We tested the ability of the freely available deconstructSigs algorithm to deconstruct simulated data sets into signature weights. This procedure enables the identification of the number of mutations per sample which are needed for optimal performance as well as the error rate of the algorithm. Validation was done on 433 colon adenocarcinoma and 470 skin melanoma samples from The Cancer Genome Atlas (TCGA). The proportion of samples containing a specific signature and its average weight reflected the proposed aetiology, but to see this in the more complex tumors it was necessary to first filter out the samples with low SNV count (200 proved to be a good threshold). Also, we have observed slight differences in signature decomposition due to choice of different variant caller, with the signature weights inferred from somatic variants called using SomaticSniper looking a bit more in line with the conclusions made by the signatures authors.

Session B-271: Toward automatic genome-based hierarchical classification of viruses: advancement in calculation of clustering cost and rank thresholds
COSI: VarI
  • Igor Sidorov, Leiden University Medical Center, Netherlands
  • Andrey Leontovich, Inst. of Phys. Chem. Biology, Moscow State University, Russia
  • Anastasia Gulyaeva, Leiden University Medical Center, Netherlands
  • Dmitry Samborskiy, Inst. of Phys. Chem. Biology, Moscow State University, Russia
  • Alexander Gorbalenya, Leiden University Medical Center, Netherlands

Short Abstract: Due to the advent of high-throughput genome sequencing and metagenomics, viruses are increasingly characterized by genome sequence only. Analysis of genome diversity informs structure-function and evolutionary studies, and was recently accepted as a sole basis for developing virus taxonomy. Our group has introduced a quantitative comparative sequence procedure dubbed DivErsity pArtitioning by hieRarchical Clustering (DEmARC) to delineate ranks and clusters in a monophyletic group of viruses (1) that could be used to devise taxonomy objectively (2). DEmARC partitions pair-wise evolutionary distances (PED) obtained for multiple sequence alignments of conserved proteins using clustering cost (CC) function that is calculated for each cluster obtained by clustering procedure. Clusters corresponding to local minima in both CC function and distribution of pair-wise distances were used to delineate ranks of classification. In this report we describe advances to the CC calculation that improve confidence about selected PED thresholds for demarcation and speed of classification. We introduce cluster-specific CC profiles (csCCP) that describe change of CC for specific cluster over the entire PED range of the analyzed dataset, with each profile having global minimum that may be located also outside of the corresponding cluster PED interval. Only some of these minima remain recognized as local ones in the resulting CC function upon comparative analysis of csCCPs of a dataset. These minima facilitate accurate identification of rank thresholds, a critical step towards automation of virus taxonomy advancement. 1. Lauber & Gorbalenya. J.Virol., 2012, 86(7):3890-3904 2. Lauber & Gorbalenya. J.Virol., 2012, 86(7):3905-3915

Session B-273: Exploring Variations on Transcription Factor Binding Sites in Taiwanese Population
COSI: VarI
  • Mu Yang, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan
  • Yi-An Tung, Genome and Systems Biology Degree Program, National Taiwan University and Academia Sinica, Taiwan
  • Yu-Chuan Chang, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taiwan
  • Dung-Chi Wu, Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taiwan
  • Yen-Jen Oyang, Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University and Genome and Systems Biology Degree Program, National Taiwan University and Academia Sinica, Taiwan
  • Chien-Yu Chen, Genome and Systems Biology Degree Program, National Taiwan University and Academia Sinica and Department of Bio-Industrial Mechatronics Engineering, National Taiwan University, Taiwan

Short Abstract: Regulation of gene expression can be achieved by multiple means, among which the major one is through proteins called transcription factors (TFs). TFs recognize and bind to specific segments of DNA sequences, called transcription factor binding sites (TFBSs), in order to initiate transcription. Via interactions between TFs and TFBSs, activation and suppression of the corresponding genes are made. Given the prominent role of TFBSs on gene regulation, it is imaginable the variations in TFBSs might result in significant effects. While variations are present in many forms, we focus on the single nucleotide polymorphisms (SNPs) in this study. Many studies have shown that SNPs on TFBSs may vary the binding affinity and change cell behaviors. The binding affinities between TFs and TFBSs might be raised or lowered owing to the mutations. In some cases, even new TFBSs may emerge or existing TFBSs may submerge. This study aims at exploring the SNPs present in TFBSs on 11,493 Taiwanese. With the SNP array data acquired from the Taiwan Biobank, we examined the SNPs falling on known TFBSs collected from the TRANSFAC database. Among the 17,441 TFBSs we checked, only 92 of them involved SNPs. Since the frequency of allelic variations may vary across populations, we compared the allele frequencies (AFs) of certain SNPs in Taiwanese with that in the 1000 genomes project. It is in particular of interest to examine the potential influence of the variations with AFs that are significantly higher or lower than those in other populations.

Session B-275: Protein-centric Exome Association (PREXA) analysis enables discovery of biologically sensible coding variants associated with phenotypes
COSI: VarI
  • Ginny Xiaohe Li, National University of Singapore, Singapore
  • Damian Fermin, University of Michigan, United States
  • Christine Vogel, New York University, United States
  • Hyungwon Choi, National University of Singapore, Singapore

Short Abstract: Protein coding variants are presumably more impactful on phenotypes than non-coding variants. However, it remains challenging to utilize these variants detected from next generation sequencing experiments in a population-based association analysis for various reasons. Moreover, although each coding variant likely contributes to phenotypic variation by altering amino acids around key functional sites, the protein-level connections have rarely been incorporated in the exome-wide association analysis. To address this, we propose a novel modeling framework called PRotein-centric EXome Association (PREXA) analysis. In PREXA, we detect exon variants against the reference genome and map them to adjacent post-translational modification (PTM) sites and predicted domains. We then counted the number of variants mapped to each amino acid position and subsequently used the count data for test of association. We also prioritized PTM sites and domains using a model-based scoring approach in terms of functional importance and subsequently used high-confidence sites for the mapping. Using 5 different cancer data sets from The Cancer Genome Atlas, we compared the PREXA models with a regular association model devoid of protein-level mapping. The mapping to protein units allowed more coding variants to be eligible for the association analysis. More importantly, the PREXA models produced more interpretable association models with the risk of mortality, directly pinpointing the common and unique amino acid positions across the cancer types. For example, in the TCGA breast cancer cohort, we mapped 253984 mutations from 1059 patients to 2598 functional PTM sites (1190 proteins). We detected 62 PTM sites to be hotspots associated with exome variation in PREXA, whereas the model without protein mapping had <10 loci contributing to the risk of mortality.

Session B-277: Combinatorial Association Rules of SNPs with Coronary Artery Disease in a Lebanese Population.
COSI: VarI
  • Georges Khazen, Lebanese American University, Lebanon
  • Liza Darrous, Lebanese American University, Lebanon

Short Abstract: The advancement of genome wide association studies (GWAS) led to the identification of different susceptible loci for many diseases including Coronary Artery Disease (CAD). Traditional methods focus on the association of individual single-nucleotide polymorphism (SNP) with the disease phenotype. Unfortunately, these methods fail to reflect the genetic intricacy underlying the cause of many complex diseases by disregarding phenomena like epistasis. Although some combinatorial techniques were used in SNPs association studies, they are mainly based on logistic regression or network models and don’t reveal much about the rules that govern the occurrence of diseases. We propose here a new combinatorial SNP association rule method inspired by logic gates. Two logic gates AND as well as a NOT-AND (NOT followed by an AND), were used on 550,378 SNPs from a Lebanese CAD dataset consisting of 735 cases and 98 controls. The gates’ outputs were then tested for association with CAD, and ranked based on their adjusted p-value significance level (1e-7), fold change and occurrence frequency. The AND and NOT-AND gates resulted in 61 and 2537 significant SNP-pairs, respectively, compared to only 22 using the traditional single SNP association test. Additionally, all significant pairs were mapped to 60 genes known to be associated with Cardiovascular Disease, 3 of which were only identified using the single SNP association. Moreover, out of the 2598 significant SNP-pairs, 1839 pairs (70.78%) had a fold change ratio larger than or equal to 1.5 between the cases and controls, and 2413 pairs (92.87%) were found on the Y chromosome.

Session B-279: DEOGEN2: prediction and interactive visualisation of Single Amino Acid Variant deleteriousness in human proteins
COSI: VarI
  • Daniele Raimondi, Interuniversity Institute of Bioinformatics Brussels, Belgium
  • Ibrahim Tanyalcin, Interuniversity Institute of Bioinformatics Brussels, Belgium
  • Julien Ferté, Interuniversity Institute of Bioinformatics Brussels, Belgium
  • Andrea Gazzo, (IB)² - Interuniversity Institute of Bioinformatics in Brussels, Belgium
  • Gabriele Orlando, VUB, Belgium
  • Tom Lenaerts, Universite Libre de Bruxelles, Belgium
  • Marianne Rooman, Université Libre de Bruxelles, Belgium
  • Wim Vranken, Vrije Universiteit Brussel, Belgium

Short Abstract: High-throughput sequencing methods are generating enormous amounts of genomic data, giving unprecedented insights into human genetic variation and its relation to disease. An individual human genome contains millions of Single Nucleotide Variants: to discriminate the deleterious from the benign ones, a variety of methods have been developed that predict whether a protein-coding variant likely affects the carrier individual’s health. We present such a method, DEOGEN2, which incorporates heterogeneous information about the molecular effects of the variants, the domains involved, the relevance of the gene and the interactions in which it participates. This extensive contextual information is non-linearly mapped into one single deleteriousness score for each variant. Since for the non-expert user it is sometimes still difficult to assess what this score means, how it relates to the encoded protein, and where it originates from, we developed an interactive online framework (http://deogen2.mutaframe.com/) to better present the DEOGEN2 deleteriousness predictions of all possible variants in all human proteins. The prediction is visualised so both expert and non-expert users can gain insights into the meaning, protein context and origins of each prediction.

Session B-281: Protein tools for interpreting the functional effects of genetic variants
COSI: VarI
  • Andrew Nightingale, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Uniprot Consortium, EMBL-EBI, SIB, PIR, United Kingdom

Short Abstract: Genomic variants maybe deleterious and many of the known severe disease- or phenotype-causing mutations are located in protein-coding regions of the genome; therefore, correctly interpreting variation involves knowing how a protein functions. Combining sequence, structural and functional information with genomic variants is critical for deciphering the effect of a variant and a variant’s influence on a human disease. However, this information is not readily available and not easy to integrate in independent tools or workflows. UniProt provides comprehensive protein information including reviewed variants with functional effects from the scientific literature and, in collaboration with genomic resources and Ensembl, UniProt has imported variants publicly available from the 1000 Genomes project, COSMIC, ExAC and ESP. To facilitate the interpretation of these and user generated genomic variant data, UniProt has developed tools to access functional annotations for specific sequence positions where the variant occurs, indicating the possible mechanisms and effects of a variant. The UniProt Proteins API, is a REST interface for programmatic access to the protein functional annotations, such as active and metal binding sites, variants etc. For example, this provides users with access to protein annotations via specific genes and/or genomic positions for integration into their genomic analysis workflows. Also, protein annotations and variants can be now graphically represented in web browsers with UniProt's BioJS component ProtVista, which users can integrate with their own data and into their own web resources.These bioinformatic tools enable biomedical researchers to investigate how genomic alterations can contribute to modifications in the translated protein and evaluate how these can result in a disease and/or phenotype.

Session B-283: Understanding mutational effects in digenic diseases
COSI: VarI
  • Andrea Gazzo, (IB)² - Interuniversity Institute of Bioinformatics in Brussels, Belgium
  • Daniele Raimondi, Interuniversity Institute of Bioinformatics Brussels, Belgium
  • Dorien Daneels, Center for Medical Genetics, Reproduction and Genetics, Reproduction Genetics and Regenerative Medicine, Vrije Universiteit Brussel, UZ Brussel, Brussel, Belgium, Belgium
  • Yves Moreau, ESAT-STADIUS, KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
  • Guillaume Smits, HUDERF - IB2 - ULB, Belgium
  • Sonia Van Dooren, Vrije Universiteit Brussel, Universitair Ziekenhuis (UZ Brussel), Belgium
  • Tom Lenaerts, Universite Libre de Bruxelles, Belgium

Short Abstract: To further our understanding of the complexity and genetic heterogeneity of rare diseases, it has become essential to shed light on how combinations of variants in different genes are responsible for a disease phenotype. With the development of DIDA (DIgenic diseases DAtabase), it has become possible to evaluate how digenic combinations differ in terms of the phenotypes they produce. All instances in this resource were assigned to two classes of digenic effects (DE), annotated as true digenic and composite classes. Whereas in the true digenic class variants in both genes are required for developing the disease, in the composite class, a variant in one gene is sufficient to produce the phenotype, but an additional variant in a second gene impacts the disease phenotype or alters the age of onset. We hypothesized that biological properties linked to digenic combinations can be used to differentiate between DE classes, for example exploiting the impact of the variants, the allelic state of the genes involved and their ability to tolerate loss of function mutations. To examine this hypothesis, we constructed a classification model that employs features consisting of different variant-, gene- and pathway-related characteristics, and obtaining quantitatively relevant predictions. Moreover, we show via the analysis of three digenic disorders that a DE decision profile, extracted from the predictive model, can explain why an instance was assigned to either of the two classes. Together, our results show that digenic disease data generates novel insights, providing a glimpse into the oligogenic realm.

Session B-285: A Comparative Analysis of Splicing Quantitative Trait Locus in Arabidopsis thaliana
COSI: VarI
  • Wonseok Yoo, Department of Bioinformatics and Life Science, Soongsil University, Seoul 06978, South Korea
  • Sangsoo Kim, Department of Bioinformatics and Life Science, Soongsil University, Seoul 06978, South Korea

Short Abstract: The splicing quantitative trait locus (sQTL) is a sequence variation that regulates mRNA alternative splicing, which eventually dictates protein structure and thus phenotypic variations. It is believed that the sQTLs are evolutionary selected due to the difference in their adaptability to different natural environment, such as weather and other geographical factors. Arabidopsis thaliana is one of the most genetically studied plants, and a number of ecotypes have been collected from various geographical locations around the world. Previously, we published sQTL analysis of 141 Arabidopsis thaliana samples using IVAS, an R/Bioconductor package. 1,694 SNPs were identified sQTLs. In this study, we expended the datasets to 666 samples by downloading all the available SNP genotypes and the matured RNA sequencing data. Works in progress to analyze this expended dataset using the established procedure. This study will prove that sQTL discovered using above method are relevant to the natural environment and sQTL is involved in ecotype evolution. As further research, a comparative analysis of splicing patterns will be conducted for each environment. Through these finding, we can suggest the course of evolution and migration pattern of Arabidopsis thaliana.

Session B-287: Identification of heteroplasmic variants in michondrial DNA from human lung and blood cancer
COSI: VarI
  • Kyeongsu Ha, Department of Bioinformatics and Life Science, Soongsil University, Seoul, South Korea
  • Sangsoo Kim, Department of Bioinformatics and Life Science, Soongsil University, Seoul, South Korea

Short Abstract: Mitochondria play an important role in regulation of cellular functions including energy metabolism and apoptosis. Unlike nuclear DNA, there are numerous mitochondrial DNA in a single cell. Heteroplasmy is known as different mitochondrial DNA types existing within an individual. It has been reported that heteroplasmic mutation causes mitochondrial dysfunction, which affects many diseases including neurogenic diseases and cancer. However, the cancer-associated mutations of mitochondrial DNA have not yet been clearly studied. In this study, we analyzed mitochondrial DNA of normal and tumor tissues in 30 patients with lung cancer (Squamous cell carcinoma) and 11 patients with blood cancer (Acute myeloid leukaemia). We compared tumor heteroplasmic variants with variant allele frequency of normal tissues and found tumor specifc heteroplasmic variants. We identified 93 and 54 tumor specific heteroplasmic variants in lung cancer and blood cancer, respectively. Further research will confirm how the variants we discovered will affect the mitochondrial functions. This research may offer the possibility of using a heteroplasmic variant as a new cancer marker.

Session B-289: DNA Sequence Variant Deleteriousness scores for Mouse Genomes
COSI: VarI
  • Christian Groß, TUDelft, Netherlands
  • Dick De Ridder, Wageningen University & Research, Netherlands
  • Marcel Reinders, Delft University of Technology, Netherlands

Short Abstract: In recent years, the advancements in functional effect prediction of DNA sequence variants in human genomes lead to several new discoveries and insights in heritable diseases. For non-human species this process is lacking behind. In this project we have evaluated the possibility to create a method, similar to the Combined Annotation Dependent Depletion (CADD)[1] approach, which is capable of scoring the deleteriousness of single nucleotide variants (SNV) in the genomes of non-human species. The species we chose to investigate was mouse because of its relatively rich genomic annotation datasets. We have evaluated our trained models on variant subsets representing various coding and non-coding regions for which we have seen differing performances of the original CADD method. Furthermore, we took subsets of features according to their estimated availability in other species such as pig, cattle and chicken. For validation, we compare CADD applied on coding variants and on clinically identified pathogenic variants (ClinVar) [2] with coding variants within the mouse genome. Our results indicate that methods for the prediction of deleteriousness in non-human species are possible, even if only few genomic annotations are given but instead of one model for the entire genome several models for different genomic regions may become necessary. References 1. Kircher M, et. Al. “A general framework for estimating the relative pathogenicity of human genetic variants”. Nat Genet. 2014 Feb 2. 2. Landrum M., et. al. “ClinVar: public archive of relationships among sequence variation and human phenotype”, Nucleic Acids Res. 2014 Jan 1; 42 D980–D985.

Session B-291: Advancing whole-genome variant effect scoring – an update to the CADD framework
COSI: VarI
  • Philipp Rentzsch, Berlin Institute of Health, Germany
  • Martin Kircher, Berlin Institute of Health, Germany

Short Abstract: Modern sequencing approaches have seen rapid adoption for the identification of disease causal variants. However, interpreting thousands of new or very rare variants identified with every new human genome remains a major challenge. Computational prioritization can support variant interpretation, specifically with scores available across all variant types (SNVs, InDels & SVs). Genome-wide scores typically integrate diverse types of data such as functional element annotations, sequence conservation, and biochemical activity read-outs. Combined Annotation Dependent Depletion (CADD) is one such method for scoring variant deleteriousness of SNVs and short InDels. It trains a supervised learning classifier, separating simulated de novo variants (proxy-deleterious) from changes since the common ancestor of human and chimp (proxy-benign). More recent approaches like DANN, DeepSEA, Eigen, FatHMM-MKL, FunSeq2, and LINSIGHT have shown that since the publication of CADD in 2014, a number of new genomic datasets became available and that tweaks in model training may improve reliability of variant scoring. We revised CADD's source-code base, to allow easier integration of new annotations as well as exploration of alternate and non-linear models using the GraphLab Create library. As a first application, we train different learners on the original feature set and compare performance in separating benign and pathogenic variant sets. Further, we explore integrating additional annotations (e.g. conserved element annotations, binding site motifs, splice predictors, and variant density information) with logistic regression models. In the future, the new code base will support CADD scores for GRCh38 as well as explorations of alternative training objectives.

Session B-293: GDIvar: a Web-Application for Storing and Characterizing Genomic Variants in Human Cancer
COSI: VarI
  • Marta Interlandi, Institute of Medical Informatics, University of Münster, Germany
  • Sarah Sandmann, Institute of Medical Informatics, University of Münster, Germany
  • Iñaki Soto Rey, Institute of Medical Informatics, University of Münster, Germany
  • Michael Storck, Institute of Medical Informatics, University of Münster, Germany
  • Marcel Trautmann, Gerhard-Domagk-Institute of Pathology, UKM, Germany
  • Wolfgang Hartmann, Gerhard-Domagk-Institute of Pathology, UKM, Germany
  • Martin Dugas, Institute of Medical Informatics, University of Münster, Germany

Short Abstract: The application of next generation sequencing (NGS) in oncological practice allows clinicians to derive a personalized therapy based on the genomic mutations detected in each patient. The focus is then no longer the histopathological analysis of tumor cells and its anatomical position, but it shifted to the analysis of the genotype. Thus, the aim is to identify those mutations that are targeted by known drugs, to suggest the most suitable treatment to each patient. However, the characterization of genomic variants according to their pathogenicity and druggability is not a trivial task. Here, we present GDIvar, a web-application to simplify and improve the workflow of pathologists in their analysis of variants detected in NGS data. This tool allows to import and store lists of variants along with different information related to the tumor samples analyzed as well as the patients to whom these belong. Data are stored in a relational database and can be filtered according to attributes of interest. Moreover, the characterization of variants is assisted by a “suggested characterization” derived from previous characterizations performed on the same variant, belonging to different samples. For each variant of interest, a table reporting the list of patients sharing the same variant - and related information such as their diagnosis - is available. The output of the conducted analysis can be exported in a spreadsheet format. GDIvar has been successfully tested on 48 samples (5840 variants) and is currently used by the Gerhard-Domagk Institute of Pathology of the University Hospital of Münster.

Session B-295: In-depth germline copy number variant analysis of breast cancer revealed further DNA damage repair system deficiency
COSI: VarI
  • Jihyun Kim, National Cancer Center, South Korea
  • Soo Young Cho, National Cancer Center, South Korea
  • Charny Park, National Cancer Center, South Korea

Short Abstract: Background: Rare germline cancer susceptibility variants were known to exist from AML 4% to ovarian cancer 19%. However previous germline variant studies have focused on the single nucleotide variant, and germline copy number variant predisposing high cancer susceptibility is unveiled in large-scale dataset. In here, we systemically identified germline copy number alteration (CNV)from TCGA breast cancer whole exome sequencing (WES) integrating with matched transcript data. Result: We established germline CNV analysis pipeline for 1044 WES samples of TCGA BRCA. First copy number calling based on normal-pooling method was performed, and we identified germline deletions in 374 cancer-associated genes. There result included CNV in cancer susceptibility genes BRCA1/2(0.39, 0.1%), TP53(0.19%), ATM(0.39%), and BRIP1(0.29%), and subtle recurrent deletions of 2.6% were discovered in mutagenesis gene APOBEC3A/B. Especially we classified APOBEC3A/B patients who were non-carrier, heterozygous, and homozygous deletion from integrating WES and matched RNA-Seq alignment status. We found APOBEC3B deletion allele status was associated with breast cancer risk. Finally, we investigate the function of germline deletion variants that significantly altered pathways including Fanconi anemia, homologous recombination and mismatch repair pathway. Conclusion: We revealed that germline CNV existence implicating DNA damage repair deficiency. Our investigation suggests the importance to anneal genetic variant diagnosis for breast cancer patients. As a further study, integration between somatic and germline variant could be required.

Session B-297: Predicting the pathogenicity of coding region variants using gene-centric features
COSI: VarI
  • Sérgio Matos, DETI/IEETA, Universidade de Aveiro, Portugal

Short Abstract: The ability to effectively prioritise sequence variants originating from high-throughput methods is of central importance for the study of inherited diseases. Although various tools have been developed for predicting the pathogenic effect of these variants, many do not take advantage of gene characteristics and alterations at the DNA or mRNA levels. In this work, we evaluated the contribution of several gene and protein physicochemical properties for training machine-learning classifiers of variant pathogenicity. We used five evaluation sets (HumVar, ExoVar, VariBench, predictSNP, SwissVar; Grimm et al., Human Mutation 2015) containing a total of 82033 distinct variants from 18419 transcripts to perform a leave-one-dataset-out evaluation, removing any overlap between the training and test sets. For a second, more stringent, evaluation we also removed from the training set any variation from transcripts occurring in the test set. For each dataset, we performed k-means clustering of the training instances and trained a separate linear support-vector machine (SVM) classifier on each cluster. At the classification stage, we selected the classifier with best margin, that is, for which the test instance was further away from the class boundary. Using two SVMs, we obtained areas-under-the-curve (AUC) between 0.842 (0.797 in the second evaluation) and 0.894 (0.885) for four of the datasets and 0.649 (0.650) for the SwissVar dataset. For this dataset, we obtained an AUC of 0.704 (0.705) after removing 3108 instances that were either marked as ‘unclassified’ or had inconsistent classification when compared to recent data collected from UniProt, leaving 7436 negative and 2184 positive instances.

Session B-299: AML-Varan – a generic approach to integrate a multi-tool-combination based NGS variant calling pipeline into a web-based diagnostics platform
COSI: VarI
  • Christian Wünsch, University of Münster, Germany
  • Sarah Sandmann, University of Münster, Germany
  • Sebastian Windau, University of Münster, Germany
  • Martin Dugas, University of Münster, Germany

Short Abstract: The web-based clinical diagnostics platform AMLVaran supports physicians in the automated analysis of targeted Next-generation sequencing samples for patients with Acute Myeloid Leukemia (AML). This should now be equipped with a new (also in-house developed) variant-calling technology, which combines eight different variant calling tools and has already demonstrated sensitivities around 0.98 together with a PPV of up to 0.98 on several data sets. However, the presented approach should be chosen as generically as possible, in order to enable custom adjustments in regard to the used calling tools, annotation databases and filter settings. The generic pipeline consists of 4 blocks: 1.) Variant Calling: Different caller tools are launched in order to analyze the current sample. For configuration, just a shell script for the execution and a meta file, specifying the output format of the caller, is needed. 2.) Normalization and Consolidation: The output of the individual callers is normalized, standardized and transferred into a common MySQL database. 3.) Annotation: The annotation is done within the central variant database, in which currently eight common annotation sources (dbSNP, 1000Genomes, ExAC, ClinVar, etc.) are stored in a preprocessed and indexed form so that they can be added to the variants in real-time. 4.) Filtering: For filtering, a complex score is calculated depending on numerous parameters that can be configured via a dynamic (JavaScript/Ajax) web interface. The effects of different thresholds can be controlled in real-time. Test results on a dataset of 120 samples, from AML/MDS patients, (target length 520 kBp) will be presented.

Session B-301: Detecting Small Structural Variants in Amplicon-Based NGS Data – an Evaluation of Ten Algorithms
COSI: VarI
  • Marius Wöste, Institute of Medical Informatics, University of Münster, Germany, Germany
  • Sarah Sandmann, Institute of Medical Informatics, University of Münster, Germany, Germany
  • Aniek de Graaf, Laboratory Hematology, RadboudUMC, Netherlands, Netherlands
  • Bert van der Reijden, Laboratory Hematology, RadboudUMC, Netherlands, Netherlands
  • Joop Jansen, Laboratory Hematology, RadboudUMC, Netherlands, Netherlands
  • Martin Dugas, Institute of Medical Informatics, University of Münster, Germany, Germany

Short Abstract: With the development of next-generation sequencing (NGS), genetic sequencing became applicable for many research areas, cancer related topics in particular. Targeted sequencing enables researchers to analyze selected genomic regions of interest with extremely high coverages, thus enabling analysis of variants with low allelic frequencies. However, using an amplicon-based sequencing strategy yields reads with positions constrained by the amplicons in use. These contraints potentially hinder structural variant (SV) detection within these amplicons since most SV detection algorithms were designed for and evaluated on whole genome or whole exome datasets. We simulate a paired-end, amplicon-based dataset with small (15bp to 70bp) SVs to evaluate performance of ten common SV detection algorithms for amplicon-based NGS datasets: Socrates, BreaKmer, delly2, gustaf, SoftSV, CREST, Pindel, Sprites, BreakDancer and SeekSV. The simulated coverage is based on an amplicon-based dataset covering 111 patients diagnosed with myelodysplastic syndromes (MDS). The covered target region is ~125 kbp in length. We simulate deletions, inversions and tandem duplications as these are the most common types of SVs called by different tools. Since amplicon-based strategies are usually employed to capture variants with low allelic frequencies, SVs with allelic frequencies between 5% and 35% are simulated. Sensitivity, positive predictive value (PPV) and required computational resources are evaluated. Gustaf and Pindel show sensitivity of ~55%. All other algorithms show sensitivity <20%, with some algorithms calling no SVs at all. However, PPV of the tools successfully detecting the simulated SVs was >75% in all cases. Run time varies between a few minutes and several hours.

Session B-303: Discovering rare causative variants in undiagnosed diseases exploiting VarGenius an automated tool for variants discovery and annotation.
COSI: VarI
  • Francesco Musacchia, Telethon Institute for Genetics and Medicine, Italy
  • Margherita Mutarelli, Telethon Institute for Genetics and Medicine, Italy
  • Andrea Ciolfi, Centro di Ricerca per gli Alimenti e la Nutrizione CREA, Italy
  • Michele Pinelli, Telethon Institute for Genetics and Medicine, Italy
  • Annalaura Torella, Telethon Institute for Genetics and Medicine, Italy
  • Raffaele Castello, Telethon Institute for Genetics and Medicine, Italy
  • Marco Tartaglia, Genetics and Rare Diseases Research Division Ospedale Pediatrico Bambino Gesù, Italy
  • Sandro Banfi, Telethon Institute for Genetics and Medicine, Italy
  • Vincenzo Nigro, Telethon Institute for Genetics and Medicine, Italy

Short Abstract: Motivations The Telethon Insitute for Genetics and Medicine is one of the research centers involved in the Undiagnosed Disease Program (UDP) which pursues the goal to give a diagnosis to pediatric patients with a disease without a name. The main aim is to generate WES for 350-400 patients together with parents and use innovative techniques to detect causative variants. Whole exome (WES) and targeted sequencing have been already used to discover causative variants in complex or unknown diseases (Bonnefond et al. 2010), (Ng et al. 2010, 2) whilst the recent advances surprisingly reduced the costs of sequencing leading to an increase of the number of research and clinical experiments. In the recent years many variant detection and annotation tools have been developed for WES analysis (Fischer et al. 2012), (Menon et al. 2016), (Lam et al. 2012), (Gao, Xu, and Starmer 2015) . Meanwhile, the Broad Institute developed the Genome Analyzer ToolKit (GATK) (DePristo et al. 2011) that, with its Best Practices, became one of the most used software solutions for variant discovery and genotyping. A valuable method provided in GATK is the joint calling which exploits data from different samples for accurate variants detection. Jointly calling variants in cohorts and more in general getting information for thousands of samples paves the way to a number of sofisticated downstream analyses. Yet, the organization of samples, analyses and the resulting output data is critical because none of the existent analysis tools permits their management. Methods To this aim, we developed VarGenius, a software that can execute different custom pipelines for variants discovery and annotation of targeted and whole exome sequencing following the GATK Best Practices. It exploits: FastQC for reads quality check, TrimGalore for trimming sequences (https://www.bioinformatics.babraham.ac.uk/), BWA for the alignment against the reference genome (Li and Durbin 2010), GATK for variants detection and Annovar for the annotation (Wang, Li, and Hakonarson 2010). VarGenius takes in input a text file with information about the samples location, gender, kinship and analyses organization. The configuration file can be used to execute different pipelines and to change the parameters for either the cluster jobs execution and for the integrated tools. Both single and joint variant calling can be executed and the final VCF is used in Annovar for the variants annotation. The databases used in Annovar are automatically downloaded, installed and used. A PostgreSQL database stores the data about the analyses, the variants and their genotypes, and gene and transcripts information associated with the variants allowing gene annotation of variants. This tool was designed to use High Parallel Computing machines (HPC cluster): pipeline tasks are separated and ran independently for each sample using user-defined number of nodes and memory. Results VarGenius is an automated open source tool which generates a tabular output containing variants genotype information and annotation and a web site showing plots and statistics for quality control of the sequencing and the detection. The time to compute the entire analysis is about 7 hours, within a cluster with nodes with Intel Xeon 10 Core 2.50 GHz, 128 GB mem and 4 hours within a cluster with nodes with 24 cores and 47GB RAM. Each exome analysis needs between 30-70GB space. While, to analyze a single sample from targeted sequencing the time needed is approximately 1 hour (with both used clusters) and results need 1GB space. We analyzed with VarGenius 46 probands from the UD Program together with their parents in TRIO or QUARTET analyses for a total of 140 exomes using Illumina NextSeq 500 and 150bp read length. We could cover 91% of the target exome and an average number of bases of 49 millions (out of 54 millions bases) with at least 20 reads (20X). Pathogenic genomic variants were identified in 24 cases (all are known disease genes) and 14 have been validated with Sanger Sequencing. We plan to extend this software to work with Hg38, to extend the pipeline for genome analysis and to be suitable for somatic variants. Moreover, we plan to work on the prioritization of the variants to provide a reliable tool for the automatic filtering of high quality variants.

Session B-305: Human protein variants in UniProtKB/Swiss-Prot: Improving access to knowledge through standardized annotations
COSI: VarI
  • Maria Livia Famiglietti, SIB Swiss Institute of Bioinformatics, Switzerland
  • Lionel Breuza, SIB Swiss Institute of Bioinformatics, Switzerland
  • Teresa Neto, SIB Swiss Institute of Bioinformatics, Switzerland
  • Alan Bridge, SIB Swiss Institute of Bioinformatics, Switzerland
  • Sylvain Poux, SIB Swiss Institute of Bioinformatics, Switzerland
  • Nicole Redaschi, SIB Swiss Institute of Bioinformatics, Switzerland
  • Lydie Bougueleret, SIB Swiss Institute of Bioinformatics, Switzerland
  • Ioannis Xenarios, SIB Swiss Institute of Bioinformatics, Switzerland
  • Uniprot Consortium, SIB Swiss Institute of Bioinformatics, European Bioinformatics Institute, Protein Information Resource, United Kingdom

Short Abstract: We are at the dawn of a new era of personalized genomic medicine where advances in human healthcare will be powered by the integration of data from many sources, including knowledge of how genomic variation affects protein function, disease, and drug response and metabolism. Here we describe work performed at UniProtKB/Swiss-Prot that aims to standardize the curation and provision of protein sequence variation data using a range of ontologies including VariO, GO, and ChEBI. Information about the functional impact of protein sequence variants is curated into UniProtKB/Swiss-Prot as part of the normal annotation workflow, along with detailed information about each proteins normal function, interactions, expression, and regulation as well as other sequence features of interest such as active sites and ligand-binding residues. Our focus on protein sequence variants with functional impact demonstrated by biochemical assays makes UniProtKB/Swiss-Prot variant data highly complementary to that from resources which use genetic data (such as pedigree analyses or GWAS studies) to link protein sequence variants to specific diseases, phenotypes, or traits. UniProtKB/Swiss-Prot contains more than 76,800 missense and small in-frame variants (UniProt Release 2017_04), including over 29,000 clinically relevant variants.

Session B-307: A preliminary genome-wide association study to determine the putative germline genetic risk variants of breast cancer in Korean women using whole-exome data
COSI: VarI
  • Insong Koh, Hanyang University, South Korea
  • Kiejung Park, Hanyang University, South Korea
  • Sunmin Kim, Hanyang University, South Korea

Short Abstract: Breast cancer is the most common cancer in worldwide women, and the second most common cancer in Korean women. Germline mutation studies have been strictly required and performed to find out heritable genetic risk variants affecting breast cancer. Variants analysis was performed based on Genome Analysis ToolKit's best pipeline with 78 Korean women exome sequencing data collected from blood cell of cancer patients(case) and 100 healthy Korean exomes(control). We obtained 102,346 SNVs (78,508 known SNPs, 23,838 unknown SNPs) from control and 93,702 SNVs (85,490 known SNPs, 7,582 unknown SNPs) from case. Genome-wide association study was performed with genotype information from exome pipeline. Five genetic risk SNP loci, which are significantly associated with breast cancer (P-value < 5*10-8), were predicted in non-synonymous SNPs. In order to do network analysis, we selected 1,258 genes of 1,573 SNPs with P-value < 0.05. We found one novel significant SNP (P-value = 0.0001877) among them and identified that it is a mutation found in two Korean lung cancer patients in the ICGC database. Through network analysis, top two overlaps in sub-network enrichment were neoplasms and cancer pathways. The novel SNP and others loci found in the study will contribute to reveal more effective breast cancer markers in more expanded studies with more case-control exomes and genomes.

Session B-309: Imputation and extrapolation for functional impact of single amino acid change
COSI: VarI
  • Joe Wu, University of Toronto, Canada
  • Jochen Weile, University of Toronto, Canada
  • Song Sun, University of Toronto, Canada
  • Jennifer Knapp, University of Toronto, Canada
  • Marta Verby, University of Toronto, Canada
  • Fritz Roth, University of Toronto, Canada

Short Abstract: One of the great challenges in genetics is to accurately make functional annotation of variation in the human genome sequence. Although recent work shows that experimental assays can detect three times more disease variants at high confidence than computational predictions, assessing the functional impact of missense mutations still relies heavily on computational methods. Towards more accurate pathogenicity annotations of human genetic variants, we experimentally mapped the fitness landscape of missense mutations for four proteins: SUMO1, UBE2I, TPK1 and CALM1. Using exhaustive mutagenesis and a deep sequencing read-out of multiplexed yeast functional complementation assays, we experimentally determined the fitness effect of the majority of all possible missense amino acid changes in these proteins. Unfortunately, a substantial minority of missense mutations remained unmeasured. We sought to fill in the missing information in this experimental fitness landscape map, training multiple machine learning models with diverse features such as conservation score and physicochemical amino acid properties to impute the functional impact of the remaining variants. We show that the imputed fitness values are more accurate than computational predictions using PolyPhen2 and SIFT. Finally, we show that these experimental fitness landscapes can be used more broadly to improve models that predict human pathogenic variants for all human genes.

Session B-311: Structural annotation of single-nucleotide variants sheds light on molecular mechanism behind cancer and other diseases
COSI: VarI
  • Alexander Gress, Max-Planck-Institute for Informatics, Saarland University, Germany
  • Andreas Keller, Chair for Medical Bioinformatics, Saarland University, Germany
  • Vasily Ramensky, Moscow Institute of Physics and Technology, Russia
  • Olga V. Kalinina, Max-Planck-Institute for Informatics, Saarland University, Germany

Short Abstract: The advances of genome sequencing technologies rapidly increase the amount of genetic information, with it more and more genomes associated with genetic diseases being available. Still the information about how exactly disease-associated phenotypes arise must be extracted from them. Analyzing such great volumes of data requires effective algorithmic tools. In this work, we describe a method for comprehensive analysis of spatial location of mutations corresponding to non-synonymous single-nucleotide variants (nsSNVs) in three-dimensional structures of proteins and protein complexes. We study a large collection of disease-associated and neutral variants with it. We consider five datasets: germline and somatic nsSNVs associated with cancer, nsSNVs associated with non-cancer genetic diseases, frequent (and hence probably functionally neutral) and benign variants. We map all nsSNVs into three-dimensional structures of the corresponding proteins or of their homologs and gather information on molecular contacts of the corresponding residues. We show that cancer-associated nsSNVs are more often found in ligand-binding pockets and DNA-binding interfaces compared to all other variants and random controls. They are also enriched in proteins that participate in many protein-protein interactions, but their distribution in them is not different from random. nsSNVs associated with non-cancer diseases are enriched in protein core, where they probably destabilize the protein three-dimensional structure. To the best of our knowledge, this study is the largest structure-based analysis of nsSNVs associated with genetic diseases so far. It highlights the diversity of molecular mechanisms behind these diseases and identifies important specific trends.

Session B-313: Estimation of population genetic parameters from sequence data of experimental evolution populations using an EM algorithm
COSI: VarI
  • Yasuhiro Kojima, University of Tokyo, Japan
  • Hirotaka Matsumoto, RIKEN, Japan
  • Hisanori Kiryu, University of Tokyo, Japan
Session B-315: Deep Learning of Mutation-Gene-Drug Relations from the Literature for Precision Medicine
COSI: VarI
  • Kyubum Lee, Korea University, South Korea
  • Byounggun Kim, Korea University, South Korea
  • Sunkyu Kim, Korea University, South Korea
  • Yonghwa Choi, Korea University, South Korea
  • Wonho Shin, Korea University, South Korea
  • Sunwon Lee, Korea University, South Korea
  • Sungjoon Park, Korea University, South Korea
  • Seongsoon Kim, Korea University, South Korea
  • Aik Choon Tan, University of Colorado Anschutz Medical Campus, United States
  • Jaewoo Kang, Korea University, South Korea

Short Abstract: Motivation: Molecular biomarkers that can predict drug efficacy in cancer patients are crucial components for the advancement of precision medicine. However, identifying these molecular biomarkers remains a laborious and challenging task. Next-generation sequencing of patients and preclinical models has increasingly led to the identification of novel gene-mutation-drug relations, and these results have been reported and published in the scientific literature. Methods: Here, we present two new computational methods that utilize all the PubMed articles as domain specific background knowledge to assist in the extraction and curation of gene-mutation-drug relations from the literature. The first method integrates Biomedical Entity Search Tool (BEST) scoring results as part of the features to train the machine learning classifiers. The second method not only uses the BEST scoring results, but also word vectors in a deep convolutional neural network model that are constructed from and trained on numerous documents such as PubMed abstracts and Google News articles. Using the features obtained from both the BEST search engine scores, word vectors, random forest and deep convolutional neural network, we extract the mutation-gene and mutation-drug relations from the literature. Results: We achieved better results compared with the state-of-the-art method. We used our suggested features in a simple machine learning approach, and obtained F1-scores of 0.96 and 0.82 for mutation-gene and mutation-drug relation classification, respectively. We also developed a deep learning classification model using convolutional neural networks, BEST scores, and the word embeddings that are pre-trained on PubMed or Google News data. Using deep learning, the classification accuracy improved, and F1-scores of 0.96 and 0.86 were obtained for both the mutation-gene and mutation-drug relations, respectively. We believe that our computational methods described in this research could be used as an important tool in identifying molecular biomarkers that predict drug responses in cancer patients. We also built a database of these mutation-gene-drug relations that were extracted from all the PubMed abstracts. We believe that our database can prove to be a valuable resource for precision medicine researchers. To the best of our knowledge, our approach is the first that combines biomedical entity search and word embedding to capture background knowledge and utilize it for variant-entity relation extraction. Availability: The database is available at http://VarDrugPub.korea.ac.kr

Session B-317: From genes to phenotypes: eDGAR plus NET-GE approach
COSI: VarI
  • Giulia Babbi, Biocomputing Group Bologna, Italy
  • Pier Luigi Martelli, University of Bologna, Italy
  • Giuseppe Profiti, Università di Bologna, Italy
  • Samuele Bovo, University of Bologna, Italy
  • Castrense Savojardo, University of Bologna, Italy
  • Rita Casadio, UNIBO, Italy

Short Abstract: Modern sequencing technologies allow dissecting the genetic component of phenotypic traits, but the molecular mechanisms at the basis of the pathogenesis are often uncharacterized. Investigating the biological pathways at the basis of symptoms insurgence may give fundamental indications about the disease developments. We derived the gene-disease relations from eDGAR, a database collecting 5729 gene/disease associations as derived from OMIM, Humsavar and ClinVar. eDGAR provides precomputed results for polygenic and heterogeneous diseases, analyzing the associated list of genes and describing their features. These include physical and/or regulatory interactions between pairs of genes, retrieved from PDB, BIOGRID and STRING as well as co-occurrence in structural complexes. Regulatory interactions are derived from TRRUST and the localization on chromosomes and/or co-localization in neighboring loci are reported. Moreover, eDGAR reports enriched functional annotations computed with NET-GE, a tool for the standard and network-based enrichment of REACTOME and KEGG pathways and GO terms. Given the input set (genes or proteins), it retrieves 20,390 modules computed from the STRING human interactome, detecting statistically significant associations not directly inferable from the annotations of the starting set. We classified the human diseases on the basis of the corresponding symptoms and particular phenotypes, using the OMIM Clinical Synopsis and the HPO resources. Combining this classification with eDGAR and NET-GE, we are able to detect the biological pathways at the basis of the phenotype appearance, directly linking the symptoms to the targeted biological process. This is a new approach to investigate in deep the relations among phenotypes and molecular mechanisms.

Session B-319: PopCluster: A New Algorithm to Identify Genetic Variants with Effects that Change with Ethnicity
COSI: VarI
  • Anastasia Gurinovich, Boston University, United States
  • John Farrell, Boston University, United States
  • Harold Bae, Oregon State University, United States
  • Annibale Puca, University of Salerno, Italy
  • Gil Atzmon, Albert Einstein College of Medicine, United States
  • Nir Barzilai, Albert Einstein College of Medicine, United States
  • Thomas Perls, Boston University, United States
  • Paola Sebastiani, Boston University, United States
Session B-321: Analysis of KIR gene copy number variation among cancer patients reveals interactions with tumor phenotypes
COSI: VarI
  • Rachel Marty, UCSD, Switzerland
  • Hannah Carter, UCSD, United States
  • David Gfeller, Ludwig Center for Cancer Research; UNIL, Switzerland
Session B-323: Variant relevance prediction in extremely imbalanced training sets
COSI: VarI
  • Max Schubach, Berlin Institute of Health, Germany
  • Matteo Re, Università degli Studi di Milano, Italy
  • Peter N. Robinson, The Jackson Laboratory for Genomic Medicine, United States
  • Giorgio Valentini, Università degli Studi di Milano, Italy
Session B-325: FUMA: Functional mapping and annotation of genetic associations
COSI: VarI
  • Kyoko Watanabe, VU University Amsterdam, Netherlands
  • Erdogan Taskesen, VU University (VU), Netherlands
  • Arjen van Bochoven, Vrije Universiteit Amsterdam, Netherlands
  • Danielle Posthuma, VU University Amsterdam (VU), Netherlands
Session B-327: Phenotype-driven discovery of digenic variants in personal genome sequences
COSI: VarI
  • Imane Boudellioua, King Abdullah University of Science and Technology, Saudi Arabia
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Paul Schofield, University of Cambridge, United Kingdom
  • Georgios Gkoutos, University of Birmingham, United Kingdom
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia
Session B-329: MAPPIN: A Method for Annotating, Predicting Pathogenicity, and mode of Inheritance for Nonsynonymous variants
COSI: VarI
  • Nehal Gosalia, Regeneron, United States
  • Aris Economides, Regeneron, United States
  • Frederick Dewey, Regeneron Genetics Center, United States
  • Suganthi Balasubramanian, Regeneron Genetics Center, United States
Session B-331: Benchmarking Variant Calling Tools for NGS Data
COSI: VarI
  • Sarah Sandmann, Institute of Medical Informatics, Germany
  • Aniek de Graaf, Laboratory Hematology, Netherlands
  • Mohsen Karimi, Center for Hematology and Regenerative Medicine, Sweden
  • Bert van der Reijden, Laboratory Hematology, Netherlands
  • Eva Hellström-Lindberg, Center for Hematology and Regenerative Medicine, Sweden
  • Joop Jansen, Laboratory Hematology, Netherlands
  • Martin Dugas, Institute of Medical Informatics, Germany
Session B-333: When loss-of-function is loss of function: assessing mutational signatures and impact of loss-of-function genetic variants
COSI: VarI
  • Sean D Mooney, Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, United States
  • Predrag Radivojac, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Kymberleigh Pagel, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Vikas Pejaver, Department of Computer Science and Informatics, Indiana University, Bloomington, Indiana, United States
  • Hyunjun Nam, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Matthew Mort, Institute of Medical Genetics, Cardiff University, United Kingdom
  • David N Cooper, Institute of Medical Genetics, Cardiff University, United Kingdom
  • Jonathan Sebat, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Guan Ning Lin, Department of Psychiatry, University of California San Diego, La Jolla, California, United States
  • Lilia M Iakoucheva, Department of Psychiatry, University of California San Diego, La Jolla, California, United States

Short Abstract: Motivation:Loss-of-function genetic variants are frequently associated with severe clinical phenotypes, yet many are present in the genomes of healthy individuals. The available methods to assess the impact of these variants rely primarily upon evolutionary conservation with little to no consideration of the structural and functional implications for the protein. They further do not provide information to the user regarding specific molecular alterations potentially causative of disease. Results: To address this, we investigate protein features underlying loss-of-function genetic variation and develop a machine learning method, MutPred-LOF, for the discrimination of pathogenic and tolerated variants that can also generate hypotheses on specific molecular events disrupted by the variant. We investigate a large set of human variants derived from the Human Gene Mutation Database, ClinVar, and the Exome Aggregation Consortium. Our prediction method shows an area under the Receiver Operating Characteristic curve of 0.85 for all loss-of-function variants and 0.75 for proteins in which both pathogenic and neutral variants have been observed. We applied MutPred-LOF to a set of 1,142 de novo variants from neurodevelopmental disorders and find enrichment of pathogenic variants in affected individuals. Overall, our results highlight the potential of computational tools to elucidate causal mechanisms underlying loss of protein function in loss-of-function variants

Session B-501: Increasing the power of meta-analysis of genome-wide association studies to detect heterogeneous effects
COSI: VarI
  • Cue Hyunkyu Lee, Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Korea, Rep
  • Eleazar Eskin, Department of Human Genetics, Department of Computer Science, University of California, United States
  • Buhm Han, Department of Convergence Medicine, University of Ulsan College of Medicine, Asan Medical Center, Korea, Rep

Short Abstract: Motivation: Meta-analysis is essential to combine the results of genome-wide association studies (GWASs). Recent large-scale meta-analyses have combined studies of different ethnicities, envi- ronments and even studies of different related phenotypes. These differences between studies can manifest as effect size heterogeneity. We previously developed a modified random effects model (RE2) that can achieve higher power to detect heterogeneous effects than the commonly used fixed effects model (FE). However, RE2 cannot perform meta-analysis of correlated statistics, which are found in recent research designs, and the identified variants often overlap with those found by FE. Results: Here, we propose RE2C, which increases the power of RE2 in two ways. First, we general- ized the likelihood model to account for correlations of statistics to achieve optimal power, using an optimization technique based on spectral decomposition for efficient parameter estimation. Second, we designed a novel statistic to focus on the heterogeneous effects that FE cannot detect, thereby, increasing the power to identify new associations. We developed an efficient and accurate p-value approximation procedure using analytical decomposition of the statistic. In simulations, RE2C achieved a dramatic increase in power compared with the decoupling approach (71% vs. 21%) when the statistics were correlated. Even when the statistics are uncorrelated, RE2C achieves a modest increase in power. Applications to real genetic data supported the utility of RE2C. RE2C is highly efficient and can meta-analyze one hundred GWASs in one day.
Availability and implementation: The software is freely available at http://software.buhmhan.com/ RE2C.
Contact: buhm.han@amc.seoul.kr


View Posters By Category

Search Posters: