Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters - Schedules

Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19 and no later than July 23. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2021. There are Q&A opportunities through a chat function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.

Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC
Session B: Monday, July 26 between 15:20 - 16:20 UTC
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
Session E: Thursday, July 29 between 15:20 - 16:20 UTC
Consensus-based identification and comparative analysis of structural variants in selected human families
COSI: VarI
  • Dariusz Plewczynski, Centre of New Technologies, University of Warsaw, S. Banacha 2c, 02-097 Warsaw, Poland, Poland
  • Mateusz Chiliński, Centre of New Technologies, University of Warsaw, S. Banacha 2c, 02-097 Warsaw, Poland, Poland
  • Sachin Gadakh, Centre of New Technologies, University of Warsaw, S. Banacha 2c, 02-097 Warsaw, Poland, Poland

Short Abstract: We present a comprehensive analysis of Oxford Nanopore (ONT) sequencing technology compared with short-read techniques, such as Illumina. In our study, we focus on the structural variants, at least 50 bp segments of DNA in length that are unique for personal genomes, as identified by the 1000 Genomes project. We improve the quality of the Structural Variants identification from the whole genome sequencing (WGS) experiments by using the consensus approach. Fifteen gold-standard tools were used for obtaining the polished list of Structural Variants (SV) for daughters of families from the 1000 Genomes project using publicly available datasets from next-generation sequencing experiments performed by both short-read (Illumina) and long-read (ONT) technologies. The results of the SV callers were merged using the novel ConsensuSV algorithm, which integrates the SV sets using machine learning by combining decision trees and neural networks trained and benchmarked on the high-quality SVs from the 1k Genomes Project. Finally, upon comparing the SV sets obtained from ConsensuSV algorithm between long and short read, our findings demonstrate the superiority of ONT across all SV sizes, long-read-based SV inference detected more SVs than the short-read one.

Disease interpretation of non-coding genomic elements with the GeneCards Suite
COSI: VarI
  • Simon Fishilevich, Weizmann Institute of Science, Israel
  • Ruth Barshir, Weizmann Institute of Science, Israel
  • Tsippi Iny-Stein, Weizmann Institute of Science, Israel
  • Ofer Zelig, LifeMap Sciences Inc., United States
  • Yaron Guan-Golan, LifeMap Sciences Inc., United States
  • Marilyn Safran, Weizmann Institute of Science, Israel
  • Doron Lancet, Weizmann Institute of Science, Israel

Short Abstract: Interpreting whole genome sequencing data is a major challenge, since 98% of variants reside in non-coding genomic “dark matter”, including regulatory elements and non-coding RNA (ncRNA) genes.
The GeneCards® Suite (www.genecards.org/) is a leading integrated biomedical knowledgebase for interpretation of clinical genetics, including the gene-centric GeneCards (PMID:27322403) and disease-centric MalaCards (PMID:27899610). VarElect (PMID:27357693), our NGS phenotype interpreter, leverages this knowledgebase to prioritize associations between genes and disease/phenotype terms. We’ve made significant strides towards optimizing our Suite for effective interpretation of non-coding variants.
GeneHancer (PMID:28605766) is a database of ~400k functionally annotated enhancers, promoters, and their target genes. It integrates information from key epigenetic resources, and is included as a native regulation track at the UCSC genome browser.
GeneCaRNA (PMID:33676929) is a novel gene-centric ncRNA database, integrating data from major gene and transcript resources. GeneCaRNA provides a comprehensive non-redundant view of >220k human ncRNAs of 17 functionally diverse types, such as lncRNAs and miRNAs.
Our novel non-coding compendia provide an indispensable augmentation for VarElect, empowering the prioritization of variant-containing enhancers, promoters and ncRNA genes with respect to diseases, via direct and target-gene mediated links. These capabilities facilitate deciphering the clinical significance of non-coding variants, often elucidating unsolved clinical cases (PMID:32506582).

Findings from the Critical Assessment of Genome Interpretation, a community experiment to evaluate phenotype prediction
COSI: VarI
  • Steven E. Brenner, University of California Berkeley, United States
  • Constantina Bakolitsa, University of California Berkeley, United States
  • Gaia Andreoletti, University of California, San Francisco, United States
  • Roger A. Hoskins, University of California Berkeley, United States
  • Shantanu Jain, Northeastern University, United States
  • Predrag Radivojac, Northeastern University, United States
  • John Moult, University of Maryland, United States
  • Cagi Participants, University of California Berkeley, United States

Short Abstract: Interpretation of genomic variation plays an essential role in monogenic disease, the analysis of cancer and increasingly also in complex trait disease, with applications ranging from basic research to clinical decisions. Yet the field lacks a clear consensus on the appropriate level of confidence to place in variant impact prediction. The Critical Assessment of Genome Interpretation (CAGI, \'kā-jē\) is a community experiment to objectively assess computational methods for predicting the phenotypic impact of genomic variation. CAGI participants are provided genetic variants and make blind predictions of resulting phenotype. Independent assessors evaluate the predictions by comparing with experimental and clinical data.

Over five CAGI editions completed during the past decade, several themes have emerged. Top missense prediction methods are highly statistically significant. Interpretation of non-coding variants shows promise but is not at the level of missense. Bespoke approaches often enhance performance. Conservation-based methods show the most consistent performance. Interpretation of whole-genome data remains an open challenge. However, in certain examples of using clinical data, predictors identified causal variants overlooked by initial analysis in a diagnostic laboratory.

CAGI 6 is currently underway with challenges addressing missense, clinical genomes, cancer, splicing and polygenic risk scores. See: genomeinterpretation.org.

Genetic Variants associated with Vitamin K Deficiency
COSI: VarI
  • Shalini Rajagopal, Birla Institute of Scientific Research, India
  • Ayam Gupta, Birla Institute of Scientific Research, India
  • Jalaja Jhansi, Vignan University, India
  • Anil Kumar, Vignan University, India
  • Krishna Mohan Medicherla, Birla Institute of Scientific Research, India
  • Ashwani Kumar Mishra, DNA Xperts, India
  • Bhanuprakash Reddy, National Institute of Nutrition, India
  • Kavi Kishor Pb, Vignan University, India
  • Prashanth Suravajhala, Birla Institute of Scientific Research, India

Short Abstract: Shalini Rajagopal, Ayam Gupta, Jalaja Naravula, Anil Kumar S, Praveen Mathur, Anita Simlot, Sudhir Mehta, Chhagan Bihari, Sumita Mehta, Ashwani Kumar Mishra, Krishna Mohan Medicherla, G Bhanuprakash Reddy, PB Kavi Kishor and Prashanth Suravajhala*


Understanding the genetic variants is a major focus in our project research. We would like to take this opportunity to present a poster on genetic variants associated with Vitamin K Deficiency and discuss possible future directions of our project. The aim of our study is to detect and quantify the differentially expressed genes and the variant effects on experimental conditions which gives information on how genes are regulated and reveals the details of organism's biology. There are 46 genes that have been related to Vitamin K so far wherein major genes such as VKORC1, GGCX and VKA are involved in the biological functions but deciphering the mechanism of few genes is still unknown. Our future work will identify the high sensitivity and specificity in SNP calls for different conditions of cohorts of samples. The research ideas shared in our presentation would hopefully be absorbing to the scientific community and this will help early-career researchers like me to gain experience from this learning curve.

Germline variants that influence the tumor immune microenvironment also drive response to immunotherapy
COSI: VarI
  • Gerald Morris, UCSD, United States
  • Pandurangan Vijayanand, La Jolla Institute of Immunology, United States
  • J Silvio Gutkind, Moores Cancer Center, United States
  • Glenn Merlino, NCI, United States
  • Wesley Thompson, UCSD, United States
  • Chun Chieh Fan, UCSD, United States
  • Chi-Ping Day, NCI, United States
  • Maurizio Zanetti, Moores Cancer Center, United States
  • Jill Mesirov, UCSD, United States
  • Sandip Patel, Moores Cancer Center, United States
  • Olivier Harismendy, Moores Cancer Center, United States
  • Hannah Carter, UCSD, United States
  • Rany Salem, UCSD, United States
  • Benjamin Schmiedel, La Jolla Institute of Immunology, United States
  • Steven Cao, UCSD, United States
  • Cristian Gonzalez-Colin, La Jolla Institute of Immunology, United States
  • James Talwar, UCSD, United States
  • Andrea Castro, UCSD, United States
  • Hyo Kim, UCSD, United States
  • Eva Pérez-Guijarro, NCI, United States
  • Victoria Wu, Moores Cancer Center, United States
  • Meghana Pagadala, UCSD, United States

Short Abstract: With the continued promise of immunotherapy as an avenue for treating cancer, understanding how host genetics contributes to the tumor immune microenvironment (TIME) is essential to tailoring cancer risk screening and treatment strategies. Using genotypes from over 8,000 European individuals in The Cancer Genome Atlas (TCGA) and 137 heritable tumor immune phenotype components (IP components), we identified and investigated 482 TIME-associated variants. Many TIME-associated variants influence gene activities in specific immune cell subsets, such as macrophages and dendritic cells, and interact to promote more extreme TIME phenotypes. TIME-associated variants were predictive of immunotherapy response in human cohorts treated with immune-checkpoint blockade (ICB) in 3 cancer types, causally implicating specific immune-related genes that modulate myeloid cells of the TIME. Moreover, we validated the function of these genes in driving tumor response to ICB in preclinical studies. Through an integrative approach, we link host genetics to TIME characteristics, informing novel biomarkers for cancer risk and target identification in immunotherapy.

ITHANET: An information and database community portal for haemoglobinopathies
COSI: VarI
  • Stella Tamana, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Maria Xenophontos, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Coralea Stephanou, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Anna Minaidou, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Carsten W Lederer, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Petros Kountouris, The Cyprus Institute of Neurology and Genetics, Department of Molecular Genetics Thalassaemia, Cyprus
  • Marina Kleanthous, Molecular Genetics Thalassaemia, The Cyprus Institute of Neurology and Genetics, Nicosia, Cyprus, Cyprus

Short Abstract: The ITHANET portal (www.ithanet.eu) is an expanding, publicly available biomedical resource dedicated to haemoglobinopathies. It provides a manually curated, literature-derived collection of published genetic and epidemiological data, also integrating the latest updates on news, events, publications and many more.
A team of expert biocurators retrieves, validates and annotates information from scientific literature and individual submitters, whilst also incorporating new and updated information from existing public databases (e.g., HbVar, dbSNP, ClinVar).
ITHANET offers a range of inter-linked databases; IthaGenes currently stores annotations for over 3180 variants in over 420 globin-related loci and genes. IthaMaps consists of epidemiological data for over 200 countries, which are illustrated both at a global and regional scale. IthaChrom is a collection of digitized reports of standard diagnostic HPLC analyses on more than 600 haemoglobin variants. Recently, ITHANET developed IthaPhen, an interactive genotype-phenotype database and a tool focused on the characterization and detection of CNVs related to haemoglobinopathies.
ITHANET is the most comprehensive knowledgebase on haemoglobinopathies and the official partner of the HVP Global Globin Network for data storing, curation and sharing within and between countries. ITHANET is coordinating the ClinGen Hemoglobinopathy VCEP, focused on standardizing pathogenicity classification of genetic variants according to the ACMG/AMP guidelines.

LYRUS: A Machine Learning Model for Predicting the Pathogenicity of Missense Variants
COSI: VarI
  • Jiaying Lai, Brown University, United States
  • Jordan Yang, Brown University, United States
  • Ece Uzun, Brown University, United States
  • Brenda Rubenstein, Brown University, United States
  • Neil Sarkar, Brown University, United States

Short Abstract: Single amino acid variations (SAVs) are a primary contributor to variations in the human genome. Identifying pathogenic SAVs can aid in the understanding of the genetic architecture of complex diseases. Most approaches for predicting the functional effects or pathogenicity of SAVs rely on either sequence or structural information. Recently, researchers have attempted to increase the accuracy of their predictions by incorporating protein dynamics. We present (LYRUS), a machine learning method that uses an XGBoost classifier selected by TPOT to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based features, six structure-based features, and four dynamics-based features. Uniquely, LYRUS includes a novel sequence co-evolution feature called variation number. LYRUS's performance was evaluated using a dataset that contains 4,363 protein structures corresponding to 20,307 SAVs based on human genetic variant data from the ClinVar database. Based on our dataset, the LYRUS classifier has a higher accuracy, specificity, F-measure, and Matthews correlation coefficient (MCC) than alternative methods including PolyPhen2, PROVEAN, SIFT, Rhapsody, EVMutation, MutationAssessor, SuSPect, FATHMM, and MVP. Variation numbers used within LYRUS differ greatly between pathogenic and neutral SAVs. Applications of the method to PTEN and TP53 further corroborate LYRUS's strong performance.

Molecular and Systems level characterisation of human Loss-of-Function mutations using Bioinformatics Approaches
COSI: VarI
  • H.A Nagarajaram, University of Hyderabad, Hyderabad, India
  • Konduru Guruprasad Varma, Centre for DNA Fingerprinting and Diagnostics (CDFD),Hyderabad, India

Short Abstract: Loss-of-Function (LoF) mutations include nonsense Single Nucleotide Polymorphisms (SNPs), frameshift indels and splice site SNPs, which usually lead to premature termination of their transcription as well as their translations. It has been estimated that a typical human genome harbour 149-182 putative LOF mutations. Identification of disease causing LoF mutations out of those many possible LoF mutations is a major bottleneck while applying whole exome/genome sequencing for clinical diagnosis of diseases. For our analysis we have used putatively benign and pathogenic LoF mutations (except the splice site SNPs) from publicly available databases. We analysed impact of LoFs at transcript level and at protein level. We combined transcript and protein level results and classified LoF mutations into three groups i.e. I. LoFs leading to complete protein function loss, II. LoFs leading to partial function loss and III. LoFs with no function loss. We further analysed genes harbouring group I LoF mutations at systems level using network approaches. Our studies show significant differences between pathogenic and putatively benign LoF mutations at various levels and this knowledge will be used to develop a machine learning model to identify LoFs likely to be pathogenic from those that are likely to be benign LoFs.

On the role of common synonymous variants in complex diseases and traits
COSI: VarI
  • Inga Weißenborn, University of Lübeck, Germany
  • Jeanette Erdmann, University of Lübeck, Germany
  • Hauke Busch, University of Lübeck, Germany
  • Inken Wohlers, University of Lübeck, Germany

Short Abstract: Synonymous variants are usually neglected in genetic studies. Recently, their functional roles are increasingly investigated, but not yet systematically with respect to diseases and traits.

We perform such evaluation based on the genome-wide association studies (GWAS) catalog. Effects on transcription are assessed via expression quantitative trait locus (eQTL) annotations obtained via tool Qtlizer. Effects on translation are evaluated via codon usage bias using relative synonymous codon usage (RSCU) as transcript level-based quantification.

There are 101 exclusively synonymous GWAS catalog variants in 94 genes, linked to 3,267 eQTL annotations of which 99 eQTLs (3%) from 31 variants are flagged as best eQTLs. Notably, variant rs199533 in gene NSF, associated with Parkinson's disease and cancer, has most eQTLs, indicating a gene regulatory role and variant rs11568377, associated with systolic blood pressure in sickle cell anemia, affects codons of large RSCU difference, plausibly interfering with protein folding. Finally, in an extended data set, we show that RSCU distributions for 39 of 119 trait-associated synonymous codons (33%) differ significantly from those of transcriptome-wide protein-coding sequences.

In summary, our results indicate GWAS variants and biological mechanisms for follow-up studies and that functional roles of synonymous, disease-associated variants may be more common than intuitively expected.

PepVEP: A Tool to Retrieve Functional and Structural Data for Every Possible Missense Variant in the Human Proteome
COSI: VarI
  • James Stephenson, EMBL-EBI, United Kingdom
  • Alok Mishra, EMBL-EBI, United Kingdom
  • Andrew Nightingale, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom

Short Abstract: Functional data and structural context can aid our understanding of the specific roles each residue in a protein plays and offer insight into the potential for a variant to be associated with disease. Researchers can make use of an ever-increasing variety of data from different sources at the gene/protein level or at the level of single nucleotides or amino acids. The PepVEP platform collates functional and structural data from various EMBL-EBI resources at the protein residue level and maps to genomic coordinates to allow users to query any position in the proteome.
The data include protein/protein and protein-ligand interactions both from every structure in the PDB and from experimentation. Also, mutagenesis experiments which directly assess specific variation and all publicly available human variants from healthy and disease populations. The data can be retrieved programmatically via an API which accepts a variety of input or a user interface with additional features such as the variant position in every structure of the protein.
PepVEP is regularly updated with the latest data and allows clinical geneticists and researchers a single location to gather information on any specific missense change at any position in the proteome to understand its potential impact.

Physico-chemical and structural features of pathogenic and benign human protein missense variations collected from HUMSAVAR and ClinVar
COSI: VarI
  • Giulia Babbi, University of Bologna - Biocomputing Group, Italy
  • Castrense Savojardo, University of Bologna - Biocomputing Group, Italy
  • Matteo Manfredi, University of Bologna - Biocomputing Group, Italy
  • Pier Luigi Martelli, University of Bologna - Biocomputing Group, Italy
  • Rita Casadio, University of Bologna - Biocomputing Group, Italy

Short Abstract: Modern sequencing technologies provide an unprecedented amount of data about missense single-nucleotide variations leading to changes in protein sequences. For many single residue variations (SRVs), links to genetic diseases are reported. From HUMSAVAR and ClinVar, we collected human SRVs whose effect on human health is annotated as Pathogenic/Likely Pathogenic (P/LP) or Benign/Likely Benign (B/LB).
After merging, the Union dataset contains 3,627 proteins carrying 75,927 SRVs. Of them, 44,543 and 31,384 are labelled as P/LP and B/LB, respectively. The intersection between SRVs from HUMSAVAR and ClinVar is limited:the two datasets share about 5% and 30% of B/LB and P/LP SRVs, respectively. The question poses as to which extent the SRVs from different datasets share physico-chemical and structural features. With computational methods, we characterised solvent accessibility, flexibility and disorder of positions carrying P/LP and B/LB SRVs, and we compared the results obtained on ClinVar, HUMSAVAR and Union datasets. P/LP SRVs are significantly more abundant in buried/rigid positions, while B/LB SRVs occur preferentially in solvent-exposed/flexible regions. P/LP SRVs have a slight tendency to be more abundant than B/LB in not disordered regions. Overall, the findings suggest that SRVs deriving from HUMSAVAR and ClinVar, despite their limited overlap, share common physico-chemical and structural features.

Predicting the effect of mutations on membrane proteins
COSI: VarI
  • Marianne Rooman, Université Libre de Bruxelles, Belgium
  • Fabrizio Pucci, Université Libre de Bruxelles, Belgium
  • Simon Assaf, Université Libre de Bruxelles, Belgium

Short Abstract: The broad family of integral membrane proteins are indispensable components of  living cells. The understanding of their function and stability is thus a major focus of biomedical and biotechnological research considering for example that  the majority of FDA approved drugs target this class of proteins. The aim of this investigation is to understand how mutations impact on the stability of membrane proteins.  We start by defining a series statistical potentials derived from a non redundant set of membrane protein structures, which better describe the stability properties of this class of proteins than standard potentials derived from globular proteins. We then  combine all the information gathered from these potentials using an artificial neural network approach and construct a prediction model called BraneMuSiC that is able to predict how point mutations affect the folding free energy for a set of about 300 mutations inserted in proteins with known structure. Application to test sets further confirms the accuracy of our predictions and show that BraneMuSiC outperforms the state of the art methods for folding free energy change predictions. Our method will thus be of importance in protein design, in order to rationally modify membrane protein biophysical characteristics, and in the evaluation of the deleteriousness of genetic variants that target them.

Prediction of the effects of mutations on the protein-protein interactions based on local structural variations
COSI: VarI
  • Yasser Mohseni Behbahani, Sorbonne Universite, France
  • Elodie Laine, Sorbonne Universite, France
  • Alessandra Carbone, Sorbonne Universite, France

Short Abstract: Protein-protein interactions drive virtually all biological processes in living organisms and are necessary for cellular machinery. Disease-causing point mutations on a protein interface affect its ability to interact with its partners and interrupt the physiological mechanism of the cell. Hence, it is crucial to develop a systematic and accurate approach assessing the impact of mutations on the formation and stability of protein complexes. Here, we report on a deep learning approach to directly estimate the changes of binding affinity upon mutations. Given the importance of local structural variations for this purpose, we implemented a siamese architecture that takes as input the local 3D environments around the mutation site in the wild-type and mutated forms of the complex. Thanks to the use of locally oriented frames, the architecture is invariant to 3D translations and rotations. The 3D environments are extracted from conformations generated by the Rosetta-Backrub algorithm that explicitly models the flexibility of the backbone and side-chains, and accounts for their fluctuations around the native state. We evaluated the performance of our approach against experimental binding affinity measurements from SKEMPI-2.0. Our predictive performance on a completely blind test with 50 complexes (one mutation per complex) is comparable or better than state-of-the-art.

The impact of missense SNPs on amyloidogenesis: The example of Αβ and Alzheimer’s disease
COSI: VarI
  • Fotis Galanis, National and Kapodistrian University of Athens, Greece
  • Avgi Apostolakou, National and Kapodistrian University of Athens, Greece
  • Georgia Nasi, National and Kapodistrian University of Athens, Greece
  • Zoi Litou, National and Kapodistrian University of Athens, Greece
  • Vassiliki Iconomidou, National and Kapodistrian University of Athens, Greece

Short Abstract: The deposition of amyloid fibrils is a characteristic of a variety of diseases including Alzheimer’s disease (AD). Proteins and peptides with a tendency to form such depositions are called amyloidogenic e.g. Αβ peptide, the primary component of amyloid plaques characteristic of AD. In this work we analyzed how msSNPs affect amyloidogenic proteins. Amyloidogenic proteins were collected from AmyCo, the Amyloidoses Collection, a repository, containing information about amyloidoses and diseases related to amyloid deposition created by our lab. msSNPs were extracted from dbSNP, ClinVar and UniProt. Statistical analyses, such as Chi-squared test, were performed to determine, for example, if alterations to residue properties are correlated to pathogenic msSNPs. Additional analysis was done focusing on msSNPs found in amyloidogenic-prone segments as predicted by AMYLPRED2. It was shown that msSNPs located in those segments are more likely to be pathogenic. To explore how msSNPs contribute to the onset of disease, Aβ Precursor Protein (APP) and AD were used as an example. Pathogenic msSNPs of APP are mostly gathered in and around the Aβ sequence affecting the proteolytic cleavage of APP or tendency of Aβ to aggregate. APP variants have a significant role in AD and should be considered when designing pharmaceuticals.

The Regulatory Mendelian Mutation (ReMM) score for GRCh38
COSI: VarI
  • Lusiné Nazaretyan, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Germany
  • Max Schubach, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Germany
  • Martin Kircher, Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Germany

Short Abstract: Despite a consensus that regulatory mutations play an important role in disease, computational tools supporting their identification are limited and frequently unavailable for the recent genome build. Here, we rebuild the ReMM score for prioritizing non-coding mutations in the GRCh38 human genome assembly. We contrast a curated set of 406 regulatory variants causative for Mendelian disorders and millions of human-derived sequence alterations (as proxy for non-pathogenic variation). We use a set of 26 genomic features combining epigenetic profiles, species conservation and density of disease and population variants to train a hyper-ensemble random forest model. The entire workflow is based on Snakemake, which improves reproducibility and facilitates adaption of the model for future genome releases and inclusion of new features. We achieve an average precision of 0.57 on our data, which compares favorably to the original ReMM version of the GRCh37 build (0.50). We observe moderate correlation of scores (0.72) between genome builds, which we ascribe to the changes in the feature sets as well as adjustments in feature importance of the model. Our work provides a reliable tool for scoring pathogenicity of human regulatory variants and will facilitate further developments of the ReMM score. GRCh38 scores are available at doi.org/10.5281/zenodo.4768448.

Validation of genetic variants from NGS data using Deep Convolutional Neural Networks
COSI: VarI
  • Marc Vaisband, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research; University of Bonn, Germany
  • Maria Schubert, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research, Austria
  • Franz Josef Gassner, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research, Austria
  • Roland Geisberger, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research, Austria
  • Richard Greil, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research, Austria
  • Nadja Zaborsky, Salzburg Cancer Research Institute - Laboratory for Immunological and Molecular Cancer Research, Austria
  • Jan Hasenauer, University of Bonn, Germany

Short Abstract: One of the most important frontiers in computational biology and biomedicine is the comprehensive analysis of Next-Generation Sequencing (NGS) data. In cancer research in particular, the identification of somatic mutations is vital for the investigation of their effects on disease progression and treatment response. This is done by considering the sequenced tumour DNA and a reference germline sample, and identifying candidate variants by way of comparison. Despite automated filtering, however, sequencing artifacts or alignment errors are often mistakenly flagged as variants. For this reason, researchers must perform extremely time-consuming manual screening. We demonstrate that it is possible to reliably automate this process using Deep Convolutional Neural Networks, whose utility has been behind many recent successes in applied machine learning. Using previously performed manual annotation as input data, we trained a CNN model that recognises sequencing artifacts with high accuracy, achieving a 5-fold crossvalidation score of 96%, on par with human reviewers. Moreover, we show how this can be extended to account for artifacts specific to library preparation which require comparison with additional sequencing tracks. Altogether, this allows for a significant reduction in the workload for researchers, and can in the future be integrated into bioinformatics workflows for NGS data processing.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube