Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: AHRD - Protein Function Transfer and Evaluation
COSI: Function
  • Florian Boecker, Crop Bioinformatics Uni Bonn, Germany
  • Heiko Schoof, Crop Bioinformatics Uni Bonn, Germany


Presentation Overview: Show

Genome-scale protein annotation can be performed by the transfer of functions from known proteins matched via sequence similarity. Errors can propagate when annotations falsely generated in this manner make their way into public databases and are used as basis for subsequent function transfers. Our program "Automatic assignment of Human Readable Descriptions" (AHRD) can overcome these pitfalls by avoiding previously transferred annotations. It emulates the decision process of a human curator to select a description and GO terms from sequence similarity search results. To assess the annotation performance of AHRD and competing methods accurately, we firstly generated an unbiased ground truth set of high quality protein annotations with minimal redundancy. It enables one to contrast annotation methods because it contains many proteins that are difficult to annotate. Secondly, we implemented an evaluation metric that uses the information content of GO terms to determine the semantic similarity of GO annotations. Using both, we are able to compare AHRD with some of its competitors and can show that it is able to predict descriptions and GO annotations at high quality while simultaneously maintaining a competitive coverage.

Virtual: AlphaFold models can inform the prediction of functional LIR-motifs
COSI: Function
  • Vasilis Promponas, Department of Biological Sciences, University of Cyprus, Cyprus
  • Agathangelos Chatzichristofi, Department of Biological Sciences, University of Cyprus, Cyprus
  • Vasileios Sagris, University of Cyprus, Cyprus
  • Aristos Pallaris, University of Cyprus, Cyprus
  • Marios Eftychiou, University of Cyprus, Cyprus
  • Ioanna Kalvari, University of Cyprus, Cyprus
  • Nicholas Price, University of Cyprus, Cyprus
  • Theodosios Theodosiou, University of Cyprus, Cyprus
  • Ioannis Iliopoulos, U of Crete, Greece
  • Ioannis Nezis, The University of Warwick, United Kingdom


Presentation Overview: Show

In selective macroautophagy, selectivity is mainly achieved through receptor proteins that bind to the surface of members of the Atg8 protein family by anchoring a small linear peptide (AIM in plants and fungi, LIR-motif in mammals). After the discovery of the first few LIR-containing proteins (LIRCPs), several definitions of the LIR-motif have appeared in the literature, highlighting a short core consensus sequence described by the regular expression pattern [WFY]xx[VLI]; accumulating experimental evidence pinpoints the importance of regions flanking the core LIR-motif. There are several mentions in the literature reporting that short linear motif binders often reside in intrinsically disordered regions, however there is no extensive evidence in the particular case of LIR-motifs, primarily due to the lack of relevant experimental data. In this work, we explore predicted 3D structures from the AlphaFold Database and the recently reported association of regions predicted with low pLDDT score with intrinsic disorder for studying LIR-motif containing proteins. We systematically characterize the “disorderliness” properties of LIR-motifs and their flanking regions and suggest that such properties can be used to develop successful predictors of LIR-motifs in combination with sequence-derived features.

Virtual: Anti-CRISPR prediction by using Transformer Model
COSI: Function
  • Chan-Seok Jeong, Korea Institute of Science and Technology Information, South Korea


Presentation Overview: Show

Anti-CRISPR, a family of proteins that hinder the CRISPR-Cas system of prokaryotic immune system, has recently emerged as a natural inhibitor of the CRISPR-Cas system, allowing for post-translational regulation of CRISPR-Cas system in a range of applications. Although experimental strategies for discovering anti-CRISPR have been developed, bioinformatic prediction may give a more cost-effective screening strategy. However, algorithm development is problematic due to a dearth of verified anti-CRISPR data and poor sequence similarity. Here, we describe an approach for predicting anti-CRISPR proteins from amino acid sequences by fine-tuning a pre-trained Transformer model for classification task. We predict the anti-CRISPR function of a given amino acid sequence by adding an additional classification layer to the Transformer model pre-trained on the unlabeled amino acid sequence of Pfam. The resulting model is fine-tuned with further training on validated anti-CRISPR and putative non-CRISPR prophage data sets. We evaluate performance on independent data sets compared to conventional predictors. Unlike traditional predictors, which require additional feature calculations and pre-filtering procedures, the present method only requires amino acid sequences, making it ideal for genome-scale investigations.

Virtual: Inferring MicroRNA Regulation from Inspecting the Proteome
COSI: Function
  • Michal Linial, The Hebrew University of Jerusalem, Israel
  • Dan Ofer, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Post-transcriptional regulation in multicellular organisms is mediated by microRNAs (miRNAs). However, the mechanisms that determine if a gene is regulated by miRNAs are poorly understood. Previous works focused mostly on miRNA seed matches and features of the 3’-UTR of transcripts. These common computational approaches still yield poor, inconsistent results, flooded with false positives. In this work, we present a novel, automated machine learning (ML) framework, we use sequence as well as diverse proteome-derived functional annotations to train models on multiple organisms using experimentally validated data. We present insights from millions of features extracted and ranked from different modalities. We show high predictive performance per organism and in generalization across species. We provide a list of novel predictions for Danio rerio (zebrafish) and Arabidopsis thaliana (mouse-ear cress). We found that most membranous and disease related proteins are regulated by miRNAs, but G-protein coupled receptor (GPCR) family is an exception, being mostly unregulated by miRNAs. We further show that the evolutionary conservation among duplicated genes does not imply a coherence in miRNA regulation. We conclude that duplicated genes diverge in their tendency to be miRNA regulated. However, protein function is informative across species in predicting post-transcriptional miRNA regulation in living cells.

Virtual: Paralog function annotation with ProfileView
COSI: Function
  • Edoardo Sarti, Inria Université Côte d'Azur, France
  • Théo Le Moigne, Sorbonne Université, France
  • Julien Henri, Sorbonne Université, France
  • Alessandra Carbone, Sorbonne Université, France


Presentation Overview: Show

Annotating the function of paralogous sequences has always been very challenging both in small-scale, expert-guided assays and large-scale bioinformatics studies, where paralogs are the most important source of functional annotation errors. ProfileView is a novel computational method designed to functionally classify sets of homologous sequences. It constructs a library of probabilistic models accurately representing the functional variability of protein families, and extracts biologically interpretable information from the classification process. Although ProfileView has been tested with success on several non-isofunctional protein families, paralog function recognition poses a new, complex challenge that requires careful analysis. We have devised a pipeline that applies ProfileView on the problem of paralog recognition, and transfers the knowledge of small-scale expert-guided annotations to the entire protein family. We have tested it on the 11 proteins composing the Calvin-Benson cycle, and obtained fully consistent results on 8 of them, and partially consistent results on other 2. The knowledge about paralog function annotation in the CBC is being now employed for matching same-function paralog sequences for producing joint MSAs for protein-protein interaction studies.

Virtual: Protein embeddings and deep learning predict binding residues for various ligand classes
COSI: Function
  • Maria Littmann, Technical University of Munich, Germany
  • Michael Heinzinger, Department of Informatics, Technical University of Munich, Germany
  • Christian Dallago, Department of Informatics. Technical University of Munich, Germany
  • Konstantin Weissenow, Department of Informatics. Technical University of Munich, Germany
  • Burkhard Rost, Department of Informatics, Technical University of Munich, Germany


Presentation Overview: Show

One important aspect of protein function is the binding of proteins to ligands, including small molecules, metal ions, and macromolecules such as DNA or RNA. Despite decades of experimental progress many binding sites remain obscure. We proposed bindEmbed21, a method predicting whether a protein residue binds to metal ions, nucleic acids, or small molecules. The Artificial Intelligence (AI)-based method exclusively uses embeddings from the Transformer-based protein Language Model (pLM) ProtT5 as input. Using only single sequences without creating multiple sequence alignments (MSAs), bindEmbed21DL outperformed MSA-based predictions. Combination with homology-based inference increased performance to F1=48±3% (95% CI) and MCC=0.46±0.04 when merging all three ligand classes into one. All results were confirmed by three independent data sets. Focusing on very reliably predicted residues could complement experimental evidence: For the 25% most strongly predicted binding residues, at least 73% were correctly predicted even when ignoring the problem of missing experimental annotations. The new method bindEmbed21 is fast, simple, and broadly applicable - neither using structure nor MSAs. Thereby, it found binding residues in over 42% of all human proteins not otherwise implied in binding and predicted about 6% of all residues as binding to metal ions, nucleic acids, or small molecules.

Virtual: ProteinBERT: a universal deep-learning model of protein sequence and function
COSI: Function
  • Dan Ofer, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Yam Peleg, Deep Trading, Israel
  • Nadav Rappoport, Ben-Gurion University of the Negev, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel


Presentation Overview: Show

Self-supervised deep language modeling has shown unprecedented success across natural language tasks, and has recently been repurposed to biological sequences. However, existing models and pretraining methods are designed and optimized for text analysis. We introduce ProteinBERT, a deep language model specifically designed for proteins. Our pretraining scheme combines language modeling with a novel task of Gene Ontology (GO) annotation prediction. We introduce novel architectural elements that make the model highly efficient and flexible to long sequences. The architecture of ProteinBERT consists of both local and global representations, allowing end-to-end processing of these types of inputs and outputs.
ProteinBERT obtains near state-of-the-art performance, on multiple benchmarks covering diverse protein properties (including protein structure, post-translational modifications and biophysical attributes), despite using a far smaller and faster model than competing deep-learning methods. Overall, ProteinBERT provides an efficient framework for rapidly training protein predictors, even with limited labeled data

F-001: RAPPPID: Deep, Regularised Protein-Protein Interaction Prediction that Generalises to Unseen Proteins
COSI: Function
  • Joseph Szymborski, McGill University, Canada
  • Amin Emad, McGill University, Canada


Presentation Overview: Show

Computational methods for the prediction of protein-protein interactions, while important tools for researchers, are plagued by challenges in generalising to unseen proteins. Datasets used for modelling protein-protein predictions are particularly predisposed to information leakage and sampling biases.

In this study, we introduce RAPPPID, a method for the Regularised Automatic Prediction of Protein-Protein Interactions using Deep Learning. RAPPPID is a twin AWD-LSTM network which employs multiple regularisation methods during training time to learn generalised weights. Testing on stringent interaction datasets composed of proteins not seen during training, RAPPPID outperforms state-of-the-art methods. Further experiments show that RAPPPID’s performance holds regardless of the particular proteins in the testing set and its performance is higher for biologically supported edges. This study serves to demonstrate that careful attention to data leakage and the according adjustments to architecture and training routine that RAPPPID represents are important to creating protein-protein interaction models which generalise to unseen proteins. Additionally, as part of this study, we provide datasets corresponding to several data splits of various strictness, in order to facilitate assessment of PPI reconstruction methods by others in the future.

Code and datasets are freely available at https://github.com/jszym/rapppid .

A pre-print is available at https://doi.org/10.1101/2021.08.13.456309 .

F-002: Network Based Analysis of Microbial Function Reveals Putative Bacterial Pathways
COSI: Function
  • Henri Chung, Iowa State University, United States
  • Yannick Mahlich, Rutgers University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Yana Bromberg, Rutgers University, United States


Presentation Overview: Show

Even though bacterial functions have been studied for over 150 years, the vast majority of the bacterial world remains unexplored. Here, we incorporated a significantly larger number of proteins into our classification scheme for a more in-depth look into microbial molecular functionality. In addition, we developed a novel network-based method for identifying relationships between functions based on their phylogenetic profiles: the presence or absence in individual bacterial strains. Linked protein functions, e.g. members of the same pathway, are likely to have been preserved or eliminated in new species throughout their evolutionary history For a set of known fusion functions, which were mapped to Enzyme Commission Classification (EC) numbers and corresponding known pathways from Kyoto Encyclopedia of Genes and Genome (KEGG), we demonstrate a shorter average phylogenetic distance than that of functions from different pathways. We further show that clustering the network of functional relationships captures groups of functions which participate in shared pathways. We thus highlight previously unseen groupings of functions that could potentially encode for unknown or unannotated pathways. We also note that a similar approach to pathway annotation can be applied to metagenomic data for identification of emergent functionality carried out across multiple organisms.

F-003: Discovering Proteins: Function to Name
COSI: Function
  • Akshay Agarwal, IBM Research, United States
  • Sara Capponi, IBM Research, CCC, United States
  • Edward Seabolt, IBM, United States
  • Kristen Beck, IBM, United States
  • James Kaufman, IBM, United States


Presentation Overview: Show

Currently approximately half of all microbial proteins are tagged as putative or hypothetical proteins and lack functional annotation which leads to a reduced understanding of biological function at the genome-level and limits the classification of microorganisms especially pathogens. Here, we developed an approach to perform functional annotation of hypothetical proteins from over 50 million named proteins and 27K functional codes (InterProScan domain codes). We train 3 separate models for performing functional annotations at domain, family, and superfamily levels using Kraken. Furthermore, we construct a functional space to visualize these proteins and perform biological validation of results, while also enabling the discovery of potentially new proteins and their function. Most interestingly, this high dimensional functional space will facilitate the shift from genotype to phenotype for named proteins. Leveraging this space, we identify function-based clusters; if new clusters are formed due to improved annotation of hypothetical proteins, we will possibly uncover and understand evolutionary paths shared with known proteins. We use data from our Functional Genomics Platform for our work which has over 300K prokaryotic genomes, 75 million gene sequences, 55 million protein sequences, and over 260 million functional domains.

F-004: Integrated gene expression and variant analysis to investigate CVD genes with associated phenotypes among high-risk Heart Failure patients.
COSI: Function
  • Zeeshan Ahmed, Institute for Health, Health Care Policy and Aging Research. Rutgers, The State University of New Jersey., United States


Presentation Overview: Show

Cardiovascular disease (CVD) is a leading cause of premature mortality in the US and the world. CVD comprises of several complex and mostly heritable conditions, which range from myocardial infarction to congenital heart disease. Here, we report our findings from an integrative analysis of gene expression, disease-causing gene variants, and associated phenotypes among CVD populations, with a focus on high-risk Heart Failure (HF) patients. Our in-depth gene expression analysis revealed differentially expressed genes associated with HF (41 genes) and other CVDs (23 genes). Furthermore, a variant analysis of whole-genome sequence data of CVD patients identified genes with altered gene expression (FLNA, CST3, LGALS3, and HBA1) with functional and nonfunctional mutations in these genes. Our study highlights the importance of an integrative approach that leverages gene expression, genetic mutations, and clinical data that will allow the prioritization of key driver genes for complex diseases to improve personalized healthcare.

F-005: Tensor decomposition and principal component analysis based unsupervised feature extraction with optimized standard deviation applied to gene expression, DNA methylation and histone modification
COSI: Function
  • Y-H. Taguchi, 田口善弘, Japan
  • Ryo Ishibashi, Chuo University, Japan


Presentation Overview: Show

Tensor decomposition and principal component analysis based unsupervised feature extraction were proposed almost ten years and five years ago, respectively. Although they were successfully applied to wide range of problems, they have some fundamental problems; the null hypothesis that the principal component and singular value vectors derived should obey the Gaussian distribution is not fully satisfied and the number of selected features is too small to regard that there are no false negatives. These are recently improved by the introduction of standard deviation optimization. In this paper, we briefly describe this recent progress.

F-006: Interpretable modeling of genotype-phenotype landscapes with state-of-the-art predictive accuracy
COSI: Function
  • Peter Tonner, National Institute of Standards and Technology, United States
  • Abe Pressman, National Institute of Standards and Technology, United States
  • David Ross, National Institute of Standards and Technology, United States


Presentation Overview: Show

Large-scale measurements linking genetic background (genotype) to corresponding function (phenotype) have grown steadily in size and scale. From these measurements, our ability to predict novel relationships in the genotype-phenotype landscape (GPL) directly impacts diverse biological sciences. Currently, these predictions are often made with neural networks (NNs) due to their unsurpassed accuracy in out-of-sample extrapolation. But NNs cannot easily explain their predictions, which hinders insight into the underlying biophysical system measured by the GPL. model of GPLs, called LANTERN. LANTERN learns a representation of every genetic mutation in the form of a latent, low-dimensional vector. Predictions from LANTERN then combine these representations, and connect them to measured phenotypes through a smooth, nonlinear surface. LANTERN is therefore fully interpretable: every prediction from the model is easily explained through these simple components. These components also reveal how different biophysical mechanisms, including structural biochemistry and changes in free energy, influence different GPLs. Despite its simplicity, we also show that LANTERN equals or outperforms alternative models in predictive accuracy, including NNs. So, LANTERN provides state-of-the-art prediction on GPL data while also being fully interpretable.

F-007: Integrating multimodal data through interpretable heterogeneous ensembles
COSI: Function
  • Yan Chak Li, Icahn School of Medicine at Mount Sinai, United States
  • Linhua Wang, Baylor College of Medicine, United States
  • Jeffrey Law, National Renewable Energy Laboratory, United States
  • T. M. Murali, Virginia Tech, United States
  • Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Most biomedical data integration methods follow the early approach of aggregating multimodal datasets into a uniform representation like networks that can then be analyzed. This approach may diminish local information exclusive to each data modality. We propose the novel Ensemble Integration (EI) framework based on the late approach that can address this challenge. EI offers the flexibility of inferring effective local predictive models from individual modalities, before aggregating them into a global heterogeneous ensemble model. We also propose a novel interpretation method for EI. We tested EI for predicting GO term annotations from multimodal STRING data. Across 2,139 GO terms, EI performed significantly better than other data integration approaches and individual STRING modalities (FDR<9.34x10-14). For predicting COVID-19 mortality from multimodal electronic health record data, EI performed significantly better than the individual modalities (FDR<0.0082), and slightly better than an XGBoost model. The best-performing EI model also revealed several features, such as age, minimum oxygen saturation and several laboratory tests, that have been confirmed as relevant to COVID-19 mortality in previous clinical studies. These results demonstrated the effectiveness of EI for biomedical data integration and predictive modeling, as well as revealing problem-relevant information from complex multimodal datasets.

F-008: The field of protein function prediction as viewed by different domain scientists
COSI: Function
  • Rashika Ramola, Northeastern University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Predrag Radivojac, Northeastern University, United States


Presentation Overview: Show

Experimental biologists, biocurators, and computational biologists all play a role in characterizing a protein’s function. The discovery of protein function in the laboratory by experimental scientists is the foundation of our knowledge about proteins. Experimental findings are compiled in knowledge-bases by biocurators to provide standardized, readily accessible, and computationally amenable information. Computational biologists train their methods using these data to predict protein function and guide subsequent experiments. To understand the state of affairs in this ecosystem, centered here around protein function prediction, we surveyed scientists from these three constituent communities. Most strikingly, we find that experimentalists rarely use modern prediction software, but when presented with predictions, report many to be surprising and useful. Ontologies appear to be highly valued by biocurators, less so by experimentalists and computational biologists, yet controlled vocabularies bridge the communities and simplify the prediction task. Additionally, many software tools are not readily accessible and the predictions presented to the users can be broad and uninformative. To meet both the social and technical challenges in the field, a more productive and meaningful interaction between members of the core communities is necessary.

F-009: Machine Learning to Uncover Microbial Function Indicators for Earth Biomes
COSI: Function
  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Ziming Yang, Brookhaven National Laboratory, United States
  • Sean Jungbluth, Lawrence Berkeley National Laboratory, United States
  • Shane Cannon, Lawrence Berkeley National Laboratory, United States
  • Paramvir Dehal, Lawrence Berkeley National Laboratory, United States
  • Adam Arkin, Lawrence Berkeley National Laboratory, United States


Presentation Overview: Show

Microbial life is a critical component of Earth biomes and the taxonomic and functional information from environmental genomics can provide insights into microbial roles in the environment. However, comparing this data across metagenomes can be challenging and furthermore abundance differences may not reflect important functional differences between environments. Here we aim to use machine learning to: 1) build and evaluate robust biome classification models using standardized data from ~32,000 metagenomes; and 2) identify important classification features and how they relate to our understanding of environments on Earth. We constructed a feature table associating metagenome sample environment labels with Gene Ontology (GO) term abundance profiles. Using top multiclass classification methods we performed clasification model-building experiments by varying data preparation and feature selection strategies. With a permutation analysis for feature importance from the top evaluated models, we extracted functions important for classification as well as function indicators for specific biomes. The important biome functions lead to better separation and grouping of samples according to their biome labels and furthermore, model predictions suggest sample labeling improvements. Our results provide a high performance metagenome biome classification model and enable model interpretability to learn important biome indicator functions as well as biome and function relationships.

F-010: The Expanding World of Metabolic Enzymes Moonlighting as RNA Binding Proteins
COSI: Function
  • Constance Jeffery, University of Illinois at Chicago, United States
  • Nicole Curtis, University of Illinois at Chicago, United States
  • Cesar Siete, University of Illinois at Chicago, United States
  • Victoria Ogunniyi, University of Illinois at Chicago, United States
  • Krupa Patel, University of Illinois at Chicago, United States


Presentation Overview: Show

RNA binding proteins play critical roles throughout the lifetime of RNA in processing of new transcripts, regulation of translation, and RNA stability. Recently, proteomics studies have identified dozens of enzymes in intermediary metabolism that bind to RNA but do not contain canonical RNA binding domains. Studies of a few “classic” examples such as aconitase have shown that combining catalytic and RNA binding functions in one protein can be a mechanism to sense the cell’s metabolic state through availability of the enzyme’s ligands and respond by regulating translation of specific transcripts. Conversely, RNA binding could regulate the enzyme’s catalytic activity, through blocking the active site, allosteric effects, acting as a scaffold, or sequestering enzymes. More information about the locations of the RNA binding sites will aid in predicting which other proteins also act as noncanonical RNA binding proteins. Information gained from studying enzymes in carbohydrate, amino acid, and lipid metabolism that act as noncanonical RNA binding proteins will increase our understanding of the coordination between central metabolic pathways and RNA metabolism. This information can be applied in the future to the design of novel proteins that regulate RNA translation, stability, and lifetime, as well as RNAs that regulate enzyme function.

F-011: Results from the CAFA4 challenge on the prediction of protein function
COSI: Function
  • Predrag Radivojac, Northeastern University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Sean Mooney, University of Washington, United States
  • Casey Greene, University of Colorado, United States
  • Yisu Peng, Northeastern University, United States


Presentation Overview: Show

The Critical Assessment of Functional Annotation (CAFA) is a community challenge aimed at evaluating the effectiveness of computational prediction of protein function. The challenge is organized every three years, starting in 2010-2011, with the 4th round of the experiment taking place in 2019-2020 and the analysis completed on data from 2021. CAFA4 reports small but significant improvement in function prediction in the biological process aspect of functional annotation, whereas the state of the art has not appreciably changed in the molecular function and cellular component ontologies. CAFA4 additionally evaluated the prediction of function for intrinsically disordered proteins and introduced more comprehensive evaluation schemes for the proteins that were partially annotated in a specific ontology at the prediction submission time but subsequently gained new terms in the same ontology. We find that the prediction accuracy for such annotations is lower due to larger average term depth, but that all methods significantly outperformed the baseline methods. Overall, to further increase the accuracy of protein function prediction methods, an influx of new computational techniques will be necessary.

F-012: Prioritizing important regions of sequencing data for function prediction
COSI: Function
  • Mahdi Baghbanzadeh, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States
  • Tyson Dawson, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States
  • Ali Rahnavard, Computational Biology Institute, Milken Institute School of Public Health, The George Washington University, United States


Presentation Overview: Show

Genomics data holds great promise in understanding the causes, characteristics, and potential interventions for diseases because of the specificity of DNA snapshots and mutations from individuals. There are many challenges with this data, including non-independent observations, large noise components, nonlinearity, collinearity, and high dimensionality which make them suitable for machine learning techniques to capture nonstructural patterns.
In recent years, these large datasets have made it possible for researchers to implement machine learning algorithms to study the genotype-phenotype association and develop models with high-performance metrics. In this study, we present a unified and generic approach that not only compares the metrics of multiple fitted models and reports the best-fitted model but also assists in identifying the most important positions (features) in the genomic sequence data that contribute mainly to identifying phenotypic traits or variants.
We have developed, deepBreaks, a computational tool, to identify important changes in association with the phenotype of interest using sequencing data. We validated deepBreaks in diverse applications, revealing new significant genotype-phenotype associations, experimentally validated associations, and a set of novel findings from related microbial strains, coronavirus families, SARS-COV-2 strains, and opsins. deepBreaks is open-source software, and the Python implementation is available online at http://github.com/omicsEye/deepBrekas.

F-013: Snekmer: A scalable pipeline for protein sequence fingerprinting based on amino acid recoding
COSI: Function
  • Christine Chang, Pacific Northwest National Laboratory, United States
  • William Nelson, Pacific Northwest National Laboratory, United States
  • Aaron Wright, Pacific Northwest National Laboratory, United States
  • Robert Egbert, Pacific Northwest National Laboratory, United States
  • Jason McDermott, Pacific Northwest National Laboratory, United States


Presentation Overview: Show

Advances in sequencing and the subsequent explosion of available sequence data has motivated the need for faster, more sensitive methods for protein function annotation. Current methods for functional annotation—including sequence similarity assessment via e.g., BLAST, and hidden Markov models (HMMs)—scale poorly, suffer from reference or training set biases, and require time-consuming curation processes. Here, we introduce a method for rapid protein classification combining the kmer approach, in which proteins are represented as sets of peptide subsequences of length k, and amino acid recoding (AAR), which reduces the set of characters representing the 20 amino acids through similarity groupings. Proteins are thus simplified into vector representations capable of fingerprinting structural information for functional differentiation. To enable automated, high-throughput sequence analysis, we have developed Snekmer, a software tool which allows users to: (a) build classification models for protein families, (b) search a list of input protein sequences against pre-trained AAR-kmer family models and assign family probability scores, or (c) apply clustering algorithms for de novo protein family determination. We have evaluated Snekmer using an example set of 33 protein families, and demonstrate Snekmer’s utility in enabling rapid and accurate differentiation between families. Snekmer is available at http://github.com/PNNL-CompBio/Snekmer.

F-014: Ontological analysis of coronavirus associated human genes at the COVID-19 Disease Portal
COSI: Function
  • Shur-Jen Wang, Medical College of Wisconsin, United States
  • Jeff De Pons, Medical College of Wisconsin, United States
  • Wendy Demos, medical college of wisconsin, United States
  • G. Thomas Hayman, Medical College of Wisconsin, United States
  • Morgan Lee Hill, Medical College of Wisconsin, United States
  • Mary Kaldunski, Medical College of Wisconsin, United States
  • Stan Laulederkind, Medical College of Wisconsin, United States
  • Jennifer R. Smith, Rat Genome Database, Medical College of Wisconsin, United States
  • Monika Tutaj, Medical College of Wisconsin, United States
  • Mahima Vedi, Medical College of Wisconsin, United States
  • Melinda Dwinell, Medical College of Wisconsin, United States
  • Anne Kwitek, Medical College of Wisconsin, United States


Presentation Overview: Show

The Disease Portal at Rat Genome Database (RGD) is a premier disease resource where disease-associated genome objects are integrated with their associated data in one place according to disease areas. In response to SARS-CoV-2 pandemic, RGD developed a “COVID-19 Diseases Portal” https://rgd.mcw.edu/rgdweb/portal/home.jsp?p=14). In the COVID-19 Portal, gene-disease associations are established by manual curation of Pubmed literature. All the functional annotations and genome data associated with these disease genes are accessible from the portal. We performed analyses on the COVID-19 disease gene set using tools developed at RGD. The Disease Ontology term enrichment analysis showed that the COVID-19 disease gene set is highly enriched with coronavirus infectious disease and related diseases. Several less related disease areas such as liver disease, and rheumatic disease also highly enriched. Using the comparison heatmap, we showed that close to 60 percent of the COVID-19 genes were associated with nervous system disease and 40 percent were associated with gastrointestinal disease. Our analysis confirms the role of immune system in COVID-19 pathogenesis as shown by substantial enrichment of immune system related Gene Ontology terms. Using the integrated data sets in the RGD COVID-19 Portal will help elucidate mechanisms of COVID-19 and ultimately leads to prevention or treatments.

F-015: The GENCODE geneset – reference gene annotation for Human and Mouse
COSI: Function
  • Toby Hunt, EMBL-EBI, United Kingdom
  • Jonathan Mudge, EMBL-EBI, United Kingdom
  • Jose Gonzalez, EMBL-EBI, United Kingdom
  • Jane Loveland, EMBL-EBI, United Kingdom
  • Adam Frankish, EMBL-EBI, United Kingdom
  • Paul Flicek, EMBL-EBI, United Kingdom


Presentation Overview: Show

The GENCODE consortium produces detailed reference annotation of all human and mouse protein-coding genes, pseudogenes, long non-coding RNAs and small RNAs. Accurate gene annotation is of fundamental importance for genome biology and clinical genomics; annotation that is incorrect or incomplete impacts downstream analysis and introduces potentially significant errors.

However GENCODE remains a work in progress: we are continuing to refine and improve our geneset in a number of ways. 1. By incorporating long-read transcriptomic datasets (PacBio ISOseq, RACE-seq and ONT) via our TAGENE automated pipeline to extend existing models and add new ones. 2. Identifying novel ORFs, including potential translations within lncRNAs and the UTRs of protein-coding genes, found by recently published Ribo-seq studies. 3. Via the production of a high-value set of transcripts and corresponding proteins for use as a universal standard for clinical variant reporting as part of the Matched Annotation from NCBI and EMBL-EBI (MANE) collaboration with NCBI RefSeq.

The GENCODE geneset is the default Human and Mouse annotation used in the Ensembl and UCSC genome browsers and is also available for download from www.gencodegenes.org.

On behalf of the GENCODE Consortium.

F-016: Function annotation inequalities reveal the need to study under-annotated genes
COSI: Function
  • An Phan, Iowa State University, United States
  • Parnal Joshi, Iowa State University, United States
  • Claus Kadelka, Iowa State University, United States
  • Karin Dorman, Iowa State University, United States
  • Iddo Friedberg, Iowa State University, United States


Presentation Overview: Show

Gene Ontology (GO) annotation databases contain the sum computable knowledge of protein function, but they are known to be heavily biased. While being responsible for over 25% of the annotations in the GO Consortium database, the annotations from high-throughput experiments are considered less informative than those from low-throughput experiments. In addition, the “rich-get-richer syndrome” describes the abundance of research preferentially targeting already well-studied proteins, leaving many others under-annotated. To comprehend these trends, we focused on experimentally validated annotations of proteins in humans. We examined the number and information content of cumulative annotations assigned to each protein. We identified a lower specificity of information provided from high-throughput compared to low-throughput experiments. We also assessed the inequality in the distribution of annotations among proteins using the GINI index, from which we observed a “rich-get-richer” phenomenon. In both annotation counts and information content, the top annotated genes are highly conserved throughout the years. We observed a rapid increase of GINI index in the molecular function aspect, indicating a worsening inequity between top- and poorly annotated genes. Overall, the most heavily-studied genes of the past continue to receive much current attention, implying that we should reconsider resource allocation in the functional studies of genes.

F-017: Automated Protein Function Description for Novel Class Discovery
COSI: Function
  • Meet Barot, New York University, United States
  • Vladimir Gligorijevic, Prescient Design, Genentech, Roche, United States
  • Richard Bonneau, New York University, United States
  • Kyunghyun Cho, New York University, United States


Presentation Overview: Show

Functional characterization is far outpaced by the discovery of new sequences from high-throughput sequencing technologies. Beyond the difficulty of assigning newly sequenced proteins to known functions, a more challenging issue is discovering novel protein functions. Protein function prediction, as it is usually framed in the case of Gene Ontology term prediction, is a multilabel problem with a hierarchical label space. However, this framing is limiting. It does not provide guiding principles for discovering completely novel functions. Clustering-based approaches are not able to give much information about the new functional categories that they predict; they can only predict that a protein may belong to a category that has not been studied. In this work we propose a neural machine translation model in order to generate descriptions of protein functions in natural language. We provide quantitative results of our model in the zero-shot classification setting, scoring sequence sets with functional descriptions that the model has not seen before, as well as generated function descriptions for qualitative evaluation.

F-018: Predicting protein succinylation sites using features extracted from Protein Language Models
COSI: Function
  • Dukka Kc, Michigan Technological University, United States
  • Suresh Pokharel, Michigan Technological University, United States
  • Pawel Pratyush, Michigan Technological University, United States
  • Michael Heinzinger, Technical University of Munich, Germany


Presentation Overview: Show

Lysine Succinylation in proteins is one of the important Post-Translational Modification (PTM) phenomena that is responsible for many vital metabolic activities in cells including cellular respiration, regulation, and repair. Here, we present a novel approach that leverages features from a transformer-based protein language model called ProtT5-XL-UniRef50 in a machine learning framework to predict protein succinylation sites. To our knowledge, this is one of the first attempts to employ a transformer-based Language Model(LM) to predict protein succinylation sites. Eliminating the time and cost associated with manual extraction of features from protein sequences, our RBF-SVM model trained on transformer encoded features achieves competitive results compared to the existing state-of-the-art approaches with performance scores of 0.48 and 0.75 for MCC and AUROC respectively.