Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide

banner

Posters - Schedules

Poster presentations at ISMB/ECCB 2021 will be presented virtually. Authors will pre-record their poster talk (5-7 minutes) and will upload it to the virtual conference platform site along with a PDF of their poster beginning July 19 and no later than July 23. All registered conference participants will have access to the poster and presentation through the conference and content until October 31, 2021. There are Q&A opportunities through a chat function and poster presenters can schedule small group discussions with up to 15 delegates during the conference.

Information on preparing your poster and poster talk are available at: https://www.iscb.org/ismbeccb2021-general/presenterinfo#posters

Ideally authors should be available for interactive chat during the times noted below:

View Posters By Category

Session A: Sunday, July 25 between 15:20 - 16:20 UTC
Session B: Monday, July 26 between 15:20 - 16:20 UTC
Session C: Tuesday, July 27 between 15:20 - 16:20 UTC
Session D: Wednesday, July 28 between 15:20 - 16:20 UTC
Session E: Thursday, July 29 between 15:20 - 16:20 UTC
BENZ WS annotates sequences of the human reference proteome with four level EC numbers
COSI: Function
  • Davide Baldazzi, University of Bologna - Biocomputing Group, Italy
  • Castrense Savojardo, University of Bologna - Biocomputing Group, Italy
  • Pier Luigi Martelli, University of Bologna - Biocomputing Group, Italy
  • Rita Casadio, University of Bologna - Biocomputing Group, Italy

Short Abstract: We benchmark our BENZ, a method recently introduced for the four level Enzyme Commission (EC) Number annotation, on the human reference proteome. The analysis of false positives (enzymes that the system predicts from the pool of sequences without EC annotation) reveals 5741 new enzymes. Our annotations are independently validated by a method relying on GOtoEC and Pfam/InterPro mapping. Results support the relevance of BENZ as a reliable method for EC number annotations. We introduce an independent validation of the false positive predictions and this allows finding new enzymes in the pool of the UniProt non-enzyme sequences. Accuracy (96.84%), wrong predictions (2.06%), unpredicted (3.57%) and false positives (2.69%) values are quite good and similar to those obtained over previous benchmarks on different data sets, as already reported (NAR, 2021 Web Server Issue, in the press). We fail 679 enzyme sequences, and gain 5741 new monofunctional enzymes distributed among the seven enzymatic classes (EC1: 691; EC2: 2332; EC3: 1884; EC4: 199; EC5: 301; EC6: 172 and EC7: 162).

ChemBoost: A chemical language based approach for the prediction of protein - ligand binding affinity
COSI: Function
  • Rıza Özçelik, Boğaziçi University, Turkey
  • Hakime Öztürk, Boğaziçi University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey
  • Elif Ozkirimli, Bogazici University, Turkey

Short Abstract: Identification of high affinity drug-target interactions is a major research question in drug discovery. Proteins are generally represented by their structures or sequences. However, structures are available only for a small subset of biomolecules and sequence similarity is not always correlated with functional similarity. We propose ChemBoost, a chemical language based approach for affinity prediction using SMILES syntax. We hypothesize that SMILES is a codified language and ligands are documents composed of chemical words. These documents can be used to learn chemical word vectors that represent words in similar contexts with similar vectors. In ChemBoost, the ligands are represented via chemical word embeddings, while the proteins are represented through sequence-based features and/or chemical words of their ligands. ChemBoost is able to predict drug-target binding affinity as well as or better than state-of-the-art machine learning systems. when powered with ligand-centric representations, ChemBoost is more robust to the changes in protein sequence similarity and successfully captures the interactions between a protein and a ligand, even if the protein has low sequence similarity to the known targets of the ligand.

Critical assessment of protein intrinsic disorder prediction
COSI: Function
  • Damiano Piovesan, University of Padova, Italy
  • Silvio C. E. Tosatto, University of Padova, Italy

Short Abstract: Intrinsically disordered proteins, defying the traditional protein structure–function paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43 methods were evaluated on a dataset of 646 proteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has Fmax = 0.483 on the full dataset and Fmax = 0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with Fmax = 0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude.

Data to Knowlegde - The Pseudomonas fluorescens SBW25 Knowledge Base
COSI: Function
  • Carsten Fortmann-Grote, Max Planck Institute for Evolutionary Biology, Germany
  • Paul Rainey, Max Planck Institute for Evolutionary Biology, Germany

Short Abstract: Reference genomes and their annotations are usually deposited on dedicated databases once and forever. In contrast, knowledge about individual genes and larger genomic subunits grows and evolves steadily but is not fed back into the reference annotation database. Our Knowledge Base for Pseudomonas fluorescens SBW25 aims at collecting the available data for this model organism from numerous *omics databases, publications but also from less structured sources such as socal media, electronic lab notebooks as well as our internal lab databases.
By querying these sources directly, using semantic web technologies, we ensure that our data always reflects the current state of knowledge. An intuitive web frontend allows our site visitors to collect and to combine data from the various sources and thereby to create new knowledge which can be fed back into the system.

Effect of gene expression level on functional prediction using gene coexpression
COSI: Function
  • Takeshi Obayashi, Tohoku University, Japan
  • Yuichi Aoki, Tohoku University, Japan
  • Kengo Kinoshita, Tohoku University, Japan

Short Abstract: Gene coexpression information, which is a similarity of gene expression profiles, has been widely used for estimating gene function. Based on publicly available gene expression resources, we have developed coexpression databases (ATTED-II, COXPRESdb, ALCOdb) to promote functional genomics studies in each biological domain. The calculation of gene coexpression information is composed of multiple steps. Therefore, evaluation of the coexpression data is key to optimizing the calculation pipeline, and consistency with known gene function annotations is often used for coexpression evaluation. However, there are two potential biases in the evaluation using the known functions. (1) Annotation Bias. The more highly expressed genes have more annotations. Because highly expressed genes tend to show a clear phenotype by gene disruption, this may be a research bias. (2) Coexpression Bias. Genes having a similar expression level tend to be coexpressed. In this presentation, we first summarize the multiple effects on gene expression levels and then report that the normalization of these potential biases changes the optimal calculation pipeline of gene coexpression.

Empirical variance component regression for sequence-function relationships
COSI: Function
  • Juannan Zhou, Cold Spring Harbor Laboratory, United States
  • Mandy Wong, Cold Spring Harbor Laboratory, United States
  • Wei-Chia Chen, Cold Spring Harbor Laboratory, United States
  • Adrian Krainer, Cold Spring Harbor Laboratory, United States
  • Justin Kinney, Cold Spring Harbor Laboratory, United States
  • David McCandlish, Cold Spring Harbor Laboratory, United States

Short Abstract: Contemporary high-throughput mutagenesis experiments have provided an unprecedented view of the complex patterns of genetic interaction that occur between multiple mutations within a single protein or regulatory element. By simultaneously measuring the effects of thousands of combinations of mutations, these experiments have revealed that the genotype-phenotype relationship typically contains genetic interactions not only between pairs of sites, but also higher-order interactions between larger numbers of sites. Here we provide a method of analyzing data from these experiments using Gaussian process regression with an empirical Bayes prior. The key insight is that many previously proposed methods can be recast as members of a family of Gaussian process regression models with hyperparameters corresponding to the expected fraction of variance due to each order of genetic variance. By analyzing the distance correlation function of the observed data, we can extract point estimates of these variance components. Based on these point estimates, we then construct a prior over all possible sequence function relationships, where the prior is concentrated on models with a similar correlation structure to that observed in the data. We apply this method to analyze high-throughput measurements for the protein GB1, and the splicing efficiency of human 5’ pre-mRNA splice sites.

Ensemble learning for novel drug - target affinity prediction
COSI: Function
  • Rıza Özçelik, Boğaziçi University, Turkey
  • Alperen Bağ, Bogazici University, Turkey
  • Berk Atıl, Boğaziçi University, Turkey
  • Elif Ozkirimli, Bogazici University, Turkey
  • Arzucan Ozgur, Bogazici University, Turkey

Short Abstract: Computational prediction of high affinity drug - target pairs is an interesting problem in drug discovery research, especially for novel drugs and targets. The machine learning models are effective at predicting the interactions between known biochemicals but their generalization to novel drugs and targets has left an open question. Here, we introduce DebiasedDTA, the first model training strategy that is specifically designed for novel drug - target affinity (DTA) prediction and applicable to all optimization-based models. DebiasedDTA ensembles an existing DTA model with a weak learner that aims to quantify dataset biases. The output of the weak learner is then used during the training of the DTA model to improve generalizability. The experiments showed that the proposed approach can indeed quantify the bias embedded in the training sets and DTA models can use this information to boost their performance both on known and novel molecules.

Integrating multiple information sources for protein function prediction with end-to-end deep learning
COSI: Function
  • Gabriela Merino, IBB-CONICET-UNER, Argentina
  • Diego Milone, sinc(i)-CONICET-UNL, Argentina
  • Maria Martin, EBI-EMBL, United Kingdom
  • Georgina Stegmayer, sinc(i)-CONICET-UNL, Argentina
  • Rabie Saidi, EBI-EMBL, United Kingdom

Short Abstract: Manual curation based on experimental evidence is a precise strategy for function annotation but extremely expensive and time-consuming. Hence it cannot cope with the exponential growth of data. Although computational methods for function prediction are being constantly developed, their performance is still subject to improvement, especially for no-knowledge (NK) proteins.

We propose a novel end-to-end deep learning model for predicting Gene Ontology (GO) terms by integrating multiple features from sequence and taxon of proteins. Our model was trained and evaluated as the CAFA3 challenge. For training, NK proteins were augmented using CAFA3 training proteins with no changes or added annotations up to 02/2017 (challenge deadline). For evaluation, CAFA3 benchmark proteins were used obtaining F-max scores of 0.34, 0.55, and 0.55 for biological process (BP), cellular component (CC), and molecular function (MF), respectively. These results revealed our model performed in the top 5 CAFA3 methods, achieving very competitive scores to those of the best competitors for BP and CC. It is also the second-best method when predicting MF. Our results suggest deep learning integrating multi-source data and using data augmentation during training is a promising tool for function prediction.

Known allosteric proteins have central roles in genetic disease
COSI: Function
  • Gyorgy Abrusan, University of Cambridge, United Kingdom
  • David Ascher, Univeristy of Melbourne, Australia
  • Michael Inouye, University of Cambridge, United Kingdom

Short Abstract: Allostery is a form of protein regulation, where ligands that bind sites located apart from the active site can modify the activity of the protein. The molecular mechanisms of allostery have been extensively studied, because allosteric sites are less conserved than active sites, and drugs targeting them are more specific than drugs binding the active sites. Here we quantify the importance of allostery in genetic disease. We show that 1) known allosteric proteins are central in disease networks, and contribute to genetic disease and comorbidities much more than non-allosteric proteins, particularly in cancers, hematopoietic and vascular diseases; 2) variants from cancer genome-wide association studies are enriched near allosteric proteins, indicating their importance to polygenic traits; and 3) the importance of allosteric proteins in disease is due, at least partly, to their central positions in protein-protein interaction networks, and probably not due to their dynamical properties.

Prediction and Classification of Toxins and Venom Proteins with PSSMs and Family Domain Profiles
COSI: Function
  • Fernanda Midori Abukawa, Laboratory of Applied Toxinology, Butantan Institute, Brazil
  • Caio Fontes de Castro, Institute of Mathematics and Statistics, University of São Paulo, Brazil
  • Milton Yutaka Nishiyama Junior, Laboratory of Applied Toxinology, Butantan Institute, Brazil

Short Abstract: Venoms are rich sources of toxins, enzymes and bioactive peptides that are promising targets for development of new drugs. Novel sequencing technologies have allowed the study of whole venom glands transcriptomes from venomous animals. The annotation of toxins is a challenging task and current methods rely on protein homology by alignment to toxins and manual curation. We propose a new approach to improve the prediction and classification of venom proteins and toxins based on Position Specific Scoring Matrix (PSSM) combined with Hidden Markov Models (HMM) profiles and characterization of family domains. 2559 venom proteins and toxins were collected and curated in 13 venom protein families and 10 neurotoxin subfamilies. For each family, a PSSM corresponding to a multiple alignment was generated and tested against the family proteins and a negative dataset of 10000 non-venom proteins. The mean and standard deviation of Matthew’s Correlation Coefficient was obtained for venom protein families (Mean=0.95, StdDev=0.08) and Neurotoxin subfamilies (Mean=0.87, StdDev=0.09). This approach could successfully identify and classify toxins and venom proteins in families. Characterization of family domains will improve the quality of the classification and increase accuracy of toxins and venom protein prediction and annotation.

Proposals to improve CAFA evaluation based on community participation
COSI: Function
  • Jeffrey Yunes, Yunes Foundation for Research on Aging, Portsmouth, NH, USA, United States
  • Chengxin Zhang, Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, CT 06511, USA, United States
  • Petri Törönen, Institute of Biotechnology, University of Helsinki, Finland, Finland

Short Abstract: CAFA is a community-wide challenge for automated prediction of protein functions (AFP). AFP has several characteristics that complicate its evaluation, such as incomplete gold standards, imbalanced classes, and structured outputs. After a decade of participation, we have observed ways in which CAFA deviates from the task of experimental characterization of protein function. Firstly, the proteins, GO terms, and function annotations are filtered to a distribution that is biased towards a few model organisms, especially mammals, and towards certain types of terms and annotations. Secondly, some of the main assessment metrics do not account for imbalanced classes and incomplete data via information content-based metrics. Thirdly, CAFA’s current “blast” baseline is oversimplified and performs worse than other BLAST-based method. After collecting suggestions through a community-wide virtual town hall, we tentatively propose the following changes to on-going and future CAFA challenges. First, we propose to evaluate on all proteins with new experimental annotations, instead of limiting the task to the Swiss-Prot subset with only certain experimental terms and annotations. Second, CAFA should prioritize metrics that use information content-based metrics, to account for imbalanced classes and an incomplete gold standard. Third, blast baselines should be re-implemented to reflect best practices for such an analysis.

Protein analysis web service powered by the COMER2 homology search engine
COSI: Function
  • Mindaugas Margelevicius, Institute of Biotechnology, Life Sciences Center, Vilnius University, Lithuania
  • Justas Dapkunas, Institute of Biotechnology, Life Sciences Center, Vilnius University, Lithuania

Short Abstract: A basic approach for protein function prediction and sequence annotation rests on inference by homology. Here we present a multipurpose web server with the GPU-accelerated sensitive and specific homology search method COMER2 at its core. The COMER2 software architecture allows for simultaneously running multiple instances of homology search on the same GPU independently. This property allows the web server to efficiently exploit computational resources and distribute workload across multiple dedicated GPUs. Among its other distinctive features is the possibility for the user to submit multiple sequence and sequence family queries in different formats at once. Organizing and processing user queries in bulk remove the limitation of focusing on a single protein of interest at a time. The user can choose between popular target databases (PDB, SCOP, PFAM) and conduct homology searches on their different combinations. Structured output facilitates constructing multiple sequence alignments and permits generating many protein structural models based on one or more templates with one click. Finally, a new deep learning-based threading method with the COMER2 search engine expands the services provided by the web server, which is expected to represent a valuable resource for researchers who can access it directly or via a RESTful API.

seqSCAN: Unsupervised Classification of Proteins for New Function Discovery.
COSI: Function
  • Meet Barot, Center for Data Science, New York University, United States
  • Vladimir Gligorijevic, Center for Computational Biology, Flatiron Institute, United States
  • Kyunghyun Cho, Center for Data Science, New York University, United States
  • Richard Bonneau, Flatiron Institute, New York University, United States

Short Abstract: Despite the large number of protein function prediction methods developed, many proteins elude categorization into known functional groups. This motivates the need for ways of classifying proteins without labeled data, as many proteins likely perform functions which are not currently known. In this work, we aim to categorize protein sequences into functionally relevant groups without using labels. We use a semantic clustering algorithm similar to a method which has recently shown to achieve state of the art results on unsupervised image classification tasks. We evaluate our method on several datasets: 10 and 558 family datasets constructed from Pfam, and a dataset of lyases with EC numbers from BRENDA. We compare our protein class discovery method with a baseline of K-means clustering of PCA components of sequence embeddings and find that we outperform the baseline in all tasks. We are able to show that our model categorizes proteins into existing classes (after hiding those class labels) without any access to labels. Furthermore, these results point towards a possibility of discovering new functions from only sequence information.

Sequence-based prediction of proteins associated with extracellular vesicles
COSI: Function
  • Sanne Abeln, Vrije Universiteit Amsterdam, Netherlands
  • Katharina Waury, Vrije Universiteit Amsterdam, Netherlands
  • Dea Gogishvili, Vrije Universiteit Amsterdam, Netherlands

Short Abstract: Extracellular vesicles (EVs) comprise a heterogeneous group of membranous structures that are secreted by cells into the extracellular space, and are of major importance for cell-to-cell communication. EVs are promising biomarker candidates as they have been implicated in several pathologies. Results from EV isolation studies have been collected in databases providing a valuable resource for the exploration of associated proteins. In this study, we investigated if we could predict EV association from amino acid (AA) sequence, and what the dominating protein features are for EV association. Sequence-based features were utilized to build a random forest classifier to predict EV association. The model achieved a surprisingly high AUC of 0.82, which increases further to 0.87 when incorporating post-translational modification (PTM) annotations. Based on the feature analysis EV proteins appear to be large, soluble, stable, and with low IP compared to non-EV proteins. Importantly, de-novo discovered motifs and known domains, including RAS profile and WW domains are positively correlated with EV packaging. EV associated proteins are rich in aspartic and glutamic acid and undergo various PTMs, from which lipidation emerges to be of big importance as the presence of palmitoylation sites was the most predictive feature found.

The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP)
COSI: Function
  • Jill Wegrzyn, University of Connecticut, United States
  • Alexander Hart, University of Connecticut, United States
  • Josh Burns, Washington State University, United States
  • Stephen Ficklin, Washington State University, United States

Short Abstract: The Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) is designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non-model eukaryotes. This software package addresses the fragmentation and related assembly issues that result in inflated transcript estimates and poor annotation rates. Following filters applied through assessment of true expression and frame selection, open-source tools are leveraged to functionally annotate the translated proteins. Downstream features include fast similarity search across multiple databases, protein domain assignment, orthologous gene family assessment, Gene Ontology term assignment, and KEGG pathway annotation. The final annotation integrates across multiple databases and selects an optimal assignment from a combination of weighted metrics describing similarity search score, taxonomic relationship, and informativeness. Researchers have the option to include additional filters to identify and remove potential contaminants and prepare the transcripts for enrichment analysis. This fully featured pipeline is easy to install, configure, and runs much faster than comparable functional annotation packages. The latest release offers both command line and GUI access. EnTAP is optimized to generate extensive functional information for the gene space of organisms with limited or poorly characterized genomic resources.

Transformer-based Protein Function Annotation with Joint Sequence-Label Embedding
COSI: Function
  • Yue Cao, Texas A&M University, United States
  • Yang Shen, Texas A&M University, United States

Short Abstract: Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions.  To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences, we use self-attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions, we embed protein function labels together with inputs/features in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence-function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability; and a GO term-centric analysis was also provided.



International Society for Computational Biology
525-K East Market Street, RM 330
Leesburg, VA, USA 20176

ISCB On the Web

Twitter Facebook Linkedin
Flickr Youtube