View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Deep learning has been increasingly and widely used to solve numerous problems in a variety of fields with state-of-the-art performance. Deep learning has particular advantages over other traditional machine learning techniques, i.e. reach the high performance, reduce the requirement for feature extraction and time consumption. Here we present DeepRes, a deep learning framework via two-dimensional convolutional neural networks and position-specific scoring matrices profiles to predict nucleotide binding sites (including ATP, ADP, AMP, GDP, and GTP), which are the most vital molecular functions in biology. These nucleotide binding residues play the important roles in many biological processes and directly affect many human diseases (diabetes, cancer, Parkinson’s disease). Therefore, creating a precise model to identify their functions is an imperative problem for understanding these diseases and designing the drug targets. Our DeepRes could identify these interacting residues with achieved independent test accuracy of 97.2%, 96.5%, 97.4%, 97.3%, and 97.6%, respectively. Compared with other published works, our predictive performance has been extremely improved and becomes the superior model for predicting nucleotide binding residues. Throughout the proposed study, we provide an effective tool for predicting nucleotide binding residues and a basis for further research that can apply deep learning in bioinformatics.
Short Abstract: Both UniProt automatic and manual pipelines use sets of family and domain signatures to infer functional annotations of proteins. Recently, a number of studies have suggested that the same set of signatures does not necessarily imply the same annotations, and that other factors, such as the order of signatures in the protein sequence, may have an impact on its function. However, this impact has not yet been quantified. In this work, we present an information theory based approach to measure the consistency between signature sets and annotations. We propose a new entropy measure which takes the dynamic nature of the annotation process into account by assigning different weights to the presence and absence of an annotation. The results show a high consistency between signature sets and annotations in UniProt Knowledgebase. Apart from quantifying the annotation consistency, our analysis has a few additional implications. One is detection of signatures having complete annotation consistency which can then be used as seeds for generating new annotation rules. Moreover, to gain a better understanding of the reasons behind inconsistency in some signature sets, we used formal concepts to identify proteins with incomplete annotations and discover potential new subfamilies sharing the same annotations.
Short Abstract: 99.6% of all known proteins were never tested experimentally or even their expression observed, thus predicting their function relies mainly on comparing their sequence to annotated homologs. However, even with new automated tools for high-throughput functional annotation, the function of many proteins remains unknown since they have no annotated homologs. In order to identify function and discover protein-protein interaction networks, our study aimed at identifying proteins that are functionally linked to each. We analyzed the co-occurrence patterns of 406,000 orthologous and 118,000 homologous proteins from the fully sequenced non-draft genomes of 4,350 bacteria, 166 eukaryotes and 226 archaea. Validation successfully revealed known networks from various pathways, including nitrogen fixation, glycolysis and ribosome proteins; for example, using the query protein AmoA (a subunit of ammonia monooxygenase), the resulting calculated functional network included AmoB and AmoC, the two other subunits. This method was found to be both biological and computational practical and efficient , thus, it promises to remain efficient even as more and more genomes are being sequenced.
Short Abstract: CAZymes (carbohydrate-active enzymes) are among the most important enzymes for the bioenergy and agricultural industries. CAZyme are also important for human health, because microbes living in the human guts encode the highest percentage of CAZymes to degrade various dietary and host carbohydrates, and changing the dietary carbohydrates will impact the gut microbiota structure and further influence the human health. We have built an online database dbCAN-seq (http://cys.bios.niu.edu/dbCAN_seq) to provide pre-computed CAZyme sequence and annotation data for 5,349 bacterial genomes. Compared to the other CAZyme resources, dbCAN-seq has the following new features: (i) a convenient download page to allow batch download of all the sequence and annotation data; (ii) an annotation page for every CAZyme to provide the most comprehensive annotation data; (iii) a metadata page to organize the bacterial genomes according to species metadata such as disease, habitat, oxygen requirement, temperature, metabolism; (iv) a very fast tool to identify physically linked CAZyme gene clusters (CGCs); and (v) a powerful search function to allow fast and efficient data query. With these unique utilities, dbCAN-seq will become a valuable web resource for CAZyme research, with a focus complementary to dbCAN (automated CAZyme annotation server) and CAZy (CAZyme family classification and reference database).
Short Abstract: An effective approach to leveraging the complementarity of methods proposed for protein function prediction (PFP) is to assimilate them into heterogeneous ensembles. We have illustrated that such ensembles can provide significant performance gains over individual PFP predictors. However, our previous work has been limited to a few GO terms due to the computational costs of constructing these ensembles. Here, we report the results of large-scale PFP using heterogeneous ensembles. Specifically, we constructed and evaluated ensembles for 277 GO terms using 12 diverse base classifiers, and two types of methods, namely stacking with 8 different meta-classifiers and Caruana et al’s Ensemble Selection algorithm (CES). Stacking using Logistic Regression (SLR) was the best-performing stacker, and also performed competitively with CES. SLR generally outperformed the best base classifier, with median Fmax improvement increasing with GO term size, namely 0.010 (p=0.21), 0.027 (p=1.1x10-7) and 0.033 (p=1.7x10-10) for small (200-500 proteins), medium (500-1000 proteins) and large (over 1000 proteins) terms respectively. Furthermore, the entire computation took less than 48 hours on a sizeable computing cluster. These results demonstrate that large-scale PFP using heterogeneous ensembles constructed systematically using stacking and CES can be predictive and computationally feasible.
Short Abstract: Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Previously, we have described our region-specific function prediction methodology that first decomposes proteins into potential functional regions and then automatically infers their function labels using protein-level annotations and multiple types of region-level feature representations. These region-level features include (1) keywords extracted from residue- and domain-level InterPro/UniProtKB feature annotations and (2) amino acid sequences directly (k-mer frequency). By themselves, keyword features are much more informative and outperform k-mers across MF-GO terms but require prior curation that is not always available for every protein/region. In order to combine the relevance of keyword features and the prevalence of k-mers, we propose to use multi-modal deep autoencoders (MDA) to learn a single, low-dimensional feature representation. We apply our region-specific method on this representation and report the predictive performance when evaluated at both the whole protein level and at the region-specific level.
Short Abstract: The experimental determination of protein function is a time-consuming and expensive process, mostly as a result of the large number of possible protein functions. However, the number of novel protein sequences identified has continually increased in recent years. This has resulted in many proteins with known sequences but unknown function. Many machine learning (ML) methods for protein function prediction have been proposed to bridge this gap. We present here a systematic comparison between several classes of sequence-based ML function prediction methods as well as performances of these methods on different types of sequence representations (one-hot encoding of sequence, k-mer frequencies, blast similarity profiles, PSSMs, and Chaos Game Representation). We find that deep learning methods, including multilayer perceptrons (MLPs) and convolutional neural networks (CNNs), outperform conventional ML methods including logistic regression and SVM methods, as well as the BLAST and Naive term frequency baseline measures from CAFA for yeast, human, and multi-species protein sequence datasets using a temporal holdout validation measure. We also explore a multi-task objective in the CNN to incorporate relationships between functions in the Gene Ontology (GO) tree and show improved performance.
Short Abstract: The prevalence of high-throughput experimental methods has resulted in an abundance of large-scale molecular and functional interaction networks. The connectivity of these networks provide a rich source of information for inferring functional annotations for genes and proteins. An important challenge has been to develop methods for combining these heterogeneous networks to extract useful protein feature representations for function prediction. Most of the existing approaches for network integration use shallow models that cannot capture complex and highly-nonlinear network structures. Thus, we propose deepNF, a network fusion method based on Multimodal Deep Autoencoders to extract high-level features of proteins from multiple heterogeneous interaction networks. We apply deepNF on 6 STRING networks to construct a compact low-dimensional representation containing high-level protein features. We present an extensive performance analysis comparing our method with the state-of-the-art network integration methods such as GeneMANIA and Mashup. In addition to cross-validation, the analysis also includes a temporal holdout validation evaluation similar to the measures in CAFA. Our method outperforms previous methods for both human and yeast STRING networks. Features learned by our method lead to substantial improvements in protein function prediction accuracy, which could enable novel protein function discoveries.
Short Abstract: Computational methods for post-translational modification (PTM) site prediction play important roles in protein function studies and experimental design. Most existing methods are based on feature extraction, which may result in incomplete or biased features. Deep learning as the cutting-edge machine learning method has the ability to automatically discover complex representations of PTM patterns from the raw sequences, and hence it provides a powerful tool for improvement of post-translational modification site prediction. In our previous work, we proposed MusiteDeep, the first deep-learning framework for predicting the phosphorylation, one of the well-studied PTMs. The previous MusiteDeep takes protein raw sequence as the input and uses convolutional neural networks with a two-dimensional attention mechanism. It achieved over a 50% improvement in the area under the precision-recall curve in general phosphorylation site prediction and obtains competitive results in kinase-specific prediction compared to other well-known tools on the benchmark data. In this work, we extended to explore more types of PTMs and a new deep-learning architecture which is CapsNet. Web server for more PTM site predictions and complex motif visualization will be developed in the future.
Short Abstract: Critical Assessment of protein Function Annotation algorithms (CAFA) is a scientific challenge ran every two years, consisting in predicting Gene Ontology (GO) terms from protein sequences. The organizers release a set of protein sequences, participant’s predictions should be deposited by the following January, and the evaluation is performed on the experimental annotation accumulated in the following months (at least 6). A paper with the results is usually published before the following instalment of the challenge: CAFA1 (2010-2011) results were published in 2013, CAFA2 (2013-2014) in 2016, CAFA3 2016-2017 evaluation is still in progress. Journals like NAR Web Server issue require CAFA results for predictors submitted for publication, however such results are available years after the method was tested in CAFA, and in any case the challenge is run every two years. This leads to a gap: either scientists will use old scores, or they should perform “in house” CAFA-like evaluations. Given this scenario, we propose to have a centralized continuous evaluation system for CAFA-like assessments. This will help in having consistent and certified scores, clear dataset references and openness. Existing benchmarking platforms like OpenEBench could be exploited in that sense.
Short Abstract: Nearly 20 years after the first human genome sequence was published our knowledge and understanding of gene/protein functions remains limited. This is exemplified by the recent identification of the minimal bacterial genome which revealed that one third (149 of 438) of the proteins in this genome were of unknown function. These genes perform essential roles, yet we have no idea of the functions they perform. We performed an extensive in silico analysis to expand our understanding of the minimal genome. Overall our analysis inferred more informative functions for 59 of the 149 proteins of unknown function. The inferred functions cover multiple areas including protein synthesis, cell division and transport. Our results suggest that >50% of the minimal genome is required for the fundamental life processes of preserving and expressing genetic information. Interestingly we identified many transmembrane proteins in the set of uncharacterised proteins and predict that >70% of these have transporter functions. Our analysis provides insight into the functions of proteins in the minimal bacterial genome, which will now be of interest for experimental characterisation. Further, it highlights the ability to use computational approaches to expand our knowledge and understanding of protein function.
Short Abstract: Antibiotic resistance is a major public health crisis, and finding new sources of antimicrobial drugs is crucial to solving it. Bacteriocins, which are bacterially-produced antimicrobial peptide products, are candidates for broadening our pool of antimicrobials. The discovery of new bacteriocins by genomic mining is hampered by their sequences' low complexity and high variance, which frustrates sequence similarity-based searches. Here we use word embeddings of protein sequences to represent bacteriocins, and subsequently apply Recurrent Neural Networks and Support Vector Machines to predict novel bacteriocins from protein sequences without using sequence similarity. We developed a word embedding method that accounts for sequence order, providing a better classification than a simple summation of the same word embeddings. We use the Uniprot/TrEMBL database to acquire the word embeddings taking advantage of a large volume of unlabeled data. Our method predicts, with a high probability, six yet unknown putative bacteriocins in Lactobacillus. Generalized, the representation of sequences with word embeddings preserving sequence order information can be applied to protein classification problems for which sequence homology cannot be used.
Short Abstract: Computational functional annotation is frequently hampered by the lack of high-identity templates for any new target of interest. We have recently developed a hybrid pipeline combining structural prediction/alignment, sequence alignment, and protein-protein interaction information to obtain combined structure predictions and functional annotations for entire proteomes. We find that our inclusion of structural information makes our workflow unusually strong in performance on difficult targets with limited sequence identity to annotated proteins. Importantly, we also observe that in silico structure prediction can now replace experimental structures for the purposes of functional annotation pipelines. The combined structure/function predictions provided by our pipeline provide an unusual richness of information, and we show several usage cases where insight from these predictions accurately guided follow-up experiments. Examination of our predictions on several model proteomes reveals a range of commonly over-represented functionalities among poorly annotated proteins, including transcription factors, kinases/phosphatases, and pathogenicity genes. Our findings provide fundamental new insight into the genetic capacity encoded in proteomes across all domains of life, yield a rich new source of information to seed detailed investigation of the functions of many previously mysterious protein-coding genes, and pave the way for large-scale structure/function annotation for a broader range of proteomes of interest.
Short Abstract: Many computational functional inference methods use GO for their set of functional labels. While informative motif information leveraging structure is already captured in libraries of Hidden Markov Models (HMMs), such as Pfam, creating a useful Pfam to GO mapping remains a difficult endeavor. This is because, it is a many-to-many mapping, where different Pfam annotations within a protein structure, either individually, or as a set, might yield different amounts of specificity in regards to the set of possible GO labels that are appropriate. Estimating the amount of specificity that a single, or set of, Pfam-derived domains gives, in regards to GO labeling, is confounded by the unequal representation and/or the lack of coverage of annotation in both domains across the protein universe. We revisit issues of coverage, diversity, and representation in the light of all the new data in current sequence databases. We have developed a suite of parsers and an Object-Relational Mapping using Python and SQLAlchemy to represent selected information of proteins and families from the UniProt and Pfam databases respectively. We use this framework to compare dcGO (Fang and Gough, 2013) and GODM (Alborzi et al., 2017), which are designed to optimize different tradeoffs for coverage versus false-positives.
Short Abstract: Functional annotation of biomolecules in the gene and protein databases is mostly incomplete. This is especially valid for multi-domain proteins. There is a grey area in the protein function data resources, where the truly negative functions and the ones possessed by the protein but have not been discovered or documented yet (i.e. false negatives), reside together. In many cases the information about the functions absent from the target biomolecule can be as important as the assigned functions. It’s possible to resolve a portion of this grey area by predicting the functions that the target proteins most probably do not possess. In this study, we present an approach to produce negative functional annotations for protein sequences, along with regular positive associations. Using this approach, we have developed an automated function prediction tool "UniGOPred". The negative prediction performance (recall) was measured as 0.82 for both MF and BP, and 0.66 for CC GO terms (with prediction scores ≤ 0.3), in cross-validation. To the best of our knowledge, the ability of a protein function prediction method to predict negative functions using sequence features is investigated here for the first time. UniGOPred is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.
Short Abstract: The Critical Assessment of protein Function Annotation algorithms (CAFA) is a large-scale experiment for assessing the computational models for automated function prediction (AFP). The models presented in CAFA have shown excellent promise in terms of prediction accuracy, but quality assurance has been paid relatively less attention. The main challenge associated with conducting systematic testing on AFP software is the lack of a test oracle, which determines passing or failing of a test case; unfortunately, the exact expected outcomes are not well defined for the AFP task. Metamorphic testing (MT) is a technique used to test programs that face the oracle problem by defining metamorphic relations (MRs). An MR determines whether a test has passed or failed by specifying how the output should change according to a specific change made to the input. In this work, we use MT to test five web-based CAFA2 AFP tools by defining a set of MRs that apply input transformations at the protein-level. According to this initial testing, we observe MR violations. Currently, we are working on developing domain-specific MRs based on sequence modifications. In the future, we plan to develop a comprehensive MT tool that is readily available for the AFP community.
Short Abstract: The bacteria genus Actinomyces are able to grow, reproduce and cause infections in multiple sites of the human body including sites where the conditions for bacteria growth is unfavorable. Genes encoding the universal stress proteins enable bacteria to respond to stress and grow in unfavorable conditions such as limited nutrients and acidic conditions. The goal of the research reported here was to predict the functions of the universal stress proteins encoded in genomes of Actinomyces species. A combination of bioinformatics and visual analytics techniques were used to construct data sets and identify function, transcription direction and operonic arrangement of genes adjacent to the universal stress proteins of Actinomyces. Gene neighborhood analysis revealed a 4-gene operon that includes a USP gene that is associated with the genome of an oral Actinomyces. The operon had function annotation for a sucrose transporter and an enzyme for breakdown of sucrose. The presence of double domain USPs could indicate capacity for biofilm formation. Sugar metabolism is central to the behavior of dental Actinomyces species who are able to persist in biofilms, produce acid and store glycogen-like molecules. Further studies could evaluate the expression levels of the members of the operon in diverse environmental conditions.
Short Abstract: The third CAFA challenge (CAFA3) released its prediction targets in September 2016, and preliminary results were announced in July 2017. CAFA3 featured a term-centric track where predictors were asked to associate a large set of genes (the complete genomes of Candida albicans and Pseudomonas aeruginosa) with a limited set of functions. By collaborating with experimental biologists, we were able to use unpublished whole-genome screen results to evaluate these predictions. To specifically address this question, we hosted an additional challenge CAFA 3.14 (CAFA-Pi) that is dedicated to evaluating term-centric predictions. The final CAFA3 results as well as preliminary CAFA-Pi results will be released and discussed, in addition to highlights of the term-centric evaluations and benchmark proteins.
Short Abstract: Metabolic modelling is an effective way to understand factors affecting organisms’ growth. Ultimately, such models are key for such purposes as metabolic engineering and drug design. However, sequence similarity searches—typically used to annotate enzymatic function for these models—produce false positive enzyme predictions and fail to consider sequence diversity within enzyme classes. Therefore, various methods have been developed, looking beyond sequence similarity for such elements as domain and catalytic site presence. Here, we start by presenting DETECT (Density Estimation Tool for Enzyme ClassificaTion). In DETECT, the sequence diversity within each enzyme class is captured through density profiles. Then, it calculates likelihood scores for a query sequence given its matches to sequences of different enzyme classes. The use of enzyme-specific score cutoffs calculated from cross-validation gives DETECT higher precision and recall compared to existing methods. It remains that different methods are better suited for predicting certain enzyme classes compared to others. Thus, in a second part, we present an integrative approach for enzyme annotation, where enzyme-specific rules are used for combining the predictions of different tools. Overall, we propose methods for creating high-confidence metabolic models to drive biological discovery.
Short Abstract: Sequence similarity, computed e.g. by BLAST, is used for large-scale annotation of protein sequences. These automated annotations propagate in non-curated protein databases as human readable descriptions or Gene Ontology annotations. To overcome error propagation by simple transfer of annotation from the most similar database match, we developed a new program called "Automatic assignment of Human Readable Descriptions" (AHRD) that is modeled on the work flow of human curators when evaluating similarity search results. Based on semantic similarity of GO sub graphs we optimized AHRD with heuristic machine learning algorithms. AHRD can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.
Short Abstract: BACKGROUND/MOTIVATION The protein structure-function relationship poses the problem of finding suitable mathematical models to relate structural numerical descriptors (protein structure features) with conscious experiences (i.e., phenotypes that are directly related to protein function). Here we model this relationship through the identification of residues critical for protein function. METHODS Critical residues are those positions in a protein that upon mutation rendered an inactive protein, and here we introduced a mathematical way to define such residues (see Figure 1). We used an exhaustive optimization of protein structure descriptors, machine learning algorithms and their hyper-parameters, to identify the best models to classify critical residues. The descriptors were generated using protein sequences and ProtDCal (see Figure 2). RESULTS A random forest model rendered 84% accuracy in 10-fold cross-validation test (Precision=80%, Recall=80%) and 79% accuracy (Precision=74%, Recall=76%) in a set of proteins that differ significantly from those used during training. CONCLUSIONS While the accuracy of our model in predicting critical residues outranked previous methods, our methodology still should be expanded to identify novel descriptors of proteins to model accurately phenotypes from physical measures
Short Abstract: Residue Cluster Class (RCC) is a new feature space to represent proteins that suggests to capture more information than classical contact maps. We explore the usefulness for the classification of structure and function of proteins. In a residue contact graph, the maximal cliques are calculated and divided into 26 categories depending on the type of interaction they have. This representation is used on the PDB to learn the structural (CATH) and functional (Gene Ontology) classifications using machine-learning methods. Code optimization allowed to improve previous performance, by optimizing over criteria for graph construction. We discovered that a residue contact graph built without lateral chains and distance of 7 angstroms yielded the best results using a Random Forest model. 97.5% of CATH classification was correctly learned at the Class level and 89.3% at Class and Architecture. Preliminary results in function classification rendered 41.9% for Cellular localization, 27.6% for molecular function and 35.7% for biological process (this is likely due to too few samples, see Data). RCCs constitute a representation of protein structure useful for structure and function classification of proteins, this implies that they capture relevant features of proteins to study structure-function relationship of proteins.
Short Abstract: Maize is both a crop species and a model for genetics and genomics research. Maize GO annotations from Gramene and Phytozome are widely used to derive hypotheses for crop improvement and basic science. The maize-GAMER project is an effort to assess existing maize GO annotations and to improve the quality and quantity of annotations. We designed and implemented a plant-specific reproducible meta-annotator (GO-MAP) that uses diverse component methods including sequence-similarity, domain presence, and three CAFA tools (Argot2, FANN-GO, and Pannzer), to predict GO terms to maize genes and aggregates the predicted annotations as an aggregate dataset. Annotations from Gramene, Phytozome, and maize-GAMER were assessed and compared. Compared to Gramene and Phytozome, the maize-GAMER dataset annotates more genes and assigns more GO terms per gene. The quality of annotations was evaluated using an independent gold-standard dataset (2002 GO annotations for 1,619 genes) from MaizeGDB. In the CC category, maize-GAMER was the top performer, but it ranked slightly behind Gramene in both MF and BP categories. The maize-GAMER GO annotations have been released publicly, and the containerized GO-MAP tool will soon be released to facilitate annotation of other plant proteomes.
Short Abstract: Thousands of bacterial genomes have been sequenced and annotated. A very large fraction of GO functional annotations for bacterial genes are based on sequence similarity and have not been reviewed by any curator. We sought to examine afresh how well we can predict bacterial gene annotations with experimental evidence using network-based methods. As a proof of concept, we selected 19 clinically-relevant pathogenic bacteria and created a cross-species network based on protein sequence similarity. We integrated this network with species-specific functional association networks for each pathogen from STRING. We hypothesized that the integrated network would have higher predictive power, despite the large network size and sparsity of annotated nodes. We evaluated multiple network-based prediction algorithm’s ability to predict experimental annotations, and non-IEA annotations using five-fold cross validation. We found that the SinkSource algorithm consistently outperformed (higher F-max values) GeneMANIA, FunctionalFlow, and other BLAST-based methods. While incorporating STRING with the sequence similarity network did not improve F-max values for non-IEA annotations, the integrated network did yield higher F-max values for experimental annotations (median F-max increased from 0.46 to 0.51 for SinkSource across all BP terms). These results demonstrate that integrating multiple types of data improves predictive power for experimental annotations.
Short Abstract: Metabolism forms the basis for understanding cellular processes in all living organisms and is essential in mediating microbial community and host-microbe associations. Despite the broad application of genome-scale models into studying the function and evolution of metabolic networks, a comprehensive understanding of diverse metabolic processes is still lacking due to the great complexity and variability of metabolic interactions among different species. To enable the annotation and visualization of complex metabolic networks beyond the scope of existing metabolic pathway databases, we have developed a new algorithm, FindPrimaryPairs, for automatically predicting the element-transferring reactant/product pairs and hence tracing the primary connections of metabolites in metabolic networks. The algorithm has been applied to enable the visualization of metabolic pathways. In the presentation, we will demonstrate new applications of our approach into annotating host-microbe metabolic collaborations and discuss the further integration of protein structural and functional information into studying the evolution of metabolic interactions among different species.