Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide


Function SIG: Gene and Protein Function Annotation

COSI Track Presentations

Schedule subject to change
Monday, July 22nd
10:15 AM-10:20 AM
10:20 AM-11:00 AM
KEYNOTE: Using evolutionary sequence variation to build predictive models of protein structure and function.
  • Lucy Colwell, Cambridge University and Google AI

Presentation Overview: Show

The evolutionary trajectory of a protein through sequence space is constrained by its function. A central challenge across the biological sciences is to predict the functional properties of a protein from its sequence, and thus (i) discover new proteins with specific required functionality and (ii) better understand the functional effect of changes within protein coding genes. The explosive growth in the number of available protein sequences raises the possibility of using the natural variation present in homologous protein sequences to infer these constraints and thus identify residues that control different protein phenotypes. Because in many cases phenotypic changes are controlled by more than one amino acid, the mutations that separate one phenotype from another may not be independent, requiring us to build models that take into account the correlation structure of the data. Models that have this feature are capable of (i) inference of residue pair interactions accurate enough to predict all atom 3D structural models; and predictions of (ii) binding interactions between different proteins and (iii) accurate annotation of sequence domains as far as 80% distinct from the training set.

Speaker Bio: Lucy Colwell is a lecturer in the Chemistry department at Cambridge University, and a Research Scientist at Google Applied Science. She completed her PhD in applied mathematics at Harvard University and spent time as a member of the systems biology group at the Institute for Advanced Study in Princeton NJ before taking up her faculty position at Cambridge. Her research focuses on using large datasets to build predictive models in chemistry and biochemistry.

11:00 AM-11:20 AM
Proceedings Presentation: SCRIBER: accurate and partner type-specific prediction of protein-binding residues from proteins sequences
  • Jian Zhang, Xinyang Normal University, China
  • Lukasz Kurgan, Virginia Commonwealth University, United States

Presentation Overview: Show

Motivation: Accurate predictions of protein-binding residues (PBRs) enhances understanding of molecular-level rules governing protein-protein interactions, helps protein-protein docking, and facilitates annotation of protein functions. Recent studies show that current sequence-based predictors of PBRs severely cross-predict residues that interact with other types of protein partners (e.g., RNA and DNA) as PBRs. Moreover, these methods are relatively slow, prohibiting genome-scale use.
Results: We propose a novel, accurate and fast sequence-based predictor of PBRs that minimizes the cross-predictions. Our SCRIBER (SeleCtive pRoteIn-Binding rEsidue pRedictor) method takes advantage of three innovations: comprehensive dataset that covers multiple types of binding residues, novel types of inputs that are relevant to the prediction of PBRs, and an architecture that is tailored to reduce the cross-predictions. The dataset includes complete protein chains and offers improved coverage of binding annotations that are transferred from multiple protein-protein complexes. We utilize innovative two-layer architecture where the first layer generates a prediction of protein-binding, RNA-binding, DNA-binding, and small ligand-binding residues. The second layer re-predicts PBRs by reducing overlap between PBRs and the other types of binding residues produced in the first layer. Empirical tests on an independent test dataset reveal that SCRIBER significantly outperforms current predictors and that all three innovations contribute to its high predictive performance. SCRIBER reduces cross-predictions by between 41% and 69% and our conservative estimates show that it is at least 3 times faster. We provide putative PBRs produced by SCRIBER for the entire human proteome and use these results to hypothesize that about 14% of currently known human protein domains bind proteins.
Availability: SCRIBER webserver is available at http://biomine.cs.vcu.edu/servers/SCRIBER/.

11:20 AM-11:40 AM
Proceedings Presentation: Multifaceted Protein-Protein Interaction PredictionBased on Siamese Residual RCNN
  • Muhao Chen, University of California, Los Angeles, United States
  • Chelsea J.-T. Ju, University of California, Los Angeles, United States
  • Guangyu Zhou, University of California, Los Angeles, United States
  • Shirley Chen, University of California, Los Angeles, United States
  • Tianran Zhang, University of California, Los Angeles, United States
  • Kai-Wei Chang, University of California, Los Angeles, United States
  • Carlo Zaniolo, University of California, Los Angeles, United States
  • Wei Wang, University of California, Los Angeles, United States

Presentation Overview: Show

Motivation: Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information.
Results: We present an end-to-end framework, PIPR, for PPI predictions using only the primary sequences. PIPR incorporates a deep residual recurrent convolutional neural network in the Siamese architecture, which leverages both robust local features and contextualized information that are significant for capturing the mutual influence of proteins sequences. Our framework relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short.

11:40 AM-12:00 PM
Proceedings Presentation: Reconstructing Signaling Pathways Using Regular-Language Constrained Paths
  • Mitchell Wagner, Virginia Tech, United States
  • Aditya Pratapa, Virginia Tech, United States
  • T. M. Murali, Virginia Tech, United States

Presentation Overview: Show

Motivation: High-quality curation of the proteins and interactions in signaling pathways is slow and painstaking. As a result, many experimentally-detected interactions are not annotated to any pathways. A natural question that arises is whether or not it is possible to automatically leverage existing pathway annotations to identify new interactions for inclusion in a given pathway.

Results: We present RegLinker, an algorithm that achieves this purpose by computing multiple short paths from pathway receptors to transcription factors (TFs) within a background interaction network. The key idea underlying RegLinker is the use of regular-language constraints to control the number of non-pathway interactions that are present in the computed paths. We systematically evaluate RegLinker and five alternative approaches against a comprehensive set of 15 signaling pathways and demonstrate that RegLinker recovers withheld pathway proteins and interactions with the best precision and recall. We used RegLinker to propose new extensions to the pathways. We discuss the literature that supports the inclusion of these proteins in the pathways. These results show the broad potential of automated analysis to attenuate difficulties of traditional manual inquiry.

Availability: https://github.com/Murali-group/RegLinker

12:00 PM-12:20 PM
ProfileView: a pipeline based on multiple probabilistic models resolving the functional organization of the cryptochrome/photolyase protein family
  • Riccardo Vicedomini, Sorbonne Universités, UPMC-Univ. P6, CNRS, IBPS, Laboratory of Computational and Quantitative Biology - UMR 7238, France
  • Alessandra Carbone, Sorbonne Université, France

Presentation Overview: Show

Sequence functional classification became a bottleneck for understanding the large amount of protein sequences accumulating in our databases due to the recent advances in high-throughput sequencing. The diversity of homologous sequences often displays various functional activities we need to unravel for a fundamental understanding of living organisms and biotechnological applications. Computational tools classifying sequences by function would possibly help sequence screening towards the design of accurate functional testing experiments and the discovery of new functions.
ProfileView is a novel computational pipeline designed to functionally classify sets of homologous protein sequences. It considers multiple probabilistic profiles for the same protein domain and exploits them to evaluate each sequence in the ensemble. The similarity of profile hits, capturing similar functional patterns, will determine which sequences cluster together on the ProfileView multidimensional space of sequences defined by the profile models. As a proof of concept, we apply ProfileView to the important class of photoactive proteins known to present a large variety of functions. Known functional characterizations confirm the soundness of the functional organization obtained with our approach, while laboratory experiments and modelization confirmed the interest of a new uncharacterized functional group we identified as a photoreceptor, paving the way to new possible analyses.

12:20 PM-12:40 PM
Towards evolution-guided protein design
  • Martin Weigt, Sorbonne Université, France

Presentation Overview: Show

Over the last few years, statistical modelling approaches taking into account residue coevolutuon in proteins, have become increasingly popular for predicting protein structure, protein interaction and mutational landscapes from the sequence variability across homologous protein families. In this talk, we will show that precise coevolutionary models, inferred using Direct Coupling Analysis based on Boltzmann machine learning, can be considered as generative models : they reproduce statistical features of the data, which are not explicitly fitted. Sequences sampled from these models can therefore be seen as good candidates for artificial sequence homologs. In collaboration with the Ranganathan lab, we have tested more than 2000 artificial sequences for their in vivo functionality, and find an astonishingly high fraction (up to 50%) of perfectly functional sequences, which are up to about 40 mutations away from the closest known natural sequence. Experimental data can, in turn, be reintroduced in the statistical modelling procedure, to improve the success rate and to unveil sequence patterns related to protein function. Collaboration with Matteo Figliuzzi, Pierre Barrat-Charlaix, Remi Monasson, Simona Cocco, Bill Russ, Rama Ranganathan.

2:00 PM-2:20 PM
Predictability of Human Differential Gene Expression
  • Megan Crow, Cold Spring Harbor Laboratory, United States
  • Nathaniel Lim, The University of British Columbia, Canada
  • Sara Ballouz, Cold Spring Harbor Laboratory, United States
  • Paul Pavlidis, The University of British Columbia, Canada
  • Jesse Gillis, Cold Spring Harbor Laboratory, United States

Presentation Overview: Show

Data from high-throughput expression experiments are an important resource for computational prediction of gene function or disease candidacy, often by providing context-specific information about interactions. Though transcriptomic data are often referred to as unbiased, intrinsic differences in gene expression levels and variability are likely to have an impact on the accuracy and breadth of annotations that can be inferred. In this work, we re-analyze more than 600 independent microarray studies to define a gene-level "differential expression (DE) prior" that informs functional interpretation of gene sets. We find that the DE prior has remarkably high performance for predicting DE hit lists, and that genes associated with sex, the extracellular matrix, the immune system and stress responses are prominent within the DE prior. In contrast, predictors based on attributes such as mutation rates or network features perform poorly. Finally, we demonstrate the application of the DE prior to data interpretation in three use cases: (1) breast cancer sub-typing, (2) single cell genomics of pancreatic islet cells, and (3) meta-analysis of lung adenocarcinoma and renal transplant rejection transcriptomics. In all cases, we find hallmarks of generic differential expression, highlighting the need for nuanced interpretation of gene phenotypic associations.

2:20 PM-2:40 PM
Geometric Deep Learning Methods for Large-Scale Structure-Based Protein Function Prediction
  • Douglas Renfrew, Flatiron Institute, United States
  • Vladimir Gligorijević, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Presentation Overview: Show

We present a novel geometric deep learning method based on Graph Convolutional Neural Networks (GCNNs) for predicting functions from experimental and Rosetta-predicted protein structures. As opposed to CNNs, that have been the state-of-the-art in predicting protein function from sequence, we show that GCNNs are better at extracting features from proteins and predicting their functions by taking into account the graph-based structure of their amino acid residues represented by contact maps. Our method uses three graph convolutional layers followed by two fully connected layers to learn a complex structure-to-function relationship by first automatically extracting features from PDB protein structures and then mapping them to Gene Ontology functional classes.
We show that GCNN-based method learns generalizable aspects of protein structure by robustly predicting functions of proteins with < 30% sequence identity to training set. We also demonstrate its comparable performance in predicting protein functions from ~300 Rosetta-predicted lowest energy structures compared to predictions from their native structures. Our model advances the state-of-the-art on a wide range of functions and removes the need for manual feature engineering from protein structures; can make predictions that go beyond homology-based transfer and can reliably be used in annotating proteins with predicted structures.

2:40 PM-3:00 PM
Deciphering Protein Functional Complexity With Alternative Splicing Evolution
  • Hugues Richard, Sorbonne Université - Laboratory of Computational and Quantitative Biology (LCQB, CNRS-SU), France
  • Elodie Laine, Sorbonne Université - Laboratory of Computational and Quantitative Biology (LCQB, CNRS-SU), France
  • Diego Zea, LCQB UMR 7238 CNRS, Institut de Biologie Paris Seine, Sorbonne Université, Argentina

Presentation Overview: Show

Alternative Splicing (AS) is an essential regulatory process by which multiple isoforms are produced from the same gene. It has the potential to greatly expand the protein repertoire and to contribute to functional diversity. Growing experimental evidence has shown that AS modulates various biochemical processes and that alternative isoforms of the same gene can accomplish different biological functions by interacting with different partners. We propose two methods, ThorAxe and PhyloSofS, to reconstruct AS evolution and to functionally annotate exons and transcripts. ThorAxe combines pairwise and multiple sequence alignments to establish transcript-aware homologous relationships between exons across species, while PhyloSofS reconstructs the evolutionary history of the set of transcripts, and makes homology-based structural models of their proteins isoforms. We analyse a set of a dozen genes whose AS has been shown to be functionally relevant where we identified non trivial groups of homologous exon and date the corresponding set of transcripts. To our knowledge we are the first one to provide such a rich description about the complexity of AS induced protein functional diversity.

3:00 PM-3:20 PM
The evolutionary signal in metagenome phyletic profiles predicts many gene functions
  • Vedrana Vidulin, Jožef Stefan Institute, Slovenia
  • Tomislav Šmuc, Ruđer Bošković Institute, Croatia
  • Sašo Džeroski, Jožef Stefan Institute, Slovenia
  • Fran Supek, Institute for Research in Biomedicine (IRB Barcelona), Spain

Presentation Overview: Show

Many methods were proposed that infer gene function from comparative analyses of whole-genome sequences, phyletic profiles(PP) being among the top-performing. PP encodes the pattern of presence/absence of a gene family members across genomes. Motivated by the PP’s accuracy and the abundance of metagenomic data, we propose a method to construct metagenome phyletic profiles(MPP), which reflect relative abundances of gene families across metagenomes. We derive MPP from 5049 metagenomes covering seven types of environments and compare against PP constructed from 2071 bacterial/archaeal genomes. Both data sets cover 3536 COGs annotated with 3358 Gene Ontology(GO) terms. Classification models are constructed using the CLUS-HMC algorithm that accounts for GO hierarchy. We find that MPP can predict hundreds of GO terms that could not be predicted by standard PP. In particular, exclusively the MPP but not PP can predict 29% of the 664 GO terms that have at least one inference at a stringent threshold of FDR<10%(by either PP or MPP). We also find that MPP’s accuracy does not necessarily increase with more data, but it does benefit by sampling from diverse environments. Simulation studies indicate that full accuracy is reached with only 500 metagenomes(out of the tested 5049) if maximum diversity is enforced.

3:20 PM-3:30 PM
Graph-Regularized Autoencoders for Protein Feature Learning
  • Meet Barot, Center for Data Science, New York University, New York, NY, USA, United States
  • Vladimir Gligorijević, Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA, United States
  • Richard Bonneau, Center for Data Science, New York University, New York, NY, USA, United States

Presentation Overview: Show

In protein function prediction, as with many biological classification tasks, most samples are unlabeled or have missing information, while others have extra information that is helpful for learning relationships between samples and the labels. Recently, autoencoders were used in a method called deepNF to learn protein features useful for protein function prediction from protein interaction networks in a completely unsupervised fashion. In order to improve on this, we present a semi-supervised learning technique using autoencoders that incorporates a protein-protein similarity graph constructed using GO annotations. We use a regularization term that enforces features of samples to be similar if they are connected in the GO similarity graph, without needing all samples to be labelled. We train autoencoders on protein-protein interaction networks with this regularization term. We then, as in deepNF, train SVM classifiers using these learned features to predict GO terms. We test our model using a temporal holdout validation scheme and show that our method outperforms previous network-based methods on yeast and human STRING networks. Additionally, we show that this same technique can be used to incorporate various kinds of “privileged information” in protein representation learning.

3:30 PM-3:40 PM
Not all protein modification sites are created equal: Insights from a human-specific substitution matrix
  • Tair Shauli, The Hebrew University of Jerusalem, Israel
  • Nadav Brandes, The Hebrew University of Jerusalem, Israel
  • Michal Linial, The Hebrew University of Jerusalem, Israel

Presentation Overview: Show

The proteomic implication of human genetic variation is fundamental to our understanding of protein function. Such variation is commonly encapsulated in amino-acid (AA) substitution matrices (AASUMs), which are pivotal to the study of proteomics and molecular evolution. However, contemporary AASUMs, such as BLOSUM, are unlikely to reflect human-specific variation. In this study, we present a set of human-centric substitution matrices, at codon and AA resolutions. Derived from an exhaustive list of >8M variants collected from >60K healthy human individuals, our AASUMs substantially deviate from the BLOSUM matrices. Additionally, in contrast to inter-taxa derived matrices, our novel matrices convey directional information. For example, we find that substitutions into Ser and Leu prevail most substitutions of these AAs into others. All 400 possible AA substitutions are directly calculated using a probabilistic approach based on the frequencies of human genetic variants. We investigate how the presence of major PTMs leads to different patterns of AA substitutions. We show that with respect to the general human-model, substitutions in genomic sites that are associated with phospho-Thr/Ser are more dynamic and interchangeable, while sites of phospho-Tyr are far more robust. Our matrix provides a strong baseline for studying human protein function in health and disease.

3:40 PM-3:50 PM
Data integration through heterogeneous ensembles
  • T. M. Murali, Virginia Tech, United States
  • Linhua Wang, Icahn School of Medicine at Mount Sinai, United States
  • Jeffrey Law, Virginia Tech, United States
  • Gaurav Pandey, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Integrating diverse datasets has been effective for addressing several biomedical problems, including protein function prediction. A common approach to this task is early integration, where individual datasets are assimilated into a common representation, which is then analyzed using methods like predictive modeling. Another approach is late integration, where base models are first inferred from individual datasets, and then assimilated. We present a novel heterogeneous ensemble-based approach to late integration. These ensembles, typically learned from a large number and variety of models, have shown promise, but been generally restricted to a single dataset. We repurposed these ensembles to learn base models from individual datasets and then train a second-layer ensemble over these models. We applied this ensemble integration (EI) approach to six STRING networks to predict human protein annotations to 112 Gene Ontology (GO) terms. Results from five-fold cross-validation show that EI performs significantly better than Mashup and deepNF, specialized algorithms for early network integration. EI maintains this advantage for the generally challenging deeper and/or sparsely annotated GO terms. These results show that encapsulating the information in individual datasets into local models, and then assimilating them into ensembles gives power to EI, which should be useful for other problems also.

3:50 PM-4:00 PM
Updates on the Critical Assessment of Function Annotation Challenge Series
  • Predrag Radivojac, Northeastern University, United States
  • Iddo Friedberg, Iowa State University, United States
  • Casey Greene, University of Pennsylvania, United States
  • Sean Mooney, University of Washington, United States
  • Sandra Orchard, EMBL-EBI, United Kingdom
  • Claire O'Donovan, EBI, United Kingdom
  • Deborah Hogan, Geisel School of Medicine at Dartmouth, United States
  • Giovanni Bosco, Dartmouth College, United States
  • Balint Z. Kacsoh, Geisel School of Medicine at Dartmouth, United States
  • George Georghiou, EBML European Bioinformatics Institute, United Kingdom
  • Alex W. Crocker, Geisel School of Medicine at Dartmouth, United States
  • Kimberley Lewis, Geisel School of Medicine at Dartmouth, United States
  • Timothy Bergquist, University of Washington, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Naihui Zhou, Iowa State University, United States
  • Maria Jesus Martin, EMBL-EBI, United Kingdom

Presentation Overview: Show

The third Critical Assessment of Function Annotation challenge (CAFA3) released its prediction targets in September 2016, and preliminary results were announced July 2017. CAFA3 featured a term-centric track where predictors were asked to associate a large set of genes (the complete genomes of Candida albicans and Pseudomonas aeruginosa with a limited set of functions. By collaborating with experimental biologists, we were able to use unpublished whole-genome screen results to evaluate these predictions. To specifically address this question, we hosted an additional challenge CAFA-Pi that is dedicated to evaluating term-centric predictions. We will discuss the latest CAFA3 and CAFA-Pi analyses including comparisons with CAFA1 and CAFA2 results, methods, diversity analyses, and CAFA-pi results analysis. We also used CAFA to discover new genes associated with long term memory in in Drosophila melanogaster.

Having conducted three challenges to-date, we provide a historical review of the progress made in CAFA, discuss the difficulties and opportunities in the field of function prediction, and the role of an ongoing community challenge in the progress made in the field.

4:40 PM-5:00 PM
Metric Learning on Expression Data for Gene Function Prediction
  • Stavros Makrodimitris, Delft University of Technology, Netherlands
  • Roeland van Ham, Delft University of Technology, Netherlands
  • Marcel Reinders, TU Delft and Leiden University Medical Center, Netherlands

Presentation Overview: Show

Motivation: Co-expression of two genes across different conditions is indicative of their involvement in the same biological process. However, using RNA-Seq datasets with many experimental condi-tions from diverse sources introduces batch effects and other artefacts that might obscure the real co-expression signal. Moreover, only a subset of experimental conditions is expected to be relevant for finding genes related to a particular Gene Ontology (GO) term. Therefore, we hypothesize that when the purpose is to find similar functioning genes, the co-expression of genes should not be determined on all samples but only on those samples informative for the GO term of interest.
Results: We developed MLC (Metric Learning for Co-expression), a fast algorithm that assigns a GO-term-specific weight to each expression sample. The goal is to obtain a weighted co-expression measure that is more suitable than the unweighted Pearson correlation for applying Guilt-By-Association-based function predictions. More specifically, if two genes are annotated with a given GO term, MLC tries to maximize their weighted co-expression and if one of them is not annotated with that term, the weighted co-expression is minimized. Our experiments on publicly available Arabidop-sis thaliana RNA-Seq data demonstrate that MLC outperforms standard Pearson correlation in term-centric ROC AUC.

5:00 PM-5:20 PM
BUSCA for the annotation of protein subcellular localization
  • Castrense Savojardo, Biocomputing Group, University of Bologna, Italy
  • Piero Fariselli, Department of Medical Sciences, University of Torino, Torino, Italy., Italy
  • Martelli Luigi, University of Bologna - Biocomputing Group, Italy
  • Rita Casadio, University of Bologna - Biocomputing Group & National Research Council - Institute of Biomembranes and Bioenergetics, Italy

Presentation Overview: Show

In-silico prediction of protein subcellular localization is a key step of large-scale functional annotation projects aiming at understanding the role of each protein within the cellular complexity.
We recently developed the Bologna Unified Subcellular Component Annotator (BUSCA) (http://busca.biocomp.unibo.it), a web server integrating different computational tools developed by our group and devised to predict protein subcellular localization. In particular, BUSCA combines: i) tools for the detection of signal and transit peptides (DeepSig and TPpred3), GPI-anchors (PredGPI) and transmembrane domains (ENSEMBLE3.0 and BetAware); ii) tools for assessing subcellular localization of both globular and membrane proteins (BaCelLo, MemLoci and SChloro). The different tools are organized into five pipelines for predicting subcellular localization in plants, animals, fungi, Gram-positive and Gram-negative bacteria. Overall BUSCA discriminate: sixteen different compartments in plants, nine for animals and fungi, four for Gram-negative and three for Gram-positive bacteria. Moreover, BUSCA provides residue-level feature annotation including membrane-spanning segments, signal and transit peptide cleavage-sites and GPI-anchors. In benchmarks performed on CAFA2 and CAFA3 targets, BUSCA reported performances comparable and even better that other methods: F1 measures are 0.49 and 0.59 in CAFA2 and CAFA3 experiments, respectively. We propose BUSCA as an accurate resource for the annotation of protein subcellular localization.

5:20 PM-5:40 PM
An Evolutionary Calculus for Identifying Genes under Selection in Mutational Landscapes
  • Teng-Kuei Hsu, Baylor College of Medicine, United States
  • Panagiotis Katsonis, Baylor College of Medicine, United States
  • Amanda Koire, Baylor College of Medicine, United States
  • Kwanghyuk Lee, Baylor College of Medicine, United States
  • Thomas Bourquard, Baylor College of Medicine, United States
  • Young Won Kim, Baylor College of Medicine, United States
  • Olivier Lichtarge, Baylor College of Medicine, United States
  • Brigitta Wastuwidyaningtyas, Baylor College of Medicine, United States

Presentation Overview: Show

Finding which proteins play a role in diseases and are relevant drug targets remains difficult. This is despite an abundance of sequencing and other omics data. Genome-wide association studies tend to lack sensitivity and specificity because disease driver mutations are often too rare and too personal to rise above the noise from thousands of nearly neutral gene variants we all carry. Particularly so for widespread diseases, which tend to be polygenic. To improve these analyses of protein function in health and disease, we developed a new mathematical calculus of evolution. Examples from bacteria to humans show how differential and integral equations can be combined to identify genes that drive traits and diseases. These ongoing studies suggest that the genotype-phenotype relationship follows paths in the fitness landscape that are described adequately by calculus. Theory aside, in practice, the use of these equations can point to protein mechanisms active in disease that suggest drug targets, with the hope to guide personalized screening and treatment.

5:40 PM-6:00 PM
Choosing the best annotation for functional enrichment testing
  • Julien Roux, University of Basel, Switzerland

Presentation Overview: Show

Functional enrichment testing has demonstrated its usefulness for the characterization of lists of genes, for example differentially expressed genes obtained from RNA-sequencing experiments. But this type of analysis does not have a very good reputation within the genomics community, because it is often perceived as an "obligatory" step that brings limited insights to an article.

However numerous functional enrichment analyses are performed using outdated, or pre-filtered annotations, which could potentially negatively impact the relevance of the results. How should we best take advantage of the large amount of knowledge formalized in the functional annotation databases?

There is unfortunately a disconnect between the biocuration community, busy with functional data integration, annotation reliability and sharing issues - but having difficulties to benchmark the quality of the data provided and provide relevant guidelines for optimal use - and the data analysis community, often focusing on the methodological aspects of functional enrichment testing - but not aware of annotation practices and recent improvements.

In this talk/poster I will focus on the use of functional annotations from the Gene Ontology consortium, and perform a comparative assessment of the impact of the choice of data sources and filtering approaches on gene set enrichment testing results.