Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide



COSI Track Presentations

Schedule subject to change
Tuesday, July 23rd
10:15 AM-10:20 AM
  • Michel Dumontier

Presentation Overview: Show

Introduction to the Bio-Ontologies track and program.

10:20 AM-11:20 AM
Keynote: A Knowledge Graph for Health: can graphs really save lives?
  • Helena Deus

Presentation Overview: Show

Weaving a web of reliable health knowledge has never been more important - as machine learning automation replaces computational health care tasks that used to be performed by hospital staff, the datasets used to train them need to be complete, reliably updated, clean and unbiased. The primary source of that data are hospital electronic health records and related physician’s notes. Yet the rise of cell therapies, precision medicine and wearable medical devices has created another, highly relevant source of data for that automation - detailed measurements of genetic, metabolic and environmental data. Whereas these have been repeatedly shown to significantly impact diagnosis and treatment, providers of such services will face an uphill battle until their data is available to the patient in an interoperable format. One of the formats supported by FHIR, the health interoperability standard recommended by HL7, is RDF. Knowledge graphs are therefore set to play an increasingly important role not only as the interfaces connecting data from wearables and diagnostic tests to the patient ecosystem, but also as sources of data for training machine learning models that can identify the best treatment given the patient’s information. This talk will describe a few of the challenges - and the solutions - that staff at Elsevier has come across while using an expert curated knowledge graph and machine learning for enabling precision medicine.

Bio: Helena Deus received her PhD in Bioinformatics from Lisbon University (UNL) where she focused on Linked Data and Semantic Web applications for Health Care and Life Sciences, with an emphasis on Cancer Research. Helena is passionate about applying data science to improving healthcare – in particular for oncology. She works for Elsevier as technology research director and is leading the effort to build a knowledge graph for health. Prior to joining Elsevier, Helena's roles included directing a data science team at Foundation Medicine and leading the Health Care and Life Sciences strategy at the Digital Enterprise Research Institute.

11:20 AM-11:40 AM
Knowledge graph representation learning: approaches and applications to bio-medicine
  • Mona Alshahrani, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

There is over 500 biomedical ontologies and millions of triples storing biological interconnected data in knowledge bases as semantic known facts consisting of entities and their relations. Such knowledge bases are being mostly used for information retrieval, data integration and provision. Developing machine learning methods which can exploit such re-sources for predictive analysis and novel discovery becomes necessary and of significant importance. In this work, we utilize the plethora of biological linked data and bio-ontologies and we form a knowledge graphs centered around bio-logical entities and classes. Knowledge graphs embedding methods have recently emerged as an effective and promising paradigm for analyzing and learning from knowledge graphs within and across subjects’ domains. In this work, we present the potential of utilizing knowledge graphs embeddings methods as predictive tools in the bio-medicine domain. We compare between four state-of-the-art methods in the link prediction task concerning important biological relations. Each of these methods is a representative of various categories of knowledge graphs methods. We investigates various settings and evaluation metrics and their effects on the performance.

11:40 AM-12:00 PM
Combining knowledge-based approach with logic data mining techniques to improve data querying and analysis on Alzheimer's Disease data
  • Fabio Cumbo, Department CIBIO, University of Trento, Trento, Italy, Italy
  • Francesco Taglino, Institute of Systems Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Rome, Italy, Italy
  • Michele Sonnessa, Genomics Laboratory, European Brain Research Institute, Rome, Italy, Italy
  • Federico Perazzoni, Department of Engineering, Uninettuno International University, Rome, Italy, Italy
  • Gabriella Mavelli, Institute of Systems Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Rome, Italy, Italy
  • Giulia Fiscon, Institute of Systems Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Rome, Italy, Italy
  • Federica Conte, Institute of Systems Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Rome, Italy, Italy
  • Eleonora Cappelli, Department of Engineering, University of Roma Tre, Rome, Italy, Italy
  • Paola Bertolazzi, Institute of Systems Analysis and Computer Science "Antonio Ruberti", National Research Council of Italy, Rome, Italy, Italy
  • Ivan Arisi, Genomics Laboratory, European Brain Research Institute, Rome, Italy, Italy
  • Giulia Antognoli, ACT Operations Research, Rome, Italy, Italy
  • Roger Voyat, Department of Engineering, Uninettuno International University, Rome, Italy, Italy

Presentation Overview: Show

A huge amount of biomedical data are collected around the world related to many pathologies. In particular, in neurodegenerative diseases area, last years have witnessed the increasing of specialized databases such as Alzheimer’s Disease Neuroimaging Initiative (ADNI), which covers psychometric tests, biospecimen, imaging, and laboratory results. Analyzing these data is a challenging task and machine learning (ML) may offer methods and tools for knowledge discovery from them. However, ADNI suffers from a scarce conceptualization behind the collected data, which prevents a fully intuitive access to the data themselves and a direct analysis through ML methods. Therefore, in order to take advantage of this big data repository, we are working on two directions: (i) develop a detailed ontology to give a more conceptual organization to the data, to ease data access and interpretation, and to facilitate data integration approaches with other data sources; (ii) apply logic data mining methodologies to extract knowledge and generate probabilistic diagnostic models from the ontology, in order to classify patients into disease categories.

12:00 PM-12:20 PM
SIENA: Semi-automatic Semantic Enhancement of Datasets using Concept Recognition
  • Andreea Grigoriu, Institute of Data Science, Maastricht University, Netherlands
  • Amrapali Zaveri, Institute of Data Science, Maastricht University, Netherlands
  • Gerhard Weiss, Department of Data Science and Knowledge Engineering, Maastricht University, Netherlands
  • Michel Dumontier, Institute of Data Science, Maastricht University, Netherlands

Presentation Overview: Show

The amount of available data, which can facilitate answering research questions, is growing. However, various data formats of published data are expanding as well, creating a serious challenge when multiple datasets need to be integrated for answering a question. This paper presents a semi-automated framework that provides semantic enhancement of biomedical data, specifically datasets containing gene information.
This framework proposes the combination of Machine Learning classification and annotation using BioPortal to automatically identify the semantic type of a concept. Compared to baseline methods, the proposed framework achieves the highest results.

2:00 PM-3:00 PM
Keynote: Knowledge graph and computable models
  • Ioannis Xenarios

Presentation Overview: Show

Knowledge graph and computable models are becoming entities that tries to capture the complexity of biological systems and encode this information into a structure that is reusable and linkable. Cellular models used in systems biology have emerged over the last decade as useful media to map, compute and sometimes predict in-silico cellular behavior. The presentation will tackle with a set of examples how computable models requires accurate and community maintained ontologies. The emergence as well of linked data being an essential component of this eco-system.

Bio: Ioannis Xenarios holds a PhD in immunology from the Ludwig Institute for Cancer Research (LICR) and the Biochemistry department from the University of Lausanne. He has been trained on profile/prosite method for domain detection in proteins by Dr Philipp Bucher and Roland Luethy as an undergrad student and after his PhD went in the computational structural biology laboratory of Prof David Eisenberg where he created with his colleague at UCLA the first database of interacting proteins (DIP). He then moved to an industrial position at Serono /Merck-Serono where he hold several positions in domain ranging from genomics to systems-immunology oriented activities. For 11 years he was part of the SIB Swiss Institute of Bioinformatics where he led the Swiss-Prot part of the UniProt consortium and the Vital-IT competence center. Since 2019 he is leading a systems immunology modeling actiivities within the LICR and the Data Analysis Platform of the Health2030 Genome Center. He is a professor at the Center for Integrative Genomics of the University of Lausanne and a Professor of Chemistry/Biochemistry at the University of Geneva.

3:00 PM-3:20 PM
Proceedings Presentation: DIFFUSE: Predicting isoform functions from sequences and expression profiles via deep learning
  • Hao Chen, University of California, Riverside, United States
  • Dipan Shaw, University of California, Riverside, United States
  • Jianyang Zeng, Tsinghua University, China
  • Dongbo Bu, Chinese Academy of Sciences; University of Chinese Academy of Sciences, China
  • Tao Jiang, University of California, Riverside; Tsinghua University, United States

Presentation Overview: Show

Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: i) Unlike genes, isoform-level functional annotations are scarce. ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. In this study, we present a novel approach, DIFFUSE, to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average AUC of 0.840 and AUPRC of 0.581 over 4,184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity, and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences.

3:20 PM-3:40 PM
Extending Machine Learning Capabilities of BioAssay Express
  • Peter Gedeck, Collaborative Drug Discovery - CDD VAULT, United States
  • Barry Bunin, Collaborative Drug Discovery - CDD VAULT, United States
  • Hande Küçük McGinty, Collaborative Drug Discovery, United States
  • Alex Clark, Collaborative Drug Discovery Inc., United States

Presentation Overview: Show

The recently developed BioAssay Express technology streamlines the conversion of human-readable assay descriptions to computer-readable information. BioAssay Express uses public semantic standards (ontologies) to markup bioprotocols, which unleashes the full power of informatics technology on data that could previously only be organized by crude text searching (https://peerj.com/articles/cs-61/).

One of several annotation-support strategies within BioAssay Express is the use of machine learning models to provide statistically backed "suggestions" to the curator. We will describe our efforts to complement these models by applying ontology derived text mining, association rules mining based on existing annotations, and axioms that are embedded within the underlying ontologies. BioAssay Express includes the BioAssay Ontology (BAO), Gene Ontology (GO), Drug Target Ontology (DTO) and Cell Line Ontology (CLO). It can also be extended to handle private ontologies.

We will explore how this resource will be used to encourage further semantic annotation of publicly available bioassay protocol data. These efforts are timely and important, as such datasets (released by both public and private organizations) are only increasing, with the volume already exceeding the ability of individual scientists to manage productively.

3:40 PM-4:00 PM
Semantic persistence of ambiguous biomedical names in the citation network
  • Raul Rodriguez-Esteban, F. Hoffmann-La Roche Ltd., Switzerland

Presentation Overview: Show

Names with multiple meanings generally present only one meaning within a given text. Using a new method that leverages large numbers of biomedical annotations, it can be shown that such semantic persistence exists also across scientific articles connected by citations. Thus, ambiguous biomedical names mentioned in scientific articles tend to present the same meaning in articles that cite them or that they cite, and, to a lesser extent, in articles two steps away in the citation network. Citations, therefore, can be regarded as semantic connections between articles and the citation network should be considered a source of knowledge in tasks such as automatic name disambiguation, entity linking and biomedical database annotation.

4:40 PM-5:00 PM
A multi-class decision tree for identifying cell cluster marker genes in single-cell RNAseq analysis
  • Ahmed Youssef, Boston University, United States
  • Jing Zhang, Boston University, United States
  • Eric Reed, Boston University, United States
  • Zhe Wang, Boston University, United States
  • Evan Johnson, Boston University, United States
  • Stefano Monti, Boston University, United States
  • Joshua Campbell, Boston University, United States
  • Gary Benson, Boston University, Departments of Biology and Computer Science, United States

Presentation Overview: Show

The emergence of single-cell RNA-seq (scRNA-Seq) has revolutionized the ability to study complex biological systems. The analysis of scRNA-seq data typically involves quality control and filtering, dimensionality reduction of features, and identification of all cell clusters via unsupervised clustering methods. The next major step is the identification of marker genes that uniquely define each cell cluster. This task is often attempted by examining the top up-regulated genes in a one-versus-all differential expression analysis for each cell cluster. However, this approach may not work well when cell subpopulations are small and/or are mostly similar in expression to one another. Here we propose a multi-class decision tree approach utilizing multivariate Gaussian density and probability entropy-based information gain that can identify combinations of genes that distinguish various cell clusters. This method was applied to datasets from mouse lung and human peripheral blood monoclonal cells (PBMCs) that were clustered with Celda, a method developed by our lab for scRNA-Seq clustering. In each dataset, core groups of gene modules were identified that uniquely distinguished cell subpopulations. Overall, this approach will help facilitate the biological interpretation of of clustered single-cell data.

5:00 PM-5:20 PM
Standardising intrinsic disorder description
  • Silvio Tosatto, Department of Biomedical Sciences, University of Padova, Italy
  • András Hatos, University of Padua, Italy

Presentation Overview: Show

The Database of Protein Disorder (DisProt, URL: www.disprot.org) is the major repository for manually curated intrinsic disorder (ID) annotations which are provided by almost 40 curators from different groups and countries. Currently, DisProt annotation uses a controlled vocabulary of functional terms associated with ID grouped in three main categories (molecular function, structural transition and interaction partner). Additionally, it allows to select the experimental techniques used for determining ID. In this work we report the results of our efforts and improvements made in the latest version of this resource, which will be released in June 2019. This version features almost 800 new proteins and more than 3500 ID region annotations. The DisProt system was also re-designed to support Minimum Information About Disorder Experiments (MIADE) as an early implementation of common standard for ELIXIR data interoperability in IDP field. On the new DisProt annotation interface curators are encouraged to report author statements from articles documenting ID proteins. This will provide a new corpus of statements for the development of a new generation of text mining tools that will improve article curation and processing, integrating DisProt annotations into SciLite and EuropePMC.

5:20 PM-5:40 PM
Mapping UK Biobank to the Experimental Factor Ontology.
  • Zoë May Pendlington, EMBL-EBI, United Kingdom
  • Paola Roncaglia, EMBL-EBI, Italy
  • Edward Mountjoy, Wellcome Sanger Institute, United Kingdom
  • Gautier Koscielny, GSK, United Kingdom
  • Helen Parkinson, EMBL-EBI, United Kingdom
  • Simon Jupp, EMBL-EBI, United Kingdom

Presentation Overview: Show

UK Biobank is a massive datasource for population health data used extensively in research. Some of the clinical data is mapped to ICD-10, but coverage is incomplete, as are mappings from existing ICD-10 codes to public ontologies. Here, we describe a pipeline to map 1,552 UK Biobank traits to public ontology terms via the Experimental Factor Ontology. Our approach uses ontology mapping services and manual curation to provide almost complete coverage (97%), thus increasing interoperability of UK Biobank with public datasets. Both mappings and services described are freely available for use and the mapping pipeline represents a typical curation workflow that can be adopted for other domains.

5:40 PM-6:00 PM
Flash Presentations
  • Various

Presentation Overview: Show

An opportunity to present at the very last minute, for a minute!

Wednesday, July 24th
10:15 AM-10:20 AM
  • Michel Dumontier

Presentation Overview: Show

Introduction to the morning program.

10:20 AM-10:40 AM
Proposing a unified framework of topological factors in order to refine semantic network analysis on biomedical ontologies
  • Alexandros Xenos, National Hellenic Reasearch Foundation, Greece
  • Thodoris Koutsandreas, National Hellenic Reasearch Foundation, Greece
  • Aristotelis Chatziioannou, National Hellenic Reasearch Foundation, Greece

Presentation Overview: Show

Several approaches have been developed to apply meta-analytical tasks on the ontological graphs, aiming to detect differences and commonalities between groups of semantic terms. The majority of these approaches rely solely on the concept of most informative common ancestor (MICA), ignoring the high topological complexity of these graphs. The purpose of this study is to winnow the crucial factors that impinge on the semantic association of two terms and to evaluate existing similarity measures, in terms of their consistency with them. To address this, an instance of Gene Ontology was constructed and pairs of terms were ranked, according to the proposed rules. Different measures were applied, to measure their ability to reproduce the same ranking. The Aggregate Information Content measure had the best performance among various measures, suggesting that the inclusion of common informative ancestors in the estimation of IC-based similarity measures enhances the performance. Therefore, multiple parent inheritance is important in biomedical ontologies and should be taken into consideration. However, none of the existing measures presented ultimate accuracy, indicating the need of a more sophisticated approach.

10:40 AM-11:00 AM
Managing ontology releases with the Ontology Development Kit
  • Nicolas Matentzoglu, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • James Balhoff, Renaissance Computing Institute, University of North Carolina, United States
  • Rebecca C. Jackson, Institute for Genome Sciences, United States
  • David Osumi-Sutherland, European Bioinformatics Institute (EMBL-EBI), United Kingdom
  • James A. Overton, Knocean Inc., Canada
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States

Presentation Overview: Show

Managing the ontology life cycle is a complex process involving tasks such as import management, release file compilation, integration testing, quality control (QC) and various pre-processing steps such as materialising inferred subsumptions. Over the years, many ontologies developed their own pipelines for managing this process using tailored scripts and referring to various, often hard to use, APIs. The Ontology Development Kit (ODK) bundles all tools required for performing ontology releases and includes a customisable battery of QC procedures which cover the majority of the Open Biological and Biomedical Ontology (OBO) Foundry’s well-established principles for open ontology development in an easy-to-use framework facilitating unified workflows. It has been adopted by a wide range of biomedical ontologies, such as most of the current phenotype ontologies including Human (HP), Mammalian (MP) and Zebrafish (ZP), as well as Cell Ontology (CL), Behaviour Ontology (NBO) and many more. Using the ODK ensures a shared minimum standard of quality for open biomedical ontologies, thus increasing interoperability and reuse. Advanced features such as pattern-based development enable the definition of community-wide, shared standards for the representation of terms and axioms and a much more scalable approach to managing large and complex ontologies.

11:00 AM-11:20 AM
Ontology-Driven Omics Data Visualization and Database Update Notifications
  • Suzanne Paley, SRI International, United States
  • Peter Karp, SRI International, United States

Presentation Overview: Show

We present two novel ontology-driven bioinformatics applications. Both combine the use of Gene Ontology and the Pathway Tools ontologies for metabolic pathways and for regulation. The Omics Dashboard provides multi-level views of gene expression or metabolomics data that are organized by cellular system. Cellular systems such as biosynthesis, regulation, and central dogma are organized into subsystems such as amine biosynthesis, and DNA metabolism. Each subsystem is represented by a graph that depicts the integrated expression levels of all genes/ metabolites within that subsystem; clicking on the graph expands another layer of graphs for each of its subsystems. The Dashboard organization is defined by ontologies but in a fashion that hides unnecessary complexity.

Update Notifications are a BioCyc capability that enables users to subscribe to interest areas within a BioCyc database. When database updates occur within a users' area of interest, the user receives an email. Emails are sent in conjunction with database releases. Interest areas are all ontology based. For example, if the user subscribes to a MetaCyc pathway class such as "bioluminescence" ("Bioluminescence"), the user is notified whenever a pathway in this class, or any component reaction, metabolite, gene, or enzyme in such a pathway is updated.

11:20 AM-11:40 AM
OntoloBridge – A FAIR Semi-Automated Ontology Update Request System
  • Hande Küçük McGinty, Collaborative Drug Discovery, United States
  • John Graybeal, Stanford University, United States
  • Alex Clark, Collaborative Drug Discovery Inc., United States
  • John Turner, University of Miami, United States
  • Daniel Cooper, University of Miami, United States
  • Michael Dorf, Stanford University, United States
  • Mark Musen, Stanford University, United States
  • Stephan Schürer, University of Miami, United States
  • Barry Bunin, Collaborative Drug Discovery, United States

Presentation Overview: Show

Ontologies are becoming more relevant for data science as the need for standardized vocabulary and metadata increases. However, ontologies must evolve in order to stay relevant. A frequent complaint of ontology users is not knowing where to make requests for changes to ontologies. We are developing a set of infrastructure services with a public API that will allow users of ontologies to easily request new terms and update existing ones. Domain experts who are using Collaborative Drug Discovery's new tool BioAssay Express (BAE) to annotate their bioassays in a semi-automated and standardized fashion (using well-known ontologies like BioAssay Ontology (BAO), Gene Ontology (GO), Disease Ontology (DOID), and Drug Target Ontology (DTO)); and metadata curators who are using CEDAR Workbench, both have access to a number of ontologies. However, currently these users can not provide feedback to the ontology authors or request terms. Our initial goal in the OntoloBridge project is to help the users of BAE request changes to the existing BAE-exposed vocabularies in a semi-automated way. Collaborators at Stanford University are investigating the adoption of these APIs in tools like BioPortal and CEDAR. OntoloBridge ontology request services intend to increase the Findability, Accessibility, Interoperability, and Reproducibility (FAIR) of the above-mentioned ontologies, user requests to change them, and the BioAssay Protocols and metadata resources that rely on them.

11:40 AM-12:00 PM
CROssBAR: Comprehensive Resource of Biomedical Relations with Network Representations and Deep Learning
  • Vishal Joshi, EMBL-EBI, United Kingdom
  • Tunca Dogan, European Bioinformatics Institute, Turkey
  • Rengül Atalay, Middle East Technical University, Turkey
  • Maria Jesus Martin, EMBL-EBI, United Kingdom
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Hermann Zellner, EMBL-EBI, United Kingdom
  • Vladimir Volynkin, EMBL-EBI, United Kingdom
  • Esra Sinoplu, Middle East Technical University, Turkey
  • Heval Atas, Middle East Technical University, Turkey
  • Andrew Nightingale, EMBL-EBI, United Kingdom
  • Ahmet Sureyya Rifaioglu, Middle East Technical University, Turkey
  • Mehmet Volkan Atalay, Middle East Technical University, Turkey

Presentation Overview: Show

Biomedical information is scattered across different biological data resources, which are biologically related but only loosely linked to each other in terms of data connections. This hinders the applications of integrative systems biology applications on data. We aim to develop a comprehensive resource, CROssBAR, to address these shortcomings by establishing relationships between relevant biological data sources to present a well-connected database, focusing on the fields of drug discovery and precision medicine. CROssBAR will contain 3 modules: (1) novel computational methods using graph theory and deep learning algorithms, to reveal unknown drug-target interactions and gene/protein-disease associations; (2) multi-partite biological networks where nodes will represent compounds/drugs, genes/proteins, pathways/systems and diseases, the edges will represent known and predicted pairwise relations in-between; and (3) an open access database and web-service to provide access to the resultant networks with its components. We have developed data pipelines for the heavy lifting of data from different data sources like UniProt, ChEMBL, PubChem, Drugbank and EFO persisting only specific data attributes for biomedical entity networks. The database is hosted in self-sufficient collections in MongoDB. The CROssBAR resource should help researchers in the interpretation of biomedical data by observing biological entities together with their relations.

12:00 PM-12:20 PM
Closing session
  • Michel Dumontier

Presentation Overview: Show

Wrap up and opportunity for audience to share thoughts about the Bio-Ontologies track at ISMB.