Presentation Overview: Show
Ontology authoring, sometimes referred to as the ‘implementation’ stage of representing the knowledge, may seem like a just-do-it task, but even when experts agree on what to represent, there are a myriad of ways how to represent it. Consistently adhered to, they lead to transformable modelling styles. Different representation choices may clash with other ontologies to reuse, however, and with some of the purposes that an ontology may have been built for. Cognizance of such differences may facilitate smoother deployment of ontologies in applications with different requirements. They also pose challenges on methods and tools, such as for competency questions and verification with them, test-driven development, and various bottom-up ontology development approaches, such as knowledge extraction from biological diagrams. In this talk we take a tour through new insights into such factors that slow down or speed up development of bio-ontologies and their use in tools for the life sciences.
Presentation Overview: Show
Motivation: Literature-based Gene Ontology Annotations (GOA) are biological database records that use controlled vocabulary to uniformly represent gene function information that is described in primary literature. Assurance of the quality of GOA is crucial for supporting biological research. However, a range of different kinds of inconsistencies in between literature as evidence and annotated GO terms can be identified; these have not been systematically studied at record level. The existing manual-curation approach to GOA consistency assurance is inefficient and is unable to keep pace with the rate of updates to gene function knowledge. Automatic tools are therefore needed to assist with GOA consistency assurance. This paper presents an exploration of different GOA inconsistencies and an early feasibility study of automatic inconsistency detection.
Results: We have created a reliable synthetic dataset to simulate four realistic types of GOA inconsistency in biological databases. Three automatic approaches are proposed. They provide reasonable performance on the task of distinguishing the four types of inconsistency and are directly applicable to detect inconsistencies in real-world GOA database records. Major challenges resulting from such inconsistencies in the context of several specific application settings are reported.
Conclusion: This is the first study to introduce automatic approaches that are designed to address the challenges in current GOA quality assurance workflows.
Presentation Overview: Show
Named Entity Recognition (NER) systems are commonly used in the construction of large biomedical knowledge graphs (KGs) from free text or non-standardized data. Their main role is to map biomedical entities to standardized identifiers in ontologies and databases. While NER is only one of the steps in KG construction, NER systems can greatly accelerate KG construction. However, errors introduced by the NER systems can systematically affect downstream applications of the KG. In this study, we used two NER systems – BioPortal Annotator and the OntoRunNER OGER++ wrapper - to map biomedical entities from a KG to 13 biomedical ontologies and subsequently evaluated the mappings. Our preliminary results show that the OntoRunNER wrapper produced average candidate matches equal to 4.26 and mapped 76% of the entities correctly, while the BioPortal Annotator correctly mapped 60% of the manually reviewed entities. Results from both systems contained errors such that using the mappings in a KG without curation could lead to inaccurate inferences. We are currently evaluating the effects of the NER systems on downstream KG applications using graph analysis, embedding similarity, and data source cross-validation.
Presentation Overview: Show
Interoperability, one of the FAIR Data principles (Findable, Accessible, Interoperable and Reusable), requires mapping data models, formats and semantics. The European Joint Programme on Rare Disease (EJP RD) CDE semantic data models enable creating highly expressive FAIR data for the interoperability of patient registries to facilitate rare disease research. The Observational Health Data Sciences and Informatics (OHDSI) OMOP Common Data Model (CDM) is used to harmonize representations of healthcare data and reproducible open source analytics to conduct clinical research. Here, we present our mapping work for scheme integration and interoperability between the EJP RD CDE semantic model and the OMOP CDM. Enhancing the interoperability between these two schemas can be beneficial to enrich rare disease research with observational health data, and to extend the EJP RD Virtual Platform research ecosystem with OMOP analytics.
Presentation Overview: Show
Motivation: Cognitive-behavior symptoms (CBSx) represent the surface manifestation of diverse etiology. Identification and documentation of CBSx are critical to biomedical research that targets understanding the association between the symptoms and the underlying causes. Seeing the lack of a semantic resource of CBSx, specifically dedicated to the symptom layer, we sought to develop a CBSx Taxonomy.
Methods: We defined CBSx as any observable abnormality or reduced capacity in certain cognitive or behavioral functionality. We also collected concepts concerning the impact of CBSx on life quality and daily activities. The taxonomy was iteratively developed by synthesizing knowledge from the literature, existing biomedical ontologies, and clinical instruments that assess CBSx in patients. A domain expert in aging and neurogenerative diseases was consulted in curating the CBSx Taxonomy.
Results: The derived taxonomy contains 258 concepts that are grouped into four major branches: cognitive symptoms, psychomotor symptoms, behavioral symptoms, and impact on life.
Conclusion: By synthesizing multiple knowledge sources, we developed the CBSx Taxonomy to serve as a dedicated semantic layer for cognitive-behavioral symptoms. The taxonomy is shared publicly and is expected to benefit diverse applications including natural language processing, phenotyping, and semantic harmonization.
Presentation Overview: Show
Due to gaps in knowledge, most rare disease patients never get a diagnosis. But diagnostic reach can be extended by integrating bio-ontologies. We show clusters identified across ontologies suggest undiscovered connections between genes and phenotypes. Biological networks have long been used to infer new connections, but the use of heterogeneous networks and the inference of gene-to-phenotype connections remains largely unexplored. We create a heterogeneous network composed of genes and phenotypes with edges derived from the STRING protein-protein interaction network and the Human Phenotype Ontology for each year: 2015 to 2021. Employing a combination of classic and node attribute-aware network clustering algorithms, we identify small, heterogeneous clusters for each year. We show these clusters contain significantly more (p < 0.0001) gene-to-phenotype edges in the future year than 10,000 replicates from a robust null model. Using biologically meaningful cluster properties, we train an XGBoost model to estimate the degree to which we’d expect more gene-to-phenotype pairs in the future year than at random and prioritize clusters that will be meaningful to those affected by a rare disease.
All data and methods are available in a Snakemake pipeline and Conda environment for the highest degree of reproducibility (https://github.com/MSBradshaw/BOCC).
Presentation Overview: Show
To support diagnosis, management, and further research of vascular anomalies, Mulliken and Glowacki created a comprehensive classification system for vascular anomalies. The International Society for the Study of Vascular Anomalies (ISSVA, i.e., the society for specialists of various medical disciplines involved in the treatment of patients afflicted with vascular anomalies), adopted this classification in 1996. The current version of the classification is available as a PDF file, which does not allow for structured registration of these diagnoses using unique identifiers, nor implementation in software systems. To make the data for vascular anomaly research more Findable, Accessible, Interoperable, and Reusable (FAIR), it is important that these diagnoses are registered in a structured and machine-readable manner. The Vascular Anomalies European Reference Network (VASCERN) and its Registry of Rare Vascular Anomalies (VASCA), therefore, adopted the ISSVA classification and created a machine-readable representation of the classification: the ISSVA ontology.
In this session, we will present the ISSVA ontology. We will also present our lessons learned from creating an ontology out of a classification and (semi-automatically) mapping the ontology to existing ontologies.
Presentation Overview: Show
In the Netherlands, women aged 50-75 years are invited to receive breast cancer screening every two years. The PRISMA (Personalised RISk-based MAmmascreening) study was designed to investigate the added value of risk-based mammography screening. To our best knowledge, there is no universally accepted data model for the collection of breast cancer risk factors and PROMs. Therefore, we aimed to retrospectively create a domain ontology based on the PRISMA questionnaire. This contributes to global efforts to increase secondary use of patient-reported outcomes through FAIRification, i.e. ensuring data are Findable, Accessible, Interoperable, and Reusable.
Initially, 201 questions and 188 Variables were identified from the questionnaire. After several inventory meetings with different stakeholders, the resulting 70 data elements were grouped into 17 main classes. The structure of the domain ontology adheres to an “is-a” relationship.
Given that most of the concepts derived from the questionnaire are measurements of some attributes of the patient, we use the Semantic Science Integrated Ontology (SIO), which is capable of modelling entities, processes, and their qualities/attributes, as guidance to derive our core model.
The data model developed could serve as a template for other breast cancer research groups to other Patient-Reported Outcome and Real-World Experience questionnaires.
Presentation Overview: Show
Governance on ontology development and maintenance practices within an organization has many advantages over ungoverned, siloed approaches that many organizations exhibit nowadays. This paper presents the BASF Governance Operational Model for Ontologies (GOMO), which addresses all stages of the ontology lifecycle and provides a framework for the development and maintenance of ontologies within BASF. GOMO is comprised of Principles, Standards and Quality Assurance criteria, Best Practices, Training and Outreach and is the result of a collaborative effort between industry and academia in the semantic web field. GOMO Principles, Standards and Best Practices are being applied to all running ontology-based projects in BASF. Through outreach presentations with sections of the BASF community, GOMO has reached a wider audience to foster understanding on the utility and implementation of ontologies. Finally, GOMO stands as a framework that is fit for adoption by other organizations facing similar challenges in ontology governance.
Presentation Overview: Show
We have recently published an updated genome assembly and annotation
of our model organism Pseudomonas fluorescens SBW25. We are now
facing the challenge to keep the annotation up to date with novel results
from experimental and computational studies of gene function, fitness as-
says, regulatory and metabolic networks. We will present
various opensource software tools and open data and metadata standards combined
into a public knowledge base for our model organism. The central part
is our genome database and genome browser based on Tripal. It allows internal and external colleagues to feed in their data and results in a curated fashion. To further integrate our data we are working on a Linked Data architecture that connects our
genome database to various public *omics databanks as well as to internal datasources
thereby creating an organism specific knowledge graph. By exposing a
public SPARQL endpoint, our data ultimately becomes part of the world
wide semantic web that incorporates other, domain specific knowledge
graphs but also generic data sources such as Wikipedia (via WikiData).
In this way, our system facilitates the growth of the Pseudomonas fluorescens SBW25 knowledge graph both through manual explorations as well as through automated procedures.
Presentation Overview: Show
Being the greatest significant crop in Asian countries, rice and its breeding have long been a concerned research issues. Unfortunately, a standardization of rice trait ontology has been lacking, and make it challenging to normalize the description of lab results. In this work, a rice trait ontology (RTO) is manually curated by aligning three existed terminology set of rice trait. RTO includes 2,522 rice trait concepts with corresponding descriptors defining relations among concepts. Hopefully RTO standardizes the common-used trait concepts during rice breeding research, and provides the possibility of automated mining of rice traits knowledge. To make it easier for concept query, a user-friendly web service is released via the link, http://lit-evi.hzau.edu.cn/RiceTraitOntology.
Presentation Overview: Show
EDAM is a domain ontology of data analysis and data management in bio- and other sciences. It comprises concepts related to analysis, modeling, optimization, and data life-cycle, divided into 4 main sections: topics, operations, data, and formats.
EDAM is used in numerous resources, for example Bio.tools, Galaxy, CWL, Debian, BioSimulators, FAIRsharing, or the ELIXIR training portal TeSS. Thanks to the annotations with EDAM, tools, workflows, standards, data, and learning materials are easier to find, compare, choose, and integrate.
EDAM is developed by a diverse community of contributors. A substantial extension is EDAM Bioimaging, focused on image analysis and machine learning.
The main improvements and ongoing work in 2022 include:
- In addition to using standard tools such as HermiT and ROBOT, we develop additional validation tools at both the syntactic and semantic levels: https://github.com/edamontology/edam-validation
- Enabling interdisciplinary applications with EDAM Geo (https://github.com/edamontology/edam-geo), an extension of EDAM towards geolocated data (e.g. in ecology, public health, …). Developed at https://webprotege.stanford.edu/#projects/69591619-4eda-4f03-9e7f-65b213038fe1/edit/Classes
- Improving the implementation of links to external resources (incl. other ontologies), definitions, and the overall quality
- Addition of numerous data formats, especially for models and simulations, and chemistry
Presentation Overview: Show
Motivation: Protein functions are often described
using the Gene Ontology (GO) which is an ontology consisting of over
50,000 classes and a large set of formal axioms. Predicting the
functions of proteins is one of the key challenges in computational
biology and a variety of machine learning methods have been
developed for this purpose. However, these methods usually require
significant amount of training data and cannot make predictions for
GO classes which
have only few or no experimental annotations.
Results: We developed DeepGOZero, a machine learning model
which improves predictions for functions with no or only a small
number of annotations. To achieve this goal, we rely on a
model-theoretic approach for learning ontology embeddings and
combine it with neural networks for protein function
prediction. DeepGOZero can exploit formal axioms in the GO to make
zero-shot predictions, i.e., predict protein functions even if not a
single protein in the training phase was associated with that
function. Furthermore, the zero-shot prediction method employed by
DeepGOZero is generic and can be applied whenever associations with
ontology classes need to be predicted.
Presentation Overview: Show
Ontologies play an important role in the representation, standardization, and integration of biomedical data, but are known to have data quality (DQ) issues. We aimed to understand if the Harmonized Data Quality Framework (HDQF), developed to standardize electronic health record DQ assessment strategies, could be used to improve ontology quality assessment. A novel set of 14 ontology checks was developed. These DQ checks were aligned to the HDQF and examined by HDQF developers. The ontology checks were evaluated using 11 Open Biomedical Ontology Foundry ontologies. 85.7% of the ontology checks were successfully aligned to at least 1 HDQF category. Accommodating the unmapped DQ checks (n=2), required modifying an original HDQF category and adding a new Data Dependency category. While all of the ontology checks were mapped to an HDQF category, not all HDQF categories were represented by an ontology check presenting opportunities to strategically develop new ontology checks. The HDQF is a valuable resource and this work demonstrates its ability to categorize ontology quality assessment strategies.
Presentation Overview: Show
Advances in data collection techniques have yielded large-scale heterogeneous clinical and phenotype datasets from different geographical locations. However, harmonizing these datasets retrospectively for integrative analyses to potentially increase prediction power is still challenging. We present omsim, a model-based ontology mapping and text graph-based similarity information retrieval technique, for automated generation of harmonized datasets from disparate research patient registries. We tested omsim on multi-national sickle cell patient research datasets in sub-Saharan Africa.
Presentation Overview: Show
The recent success of node embedding methods has greatly boosted the application of graph data to bioinformatics problems. In this work, we proposed a new method for learning a type of Gene Ontology sub-graph embeddings to classify model organisms' genes into pro-longevity or anti-longevity genes. The experimental results show that this type of Gene Ontology sub-graph embeddings successfully obtains higher predictive accuracy than the conventional binary Gene Ontology annotation-based features.
Presentation Overview: Show
Biomedical ontologies are widely available, with hundreds of ontologies under development, however, there is a lack of formal training on methods for ontology development, including best practices for how to create and edit ontologies, and the application of new tools and workflows. This presents a challenge for new and current ontologists to find and access training materials, and learn the methodology or hone existing skills. The OBO Academy provides open, online, self-paced training materials that aim to provide ongoing training for the ontology community on best practices in ontology development. The training materials cover a range of topics from basics like getting started in contributing to ontologies and editing in Protege, to more advanced materials that cover technical workflows such as using the Ontology Development Kit and ROBOT templates. The initial offering of materials is available online and is under continuous development, and community feedback and contributions are welcomed (https://github.com/OBOAcademy/obook).
Presentation Overview: Show
With the new era of genomics, an increasing number of animal species are amenable to large-scale data generation. This had led to the emergence of new multi-species ontologies to annotate and organize these data. While anatomy and cell types are well covered by these efforts, information regarding development and life stages is also critical in the annotation of animal data. Its lack can hamper our ability to answer comparative biology questions and to interpret functional results. We present here a collection of development and life stage ontologies for 21 animal species, and their merge into a common multi-species ontology. This work has allowed the integration and comparison of transcriptomics data in 52 animal species.
Presentation Overview: Show
Closing session and COSI remarks
Presentation Overview: Show
Addressing complex scientific challenges requires a roadmap of data from diverse sources, organisms, contexts, formats, and granularities. Building a coherent holistic view of the data landscape to address any given problem is non-trivial. Often in the aggregation process, many of the original connections within the data are lost and it is difficult to make new (inferred) connections. Novel data integration strategies that leverage semantic technologies such as ontologies, knowledge graphs, and common modeling strategies can help span disciplinary boundaries. However, it takes the people too; robust interdisciplinary collaboration and improved data licensing and access can advance progress and innovation - turbo boosting the open data highway.
Presentation Overview: Show
The OntoDev Suite (https://ontodev.com, https://github.com/ontodev) of open source software brings together modular open-source libraries and applications for ontology development and scientific data integration, with special emphasis on open science and the Open Biological and Biomedical Ontologies (OBO) community. The suite builds on the success of ROBOT to include data cleaning, ontology-driven validation, development and curation workflows, and more. We strive to make small, focused tools that work well together, but also work well with other best-in-class software, languages, and platforms. In this talk we present an overview of the suite, design principles, and future plans.
Presentation Overview: Show
We present RPhenoscape, a package for the R programming language to provide convenient and robust access to the ontologies and ontology-liked morphological trait data (phenotypes) within the Phenoscape Knowledgebase (KB), as well as to several algorithms for computing with the semantics of traits based on formal logic reasoning. Among the major aims of the package is to enable the computational integration of trait semantics into evolutionary models for comparative trait analysis, which have traditionally treated traits simply as independent characters and character states. To this end, RPhenoscape provides access to the computational inference of presence/absence trait matrices, the presence/absence-based inference of trait dependence, evidence-based mutual trait compatibility/exclusivity, and a variety of semantic similarity metrics for phenotypes. RPhenoscape is currently in the last steps of a new major release series, which adds some of the features presented here and once complete will be made available on the Comprehense R Archive Network.
Presentation Overview: Show
Knowledge graphs (KGs) are representations of entities and their multifaceted relationships. An ongoing challenge in learning from KGs in biology and biomedicine is in bridging the gap between real-world observations and conceptual knowledge. Though numerous bio-ontologies address this need, none may be directly added to a KG without significant effort.
Past efforts in aligning instance data to ontologies led to creation of the OBO Foundry, an open resource for standardized biological ontologies. We developed KG-OBO to allow the community to rapidly integrate OBO Foundry ontologies with biological KGs. KG-OBO translates OBOs into easily-parsed KGX TSV graphs aligned with the Biolink model, then uploads all graphs to a public repository. Users may merge one or more ontology graphs as needed, e.g., combining CHEBI with a KG of protein vs. chemical interactions allows for grouping chemicals hierarchically. The added context can also provide further training input for graph machine learning models.
The KG-OBO code, graphs, and infrastructure drive a community of knowledge engineers seeking answers to biomedical questions in KGs, including the broader OBO community. We anticipate that continued interest in learning from KGs will require easy access to the comprehensive knowledge within bio-ontologies, and KG-OBO fills this need.
Presentation Overview: Show
Today’s international corporations such as BASF, a leading company in the crop protection industry, produce and consume more and more data that are often fragmented and accessible through Web APIs. In addition, part of the proprietary and public data of BASF’s interest are stored in triple stores and accessible with the SPARQL query language. Homogenizing the data access modes and the underlying semantics of the data without modifying or replicating the original data sources become important requirements to achieve data integration and interoperability. In this work, we propose a federated data integration architecture within an industrial setup, that relies on an ontology-based data access method. Our performance evaluation in terms of query response time showed that most queries can be answered in under 1 second.