Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Saturday, July 13th
10:40-10:50
COSI Opening Remarks
Room: 522
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Robert Hoehndorf
  • Tiffany Callahan
10:50-11:55
Invited Presentation: Learning from our collective scientific ignorance: How can ontologies help us determine what isn't yet?
Confirmed Presenter: Mayla Boguslav, Southern California Clinical and Translational Science Institute (USC Keck School of Medicine), United States

Room: 522
Format: In Person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Mayla Boguslav, Southern California Clinical and Translational Science Institute (USC Keck School of Medicine), United States

Presentation Overview: Show

Ontologies beg the question what is or exists (known knowns). I seek to determine what isn't or doesn't exist yet (known unknowns or questions). Ontologies aim to make knowledge accessible, transparent, and searchable. Biological ontologies define the entities and relations in biological domains. The community seeks to organize, present, and disseminate knowledge in biomedicine and the life sciences more generally. This can also be done for our collective scientific ignorance - our missing or incomplete knowledge. Let's make our collective scientific ignorance accessible, transparent, and searchable. In fact, research begins with a question and progresses by exploring new and uncharted territory. Enumerating what we don't know yet can help students, researchers, funders, and publishers generate novel research questions, prioritize resources, and rebuild trust in science. Further, ideally, we combine both knowledge and ignorance to determine solved and unsolved questions. I will present my ignorance taxonomy and ignorance-base (comparable to a knowledge-base) that used ontologies. More generally, I will present a new scientific method framework that shifts the focus to ignorance and questions, not just knowledge. Join me to talk about what we don’t know yet.

11:55-12:20
Poster Madness
Room: 522
Format: In person

Moderator(s): Tiffany Callahan


Authors List: Show

Presentation Overview: Show

Opportunity for poster presenters to give a brief overview of their work and advertise their upcoming poster session

14:20-15:05
Proceedings Presentation: Integration of Background Knowledge for Automatic Detection of Inconsistencies in Gene Ontology Annotation
Confirmed Presenter: Jiyu Chen, The Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Jiyu Chen, The Commonwealth Scientific and Industrial Research Organisation (CSIRO), Australia
  • Benjamin Goudey, The Florey Institute of Neuroscience and Mental Health, Australia
  • Nicholas Geard, School of Computing and Information Systems, The University of Melbourne, Australia
  • Karin Verspoor, The RMIT University, Australia

Presentation Overview: Show

Biological background knowledge plays an important role in the manual quality assurance (QA) of biological database records. One such QA task is the detection of inconsistencies in literature-based Gene Ontology Annotation (GOA). This manual verification ensures the accuracy of the GOA based on a comprehensive review of the literature used as evidence, Gene Ontology (GO) terms, and annotated genes in GOA records. While automatic approaches for the detection of semantic inconsistencies in GOA have been developed, they operate within predetermined contexts, lacking the ability to leverage broader evidence, especially relevant domain-specific background knowledge. This paper investigates various types of background knowledge that could improve the detection of prevalent inconsistencies in GOA. Additionally, the paper proposes several approaches to integrate background knowledge into the automatic GOA inconsistency detection process.
We extended a previously developed GOA inconsistency dataset with several kinds of GOA-related background knowledge, including GeneRIF statements, biological concepts mentioned within evidence texts, GO hierarchy and existing GO annotations of the specific gene. We proposed several effective approaches to integrate background knowledge as part of the automatic GOA inconsistency detection process. The proposed approaches can improve automatic detection of self-consistency and several of the most prevalent types of inconsistencies.
This is the first study to explore the advantages of utilizing background knowledge and to propose a practical approach to incorporate knowledge in automatic GOA inconsistency detection. We established a new benchmark for performance on this task. Our methods may be applicable to various tasks that involve incorporating biological background knowledge.

15:05-15:30
The cyclic nature of biases against understudied genes and diseases in knowledge graph embedding link prediction models
Confirmed Presenter: Michael Bradshaw, University of Colorado Boulder, United States

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Michael Bradshaw, University of Colorado Boulder, United States
  • Ryan Layer, University of Colorado Boulder, United States

Presentation Overview: Show

Knowledge graph embedding (KGE) models have been successfully used for a variety of biomedical applications, but have yet to be effectively applied to rare disease variant prioritization; certain limitations need to be addressed to facilitate application of these models, namely node degree bias. We found there is a cyclical form of bias against under-studied genes and diseases when using KGE models. We found that commonly studied genes–like those related to heretable forms of cancer – perform very well in KGE link prediction tasks (median normalized rank (MNR)=0.91); while less studied genes – like those differentially expressed in females and males, or diseases caused by ancestry specific variations – are deprioritized by the same systems (MNR=0.63-0.71). Our results revealed that not all information contained within large biomedical knowledge graphs is useful for training KGE models. There was a 7-10% improvement in gene-gene edge prediction when the KG was filtered to include only nodes and edges describing genes and diseases. This filtration step also drastically sped up hyperparameter optimization and training times reducing them to 1 - 2.5% that of using the full KG. Several alternative methods for exploring the KG filtration space are explored in this project. Our results show that KGE link prediction performance for gene and disease association is a very nuanced space where careful consideration of the learning model and underlying KG are required. Performance can vary by 5-11% for gene-gene edges and 11-34% for gene-disease predictions depending on the combination of KG and KGE model.

15:30-15:55
Prioritizing Causative Genomic Variants by Integrating Molecular and Functional Annotations from Multiple Biomedical Ontologies
Confirmed Presenter: Azza Althagafi, Computer Science Department, Taif University, Taif 26571, Saudi Arabia., Saudi Arabia

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Azza Althagafi, Computer Science Department, Taif University, Taif 26571, Saudi Arabia., Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia, Saudi Arabia

Presentation Overview: Show

Whole-exome and genome sequencing are widely used for diagnosing
patients with rare diseases, but many remain undiagnosed due to
undiscovered disease genes/variants or novel phenotypes arising from
combinations of variants in multiple genes. Interpreting phenotypic
consequences of variants relies on information about gene functions,
expression, and other genomic features. Existing phenotype-based
prioritization methods link molecular features to phenotypic effects
of altering gene functions but are limited by incomplete
gene--phenotype associations and applicability only to genes with
known phenotypes. We developed several computational methods to
prioritize genes based on phenotypes. Our methods incorporate genomic
information, gene functions from the Gene Ontology, anatomical site of
expression from Uberon, celltype of expression using the Cell
Ontology, and clinical phenotypes. We integrate this information and
apply knowledge-enhanced machine learning to prioritize candidate
genes. We apply this work to the prioritization of different types of
genomic variants, including single nucleotide exonic variants,
non-coding variants, and structural variants.
The methods we develop leverage large amounts of background knowledge,
from databases with ontology annotations as well as from ontology
axioms. We evaluated these methods using synthetic and patient-derived
clinical genomes.

16:40-17:05
Taking AIIM at antibiotic resistance: harmonizing the nomenclature for aminoglycoside inactivating enzymes
Confirmed Presenter: Emily Bordeleau, University of British Columbia, Canada

Room: 522
Format: In Person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Emily Bordeleau, University of British Columbia, Canada
  • Brian Alcock, McMaster University, Canada
  • Andrew McArthur, McMaster University, Canada

Presentation Overview: Show

Multidrug-resistant pathogens continue to challenge aminoglycoside antibiotics with the spread of genetic elements encoding aminoglycoside modifying enzymes (AMEs). Unfortunately, these enzymes have a discordant naming history that further complicates stewardship and surveillance programs. We are undertaking the management and adoption of a single AME nomenclature. We will abide by the guidelines first proposed in 1975 while incorporating additional rules accounting for the scale at which sequencing technology permits AME discovery. Cell-based and biochemical data has been curated from the literature that supports the AMEs characterized to date. CARD will utilize this data to develop software that will guide researchers in the classification of new and existing AME variants. Additionally, CARD will provide tools to evaluate AME benchmarks and recommend available namespace, resolve conflicts, or suggest additional analyses if applicable. After conducting a full review of the AME terminology in the Antibiotic Resistance Ontology (ARO), CARD has updated the ARO to categorize AMEs with the nomenclature guidelines. The revised ontology is reflective of AME biochemistry and phenotype, with strict definitions for each allele family based on antibiotic susceptibility testing. Planning of an interactive web interface to assist authors in naming and analyzing proposed novel AMEs is underway. Going forward, novel published AMEs will only be included in CARD if they have a unique proper name. There remains an ongoing process to review existing AMEs and resolve naming conflicts, for which CARD will engage authors and the research community for feedback.

17:05-17:30
Investigating Food Composition Components in Cancer Prevention and Therapy using Knowledge Graphs
Confirmed Presenter: Hande McGinty, Kansas State University, Manhattan KS, United States

Room: 522
Format: In Person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Hande McGinty, Kansas State University, Manhattan KS, United States
  • Aryan Dalal, Kansas State University, Manhattan KS, United States
  • Duru Dogan, Kansas State University, Manhattan KS, United States
  • Atalay Mert Ileri, Kansas State University, Manhattan KS, United States
  • Yinglun Zhang, Kansas State University, Manhattan KS, United States

Presentation Overview: Show

Flavonoids are polyphenolic compounds found in plants and naturally occur in fruits, vegetables, teas, wines, and chocolate. Flavonoids also have known health benefits due to their anti-oxidative, anti-inflammatory, anti-mutagenic, and anti-carcinogenic properties and their ability to inhibit/modulate enzymatic systems. During this research we explored the relationships among different flavanoids, different foods, and different cancers using knowledge graphs and statistical methods. Our preliminary results show that the relationships among these concepts are more complex than the insights simple statistical methods can provide. In this presentation, we present our approach using KNARM methodology to data collection, data cleaning, and representation using graph databases and knowledge graphs in addition to the preliminary results of our statistical approaches. As we continue our research, we're enriching our knowledge graph by incorporating data on known cancer drugs and drug targets to the knowledge graph and adopting more complex analysis approaches to understand the dynamic interplay of flavanoid-food-cancer interactions as well as using Large Language Models (LLMs) for enhancing our knowledge graph.

17:30-18:00
COSI Day 1 Wrap-up
Room: 522
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Tiffany Callahan
  • Robert Hoehndorf

Presentation Overview: Show

Wrap-up and open time for questions

Sunday, July 14th
10:40-10:50
COSI Announcements
Room: 522
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Tiffany Callahan
  • Robert Hoehndorf
10:50-11:55
Invited Presentation: Exploring Multiple Perspectives for Associative Knowledgebases
Confirmed Presenter: Karin Slater

Room: 522
Format: Live Stream

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Karin Slater

Presentation Overview: Show

Databases encoding associative relationships between biomedical entities function as background knowledge which are leveraged for a range of purposes. For example, disease-phenotype associations are used for differential diagnosis and variant prediction, while gene-function associations are used in gene set enrichment analyses.

In the ontology world, these associative knowledgebases lie somewhere between the conceptualisation and instance spaces, defining foundational knowledge that is often probabilistic, associative, or uncertain, rather than axiomatic. They are formed through some combination of manual curation from expert knowledge, experimental data, and analysis of co-occurrence in literature text. Due to this aetiology of associations, existing databases represent a particular perspective on biomedical knowledge, and it is one that differs from those that might be cultivated from analysis of other sources, such as clinical data, public discussion, or alternative modularisations of literature text.

We will explore the similarities and differences between associative knowledgebases derived from these contexts, including methodological concerns, hypothesis generation, characterisation, and implications for downstream applications.

11:55-12:20
Extracting Clinical Significance for Drug-Gene Interactions using FDA Label Packages
Confirmed Presenter: Matthew Cannon, Institute for Genomic Medicine, Nationwide Children's Hospital, United States

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Matthew Cannon, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • James Stevenson, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Kathryn Stahl, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Rohit Basu, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Kori Kuzma, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Adam Coffman, Department of Medicine, Washington University, United States
  • Susanna Kiwala, Department of Medicine, Washington University, United States
  • Joshua McMichael, Department of Medicine, Washington University, United States
  • Elaine Mardis, Institute for Genomic Medicine, Nationwide Children's Hospital, United States
  • Obi Griffith, Department of Medicine, Washington University, United States
  • Malachi Griffith, Department of Medicine, Washington University, United States
  • Alex Wagner, Institute for Genomic Medicine, Nationwide Children's Hospital, United States

Presentation Overview: Show

The drug-gene interaction database (DGIdb) is a resource that aggregates interaction data from over 40 different resources into one platform with the primary goal of making the druggable genome accessible to clinicians and researchers. By providing a public, computationally accessible database, DGIdb enables therapeutic insights through broad aggregation of drug-gene interaction data.

As part of our aggregation process, DGIdb preserves data regarding interaction types, directionality, and other attributes that enable filtering or biochemical insight. However, source data are often incomplete and may not contain the therapeutic relevance of the interaction. In this report, we address these missing data and demonstrate a pipeline for extracting physiological context from free-text sources. We apply existing large language models (LLMs) to tag and extract indications, cancer types, and relevant pharmacogenomics from free-text, FDA approved labels. We are then able to utilize the Variant Interpretation for Cancer Consortium (VICC) normalization services to ground extracted data back to formally grouped concepts.

In a preliminary test set of 355 FDA labels, we were able to normalize 59.4% of extracted chemical entities back to ontologically-grounded therapeutic concepts. We can link this therapeutic context data back to interaction records already searchable within DGIdb. By using LLMs to extract this data set, we can supplement our existing interaction data with relevant indications, pharmacogenomic data and mutational statuses that may inform the therapeutic relevance of a particular interaction. Inclusion of these data will be invaluable for variant interpretation pipelines where mutational status can lead to the identification of a lifesaving therapeutic.

14:20-15:05
Proceedings Presentation: Predicting protein functions using positive-unlabeled ranking with ontology-based priors
Confirmed Presenter: Fernando Zhapa-Camacho, King Abdullah University of Science and Technology, Saudi Arabia

Room: 522
Format: Live Stream

Moderator(s): Tiffany Callahan


Authors List: Show

  • Fernando Zhapa-Camacho, King Abdullah University of Science and Technology, Saudi Arabia
  • Zhenwei Tang, University of Toronto, Canada
  • Maxat Kulmanov, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Automated protein function prediction is a crucial and widely studied problem in bioinformatics. Computationally, protein function is a multilabel classification problem where only positive samples are defined and there is a large number of unlabeled annotations. Most existing methods rely on the assumption that the unlabeled set of protein function annotations are negatives, inducing the false negative issue, where potential positive samples are trained as negatives. We introduce a novel approach named PU-GO, wherein we address function prediction as a positive-unlabeled ranking problem. We apply empirical risk minimization, i.e., we minimize the classification risk of a classifier where class priors are obtained from the Gene Ontology hierarchical structure. We show that our approach is more robust than other state-of-the-art methods on similarity-based and time-based benchmark datasets. Data and code are available at https://github.com/bio-ontology-research-group/PU-GO.

15:05-15:30
Protein Function: how much do we know and how much do we care?
Confirmed Presenter: An Phan, Iowa State University, United States

Room: 522
Format: In Person

Moderator(s): Tiffany Callahan


Authors List: Show

  • An Phan, Iowa State University, United States
  • Karin Dorman, Iowa State University, United States
  • Claus Kadelka, Iowa State University, United States
  • Iddo Friedberg, Iowa State University, United States

Presentation Overview: Show

The resources required to study gene function are limited, especially when considering the number of genes in the human genome and the complexity of their function. Genes are prioritized for experimental studies based on many different considerations, including, but not limited to, perceived biomedical importance and the understanding of biomedical processes. At the same time, the lion's share of genes are not studied or are under-characterized, with detrimental results to our understanding of the functions inherent to them, and their effects on human health and wellness. However, the size of this disparity in knowledge has not yet been quantified. Understanding function annotation disparity is a necessary first step toward understanding how much functional knowledge is gained of the human genome, and guidelines for the future studies of its component genes effectively.
Here, we present a comprehensive longitudinal analysis of our understanding of the human proteome utilizing tools from economics and information theory. Specifically, we view the human proteome as a population of proteins with a knowledge economy: we treat quantified knowledge of the function of each protein as the equivalent of its wealth, and examine the distribution of knowledge of proteins within a proteome in the same manner distribution of wealth is studied in societies. Our results show a broad distribution of functional knowledge about human proteins over the last decade, in which the inequality in annotations of these proteins remains high.

15:30-15:55
Harmonizing human and microbial datasets to explore mechanisms of the gut microbiome in disease
Confirmed Presenter: Brook Santangelo, University of Colorado Anschutz Medical Campus, United States

Room: 522
Format: In Person

Moderator(s): Tiffany Callahan


Authors List: Show

  • Brook Santangelo, University of Colorado Anschutz Medical Campus, United States
  • Marcin Joachimiak, Lawrence Berkeley National Laboratory, United States
  • Harshad Hegde, Lawrence Berkeley National Laboratory, United States
  • Lawrence Hunter, University of Chicago, United States
  • Catherine Lozupone, University of Colorado Anschutz Medical Campus, United States

Presentation Overview: Show

The integration of disparate forms of biological data is essential for understanding human health and disease. Doing so is particularly challenging in the context of microbe-host interactions that contribute to both positive and negative health outcomes. There are thousands of relevant microbial species, and many interactions among those microbes and with the host. To facilitate understanding of these complex interactions, information about host and microbial physiology, genetics, and metabolism, including interactions must be assembled. We address this technical challenge by harmonizing data in the form of a knowledge graph (KG) of the gut microbiome in disease. We present a KG that integrates enzymatic data of human and over 1,500 microbial proteomes, drawn from UniProt and 8 other reaction, enzymatic, genomic, chemical, pathway and disease oriented resources. We also provide a framework that supports customizable subsets which represent a microbial community of interest. We use a version of the graph constrained by gut microbes known to be correlated with disease that contains over 8 million nodes and 30 million edges. We apply a novel semantic search to identify meaningful mechanistic hypotheses for these microbe-disease relationships. Finally, we demonstrate the predictive capabilities of the KG by using graph embeddings to identify similarities among individual microbial taxa and human disease. This KG is an important enabling technology for automated methods to uncover mechanistic explanations for microbe-disease associations.

16:40-17:05
Using ontologies to make bioassay protocols machine readable
Confirmed Presenter: Alex Clark, Collaborative Drug Discovery, Canada

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Alex Clark, Collaborative Drug Discovery, Canada
  • Barry Bunin, Collaborative Drug Discovery, United States
  • Jason Harris, Collaborative Drug Discovery, United States

Presentation Overview: Show

Bioassay protocols have lagged other areas of drug discovery in terms of digitization. While molecules and proteins have spawned entire disciplines (cheminformatics and bioinformatics), most archives of assays carried out by companies are siloed away as a combination of bespoke pick lists, plain text, and sporadic links to globally meaningful dictionaries. Published experiments often err on the side of terseness and obfuscation by referring to similar work. This leads to serious challenges to anyone who wants to federate data, or effectively search it, or use it as the basis of any kind of machine learning inference. Reproducibility issues are further confounded by the difficulty of ascertaining whether any two experiments are comparable. Public ontologies can greatly improve the machine readability of assay protocols by virtue of having universal meaning. We will describe an open source project - BioAssay Express - that uses templates to gather and organize ontologies into a coherent user interface for curating data content. We have marked up 4000 assays from PubChem using our templates, plus another 2600 from the DataFAIRy project, using a hybrid automated model/expert curation workflow. This freely available data can be precisely and rapidly searched as well as used for sophisticated analysis techniques and model building. We have integrated these curation tools into a commercial product in order to make the process of creating marked-up data less work than traditional writeups, with the ultimate goal of making machine readable data the standard practice rather than a post-publication cleanup chore.

17:05-17:30
Knowledge graphs in Cancer Genomics: The Case of Mutational Signatures
Confirmed Presenter: Ulrike Steindl, Computational Biomedicine, University Hospital Aachen, Germany; Cancer Research Center Cologne-Essen, Germany

Room: 522
Format: In Person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Ulrike Steindl, Computational Biomedicine, University Hospital Aachen, Germany; Cancer Research Center Cologne-Essen, Germany
  • Arnab Chakrabarti, Computational Biomedicine, University Hospital Aachen, Germany; Cancer Research Center Cologne-Essen, Germany
  • Kjong-Van Lehmann, Computational Biomedicine, University Hospital Aachen, Germany; Cancer Research Center Cologne-Essen, Germany

Presentation Overview: Show

Mutational Signatures are generated from somatic genomic mutation data based on their sequence context and have been shown to be indicative of various functional changes in cancer patients.
Studying cancer biology using mutational signatures is an emerging field of research. The analyses are continuously being refined.
The findings generated using this approach are manifold, making it challenging to make draw conclusions. Due to its distributed nature, integrating knowledge allows for new discoveries.
In this work, we will introduce the Mutational Signature Ontology, an ontology describing the space of mutational signature.

The Mutational Signature Ontology represents the numeric data of the signatures in COSMIC database version 3.4 (Sondka et al. 2023) and selected metadata. It is implemented as an owl/rdf knowledge graph, encoding necessary other information regarding the sample used, and other features encoded in the COSMIC dataset.


We also integrated are the discoveries based on Alexandrov et al. (2020), which provide a quantificational link between cancer types and mutational signatures.
The tumor, etiology, and treatment classes of the Mutational Signature Ontology have been designed to be interoperable with the National Cancer Institute Thesaurus (NCIT).

The Mutational Signature Ontology models relations between mutational signatures, mutations, and localities in the genomic location, which uses concepts from the Gene Ontology (Ashburner et al. 2000) and Sequence Ontology (Eilbeck et al. 2005).

The Mutational Signature Ontology and knowledge graph provides missing links in the existing ontology space in oncology. It enables interaction between previously unrelated knowledge spaces and will allow for new predictions.

17:30-18:00
COSI Closing Remarks
Room: 522
Format: In person

Moderator(s): Robert Hoehndorf


Authors List: Show

  • Robert Hoehndorf
  • Tiffany Callahan

Presentation Overview: Show

Speaker Questions and COSI Closing / Community Discussion