Leading Professional Society for Computational Biology and Bioinformatics
Connecting, Training, Empowering, Worldwide



COSI Track Presentations

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
Monday, July 9th
10:15 AM-10:20 AM
Bio-Ontologies: Introduction
Room: Columbus EF
10:20 AM-11:20 AM
The use of existing and new ontologies in African biomedical research
Room: Columbus EF
  • Nicola Mulder, University of Cape Town, South Africa

Presentation Overview: Show

African scientists have joined the genomics revolution, and large-scale
projects to study the genetic and environmental determinants of diseases
are underway across the continent. These have been made possible
through initiatives such as the Human Heredity and Health in Africa
(H3Africa), which is funding research projects, large collaborative
centres, bioepositories and a bioinformatics network. H3ABioNet is the
NIH funded bioinformatics network, composed of 27 institutions across 16
African countries, which is developing capacity for genomics research.
One of the roles of the network is to facilitate data harmonization and
submission to public repositories, such as the EGA. Data from the
H3Africa projects is being collected for over 100,000 participants on
multiple communicable and non-communicable diseases, and includes
phenotype, genomic and other kinds of experimental data. H3ABioNet is
working with the projects to harmonize the data and map to ontologies.
In doing so, we identified a few gaps in existing ontologies, including
those for the description of ethnolinguistic groups and for African
specific diseases such as Sickle Cell Disease. To address these we
extended existing terms for ethnolinguistic groups and developed a
Sickle Cell Disease Ontology. This talk will describe some of this work
and other challenges encountered in harmonizing the data.   

11:20 AM-11:40 AM
Proceedings Presentation: Enumerating consistent subgraphs of directed acyclic graphs: an insight into biomedical ontologies
Room: Columbus EF
  • Yisu Peng, Indiana University Bloomington, United States
  • Yuxiang Jiang, Indiana University Bloomington, United States
  • Predrag Radivojac, Indiana University Bloomington, United States

Presentation Overview: Show

Motivation: Modern problems of concept annotation associate an object of interest (gene, individual, text document) with a set of interrelated textual descriptors (functions, diseases, topics), often organized in concept hierarchies or ontologies. Most ontologies can be seen as directed acyclic graphs, where nodes represent concepts and edges represent relational ties between these concepts. Given an ontology graph, each object can only be annotated by a consistent subgraph; that is, a subgraph such that if an object is annotated by a particular concept, it must also be annotated by all other concepts that generalize it. Ontologies therefore provide a compact representation of a large space of possible consistent subgraphs; however, until now we have not been aware of a practical algorithm that can enumerate such annotation spaces for a given ontology.

Results: We propose an algorithm for enumerating consistent subgraphs of directed acyclic graphs. The algorithm recursively partitions the graph into strictly smaller graphs until the resulting graph becomes a rooted tree (forest), for which a linear-time solution is computed. It then combines the tallies from graphs created in the recursion to obtain the final count. We prove the correctness of this algorithm, propose several practical accelerations, evaluate it on random graphs, and then apply it to characterize four major biomedical ontologies. We believe this work provides valuable insights into the complexity of concept annotation spaces and its potential influence on the predictability of ontological annotation.

11:40 AM-12:00 PM
Standardizing ontology metadata in the OBO registry and beyond
Room: Columbus EF
  • Randi Vita, La Jolla Institute for Allergy & Immunology, United States
  • Rebecca Tauber, Knocean Inc., Canada
  • James A. Overton, Knocean Inc., Canada
  • Darren Natale, Georgetown University Medical Center, United States
  • Jie Zheng, University of Pennsylvania, United States
  • Christian Stoeckert, University of Pennsylvania, United States
  • Philippe Rocca-Serra, University of Oxford, United Kingdom
  • Chris Mungall, Lawrence Berkeley National Laboratory, United States
  • Bjoern Peters, La Jolla Institute for Allergy and Immunology, United States

Presentation Overview: Show

Ontologies are broadly used in the biomedical domain to enable precise and computable representation of entities in the world and their relationships. To coordinate these ontol-ogy development efforts, the Open Biomedical Ontologies (OBO) Foundry was established in 2001 as a community effort to converge toward a suite of freely shared, interoper-able ontologies that cover all of biology and biomedicine. OBO provides a number of services to the community, in-cluding the OBO registry which contains metadata on all OBO registered ontologies. Herein, we describe recent ef-forts to update the registry in order to improve consistency and completeness of the gathered data while at the same time reducing the need for manual inspection and ultimately building a low-cost, high-value approach that is both sustainable and scalable.

12:00 PM-12:06 PM
Ontology-based Semantic Mapping of Chemical Toxicities
Room: Columbus EF
  • Rong-Lin Wang, US EPA, United States
  • Cataia Ives, RTI International, United States
  • Steve Edwards, RTI International, United States

Presentation Overview: Show

Omics technologies are essential to the ongoing paradigm shift in chemical toxicology, from in vivo animal testing to in vitro, toxicity pathway-oriented approach. While great progress has been made in dissecting molecular mechanisms of action (MOAs) and developing exposure biomarkers, overlooked in most of toxicogenomics studies are the phenotypes most relevant to the apical endpoints of interest, i.e. those observed at higher levels of biological organization, and their integration to molecular MOAs.

Ontology-based semantic analysis integrates chemical-induced toxicity phenotypes across levels of biological organization, knowledge domains, and species. In this pilot study, we assembled over 700 publications from the EcoTox Database (https://cfpub.epa.gov/ecotox/) covering six vertebrates and ten chemicals. Toxicity responses from individual studies were annotated by Entity-Quality syntax, converted into OWL classes, and organized into 19 chemical-species phenotypic profiles (PPs). A collection of more than 28000 target PPs were also assembled by genes, KEGG pathways, and diseases from human, mouse, and zebrafish phenotype ontologies. A Java application was developed based on OWLAPI version 4.2.5 and Semantic Measure Library (SML 0.9.4d). Using a cross-species phenotype ontology, http://purl.obolibrary.org/obo/upheno/vertebrate.owl, we compared the 19 chemical PPs against the targets and themselves, and identified their semantically most similar chemicals, genes, KEGG pathways, and disease PPs.

12:06 PM-12:13 PM
FAIRShake: Toolkit to Enable the FAIRness Compliance Assessment of Biomedical Digital Objects
Room: Columbus EF
  • Lily Wang, Icahn School of Medicine at Mount Sinai, United States
  • Avi Ma'Ayan, Icahn School of Medicine at Mount Sinai, United States

Presentation Overview: Show

Making digital research objects more findable, accessible, interoperable and reusable (FAIR) is critical for realizing the opportunity for accelerated progress in research through data science. Here we present FAIRShake, a toolkit developed to enable the assessment of compliance of biomedical digital research objects with the FAIR guiding principles. FAIRShake functions as a repository to store and serve FAIR assessments. FAIRness assessments of three types of digital objects: datasets, tools, and repositories/databases, are based on answers to nine questions. In order to visually communicate FAIRness level, a 3x3 grid of colored squares, called the FAIR insignia, was developed. The FAIRness insignia identifies areas of strength and weakness in the FAIRness level of digital objects, guiding digital object producers on how to improve FAIRness of their products. The FAIRShake toolkit consists of the FAIRShake website, through which assessments are completed and insignias minted; the FAIRShake Google Chrome browser extension; the FAIRShake bookmarklet; and FAIRShake APIs for direct programmatic access to the information within the FAIRShake database. The Chrome extension and bookmarklet provide easy access to display and perform FAIR assessments on any relevant web-site. The FAIRShake toolkit is a cloud-based application freely available at http://fairshake.cloud.

12:13 PM-12:20 PM
Minimum Information Required: Guideline for Stroke Research and Clinical Data Reporting
Room: Columbus EF
  • Judit Kumuthini, CPGR, South Africa
  • Lyndon Zass, CPGR, South Africa
  • Paul Olowoyo, Ido-Ekiti; Afe Babalola University, South Africa
  • Mamana Mbiyavanga, UCT, South Africa
  • Faniyan Moyinoluwalogo, University of Ibadan, Nigeria
  • Gordon Wells, CPGR, South Africa
  • Victoria Nembaware, UCT, South Africa
  • Mayowa Owolabi, University of Ibadan, Nigeria

Presentation Overview: Show

Stroke is one of the leading causes of brain injury and death, worldwide. Significant progress has been made in terms of stroke healthcare and research in reducing the immense impact of stroke on public health due to the rapid development of high-throughput technologies. However, significant barriers remain associated with this data generation, including its analyses, availability, accessibility and usability.

In collaboration with SIREN, we aim to alleviate these challenges by enforcing common ontology, and improving experiment/study interpretability, sharing, interoperability, reproducibility and reporting, by developing the Minimum Information Required:Guideline for Stroke Research and Clinical Data Reporting, which aims to add value to the data by making it findable, accessible,
interoperable and reproducible (FAIR).

The standard is based on previously published literature and a draft was developed, which is currently being reviewed by stroke researchers and clinicians by review survey. The standard distinguishes items between required and optional elements. The elements included in the standard are subdivided into participant-, and disease-, and study-specific fields. The stroke research and clinical data reporting standard has the potential to benefit both the stroke research community, with ongoing and future stroke studies, as well as the clinical community, with clinical interoperability and collaboration.

12:20 PM-12:26 PM
Structuring Genetic Disease Complexity & Environmental Drivers
Room: Columbus EF
  • Lynn Schriml, University of Maryland School of Medicine, United States
  • Linda Jeng, University of Maryland School of Medicine, United States
  • Carol Greene, University of Maryland School of Medicine, United States
  • Cynthia Bearer, University of Maryland School of Medicine, United States
  • Richard Lichenstein, University of Maryland School of Medicine, United States
  • James Munro, University of Maryland School of Medicine, United States
  • Rebecca Tauber, University of Maryland School of Medicine, United States
  • Michelle Giglio, University of Maryland School of Medicine, United States

Presentation Overview: Show

Representing the complexity of genetic etiology and environmental drivers of disease within the ontological structure of the Human Disease Ontology (DO) presents a framework for developing a Differential Diagnosis ontology. Beyond monogenic diseases, clinical diagnosis is challenged by the complexity of etiologies for many genetic diseases. To address the challenges of representing this clinical complexity, the DO project has developed a complex disease model to drive the restructuring of DO knowledge. The DO’s clinical team is assessing a set of complex and environmental diseases to build the knowledgebase to be represented in the DO, through an expanded data representation captured through logical definitions to the Sequence Ontology. This work is enabled through the DO’s integration of ROBOT tools for capturing and integrating the dis-ease to functional and/or structural sequence variant associations. Expanding the DO’s ontological structure and content will inform the development of a Differential Diagnosis DO. The DO clinician team has developed a conceptual complex genetic disease model to identify the key types of genetic diseases (monogenic, chromosomal, epigenetic, methylation, post-translational) to be represented in the DO. This model forms the basis for re-structuring of the DO’s genetic disease branch to represent the clinical complexity of genetic diseases.

12:26 PM-12:33 PM
Automated Negative Gene Ontology Based Functional Predictions for Proteins with UniGOPred
Room: Columbus EF
  • Tunca Dogan, EMBL-EBI, CanSyL, METU, United Kingdom
  • Ahmet Süreyya Rifaioğlu, Middle East Technical University, Turkey
  • Rabie Saidi, EMBL-EBI, United Kingdom
  • Maria Martin, EMBL-EBI, United Kingdom
  • Volkan Atalay, Middle East Technical University, Turkey
  • Rengul Atalay, METU, Turkey

Presentation Overview: Show

Functional annotation of biomolecules in the gene and protein databases is mostly incomplete. This is especially valid for multi-domain proteins. There is a grey area in the protein function data resources, where the truly negative functions and the ones possessed by the protein but have not been discovered or documented yet (i.e. false negatives), reside together. In many cases the information about the functions absent from the target biomolecule can be as important as the assigned functions. It’s possible to resolve a portion of this grey area by predicting the functions that the target proteins most probably do not possess. In this study, we present an approach to produce negative functional annotations for protein sequences, along with regular positive associations. Using this approach, we have developed an automated function prediction tool "UniGOPred". The negative prediction performance (recall) was measured as 0.82 for both MF and BP, and 0.66 for CC GO terms (with prediction scores ≤ 0.3), in cross-validation. To the best of our knowledge, the ability of a protein function prediction method to predict negative functions using sequence features is investigated here for the first time. UniGOPred is available as an open access tool at http://cansyl.metu.edu.tr/UniGOPred.html.

12:33 PM-12:40 PM
PyxisMap: Phenotype Rankings for Genomic Sequencing Variants
Room: Columbus EF
  • James Holt, HudsonAlpha Institute for Biotechnology, United States
  • Brandon Wilk, HudsonAlpha Institute for Biotechnology, United States
  • Manavalan Gajapathy, HudsonAlpha Institute for Biotechnology, United States
  • Elizabeth Worthey, HudsonAlpha Institute for Biotechnology, United States

Presentation Overview: Show

Motivation: Clinical interpretation of genomic variants relies on computational, statistical, and phenotype annotations to review variants. Current phenotype tools are not easy to use when assessing a particular gene and are usually weeks to months behind literature due to curation.
Results: We developed the PyxisMap tool for ranking genes using phenotype information. PyxisMap assists in annotation of phenotypes from plain text and then outputs a full ranking of genes based on phenotype information from both the Hu- man Phenotype Ontology (HPO) and all PubMed abstracts as annotated by PubTator. We show that using the tool to re-rank variants in rare disease cases significantly improves the rank- ing of clinically reported variants.

12:40 PM-2:00 PM
Lunch Break
2:00 PM-2:10 PM
Ontology application at the ENCODE portal
Room: Columbus EF
  • Jason Hilton, Stanford University, United States
  • Idan Gabdank, Stanford University, United States
  • Esther Chan, Stanford University, United States
  • J Seth Strattan, Stanford University, United States
  • Keenan Graham, Stanford University, United States
  • Kathrina Onate, Stanford University, United States
  • Zachary Myers, Stanford University, United States
  • Nick Luther, Stanford University, United States
  • Forrest Tanaka, Stanford University, United States
  • Tim Dreszer, Stanford University, United States
  • Casey Litton, Stanford University, United States
  • Bek Baymuradov, Stanford University, United States
  • Karthik Kalyanaraman, Stanford University, United States
  • Otto Jolanki, Stanford University, United States
  • Benjamin Hitz, Stanford University, United States
  • Mike Cherry, Stanford University, United States

Presentation Overview: Show

The Encyclopedia of DNA elements (ENCODE) project has produced data from more than 9,000 experiments using a variety of techniques to study the structure, regulation, and transcription profiles of human and mouse genomes. The data from these experiments first pass through the ENCODE Data Coordination Center (DCC) for basic validation and metadata standardization before they are openly available to the community at the ENCODE site (https://www.encodeproject.org/). Additionally, the ENCODE portal hosts data from other projects, such as modENCODE, GGR, and Roadmap. As the volume of data and variation in experimental methods increase, the organization of experimental details becomes more vital in order to provide a useful and effective access point for our users. Ontologies are used by the ENCODE DCC to annotate metadata as one way of standardizing metadata across projects and labs, providing improved searching capabilities, and making connections across experiments. Furthermore, the ENCODE DCC is in frequent contact with various ontologies to assist in new term generation and annotation of existing terms. The collection, careful curation, and organization of vast genomic datasets, guided in part by ontologies, allows for comparison across different projects and maximizes accessibility to epigenomic data and analysis.

2:10 PM-2:20 PM
An Unsupervised Probabilistic Method for Automatically Integrating Multiple Disease Terminologies
Room: Columbus EF
  • Robert Leaman, NCBI/NLM/NIH, United States
  • Zhiyong Lu, NCBI, NLM, NIH, United States

Presentation Overview: Show

Named entity recognition (NER) and normalization benefit greatly from comprehensive terminological resources. While many biomedical terminologies exist, each only contains a sample of the synonymous terms possible for each concept. Integrating multiple terminologies increases the number of synonyms available, but each resource differs in its coverage and level of granularity. Direct integration therefore reduces the integrity of the term/concept mappings, which is problematic for normalization.
We present a method for automatically integrating terminological resources. Our method uses a generative model to probabilistically quantify the available evidence that a pair of concepts are - or are not - synonyms. The model is trained as a matrix completion task using the information contained within the resources themselves: no additional labeled data is required. We apply our method to disease names from over 80 resources (including MeSH, OMIM, SNOMED-CT, OrphaNet, UMLS, NCI, Disease Ontology, and Monarch Disease Ontology) to create a broad-coverage disease terminology containing over 1.4 million unique terms. Our method identifies a significant number of missing cross references between existing vocabularies. We also demonstrate that enriching an existing disease lexical resource with our method results in a significant performance improvement for disease NER and normalization.

2:20 PM-2:40 PM
Proceedings Presentation: Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations
Room: Columbus EF
  • Fatima Zohra Smaili, King Abdullah University of Science and Technology, Saudi Arabia
  • Xin Gao, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Biological knowledge is widely
represented in the form of ontology-based annotations: ontologies
describe the phenomena assumed to exist within a domain, and the
annotations associate a (kind of) biological entity with a set of
phenomena within the domain. The structure and information contained
in ontologies and their annotations makes them valuable for
developing machine learning, data analysis and knowledge extraction
algorithms; notably, semantic similarity is widely used to identify
relations between biological entities, and ontology-based
annotations are frequently used as features in machine learning

Results: We propose the Onto2Vec method, an approach to
learn feature vectors for biological entities based on their
annotations to biomedical ontologies. Our method can be applied to a
wide range of bioinformatics research problems such as
similarity-based prediction of interactions between proteins,
classification of interaction types using supervised learning, or
clustering. To evaluate Onto2Vec, we use the Gene Ontology (GO) and
jointly produce dense vector representations of proteins, the GO
classes to which they are annotated, and the axioms in GO that
constrain these classes. First, we demonstrate that
Onto2Vec-generated feature vectors can significantly improve
prediction of protein-protein interactions in human and yeast. We
then illustrate how Onto2Vec representations provide the means for
constructing data-driven, trainable semantic similarity measures
that can be used to identify particular relations between
proteins. Finally, we use an unsupervised clustering approach to
identify protein families based on their Enzyme Commission
numbers. Our results demonstrate that Onto2Vec can generate high
quality feature vectors from biological entities and
ontologies. Onto2Vec has the potential to significantly outperform
the state-of-the-art in several predictive applications in which
ontologies are involved.

Availability: https://github.com/bio-ontology-research-group/onto2vec

Contact: robert.hoehndorf@kaust.edu.sa and xin.gao@kaust.edu.sa

2:40 PM-3:00 PM
Ontology-based annotation and integration of pathway databases
Room: Columbus EF
  • Lucy Lu Wang, University of Washington, United States
  • Mary Shimoyama, Medical College of Wisconsin, United States
  • G. Thomas Hayman, Medical College of Wisconsin, United States
  • Jennifer R. Smith, Medical College of Wisconsin, United States
  • Monika Tutaj, Medical College of Wisconsin, United States
  • John Gennari, University of Washington, United States

Presentation Overview: Show

Biological pathway alignment is necessary to reduce redundancy in pathway data for secondary analysis, but it is difficult to identify semantically similar pathways to align based on entity membership alone. Annotations to the Pathway Ontology (PW) can be used to identify semantically similar pathways. This paper describes a computationally assisted method for annotating pathways to classes in the PW. An ensemble model using lexical features and ontology matching software was used to derive PW annotations for Reactome pathways. Proposed annotations were reviewed by the authors and PW curatorial team for correctness and inclusion into the PW.

3:00 PM-4:00 PM
Deep X: Deep Learning with Deep Knowledge
Room: Columbus EF
  • Volker Tresp

Presentation Overview: Show

Labeled graphs can describe states and events at a cognitive
abstraction level, representing facts as subject-predicate-object
triples. A prominent and very successful example is the Google
Knowledge Graph, representing on the order of 100B facts. Labeled
graphs can be represented as adjacency tensors which can serve as
inputs for prediction and decision making, and from which tensor
models can be derived to generalize to unseen facts. We show how
these ideas can be used, together with deep recurrent networks, for
clinical decision support by predicting orders and outcomes.
Following Goethe’s proverb, “you only see what you know”, we show how
background knowledge can dramatically improve information extraction
from images by deep convolutional networks and how tensor train models
can be used for the efficient classification of videos. We discuss
potential links to the memory and perceptual systems of the human
brain. We conclude that tensor models, in connection with deep
learning, can be the basis for many technical solutions requiring
memory and perception, and might be a basis for modern AI.

4:00 PM-5:00 PM
Coffee Break
5:00 PM-6:00 PM
The Role of Ontologies in Artificial Intelligence (and Machine Learning)
Room: Columbus EF
Tuesday, July 10th
8:35 AM-8:40 AM
Bio-Ontologies: Introduction
Room: Columbus EF
8:40 AM-9:00 AM
Ontology-Based Concept Recognition by Using Word Embeddings
Room: Columbus EF
  • Sara Althubaiti, King Abdullah University of Science and Technology, Saudi Arabia
  • Senay Kafkas, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Motivation: Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology-based annotation often requires literature curation which is a time-consuming and expensive process given the large volume of literature. Automatic and accurate identification of ontology classes in text plays a key role in making literature curation more efficient. While several methods were developed for concept recognition in text, they are often specific to particular ontologies or primarily dictionary-based and NLP-based and therefore not able to discover new terms referring to ontology classes.
Results: We developed a method for recognizing mentions of ontology classes in text. Our method is based on machine learning and utilizes word embedding, combines them with automated reasoning over ontologies to generate training and test data, and uses a neural network classifier to recognize whether a word refers to a particular class. We demonstrate the utility of our approach in identification of disease concepts from the Human Disease Ontology and demonstrate that our approach generates accurate results (F-score above 90%) and is capable of discovering concepts that are not present in an ontology. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ConceptRecognition_word2vec

9:00 AM-9:20 AM
Proceedings Presentation: Deep neural networks and distant supervision for geographic location mention extraction
Room: Columbus EF
  • Arjun Magge, ASU, United States
  • Davy Weissenbacher, University of Pennsylvania, United States
  • Abeed Sarker, University of Pennsylvania, United States
  • Matthew Scotch, Arizona State University, United States
  • Graciela Gonzalez, University of Pennsylvania, United States

Presentation Overview: Show

Motivation: Virus phylogeographers rely on DNA sequences of viruses and the locations of the infected hosts found in public sequence databases like GenBank for modeling virus spread. However, the locations in GenBank records are often only at the country or state level, and may require phylogeographers to scan the journal articles associated with the records to identify more localized geographic areas. To automate this process, we present a named entity recognizer (NER) for detecting locations in biomedical literature. We built the NER using a deep feedforward neural network to determine whether a given token is a toponym or not. To overcome the limited human annotated data available for training, we use distant supervision
techniques to generate additional samples to train our NER.

Results: Our NER achieves an F1-score of 0.910 and significantly outperforms the previous state-of-the-art system. Using the additional data generated through distant supervision further boosts the performance of the NER achieving an F1-score of 0.927. The NER presented in this research improves over previous systems significantly. Our experiments also demonstrate the NER’s capability to embed external features to further boost the system’s performance. We believe that the same methodology can be applied for recognizing similar biomedical entities in scientific literature.

9:20 AM-9:26 AM
Project Tycho 2.0: A New Repository for the Integration and Reuse of Global Health Data
Room: Columbus EF
  • Willem Van Panhuis, University of Pittsburgh, United States

Presentation Overview: Show

Much information in global health is organized in siloed repositories, and global health datasets are relatively small compared to genomics of proteomics datasets. The data problem in global health could be considered a small data problem on a big scale. In 2013 we released the first version of Project Tycho to disseminate disease surveillance data reported by health agencies in the United States between 1888 and 2014. Over the past 3.5 years, 3500+ users have registered to use Project Tycho and over 40 creative works, including 20 peer-reviewed papers, have been published that used Project Tycho data. Now, we released Project Tycho 2.0 that aims to represent information for global health in a more FAIR (Findable, Accessible, Interoperable, and Reusable) compliant way. We re-represented all our US data and information about dengue fever for 99 countries into a standard data format, using standard ontologies and vocabularies where possible. We also created rich metadata in DataCite XML and Data Tag Suite (DATS) JSON format. With Project Tycho 2.0, we aim to improve the integration and machine-interpretability of global health data so that new discoveries can truly be made across all scales in biology, from the molecule to the global population.

9:26 AM-9:33 AM
Computational Classification of Phenologs across Biological Diversity
Room: Columbus EF
  • Ian Braun, Iowa State University, United States
  • Carolyn Lawrence-Dill, Iowa State University, United States

Presentation Overview: Show

Phenotypic diversity analyses are the basis for research discoveries that span the spectrum from basic biology (e.g., gene function and pathway membership) to applied research (e.g., plant breeding). In cases where equivalent phenotypes across individuals or groups are not anatomically similar, high-throughput, computational classification is possible if the traits and phenotypes are documented using standardized, language-based descriptions. In the case of text phenotype data, conversion to computer-readable “EQ” statements enables such large-scale analyses. EQ statements are composed of entities (e.g., leaf) and qualities (e.g., length) drawn from terms in ontologies. We present a method for automatically converting free-text descriptions of plant phenotypes to EQ statements using a machine learning approach. Random forest classifiers identify potential matches between phenotype descriptions and terms from a set of ontologies including GO, PO, and PATO. Features used include semantic, syntactic, and context similarity measures. The classifiers are trained and tested using a dataset of text descriptions and EQ statements from the Plant PhenomeNET project (Oellrich, Walls et al., 2015). The most likely candidate terms are used to compose EQ statements with confidence scores. Results of evaluating the accuracy of this approach are presented, and potential use to enable automated phenolog discovery are discussed.

9:33 AM-9:40 AM
HPO2GO: Prediction of Human Phenotype Ontology Term Associations for Proteins Using Cross Ontology Annotation Co-occurrences
Room: Columbus EF
  • Tunca Dogan, EMBL-EBI, CanSyL, METU, United Kingdom

Presentation Overview: Show

Analysing the relationships between biomolecules and the genetic diseases is a highly active area. A novel approach is proposed here to this end, by mapping abnormality defining Human Phenotype Ontology (HPO) terms with biomolecular function defining GO terms, where each association indicates the occurrence of the abnormality due to the loss of the molecular function expressed by the corresponding GO term. The proposed HPO2GO mappings were extracted by calculating the frequency of the co-annotations of the terms on the same genes/proteins, followed by the filtering of the unreliable mappings via statistical resampling. Furthermore, the biological relevance of the finalized mappings were discussed over selected cases. The resulting HPO2GO mappings can be employed in different settings to predict and to analyse novel gene/protein-disease relations. As an application of the proposed approach, HPOterm-protein associations (i.e., HPO2protein) are predicted. To test and compare the predictive performance, CAFA2 challenge HPO prediction target protein set was used. The results showed that HPO2GO beat all models from the participating groups (with Fmax=0.402), by a margin. The automated cross ontology mapping approach developed in this work can easily be extended to other ontologies as well, to identify unexplored relation patterns at the systemic level.

9:40 AM-10:15 AM
Coffee Break
10:15 AM-10:20 AM
Bio-Ontologies: Introduction
Room: Columbus EF
10:20 AM-10:40 AM
Proceedings Presentation: A Gene-Phenotype Relationship Extraction Pipeline from the Biomedical Literature Using a Representation Learning Approach
Room: Columbus EF
  • Wenhui Xing, Wuhan University of Technology, China
  • Junsheng Qi, China Agricultural University, China
  • Xiaohui Yuan, Wuhan University of Technology, China
  • Lin Li, Wuhan University of Technology, China
  • Xiaoyu Zhang, Huazhong University of Science and Technology, China
  • Yuhua Fu, Wuhan University of Technology, China
  • Shengwu Xiong, Wuhan University of Technology, China
  • Lun Hu, Wuhan University of Technology, China
  • Jing Peng, Wuhan University of Technology, China

Presentation Overview: Show

Motivation: The fundamental challenge of modern genetic analysis is to establish gene-phenotype correlations that are often found in the large-scale publications. Because lexical features of gene are relatively regular in text, the main challenge of these relation extraction is phenotype recognition. Due to phenotypic descriptions are often study- or author-specific, few lexicon can be used to effectively identify the entire phenotypic expressions in text, especially for plants.
Methods: We propose a pipeline for extracting phenotype, gene and their relations from biomedical literature. Combined with abbreviation revision and sentence template extraction, we improve the unsupervised word-embedding-to-sentence-embedding cascaded approach as representation learning to recognize the various broad phenotypic information in literature. In addition, the dictionary- and rulebased method is applied for gene recognition. Finally, we integrate one of famous information extraction system OLLIE to identify gene-phenotype relations.
Results: To demonstrate the applicability of the pipeline, we established two types of comparison experiment using model organism Arabidopsis thaliana. In the comparison of state-of-the-art baselines, our approach obtained the best performance (F1-Measure of 66.83%). We also applied the pipeline to 481 full-articles from TAIR gene-phenotype manual relationship dataset to prove the validity. The results showed that our proposed pipeline can cover 70.94% of the original dataset and add 373 new relations to expand it.

10:40 AM-11:00 AM
Ontology based mining of pathogen-disease associations from literature
Room: Columbus EF
  • Senay Kafkas, King Abdullah University of Science and Technology, Saudi Arabia
  • Robert Hoehndorf, King Abdullah University of Science and Technology, Saudi Arabia

Presentation Overview: Show

Infectious diseases claim millions of lifes especially in the developing countries each year. Resistance of pathogens to drugs is the major reason in failing infection treatments. Identification of causative pathogens accurately and rapidly plays a key role in the success of treatment, because, prescription of the right drug can help to alleviate the drug resistance problem. Therefore, there is an urgent need for a reference resource on pathogen-disease associations that can be utilised to support diagnosis of infectious diseases. A very large portion of pathogen-disease associations is available from the literature in unstructured form and thus we need automated methods to extract the data. Motivated by these we present the first text mining system designed for extracting pathogen-disease relations from literature. All data is publicly available from https://github.com/bio-ontology-research-group/Infectious_Diseases.git.

11:00 AM-11:20 AM
Assessing Schema.org's Coverage of Terms from Key Biomedical Datasets
Room: Columbus EF
  • Kody Moodley, Maastricht University, Netherlands
  • Josef Hardi, Stanford BMIR, United States
  • John Graybeal, Stanford University, United States
  • Mark Musen, Stanford University, United States
  • Michel Dumontier, Maastricht University, Netherlands

Presentation Overview: Show

Schema.org is an initiative by major Web search engines to define a common vocabulary for structuring Web content from a variety of domains, promoting data interoperability and enabling Web content to benefit from sophisticated search services. Schema.org provides specialized attributes for describing biomedical data. Before leveraging this to increase interoperability of their data, it is valuable for biomedical data publishers to know which of their key data attributes can be captured by Schema.org. There are currently no quantitative evaluations to measure how much of the existing metadata terms align with Schema.org. We provide such an evaluation here.

11:20 AM-11:40 AM
Intelligently Designed Ontology Alignment: A Case Study from the Sequence Ontology
Room: Columbus EF
  • Michael Sinclair, University of Utah, United States
  • Michael Bada, CU-Denver Anschutz Medical Campus, United States
  • Karen Eilbeck, University of Utah, United States

Presentation Overview: Show

As the number and size of biomedical ontologies continue to grow, the problem of coordinating them becomes increasingly urgent. This is particularly problematic for ontologies with semantically overlapping or analogous content but based on different underlying conceptualizations. Ontology matching methods have been developed to align existing ontologies for semantic interoperability, but this is a laborious and inefficient process compared to designing related ontologies to be aligned from the outset. We present a method for designing companion ontologies to be largely semantically and structurally aligned as they evolve. We have applied this method to the specific case of the coordinated management of two ontologies focused on the representation of analogous types of biological sequences: the Sequence Ontology and the Molecular Sequence Ontology. We derived general principles that any group can apply to their own use cases: 1) Create inter-ontology logical definitions for those portions of the ontologies to be aligned, 2) use class annotations to automatically manage those portions of the ontologies not to be aligned, and 3) use the taxonomy of one ontology to classify the other with an OWL reasoner. We propose this as an efficient way of designing companion ontologies to be aligned throughout their development life cycle.

11:40 AM-11:46 AM
Establishing the framework for an African Genome Archive
Room: Columbus EF
  • Jamie Southgate, South African National Bioinformatics Institute (SANBI), South Africa
  • Alan Christoffels, South African National Bioinformatics Institute (SANBI), South Africa

Presentation Overview: Show

The generation of biomedical research data on the African continent is growing, with studies realizing the importance of African genetic diversity in discoveries of human origins and disease susceptibility. The benefits of such studies can only come to fruition if African researchers are fully involved at all levels. Such studies are producing rich large-scale datasets, which require careful curation, secure storage and governance. There is also a great willingness amongst African researchers to collaborate and promote data discovery whilst maintaining ownership of data. The development of an African Genome Archive will provide infrastructure to support the effective use of data resources to further sustain the growth of Bioinformatics and Genomic research on the continent, furthering collaboration between academic institutions, science councils and industry.

11:46 AM-11:53 AM
Predicting new relationships between genes and Human Phenotype Ontology terms
Room: Columbus EF
  • Marco Notaro, University of Study of Milan, Italy
  • Max Schubach, Berlin Institute of Health, Germany
  • Peter N. Robinson, The Jackson Laboratory for Genomic Medicine, United States
  • Giorgio Valentini, University of Study of Milan, Italy

Presentation Overview: Show

The prediction of human gene–abnormal phenotype associations is a fundamental step towards the discovery of novel genes associated with human disorders, especially considering that for several disorders no causative genes are known. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of abnormal phenotypes associated with human diseases. While the problem of the prediction of gene–disease associations has been widely investigated, the related problem of gene–HPO term associations has been largely overlooked. Moreover most of the methods proposed in literature are 'hierarchy-unaware', i.e. are not able to capture the hierarchical relationships between HPO terms, making the predictions inaccurate and biologically contradictory. Here we present highly-modular hierarchical ensemble approaches that can be used to enhance the prediction of virtually any flat learning method, by taking into account the hierarchical nature of the HPO. Genome-wide experimental results shown that our algorithms 1) are able to predict new genes-abnormal phenotypes associations; 2) are competitive with other leading state-of-the-art approaches; 3) scale nicely with large datasets and bio-ontologies. An R implementation of the proposed methods is available on CRAN and bioconda along with a step-by-step tutorial to enable an easy integration in other research.

11:53 AM-12:00 PM
OntoloBridge – A Semi-Automated Ontology Update Request System
Room: Columbus EF
  • Hande Küçük-Mcginty, Collaborative Drug Discovery Inc., United States
  • John Turner, University of Miami, United States
  • John Graybeal, Stanford University, United States
  • Michael Dorf, Stanford University, United States
  • Alex Clark, Collaborative Drug Discovery Inc., United States
  • Daniel Cooper, University of Miami, United States
  • Barry Bunin, Collaborative Drug Discovery Inc., United States
  • Mark Musen, Stanford University, United States
  • Stephan Schürer, University of Miami, United States

Presentation Overview: Show

Ontologies are becoming increasingly relevant for integration, reuse and interoperability of complex biomedical data. However, to stay relevant, ontologies require constant evolution. The current options to request ontology updates and new terms involve emailing the ontology maintainer and waiting for the next release, or extending the ontology with private terms, both of which are quite unsatisfactory. This lack of a more efficient user-driven ontology update mechanism was pointed out by domain experts who are using CDD’s new tool BioAssay Express (BAE). BAE allows users to annotate their bioassays in a semi-automated and standardized fashion using highly-accessed ontologies (BioAssay Ontology (BAO), Gene Ontology (GO), Disease Ontology (DOID), and Drug Target Ontology (DTO) among others) in the background. Our goal in the OntoloBridge project is to help various users of BAE request and update the existing vocabulary provided by BAO in a semi-automated way. Furthermore, APIs and tools including templates from the Center for Expanded Data Annotation Retrieval (CEDAR) will be created. In this way, we’re aiming to increase the Findability, Interoperability, Accessibility, and Reproducibility (FAIR) of ontologies by bringing the ontology maintainers and ontology users closer together.

12:00 PM-12:10 PM
Bio-Ontologies: One Minute Short Talks
Room: Columbus EF

Presentation Overview: Show

One minute short statements of interest to the Bio-Ontologies community

12:10 PM-12:40 PM
Bio-Ontologies: Closing
Room: Columbus EF
12:40 PM-2:00 PM
Lunch Break