View Posters By Category
Session A: (July 7 and July 8)
Session B: (July 9 and July 10)
Short Abstract: Schema.org is an initiative by major Web search engines to define a common vocabulary for structuring Web content from a variety of domains, promoting data interoperability and enabling Web content to benefit from sophisticated search services. Schema.org provides specialized attributes for describing biomedical data. Before leveraging this to increase interoperability of their data, it is valuable for biomedical data publishers to know which of their key data attributes can be captured by Schema.org. There are currently no quantitative evaluations to measure how much of the existing metadata terms align with Schema.org. We provide such an evaluation here.
Short Abstract: Named entity recognition (NER) and normalization benefit greatly from comprehensive terminological resources. While many biomedical terminologies exist, each only contains a sample of the synonymous terms possible for each concept. Integrating multiple terminologies increases the number of synonyms available, but each resource differs in its coverage and level of granularity. Direct integration therefore reduces the integrity of the term/concept mappings, which is problematic for normalization. We present a method for automatically integrating terminological resources. Our method uses a generative model to probabilistically quantify the available evidence that a pair of concepts are - or are not - synonyms. The model is trained as a matrix completion task using the information contained within the resources themselves: no additional labeled data is required. We apply our method to disease names from over 80 resources (including MeSH, OMIM, SNOMED-CT, OrphaNet, UMLS, NCI, Disease Ontology, and Monarch Disease Ontology) to create a broad-coverage disease terminology containing over 1.4 million unique terms. Our method identifies a significant number of missing cross references between existing vocabularies. We also demonstrate that enriching an existing disease lexical resource with our method results in a significant performance improvement for disease NER and normalization.
Short Abstract: Omics technologies are essential to the ongoing paradigm shift in chemical toxicology, from in vivo animal testing to in vitro, toxicity pathway-oriented approach. While great progress has been made in dissecting molecular mechanisms of action (MOAs) and developing exposure biomarkers, overlooked in most of toxicogenomics studies are the phenotypes most relevant to the apical endpoints of interest, i.e. those observed at higher levels of biological organization, and their integration to molecular MOAs. Ontology-based semantic analysis integrates chemical-induced toxicity phenotypes across levels of biological organization, knowledge domains, and species. In this pilot study, we assembled over 700 publications from the EcoTox Database (https://cfpub.epa.gov/ecotox/) covering six vertebrates and ten chemicals. Toxicity responses from individual studies were annotated by Entity-Quality syntax, converted into OWL classes, and organized into 19 chemical-species phenotypic profiles (PPs). A collection of more than 28000 target PPs were also assembled by genes, KEGG pathways, and diseases from human, mouse, and zebrafish phenotype ontologies. A Java application was developed based on OWLAPI version 4.2.5 and Semantic Measure Library (SML 0.9.4d). Using a cross-species phenotype ontology, http://purl.obolibrary.org/obo/upheno/vertebrate.owl, we compared the 19 chemical PPs against the targets and themselves, and identified their semantically most similar chemicals, genes, KEGG pathways, and disease PPs.
Short Abstract: Motivation: Clinical interpretation of genomic variants relies on computational, statistical, and phenotype annotations to review variants. Current phenotype tools are not easy to use when assessing a particular gene and are usually weeks to months behind literature due to curation. Results: We developed the PyxisMap tool for ranking genes using phenotype information. PyxisMap assists in annotation of phenotypes from plain text and then outputs a full ranking of genes based on phenotype information from both the Hu- man Phenotype Ontology (HPO) and all PubMed abstracts as annotated by PubTator. We show that using the tool to re-rank variants in rare disease cases significantly improves the rank- ing of clinically reported variants.
Short Abstract: Making digital research objects more findable, accessible, interoperable and reusable (FAIR) is critical for realizing the opportunity for accelerated progress in research through data science. Here we present FAIRShake, a toolkit developed to enable the assessment of compliance of biomedical digital research objects with the FAIR guiding principles. FAIRShake functions as a repository to store and serve FAIR assessments. FAIRness assessments of three types of digital objects: datasets, tools, and repositories/databases, are based on answers to nine questions. In order to visually communicate FAIRness level, a 3x3 grid of colored squares, called the FAIR insignia, was developed. The FAIRness insignia identifies areas of strength and weakness in the FAIRness level of digital objects, guiding digital object producers on how to improve FAIRness of their products. The FAIRShake toolkit consists of the FAIRShake website, through which assessments are completed and insignias minted; the FAIRShake Google Chrome browser extension; the FAIRShake bookmarklet; and FAIRShake APIs for direct programmatic access to the information within the FAIRShake database. The Chrome extension and bookmarklet provide easy access to display and perform FAIR assessments on any relevant web-site. The FAIRShake toolkit is a cloud-based application freely available at http://fairshake.cloud.
Short Abstract: Ontologies are becoming increasingly relevant for integration, reuse and interoperability of complex biomedical data. However, to stay relevant, ontologies require constant evolution. The current options to request ontology updates and new terms involve emailing the ontology maintainer and waiting for the next release, or extending the ontology with private terms, both of which are quite unsatisfactory. This lack of a more efficient user-driven ontology update mechanism was pointed out by domain experts who are using CDD’s new tool BioAssay Express (BAE). BAE allows users to annotate their bioassays in a semi-automated and standardized fashion using highly-accessed ontologies (BioAssay Ontology (BAO), Gene Ontology (GO), Disease Ontology (DOID), and Drug Target Ontology (DTO) among others) in the background. Our goal in the OntoloBridge project is to help various users of BAE request and update the existing vocabulary provided by BAO in a semi-automated way. Furthermore, APIs and tools including templates from the Center for Expanded Data Annotation Retrieval (CEDAR) will be created. In this way, we’re aiming to increase the Findability, Interoperability, Accessibility, and Reproducibility (FAIR) of ontologies by bringing the ontology maintainers and ontology users closer together.
Short Abstract: Much information in global health is organized in siloed repositories, and global health datasets are relatively small compared to genomics of proteomics datasets. The data problem in global health could be considered a small data problem on a big scale. In 2013 we released the first version of Project Tycho to disseminate disease surveillance data reported by health agencies in the United States between 1888 and 2014. Over the past 3.5 years, 3500+ users have registered to use Project Tycho and over 40 creative works, including 20 peer-reviewed papers, have been published that used Project Tycho data. Now, we released Project Tycho 2.0 that aims to represent information for global health in a more FAIR (Findable, Accessible, Interoperable, and Reusable) compliant way. We re-represented all our US data and information about dengue fever for 99 countries into a standard data format, using standard ontologies and vocabularies where possible. We also created rich metadata in DataCite XML and Data Tag Suite (DATS) JSON format. With Project Tycho 2.0, we aim to improve the integration and machine-interpretability of global health data so that new discoveries can truly be made across all scales in biology, from the molecule to the global population.
Short Abstract: Infectious diseases claim millions of lifes especially in the developing countries each year. Resistance of pathogens to drugs is the major reason in failing infection treatments. Identification of causative pathogens accurately and rapidly plays a key role in the success of treatment, because, prescription of the right drug can help to alleviate the drug resistance problem. Therefore, there is an urgent need for a reference resource on pathogen-disease associations that can be utilised to support diagnosis of infectious diseases. A very large portion of pathogen-disease associations is available from the literature in unstructured form and thus we need automated methods to extract the data. Motivated by these we present the first text mining system designed for extracting pathogen-disease relations from literature. All data is publicly available from https://github.com/bio-ontology-research-group/Infectious_Diseases.git.
Short Abstract: Analysing the relationships between biomolecules and the genetic diseases is a highly active area. A novel approach is proposed here to this end, by mapping abnormality defining Human Phenotype Ontology (HPO) terms with biomolecular function defining GO terms, where each association indicates the occurrence of the abnormality due to the loss of the molecular function expressed by the corresponding GO term. The proposed HPO2GO mappings were extracted by calculating the frequency of the co-annotations of the terms on the same genes/proteins, followed by the filtering of the unreliable mappings via statistical resampling. Furthermore, the biological relevance of the finalized mappings were discussed over selected cases. The resulting HPO2GO mappings can be employed in different settings to predict and to analyse novel gene/protein-disease relations. As an application of the proposed approach, HPOterm-protein associations (i.e., HPO2protein) are predicted. To test and compare the predictive performance, CAFA2 challenge HPO prediction target protein set was used. The results showed that HPO2GO beat all models from the participating groups (with Fmax=0.402), by a margin. The automated cross ontology mapping approach developed in this work can easily be extended to other ontologies as well, to identify unexplored relation patterns at the systemic level.
Short Abstract: Motivation: Ontologies are widely used across biology and biomedicine for the annotation of databases. Ontology-based annotation often requires literature curation which is a time-consuming and expensive process given the large volume of literature. Automatic and accurate identification of ontology classes in text plays a key role in making literature curation more efficient. While several methods were developed for concept recognition in text, they are often specific to particular ontologies or primarily dictionary-based and NLP-based and therefore not able to discover new terms referring to ontology classes. Results: We developed a method for recognizing mentions of ontology classes in text. Our method is based on machine learning and utilizes word embedding, combines them with automated reasoning over ontologies to generate training and test data, and uses a neural network classifier to recognize whether a word refers to a particular class. We demonstrate the utility of our approach in identification of disease concepts from the Human Disease Ontology and demonstrate that our approach generates accurate results (F-score above 90%) and is capable of discovering concepts that are not present in an ontology. The algorithm, corpora and evaluation datasets are available at https://github.com/bio-ontology-research-group/ConceptRecognition_word2vec
Short Abstract: Stroke is one of the leading causes of brain injury and death, worldwide. Significant progress has been made in terms of stroke healthcare and research in reducing the immense impact of stroke on public health due to the rapid development of high-throughput technologies. However, significant barriers remain associated with this data generation, including its analyses, availability, accessibility and usability. In collaboration with SIREN, we aim to alleviate these challenges by enforcing common ontology, and improving experiment/study interpretability, sharing, interoperability, reproducibility and reporting, by developing the Minimum Information Required:Guideline for Stroke Research and Clinical Data Reporting, which aims to add value to the data by making it findable, accessible, interoperable and reproducible (FAIR). The standard is based on previously published literature and a draft was developed, which is currently being reviewed by stroke researchers and clinicians by review survey. The standard distinguishes items between required and optional elements. The elements included in the standard are subdivided into participant-, and disease-, and study-specific fields. The stroke research and clinical data reporting standard has the potential to benefit both the stroke research community, with ongoing and future stroke studies, as well as the clinical community, with clinical interoperability and collaboration.
Short Abstract: Biological pathway alignment is necessary to reduce redundancy in pathway data for secondary analysis, but it is difficult to identify semantically similar pathways to align based on entity membership alone. Annotations to the Pathway Ontology (PW) can be used to identify semantically similar pathways. This paper describes a computationally assisted method for annotating pathways to classes in the PW. An ensemble model using lexical features and ontology matching software was used to derive PW annotations for Reactome pathways. Proposed annotations were reviewed by the authors and PW curatorial team for correctness and inclusion into the PW.
Short Abstract: Representing the complexity of genetic etiology and environmental drivers of disease within the ontological structure of the Human Disease Ontology (DO) presents a framework for developing a Differential Diagnosis ontology. Beyond monogenic diseases, clinical diagnosis is challenged by the complexity of etiologies for many genetic diseases. To address the challenges of representing this clinical complexity, the DO project has developed a complex disease model to drive the restructuring of DO knowledge. The DO’s clinical team is assessing a set of complex and environmental diseases to build the knowledgebase to be represented in the DO, through an expanded data representation captured through logical definitions to the Sequence Ontology. This work is enabled through the DO’s integration of ROBOT tools for capturing and integrating the dis-ease to functional and/or structural sequence variant associations. Expanding the DO’s ontological structure and content will inform the development of a Differential Diagnosis DO. The DO clinician team has developed a conceptual complex genetic disease model to identify the key types of genetic diseases (monogenic, chromosomal, epigenetic, methylation, post-translational) to be represented in the DO. This model forms the basis for re-structuring of the DO’s genetic disease branch to represent the clinical complexity of genetic diseases.
Short Abstract: The Encyclopedia of DNA elements (ENCODE) project has produced data from more than 9,000 experiments using a variety of techniques to study the structure, regulation, and transcription profiles of human and mouse genomes. The data from these experiments first pass through the ENCODE Data Coordination Center (DCC) for basic validation and metadata standardization before they are openly available to the community at the ENCODE site (https://www.encodeproject.org/). Additionally, the ENCODE portal hosts data from other projects, such as modENCODE, GGR, and Roadmap. As the volume of data and variation in experimental methods increase, the organization of experimental details becomes more vital in order to provide a useful and effective access point for our users. Ontologies are used by the ENCODE DCC to annotate metadata as one way of standardizing metadata across projects and labs, providing improved searching capabilities, and making connections across experiments. Furthermore, the ENCODE DCC is in frequent contact with various ontologies to assist in new term generation and annotation of existing terms. The collection, careful curation, and organization of vast genomic datasets, guided in part by ontologies, allows for comparison across different projects and maximizes accessibility to epigenomic data and analysis.
Short Abstract: The generation of biomedical research data on the African continent is growing, with studies realizing the importance of African genetic diversity in discoveries of human origins and disease susceptibility. The benefits of such studies can only come to fruition if African researchers are fully involved at all levels. Such studies are producing rich large-scale datasets, which require careful curation, secure storage and governance. There is also a great willingness amongst African researchers to collaborate and promote data discovery whilst maintaining ownership of data. The development of an African Genome Archive will provide infrastructure to support the effective use of data resources to further sustain the growth of Bioinformatics and Genomic research on the continent, furthering collaboration between academic institutions, science councils and industry.
Short Abstract: The prediction of human gene–abnormal phenotype associations is a fundamental step towards the discovery of novel genes associated with human disorders, especially considering that for several disorders no causative genes are known. In this context the Human Phenotype Ontology (HPO) provides a standard categorization of abnormal phenotypes associated with human diseases. While the problem of the prediction of gene–disease associations has been widely investigated, the related problem of gene–HPO term associations has been largely overlooked. Moreover most of the methods proposed in literature are 'hierarchy-unaware', i.e. are not able to capture the hierarchical relationships between HPO terms, making the predictions inaccurate and biologically contradictory. Here we present highly-modular hierarchical ensemble approaches that can be used to enhance the prediction of virtually any flat learning method, by taking into account the hierarchical nature of the HPO. Genome-wide experimental results shown that our algorithms 1) are able to predict new genes-abnormal phenotypes associations; 2) are competitive with other leading state-of-the-art approaches; 3) scale nicely with large datasets and bio-ontologies. An R implementation of the proposed methods is available on CRAN and bioconda along with a step-by-step tutorial to enable an easy integration in other research.
Short Abstract: Phenotypic diversity analyses are the basis for research discoveries that span the spectrum from basic biology (e.g., gene function and pathway membership) to applied research (e.g., plant breeding). In cases where equivalent phenotypes across individuals or groups are not anatomically similar, high-throughput, computational classification is possible if the traits and phenotypes are documented using standardized, language-based descriptions. In the case of text phenotype data, conversion to computer-readable “EQ” statements enables such large-scale analyses. EQ statements are composed of entities (e.g., leaf) and qualities (e.g., length) drawn from terms in ontologies. We present a method for automatically converting free-text descriptions of plant phenotypes to EQ statements using a machine learning approach. Random forest classifiers identify potential matches between phenotype descriptions and terms from a set of ontologies including GO, PO, and PATO. Features used include semantic, syntactic, and context similarity measures. The classifiers are trained and tested using a dataset of text descriptions and EQ statements from the Plant PhenomeNET project (Oellrich, Walls et al., 2015). The most likely candidate terms are used to compose EQ statements with confidence scores. Results of evaluating the accuracy of this approach are presented, and potential use to enable automated phenolog discovery are discussed.