If you need assistance please contact firstname.lastname@example.org and provide your poster title or submission ID.
Short Abstract: The Mouse Genome Informatics (MGI, http://www.informatics.jax.org) is the international database resource for the laboratory mouse, providing integrated genetic, genomic, and biological data to facilitate the study of human health and disease. MGI uses the Gene Ontology (GO, http://www.geneontology.org) for functional annotation of mouse genes. However, single eukaryotic genes can encode multiple protein isoforms due to the usage of alternate promoters or polyadenylation sites, alternative splicing of the primary transcript, and/or selection of alternative start sites during translation of an mRNA. Proteins can be further subjected to post-translational processing events. As well, the functioning or cellular location of these proteoforms may be quite different. To provide the most accurate level of annotation, MGI curators make literature-based manual GO annotations using proteoform IDs provided by the Protein Ontology (PRO). PRO (http://proconsortium.org) is an ontological resource that supplies unique identifiers to specific proteoforms. These forms are organized in an ontological framework that explicitly describes how these entities relate both in context of taxon-specific genome localization as well as comparatively with other taxa. The ontology currently has over 68,600 isoforms and 6440 modified proteoforms, which are either imported from high-quality sources or added via literature-based annotation along with attributes such as provenance. Among them, there are 17,811 mouse isoforms and 589 manual curated mouse specific PTM forms. The GO annotations to proteoforms are grouped according to the encoding gene, and can be queried for and viewed at MGI, as well as at the Amigo database (http://amigo.geneontology.org/amigo). Supported by NIH Grants HG000330, HG002273, and GM080646.
Short Abstract: Motivation: The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarisation. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods: We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. Results: The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain independent systems up to 42.6% in terms of the Pearson correlation metric. Availability: A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/ Contact: email@example.com, firstname.lastname@example.org
Short Abstract: The Experimental Factor Ontology (EFO) is an opensource ontology established in 2009 to aid the annotation of data at the EBI Expression Atlas. EFO has grown from ~2,600 classes to 19,799 classes (as of March 15th 2017). The growing number of classes in EFO reflects the expanding user-base from multiple databases and serves a wider set of use-cases. EFO now covers various life science domains, both animal and plant in many projects i.e., Open Targets, GWAS Catalog, EXCELERATE, and Genomics England. EFO also facilitates knowledge sharing in multiorganisation collaborative work such as ENCODE Consortium, NASA GeneLab, and GSK. The growing user-base has necessitated the development of a new operational pipeline for EFO that presents a more generalised approach to application ontology building.
Short Abstract: Websites are commonly used to expose data to end-users, making it possible for them to search, filter, and download data. Such capabilities help users to easily find, organize and obtain data relevant according to their interests. With the continuous growth of data in the Life Sciences domain, it gets difficult for users to find all the information required by their research on one single website. A lightweight semantic approach would help to those users looking for basic information as both formats and models would be reduced and simplified. BioSchemas is a community initiative built on to of schema.org and aiming to improve data discoverability and interoperability in Life Sciences. Here we present the advances of the bioSchemas Protein Working Group, particularly the identified use cases as well as a proposed schema for proteins, protein annotations and protein structures.
Short Abstract: Introduction Over the last few years, DMET (Distribution, Metabolism, Excretion and Toxicity) technology has steadily advanced leading to increased demands to develop new bioinformatics software, analysis tools, algorithms, web applications and specific statistical techniques for its application. With the advent of personalized medicine, it is evident that a large number of pilot studies will be conducted globally in the foreseeable future [1, 2]. Findings between different studies can be compared effectively if data reporting standards are in place. Furthermore, if data are extracted in a concise and correct manner, they can be used in subsequent experiments. We feel that quality of data supersedes quantity and using a concise method of data extraction from DMET arrays can greatly increase the experimenter’s ability to sort biologically meaningful information from background noise and experimental error. Ultimately, the field of pharmacogenomics will evolve to integrate data from various omics and technology platforms. Implementing standards for submission of data for each of the various platforms (microarrays, SNP arrays, proteomics and DMET), will aid the development of pipelines able to consolidate the different platforms and the standards by which each abides. Minimum Information required for a DMET Experiment (MIDE) is proposed to provide pharmacogenomics reporting guidelines, the information and tools required for reporting to public omic databases. For effective DMET data interpretation, sharing, interoperability, reproducibility and reporting, we propose MIDE. Methods The common elements proposed for MIDE were derived from consolidating other Minimum Information Guidelines and Standards published in MIBBI, as to which elements and ontology, where appropriate, to use. However, these were either not available for most of the elements or some elements had multiple fields or inconsistency in its ontology. This was addressed by consulting with researchers from the Pharmacogenomic for Every Nation’s Initiative (PGENI), the Technology platform, the Affymetrix and the DMET service labs to consolidate what is essential and optional information to capture and how to phrase it such that validation, interoperability and reproducibility of DMET experiments is increased and seamless. Results We have introduced the minimum information required for a DMET experiment guideline - MIDE. The reporting template is also available in XML format in order to increase the user friendliness of data capturing and data governance. The quintessential information for DMET data pertains to the experiment itself, including, but not limited to, maintaining good laboratory practice, experimental design, extract/sample preparation, labeling, hybridization procedures, measurement specifications and QC and data extraction information, which, if reported in a concise manner in one place, will increase their accessibility and interpretability. Since the Affymetrix DMET kit provides a rigorous protocol, any deviation from the protocol must be captured and noted in any publication as well as when submitting the data to specific databases. Standardizing how this information is captured and published in a comparable and consistent manner is crucial for researchers to understand the experiment and subsequently interpret the data generated. The adoption of common data reporting standards for DMET generated data will allow the aggregation of data generated from different platforms as well as the integration of secondary data (e.g. QPCR data, clinical and epidemiological data, drug interaction data etc.) in tertiary analysis platforms. With the need for a standardized manner of reporting DMET-generated data, we propose a guideline to the MIDE including a check list of information to capture, with format and appropriate ontology. This information is categorized into the following: desirable or optional information, which should be submitted if available, and essential or required information, which should always be submitted with the article. The full MIDE XML 1.0 schema is accessible from the MIBBI project page.[http://bioweb.cpgr.org.za/mide/mide.xml]. Conclusions Currently, no data reporting standard for DMET or drug toxicology arrays exist. We introduced and have developed a DMET raw and meta-data reporting guideline through data sharing, MIBBI foundry requirements and principals. The MIDE guidelines will facilitate the sharing, dissemination and re-analysis of datasets, through accessible and transparent DMET experiment reporting, benefiting the scientific community with ongoing and future DMET experiments with tools that will ease and automate the generation of such reports using the Standardized MIDE XML schema . A MIDE API is also under development, which will facilitate the validation and generation of MIDE reports by providing the user with an easy-to-use and powerful interface to store, retrieve and export MIDE documents into different formats. This research stands to benefit clinical trial procedures as well as drug development and dosage guidelines . References 1. Oetjens MT, Denny JC, Ritchie MD et al. Assessment of a pharmacogenomic marker panel in a polypharmacy population identified from electronic medical records. Pharmacogenomics 14(7), 735–744 (2013). 2. Yang L, Price ET, Chang CW et al. Gene expression variability in human hepatic drug metabolizing enzymes and transporters. PLoS ONE 8(4), e60368 (2013). 3. Botha G, Kumuthini J. Minimum Information About a Peptide Array Experiment (MIAPepAE). EMBnet. J. 18(1), 14 (2012). 4. Laurence J. Getting personal: the promises and pitfalls of personalized medicine. Transl. Res. 154(6), 269–271 (2009).
Short Abstract: Ontologies represent core relationship models of concepts that span numerous life science domains. It is difficult for human curation to accurately respond to all data sets given the breadth and depth of new knowledge, creating a need to assess the validity of ontology concepts in light of biological data rather than literature alone. Here we leverage GeneWeaver's bipartite graph model, homology integration, and curated experimental data to evaluate ontology terms and hone classification based on underlying biological data. We report that empirical functional genomic data can be used to effectively discriminate nodes in an ontology hierarchy by examining how tightly coupled ontology terms are to functional data. Using GO slims we are able to highlight terms of both high and low representation overlap with empirical data, and illustrate the effectiveness of cross-species homology and data heterogeneity in establishing veracity for ontology concepts.
Short Abstract: Motivation: Text mining has become an important tool for biomedical research. The most fundamental text mining task is the recognition of biomedical named entities (NER), such as genes, chemicals, and diseases. Current NER methods rely on predefined features which try to capture the specific surface properties of entity types, properties of the typical local context, background knowledge, and linguistic information. State-of-the-art tools are entity-specific, as dictionaries and empirically optimal feature sets differ between entity types, which makes their development costly. Furthermore, features are often optimized for a specific gold standard corpus, which make extrapolation of quality measures difficult. Results: We show that a completely generic method based on deep learning and statistical word embeddings (called LSTM-CRF) outperforms state-of-the-art entity-specific NER tools, and often by a large margin. To this end, we compared the performance of LSTM-CRF on 33 data sets covering five different entity classes with that of best-of-class NER tools and an entity-agnostic CRF implementation. On average, F1-score of LSTM-CRF is 5% above that of the baselines, mostly due to a sharp increase in recall.
Short Abstract: Brassica Information Portal (BIP) provides an open access web repository for Brassica phenotyping data (https://bip.earlham.ac.uk). It aims to fill the gap between physical genome information and quantitative phenotypic information. Earlham Institute with the BIP represents UK in Elixir EXCELERATE WP7, working along with six more ELIXIR nodes. Together, we are establishing a common infrastructure and an open model for the publication and sharing of plant genotype-phenotype data, addressed on several fronts. For example, the controlled vocabulary Minimum Information About Plant Phenotyping Experiments (MIAPPE) is updated and agreed among all participants. As a WP7 collaborator, we provide Brassica trial example data in this new format. Ultimately, we will create an updated version of MIAPPE which will allow dataset compatibility across all species and ELIXIR nodes. Further, in order to make data easily accessible and interoperable, all partners are also defining a new common API, initially based on BrAPI (Breeding API) v1. BIP and all other nodes are implementing a subset of BrAPI v1 and working on its next version, enabling its alignment to MIAPPE, which will be fully adopted by all WP7 members. Phenotypic traits measured in Brassica field trials are very heterogeneous in terms of data types, units and detailedness of experimental protocols. The development of a Brassica-specific ontology BRATO was necessary to overcome these issues. Using the Crop Ontology model, we collaborate with INRA and Southern Cross University to describe traits with this standard. Harmonised Brassica trait data will enable comparison valuable for scientists and breeders alike.
Short Abstract: There is a myriad of information hidden in scientific media in the forms of texts, graphics, audios, videos, datasets and various kinds of software but it is often not discoverable due to absence of structured and semantic annotations. iCLiKVAL is a web-based application (http://iclikval.riken.jp) that collects annotations for scientific media found online by a growing community using crowdsourcing. The application is built upon a freely accessible and secure API to save semantic as well as free-form but structured annotations as “key-value” pairs with optional “relationships” between them. The philosophy behind iCLiKVAL is to identify the online media by a unique URI (Uniform Resource Identifier) and attach highly structured semantic annotations to various concepts related to the media to identify and mark occurrences of ontological entities and relationships. These annotations are stored in human-readable as well as machine-readable formats which helps computers to easily index and interpret information and allows for much more sophisticated data searches and knowledge discovery by linking with other heterogeneous data sources. We hope this application simplifies the process for users to securely and conveniently submit their valuable annotations to iCLiKVAL and facilitates knowledge extraction for scientific community.
Short Abstract: Medical language processing, when used to aggregate knowledge from biomedical literature, has the potential to revolutionize how researchers understand real-world biological phenomena. Few language processing approaches have focused on the valuable observations within medical case reports at this point. Unlike formal research, case reports describe real-world medical events, complete with demographics, diagnoses, and resulting treatments. The more than 1.8 million case reports currently available in PubMed therefore offer a rich source of biomedical observation available in few other sources. Thus far, much of the challenge in medical language processing arises from how to extract novel, biologically-meaningful conclusions from biomedical literature. There is an additional challenge: newly-developed language processing methods must be broadly applicable, as crucial medical information takes numerous structured and unstructured forms and results must be integrated with other knowledge sources. Furthermore, the results of such processing must be readily available to clinicians and other non-researchers. We are currently developing HeartCases, a system for processing medical case reports involving cardiovascular disease. As the leading cause of death worldwide and a distinctly multifaceted issue, cardiovascular disease serves as an ideal topic for broad case report analysis. HeartCases leverages several existing ontologies along with a combination of supervised and unsupervised machine learning approaches to identify the most crucial features and metadata within case reports. The methods developed within HeartCases will be directly adapted to other sources of textual medical data, including electronic health records. Its results also reveal trends and novel relationships between the details of cardiovascular disease patients.
Short Abstract: Acclimation as a response of organism to changes in environmental conditions represents a complex dynamic interplay between genes, proteins and metabolites and are nowadays typically recorded using ‘omics technics. A common approach in understanding the high-throughput biological data is to evaluate the functional properties of experimentally derived feature sets by calculating functional enrichment via statistical analysis. However, the consideration of functional groups may easily lead to the loss of important information and is in practice often manually compensated by the researcher’s intuitive focus on individual molecules. To automatize a robust and interpretable representation of the functional and temporal relation between the biological relevant features, we developed an algorithm that find the “sweet spot” between a too general description solely based on functional information and a description that is only based on single molecule information and therefore too noisy and complex. In our approach, we use the functional information from static ontologies like GO or MapMan and learn a new structure from the data. This dynamic ontology approach aggregates single components into functional groups only if there will be no loss in information measured by the balance between similarity and entropy. We could apply our dynamic ontology method to heat acclimation in the model plant Chlamydomonas reinhardtii and could show a successful reduction in complexity while preserving all individual molecular information reported in literature. Additionally, we are able to suggest new modulators of acclimation processes that are to be biological sound.
Short Abstract: We have developed a protocol to manually curate indications and usage context for approved drugs from product labels from DailyMed (dailymed.nlm.nih.gov). This is an unmet need because resources that list drug indications suffer from inconsistencies in the quality and accuracy of such information, are not computation-ready, nor use established bio-ontology terms. This adversely affects downstream uses for the information, e.g., for drug repositioning and clinical decision support. We view our curation effort as a helpful first step to (1) enrich existing drug resources with more detailed, higher quality metadata about drug indications and (2) evaluating the accuracy of existing metadata in these resources. With respect to (2), we have begun an investigation into what our curations reveal about the accuracy of indications in DrugCentral (drugcentral.org). We have chosen to study DrugCentral because it distills most of the DailyMed product information in a computer-ready format. As a preliminary step, we focus on antineoplastic and cardiovascular agents because of the heightened societal interest in heart disease and cancer. We have curated indications for 18% of the cardiovascular drugs and 15% of the antineoplastic drugs listed on DrugCentral. For each drug we identified indications for which it is directly prescribed and those for which it is indirectly mentioned (e.g. as adjuvant therapy). The latter indications are considered imprecise and we found DrugCentral to exhibit imprecise indications for at least 36% of the drugs in our completed set.
Short Abstract: There is a long history of computational methods to find candidate genes involved in diseases. These methods use heterogeneous types of information, including information in biological databases and biomedical literature. Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries, but have not been widely exploited for machine learning tasks. We developed a novel method for feature learning that combines information in biological knowledge graphs and biomedical literature. In our approach, we first normalize literature mentions of genes, proteins, and diseases to a biological knowledge graph. Our method then generates "embeddings" of biological entities that encode for information within knowledge graphs as well as for information about them in literature. The embeddings represent genes, proteins, and diseases within a low-dimensional vector space. We apply our method to the prediction of gene-disease associations through phenotypic similarity. For this purpose, we integrate information from human and mouse genotype-phenotype databases. Our multimodal approach can recover known gene-disease associations in the Online Mendelian Inheritance in Men (OMIM) database with ROCAUC of up to 0.94, and outperforms current baselines for predicting gene-disease associations based on semantic similarities over phenotype ontologies. Additionally, our method does not rely on prior knowledge about gene-disease associations that is inherent in approaches based on the guilt-by-association principle. Finally our method is generic and can be applied to infer semantic similarity between biological entities in several biological applications.
Short Abstract: Stratified medicine projects are generating molecular signatures that may be associated with disease subtypes. However, these signatures by themselves do not usually provide mechanistic insight and need to be explored and interpreted in the context of other available data. Graph databases provide a convenient framework for storing, and querying large heterogeneous biological datasets. Such data is typically highly connected, and semi structured lacking clear semantics. New types of data are emerging from systems medicine experiments within relatively short time scales which need to be integrated to further aid in the contextualisation of disease associated genes. In graph databases data is represented as a network of nodes (biological concepts) and edges (biological relationships) both which can have properties. Envisioning the networks allows manual exploration useful for hypothesis generation. Additionally, graph traversal algorithms can be used to identify associations between concepts. The flexibility of graph-based data integration frameworks allows new data types to easily be accommodated. Here we review some aspects of graph-based data integration. Using the popular graph database Neo4j and the Neo4j query language, Cypher, we present some examples related to contextualising disease associated genes. Python and R interfaces are available for Neo4j, which facilitates the inclusion of popular external tools for graph topological analysis and for enhanced visualisation. We show that this framework provides a flexible and convenient way to integrate, explore and contextualise data emerging in systems medicine.
Short Abstract: There are over 500 ontologies registered in the NCBO BioPortal repository, representing a valuable resource of structured data models for use in biological dataset annotation and analysis. One of the main impediments to using new ontologies is the relatively small number that have been annotated from relevant corpora with sufficient coverage and accuracy. The growing demand for integrative and customised ontologies necessitates the development of generic tools for annotation and analysis that are both accurate and scalable. We present OntoSuite, a collection of computational tools for ontology neutral annotation and analysis. Our system performs text mining with multiple natural language processing (NLP) tools to annotate biomedical ontologies with improved sensitivity, specificity and coverage. Within OntoSuite we include an open-source software package, topOnto that supports term-enrichment analysis across multiple ontologies with a selection of statistical/topological algorithms, allowing comparison across ontologies and algorithms. We evaluate our system by using OntoSuite to mine the OMIM, GeneRIF and Ensembl variation corpora to create a comprehensive human disease ontology (HDO) to gene annotation database. Using this novel annotation set and topOnto term enrichment analysis, we profile 277 gene classes for human disease, generating ‘disease environments’ for 1310 human diseases. These disease environments provide an overview of disease knowledge and new insight into potential disease mechanisms. The integration of multiple ontologies into a disease context demonstrates how ‘orthogonal’ ontologies can lead to biological insight that would be missed by traditional single ontology analysis.
Short Abstract: Much of the world’s primary scientific knowledge is buried in the prose, figures and reference sections of scientific articles, virtually inaccessible to the computational machinery that's now needed to search, extricate and assimilate their sequestered information. Data-sets are trapped in tables and illustrations; factual assertions are shrouded in complex narratives; and relationships between articles are entombed in convoluted reference/citation formats. Extracting and liberating this knowledge automatically is technologically challenging, and made harder still by the restrictions on access and bulk-processing imposed by commercial publishers. The Lazarus project aims to resurrect data and insights buried in the literature by creating a massively distributed crowdsourcing platform based on the Utopia Documents PDF reader, which text-mines the content and reconstructs the semantic structure of papers on the fly. Scientists routinely perform ‘micro-tasks’ when they read, annotate, extract data from and organise collections of papers to facilitate their work. In the process, they apply intuition and specialist knowledge that far surpasses that of any current computational technique. Locked in their personal unstructured notes, spreadsheets and bespoke PDF collections, however, the results of their labours are currently lost to the broader community. Lazarus provides a new tool that makes these micro-tasks easier, by automatically capturing knowledge from scientists’ browsing behaviours, highlighting it in papers that others read and making it available for further processing by the community via an API.
Short Abstract: It is widely accepted that ontologies will solve many problems in biomedical computing. However, although the amazing contributions that ontologies have made to some problems have shown early nay-sayers (Brenner 2002) to have been misguided (Hunter 2002), their impact on many other problems, including natural language processing, has been more limited. We show that this lack of impact can be related to a coverage issue. The Relation Ontology was mapped to two sets of relations used in biomedical natural language processing and in semantic representation. Only 16% of the relations in those two resources had equivalents in the Relation Ontology, suggesting a possible reason for the lack of impact of the Relation Ontology in language processing.
View Posters By Category
- Bioinformatics Open Source Conference (BOSC)
- Network Biology
- Regulatory Genomics (RegGenSig)
- Computational Modeling of Biological Systems (SysMod)
Session A: (July 22 and July 23)
- High Throughput Sequencing Algorithms and Applications (HitSeq)
- Machine Learning Systems Biology (MLSB)
- Translational Medicine (TransMed)