Conference on Semantics in Healthcare and Life Sciences (CSHALS)

Poster Presentations

Updated April 03, 2012

POSTER DETAILS:

Posters will be on display throughout the conference, beginning 5 p.m. Wednesday, February 22, with a special reception for poster authors to present their work to conference delegates:

Poster set-up: Wednesday, February 22, 4:00 p.m. – 5:00 p.m.

Poster Reception: Wednesday, February 22, 5:00 p.m. – 7:00 p.m.

When preparing and designing your poster please note that it should be no larger than 44 inches wide by 44 inches high (there are two posters per side).

Posters must be removed between 1:30 p.m. - 3:00 p.m. Friday, February 24

Poster 1
SDlink: An Integrated System for Linking Biological and Biomedical Semantic Data

Presenter: Alexandre Francisco
INESC-ID / IST, Technical University of Lisbon, Portugal

Additional Authors:
Pedro Reis, INESC-ID / IST, Technical University of Lisbon, Portugal
Dário Abdulrehman, INESC-ID / IST, Technical University of Lisbon, Portugal
Cátia Vaz, INESC-ID / ISEL, Poly Inst of Lisbon, Portugal
Mauro Santos, INESC-ID / IST, Technical University of Lisbon, Portugal
Ana Freitas, INESC-ID / IST, Technical University of Lisbon, Portugal

Nowadays, with the decreasing cost and increasing availability of high-throughput technologies, an enormous amount of biological and biomedical data is becoming available. Such data is usually represented and stored in different formats and platforms, most of the times off line and not standardized. The automatic integration of data from different databases suffers from several caveats, the most notable being the lack of interfaces for automatic querying and running integration and analysis tools. In order to solve some of these issues, semantic technologies have been proposed and used with great success. In these work we propose an integrated environment for querying, retrieving and analyzing linked data, suitable for users unfamiliar with such technologies, solving an issue that has been detracting a more generalized adoption of semantic methodologies in biology and biomedicine.

Method: The new sdlink system (http:/kdbio.inesc-id.pt/sdlink) assumes that data is annotated following a given ontology and provides data views, including graphical representations, and a friendly querying interface. The querying interface was developed to be used by semantic technologies unfamiliar users, where one can for instance define a query by means of a point and click simple interface, which is then translated to SPARQL. The sdlink system uses Virtuoso OSE as the underlying triplestore. To address user concerns with respect to security and privacy, the system supports user/project control access, based on OpenID for authentication and FOAF+WAC for authorization. The system is being used by two FP7 European projects, with good results both in what concerns scalability and usability by non-expert users. We made also available a public project for evaluation purposes (http:/kdbio.inesc-id.pt/sdlink/lubm/).

Results: Our results were twofold. First, through the development and deployment of sdlink, we were able to use semantic technologies and linked data on two large projects were most people were unaware of these technologies or of any reason to use them. The main contribution was an interface that simultaneously allows users to retrieve and query linked data, and does not lose expressiveness, efficiency or scalability. In particular, the system is self-adaptable to ontology changes and data transformations, depending only on the update of underlying ontologies. The projects where the system was tested are dealing with heterogeneous data, including sequence data and experiment results, resulting from several teams and work packages, that in the end became integrated, browsable and queryable. The data stored comprises about one million triples, which can be queried in less than one second for most usual queries.

Conclusions: The development of sdlink, and its deployment in a real scenario, allowed us to concluded about the importance and usefulness of semantic technologies, namely for domain representation and data integration. More importantly, it was possible to show that, by developing suitable interfaces, any user can benefit from such technologies. Currently, the unfriendliness of most semantic technologies, in particular in the fields of biology and biomedicine, have struggle the adoption of these technologies. The sdlink system is proposed to overcome this problem and to bring semantic technologies and linked data to a broader audience.

Poster 2
Spo: An Ontology for Describing Host-pathogen Interactions Inherent to Streptococcus Pneumoniae Infections

Presenter: Cátia Vaz
INESC-ID / ISEL, Poly Instiute of Lisbon, Portugal

Additional Authors:
Pedro Reis, INESC-ID / IST, Technical University of Lisbon, Portugal
Alexandre Francisco, INESC-ID / IST, Technical University of Lisbon, Portugal
Susana Vinga, INESC-ID, Lisbon, Portugal
Ana Freitas, INESC-ID / IST, Technical University of Lisbon, Portugal

Abstract: Over the past twenty years, the study of infection has tended to consider individual virulence factors or host factors. The Pneumopath project (www.pneumopath.org/), a FP7 European research project, has the objective of studying the host-pathogen interactions during infection of Streptococcus pneumonaie and finding new targets for diagnosis and treatment. This research purports to identify the most important and consistently involved host and pneumococcal factors, in contrast to previous approaches, where factors where studied in isolation. The transmission of Streptococcus pneumoniae to a new host can result in asymptomatic colonization or progress to invasive disease. The infection can be determined by multiple attributes of both host and pathogen, being important to take into account the epidemiological and genomic characterization of pneumococcal strains, the results from experiments that evaluate host or pneumococcal responses to infection or different environmental challenges, and also the results from experiments that identify host genetic susceptibility factors. In this work we propose Spo (kdbio.inesc-id.pt/~cvaz/pneumopath/), an ontology developed in the context of the Pneumopath project, which provides terms and semantic constructs for annotating all aspects of host-pneumococcal interactions.

Method: The data considered includes the characterization of pneumococcal strains, typing information, as well as data of in vitro and in vivo experiments with animals and cell models, relevant for identifying new targets to combat pneumococcal diseases. Some of these data are scattered across numerous information systems and repositories, each with its own terminologies, identifier schemes, and data formats. The need to share such data brings challenges for both data management and annotation, such as, the need to have a common understanding of the concepts that describes host-pneumococcal interactions. Thus, semantic annotation and interoperability become an absolute necessity for the integration of such diverse biomolecular data. Moreover, given the heterogeneous environment inherent to the project, the ontology construction took into consideration contributions from all partners, leading to a well-grounded set of concepts and annotations.

Results: Spo provides a framework to represent mentioned host-pneumococcal interactions, being flexible enough to accommodate the rapid changes and advancement of research and achieve data interoperability and interchange. This has been only possible because of semantic Web recommended practices for clearly specifying names for things and relationships, expressing data using standardized and well-specified knowledge representation languages. The ontology described in OWL Lite v1.0 includes 36 classes, 24 object properties and 43 data properties.

Conclusion: The main contribution of this work was not only Spo, but all the approach and methodology for its construction in the context of a large research project, where many people were not aware of semantic technologies. The proposed ontology does not only describe knowledge in this field, but also allows for validating and aggregating existing knowledge, which is essential for data integration. Furthermore, the ability to accurately describe the host-pneumococcal interactions through the use of Spo has facilitated the implementation of information systems capable of coping with the heterogeneous types of data and, by using well known semantic technologies, it allowed users to query data and discover new knowledge.

Poster 3
Chem2Bio2RDF: Linked Open Data for Drug Discovery

Presenter: Bin Chen
Indiana University, Bloomington, United States

Additional Authors:
Ying Ding, Indiana University, United States
Philip Yu, University of Illinois, Chicago, United States
Eric Gifford Pfizer, United States
David Wild, Indiana University, United States

A critical barrier in current drug discovery is the inability to utilize public datasets in an integrated fashion to fully understand the actions of drugs and chemical compounds on biological systems. There is a need for not only a resource to intelligently integrate the heterogeneous datasets pertaining to compounds, drugs, targets, genes, diseases, and drug side effects now available, but also robust, effective network data mining algorithms that can be applied to such integrative data sets to extract important biological relationships. In this talk, we discuss (i) the creation of the Chem2Bio2RDF for drug discovery data, integrating chemical compounds, protein targets, genes, metabolic pathways, diseases and side-effects using Semantic Web technologies, and (ii) the development of innovative data mining algorithms to facilitate drug discovery. Chem2Bio2RDF incorporates 25 public datasets related to systems chemical biology, grouped into 6 domains: chemical (PubChem Compound, ChEBI, PDB Ligand); chemogenomics (KEGG Ligand, CTD Chemical, BindingDB, MATADOR, PubChem BioAssay, QSAR, TTD, DrugBank, ChEMBL, Binding MOAD, PDSP, PharmGKB); biological (UNIPROT, HGNC, PDB, GI); systems (KEGG Pathway, Reactome, PPI, DIP); phenotypes (OMIM, Diseasome, SIDER, CTD diseases); and literature (MEDLINE/PubMed). The number of RDF triples is approximately 110 million. We developped the domain ontology (called Chem2Bio2OWL) to better integrate these 25 datasets. The primary classes of this ontology are: SmallMolecule, MacroMolecule, Disease, SideEffect, Pathway, BioAssay, Literature and Interaction based partially on the BioPAX classes. The primary classes were further refined in accordance with current instance data structure. We proposed and tested several graph mining and machine learning algorithms (e.g., Bio-LDA, path finding, subgraph mining and diversity ranking) on the generated Chem2Bio2RDf linked open dataset to facilitate drug discovery. We found that our Bio-LDA model used the bio-terms, journal information and word information to characterize the topic providing a better representation of topics than the simple LDA model, which only can provide the word representation. Rosiglitazone is one of several thiazolinediones on the market for diabetes. Our path finding algorithm presents the set of most informative and diverse associations between the drug and the potential side effects, which shows different causes of the hepatitis side effect. Our constraint-based subgraph and diversity ranking algorithm can detect the inhibition of Catechol O-methyltransferase (COMT) in Parkinson's disease. By combining information from Drugbank, Pubchem and Uniprot, we can find information regarding the gene that Tolcapone and Entacapone targets, its name, the protein it encodes, Pubmed articles related to their interaction with COMT, and the structure of the protein it targets. In this talk, We demonstrated the potentials of data mining and graph mining algorithms to identify hidden associations that could provide valuable directions for further exploration at the experimental level. In the future, we will focus on using the identified associations and paths existing between various bio terms to predict the potential connection of other unknown bio terms.

Poster 4:
The VIVO Ontology: Enabling Networking of Scientists

Presenter: Ying Ding
Indiana University, Bloomington, Indiana, United States

Additional Authors:
Stella Mitchell, Cornell University, United States
Jon Corson-rikert, Cornell University, United States
Brian Lowe, Cornell University, United States
Bing He, John Hopkins University, United States

VIVO, funded by NIH, utilizes Semantic Web technologies to model scientists and provides federated search to enhance the discovery of researchers and collaborators across disciplines and organizations. VIVO ontology is designed with the focus on modeling scientists, publications, resources, grants, locations, and services. It incorporates classes from popular ontologies, such as BIBO, Dublin Core, Event, FOAF, geopolitical, and SKOS. VIVO data is annotated based on the VIVO ontology to semantically represent and integrate information about faculty research (i.e., educational background, publications, expertise, grants), teaching (i.e., courses, seminars, training), and service (i.e., organizing conferences, editorial boards, other community services). The VIVO ontology has been adopted nationally and internationally, and enables the national and international federated search for finding experts. VIVO is an open source Semantic Web application that, when populated with researcher interests, activities, and accomplishments, enables discovery of research and scholarship across disciplines and organizations. The VIVO core ontology models the academic community in order to provide an consistent and connected perspective on the research community to various shareholders, including students, administrative and service officials, prospective faculty, donors, funding agencies, and the public. The major impetus for NIH to fund the VIVO effort to "develop, enhance, or extend infrastructure for connecting people and resources to facilitate national discovery of individuals and of scientific resources by scientists and students to encourage interdisciplinary collaboration and scientific exchange". The application is in use at the seven institutions of the NIH VIVO project and has been adopted or to be adopted by several other universities (e.g., Harvard University) and organizations in the USA (e.g., the United States Department of Agriculture), and several universities or institutions in Australia and China (e.g., Queensland University of Technology, Chinese National Academy of Sciences) (Gewin, 2009). More specifically, VIVO can support discovering potential collaborators with complementary expertise or skills, suggesting appropriate courses, programs, and faculty members according to students’ interests, and facilitate research currency, maintenance and communication. For example, a Computer-Aided Drug Discovery (CADD) group may want to find and team up with a computer specialist and a group using in vivo experiments in drug discovery. If the VIVO core ontology is implemented in the hypothetical situation, the group leader can search across experts in computer science and molecular biology. In this paper, we present a relatively comprehensive discussion of the development of the VIVO core ontology, including the latest updates.

Poster 5:
BioPAX Community Update

Presenter: Nadia Anwar
General Bioinformatics, Berkshire, United Kingdom

Additional Authors:
Gary Bader, University of Toronto, Canada
Emek Demir, Memorial Sloan-Kettering Cancer Center, United States
Igor Rodchenkov, University of Toronto, Canada
Chris Sander, Memorial Sloan-Kettering Cancer Center, United States

BioPAX, Biological Pathway Exchange, is an OWL ontology modelling biological pathway data. Biological pathways are constructs that biologists use to represent relationships between and within chains of cellular events. For example, metabolic pathways typically represents flow of chemical reactions, while signal transduction pathways represents the chain of interactions that transmit external signals received by a cell to deliver some response within the cell. These data are as heterogeneous as the numerous data sources (pathguide.org) that supply the data. Exchange, integration and annotation of these data is a considerable challenge. BioPAX was developed to ease the access, use, exchange and aggregation of pathway data. This poster will highlight the recent community developments. The current specification, BioPAX Level 3 and its supporting API, PaxTools, were released by the community in 2009. Since this release, the BioPAX community has focused on supporting developers with the transition from L2 to L3, community organisation, interoperability with other standards and future directions from user feedback. In 2010, BioPAX joined forces with SBML, SBGN and other standards, to form the 'COmputational Modeling in BIology' NEtwork (COMBINE). This initiative aims to coordinate the development of the various community standards and formats. Through learning from the experiences of community organization in other successful standards, the BIoPAX community have re-orgnaised themselves. In place now is an invited Scientific Advisory Board and an elected Editorial Commitee who are now co-ordintating governance and proposal development with the COMBINE netowrk. The annual BioPAX meetings are also now co-ordinated with the COMBINE network, providing economy of scale in development of standards through a shared forum to share experiences, and enabling the standardization efforts to work co-operativley. These meetings are organized into two events a Hackathon (Harmony May 21–25, 2012 in Masstricht) and the COMBINE forum (Combine August 15-19 2012, Toronto). The BioPAX community is also responding to feedback from a survey undertaken in 2011. Community members, consumers and data providers gave valuable information on how they use BioPAX, the difficulties they faced and how they want to see the specification progress in the future. This feedback will be used by the new governance teams to help establish the specification in the areas it is currently used, to help extend the community beyond current usage and also to determine future directions for the community. To get involved or find out more about BioPAX see www.biopax.org join the mailing list This email address is being protected from spambots. You need JavaScript enabled to view it. or attend a meeting in 2012.

Poster 7:
Dynamic Enhancement of Drug Product Labels through Semantic Web Technologies (.pdf)

Presenter: Richard Boyce
University of Pittsburgh, PA, United States

Additional Authors:
Jodi Schneider, Digital Enterprise Research Institute, Ireland
Michael Taylor, Microsoft, United States
Maria Liakata, EBI, United Kingdom
Anita De Waard, Elsevier, United States

FDA-approved drug product labeling (packages insert or PI) is a major source of information intended to help clinicians prescribe drugs in a safe and effective manner. Unfortunately, drug PIs have been identified as often lagging behind the drug knowledge expressed in the scientific literature, especially when it has been several years since a drug has been released to the market. Out-of-date or incomplete PI information can increase the risk of otherwise preventable adverse drug events. This can occur directly if the PI fails to provide information that is needed for safe dosing or to properly manage drugs known to interact. Clinicians might also be indirectly affected if they depend on third party drug information sources, and these sources fail to add information that is available in the scientific literature but not present in the PI. We are creating a Linked Data store that will enable the drug PI to be expanding as new information becomes available in the scientific literature. The goal of the Linked Data store will be to provide clinicians, patients, and the maintainers of drug information resources with the most complete and up-to-date information on particular claims made within a PI. We are focusing on 25 currently-marketed psychotropic medications (nine antipsychotics, twelve antidepressants, and four sedative hypnotics). To construct this Linked Data repository, we aim to use Natural Language Processing (NLP) technologies identify core claims in the scientific literature and various web-based data sources that pertain to pharmacokinetic drug-drug interactions, age-related changes in clearance, metabolic clearance pathways, and genetic polymorphisms that can affect metabolism. This work aligns with the CSHAL themes "Linked Data", "Text Analysis, NLP, Question Answering", "Data Modeling: Ontologies, Taxonomies", and "Clinical Applications." Method We will identify the core rhetorical components of the content sources using a basic Scientific Discourse ontology constructed (and compatible with) biomedical discourse ontologies (i.e., SWAN, OAC and AO) and discourse annotation metadata (specifically CoreSC). The ensuing discourse annotations will distinguish between facts, hypotheses, and evidence statements, and will be automatically recognised in text following an information extraction approach similar to conceptualisation zoning. The expected result is a Linked Open Data Node, a Triple store and a SPARQL endpoint available for use by different patient, clinician, and pharmacoepidemiology-centered data sources. Human readable summaries will also be generated to expand on existing PI information. Results: While we are in the early planning phases of the project, we have built a prototype system that demonstrates the concept by identifying how claims on metabolic clearance and drug-drug interactions could be updated in two drug PIs with evidence from the scientific literature. Conclusions: We envision using the resulting Linked Data store as the back end for a system that provides pharmacokinetic information on age-related clearance changes, metabolic clearance pathways, pharmacokinetic drug-drug interactions, and genetic polymorphisms. After developing a demonstrator for the 25 psychotropics, we anticipate that it will be feasible to subsequently deploy our system for any given drug.

Poster 8:
A Case Study in Using Literature to Find Predicate Relationships and Indirect Associations (.pdf)

Presenter: James Dixon
Linguamatics Ltd., Newton, MA, United States

Additional Authors:
David Milward, Linguamatics, United Kingdom

Objectives and Motivation Gene expression has been the focus of much research, especially for treatment of carcinomas. Recently attention has turned to smaller RNA molecules that are involved in post-transcriptional regulation, microRNAs. MicroRNAs (miRNAs) are known to bind to complementary sequences on target messenger RNA transcripts (mRNAs). MiRNA-expression profiling of different neoplasms has identified signatures associated with diagnosis, staging, progression, prognosis and response to treatment. In addition, profiling has been exploited to identify miRNA genes that might be involved in cancer or oncogenic pathways. To obtain a better insight into the connection between miRNAs and diseases requires understanding of the relationships between miRNAs and genes, and the relationship between the relevant genes and diseases. This paper compares and links together data from different sources: algorithmic predictions, experimental evidence and text mined literature.

Method There are a number of publically available databases that have the miRNA to gene mapping, usually via statistical calculations. However, few have established the mechanism of action. Each mechanism is different and may matter to a researcher. Using the Linguamatics I2E text mining platform we were able to mine research literature (Medline abstracts) using natural language processing (NLP) to add relational information to the miRNA-gene combinations. A particular challenge was the nomenclature for miRNAs, which may include prefixes and suffixes. For example, they may be prefixed to distinguish species, such as hsa-miR-19a for human and mmu-miR-19a for mouse. Our approach treated them as a single family since in general, their function is very similar. This also allowed us to extract literature results where the species is not identified and only the family name is used. Since our interests lie in connecting miRNA to carcinomas, we also used the same literature source, Medline, to extract gene to carcinoma relationships, to allow linking between the miRNA and the diseases via the genes they affect.

Results Using I2E, we found over 6000 miRNA to gene relationships from Medline abstracts. These relationships overlapped to some extent with commonly used databases in the genomic field, for instance TargetScan(1004), TarBase(135) and miRecords(316). The overlap of all three databases to each other was similar to what was found with the I2E results. Focusing on a single carcinoma, non-small cell lung cancer (NSCLC), as an example, we were able to extract over 400 indirect relationships between miRNA and NSCLC, where other public databases had less than 50 miRNA to NSCLC associations.

Conclusions Since all of the public database information used had modest overlap with the results from the literature, we are confident that we added not only relational information to the miRNA-gene interactions, but also added novel relationships to the miRNA-disease connections. In addition, we have extended our network from miRNA to gene and gene to disease to a more interesting relationship of miRNA to disease via their indirect links. Creating these associations will provide researchers new avenues to explore, lead to new target identification, and hopefully, new disease treatments.

Poster 9:
Image Retrieval in Controlled English (.pdf)

Presenter: Tobias Kuhn
Yale University School of Medicine, New Haven, CT, United States

Additional Authors:
Michael Krauthammer, Yale University, United States

The Yale Image Finder (YIF) project aims at improving biomedical image and document retrieval by developing advanced image parsing and indexing strategies. To this end, we have deployed a YIF search engine, which allows for keyword searches against indexed Pubmed Central open access images. Authors often follow well-accepted layouts when depicting experiment results as gels, graphs or plots, and use image text in an equally structured fashion for labeling different image elements. Image text placement often conveys higher-level semantics, such as the names of proteins being studied under different experimental conditions. We are currently exploring innovative ways for allowing YIF users to access such structured image text content. Here, we propose the use of a controlled language interface that guides users in composing natural language queries ("Find an image where X is measured under the condition Y") that are be subsequently mapped to indexed image text content. Our approach is based on controlled natural language, i.e. a restricted subset of English with a precise and unambiguous mapping to logic. We present a prototype called Rice (Retrieving Images through Controlled English) that is based on an interface we developed for a different domain (annotated text corpora) and adopted for image mining. Users can write seemingly natural queries like "Find an image that is a Western blot and where 'p38' is compared to 'MKK3'" which is subsequently translated into a logical representation like "western-blot & compared(p38,MKK3)". Such logical representations can then be matched with the formal model that we extract from images found in biomedical papers. One serious problem with controlled natural language is that it is very easy to read and understand but hard to write. Our prototype solves this problem by providing a predictive editor, with which users construct syntactically correct sentences in an iterative and guided way. For any partial sentence, the predictive editor of Rice shows the possible continuations in the form of different menu boxes. In this way, users do not need to know about the restrictions of our language beforehand. Previous evaluation has shown that this editor is very easy to use after very little or no training. Typical users of search engines are not familiar with logic notations and rarely have the time to learn one. Existing query interfaces are either very simple (i.e. keyword-based) or too complex to be usable without training. With Rice, complex queries can be written in a natural and intuitive way. The interface should be immediately accessible to researchers interested in the results represented in images of the biomedical literature. Rice supports queries with directed relationships "... where A is measured under the condition B", resulting in the retrieval of highly specific image sets. In contrast, keyword searches cannot build such refined query representations, and cannot easily tell apart a related query "… where B is measured under condition A". Our prototype is still incomplete, but we believe that it nicely demonstrates the potential of our approach, and the positive results of previous work make us confident of its practicality.

Poster 10:
A Simplified Method for Creating a Cell Cycle Ontology for the Laboratory Mouse (.pdf)

Presenter: Mary E Dolan
MGI, The Jackson Laboratory, Bar Harbor, ME, United States

Additional Authors:
Chris J. Mungall, Lawrence Berkeley National Laboratory, United States
Heiko Dietze, Lawrence Berkeley National Laboratory, United States
Judith Blake, MGI, The Jackson Laboratory, United States

The cell cycle is an essential, highly conserved, complex process. Understanding the cell cycle is important in understanding development, aging, and the progression of many diseases including cancer. Mouse Genome Informatics (MGI) is the international database resource for the laboratory mouse, providing genetic, genomic, and biological data to facilitate the study of the mouse as a model for human health and disease. We have recently developed a mouse cell cycle ontology as a novel approach to data integration for the diverse data on the laboratory mouse that is available at MGI. Currently at MGI, 1070 mouse genes are functionally associated with the cell cycle and have been annotated to the Gene Ontology (GO) term ‘cell cycle’ and its descendants. This mouse cell cycle gene set also has a large body of additional experimental annotation: 8126 experimental GO annotations in addition to ‘cell cycle’; 581 genes have phenotypic alleles with 31,134 phenotype annotations describing 10,129 affected anatomical systems; 512 genes have curated OMIM (Online Medelian Inheritance in Man) associations to mouse models; 58 genes have pathway (MouseCyc) annotations; and 1055 genes have human orthologs. Many of these data are described by different ontologies from the Open Biomedical Ontologies (OBO): gene product function data is annotated using the Gene Ontology; mouse phenotype data using the Mammalian Phenotype Ontology; expression data using the Adult Mouse Anatomical Dictionary and the Edinburgh Mouse Atlas for embryonic stages. Our mouse cell cycle ontology provides a view across these distinct ontologies providing a richer description of the data. The analysis of data related to cell cycle processes requires an integrated view that pulls together as much data as possible. Our approach adapts and extends a method that has been used by other groups to develop cell cycle ontologies for other organisms, including human, yeast, and Arabidopsis. In this work, we describe the structure and content of our mouse cell cycle ontology, Mouse_CCO, as an ‘application’ ontology built on experimental evidence-based annotations for the specific purpose (application) of studying the cell cycle. The structure of Mouse_CCO provides the generic template for the ontology, which is then populated using 1070 mouse cell cycle genes along with all their annotations from MGI and several additional data resources. The data drives the structure and allows a user to ‘discover’ connections. As an experimental evidence-based ontology, it is particularly important to keep the ontology up to date. The two newly developed tools also described in this work simplify maintenance of the ontology: the first allows a user to download mouse genes and selected annotations in OBO format that is then used by the second tool, Oort (OBO Ontology Release Tool), to perform MIREOT-like procedures to create a merged ontology bringing in subsets of external ontologies. The final product is Mouse_CCO in both OBO and OWL formats that can be queried and explored using a variety of free, publicly available tools. Our hope is that this resource will facilitate hypothesis generation based on the cell cycle as a biological system.

Poster 11:
OpenBEL, the BEL Framework, and the BEL Portal

Presenter: Julian Ray
Selventa, Cambridge, MA, United States

Additional Authors:
Ted Slater, Selventa, United States
Natalie Catlett, Selventa, United States
David De graaf, Selventa, United States

The Biological Expression Language (BEL) and supporting technology platform, the BEL Framework, will be released by Selventa, in conjunction with Pfizer, to the life sciences community in Q1 of 2012. BEL and the BEL Framework are designed to promote the collection, sharing, and interchange of structured knowledge within and among organizations. The BEL Portal, at http://belframework.org/, is the online community home for BEL and the BEL Framework. The Biological Expression Language (BEL) is a language for representing scientific findings in the life sciences in a computable form. BEL is designed to represent scientific findings by capturing causal and correlative relationships in context, where context can include information about the biological and experimental system in which the relationships were observed, the supporting publications cited, and the process of curation. BEL is intended as a knowledge capture and interchange medium, supporting the operation of systems that integrate knowledge derived from independent efforts. The BEL Language has been designed and used by our scientists and our customers for almost a decade. The language has been specifically designed to help scientists record life science facts in a way that is intuitive, easy to learn, concise, and appealing. A good language should help the user articulate an idea in a manner that is unambiguous, terse, and conveys the facts and associated contexts without loss or ambiguity. BEL is designed to do just this for life science applications. The current version of the language is small, which makes it easy to learn. BEL supports both causal and correlative relationships as well as negative relationships, which makes it suitable for recording a variety of experimental and clinical findings, and it can be used with almost any set of vocabularies and ontologies, which makes it highly adaptable and easy to adopt. BEL can be easily extended to annotate findings with use-specific contexts such as experimental and clinical parameters. The BEL Framework is an emerging open-platform technology specifically designed to overcome many of the challenges associated with capturing, integrating, and storing knowledge within an organization, and sharing the knowledge across the organization and between business partners. The BEL Framework provides mechanisms for knowledge capture and management; integration of knowledge from multiple, disparate knowledge streams; knowledge representation and standardization in an open, use-neutral format; creating customizable, computable biological networks from captured knowledge; and quickly enabling knowledge-aware applications using standardized application programming interfaces (APIs) across all major development platforms. Registering on the BEL Portal gives you access to more detailed documentation about BEL and the BEL Framework, and also allows you to participate in our community section and offer your views, opinions, and suggestions on the language and framework as well as keeping you informed on the progress of the official launch. Once you register you will have access to example documents, best practices, technical specifications, configuration guides, code examples, a wiki, and discussion groups.

Poster 12:
Semantic Integration to Characterize Microbial Pathogens: Multi-resource Enrichment of Experimental Proteomic and Genomic Datasets (.pdf)

Presenter: Erich Gombocz
IO Informatics, Inc., Berkeley, CA, United States

Additional Authors:
James Candlin, Sage-n Research, United States

Bacterial and viral-caused infectious diseases account for major health threats globally, yet the characterization, identification and understanding of them has been scientifically challenging. This is mainly due to the fact that while there is a wealth of information (and even complete genomes) available, its integrated utilization in context of the biological system to better understand causes and similarities in infectious diseases is still in its infancy. This poster tries to address some of the many obstacles involved in this endeavor as it attempts to identify peptides from different microorganism with common mechanism of actions causing disease, and to use them as biomarkers to detect pathogenic microbial threats prior to onset of disease symptoms to help in outbreak prevention. The presented workflow to accomplish this goal consist of 5 steps. The first step is a thorough peptide analysis of microorganism via mass spectrometry and their identification by sequence scoring (Sorcerer, indexed SEQUEST search, BioWorks). The second step is the annotation of peptides with genes and genomic sequences relevant to protein expression to qualify the accuracy of the identification. Step 3 involves the use of public domain microbial databases (PATRIC, ICTV, VIDA, Viral ORFeome, miRBase) to semantically integrate the experiments with organism taxon-specific functional genomic and pathway information relevant to diseases caused by the pathogens. Based on sequence similarity, sequences are clustered into homologous protein families (HPFs), and those protein families are enriched with annotations including functional classification, related protein structures, taxonomy, protein length, boundaries of conserved regions and bacterial or virus-specific genes. Further enrichment is achieved through addition of disease-related pathways (BioCyc, KEGG). The resulting knowledgebase provides a network with functional annotations to peptides and their relationships to diseases (Sentient Knowledge Explorer). In Step 4, those peptides in the network are identified which have similar disease-causing functions and appear in several pathogens. Interrogating the network via semantic queries (SPARQL) results in discovery of key pathway intersections commonly involved in the disease. The last step is the creation of molecular marker signatures (SPARQL, Applied Semantic Knowledgebases - ASK) and test their validity as decision support in multiplexed assays. Future applications will apply this technology for rapid detection of biological threats, to characterize origin and type of disease outbreaks and to develop preventive measures (such as broadly applicable drugs or vaccines) effective for entire classes of pathogenic organism.

Poster 13:
The Quad Economy of a Semantic Web Ontology Repository

Presenter: Trish Whetzel
National Center for Biomedical Ontology, Stanford, CA, United States

Additional Authors:
Paul R. Alexander, Stanford University, United States
Mark A. Musen, Stanford University, United States
Natalya F. Noy, Stanford University, United States

BioPortal is an open library of biomedical ontologies that can be accessed using a Web-based user interface or RESTful Web services. The Web-based user interface allows users to browse, search, and visualize ontologies and facilitates community participation in the ontology lifecycle, including reviews of ontologies, mappings between terms, comments and new term proposals. A suite of Web services, including Web services that expose information about terms in ontologies, mappings, notes, and metadata about the ontologies themselves, drives the Web-based interface. The NCBO Web services provide a common XML output of ontology content regardless of the ontology representation format, however there is no single uniform storage for the ontologies and their metadata. As the amount of information in BioPortal and number of hits to the NCBO Web services increase, a more scalable solution is needed. To address these issues, we analyzed the use of a quad store since quad stores easily scale to millions of triples and provides SPARQL query access to the ontologies. Currently each ontology in BioPortal includes the materialization of all owl:imports. Thus, if a small ontology imports a large ontology then the former becomes a large ontology. Taking into account that BioPortal stores multiple versions of an ontology, the problem is reproduced for every version. Our hypothesis was that we could optimize the number of quads in the system using a more granular model where owl:imports are not materialized and every ontology graph contains its own RDF triples without the triples from the owl:imports ontologies. One of the questions to be answered is the optimization ratio–in number of triples¬–when using an ontology-per-graph model versus a closure-materialized model. Of the 149 OWL ontologies reviewed, there are 299 ontologies in the import closure (i.e., if we follow all the owl:imports links from the 149 ontologies, we will create a set of 299 ontologies). These 299 OWL ontologies contain 303 owl:imports, the materialized import closure is a set of 495 owl:imports. We also reviewed the number of re-used triples. Ontologies with no imports gather 5.4M triples in the system; ontologies with one import 1.7M; ontologies with 2-9 imports reach 0.5M triples; and more than 10 imports 2.1M. To conclude, our analysis shows that while ontology reuse is still far from being the norm, effective reuse is a goal worth pursuing and the level of reuse can have significant implications for the scalability of ontology storage systems.

Poster 14:
A PubMed Search Engine for Rat Genome Curation at RGD

Presenter: Weisong Liu
The Medical College of Wisconsin, Milwaukee, WI, United States

Additional Authors:
Mary Shimoyama, The Medical College of Wisconsin, United States
Melinda Dwinell, The Medical College of Wisconsin, United States
Howard Jacob, The Medical College of Wisconsin, United States

The Rat Genome Database (RGD) is a collaborative effort between leading research institutions involved in rat genetic and genomic research. One of the main tasks of RGD is to curate rat gene related literatures and enter the information into our database. In this work, we built a PubMed search engine to help our curators locate paper-of-interest more efficiently. Using NCBI’s Entrez Utilities for Java, we have created a pipeline to weekly download PubMed data in XML format. We parse the XML files using a parser generated from NCBI’s efetch_pubmed.xsd file to extract information such as PMID, title, abstract, publication date and authors. The parsed information is stored in a MySQL database. This makes it easier for us to further utilize the information. By making use of the GATE and the UIMA frameworks, we built another pipeline to extract ontology (gene ontology, rat strain ontology, disease ontology, sequence ontology and organism ontology) terms and synonyms, and gene names/symbols from the PubMed titles and abstracts stored in the database. Some third-party plugins, such as Abner, OrganismTagger and MetaMap, were also used in this pipeline. The output of this pipeline includes ontology IDs, term positions within an abstract, and matching types. In order to make our framework scalable, we set up a small Hadoop cluster. The XML files are compressed and stored on Hadoop HDFS. Using the MapReduce framework, we can run our XML parsing and information extraction pipeline in many parallel threads. This dramatically reduced the total processing time comparing to a single-threaded program. The pipeline can also run on Amazon Web Services’ Elastic MapReduce. Along with the stored PubMed information, the pipeline output is fed into a Solr server. All information is indexed by Apache Lucene. With a web based user-interface, a user can search for PubMed abstracts by entering PMIDs, authors, publication dates, terms, ontologies, ontology IDs, gene names and(or) gene symbols. The search results are ranked by relevance. Matched terms are sorted by frequency of appearance in an abstract.

Poster 16:
Turning Biological Knowledge into Mathematical Models, Automated (.pdf)

Presenter: Oliver Ruebenacker
Center for Cell Analysis and Modeling, University of Connecticut Health Center, United States

Additional Authors:
Michael Blinov, Center for Cell Analysis and Modeling, University of Connecticut Health Center
Ion Moraru, Center for Cell Analysis and Modeling, University of Connecticut Health Center
James Schaff, Center for Cell Analysis and Modeling, University of Connecticut Health Center

Living organisms are so enormously complex that we need computer simulations to understand the consequences of their vast biochemical reaction networks. As we uncover an increasing part of these networks, our established knowledge is increasingly stored in free web databases and available for query and download in machine-readable formats, especially in the RDF/OWL-based community standard Biological Pathways Exchange (BioPAX) [1]. The available data is massive and growing, e.g. Pathway Commons [2] stores 1,700 pathways, 414 organisms, 440,000 interactions and 86,000 substances. This data is fully linked with open controlled terminologies such as gene ontology (e.g. anatomical features) [3] and other free online databases such as ChEBI (chemicals) [4], KEGG (genes a.o.) [5], UniProt (proteins) [6] and PubMed (publications) [7].

Automatic use of this knowledge for computer simulations of biological organisms has been an ongoing challenge [8,9,10]. Now, Systems Biology Pathway Exchange (SBPAX) [11], a BioPAX extension, allows the inclusion of quantitative data and systems biology terms, especially the Systems Biology Ontology (SBO) [12]. SBPAX support has been implemented by the Virtual Cell [13], Signaling Gateway Molecule Pages [14] and System for the Analysis of Biochemical Pathways - Reaction Kinetics (SABIO-RK) [15]. For the first time, a mathematical model can be automatically built and fully annotated from a pathway of interest.

Citations:
[1] Biological Pathway Exchange (BioPAX), www.biopax.org
[2] Pathway Commons, www.pathwaycommons.org
[3] Gene Ontology (GO), www.geneontology.org/
[4] Chemical Entities of Biological Interest (ChEBI), www.ebi.ac.uk/chebi/
[5] Kyoto Encyclopedia of Genes and Genomes (KEGG), wwww.genome.jp/kegg/
[6] UniProt, www.uniprot.org
[7] PubMed, www.ncbi.nlm.nih.gov/pubmed/
[8] Modeling without Borders: Creating and Annotating VCell Models Using the Web, Michael L. Blinov, Oliver Ruebenacker, James C. Schaff and Ion I. Moraru, Lecture Notes in Computer Science, 2010, Volume 6053 (2010).
[9] Using views of Systems Biology Cloud: application for model building, Oliver Ruebenacker, Michael Blinov, Theory in Biosciences, Volume 130, Number 1, 45-54 (2010).
[[10] Integrating BioPAX pathway knowledge with SBML models, Michael L Blinov, Oliver Ruebenacker, Ion I Moraru, IET Syst. Biol., 2009, Vol. 3, Iss. 5, pp. 317-328 (2009).
[11] Systems Biology Pathway Exchange (SBPAX), www.sbpax.org
[12] Systems Biology Ontology, www.ebi.ac.uk/sbo/main/
[13] Virtual Cell, http://vcell.org
[14] Signaling Gateway Molecule Pages, www.signaling-gateway.org/molecule/
[[15] System for the Analysis of Biological Pathways – Reaction Kinetics (SABIO-RK), http://sabio.villa-bosch.de/

Poster 17:
Bioquery-Asp: Querying Biomedical Ontologies In Natural Language Using Answer Set Programming (.pdf)

http://krr.sabanciuniv.edu/projects/BioQuery-ASP/

Presenter: Esra Erdem
Sabanci University, Istanbul, Turkey

Additional Authors:
Yelda Erdem, Research and Development Department
Sanovel Pharmaceutical Inc., Istanbul, Turkey
Halit Erdogan, Faculty of Engineering and Natural Sciences
Sabanci University, Istanbul, Turkey
Umut Oztok, Faculty of Engineering and Natural Sciences
Sabanci University, Istanbul, Turkey

Storing biomedical data in various structured forms, like biomedical ontologies, and at different locations have brought about many challenges for answering queries about the knowledge represented in these ontologies. One of the challenges is to represent a complex query in a natural language and get its answers in an understandable form. Another challenge is to answer complex queries that require appropriate integration of relevant knowledge stored in different places and in various forms, and/or that require auxiliary definitions, such as, chains of drug-drug interactions, cliques of genes based on gene-gene relations, similarity/diversity of genes/drugs. Furthermore, once an answer is found for a complex query, the experts may need further explanations about the answer. We have built a software system, called BIOQUERY-ASP, that handles all these challenges. Method: We have addressed the three challenges described above using the high-level knowledge representation formalism and efficient automated reasoners of Answer Set Programming (ASP) - a declarative programming paradigm that supports various semantic Web technologies. To address the first challenge, we have developed a controlled natural language for biomedical queries about drug discovery; this language is called BIOQUERY-CNL. Then we have built an intelligent user interface that allows users to enter biomedical queries in BIOQUERY-CNL and that presents the answers (possibly with explanations or related links, if requested) in BIOQUERY-CNL. To address the second challenge, we have developed a rule layer over biomedical ontologies and databases that not only integrates the concepts in these knowledge resources but also provides definitions of auxiliary concepts. We have introduced an algorithm to identify the relevant parts of the rule layer and the knowledge resources with respect to the given query, and used automated reasoners of ASP to answer queries considering these relevant parts. To address the third challenge, we have developed an intelligent algorithm to generate an explanation for a given answer, with respect to the query and the relevant parts of the rule layer and the knowledge resources. The overall system architecture for BIOQUERY-ASP is presented in the figure included in the supporting document. Results: We have shown the applicability of BIOQUERY-ASP to answer complex queries (specified by experts) over large biomedical knowledge resources.

Poster 18:
Proposed Ontology for Seizure and Epilepsy

Presenter: Robert Yao
Arizona State University, United States

Additional Authors:
Graciela Gonzalez, Arizona State University, United States
Jeffrey Buchhalter, Phoenix Children's Hospital, United States

The understanding and classification of seizures and epilepsy syndromes have constantly changed with the advent of new knowledge from new technologies. Ontologies provide a structured knowledge framework that could aid in more precisely defining and standardizing terminologies and diagnoses. This in turn could enhance the abilities of researchers and clinicians to pinpoint the causes of a disorder, discover new treatment measures, and improve patient outcomes.

Hypothesis We hypothesize that a more refined ontology for seizures and epilepsy syndromes that adequately reflects the latest measurements, observations and medical findings can be used to assist empirical diagnosis of epilepsy and to potentially differentiate new syndromes in a logical and standardized format.

Methods A review of previously proposed Seizure and Epilepsy classifications is being done to determine the most general way to classify each seizure, syndrome, and epilepsy. By analyzing and defining the building blocks of Epilepsy, an Epilepsy Ontology is iteratively formalized using Protege. Each seizure and syndrome will be instantiated to the ontology to determine if it provides a reasoning framework on epilepsy knowledge.

Results A poly-axial ontology is being defined to encode the conceptual building blocks of seizures and Epilepsy. The ontology will be open for both qualitative and quantitative evaluation when the data/evidence is available in preference over consensus expert opinion.

Conclusions The aim of this ongoing work is to help clinicians better understand the etiology of seizures and definitions of and relationships between seizures and epilepsy syndromes, and to provide a more helpful path towards research, diagnosis, and treatment of the disorder. Eventually, this ontology could be expanded for use with other diseases, providing more structured definitions. Such a standard framework could also help pinpoint knowledge deficits which in turn should drive laboratory and clinical experiments to discover missing knowledge.

Poster 19:
Exploiting Ontology Information for Extracting Keyphrases from Biomedical Articles

Presenter: Kyu-Baek Hwang
Soongsil University, Seoul, Korea

Additional Authors:
Sun Gon Kim, University of Seoul, Korea
Eunok Paek, University of Seoul, Korea

Keyphrases (or keywords) of a document serve a role of compactly representing its content. They can be used for indexing or summarization purposes. Our method for keyphrase extraction is based on supervised machine learning combined with ontology information. It consists of two stages: (1) keyphrase candidate generation and (2) keyphrase selection. In the first stage, keyphrase candidates are generated by extracting every unigram, bigram, and trigram of the words in the title and abstract of each article. Also, a set of ontology terms are assigned to each article. For this, any automated methods for ontology term assignment, e.g., vector space models, can be adopted. Ontology terms are used for expanding the set of keyphrase candidates. In specific, keyphrases, frequently co-occurred with the ontology terms assigned to a document, are added to its candidate set. In the second stage, keyphrases are selected from the expanded candidate set by supervised machine learning. Features for supervised learning include term and inverse document frequencies, length, first/last occurrence positions, and relationships with ontology terms. The confidence and lift of an association rule between keyphrases and ontology terms are used for representing their relationships. Because multiple ontology terms are usually assigned to a document, ontology-related feature values are averaged across all of them. Results: The proposed method was applied to a dataset consisting of 1,799 articles from three journals in the biomedical literature, i.e., IEEE/ACM Transactions on Computational Biology and Bioinformatics, Journal of Computational Biology, and Journal of Proteome Research. The MeSH (Medical Subject Heading) descriptors, which constitute a biomedical ontology, are manually assigned to the articles published in these journals for PubMed indexing. These MeSH descriptors represent the subject content. In our experiments, MeSH descriptors were automatically assigned to each document of our dataset by a vector space model-based method. In addition, each article of these journals is annotated with about four to six author-provided keyphrases. These author keyphrases were used as a gold standard for keyphrase extraction evaluation. We conducted a 10-fold cross validation experiment using several supervised machine learning methods including naïve Bayes classifiers and Bayesian networks. The experimental results showed that the inclusion of ontology information improved the keyphrase extraction performance about 100% in terms of the F1-measure. When the number of extracted keyphrases was set to five, our method achieved an F1-measure of about 0.185 and the performance increase was 129%. We also compared our method with KEA, a method for keyphrase extraction using syntactic features (which is accessible at www.nzdl.org/Kea/index.html). Our method was always better than KEA regardless of the number of extracted keyphrases (the performance increase was from 2 to 98%). These results confirm the fact that semantic information about document topics plays a central role in keyphrase extraction. Conclusions: We proposed a method for keyphrase extraction from documents using ontology information. Through a set of experiments, we showed that the inclusion of ontology information about document topics could greatly improve the performance in keyphrase extraction.

[TOP]