Posters

20th Annual International Conference on
Intelligent Systems for Molecular Biology

Posters

Poster numbers will be assigned May 30th.
If you can not find your poster below that probably means you have not yet confirmed you will be attending ISMB/ECCB 2015. To confirm your poster find the poster acceptence email there will be a confirmation link. Click on it and follow the instructions.

If you need further assistance please contact submissions@iscb.org and provide your poster title or submission ID.

Category E - 'Functional Genomics'

E01 - Chromosome 13 Encoded Proteome Database Exploring the Comprehensive Landscape of Protein and Transcriptome.

Short Abstract: In an effort to make a comprehensive human proteome map, a chromosome-centric human proteome project (C-HPP) has recently been initiated. Chromosome 13, the largest acrocentric human chromosome with the lowest gene density among the human chromosomes, has 335 predicted coding proteins with 64 missing protein-coding genes. We have developed gene-centric proteomic database for chromosome 13 by integrative transcriptomic data and other information from public databases. Entire chromosome 13 proteins are linked to other resources. They are UniProt, Ensemble, NCBI for protein and gene information; PeptideAtlas, GPMDB, and SRMatlas for observed ms and MRM information; and AntibodyPedia, ProteinAtlas for antibody availability and tissue expression information. Pathway, GO, and disease information are also included in this database. For the proteomic profiling of human placental tissue (control vs. pre-eclampsia), samples were processed with a standard procedure with SDS extraction for a quantitative proteomic analysis by LC-MS/MS (Orbitrap). For the genomic profiling, cDNA array (Illumina) were performed and the data analyzed. Proteomic and genomic expression data from cell lines (Normal cell line, cancer cell lines such as hepatocellular carcinoma, gastric cancer, pancreatic cancer) will also be added into Chr. 13 database. Developing comprehensive database with genomic and proteomics real data would be provide another layer of functional studies on the pathogenesis of complex disease. (This study was supported by a grant from the National Project for Personalized Genomic Medicine, the Korean Health 21 R & D Project [A030003] and MediStar [A112047],Ministry for Health & Welfare, Republic of Korea.)

E02 - GenomeRNAi: A Phenotype Database for Large-scale RNAi Screens

Short Abstract: RNA interference (RNAi) represents an ideal method to systematically investigate loss-of-function phenotypes on a large scale, providing a rich source of functional gene annotation. A wide variety of biological assays are applied in RNAi screening experiments, ranging from visible phenotypes to transcriptional readouts, in cell-based or in vivo studies.
The aim of the GenomeRNAi database is to collect and make available published RNAi phenotype data and to facilitate the comparison of phenotype data across multiple screening experiments. Structured annotation guidelines have been developed to ensure consistency and comparability of the data. Currently the database contains phenotype entries from 96 cell-based experiments in Homo sapiens, as well as 150 screens in Drosophila melanogaster, 50 of which were performed in vivo. The database also provides detailed information on RNAi reagents, including calculations on specificity and efficiency.
GenomeRNAi features a user-friendly, intuitive web interface (www.genomernai.org) allowing to browse through RNAi screens or to search by gene, reagent or phenotype. Download options are available for individual RNAi experiments and the entire set of screens. Links to external resources are provided, and external websites can link to GenomeRNAi via common identifiers such as Ensembl, UniProt, Flybase or CG, or stable GenomeRNAi screen identifiers. A template for direct author submission of RNAi data is available on the website.
The implementation of a DAS server is in progress, allowing the visualization of GenomeRNAi phenotype data in the context of a genome browser. An update on the curation progress and new functionalities of the website will be presented.

E03 - InterMine - RESTful searching of Integrated Biological Data Warehouses

Short Abstract: InterMine (www.intermine.org) is a free and open-source biological data warehousing system used in several projects worldwide. InterMine can integrate data from common biological formats and provides powerful query features through a web interface and RESTful web services.
Adoption in a growing number of major Model Organism Databases (MODs), InterMine is coming to define a common query and web-services platform for integrated biological data (such as genome annotation, expression studies, ontologies, interactions and literature). Using InterMine, researchers can build custom queries, use pre-defined template searches, export results, and upload and analyse lists of data.
Researchers may, for example automate complex analytical pipelines; a user might find genes with a certain expression pattern in FlyMine, fetch the rat orthologues of those features, and then fetch disease associations for the orthologues from RatMine. Or one might use the system to perform meta-analyses, such as intersecting the results of a number of gene lists resulting from experiments, finding gene ontology terms most significantly associated with the intersection, and retrieving the FASTA sequence for only the genes in the intersection with one of those ontology terms for blasting against a reference sequence elsewhere.
Client libraries in Python, Perl, Ruby and Java facilitate access to the web service API, supported by automatic code generation for any query. A key feature, and an area of significant current development, is a JavaScript API that makes many features of InterMine such as interactive results tables and graphical widgets embeddable in other web pages.

E04 - EasyBioDB: a pipeline for database modeling and construction by non-computer specialists

Short Abstract: Life/Health Sciences (LHS) professionals are faced with various types of data bottlenecks while applying computational tools, particularly when in need of creating their own databases (DBs). Biological/biomedical DBs (BDBs) have become indispensable for data management, operation and sharing among different users. The data (either original, collected in or compared with large public repositories) represent valuable and significant resources for ongoing knowledge extraction. Mining of this data is also an increasingly indispensable part of modern research, and the organized storage of data in DBs is obligatory. However, for a non-computer specialist, BDB design, modeling and developing with standard relational DB management systems (DBMS such as MySQL and PostgreSQL) may still be a challenge. Therefore, we have developed EasyBioDB as a prototyped pipeline for the construction of novel, exclusive BDBs to allow their establishment, interrogation, rearrangement, display and organization by regular LHS professionals. EasyBioDB also covers a model-driven approach to data architecture, which can take on many different forms in biomedical scenarios, and we outline the benefits such a user-friendly tool might provide on biomedical communities. The benefits range from creating and populating local BDBs, to the use of data mining tools to assist in data analysis and functional inferences. EasyBioDB highlights how a simple pipeline can be an advantageous aid in the hands of unexperienced data modelers, developers, and/or administrators (such as non-computer but LHS specialists) who are often tasked with managing complex data management infrastructures of LHS research and practice.

E05 - COSMIC: Examining Cancer Genomes in the Catalogue of Somatic Mutations in Cancer

Short Abstract: The Catalogue of Somatic Mutations In Cancer (COSMIC http://www.sanger.ac.uk/cosmic) aims to curate somatic mutations from across a wide range of sources, including published literature, data portals from global consortia such as TCGA and ICGC, and from the Cancer Genome Project (CGP) at the Sanger Institute, UK.

Data in COSMIC can be analysed in several ways. For computational approaches, COSMICMart (our BioMart instance) can combine results from other BioMart enabled resources such as Ensembl and Uniprot. We provide data exports in formats including csv files and database exports in Oracle. For visual analysis, our website provides GBrowse, as well as customised displays of our data on various charts, providing numerous analysis filters. Data is annotated using HGVS and HGNC standards, and our controlled vocabulary for tumour classifications. Currently we are redesigning our website to provide more emphasis on genomic context whilst maintaining our gene-centric perspective. This gives us an opportunity to add new features such as a filter for excluding identified SNPs from the 1000 Genomes Project, displaying Pfam domains and links to biological pathways for selected genes. We actively seek to integrate with the bioinformatics community, providing data via DAS, and also in the UCSC and Ensembl Genome Browsers as selectable tracks.

The ongoing effort to curate high quality data from multiple sources, integrate with other biological resources, and improve analytical functionality will allow COSMIC to continue as a significant resource supporting cancer genome analysis.

E06 - RefEx: Reference expression dataset for functional transcriptmics

Short Abstract: RefEx (Reference Expression dataset; http://refex.dbcls.jp/ ) is a challenge to achieve the reference of mammalian gene expression data by different types of methods such as expressed sequence tag (EST), microarray (GeneChip), CAGE　(Cap Analysis Gene Expression) and RNA-seq for human, mouse and rat. Gene expression data by RNA-seq has been rapidly accumulated, and the functional annotation of such sequence from metadata is required and the curation of such data is crucial for the biological interpretation of a bunch of sequences obtained. In addition to bulk download of the data, RefEx can be accessed via significantly improved web interface, and it contains the form in which users can search incrementally by gene names and gene symbols. Additionally, RefEx can search not only for tissue-specific patterns from gene expression data for many tissues and thousands of genes, but also for gene expression data of particular group of genes in Interpro and Gene Ontology. In the detailed view for search results, relative values of gene expression are mapped to the 3D body image as well as the graphical histograms for those are available for different types of measurement methods. Moreover, in order to offer a fast and flexible environment for visualizing and analyzing gene expression data, we are developing the TIBCO Spotfire Web Player interface.
We will present current status of the project and discuss how to improve the reference database for gene expression that are truly needed to experimental biologist.

E07 - Technology development of database integration to make re-use of public biological data

Short Abstract: In Database Center for Life Science (DBCLS), we have been tackling the problem how to organize various types of database in life science. Currently, we are developing database integration technologies to utilize huge amount of public data in collaboration with various sectors of biologists in Japan.

As technology development of public biological data, we are trying to integrate data from the next generation sequencers (NGS) by annotated metadata. In particular, we make use of the metadata(experimental conditions, biological samples and information of sequencers) of NGS database entries to simplify the recycle process of such useful data. Notably, we are collecting NGS database entries linked to publication records generated from the full text data of publications by text mining. Moreover, the QC values are pre-calculated for all entries where corresponding sequence data is available, and those can easily be retrieved from the web interface.

For increasing demands to re-use transcriptome sequence data and expression array data, we have maintained reference expression dataset (RefEx) for comparing transcriptome data from various measuring technology for gene expression.
In addition, we are also developing a 'Google-like' full-text search engine called GGRNA for genes and transcripts utilizing the compressed suffix array to be utilized in searching archived transcriptome data including nucleotide/amino acid sequences.

We will present current status of the project and the utility of data produced in the system.

E08 - MISO: an open-source LIMS for small-to-large sequencing centres

Short Abstract: Sequencing centres differ not only in their scale and output, but also their requirements for data management. The efficient storage of genomic metadata is vital for all sequencing centres.

Off-the-shelf solutions are often very expensive and not cost-effective, especially for the smaller centre. Furthermore, paid customised support is often required, and the extensibility of these systems is rarely in the hands of the community. Alongside a desire to tailor an open information system in-house, data formats can change, new platforms are being developed and those platforms can evolve rapidly. These are valid concerns for both large centres characterised by high-throughout data production and smaller scale laboratories with constrained expenditure for IT solutions, and potentially project-specific requirements.

Hence, we present MISO ("Managing Information for Sequencing Operations"), a freely available open-source LIMS for recording next-generation sequencing (NGS) metadata. MISO can store relevant metadata for the most common NGS platforms (e.g. Illumina GA, HiSeq and MiSeq, Roche 454, ABI SOLiD and PacBio RS) and automatically generate public repository data submission schemas (e.g. EBI Sequence Read Archive). It comprises many essential features, such as secure authentication, fine-grained access control, barcode tracking, and reporting. We are also working on novel modules that expose the run information from these NGS platforms to allow automated tracking and generation of statistics to be an integral part of MISO, and also an automated submission pipeline facilitating easy public data exposure.

E09 - LIMS and Bioinformatics Support for the GLBRC Seed-to-Biofuels Research Pipeline

Short Abstract: Through collaborations across the Center, GLBRC has developed and benchmarked a biomass-to-biofuels pipeline. This pipeline includes procedures to grow, harvest, analyze, and process biomass to produce hydrolysates, which are subsequently fermented by microbial cultures to produce fuels and chemicals. The Informatics and Information Technology Group has developed information and data management systems to support this pipeline from seed to fermentation and the multi-omics analysis of samples pulled from it at multiple steps. These systems include a Laboratory Information Management System (LIMS) that supports tracking experimental materials and samples, planning experiments, and collecting raw experimental data. These data are accessed by scientists who analyze them using a variety of tools, many of which are hosted on a Galaxy server. The processed and analyzed data produced in these analyses are housed in the Great Lakes Omics Warehouse (GLOW), where they are available to all scientists in the Center.

E10 - Population of the BioSamples Database at the EBI

Short Abstract: The BioSamples Database stores information about biological samples ultimately used for experimental assays. Goals of the BioSamples Database include:
(i) recording and linking sample information consistently within cooperating databases;
(ii) reducing data entry for submitters to cooperating databases by entering sample descriptions once and referencing them later in data submissions to assay databases;
(iii) supporting cross database queries by sample characteristics;
(iv) data exchange with other sample databases.
Each sample is assigned a unique accession number that can be referenced when submitting to a cooperating database. Samples can be queried by other attributes such as material, disease names or sample providers.

Data is submitted directly to the BioSamples Database in a tab-delimited format (SampleTab). Several sources of publicly available data are converted into SampleTab format and used by the BioSamples Database. A subset of samples are identified as reference samples and have additional manual curation applied as well as being promoted to users. Reference samples are highly re-used samples that have significant scientific importance such as common mouse strains, cell line collections or projects such as EnCode.

All samples in the BioSamples Database are automatically curated using a variety of tools and approaches, including mapping to ontology terms. Relationships between samples (“derived from”, “child of”, “equivalent to”) provide additional information and are used to improve annotation quality.

Presented here is an overview of data flow into and within the BioSamples Database. This includes the SampleTab file format with associated validation and parsing software, as well as annotation and mapping tools.

E11 - From the Era of Genome-Wide Mouse Mutagenesis to the Era of Genome-Wide Mouse Phenotyping in Mouse Genome Informatics (MGI)

Short Abstract: High-throughput phenotyping of knockout mice generated from genome-wide mutagenesis efforts of the International Knock-out Mouse Consortium (IKMC), is broadening the landscape of mouse mutant phenotypes to genome scale. Mouse Genome Informatics (MGI) has a tradition of integrating published mouse phenotypes with mouse genes and associated genetic, molecular and functional data. MGI is now incorporating high-throughput phenotyping data into this extended biological context. We will present our analysis of the integration challenges posed, and a view of these data within the MGI user interface.

The International Mouse Phenotyping Consortium (IMPC), a group of 11 Phenotyping Centers from 7 countries, is establishing pipelines to phenotype knockout mice from the mutant ES cell line resources of the IKMC. Participating Centers interpret pipeline test results to make phenotype calls, and map these results to the Mammalian Phenotype Ontology (MP). Challenges for MGI include 1) integrating phenotyping data from different contributing Centers, where data conflicts can exist, 2) and integrating these data with literature-published phenotypes from mice with mutations in the same genes, and eventually with follow-up phenotype data involving the same knockout mutant alleles sent through the automated pipelines, and 3) presenting summary views of these integrated data so users have a more complete picture of the phenotypic consequences of these knockout mutations.

Our analysis of high-throughput phenotyping data from the Sanger Mouse Genetics Project and Europhenome Mouse Phenotyping Resource has guided development of a user interface that synthesizes a coherent phenotype summary while providing access to primary data.

Supported by NIH grant HG000330

E12 - Managing complex bioinformatics pipelines through PyPedia and MOLGENIS

Short Abstract: The development of modern bioinformatics methods requires, knowledge of advanced software engineering, long term maintenance plans, modules for testing and verifying and finally, open source and access policies. Similarly the size and complexity of –omics data require data management tools that can facilitate the submission and monitoring to high performance computing environments. Here we present how PyPedia and MOLGENIS can be combined to approach these issues. PyPedia is an open source online IDE based on the python language and operates as a wiki of bioinformatics methods. Every article contains the source code, documentation, html form for execution, unit tests and edit permissions of a method. Execution can take place either in a local computer, in a remote server or online in the Google App Engine. Editing is allowed according to the permissions set by the creator of the article. Articles are divided in (1) peer reviewed, qualitative and tested methods and (2) user maintained, experimental and sub-community driven methods. Existing articles offer novel implementations for common bioinformatics problems, wrappers of BioPython methods and Galaxy pipelines. A special REST web service allows the distribution of dependency free, ready to run, peer reviewed source code to any external tool. On the other hand, MOLGENIS is a rich data management and analysis application generator that also offers REST interface for data access. Through the bridging of these tools a researcher can manage large data, search, select and configure an execution method, submit and monitor the processing and visualize potential results.

E13 - GlowWorm: a web application for standards-compliant annotation, storage and retrieval of omics experimental data

Short Abstract: In the biosciences a number of minimum information guidelines and standards have been established to facilitate the annotation and sharing of experimental data. The goal of such standards is to maximize the utility of the experimental data, promoting the scientific discovery process. Additionally many journals now require standards-compliant experimental data as a condition of publication.
However, in the absence of a dedicated data management system, experimental data is often stored in an ad-hoc manner, without sufficient metadata, which ultimately diminishes the utility of the data.
To address such issues we have developed GlowWorm: a web-based data repository built with the Ruby-on-Rails web development framework that provides the metadata annotation, storage and sharing of microarray, next-generation sequencing and other forms of omics data. The initial needs were articulated by a group performing gene expression experiments on yeast using Nimblegen tiling arrays.
GlowWorm's metadata model is an implementation of the MAGE-TAB object model. MAGE-TAB is a communications standard established by the FGED society, primarily for microarray data, which is annotated in a tab-delimited, spreadsheet-based format. GlowWorm parses and exports MAGE-TAB documents which can later to used to submit data to public repositories. It further provides a search interface by which the metadata can be leveraged to find datasets of interest. By using RESTful web services built into the Rails framework, we have exposed web services that can be consumed by the various analysis tools provided by our group.
We intend to release GlowWorm as open source in the near future.

E14 - Foundations of the Semantic Integration of Databases for the Knowledge Base of Biomedicine (KaBOB)

Short Abstract: Biomedical researchers face a profound challenge in keeping track and making sense of numerous databases, a growing body of literature, and data from high-throughput experiments. To assist researchers and tools in efficiently leveraging this flood of information we are producing the Knowledge Base of Biomedicine (KaBOB) to semantically integrate these disparate sources of information.

We have started constructing the foundation for semantically integrating biomedical databases by building representations of the database records themselves using an extension of the Information Artifact Ontology and are working toward transforming these records into representations in terms of biomedical concepts grounded in the Open Biomedical Ontologies. The intermediate record representation allows access to the information while the biomedical representations are still being constructed and provides provenance information for the biomedical knowledge.

We have produced over 8 billion RDF triples from 19 data sources including information about gene and gene products, pathways, diseases, and drugs. The goal of integrating the information in the database records into a unified knowledge base starts by integrating the many disparate identifiers for the genes and gene products in these sources, using the mappings from 12 different data sources. We are continuing to grow the biomedical section of the knowledge base by producing declarative rules that represent mappings from the database records to biomedical knowledge structures, and applying those rules (via forward-chaining) to produce additional RDF triples, while tracking the provenance of their construction. These rules extend and connect biomedical entities using concepts in the biomedical ontologies.

E15 - Consolidation of fractionated single-copy regions following whole genome duplication

Short Abstract: Fractionation is arguably the greatest cause of gene order disruption following whole genome duplication, causing severe biases in chromosome rearrangement-based estimates of evolutionary divergence. This presentations shows how to correct for this bias almost entirely by means of a “consolidation” algorithm for detecting and suitable transforming identifiable regions of fractionation. We characterize the process of fractionation and the performance of the algorithm through realistic simulations. We apply our method to a number of core eudicot genomes, including poplar, grapevine and cacao, and by studying the fractionation regions detected, are able to address topical issues in polyploid evolution: is fractionation systematically biased towards one homeolog or the other (subgenome dominance), and does duplicate reduction proceed gene-by-gene or are mutliple-gene regions deleted or inactivated as a whole.

E16 - Developing tools to aid data entry by FlyBase literature curators

Short Abstract: FlyBase literature curators read research papers involving Drosophila and extract selected data types that are manually typed into text proformae. The text is formatted to be suitable for software which parses data into a chado production database. We studied the process of curation to identify steps in the process that could become more efficient with appropriate software. This highlighted two processes that would particularly benefit from the development of software tools. 1) skim curation, involving making a record for a paper, into which datatypes and genes are marked. When parsed into the database, skim curation allows connections to be made between genes and the paper and also puts the paper into an ordered list for curation prioritization. 2) Gene Ontology (GO) annotation, which was identified as a vital and time consuming curation activity. In particular, the need for exact formatting of GO annotation prompted us to develop a line maker program that splits the GO line into sections, each of which calls up controlled vocabularies or identifiers such as gene symbols. A tab and type action quickly selects items which are correctly formatted and make sense to the curator. By developing a graphical user interface these two activities can be performed more efficiently. We discuss the designs, improved efficiency and ideas for future additions to the tool.

E17 - ProBiS-Database: Precalculated Binding Site Similarities and Local Pairwise Alignments of PDB Structures

Short Abstract: ProBiS-Database is a searchable repository of
precalculated local structural alignments in proteins detected
by the ProBiS algorithm in the Protein Data Bank.
Identification of functionally important binding regions of
the protein is facilitated by structural similarity scores mapped
to the query protein structure. PDB structures that have been
aligned with a query protein may be rapidly retrieved from the
ProBiS-Database, which is thus able to generate hypotheses
concerning the roles of uncharacterized proteins. Presented
with uncharacterized protein structure, ProBiS-Database can
discern relationships between such a query protein and other better known proteins in the PDB. Fast access and a user-friendly
graphical interface promote easy exploration of this database of over 420 million local structural alignments. The ProBiS-
Database is updated weekly and is freely available online at http://probis.cmm.ki.si/database.

E18 - The next generation of SCOP and ASTRAL

Short Abstract: We released new versions of both SCOP and ASTRAL (1.75A) in March 2012. The new releases are the first to be fully dependent on a new SQL-based infrastructure and build procedure, and the first to be presented to the public through a single, unified interface. They also represent the first public deployment of our fully automated classification scheme: more than 11,000 new PDB entries were added to the current release, without sacrificing the reliability that SCOP has accumulated through years of careful manual curation. We plan to introduce additional features in a series of stable releases, while a major reclassification (SCOP 2.0) is in progress.

E19 - Using HDF5 to Work with Large Quantities of Biological Data

Short Abstract: The HDF5 technology platform allows users to organize, store, share, and access large and complicated data. It consists of a data model, file format, library (C/C++/Java), and tools. HDF5 is used worldwide by government, industry, and academia in a wide range of science, engineering, and business disciplines. Prominent users include MathWorks (Matlab can read HDF5 files), NASA (HDF-EOS5), and Applied Biosystems (primary image data storage).

HDF5 combines the flexible and extensible data layout of a database with the portability and ease of access of individual files. As biology becomes more data-driven, with data sets of ever-increasing size, HDF5 can provide a way for researchers to work with their data via a high-performance, scalable, platform-independent technology suite.

Some aspects of HDF5 include: facilities for item association, hierarchies, and annotation; flexible user-defined types; files are self-contained and self-describing; portable across platforms and architectures; high I/O performance, parallel I/O; out-of-core data access (partial I/O); unlimited file size support; support for compression and other custom filters; suitable for long-term data archiving; free and open source (BSD license); and high-quality support, training, and documentation.

There has been a good deal of interest in HDF5 in the life sciences. We want to support that interest any way we can, including working with communities that wish to adopt HDF5. Please stop by and talk to us about your data needs!

E20 - Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis - The CAMERA Resource

Short Abstract: The Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis (CAMERA, http://camera.calit2.net/) is a database and associated computational infrastructure that provides a single system for depositing, locating, analyzing, visualizing and sharing data about microbial biology through an advanced web-based analysis portal. CAMERA collects and links metadata relevant to environmental metagenome datasets with annotation in a semantically-aware environment allowing users to write expressive semantic queries against the database. To meet the needs of the research community, users are able to query metadata categories such as habitat, sample type, time, location, and other environmental physicochemical parameters. CAMERA is compliant with the standards promulgated by the Genomic Standards Consortium (GSC), and sustains a role within the GSC in extending standards for content and format of the metagenomic data and metadata and its submission to the CAMERA repository. To ensure wide, ready access to data and annotation, CAMERA also provides data submission tools to allow researchers to share with research communities, it has multiple interfaces for easy submission of large or complex datasets, and supports pre-registration of samples for sequencing. CAMERA integrates a growing list of bioinformatic applications, tools and viewers for querying, analyzing, annotating, and comparing publically available and user-provided metagenome and genome data. Also, CAMERA has modified and developed a list of tools for next generation sequence data analysis, such as Illunima, SOLiD, and more. This data-oriented view of an analysis enables communication of what has occurred within a given workflow, and reproducibility of the computation itself.

E21 - DOMMINO: Towards integrating DNA-, RNA-, and protein-mediated macromolecular interaction data

Short Abstract: Macromolecular interactions in a cell are mediated by RNA, DNA, and protein molecules underlying the cell’s basic mechanisms. In addition, the interactions mediated by proteins involve not only the protein domains, but also various non-structured regions, such as peptides, N- and C-termini, or inter-domain linkers. We developed DOMMINO (http://dommino.org), a comprehensive Database of MacroMolecular INteractiOns that initially included the interactions between protein subunits of all structured and non-structured types described above. The content of DOMMINO includes both intra-chain and inter-chain interactions, and the domain definitions and boundaries are obtained using SCOP and SUPERFAMILY annotations. Our database is weekly updated and currently (April, 2012) has ∼595,000 entries, ∼141, 000 of which are determined using the SCOP domain definitions and ∼454, 000 using domain predictions by SUPERFAMILY. Surprisingly, we found that the interactions mediated by the unstructured protein regions comprise ∼50% of all interactions in total. We have implemented a web-interface of DOMMINO that allows a user to flexibly search and study macromolecular interactions at the network level as well as at the atomic level. Now, we have included into DOMMMINO the interactions involving nucleotides (protein-DNA, protein-RNA, and RNA-RNA interactions).

E22 - Automatic extraction of correlation of annotation terms in databases to find similar concepts, synonyms, and multifunction

Short Abstract: Because of the widespread use of "Gene Set Enrichment Analysis (GSEA)", characterizing a given set of genes through identification of significantly enriched annotations has become an important task. The interpretation as a set of annotations is often difficult for the user. The annotations are shown as if they were independent, despite that some annotations are actually correlated each other. To understand the relationships among the annotations, we comprehensively examined how much each annotation is correlated each other through genes. We selected ten gene annotation (Gene Family, Gene Ontology, InterPro, KEGG pathway, protein-protein interaction, SCOP, SOSUI membrane protein prediction, OMIM, Tissue specificity of gene expression, and Wolf-PSORT) in the integrated human gene database, H-InvDB. For each pair of individual annotation terms (e.g. GO:0004252 and IPR001627), the correlation were evaluated using Fisher's exact (two-side) test with Bonferroni correction. As a result, we found 21,047 (793) pairs with positive (negative) correlation. As an example for positive relationships, SCOP Family g.44.1.1 (RING finger domain) and IPR001841 (Zinc finger, RING-type) were found. In addition to find similar concepts, we estimated conditional probabilities that one term occurs in a gene annotation given the other term. We also obtained negative relationships; Many of them are unlikely to co-exist in a gene, such as membrane protein and soluble protein. Those information will help to refine predictive annotation and find multifunctional genes. In the future, we will provide a new interpretation system of multiple annotations of GSEA, using the conditional probabilities together with Bayesian network approach.

E23 - Toward interoperable bioscience data

Short Abstract:

TOP

View Posters By Category

Search Posters:

TOP