ISMB/ECCB 2011 Posters

19th Annual International Conference on
Intelligent Systems for Molecular Biology and
10th European Conference on Computational Biology

Accepted Posters

Category 'Y'- Text Mining'

Poster Y01

Biomedical Natural Language figure Processing for Intelligent Figure Search

Hong Yu University of Wisconsin-Milwaukee

Short Abstract: Figures represent important knowledge in biomedical literature. An intelligent biomedical figure search engine can assist biocuration, provide biomedical scientists with targeted evidence search, and provide a method for automatic validation of genome-wide high-throughput predictions. Currently, neither natural language processing (NLP) nor image processing algorithms address the unique complexities inherent to accurate biomedical figure searching. Most NLP approaches focus on text, and, for the most part, ignore figures altogether. On the other hand, traditional image search processing approaches frequently ignore important associated text. No approach takes advantage of the semantics that relate in-text content to figures that are uniquely encased in biomedical publications. We open a new research direction entitled “biomedical Natural Language figure Processing” (bNLfP) that combines traditional NLP and image processing by using the semantic relationships that exist between published biomedical text and figures. I will present two innovative bNLfP algorithms: figure ranking and figure text extraction.

Long Abstract: Click Here

Poster Y02

Differing structure and content of abstracts and full text journal articles

K. Bretonnel Cohen U. Colorado School of Medicine

Helen Johnson (U. Colorado School of Medicine, Pharmacology); Karin Verspoor (U. Colorado School of Medicine, Pharmacology); Christophe Roeder (U. Colorado School of Medicine, Pharmacology); Lawrence Hunter (U. Colorado School of Medicine, Pharmacology);

Short Abstract: Abstracts have historically been the main input to biomedical text mining, but a recent paradigm shift is seeing a major move towards mining the full text of journal articles. This paper demonstrates that a number of differences with consequences for biomedical text mining exist between abstracts and full text articles, and lays out a roadmap for adapting to the new challenges presented to us by full text.

Long Abstract: Click Here

Poster Y03

Implicit conceptual links in the literature: a large untapped reservoir of gene discovery

Herman van Haagen University of Leiden

Martijn Schuemie (Erasmus University of Rotterdam, Biosemantics); Peter-Bram 't Hoen (University of Leiden, Human Genetics); Marco Roos (University of Leiden, Human Genetics); Barend Mons (University of Leiden, Human Genetics); Johan den Dunnen (University of Leiden, Human Genetics); Gert-Jan van Ommen (University of Leiden, Human Genetics); Erik Schultes (University of Leiden, Human Genetics);

Short Abstract: The published literature and bioinformatic data are growing faster than individual researchers can track. Text- and data-mining tools have been developed to retrieve explicit links between pairs of concepts (e.g. genes and disease). However, in addition to explicit links, text also contains a complex network of indirect (implied) links containing useful information. We applied concept profiling to probe this vast network of implied information. In a retrospective analysis of 18 gene-disease relationships, implicit information alone could prioritize the causative gene on average within the top 13 out of 200 genes located in a specified linkage interval, at least one year before publication of the landmark paper. The concepts shared between gene and disease enabled the evaluation of the plausibility of the inferred relationship. Of the 40,404,412 possible gene disease pairs, 120,246 (47%, p < 0.003) arose exclusively from implicit relations. These results reveal an enormous untapped discovery potential in the implied information of biomedical literature .

Poster Y04

ANDCell-ANDVisio system: knowledge extraction using text-mining in the systems biology

Pavel Demenkov Institute of Cytology and Genetics

Ewgenia Yarkova (Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences); Timofey Ivanisenko (Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences); Vladimir Ivanisenko (Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences);

Short Abstract: Work with scientific literature is required for research in every knowledge area. The PubMed database currently contains about 19 millions of abstracts. Analysis of this bulky unstructured information is very time-consuming. Furthermore, the modern approaches to analysis of the unstructured information make obligatory reference to another important source, the factographic databases for molecular biology and genetics.

The method for automated extraction of information about molecular-genetic interactions from PubMed abstracts was developed using the text-mining approach. For text-mining, we used the previously developed thesauruses for the names of proteins, genes, microRNAs, metabolites, biological pathways, diseases, cells, and organisms. To recognize facts describing molecular genetic interactions in abstract texts we created more than 4000 patterns or decision rules. These templates provide an automated extraction of knowledge from PubMed abstracts about molecular genetic interactions, gene regulations, catalytic processes and other associations between facts and their representation as semantic association networks. Extracted information was integrated through reconstruction of networks for semantic associations joining literary facts about molecular-genetics regulations, physical interactions, also about associations between molecular-genetic objects, biological processes, and diseases.

The ANDCell system contains the knowledge base and the ANDVisio program for associative network reconstruction. The ANDCell knowledge base contains more than 5 millions of molecular genetic interaction facts. The ANDVisio program allows the user to access the database and represents the results in a graphic form as associative networks. The system is provided with a user's friendly interface implemented links to the molecular-genetic databases and articles from information was extracted.

Poster Y05

Self-Training Improves Robustness of Protein-Protein Interaction Extraction from Text

Philippe Thomas Humboldt University of Berlin

Illés Solt Solt (Budapest University of Technology and Economics, 2Department of Telecommunications and Media Informatics); Ulf Leser (Humboldt University of Berlin, Knowledge Management in Bioinformatics);

Short Abstract: Motivation: Automated extraction of protein-protein interactions (PPI) from the literature is a key challenge in current biomedical text-mining. The past years have seen a certain increase in the performances reported for novel methods. However, these results are typically obtained using corpus-wise cross-validation, which, as shown by a number of more recent works, tends to be overly optimistic when trying to extrapolate the performance on new text.
Result: In this work, we present a general method for increasing the robustness of PPI extraction methods that is based on self-training. Self-training is a semi-supervised learning technique where a classifier is first trained, then applied, and then retrained on the results of the first application. We achieve consistent improvements in both extrinsic performance (0.8-6.6% in F1) and robustness (the F1 gap to cross-validation reduced by 41%) compared to the over-optimistic estimates on extrinsic performance provided by cross-corpus evaluation. Notably, our method is agnostic to the particular information that is extracted and can thus be used to improve performance in many tasks in biomedical text-mining.

Poster Y06

Extraction of protein cellular localization using the Locminer text mining system

Martin Krallinger Spanish National Cancer Research Centre

Alfonso Valencia (Spanish National Cancer Research Centre, Structural Biology and Biocomputing Programme ); Ashish Tendulkar (Indian Institute of Technology, Madras, Department of Computer Science and Engineering);

Short Abstract: To determine the subcellular location of proteins is crucial to understand their biological and functional properties. We implemented the Locminer system, which explores semantic-syntactic frames for extracting fine-grained associations between proteins and subcellular locations from sentences. This tool is based on localization frames, an set of trigger verbs relevant for expressing locative relations and a location sentence classifier. A comprehensive subcellular location dictionary integrating keywords/synonyms from SwissProt, Cellular Component terms from Gene Ontology, and an in house location dictionary was created. After detecting protein names co-occurring with location terms, a total of 1,288 sentences were manually revised to derive 396 hand crafted location frames, covering mainly binary protein-localization relations, in addition to associations to multiple (alternative) locations. We expanded semi-automatically locative expressions selecting 220 location and motion-relevant verb roots by including verb synonyms, English spelling variants, generation of affix variants (using 14 general English affixes such as de-, sub-, co- and trans-, and 5 domain specific affixes such as immuno- and radioimmuno-), generation of hyphenation variants as well as inflectional variants and nominalized forms. This resulted in 11,544 expanded forms and a dictionary of 161,616 potential location. These variants were filtered based on their instantiation on the whole PubMed database (remaining a total of 6,436 location triggers). To prioritize location words/expressions for subcellular relevance we considered the fraction of cellular location term co-occurrences with respect to their total frequency in PubMed, as well as a score provided by a SVM sentence classifier trained on documents annotated as location relevant.

Poster Y07

Reflect: a browser plugin for life scientist

Janos Binder European Molecular Biology Laboratory

Reinhard Schneider (European Molecular Biology Laboratory, Structural and Computational Biology Unit); Sean O'Donoghue (European Molecular Biology Laboratory, Structural and Computational Biology Unit); Lars Juhl Jensen (University of Copenhagen, NNF Center for Protein Research);

Short Abstract: During the past decade, extensibility of web browsers became a killer feature. We have built a browser plugin called Reflect which augments life science browsing experience. By tagging an article with a single click, any user can access further information related to proteins, genes or chemicals in a popup window.
Reflect was widely used during the past years, and by accessing its API different websites have incorporated its features e.g. it can be switched on ScienceDirect website by highlighting keywords. We aim to keep it simple and user friendly, without having to access any complex databases and retrieving relevant information about the selected terms. This year we are empowering our software with new features like sub-cellular and tissue localization of proteins, while keeping the current visualization abilities e.g.: sequence, structure and interaction network. The software is freely downloadable at: http://www.reflect.ws

References:

O’Donoghue et al: Reflect: A practical approach to web semantics, Journal of Web Semantics (2010), Special Issue: Sp. Iss. SI 2-3 (8) 182-189

Pafilis et al: Reflect: augmented browsing for the life scientist, Nat Biotechnol. 2009 Jun;27(6):508-10.

Poster Y08

Biomedical text mining from multiple views: information fusion and vertical

xinhai Liu Katholieke Universiteit Leuven & IBBT

Olivier Gevaert (1. Katholieke Universiteit Leuven & IBBT , ESAT-SCD & Future Health Department); Léon-Charles Tranchevent (1. Katholieke Universiteit Leuven & IBBT, ESAT-SCD & Future Health Department); Bart De Moor (1. Katholieke Universiteit Leuven & IBBT, ESAT-SCD & Future Health Department);

Short Abstract: Biomedical literature contains rich information which could be observed from different point of views and is expected to provide better or exact knowledge about biomedical process. Thus we propose a novel strategy to provide text prior information from multi-view perspectives. The strategy is implemented by text mining on MEDLINE database. Our strategy can be applied to do information fusion by integrating multi-view data or provide certain knowledge from a small vertical perspective. A Web application of our strategy is developed for gene retrieval. The multiple views can be different controlled vocabularies, weighting schemes, publishing time periods and biomedical subjects. As a series of human related genes are input, based on the multi-view selection, the outputs are the gene-by-term profile, which is illustrated by term cloud and gene-by-gene similariy matrix, which is visualized by color map. In addition, the hierarchical clustering of these queried genes are available as well. We employ a set of genes which belong to different diseases to test our multi-view gene retrieval system. Firstly, we investigate the dis-similarity of multiple views by calculating the cosine similarity among them and the results demonstrate that multiple views vary with each other, in other words, each of them is able to provide complementary information to certain extend. Secondly, we carry out the hybrid clustering by integrating multi-view text mining data. The clustering results show that integrating multi-view controlled vocabularies and weighting schemes is able to boost clustering performance.

Poster Y09

Keyword Clustering in Biomedical Information Retrieval Using Evolutionary Algorithms

Viktoria Dorfer Upper Austria University of Applied Sciences

Stephan Winkler (Upper Austria University of Applied Sciences, Medical and Bioinformatics); Thomas Kern (Upper Austria University of Applied Sciences, Research & Development); Sophie Blank (Upper Austria University of Applied Sciences, Research & Development); Gerald Petz (Upper Austria University of Applied Sciences, Marketing & Electronic Business); Patrizia Faschang (Upper Austria University of Applied Sciences, Research & Development);

Short Abstract: As the amount of available data in the field of life sciences grows exponentially, intelligent search strategies are necessary to help people in information retrieval. We here describe the use of a new keyword clustering method: Based on a set of documents (D), keyword clusters are optimized so that the identified groups of keywords consist of keywords that often occur in combination in D. The so generated keyword clusters shall in the near future serve as a solid base for a new PubMed search tool based on query extension, using also user feedback to optimize the search process.
We have defined several important characteristics for clustering candidates, including the data set coverage, the cluster confidence (measuring the ratio of clustered keywords that are found in the same documents), and the document confidence (measuring the amount of equal keywords in the documents assigned to a cluster through their keywords). Evolutionary algorithms have been applied for solving this optimization task, amongst others evolution strategies (ES) and a multi-objective genetic algorithm (NSGA-II, used because the optimization objectives are partially contradictory).
For testing this approach we have used data published for the TREC-9 conference containing 36,890 entries. Out of this data set we extracted the most significant keywords for clustering using tf-idf weighting. Analyzing first optimization results we see that the best result obtained with 10+1 ES provides 23.5% data set coverage, 45.2% cluster confidence, and 23.4% document confidence; using the NSGA-II we for example got results with respective values 71%, 56% and 37%.

Poster Y10

Automating host-pathogen interaction discovery: an HIV case study

Daniel Jamieson University of Manchester

Jonathan Dickerson (University of Manchester, Bioinformatics); Martin Gerner (University of Manchester, Computer Sciences); Farzaneh Sarafraz (University of Manchester, Computer Sciences); Goran Nenadic (University of Manchester, Computer Sciences); David Robertson (University of Manchester, Bioinformatics);

Short Abstract: HIV’s exploitation of the human host for replication involves highly specific interactions with the cellular system. The persistence of the virus is dependent on an intricate network of biological interactions. There is presently extensive knowledge of molecular interactions within the scientific literature. However, manual curation of these would be an enormously time-consuming task. Here we present a procedure for automated host-pathogen focused data extraction. We focus on mining HIV-1-host protein interactions, using the gold-standard HIV-1 Human Protein Interaction Database, containing 5,127 interactions manually extracted from 14,312 references, as a benchmark for evaluation. Our process involves employing a combination of named entity recognisers (such as BANNER) and the highest performing event extraction tools (e.g. from the BioNLP 2009 shared task). We replicated the original dataset to an acceptable degree, however with compromises devised to conform to the design of tools presently available. Notably, event extraction tools are limited to obtaining interactions from 9 event types, which are not compatible with all of the interactions in HIV biology. Thus, we have developed an appropriate ontology to encompass the full spectrum of host-pathogen interactions and demonstrate improved coverage. As a result, we demonstrate the effective use of sophisticated text mining tools to functionally recreate a large database of host-pathogen based interactions. This provides a platform for full-scale automated extraction of host-pathogen interactions from all the available literature (20 million+ online references). Not only will such an approach improve our understanding of host-pathogen biology, it may also permit informed development of pharmaceutical interventions.

Poster Y11

Relation-mining workflows to extract animal host-microorganism interactions in free text

Sophia Ananiadou University of Manchester

BalaKrishna Kolluru (University of Manchester) Sirintra Nakjang (Univerisity of Newcastle, Institute for Cell and Molecular Biosciences); Robert, P. Hirt (Univerisity of Newcastle, Institute for Cell and Molecular Biosciences); Anil Wipat (Univerisity of Newcastle, Institute for Cell and Molecular Biosciences); Sophia Ananiadou (University of Manchester, Computer Science);

Short Abstract: This poster is based on Proceedings Submission. We have developed a relation-mining workflow to extract sentences with a particular focus on animal host-microorganisms pairs from free text such as journal papers. The workflow uses three different component, one each to convert pdf to raw text, to process sentences from this raw text and a CRF-based relation miner. The CRF-based component uses, lexical features and bag-of-words as features to estimate the "relation" aspect in a sentence. The results indicate that the precision of our approach is around 85%; a significant number of errors reported by the workflow are due to the erroneous conversion from pdf to text. We are currently working both at improving the net coverage of the machine-learning component using syntactic relationship between the main entities and their connecting words. To this end, we are working on reducing the noise in pdf-to-text conversion. We show that accuracy in automatic extraction of relations is a boost to creation of comprehensive databases.

Poster Y12

Overview of the second CALBC Challenge

Chen Li EMBL Outstation Hinxton

Senay Kafkas (EMBL Outstation Hinxton) Erik van Mulligen (Erasmus University Medical Center, Department of Medical Informatics); Jan Kors (Erasmus Medical Center, Department of Medical Informatics); Udo Hahn (Friedrich-Schiller-Universität Jena, Jena University Language & Information Engineering (JULIE) Lab); David Milward (Linguamatics, -); Dietrich Rebholz-Schuhmann (EMBL Outstation Hinxton, European Bioinformatics Institute); Ian Lewin (EMBL Outstation Hinxton, European Bioinformatics Institute); Chen Li (EMBL Outstation Hinxton, European Bioinformatics Institute);

Short Abstract: The aim of the CALBC project is to generate a large scale Silver Standard Corpus (SSC) (about 1 million Medline abstracts) annotated with diverse biomedical entities by automatically harmonizing different solutions from the community. The second CALBC challenge attracted 16 teams from all around the world. The results were presented in a workshop held in Hinxton, UK, March 16-18, 2011.
The 2nd challenge required participants to annotate a much greater quantity of data (175k and, optionally 714k documents) and offered the opportunity to annotate a much larger number of entity types (up to 16). An annotated training corpus (100k documents) for 4 entity types was generated automatically using the harmonization scheme derived from the first challenge, applied to the 4 project partners. Challenge submissions were then a) used to generate a new SSC through an improved character-based voting mechanism b) evaluated against it. As with the smaller scale 1st challenge evaluations (1k documents), some participants, especially trained solutions, outperform the partners over a partner-only SSC; and this is true for all 4 entity types. In contrast to the 1st challenge, we find that an all-participant SSC requires different voting thresholds for different entity types.
In a new departure, we have also evaluated our SSCs against published Gold standard annotations (where available). Furthermore, several solutions have been developed for normalizing the annotated biomedical entities to terminological resources (e.g. UMLS). The SSC has been converted to RDF format for public exploitation via the Semantic Web.

Poster Y13

An exercise in kidney factomics: From article titles to RDF Knowledgebase

James Eales University of Manchester

George Demetriou (University of Sheffield, Department of Computer Science); Robert Stevens (University of Manchester, School of Computer Science);

Short Abstract: There are many existing resources that integrate data between databases; they do this either semantically by the use of RDF and triplestores (e.g. Bio2RDF), or with web links and ID mapping services (e.g. PICR, eUtils). However results declared in the literature are only rarely interlinked with existing databases and even more rarely interlinked with each other. We describe a method to take factual statements reported in the literature and turn them into semantic networks of RDF triples. We use a method based on finding titles of papers that contain a positive, direct statements about the outcome of a biomedical investigation. We then use dependency parsing and an ontological perspective to create and combine graphs of knowledge about a domain. Our aim in this work is to collect knowledge from the literature for inclusion in the Kidney and Urinary Pathways Knowledge Base (KUPKB), which will be used in the e-LICO project to illustrate the utility of data-mining methods for biomarker discovery and pathway modelling.

Poster Y14

AUTOMATIC PREDICTION OF GENE FUNCTION USING LITERATURE PROFILING

Roney S. Coimbra Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-Minas

Francislon Oliveira (Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-Minas) Guilherme Corrêa Oliveira (Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-Minas, LPCM); Raul Torrieri (Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-Minas, LPCM); Roney Santos Coimbra (Center for Excellence in Bioinformatics (CEBio), FIOCRUZ-Minas, LPCM);

Short Abstract: We propose a new tool for the automatic prediction of bacterial gene function based on literature profiles. For this purpose we developed and used LitProf, an implementation of the Chaussabel & Sher algorithm, to identify the minimum vocabulary required to describe the function of a given gene from a collection of abstracts in Pubmed. Our initial dataset consisted of ~50 K canonical gene names randomly picked from genomes representing all bacterial phylogenetic branches at the JCVI Resource. This gene set was screened to eliminate homonymous orthologs, and genes assigned to imprecise or to more than one functional categories of the JCVI ontology. To further reduce redundancy, genes were grouped by Hierarchical Clustering based on the similarities of word-frequency vectors generated by Litprof from their gene-specific text corpora of PubMed abstracts. One gene was randomly chosen to represent each cluster of highly related genes (> 0.99 correlation). From the text corpora of these selected genes, LitProf disclosed a minimum informative vocabulary and produced the word-frequency vectors representing each gene. We used these vectors to train a Support Vector Machine (SVM) gene classifier. In a 100X cross-validation the classifier showed 80±3% average precision, and 60±3% average recall. In an independent set of 4,000 JCVI annotated genes, the classifier achieved up to 81% precision with 73% recall.
Through this method we propose functional categories to 24,104 unclassified genes with abstracts in PubMed. For confidence thresholds higher than 0.7, 4,927 genes (> 20%) were unambiguously assigned to one functional category of the JCVI ontology.

Accepted Posters

Preparing your Poster - Information and Poster Size
Poster Schedule
Vienna Poster Printing Services
Poster Categories
Search for a Poster

Attention Poster Authors: The ideal poster size should be max. 1.30 m (130 cm) high x 0.90 m (90 cm) wide. Fasteners (Velcro / double sided tape) will be provided at the site, please DO NOT bring tape, tacks or pins. View a diagram of the the poster board here

Posters Display Schedule:

Odd Numbered posters:

Set-up timeframe: Sunday, July 17, 7:30 a.m. - 10:00 a.m.
Author poster presentations: Monday, July 18, 12:40 p.m. - 2:30 p.m.
Removal timeframe: Monday, July 18, 2:30 p.m. - 3:30 p.m.*

Even Numbered posters:

Set-up timeframe: Monday, July 18, 3:30 p.m. - 4:30 p.m.
Author poster presentations: Tuesday, July 19, 12:40 p.m. - 2:30 p.m.
Removal timeframe: Tuesday, July 19, 2:30 p.m. - 4:00 p.m.*

* Posters that are not removed by the designated time may be taken down by the organizers and discarded. Please be sure to remove your poster within the stated timeframe.

Delegate Posters Viewing Schedule

Odd Numbered posters:
On display Sunday, July 17, 10:00 a.m. through Monday, June 18, 2:30 p.m.
Author presentations will take place Monday, July 18: 12:40 p.m.-2:30 p.m.

Even Numbered posters:
On display Monday, July 18, 4:30 p.m. through Tuesday, June 19, 2:30 p.m.
Author presentations will take place Tuesday, July 19: 12:40 p.m.-2:30 p.m

Want to print a poster in Vienna - try these options:

Repacopy- next to the congress venue link [MAP]

Also at Karlsplatz is in the Ring Center, Kärntner Str. 42, link [MAP]

If you need your poster on a thicker material, you may also use a plotter service next to Karlsplatz: http://schiessling.at/portfolio/

View Posters By Category

Search Posters:

↑ TOP