TextLens: A Fast and Practical Partial Parser for Biomedical Literature

Yasunori Yamamoto1, Hiroko Ao2, Toshihisa Takagi
1yayamamo@ims.u-tokyo.ac.jp, Department of Computer Science, University of Tokyo; 2aohiroko@ims.u-tokyo.ac.jp, Department of Computational Biology

TextLens Partial Parser is a parser to capture an essence of a sentence. It inputs a POS tagged sentence and outputs two kinds of tagged sentences. One indicates scopes of coordinate conjunctions and the other shows portions of main subjects and predicates. It is intended for an information extracting task such as extraction of gene names or protein names from Medline abstracts or extraction of protein-protein interactions from biomedical literature. Our parser runs fast because it does not use any machine learning technology such as HMM or SVM. Instead, it uses a set of simple words replacement rules to make an abstract of a sentence. After the abstraction, the sentence is expressed as a sequence of several letters and numbers. Each letter or number denotes a word or a chunk of words, which constitutes an essential idea or a function in the sentence such as a noun clause or a preposition. Literature in the field often has long collocations, and therefore, appropriately capturing each element of ideas which constitutes a sentence is important and challenging. A preliminary result shows that our parser is satisfactorily able to catch a pair of main subject and predicate at a practical level, although it needs much more improvements.