Current status of the GENIA Corpus: an Annotated Corpus in Molecular Biology Domain

Tomoko Ohta¹, Jin-Dong Kim², Yuka Tateisi, Masayoshi Tsuruoka, Jun'ichi Tsujii
¹okap@is.s.u-tokyo.ac.jp, CREST, JST; ²jkdim@is.s.u-tokyo.ac.jp, University of Tokyo

We have enhanced the GENIA corpus to 4000 abstracts and made 2000 of them available to public as version 3.01. The base set was taken from the query results with MeSH terms "Human, Blood Cells and Transcription Factors" of MEDLINE database, and is the superset of the base set of version 1.1 ([1]). The semantic class of technical terms are marked up, like in version 1.1, but in this version the terms inside other terms are also marked up. We also have corrected sentence bounary errors.
The POS-tagged corpus is also released as version 3.0p. The tag set is basically that of Penn Treebank (PTB) PoS tag set, with the following major differences. The NNP and NNPS (proper name) tag is not used, except for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags. We tried to eliminate SYM tags as much as possible. This PoS tagged corpus is available in three formats, PTB-like format, XML format and merged GPML format ([2]).
The markup language GPML is also revised. The major revision is that we discarded the element and the element. The semantic class information is annotated directly into the abstract, using elements.

[1] Ohta, T., Tateisi, Y., Kim, J-D and Tsujii, J. (2002) The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain. Proc. of the Human Language Technology Conference, to be appeared.
[2] GENIA project homepage