A Constructional Approach to Extraction

Cornelia M. Verspoor1, George J. Papcun2, Kari Sentz
1verspoor@lanl.gov, Los Alamos National Laboratory; 2gjp@lanl.gov, Los Alamos National Laboratory

We present a prototype implementation of an information extraction system that aims to identify protein/gene interactions from the biological literature. The system embodies a framework for linguistic processing that lies between the two extremes of existing solutions to this problem: it neither attempts full syntactic parsing of the input text (Yakushiji et al. 2001), nor ignores linguistic structure entirely (Blaschke and Valencia 2002). This framework is based on Construction Grammar (Fillmore, 1985; Goldberg, 1995), in which it is argued that language consists of a set of patterns at varying levels of abstraction that integrate form and meaning in conventionalized and often non-compositional ways. We define a construction as any learned relationship between form and meaning in a language, and show that the representation of constructions for information extraction provides a powerful mechanism for recognizing relations of interest in free text (Papcun et al 2003).

The CG framework allows us to capture highly domain-specific lexical patterns such as words and cue phrases that have particular meanings or implications in the biological context, while still making use of more abstract linguistic structure such as clause and phrasal boundaries, established through part of speech tagging and shallow parsing, to constrain the recognition of patterns for protein/gene interactions in context. It furthermore accommodates the representation of domain-specific semantic properties of specific patterns — both to constrain recognition and to guide interpretation — without depending on the identification of deep structural relationships in the text . This approach increases the precision of the interaction extraction without requiring complete linguistic analysis. In this prototype, we focus on constructions at the clausal level that are tolerant to intervening modifiers not contributing to the main content of the clause.

Blaschke C and Valencia A. (2002) The frame-based module of the Suiseki information extraction system, IEEE Intelligent Systems 17: 14-20.

Fillmore, C. 1985. Syntactic intrusion and the notion of grammatical construction. Berkeley Linguistics Society 11: 73-86.

Goldberg, A. 1995. Constructions: A Construction Grammar Approach to Argument Structure. Chicago: University of Chicago Press.

Papcun, George, Kari Sentz, Andy Fulmer, Jun Xu, Olaf Lubeck, and Murray Wolinsky. 2003. A construction grammar approach to extracting regulatory relationships from biological literature. Pacific Symposium on Biocomputing 2003, Kauai, Hawaii

Yakushiji, Akane, Yuka Tateisi, Yusuke Miyao and Jun'ichi Tsujii. (2001). Event extraction from biomedical papers using a full parser. In the Proceedings of the sixth Pacific Symposium on Biocomputing (PSB 2001). Hawaii, U.S.A.. pp. 408-419.