The biomedical literature, as measured by the number of entries in the
National Library of Medicine's MedLine database, has been growing
exponentially (~e^0.043) for over two decades. Last year, 562,134 articles
were added, more than 1540 per day. Furthermore, the genomic revolution is
breaking down disciplinary boundaries in biomedicine, greatly expanding the
number of potentially relevant publications that researchers must track.
High-throughput techniques, such as expression arrays and shotgun
proteomics, exacerbate this problem by identifying dozens to thousands of
genes or gene products relevant to phenomena under study; many of those will
have been characterized in subdisciplines previously thought to be unrelated
to the study. Computational information extraction and retrieval techniques
are becoming increasingly important tools for managing the biomedical
literature, and rapidly finding and organizing all available information
about large gene sets. Recent progress suggests that computational natural
language processing techniques may be more effective in biomedical language
than in general English. In this talk, I will give an overview of some of
the relevant techniques and applications of computational natural language
processing, as well as describe recent results obtained in my laboratory.