Representations of Cells in the Biomedical Literature: First Look at the NLM CellLink Corpus
Confirmed Presenter: Noam H. Rotenberg, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
Room: 12
Format: In person
Authors List: Show
- Noam H. Rotenberg, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Robert Leaman, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Rezarta Islamaj, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Brian Fluharty, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Helena Kuivaniemi, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Savannah Richardson, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Gerard Tromp, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Zhiyong Lu, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
- Richard H. Scheuermann, Division of Intramural Research, National Library of Medicine, National Institutes of Health, United States
Presentation Overview: Show
Single-cell technologies are enabling the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale, however, the existing annotated datasets required for system development and evaluation do not reflect the complex assortment of cell phenotypes described in recent studies.
We present a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers, using their meaning in context, with matches labeled as exact or related, where possible. Annotation was performed by four cell biologists using a multi-round process, with automated pre-annotation.
The corpus contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. Fine-tuning BiomedBERT in a simplified named entity recognition task on this corpus resulted in substantially higher performance than the same configuration fine-tuned on previously annotated datasets.
Our corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.