Text Mining

Attention Presenters - please review the Speaker Information Page available here
Schedule subject to change
All times listed are in CEST
Thursday, July 27th
8:30-9:10
Invited Presentation: Implementing text mining resources for clinical variables applied to literature and medical texts: applications in biomaterial research, cardiology, occupational health and phenotypes
Room: Salle Rhone 3b
Format: Live from venue

  • Martin Krallinger


Presentation Overview: Show

The development and use of pretrained language models (PLMs) and Transformer technologies adapted to clinical and biomedical content resulted in a considerable improvement of biomedical NLP systems. Nevertheless, the complexity, and variability encountered when processing different sources of content, together with the lack of annotated resources and components to extract information from clinical records written in different languages still poses a challenge for analysis and automatic detection of clinical variables from text.
I will outline some recent efforts related to biomedical text mining, covering different types of content such as literature and social media and how such approaches could be of importance for multilingual clinical record processing.
In response to the growing demand for NLP tools in languages other than English, I will provide an illustration of how Spanish clinical corpora can be used to generate annotated datasets under a multilingual scenario. This initial adaptation proposal extends to a variety of languages, such as Italian, Swedish, Portuguese, Dutch, and even English. By leveraging existing high quality annotated corpora in English or Spanish, we can effectively address the need for NLP tools in multiple languages, promoting broader accessibility and support across linguistic boundaries to meet the need of clinical information extraction.
Shared tasks and access to high quality annotated corpora and guidelines is becoming key to foster technological development, to monitor progress and to assess the quality of automatically generated results. Therefore, an overview of clinical NLP shared tasks and datasets will be provided. Finally, I will briefly introduce several application scenarios my group is currently working on, such as biomaterials research (BIOMATDB), clinical cardiology (DataTools4Heart & AI4HF), rare diseases (BARITONE) or occupational health (AI4ProfHealth).

9:10-9:30
Text mining for disease-lifestyle relations based on a novel lifestyle factors ontology
Room: Salle Rhone 3b
Format: Live from venue

  • Esmaeil Nourani, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Mikaela Koutrouli, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Yijia Xie, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Danai Vagiaki, Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
  • Farrokh Mehryary, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland
  • Sampo Pyysalo, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland
  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Søren Brunak, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark


Presentation Overview: Show

While text mining of genes, diseases, and their associations is well-established, methods for identifying equally important lifestyle factors are lacking, and disease-lifestyle associations remain largely hidden in the literature. This is partly due to the lack of a manually annotated corpus of disease-lifestyle associations to train a model for relation extraction, and partly because there is no available ontology for lifestyles to facilitate the detection of lifestyle mentions in the literature. Existing ontologies focus on specific aspects of lifestyle, and even when combined, other important disease-associated factors such as socioeconomic factors have not been well captured or captured at all.
In this study, we developed a diverse and comprehensive ontology for lifestyle factors starting from manual creation, followed by ontology expansion using state-of-the-art language models to identify new lifestyle-related concepts. We also created a corpus of 400 abstracts that were annotated and used for training a BioBERT-based disease-lifestyle relation extraction model. The prototype implementation achieved an F-score of approximately 70\% for classifying the four distinct relation types between disease and lifestyle factors.
We plan to extend the annotated corpus and use the generalized model to extract all disease-lifestyle relations from the literature and generate a knowledge base of relations.

10:00-10:20
S1000: a better corpus for evaluating species named entity recognition methods
Room: Salle Rhone 3b
Format: Live from venue

  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark,, Denmark
  • Jouni Luoma, University of Turku, TurkuNLP Group, Department of Computing, Finland
  • Tomoko Ohta, textimi, Japan
  • Harttu Toivonen, University of Turku, TurkuNLP Group, Department of Computing, Finland
  • Evangelos Pafilis, Institute of Marine Biology, Biotechnology and Aquaculture, Hellenic Centre for Marine Research, Greece
  • Sampo Pyysalo, University of Turku, TurkuNLP Group, Department of Computing, Finland
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark,, Denmark


Presentation Overview: Show

The recognition of mentions of species names in text is a critically important task for biomedical text mining. While deep learning-based methods have made great advances in other named entity recognition tasks, they fail to achieve comparable levels of performance for species names. We hypothesize that this is primarily due to the lack of appropriately annotated corpora.
In this work we introduce the S1000 corpus, a comprehensive manual re-annotation and extension of the S800 corpus. We propose the use of S1000 as a new and improved gold standard for the evaluation of deep learning-based language models in the place of S800. In S1000 we have doubled the number of unique and total mentions of taxonomic names compared to S800. We have shown that this new corpus leads to clear improvements both in the training and evaluation of deep learning-based language models, while not affecting the ability to evaluate dictionary-based named entity recognition methods. Moreover, the extension of the corpus with 200 additional documents has increased its variance and thus the generalizability of models trained on it, while still maintaining very high performance.

10:20-10:40
AIONER: An all-in-one scheme for biomedical named entity recognition using deep learning
Room: Salle Rhone 3b
Format: Live from venue

  • Ling Luo, Dalian University of Technology, China
  • Chih-Hsuan Wei, National Center of Biotechnology Information (NCBI), United States
  • Po-Ting Lai, National Center of Biotechnology Information (NCBI), United States
  • Robert Leaman, NCBI/NLM/NIH, United States
  • Qingyu Chen, NIH, United States
  • Zhiyong Lu, NCBI, United States


Presentation Overview: Show

Biomedical Named Entity Recognition (BioNER) aims to automatically identify biomedical entities within natural language text, serving as a necessary foundation for downstream text mining tasks and applications. However, manual labeling of training data for the BioNER task is expensive. The resulting data scarcity leads to current BioNER approaches being prone to overfitting, limited generalizability, and the ability to address only a single entity type at a time (e.g., gene or disease). To address these challenges, we propose a novel All-In-One (AIO) scheme that leverages external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We also present AIONER, a general-purpose BioNER tool based on state-of-the-art deep learning and our AIO scheme. We evaluate AIONER on 14 BioNER benchmark tasks, demonstrating its effectiveness, robustness, and favorable comparison to other cutting-edge approaches, such as multi-task learning. Availability and implementation: The source code, trained models, and data for AIONER are freely available at https://github.com/ncbi/AIONER.

10:40-10:50
Instruction fine-tuning large language models with context-derived weak supervision improves clinical information extraction
Room: Salle Rhone 3b
Format: Live from venue

  • Brett Beaulieu-Jones, University of Chicago, United States
  • Elizabeth O'Neill, Rush University Medical Center, United States
  • Emily Alsentzer, Harvard Medical School, United States


Presentation Overview: Show

Biomedical studies frequently require manual chart review, whether as part of a prospective recruitment or retrospective analysis. Large language models (LLMs) demonstrate strong performance on many natural language processing tasks, including information extraction from clinical notes. However, the combination of manual annotation burden and privacy concerns limit the datasets available for instruction fine-tuning. This study introduces context-derived weak supervision (CDWS) to address this challenge, generating weak labels for clinical notes using metadata contained in the structured data captured by EMRs (e.g., diagnoses, vital signs, labs etc.). We fine-tuned the Flan-T5 model on the MIMIC-III clinical database and evaluated its performance on three clinical information extraction tasks. The CDWS base model outperformed the Flan-XXL model despite having significantly fewer parameters. CDWS-flan-XXL showed a 12.7% improvement over the vanilla Flan-XXL model. The results demonstrate the high potential of CDWS for instruction fine-tuning in LLMs. Future work will explore larger public tasks, evaluate label generation logic, attempt to understand potential biases in extraction and assess the effects of varying noise levels of label accuracy on the fine-tuning process.

10:50-11:00
Word embeddings capture functions of low complexity regions: scientific literature analysis using a transformer-based language model
Room: Salle Rhone 3b
Format: Live from venue

  • Sylwia Szymanska, Department of Computer Networks and Systems, Silesian University of Technology, Poland
  • Aleksandra Gruca, Department of Computer Networks and Systems, Silesian University of Technology, Poland


Presentation Overview: Show

Low Complexity Regions (LCRs) are fragments of protein sequences that consist of only a few types of amino acids. In the past, scientists were ignoring LCRs and assumed them as non-functional. Recent studies show that LCRs can have important roles in proteins. Unfortunately, information about it is not collected in a systematic way. The free text of scientific publications is the only resource about LCRs and their functions. Due to that we developed a language model that is able to retrieve information about the function of LCRs in an automated way. We convert text to embeddings, and performing PCA analysis, which provides a single vector representing each publication. To train our model we selected papers related to the following LCR functions: phase separation, aggregation, DNA- and RNA-binding and we applied the Random Forest (RF) to classify the publications. The results were compared with a baseline LitSuggest model. Our model got the precision 93.16, recall 92.60 and F1-score 92.87 when LitSuggest had precision 82.45, recall 94.00 and F1-score 87.85. In this work we present a language model based approach that is able to retrieve the publications containing information about LCRs and their function through free text analysis.

11:00-11:20
Integrating 3D and 2D Molecular Representations with Biomedical Text via a Unified Pre-trained Language Model
Room: Salle Rhone 3b
Format: Live from venue

  • Xiangru Tang, Yale University, United States
  • Andrew Tran, Yale University, United States
  • Jeffrey Tan, Yale University, United States
  • Mark Gerstein, Yale University, United States


Presentation Overview: Show

Current deep learning models for molecular representation predominantly focus on single data formats, restricting their versatility across various modalities. To address these challenges in molecular representation learning, we present a unified pre-trained language model that concurrently captures biomedical text, 2D, and 3D molecular information. This model consists of a text Transformer encoder and a molecular Transformer encoder. Our approach employs contrastive learning as a supervisory signal for cross-modal information learning, and we assemble a multi-modality dataset using cheminformatics-based molecular modifications. As a result, we develop a three-modality pre-trained molecular language model (MoLM). MoLM demonstrates robust chemical and molecular representation capabilities in numerous downstream tasks such as understanding and generating text from molecular structure inputs, and retrieving 2D and 3D molecular structures based on textual inputs. Moreover, we underscore the significance of capturing 3D molecular structures. Comprehensive experiments conducted across multiple aspects of challenging tasks highlight the model’s potential to expedite biomedical research.

11:20-11:40
Poster lightning presentations
Room: Salle Rhone 3b
Format: Live from venue

13:20-14:00
Invited Presentation: Reproducibility in biomedical natural language processing and applications to bioinformatics workflows
Room: Salle Rhone 3b
Format: Live from venue

  • Aurélie Névéol
14:00-15:00
Panel: Applications of ChatGPT and large language models in biology and medicine
Room: Salle Rhone 3b
Format: Live from venue

  • Larry Hunter
  • Harry Caufield