The development and use of pretrained language models (PLMs) and Transformer technologies adapted to clinical and biomedical content resulted in a considerable improvement of biomedical NLP systems. Nevertheless, the complexity, and variability encountered when processing different sources of content, together with the lack of annotated resources and components to extract information from clinical records written in different languages still poses a challenge for analysis and automatic detection of clinical variables from text.
I will outline some recent efforts related to biomedical text mining, covering different types of content such as literature and social media and how such approaches could be of importance for multilingual clinical record processing.
In response to the growing demand for NLP tools in languages other than English, I will provide an illustration of how Spanish clinical corpora can be used to generate annotated datasets under a multilingual scenario. This initial adaptation proposal extends to a variety of languages, such as Italian, Swedish, Portuguese, Dutch, and even English. By leveraging existing high quality annotated corpora in English or Spanish, we can effectively address the need for NLP tools in multiple languages, promoting broader accessibility and support across linguistic boundaries to meet the need of clinical information extraction.
Shared tasks and access to high quality annotated corpora and guidelines is becoming key to foster technological development, to monitor progress and to assess the quality of automatically generated results. Therefore, an overview of clinical NLP shared tasks and datasets will be provided. Finally, I will briefly introduce several application scenarios my group is currently working on, such as biomaterials research (BIOMATDB), clinical cardiology (DataTools4Heart & AI4HF), rare diseases (BARITONE) or occupational health (AI4ProfHealth).