Posters - Schedules
Posters Home

View Posters By Category

Monday, July 24, between 18:00 CEST and 19:00 CEST
Tuesday, July 25, between 18:00 CEST and 19:00 CEST
Session A Poster Set-up and Dismantle
Session A Posters set up:
Monday, July 24, between 08:00 CEST and 08:45 CEST
Session A Posters dismantle:
Monday, July 24, at 19:00 CEST
Session B Poster Set-up and Dismantle
Session B Posters set up:
Tuesday, July 25, between 08:00 CEST and 08:45 CEST
Session B Posters dismantle:
Tuesday, July 25, at 19:00 CEST
Wednesday, July 26, between 18:00 CEST and 19:00 CEST
Session C Poster Set-up and Dismantle
Session C Posters set up:
Wednesday, July 26,between 08:00 CEST and 08:45 CEST
Session C Posters dismantle:
Wednesday, July 26, at 19:00 CEST
Virtual
C-423: Integrating 3D and 2D Molecular Representations with Biomedical Text via a Unified Pre-trained Language Model
Track: Text Mining
  • Xiangru Tang, Yale University, United States
  • Andrew Tran, Yale University, United States
  • Jeffrey Tan, Yale University, United States
  • Mark Gerstein, Yale University, United States


Presentation Overview: Show

Current deep learning models for molecular representation predominantly focus on single data formats, restricting their versatility across various modalities. To address these challenges in molecular representation learning, we present a unified pre-trained language model that concurrently captures biomedical text, 2D, and 3D molecular information. This model consists of a text Transformer encoder and a molecular Transformer encoder. Our approach employs contrastive learning as a supervisory signal for cross-modal information learning, and we assemble a multi-modality dataset using cheminformatics-based molecular modifications. As a result, we develop a three-modality pre-trained molecular language model (MoLM). MoLM demonstrates robust chemical and molecular representation capabilities in numerous downstream tasks such as understanding and generating text from molecular structure inputs, and retrieving 2D and 3D molecular structures based on textual inputs. Moreover, we underscore the significance of capturing 3D molecular structures. Comprehensive experiments conducted across multiple aspects of challenging tasks highlight the model’s potential to expedite biomedical research.

C-424: Extracting protein-protein interactions from the literature with deep learning-based text mining
Track: Text Mining
  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark
  • Farrokh Mehryary, University of Turku, Finland
  • Sampo Pyysalo, University of Turku, Finland
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Denmark


Presentation Overview: Show

STRING is a database with billions of protein-protein interactions, systematically collected from a number of sources, including automated text mining of the scientific literature. Our study presents how deep learning-based large language models (LLMs) have been used to develop the physical interactions mode in STRING and the future potential use of such models. For STRINGv12 we have manually annotated a corpus of 1,287 documents to generate a training set of scientific texts, discussing physical interactions between protein pairs. 3,425 complex formation relationships were annotated in these documents. We did a document-based split of the corpus and used it to finetune several biomedical transformer-based LLMs. We have achieved a best microaverage-F1 score of 84.2% with a biomedical RoBERTa-large model (std=0.8%). In the next version of STRING, we aim to expand toward detecting typed directed and signed associations. Preliminary results for this task have given us a best microaverage-F1 score of 71.9% (std=0.5%). Our study highlights the potential of transformer-based LLMs to identify protein interactions in the literature, which can contribute to the development of more informative protein interaction networks for STRING.

C-425: The challenge of evaluating factuality of biomedical language models
Track: Text Mining
  • Lawrence Hunter, UC Denver, United States


Presentation Overview: Show

The advent of large language models (LLMs) able to interact in natural language with high performance in multiple domains creates a critical new challenge problem in biomedical text mining. Biomedical applications of LLMs have become the subject of extensive discussion in application areas such as clinical decision support, medical education, patient education, drug discovery, public health and more. However, concerns about the trustworthiness of language model output remain a significant barrier to realizing this potential. One of the most pressing problems is the tendency of LLMs to generate convincing texts that are not true and/or supported by hallucinated non-existent citations. The availability of reliable automated systems to assess factuality is a prerequisite for many approaches to improving factuality in natural language generation. Existing benchmarks based on medical question answering or fact templates (e.g., BioLama) are inadequate. However, evaluation of the degree of support for biomedical assertions is a common human activity, and much more constrained than the general problem of detecting misinformation. An effective, automated, multidimensional evaluation framework that assesses the many aspects of trustworthiness, including not only accuracy, but also calibration and the ability to support LLM assertions with evidence is a critical need.

C-426: Text mining for disease-lifestyle relations based on a novel lifestyle factors ontology
Track: Text Mining
  • Esmaeil Nourani, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Mikaela Koutrouli, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Yijia Xie, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Danai Vagiaki, Genome Biology Unit, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
  • Farrokh Mehryary, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland
  • Sampo Pyysalo, TurkuNLP group, Department of Computing, Faculty of Technology, University of Turku, Turku, Finland, Finland
  • Katerina Nastou, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Søren Brunak, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark
  • Lars Juhl Jensen, Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Denmark


Presentation Overview: Show

While text mining of genes, diseases, and their associations is well-established, methods for identifying equally important lifestyle factors are lacking, and disease-lifestyle associations remain largely hidden in the literature. This is partly due to the lack of a manually annotated corpus of disease-lifestyle associations to train a model for relation extraction, and partly because there is no available ontology for lifestyles to facilitate the detection of lifestyle mentions in the literature. Existing ontologies focus on specific aspects of lifestyle, and even when combined, other important disease-associated factors such as socioeconomic factors have not been well captured or captured at all.
In this study, we developed a diverse and comprehensive ontology for lifestyle factors starting from manual creation, followed by ontology expansion using state-of-the-art language models to identify new lifestyle-related concepts. We also created a corpus of 400 abstracts that were annotated and used for training a BioBERT-based disease-lifestyle relation extraction model. The prototype implementation achieved an F-score of approximately 70\% for classifying the four distinct relation types between disease and lifestyle factors.
We plan to extend the annotated corpus and use the generalized model to extract all disease-lifestyle relations from the literature and generate a knowledge base of relations.

C-427: Word embeddings capture functions of low complexity regions: scientific literature analysis using a transformer-based language model
Track: Text Mining
  • Sylwia Szymanska, Department of Computer Networks and Systems, Silesian University of Technology, Poland
  • Aleksandra Gruca, Department of Computer Networks and Systems, Silesian University of Technology, Poland


Presentation Overview: Show

Low Complexity Regions (LCRs) are fragments of protein sequences that consist of only a few types of amino acids. In the past, scientists were ignoring LCRs and assumed them as non-functional. Recent studies show that LCRs can have important roles in proteins. Unfortunately, information about it is not collected in a systematic way. The free text of scientific publications is the only resource about LCRs and their functions. Due to that we developed a language model that is able to retrieve information about the function of LCRs in an automated way. We convert text to embeddings, and performing PCA analysis, which provides a single vector representing each publication. To train our model we selected papers related to the following LCR functions: phase separation, aggregation, DNA- and RNA-binding and we applied the Random Forest (RF) to classify the publications. The results were compared with a baseline LitSuggest model. Our model got the precision 93.16, recall 92.60 and F1-score 92.87 when LitSuggest had precision 82.45, recall 94.00 and F1-score 87.85. In this work we present a language model based approach that is able to retrieve the publications containing information about LCRs and their function through free text analysis.

C-428: Exuberanter: A Versatile Python Tool for Data Extraction
Track: Text Mining
  • Teo Lovmar, None, Sweden
  • Johan Bengtsson-Palme, Chalmers University of Technology, Sweden
  • Marcus Wenne, Chalmers University of Technology, Sweden


Presentation Overview: Show

The growing volume of scientific literature poses significant challenges for researchers conducting comprehensive literature surveys. We have developed a Python-based tool to address this issue by automating much of the process, while maintaining user control via a graphical user interface. Operating on the user's local computer, the tool enables input of specific query terms for database searches to retrieve relevant articles. Furthermore, users can include additional articles for analysis, not downloaded by the tool.

The software incorporates text searches, natural language processing, and graph data extraction techniques to extract a wide diversity of data from text or figures. The user first specifies desired metadata (e.g., gene names, methods, sampling year) and where in the article the tool should search for this data (e.g., method section, result section, plots).

Designed with flexibility in mind, the software can be easily adapted to various research needs, allowing users to customize their searches and extraction parameters. By offering a user-friendly and versatile solution, this Python-based data extraction tool seeks to support researchers in conducting more efficient and reproducible literature surveys across diverse scientific fields.

C-429: Instruction fine-tuning large language models with context-derived weak supervision improves clinical information extraction
Track: Text Mining
  • Brett Beaulieu-Jones, University of Chicago, United States
  • Elizabeth O'Neill, Rush University Medical Center, United States
  • Emily Alsentzer, Harvard Medical School, United States


Presentation Overview: Show

Biomedical studies frequently require manual chart review, whether as part of a prospective recruitment or retrospective analysis. Large language models (LLMs) demonstrate strong performance on many natural language processing tasks, including information extraction from clinical notes. However, the combination of manual annotation burden and privacy concerns limit the datasets available for instruction fine-tuning. This study introduces context-derived weak supervision (CDWS) to address this challenge, generating weak labels for clinical notes using metadata contained in the structured data captured by EMRs (e.g., diagnoses, vital signs, labs etc.). We fine-tuned the Flan-T5 model on the MIMIC-III clinical database and evaluated its performance on three clinical information extraction tasks. The CDWS base model outperformed the Flan-XXL model despite having significantly fewer parameters. CDWS-flan-XXL showed a 12.7% improvement over the vanilla Flan-XXL model. The results demonstrate the high potential of CDWS for instruction fine-tuning in LLMs. Future work will explore larger public tasks, evaluate label generation logic, attempt to understand potential biases in extraction and assess the effects of varying noise levels of label accuracy on the fine-tuning process.

C-430: AIONER: An all-in-one scheme for biomedical named entity recognition using deep learning
Track: Text Mining
  • Ling Luo, Dalian University of Technology, China
  • Chih-Hsuan Wei, National Center of Biotechnology Information (NCBI), United States
  • Po-Ting Lai, National Center of Biotechnology Information (NCBI), United States
  • Robert Leaman, NCBI/NLM/NIH, United States
  • Qingyu Chen, NIH, United States
  • Zhiyong Lu, NCBI, United States


Presentation Overview: Show

Biomedical Named Entity Recognition (BioNER) aims to automatically identify biomedical entities within natural language text, serving as a necessary foundation for downstream text mining tasks and applications. However, manual labeling of training data for the BioNER task is expensive. The resulting data scarcity leads to current BioNER approaches being prone to overfitting, limited generalizability, and the ability to address only a single entity type at a time (e.g., gene or disease). To address these challenges, we propose a novel All-In-One (AIO) scheme that leverages external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We also present AIONER, a general-purpose BioNER tool based on state-of-the-art deep learning and our AIO scheme. We evaluate AIONER on 14 BioNER benchmark tasks, demonstrating its effectiveness, robustness, and favorable comparison to other cutting-edge approaches, such as multi-task learning. Availability and implementation: The source code, trained models, and data for AIONER are freely available at https://github.com/ncbi/AIONER.

C-431: Towards a large-scale collection of quantitative traits across Arthropoda
Track: Text Mining
  • Joseph Cornelius, Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Lugano, Switzerland, Switzerland
  • Harald Detering, Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland, Switzerland
  • Oscar Lithgow Serrano, Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Lugano, Switzerland, Switzerland
  • Fabio Rinaldi, Dalle Molle Institute for Artificial Intelligence Research, IDSIA USI-SUPSI, Lugano, Switzerland, Switzerland
  • Robert Waterhouse, Department of Ecology and Evolution, University of Lausanne, Lausanne, Switzerland, Switzerland


Presentation Overview: Show

As biodiversity continues to decline, the ability of arthropods to adapt evolutionarily to many ecological niches becomes of growing interest (Watson et al. 2019, Grimaldi and Engel 2005). To understand the success of arthropods in adapting to diverse ecosystems, we need to comprehend the connection between genomic data and observable phenotypes. This requires large-scale quantitative analysis of trait data.
Our goal, therefore, is to create the first comprehensive, and updatable resource of organismal and ecological trait database for Arthropoda. To achieve this, we use state-of-the-art text mining methods to analyze and extract arthropod-trait-value relationship triples from a wide range of literature.
We pursue a multi-step approach that starts with the assembling of arthropod, trait, and value ontologies and resources, the collection of a literature corpus, the creation of manual annotations, the implementation of Named Entity Recognition (NER) and Relation Extraction (RE) pipeline to identify and extract the relation triples, and finally the construction of a database to store and make our results publicly accessible.
The presented project is the first step towards a large-scale collection of quantitative traits across Arthropoda and will allow in the future a co-interrogation of patterns of genomic evolutionary change with traits.

C-432: EasyNER: Using Artificial Intelligence to “Read” Biomedical Research Papers Easily
Track: Text Mining
  • Rafsan Ahmed, Lund University, Sweden
  • Petter Berntsson, Lund University, Sweden
  • Alexander Skafte, Lund University, Sweden
  • Salma Kazemi Rashed, Lund University, Sweden
  • Marcus Klang, Faculty of Engineering, LTH, Sweden
  • Adam Barvesten, Lund University, Sweden
  • Ola Olde, Lund University, Sweden
  • William Lindholm, Lund University, LTH, Sweden
  • Antton Lamarca Arrizabalaga, Lund University, Sweden
  • Pierre Nugues, Lund University, Department of Computer Science Lund, Sweden, Sweden
  • Sonja Aits, Lund University, Sweden


Presentation Overview: Show

Medical research generates a vast number of articles every year. The PubMed database itself contains ~35 million research articles. The knowledge scattered across this large domain can provide key insights into physiological mechanisms and disease processes, leading to novel medical interventions. However, due to the scale and complexity, it is a great challenge for researchers to utilize this information in full. This becomes especially problematic in cases of extreme urgency like the COVID-19 pandemic. Automated text mining can help researchers extract and connect information from such large body of medical text.
We present EasyNER: an end-to-end pipeline for Named Entity Recognition (NER) of typical entities found in medical research articles: cells, chemicals, diseases, genes/proteins, and species. The pipeline can access and process large medical research article collections (PubMed, CORD-19) or raw text. It incorporates a series of deep learning models (based on BioBERT and fine-tuned on the HUNER corpus) and dictionary-based models (based on Spacy). The pipeline was deployed on two collections of autophagy-related abstracts from PubMed and on the CORD19 corpus, a collection of 764 398 abstracts on COVID-19. This resulted in the detection of expected entities and produced meaningful information that can be used to guide experimental researchers.

C-433: Combination of text-mining and tissue expression facilitates triaging of drug repurposing hits
Track: Text Mining
  • Anika Liu, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany
  • Sergio Picart-Armada, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany
  • Nathan Lawless, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany
  • Jan Jensen, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany
  • Stefano Patassini, Boehringer Ingelheim Pharma GmbH & Co. KG, Germany


Presentation Overview: Show

Indication expansion, or drug repurposing, describes the process of identifying new therapeutic uses (or disease indications) for existing drugs, and is attractive for drug development due to the reduced associated resources. A common approach in the community to identify new potential therapeutic opportunities is by finding target-disease links via text-mining and data-driven approaches. However, this approach can result in a high volume of indications that are typically either out of scope or of low application value. To identify more relevant results, we propose an alternative triaging workflow refining the positive matches to those where gene transcriptional baseline expression exists in disease-related tissues. Disease-tissue co-occurrence in literature is used to define disease-related tissues, which are then matched to gene expression via the Human Protein Atlas. We benchmark our triaging strategy against target-disease links from MetaBase, finding that 97% of the high-confidence links are supported by it, significantly higher than expected at random (82%) or in low-confidence links (86%). Thereby, missed links are found for mediators of immune response, which is expected since these are primarily linked to accessory cells and not specific tissues. Overall, this approach can be a valuable tool to triage hits for the identification of new indications.

C-434: Towards improving workflows reproducibility : Extracting information on workflows from text and code repositories
Track: CAMDA
  • Clémence Sebe, Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France, France
  • Frédéric Lemoine, Institut Pasteur, Université Paris Cité, G5 Evolutionary Genomics of RNA Viruses, 28, rue du Dr Roux,Paris 75015,France, France
  • Alban Gaignard, Nantes Université, CNRS, INSERM, l’institut du thorax, 8 quai Moncousu, Nantes F-44000, France, France
  • Olivier Ferret, Université Paris-Saclay, CEA, List, F-91120, Palaiseau, France, France
  • Sarah Cohen-Boulakia, Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France, France
  • Aurélie Névéol, Universite Paris-Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, Orsay 91405, France, France


Presentation Overview: Show

Scientific workflows provide bioinformaticians a means to represent, exchange, and ensure the reproducibility of their analysis pipelines. Workflows are described in literature (text) and/or stored in workflow repositories (code). A major challenge to favor better workflow reuse is to rebuild the link between the documentation and the workflow code.
Based on workflow descriptions found in the full text of articles in English, we propose a method for representing and extracting information about the components of workflows. We also use a symbolic approach relying on code structure to analyze the contents of GitHub repositories containing workflows and compare the information available from both sources.
We present a corpus of articles annotated with a workflow representation comprising 16 entities and 10 relations. We use this corpus to train and evaluate statistical models for extracting information about workflows, in particular tools used in the workflow. We then link the information extracted from the text to information extracted from the source code.
The results obtained show both the feasibility of the extraction tasks and the complementarity of articles and code repositories in terms of information. This work is a first step towards the integration of workflow information from the literature and workflow repositories.

C-435: Modelling gene-to-phenotype relationships in monogenic neurological diseases
Track: Text Mining
  • Tariro Alfa Chatiza, The University of Edinburgh, United Kingdom
  • Thomas Ian Simpson, The University of Edinburgh, United Kingdom
  • Richard Chin, The University of Edinburgh, United Kingdom
  • Karen Mackenzie, The University of Edinburgh, United Kingdom


Presentation Overview: Show

Only about half of people affected by rare genetic diseases receive a diagnosis, often following a long delay. Information linking mutations and symptoms is often inconsistently described and poorly represented in literature. This makes detailed clinical interpretation difficult, expensive, and time-consuming. We present a novel large-scale computational approach to help create detailed disease-gene-phenotype knowledge and illustrate this with preliminary data for 3 monogenic neurological diseases; Dravet syndrome, familial hemiplegic migraine and CDKL5-deficiency disorder. We have developed a literature retrieval tool, Cadmus, to build a full-text disease literature corpus as a source from which to identify biomedical concepts using NIH-MetaMap, bioBERT and SciSpasy. We recently reported a proof-of-concept-study comparing performance of this system to manual phenotype annotation for 99 genetic developmental disorders from the Genes2Phenotype database resulting in performance exceeding manual curation accuracy by 5-10%. Here we build upon this approach by including extraction of Human Genome Variation Society encodings of gene variants using tmVar. Together, these tools allow extraction of gene, variant, and phenotype data from literature at scale to generate predictive disease models. These will generate new methods and resources to improve diagnosis and further our understanding of the underlying molecular-basis of neurological diseases of rare genetic origin.