The SciFinder tool lets you search Titles, Authors, and Abstracts of talks and panels. Enter your search term below and your results will be shown at the bottom of the page. You can also click on a track to see all the talks given in that track on that day.

View Talks By Category

Scroll down to view Results

July 14, 2025
July 15, 2025
July 20, 2025
July 21, 2025
July 22, 2025
July 23, 2025
July 24, 2025

Results

July 24, 2025
8:40-9:00
Opening remarks
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12

Authors List: Show

July 24, 2025
9:00-9:40
Invited Presentation: Keynote - TBA
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12

Authors List: Show

  • Chris Mungall
July 24, 2025
9:40-10:00
Poster lightning talks
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12

Authors List: Show

July 24, 2025
11:20-11:40
Representations of Cells in the Biomedical Literature: First Look at the NLM CellLink Corpus
Confirmed Presenter: Noam H. Rotenberg, Division of Intramural Research, National Library of Medicine
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Noam H. Rotenberg, Noam H. Rotenberg, Division of Intramural Research
  • Robert Leaman, Robert Leaman, Division of Intramural Research
  • Rezarta Islamaj, Rezarta Islamaj, Division of Intramural Research
  • Brian Fluharty, Brian Fluharty, Division of Intramural Research
  • Helena Kuivaniemi, Helena Kuivaniemi, Division of Intramural Research
  • Savannah Richardson, Savannah Richardson, Division of Intramural Research
  • Gerard Tromp, Gerard Tromp, Division of Intramural Research
  • Zhiyong Lu, Zhiyong Lu, Division of Intramural Research

Presentation Overview:Show

Single-cell technologies are enabling the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale, however, the existing annotated datasets required for system development and evaluation do not reflect the complex assortment of cell phenotypes described in recent studies.

We present a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers, using their meaning in context, with matches labeled as exact or related, where possible. Annotation was performed by four cell biologists using a multi-round process, with automated pre-annotation.

The corpus contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. Fine-tuning BiomedBERT in a simplified named entity recognition task on this corpus resulted in substantially higher performance than the same configuration fine-tuned on previously annotated datasets.

Our corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.

July 24, 2025
11:40-12:00
Contextualizing phenotypes in medical notes with small language models
Confirmed Presenter: Connor Grannis, Nationwide Children's Hospital, United States
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Connor Grannis, Connor Grannis, Nationwide Children's Hospital
  • Max Homilius, Max Homilius, Nationwide Children's Hospital
  • Austin A. Antoniou, Austin A. Antoniou, Nationwide Children's Hospital
  • David M. Gordon, David M. Gordon, Nationwide Children's Hospital
  • Ashley Kubatko, Ashley Kubatko, Nationwide Children's Hospital
  • Peter White, Peter White, Nationwide Children's Hospital

Presentation Overview:Show

Accurate phenotypic extraction from clinical notes is essential for precision medicine. While manual approaches are time-consuming and prone to bias, automated phenotype recognition tools often misinterpret contextual attributes—such as negation, temporality, or familial association—due to variability in documentation styles. Evaluation is further hindered by the lack of gold-standard datasets with context attributes.

To address this gap, we 1) annotated the ID-68 dataset (an open-source dataset of 68 clinical notes from a cohort of patients with intellectual disabilities) with context attributes using a large language model (LLM) followed by manual review; 2) generated 50 synthetic clinical notes modeled off the ID-68 dataset by using an LLM, seeding them with phenotypes associated with OMIM diseases and diverse contextual attributes; and 3) fine-tuned small language models (SLMs) to perform binary classification of whether a phenotype is negated, hypothetical, or associated with a family member.

We evaluated several phenotype concept recognition models on a span-level NER task, including the correct classification of negated, family-related or hypothetical phenotype mentioned. Our results demonstrate that existing phenotype recognition tools are effective for extracting phenotypes that are mostly patient-related (i.e., ID-68), but insufficient for more complex contexts. By augmenting extracted phenotypes with SLMs, we boosted context accuracy in the synthetic dataset from 57% to 89%. These findings highlight the importance of accurate contextual awareness in phenotype extraction pipelines. Our synthetic dataset and evaluation framework offer a foundation for benchmarking future tools and advancing scalable, high-fidelity phenotype extraction for precision medicine applications.

July 24, 2025
12:00-12:20
CSpace: A concept embedding space for bio-medical applications
Confirmed Presenter: Danilo Tomasoni, Fondazione the Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI), Italy
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Danilo Tomasoni, Danilo Tomasoni, Fondazione the Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)
  • Luca Marchetti, Luca Marchetti, Fondazione the Microsoft Research - University of Trento Centre for Computational and Systems Biology (COSBI)

Presentation Overview:Show

Motivation: The rise of transformer-based architectures dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed.
Results: In this work, we present CSpace: a concise word embedding of bio-medical concepts that outperforms alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task and have comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs) enabling computationally efficient cross-ontology relatedness measurement, potentially unlocking previously unknown disease-condition associations.
Method: CSpace was trained with the FastText algorithm on full-text articles from PubMed, PubMedCentral, ClinicalTrials (US) and preprints from BioRxiv and MedRxiv published in 2024. CSpace encodes concepts rather than words: it combines multiple words pertaining to the same concepts both with Pubtator3 annotations and statistical word co-occurrence.
Conclusion: CSpace outperforms other embedding models in both concept and sentence similarity tasks. It also surpasses the transformer-based OpenAI ada-v2 model in the concept similarity task, with a performance trade-off of less than 5% in the sentence similarity task. Additionally, CSpace can effectively measure associations among diseases, genes, and clinical conditions, even across different ontologies, using less than 10% of the embedding dimensions required by ada-v2, making it a highly efficient and accessible tool for democratizing advanced embedding technologies.

July 24, 2025
12:20-12:40
VectorSage: Enhancing PubMed Article Retrieval with Advanced Semantic Search
Confirmed Presenter: Rahul Brahma, University Medicine Greifswald, Germany
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Yasas Wijesekara, Yasas Wijesekara, University Medicine Greifswald
  • Rahul Brahma, Rahul Brahma, University Medicine Greifswald
  • Mehdi Lotfi, Mehdi Lotfi, University Medicine Greifswald
  • Marcus Vollmer, Marcus Vollmer, University Medicine Greifswald
  • Lars Kaderali, Lars Kaderali, University Medicine Greifswald

Presentation Overview:Show

The exponential growth of academic literature has presented unprecedented opportunities and highlighted the need for advanced search methodologies for efficient knowledge discovery. While effective for structured queries, traditional keyword-based search engines often struggle with the inherent variability of language, where the same concept can be expressed in many ways, leading to imprecise retrieval of relevant articles. Recent advancements in natural language processing (NLP) have facilitated the development of semantic similarity techniques that extend beyond simple text matching, enabling more contextually aware search capabilities.
Taking advantage of these advancements, to address the limitations of the traditional approach, we introduce VectorSage, a hybrid search framework that integrates semantic similarity search and keyword-based retrieval to enhance academic literature discovery in peer-reviewed articles.
VectorSage employs a multi-step ranking mechanism executed in parallel: (1) a semantic similarity search using FAISS with Stella-400M embeddings to retrieve conceptually related articles; and (2) a keyword-based search leveraging BM25S for probabilistic text ranking. The results are independently ranked and merged into a globally optimized ranked list using a weighted scoring function, balancing semantic relevance with keyword specificity. This hybrid approach is particularly useful where terminology consistency varies, allowing researchers to retrieve articles that traditional search techniques might otherwise miss.
Tested on over 26 million PubMed abstracts, VectorSage significantly improves retrieval of relevant articles, facilitating more effective literature exploration. As a freely accessible web tool, VectorSage enhances high-quality academic literature search across disciplines.
VectorSage is live at: https://vectorsage.nube.uni-greifswald.de/.

July 24, 2025
12:40-13:00
Large Language Model Applications on the Uniprot Protein Sequence and Annotation Database
Confirmed Presenter: Melike Akkaya, Hacettepe University, Turkey
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Melike Akkaya, Melike Akkaya, Hacettepe University
  • Rauf Yanmaz, Rauf Yanmaz, Hacettepe University
  • Sezin Yavuz, Sezin Yavuz, Hacettepe University
  • Vishal Joshi, Vishal Joshi, European Molecular Biology Laboratory - European Bioinformatics Institute
  • Maria-Jesus Martin, Maria-Jesus Martin, European Molecular Biology Laboratory - European Bioinformatics Institute
  • Tunca Doğan, Tunca Doğan, Hacettepe University

Presentation Overview:Show

Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.

July 24, 2025
14:00-14:20
Human-AI Collaboration for Cancer Knowledge Verification: Insights from the CIViC-Fact Dataset
Confirmed Presenter: Caralyn Reisle, University of British Columbia; Canada's Micheal Smith Genome Sciences Centre, Canada
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Caralyn Reisle, Caralyn Reisle, University of British Columbia; Canada's Micheal Smith Genome Sciences Centre
  • Cameron J. Grisdale, Cameron J. Grisdale, Canada's Micheal Smith Genome Sciences Centre
  • Kilannin Krysiak, Kilannin Krysiak, Washington University in St. Louis
  • Arpad M. Danos, Arpad M. Danos, Washington University in St. Louis
  • Mariam Khanfar, Mariam Khanfar, Washington University in St. Louis
  • Erin Pleasance, Erin Pleasance, Canada's Micheal Smith Genome Sciences Centre
  • Jason Saliba, Jason Saliba, Washington University in St. Louis
  • Melika Bonakdar, Melika Bonakdar, Canada's Micheal Smith Genome Sciences Centre
  • Nilan V. Patel, Nilan V. Patel, Washington University in St. Louis
  • Joshua F. McMichael, Joshua F. McMichael, Washington University in St. Louis
  • Malachi Griffith, Malachi Griffith, Washington University in St. Louis
  • Obi L. Griffith, Obi L. Griffith, Washington University in St. Louis
  • Steven J.M. Jones, Steven J.M. Jones

Presentation Overview:Show

Interpretation of genomic findings remains one of the largest barriers to automation in processing precision oncology patient data due to the high level of expertise required in cancer biology, genomics, and bioinformatics. Efforts to streamline this process include creating cancer knowledge bases (KB) to store annotations of individual genes and variants, but creating such resources is time-consuming. The open-data cancer KB CIViC (civicdb.org) adopted crowd-sourcing to curate content efficiently. However, these submissions still require expert review, leading to a new bottleneck.

To address this, we introduce CIViC-Fact, a novel benchmark designed to support automated fact-checking and claim verification in the biomedical domain. CIViC-Fact augments thousands of curated entries in the CIViC knowledge base with sentence-level evidence provenance, linking each claim to the specific sentences that support or contradict it. We evaluate the performance of several open large language models (LLMs) on this dataset. Existing LLMs struggle (up to 60% accuracy) and require fine-tuning to achieve reasonable performance. While fine-tuned language models perform well (up to 88% accuracy), there is significant room for improvement in the quality of their reasoning. Despite these remaining challenges, we have applied the current pipeline to the entirety of CIViC, flagging any errors detected. Flagged entries were returned to the CIViC curation team for follow-up, resulting in corrections to the KB content. This demonstrates the practical utility of CIViC-Fact, not only as a new benchmark for NLP research, but as a tool for semi-automated auditing of scientific knowledge bases.

July 24, 2025
14:20-14:40
Named entity recognition and relationship extraction to mine minimum inhibitory concentration of antibiotics from biomedical text
Confirmed Presenter: Tiffany Ta, McMaster University, Canada
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Tiffany Ta, Tiffany Ta, McMaster University
  • Arman Edalatmand, Arman Edalatmand, McMaster University
  • Andrew McArthur, Andrew McArthur, McMaster University

Presentation Overview:Show

Antimicrobial resistance (AMR) poses a global public health threat, undermining modern medicine by diminishing the effectiveness of antibiotics for treating bacterial infections. The minimum inhibitory concentration (MIC) is the lowest concentration at which an antibiotic inhibits bacterial growth. Based on pre-established MIC cutoff values, MIC can be used to determine if an isolate will be susceptible or resistant to an antibiotic. To interpret MIC, various metadata (i.e., infection site, bacterial isolate, etc.) must also be collected. This information is spread piecemeal throughout the scientific literature available at PubMed Central (PMC) but has yet to be mined. The Comprehensive Antibiotic Resistance Database (CARD) is a globally used, expert-curated resource and knowledgebase of AMR determinants and antibiotics. Yet CARD lacks MIC information, which can provide insights into the phenotypic risk profile of individual ARGs. We are leveraging natural language processing to extract MIC values and relevant information from 5,704,429 PMC articles. We have trained a text classifier to identify PMC articles associated with bacterial drug resistance (F1 = 0.9699) and filtered the papers via Regular Expressions to identify papers with MIC data, yielding 10,086 papers. Afterwards, named entity recognition (NER) was used to mine relevant MIC information, generating 1,082,026 annotations. We are now working towards employing generative models to extract MIC values from PMC articles. Once CARD has MIC data, we can track MIC across time, as increasing MIC values can be a forewarning for pathogens that may develop resistance.

July 24, 2025
14:40-15:00
Proceedings Presentation: Enhancing Biomedical Relation Extraction with Directionality
Confirmed Presenter: Chih-Hsuan Wei, National Center for Biotechnology Information, United States
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Po-Ting Lai, Po-Ting Lai, National Center for Biotechnology Information
  • Chih-Hsuan Wei, Chih-Hsuan Wei, National Center for Biotechnology Information
  • Shubo Tian, Shubo Tian, National Center for Biotechnology Information
  • Robert Leaman, Robert Leaman, National Center for Biotechnology Information
  • Zhiyong Lu, Zhiyong Lu, NCBI

Presentation Overview:Show

Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the development of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relation-ships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results include an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks.

July 24, 2025
15:00-15:20
Metadata extraction: Large Language Models (LLMs) to the rescue
Confirmed Presenter: Daniela Gaio, UZH, Switzerland
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Daniela Gaio, Daniela Gaio, UZH

Presentation Overview:Show

In this project, our research group embarked on an extensive effort to download and re-analyze all globally and publicly accessible metagenomic samples from the NCBI database, culminating in the creation of MicrobeAtlas.org—a resource for the scientific community. We employed Large Language Models (LLMs) to efficiently extract relevant information from the often chaotic and submitter-dependent metadata files. This technique represents a significant leap over traditional methods such as the employment of conventional natural language processing tools, offering unparalleled efficiency in metadata mining.

The value of clean, accessible metadata in microbial -omics is critical for the analysis of metagenomic samples. Despite advancements, the challenge of disorganized metadata remains, limiting dataset utility. The application of LLM greatly enhanced the extraction of keywords, geographical data, sample's nature, and host. This improvement unveiled signals within the metagenomic data previously masked by conventional NLP tool limitations, thus increasing dataset value and access.

Our approach included developing and validating a pipeline for processing metadata from 3.8 million samples. I will outline the encountered challenges and the implemented solutions, including the comparison of paid and free LLM models.

Conclusively, our efforts in improving metagenomic dataset accessibility and utility not only enable the reuse of existing data for comparative analysis and new discoveries, but also establish a new benchmark in metadata analysis within microbial ecology. The advancements in metadata extraction foster more detailed and comprehensive research, significantly enhancing our microbial ecosystem understanding.

July 24, 2025
15:20-15:40
Large-scale semantic indexing of Spanish biomedical literature using contrastive transfer learning
Confirmed Presenter: Shanfeng Zhu, Fudan University, China
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Ronghui You, Ronghui You, Nankai University
  • Tianyang Huang, Tianyang Huang, Fudan University
  • Ziye Wang, Ziye Wang, Fudan university
  • Yuxuan Liu, Yuxuan Liu, fudan university
  • Hong Zhou, Hong Zhou, Atypon Systems
  • Shanfeng Zhu, Shanfeng Zhu, Fudan University

Presentation Overview:Show

The exponential growth of biomedical literature has made automatic indexing essential for advancing biomedical research. While automatic indexing has made strides in English biomedical literature, there has been limited research on non-English biomedical texts due to insufficient high-quality training data. We propose BERTDeCS, a novel deep learning framework for automatically indexing Spanish biomedical literature using contrastive transfer learning. BERTDeCS utilizes a multilingual BERT (M-BERT) to generate multi-language representations and adapts M-BERT for Spanish biomedical literature domain through contrastive learning. Additionally, BERTDeCS enhances its semantic indexing capabilities on Spanish biomedical literature by leveraging enriched English annotated literature in MEDLINE through transfer learning. Experimental results on Spanish datasets demonstrate that BERTDeCS outperforms state-of-the-art indexing methods, achieving top performance in the MESINESP and MESINESP2 Tasks on medical semantic indexing in Spanish within the BioASQ challenge. Notably, when extended to other languages (e.g., Portuguese) or applied in settings lacking manual indexing, BERTDeCS maintains exceptional performance, affirming its robustness in non-English biomedical semantic indexing.

July 24, 2025
15:40-16:00
Reading papers: Extraction of molecular interaction networks with large language models
Confirmed Presenter: Enio Gjerga, University Hospital Heidelberg, Germany
Track: Text Mining: Text Mining for Healthcare and Biology

Room: 12
Format: In person

Authors List: Show

  • Enio Gjerga, Enio Gjerga, University Hospital Heidelberg
  • Philipp Wiesenbach, Philipp Wiesenbach, University Hospital Heidelberg
  • Christoph Dieterich, Christoph Dieterich, University Hospital Heidelberg

Presentation Overview:Show

Motivation: Signalling occurs within and across cells and orchestrates essential cellular processes in complex tissues. Cell signalling involves several different components, including protein-protein interactions (PPIs). Dynamically changing conditions oftentimes lead to the rewiring of cellular communication networks. Computational modelling approaches typically rely on databases of molecular interactions. Evidently, manual curation of databases is time-consuming and automatic relation extraction (RE) from scientific literature would greatly support our strive to understand molecular mechanisms. To ease this process, we reason that prompt-based data mining with Large Language Models (LLMs) could be used to extract information from relevant scientific publications.
Approach: We focus on the extraction of entity relations between proteins, as exemplified in protein-protein interaction networks, over a corpus of annotated short scientific texts. We analyze vanilla and fine-tuned models with different prompt setups where we either give no examples at all or follow specific patterns, e.g. only give correct, incorrect or both sorts of examples.
Results: We rely on the RegulaTome corpus of annotated abstracts and short scientific texts where we obtain promising evaluation results as measured by precision, recall and F1-score for the extraction of PPI relations: 79%, 70% and 71%. Our workflow also ingests entire manuscripts and yields 96%, 65% and 77% for PPI relation extraction over a corpus of manually annotated cardiac manuscripts.
Availability: Codes with scripts and results have been provided in: https://github.com/dieterich-lab/LLM_Relations.