Return to ISMB/ECCB 2025 Homepage   Click here for the abridged agenda


Select Track: 3DSIG | Bio-Ontologies and Knowledge Representation | BioInfo-Core | Bioinfo4Women Meet-Up | Bioinformatics in the UK | BioVis | BOSC | CAMDA | CollaborationFest | CompMS | Computational Systems Immunology | Distinguished Keynotes | Dream Challenges | Education | Equity and Diversity | EvolCompGen | Fellows Presentation | Function | General Computational Biology | HiTSeq | iRNA | ISCB-China Workshop | JPI | MICROBIOME | MLCSB | NetBio | NIH Cyberinfrastructure and Emerging Technologies Sessions | NIH/Elixir | Publications - Navigating Journal Submissions | RegSys | Special Track | Stewardship Critical Infrastructure | Student Council Symposium | SysMod | Tech Track | Text Mining | The Innovation Pipeline: How Industry & Academia Can Work Together in Computational Biology | TransMed | Tutorials | VarI | WEB 2025 | Youth Bioinformatics Symposium | All


Schedule for TextMining

NOTE: Browser resolution may limit the width of the agenda and you may need to scroll the iframe to see additional columns.
Click the buttons below to download your current table in that format

Date Start Time End Time Room Track Title Confrimed Presenter Format Authors Abstract
2025-07-24 08:40:00 09:00:00 12 Text Mining Opening remarks
2025-07-24 09:00:00 09:40:00 12 Text Mining Keynote - TBA , Chris Mungall
2025-07-24 09:40:00 10:00:00 12 Text Mining Poster lightning talks
2025-07-24 11:20:00 11:40:00 12 Text Mining Representations of Cells in the Biomedical Literature: First Look at the NLM CellLink Corpus Noam H. Rotenberg Noam H. Rotenberg, Robert Leaman, Rezarta Islamaj, Brian Fluharty, Helena Kuivaniemi, Savannah Richardson, Gerard Tromp, Zhiyong Lu, Richard H. Scheuermann Single-cell technologies are enabling the discovery of many novel cell phenotypes, but this growing body of knowledge remains fragmented across the scientific literature. Natural language processing (NLP) offers a promising approach to extract this information at scale, however, the existing annotated datasets required for system development and evaluation do not reflect the complex assortment of cell phenotypes described in recent studies. We present a new corpus of excerpts from recent articles, manually annotated with mentions of human and mouse cell populations. The corpus distinguishes three types of mentions: (1) specific cell phenotypes (cell types and states), (2) heterogenous cell populations, and (3) vague cell population descriptions. Mentions of the first two types were linked to Cell Ontology identifiers, using their meaning in context, with matches labeled as exact or related, where possible. Annotation was performed by four cell biologists using a multi-round process, with automated pre-annotation. The corpus contains over 22,000 annotations across more than 3,000 passages selected from 2,700 articles, covering nearly half the concepts in the current Cell Ontology. Fine-tuning BiomedBERT in a simplified named entity recognition task on this corpus resulted in substantially higher performance than the same configuration fine-tuned on previously annotated datasets. Our corpus will be a valuable resource for developing automated systems to identify cell phenotype mentions in the biomedical literature, a challenging benchmark for evaluating biomedical NLP systems, and a foundation for the future extraction of relationships between cell types and key biomedical entities, including genes, anatomical structures, and diseases.
2025-07-24 11:40:00 12:00:00 12 Text Mining Contextualizing phenotypes in medical notes with small language models Connor Grannis Connor Grannis, Max Homilius, Austin A. Antoniou, David M. Gordon, Ashley Kubatko, Peter White Accurate phenotypic extraction from clinical notes is essential for precision medicine. While manual approaches are time-consuming and prone to bias, automated phenotype recognition tools often misinterpret contextual attributes—such as negation, temporality, or familial association—due to variability in documentation styles. Evaluation is further hindered by the lack of gold-standard datasets with context attributes. To address this gap, we 1) annotated the ID-68 dataset (an open-source dataset of 68 clinical notes from a cohort of patients with intellectual disabilities) with context attributes using a large language model (LLM) followed by manual review; 2) generated 50 synthetic clinical notes modeled off the ID-68 dataset by using an LLM, seeding them with phenotypes associated with OMIM diseases and diverse contextual attributes; and 3) fine-tuned small language models (SLMs) to perform binary classification of whether a phenotype is negated, hypothetical, or associated with a family member. We evaluated several phenotype concept recognition models on a span-level NER task, including the correct classification of negated, family-related or hypothetical phenotype mentioned. Our results demonstrate that existing phenotype recognition tools are effective for extracting phenotypes that are mostly patient-related (i.e., ID-68), but insufficient for more complex contexts. By augmenting extracted phenotypes with SLMs, we boosted context accuracy in the synthetic dataset from 57% to 89%. These findings highlight the importance of accurate contextual awareness in phenotype extraction pipelines. Our synthetic dataset and evaluation framework offer a foundation for benchmarking future tools and advancing scalable, high-fidelity phenotype extraction for precision medicine applications.
2025-07-24 12:00:00 12:20:00 12 Text Mining CSpace: A concept embedding space for bio-medical applications Danilo Tomasoni Danilo Tomasoni, Luca Marchetti Motivation: The rise of transformer-based architectures dramatically improved our ability to analyze natural language. However, the power and flexibility of these general-purpose models come at the cost of highly complex model architectures with billions of parameters that are not always needed. Results: In this work, we present CSpace: a concise word embedding of bio-medical concepts that outperforms alternatives in terms of out-of-vocabulary ratio and semantic textual similarity task and have comparable performance with respect to transformer-based alternatives in the sentence similarity task. This ability can serve as the foundation for semantic search by enabling efficient retrieval of conceptually related terms. Additionally, CSpace incorporates ontological identifiers (MeSH, NCBI gene and taxonomy IDs) enabling computationally efficient cross-ontology relatedness measurement, potentially unlocking previously unknown disease-condition associations. Method: CSpace was trained with the FastText algorithm on full-text articles from PubMed, PubMedCentral, ClinicalTrials (US) and preprints from BioRxiv and MedRxiv published in 2024. CSpace encodes concepts rather than words: it combines multiple words pertaining to the same concepts both with Pubtator3 annotations and statistical word co-occurrence. Conclusion: CSpace outperforms other embedding models in both concept and sentence similarity tasks. It also surpasses the transformer-based OpenAI ada-v2 model in the concept similarity task, with a performance trade-off of less than 5% in the sentence similarity task. Additionally, CSpace can effectively measure associations among diseases, genes, and clinical conditions, even across different ontologies, using less than 10% of the embedding dimensions required by ada-v2, making it a highly efficient and accessible tool for democratizing advanced embedding technologies.
2025-07-24 12:20:00 12:40:00 12 Text Mining VectorSage: Enhancing PubMed Article Retrieval with Advanced Semantic Search Rahul Brahma Yasas Wijesekara, Rahul Brahma, Mehdi Lotfi, Marcus Vollmer, Lars Kaderali The exponential growth of academic literature has presented unprecedented opportunities and highlighted the need for advanced search methodologies for efficient knowledge discovery. While effective for structured queries, traditional keyword-based search engines often struggle with the inherent variability of language, where the same concept can be expressed in many ways, leading to imprecise retrieval of relevant articles. Recent advancements in natural language processing (NLP) have facilitated the development of semantic similarity techniques that extend beyond simple text matching, enabling more contextually aware search capabilities. Taking advantage of these advancements, to address the limitations of the traditional approach, we introduce VectorSage, a hybrid search framework that integrates semantic similarity search and keyword-based retrieval to enhance academic literature discovery in peer-reviewed articles. VectorSage employs a multi-step ranking mechanism executed in parallel: (1) a semantic similarity search using FAISS with Stella-400M embeddings to retrieve conceptually related articles; and (2) a keyword-based search leveraging BM25S for probabilistic text ranking. The results are independently ranked and merged into a globally optimized ranked list using a weighted scoring function, balancing semantic relevance with keyword specificity. This hybrid approach is particularly useful where terminology consistency varies, allowing researchers to retrieve articles that traditional search techniques might otherwise miss. Tested on over 26 million PubMed abstracts, VectorSage significantly improves retrieval of relevant articles, facilitating more effective literature exploration. As a freely accessible web tool, VectorSage enhances high-quality academic literature search across disciplines. VectorSage is live at: https://vectorsage.nube.uni-greifswald.de/.
2025-07-24 12:40:00 13:00:00 12 Text Mining Large Language Model Applications on the Uniprot Protein Sequence and Annotation Database Melike Akkaya Melike Akkaya, Rauf Yanmaz, Sezin Yavuz, Vishal Joshi, Maria-Jesus Martin, Tunca Doğan Efficiently accessing and analyzing comprehensive biological datasets remains challenging due to traditional querying complexities. To address this, we developed an intuitive, scalable query interface using advanced large language models. Our system enables users, regardless of computational expertise, to formulate natural-language queries that automatically translate into precise Solr database searches, significantly simplifying interaction with UniProtKB. Additionally, we implemented a semantic vector search for rapid protein similarity analyses, using protein embeddings generated by ProtT5 protein language model within an optimized approximate nearest-neighbor search framework (Annoy). This method significantly outperforms conventional BLAST searches, offering a speed increase of up to tenfold on GPU hardware. Functional insights are further enriched through integrated Gene Ontology analyses, providing biologically meaningful context to similarity searches. Currently, we are expanding the system using Retrieval-Augmented Generation, integrating real-time annotations from UniProt flat files to enhance contextual relevance and accuracy of generated responses. Evaluations using diverse biological queries demonstrated the robustness of our interface, highlighting its ability to mitigate intrinsic variability in LLM outputs through controlled prompt engineering and query retry mechanisms. Overall, our novel project substantially streamlines the retrieval process, facilitating quicker, more accurate exploration of protein functions, evolutionary relationships, and annotations.
2025-07-24 14:00:00 14:20:00 12 Text Mining Human-AI Collaboration for Cancer Knowledge Verification: Insights from the CIViC-Fact Dataset Caralyn Reisle Caralyn Reisle, Cameron J. Grisdale, Kilannin Krysiak, Arpad M. Danos, Mariam Khanfar, Erin Pleasance, Jason Saliba, Melika Bonakdar, Nilan V. Patel, Joshua F. McMichael, Malachi Griffith, Obi L. Griffith, Steven J.M. Jones Interpretation of genomic findings remains one of the largest barriers to automation in processing precision oncology patient data due to the high level of expertise required in cancer biology, genomics, and bioinformatics. Efforts to streamline this process include creating cancer knowledge bases (KB) to store annotations of individual genes and variants, but creating such resources is time-consuming. The open-data cancer KB CIViC (civicdb.org) adopted crowd-sourcing to curate content efficiently. However, these submissions still require expert review, leading to a new bottleneck. To address this, we introduce CIViC-Fact, a novel benchmark designed to support automated fact-checking and claim verification in the biomedical domain. CIViC-Fact augments thousands of curated entries in the CIViC knowledge base with sentence-level evidence provenance, linking each claim to the specific sentences that support or contradict it. We evaluate the performance of several open large language models (LLMs) on this dataset. Existing LLMs struggle (up to 60% accuracy) and require fine-tuning to achieve reasonable performance. While fine-tuned language models perform well (up to 88% accuracy), there is significant room for improvement in the quality of their reasoning. Despite these remaining challenges, we have applied the current pipeline to the entirety of CIViC, flagging any errors detected. Flagged entries were returned to the CIViC curation team for follow-up, resulting in corrections to the KB content. This demonstrates the practical utility of CIViC-Fact, not only as a new benchmark for NLP research, but as a tool for semi-automated auditing of scientific knowledge bases.
2025-07-24 14:20:00 14:40:00 12 Text Mining Named entity recognition and relationship extraction to mine minimum inhibitory concentration of antibiotics from biomedical text Tiffany Ta Tiffany Ta, Arman Edalatmand, Andrew McArthur Antimicrobial resistance (AMR) poses a global public health threat, undermining modern medicine by diminishing the effectiveness of antibiotics for treating bacterial infections. The minimum inhibitory concentration (MIC) is the lowest concentration at which an antibiotic inhibits bacterial growth. Based on pre-established MIC cutoff values, MIC can be used to determine if an isolate will be susceptible or resistant to an antibiotic. To interpret MIC, various metadata (i.e., infection site, bacterial isolate, etc.) must also be collected. This information is spread piecemeal throughout the scientific literature available at PubMed Central (PMC) but has yet to be mined. The Comprehensive Antibiotic Resistance Database (CARD) is a globally used, expert-curated resource and knowledgebase of AMR determinants and antibiotics. Yet CARD lacks MIC information, which can provide insights into the phenotypic risk profile of individual ARGs. We are leveraging natural language processing to extract MIC values and relevant information from 5,704,429 PMC articles. We have trained a text classifier to identify PMC articles associated with bacterial drug resistance (F1 = 0.9699) and filtered the papers via Regular Expressions to identify papers with MIC data, yielding 10,086 papers. Afterwards, named entity recognition (NER) was used to mine relevant MIC information, generating 1,082,026 annotations. We are now working towards employing generative models to extract MIC values from PMC articles. Once CARD has MIC data, we can track MIC across time, as increasing MIC values can be a forewarning for pathogens that may develop resistance.
2025-07-24 14:40:00 15:00:00 12 Text Mining Enhancing Biomedical Relation Extraction with Directionality Chih-Hsuan Wei Po-Ting Lai, Chih-Hsuan Wei, Shubo Tian, Robert Leaman, Zhiyong Lu Biological relation networks contain rich information for understanding the biological mechanisms behind the relationship of entities such as genes, proteins, diseases, and chemicals. The vast growth of biomedical literature poses significant challenges updating the network knowledge. The recent Biomedical Relation Extraction Dataset (BioRED) provides valuable manual annotations, facilitating the development of machine-learning and pre-trained language model approaches for automatically identifying novel document-level (inter-sentence context) relation-ships. Nonetheless, its annotations lack directionality (subject/object) for the entity roles, essential for studying complex biological networks. Herein we annotate the entity roles of the relationships in the BioRED corpus and subsequently propose a novel multi-task language model with soft-prompt learning to jointly identify the relationship, novel findings, and entity roles. Our results include an enriched BioRED corpus with 10,864 directionality annotations. Moreover, our proposed method outperforms existing large language models such as the state-of-the-art GPT-4 and Llama-3 on two benchmarking tasks.
2025-07-24 15:00:00 15:20:00 12 Text Mining Metadata extraction: Large Language Models (LLMs) to the rescue Daniela Gaio Daniela Gaio In this project, our research group embarked on an extensive effort to download and re-analyze all globally and publicly accessible metagenomic samples from the NCBI database, culminating in the creation of MicrobeAtlas.org—a resource for the scientific community. We employed Large Language Models (LLMs) to efficiently extract relevant information from the often chaotic and submitter-dependent metadata files. This technique represents a significant leap over traditional methods such as the employment of conventional natural language processing tools, offering unparalleled efficiency in metadata mining. The value of clean, accessible metadata in microbial -omics is critical for the analysis of metagenomic samples. Despite advancements, the challenge of disorganized metadata remains, limiting dataset utility. The application of LLM greatly enhanced the extraction of keywords, geographical data, sample's nature, and host. This improvement unveiled signals within the metagenomic data previously masked by conventional NLP tool limitations, thus increasing dataset value and access. Our approach included developing and validating a pipeline for processing metadata from 3.8 million samples. I will outline the encountered challenges and the implemented solutions, including the comparison of paid and free LLM models. Conclusively, our efforts in improving metagenomic dataset accessibility and utility not only enable the reuse of existing data for comparative analysis and new discoveries, but also establish a new benchmark in metadata analysis within microbial ecology. The advancements in metadata extraction foster more detailed and comprehensive research, significantly enhancing our microbial ecosystem understanding.
2025-07-24 15:20:00 15:40:00 12 Text Mining Large-scale semantic indexing of Spanish biomedical literature using contrastive transfer learning Shanfeng Zhu Ronghui You, Tianyang Huang, Ziye Wang, Yuxuan Liu, Hong Zhou, Shanfeng Zhu The exponential growth of biomedical literature has made automatic indexing essential for advancing biomedical research. While automatic indexing has made strides in English biomedical literature, there has been limited research on non-English biomedical texts due to insufficient high-quality training data. We propose BERTDeCS, a novel deep learning framework for automatically indexing Spanish biomedical literature using contrastive transfer learning. BERTDeCS utilizes a multilingual BERT (M-BERT) to generate multi-language representations and adapts M-BERT for Spanish biomedical literature domain through contrastive learning. Additionally, BERTDeCS enhances its semantic indexing capabilities on Spanish biomedical literature by leveraging enriched English annotated literature in MEDLINE through transfer learning. Experimental results on Spanish datasets demonstrate that BERTDeCS outperforms state-of-the-art indexing methods, achieving top performance in the MESINESP and MESINESP2 Tasks on medical semantic indexing in Spanish within the BioASQ challenge. Notably, when extended to other languages (e.g., Portuguese) or applied in settings lacking manual indexing, BERTDeCS maintains exceptional performance, affirming its robustness in non-English biomedical semantic indexing.
2025-07-24 15:40:00 16:00:00 12 Text Mining Reading papers: Extraction of molecular interaction networks with large language models Enio Gjerga Enio Gjerga, Philipp Wiesenbach, Christoph Dieterich Motivation: Signalling occurs within and across cells and orchestrates essential cellular processes in complex tissues. Cell signalling involves several different components, including protein-protein interactions (PPIs). Dynamically changing conditions oftentimes lead to the rewiring of cellular communication networks. Computational modelling approaches typically rely on databases of molecular interactions. Evidently, manual curation of databases is time-consuming and automatic relation extraction (RE) from scientific literature would greatly support our strive to understand molecular mechanisms. To ease this process, we reason that prompt-based data mining with Large Language Models (LLMs) could be used to extract information from relevant scientific publications. Approach: We focus on the extraction of entity relations between proteins, as exemplified in protein-protein interaction networks, over a corpus of annotated short scientific texts. We analyze vanilla and fine-tuned models with different prompt setups where we either give no examples at all or follow specific patterns, e.g. only give correct, incorrect or both sorts of examples. Results: We rely on the RegulaTome corpus of annotated abstracts and short scientific texts where we obtain promising evaluation results as measured by precision, recall and F1-score for the extraction of PPI relations: 79%, 70% and 71%. Our workflow also ingests entire manuscripts and yields 96%, 65% and 77% for PPI relation extraction over a corpus of manually annotated cardiac manuscripts. Availability: Codes with scripts and results have been provided in: https://github.com/dieterich-lab/LLM_Relations.

- top -