Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Sunday, July 14th
10:40-11:20
Invited Presentation: Combining computational pipelines and text mining to build a cell type knowledge graph resource
Confirmed Presenter: Richard Scheuermann, National Library of Medicine, USA

Room: 524ab
Format: In Person


Authors List: Show

  • Richard Scheuermann, National Library of Medicine, USA
  • Renee Zhang, J. Craig Venter Institute, USA
  • Ajith Pankajam, National Library of Medicine, USA
  • Beverly Peng, J. Craig Venter Institute, USA
  • Angela Liu, J. Craig Venter Institute, USA
  • Noam Rotenberg, National Library of Medicine, USA
  • Robert Leaman, National Library of Medicine, USA
  • Zhiyong Lu, National Library of Medicine, USA
  • Brian Aevermann, Chan Zuckerberg Initiative, USA

Presentation Overview: Show

Advances in sequencing technologies are now allowing for the comprehensive analysis of whole genome structure (epigenomic) and expression (transcriptomic) characteristics of individual single cells. This is revealing the cellular complexity of healthy, diseased, and perturbed tissues at unprecedented granularity. However, the knowledge derived from the analysis and interpretation of these experiments is currently only available as free text in scientific publications, making their exploration challenging and labor intensive. To address this, we are developing a cell type knowledgebase resource to capture knowledge about cell phenotypes from these experiments. Two streams of knowledge are being established. First, using standard cell-by-gene expression matrices as input, validated analysis pipelines are used to produce information about cell type-specific marker genes and differential expression patterns, linked with experiment metadata about species and specimen sources, disease states, and perturbations. Second, using open access peer-reviewed publications from PubMed Central (PMC) reporting results from single cell genomics experiments as input, AI-driven natural language processing (NLP) pipelines are being used to extract information about cell phenotypes and their associations with disease states and perturbation responses. The outputs of these knowledge generating pipelines are then formulated as standardized semantically-structured assertions of subject-predicate-object triple statements compatible with storage using semantic web technologies and graph database platforms for search, analysis, and integration with other sources of knowledge about diseases and drugs, especially from NCBI and other NLM resources, for discovery of novel diagnostic biomarkers and therapeutic targets.

11:20-11:40
Enhancing Machine Learning Based Drug Response Prediction Models via Text Mining-Driven Feature Selection Approach
Confirmed Presenter: Arvind Mer, University of Ottawa, Canada

Room: 524ab
Format: Live Stream


Authors List: Show

  • Arvin Zaker, University of Ottawa, Canada
  • Arvind Mer, University of Ottawa, Canada

Presentation Overview: Show

Predicting anticancer treatment responses from baseline genomic data is a formidable challenge in personalized cancer medicine. Machine learning is increasingly employed to interpret drug responses from gene expression data, but a significant hurdle is the selection of relevant features. The large number of features can make training models time-consuming, computationally intensive, and less accurate. Hence, effective feature selection is essential to reduce complexity, and improve interpretability, leading to more efficient and accurate models. We propose a text-mining-based feature selection method that leverages peer-reviewed scientific literature to identify gene-drug interactions. By mining extensive literature from databases like PubMed, our method pinpoints genes associated with specific drugs, prioritizing them for pharmacogenomic analysis using a Bayesian linear classifier and Fisher’s exact test. To assess the efficacy of our approach, we compared it against eight advanced feature selection methods using three different machine learning algorithms, Elastic Net, Random Forest, and deep learning across two independent cancer pharmacogenomic datasets. Our text-mining-based features showed a strong correlation with drug responses, outperforming other methods in both univariate and multivariate analyses. Specifically, models trained with our features excelled in within-domain and cross-domain validations, consistently achieving the highest prediction accuracy. These models were also effective in predicting drug responses in patient-derived xenograft pharmacogenomics datasets.
Our text-mining-based feature selection method offers a robust, efficient, and scalable way to identify relevant genetic features for drug response prediction, enhancing the interpretability of predictive models by linking gene selection to documented biological relevance in scientific literature.

Streamlining Drug Development with Conversational AI-Powered Knowledge Graphs: From Preclinical Discovery to Clinical Trials
Confirmed Presenter: Maaly Nassar, Elsevier, United Kingdom

Room: 524ab
Format: Live Stream


Authors List: Show

  • Maaly Nassar, Elsevier, United Kingdom

Presentation Overview: Show

The drug development industry faces an efficacy crisis, with a 90% clinical trial failure rate, an average of 9 years, and $1.5 billion spent on bringing a new drug to market. This is largely due the complex process of clinically translating and validating drug-target cellular machinery within extensive scientific literature. To tackle this challenge, we applied conversational AI-powered knowledge graphs (KGs) to various aspects of the drug development process, from preclinical drug discovery and repurposing to matching patients with clinical trials. Our strategy includes: 1) creating FAIR (Findable, Accessible, Interoperable, and Reusable) knowledge graphs with ontology-validated Named Entity Recognition, 2) leveraging embeddings models and fine-tuned large language models (LLMs) for deciphering biomedical relationships, 3) identifying key therapeutic targets and adverse reactions using AI-driven graph analysis and 4) harnessing LLMs reasoning capabilities to match patients to clinical trials.

We showcase how conversational AI-powered KGs can enhance microbiome repurposing and streamline patient matching with high accuracy. Using benchmark datasets like BIOASQ, we assessed various embedding models for capturing both structural and semantic relationships among biomedical entities. We further illustrate how fine-tuning LLMs, such as BioGPT, with synthetic datasets can accurately improve their understanding of biomedical contexts, classify the regulatory effects of intermingled biological entities and assess patients' eligibility for clinical trials based on matching criteria. Expanding on these methods, we employ AI-driven graph analysis techniques (e.g., graph centrality metrics) to precisely identify key therapeutic and diagnostic targets for specific diseases and drugs, ultimately improving the overall success rate in developing innovative treatments.

11:40-12:00
eMIND: Enabling automatic collection of protein variation impacts in Alzheimer’s disease from the literature
Confirmed Presenter: Cathy Wu, University of Delaware, United States

Room: 524ab
Format: In Person


Authors List: Show

  • Samir Gupta, Georgetown University, United States
  • Xihan Qin, University of Delaware, United States
  • Qinghua Wang, University of Delaware, United States
  • Krithika Umesh, University of Delaware, United States
  • Spandan Pandya, University of Delaware, United States
  • Julie Cowart, University of Delaware, United States
  • Hongzhan Huang, University of Delaware, United States
  • Cathy Wu, University of Delaware, United States
  • Vijay Shanker, University of Delaware, United States
  • Cecilia Arighi, University of Delaware, United States

Presentation Overview: Show

Alzheimer’s disease and related dementias (AD/ADRDs) are among the most common forms of dementia, and yet no effective treatments have been developed. To gain insight into the disease mechanism, capturing the connection of genetic variations to their impacts, at the disease and molecular levels, is essential. The scientific literature continues to be a main source of information about the impact of variants and thus, the development of automatic methods to extract such information from literature would be useful to assist biocuration. We developed eMIND, a deep learning-based text mining system that supports the automatic extraction of annotations of variants and their impacts in AD/ADRDs. In particular, we capture the impacts of protein-coding variants affecting a selected set of protein properties, such as protein activity/function, structure and post-translational modifications. A major hypothesis we are testing is that the structure and words used in statements that describe the impact of one entity on another entity or event/process are not specific to the two objects under consideration. Thus, a BERT model was fine-tuned using a training dataset with 8,245 positive and 11,496 negative impact relations derived from impact involving microRNAs. We conducted a preliminary evaluation on a small manually annotated corpus (60 abstracts) consisting of variant impact relations from AD/ADRDs literature and obtained a recall of 0.84 and a precision of 0.94. The publications and extracted information by eMIND are integrated into the UniProtKB computationally mapped bibliography.

12:00-12:20
Poster Flash Presentations
Room: 524ab
Format: In person


Authors List: Show

14:20-15:00
Invited Presentation: An informatic path to better understanding of cardiovascular biology and medicine
Confirmed Presenter: Peipei Ping, UCLA, United States

Room: 524ab
Format: In Person


Authors List: Show

  • Peipei Ping, UCLA, United States

Presentation Overview: Show

We will present an overview on our bioinformatics platforms as well as our use cases applying text mining approaches to better understand cardiovascular biology and medicine.

15:00-15:20
The Netherlands Neurogenetics Database: Reveiling clinical, neuropathological and genetic heterogeneity of brain-disorders
Confirmed Presenter: Inge Holtman, University Medical Center Groningen, Netherlands

Room: 524ab
Format: In Person


Authors List: Show

  • Inge Holtman, University Medical Center Groningen, Netherlands
  • Bart Eggen, University Medical Center Groningen, Netherlands
  • Inge Huitinga, Netherlands institute for Neurosciences, Netherlands

Presentation Overview: Show

The brain is susceptible to a wide-range of neurodegenerative disorders, that share pathophysiological mechanisms, genetic risk factors, and are frequently clinically misdiagnosed[1]. Hence, there is a clear need for a data-driven delineation of the pathophysiological mechanisms of brain disorders, for improved diagnosis and prognosis. To this end, we established the Netherlands Neurogenetics Database (http://nnd.app.rug.nl/) which aims to integrate the extensive clinical, neuropathological, genetic data of large collection of brain donors (+/- 3000) from the Netherlands Brain Bank. We recently implemented Large Language Models (LLMs) to convert medical record summaries into clinical disease trajectories[2]. These trajectories included many known and novel disease specific symptoms, and were used for disease prediction, and disease subtyping and resulted in the identification of clinical subtypes of disease. Currently, we’re implementing LLMs to process neuropathological examinations, giving us unprecedented insight into neuropathological state, which we’ll relate back to the clinical heterogeneity. In addition, we’re analysing common genetic variants (Illumina GSA-array), to refine current GWAS studies for neurodegenerative disorders, that typically include a considerable fraction of misdiagnosed individuals. We’re also calculating polygenic risk scores (PRS) to identify genetic features for clinical/neuropathological subtypes and features. Together, these studies aim to give new data-driven insights into shared and unique features of neurodegenerative disorders.

1. Revealing clinical heterogeneity in a large brain bank cohort. N.J. Mekkes, I.R. Holtman, Nat Med, 2024.
2. Identification of clinical disease trajectories in neurodegenerative disorders with natural language processing. N. J. Mekkes, …, B.J.L. Eggen, I. Huitinga. I.R. Holtman, Nat Med, 2024.

GeneAgent: Self-verification Language Agent for Gene Set Knowledge Discovery
Confirmed Presenter: Zhiyong Lu, National Institutes of Health (NIH), United States

Room: 524ab
Format: In Person


Authors List: Show

  • Zhizheng Wang, National Institutes of Health (NIH), United States
  • Qiao Jin, National Institutes of Health (NIH), United States
  • Chih-Hsuan Wei, National Institutes of Health (NIH), United States
  • Shubo Tian, National Institutes of Health (NIH), United States
  • Po-Ting Lai, National Institutes of Health (NIH), United States
  • Qingqing Zhu, National Institutes of Health (NIH), United States
  • Xiuying Chen, King Abdullah University of Science and Technology, Saudi Arabia
  • Chi-Ping Day, National Institutes of Health (NIH), United States
  • Christina Ross, National Institutes of Health (NIH), United States
  • M.G. Hirsch, National Institutes of Health (NIH), United States
  • Teresa Przytycka, National Institutes of Health (NIH), United States
  • Zhiyong Lu, National Institutes of Health (NIH), United States

Presentation Overview: Show

Genomics has been a research interest of molecular biologists for a long time. Recent studies have shown promising results by harnessing the instruction learning in Large Language Models (LLMs). Nonetheless, these methods did not explore LLMs in-depth to accurately identify biological functions of gene sets and are hindered by the issue of hallucinations. In response, we present GeneAgent, a first-of-its-kind language agent equipped with the self-verification capability to autonomously interact with domain-specific databases. GeneAgent contains four stages (i.e., generation, self-verification, modification, and summarization), which creates the process name and analytical narratives for the input gene set and activates the self-verification agent for verifying them respectively. Different stages of self-verification are cascaded through the modification module. After self-verification, GeneAgent produces the final response for the given gene set based on the verification report. Benchmarking on multiple gene sets in Gene Ontology, NeST, and MsigDB, GeneAgent achieves higher accuracies than the standard GPT-4 by a significant margin. Notably, for 15 gene sets (1.4%), GeneAgent accurately predicted the reference terms with 100% precision, compared with only 3 cases (0.3%) by GPT-4. Additionally, our enriched term tests demonstrate that GeneAgent can provide targeted gene synopsis for summarizing multiple biological terms in alignment with conventional enrichment analyses. Detailed case studies demonstrate that GeneAgent can effectively reduce hallucination issues in GPT-4 and generate reliable analytical narratives for gene functions. As such, GeneAgent stands as a robust solution for gene set knowledge discovery and can provide reliable insights for future research endeavors.

15:20-15:40
Proceedings Presentation: MolLM: A Unified Language Model for Integrating Biomedical Text with 2D and 3D Molecular Representations
Confirmed Presenter: Andrew Tran, Yale University, United States

Room: 524ab
Format: In Person


Authors List: Show

  • Xiangru Tang, Yale University, United States
  • Andrew Tran, Yale University, United States
  • Jeffrey Tan, Yale University, United States
  • Mark Gerstein, Yale University, United States

Presentation Overview: Show

The current paradigm of deep learning models for the joint representation of molecules and text primarily relies on 1D or 2D molecular formats, neglecting significant 3D structural information that offers valuable physical insight. This narrow focus inhibits the models’ versatility and adaptability across a wide range of modalities. Conversely, the limited research focusing on explicit 3D representation tends to overlook textual data within the biomedical domain. We present a unified pre-trained language model, MolLM, that concurrently captures 2D and 3D molecular information alongside biomedical text. MolLM consists of a text Transformer encoder and a molecular Transformer encoder, designed to encode both 2D and 3D molecular structures. To support MolLM’s self-supervised pre-training, we constructed 160K molecule-text pairings. Employing contrastive learning as a supervisory signal for cross-modal information learning, MolLM demonstrates robust molecular representation capabilities across 4 downstream tasks, including cross-modality molecule and text matching, property prediction, captioning, and text-prompted molecular editing. Through ablation, we demonstrate that the inclusion of explicit 3D representations improves performance in these downstream tasks.

15:40-16:00
Proceedings Presentation: BioCoder: A Benchmark for Bioinformatics Code Generation with Large Language Models
Confirmed Presenter: Xiangru Tang, Yale University, United States

Room: 524ab
Format: In Person


Authors List: Show

  • Xiangru Tang, Yale University, United States
  • Bill Qian, Yale University, United States
  • Rick Gao, Yale University, United States
  • Jiakang Chen, Yale University, United States
  • Xinyun Chen, Google, United States
  • Mark Gerstein, Yale University, United States

Presentation Overview: Show

Pre-trained large language models have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of specialized domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate large language models (LLMs) in generating bioinformatics-specific code. BioCoder spans a broad spectrum of the field and covers cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling we show that overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we finetuned StarCoder, demonstrating how our dataset can effectively enhance the performance of LLMs on our benchmark (by >15% in terms of Pass@K in certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> 2600 tokens) with full context, for functional dependencies. (2) They contain specific domain knowledge of bioinformatics, beyond just general coding knowledge. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on the benchmark (50% vs up to 25%).

16:40-17:00
Proceedings Presentation: Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
Confirmed Presenter: Minbyul Jeong, Korea University, South Korea

Room: 524ab
Format: In Person


Authors List: Show

  • Minbyul Jeong, Korea University, South Korea
  • Jiwoong Sohn, Korea University, South Korea
  • Mujeen Sung, School of Computing, Kyung Hee University, South Korea
  • Jaewoo Kang, Korea University, AIGEN Sciences, South Korea

Presentation Overview: Show

Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations.
To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation.
However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments.
In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses.
We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens.
Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions.
Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less.
We analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does.
We release our data and code for training our framework components and model weights
to enhance capabilities in biomedical and clinical domains.

17:00-18:00
Panel: Leveraging AI, text mining and large language models to advance biology and medicine
Room: 524ab
Format: In person


Authors List: Show

  • Karin Verspoor
  • Yanli Wang
  • Cathy Wu