Summary:
This session introduces the NIH Common Fund Data Ecosystem (CFDE) Gene Set Browser, an AI/ML-ready tool that connects diverse biomedical datasets to uncover novel gene-disease associations. Learn how this interoperable resource leverages Bayesian modeling and LLM-driven insights to power cross-program analysis, enable hypothesis generation, and drive discovery through FAIR, integrated data.
Abstract:
In an AI/ML-ready world, data interoperability and integration are becoming increasingly critical. The US National Institutes of Health (NIH) has risen to address these needs through major initiatives including the Common Fund Data Ecosystem (CFDE), which promotes accessibility, (re)use, and integration of NIH Common Fund programs’ data and resources through a cohesive ecosystem. By establishing common standards, data, tools, and infrastructure, CFDE serves as a model for data accessibility and interoperability.
As a compelling use case of how increased interoperability can drive data utility and scientific discovery, we present the CFDE Gene Set Browser, available through https://cfdeknowledge.org. This open-access web resource performs cross-program analyses of gene sets (lists of genes) and their relationship to additional genes, human phenotypes, and mechanisms. Importantly, this tool connects multiple disparate CFDE and non-CFDE programs, phenotypes, and data types. Through Gene Set Browser, users can learn a) which gene sets capture important biological mechanisms, and b) which mechanisms are relevant to human health.
Gene sets are derived from six CFDE programs (GlyGen, GTEx, IDG, IMPC/ KOMP2, LINCS, and MoTrPAC); intersections between CFDE programs; and differential expression analyses of CFDE transcriptomic data. Phenotypes include rare diseases from Orphanet (n=2,927) and common phenotypes/ traits from the NHGRI Association to Function Knowledge Portal (n=1,237) and the EBI GWAS Catalog (n=2,213).
Relationships between phenotypes and gene sets were computed using PIGEAN (Priors Inferred from GEne ANnotations), a novel Bayesian method. PIGEAN jointly models the probability that each gene is involved in each phenotype, given the gene sets that contain the gene and the genome-wide association study (GWAS) statistics for variants near the gene. We applied PIGEAN to the above common and rare disease phenotypes/ traits, in each case fitting a model using all CFDE gene sets, intersections of CFDE gene sets, and gene sets from the Mouse Genome Informatics database (MGI; >11,000 mouse model phenotypes) and MSigDB (pathway analyses). Users can obtain the estimated probability that the genes within each gene set are involved in disease. Additionally, the estimated probability that each gene is involved in disease is provided. For each result, an LLM enables users to explore hypotheses underlying each gene set-to-disease connection.
The Gene Set Browser has unearthed a wide range of known and novel candidate genes and mechanisms for human biological processes and diseases. For example, a gene set from MoTrPAC, a CFDE program that studies the molecular effects of exercise, reveals a list of genes that are upregulated in the blood of male rats after 2 weeks of exercise and their connection to reticulocyte count.
Through the Gene Set Browser, users can discover gene sets relevant to a wide range of research questions, explore connections between gene sets and other biological information (e.g., pathways and disease associations from external databases), and generate new hypotheses that might not be apparent from individual resources. Connecting CFDE gene sets to external resources is a powerful demonstration of how leveraging interoperability can foster scientific discovery.