Knowledge Graph–Powered and LLM-Assisted Microbial Growth Predictions: Integrating Symbolic Rule Mining, Boosted Trees, and Deep Graph-Based Models
Joachimiak MP
BOKR
Predicting microbial growth preferences has broad impacts in biotechnology, healthcare, and environmental management. By identifying the media and conditions conducive to the growth of an organism, researchers can streamline strain selection for industrial processes, develop targeted antimicrobials, and uncover metabolic pathways for biodegradation or bioproduction. However, microbial cultivation remains an unsolved challenge, with only a small fraction of microbial taxa easily culturable.
Microorganisms are diverse in their metabolic capabilities and growth preferences, though much of this knowledge remains fragmented and locked in unstructured text. To address this, we developed KG-Microbe, a knowledge graph (KG) of over 800,000 bacterial and archaeal taxa, 3,000 types of complex traits, and 30,000 types of genome functional annotations. Built using a reproducible pipeline grounded in ontologies, KG-Microbe supports a spectrum of applications, such as predicting growth conditions and traits of microbes, interpreting metagenomics and other omics data, and providing recommendations for bioengineering and biomanufacturing.
Using KG-Microbe, we constructed machine learning pipelines to predict microbial growth preferences using different combinations of KG-derived input data types with: 1) symbolic rule mining, producing editable, human-readable explanations, 2) gradient boosted decision trees, and (3) deep graph-based models, which can achieve higher accuracy but with lower transparency. We demonstrate that symbolic rule mining can match the performance of “black box” methods, while boosted tree models yielded a mean precision of 70% across 46 diverse cultivation media. To help interpret and validate these predictions, we show that Large Language Models (LLMs) can be used to synthesize and explain model outputs. By comparing the model and their results, we identified key data features, data type biases, and knowledge gaps relevant to predicting growth preferences. We also use KG-Microbe embedding vector analogies and complex semantic queries across combinations of organismal traits to generate hypotheses and identify target organisms with specific properties.
Our work highlights the capabilities of a KG-driven approach and the trade-offs between model interpretability and predictive performance. These findings motivate the development of hybrid AI/ML approaches that combine model transparency with enhanced data utilization and predictive performance to advance microbial cultivation.