Authors List: Show
Presentation Overview: Show
Structural biology complemented with physics based models has been immensely successful in understanding the links between biomolecular structure, dynamics and function. This has resulted in biotechnological and therapeutic developments especially with recent boost from the application of Machine Learning/AI inspired methods. This pipeline is now being extended to the clinic and will be discussed with some examples and future directions towards informing clinical decision making.
Authors List: Show
Presentation Overview: Show
This lecture will cover our experience in sequencing viral and bacterial causes of pneumonia and meningitis from routine surveillance platforms. The lecture will also include outbreak investigations, including South Africa’s recent diphtheria outbreak. Viral infections currently being monitored include influenza and respiratory syncytial virus (RSV), keeping in mind the possibility of a potential influenza pandemic, but also maternal vaccines and monoclonal antibody interventions coming to Africa to prevent RSV in infants. In addition, updated data from surveillance for SARS-CoV-2 variants will be presented. Antimicrobial resistance (AMR) surveillance includes the bacterium pneumococcus (AMR driven by both vaccine-serotype and non-vaccine-serotype lineages), and diphtheria (high-level resistance to first-line antimicrobials [beta-lactams and macrolides] reported recently).
Data will be presented from whole genome sequencing of these pathogens, but focusing on current global and local priorities for each, highlighting important factors to keep in mind before planning sequencing from representative samples of cultures or specimens, to ensuring treatment and clinical outcome metadata are captured for analyses. Quality control of sequence data will be discussed, including the correlation of phenotypic testing results with findings from genomic analyses.
Authors List: Show
Presentation Overview: Show
Long Covid, or Post-Acute Sequelae of COVID-19 (PASC), encompasses a range of chronic symptoms that persist after acute SARS-CoV-2 infection. The proposed mechanisms underlying Long Covid include factors such as persistent viral presence, reactivation of latent viruses, tissue damage, immune system dysregulation, and inflammatory responses. In a study involving 142 participants—spanning uninfected controls, acutely infected individuals, convalescent controls, and Long Covid patients—integrative bioinformatics and machine learning analyses were applied to immunologic, virologic, transcriptomic, and proteomic data. This approach revealed that Long Covid patients exhibited chronic immune activation, upregulated proinflammatory pathways (e.g., JAK-STAT and IL-6), and metabolic and T cell exhaustion signatures, differentiating them from convalescent patients six months post-infection.
Advanced computational tools, particularly machine learning models, played a pivotal role in identifying these biological patterns, offering insights into biomarkers such as plasma IL-6R levels for Long Covid diagnosis. By leveraging these tools, we identified novel therapeutic targets to treat long Covid, such as JAK inhibitors, which are now under clinical investigation (NCT06597396). This integration of bioinformatics and machine learning not only accelerates discovery but also refines our understanding of complex disease mechanisms, underscoring their importance in shaping future medical research.
Authors List: Show
Presentation Overview: Show
Due to the heterogeneity of multi-omics data, obtaining information from them remains a challenge. Whereas some solutions have been offered, most cannot overcome the large linear dynamic range associated with these data, while others require large biological effect sizes to produce meaningful models. Here, we (a) perform a comprehensive benchmarking of multi-omics data analysis tools, and (b) introduce kurtosis-based projection pursuit analysis, augmented with classification and regression trees (kPPA-CART) as a robust, easy-to-implement approach to model multi-omics data that are derived from next generation sequencing (NGS) and mass spectrometry (MS). Most of the available methods for unsupervised multi-omics integration suffer from an inability to model low-intensity (low count) features and instead focus on high variable (dominant) ones. While low-count features, such as genes involved in signaling, and non-coding RNAs (ncRNA) are associated with high analytical uncertainty, they exhibit significant biological impact upon perturbation.
Methodologically, kPPA is an “unsupervised” data exploration approach that finds patterns in input data without a priori knowledge of class membership. The output of kPPA is projections of the original samples into “interesting” directions, which, when plotted against each other, show clustering of (dis)similar samples. We augment kPPA’s clustering with classification and regression trees (CART), which takes cluster identities derived from k-means classification as input to perform a quasi-supervised classification and decipher feature importance.
Using ground truth data, we demonstrate that kPPA-CART exhibits superiority in inferring biological significance from low-intensity features. Moreover, when effect sizes (expected biological differences between conditions) are small, we show that kPPA-CART can recover important biological information better than available approaches. To provide biological context, we have re-analyzed prominent Breast Cancer (BC) data from The Cancer Genome Atlas (TCGA) and show that kPPA-CART identifies novel gene transcripts that provide a classification of BC into Basal, Her2, Luminal A, and Luminal B subclusters better than the original PAM50 panel. We validate these genes with an external set of data and show that the top kPPA-CART panel of genes is associated with poor overall survival for patients with BC for whom these genes are dysregulated. Finally, we provide an R package and an online implementation of kPPA-CART.
Authors List: Show
Presentation Overview: Show
Breast cancer remains the leading cause of cancer mortality in women despite available interventions; however, information on its major genetic drivers is incomplete. We aimed to identify and quantify the impact of the critical genes in breast cancer (BC) pathology for tailored patient management. Numerous signal transduction networks (STNs) in BC have cross-cutting associations. Until now, survival studies and management interventions often consider a few STNs or genes, thus missing a global perspective integral to BC understanding. We hypothesised that integrating STNs information across major BC networks can improve disease understanding and provide application in precision medicine. We included all known major BC pathology STNs to maximise disease heterogeneity in identifying the critical genes. A bi-directional Kaplan-Meier (KM) survival scanning with log-rank statistics was used to triage genes by their expression patterns and select a statistically significant subset of all pathway genes. Moreover, we evaluated the triaged genes, including clinical features, by modelling overall survival (OS) using Cox’s proportional hazard (CPH) regression – 79.2% accuracy for the best model. The SHapley Additive exPlanations (SHAP) then quantified feature contributions to model overall survival risk (OSR) predictions. The result is 28 most impactful genes, ranked by relevance, from three gene sets corresponding to the different expression patterns. The top three genes per category were validated through literature and databases. Among them were relatively less-studied but potentially critical genes in BC pathology. For example, DKK4 and KREMEN1. Both belong to families negatively regulating the growth-promoting Wnt/β-catenin pathway. A broadened scope of BC heterogeneity was captured by including all known major networks. Ultimately, we demonstrated important implications in BC clinical management by showcasing a quick, intuitive, and robust overview of patient monitoring for potential healthcare applications.
Authors List: Show
Presentation Overview: Show
Introduction: Viral infections pose significant global health challenges, with human-viral protein-protein interactions (HV-PPIs) playing a central role in infection mechanisms and host immune responses. While experimental methods for studying HV-PPIs are resource-intensive, computational approaches, particularly machine learning (ML), offer scalable and efficient alternatives.
Methodology: Here, we present a benchmarking study evaluating the performance of various ML models in predicting HV-PPIs, focusing on three viruses: West Nile Virus (taxon ID: 11082), HIV-1 (taxon ID: 11676), and SARS-CoV-2 (taxon ID: 2697049). We curated positive and negative interaction datasets from six public databases and employed five sequence-based feature encoding methods to represent protein sequences. Six ML classifiers, including SVM and RF, were trained and evaluated using metrics such as accuracy and F1-score.
Results: Our results reveal that dataset imbalance significantly impacts model performance, with balanced datasets (1:1 positive-to-negative ratio) yielding more reliable predictions, emphasizing the value of techniques like SMOTE for handling imbalanced real-world data. Encoding methods significantly influence outcomes, with pseudo-amino acid composition (PAAC) (type I), quasi-sequence-order (QSO), and conjoint-triad (CT) encodings showing better generalization for taxon ID ""11676"". Overfitting was observed in models like GBM, particularly for specific taxonomy IDs, underscoring the need for practices like limiting tree depth and hyperparameter tuning. The primary goal of HV-PPI models is to identify novel interactions. In this study, the SVM model using combination-set features identified 333 human-SARS-CoV-2 interactions, including 75 shared with experimental studies and 82 newly predicted ones. Although SARS-CoV-2 interacts with various host receptors, including ACE2, NRP-1, AXL, CD147, and heparan sulfate, as well as host proteases like FURIN, TMPRSS2, and cathepsins, our interactome revealed potential interactions between the spike (S) protein and TLR4, suggesting a role in antiviral immunity. Additionally, TRIM7 was predicted to interact with NSP12 and NSP7, possibly targeting them for ubiquitination and degradation, which could suppress viral replication. Another key finding was the predicted interaction between ACTN4 and ORF6, which may counteract the antiviral effects of ACTN4-NSP12 binding and facilitate immune suppression and viral replication.
Conclusion: These findings highlight the potential of ML in uncovering new HV-PPIs, offering insights into viral pathogenesis and therapeutic targets. However, challenges such as overfitting and small dataset sizes underscore the need for further refinement of ML models and exploration of alternative learning approaches to enhance predictive accuracy and generalizability.
Authors List: Show
Presentation Overview: Show
Hypertension is a significant public health concern worldwide, affecting over 1.4 billion people and contributing to cardiovascular diseases, stroke, and kidney failure (World Health Organization, 2021). In Africa, the prevalence of hypertension has been increasing, with estimates suggesting that nearly 40% of adults are hypertensive, largely due to genetic predispositions and environmental factors (Adeloye et al., 2021). Despite its growing burden, the genetic underpinnings of hypertension remain poorly understood, especially in African populations that are underrepresented in global genomic studies.
This study leverages machine learning algorithms to identify genetic variants associated with hypertension in African populations. We analyzed genomic data from 1,000 African individuals diagnosed with hypertension and 1,000 normotensive controls. Genotyping was conducted using the Illumina OmniExpress array, capturing approximately 700,000 single nucleotide polymorphisms (SNPs). Various machine learning models, including random forest, support vector machine (SVM), and gradient boosting, were implemented to identify key genetic variants predictive of hypertension.
Our results demonstrate that machine learning models effectively predict hypertension risk based on genetic information, with the random forest model achieving the highest classification accuracy of 85.2%, outperforming both gradient boosting (82.7%) and SVM (79.5%). Notably, the analysis identified several hypertension-associated variants, particularly within the NOS3, AGT, and ACE genes, which have well-established roles in blood pressure regulation. These findings underscore the utility of artificial intelligence in detecting complex genetic patterns that contribute to hypertension susceptibility.
The study highlights the potential of integrating machine learning with genomic research to enhance disease risk prediction and inform personalized medicine strategies tailored to African populations. Unlike traditional genome-wide association studies (GWAS), which primarily focus on linear associations, machine learning algorithms can capture complex, nonlinear interactions among genetic variants, enabling more robust disease modeling. The clinical implications of this research suggest that incorporating machine learning-driven genetic risk assessment into public health frameworks could improve hypertension prevention and treatment strategies, particularly in resource-limited settings.
However, further research is necessary to validate these findings using larger, more diverse datasets and functional analyses of the identified variants. Future studies should explore how environmental factors interact with genetic predispositions to influence hypertension risk and evaluate the translational potential of these predictive models in clinical settings. Additionally, the inclusion of multi-omics data, such as transcriptomic and epigenomic profiles, may further enhance the accuracy of hypertension risk prediction.
Overall, this study underscores the transformative role of artificial intelligence in genomic medicine and emphasizes the need for increased representation of African populations in genetic research. By leveraging machine learning approaches, researchers can uncover novel genetic markers of hypertension and contribute to the development of targeted therapeutic interventions that address the unique genetic architecture of African populations.
Authors List: Show
Presentation Overview: Show
Multi-omics strategies hold great promise for disease prognosis and diagnosis, offering a more comprehensive understanding of biological systems than single-omics approaches. By integrating multiple layers of biological information, multi-omics analyses enable better identification of disease mechanisms, biomarker discovery, and personalized treatment strategies. Machine learning (ML) algorithms are increasingly applied to these datasets to extract meaningful insights, improve disease detection, predict treatment responses, and identify biomarkers inferring susceptibility to diseases. However, despite the growing interest in multi-omics and ML integration, there is a lack of systematic investigation into how different combinations of omics datasets affect ML model performance in clinical decision support systems.
This study explores the integration of ML algorithms with multi-omics datasets to predict prostate cancer (PCa) treatment outcomes and biochemical recurrence (BCR) using The Cancer Genome Atlas (TCGA) dataset. We evaluated the predictive performance of nine ML algorithms across 63 possible omics combinations, incorporating six omics data types: single nucleotide variation (SNV), copy number variation (CNV), DNA methylation, RNA sequencing (RNA-seq), microRNA sequencing (miRNA-seq), and reverse-phase protein array (RPPA) datasets. To rank these models and omics combinations, we developed a multi-criteria decision scoring system based on key performance metrics such as accuracy, precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC).
Our results demonstrate that selective multi-omics integration outperforms indiscriminate aggregation. For PCa treatment outcome prediction, the best-performing combinations were CNV + SNV + DNA methylation + miRNA-seq, SNV + DNA methylation, and CNV + DNA methylation + RPPA. For BCR prediction, SNV + DNA methylation ranked highest, followed by SNV + DNA methylation + miRNA-seq and CNV + SNV + DNA methylation + miRNA-seq + RPPA. Notably, while multi-omics generally improved ML model performance compared to single-omics, the combination of all six omics datasets did not yield the best predictive power. Instead, targeted integration of specific omics types proved more effective. XGBoost (xGB) algorithm consistently outperformed other ML models across both tasks. Feature selection (FS) using elastic-net penalized regression yielded superior results compared to feature extraction (FE) via autoencoders.
To validate our methodology, we applied the same ML framework to TCGA breast cancer (BRCA) multi-omics datasets for PAM50 subtyping. The best-performing omics combinations for BRCA were SNV + DNA methylation + miRNA-seq, CNV + SNV + DNA methylation + RPPA, and SNV + DNA methylation + miRNA-seq + RPPA. Notably, SNV + DNA methylation alone ranked 30th, reinforcing the importance of carefully integrating complementary omics layers. The BRCA validation confirmed that while multi-omics strategies enhance predictive power, dataset size and phenotype balance play crucial roles. XGBoost emerged as the best-performing algorithm, followed by gradient boosting and support vector machines.
In conclusion, this study provides a large-scale investigation into multi-omics data integration with ML for precision oncology. It highlights the need for careful omics selection rather than arbitrary multi-omics aggregation and underscores the importance of addressing class imbalance and feature representation challenges in clinical ML applications. Our findings contribute to the development of more reliable and interpretable AI-driven clinical decision support systems for cancer management.
Authors List: Show
Presentation Overview: Show
Although disease-causal genetic variants have been found within silencer sequences, we still lack a comprehensive analysis of the association of silencers with diseases. Here, we profiled GWAS variants in 2.8 million candidate silencers across 97 human samples derived from a diverse panel of tissues and developmental time points, using deep learning models.
We show that candidate silencers exhibit strong enrichment in disease-associated variants, and several diseases display a much stronger association with silencer variants than enhancer variants. Close to 52% of candidate silencers cluster, forming silencer-rich loci, and, in the loci of Parkinson's-disease-hallmark genes TRIM31 and MAL, the associated SNPs densely populate clustered candidate silencers rather than enhancers displaying an overall twofold enrichment in silencers versus enhancers. The disruption of apoptosis in neuronal cells is associated with both schizophrenia and bipolar disorder and can largely be attributed to variants within candidate silencers. Our model permits a mechanistic explanation of causative SNP effects by identifying altered binding of tissue-specific repressors and activators, validated with a 70% of directional concordance using SNP-SELEX. Narrowing the focus of the analysis to individual silencer variants, experimental data confirms the role of the rs62055708 SNP in Parkinson's disease, rs2535629 in schizophrenia, and rs6207121 in type 1 diabetes.
In summary, our results indicate that advances in deep learning models for the discovery of disease-causal variants within candidate silencers effectively ""double"" the number of functionally characterized GWAS variants. This provides a basis for explaining mechanisms of action and designing novel diagnostics and therapeutics.
Authors List: Show
Presentation Overview: Show
The underlying genetic architecture of dilated cardiomyopathy (DCM) in Southern Africa has not been described despite the high prevalence of this condition in patients residing in this region. The availability of multiple “omics” techniques for genomics sequencing, including whole exome sequencing (WES), at reduced costs is slowly enabling the study of many diseases affecting African populations in under-resourced settings. This study was aimed at determining the underlying genetic aetiology of DCM in patients from Southern Africa using WES. A cohort of 100 unrelated patients with heart failure of unknown origin were recruited from Charlotte Maxeke Johannesburg Academic Hospital (CMJAH) and subjected to WES. The cohort consisted of participants of ages between 16 and 77 years (47 years average), of whom 67% were males and 92% identified as black. The median left ventricular ejection fraction was 26.5% (interquartile range: 16 – 37), and late gadolinium enhancement was visualised in 42% of participants. Variant calling was carried out on the WES data following the Genome Analysis Toolkit (GATK) Best Practices for WES data analyses, and the resulting variants annotated using Ensemble’s Variant Effect Predictor (VEP). Through various bioinformatics techniques, in combination with genetic- and clinical-guided interpretations, we identified and prioritised several genetic variants in BAG3, FLNC, DSP, MYH7 and TTN genes that have potential roles in the pathogenicity of DCM. This study not only presents potential DCM causal variants but also lays foundation for WES data analyses workflows for similar studies utilising WES to determine the underlying genetic aetiology of diseases in the under-resourced African settings.
Authors List: Show
Presentation Overview: Show
In order to resolve crucial global issues, the widespread application of genetic engineering at an industrial level is key. Effective genetic engineering at an industrial scale hinges heavily on precise cellular control of the microorganism at hand. However, the majority of synthetically engineered strains fail at the industrial level due to disruptions in gene regulation. This stems from a lack of understanding and usage of gene regulatory networks (GRNs), which control cellular processes and metabolism. Research shows that effective manipulation of host GRNs and effective introduction of synthetic GRNs can improve product yield and functionality significantly. However, current GRN inference tools are extremely slow, inaccurate, and incompatible with industrial scale processes, because of which there are no complete expression based GRNs for any commonly used organism, limiting the application of GRNs as a practical tool in genetic engineering at the industrial level. This research proposes a novel computational system, GEMINI, to enable fast and efficient GRN inference for integration into industrial scale pipelines. GEMINI consists of two main parts. First, I create a novel information theoretic algorithm that replaces traditional sequential inference and calculation methods, ensuring compatibility with parallel processing. Second, I integrate a novel GNN architecture based on spectral convolution to bypass intensive eigenvalue computation and efficiently learn global and local regulatory structures. On the DREAM4 and DREAM5 in silico benchmarks, GEMINI outperforms all industry leaders in terms of AUROC and AUPRC, achieving a nearly 300% increase in AUPRC compared to the industry leading method, GENIE3. When applied on a real biological E. coli dataset, GEMINI not only recovered 98% of existing interactions, but discovered 468 novel candidate interactions, which were validated against literature. Thus, GEMINI was able to construct the most complete expression based GRN of E. coli to date, providing a novel biological blueprint for genetic engineers to use at the industrial level. GEMINI removes reliance on expensive computing equipment and enables fast and accurate GRN inference for the first time, opening doors to more efficient gene expression control and metabolic pathway manipulation for more effective application of genetic engineering at an industrial level.
Authors List: Show
Presentation Overview: Show
Background: Rotavirus remains a leading cause of severe gastroenteritis in children under five, particularly in low- and middle-income countries (LMICs). In Malawi, G9P[6] strains re-emerged in 2017, five years after the introduction of Rotarix rotavirus vaccine, necessitating an in-depth investigation of their genetic diversity, evolutionary origins, and public health implications.
Methods: Using whole-genome sequencing (WGS), we analysed and assigned complete genotype constellations and employed phylogeographic and phylogenetic network analyses to trace the evolutionary pathways of G9P[6] strains (n=11) between 2017 to 2022.
Findings: The re-emergent G9P[6] strains were characterised by a DS-1-like G9-P[6]-I2-R2-C2-M2-A2-N2-T2-E2-H2 genotype constellation. Phylogeographic analysis of the VP7 gene revealed monophyletic clustering with contemporary G9P[6] strains from Mozambique. Phylogenetic network analysis demonstrated high genetic similarity of the inner capsid and non-structural genes of G9P[6] strains to previously circulating Malawian G2P[4], G2P[6], G3P[4], and G3P[6] strains. Time-resolved phylogenies dated the most recent common ancestor for the inner capsid and non-structural genes between 2009–2015. Evolutionary analysis suggested lineage spillover events associated with the VP6 segment
Conclusion: This study, for the first time in Malawi, elucidates the role of reassortment and zoonotic transmission in the re-emergence of G9P[6] strains. These findings highlight the evolutionary dynamics of rotaviruses and the need for continuous genomic surveillance. Considering the limited heterotypic protection provided by the Rotarix (G1P[8] strain) vaccine, tailored vaccination strategies and ongoing vaccine effectiveness studies are critical to addressing the emergence of novel rotavirus strains and improving vaccine performance in LMICs
Authors List: Show
Presentation Overview: Show
Stemming from a complex etiology that includes a strong genetic component (1), Parkinson's disease (PD) is a neurodegenerative disorder characterized by a wide range of both motor and non-motor symptoms (2). The burden of PD is increasing rapidly within the aging Sub-Saharan African populations, ranking as the 11/12th most prevalent nervous system disorder in the region (3). Despite the rise in disease prevalence, the representation of African populations in PD genetic research remains limited. Allele frequencies vary across genomes due to factors such as natural selection, genetic drift, and differing exposures to environments and pathogens. These variations in allele frequencies can help identify population-specific disease risk variants in admixed individuals while simultaneously uncovering risk variants relevant to multiple populations (4).
Genome-wide association studies (GWAS) have successfully identified susceptibility variants linked to PD (5,6). However, the majority of these studies have focused on European cohorts with few including diverse ancestries. Using genotyped and imputed data from 1,516 South African participants, we conducted a GWAS using SAIGE software, which includes the genetic relationship matrix as a random effect, allowing for the inclusion of related individuals. Moreover, we inferred global and local ancestry for the cohort to better understand the genetic admixture in the South African population and further investigate the GWAS results. Our GWAS findings were replicated using a Latin American cohort. The ancestry inference showed the South African cohort to be five-way admixed between the European (EUR; 56%), African (AFR; 18.8%), indigenous Khoe-San Nama ancestry (NAMA; 13%), South Asian (SAS; 6.9%), and Malaysian (MAL; 5.2%) ancestries. The GWAS identified one variant with a genome-wide significance and 351 variants with a suggestive significance. Of these, 14 variants replicated in the Latin American cohort. In the local ancestry window containing the top GWAS hit, 86.7% of the variant carriers were inferred to have AFR, 11% NAMA, and 2.2% MAL ancestries. No carriers exhibited EUR or SAS inferred ancestry. This suggests that the variant is ancestry-specific and highlights the value of including populations previously underrepresented in PD genetic research to reveal novel susceptibility variants. Our findings contribute to a global understanding of the complex genetic etiology of PD.
1. Trevisan, L. et al. Genetics in Parkinson’s disease, state-of-the-art and future perspectives. Br. Med. Bull. 149, 60–71 (2024).
2. Armstrong, M. J. & Okun, M. S. Diagnosis and treatment of Parkinson disease: A review: A review. JAMA 323, 548–560 (2020).
3. GBD 2021 Nervous System Disorders Collaborators. Global, regional, and national burden of disorders affecting the nervous system, 1990-2021: a systematic analysis for the Global Burden of Disease Study 2021. Lancet Neurol. 23, 344–381 (2024).
4. Swart, Y. et al. Local ancestry adjusted Allelic association analysis robustly captures tuberculosis susceptibility loci. Front. Genet. 12, 716558 (2021).
5. Nalls, M. A. et al. Identification of novel risk loci, causal insights, and heritable risk for Parkinson’s disease: a meta-analysis of genome-wide association studies. Lancet Neurol. 18, 1091–1102 (2019).
6. Kim, J. J. et al. Multi-ancestry genome-wide association meta-analysis of Parkinson’s disease. Nat. Genet. 56, 27–36 (2024).
Authors List: Show
Presentation Overview: Show
Introduction
Middle East respiratory syndrome coronavirus (MERS-CoV) is endemic in dromedary camels from the Arabian Peninsula and Africa with comparably high seroprevalence of >75%. High camel population density and the loss of maternal antibodies in farmed camel calves are linked to acute MERS-CoV outbreaks. Investigations into MERS-CoV outbreak patterns in nomadic camels are challenged by limited infrastructures in remote and resource-restricted camel migration regions.
Study Objective
We performed a continuous 12-month study at an abattoir hub for nomadic camels in Northern Kenya. We investigated MERS-CoV incidence in migrating camels and determined genomic diversity of contemporary MERS-CoV variants.
Methods
We collected nasal swabs from 10-15 camels 4-5 days per week from September 2022 to September 2023, totalling 2711 camels sampled during the period in the main abattoir in Isiolo County, Kenya. The samples were tested for MERS-CoV RNA using UpE and ORF1a RT-qPCR. Genomic diversity was assessed using Illumina next-generation sequencing (NGS) and ORF1ab domain assembly for RNA samples with >1x106 genome copies/ml.
Results
MERS-CoV RNA was detected in 36/2711 (1.3%) nasal swabs. MERS-CoV incidence was biphasic with detection peaks in the respective first week of October 2022 (7/60, 11.7%) and February 2023 (7/58, 12.1%). The cumulative MERS-CoV RNA positivity rate was higher in September–October 2022 with 19/381 (5.0%) compared to 17/727 (2.3%) in January–March 2023. For 9/36 MERS-CoV RNA-positive samples ORF1ab sequences were obtained, and phylogenetic analysis were performed. The sequences formed a distinct clade from other Clade C viruses but clustered with Clade C2.2, mostly prevalent in East Africa. The 9 ORF1ab sequences were highly similar (>99.93% nucleotide identity) and had 99.75–99.78% nucleotide identity with the closest MERS-CoV relative identified in Akaki, Ethiopia, in 2019.
Conclusion
The biphasic MERS-CoV incidence in nomadic camels may be linked to seasonality factors, such as the biannual alternating wet and dry seasons in Northern Kenya. Interestingly, camel calves are primarily born during the two wet seasons and maternal antibody loss coincides with the observed two MERS-CoV RNA detection peaks. Phylogenetic analysis suggests that we identified at least 3 MERS-CoV clusters over 3 different weeks in dromedaries originating from different locations.
Authors List: Show
Presentation Overview: Show
Genetic diseases (GDs) pose a significant public health challenge as they are a leading cause of morbidity and premature death. In Tunisia, there are 589 reported GDs, with more than 60% being autosomal recessive. In 40% of these cases, the molecular etiology is unknown, highlighting the urgent need for advanced genetic research and diagnostic tools. Addressing this gap is crucial for improving patient outcomes and guiding therapeutic interventions. A combination of manual and AI-assisted text mining from the literature is used to collect complex genetic data on GDs in Tunisia, ensuring data integrity. Python and R scripts are employed for data validation and biological database queries. Bioinformatic approaches, including AI, are being utilized for in-silico drug (re)discovery. Cutting-edge technologies support the development of the PREMEDIT platform. To date, more than 600 GDs have been identified in Tunisian patients, with approximately 1,000 mutations, representing the most comprehensive mutatome of the Tunisian population. Data analyses revealed the scarcity of epidemiological data and treatments for rare GDs. The genetic, epidemiological, and pharmaceutical data have been integrated into a centralized platform: PREMEDIT. By consolidating comprehensive data on genetic mutations and their correlation with specific treatments, PREMEDIT aims to enhance diagnosis and tailor therapeutic strategies for the Tunisian population. The integration of AI not only refines data accuracy but also facilitates the efficient identification of complex genetic patterns, empowering the platform to provide more precise diagnostic and therapeutic recommendations. This platform serves as a crucial resource for healthcare professionals, researchers, and policymakers, bridging the gap between genetic research and clinical practice. PREMEDIT will contribute to innovation across biomedical communities as well as pharmaceutical companies, improving the quality of life for patients.
Authors List: Show
Presentation Overview: Show
One of the main objectives of the Public Health Alliance of Genomic Epidemiology (PHA4GE) is to equip those working in the public health sector with the necessary skills and knowledge to efficiently and effectively respond to disease outbreaks. A key component of our courses is the inclusion of domain experts from diverse communities, whose insights and experiences provide invaluable perspectives that enrich the learning process and ensure that what is being integrated is relevant for the target group. For our wastewater surveillance bioinformatics course, participants are presented with an outbreak scenario which imitates the process of forming a response in the case of a real-world event, with the overarching aim of using the skills and knowledge acquired to build a wastewater surveillance health dashboard.
The foundational concepts such introduction to wastewater surveillance epidemiology, and waste water surveillance bioinformatics use a pedagogical approach (Bansal et al., 2020), where lessons are broken down into short, modular presentations. A combined direct learning model and cooperative learning model (Guzmán and Payá, 2020) allow participants to rapidly engage with the material to grasp concepts such as public health and community impact and to develop fundamental analytical skills. Andragogical strategies (Bansal et al., 2020) such as outbreak case studies and discussions dispersed between modules, further enhance the peer learning (Tullis and Goldstone, 2020) process by promoting teamwork and idea sharing through the application of acquired knowledge.
The main learning approach involves following an avatar who needs to solve day-to-day bioinformatics problems. This incorporates an interactive and playful learning experience, where tasks become progressively more challenging, thus incorporating both narrative and game-based learning approaches (Breien and Wasson, 2021).
For the final project, participants engage in a heutagogical exercise, where they are required to build a dashboard in pairs, incorporating all their accumulated skills and knowledge. Participants have to decide which of the principles, techniques and tools they have learned throughout the course are relevant to the assignment. Heutagogy is also a key component of online curriculum, empowering participants to take increasing responsibility for their own learning (Mwinkaar and Lonibe, 2024). This not only develops technical proficiency but also facilitates self-directed learning and problem-solving. Additionally, this reinforces the effective application of bioinformatics solutions in wastewater surveillance.
These are just some of the learning approaches which have consolidated diverse teaching strategies and models, facilitating the inclusivity of people with different learning styles, varying attention spans, and time constraints. The course’s modular design allows for focused exploration of topics relevant to public health practitioners and academics. The training program is designed so that participants are expected to systematically and progressively develop expertise in public health and computational techniques.
Authors List: Show
Presentation Overview: Show
Abstract: The recent emergence and re-emergence of infectious diseases in Africa highlight the critical need for robust pathogen genomic surveillance systems across the continent. Effective surveillance depends on comprehensive training and capacity development in pathogen genomics and bioinformatics, as rapid public health responses to disease outbreaks rely on continuously enhancing these essential skills. To ensure quality and consistency in training, the development and implementation of a standardised curriculum are crucial; enabling uniform skill-building and knowledge dissemination across diverse regions.
Over the past four years, we have delivered hybrid training in pathogen genomic surveillance and bioinformatics to over 290 participants from 36 African countries. These initiatives, tailored to diverse personas in national public health institutions, leveraged trainers and facilitators from across the continent to address varying competency levels. We have also developed and implemented resources to support our training initiatives, including a user-friendly helpdesk ticketing system, a robust trainer database, and intuitive websites hosting training materials. These tools work jointly to ensure that training and related resources are widely accessible, while also providing participants with support and engagement opportunities long after receiving training.
To ensure consistency in the training of public health staff in Africa, a standardised pathogen genomics surveillance training curriculum has been developed. The curriculum is designed to serve as a comprehensive resource for trainers, encompassing content that ranges from foundational courses in generic, wet-lab, and bioinformatics topics to advanced pathogen-specific courses that include tailored genomic surveillance workflows. The next step is implementing this curriculum in future training initiatives across African public health institutes. Additionally, we are exploring the integration of AI in pathogen genomics curriculum development and training.
Our training efforts have highlighted the need for ongoing training and capacity building in pathogen genomic surveillance in Africa. A standardised curriculum can be used in addressing this need and facilitate consistent skills development and collaboration across the continent’s public health institutes. Implementing this curriculum and exploring AI-driven training and decision-making will enhance preparedness for future disease outbreaks and public health responses.
Authors List: Show
Presentation Overview: Show
The integration of artificial intelligence (AI) into genomics promises substantial advancements in personalised medicine, diseases prediction, gene editing but it also presents critical ethical and governance challenges. This study explores these challenges by addressing three main research questions: (1) What are the primary ethical concerns related to AI applications in genomics, including privacy, consent, and bias? (2) How are current governance structures addressing or failing to address these issues? and (3) How can effective governance frameworks be established to ensure responsible, equitable, and transparent use of AI in this field? Using a mixed-methods approach that includes a systematic literature review, expert interviews, and case analysis, the study examines the ethical risks and governance gaps in AI-driven genomic research. Findings indicate significant concerns around data privacy, potential misuse of genetic information, and the exacerbation of existing health disparities due to biased data and algorithms. Additionally, existing regulatory frameworks lack sufficient guidelines to address algorithmic accountability, data ownership, and inclusive representation within genomic datasets.The study concludes by recommending a multi-stakeholder governance model that emphasizes transparency, fairness, and adaptability. This framework would involve guidelines for data handling, bias mitigation, and global collaboration among governments, private sectors and global health organizations. It provides actionable steps to establish ethical oversight in the evolving landscape of AI-driven genomics.These recommendations aim to enhance public trust and ensure that AI’s role in genomics aligns with ethical standards that protect individual rights and foster equitable health outcomes.
Authors List: Show
Presentation Overview: Show
Background
Infectious diseases continue to present significant public health issues in low- and middle-income countries like Tanzania, where the integration of clinical and genomic data is important for better disease diagnosis and surveillance. However, existing health information systems mostly operate in different sources limiting the ability to connect clinical data with genomic data for better patient diagnosis and infectious disease control. To tackle these challenges we developed an integrated information system that combines clinical data collected in a customized District Health Information System2 (DHIS2) with genomic data generated from Nanopore sequencing. The system aims to integrate these data, aiding clinicians and laboratory scientist in identifying multiple pathogens from a single patient sample and public health researchers in viewing infectious disease patterns.
Methods
Clinical data, including patient demographics and symptoms such as fever and diarrhoea, were collected from healthcare facilities using a customised DHIS2, an open-source software widely used for health data collection in Tanzania. R programming language scripts were used to securely fetch clinical data from DHIS2 using the DHIS2 API and integrate it with genomic data results that were produced from the analysis of the cgetools bioinformatics pipeline. This pipeline uses tools such as KmerFinder for pathogen identification, supporting the detection of the diverse pathogens from a single sample. R's shiny web framework was used to build an interactive web interface allowing the user to search for patient IDs on the system to view detailed clinical data alongside genomic data that displays the identified pathogens.
Results
The developed system successfully processed and integrated 21 datasets, connecting clinical information with genomic output results. The datasets included key clinical variables such as patient symptoms like fever and diarrhoea, gender, and the region of origin, while linked genomic data showed pathogens identified from patient samples. Proving an interactive web interface for users to search for patient IDs to view detailed clinical records alongside genomic data and also has features for interactive data visualisation capabilities, including bar graphs that show trends in pathogen occurrence according to Tanzania regions, enabling epidemiological monitoring and outbreak
Discussion
The developed system for integrating data shows several critical insights regarding the potential of clinical-genomic data integration in infectious disease control. The use of an open-source health information system like DHIS2 demonstrated the feasibility of leveraging existing digital health data collection software to enhance data integration in healthcare. Additionally, the application of the cgetools pipeline for pathogen detection proved effective in identifying multiple pathogens from a single sample. The integration process shows the significance in support real-time clinical decision-making. The interactive visualization tools provided valuable information on pathogen distribution patterns, emphasising their important role in outbreak detection and response.
Conclusion
By integrating clinical data from DHIS2 with genomic sequencing outputs, this system offers a powerful tool for infectious disease surveillance in Tanzania. It supports the identification of different pathogens, enabling timely diagnosis and supporting infectious disease control. The system’s flexible, scalable design makes it suitable for applications in infectious disease management across healthcare settings.
Authors List: Show
Presentation Overview: Show
Motivation
Knowledge graphs (KGs) are powerful tools for structuring and analyzing biological information due to their ability to intuitively represent data and improve query performance across heterogeneous datasets. However, constructing KGs from unstructured scientific literature remains challenging due to the high cost and expertise required for manual curation. Prior works have explored text-mining techniques to automate this process but have limitations that impact their ability to capture complex biological interactions fully.
Traditional text-mining methods struggle with understanding context across sentences. Additionally, these methods lack expert-level background knowledge, making it difficult to infer relationships that require awareness of biological concepts indirectly described in the text.
Large Language Models (LLMs) present an opportunity to overcome these challenges. LLMs are trained on large amounts of diverse biological literature, equipping them with contextual knowledge that enables more accurate extraction. Additionally, LLMs can process the entirety of an article’s text, capturing relationships across several sections rather than analyzing sentences in isolation; this allows for more precise extraction.
Results
To address these challenges, we present textToKnowledgeGraph (https://pypi.org/project/texttoknowledgegraph), an artificial intelligence (AI) tool using LLMs to extract interactions from individual publications directly in Biological Expression Language (BEL). BEL was chosen for its compact and detailed representation of biological relationships, allowing for structured and computationally accessible encoding. The tool provides two usage modes: 1) a Python package usable through the command line or within other projects, or 2) an interactive application within Cytoscape Web to simplify extraction and online exploration. In the text processing pipeline, we leverage LangChain with GPT-4o for information extraction using a predefined schema implemented with Pydantic to ensure structured outputs for BEL generation. The extracted BEL statements are outputted in CX2 format, enabling visualization and exploration within the Cytoscape ecosystem. Additionally, the ndex2 package is used for CX2 conversion and to support optional storage and sharing of extracted networks on NDEx. In this initial version of textToKnowledgeGraph, we only support the extraction of interactions into BEL. Future updates will enable greater customization, making it more adaptable for broader applications.
To evaluate the accuracy of extracted interactions, we applied textToKnowledgeGraph to various published articles. The extracted interactions were manually reviewed by BEL experts, ensuring the biological accuracy and completeness of captured relationships. Finally, we present a use case example in which a topic-specific BEL knowledge graph provides relevant information to augment queries to an LLM using a technique known as Graph Retrieval Augmented Generation (Graph RAG).
Authors List: Show
Presentation Overview: Show
Large-scale data platforms enable researchers and the public to access, manage and study massive amounts of genomics data. While small research teams can generate these massive datasets, they often struggle to build the platforms needed for transparent and reproducible FAIR data management and sharing.
We built Overture, a suite of reusable, open-source software to develop reliable data management systems quickly, flexibly and at multiple scales. Overture successfully underpins many large-scale international data platforms, including ICGC-ARGO which aims to store genomic and clinical data for over 100,000 participants and VirusSeq which hosts data for over 500,000 pathogen genomes. Behind these platforms are large organizations with large teams that plan, develop and deploy the Overture suite with relative ease. Yet, this can be prohibitively demanding for smaller research groups. How can we help them build data platforms more efficiently and with fewer resources?
We address this challenge with Prelude, a tool that enables teams to incrementally build their data platforms by breaking down development into systematic phases. Prelude focuses on solving a specific challenge in platform adoption: the high technical overhead and configuration burden required during the planning and development stages. By breaking down data portal development into phased steps, teams can systematically verify requirements through hands-on testing, which provides clear insights into user workflows, data needs, and overall platform fit.
Prelude guides teams through three progressive phases of data platform development. Each phase builds upon the previous one's foundation and can be deployed locally with a single command:
- Phase one focuses on data exploration and theming, enabling teams to visualize and search their tabular data through a customizable portal UI;
- Phase two expands capabilities to enable tabular data management and validation with persistent storage;
- Phase three adds file management and object storage.
Prelude also includes configuration generation services that validates and transforms your data into key configuration files; this greatly reduces time spent doing tedious manual configurations.
With early adopters reporting significant reductions in configuration time, Prelude is enabling teams to transition through initial planning and development stages efficiently. Looking ahead, we are focused on enabling teams to independently transition to production settings. We are sharing this work to gather community feedback on our approach and learn from others' experiences. Prelude represents a practical step toward making data platform development accessible to research teams with limited resources, reducing initial barriers so teams can do more with less.
Authors List: Show
Presentation Overview: Show
Africa is woven together by population movements, ethno-linguistic diversity, and a unique genetic heritage. Recently, a number of genetics/genomics projects have arisen on the continent, such as those developed in the framework of the Human Heredity and Health in Africa (H3Africa) initiative. With significant ethnic diversity in the continent, researchers are faced with difficulties in defining a population or unit that represents a social group in a standardized format in Africa. Here we developed an African Population Ontology (AfPO) framework that aims to structure this knowledge in a harmonized and standardized way, in order to describe African populations and sub-populations. We used publicly available data related to African populations and their demography, geographic localization, spoken language and genetic background. The Webprotégé platform was used to design and implement this ontology. The country of origin and the populations were selected as descriptive classes. The AfPO was validated by the OBO foundry community and is available in both Github and EBI-OLS. The AfPO enables the annotation of African population groups, and brings together knowledge accumulated about existing populations with their genetic fingerprint in a standardized format; it can be employed to comprehensively annotate African participants in research studies. It can also be used to describe participants of past studies, by mapping them to population identifiers or synonyms. The ontology produced is essential to the study of the history of African populations and their genetics, and is therefore invaluable in addressing public health issues, promoting cultural preservation and fostering a more nuanced appreciation of Africa's unique place in human history.
Authors List: Show
Presentation Overview: Show
Introduction: Homology modeling is a widely used computational technique for predicting the three-dimensional (3D) structures of proteins based on known templates,evolutionary relationships to provide structural insights critical for understanding protein function, interactions, and potential therapeutic targets. However, existing tools often require significant expertise and computational resources, presenting a barrier for many researchers.
Methods: Prostruc is a Python-based homology modeling tool designed to simplify protein structure prediction through an intuitive, automated pipeline. Integrating Biopython for sequence alignment, BLAST for template identification, and ProMod3 for structure generation, Prostruc streamlines complex workflows into a user-friendly interface. The tool enables researchers to input protein sequences, identify homologous templates from databases such as the Protein Data Bank (PDB), and generate high-quality 3D structures with minimal computational expertise. Prostruc implements a two-stage vSquarealidation process: first, it uses TM-align for structural comparison, assessing Root Mean Deviations (RMSD) and TM scores against reference models. Second, it evaluates model quality via QMEANDisCo to ensure high accuracy.
Results: The top five models are selected based on these metrics and provided to the user. Prostruc stands out by offering scalability, flexibility, and ease of use. It is accessible via a cloud-based web interface or as a Python package for local use, ensuring adaptability across research environments. Benchmarking against existing tools like SWISS-MODEL,I-TASSER & Phyre2 demonstrates Prostruc's competitive performance in terms of structural accuracy and job runtime, while its open-source nature encourages community-driven innovation.
Discussion: Prostruc is positioned as a significant advancement in homology modeling, making high-quality protein structure prediction more accessible to the scientific community.
Authors List: Show
Presentation Overview: Show
The spotted hyena (Crocuta crocuta) is a highly social carnivore with complex behavioural and ecological functions, making it an important model for studying genetic diversity, adaptation, and evolution. However, previous draft genomes for C. crocuta have been incomplete and derived from captive individuals, limiting insights into natural genetic variation. Here, we present a high-quality de novo genome assembly and the first draft pangenome of wild spotted hyenas sampled within the Kruger National Park in South Africa.
Using Oxford Nanopore Technologies (ONT) long-read sequencing, we generated a reference genome for a male individual, achieving a total assembly size of 2.39 Gb with a scaffold N50 of 19.6 Mb. Assembly completeness, assessed with BUSCO, revealed 98.5% completeness against the Mammalia_odb10 database and 98.2% against the Carnivora_odb10 database, confirming a high-quality assembly. This assembly is more contiguous and complete than previously published hyena genomes, which had lower N50 values and only 95% completeness. Additionally, our assembly is derived from a wild individual, providing a more ecologically relevant reference compared to those from captive specimens.
To investigate population-level diversity, we sequenced ten additional free-ranging individuals using MGI short-read sequencing at depths of 32X (two individuals) and 10X (eight individuals). This revealed approximately 4 million SNPs and 1 million INDELs across individuals. A draft pangenome was constructed using the Progressive Genome Graph Builder (PGGB), incorporating sequences from all individuals and capturing both conserved genomic regions and variation potentially associated with immune function, behaviour, and environmental adaptation. The draft pangenome comprises ~2.47 Gb, with 35.2 million nodes, 48.4 million edges, and 159,060 paths, providing a foundational resource for future comparative and population genomic studies
Our findings reveal fine-scale population structure and structural variations—including insertions, deletions, and duplications—that are often missed in reference-based approaches. Future efforts will focus on further refining the pangenome, with Hi-C as a potential strategy to enhance chromosome-level scaffolding.
This study represents a significant advancement in the genomic resources available for C. crocuta, offering the first wild-derived reference genome and draft pangenome for the species. These resources contribute to a deeper understanding of spotted hyena genetic diversity and evolution, with implications for conservation genetics, behavioural ecology, and comparative genomics
Authors List: Show
Presentation Overview: Show
Structural variants (SVs) contribute significantly to human genomic diversity and are implicated in both common and rare diseases. As with most genomic data in the public domain, there is limited representation of SV datasets derived from African populations, creating a critical gap in our understanding of global genomic diversity. To address this underrepresentation, this H3Africa collaboration analysed 1,091 high-coverage African whole genomes, including 546 previously unanalysed genomes for structural variants.
We employed an ensemble approach for detecting SVs in whole genome sequencing data, combining five SV detection tools and then merging datasets jointly called through SURVIVOR. This conservative methodology identified 67,795 structural variants across the genome, with SVs observed to impact on 10,421 gene regions. By SV subtype, our analysis revealed 75% deletions, 19% duplications, 4% insertions and 2% inversions, though these proportions reflect algorithmic detection biases.
There was significant novelty in the data, 10% being obersved for the first time in this cohort of African individuals. Variants were distributed throughout the genome with 42% occurring in introns, 4% in coding regions and 53% in intergenic regions. Size distribution analysis showed that a third of SVs detected are over 800bp in length. We observed a higher proportion of common variants (17% occurring at >10% frequency) than previously reported in non-African populations, potentially representing a distinctive feature of African structural variant patterns.
The potential functional impact of the SVs detected were assessed according to ACMG/AMP classification guidelines. This analysis indicated that the majority of SVs (68%) were classified as variants of uncertain significance. A small portion of SVs were classified as likely pathogenic (0.2%) and only 15 pathogenic variants were identified. Of the latter, the majority (60%) were known African variants that were previously linked to disease. The variants described as pathogenic for the first time (5/15) require further investigation.
This study highlights the technical challenges in SV research, including computational intensity and the limitations of short-read sequencing technologies. Different detection algorithms showed complementary strengths across various SV types and sizes, reinforcing the value of ensemble approaches despite their computational demands. Our work provides a valuable resource for population genetics and health-related research, addressing the critical need for high-quality baseline data on structural variant diversity in African populations. This dataset will enhance interpretation of potentially pathogenic variants and improve our understanding of genetic diseases in understudied populations, contributing to more equitable genomic medicine.
Authors List: Show
Presentation Overview: Show
The SeqWord Motif Mapper (SWMM) is a newly developed tool designed to streamline the identification and visualization of complex patterns of epigenetic modifications in bacterial genomes using data obtained through single-molecule real-time (SMRT) sequencing technologies. Bacterial epigenetics, particularly through methylation, plays a crucial role in regulating processes such as gene expression, chromosome replication, symbiont-host interactions, and defense mechanisms against phages. However, there is a lack of computational tools for the detection, comparison, and visualization of patterns of epigenetically modified bases in bacterial genomes. SWMM addresses these challenges by providing a robust statistical framework and interactive visualization capabilities. Implemented in Python 3, the software utilizes input data from standard SMRT analysis pipelines, including GFF annotation files and reference genomes in GenBank format. The tool integrates advanced genomic analyses, such as motif distribution mapping and statistical assessment of the distribution of modified bases and motifs across coding, non-coding, and promoter regions; core and horizontally acquired regions; chromosomes and plasmids; leading and lagging replichores; and regions with alternative base composition. Its visualization outputs include circular and dot-plot representations, accompanied by statistical validation in both graphical and text formats. Applications of the tool have already yielded significant insights into epigenetic regulation mechanisms within various bacterial species, including motifs linked to antibiotic resistance and stress response [1-5].
SWMM can be deployed both locally and as a web application, making it accessible to users with varying levels of bioinformatics expertise. By offering a user-friendly interface and compatibility with multiple operating systems, it enables scalable and reproducible research. The program is freely available on GitHub (https://github.com/chrilef/BactEpiGenPro) and can also be accessed as a web application at http://begp.bi.up.ac.za. This tool represents a critical advancement in bacterial epigenetics, with promising implications for understanding bacterial adaptation, pathogenicity, and gene regulation in both clinical and environmental contexts.
References:
1. Reva ON, La Cono V, Crisafi F, et al. Interplay of intracellular and trans-cellular DNA methylation in natural archaeal consortia. Environ Microbiol Rep. 2024;16(2):e13258. doi: 10.1111/1758-2229.13258.
2. Korotetskiy IS, Shilov SV, Kuznetsova T, et al. Analysis of Whole-Genome Sequences of Pathogenic Gram-Positive and Gram-Negative Isolates from the Same Hospital Environment to Investigate Common Evolutionary Trends Associated with Horizontal Gene Exchange, Mutations and DNA Methylation Patterning. Microorganisms. 2023;11(2):323. doi: 10.3390/microorganisms11020323..
3. Korotetskiy IS, Jumagaziyeva AB, Shilov SV, et al. Transcriptomics and methylomics study on the effect of iodine-containing drug FS-1 on Escherichia coli ATCC BAA-196. Future Microbiol. 2021;16:1063-1085. doi: 10.2217/fmb-2020-0184.
4. Reva ON, Korotetskiy IS, Joubert M, et al. The Effect of Iodine-Containing Nano-Micelles, FS-1, on Antibiotic Resistance, Gene Expression and Epigenetic Modifications in the Genome of Multidrug Resistant MRSA Strain Staphylococcus aureus ATCC BAA-39. Front Microbiol. 2020;11:581660. doi: 10.3389/fmicb.2020.581660.
5. Reva ON, Swanevelder DZH, Mwita LA, et al. Genetic, Epigenetic and Phenotypic Diversity of Four Bacillus velezensis Strains Used for Plant Protection or as Probiotics. Front Microbiol. 2019;10:2610. doi: 10.3389/fmicb.2019.02610.
Authors List: Show
Presentation Overview: Show
Objectives
Successful Alzheimer’s disease (AD) interventions in preclinical models often fail in human trials. While preclinical models offer insights into AD mechanisms, there is no systematic approach to verify whether preclinical target mechanisms retain therapeutic relevance in humans. Bridging this preclinical-to-clinical translational gap accelerates therapeutic development by precisely addressing whether failures are due to testing ineffective drugs, targeting the wrong mechanism, or relying on unrepresentative models.
Methods
We have developed a novel bioinformatics platform, named Integrative Pathway Activity Analysis (IPAA), that maps pathway activity from omics data. IPAA precisely captures the degree to which disease functions in models match those in human brains and prioritizes targetable pathways in the most representative models. We assessed the mechanistic similarities between the transcriptomes of three AD brain regions and multiple 2D/3D human AD cellular models to define targetable functions. We performed phosphoproteomics analysis and compared pathway activity changes with transcriptomic findings. Top pathways were pharmacologically evaluated for their impact on AD pathology in 3D models.
Results
IPAA found high correlation of pathway dysregulation between brain regions (r=0.84, temporal cortex and parahippocampal gyrus), suggesting IPAA’s ability to detect conserved AD functions. IPAA found 83 dysregulated transcriptomic pathways shared between AD brains and a 3D model with a high Amyloid-beta (Aβ) 42/40 ratio. Shared dysregulated pathways included p38 MAPK, YAP1/TAZ, E-cadherin, CDC20, and APC/C, which were confirmed at the protein level. Elevated active p38 MAPK was observed in the 3D models, human AD brains, and 5XFAD mice, localized to presynaptic dystrophic neurites. Phosphoproteomic analysis confirmed an increase in p38 MAPK substrate phosphorylation driven by Aβ42 accumulation. Targeting p38 MAPK with a clinical p38α/β MAPK inhibitor (Losmapimod)– which has not been tested for AD– significantly reduced Aβ-induced tau, Aβ accumulation, neuronal loss, and microglial activation in 3D models and human microglia. We further found that MAPK-activated protein kinase 2 (MK2) plays crucial roles in mediating Aβ-induced tau pathology.
Conclusions
IPAA enables rapid preclinical assessment of target pathways with confidence for impact on AD pathology prior to clinical trials. Our findings highlight the critical role of protein kinase networks, particularly the p38 MAPK-MK2 axis, in driving AD pathology in humans.
Authors List: Show
Presentation Overview: Show
Over the past decade, the Human Heredity and Health in Africa (H3Africa) initiative has driven the development of genomic research for human health in Africa through its bioinformatics network (H3ABioNet). Through collaborative efforts, H3ABioNet has established robust frameworks for data processing, quality control, and imputation pipelines specifically optimized for African populations. An African genotype imputation service with a comprehensive reference panel is indispensable for accurate genetic analyses tailored to the continent's diverse genetic landscape.
We developed an imputation platform (Afrigen-D Imputation Service, https://impute.afrigen-d.org) that leverages the high-quality H3Africa reference panel, comprising 8,894 high-coverage haplotypes from 48 populations worldwide, with 50% of African ancestry. The service implements established guidelines and workflows while addressing data privacy challenges by maintaining genetic data within continental boundaries. It utilizes the validated software stack and workflow architecture of the Michigan Imputation Server and TopMed Imputation Service, ensuring methodological consistency and standardization of genetic imputation procedures. This enables the combination of genotype data after imputation with multiple reference panels. Additionally, the platform integrates an HLA reference panel and incorporates polygenic score (PGS) calculation capabilities, enabling automated standardized computation of genetic risk scores from imputed genotypes.
The Afrigen-D Imputation Service facilitates efficient genotype imputation through a user-friendly interface, requiring minimal computational expertise and resources. The platform provides comprehensive preprocessing utilities for automated quality control and data preparation, adhering to established bioinformatics standards. Integration with population-specific reference panels and polygenic scoring capabilities provides a robust foundation for investigating complex diseases and genetic traits in African populations. Through ongoing development and community collaboration, this resource contributes significantly to advancing our understanding of African genetic diversity and its implications for health outcomes.
Authors List: Show
Presentation Overview: Show
In recent years, suicidality has become a serious public health issue. Neuroimaging studies have suggested pathological and etiological influences based on brain volumetric abnormalities in suicidal individuals as well as upon post-mortem brain tissue samples of suicide victims. There have been advances in understanding the genetic underpinnings of suicidality, however, the shared genetic configuration between suicidality and subcortical brain volume is poorly understood. Based on Genome-Wide Association Studies, we aim to explore the shared genetic architecture between suicidality and subcortical brain volume. We obtained summary statistics of suicidal behaviour, notably Suicide Attempts (n = 50,264), Ever-Self Harmed (n = 117,733), and Thoughts of Life Not Worth Living (n = 117,291) from the UK Biobank as well as Suicide or other intentional self-harm (n = 342,499) from the FinnGen Biobank. Additionally, summary-level data of seven subcortical brain volumes and the Intracranial Volume were sourced from the ENIGMA2 study. Linkage Disequilibrium score regression was deployed to ascertain the genetic relationship between suicidality and subcortical brain volume. Genomic Structural Equation Modelling analyses were deployed to identify common factor patterns among them. Our Genomic Structural Equation Modelling analyses outcome led to a series of GWAS meta-analyses at variant, gene/sub-network levels. Our results detected a nominal genetic correlation between the Suicide cohort from FinnGen and Intracranial Volume, as well as a common genetic factor divided into two categories encompassing Suicide Attempt, Ever-Self Harmed, and Thoughts of Life Not Worth Living from the UK Biobank on one side, and Suicide from Finngen, Intracranial Volume and the subcortical brain volume, phenotypes on the other side. Network, pathway and Gene Ontology analysis of the joint sets of disorders uncovered enriched pathway/biological processes connected to the blood-brain barrier/permeability.
Furthermore, our findings indicate that the presence and severity of suicidality are associated with an inflammatory signature detectable in both blood and brain tissues. This suggests a biological continuity underlying suicidality, potentially pointing to a common heritability. These results support the role of brain and peripheral blood inflammation in suicide risk. These findings hold promise for developing targeted interventions and personalized treatment strategies to mitigate the risk of suicidality in vulnerable individuals.
Authors List: Show
Presentation Overview: Show
Chronic kidney disease (CKD) is a critical global health concern with high mortality rates and severe complications, particularly in Africa, yet the underlying molecular mechanisms remain poorly understood. We conducted a Genome-Wide Association Study (GWAS) using Blood Urea Nitrogen (BUN) levels, a key biomarker of kidney function, in 5,910 Ugandan participants to identify single nucleotide polymorphisms (SNPs) associated with CKD risk. Our analysis identified 13 SNPs reaching a suggestive significance threshold (p < 5×10⁻⁷), refined to five independent lead SNPs through LD clumping. Notably, rs73309776 in the GALNT6 gene suggests potential pathways linking breast cancer and kidney function, while rs145326389, an intronic variant in LOC105374218, is associated with traits related to the RAAS pathway and blood pressure regulation. rs142038911 is a synonymous variant in TRIM11, TRIM17, and LOC124904537, and may play a role in regulating serum creatinine and protein binding, which are crucial in kidney disease. Bayesian fine mapping highlighted rs1286795408 on chromosome 7 as a strong candidate with a posterior probability of 84% with a 99% credible set, warranting further investigation. Functional annotations using MAGMA and GTEx revealed gene expression in the pituitary gland and kidney medulla, though these did not reach statistical significance. Replication in European, East Asian, and Latin American populations validated associations with genes such as HOXD11, BCAS3, and TFCP2L1, which are involved in kidney development and function, emphasizing shared genetic factors across ancestries. Rigorous quality control measures, including filtering for Hardy-Weinberg equilibrium, sex discrepancies, and minor allele frequency, ensured robust results. This study, the first GWAS of BUN in a continental African population, underscores the importance of inclusive genetic research and contributes to understanding CKD's genetic underpinnings, paving the way for precision medicine and potential targeted treatments for underrepresented populations.
Authors List: Show
Presentation Overview: Show
Background: The menopausal transition has been associated with changes in the gut microbiome (GM). This compositional shift is likely related to changing hormone levels during menopause: the GM harbours bacterial taxa that can deconjugate estrogen and other sex hormones, allowing reabsorption of sex hormones. Estrogen also maintains gut homeostasis by influencing the intestinal barrier function and microbial composition. The altered GM composition accompanying the decreased hormone levels may be partially responsible for the onset of menopause-related health conditions, including cardiometabolic diseases (CMDs). However, it is unclear which microbiome features are associated with menopause-related health outcomes particularly in the context of African populations. This study is the first investigation of menopause-related changes in the GM and their association with CMDs in African women.
Aim and objectives: This study investigated compositional differences in the GM between pre- and postmenopausal women in sub-Saharan Africa and their association with CMDs by characterising alpha and beta microbial diversity patterns, identifying differentially abundant bacterial taxa between menopausal groups and determining how these microbial taxa may be linked to CMDs.
Methods: The cross-sectional analysis included 1,801 women from Burkina Faso, Ghana, Kenya and South Africa that were selected from the Africa-Wits INDEPTH partnership for Genomic studies (AWI-Gen) wave 2. Shotgun metagenomic sequencing was performed on DNA extracted from faecal samples using Illumina technology. The metagenomic reads underwent quality control processing and alignment, followed by taxonomic profiling. Microbial diversity and composition were assessed using Inverse Simpson index and Bray-Curtis dissimilarity between menopausal groups. Linear discriminant analysis Effect Size was used to identify differentially abundant taxa between menopausal groups.
Results: Our analysis revealed that CMD status emerged as a stronger determinant of gut microbial diversity than menopausal status. Women with CMDs showed significantly lower microbial diversity regardless of menopausal status. Geographic location also significantly influenced GM composition, with substantial variations across study sites. Taxonomically, premenopausal women were enriched with beneficial short-chain fatty acid-producing bacteria, including Faecalibacterium prausnitzii, Bacteroides fragilis, and Prevotella, while postmenopausal women exhibited both beneficial (Ruminococcus champanellensis) and potentially harmful species (Collinsella bouchesdurhonensis).
Conclusions: Our findings contrast with prior research in non-African populations by demonstrating that rather than menopause directly altering the GM and subsequently increasing CMD risk, the higher prevalence of CMDs in postmenopausal women may be driving the observed microbiome changes. Geographic location emerged as another significant determinant, highlighting the importance of regional factors in shaping microbial communities. While we observed distinct taxonomic differences between pre- and postmenopausal women, these patterns varied by location and included both beneficial and potentially harmful bacteria in postmenopausal women. This study underscores the complex interplay between hormonal status, geographic factors, and metabolic health, emphasizing the need for population-specific approaches to women's health research and clinical interventions targeting the gut-hormone-metabolism axis in African women.
Authors List: Show
Presentation Overview: Show
The conventional human reference genome, though essential for variant calling, lacks the genetic diversity needed to represent global populations, particularly African populations with high genetic variability. Use of a linear reference introduces both reference and allele bias, obscuring population-specific insights. To address this, we are constructing an African Pangenome Reference Graph, enabled by advancements in long-read sequencing and graph-based reference models.
Our project leverages ~60X PacBio HiFi sequencing data from 27 individuals across Burkina Faso, Kenya, and South Africa, capturing a significant proportion of genetic diversity of Africa. This data allows us to build a pangenome graph that more accurately represents African genomes. Unlike traditional linear references, the pangenome graph integrates diverse sequence paths, improving variant calling for both single nucleotide and structural variants. The goal is to improve exploration of African-specific genetic variation and enhance variant discovery in related populations.
To achieve this, we developed workflows for generating high-quality de novo assemblies and pangenome graphs using both reference-free (PanGenome Graph Builder, PGGB) and the reference-derived (Minigraph-Cactus) algorithms. A third workflow is under development to extract and analyze African variation within specific regions and call variants using the graph as a reference. Preliminary analysis based on a 30x coverage dataset has yielded high-quality assemblies with contig N50s between 31-49 Mb. While the recent dataset is 60x coverage.
Supported by H3ABioNet and the eLwazi Consortium, the African pangenome graph provides a valuable resource advancing population-specific genomics. Initial applications include comparing variants called using the African graph versus linear and global references, investigating complex regions, and benchmarking graph-based variant calling.
As part of our efforts to advance pangenome research and analysis, we hosted the Human Pangenome Bring Your Own Data (BYOD) Workshop in October 2024 in collaboration with the eLwazi Open Data Science Platform. During this hands-on workshop, participants explored methods for variant calling, graph-based analysis, and personalized genome graph creation, comparing results between linear reference and pangenome-based approaches.
A second phase will generate assemblies from seven samples from the Democratic Republic of Congo, expanding the resource. This collaborative initiative is a crucial step toward a more inclusive genomic reference, enabling equitable genomic studies across African and global populations.
Authors List: Show
Presentation Overview: Show
We have developed a novel computational algorithm TR-2-PATH that reconstructs first-of-its kind mechanism-centric regulatory network, which connects molecular pathways to their upstream transcriptional regulatory programs, and prioritizes them as markers of therapeutic resistance in cancer. Such network offers a novel way to identify biomarkers that are mechanisms-centric, rather than based on individual genes or alterations - a new way to identify functional interactions and valuable therapeutic targets. As a proof of concept, we have applied TR-2-PATH to metastatic castration-resistant prostate cancer (mCRPC). Network mining step addressed a knowledge gap of multi-collinearity among upstream transcriptional regulators (TRs) and identified TR groups that collaborate to regulate downstream pathways. Interrogating this network with signatures of resistance to Enzalutamide, a second-generation androgen-deprivation drug commonly administered to mCRPC, identified a collaboration between NME2 TR program and MYC molecular pathways as a biomarker of primary resistance to Enzalutamide. In vitro and in vivo experimental validation confirmed cooperation of these mechanisms and demonstrated that their joined therapeutic targeting is not only effective to prevent resistance to Enzalutamide, but also re-sensitizes Enzalutamide resistant tumors in vivo, allowing Enzalutamide to work longer. We propose to use MYC and NME2 as markers to identify patients at risk of Enzalutamide resistance and as effective therapeutic targets for patients that failed Enzalutamide. Our novel algorithm is generalizable and could be applied to study a multitude of biologically and clinically important questions, including (but not limited to) therapeutic resistance, metastatic progression, tumor heterogeneity and plasticity across cancer types and in other diseases. TR-2-PATH was published in Nature Communications in 2024. We are now expanding this algorithm to include regulatory relationships with long non-coding RNAs.
Authors List: Show
Presentation Overview: Show
Introduction: Antimicrobial Resistance (AMR) and lack of new drugs poses a serious public health threat. Carriage of AMR may be important drivers of inpatient and post-discharge mortality risk in Low Middle-Income countries (LMICs) despite following guidelines. ESBL and CRE are important as proxies for broad multi-class resistance spread on mobile genetic elements that promote horizontal gene transfer intra- and inter-species in hospitals and communities. We hypothesise that intestinal colonisation and carriage is a possible means of transmission of AMR and a precursor to invasive disease.
Methods: This was a prospective cohort study enrolling children admitted to 3 Kenyan hospitals followed for 6 months after discharge and well community controls. Detailed demographic, clinical, and antimicrobial use data were collected along with blood and rectal swab culture. We carried out short and long read whole genome sequencing of 486 E.coli isolates to detect AMR and virulence genes and assess genetic relatedness at gene, mobile genetic element, and strain level through core genome phylogeny.
Results: Of the 804 inpatient participants, 291 (36%) carried ESBL-E at admission, 447/630 (71 %) at discharge, 199/455 (44%) at day 45, 152/457 (33%) at day 90 and 120/452 (27%) at 180 days post-discharge from hospital. The baseline ESBL-E carriage prevalence among healthy community participants was 65/404 (16%). Acquisition of ESBL carriage in hospital was associated with prior hospitalization, prior use of antibiotics, prolonged stay in hospital and antimicrobial classes use; and with outcomes of post-discharge death or readmission after adjusting for potential confounders. CPE of up to 26 (6%) and 4 (8%) during readmission were seen in Nairobi site. E. coli isolates were diverse across pathotypes with 12 of the 14 E. coli phylogroups identified globally present including those associated with invasive disease; D3, B1, B2 and D1. Sequence types linked to invasive disease like ST 131, ST 410 and ST 38 were also identified and concordance in ST types among invasive and carriage isolates seen. Several AMR genes cutting across all classes of antibiotics and virulence genes were identified with the leading ESBL gene being blaCTX-M-15 and CRE gene blaNDM-5.
Conclusions: There was significant AMR acquisition before and during hospitalisation that took more than six months to return to community level. Carriage and invasive ST types were similar. Further genomic studies and antimicrobial trials to monitor changes on the whole microbiome and calculation of invasiveness of the ST types and phylogroups should be conducted for infection control.
Authors List: Show
Presentation Overview: Show
Artificial intelligence (AI) has emerged as a revolutionary approach in the field of drug discovery, with the increased availability of large datasets for training AI models to predict the properties and potential biological activities of chemical compounds. The AI-driven framework essentially consists of three main components: the dataset, the combination encoding system-model, and the prediction task. The present work introduces an AI-based Ligand-Based Drug Design approach focused on optimizing the different components of such a pipeline to provide robust predictive tools of chemical compound activities against various diseases.
In this study, we investigated the impact of class imbalance on the performance of various classifiers in predicting the biological activity of chemical compounds. We trained two machine learning models, four graph-based models, and two pre-trained models on highly imbalanced bioassay datasets. To address the class imbalance, we first employed two oversampling methods namely Random Oversampling (ROS) and SMOTE and two undersampling methods namely Random Undersampling (RUS) and NearMiss. Additionally, we proposed a novel strategy called K-Ratio Undersampling. Through this approach, based on RUS, we created three specific ratios (1:50, 1:25, and 1:10) for each dataset. The impact of these ratios on model performances was evaluated using F1-scores. To ensure the robustness of our models, we conducted an external validation on unseen data. As a last step, we performed an analysis of each dataset content to better understand the factors behind the models' misclassifications.
Across all simulations, the comparison of the classical resampling techniques revealed that RUS outperforms ROS across various evaluation metrics, supporting our hypothesis that reducing majority class instances through undersampling improves model performance. Through the investigation of the impact of the various imbalance ratios on the ML and DL models, we demonstrated that moderate imbalance ratios of (1:25 - 1:10) significantly enhanced the models performances, achieving higher F1-scores compared to previous results. Among the evaluated models, the top-performing models for each dataset were optimized through hyperparameter tuning.
The external validation step confirmed that the 10-RUS configuration yielded the best configuration in achieving a good balance between true positive and false positive rates. Although no particular model showed optimal performances on all datasets. Through the previous results, the HIV dataset was particularly challenging. The analysis of the similarity between active and inactive compounds through a chemical space network showed that high similarity between both classes reduced predictive accuracy.
Our findings highlighted the importance of optimizing both the chemical data content and the class imbalance to improve the model performances in predicting the biological activity of chemical compounds.
Authors List: Show
Presentation Overview: Show
Malaria remains a significant global public health issue, causing over 600,000 deaths annually. One promising research direction is disrupting crucial pathways, such as the cell signalling mechanisms enabling malaria parasites to grow and survive. By focusing on these pathways, scientists aim to develop a new generation of antimalarial drugs capable of effectively addressing drug resistance and improving treatment options for vulnerable populations worldwide. Post-translational modifications (PTMs), such as phosphorylation, can significantly change protein structures. This poses challenges for computational approaches like computer-aided drug design (CADD), where even slight structural changes can affect ligand binding and functionality. This study investigates how autophosphorylation and Ca2+ binding influence the conformational dynamics of PfCDPK1 using computational modelling, mainly through molecular dynamics simulations (MDS) and high-throughput virtual screening (HTVS). By analysing the changes in protein dynamics, the research may reveal important insights into the druggability of protein kinases, facilitating the design of more effective drugs. The results demonstrated notable variations in the dynamic behaviour of the four systems with or without the ligand (BKI-1294) based on metrics such as Cα-RMSD, Cα-RMSF, radius of gyration (RoG), and ligand properties. The findings suggest that Ca2+ binding alone results in structural changes in the conformity of the protein over time and Ca2+ binding and autophosphorylation enhances structural stability. While phosphorylation alone leads to significant structural deviations, with statistically significant differences observed amongst all systems. Phosphorylation, particularly autophosphorylation, and Ca2+ binding to CDPK1 may reshape the conformational landscape of the enzyme. Such structural changes could influence its functionality, including substrate binding and allosteric inhibition. Ultimately, this study elucidated how these modifications affect the structure and function of PfCDPK1, providing insights into the molecular mechanisms that regulate enzyme activity and calcium homeostasis in Plasmodium falciparum.
Authors List: Show
Presentation Overview: Show
Cancer diseases pose significant challenges due to their complex molecular mechanisms and resistance to conventional therapies. Bioinformatics techniques such as molecular docking and molecular dynamics offer an unprecedented opportunity to accelerate the identification of cancer therapeutic targets and drug design. By developing explainable and integrated platforms, it is possible to meet the growing innovation needs in the field of oncology while reducing the time and costs associated with the development of new treatments. Herein, we present the development of an integrated and automated molecular docking platform designed to study ligand-protein interactions in cancers. The platform leverages bioinformatics to streamline the drug discovery process, utilizing Python scripts to automate the preparation of protein and ligand structures, performing docking using the Vina library, and providing visualizations of the results. The workflow includes the creation of OncoligandDB, a structured database that centralizes information on more than 100 anticancer ligands, organized in tables classified by cancer type. OncoligandDB includes details such as the commercial name of the product, year of production, SMILES structure, and direct downloadable links to PDB formats, facilitating the docking process. The platform also features DockSmart, an intuitive web interface that integrates a direct link to OncoligandDB, enabling users to easily access and utilize the PDB formats for docking simulations. DockSmart generates affinity scores and RMSD values for analysis, offering a comprehensive tool for researchers. The results provided by DockSmart were validated by comparing docking outcomes with the established tool SeamDock, demonstrating comparable or superior performance in terms of docking scores and computational efficiency.
DockSmart modular design, conviviality, and automation significantly reduce manual intervention, improve reproducibility, and accelerate the discovery of potential therapeutic candidates. Looking ahead, future perspectives include the integration of molecular dynamics simulations and advanced AI tools, such as Graph Neural Networks (GNN), a deep learning algorithm designed to calculate and identify interaction points between protein and ligand graphs. DockSmart in its current version highlights the potential of bioinformatics to advance cancer research and drug development, offering a powerful tool for researchers in the field of oncology and bioinformatics.
Authors List: Show
Presentation Overview: Show
Microorganisms are detected in multiple cancer types, including in putatively sterile organs, but the contexts in which they influence oncogenesis or anti-tumor responses in humans remain unclear. Despite increasing research into the human microbiome, however, a number of basic questions remain unanswered, including questions about its size, distribution, and presence in various human tissues, exemplified by recent controversies around the fetal microbiome and cancer microbiome. We developed single-cell analysis of host-microbiome interactions (SAHMI), a computational pipeline to recover and denoise microbial signals from single-cell sequencing of host tissues. More recently, we developed a companion framework PRISM, a computational approach for precise microorganism identification and decontamination from low-biomass sequencing data. Using these resources, we identified rich microbiomes in gastrointestinal tract tumors and identify bacteria in a subset of pancreatic tumors that are associated with altered glycoproteomes, more extensive smoking histories, and higher tumor recurrence risk. We find relatively sparse microbes in other cancer types that grow in more sterile environment, which we demonstrate may reflect differing sequencing parameters. Overall, these resources present applicable guidelines that do not replace gold-standard controls, but it enables higher-confidence analyses and reveals tumor-associated microorganisms with potential molecular and clinical significance.
References:
1. Ghaddar B et al. Tumor microbiome links cellular programs and immunity in pancreatic cancer. Cancer Cell. 2022 Oct 10;40(10):1240-1253.e5. PMID: 36220074.
2. Ghaddar B, Blaser MJ, De S. Denoising sparse microbial signals from single-cell sequencing of mammalian host tissues. Nat Comput Sci. 2023 Sep;3(9):741-747. PMID: 37946872.
3. Ghaddar B, Blaser MJ, De S. Revisiting the cancer microbiome using PRISM. (submitted)
Authors List: Show
Presentation Overview: Show
The gut microbiome is important for the health of all animals. Very little is known about the microbiome of large carnivores, and lion in particular. This study presents the first comprehensive microbiome classification of African lions (Panthera leo melanochaita). Our study used shotgun metagenomics data (Illumina short-reads) of DNA extracted from faecal samples from 20 lions from Etosha National Park, Namibia. Three of the lions were sampled twice, in different seasons. In addition, 10 of the samples were sent for long-read sequencing (Oxford Nanopore). Our findings illuminate potential connections between gut microbiome composition and social structure, diet, and pack-hunting in carnivores, with potential implications for wildlife conservation and veterinary medicine.
We discovered distinct microbial profiles in the African lion gut, dominated by the genera Bacteroides and Phocaeicola. In particular, we note similar abundances of Bacteroides in other pack-hunting carnivores such as black-backed jackals, wolves and dholes. Solitary hunters like cheetahs on the other hand, have a relatively low abundance of Bacteroides. The high abundance of this genera is possibly caused by the high interaction (and therefore transmission of bacteria) between pack-hunters compared to solitary carnivores. Alternatively, Bacteroides abundance could be attributed to differences in diet: solitary hunters consume the most nutritious portions of prey immediately, while pack hunters usually distribute resources based on social hierarchies.
Moreover, links were drawn between pregnancy and inflammation to the gut microbiome of female lions via the genus Fusobacterium. This genus is seen in high abundance in post-natal pigs and is also linked to gut inflammation in humans and pigs, indicating that postnatal lions may experience similar gut inflammation during and after pregnancy.
For comparison, we analyzed three additional samples from Asiatic lions (Panthera leo leo) collected from previous research in India and found that these sub-species had similar abundances of bacterial phyla but differences in bacterial genera and species. We attribute the similarities in bacterial phyla to common evolutionary ancestry and the differences in bacterial genera to allopatric separation causing minor changes in bacterial composition over time.
Finally, a large proportion of DNA in the lion gut was unclassified, representing new species of microorganisms not present in current databases. We were able to create 272 metagenome assembled genomes (MAGs) the majority of which represent new species which will contribute to current knowledge. The identification of novel microbial species highlights the importance of expanding microbial databases and the need for further research into host-microbe interactions in wildlife conservation contexts. Our plan for future research is to leverage long-read data to supplement databases and improve microbial classification.
Authors List: Show
Presentation Overview: Show
Background: Despite its critical role in human health, the gut microbiome remains understudied in underrepresented populations, particularly in low- and middle-income countries. Large-scale gut microbiome research has historically focused on high-income, industrialized populations, limiting our understanding of microbial diversity, adaptation, and health implications across different environmental and lifestyle contexts. The AWI-Gen 2 Microbiome Project addresses this gap by investigating how geography, industrialization, lifestyle, and health status shape gut microbiome diversity in six study sites across Burkina Faso, Ghana, Kenya, and South Africa.
Methods: A total of 1,801 women aged 41–84 years were enrolled from rural, semi-rural, and urban communities spanning distinct environmental and socioeconomic settings. Shotgun metagenomic sequencing was performed to generate high-resolution taxonomic and functional profiles of gut microbial communities. Metagenome-assembled genomes (MAGs) were reconstructed to expand microbial reference catalogues and uncover novel species. Statistical analyses assessed associations between microbiome composition, dietary patterns, antibiotic use, and disease states, including HIV infection.
Results: Geography was the primary driver of microbiome composition, with distinct microbial transitions observed along an industrialization gradient. Rural populations exhibited higher microbial diversity, with a notable enrichment of Treponema species, while urban populations showed reduced Treponema and Cryptobacteroides abundance alongside a relative increase in Bifidobacterium species. Nairobi’s informal settlements exhibited a unique hybrid microbiome signature, reflecting a mix of rural and urban microbial traits, challenging conventional rural–urban microbiome models. The study significantly expanded global microbial reference datasets, identifying 1,005 novel bacterial species and 40,135 previously uncharacterized viral genomes. The absence of Treponema succinifaciens in urban populations correlated with higher antibiotic exposure and lower dietary fiber intake, suggesting that antimicrobial-driven microbiome shifts may be occurring in transitioning populations. Additionally, a distinct HIV-associated microbiome signature was characterized, featuring taxa not previously linked to HIV in high-income cohorts, including Dysosmobacter welbionis and Enterocloster species. These findings underscore the need for population-specific microbiome research to better understand host-microbiome interactions in infectious diseases.
Conclusion: This study provides critical insights into the diversity and adaptation of the gut microbiome in African populations, challenging existing models of industrialization-driven microbial shifts. By leveraging shotgun metagenomics, this work contributes to a more representative and equitable global microbiome atlas, expanding the known diversity of bacterial and viral species. These findings highlight the need for inclusive microbiome research that reflects diverse global populations and informs precision medicine approaches. Beyond advancing microbiome science, this study prioritizes community engagement, participant education, and the dissemination of findings. Future work will integrate participant feedback and explore the implications of microbiome shifts for public health. Ongoing research will investigate longitudinal microbiome dynamics and microbiome-host interactions, while planned follow-up analyses will assess microbiome stability over time in previously sampled participants.
Authors List: Show
Presentation Overview: Show
RNA structure prediction is a challenging problem in computational biology, as the three-dimensional structure of RNA molecules is intrinsically related to their function. Predicting RNA structure is important for understanding gene expression regulation, disease mechanisms, and for the development of RNA-based therapeutics. This presentation will focus on the algorithms that have been developed to predict RNA secondary and tertiary structures from sequence data. We will discuss classical approaches, such as dynamic programming and energy minimization, as well as more recent approaches that rely on machine learning and deep learning models. We will also discuss the integration of multiple data sources, such as experimental structures to enhance prediction accuracy. We will explore the strengths, limitations, and computational challenges of each category of methods.