Attention Presenters - please review the Presenter Information Page available here
Schedule subject to change
All times listed are in EDT
Monday, July 15th
10:40-11:00
Welcome & Overview
Room: 520b
Format: In person


Authors List: Show

  • David Kreil
11:00-12:20
Invited Presentation: CAMDA Keynote: Exploring drivers of gut microbiome compositional differences in disease and mechanistic pathways to recovery using big data
Confirmed Presenter: Catherine Lozupone

Room: 520b
Format: In Person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Catherine Lozupone

Presentation Overview: Show

The commensal gut microbiome plays an essential role in protecting against opportunistic pathogens and maintaining immune homeostasis. Dysbiosis, an imbalance in microbial communities, is linked with disease when this imbalance disturbs microbiota functions essential for maintaining health or introduces processes that promote disease. By performing meta-analyses of many studies that have sequenced the 16S ribosomal RNA gene to characterize gut microbial communities in different disease and health contexts, we have defined very young age and Western versus Developing world/Agrarian cultures to be two major axes of gut microbiome compositional variation that are important for explaining variability across healthy humans. Interestingly, among Western adults, individuals with different diseases or microbiome disturbances have migration along both of these major axes of health-associated gut microbiome variation. For instance, obese Western individuals sometimes have microbiomes that cluster closer to Prevotella-rich/Bacteroides-poor microbiome types in the developing world and this is more common in African versus European Americans. Related to age, gut microbiomes of adults with recurrent Clostridioides difficile infection, Inflammatory Bowel Disease, cancer, and intake of broad-spectrum antibiotics all tend to cluster closer to healthy infant gut microbiomes, characterized by low diversity with increased representation of facultative versus strict anaerobes. The relationship between highly disturbed and infant gut microbiome compositions is likely related to parallel processes that occur in primary versus secondary ecological succession, where absence of a complex community of healthy gut commensals allows for the colonization of opportunistic, early succession adapted organism that undergo an ordered turnover of membership. By coupling co-occurrence patterns and longitudinal analyses of dense time-series data with genomic and metabolic network interrogations to explore underlying drivers of microbial cooperation and competition, we have been generating hypotheses regarding important interactions that occur during succession and testing them in humanized mice.

14:20-14:50
Invited Presentation: The Gut Microbiome based Health Index Challenge - Introduction
Confirmed Presenter: Kinga Zielińska

Room: 520b
Format: In Person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Kinga Zielińska

Presentation Overview: Show

Microbiome-based disease prediction has significant potential as an early, non-invasive marker of multiple health conditions attributable to dysbiosis of the human gut microbiota, thanks in part to decreasing sequencing and analysis costs. Existing tools, or microbiome health indexes, are often based solely on microbiome richness and are heavily dependent on taxonomic classification. More recently, an ecological approach has led to increased understanding of microbiome, which reveals substantial restrictions of such approaches. In this study, we introduce a new health index created as an answer to updated microbiome definitions. The novelty of our approach is a shift from a traditional approach of phylogenetic classification, towards a more holistic consideration of metabolic function including ecological interactions between species in the effort to distinguish between healthy and diseased states. We compare this to not only the taxonomy-based Gut Microbiome Health Index (GMHI) and the high dimensional principal component analysis (hiPCA)method, the most comprehensive indices to date, but also to taxon- and function-based Shannon entropy and demonstrate a significant improvement to these approaches. We validate our index’s performance using a variety of complementary benchmarking approaches on datasets representing a range of gut health conditions and showcase the robustness of its superiority over the GMHI and the hiPCA. Overall, we emphasize the potential of this approach and advocate a shift towards functional approaches in order to better understand and assess microbiome health as well as to provide directions for future index enhancements. Our method, q2-predict-dysbiosis, is freely available as a QIIME 2 plugin (https://github.com/bioinf-mcb/q2-predict-dysbiosis).

14:50-15:20
Integrating Taxonomic and Functional Features for Gut Microbiome Health Indexing
Confirmed Presenter: Nelly Selem Mojica, Centro de Ciencias Matemáticas UNAM, Mexico

Room: 520b
Format: In person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Shaday Guerrero Flores, Centro de Ciencias Matemáticas UNAM, Mexico
  • Nelly Selem Mojica, Centro de Ciencias Matemáticas UNAM, Mexico
  • Adriana Haydee Contreras Peruyero, Centro de Ciencias Matemáticas UNAM, Mexico
  • Juan Francisco Espinosa Maya, Centro de Ciencias Matemáticas UNAM, Mexico
  • Rafael Perez Estrada, Centro de Ciencias Matemáticas UNAM, Mexico
  • Mario Jardon, Centro de Ciencias Matemáticas UNAM, Mexico
  • Orlando Camargo Escalante, Centro de Investigación y de Estudios Avanzados (CINVESTAV), Mexico
  • Miguel Nakamura, Centro de investigación en Matemáticas CIMAT, Mexico
  • Kotaro Hata, Centro de investigación en Matemáticas CIMAT, Mexico
  • David Alberto Garcia, Centro de Investigación y de Estudios Avanzados (CINVESTAV), Mexico
  • Luis Yovanny Bedolla Galvan, Escuela Nacional de Estudios Superiores UNAM, Mexico
  • Goretty Mendoza, Instituto de Investigaciones en Ecosistemas y Sustentabilidad UNAM (IIES), Mexico
  • Mirna Vazquez Rosas-Landa, Instituto de Ciencias del Mar y Limnología UNAM, Mexico
  • José Abel Lovaco, Centro de Investigación y de Estudios Avanzados (CINVESTAV), Mexico
  • Mario Enrique Carranza Barragán, Centro de investigación en Matemáticas CIMAT, Mexico
  • Jose Daniel Chavez Gonzalez, Centro de Investigación y de Estudios Avanzados (CINVESTAV), Mexico
  • Axel Alejandro Ramos García, Tecnologico de Monterrey, Mexico
  • July Stephany Gámez Valdez, Tecnologico de Monterrey, México

Presentation Overview: Show

This study aimed to enhance our understanding of metagenomic datasets by applying and innovating bioinformatics tools for the identification and functional characterization of microbial genes and pathways. We utilized tools such as Prokka, Prodigal, EggNog, mi-faser, Metacyc, and DiTing to annotate gene functions and metabolic pathways, generating a detailed functional landscape of the microbial communities. We identified key functional roles in various health conditions through Pearson-Spearman correlation networks but found a notable absence of keystone functions in several categories. On the other hand, we explored microbial health indicators by replicating indices like GMHI and hiPCA and attempting novel integrations with metabolic pathway data. The adapted GMHI and hiPCA indices could distinguish between health states in microbial communities. Moving forward, we aim to refine these indices using expanded datasets, focusing on both taxonomic and functional data. In conclusion, our study enhances the predictive capabilities of metagenomic analyses for assessing microbial community health, paving the way for future developments in microbial ecology and biomedicine.

15:20-15:40
Using Gradient Boosting to Predict Health States from Composition and Function of the Gut Microbiome
Confirmed Presenter: Patrick Smyth, National Microbiology Laboratory, Canada

Room: 520b
Format: In Person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Patrick Smyth, National Microbiology Laboratory, Canada
  • Liam Elson, National Microbiology Laboratory, Canada
  • Julie Chih-Yu Chen, National Microbiology Laboratory, Canada

Presentation Overview: Show

This study utilizes stool samples from the Human Microbiome Project 2 and American Gut Project cohorts, along with COVID-19 patient data, to develop a superior health index using machine learning techniques.

We employed LightGBM's Dropouts meet Multiple Additive Regression Trees (DART) algorithm, which excels in handling high-dimensional data, for predicting health states based on combined taxonomic and functional profiles. Data preprocessing involved filtering features with a minimum prevalence threshold, as well as aggregating taxonomic pathways.

Two cross-validation strategies, nested stratified 5-fold and Leave-One-Project-Out (LOPO), were implemented to ensure robust model evaluation. Performance metrics such as AUC, F1 Score, and Balanced Accuracy were used to assess model effectiveness. Feature importance analysis identified key taxa and pathways relevant to gut health.

The Gradient Boost Health Index from gut Microbiome data (GBHIM) was introduced, showing improved performance over existing indices like the Gut Microbiome Health Index (GMHI). The inclusion of GMHI as a feature occasionally enhanced model performance. The model demonstrated strong performance across various validation folds and projects, highlighting its potential for accurate health state predictions.

For COVID-19 samples, the model effectively distinguished between healthy and non-healthy states, clustering more closely with non-healthy samples in Principal Coordinates Analysis. This study underscores the importance of leveraging comprehensive microbial data and advanced machine learning techniques for improved health state predictions in microbiome research

15:40-16:00
Microbiome time series data reveal predictable patterns of change
Confirmed Presenter: Karwowska Zuzanna, Malopolska Centre of Biotechnology, Jagiellonian University, Poland

Room: 520b
Format: In Person

Moderator(s): Paweł Łabaj


Authors List: Show

  • Karwowska Zuzanna, Malopolska Centre of Biotechnology, Jagiellonian University, Poland
  • Paweł Szczerbiak, Sano Centre for Computational Medicine, Poland
  • Tomasz Kościółek, Sano Centre for Computational Medicine, Poland

Presentation Overview: Show

Despite the majority of microbiome studies being cross-sectional, it is widely acknowledged that the microbiome is a dynamic ecosystem.
Here, we analyse how the gut microbiome changes over time as a community, how different bacterial species behave over time, and whether there are clusters of bacteria that exhibit similar fluctuations?
We show that a healthy human gut microbiome is stationary, seasonal, and non-random. Moreover, we demonstrate that it is self-explanatory to some extent, and its behavior can be predicted.
The analysis of individual bacterial species uncovered the existence of three distinct longitudinal regimes in the healthy human gut microbiome. These regimes consist of 1) stationary and highly prevalent bacteria that exhibit resistance to environmental changes; 2) volatile bacteria that exhibit dynamic reactions to external stimuli, causing their presence to fluctuate over time; and 3) white noise. Clustering analysis revealed the presence of taxonomically diverse bacterial groups that exhibit similar fluctuations over time.
In conclusion, our study highlights the importance of longitudinal data and provides new insights into the dynamics of the healthy human gut microbiome. We offer clear guidelines for clinicians and statisticians who conduct longitudinal studies and develop models to predict the behavior of the gut microbiome over time.

16:40-17:10
Invited Presentation: Prediction in microbiome science
Confirmed Presenter: Jesse Shapiro

Room: 520b
Format: In Person


Authors List: Show

  • Jesse Shapiro

Presentation Overview: Show

As variation in microbial community structure is implicated in an increasing number of human diseases and environmental changes, there is strong potential for microbiome-based diagnostics and therapeutics. I will discuss three brief case studies, highlighting how (1) diagnosis is easier than forecasting of future disease or environmental perturbations, (2) predicting simpler disease outcomes (e.g. infection or not) is easier than more complex outcomes (e.g. disease severity) that depend on a larger number of host- and microbe-determined factors, and (3) certain disease outcomes are more predictable based on genetic diversity within a key pathogen species than based on microbiome community composition. If these principles prove to be general, we can move toward more realistic expectations for microbiome-driven predictions and use the best combination of data and methods for the task at hand.

17:10-17:30
The Elephant in the Room: Software and Hardware Security Vulnerabilities of Portable Sequencing Devices
Confirmed Presenter: Carson Stillman, University of Florida, United States

Room: 520b
Format: In person


Authors List: Show

  • Carson Stillman, University of Florida, United States
  • Jonathan E. Bravo, University of Florida, United States
  • Christina Boucher, University of Florida, United States
  • Sara Rampazzi, University of Florida, United States

Presentation Overview: Show

Portable genome sequencing technology is revolutionizing genomic research by providing a faster, flexible method of sequencing DNA and RNA. The unprecedented shift from bulky stand-alone benchtop equipment confined in a laboratory setting to small portable devices which can be carried anywhere outside the laboratory network and connected to untrusted computers to perform sequencing raises new security and privacy threats not considered before. Current research primarily addresses the privacy of DNA/RNA data in online databases and the security of stand-alone sequencing devices such as Illumina. However, it overlooks the security risks arising from compromises of computers directly connected to sequencers. While sensitive data, such as the human genome, has become easier to sequence, the networks connecting to these smaller devices and the hardware running basecalling can no longer implicitly be trusted.
Here, we present new security and privacy threats of portable sequencing technology and recommendations to aid in ensuring sequencing data is kept private and secure. First, to prevent unauthorized access to sequencing devices, IP addresses should not be considered a sufficient authentication mechanism. Second, integrity checks are necessary for all data passed from the sequencer to external computers to avoid data manipulation. Finally, encryption should be considered as data is passed from the sequencer to such external computers to prevent eavesdropping on data as it is sent and stored. As devices and technology rapidly change, it becomes paramount to reevaluate security requirements alongside them or risk leaving some of our most sensitive data exposed.

Improving genomic epidemiology of Giardia intestinalis with a core genome gene-by-gene subtyping schema
Confirmed Presenter: Miguel Prieto, Simon Fraser University, Canada

Room: 520b
Format: In person


Authors List: Show

  • Miguel Prieto, Simon Fraser University, Canada
  • William Hsiao, Simon Fraser University, Canada
  • Clement Tsui, National Centre for Infectious Diseases, Singapore

Presentation Overview: Show

Giardia intestinalis parasites are common causes of sporadic gastroenteritis outbreaks in high-income countries. In contrast, giardiasis is endemic in low-income settings with poor sanitation, where it may cause failure to thrive and chronic malnutrition. Whole genome analyses of this microbe are rare because culturing this parasite is a laborious process with a limited success rate. Hence, subtyping for epidemiological tracking and surveillance relies on a three-loci marker scheme based on PCR amplification of stool samples. Here, we developed a nextflow pipeline that creates a core genome multilocus sequence typing (cgMLST) scheme for Giardia intestinalis using chewBBACA. This workflow takes as input all available whole genome sequencing samples of Giardia intestinalis assemblages A and B (the most commonly associated with human infections) in public biorepositories (n=128, after excluding samples producing poor-quality assemblies). The accuracy and reproducibility of the schema calling process were verified using k-fold cross-validation (70/30 splits) with a 95% prevalence cut-off for every locus in the training samples. Finally, the selected sequences with inaccurate alignments against reference genomes by BLAST were removed from the final schema definition. Our gene-by-gene schema is scalable to specific epidemiological settings (adding locally circulating Giardia spp. genomes), and the pipeline is extendable to other pathogens of public health interest. This schema, applied to culture-free stool metagenomics, can be used by interested public health agencies to investigate outbreaks and conduct genomic surveillance of human giardiasis.

17:30-17:50
Analysis of Inverted Repeats in Viral Genomes at a Large Scale
Confirmed Presenter: Madhavi Ganapathiraju, Carnegie Mellon University, Qatar

Room: 520b
Format: In person


Authors List: Show

  • Jingxiang Gao, Undergraduate Student in Computer Science, Carnegie Mellon University in Qatar, Qatar
  • George Rivera, Global Society for Philippine Nurse Researchers, Philippines
  • Madhu Sen, VIT Vellore, India
  • Matthew Shtrahman, UCSD School of Medicine, United States
  • Madhavi Ganapathiraju, Carnegie Mellon University, Qatar

Presentation Overview: Show

An inverted repeat (IR) in DNA is a sequence of nucleotides that is followed by its complementary bases but in reverse order (e.g., CACGGAttgTCCGTG). IRs cause fragile sites endangering genetic stability. In viruses, IRs enable host cell entry, genomic evolution in zoonotic viruses, and more. Despite their importance, IRs have not been studied comprehensively viral genomes at a large scale. We developed a tool into the Biological Language Modeling Toolkit which computes augmented suffix-arrays to efficiently identify IRs, and studied 13,023 viral genomes and catalogued their IRs. We found over 19 million IRs longer than 20 bases (1,300 IRs per virus), including 134 that are longer than 2 kilobases. Among the viruses with large IRs, we identified over 50 large IRs in herpes viruses, and over 10 IRs in pox viruses. There is a prevalence of large ‘terminal’ inverted repeats in bacteriophages. We discovered large IRs in common disease-causing viruses, such as the African swine fever virus (lethal to domestic pigs), paramecium bursaria chlorella virus (important for termination of algae blooms, found to be able to infect humans and decrease the motor skills and reaction speed), Yaba-like disease virus (important in the cancer gene therapy), and human herpes virus. We found 54 viruses with high IR density, including disease-causing viruses like pox and herpes, and lymphocystis disease virus. These results in investigating the prevalence and distribution of inverted repeats in viral genomes suggests potential for discovery of mechanism of action of some of the understudied viruses.

Intgration of Spatial Transcriptomics into Multimodal Imaging of Skin Aging
Confirmed Presenter: Christina Bauer, Medical University of Vienna, Vienna, Austria; IMC, University of Applied Science, Krems, Austria, Austria

Room: 520b
Format: In person


Authors List: Show

  • Christina Bauer, Medical University of Vienna, Vienna, Austria; IMC, University of Applied Science, Krems, Austria, Austria
  • Christopher Kremslehner, Medical University of Vienna, Vienna, Austria; Christian Doppler Laboratory–SKINMAGINE, Vienna, Austria;, Austria
  • Florian Gruber, Medical University of Vienna, Vienna, Austria; Christian Doppler Laboratory–SKINMAGINE, Vienna, Austria;, Austria

Presentation Overview: Show

Advancements in spatial transcriptomicshave advanced our understanding of cellular organization and function within skin and other tissues. However, existing techniques often encounter limitations in resolution and coverage, hindering comprehensive analysis. To address this gap, we propose a novel approach to enhance the resolution of spatial transcriptomic data and integrate it into multimodal imaging workflows.
Our project aims to leverage advanced image processing software to generate an approximated cell-level transcriptome from spatial transcriptomic data from juvenile and aged skin. By correlating gene expression profiles with immunofluorescence staining and age-related metabolic activity assays, we seek to gain novel insights into the intricate interplay between gene expression and cellular phenotypes. This would facilitate a more nuanced and analysis and allow to locally correlate complex phenotypes of cellular aging.
Furthermore, we establish a robust analysis pipeline tailored for evaluating skin, streamlining future workloads for similar studies. This pipeline aims to address the complexity associated with spatial transcriptomic data analysis, ensuring accessibility to individuals within the lab, including those without programming expertise.
Through the integration of spatial transcriptomics data into existing analytic imaging workflows, our project seeks to overcome existing limitations and pave the way for comprehensive analyses of cellular dynamics within tissue microenvironments. Our evaluation workflow includes initial assessment and comparative data analysis, utilizing quantitative metrics and established benchmarks to objectively evaluate the performance and accuracy of our approach.
Overall, our project holds significant promise in advancing our understanding of skin aging and offers valuable insights into tissue organization and cellular interactions.

17:50-18:00
CAMDA 1st day summary
Room: 520b
Format: In person


Authors List: Show

  • Joaquin Dopazo
Tuesday, July 16th
8:40-10:00
Invited Presentation: Computational dissection of complex human disease
Confirmed Presenter: Andrey Rzhetsky

Room: 520b
Format: In Person

Moderator(s): Joaquin Dopazo


Authors List: Show

  • Andrey Rzhetsky

Presentation Overview: Show

I will cover a collection of interrelated topics in dissection of etiology of complex human diseases, as seen through lens of large-scale medical data analysis. Individual studies that I will cover focus on mosaic of genetic, environmental, and genetic—environmental interaction factors. The studies relied on massive medical records from US, Sweden, Denmark, and Japan, and a battery of modeling approaches.

10:40-11:10
Invited Presentation: The Synthetic Clinical Health Records Challenge - Introduction
Confirmed Presenter: Joaquin Dopazo

Room: 520b
Format: In Person


Authors List: Show

  • Joaquin Dopazo

Presentation Overview: Show

Although data protection is necessary to preserve patients’ intimacy, privacy regulations are also an obstacle to biomedical research. An interesting alternative is the use of synthetic patients. However, conventional synthetic patients are useless for discovery given that they are built out of known data distributions. Interestingly, Generative Adversarial Networks (GANs) and related developments have emerged as powerful tools to generate synthetic data in a way that captures relationships between the variables produced even if such relationships were previously unknown. GANs became popular in the generation of highly realistic synthetic pictures but have been applied in many fields, including in the generation of synthetic patients with applications such as medGAN and others.
Two datasets of synthetic patients have been subsequently created for this challenge since CAMDA 2023. Both datasets were generated from a real cohort retrieved from the Health Population Database (Base Poblacional de Salud, BPS) at the Andalusian Health System (Spain), by performing a Dual Adversarial AutoEncoder (DAAE) approach and contain data on about 1 million patients.
Two challenges are suggested on both datasets, although any other original analysis you may think will also be welcomed:
1) Finding some strong relationships in diabetes-associated pathologies that allows to predict any pathology before this is diagnosed. Some well-known pathological diabetes consequences, which can be considered relevant endpoints to predict, can be: a) Retinopathy, b) Chronic kidney disease, c) Ischemic heart disease, d) Amputations.
2) Another proposed challenge is the prediction of disease trajectories in diabetes patients

11:10-11:30
Predicting Diabetes Complications from Electronic Health Record Visits Using Machine Learning Algorithms
Confirmed Presenter: Daniel Voskergian, Computer Engineering Department, Al-Quds University, Palestine, Palestine

Room: 520b
Format: Live Stream


Authors List: Show

  • Daniel Voskergian, Computer Engineering Department, Al-Quds University, Palestine, Palestine
  • Malik Yousef, Zefat Academic College, Israel
  • Burcu Bakir-Gungor, Department of Computer Engineering, Faculty of Engineering, Abdullah Gul University, Turkey

Presentation Overview: Show

This study employed a novel approach to feature engineering, utilizing XGB feature selection combined with various supervised machine learning algorithms, including Random Forest, XGBoost, LogitBoost, AdaBoost, and Decision Tree, to develop predictive models for four complications of diabetes mellitus: retinopathy, chronic kidney disease, ischemic heart disease, and amputations. These models were built on synthetic electronic health records generated by dual-adversarial autoencoders, representing nearly 1 million synthetic patients for each of the two datasets used. These synthetic patients were derived from an authentic cohort of 979,308 and 984,414 individuals with diabetes, extracted from the Health Population Database (Base Poblacional de Salud, BPS) within the Andalusian Health System in Spain. The models considered variables such as age range and chronic diseases occurring during patient visits from the onset of diabetes. The final models, tailored to each complication, achieved an accuracy between 69% and 77% and an AUC between 77% and 84%. Notably, XGBoost and Random Forest demonstrated the best overall prediction performance, highlighting the effectiveness of our feature engineering and selection approach in enhancing model accuracy and robustness.

11:30-11:50
Cluster-based machine learning prediction of diabetes complications
Confirmed Presenter: Daniel Santana-Quinteros, Universidad Nacional Autónoma de México, Mexico

Room: 520b
Format: Live Stream


Authors List: Show

  • Daniel Santana-Quinteros, Universidad Nacional Autónoma de México, Mexico
  • Mario Rodriguez-Moran, Amphora Health, Mexico
  • Joaquin Tripp, Amphora Health, Mexico
  • Diana Colín-Ayala, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • María Arroyo-Perez, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Andrés Anguiano-Peña, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Pedro Salas, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Juan González-Tapia, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Roxana Villanueva-Calderón, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Alejandro Solorio-Solorio, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Liliana Solorio-Cázares, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Axel Quiroz-Ávalos, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Ana Escalera-Dominguez, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Kathia Rangel-Pompa, Universidad Michoacana de San Nicolas de Hidalgo, Mexico
  • Leonardo Tapia-Figueroa, Escuela Nacional de Estudios Superiores campus Morelia, Mexico
  • Miguel Zamorano-Presa, Escuela Nacional de Estudios Superiores campus Morelia, Mexico
  • Lilian Perez-Sosa, Centro de Investigación en Matemáticas, Mexico
  • Rubén Maldonado, Universidad Nacional Autónoma de México, Mexico
  • Edgar Salazar-Fernandez, Amphora Health, Mexico
  • Nelly Selem, Universidad Nacional Autónoma de México, Mexico
  • Cleto Alvarez-Aguilar, Universidad Michoacana de San Nicolás de Hidalgo, Mexico
  • Arturo Lopez-Pineda, Escuela Nacional de Estudios Superiores, Unidad Morelia, Mexico
  • Karina Figueroa-Mora, Universidad Michoacana de San Nicolás de Hidalgo, Mexico

Presentation Overview: Show

Background: Type 2 Diabetes Mellitus (T2D) is a prevalent metabolic disorder characterized by hyperglycemia due to defects in insulin secretion or action. T2D often leads to severe complications, including cardiovascular diseases, nephropathy, retinopathy, and neuropathy. This study explores the feasibility of employing cluster-based machine learning techniques to predict diabetes complications.
Methods: We utilized synthetic patient data generated by CAMDA 2023 (Spain) and real-world data from the DiabetIA database (Mexico), analyzing records of 997,657 patients in total. Data transformation involved converting JSON format to tabular form, cleaning sex information, and propagating chronic conditions across subsequent visits. Machine learning models, including Support Vector Machines (SVM) and Neural Networks (NN), were trained on stratified datasets to predict the onset of diabetic complications. Clustering techniques such as UMAP and BIRCH were employed to group patients by comorbidities.
Results: The cluster-based machine learning models demonstrated potential in classifying diabetic complications. By analyzing patient data, the models identified distinct clusters of patients with similar comorbidities and disease trajectories. The classification processes gave us an area under the curve of 0.59 for NN and 0.56 for SVM at next year prediction.
Discussion: Cluster analysis can effectively enhance the understanding of T2D by revealing the interplay between various comorbidities and their impact on disease progression. The integration of advanced predictive models within a precision medicine framework promises more personalized and proactive healthcare interventions, ultimately improving patient outcomes and optimizing healthcare resources.

11:50-12:20
Statistical Measures for the Evaluation of Clustering Methods on Single Cell Data
Confirmed Presenter: Owen Visser, University of Florida, United States

Room: 520b
Format: In Person


Authors List: Show

  • Owen Visser, University of Florida, United States
  • Somnath Datta, University of Florida, United States

Presentation Overview: Show

The growing efficiency of single-cell sequencing technology has provided biologists with ample cells to identify and differentiate, often through clustering. Heuristic approaches for clustering method choice have become more prevalent and could lead to inaccurate reports if statistical evaluation of the resulting clusters is omitted. During the advent of microarray data, a similar dilemma was addressed in literature through the provision of supervised and unsupervised measures, which were evaluated through Rank Aggregation. In this paper, these measures are adapted into the single-cell framework through a leave-one-out approach. Additionally, a scheme was created to utilize the information of cluster sizes by using their ranking to assign importance to the aggregation of methods, resulting in one table of methods ranked by cluster sizes. To demonstrate the ensemble of measures and scheme, five benchmark single-cell datasets were clustered with various methods at appropriate cluster sizes. We show that through rank aggregation and our importance scheme, our adapted measures select clustering methods that perform better at cluster sizes associated with true biological groups compared to those selected through traditional measures. For four of the five datasets and with internal measures alone, the rank aggregation scheme could correctly identify methods that performed the best at cluster sizes that match the original biological groups. We plan to package this ensemble of measures in the hopes to provide others with a tool to identify the best performing clustering methods and associated sizes for a variety of datasets.

14:20-14:25
The Anti-Microbial Resistance Prediction Challenge - Introduction
Room: 520b
Format: In person


Authors List: Show

  • Paweł Łabaj
14:25-14:55
The Antimicrobial Resistance Prediction Challenge
Confirmed Presenter: Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany

Room: 520b
Format: Live Stream


Authors List: Show

  • Alper Yurtseven, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Saarbruecken, Germany, Germany
  • Guangyi Chen, HIPS, HZI, CBI, Germany
  • Olga V. Kalinina, HIPS, HZI, CBI, Faculty of Medicine, Saarland University, Saarbrücken, Germany, Germany

Presentation Overview: Show

Antimicrobial Resistance (AMR) is an urgent threat to human health worldwide as microbes have developed resistance to even the most advanced drugs. In this year’s CAMDA challenge, we focused on predicting AMR status of 1820 bacterial strains that belong to 7 different species (Campylobacter jejuni, Campylobacter coli, Escherichia coli, Klebsiella pneumoniae, Neisseria gonorrhoeae, Pseudomonas aeruginosa, Salmonella enterica) with machine learning.

14:55-15:15
Machine learning models for AMR prediction
Confirmed Presenter: Jaime Salvador López Viveros, CCM UNAM, Mexico

Room: 520b
Format: In person


Authors List: Show

  • Adriana Haydeé Contreras Peruyero, CCM UNAM, Mexico
  • Yesenia Villicaña Molina, CCM UNAM, Mexico
  • Nelly Sélem Mojica, CCM UNAM, Mexico
  • Francisco Santiago Nieto de la Rosa, CCM UNAM, Mexico
  • Victor Muñiz Sánchez, CIMAT MTY, Mexico
  • Lilia Leticia Ramírez Ramírez, CIMAT, Mexico
  • Anton Pashkov, ENES Morelia UNAM, Mexico
  • Mariel Guadalupe Gutiérrez Chaveste, CCM UNAM, Mexico
  • Jaime Salvador López Viveros, CCM UNAM, Mexico
  • Johanna Atenea Carreón Baltazar, CCM UNAM, Mexico
  • Luis Raúl Figueroa Martínez, CCM UNAM, Mexico
  • Ronald Cardenas Catota, CIMAT, Mexico
  • Alejandro Sierra Conde, CIMAT, Mexico
  • Fernando Fontove Herrera, C3, Mexico
  • Diana Barcelo, CINVESTAV, Mexico
  • Miguel Calderon León, CCM UNAM, Mexico
  • Shaday Guerrero Flores, CCM UNAM, Mexico
  • César Aguilar Martínez, Purdue University, México
  • Kotaro Hata, CIMAT, México
  • Richar Chacòn Serna, CCM UNAM, Mèxico
  • Jessica Admin Córdoba de León, CCM UNAM, México

Presentation Overview: Show

Each year, the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) presents various challenges related to massive data analysis and life sciences data. One of this year's challenges addresses the problem of predicting antimicrobial resistance in isolated samples. We conducted different analyses of the data using methods such as pangenomes and RGI to obtain data frames with counts of similar genes in gene families and counts of AMR gene families. We then applied various machine learning (ML) models: some to predict resistance-susceptibility and others to predict the amount of antibiotic needed to classify the sample. A wide variety of preprocess and dimensionality reduction methods, together with supervised and unsupervised ML models were used, yielding the best F1 scores ranging from 0.76 to 0.96, with the best result obtained with logistic regression with L1 regularization.

15:15-15:35
Proceedings Presentation: Biomarker identification by interpretable Maximum Mean Discrepancy
Confirmed Presenter: Dexiong Chen, Max Planck Institue of Biochemistry, Germany

Room: 520b
Format: Live Stream


Authors List: Show

  • Michael Adamer, ETH Zürich, Switzerland
  • Sarah Brüningk, ETH Zürich, Switzerland
  • Dexiong Chen, Max Planck Institue of Biochemistry, Germany
  • Karsten Borgwardt, Max Planck Institue of Biochemistry, Germany

Presentation Overview: Show

Motivation:In many biomedical applications, we are confronted with paired groups of samples, such as treated vs. control. The aim is to detect discriminating features, i.e. biomarkers, based on high-dimensional (omics-) data. This problem can be phrased more generally as a two-sample problem requiring statistical significance testing to establish differences, and interpretations to identify distinguishing features. The multivariate maximum mean discrepancy (MMD) test quantifies group-level differences, whereas statistically significantly associated features are usually found by univariate feature selection. Currently, there are few general-purpose methods that simultaneously perform multivariate feature selection and two-sample testing.\newline
Results: We introduce a sparse, interpretable, and optimised MMD test (SpInOpt-MMD) that enables two-sample testing and feature selection in the same experiment. SpInOpt-MMD is a versatile method and we demonstrate its application to a variety of synthetic and real-world data types including images, gene expression, and text data. SpInOpt-MMD is effective in identifying relevant features in small sample sizes and outperforms other feature selection methods such as SHapley Additive exPlanations (SHAP) and univariate association analysis in several experiments.

15:35-15:45
CAMDA Trophy ceremony
Room: 520b
Format: In person


Authors List: Show

  • David Kreil
15:45-15:50
CAMDA summary and closing remarks
Room: 520b
Format: In person


Authors List: Show

  • Wenzhong Xiao