A Hybrid Pipeline for Feature Reduction, and Ordinal Classification to Predict Antimicrobial Resistance from Genetic Profiles
Confirmed Presenter: Anton Pashkov, ENES Morelia UNAM, Mexico
Track: CAMDA: Critical Assessment of Massive Data Analysis
Room: 01C
Format: In person
Moderator(s): Paweł Łabaj
Authors List: Show
- Adriana Haydeé Contreras Peruyero, Adriana Haydeé Contreras Peruyero, Centro de Ciencias Matemáticas UNAM Morelia
- Yesenia Villicaña Molina, Yesenia Villicaña Molina, Centro de Ciencias Matemáticas UNAM Morelia
- Nelly Sélem Mojica, Nelly Sélem Mojica, Centro de Ciencias Matemáticas UNAM Morelia
- Francisco Santiago Nieto de la Rosa, Francisco Santiago Nieto de la Rosa, Centro de Ciencias Matemáticas UNAM Morelia
- Victor Muñiz Sánchez, Victor Muñiz Sánchez, CIMAT MTY
- Anton Pashkov, Anton Pashkov, ENES Morelia UNAM
- Johanna Atenea Carreón Baltazar, Johanna Atenea Carreón Baltazar, Centro de Ciencias Matemáticas UNAM Morelia
- Luis Raúl Figueroa Martínez, Luis Raúl Figueroa Martínez, Centro de Ciencias Matemáticas UNAM Morelia
- Evelia Lorena Coss Navarrete, Evelia Lorena Coss Navarrete, LIIGH
- César Augusto Aguilar Martínez, César Augusto Aguilar Martínez, Campus Monterrey
- Varinia López-Ramírez, Varinia López-Ramírez, Tecnológico Nacional de México/ITS de Irapuato
- Mariana Jaired Ruíz Amaro, Mariana Jaired Ruíz Amaro, ENES León UNAM
- Johanna Castelá
Presentation Overview:Show
One of the three challenges proposed by the Community of Interest Critical Assessment of Massive Data Analysis (CAMDA) involves predicting antimicrobial resistance or susceptibility for nine bacterial species and four antibiotics of interest. The dataset underwent a cleaning process to remove duplicate IDs with differing MIC values or phenotypes. After data cleaning and preprocessing, three distinct strategies were implemented to perform the predictions. The first strategy focused on predicting minimum inhibitory concentration (MIC) values. We adapted machine learning models for ordinal classification, assuming MIC as an ordinal variable. Two main approaches were used: multiple binary models (logistic regression, CART, random forests) and threshold models (neural networks). Due to the high dimensionality and sparsity of the AMR gene count data, we applied preprocessing techniques including a TF-IDF-like transformation (GF-IAF) and dimensionality reduction (truncated SVD and NMF). In the second strategy, we tested several classical machine learning models to predict the phenotype directly and used a grid search to find the optimal set of parameters, without using MIC values. In the third, we applied dimensionality reduction methods such as TF-IDF, along with a biological filtering step, before predicting the phenotype. Finally, as a preliminary result, ANI and pangenome analyses of E. coli isolates revealed divergence in gene content among some strains. Accessory regions potentially linked to antibiotic resistance suggest that key resistance determinants may lie outside the core genome.