Leveraging deep learning for characterization of malaria parasite PUFs — proteins of unknown function
Confirmed Presenter: Harsh R. Srivastava, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States
Room: 520b
Format: In Person
Moderator(s): Ana Rojas
Authors List: Show
- Harsh R. Srivastava, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States
- Daniel Berenberg, Courant Institute, Department of Computer Science, New York University, New York, NY, USA, United States
- Omar Qassab, Center for Genomics and Systems Biology, Department of Biology, New York University, New York, NY, USA, United States
- Tymor Hamamsy, Courant Institute, Department of Computer Science, New York University, New York, NY, USA, United States
- Jane M. Carlton, Johns Hopkins Malaria Research Institute, Bloomberg School of Public Health, Baltimore, MD, USA, United States
- Richard Bonneau, Prescient Design, gRED Computational Sciences, Genentech, New York, NY, USA, United States
Presentation Overview: Show
Exploiting the sequence-structure-function paradigm is crucial for annotating proteins of unknown function (PUFs) in Plasmodium falciparum, a member of the diverged eukaryotic SAR (Stramenopiles, Alveolates, and Rhizarians) supergroup. P. falciparum, a malaria-causing parasite, accounted for ~250 million cases and over 600,000 deaths in 2022. Discovery of diagnostic and therapeutic targets in P. falciparum is hindered, as ~23% of proteins are classified as PUFs while ~40% of proteins are partially annotated. Predicting GO annotations for these PUFs is difficult given low sequence similarity to annotated proteins and limited generalization of deep learning models trained on well-studied SwissProt species to diverged organisms. Here, we focused on the structure-function relationship of SAR sequences and developed a new method to predict GO terms in P. falciparum. PFP (Plasmodium Function Predictor) is a collection of structural-homology based deep learning models trained using evolutionarily relevant structure-aware TM-Vec embeddings. We used a deep feedforward architecture with a dropout layer to predict GO annotations and quantify uncertainty using Monte Carlo dropout. When benchmarked against DeepGOPlus, PFP, demonstrated a significant improvement in Fmax, Smin, and AUPR-micro/AUPR-macro for our test split as well as our Plasmodium holdout split. PFP predicted GO terms respected the hierarchical structure of GO and aligned with expected information content distributions. For poorly annotated proteins, PFP imputed GO terms which are biologically plausible given existing annotations. Additionally, predictions made by PFP were categorized into confidence levels and aligned with published data targeting specific P. falciparum PUFs. PFP is the first curated function prediction model developed specifically for a subset of eukaryotic species. We will discuss findings in model architecture and highlight specific GO predictions contributing to an increase of more than 25% in P. falciparum proteome annotation.