Data Portals, such as Genomic Data Commons, provide catalogs of somatic short nucleotide variants (SNVs) identified from whole-genome sequencing (WGS) of tumor and matched normal DNA from thousands of samples. However, computational identification and prioritization of functionally relevant driver mutations remain far greater challenges in non-coding regions than in coding genes due to incomplete annotation of regulatory regions 1 . Towards this, we developed gene regulatory region predictive models by applying DNABERT, a deep learning model pre-trained on genomic sequences, to systematically characterize the functional impact of somatic mutations in key noncoding regions, including splice sites and transcription factor binding sites (TFBS). We apply these models to Glioblastoma multiforme (GBM), an aggressive brain cancer with limited therapeutic options. We analyzed WGS data from 189 GBM patients (SNVs from CaVEMan and indels from Pindel) obtained from the TCGA-GDC portal. This dataset included 19,968 SNVs and 34,718 indels near the acceptor sites and 23,656 SNVs and 20,171 indels near the donor sites. Additionally, we examined 700 TFBS datasets (33 histone markers and 667 TF ChIP-seq markers) from ENCODE, covering 4,228 ChIP-seq experiments across 91 cell lines, 21 in vitro differentiated cells, 53 primary cells, and 77 tissues. The histone markers contained 540,742 indel variants across 285,204 regions and 734,067 SNVs across 346,600 regions, while TF ChIP-seq regions comprised 9,615,683 indels spanning 3,796,449 regions and 7,827,556 SNVs across 6,869,598 regions.
To assess the functional impact of these variants, we developed two DNABERT-based splice site models and fine-tuned 700 TFBS-specific models. These models computed probability scores for reference and alternative sequences and assessed functional disruption via log-odds ratios and score-change values. The splice site models predicted 299 candidate SNVs and 1,822 indels in acceptor sites and 673 SNVs and 504 indels in donor sites, while histone models identified 4,171 indels and 763 SNVs as functionally disruptive. In TF ChIP-seq regions, DNABERT prioritized 61,731 candidate indels and 30,539 SNVs. We identified frequent mutations (≥10% of patients) in TFBSs of SPI1, ZBTB33, and RAD51 TFs implicated in oncogenesis and immune regulation. Survival analysis revealed mutations in TFBSs of TRIM22, QKI, RELB, and BCL3 as potential prognostic biomarkers, while DNABERT-predicted splice site mutations in PARPBP and MAPT correlated with patient survival, suggesting their role in treatment response. These findings highlight the power of genomic foundation models in identifying biologically significant and clinically relevant variants in GBM. Our approach provides a novel computational framework for prioritizing non-coding regulatory SNVs, contributing to WGS data analyses and precision medicine. To facilitate further exploration, we have developed an interactive dashboard at https://davuluri-lab-brainved.streamlit.app.