SpliceSelectNet: A Hierarchical Transformer-Based Deep Learning Model for Splice Site Prediction
Confirmed Presenter: Yuna Miyachi, Department of Computer Science, Graduate School of Information Science and Technology
Track: MLCSB: Machine Learning in Computational and Systems Biology
Room: 01A
Format: In person
Moderator(s): Maria Chikina
Authors List: Show
- Yuna Miyachi, Yuna Miyachi, Department of Computer Science
- Kenta Nakai, Kenta Nakai, Human Genome Center
Presentation Overview:Show
RNA splicing is a critical post-transcriptional process that enables the generation of diverse protein isoforms. Aberrant splicing is implicated in a wide range of genetic disorders and cancers, making accurate prediction of splice sites and mutation effects essential. Convolutional neural network-based models such as SpliceAI and Pangolin have achieved high accuracy but often lack interpretability. Recently, Transformer-based models like DNABERT and SpTransformer have been applied to genomic sequences, yet they typically inherit input length limitations from natural language processing models, restricting context to a few thousand base pairs, which are insufficient for capturing long-range regulatory signals.
To overcome these challenges, we propose SpliceSelectNet (SSNet), a hierarchical Transformer model that integrates local and global attention mechanisms to handle up to 100 kb of input while maintaining nucleotide-level interpretability. Trained on multiple datasets, including those incorporating splice site usage derived from RNA-seq data, SSNet outperforms SpliceAI and Pangolin on the Gencode test dataset, a clinically curated BRCA variant dataset, and a deep intronic variant benchmark. It demonstrates improved performance, particularly in regions characterized by complex splicing regulation, such as long exons and deep introns, as measured by area under the precision-recall curve.
Furthermore, SSNet’s attention maps provide direct insight into sequence context. In the case of a pathogenic variant in BRCA1 exon 10, the model highlighted an upstream region that may contribute to cryptic splice site activation.
These results demonstrate that SSNet combines high predictive performance with biological interpretability, offering a powerful tool for splicing analysis in both research and clinical settings.