Spatial transcriptomics data have opened up new avenues of investigation into the spatial organization of tissues and the relationship between cellular location, regulation and function. For example, several computational methods can combine transcriptomic and spatial descriptions of a cell into a single representation called an “embedding”, facilitating tasks such as cell clustering and tissue segmentation. Studies of the relationship between gene expression and spatial location are not new, and several groups have sought to quantify the extent to which one determines the other, giving us the concept of “positional information” of a cell encoded in its expression profile. Here, we combine these two complementary ideas of cell embedding and positional information, to develop a new computational tool for spatial transcriptomics data analysis.
At the core of our tool is a neural network model that maps each cell’s transcriptome to a vector representation (embedding) such that proximally located cells have similar embeddings. We demonstrate through extensive applications to several real and synthetic data sets that this unique approach to cell representations offers several practical advantages, while also allowing us to extend the quantification of “positional information” to high dimensional ST data.
First, we show that learnt embeddings can capture spatial relationships among cells as accurately as the state-of-the-art embedding methods such as GraphST, SEDR and STAGATE, while additionally ensuring that the embeddings are fully determined by cellular transcriptomes. This latter feature imparts generalizability to the method, allowing the same embedding function to be meaningful for additional biological samples. We demonstrate such generalizability by training the neural network on one tissue and using the trained model to embed cells in a “test” tissue, still achieving high accuracy in predicting cellular location from expression. This also amounts to solving the “spatial reconstruction” task, addressed by previous methods such as TANGRAM, but now in a completely map-free manner, i.e., without the need of a reference tissue to which the reconstructed spatial information must be mapped.
Second, we show that the model can seamlessly learn embeddings for cells in multiple samples of a tissue, thus providing a universal coordinate system to describe cellular locations. We then use this coordinate system to design a statistical test for differences in spatial expression of a gene between two tissue samples. We use simulations to assess the statistical power of this test and use it to detect differential spatial expression of individual genes between brains representing different biological conditions.