The linkage between transcription factors (TFs) and cis-regulatory regions (CREs) is crucial to under- standing gene regulation. Conventionally, it is determined by a step-wise process—motif enrichment and correlation/regression-based analysis. As the presence of motifs does not always imply binding, and cor- relation analysis may miss low-expression TFs, this process can suffer from false positive and negatives. Here we propose a holistic model that takes joint single-cell RNA sequencing (scRNA-seq) data and single-cell assay for transposase-accessible chromatin sequencing (scATAC-seq) data to delineate TF-CRE linkage. In- spired by multi-omics factor analysis and sequence modeling, our model decomposes peaks’ accessibility into cell factors, encoded from TF expression, and peak factors, encoded from DNA sequences.
We demonstrate our model on an embryonic mouse brain dataset. Both modalities are accurately recon- structed on held-out cells and sequences . Cell factors preserve cell type distinction and trajectory structure, while sequence factors motifs moderately localize some motifs, such as that of Neurod2 and Sox11, indicating the regulatory information is captured.
To delineate TF-CRE linkage, we take gradients with respect to the two inputs. High gradient times TF expression values (gradTF) are assigned to high correlation TF-CREs pairs, whereas low-correlation, high gradTF pairs may correspond to low-expression TFs, though systematic evaluation remains to be done. As an example, Runx1, a low-expression TF, correlates poorly with almost all peaks’ accessibility; however, its potential target CREs (compiled from ChIP-Atlas) have a higher absolute gradTF. On the other hand, gradient times sequence (gradSeq) highlights regulatory motifs.