The workflow of scMTD
scMTD implements imputation for scRNA-seq data with the following steps: data preprocessing, cell-state specific space construction, cell-level imputation, gene-level imputation, dropout probability estimation, and multidimensional imputation.
Data preprocessing
scMTD takes a \(I \times J\) gene expression matrix as input and preprocesses the raw count matrix by the normalization and logarithmic transformation. The raw count matrix is normalized by the library size of each cell so that all samples have one million reads:
$$x_{ij}^{N} = \frac{{x_{ij}^{C} \cdot 10^{6} }}{{\sum\nolimits_{k = 1}^{I} {x_{kj}^{C} } }},$$
$$x_{ij} = {\text{log}}_{2} (x_{ij}^{N} + \theta ),$$
where \(x_{ij}^{C}\), \(x_{ij}^{N}\), and \(x_{ij}\) represent the gene expression of the ith \((i = 1,2, \cdots ,I)\) gene and the jth \((j = 1,2, \cdots ,J)\) cell in the raw count matrix \(X^{C}\), normalized matrix \(X^{N}\), and preprocessed matrix \(X\), respectively. The constant \(\theta\)(defaults to 1) is added to prevent infinite values during the logarithmic transformation.
Cell-state specific space construction
Based on the results of the trajectory inference by TSCAN [28], scMTD constructs cell-state specific space to capture dynamical changes of transcriptomes. TSCAN is a robust method for pseudo-time reconstruction in scRNA-seq data analysis, which adopts a cluster-based minimum spanning tree (MST) approach to order cells. Specifically, cells firstly are grouped into clusters and an MST is constructed to connect the centers of clusters. Then, individual cells are projected onto tree edges to create cell-level ordering along the path. Finally, the order of a cell on the path is defined as its pseudo-time. Cell pseudo-time assists in describing the gradual transition of transcriptome and obtains insights into the transcriptome dynamics. The intuition behind the pseudo-time reconstruction is that adjacent pseudo-time cells share a similar expression state. Therefore, based on the pseudo-time of cells inferred by TSCAN, scMTD constructs cell-state specific space and utilizes information of local cell neighbors in each space to predict gene expressions of each cell. According to the new orders of cells, the gene expression matrix \(X\) is transformed into \(\tilde{X}\):
$$\tilde{X} = \left( {x_{\varvec{\cdot},(1)} ,x_{\varvec{\cdot},(2)} , \ldots \cdot,x_{\varvec{\cdot},(j)} , \cdots ,x_{\varvec{\cdot},(J)} } \right),$$
where \(x_{\varvec{\cdot},(j)}\) represents gene expressions of cell with the jth pseudo-time order, and \(\tilde{x}_{{\varvec{\cdot}},(j)}\) represents gene expressions for the jth cell of \(\tilde{X}\). Then, we divide \(\tilde{X}\) into \(S\) space, and denote \(X^{(s)} (s = 1,2, \cdots ,S)\) as the gene expression matrix corresponding the sth space,
$$X^{(s)} = \left( {\tilde{x}_{\varvec{\cdot},((m - 1)s + 1)} ,\tilde{x}_{\varvec{\cdot},((m - 1)s + 2)} , \cdots ,\tilde{x}_{\varvec{\cdot},(ms)} } \right),$$
where the parameter \(m\) represents the number of cells that belong to each space (except for the last one). The intuition behind the cell-state specific space is that cells with adjacent pseudo-time orders share a similar expression state. We set the parameter \(m\) based on the number of cells of input data,\(m = 5(\left[ {J/10^{3} } \right] + 1),\) where the function \([t]\) represents the smallest integer not less than \(t\) and \(J\) represents the number of cells of input data. More details of parameter \(m\) are in the Additional file 3: Text S1.
Cell-level imputation
Adaptive Gaussian kernel coefficient matrix
scMTD digs out information from local cell neighbors in each space and uses the cell-level information to predict the expressions of each cell. To reasonably assign weight values to local cell neighbors, scMTD utilizes the adaptive Gaussian kernel function to obtain the weights matrix \(K\) calculated based on the cell distance matrix \(D\). The weight value between cell \(j\) and cell \(j^{\prime}\) is denoted as \(k_{{jj^{\prime}}}\):
$$k_{{jj^{\prime}}} = \exp \left( { - \frac{{d_{{jj^{\prime}}}^{2} }}{{\sigma_{j}^{2} }}} \right),$$
where \(d_{{jj^{\prime}}}\) denotes the distance between cell \(j\) and cell \(j^{\prime}\), and the kernel width \(\sigma_{j}\) is the average distance between the cell \(j\) and the other cells of the space containing cell \(j\). The parameter \(\sigma_{j}\) is adaptive to the local density of cells in each space, which balances the effect of the density of cells to the weight values and avoids the over-smoothing problems among cells.
Imputation leveraging cell-level information
According to the cell-level information from the local cell neighbors in each space, the predictions for the expressions of cell \(j\) is calculated by
$$c_{\varvec{\cdot},j}^{(s)} = W\left( {k_{{j,S_{j}^{ * } }} ,x_{{\varvec{\cdot},S_{j}^{ * } }}^{(s)} } \right),$$
where weighted average function \(W({\varvec{\alpha}},{\varvec{y}}) = {{\sum\nolimits_{i = 1}^{I} {\alpha_{i} } y_{i} } \mathord{\left/ {\vphantom {{\sum\nolimits_{i = 1}^{I} {\alpha_{i} } y_{i} } {\sum\nolimits_{i = 1}^{I} {\alpha_{i} } }}} \right. \kern-\nulldelimiterspace} {\sum\nolimits_{i = 1}^{I} {\alpha_{i} } }},\)\({\varvec{\alpha}} = (\alpha_{1} ,\alpha_{2} , \cdots ,\alpha_{I} ),\) \({\varvec{y}} = (y_{1} ,y_{2} , \cdots ,y_{I} ),\) \({\varvec{\alpha}}\) is the vector of weights, \(c_{\varvec{\cdot},j}^{(s)}\) represents the predicted gene expressions of cell \(j\) in the space \(s\), and \(S_{j}^{ * }\) denotes the set of cells other than \(j\) in the space \(s\). In this section, each raw expression obtains a cell-level imputed value, and the cell-level imputed matrix is denoted as \(C\).
Gene-level imputation
Cell-state specific gene co-expression network
In this section, scMTD constructs the cell-state specific gene co-expression network for each space and leverages the information from the network to predict the expressions of genes. The gene expressions are firstly averaged across cells in each space to obtain pseudo cells, and the gene expressions for the pseudo-cell \(s\) is denoted as \(\overline{X}_{{}}^{(s)}\). The purpose of constructing pseudo-cells is to alleviate the effect of dropouts on gene-to-gene association. Here, scMTD determines the gene-to-gene relationship in the gene co-expression network by the statistical independence of two genes, similar to algorithm CSN [47] which constructs the cell-specific network by a statistical independence test in scRNA-seq data. Based on CSN, scMTD constructs the specific gene co-expression network by testing the independence of genes in the different sized neighbor areas of the pseudo-cell, which prevents the local independence problems (Additional file 3: Text S2). scMTD determines that gene \(i\) and gene \(i^{\prime}\) in each space are associated with each other if and only if they are associated in the different sized neighbor areas of the pseudo-cell.
Imputation leveraging gene-level information
Based on the gene-level information from the specific gene co-expression network in each space, the predictions for expressions of gene i is calculated by
$$g_{i,\varvec{\cdot}}^{(s)} = W\left( {r_{{i,N_{i}^{ * } }}^{(s)} ,x_{{N_{i}^{ * } ,\varvec{\cdot}}}^{(s)} } \right),$$
where the function \(W\) is the weighted average function, \(g_{i,\varvec{\cdot}}^{(s)}\) represents the predicted expressions of gene \(i\) in the space \(s\),\(R^{(s)}\) is the correlation coefficient matrix of genes in the space \(s\), and \(N_{i}^{*}\) represents the set of associated genes for gene \(i\) in the space \(s\). In this section, each raw expression obtains a gene-level imputed value, and the gene-level imputed matrix is denoted as \(G\).
Dropout probability estimation
Empirical evidence suggests that the dropout probability is a decreasing function of gene expression levels [48,49,50]. It means that genes with low expression levels are more likely to be affected by dropouts, while genes with high expression levels do not. scMTD fits the average expressions and the ratios of zeros in each space to a decreasing Logistic function by the non-linear least square method. The dropout probability \(p_{ij}^{(s)}\) for \(x_{ij}^{(s)}\) is calculated by
$$p_{ij}^{(s)} = {1} - \frac{{1}}{{{1} + e^{{(a^{(s)} + b^{(s)} x_{ij}^{(s)} )}} }},$$
where \(p_{ij}^{(s)}\) represents the dropout probability of gene \(i\) and cell \(j\) in the space \(s\), and \(a^{(s)}\),\(b^{(s)}\) are regression parameters of the Logistic model for the space \(s\). In this section, each raw expression obtains a dropout probability, and the dropout probability matrix is denoted as \(P\).
Multidimensional imputation
Finally, scMTD obtains the imputed matrix \(\hat{X}\) by combining the dropout probability with the cell-level predictions and gene-level predictions,
$$\hat{x}_{ij} = p_{ij} \left( {\frac{\beta }{{\alpha { + }\beta }}c_{ij} + \frac{\alpha }{{\alpha { + }\beta }}g_{ij} } \right) + \left( {1 - p_{ij} } \right)x_{ij} ,$$
where \(c_{ij}\),\(g_{ij}\),\(p_{ij}\),\(x_{ij}\), and \(\hat{x}_{ij}\) represent the cell-level imputation, gene-level imputation, and dropout probability, raw expression, and imputed expression of gene \(i\) and cell \(j\), respectively. The parameters \(\alpha\) and \(\beta\) represent the standard deviation of cell-level and gene-level predictions, respectively. The multidimensional imputation of scMTD maintains the genes’ original high-level expressions and only imputes dropout events, which reduces to introduce unexpected false signals to imputed data.
Evaluation of imputation performance
RNA FISH validation
We used the Torre dataset (fluidigm dataset and matched RNA FISH dataset) to compare the Gini coefficients of 19 overlapped genes (EGFR, SOX10, CCNA2, WNT5A, PDGFRB, PDGFC, SERPINE1, NGFR, NRG1, VEGFC, RUNX2, FGFR1, JUN, BABAM1, KDM5A, LMNA, KDM5B, VCL, TXNRD1) in the (raw or imputed) scRNA-seq data vs. FISH data.
Gini coefficient
The Gini coefficient of a gene was calculated to quantify gene expression distribution by using the R package reldist (version: 1.6.6).
Root mean square error (RMSE)
RMSE was used to evaluate the difference in Gini coefficients between the FISH data and (raw or imputed) scRNA-seq data, which is defined as
$$RMSE(y,\hat{y}) = \sqrt {\frac{1}{m}\sum\nolimits_{i = 1}^{m} {(y_{i} - \hat{y}_{i} )^{2} } } ,$$
where \(y,\hat{y}\) are the genes’ Gini coefficients of the FISH data and (raw or imputed) scRNA-seq data, respectively.
Correlation matrix distance (CMD)
CMD is a measure of the distance between two correlation matrices \(R_{1} ,R_{2}\) and ranges from 0 to 1, which is defined as
$$d(R_{1} ,R_{2} ) = 1 - \frac{{tr(R_{1} ,R_{2} )}}{{\left\| {R_{1} } \right\|_{f} \left\| {R_{2} } \right\|_{f} }},$$
where \(R_{1} ,\) \(R_{2}\) were calculated for gene pairs in FISH data and (raw or imputed) scRNA-seq data by the Pearson’s correlation coefficient, respectively.
Trajectory inference
We performed the trajectory inference in the raw data and imputed data of the Camp dataset that contains the single-cell transcriptome sequenced at multiple time points by using the R package TSCAN (version: 1.20.0). We used the known hepatocyte-like differentiation states as the reference labels and compared the correctness of the cell orders inferred in the raw and imputed data. Cell trajectories were visualized by the minimum spanning tree (MST) constructed to connect cluster centers of cells by using TSCAN.
Pseudo-temporal ordering score (POS)
POS [28] was utilized to characterize the consistency of the inferred cell orders and referenced labels, which were calculated by the R package TSCAN.
Kendall rank correlation coefficient
Kendall rank correlation coefficient was used to measure the consistency of the inferred cell orders and referenced labels by the R package stats (version: 3.5.2).
Differential expression analysis
We performed a differential expression analysis by utilizing the R package DESeq2 (version: 1.22.2) on the Chu (Time course) dataset containing the scRNA-seq and bulk RNA-seq data. We identified DEGs between cells from the states of 12 h and 96 h in the bulk data and (raw and imputed) scRNA-seq data, respectively. Here, genes with adjusted p-value smaller than \(10^{ - 3}\) and absolute log fold change greater than 1 were identified as DEGs, and the DEGs inferred from the bulk RNA-seq data were treated as the ground truth. The area under the curve (AUC) for the ROC curve was calculated by using the R package plotROC (version: 2.2.1). Then, for the DEGs inferred in the raw data and scMTD imputed data, we performed GO enrichment analysis and obtained the enriched GO terms with a p-value smaller than \(10^{ - 3}\) by using the R package clusterProfiler (version: 3.10.1).
Cell visualization and clustering analysis
For the Chu (Cell Type and Time Course) dataset, Brain 9 k dataset, and Romanov dataset, we compared visualization results of the raw data and imputed data by the nonlinear dimension-reduction algorithm tSNE. Then, we obtained the clustering labels of cells by k-mean on the tSNE results and used the Adjusted Rand Index (ARI), Jaccard Index, and Fowles Mallows (FM) Index to evaluate the clustering accuracy based on referenced labels provided by previous studies. We performed clustering analysis for to assess scMTD’s performance in cell visualization and the accuracy of clustering. Here, we ran each experiment 1000 times to get the stable analysis results of the clustering accuracy. ARI, Jaccard Index, and FM Index were calculated by R package clues (version: 0.5.9).
Identification of cell types
We identified cell types in the raw data and imputed data of the Chu (Cell Type) dataset by Seurat (version: 4.1.0). The main workflow of cell-type identification includes data normalizing, finding feature selection, scaling, PCA, cell clustering, UMAP, finding markers. To get the best performance for cell-type identification, we set the resolution and PCA-dimension parameters for cell clustering to 1:2 and 0.8, respectively, the PCA-dimension parameter for UMAP to 1:20, and all other parameters to default values. We identified cell types by matching marker genes of cell clusters and the reference marker genes. When the cluster-specific marker genes contain the marker genes of the reference cell type, we defined the cell cluster as the reference cell type. However, when the cluster-specific marker genes contain multiple reference marker genes for different cell types or no reference marker genes, we defined the clusters as unidentified cell types. The reference marker genes of cell types in the Cell Type dataset are shown in the Additional file 2: Table S1.
Percentage of cells correctly assigned (ACC)
ACC [51] was utilized to evaluate the accuracy of cell-type identification, which is calculated as
$$ACC = \frac{{\sum\nolimits_{i = 1}^{n} {\delta (r_{i} ,s_{i} )} }}{n},$$
$$\delta (r_{i} ,s_{i} ) = \left\{ \begin{gathered} 1,{\text{ if }}r_{i} = s_{i} \hfill \\ 0,{\text{ otherwise}} \hfill \\ \end{gathered} \right.,$$
where \(n\) is the number of cells, \(r_{i} ,\) \(s_{i}\) are the reference cell-type label and identified cell type, respectively.