Transcriptome analysis reveals an important candidate gene involved in both nodal metastasis and prognosis in lung adenocarcinoma

Lymph node metastasis of lung cancer is a serious problem. Therefore, there is a need for a detailed transcriptome study of metastatic lung adenocarcinoma. The lung adenocarcinoma RNA-seq data and the corresponding clinical information available from TCGA were analyzed. Differential expression, gradient changes, and biological pathways were carried out. Potential gene(s) associated with tumor metastasis and survival were validated by Cox regression. A total of 406 and 439 differentially expressed genes were identified for lymph node metastasis and TNM stages, respectively. Of the 296 intersection genes, 112 were associated with nodal metastasis and/or staging. Only 25 of these 112 genes with gradient changes were involved in nodal metastasis, and 13 were involved in staging. Only one gene, RN7SL494P, might be involved in lung adenocarcinoma development and poor outcome. Finally, Cox regression results verified that age, pathology classification, radiotherapy and chemotherapy are all the independent prognostic factors. In particular, RN7SL494P was further verified to be an independent factor affecting lymph node metastasis and patient survival. Furthermore, we verified the RN7SL494P function using simulation data generated by mixing cell lines of the Cancer Cell Line Encyclopedia (CCLE) and obtained consistent results. Our findings suggest a potential clinical application of the RN7SL494P as a promising marker in the evaluation of patients with primary lung adenocarcinoma, not only for predicting nodal metastasis, but also for the prognosis of the outcome.


Introduction
Lung adenocarcinoma, a histological subtype of nonsmall cell lung cancer (NSCLC), arises when healthy cells change and uncontrolled growth occurs in the outer region of the lung. Lung adenocarcinoma is the most common type of lung cancer and accounts for approximately 40% of all lung-derived cancers [1].
Lung adenocarcinoma tends to develop in smaller airways, such as bronchioles, and develops more slowly than any other types of lung cancer. Once cancerous tissues begin to grow, cancer cells may slough off. These cells may be carried in the blood or float in the lymph fluid that encompasses the lung tissue [2]. The lymph flows through lymphatic vessels into collecting lymph nodes [3,4]. When a cancer cell spreads to a lymph node or passes through the bloodstream to a distant body site, it is called metastasis.
The Cancer Genome Atlas (TCGA) project was started in 2006 [5] and a joint research project between the National Human Genome Research Institute and the National Cancer Institute. In the current study, we performed a comprehensive screening of TCGA databases for transcriptome and clinical data regarding nodal metastasis and TNM staging for patients with lung adenocarcinoma. According to the primary results, we further verify the gene(s)' function in independent data sets from the Cancer Cell Line Encyclopedia (CCLE) project [6].

Differentially expressed genes in lung adenocarcinoma
Gene differential expression analysis between lung adenocarcinoma tissues and matched normal controls identified a total of 13,118 genes that were differentially expressed of which 2800 were down-regulated and 10,318 were up-regulated. The top 10 most significantly down-regulated and top 10 most significantly up-regulated genes are shown in Additional file 1: Table S1. We included all the significantly up-regulated and down-regulated mRNAs to generate a heatmap and volcanic map to demonstrate their relative expression levels (Additional file 2: Figure S1A, B).

GO and KEGG analyses of differentially expressed genes
We conducted GO analysis for all the differentially expressed genes in the lung adenocarcinoma cases in the current study and found that the gene RN7SL494P was not involved in any biological functions or processes in the DAVID database, nor was it related to any cellular components of the database (Fig. 1a, b). KEGG pathway analysis and KOBAS was used to functionally annotate the differentially expressed genes. After identifying the key KEGG pathways, we determined that RN7SL494P was not associated with any of the KEGG pathways (Additional file 3: Table S2). Functional annotation of the differentially expressed genes using the clusterProfiler Supplement R package also failed to identify any RN7SL494P-related KEGG pathways (Additional file 4: Table S3). GO analysis results showed that upregulated DEGs were significantly enriched in extracellular exosome, membrane, and mitochondrion (Fig. 1a). Downregulated DEGs were mainly significantly enriched in the cytoplasm, nucleus, cytosol, nucleoplasm, and protein binding (Fig. 1b). Therefore, we concluded that a single gene functional enrichment method associated with the specific gene would be used as a subsequent step of the study.

Differentially expressed genes associated with nodal metastasis or TNM stage
Based on the features of lymph node metastasis for the subjects listed in Additional file 5: Table S4, a total of 406 differentially expressed genes were identified. Of the differentially expressed genes, 312 were  Figure S1C, D). The top 10 most significantly down-regulated and top 10 most significantly up-regulated genes associated with cancer metastasis are shown in Table 1. Similarly, the TNM staging-related differentially expressed genes are shown in Additional file 2: Figure S1E, F with the top 10 most significantly down-regulated and top 10 most significantly up-regulated genes shown in Table 1.

Overlapping differentially expressed genes associated with nodal metastasis and TNM stages
Venn diagram analysis was performed to visualize the overlapping differentially expressed genes between lymph node metastasis and TNM stages. The VennDiagram R package was used and 296 overlapping genes were identified (Fig. 2a).

Gradient changes of differentially expressed genes associated with nodal metastasis and TNM stages
We analyzed the gradient changes of differentially expressed genes in lymph node metastasis (from N0 to N2) and TNM stage (from I to IV) using the Kruskal-Wallis test. Since there were only two samples with a metastasis score of N3, this subgroup was not considered in this analysis. A total of 112 differentially expressed genes were associated with the gradient changes of lymph node metastasis, TNM stage, or metastasis and TNM stage ( Table 2). Among the 112 differentially expressed genes, 25 were associated with lymph node metastasis, 13 with TNM stage, and 7 genes (SCARNA7, AC105999.2, RANBP20P, RN7SL151P, SYNPR, AL512638.1, and TMIGD1) were associated with both lymph node metastasis and TNM stage.

Survival rates and differentially expressed genes associated with nodal metastasis and TNM stage
We analyzed patient survival time relative to all 30 differentially expressed genes that were associated with the gradient changes on lymph node metastasis and/ or TNM stage. Only one gene (RN7SL494P) was found to correlate with patient survival time (Table 2 and Fig. 2b). RN7SL494P was also associated with the gradient changes of lymph node metastasis with p = 0.02587 for N0 vs. N1 vs. N2 (Fig. 2c) and p = 0.006 for N0 vs. N1 vs. N2 (Fig. 2d). However, RN7SL494P was not associated with the gradient changes of TNM stage (p = 0.057; Fig. 2e).

sGSEA of pathways
Evaluation of the associations between RN7SL494P expression and any cancer-related pathways was performed and renin angiotensin system, JAK-STAT signaling pathway, et al. were the enriched pathways associated    with higher expression of the gene RN7SL494P (Fig. 3a).
On the other hand, the genes co-expressed with the lowexpression of RN7SL494P were associated with biological or pathological pathways including basal transcription factors, spliceosome, oxidative phosphorylation, nucleotide excision repair, DNA replication and among others (Fig. 3b). These typical results are shown on a GSEA diagram at the same time (Fig. 3c). These findings suggested that low-expression of RN7SL494P might be associated with cancer development and poor outcome in patients with lung adenocarcinoma.

Cox regression models
Univariate Cox analysis found that the increased expression of RN7SL494P would reduce the risk of death in patients (HR 0.78, p = 0.020). The patients who did receive radiotherapy, or who had a higher grade of pathology, or who had metastasis, or who had lymph node involvement, had a greater risk of death (all HR > 1, all p < 0.05) (Additional file 6: Table S5).
In multivariate Cox regression analysis, we found the expression of RN7SL494P still was an independent prognostic factor (HR 0.78, p = 0.028). This further proves that this gene is a prognostic factor of lung cancer. In addition, age, stage_T and stage_N were the fisk factors, these further suggest that lymph node metastasis will lead to a worsening prognosis in patients with lung adenocarcinoma. Interestingly, the effects of radiotherapy and chemotherapy may be reversed, that is, radiotherapy may result in reduced efficacy and poor prognosis; but chemotherapy can significantly extend the survival time of such patients (Fig. 4).

Co-expressions genes of RN7SL494P in CCLE
We downloaded the lung cancer cell lines' raw counts of the expression profiling from the CCLE database. The coexpression genes with RN7SL494P were calculated with a 0.2 co-expression coefficient threshold. The 30 up coexpression genes and 30 down co-expression genes were selected to construct a co-expression heatmap (Fig. 5a).

The functional verification of enrichment and pathway of the co-expression genes in CCLE
GO analysis results showed that the above co-expression genes of RN7SL494P were significantly enriched in cholesterol and lipid transport and homeostasis, cell membrane transport function, and so on (Fig. 5b).
KEGG analyses were performed to investigate the biological functions and pathways associated with the RN7SL494P identified. The results show that the coexpression genes of RN7SL494P were mainly enriched in ABC transporters, Hedgehog signaling pathway, PPAR  Table S6, and Fig. 5c).

Discussion
Many patients are diagnosed with cancer metastasis, which usually makes treatment more difficult. The 5-year survival rate for patients with metastatic lung cancer is approximately 1% [7]. When tumors spread outside the lungs, they may be difficult to successfully treat and cure. Since no single best treatment exists for patients with metastatic lung cancer, the choice of treatment strategies depends on the tumor location, size, and stage, as well as the cancer subtype and the lymph nodes involved. Scientists and clinicians have attempted to exploit methods that allow cancer patients to be screened for metastasis. The main goal of screening is to reduce the number of people that die from cancer, especially metastatic cancer. To investigate the "drive genes" in metastatic lung adenocarcinoma, we examined the differentially expressed genes in the RNA-seq repository data of TCGA. We comprehensively analyzed gene expression in patients included in the database that had lung We identified the differentially expressed genes associated with lymph node metastasis and TNM stage in lung adenocarcinoma. We also found that the gene RN7SL494P not only possessed the above characteristics, but also demonstrated prognostic significance for metastatic lung adenocarcinoma. Subsequent analysis of RN7SL494P using sGSEA further demonstrated the functions and roles of RN7SL494P.
RN7SL494P (7SL) is located on chromosome 15q21.2 and belongs to a long noncoding RNA (lncRNA) class pseudogene. As a small eukaryotic cytoplasmic RNA, 7SL RNA is essential for translocation of a protein that binds to the ribosome and targets the nascent protein in the endoplasmic reticulum to be secreted or inserted into the membrane during the assembly of human signal recognition particles (SRP) [8,9]. A study using RNA sequencing data from 11 human tissues showed that 7SL was the highest expressed non-coding RNA (ncRNAs) and was an order of magnitude higher than any mRNA detected [10]. 7SL stimulates GTPase activity of SRP and its signal receptor (SR) complex [11,12].
Defines a set of genes based on previous biological experiments, for example, knowledge about co-expression or biochemical pathways. A recent study showed the S-structure domain of 7SL RNA is related to cellular activity in mitochondria [13]. Furthermore, in addition to the nucleotide excision repair function, the results of sGSEA demonstrated that RN7SL494P was associated with DNA replication, transcription factor, spliceosome, oxidative phosphorylation and JAK-STAT signaling pathway. Thus, RN7SL494P (7SL) may play a role in the DNA replication, transcription, translation and assembly of peptides and its dysfunction may have pathological consequences. CCLE can be a good complement to the TCGA database to improve tumor data mining. We set a validation cohort to attain external validation, and the subsequent results of RN7SL494P's function were supportive.

Fig. 4 The multivariate Cox regression analysis of related clinical parameters and RN7SL494P in lung adenocarcinoma
We found that the high expression of RN7SL494P improved tumor survival rates in patients with lung adenocarcinoma (high-expression 41.80% vs. low-expression 39.70%; Fig. 2b). Yang et al. [14] found that the over-expression of FOXP3 is able to inhibit the transcription of 7SL mRNA by binding to its promoter and subsequently increases the translation of p53, which results in suppressing the growth of multiple tumors (lung cancer was not included). The findings from the current study suggest that the 7SL mRNA transcribed from the RN7SL494P gene may be a direct target of FOXP3 and may be enmeshed in the FOXP3/p53 feedback loop. If true, this would be consistent with the fact that there are many complex regulatory networks involved in the process of tumor formation. We speculate that the gene RN7SL494P may exhibit "inconsistent functions" in different tumor microenvironments.
In the current study, we used the information available from the TCGA database to analyze the expression of genes in patients with lung adenocarcinoma. We found that the gradient change in expression of RN7SL494P (7SL) was clearly associated with nodal metastasis. In addition, its expression correlated with its prognostic value. These findings were validated by Cox regression analysis, in particular, the function of RN7SL494P (7SL) was verified by the independent CCLE data set.
The present study presented certain limitations. Firstly, data selection from the TCGA database may potentially cause selection bias, since this is prevalent in all nonprospective, nonrandomized studies. Secondly, the CCLE database does not include clinically meaningful variables, therefore, only the function of genes and their coexpressed gene sets can be verified, but the survival time can not be verified. Thirdly, due to technical reasons, it The GO analysis of the co-expression genes of RN7SL494P. c KEGG analyses of the biological functions and pathways. The co-expression genes of RN7SL494P were mainly enriched in ABC transporters, Hedgehog signaling pathway, PPAR signaling pathway, and Non-homologous end-joining (p < 0.05) is impossible to establish a smooth working relationship with the clinical departments of a hospital in a short term, so it is temporarily unable to conduct tests in clinical practice.
In conclusion, our results suggest that the over-expression of RN7SL494P could significantly reduce lymph node metastasis and improve the survival of patients. Meanwhile, age, pathology classifications, and treatment (radiotherapy and chemotherapy) may also affect patient survival in lung adenocarcinoma.

The lung adenocarcinoma data and pipeline
The lung adenocarcinoma data (mRNA expression data and clinical data) from the National Cancer Institute's Genomic Data Commons (GDC) portal (https ://porta l.gdc.cance r.gov/repos itory ) were downloaded on August 5, 2017, using GDC-client.exe software. This provided 594 level-3 RNA-seq hits (515 cases) and 522 clinical XML datasets. The clinical data are shown in Additional file 5: Table S4. The expression data were obtained for each of the lines using Affymetrix U133 Plus 2.0 arrays from the CCLE were downloaded from the website (https ://porta ls.broad insti tute.org/ccle) directly. The data are open to the public under certain guidelines. Therefore, confirm that all written informed consent has been achieved. The pipeline and details of the study are shown in Fig. 6.

Differential gene expression analysis
Differential gene expression based on the RNA-seq data was analyzed using the edgeR software package [15], which involved empirical Bayesian estimations and accurate tests based on the negative binomial distributions. As edgeR suggested, genes with very low reads are often not of interest in differential expression analyses; therefore, the average count-per-million (CPM) was an important criterion used to define whether a gene was expressed at a reasonable level for inclusion. The edgeR software reported log2 fold change, log2 counts per million, the corresponding statistical significance, and their corresponding error discovery rates. The up-regulated and down-regulated differentially expressed genes were selected based on these parameters.

Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis
GO provides a platform for the hierarchically sorting of genes or their products by terms that fall into the three following categories: molecular functions (molecular activity), cellular component (functional gene products), and biological processes (cellular or physiological effects) [16][17][18]. The Database for Annotation, Visualization, and Integrated Discovery (DAVID) version 6.7 was used to perform the functional annotation analysis [19] and the ggplot2 and the GOplot R packages were used to view the results.
We used the KEGG Orthology Based Annotation System (KOBAS) algorithm [20] and the R package cluster-Profiler package to analyze the KEGG pathway of gene differential expression [21]. The genes from the lung adenocarcinoma RNA-seq that exhibited significant upward and downward differential expression were analyzed. A difference with a p-value less than 0.05 was considered significant for the screening criterion.

Gene Set Variation Analysis (GSVA) of KEGG pathways
A comprehensive human gene annotations document (c5.all.v5.2.symbols.gmt) for the GO function category was downloaded from the Molecular Signatures Database (MSigDB) [22]. The Gene Set Variation Analysis Fig. 6 The pipeline of this study. The RNAseq data and clinical data for lung adenocarcinoma were first downloaded from the TCGA. RNA-seq data were used to analyze gene differential expression, and perform GO and KEGG functional analysis. Clinical data combined with lymph node metastasis and TNM analysis of differential genes. And the gene set associated with both lymph node metastasis and TNM stage was obtained. Then the survivals of the intersection genes were analyzed by Kruskal-Wallis algorithm to find the target gene. The function of the single gene GSEA of the target gene was then studied. The prognosis between the gene and the clinical variable was verified by the Cox regression analysis of single gene and multivariables. Finally, in the CCLE database, the co-expression genes of the target gene and the enrichment and signaling pathway analysis in which these genes are involved are further verified (GSVA) algorithm [23] was used to perform an analysis of the mRNA-SEQ data according to enrichment scores to reduce the data from an abundance of transcriptional activity at the gene level to transcriptional activity according to gene function.

The Kruskal-Wallis test
For the analysis of differential expression associated with cancer metastasis and cancer staging, the clinical data regarding lymph node metastasis and TNM stage were selected. The Kruskal-Wallis test was used to analyze the differential expression among multiple cancer groups (N0, N1, N2, and possibly N3; and TNM stage I, II, III, and IV). As shown in Eq. 1, the Kruskal-Wallis test by grade is a nonparametric substitution method for oneway analysis of variance (ANOVA) that expands the double-sample Wilcoxon test when more than two groups are compared [24].
where s 2 is the sample variance; k is the number of groups; R i is the total for the ith row; n i is the size of the ith group; and N is the total number of observations.

Survival analyses
Two risk groups were established according to the cutoff values derived from the median expression levels of the corresponding genes in the analysis of the association between gene expression and patient prognosis. The Kaplan-Meier test and the Kruskal-Wallis log-rank test were carried out to evaluate the differences in survival rates between the two risk groups. A p-value of less than 0.05 was considered to be statistically significant.

Gene Set Enrichment Analysis (GSEA) and single-GSEA (sGSEA)
GSEA was used to assess the data on genomic expression levels. Relative to the median expression of the hub genes, the 515 lung cancer samples from the RNAseq data were divided into two groups, high-expression and low-expression samples. These two GSEA groups were used to identify the potential functions of the hub genes with the c5.all.v5.2.symbols.gmt annotations being selected as the reference gene sets. Nominal differences with p < 0.05, false discovery rate (FDR) < 0.05, and enrichment score (ES) > 0.6 were defined as the cutoff standards.
The only gene related to the gene sets from the MSigDB [25] that was identified in the study to correlate with metastasis and prognosis (RN7SL494P) was used to determine whether the sets showed statistical differences between the low-expression and high-expression categories. The analysis was performed using the java-dependent GSEA 3.0 software package [26].

Univariate and multivariate Cox analysis
Cox proportional risk regression analysis is applicable to quantitative prediction variables and classification variables. The aim of the model is to assess the impact of several factors on survival simultaneously. In other words, it allows us to examine how specific factors affect the incidence of specific events (e.g., infection, death) that occur at specific points in time. This rate is often called the risk rate. Predictors (or factors) are commonly referred to in the survival analysis literature as covariates. Possible variables affecting survival time and survival status of lung adenocarcinoma, including age, gender, smoking, whether to receive radiotherapy, whether to receive chemotherapy, and tumor grading, were included in univariate and multivariate Cox regression analysis to determine whether the target genes found above also affect the survival of lung adenocarcinoma.

The functional verification of RN7SL494P in CCLE lung cancer lines
Cell line name annotation and RNA-seq data were downloaded from the CCLE database, and the "lung cancer" matrix was extracted by Perl and R. The co-expression gene set of RN7SL494P was analyzed and the coexpression heatmap was drawn. Finally, GO and KEGG functional enrichment analyses were performed on the co-expression genes of RN7SL494P.