Sample collection and public data source
Stool samples were collected from individuals undergoing colonoscopy at the Seventh Medical Center of Chinese PLA General Hospital. The cohort included individuals with digestive symptoms to outpatient gastroenterology clinics, as well as asymptomatic individuals aged 50–85 years. The exclusion criteria were: (1) use of antibiotics within the past 3 months; (2) vegetarian diet; (3) surgery or invasive procedure within the past 3 months; (4) IBD; or (5) history of any cancer. After the written informed consent, each individual was asked to provide a stool sample before bowel preparation. The stool samples were divided and stored at – 80 ºC within 1 h for further analysis.
The public data of samples downloaded from three previous studies,128 samples were from Yu et al. (PRJEB10878), 156 samples were from Feng et al. and 199 samples were from Zeller et al. (ERP005534) [9, 13, 14]. All the sequences of the 483 samples were obtained using an Illumina HiSeq platform with the PE100 sequencing strategy. The public data included 616 samples in validation cohort 1, which were downloaded from the DDBJ Sequence Read Archive (DRA006684 and DRA008156). (see Additional file 2: Table S2) [15]. The 616 samples contained 291 control subjects (NC-V group), 67 multiple polypoid adenomas with low-grade dysplasia (MP group), 73 high-grade dysplasia (HD-V group), and 111 CRC at stage I/II, and 74 CRC at stage III/IV. The public data included 140 samples in validation cohort 3 from Andrew et al. (PRJNA447983) [16]. In this study, advanced adenoma (AA) was defined as adenoma with a ≥ 25% villous component, a diameter ≥ 10 mm, or adenoma with high-grade dysplasia. The stages of invasive adenocarcinoma were determined from the surgically resected specimens with the use of the American Joint Committee on Cancer (AJCC) staging system.
DNA extraction and whole-genome shotgun sequencing
DNA was extracted from each frozen fecal sample using a QIAamp Power Fecal DNA Kit (Qiagen str. 140724, Hilden, Germany) according to the manufacturer’s protocols. The concentration of the extracted DNA was measured with Qubit, and the molecular size was estimated by agarose gel electrophoresis. To find progressive markers that satisfied the requirement of continuous increase or decrease from controls to advanced adenoma and CRC across different cohorts or different sequencing platforms, a HiSeq XTen platform (Illumina) was used in the samples of discovery cohort 1, while sequencing libraries were generated with a TruSeq DNA Sample Prep v2 Kit (Illumina, Inc., San Diego, CA, USA). However, whole-genome shotgun sequencing of the samples of discovery cohort 2 was carried out on a BGISeq-500 platform, and sequencing libraries were generated with an MGIEasy DNA Sample Pre Kit. The DNA library quality was both confirmed with a Qubitand Agilent 2100.
Criteria for CRC samples in discovery cohort 1
After an analysis of the difference in intestinal microbiome between the AA group and the controls group (discovery cohort 1), we aimed to further investigate the richness of these differential markers in the CRC group. However, in discovery cohort 1, we did not collect CRC samples, so we selected some of them from the downloaded public raw data. We set the criteria for selecting CRC samples in discovery cohort 1 as follows: (1) sequencing platform: Illumina HiSeq PE100; (2) region of origin of the samples: China; (3) state: CRC; (4) samples in the same branch as the 97 samples from discovery cohort 1 in the Bray–Curtis tree (see Additional file 1: Fig. S3). Finally, we selected 32 CRC samples from the previous three studies (Fig. 1).
Quality control of metagenomic sequences and phylogenetic abundance profiling
Metagenomic reads with more than 50% low-quality bases (quality ≤ 20) or more than five ‘N’ (bases not identified) were excluded. Furthermore, the reads that could be mapped to the human genome (GRCh38) through SOAP alignment (v2.21) were discarded. The remaining reads in each sample were considered to be high-quality reads.
The sequencing reads of all of the samples in discovery cohort 1 were aligned to reference genomes collected from NCBI and HMP. Phylogenetic abundance profiling was performed as described by Wen et al. [17]. A Bray–Curtis tree analysis of the 97 samples collected in this study and 300 disease case samples from previous studies was constructed using the vegan R package based on genus abundance profiling (see Additional file 1: Fig. S3).
Gene catalog construction gene abundance profiling
All the reads of the samples in discovery cohort 1 were assembled by SOAPdenovo (v2.04). Four k-mers 51, 55, 59, and 63 were used, and the N50-highest k-mer was chosen. After the scaffold was assembled, the scaffolds were split into ‘scaftigs’ by removing ‘N’ bases, and scaftigs of length less than 500 bp were excluded. The genes were predicted from the remaining scaftigs by MetaGeneMark (prokaryotic GeneMark.hmm version 2.8), and the genes with a length less than 100 bp were filtered. All the genes were clustered using CD-HIT (v4.5.7) to construct a non-redundant gene catalog. Gene abundance profiling was performed as described by Wen et al. [17]. The protein sequences in the gene set were aligned to the Kyoto Encyclopedia of Genes and Genomes (KEGG) protein database using BLAST with parameters-p-rot-min score = 60-out = blast8.
Fecal immunochemical test
The FIT was performed using the Eiken OC-Sensor (Eiken Chemical Co., Ltd., Tokyo, Japan) to quantify the fecal hemoglobin. Stool samples with a hemoglobin greater than 20 µg/gm of dry stool (equal to 100 ng/mL of the buffer) were considered positive according to the manufacturer’s instructions.
The formula of the diagnosis model with different effect indexes
Dichotomy thought and the standard least-squares method were used for the modeling of each effect index with JMP 10 software. For the Adenoma Effect Index1 (AEI1), the initial scores of the controls were 1, and the initial scores of samples with advanced adenoma and above (high-grade dysplastic adenomas and all CRC patients) were 50. For Adenoma Effect Index2 (AEI2), the initial scores of all the controls were 1, and the initial scores of the high-grade dysplastic adenomas and above (all the CRC patients) were 50. For CRC Effect Index1 (CEI1), the initial scores of all the controls were 1, and the initial scores of stage I/II CRC patients and above (stage III/IV CRC patients) were 50. For CRC Effect Index2 (CEI2), the initial scores of all the controls were 1, and the initial scores of stage III/IV CRC patients were 50. Then, the training of the model was carried out by fitting the scores through the standard least-squares method. After fitting, if the majority of the scores were unsatisfactory, the good scores were picked to replace 1 or 50, and the training was carried out again until there was little increase in sensitivity or specificity. Finally, we obtained the following formulas for AEI1, AEI2, CEI1, and CEI2.
$${\text{AEI1}} = \sum\limits_{i = 1}^{277} {e_{i} \cdot \{ \ln [abun(x_{i} )} + {0.0000000001}] + {23}{{.026\} }} + b_{e} ,$$
where ei represents the parameters of AEI1 (see Additional file 2: Table S8), xi represents the gene IDs, be is the intercept of AEI1, and abun(xi) is the relative abundance of the progressive disease gene markers in the intestinal microbiome in each sample (see Additional file 2: Table S9).
$${\text{AEI2}} = \sum\limits_{i = 1}^{277} {f_{i}} \cdot \{ {\rm ln}[abun(x_{i} ) + 0.0000000001] + 23.026\} + b_{f} ,$$
where fi represents the parameters of AEI2, and bf is the intercept of AEI2.
$${\text{CEI1}} = \sum\limits_{i = 1}^{277} {g_{i} \cdot \{ {\rm ln}[abun(x_{i} )} + 0.0000000001] + 23.026\} + b_{g} ,$$
where gi represents the parameters of CEI1, and bg is the intercept of CEI1.
$${\text{CEI2}} = \sum\limits_{i = 1}^{277} {h_{i}} \cdot \{ {\rm \ln} [abun(x_{i}) + 0.0000000001] + 23.026\} + b_{h} ,$$
where hi represents the parameters of CEI2, and bh is the intercept of CEI2.
Combinations of FIT with CEI1 and CEI2
The formula for the combination of FIT and CEI1 is:
\(score1 = 23.710037 + {4}{\text{.082365}} \times {\text{FIT}} + {0}{\text{.147425}} \times {\text{CEI1}}\), with a cutoff of 28.55 (see Additional file 2: Table S10).
The formula of the combination of FIT and the CEI2 is:
\(score2 = - 0.87127 + {21}{\text{.769347}} \times {\text{FIT}} + {0}{\text{.332905}} \times {\text{CEI2}}\), with a cutoff of 22.75 (see Additional file 2: Table S10).
Statistical analysis
The differential species-level markers between controls and patients with advanced adenoma in discovery cohort 1 were tested by Wilcoxon rank-sum test, and a P-value of 0.05 or less was considered statistically significant. These differential species markers were further filtered using the mRMR algorithm (side-channel attack R package), and the top 100 markers were used for the next step. Finally, the 25 markers with the highest MCC were selected to build an SVM classifier (the e1071 R package). The progressive disease markers were identified by the Kruskal–Wallis test with the threshold FDR < 0.05 and the median abundance values of the control group, advanced adenoma group, and CRC group should be increased (or decreased) by more than 1.5 times continuously in the discovery cohort 1, and the markers should also be through the same requirements in discovery cohort 2 (see Additional file 2: Table S3). Areas under the receiver operating characteristic curves (AUCs) were constructed using the pROC R package. Spearman correlation coefficients by using an R script were calculated in the abundance profiling of the progressive microbiota biomarkers and the clinical information of subjects (see Additional file 2: Table S1). If the clinical data was ‘NA’, the abundance profiling of the sample was eliminated from the abundance profiling table. A heatmap of the spearman correlation coefficients was constructed using R packages g-plot and RColorBrewer. For a comprehensive description of the details of the establishment of the diagnostic model, see online supplementary methods.