A polygenic risk score predicts mosaic loss of chromosome Y in circulating blood cells

Background Mosaic loss of Y chromosome (LOY) is the most common somatic change that occurs in circulating white blood cells of older men. LOY in leukocytes is associated with increased risk for all-cause mortality and a range of common disease such as hematological and non-hematological cancer, Alzheimer’s disease, and cardiovascular events. Recent genome-wide association studies identified up to 156 germline variants associated with risk of LOY. The objective of this study was to use these variants to calculate a novel polygenic risk score (PRS) for LOY, and to assess the predictive performance of this score in a large independent population of older men. Results We calculated a PRS for LOY in 5131 men aged 70 years and older. Levels of LOY were estimated using microarrays and validated by whole genome sequencing. After adjusting for covariates, the PRS was a significant predictor of LOY (odds ratio [OR] = 1.74 per standard deviation of the PRS, 95% confidence intervals [CI] 1.62–1.86, p < 0.001). Men in the highest quintile of the PRS distribution had > fivefold higher risk of LOY than the lowest (OR = 5.05, 95% CI 4.05–6.32, p < 0.001). Adding the PRS to a LOY prediction model comprised of age, smoking and alcohol consumption significantly improved prediction (AUC = 0.628 [CI 0.61–0.64] to 0.695 [CI 0.67–0.71], p < 0.001). Conclusions Our results suggest that a PRS for LOY could become a useful tool for risk prediction and targeted intervention for common disease in men. Supplementary Information The online version contains supplementary material available at 10.1186/s13578-021-00716-z.


LOY estimation from SNP-array data
Whole blood DNA samples collected from 6,140 male ASPREE participants were genotyped using the Axiom 2.0 Precision Medicine Diversity Research Array (PMDA) following standard protocols (ThermoFisher). We followed best practice genotyping and quality control (QC) protocols from Thermo Fisher, starting from raw intensity CEL files, we used a command line custom script designed for the the Axiom PMDA array, mapped to human genome reference GRCh38, to produce variant call files.
We performed sample level QC using plink version 1.9, excluding samples for gender discordance (80 samples mismatched and excluded) using plink default F statistics threshold (≤ 0.2 female and ≥ 0.8 male), relatedness (124 indviduals excluded) using default PI-HAT threshold >0.025 to exclude one sample from each related pair.
To estimate population structure in the ASPREE cohort we performed principal component analysis (PCA) using The 1000 Genomes Project as a reference population [1]. Directly genotyped data from ASPREE and The 1000 Genomes Project 1K phase 3 (liftover to GRCh38) were merged and LD pruned (r 2 < 0.1) using plink version 1.9 [2] followed by R package SNPrelate [3]. We calculated the Z score for first 2 principal component eigenvectors and excluded samples with ± 2SD (standard deviation) of Z score compared to their respective five reference superpopulation groups from the 1000 Genomes Project that included: Europenas, South Asians, East Asians, African American (African super population) and Hispanics (Ad Mixed American) ( Figure S5). The final dataset of 12,815 samples (6,140 males) from Caucasians (Non-Finish Europeans) were selected from the ASPREE cohort for further analysis.
From the Axiom genotyping dataset we generated Log R Ratio (LRR) calls from the allele specific signal intensity data for each marker using a custom pipeline from the CEL files.
Following this, the genomewide experimental quality for each sample was assessed and 340 samples were excluded based on published criteria for sample inclusion [4]. We also exclude 669 saliva samples from the data set. After the quality controls based on sex, relatedness, ancestry, blood/saliva sample and genotyping quality; a total of 5,131 male samples were retained for LOY analysis.
Following this, the level of LOY mosaicism in the 5,131 samples passing strict QC was estimated by calculation of the mLRRY, as described previously [5]. First, the mLRRY was calculated for each sample as the median of the Log R Ratio (LRR) values of the 488 Axiom probes located within the male specific part of chromosome Y (MSY, chrY:2.787.139-22.318.450; GRCh38/hg38). We also calculated an mLRRX value (median LRR of probes positioned in the non-PAR part of chromosome X) for each sample, and observed a technical covariation with mLRRY. To adjust for this potential bias for the LOY assessment, an adjustment of mLRRY values was performed by a new approach, based on the coefficients of a linear model estimated between mLRRY and mLRRX in the unadjusted dataset ( Figure   S6). Following this, the mLRRY values over the entire dataset was adjusted using a constant defined as the peak of the unadjusted mLRRY distribution. This operation shifts the peak of the mLRRY distribution so that samples without LOY align around zero, thus improving comparability between datasets. The constant applied in the present dataset (i.e. 0.0006827) was calculated as the peak of the local regression median of the unadjusted mLRRY distribution using kernel estimation in the density function in R and the smoothing bandwidth method "SJ". We next defined a threshold in the adjusted mLRRY distribution that could be used for scoring samples with or without LOY. Samples with adjusted mLRRY values lower than the 99% confidence limit in a simulated distribution of technical mLRRY variation was scored with LOY ( Figure S7). As a final step to improve interpretability of our results, we used a modified version of a published formula [6] to translate individual mLRRY values into percentage of cells with LOY in every sample, as described in Figure S8. Finally, the concordance between LOY estimation from SNP-array and whole genome sequencing data was evaluated and shown in Figure S4. The distribution of mosaic LOY in male blood samples observed in ASPREE study in relation to age is shown in Figures S1 and S2.
Samples were WGS on the Illumina HiSeq X system with an average of 30x sequencing coverage as described previously [8]. The sequence alignment and processing were performed using reference 1000 Genomes Phase 3 decoyed version of build 37 of the human genome through Genome analysis tool kit (GATK) pipeline [9]. The joint genotype calling was performed in a single batch using GTAK GenotypeGVCFs function. To estimate the WGS level of LOY, we used ASPREE WGS samples that passed MGRB quality control criteria that includes call rate > 98%, depth standard deviation < 10, VAF standard deviation at loci called heterozygous <1, hetero:homo variant ratio of <2, inbreed coefficient form X chromosome values between < 0.2 or > 0.8 and Singleton rate < 0.001 [8]. Total 947 males from ASPREE WGS cohort were also genotyped on AXIOM SNP array chip and used to check the LOY concordance across two platforms.
To estimate LOY using WGS data, first, the program Control-FREEC (version 11.5) [10] was used to calculate a read depth ratio in 50.000 bp genomic windows on the UPPMAX Bianca cluster. The default parameters was used for other Control-FREEC settings. A mappability gem file for hg19 (read length 100 bp and up to 2 mismatches) was used in combination with the ASPREE-reference genome. To get a per chromosome read depth ratio, the median was calculated for all windows on each chromosome. Next, samples with deviating chromosomal-ratios was removed, i.e. suspected female samples (median X-ratio > 0.7), samples with suspected XYY aberration (median Y-ratio > 0.75) as well as all samples showing different types of abnormal autosomal read depth ratios (read depth standard deviation 1.5 IQR outside the third quartile).
LOY calling in the remaining 1137 male samples was performed from the ControlFREEC output. First, the density function in R with bandwidth set to "SJ", was used to find the highest density Y-ratio point (using all Y chromosome window rations) and the Scales package was used to linearly rescale the Y-ratios based on the density value. This resulted in a first estimate of LOY percentage in each sample between 0 and 100% cells with Y loss aneuploidy. To further improve these estimates, we identified normal samples (without LOY) as those showing a level of LOY within 2 standard deviations in the distribution. In these non-LOY samples, we noticed some highly variable Y-regions that were considered less informative for LOY calling. Specifically, 50.000 bp windows on chromosome Y with a standard deviation larger than 1.5 IQR from third quartile was considered highly variable and excluded. After this, the median Y-ratio was re-calculated for all male samples using only the Y-regions windows that passed this additional QC step. Thus, a final estimate of LOY percentage was achieved using the density function in R (bandwidth="SJ") and rescaled as described above in the first step. Finally, a binary threshold used for LOY scoring of samples was calculated following the same principle as described for the SNP array data. Finally, the concordance between LOY estimation from whole genome sequencing and SNP-array data was evaluated and shown in Figure S4.