遗传度是遗传学中的基础概念,其衡量人群中表型差异由遗传因素所解释的比例,能够帮助研究者理解复杂性状的遗传结构和复杂疾病的致病机理。遗传度对于多基因风险预测也是至关重要的。一方面,遗传度决定多基因风险预测准确性的上限;另一方面,遗传度可以作为统计建模的先验知识,提升预测准确性。本文主要研究复杂性状遗传度估计和多基因风险预测中的统计方法,为理解复杂性状的遗传结构提供新见解,并为复杂疾病的精准预防、诊断和治疗提供理论支持。 针对遗传度估计问题,我们提出连锁不平衡特征值回归方法(LDER),通过充分利用连锁不平衡矩阵中的信息,对传统连锁不平衡得分回归方法(LDSC)进行改进,提升遗传度估计的准确度和精确度,同时过滤混杂因素导致的膨胀效应。我们在理论和仿真实验中证明了LDER方法的优势。在英国生物库814组复杂性状中,LDER识别出363组显著可遗传的复杂性状,其中97组未被LDSC识别。 我们进而研究了参与偏差现象对遗传度和遗传相关性估计的影响。在统计遗传学中,由于未参与研究的样本的遗传信息无法获取,估计和校正参与偏差是困难的。为此,我们开发了一个仅基于参与样本的统计学框架,利用血缘同源性信息估计“参与”与其他性状的遗传关联,并对参与偏差导致的估计偏差进行校正。我们将方法应用于英国生物库12组复杂性状,发现其中8组与“参与”有显著遗传关联,且参与偏差会导致其中多数性状遗传度和遗传相关性绝对值存在低估偏差。 针对多基因风险预测问题,基于预测变量的高维特性,我们提出了经验贝叶斯多基因风险评分方法(EB-PRS),利用全基因组关联分析(GWAS)概括统计量中各位点的边缘效应分布得到先验参数估计,并利用期望最大化算法得到效应量的后验分布,以提升多基因风险评分的预测准确性,同时无需外部数据调参。我们通过6组大规模数据集验证了EB-PRS相比于其他方法预测准确性的大幅提升。 我们还提出一个数据驱动的统一贝叶斯多基因风险评分框架(NeuPred),能够广泛地适用多种遗传结构,并自动选择染色体级别的最优遗传先验,对复杂性状的预测效果有显著提升。我们开发了一种基于GWAS概括统计量的交叉验证策略,在没有个体级基因型数据时也可以自动选择先验。我们通过仿真实验和大规模实际数据验证了NeuPred相对于其他方法一致且稳健的优势表现。同时,NeuPred在计算速度与目前的贝叶斯框架相比也有一定的优越性。
Heritability is a fundamental concept in genetic studies, measuring the genetic contribution to complex traits and providing insights about genetic architectures and disease mechanisms. It is also a starting point in making the polygenic risk prediction. On the one hand, heritability decides the upper bound of polygenic risk prediction. On the other hand, heritability can be regarded as prior knowledge to improve the prediction accuracy. In this dissertation, we develop statistical models to improve the estimation of heritability, and provides better prediction of complex traits, which is promising in understanding genetic architectures, and stratifying patients for precision prevention, diagnosis and treatments.First, we propose linkage disequilibrium (LD) eigenvalue regression (LDER) to improve the estimation of heritability and confounding information. As an extension to the traditional LD score regression (LDSC) method, our method makes full use of the LD information, and provides more accurate estimates of SNP heritability and better distinguishes the inflation caused by polygenicity and confounding effects. We demonstrate the advantages of LDER both theoretically and with extensive simulations. We applied LDER to 814 complex traits from UK Biobank, and LDER identified 363 significantly heritable phenotypes, among which 97 were not identified by LDSC.We further focus on the impact of the participation bias to the estimation of heritability and genetic correlation. In genetic studies, participation bias is challenging to evaluate or correct as the genetic information for the non-participants is unavailable. To address this difficulty, we develop a statistical framework evaluating and adjusting the impact of participation bias to the estimation of heritability and genetic correlation, based on only genetic information of participants. Applying the method to 12 UK Biobank phenotypes, we found 8 have significant genetic correlations with participation. For most of these phenotypes, without adjustments, estimates of heritability and the absolute value of genetic correlation would have underestimation biases.As for the polygenic risk prediction, we propose a novel method EB-PRS (empirical-Bayes-based polygenic risk score), which leverages the distribution of marginal effect sizes across all the markers in genome-wide association studies (GWAS) summary statistics, to estimate the prior distribution from the data. We further derive the posterior distribution of effect sizes via an expectation–maximization algorithm. Our method does not need to tune parameters nor external information. Real data applications on 6 complex diseases show that EB-PRS achieved substantial improvements in terms of prediction accuracy over standard methods with optimally tuned parameters. To further improve the performance of polygenic risk prediction, we introduce a unified data-adaptive Bayesian regression framework, NeuPred. The construction of NeuPred accommodates varying genetic architectures and improves overall prediction accuracy for complex traits by allowing for a wide class of prior choices. To take full advantage of the framework, we propose a summary-statistics-based cross-validation strategy to automatically select suitable chromosome-level priors, which demonstrates a striking variability of the prior preference of each chromosome, for the same complex disease, and further significantly improves the prediction accuracy. Simulation studies and real data applications demonstrate that NeuPred achieves substantial and consistent improvements in terms of prediction accuracy over existing methods. In addition, NeuPred has similar or advantageous computational efficiency compared with the state-of-the-art Bayesian methods.