登录 EN

添加临时用户

基于机器学习的罕见病风险预测模型研究

Research on Rare Disease Risk Prediction Model Based on Machine Learning

作者:王伟
  • 学号
    2014******
  • 学位
    硕士
  • 答辩日期
    2022.07.12
  • 导师
    夏树涛
  • 学科名
    计算机科学与技术
  • 页码
    72
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    临床预测模型,机器学习,罕见病,可解释性
  • 英文关键词
    clinical prediction model,machine learning,rare disease,interpretability

摘要

罕见病发病率低,且与其他疾病有重叠的症状谱,导致罕见病误诊率高。大多数罕见病的确诊是通过昂贵的基因检测,所以不可能对所有疑似的患者都进行基因检测,全球25%的罕见病患者往往需要5-30年才能被正确诊断,不仅延误了病情的控制和治疗,也给患者及其家庭造成了沉重的经济和精神负担,改善罕见病的诊疗过程是一项重要的公共卫生问题。研究表明将机器学习技术应用于医疗领域能有效改善罕见病患者的诊疗过程,通过构建罕见病风险预测模型能够快速筛选出高风险患者,提升罕见病诊断率,降低误诊率。本文基于医疗电子病历系统提出了罕见病临床预测模型构建的通用方法,并将该方法应用于轻链型(AL)淀粉样变性验证了方法的可行性。此外,我们引入沙普利加性解释器(SHapley?Additive exPlanation)对模型进行解释,提升了模型的可推广性。本文的主要研究内容和贡献包括:?本文基于医疗电子病历系统提出了罕见病临床预测模型构建的通用方法。该方法基于统计方法计算出构建临床预测模型所需最小样本量,使用孤立森林算法(Isolation Forest)进行异常值检测,将缺失率不超过40%的特征变量分为全部特征变量、主要研究者(PI)选出的10个特征变量和基于递归特征消除算法(RFE)筛选出的10个特征变量,在不同特征子集上分别训练出最优模型,对比数据驱动的方法和混合系统的性能。?应用于轻链型(AL)淀粉样变性验证了上述通用方法的可行性。研究表明大部分轻链型(AL)淀粉样变性患者在临床症状出现12个月后才得以确诊,许多患者在接受诊断之前可能死于并发症,心脏病患者的诊断延迟尤其令人担忧,因为未经治疗的心脏受累患者的中位生存率在症状出现后大约为6个月。此外,对轻链型(AL)淀粉样变性的早期诊断和治疗能有效改善患者的预后。本文首次将机器学习方法应用于轻链型(AL)淀粉样变性,实验结果表明了上述通用方法的有效性。

Rare diseases have a low incidence and overlap symptom profiles with other diseases, resulting in a high rate of misdiagnosis of rare diseases. Most rare diseases are diagnosed through expensive genetic testing, so it is impossible to genetically test all suspected patients, and 25% of rare disease patients worldwide often take 5-30 years to be correctly diagnosed, which not only delays the control and treatment of the disease, but also causes a heavy economic and mental burden on patients and their families, and improving the diagnosis and treatment process of rare diseases is an important public health issue. Studies have shown that the application of machine learning technology to the medical field can effectively improve the diagnosis and treatment process of rare disease patients, and by constructing a rare disease risk prediction model, it can quickly screen out high-risk patients, improve the diagnosis rate of rare diseases, and reduce the rate of misdiagnosis. Based on the medical electronic medical record system, this paper proposes a general method for the construction of clinical prediction models for rare diseases, and applies the method to light chain (AL) amyloidosis to verify the feasibility of the method. In addition, we introduced the SHapley Additive exPlanation to interpret the model, improving the generalizability of the model.The main research contents and contributions of this paper include:?In this paper, a general method for the construction of clinical prediction models for rare diseases is proposed based on the medical electronic medical record system. Based on the statistical method, the method calculates the minimum sample size required to construct a clinical predictive model, uses the Isolation Forest algorithm (Isolation Forest) for outlier detection, and divides the characteristic variables with a deletion rate of not more than 40% into all the feature variables, the 10 feature variables selected by the main investigator and the 10 feature variables filtered out based on the recursive feature elimination algorithm, and the optimal model is trained on different feature subsets. Compare data-driven approaches to the performance of hybrid systems.?Application to light chain (AL) amyloidosis verifies the feasibility of the above general approach. Studies have shown that the majority of patients with light chain amyloidosis are diagnosed 12 months after the onset of clinical symptoms, and many patients may die of complications before being diagnosed, and delays in diagnosis in patients with heart disease are particularly worrying because the median survival rate in untreated patients with cardiac involvement is about 6 months after symptoms appear. In addition, early diagnosis and treatment of light chain amyloidosis can be effective in improving prognosis. In this paper, the machine learning method is applied to light chain amyloidosis for the first time, and the experimental results show the effectiveness of the above general method.