随着深度测序技术的迅速发展,基因组、转录组、表观遗传组等多组学测序数 据迅速积累,为发现生物体的细胞类型构成、理解细胞内基因调控机制,进而解析 重大遗传疾病发生发展的生物学机制提供了丰富的数据资源。然而,全方位解读 这些生物大数据,目前还面临利用生物大数据推理复杂生物知识不够精确、对生 物大数据多源异质协同分析不够细致等局限。近年来,以深度学习为代表的人工 智能技术在多个领域已取得突破性进展,为解决上述关键问题提供了强有力的手 段。本论文以染色质开放性数据的信息解读为主线,通过融合多种组学数据的方 式,研究预测染色质开放性的机器学习方法、探索单细胞染色质开放性数据分析 的理论与方法。主要研究内容及创新成果包括:1. 针对染色质开放性预测问题,提出了一种整合基因组序列与进化保守性的 随机森林方法 kmerForest,实现了在给定细胞系下基因组的染色质开放性预测。进 一步提出了整合基因组短片段词频的混合卷积神经网络模型 Deopen,实现了染色 质开放性信号的二值分类与连续值回归。大规模交叉验证显示上述方法的预测性 能优于已有方法,且预测结果对遗传学数据的分析具有促进作用。2. 针对跨细胞系染色质开放性预测问题,提出了一种融合基因组注释及转录 组数据的密集连接卷积网络模型 DeepCAGE。通过利用已有生物学先验知识,有 效提升了模型的预测性能,并进一步建立了基于染色质开放性解析复杂表型相关 非编码区遗传因素的分析方法,应用于复杂表型研究中。3. 针对基于单细胞染色质开放性数据的细胞类型发现问题,提出了一种循环 生成对抗网络模型 scDEC,首先从概率密度估计的角度论证了该模型的理论基础, 并在细胞聚类等一系列任务中展现了模型的优异性能,还实现了单细胞染色质开 放性与单细胞基因表达数据的协同分析。该模型在对细胞聚类的同时能对单细胞 染色质开放性进行降维表示,从而促进了后续细胞轨迹推断、细胞调控机制解析 的研究。 本论文从“数据融合,信息迁移”的观点,系统性地研究了细胞群与单细胞染 色质开放性数据分析中的关键问题,对生物数据解读中的概率密度估计等共性基 础问题进行了创新性探索,研究成果不仅能对大规模染色质开放性数据高效分析, 还能加强对细胞调控机制的深入理解,从而促进对遗传学数据的有效解读。
With the rapid development of deep sequencing technology, the quick accumulation of multi-omics data, including genomic, transcriptomic, and epigenomic sequencing data, provides a rich resource to discover the cell types that constitute organisms, understand the mechanism of gene regulation in cells, and analyze the occurrence and development of genetic diseases. However, achieving a comprehensive interpretation of these biological big data still has several limitations such as insufficient accuracy in the and functional prediction underlying the biological data and insufficient analysis for the multi-source heterogeneity of the biological data. In recent years, artificial intelligence technology, especially deep learning, has made breakthrough advances in many fields, thus providing a powerful tool to solve the above-mentioned key problems. This thesis focuses on the computational analysis of chromatin accessibility data, which includes a comprehensive investigation of machine learning methods for predicting chromatin accessibility and exploring theories and methods of single-cell chromatin accessibility analysis through the integration of multi-omics data. The main research contents and innovation results include:First, for the problem of chromatin accessibility prediction, a random forest based method kmerForest that integrates genome sequence and evolutionary conservation was proposed. It enables chromatin accessibility binary prediction of the genome in a given cell line. A hybrid deep convolutional neural network named Deopen, which integrates the word frequency of short genomic fragments, was further proposed. It can make binary prediction and continuous regression of chromatin open signals. Large-scale cross-validation experiments show that the prediction performances of the above methods are better than the existing methods and the prediction results can promote the analysis of genetic data.Second, for the problem of cross-cell-type prediction of chromatin accessibility, a densely connected convolutional network model DeepCAGE that fuses genome annotation and transcriptomic data was proposed. By utilizing the existing biological prior knowledge, the prediction accuracy of the model is largely improved. An analysis approach was established based on the chromatin accessibility for interpreting the genetic factors of the complex phenotype-related non-coding regions. This analysis approach wasII Abstractsuccessfully applied to the study of complex phenotypes.Third, for the problem of cell type discovery in single-cell chromatin accessibility analysis, a cycled generative adversarial network model scDEC was proposed. We first demonstrate the theoretical basis of scDEC from the perspective of probability density estimation. Then, we illustrate its superior performance in a series of experiments, such as cell clustering. scDEC also enables the joint analysis of single-cell chromatin accessibility and gene expression. This model performs cell clustering and low-dimensional representation learning of single-cell chromatin data, simultaneously. scDEC also facilitates the subsequent study on cell trajectory inference and cell regulation mechanism analysis.This study systematically investigates several key problems in the analysis of bulk and single-cell chromatin accessibility data from the perspective of “data integration, information transfer” and also innovatively explores the fundamental problems, such as density estimation towards the interpretation of biological data. The research findings can not only help analyze large-scale chromatin accessibility data efficiently, but also promote a deeper understanding of cell regulation mechanisms and facilitate the effective interpretation of genetic data.