登录 EN

添加临时用户

几何视角下基于生物序列数据的分类分析研究

A Study of Classification Analysis Based on Biological Sequence Data from a Geometric Perspective

作者:周涛
  • 学号
    2020******
  • 学位
    博士
  • 电子邮箱
    zho******.cn
  • 答辩日期
    2025.05.14
  • 导师
    丘成栋
  • 学科名
    数学
  • 页码
    98
  • 保密级别
    公开
  • 培养单位
    042 数学系
  • 中文关键词
    基因组空间;序列分类;子空间距离;自然向量法;凸包原理
  • 英文关键词
    genome space; sequence classification; subspace distance; natural vector methods; convex hull principle

摘要

随着生物学序列数据的快速积累,对大量的序列进行分类分析成为了一个重要的问题,该问题的关键在于寻找合适的表征序列相似性的指标。本世纪初,美国国防高级研究计划局的国防科学办公室曾提出本世纪待解决的 21 个数学问题,其中便包括基因组空间的几何问题,旨在寻找一类能够反映生物学效用的度量。 在序列分类比较问题上,一类经典的方法基于序列比对。这类方法将两条或者多条长度可能不同的序列通过插入空格的形式使得长度对齐,然后计算每个位置的匹配得分(通过预先设定的打分模型实现),进而设计合适的算法以寻找出使得匹配得分最优的插空方式,与此同时得到的总匹配得分将作为序列之间的相似性评分。这类方法直接以序列本身作为比较对象,能够实现全局或者局部的序列信息比较,成为了基因组学中广为使用的方法。但从数学角度而言,这类方法无法得到数学上严格的度量,从而不适合作为探索基因组空间的工具。基于此,本研究旨在探索具备生物学可解释性以及数学上良好几何性质的序列分类分析方法,并将其应用于具体的生物学数据进行分析。对于基因组空间的概念,过往研究中主要涉及到两类生物学对象:一类是基因组序列本身,这方面的工作主要见于丘成栋教授等人的自然向量法相关研究;另一类是单细胞中的基因表达谱,这一构建形式最早由 G.Resconi 等学者在 2005 年提出。本研究将对这两类生物学对象分别讨论基因组空间。 对于基因组序列,本研究通过混沌游戏表示的频率矩阵方法将序列转化为矩阵,然后对矩阵的列空间进行截断以形成序列的子空间表示,进而可以利用格拉斯曼流形上的广义格拉斯曼距离来作为序列之间的距离度量。作为应用,本研究基于此方法对新冠病毒基因组序列及其所在的正冠状病毒亚科基因组序列进行了分类分析。这部分研究借助流形这一几何对象,对冠状病毒的基因组空间进行了构造和探索。 对于单细胞的基因表达谱,本研究提出表达谱加权的自然向量法,实现了将表达谱数据和基因序列数据相结合的单细胞自然向量表示。本研究进一步对单细胞自然向量表示验证了凸包原理,将自然向量法的凸包原理这一几何性质从基因组序列数据拓展到了单细胞表达谱数据。作为应用,本研究对多种不同细胞类型数据集验证了凸包原理,进一步拓宽了自然向量法的应用边界。

With the rapid accumulation of biological sequence data, classifying and analysing a large number of sequences has become an important issue, and one of the key tools is to establish a suitable metric to characterise the similarity between sequences. At the beginning of this century, the Defense Science Office of the Defense Advanced Research Projects Agency proposed 21 mathematical problems to be solved in this century, one of which is the geometry of genome space, which aims to find a class of metrics that can reflect biological utility.A classical and widely used class of methods for the problem of sequence classification and comparison are those based on sequence alignment. In this method, two or more sequences of different lengths are aligned by inserting spaces, and then the alignment scores of each position are calculated (through a predefined scoring model), and then a suitable algorithm is designed to find the best insertion pattern, and the obtained alignment scores will be used as the similarity scores between sequences. This type of method directly takes the sequences themselves as the comparison object, and can achieve global or local sequence information comparison, which has become a widely used method in genomics. However, from a mathematical point of view, such methods cannot be mathematically rigorous metrics, and thus are not suitable as tools for exploring the geometric space of genomes. Based on this, this study aims to design sequence classification and analysis methods with biological interpretability and mathematically sound geometric properties, and to apply them to the analysis of specific biological data.For the concept of genome space, two types of biological objects have been involved in previous studies: the genome sequence itself, which is mainly seen in the work related to the natural vector method by S.S.-T. Yau et al. and the gene expression profiles in a single cell, a form of construction first proposed by G. Resconi et al. in 2005. In this study, we will discuss the genome geometry space for these two types of biological objects separately.For the genome sequences, this study transformed the sequences into matrices by the frequency matrix method of chaos game representation, and then truncated the column space of the matrices in order to form a subspace representation of the sequences. In turn, the generalised Grassmann distance on the Grassmann manifold can be used as a distance metric between sequences. Based on this method, the genome sequences of SARS-CoV-2 and its subfamily Orthocoronaviridae were classified and analysed during the SARS-CoV-2 outbreak at the time of the study. This part of the study constructs and explores the genome space of coronaviruses with the help of the geometric object of manifolds.For the gene expression profiles of single cells, this study proposes the natural vector method of expression profile weighting, which achieves the combination of expression profile data and gene sequence data to construct a vector embedding representation of single cells. Further, this study verifies the convex hull principle for this vector embedding representation of single cells, and extends the geometric property of the convex hull principle of the natural vector method from genome sequence data to single-cell expression profile data. This part of the study validates the convex hull principle with the help of the geometric object of convex hull for a variety of datasets labelled with different cell types, which further broadens the boundary of the application of the natural vector method.