数据分析和人工智能领域的核心问题是如何有效地挖掘数据的信息,从而学习和利用数据内在结构。子空间簇结构是一种简单而强大的数据模型,为高维数据的降维和分析提供了统一的理论框架。子空间簇结构估计方法,即子空间学习算法,是数据挖掘和降维领域的重要问题与研究热点。大数据时代的演进和数据密集型科学范式的发展对数据挖掘算法的普适性和高效性提出了更高的要求。子空间学习算法的计算复杂度是决定其实用性的关键因素。本文研究面向海量高维数据的子空间学习理论及算法,并将新算法应用到实际场景中。本文的主要创新点如下。第一,针对基于高斯随机投影的降维方法,研究了降维后子空间之间相关度的变化,并估计了子空间内数据点所受扰动的强度。在此基础上,通过理论分析证明,在高维度的条件下,高斯随机降维对子空间聚类正确率的影响可忽略,仿真数据及真实数据上的实验验证了理论分析结果的正确性。第二,针对大规模数据聚类复杂度过高的问题,分析了随机降采样对于子空间估计的影响,设计了快速子空间学习算法,在保证聚类正确率的前提下显著降低了海量高维数据子空间学习的复杂度,并在多个真实数据集上验证了新算法的有效性。第三,针对随机降维后数据信噪比发生变化,现有算法超参数选取困难的问题,提出了基于平方根代价的子空间聚类方法,能更好的适应不同信噪比,可以使用统一的超参数对不同降维比例的数据进行子空间聚类,显著提高了算法鲁棒性。第四,针对高光谱图像分析的实际应用,提出了基于代表集合的弱监督高光谱图像标注方法。在空间信息引导下,从代表点向周围像素传播类别标签,灵活地利用空间和光谱信息,并在多个公开数据集上验证了所提方法的优越性。本文的研究成果从理论上刻画了降维和随机降采样对于子空间簇结构估计的影响,并提出了能适应不同信噪比的鲁棒子空间聚类算法,解决了现有方法在大量高维数据聚类时复杂度过高的问题;针对高光谱图像聚类这个典型应用,本文根据实际数据的特点提出了基于代表集合的弱监督标注方法,在完善相关理论的同时,促进了子空间学习方法在海量高维数据场景下的实用化进程。
The central challenge in the realms of data analysis and artificial intelligence lies in effectively harnessing information within datasets while uncovering their inherent structures. The Union of Subspaces (UoS) emerges as a fundamental yet potent data model, providing a unified theoretical framework for dimension reduction and high-dimensional data analysis. Estimation methods for UoS, particularly subspace learning algorithms, represent focal points of research within data mining and machine learning domains. The evolution of the big data era and the emergence of data-intensive scientific paradigms necessitate enhanced universality and efficiency of data mining algorithms. The computational complexity of subspace learning algorithms stands as a critical determinant of their practicality.This dissertation delves into subspace learning methods tailored for massive high-dimensional data, introducing novel algorithms applicable to practical scenarios. The primary innovations of this dissertation are outlined as follows:Firstly, a comprehensive examination of the dimension reduction method employing Gaussian random projection is conducted. This analysis investigates the alteration in inter-cluster correlation pre and post dimension reduction, along with estimating the perturbation intensity experienced by data points within the subspace. Theoretical analysis confirms that the impact of Gaussian random projection on subspace clustering accuracy can be disregarded in high dimensions, a conclusion validated through simulation experiments.Secondly, addressing the challenge of high complexity in clustering massive datasets, this dissertation scrutinizes the influence of random sampling on subspace estimation. Subsequently, an online subspace learning algorithm is devised, substantially reducing the complexity of subspace clustering for massive high-dimensional data without compromising clustering accuracy. The efficacy of this novel algorithm is corroborated across multiple real datasets.Thirdly, the signal-to-noise ratio of the data changes after random dimension reduction, and the selection of hyperparameters in existing algorithms is sensitive to the signal-to-noise ratio. To address this issue, a subspace clustering method grounded in square root penalty is proposed. This method exhibits enhanced adaptability to varying signal-to-noise ratios and allows for unified hyperparameter selection across datasets with different reduction ratios, thereby bolstering algorithm robustness significantly.Lastly, for real-world hyperspectral image analysis, a weakly supervised hyperspectral image labeling method predicated on representative sets is introduced. Leveraging spatial information, class labels are propagated from representative points to adjacent pixels, effectively integrating spatial and spectral cues. Empirical validation across multiple public datasets attests to the superiority of the proposed method.In conclusion, this dissertation advances theoretical understanding by characterizing the impacts of dimension reduction and random sampling on subspace clustering estimation. Moreover, it proposes robust subspace clustering algorithms adaptable to diverse signal-to-noise ratios, thus addressing the complexity associated with clustering vast amounts of high-dimensional data using existing methods. Lastly, the introduction of a weakly supervised labeling method for hyperspectral image clustering, alongside theoretical refinements, facilitates the practical application of subspace learning methods in scenarios involving massive high-dimensional data.