登录 EN

添加临时用户

基于大数据可视化的机器学习模型性能增强方法研究

Model Enhancement Methods Based on Large Data Visualization

作者:袁隽
  • 学号
    2019******
  • 学位
    博士
  • 电子邮箱
    ths******com
  • 答辩日期
    2024.05.22
  • 导师
    刘世霞
  • 学科名
    软件工程
  • 页码
    96
  • 保密级别
    公开
  • 培养单位
    410 软件学院
  • 中文关键词
    可视分析;大数据;机器学习;模型改进
  • 英文关键词
    visualization; large data; machine learning; model enhancement

摘要

近年来,随着机器学习领域的快速发展,数据规模和模型复杂度日益增长,因此迫切需要对机器学习模型相关数据进行深入分析以优化模型性能。可视化是分析机器学习模型相关数据(如标注数据、模型结构)的重要手段,可以帮助用户更好地理解这些数据。然而,对机器学习模型相关数据进行分析时,面临以下三个主要挑战:首先,随着数据规模增加,会出现过度绘制和渲染速度下降的问题,导致难以直接展示全部数据;其次,数据关系及属性复杂交织,使得用户难以有效地在近邻数据中进行数值属性比较分析;最后,随着模型复杂度提升,模型设计和优化效率降低,越来越依赖于专家的经验,需要反复尝试和不断调整。 针对上述挑战,本文研究了大数据可视化技术及其在机器学习模型性能提升方面的应用。本文的主要研究成果和创新点如下: (1)针对数据规模增加的挑战,提出了基于感知的散点图采样方法评估机制。该评估机制进行用户研究,评估了不同采样方法对相对区域密度、相对类别密度、异常值和散点分布形状的保留在感知层面上的影响,得到了一系列有指导意义的发现,其为不同场景中采样方法的选择和改进提供了指导。通过使用合适的采样方法,可以减少展示的数据量,解决过度绘制和渲染速度下降的问题,并有效地保留数据中的重要信息。相应成果被应用于上海数据交易中心的作弊案件检测模型的负样本采样过程,改善了负样本的质量,从而提升了作弊案件检测的准确率。 (2)针对数据关系及属性复杂交织的挑战,提出了近邻保留的非均匀圆堆积布局方法。该方法在保留样本数据集中的近邻关系的同时使用圆的大小编码数值属性,使得用户可以方便地在近邻数据中进行数值属性的比较与分析,快速找出异常数据并进行校正,从而提升样本的质量。定量实验和使用案例表明该方法可以有效地帮助用户深入理解训练数据对模型预测的影响,进而提升训练数据质量。 (3)针对复杂模型设计高度依赖于专家经验的挑战,提出了用于总结神经网络架构设计原则的可视分析方法。该方法使用图编辑距离建模神经网络架构之间的相似性,设计了一个结合力导向和圆堆积布局的架构可视化布局方法,并开发了一个可视分析系统以帮助用户总结神经网络设计原则。实验结果表明,在性能相当的情况下,所总结的设计原则降低了神经网络架构搜索中50%的计算成本,提升了模型设计和优化的效率。

In recent years, with the rapid development of machine learning, data volume and model complexity have been increasing. Therefore, there is an urgent need to conduct an in-depth analysis of the data related to machine learning models (e.g., labeled data and model structures) to improve model performance. Visualization plays a vital role in analyzing such model-related data and helps users better understand the data. However, the analysis of model-related data faces three primary challenges. Firstly, as data volume grows, there are issues of overplotting and decreased rendering speed, making it difficult to directly display all data. Secondly, the intricate intertwining of data relationships and attributes makes it difficult for users to effectively compare and analyze quantitative attributes among neighboring samples. Thirdly, with the increasing complexity of models, the efficiency of model design and optimization decreases. Model design and optimization have become increasingly reliant on expert experience and require repeated attempts and continuous adjustments. To address the above challenges, this thesis focuses on studying large data visualization techniques and their applications in improving machine learning models. The main contributions include: (1) To address the challenge brought by the growth of data volume, a perception-based evaluation scheme of scatterplot sampling methods is proposed. In this evaluation scheme, a series of user studies are conducted to evaluate the impact of different sampling methods on the preservation of relative region density, relative class density, outliers, and overall distribution shape from the perspective of perception, and yield insightful findings. These findings provide guidance for the selection and improvement of sampling methods in different scenarios. By using appropriate sampling methods, the number of samples to be displayed is reduced, and the issues of overplotting and decreasing rendering speed are addressed, while important information in the data is preserved. The corresponding results are applied to the negative sampling process of the fraud detection model at Shanghai Data Exchange, which enhances the quality of negative samples and thereby improves the accuracy of fraud detection. (2) To address the challenge of complex interwoven data relationships and attributes, a neighborhood-preserving non-uniform circle packing method is proposed. In this method, the neighborhood relationships between samples are preserved while the quantitative attributes of samples are encoded by circle sizes. Such a practice facilitates users in comparing and analyzing quantitative attributes among neighboring samples, enables users to quickly identify outliers and make corrections, and thereby helps users enhance the sample data quality. Quantitative experiments and use cases demonstrate that this method effectively assists users in gaining a deep understanding of the impact of training data on model predictions and improving the quality of training data. (3) To address the challenge that complex model design highly relies on expert experience, a visual analysis method to summarize design principles for neural network architectures is proposed. In this method, the similarity between neural network architectures is modeled by graph edit distance, and an architecture visualization layout method that combines the force-directed layout and circle packing is designed. A visual analysis system has been developed to assist users in summarizing design principles for neural network architectures. Experiments show that the summarized design principles effectively reduce the computational cost of neural architecture search by 50% while maintaining comparable performance, which enhances the efficiency of model design and optimization.