登录 EN

添加临时用户

面向非独立同分布数据的联合深度学习研究

Research on Federated Learning With Non-IID Data Distribution

作者:胡成豪
  • 学号
    2016******
  • 学位
    硕士
  • 电子邮箱
    376******com
  • 答辩日期
    2019.05.30
  • 导师
    王智
  • 学科名
    计算机科学与技术
  • 页码
    63
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    分布式计算,神经网络,数据分布,联合学习
  • 英文关键词
    Deep Learning, Distributed Computing, Neural Network, Data Distribution, Federated Learning

摘要

随着神经网络与深度学习技术的不断发展,工业界越来越倾向于使用深度神经网络模型来解决诸如图像分类,人脸比对,语音识别等问题。而在这些技术的成功背后,往往需要收集海量的数据进行模型训练才能取得令人满意的结果。在实际应用中,数据不仅包含着用户的隐私,在企业眼中数据更是直接代表了价值,显然不可能轻易将这些数据暴露给外部社会,因此形成了所谓的“数据孤岛”现象。“数据孤岛”意味着任何一个单独的个体或机构自身的数据都不足以训练出一个高质量的模型,如何打破这种僵局也成为了学术界与工业界共同关心的一个重要问题。现有的方式主要是将局域网中的分布式深度学习方法利用提高传输间隔的方式扩展到广域网中,称为联合学习方法。而由于在广域网环境中有着诸如网络带宽、计算资源、数据分布等各种限制,算法的效果往往不尽人意。本论文主要针对非独立同分布数据的问题,主要研究内容和贡献包括:1. 对现有的联合深度学习算法在不同的数据分布下的训练效果进行了测试实验,与现有工作不同的是,本文从通信资源与计算资源两个角度进行评估,揭示了非独立同分布数据对训练产生的影响。2. 利用凸优化与概率论等工具对数据分布与联合训练的理论联系进行了分析,结果指出在非独立同分布的数据下,本地训练次数的越多,模型的误差也会越大,从而造成计算资源的浪费。该结果不仅解释了测试实验中的现象,也进一步指导了本文框架的设计,并且为以后的分析工作打下基础。3. 基于理论的分析与证明,本文设计了面向数据分布的高效联合深度学习框架。该框架可以通过对于数据分布信息的收集与分析来调整训练的传输机制与拓扑结构,从而提高联合学习的效率,保证模型的收敛。

With the continuous development of neural networks and deep learning techniques, the industry is increasingly inclined to use deep neural network models to solve problems like image classification, face alignment, and speech recognition. Behind the success of these technologies, it is often necessary to collect a large amount of data to train the model to achieve satisfactory results. In practical applications, data not only contains the privacy of users but in the perspective of enterprises, data and profit are closely linked. It is impossible to expose the data to the external society easily. Thus a so-called “data isolation” phenomenon has emerged. “Data isolation” means that the data of an individual is not enough to train a high-quality model. How to solve this situation has become an increasingly popular topic among academia and industry. The existing method mainly extends the distributed deep learning method in the local area network to the wide area network by increasing the transmission interval, and due to the limitations like bandwidth, computing resources, and data distribution in the WAN environment, the traditional method does not perform well. This paper focuses on the problem of Non-IID data distribution. The main research contents and contributions include:1. We test the existing distributed deep learning algorithms with different data distribution, which reveals the impact of Non-IID data distribution. In particular, we evaluate the learning performance considering both the communication and computation resources.2. We do the theoretical analysis on the relationship between data distribution and federated learning methods. The analysis results give us a better understanding of the test results, guide the design of our proposed federated learning framework, and also lay the foundation for future analysis.3. We propose a high efficiency federated deep learning framework for Non-IID data distribution, which can adapt the training topology with the data distribution of each participating node, and thus improve the training performance.