登录 EN

添加临时用户

深度模型水印泛化性及其在XAI和后门攻击中的应用

XAI Evaluation and Backdoor Attack via Generalization of Deep Model Watermark

作者:亚梦溪
  • 学号
    2021******
  • 学位
    硕士
  • 电子邮箱
    yam******.cn
  • 答辩日期
    2024.05.14
  • 导师
    夏树涛
  • 学科名
    计算机技术
  • 页码
    49
  • 保密级别
    公开
  • 培养单位
    599 国际研究生院
  • 中文关键词
    模型水印;泛化性;可解释性人工智能;后门攻击;后门防御
  • 英文关键词
    Model Watermark; Generalization; Explainable Artificial Intelligence; Backdoor Attack; Backdoor Defense

摘要

后门攻击(Backdoor Attack)旨在将隐藏的后门嵌入到深度神经网络模型中,以便被攻击的模型在良性样本上表现良好,而如果攻击者指定的后门触发器激活了隐藏的后门,则该模型的预测将被恶意更改。而后门触发器(也叫后门水印,模型水印)是指能够激活模型隐藏后门的模式(Pattern)。目前的研究表明,模型水印存在泛化性。模型水印的泛化性是指:存在不同于原始水印的模式,注入到良性样本中,同样能够激活模型中的后门。然而,对于模型水印泛化性的更多性质,如何操纵模型水印的泛化性以及如何应用模型水印的泛化性,目前鲜有研究。本论文旨在探索上述问题,主要的工作和贡献点如下:? 本文重新审视了目前关于模型水印泛化性的研究,并在此基础上设计实验发现了模型水印泛化性的更多性质。本文初步揭示和分析了模型水印位置泛化性的特点:模型水印往往具有很强的位置泛化性,且泛化的模型水印比原始的模型水印具有更强的后门激活效果。? 本文提出了一种限制模型水印泛化性的方法(Generalization-Limited Backdoor Watermark, GLBW)。具体而言,将带有水印的深度神经网络的训练过程形式化为一个Min-Max问题,在每次迭代中通过内部最大化寻找具有最高后门激活效果和与真实触发器差异最大的“最差”潜在触发器,并通过外部最小化来减少其效果以及对良性和后门样本的损失。特别地,本文设计了一种自适应的迭代优化方法,在每次内部最大化中寻找期望的潜在触发器。? 本文重新审视了基于模型水印的可解释性人工智能评测方法,并揭示了该方法在实现上的限制和由于模型水印的泛化性导致的不可靠性,并基于GLBW方法设计了一种更加可信的可解释性人工智能评测方法。? 现有的基于补丁(Patch-Based)的后门攻击方法较容易被基于梯度优化的后门防御方法(例如:Neural Cleanse 方法)检测出来,本文发现如果能够限制后门水印的泛化性,则可以降低被基于梯度优化的后门防御方法检测的概率。本文改进了GLBW方法,得到了一种可以绕过基于梯度优化的后门防御的后门攻击方法(Generalization-Limited Backdoor Watermark for Bypassing Defense, GLBW-BD)。实验表明,GLBW-BD 方法能够绕过大多数基于梯度优化的后门防御方法,同时比 GLBW方法拥有更高的良性样本分类准确率。

The backdoor attack aims to embed a hidden backdoor into a deep neural network model, so that the infected model performs well on benign samples. However, if the specified backdoor trigger by the attacker activates the hidden backdoor, the predictions of the model will be maliciously altered. The backdoor trigger (also known as backdoor watermark or model watermark) refers to the pattern which can activate the hidden backdoor in the model.Current research indicates that model watermarks have generalization. The generalization of model watermarks is: There is a watermark with a pattern different from the original watermark, which can also activate the backdoors in the model. However, there is little research on the additional properties of model watermark generalization, such as how to manipulate the generalization of model watermarks and how to apply the generalization of model watermarks. This paper aims to explore the above-mentioned issues.The main works and contributions are as follows:? We revisit the current research on the generalization of model watermarks, and we design experiments to discover the additional properties of model watermark generalization. We preliminarily reveal and analyze the characteristics of the positional generalization of model watermarks: Model watermarks often have strong position generalization, and generalized model watermarks tend to have stronger backdooractivation effects compared to the original model watermarks.? We propose a method to limit the generalization of model watermarks, termed Generalization-Limited Backdoor Watermark (GLBW). Specifically, we formulate the training of watermarked DNNs as a Min-Max problem, where we find the 'worst' potential trigger (with the highest attack effectiveness and differences from the groundtruth trigger) via inner maximization and minimize its effects and the loss over benign and poisoned samples via outer minimization in each iteration. In particular, we design an adaptive iterative optimization method to find desired potential triggers in each inner maximization.? We revisit the evaluation method for explainable artificial intelligence (XAI) methods based on model watermarks, and first reveal its implementation limitations and unreliable nature due to the generalization of model watermarks. And we design amore faithful XAI evaluation method based on GLBW.? Existing patch-based backdoor attack methods are more likely to be detected by gradient-based backdoor defense methods (e.g., Neural Cleanse). We find that by limiting the generalization of the backdoor watermark, the probability of detected by gradient-based backdoor defense methods can be reduced. We enhance the GLBW method to develop a backdoor attack method capable of bypassing gradient-based backdoor defense, termed Generalization-Limited Backdoor Watermark for Bypassing Defense (GLBW-BD). Experimental results demonstrate that the GLBW-BD can bypass the majority of gradient-based backdoor defense methods, while also achieving higher benign sample classification accuracy compared to the GLBW method.