登录 EN

添加临时用户

基于特征解耦和时空判别的图像视频对抗生成

Adversarial Generation of Image/Video based on Feature Disentangling and Spatiotemporal Discrimination

作者:胥森哲
  • 学号
    2015******
  • 学位
    博士
  • 电子邮箱
    xus******com
  • 答辩日期
    2021.05.19
  • 导师
    胡事民
  • 学科名
    计算机科学与技术
  • 页码
    124
  • 保密级别
    公开
  • 培养单位
    024 计算机系
  • 中文关键词
    生成对抗网络,解缠绕表示,视频稳定化,图像视频和谐化
  • 英文关键词
    Generative Adversarial Networks, disentangled representation, video stabilization, image/video harmonization

摘要

随着移动互联网的快速发展和视觉内容创作者数量的飞速增长,人们对图像和视频处理过程的智能化和灵活性的要求也随之提升。图像的整体属性、局部部位的编辑,视频的稳定化、融合和谐化等任务是最常见的图像/视频处理需求。生成对抗网络是当前倍受关注的生成模型,但它对视觉内容的表示通常具有不可解释性,对图像编辑的精准性、灵活性不足,且对抗目标主要集中在整体分布判别上,限制了生成对抗网络在图像/视频处理中的通用性。本文从视觉内容的全局属性特征、局部空间特征、空间变换感知和像素级质量特征出发,基于生成对抗网络,从生成器的角度提出了图像全局与局部的解耦表示方法;从判别器的角度提出了时间域相关和空间域相关的新型对抗目标,解决了不同的图像与视频编辑任务。本文的主要内容和贡献包括:1、提出了一种图像全局属性的潜空间解耦表示模型。该模型在编码器和解码器中间嵌入多个属性转换子网络,采用软独立性约束保证子网络对不同属性变换的潜空间增量保持独立,实现了图像多属性在潜空间编码中的独立表示,并应用于图像多属性同步编辑。2、提出了一种图像局部空间特征的解耦表示模型。该模型利用提出的解缠绕编码器和辅助单一部位解码器将图像局部部位形状信息解缠绕地表示在潜空间编码的语块上,再在重组损失、循环一致性损失等目标约束下使全局解码器生成全新的重组图像。该方法应用于人脸图像的五官形状编辑,实现了一种类似基因编辑技术的人像部位编辑算法。3、提出了一种时空感知的生成对抗模型,来在线地解决视频稳定化和前景物体融合问题。该方法提出了空间变换感知的生成网络,和鉴别视频帧稳定性的判别网络,采取对抗生成的方式消除视频抖动缺陷,取得了与传统方法相比拟的效果,且更加适用于低质量视频。进一步地,提出了一种图像像素级不和谐区域判别器,增强了GAN对图像像素级缺陷的理解,支持前景物体和图像/视频背景的和谐融合。

With the rapid development of the mobile Internet and the increased amount of visual content creators, people's requirements for the intelligence and flexibility of image and video editing techniques have increased. The most common processing requirements include the editing of image attributes, the modification of image parts, the stabilization of the video, and the harmonized fusion in video. Generative Adversarial Networks (GANs) is a generative model that has attracted a lot of attention at the current stage, but their representation of visual contents are usually uninterpretable, and they lack accuracy and flexibility in image editing. Besides, the adversarial objective is mainly focused on visual quality, which limits the general applicability of GANs in image and video processing. Focusing on the global attribute representation, geometric shape representation, spatial transformation and pixel-level features of visual content, and based on GANs, this thesis solved different image and video editing tasks by proposing disentangled representation methods for global and local images from the perspective of generator; and new types of discriminating targets relevant to the time domain and the space domain from the perspective of the discriminator. The main contents and contributions of this thesis are summarized as follows.1. This thesis proposes an advanced multi-attribute independence representation for images based on latent space encoding. The proposed method embeds multi-attribute transforming sub-networks between the encoder and decoder, and adopts a soft independence constraint to ensure that the latent space increments of different attribute transformations caused by different sub-networks remain independent, realizing the independent representation of image’s multi-attributes in latent space, which applies to simultaneous editing of multiple attributes for images.2. This thesis proposes a disentagled representation of the shapes of image’ local parts. This method firstly utilizes a disentagled encoder and an auxiliary part-wise decoder to disentagle the local parts’ shape information to different blocks of the latent space encoding, and then recovers virtual images under the restricts of the recombination remix loss and cyclic consistency loss using an overall decoder. This method applies to the facial parts shape editing for face images, and constitutes a facial image parts editing tool working like the gene editing technique.3. This thesis proposes a new type of generative adversarial model with temporal and spatial aware discrimination, to solve the problem of online video stabilization and image/video fusion. This method proposes a generator network with spatial transform perception and a discriminator network for identifying the stability of video frames. By adopting an adversarial networks to eliminate video jitter defects, this method achieves comparable effects with traditional methods, and is more suitable for low-quality videos. Further, this thesis also proposes a novel pixel-wise harmony/disharmony discriminator for a better quality of video fusion, which enhances the GAN's ability to understand the pixel-wise image defects , and supports the harmonious fusion of foreground objects with image/video background.