视频中的目标跟踪是计算机视觉的核心任务之一,在视频内容理解、自动驾驶和机器人领域应用广泛。其中,如何对物体的外观进行高效和准确的表征,从而将不同时刻的同一目标关联起来,是视觉目标跟踪的一个关键问题。表征模型以图像作为输入,输出描述物体外观的高层语义特征,是解决这一问题的有效途径。本文针对目标跟踪中的表征模型,拟研究和解决三个问题:运行效率问题,对标注数据过分依赖的问题,以及缺乏通用性的问题。高效性是目标跟踪的瓶颈问题。现实生活中目标跟踪算法的应用往往会提出实时处理视频流的需求。对此,本文提出了一种目标检测与外观表征联合模型。该模型将表征模型融合到单阶段目标检测器中,大大提高了算法的运行效率。在保证原有的跟踪精度不损失的情况下,该模型的运算速度超过现有的其他算法,实现了运行速度大于25 帧每秒的多目标跟踪系统。另一方面,已有的方法,包括上述联合模型,依赖大量手工标注的目标身份信息来训练外观表征,数据成本极高。本文针对这一问题,提出了一种新颖的自监督表征学习算法。该算法利用同一视频前后两帧之间,以及同步的视频对同一时刻两帧之间的循环一致性,构造了一种高效的代理任务。由此代理任务学习得到的表征可以无需经过微调而直接应用在行人跟踪任务里,并取得与全监督学习可比的跟踪准确率。同时,表征的通用性是一个重要的研究方向。由于不同的应用需求,视觉目标跟踪在业界被分解成了多个不同设定的任务。不同任务往往需要设计特定的算法和模型来进行适配。本文探究并发现不同设定的跟踪任务可以只用两种简单的算法原型统一解决。且进一步地,两种算法原型可以建立在一个统一的表征模型之上。为此,本文提出了可以同时解决多个不同目标跟踪任务的通用框架,并通过实验证明统一表征模型的性能甚至可以超越针对特定任务设计的模型。综上所述,本文面向视觉目标跟踪任务,开展表征学习方法的研究。研究既初步解决了具有重要实际应用的实时性问题,也钻研了理论性较好的自监督学习方法,还探索了通用的目标跟踪统一框架。本文通过大量实验,证明了本文提出的方法从准确率和速度两个方面改善了基线方法,一定程度上推动了视觉目标跟踪研究的发展并在研究成果发表当时达到了已见诸于文献的方法中的国际领先水平。
Object tracking in videos is one of the most fundamental problems in computer vision, underpinning a wide range of applications such as video content understanding, autonomous driving, and robotics. To associate observations of a target object from different timesteps, efficient and accurate appearance modeling is a core technique, on which this thesis mainly focuses.This thesis considers three key problems involved with appearance representation learning in object tracking: the run-time efficiency, the over-reliance on labeled data, and the lack of versatility across different task setups.On one hand, run-time efficiency is the bottleneck of object tracking. Real-world applications of object tracking often require the real-time processing of video streams. On this demand, this thesis proposes a joint detection and appearance embedding model. This model seamlessly incorporates the appearance model into a single-shot object detector, thus eliminating the run-time of the originally stand-alone appearance model. This joint model greatly promotes the run-time efficiency while keeping the tracking accuracy notdegenerating, and realizes a real-time multi-object tracking system.On the other hand, previous methods, including the aforementioned joint model, require a huge amount of identity labels for supervised training, posing a heavy cost of data annotation. For this issue, this thesis proposes a novel self-supervised appearance representation learning algorithm. In this algorithm, we construct a pretext task that takes cycle consistency as free supervision signals. Cycle consistency can be found between temporally adjacent frame pairs drawn from one video or temporally synchronized frame pairs from two videos. The appearance representation learned with this pretext task could be applied directly to tracking tasks without fine-tuning, and achieves comparable accuracywith regard to fully supervised appearance models.Meanwhile, the versatility of the appearance model is an important topic to be explored. Because of different use cases, visual object tracking has been fragmented into multiple tasks with different setups. As a result, dedicated algorithms and models have been proposed to adapt the specific task setups. This thesis explores and finds that various tracking tasks could be solved with two simple algorithmic primitives, and furthermore, the two algorithmic primitives could be built upon a single unified appearance model.Based on this finding, this thesis proposes a unified tracking framework that employs a single shared appearance representation to tackle different tracking tasks. Experiments show the unified appearance representation even outperforms task-specific models.In summary, this thesis explores appearance representation learning for visual object tracking tasks. We demonstrate that the proposed methods not only outperform classic baseline methods in terms of run-time efficiency, data reliance, and versatility, but also compare favorably against the state-of-the-art in published literature.