登录 EN

添加临时用户

Nesterov加速梯度算法及其重启策略的理论研究

Theoretical study of Nesterov’s accelerated gradient method and its restart variants

作者:李家宏
  • 学号
    2019******
  • 学位
    博士
  • 电子邮箱
    lij******.cn
  • 答辩日期
    2024.05.20
  • 导师
    包承龙
  • 学科名
    数学
  • 页码
    141
  • 保密级别
    公开
  • 培养单位
    042 数学系
  • 中文关键词
    凸优化问题;Nesterov加速梯度算法;重启策略;加速邻近点算法;梯度重启策略
  • 英文关键词
    convex optimization problems; Nesterov's accelerated gradient method; restart strategies; accelerated proximal gradient method; gradient restart schemes

摘要

一阶优化算法在人工智能等领域发挥着重要作用。Nesterov加速梯度(NAG)算法作为一种经典的基于外插的一阶优化算法,通过设计外插系数,提高了梯度下降算法的收敛速度。然而,外插格式会导致目标函数值序列发生振荡,从而减慢收敛速度。为了解决这一问题,重启策略通过在迭代过程中将外插系数调整为零以有效防止振荡,进而在数值层面改善了NAG算法的收敛表现。但是,与重启策略相关的理论研究仍不完善。本文主要研究NAG算法及其扩展形式的收敛性质,并对重启策略的理论性质进行了分析。首先,若目标函数是$L$-光滑且$\mu $-强凸时,NAG算法通过设定外插系数为$\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$, 能够实现全局R-线性收敛。然而,在经典NAG算法中,外插系数与$\mu$无关,并且会单调增加至1, 这时NAG算法的全局R-线性收敛性并不清楚。本文通过构建一种具有特殊结构的Lyapunov序列,并证明其具有全局Q-线性收敛性,从而证明了NAG算法的全局R-线性收敛性。若步长小于$1/L$时,则可以证明迭代序列的收敛速度为$O(\rho^k/k^2)$, 其中$\rho\in(0,1)$, 这可以解释NAG算法在迭代初期具有快速收敛性质。此外,针对具有强凸性的复合优化问题,本文研究了三种NAG算法的扩展形式,包括加速邻近点梯度(APG)算法、广义APG算法以及惯性向前向后分裂(I-FBS)算法,并证明了它们均具有全局R-线性收敛性。其次,本文针对NAG与APG算法的重启策略开展理论研究,证明了带重启策略的NAG与APG算法的全局R-线性收敛性。当重启间隔一致有界时,可以证明重启策略能够提高NAG与APG算法的收敛速度。针对一维目标函数,本文进一步分析了基于梯度重启NAG算法的性质,及其与梯度下降方法的收敛性差异。此外,本文从NAG算法对应的常微分方程出发,严格证明了梯度重启策略的理论优势。该结果部分解决了[Su, Boyd, and Cand\'es, \textit{J. Mach. Learn. Res.}, 2016, 17(153), 1-43]中提出的公开问题。最后,基于NAG算法的一阶外插梯度方法理论研究受到了广泛的关注,其与常微分方程的连续化对应关系是近期的研究热点。本文中所证明的NAG算法的全局线性收敛性在连续化框架下没有对应结果,值得进一步开展研究。

First-order optimization algorithms play a crucial role in fields such as artificial intelligence. The Nesterov's Accelerated Gradient (NAG) method is a classic first-order optimization algorithm that improves the convergence rate of gradient descent methods by designing extrapolation coefficients. However, the extrapolation scheme can lead to oscillations in the sequence of objective function values, slowing the convergence speed. To address this issue, restart strategies effectively prevent oscillations by setting the extrapolation coefficients to zero during the iteration process, thus improving the convergence performance of the NAG method at the numerical level. However, theoretical research related to restart strategies remains unclear. This paper primarily investigates the convergence properties of the NAG method and its extensions. Additionally, it analyzes the theoretical properties of restart strategies.Firstly, when the objective function is $L$-smooth and $\mu$-strongly convex, the NAG method achieves global $R$-linear convergence by setting the extrapolation coefficient to $\frac{\sqrt{L}-\sqrt{\mu}}{\sqrt{L}+\sqrt{\mu}}$. However, in the classical NAG method, the extrapolation coefficient is independent of $\mu$ and monotonically increases to 1. In this case, the NAG method's global $R$-linear convergence is unclear. This paper constructs a Lyapunov sequence with a special structure and proves its global $Q$-linear convergence, thereby demonstrating the global $R$-linear convergence of the NAG method. When the step size is less than $1/L$, it can be proven that the convergence rate of the iteration sequence is $O(\rho^k/k^2)$, where $\rho \in (0,1)$. This can explain the rapid convergence property of the NAG method in the early iterations.Additionally, for composite optimization problems with strong convexity, this paper investigates three extended forms of the NAG method, including the Accelerated Proximal Gradient (APG) method, the Generalized APG method, and the Inertial Forward-Backward Splitting (I-FBS) method, and proves their global $R$-linear convergence properties. Next, this paper conducts theoretical research on the restart strategy of the NAG and APG methods, demonstrating the global R-linear convergence of the NAG and APG methods with restart strategies. When the restart interval is uniformly bounded, it can be proven that the restart strategy improves the convergence rate of the NAG and APG methods.This paper further analyzes the properties of the gradient restarted NAG method and its convergence differences from gradient descent methods for one-dimensional objective functions. Additionally, starting from the ordinary differential equation corresponding to the NAG method, this paper rigorously proves the theoretical advantages of the gradient restart strategy.This result partially addresses the open problem proposed in [Su, Boyd, and Cand\`es, \textit{J. Mach. Learn. Res.}, 2016, 17(153), 1-43].Finally, theoretical research on first-order extrapolation gradient methods based on the NAG method has received extensive attention, and the limiting correspondence between them and ordinary differential equations is a recent research focus. The global linear convergence of the NAG method proved in this paper does not have a corresponding result in the continuous framework, warranting further research.