en Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rⁿ, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

\operatorname {dist} (x,Z)^{\alpha }\leq C|f(x)|.

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)-f(p)|^{\theta }\leq c|\nabla f(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to Polyak [ru], is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

Definitions

${\textstyle f}$ is a function of type ${\textstyle \mathbb {R} ^{d}\to \mathbb {R} }$ , and has a continuous derivative $\nabla f$ .

$X^{*}$ is the subset of $\mathbb {R} ^{d}$ on which $f$ achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value $f^{*}$ exists, unless otherwise stated. The optimization objective is to find some point $x$ in $X^{*}$ .

${\textstyle \mu ,L>0}$ are constants.

${\textstyle \nabla f}$ is $L$ -Lipschitz continuous iff

$\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|,\quad \forall x,y$

${\textstyle f}$ is ${\textstyle \mu }$ -strongly convex iff $f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}\quad \forall x,y$

${\textstyle f}$ is ${\textstyle \mu }$ -PL (where "PL" means "Polyak-Łojasiewicz") iff ${\frac {1}{2}}\|\nabla f(x)\|^{2}\geq \mu \left(f(x)-f(x^{*})\right),\quad \forall x$

Basic properties

Theorem — 1. If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, then it is invex.

2. If ${\textstyle \nabla f}$ is L-Lipschitz continuous, then $f(y)\leq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {L}{2}}\|y-x\|^{2}$

3. If ${\textstyle f}$ is ${\textstyle \mu }$ -strongly convex then it is ${\textstyle \mu }$ -PL.

4. If ${\textstyle g}$ is ${\textstyle \mu }$ -strongly convex, and ${\textstyle A}$ is linear, then ${\textstyle f:=g\circ A}$ is ${\textstyle (\mu \sigma ^{2})}$ -PL, where ${\textstyle \sigma }$ is the smallest nonzero singular value of ${\textstyle A}$ .

5. (quadratic growth) If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, ${\textstyle x}$ is a point, and ${\textstyle x^{*}}$ is the point on the optimum set that is closest to ${\textstyle x}$ in L2-norm, then $f(x)\geq f\left(x^{*}\right)+{\frac {\mu }{2}}\left\|x-x^{*}\right\|_{2}^{2}$

Proof

Proof

1. By definition, every stationary point is a global minimum.

2. Integrate ${\textstyle f(x+t(y-x))}$ for ${\textstyle t\in [0,1]}$ and use the ${\textstyle L}$ -Lipschitz continuity.

3. By definition, ${\textstyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)+{\frac {\mu }{2}}\lVert y-x\rVert ^{2}}$ . Now, minimize the left side, we have $f(x^{*})\geq f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}$ then minimize the right side, we have $f(x)+\nabla f(x)^{T}(x^{*}-x)+{\frac {\mu }{2}}\lVert x^{*}-x\rVert ^{2}\geq f(x)-{\frac {1}{2\mu }}\|\nabla f(x)\|^{2}$ Combining the two, we have the ${\textstyle \mu }$ -PL inequality.

$f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

4. $g(Ay)\geq g(Ax)+\langle \nabla g(Ax),Ay-Ax\rangle +{\frac {\mu }{2}}\|Ay-Ax\|^{2}$

Now, since ${\textstyle \nabla f(x)=A^{T}\nabla g(Ax)}$ , we have $f(y)\geq f(x)+\langle \nabla f(x),y-x\rangle +{\frac {\mu }{2}}\|A(y-x)\|^{2}$

Set ${\textstyle y}$ to be the projection of ${\textstyle x}$ to the optimum subspace, then we have ${\textstyle \|A(y-x)\|\geq \sigma \|y-x\|}$ . Thus, we have $f(y)-f(x)\geq \langle \nabla f(x),y-x\rangle +{\frac {\mu \sigma ^{2}}{2}}\|y-x\|^{2}$ Vary the ${\textstyle y}$ on the right side to minimize the right side, we have the desired result.

5. Let ${\textstyle g(x):={\sqrt {f(x)-f^{*}}}}$ . For any ${\textstyle x\not \in X^{*}}$ , we have $\nabla g(x)={\frac {\nabla f(x)}{2{\sqrt {f(x)-f^{*}}}}}$ so by ${\textstyle \mu }$ -PL,
$\|\nabla g(x)\|^{2}\geq \mu /2$

In particular, we see that ${\textstyle \nabla g}$ is a vector field on ${\textstyle \mathbb {R} ^{d}\setminus X^{*}}$ with size at least ${\textstyle {\sqrt {\mu /2}}}$ . Define a gradient flow along ${\textstyle \nabla g}$ with constant unit velocity, starting at ${\textstyle x(0)=x}$ : $x(0)=x,\quad {\dot {x}}(t)={\frac {\nabla g}{\|\nabla g\|}}$

Because ${\textstyle g}$ is bounded below by ${\textstyle 0}$ , and ${\textstyle \|\nabla g\|\geq {\sqrt {\mu /2}}}$ , the gradient flow terminates on the zero set ${\textstyle X^{*}}$ at a finite time $T\leq g(x)/{\sqrt {\mu /2}}$ The path length is ${\textstyle T}$ , since the velocity is constantly 1.

Since ${\textstyle x(T)}$ is on the zero set, and ${\textstyle x^{*}}$ is the point closest to ${\textstyle x}$ , we have $\|x^{*}-x\|\leq T\leq g(x)/{\sqrt {\mu /2}}$ which is the desired result.

Gradient descent

Theorem (linear convergence of gradient descent) — If ${\textstyle f}$ is ${\textstyle \mu }$ -PL and ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, then gradient descent with constant step size ${\textstyle \eta }$ $x_{k+1}=x_{k}-\eta \nabla f(x_{k})$ converges linearly as $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right),\quad \eta \in (0,2/L)$

The convergence is the fastest when ${\textstyle \eta =1/L}$ , at which point $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Proof

Proof

Since ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, we have the parabolic upper bound $f(x_{k+1})\leq f(x_{k})+\langle \nabla f(x_{k}),x_{k+1}-x_{k}\rangle +{\frac {L}{2}}\|x_{k+1}-x_{k}\|^{2}$

Plugging in the gradient descent step, ${\begin{aligned}f(x_{k+1})-f(x_{k})&\leq \langle \nabla f(x_{k}),-\eta \nabla f(x_{k})\rangle +{\frac {L}{2}}\|-\eta \nabla f(x_{k})\|^{2}\\&=(L\eta ^{2}/2-\eta )\|\nabla f(x_{k})\|^{2}\\&\leq 2\mu (L\eta ^{2}/2-\eta )\left(f(x_{k})-f(x^{*})\right)\end{aligned}}$

Thus, $f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-2\mu \eta (1-L\eta /2)\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Corollary — 1. ${\textstyle x_{k}}$ converges to the optimum set ${\textstyle X^{*}}$ at a rate of ${\textstyle \left(1-\mu \eta (2-L\eta )\right)}$ .

2. If ${\textstyle f}$ is ${\textstyle \mu }$ -PL, not constant, and ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz, then ${\textstyle L\geq \mu }$ .

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

$f\left(x_{k}\right)-f\left(x^{*}\right)\leq \left(1-\mu /L\right)^{k}\left(f\left(x_{0}\right)-f\left(x^{*}\right)\right)$

Coordinate descent

The coordinate descent algorithm first samples a random coordinate ${\textstyle i_{k}}$ uniformly, then perform gradient descent by $x_{k+1}=x_{k}-\eta \partial _{i_{k}}f(x_{k})e_{i_{k}}$

Theorem — Assume that ${\textstyle f}$ is ${\textstyle \mu }$ -PL, and that ${\textstyle \nabla f}$ is ${\textstyle L}$ -Lipschitz at each coordinate, meaning that $|\partial _{i}f(x+te_{i})-\partial _{i}f(x)|\leq L|t|$ Then, ${\textstyle \mathbb {E} [f(x_{k})-f(x^{*})]}$ converges linearly for all ${\textstyle \eta \in (0,2/L)}$ by $\mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)^{k}(f(x_{0})-f(x^{*}))$

Proof

Proof

By the same argument, $f(x_{k+1})\leq f(x_{k})+(L\eta ^{2}/2-\eta )(\partial _{i_{k}}f(x_{k}))^{2}$

Take expectation with respect to ${\textstyle i_{k}}$ , we have $\mathbb {E} [f(x_{k+1})]\leq f(x_{k})+{\frac {L\eta ^{2}/2-\eta }{d}}\|\nabla f(x_{k})\|^{2}$

Plug in the ${\textstyle \mu }$ -PL inequality, we have $\mathbb {E} [f(x_{k})-f(x^{*})]\leq \left(1-{\frac {\mu \eta (2-L\eta )}{d}}\right)(f(x_{k})-f(x^{*}))$ Iterating the process, we have the desired result.

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize ${\textstyle f(x)}$ , but we cannot sample its gradient directly. Instead, we sample a random gradient ${\textstyle \nabla f_{i}(x)}$ , where ${\textstyle f_{i}}$ are such that $f(x)=\mathbb {E} _{i}[f_{i}(x)]$ For example, in typical machine learning, ${\textstyle x}$ are the parameters of the neural network, and ${\textstyle f_{i}(x)}$ is the loss incurred on the ${\textstyle i}$ -th training data point, while ${\textstyle f(x)}$ is the average loss over all training data points.

The gradient update step is $x_{k+1}=x_{k}-\eta _{k}\nabla f_{i_{k}}(x_{k})$ where ${\textstyle \eta _{k}>0}$ are a sequence of learning rates (the learning rate schedule).

Theorem — If each ${\textstyle \nabla f_{i}}$ is ${\textstyle L}$ -Lipschitz, ${\textstyle f}$ is ${\textstyle \mu }$ -PL, and ${\textstyle f}$ has global mimimum ${\textstyle f^{*}}$ , then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]$ We can also write it using the variance of gradient L2 norm: $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {L\eta _{k}^{2}}{2}}\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]$

Proof

Proof

Because all ${\textstyle \nabla f_{i}}$ are ${\textstyle L}$ -Lipschitz, so is ${\textstyle \nabla f}$ . We thus have $f(x_{k+1})\leq f(x_{k})-\eta _{k}\langle \nabla f(x_{k}),\nabla f_{i_{k}}(x_{k})\rangle +{\frac {L\eta _{k}^{2}}{2}}\|\nabla f_{i_{k}}(x_{k})\|^{2}$

Now, take the expectation over ${\textstyle i_{k}}$ , and use the fact that ${\textstyle f}$ is ${\textstyle \mu }$ -PL. This gives the first equation.

The second equation is shown similarly by noting that $\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]=\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]+\|\nabla f(x_{k})\|^{2}$

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some ${\textstyle C>0}$ such that during the SG process, we have $\mathbb {E} _{i}[\|\nabla f_{i}(x_{k})\|^{2}]\leq C$ for all ${\textstyle k=0,1,\dots }$ , then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-2\eta _{k}\mu \right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}$ Similarly, if $\forall k,\quad \mathbb {E} _{i}[\|\nabla f_{i}(x_{k})-\nabla f(x_{k})\|^{2}]\leq C$ then $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu (2\eta _{k}-L\eta _{k}^{2})\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {LC\eta _{k}^{2}}{2}}$

Learning rate schedules

For constant learning rate schedule, with ${\textstyle \eta _{k}=\eta =1/L}$ , we have $\mathbb {E} \left[f\left(x_{k+1}\right)-f^{*}\right]\leq \left(1-\mu /L\right)\left[f\left(x_{k}\right)-f^{*}\right]+{\frac {C}{2L}}$ By induction, we have $\mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]\leq \left(1-\mu /L\right)^{k}\left[f\left(x_{0}\right)-f^{*}\right]+{\frac {C}{2\mu }}$ We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the ${\textstyle C/(2L)}$ term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and ${\textstyle x_{k}}$ starts doing a random walk in the vicinity of ${\textstyle X^{*}}$ .

For decreasing learning rate schedule with ${\textstyle \eta _{k}=O(1/k)}$ , we have $\mathbb {E} \left[f\left(x_{k}\right)-f^{*}\right]=O(1/k)$ .

References

Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.

External links

"Lojasiewicz inequality", Encyclopedia of Mathematics, EMS Press, 2001 [1994]

Agama	Bahasa	Biografi	Budaya	Ekonomi	Elektronika
Film	Filsafat	Geografi	Indonesia	Ilmu	Lingkungan
Masyarakat	Matematika	Militer	Mitologi	Musik	Olahraga
Pendidikan	Politik	Sastra	Sejarah	Seni	Teknologi

Łojasiewicz inequality

Polyak inequality

Definitions

Basic properties

Gradient descent

Coordinate descent

Stochastic gradient descent

Learning rate schedules

References

External links

Portal di Ensiklopedia Dunia