Share to: share facebook share twitter share wa share telegram print page

 

Łojasiewicz inequality

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

Here α can be large.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

Polyak inequality

A special case of the Łojasiewicz inequality, due to Polyak [ru], is commonly used to prove linear convergence of gradient descent algorithms. This section is based on Karimi, Nutini & Schmidt (2016) and Liu, Zhu & Belkin (2022).

Definitions

is a function of type , and has a continuous derivative .

is the subset of on which achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value exists, unless otherwise stated. The optimization objective is to find some point in .

are constants.

is -Lipschitz continuous iff

is -strongly convex iff

is -PL (where "PL" means "Polyak-Łojasiewicz") iff

Basic properties

Theorem — 1. If is -PL, then it is invex.

2. If is L-Lipschitz continuous, then

3. If is -strongly convex then it is -PL.

4. If is -strongly convex, and is linear, then is -PL, where is the smallest nonzero singular value of .

5. (quadratic growth) If is -PL, is a point, and is the point on the optimum set that is closest to in L2-norm, then

Proof
Proof

1. By definition, every stationary point is a global minimum.

2. Integrate for and use the -Lipschitz continuity.

3. By definition, . Now, minimize the left side, we have then minimize the right side, we have Combining the two, we have the -PL inequality.

4.

Now, since , we have

Set to be the projection of to the optimum subspace, then we have . Thus, we have Vary the on the right side to minimize the right side, we have the desired result.

5. Let . For any , we have so by -PL,

In particular, we see that is a vector field on with size at least . Define a gradient flow along with constant unit velocity, starting at :

Because is bounded below by , and , the gradient flow terminates on the zero set at a finite time The path length is , since the velocity is constantly 1.

Since is on the zero set, and is the point closest to , we have which is the desired result.

Gradient descent

Theorem (linear convergence of gradient descent) — If is -PL and is -Lipschitz, then gradient descent with constant step size converges linearly as

The convergence is the fastest when , at which point

Proof
Proof

Since is -Lipschitz, we have the parabolic upper bound

Plugging in the gradient descent step,

Thus,

Corollary — 1. converges to the optimum set at a rate of .

2. If is -PL, not constant, and is -Lipschitz, then .

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

Coordinate descent

The coordinate descent algorithm first samples a random coordinate uniformly, then perform gradient descent by

Theorem — Assume that is -PL, and that is -Lipschitz at each coordinate, meaning that Then, converges linearly for all by

Proof
Proof

By the same argument,

Take expectation with respect to , we have

Plug in the -PL inequality, we have Iterating the process, we have the desired result.

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize , but we cannot sample its gradient directly. Instead, we sample a random gradient , where are such that For example, in typical machine learning, are the parameters of the neural network, and is the loss incurred on the -th training data point, while is the average loss over all training data points.

The gradient update step is where are a sequence of learning rates (the learning rate schedule).

Theorem — If each is -Lipschitz, is -PL, and has global mimimum , then We can also write it using the variance of gradient L2 norm:

Proof
Proof

Because all are -Lipschitz, so is . We thus have

Now, take the expectation over , and use the fact that is -PL. This gives the first equation.

The second equation is shown similarly by noting that

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some such that during the SG process, we have for all , thenSimilarly, if then

Learning rate schedules

For constant learning rate schedule, with , we haveBy induction, we have We see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and starts doing a random walk in the vicinity of .

For decreasing learning rate schedule with , we have .

References

  • Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS, 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, MR 0972342, S2CID 56006439
  • Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society, 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, JSTOR 2153965, MR 1046016
  • Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
  • Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning. 59: 85–116. arXiv:2003.00307. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203.
Kembali kehalaman sebelumnya


Index: pl ar de en es fr it arz nl ja pt ceb sv uk vi war zh ru af ast az bg zh-min-nan bn be ca cs cy da et el eo eu fa gl ko hi hr id he ka la lv lt hu mk ms min no nn ce uz kk ro simple sk sl sr sh fi ta tt th tg azb tr ur zh-yue hy my ace als am an hyw ban bjn map-bms ba be-tarask bcl bpy bar bs br cv nv eml hif fo fy ga gd gu hak ha hsb io ig ilo ia ie os is jv kn ht ku ckb ky mrj lb lij li lmo mai mg ml zh-classical mr xmf mzn cdo mn nap new ne frr oc mhr or as pa pnb ps pms nds crh qu sa sah sco sq scn si sd szl su sw tl shn te bug vec vo wa wuu yi yo diq bat-smg zu lad kbd ang smn ab roa-rup frp arc gn av ay bh bi bo bxr cbk-zam co za dag ary se pdc dv dsb myv ext fur gv gag inh ki glk gan guw xal haw rw kbp pam csb kw km kv koi kg gom ks gcr lo lbe ltg lez nia ln jbo lg mt mi tw mwl mdf mnw nqo fj nah na nds-nl nrm nov om pi pag pap pfl pcd krc kaa ksh rm rue sm sat sc trv stq nso sn cu so srn kab roa-tara tet tpi to chr tum tk tyv udm ug vep fiu-vro vls wo xh zea ty ak bm ch ny ee ff got iu ik kl mad cr pih ami pwn pnt dz rmy rn sg st tn ss ti din chy ts kcg ve 
Prefix: a b c d e f g h i j k l m n o p q r s t u v w x y z 0 1 2 3 4 5 6 7 8 9