Understanding Gradient Descent

Foreword

This article assumes understanding what a derivative is!

This remarkably simple but powerful algorithm stands on the shoulders of differentiation and is the backbone for “backward propagation of errors” (back prop) which calculates the gradients needed for gradient descent.

A quick little TLDR refresher for those who need it. The derivative at a point gives the slope of the tangent line at that point. This slope represents the rate of change of the function at that specific point. We calculate the derivative (or gradient for multivariable functions) at the current point to get the direction of the steepest ascent and make it negative to minimise the function. We iteratively recalculate the derivative/gradient at the new point and continue to adjust our input variables.

When we take a derivative and our tangent is between 0 and 180 degrees then there is a slope either slanted, either left or right, or it’s horizontal, indicating we’re at the most optimal value for the local min/max. If it’s slanted the goal is to get it closer to being horizontal/the value 0.

Single Variable

If want to minimise the $y$ value of a single variable function we would use single-variable gradient descent, denoted as

x_{n+1} = x_n - \gamma f'(x_n)

where $\gamma$ (or $\alpha$ ) is the step size: how much we adjust our current position to the slope/gradient we calculated from the derivative $f'(x)$ (which “nabla” aka “del” in multi-variable gradient descent, $\nabla$ , which we’ll talk about soon). We use a negative sign to turn the ascent of the gradient into descent.

Example

Lets do a quick example to demonstrate w/ $f(x) = x^2 + 2x + 1$ :

The derivative is $f'(x) = 2x + 2$
Starting point $x_0 = 3$
Learning step rate $\gamma = 0.1$
Our starting $y$ based off $x_0$ is $f(3) = 3^2 + 2(3) + 1 = 16$

When we plug our values into our formula on iteration 0 of $f(x)$ we get:

Into the derivative $f'(x_0) = 2(3) + 2 = 8$
Then plugging $f'(x_0)$ into our gradient descent $f(x_{0+1}) = 3 - 0.1(8) = 2.2$
Our new $y$ with $x_1 = 2.2$ is $f(2.2) = (2.2)^2 + 2(2.2) + 1 = 10.4$
We started with $x_0 = 3$ and our descent step gave us $x_1 = 2.2$ so we have $3-2.2 = 0.8$ change in $x$ which resulted in a $15.6$ reduction causing a convergence to the minimum (secretly $y = -1$ ).

The next iteration (1)

Our starting point is updated from our previous iteration from $x_0 = 3$ to $x_1 = 2.2$ , PROGRESS!
Calculate the new derivative $f'(x_1) = 2(2.2) + 2 = 6.4$
Plug into gradient descent $f(x_{1+1}) = 2.2 - 0.1(6.4) = 1.56$
And calculating our output $y = f(x_2) = (1.56)^2 + 2(1.56) + 1 = 6.5536$ another reduction of $\approx 3.85$

Then this process is repeated until we reach the minimum of the function ( $-1$ ), aka the vertex of the parabola, with very small improvements the closer we get. So it’s really not too worth doing more iterations past a certain point because the diminishing returns become super high.

But what if we overshoot our learning rate ( $\gamma$ )?

Let’s say we fuck up our learning rate to something like $0.5$ as an extreme example instead of our 0.1 in iteration (1). So instead of a normal learning rate w/ $f(x_{1+1}) = 2.2 - 0.1(6.4) = 1.56$ we replace the 0.1 with 0.5 $f(x_{1+1}) = 2.2 - 0.5(6.4) = -1$ So our gradient went negative! This means it’s flipped from slanting one way to another. But it’s okay! It’s simply gone to the other side of the curve since it’s a parabola. Although if this was a non-convex function this may have gone into another local valley entirely! In our case it’s gone on the other curved part of the parabola and has gone down still by $0.56$ , so we still improved!

If the learning rate was $0.9$ with $f(x_{1+1}) = 2.2 - 0.9(6.4) = -3.56$ then we would be going back up the curve, the opposite direction to where we want to go so the learning rate needs some adjustment.

Multi-Variable

Let’s learn about the next evolution of it with the multi-variable version,

x_{n+1} = x_n - \gamma \nabla f(x_n)

where “nabla” $\nabla$ is added, which is commonly used in multi-variate calculus to represent a vector (one dimensional array; a matrix is multi-dimensional) of all partial derivatives, i.e., $\nabla {\color{yellow}{f}}({\color{orange}{x}},{\color{red}{y}}) = \left\lbrack \dfrac{\partial {\color{yellow}{f}}}{\partial \color{orange}{x}}, \dfrac{\partial {\color{yellow}{f}}}{\partial {\color{red}{y}}} \right\rbrack$ and for an example the function

f(x,y) = x^2 + y^2

the gradient would be

\nabla f(x,y) = \left\lbrack \dfrac{\partial}{\partial x} x^2, \dfrac{\partial}{\partial y}y^2 \right\rbrack = [ 2x, 2y ]

because we treat the terms without the variable we differentiate with respect to as constants and thus equal zero, e.g., $\dfrac{\partial}{\partial x} f(x,y) = 2x + 0$ because $y^2$ is the equivalent as a constant term like the number 5.

This is what we call the gradient of f — a vector of partial derivatives pointing in the direction of the steepest ascent. Gradient descent uses the negative of the gradient.

You can then adapt the gradient to whatever you like, e.g., selective learning rates for different partials based on their correlational strength to moving the needle or adding some kind of momentum factor like taking large learning steps and then slowing down as your angle gets closer to horizontal. Let’s look at a neat adaption!

Non-convex Problem

There is a big problem though non-linear problems (non-convex) because there will be many valleys and troughs, meaning there are local highs and lows and we’ll get stuck in them. Therefore, techniques to break out of these local spots are needed!

I thought of this algorithm while writing this article. Although kind of flawed it is quite interesting to explore down the road. Up next we’ll look into modifying the gradient descent algo to account for interesting enhancements, similar to what you see with this image!

Augmenting Gradient Descent Template

And so, we can extend the multi-variate gradient descent (GD), like programming, into a scaffolded version to account for automatic adaptability, meaning it has the framework for it but it’s not natively built in! It also doesn’t have an objective function to actually optimise for, which we’ll add later.

{\color{#A1DE93}{F(a_{n+1}) \le}} \text{ } {\color{#F47C7C} F(a_n) - \gamma_n ||\nabla F(a_n)||_2 ||P_n||_2}

{\color{#70A1D7} \left[ \cos \theta_n - max_{t\in[0,1]} \dfrac{||\nabla F(a_n - t\gamma_n P_n)-\nabla F(a_n)||_2}{|| \nabla F (a_n)||_2}\right]}

Ok, so this is a whole lot more than the previous gradient descent but I promise it’s quite intuitive! We’ll go through each colour step by step, from the inequality (green), the proposed gradient adjustment (red) and the descent redirection (blue) to make sure we understand each component of the spooky scary equation.

The whole idea of this modified gradient descent is to account for the assumption that the steepest descent isn’t always the best decision as it can take longer to reach the next optimal set than taking a less steep route, e.g., going sideways for a bit may end up unlocking a faster path than going straight down.

Here’s a simplified image of what I mean.

$z_1$ would have been our direct path to the subset but there is clearely a faster path that diverges from the tangent going vertically down so we adjust our search direction to go down $z_2$ ! This is an oversimplified drawing but there will be times where you need to horizontal to reach the fastest path — see how this seems counter-intuitive at first? This is what this template aims to scaffold for us.

Inequality Check

So the inequality ${\color{#A1DE93}{F(a_{n+1}) \le}}$ is simply saying the next step in our descent must decrease from the current step, quite simple. You may have noticed that $x$ is replaced with $a$ . This is just a notation difference but $a$ typically is used in numerics to represent an approximation point that we update iteratively over an exact or deterministic point.

Proposed Gradient Adjustment

${\color{#F47C7C} F(a_n) - \gamma_n ||\nabla F(a_n)||_2 ||P_n||_2}$ looks like our multi-var GD but with 3 modifications.

Dynamic Learning Rate

The first is the addition of the subscript for our learning rate $\gamma_n$ which allows us to adjust our learning rate with each iteration (automatically adjusting it requires implementation though!).

Gradient Magnitude

The second is the addition of the $||x||_2$ encapsulating $\nabla F(a_n)$ and $P_n$ — this is the Lebesgue 2 (L2) norm, denoted as $||x||_2 = \sqrt{\sum_{i=1}^n |x_i|^2}$ which converts the multi-dimensional gradient vector into the magnitude, a single scalar value. The magnitude of the vector is useful because it measures the overall gradient sensitivity to inputs. The gradient vector itself gives the direction of the steepest descent, while its L2 norm indicates how steep that descent is. It’s like finding the slope while accounting for all the partial derivatives. It smooths out the sum, by squaring each item, enabling it to be differentiable everywhere except at the origin (bottom of the curve and therefore horizontal)!

TLDR; the gradient itself provides both direction and magnitude from the individual partial derivatives but the L2 norm gives us the overall magnitude in a single scalar value making it easier to adjust the step size computation (since we deal with one value instead of $n$ partial derivatives and the calculations associated with them).

Remember $|x|$ is a ‘V’ shaped graph, the square makes it like a parabola-like graph. We square our summed values because they introduce nice to work with properties like removing negative numbers and reducing the impact by smoothing the graph, relative to not squaring, making it easier to differentiate.

Descent Direction Template

The last addition is $P_n$ represents the descent direction at iteration $n$ which allows for the learning rate to change with each iteration to adapt the optimisation process. It can incorporate information from previous iterations or use second-order information about the function to potentially converge faster, e.g., a momentum-based method, or avoid local minima, e.g., adjacent replicas that use large learning rates to double-check if there are neighbouring valleys.

The L2 norm is also applied to $P_n$ because it scales the step size based on both the gradient magnitude and the magnitude of the chosen descent direction.

Momentum $P_n$ Example

Let’s do a quick example of a single-var momentum model to simplify the calculations and maximise the understanding. This algo is great for skipping past small local minima potholes.

We first have a simple quadratic fn $f(x) = x^2 + 2x + 1$ and our init params are:

Init input $x_0 = 3$
Learning rate $\gamma = 0.1$
Momentum $\beta = 0.9$
Velocity $v_0 = 0$

Before we start our first iteration consists of

Gradient $f'(x_0) = 2x + 2 = 2(3) + 2 = 8$
We then use the $f'(x_0)$ to update our velocity $v = \beta v_n+(1-\beta)f'(x)$ where
- $\beta \cdot v$ is the exponential decay which ensures past velocities exponentially decay over time making recent gradients have more of an impact over older versions
- $(1-\beta)$ makes sure the added velocity to the gradient doesn’t grow too big, like a scale limit guardrail to make $\begin{split} v_1 &= \beta v_0 + (1 - \beta)f'(x_0) \\ &= 0.9(0) + 0.1(8) = 0.8 \end{split}$ Then we update our $x$ to

\begin{split} x_1 &= x_0 - \gamma \cdot v_1 \\ &= 3 - 0.1(0.8) = 2.92 \end{split}

So really what happening is our learning rate is influenced by our velocity which is adapts exponentially if the direction are consecutive with each iteration.

If we didn’t use the momentum factor then

\begin{split} x_1 &= x_0 - \gamma \cdot f'(x_0) \\ &= 3 - 0.1(8) = 2.2 \end{split}

Where the $y$ result would have been $y = 2(2.2) + 2 = 6.4$ instead of the momentum’s $y = 7.84$ . Now this is important! Since it’s an exponential function the early iterations are worse than regular gradient descent because it hasn’t accumulated any momentum from previous steps to scale it up. However, this trade-off of slow starts for potentially better long-term performance is a consideration to make.

Maybe down the line future me can write an example in Python to show the impact visually after $n$ iterations (sorry, reader-kun)

Descent Redirection

The third “scary” part of the equation is ${\color{#70A1D7} \left[ \cos \theta_n - max_{t\in[0,1]} \dfrac{||\nabla F(a_n - t\gamma_n P_n)-\nabla F(a_n)||_2}{|| \nabla F (a_n)||_2}\right]}$ The TLDR for quick reference is that $\cos \theta$ is the direction we want to descend and we modify our direction by where our gradient is. This is calculated by getting the highest possible result (using $max$ ) from all real numbers 0 to 1 where $t$ is the iteration and the input for the sensitivity of our new gradient adjustment.

Now let’s do the long-winded descriptive component analysis…

Descent Direction Vs Gradient

First we want to deal with this big fuck off part of the formula,

\dfrac{||\nabla F(a_n - t\gamma_n P_n) - \nabla F(a_n)||_2}{|| \nabla F (a_n)||_2}

The numerator

||\nabla F(a_n - t\gamma_n P_n)-\nabla F(a_n)||_2

is measuring the difference between the two gradients:

At a point along the proposed step direction $F(a_n - t\gamma_n P_n)$ where $t$ is a fraction of the step, a point along the path of the proposed step. The adjustment is caused by the current gradient being moved by the descent direction scaled by the step size.
And at the current point $\nabla F(a_n)$

The L2 norm is used once again to ensure it’s within certain bounds, which are inherited from the assumption that our formula incorporates Lipschitz continuity.

Let’s break it down.

Our numerator assumes it’s within the bounds of the maximum change of our descent direction

{\color{#70A1D7}{||\nabla F(a_n - t\gamma_n P_n) - \nabla F(a_n)||_2}} \leq {\color{#F47C7C}{L||t\gamma_n P_n||_2}} = tL\gamma_n ||P_n||_2

where $L$ is the Lipschitz constant which represents the maximum rate of change of the gradient, ensuring convergence of the gradients.

$L = 1$ means the function’s max change is as fast as its input, $L \gt 1$ it changes faster tan the input and vice versa for less than. However, $L$ is often unknown and needs to be accounted for in the step size $\gamma$ (another thing to implement, I’m sorry).

Therefore, our rate of change is bounded by the size of the step we’re taking in our desired direction $t\gamma_n ||P_n||_2$ and the $L$ constant.

L||t\gamma_n P_n||_2 = tL\gamma_n ||P_n||_2

as it’s a property of norms.

||\nabla F(a_n - t\gamma_n P_n) - \nabla F(a_n)||_2

is saying the L2 norm magnitude difference of the descent direction gradient relative to the actual gradient — so essentially trying to align the current gradient to our desired direction.

The denominator

|| \nabla F (a_n)||_2

is normalising the value by dividing the change in gradient by the magnitude of the current gradient to make the measure independent of the scale of the gradient — it essentially outputs how much of a relative change it is rather than absolute change. Think of it like when you take the mean: you sum everything up then divide by the amount of items to normalise it.

Who’s Max?

I’m going to skip the $\cos\theta$ just for some context of what it’s subtracting, so bear with me. This subtraction part is quite computationally expensive because of the $max_{t\in[0,1]}$ . It is a for loop that chooses the maximum value change of the gradient relative to it’s starting value from the formula

\dfrac{||\nabla F(a_n - t\gamma_n P_n)-\nabla F(a_n)||_2}{|| \nabla F (a_n)||_2}

So it’s iterating over all real numbers from $[0,1]$ , e.g., $[0, 0.0000001, ..., 0.5, ..., 1] \in [0,1]$ , storing the highest possible angle change relative to the initial magnitude and then outputting the stored best one, e.g., a 35.78% change.

In practice, we don’t check every possible value in $[0,1]$ due to computational constraints with the machine but you could add precision bounds to the formula, but there is none in the one we’re looking at.

Cos Sine Sucks

Now back to the $\cos \theta_n$ . We use this angle to measure the alignment between our two vectors: the custom descent direction $P_n$ (where we want the gradient to go) and the best relative negative gradient we calculated from $max$ ‘ing $-\nabla F(a_n)$ (we use a negative sign because we want to descend instead of ascend).

The reason why $\cos\theta = 1$ is used instead of $\sin\theta = 0$ is that cosine 1 represents perfect alignment whereas $\sin\theta = 0$ represents perpendicularity (90 degrees), which is not important for us — alignment provides more information. After all, the slightest difference tells us the slightest misalignment to adjust to realign, aka our objective to align our model. So if our $\cos\theta_n$ is significantly less than 1 then we would inflate our step size because a small step may not provide any real results, unless there was a miraculous tiny pothole with a god value.

Switcher Algo Idea

While writing about the “Descent Direction” part a bit earlier in the article I thought about this simple algo that switches its objective from descent to ascent and vice versa in the opposite direction it came from once it reaches its goal. The reason for it is so it can build a mental map of the landscape if it’s a completely blackbox environment with no prior knowledge of the graph. Then it can estimate the graph using splines and/or a power series. Thought it was a cool idea to share :)

Objective Function

The objective function is the variable of the thing you’re quantitatively trying to maximise (or minimise). It can include the cost function, weights, and additional terms. If you had a square the area would be what you’re trying to optimise for or maximum profit $Z$ = revenue - costs. The goal is to find the values that result in the best value of the objective function. You can even have multi-objective functions which stem into the world of Pareto optimality which refers to getting to a state where resources are allocated in the most efficient way possible so that it’s impossible to make any one party better off without making at least one other party worse off.

So basically this objective function becomes the thing the gradient descent is trying to optimise and that’s kind of the basis of ML/AI.

Loss + Cost Function

The loss function evaluates the error of a model on a single data point, typically focusing solely on the prediction error, denoted as $L(y, \hat{y})$ where $y$ is the actual value and $\hat{y}$ is the predicted value. It is usually calculated many times for each cycle for every data point.

The cost function is the average error for a set of data points, instead of one single point. It is calculated once each cycle. So it’s like the mean of all the loss functions.

Final

Thank you for reading my first real breakdown of, what I consider to be, foreign advanced math. I know it got a bit abstract towards the end without any real numerical examples — I just yoinked it from the wiki here. I was curious about how it worked without looking at any explanations and it turned out I learned quite a lot from just breaking it down. Hopefully, this will inspire some people or give neat ideas to others interested in this stuff.

This was actually quite challenging to break down (and I’m not even sure I did a good job tbh) but I learned a lot and hopefully you did too. I plan to make more intuition-based articles as I progress further in math, so be on the lookout for them :)

But until next time, Godspeed, anon.