The backward pass (in theory)

10. The backward pass (in theory)#

So, you want to compute the derivatives of computational graphs. To make things concrete, we’ll use our running example, the logistic regression.

In mathematical terms, logistic regression is defined by the function

\[ f(a, b, x) = \sigma(a x + b), \]

where \( x \) is the input, and \( a, b \) are the model parameters. Our ultimate goal is to compute the rate of change of \( f \) with respect to the parameters \( a \) and \( b \); that is, the derivatives \( \frac{\partial f}{\partial a} \) and \( \frac{\partial f}{\partial b} \).

How to do that? Experienced calculus users immediately reply

\[\begin{split} \begin{align*} \frac{\partial f}{\partial a} &= x \sigma^\prime(ax + b), \\ \frac{\partial f}{\partial b} &= \sigma^\prime(ax + b), \end{align*} \end{split}\]

but how was that obtained? By the chain rule, one of the most important results in mathematics. However, to fully understand what the chain rule is, we have to dive deep into the wonderful (and mildly convoluted) world of derivatives.

10.1. Derivatives and partial derivatives#

At some point in your life, you have probably encountered the concept of differentiation and derivatives. We don’t have all the time in the world here, but let’s recap; it’s best if we tackle the tough topic of backpropagation fully prepared. If you are thoroughly familiar with the concept, feel free to skip to the chain rule, but I recommend you to refresh your knowledge.

In pure English, derivatives describe the rate of change. Let’s see the formal definition.

Definition 10.1 (The derivative.)

Let \( f: \mathbb{R} \to \mathbb{R} \) be a function of one variable. We say that \( f \) is differentiable at \( \mathbf{x}_0 \in \mathbb{R} \) if the limit

\[ \frac{df}{dx}(x_0) := \lim_{x \to x_0} \frac{f(x) - f(x_0)}{x - x_0} \]

exists. If so, \( \frac{df}{dx}(x_0) \) is called the derivative of \( f \) at \( x_0 \). If \( f \) is differentiable everywhere, then it is said to be differentiable.

Two remarks about the notation. First, when it’s clear, the argument is omitted from the so-called Leibniz notation \( \frac{df}{dx} = \frac{df}{dx}(x_0) \). Second, for univariate functions, the derivate is also denoted by \( f^\prime(x_0) \); we’ll use this quite frequently.

Geometrically speaking, the \( f^\prime(x_0) \) describes the slope of the tangent line drawn at the graph of \( f \) at the point \( (x_0, f(x_0)) \).

It’s the best to visualize this, so we’ll take a look at the function \( f(x) = x^2 \), whose derivative is \( f^\prime(x) = 2x \).

def f(x):
    return x**2

def df(x):
    return 2*x

../_images/ae455df1b467c95a6428b36ebad6d0497b96109b88eef59f69c6fd26f0bfe6e8.png

However, in multiple variables, the definition breaks down. For instance, consider \( f(x, y) = x^2 + y^2 \), the two-variable version of the previous function. There, instead of a tangent line, we have a tangent plane.

def f(x, y):
    return x**2 + y**2

def df_dx(x, y):
    return 2*x

def df_dy(x, y):
    return 2*y

../_images/4fa14105db268eb3b7bd1fe6919ca98253c27218797e27445e436f423d07234b.png

In this case, we can fix all but one variable, and take the derivative of the thus obtained single-variable function. This is called the partial derivative. Here’s the formal definition.

Definition 10.2 (The partial derivatives.)

Let \( f: \mathbb{R}^n \to \mathbb{R} \) be a function of \( n \) variables. We say that \( f \) is partially differentiable at \( \mathrm{x}_0 \in \mathbb{R}^n \) in the \( i \)-th variable if the limit

\[ \frac{\partial f}{\partial x_i}(\mathbf{x}_0) := \lim_{h \to 0} \frac{f(\mathbf{x}_0 + h \mathbf{e}_i) - f(\mathbf{x}_0)}{h} \]

exists, where \( \mathbf{e}_i \) is the vector whose \( i \)-th coordinate is one, and the rest is zero. If so, \( \frac{\partial f}{\partial x_i}(\mathbf{x}_0) \) is called the \( i \)-th partial derivative of \( f \) at \( \mathbf{x}_0 \).

Again, when it’s clear, we often write \( \frac{\partial f}{\partial x_i} \) instead of \( \frac{\partial f}{\partial x_i}(\mathbf{x}_0) \).

Let’s look at an example! In the case of \( f(x, y) = x^2 + y^3 \), the partial derivatives are

\[ \frac{\partial f}{\partial x} = 2x, \quad \frac{\partial f}{\partial y} = 3y^2. \]

Remark 10.1 (Derivatives vs partial derivatives in function compositions.)

When vigorously composing functions and variables (as we do in machine learning), a clear distinction must be made between regular and partial derivatives. For example, consider the functions \( f(x, y) = x + y^2 \) and \( g(x, y) = \sin(x)\cos(y) \). If \( x \) and \( y \) represent a particular feature like height, cost, etc., and \( g(x, y) \) is an engineered feature, we end up with expressions like

\[\begin{split} \begin{align*} h(x, y) &= f(g(x, y), y) \\ &= \sin(x)\cos(y) + y^2. \end{align*} \end{split}\]

In this context, \( \frac{df}{dy} \) and \( \frac{\partial f}{\partial y} \) mean two different things:

while \( \frac{\partial f}{\partial y} \) is the partial derivative of \( f \) with respect to \( y \),
the expression \( \frac{df}{dy} \) refers to the univariate function defined by \( y \to f(g(x, y), y) \), or in other words, \( \frac{df}{dy} = \frac{\partial h}{\partial y} \).

Keep this in mind, as we won’t always explicitly name composite expressions such as \( f(g(x, y), y) \).

Speaking of composed functions: it’s time to dive into the chain rule, our main tool for differentiating computational graphs.

10.2. The chain rule#

Let’s state the chain rule right away. Single variable first, multiple variables second.

Theorem 10.1 (Chain rule, single variable.)

Let \( f, g: \mathbb{R} \to \mathbb{R} \) be two differentiable functions, and let \( h(x) = f(g(x)) \). Then

\[ \Big( f\big(g(x)\big) \Big)^\prime = f^\prime\big(g(x)\big) g^\prime(x). \]

or in other words,

\[ \frac{dh}{dx} = \frac{df}{dg} \frac{dg}{dx}, \]

In English, this means that the derivative of the composite function equals the product of the components’ derivatives, evaluated at the appropriate locations.

Remark 10.2 (Abuse of notation.)

If you are an attentive reader, perhaps you’ve noticed that there’s a discrepancy between the notations \( f^\prime(g(x)) \), \( \frac{df}{dx} \), \( \frac{df}{dg} \), and so on.

For instance, \( \frac{df}{dx} \) and \( \frac{df}{dg} \) doesn’t make sense, as the function \( f \) is univariate, defined in terms of an arbitrary variable \( x \). So, what’s \( \frac{df}{dg} \)? Let’s clear this up once and for all.

According to the chain rule, the derivative of the composed function \( f(g(x)) \) is

\[ \Big( f\big(g(x)\big) \Big)^\prime = f^\prime\big(g(x)\big) g^\prime(x), \]

in other words, the derivative function \( f^\prime \) is evaluated at \( g(x) \). To avoid writing monstrosities like \( \frac{df}{dx}(g(x)) \), we rather think of \( f \) as defined in terms of the variable \( g = g(x) \), and simply write \( \frac{df}{dg} \), which is shorthand for

\[ \frac{df}{dg} = \frac{df}{dx}(g(x)) = f^\prime(g(x)). \]

For multiple variables, the chain rule goes like the following.

Theorem 10.2 (Chain rule, multiple variables.)

Let \( f: \mathbb{R}^m \to \mathbb{R} \) be a function of \( m \) variables, let \( g_1, \dots, g_m: \mathbb{R}^n \to \mathbb{R} \), and define \( h: \mathbf{R}^n \to \mathbb{R} \) by

\[ h(\mathbf{x}) = f\big( g_1(\mathbf{x}), \dots, g_m(\mathbf{x}) \big), \quad \mathbf{x} = (x_1, \dots, x_n). \]

Then

\[ \frac{\partial h}{\partial x_i} = \sum_{j=1}^{m} \frac{\partial f}{\partial g_j} \frac{\partial g_j}{\partial x_i} \]

How does this apply to the logistic regression? As we’ve seen that before, \( f(a, b, x) = \sigma(a x + b) \) is a composite function, built from the blocks

\[\begin{split} \begin{align*} c(a, x) &= a x, \\ y(c, b) &= c + b, \\ \sigma(y) &= \frac{1}{1 + e^{-y}} \end{align*} \end{split}\]

that yield

\[ f(a, b, x) = \sigma\Big(y\big(c(a, x), b\big) \Big). \]

Thus,

\[\begin{split} \begin{align*} \frac{\partial f}{\partial a} &= \frac{\partial \sigma}{\partial y} \frac{\partial y}{\partial a} \\ &= \frac{\partial \sigma}{\partial y} \bigg( \frac{\partial y}{\partial c} \frac{\partial c}{\partial a} + \frac{\partial y}{\partial b} \frac{\partial b}{\partial a} \bigg) \\ &= \sigma^\prime(ax + b) \bigg( 1 \cdot x + 1 \cdot 0 \bigg) \\ &= x \sigma^\prime(ax + b), \end{align*} \end{split}\]

as we’ve seen it earlier.

Because of the layered structure of neural networks, the chain rule will be our bread and butter in calculating the derivatives. Now that we understand how it works, let’s see what the chain rule means in the context of computational graphs!

10.3. The chain rule and computational graphs#

Let’s go back to square one and consider the composite function \( f(g(x)) \). In computational graph terms, we prefer to work with variables, not functions. That is, instead of \( f(g(x)) \), we have the variables x, g, f, the elements of our computations.

Here’s how this simple graph looks.

../_images/7b7a6b423ce16c116e4d9fcdc0256383610f158b93bdb4ad3e8d7ef9c25bedfc.svg

Remark 10.3 (Abuse of notation, part 2.)

Note that in the above computational graph, g is not the mathematical function \( g \), nor does f mean \( f \). The variables are the results of the computations defined by the functions. Mathematically speaking, we have

the input \( x \),
the variable \( g = g(x) \),
and the variable \( f = f(g) \).

In principle, we are not allowed to designate the same symbol to different variables, but adding another set of symbols would be cumbersome. Thus, we take a hit in precision to gain a bit of simplicity.

Accordingly,

\( \frac{df}{dx} \) the derivative of \( f(g(x)) \) with respect to \( x \),
while \( \frac{df}{dg} \) is the derivative of \( f \) with respect to its only variable \( g \).

Keep this in mind when translating expressions to computational graphs.

In the language of computational graphs, the chain rule expresses the derivative of the terminal node f with respect to the initial node x. Essentially, we compute the following graph.

../_images/159c0493d8db503d00c98d864099af24fbec19371adb022d460b25ba90f716ee.svg

This is quite overloaded with information, so let me explain. In the derivative graph,

a node corresponds to the derivative of the original node with respect to the initial node x,
and an edge corresponds to the derivative of its end node with respect to its start node.

Using the chain rule, we obtain the values in the nodes by multiplying together all the edges leading up to it: \( \frac{df}{dx} = \frac{df}{dg} \frac{dg}{dx} \).

Because we progress from the initial node x to the terminal node f, this is called forward-mode differentiation. Accordingly, the

derivatives represented by the nodes are called the forward derivatives,
and the derivatives on the edge are called local derivatives.

Let’s see an example to solidify your understanding: consider the expression \( (3x)^2 \). In this setting, \( g(x) = 3x \) and \( f(g) = g^2 \). Thus, we have

\[\begin{split} \begin{align*} \frac{\partial g}{\partial x} &= 3, \\ \frac{\partial f}{\partial g} &= 2 g. \end{align*} \end{split}\]

For, say, \( x = 4 \), we can immediately compute the forward pass and the local derivatives:

\[\begin{split} \begin{align*} g(x) &= 3 \cdot 4 = 12, \\ f(g) &= 12^2 = 144, \\ \frac{dg}{dx} &= 3, \\ \frac{df}{dg} &= 2 \cdot 12 = 24. \end{align*} \end{split}\]

Thus, in the first step, we populate the edges and the first node of the derivative graph.

../_images/c81160914757cafc721922b683c657a3f36e5c9fdde11487bf3c520b65ebb7eb.svg

The second node is computed by taking the product of the first node and the first edge. (As the first node’s value is \( 1 \), this’ll match the edge.)

../_images/3d222e3103848f5452609495db59de2a937cef35735b5f9b51db97dbc3e49823.svg

In the last step, we compute \( \frac{df}{dx} \) by taking the product of \( \frac{df}{dg} = 24 \) and \( \frac{dg}{dx} = 3 \):

../_images/c025a6c00ce1a2ff944126853308408bcef8d5d466c47b8bdf03d8576a78a6cf.svg

To get a firm grasp on how forward-mode differentiation works, let’s put one more node and consider the graph given by \( h(f(g(x))) \):

../_images/f59ac19c5c076f843538226e9840725fd5f9079bb253d54d7fa7e7c9781513a4.svg

To check your understanding, try to carry through the previous process by

sketching the derivative graph,
and computing the derivatives of the nodes with respect to the initial node x.

Welcome back! This is what you should have got:

../_images/8110847fc89c4aaeb4c21fbb5a79cdeb9e35865b10d50bac3017145e927d971c.svg

To confirm, the iterated application of the chain rule gives

\[\begin{split} \begin{align*} \frac{dh}{dx} &= \frac{dh}{df} \frac{df}{dx} \\ &= \frac{dh}{df} \frac{df}{dg} \frac{dg}{dx}. \end{align*} \end{split}\]

Now you understand why the chain rule is called the chain rule! The next step is see what happens in a multivariable context.

10.3.1. The multivariable case#

Let’s turn the difficulty dial up a notch and consider the expression

\[ f\big(g_1(x), g_2(x)\big), \]

which is composed of

a bivariate function \( f: \mathbb{R}^2 \to \mathbb{R} \),
and two univariate functions \( g_1, g_2: \mathbb{R} \to \mathbb{R} \).

Sketching up its graph, we obtain the following:

../_images/327c4c726406c71027472adba77e774b6a62695e13d531fde9cdaa9f8e26f57b.svg

Again, our goal is to calculate \( \frac{df}{dx} \). To do that, we employ the (multivariate) chain rule

\[ \frac{df}{dx} = \frac{\partial f}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f}{\partial g_2} \frac{\partial g_2}{\partial x}, \]

or in computational graph terms:

../_images/6fe36a7ce97dae89d1f04f329d48e89390c2b03fa95148a4e287fd5e32fd3bcb.svg

(Note that here, \( \frac{dg_1}{dx} = \frac{\partial g_1}{\partial x} \) and \( \frac{dg_2}{dx} = \frac{\partial g_2}{\partial x} \).) In this case, the derivative \( \frac{d f}{d x} \) is obtained via forward-mode differentiation; that is,

computing the local derivatives on the edges,
taking a path from the initial node x to the terminal node f,
multiplying together all intermediate derivatives along the edges,
and summing the products for all paths.

Take a look at the expression

\[ \frac{df}{dx} = \frac{\partial f}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f}{\partial g_2} \frac{\partial g_2}{\partial x}, \]

where the first term \( \frac{\partial f}{\partial g_1} \frac{\partial g_1}{\partial x} \) corresponds to the left path, while \( \frac{\partial f}{\partial g_2} \frac{\partial g_2}{\partial x} \) corresponds to the right one.

From top to bottom, we compute

\( \frac{dx}{dx} \),
\( \frac{dg_1}{dx} \) and \( \frac{dg_2}{dx} \),
and finally \( \frac{df}{dx} \),

in this exact order.

Like before, we’ll walk through a concrete example: \( \sin(x) \cos(x) \). This expression is a composition of two univariate functions \( g_1(x) = \sin(x) \), \( g_2(x) = \cos(x) \), and a bivariate function \( f(g_1, g_2) = g_1 \cdot g_2 \). Regarding its derivatives, we have

\[\begin{split} \begin{align*} \frac{\partial g_1}{\partial x} &= \cos(x), \\ \frac{\partial g_2}{\partial x} &= -\sin(x), \end{align*} \end{split}\]

and

\[\begin{split} \begin{align*} \frac{\partial f}{\partial g_1} &= g_2, \\ \frac{\partial f}{\partial g_2} &= g_1. \end{align*} \end{split}\]

So, what would calculating the derivative look like for, say, \( x = 2 \)? Let’s begin populating the derivative graph with the edges and the first node.

../_images/aed9d4a4f69d77ae6c035a54c30bedf3bc4592f2e8f2553353e7f06f26b3182b.svg

With this, we can move one step further and compute the derivatives of the first-level nodes.

../_images/10b1fbb4f50bbcf54d4962ef70df087ca0c01b0c69144acd8c874ffd9c6dbd05.svg

The final step. According to the chain rule, the derivative is the sum of the values of all the incoming edges times their parents’ derivative for each node. In simpler terms,

\[ \frac{\partial f}{\partial x} = \frac{\partial f}{\partial g_1} \frac{\partial g_1}{\partial x} + \frac{\partial f}{\partial g_2} \frac{\partial g_2}{\partial x}, \]

Let’s state this in graph form.

../_images/ae007ab270acf63ea067f35c09847f0753e25dbcec3e77ddb8c704f7a2845e29.svg

If you want to practice, feel free to carry out the previous process on the expression \( \cos(x^2) \sin(x^2) \).

One more example to emphasize the difference between local derivatives \( \partial/\partial x \) and global derivatives \( d/dx \). Let’s put one more layer into the above graph and consider the expression

\[ f\Big(g_1\big(h_1(x)\big), g_2\big(h_2(x)\big)\Big), \]

yielding the following (derivative) graph.

../_images/e9930a579fc052b133e431551ffa93ee2a1d1173bef8d6e8e5ba1397f7ae542c.svg

Here, the chain rule says that

\[\begin{split} \begin{align*} \frac{df}{dx} &= \frac{\partial f}{\partial g_1} \frac{dg_1}{dx} + \frac{\partial f}{\partial g_2} \frac{dg_2}{dx} \\ &= \frac{\partial f}{\partial g_1} \frac{\partial g_1}{\partial h_1} \frac{\partial h_1}{\partial x} + \frac{\partial f}{\partial g_2} \frac{\partial g_2}{\partial h_2} \frac{\partial h_2}{\partial x}. \end{align*} \end{split}\]

Note that the terms \( \frac{\partial f}{\partial g_i} \frac{\partial g_i}{\partial h_i} \frac{\partial h_i}{\partial x} \) correspond to paths from x to f, while \( \frac{\partial f}{\partial g_i} \frac{dg_i}{dx} \) correspond to all incoming edges and parents of f.

Now, it’s time to put a twist on everything that we’ve learned so far regarding the chain rule!

10.3.2. The problems with forward-mode differentiation#

Let’s increase the complexity once more and consider the computational graph defined by the expression

\[ f\big( g_1(x_1, x_2), g_2(x_1, x_2) \big), \]

which looks like the following.

../_images/c3cc6264c3ea07a82c28a548a0b77f88e459e077ccbe572e8fbd8de8da221d1b.svg

As there are two initial nodes \( x_1 \) and \( x_2 \), we would like to compute both \( \frac{df}{dx_1} \) and \( \frac{df}{dx_2} \). This presents us with a problem, as now we have to compute two graphs:

../_images/69094f8afcf1da816c3ebf54cdb29ab3b842a7c8846a2db1c1f3db6b35513104.svg

(I have omitted the edge labels for clarity.)

Now, our computational cost has increased twofold. For \( n \) input variables (or features), the increase is \( n \)-fold. In practice, the number of input variables is in the millions. (Depending on which year you read this, it might even be in the billions.) This is a significant issue if we were to calculate the derivatives this way.

That’s only the tip of the iceberg. Next, consider an analogous computational graph coming from the expression

\[ f\big( g_1(x_1, x_2, x_3), g_2(x_1, x_2, x_3), g_1(x_1, x_2, x_3) \big). \]

This time, there are three input nodes (x₁, x₂, x₃) and three middle nodes (g₁, g₂, g₃).

../_images/bbfdf34508d07333e13476cb66d8652b053cb2440c862197f56868b4559f0272.svg

The number of paths from initial to terminal nodes increases dramatically with adding new nodes. Compared to \( f\big( g_1(x_1, x_2), g_2(x_1, x_2) \big) \), where we had only \( 2 \cdot 2 = 4 \) paths, this time, we have \( 3 \cdot 3 = 9 \).

This gets exponentially worse if we add another layer. Consider the following computational graph. (I don’t even want to show you the expression it came from, let alone type it.)

../_images/8bb494250c38339609ed4a403ed82c5b1681d45e945a4e072c48084b16686b87.svg

The number of paths to sum over has gone from \( 3^2 \) to \( 3^3 = 27 \). This is the dreaded exponential increase.

What can we do? Enter the backward-mode differentiation.

10.4. Backward-mode differentiation#

Let’s go back to the example \( f\big( g_1(x_1, x_2), g_2(x_1, x_2) \big) \). Here’s its graph.

Previously, we have obtained \( \frac{df}{dx_1} \) and \( \frac{df}{dx_2} \) by computing two graphs:

one with the nodes representing the partial derivatives with respect to \( x_1 \),
and another with nodes representing the partial derivatives with respect to \( x_2 \).

Let’s turn it upside down! Instead of populating the graph with \( \frac{dv}{dx_i} \) for the node \( v \), we’ll compute the derivatives \( \frac{df}{dv} \):

../_images/37e4cd879f9c8faf235b51a12c5c4a820cef3596e117f35ce88d9de0a3de44bc.svg

Notice that this time, only the final node \( \frac{df}{df} = 1 \) is known to us; our goal is to compute the value of the initial nodes \( \frac{df}{dx_1} \) and \( \frac{df}{dx_2} \). Thus, we’ll propagate the values backward along the edges, hence the name backward-mode differentiation.

In other words, the order in which we calculate the global derivatives are

\( \frac{df}{df} \),
\( \frac{df}{dg_1} \) and \( \frac{df}{dg_2} \),
and finally, \( \frac{df}{dx_1} \) and \( \frac{df}{dx_2} \).

True to its name and forward-mode counterparts, the derivatives represented by the nodes are called the backward derivatives.

As before, let’s carry out a concrete example by hand: \( (2 x_1 + 3 x_2) - 4 x_1 x_2 \). Factoring out, this is the composition of the functions

\( g_1(x_1, x_2) = 2 x_1 + 3 x_2 \),
\( g_2(x_1, x_2) = x_1 x_2 \),
and \( f(g_1, g_2) = g_1 - 4 g_2 \).

(There are alternative decompositions, but we’ll stick with this one.) The derivatives are

\[ \frac{\partial g_1}{\partial x_1} = 2, \quad \frac{\partial g_1}{\partial x_2} = 3, \quad \frac{\partial g_2}{\partial x_1} = x_2, \quad \frac{\partial g_2}{\partial x_2} = x_1, \]

and

\[ \frac{\partial f}{\partial g_1} = 1, \quad \frac{\partial f}{\partial g_2} = - 4. \]

This is our starting point:

../_images/358d1b262a051bfb1e0ced572b20f41c18b565771e98cdf2557ec5d248e40636.svg

In the first step, we calculate the derivatives on the previous level via propagating \( \frac{df}{df} = 1 \) backwards via the edges to obtain

\[ \frac{df}{dg_1} = 1 \cdot 1, \quad \frac{df}{dg_2} = 1 \cdot (-4). \]

In graph terms, we have the following:

../_images/85ade53dcc737d1a91e46949913db79a53516edd291af40f30b751640d3b76f9.svg

One final step, and we have

\[ \frac{df}{dx_1} = 1 \cdot 2 - 4 \cdot x_2, \quad \frac{df}{dx_2} = 1 \cdot 3 - 4\cdot x_1, \]

or in graph terms, the following:

../_images/ba88d6ed2768180303b5fb2b594908d845051045d966126f3a420e6214a1b88f.svg

(The expression \( (2 x_1 + 3 x_2) - 4 x_1 x_2 \) is easy to differentiate by hand, so feel free to verify our results.)

To sum up with a pseudoalgorithm, we did something like this.

for v in nodes:
    # the backward step for v
    for u, u_local_grad in v.prevs:
        u.backwards_grad += v.backwards_grad * u_local_grad

What we’ve just seen is called backpropagation, the primary way to compute derivatives of computational graphs. Why is it better than its forward-mode counterpart? Because it requires only one pass, compared to the as-many-passes-as-variables approach of forward-mode differentiation. This is a huge deal in machine learning, where we can have billions of parameters.

So, how do we implement backpropagation in practice? See you in the next chapter!