Vectorizing backpropagation (in theory)

14. Vectorizing backpropagation (in theory)#

Now that we understand how to work with the Tensor class in practice, it’s high time that we pop the hood and remove all the screws.

I have some good news and some bad news for you. The good news is that regarding the computational graph structure, Tensor works the same as Scalar. Our vectorized computatinal graphs are made up of Tensor nodes and Edge connections.

Bad news is that derivatives are way more complex than for our scalar counterparts. We need to significantly up our linear algebra game to keep the pace.

Let’s get started.

14.1. Vector-scalar graphs#

As usual, let’s see the simplest possible example: the dot product. For any \( \mathbf{x}, \mathbf{y} \in \mathbb{R}^n \), its dot product is defined by

\[ z = \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i y_i \in \mathbb{R}. \]

In other words, we input two vectors (of the same dimensions) and receive a scalar value. Here’s the graph:

../_images/03c7a469e4a14e4e1a792da1df9f8acc5e3dec3a26a1091f3fe344917c09097a.svg

Nothing special so far. Now, let’s calculate the derivative with backwards-mode differentiation! Following what we’ve learned so far, this is the graph we need to fill with exact values:

../_images/c99a3015f11a2b462a288fbda4a7dd7240da002f77e81d076ed732e979bdcd80.svg

But what on earth do the symbols \( \frac{dz}{d\mathbf{x}} \) and \( \frac{\partial z}{\partial \mathbf{x}} \) mean? Can we take the derivative with respect to a vector?

Not so surprisingly, these represent

\[ \frac{dz}{d\mathbf{x}} = \bigg( \frac{dz}{dx_1}, \dots, \frac{dz}{dx_n} \bigg) \]

and

\[ \frac{\partial z}{\partial \mathbf{x}} = \bigg( \frac{\partial z}{\partial x_1}, \dots, \frac{\partial z}{\partial x_n} \bigg), \]

that is, we compressed a vector of derivatives into a single symbol. (In this example, the global derivative \( \frac{dz}{d\mathbf{x}} \) and the local derivative \( \frac{\partial z}{\partial \mathbf{x}} \) coincide. This won’t be always the case.)

For the dot product, the vectorized derivatives are

\[\begin{split} \begin{align*} \frac{dz}{d\mathbf{x}} = \frac{\partial z}{\partial \mathbf{x}} &= (y_1, \dots, y_n), \\ \frac{dz}{d\mathbf{y}} = \frac{\partial z}{\partial \mathbf{y}} &= (x_1, \dots, x_n). \end{align*} \end{split}\]

Sounds simple enough. What about vector-vector functions?

14.2. Vector-vector graphs#

Let’s dial the difficulty up just a notch and consider the vector addition: for any \( \mathbf{x}, \mathbf{y} \in \mathbb{R}^n \), the expression

\[ \mathbf{z} = \mathbf{x} + \mathbf{y} \in \mathbb{R}^n \]

defines the same V-shaped computational graph as dot product:

../_images/77823bf714bd9935fb4d2dfff72de73ffb496d63198f1de9f20e8ece50076c05.svg

The derivative graph also looks the same, with one major difference: there are menacing vector-vector derivatives.

../_images/30d1793b71d4d9ff425502249caf103c666192a6ae7ced2de988139e3ffa95bf.svg

Fear not. It’s a similar notational trick, encoding a matrix of derivatives:

\[\begin{split} \frac{\partial \mathbf{z}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial z_1}{\partial x_1} & \frac{\partial z_1}{\partial x_2} & \dots & \frac{\partial z_1}{\partial x_n} \\ \frac{\partial z_2}{\partial x_1} & \frac{\partial z_2}{\partial x_2} & \dots & \frac{\partial z_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial z_n}{\partial x_1} & \frac{\partial z_n}{\partial x_2} & \dots & \frac{\partial z_n}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{n \times n}. \end{split}\]

The matrix \( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \) is called the Jacobian of \( \mathbf{f} \). It’s an essential concept of differential calculus, and we’ll frequently encounter it throughout our journey.

We can use a similar notation for the global derivatives as well:

\[\begin{split} \frac{d \mathbf{z}}{d \mathbf{x}} = \begin{bmatrix} \frac{d z_1}{d x_1} & \frac{d z_1}{d x_2} & \dots & \frac{d z_1}{d x_n} \\ \frac{d z_2}{d x_1} & \frac{d z_2}{d x_2} & \dots & \frac{d z_2}{d x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{d z_n}{d x_1} & \frac{d z_n}{d x_2} & \dots & \frac{d z_n}{d x_n} \\ \end{bmatrix} \in \mathbb{R}^{n \times n}. \end{split}\]

Following our notational conventions, let’s call \( \frac{\partial \mathbf{z}}{\partial \mathbf{x}} \) the local Jacobian, while \( \frac{d\mathbf{f}}{d\mathbf{x}} \) the global Jacobian.

Let’s make the definitions precise.

Definition 14.1 (The Jacobian)

Let \( \mathbf{y}: \mathbb{R}^n \to \mathbb{R}^m \) be an \( n \)-variable function mapping \( \mathbf{x} \in \mathbb{R}^n \) to \( \mathbf{y} \in \mathbb{R}^m \).

The Jacobian of \( f \) is defined by the matrix

\[\begin{split} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \dots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \dots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \dots & \frac{\partial y_m}{\partial x_n} \\ \end{bmatrix} \in \mathbb{R}^{m \times n}. \end{split}\]

Note that when the input and output dimensions don’t match, the Jacobian is not a square matrix. The dimension of the Jacobian is the output dimension times input dimension.

14.3. The vectorized chain rule#

Now that we understand the building blocks of vectorized computational graphs, let’s focus on the lead actor: the chain rule. Like always, it’s best to study it through an example. You already know the drill by now, so let’s start with the graph right away.

../_images/0a57c4f104cb299d034130d4c4576b060ca242849158136f198e85c2dc62929f.svg

Let’s compute the derivative! Right away, we hit a snag. Unlike our previous examples, there’s not one, but two terminal nodes! Thus, we have two derivative graphs to fill up with values.

../_images/d60941e77d1d7f0f016a314da2326d2b20f0601ff6ce280037b8629bb053794d.svg

According to what we’ve learned when we introduced backpropagation, the first graph gives

\[\begin{split} \begin{align*} \frac{dz_1}{dx_1} &= \frac{dz_1}{dy_1} \frac{\partial y_1}{\partial x_1} + \frac{dz_1}{dy_2} \frac{\partial y_2}{\partial x_1}, \\ \frac{dz_1}{dx_2} &= \frac{dz_1}{dy_1} \frac{\partial y_1}{\partial x_2} + \frac{dz_1}{dy_2} \frac{\partial y_2}{\partial x_2}, \end{align*} \end{split}\]

while the second graph gives

\[\begin{split} \begin{align*} \frac{dz_2}{dx_1} &= \frac{dz_2}{dy_1} \frac{\partial y_1}{\partial x_1} + \frac{dz_2}{dy_2} \frac{\partial y_2}{\partial x_1}, \\ \frac{dz_2}{dx_2} &= \frac{dz_2}{dy_1} \frac{\partial y_1}{\partial x_2} + \frac{dz_2}{dy_2} \frac{\partial y_2}{\partial x_2}. \end{align*} \end{split}\]

So many symbols and operations; it’s not the most pleasant sights to look at. Do you recognize a pattern? Let me rewrite the above expressions in a matrix form:

\[\begin{split} \begin{bmatrix} \frac{dz_1}{dx_1} & \frac{dz_1}{dx_2} \\ \frac{dz_2}{dx_1} & \frac{dz_2}{dx_2} \\ \end{bmatrix} = \begin{bmatrix} \sum_{k=1}^2 \frac{dz_1}{dy_k} \frac{\partial y_k}{\partial x_1} & \sum_{k=1}^2 \frac{dz_1}{dy_k} \frac{\partial y_k}{\partial x_2} \\ \sum_{k=1}^2 \frac{dz_2}{dy_k} \frac{\partial y_k}{\partial x_1} & \sum_{k=1}^2 \frac{dz_2}{dy_k} \frac{\partial y_k}{\partial x_2} \\ \end{bmatrix}. \end{split}\]

In other words,

\[ \frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \]

holds! With one brilliant stroke of linear algebra, we’ve contracted our computational graph from

to

../_images/4c63bf7ed6776e176ee7f60a490427bceef927a52475fa033ccbb4ff90c40f34.svg

where \( \mathbf{x}, \mathbf{y}, \mathbf{z} \in \mathbb{R}^2 \), and its derivative graphs from

to

../_images/31d0ec9d28c60bbd09a2f938641ffcc9781759a134df44bef595d3894a44593a.svg

where \( \frac{d\mathbf{z}}{d\mathbf{x}}, \frac{d\mathbf{z}}{d\mathbf{y}}, \frac{d\mathbf{z}}{d\mathbf{z}}, \frac{\partial \mathbf{y}}{\partial \mathbf{x}}, \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \in \mathbb{R}^{2 \times 2} \) are \( 2 \times 2 \) matrices. That’s a massive improvement! I told you that linear algebra is powerful.

Going one step further, consider the vectorized computational graph defined by \( \mathbf{x} \in \mathbb{R}^n, \mathbf{y} \in \mathbb{R}^m, \mathbf{z} \in \mathbb{R}^k, \) and \( \mathbf{w} \in \mathbb{R}^l \):

../_images/e240fbb7840425200ce2cc84444f7b796846dc044a307a707bfdb26194ddb465.svg

(For simplicity, I have put the graph and its derivative graph onto the same figure.) In this case, the Jacobians have various dimensions:

\[ \frac{d\mathbf{w}}{d\mathbf{x}} \in \mathbb{R}^{l \times n}, \quad \frac{d\mathbf{w}}{d\mathbf{y}} \in \mathbb{R}^{l \times m}, \quad \frac{d\mathbf{w}}{d\mathbf{z}} \in \mathbb{R}^{l \times k}, \quad \frac{d\mathbf{w}}{d\mathbf{w}} \in \mathbb{R}^{l \times l}, \]

and

\[ \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \in \mathbb{R}^{m \times n}, \quad \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \in \mathbb{R}^{k \times m}, \quad \frac{\partial \mathbf{w}}{\partial \mathbf{z}} \in \mathbb{R}^{l \times k}. \]

According to the chain rule,

\[\begin{split} \begin{align*} \frac{d\mathbf{w}}{d\mathbf{x}} &= \frac{d\mathbf{w}}{d\mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \\ &= \frac{d\mathbf{w}}{d\mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}} \\ &= \frac{d\mathbf{w}}{d\mathbf{w}} \frac{\partial \mathbf{w}}{\partial \mathbf{z}} \frac{\partial \mathbf{z}}{\partial \mathbf{y}} \frac{\partial \mathbf{y}}{\partial \mathbf{x}}. \end{align*} \end{split}\]

One more step. Now that we’ve dealt with vectorized computational graphs that consist of one path, it’s time to add some twists and forks. Take a look at the next graph, one that we’ve seen before:

../_images/aafa44ce4065ba01f675e819cf3b441bf3fe56b6d4b587f321532b48dc1077c8.svg

Here, the chain rule says that

\[ \frac{d\mathbf{z}}{d\mathbf{x}} = \frac{d\mathbf{z}}{d\mathbf{y}_1} \frac{\partial \mathbf{y}_1}{\partial \mathbf{x}} + \frac{d\mathbf{z}}{d\mathbf{y}_2} \frac{\partial \mathbf{y}_2}{\partial \mathbf{x}}. \]

In theory, the vectorized backpropagation works the same as its vanilla counterpart, with matrices instead of scalars. Are we ready to roll with the implementation?

See you in the next chapter!

Vectorizing backpropagation (in theory)

Contents

14. Vectorizing backpropagation (in theory)#

14.1. Vector-scalar graphs#

14.2. Vector-vector graphs#

14.3. The vectorized chain rule#