Vectorized computational graphs

13. Vectorized computational graphs#

Although Scalar provides a complete implementation for computational graphs, they are terribly hard to work with, not to mention their performance issues. As we’ve seen it earlier, training a simple neural net with a hidden layer of eight Scalar neurons takes a lot of time, and defining it without matrix multiplications is a pain.

Fortunately, we don’t have to be so granular when defining computational graphs. Via the magic of linear algebra, we can seriously cut down the number of nodes and edges in our graphs, resulting in a blazing increase in speed and notational simplicity.

To give you an example, consider the dot product operation, defined by

\[ \mathbf{x} \cdot \mathbf{y} = \sum_{i=1}^{n} x_i y_i, \quad \mathbf{x}, \mathbf{y} \in \mathbb{R}^n. \]

This is a computational graph that contains \(4n - 1 \) nodes and \(4n - 2 \) edges. Here it is for three-dimensional vectors.

../_images/f662e663ee1a3d9b9672bb2045c4fa35882c77af7b097184e79d07b547b99996.svg

Consider doing a backward pass in this graph: you have to traverse every node, every edge, and execute functions there. It’s even worse for matrix multiplications, and the computational complexity piles up fast. For a basic one-layer network like \( \sigma(\mathrm{ReLU}(xA)B) \), we already have a ton of components. If the inputs are images, we are in the tens of thousands. That’s not going to work out in the long run.

What if we replace scalars with vectors and matrices in our computational graphs?

This is how the dot product would look like:

../_images/2ce0cd31c7d61453efe3550f9fe5f5cd060455d2242471e25420956ef6216e17.svg

Structurally, this is identical to the one of matrix multiplication:

../_images/8c3a839e511dbd9c6f2536d8c97ba7c1ae2e2e0cb8024b3b570407fa60d19baa.svg

This is what the Tensor class implements in mlfz. Let’s see what tensors are!

13.1. How to work with `Tensor`#

Similarly to Scalar, the Tensor class is a node in a computational graph. This time, instead of a scalar value, they represent NumPy arrays.

import numpy as np
from mlfz.nn.tensor import Tensor


x = Tensor.ones(3, 4)
y = Tensor.zeros_like(x)

x + y

Tensor([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

A Tensor can be initialized in multiple ways; the principal one is via a NumPy array.

Tensor(np.array([[1, 2], [3, 4]]))

Tensor([[1, 2],
       [3, 4]])

Tensor.from_random(2, 4)

Tensor([[0.44753265, 0.63713811, 0.30274584, 0.62520603],
       [0.18116229, 0.80732081, 0.85423734, 0.12697534]])

Tensor.zeros(5, 2)

Tensor([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

x = Tensor.zeros(3, 4)
Tensor.ones_like(x)

Tensor([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])

Just like a Scalar, Tensor instances hold three important attributes as well:

a tensor value,
the backwards gradient,
and the list of incoming edges.

x = Tensor.ones(3, 4)
y = Tensor.ones(3, 4)
z = x * y

z.backwards_grad    # this is a NumPy array

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

z.prevs   # and this is a list of Edges

[Edge(prev=Tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), local_grad=array([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), backward_fn=<function _pointwise at 0x73626ad7a0c0>),
 Edge(prev=Tensor([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), local_grad=array([[1., 1., 1., 1.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]]), backward_fn=<function _pointwise at 0x73626ad7a0c0>)]

We’ll talk about the Edge object later, but if you are observant, you may have noticed that compared to the Scalar version, we have an additional backward_fn attribute. This is the price we pay for vectorization.

First, let’s see what Tensors can do.

13.2. `Tensor`s and operations#

Tensors support quite a few operations:

pointwise addition +,
pointwise subtraction -,
pointwise multiplication *,
pointwise division /,
pointwise exponentiation **,
matrix multiplication @,
and matrix transposition T.

Thanks to the broadcasting magic of NumPy, the other operand need not be a Tensor, it can be a vanilla Python number. (We’ll open a whole other can of worms with broadcasting, but we’ll cross that bridge later.)

x = Tensor.ones(3, 4)
2 + x

Tensor([[3., 3., 3., 3.],
       [3., 3., 3., 3.],
       [3., 3., 3., 3.]])

Just like Scalars, we can apply functions to Tensors.

from mlfz.nn.tensor.functional import exp

x = Tensor(value=np.array([[1, 2], [3, 4]]))
exp(x)

Tensor([[ 2.71828183,  7.3890561 ],
       [20.08553692, 54.59815003]])

As a Tensor is essentially a wrapper over NumPy arrays, it has its own sum and mean methods, working identically to the original versions.

x = Tensor.ones(3, 4)
x.sum()

Tensor(12.)

x.sum(axis=0)

Tensor([3., 3., 3., 3.])

x.sum(axis=1)

Tensor([4., 4., 4.])

y = Tensor(np.array([[1, 2], [3, 4]]))
y.mean()

Tensor(2.5)

y.mean(axis=0)

Tensor([2., 3.])

13.3. Computational graphs with `Tensor`s#

Just like most features, the graph-building is the same as well: it is dynamically built upon applying functions and operations.

from mlfz.nn.tensor.functional import sigmoid, tanh

x = Tensor.from_random(1, 3)
A = Tensor.from_random(3, 5)
B = Tensor.from_random(5, 1)

y = sigmoid(tanh(x @ A) @ B)

Tensor([[0.8497897]])

y.shape

(1, 1)

Again, the y.backward method calculates the gradient of y with respect to all nodes in the graph. As our nodes are tensors, the backwards gradient will be a tensor as well, with the shape matching the node’s shape.

y.backward()
x.backwards_grad

array([[0.02344552, 0.09243718, 0.05054987]])

A.backwards_grad

array([[0.00370821, 0.00396276, 0.02496024, 0.06280291, 0.0114475 ],
       [0.00496636, 0.00530728, 0.03342898, 0.08411127, 0.01533152],
       [0.00090712, 0.00096939, 0.00610589, 0.01536313, 0.00280034]])

B.backwards_grad

array([[0.11902051],
       [0.1178138 ],
       [0.10175171],
       [0.06213271],
       [0.08277062]])

In essence, tensors allow us to vectorize computational graphs, increasing the speed and simplicity like you wouldn’t believe. Let’s proceed to build one!

13.4. Building a vectorized neural network with tensors#

To demonstrate the power of Tensor, we’ll need a dataset to test out our neural network. We’ll use our good old friend, the spiral dataset. (Once again, the dataset generating code is irrelevant, so don’t hung up on it.)

xs, ys = generate_spiral_dataset(200, noise=2)

../_images/3f5d51b32a0b949e8ed26fd7de78289c62855d7adc43d0c0d45206479e4c1485.png

To define a network, we’ll use the Model class from mlfz.nn.

from mlfz.nn import Model
from mlfz.nn.tensor.functional import tanh, sigmoid


class TwoLayerNetwork(Model):
    def __init__(self):
        self.A = Tensor.from_random(2, 8)
        self.B = Tensor.from_random(8, 8)
        self.C = Tensor.from_random(8, 1)

    def forward(self, x):
        return sigmoid(tanh(tanh(x @ self.A) @ self.B) @ self.C)
    
    def parameters(self):
        return {"A": self.A, "B": self.B, "C": self.C}

Look at how simple this implementation is, compared to the scalar version. Writing expressions like tanh(x @ A) makes all the difference.

Training tensor networks is also much faster, as we shall see soon. Let’s initialize one and see how it looks.

model = TwoLayerNetwork()

from itertools import product


def visualize_model(model, xs, ys, res=100, xrange=(-10, 10), yrange=(-10, 10)):
    with plt.style.context("seaborn-v0_8-white"):
        plt.figure(figsize=(8, 8))
        
        res = 100
        x = np.linspace(xrange[0], xrange[1], res)
        y = np.linspace(yrange[0], yrange[1], res)
        xx, yy = np.meshgrid(x, y)
        zz = np.zeros_like(xx)
        
        for i, j in product(range(res), range(res)):
            zz[i, j] = model.forward(Tensor([xx[i, j], yy[i, j]])).value


        # plot the decision boundary
        plt.contourf(xx, yy, zz, levels=100, cmap='coolwarm_r', alpha=0.4)
        plt.xlabel('x')
        plt.ylabel('y')
        plt.title('the decision boundary of our model')
        
        # plot the data
        plt.scatter([x[0] for x in xs], [x[1] for x in xs], c=ys, cmap=plt.cm.Spectral, zorder=10)
        plt.show()

visualize_model(model, xs, ys)

/tmp/ipykernel_66116/4170422174.py:15: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  zz[i, j] = model.forward(Tensor([xx[i, j], yy[i, j]])).value

../_images/47940bedfc3bdc6e032e998d343abdd83e5a1ac0a995425625d5558a9605df9d.png

Here goes the training. We’ll let it rip and crunch out a whooping one hundred thousand iterations. For comparison, we merely did a thousand steps for the Scalar version, and even that almost made my notebook explode.

from mlfz.nn.tensor.loss import binary_cross_entropy


n_steps = 100000
lr = 0.1

for i in range(1, n_steps + 1):
    preds = model(xs)
    l = binary_cross_entropy(preds, ys)
    l.backward()
    model.gradient_update(lr)

    if i == 1 or i % 10000 == 0:
        print(f"step no. {i}, loss = {l.value}")

step no. 1, loss = 1.3500574656982025
step no. 10000, loss = 0.3276131190652927
step no. 20000, loss = 0.27962397222881064
step no. 30000, loss = 0.26929577387216247
step no. 40000, loss = 0.2990275403169361
step no. 50000, loss = 0.26435643969354067
step no. 60000, loss = 0.28143142115711317
step no. 70000, loss = 0.26137091793547507
step no. 80000, loss = 0.24449209263802452
step no. 90000, loss = 0.2385367334321998
step no. 100000, loss = 0.2446387078525447

In other words, despite having a more complex model and performing way more gradient descent steps, we are still much faster. Let’s time it to confirm. We’ll only measure a thousand training loops.

def train_model():
    model = TwoLayerNetwork()
    n_steps = 1000
    lr = 0.1

    for i in range(1, n_steps + 1):
        preds = model(xs)
        l = binary_cross_entropy(preds, ys)
        l.backward()
        model.gradient_update(lr)


%timeit -r 1 -n 1 train_model()

1.92 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

And this is the final result.

visualize_model(model, xs, ys)

/tmp/ipykernel_66116/4170422174.py:15: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)
  zz[i, j] = model.forward(Tensor([xx[i, j], yy[i, j]])).value

../_images/99b983a16c2e049d38405193b2ff5bd5e9c082b2538136db381c6a935d2ababe.png

13.5. `Tensor` vs `Scalar` performance comparison#

Before moving on to the implementation details, let’s quickly compare the performance of Scalar vs. Tensor. To do that, we’ll use a simple one-layer neural network with thirty-two hidden neurons.

We’ll store the number of hidden neurons inside a variable so you can play around with it if you execute this notebook locally.

n_hidden_neurons = 32

First, here’s the data. Feel free to skip the code, it’s nothing we haven’t seen before.

import numpy as np
from mlfz.nn.scalar import Scalar
from mlfz.nn.tensor import Tensor

n_samples = 1000

xs_scalar = [[Scalar.from_random(), Scalar.from_random()] for _ in range(n_samples)]
ys_scalar = [Scalar.from_random() for _ in range(n_samples)]

xs_tensor = Tensor(np.array([[x1.value, x2.value] for x1, x2 in xs_scalar]))
ys_tensor = Tensor(np.array([y.value for y in ys_scalar]).reshape(-1, 1))

Our Scalar network takes quite a while to set up. Here we go:

from mlfz import Model
from mlfz.nn.scalar import functional as f_scalar
from itertools import product


class ScalarNetwork(Model):
    def __init__(self):
        self.A = [[Scalar.from_random() for j in range(n_hidden_neurons)]
                  for i in range (2)]
        self.B = [Scalar.from_random() for i in range(n_hidden_neurons)]
    
    def forward(self, x):
        fs = [sum([self.A[i][j] * x[i] for i in range(2)]) for j in range(n_hidden_neurons)]
        fs_relu = [f_scalar.tanh(f) for f in fs]
        gs = sum([self.B[i] * fs_relu[i] for i in range(n_hidden_neurons)])
        return f_scalar.sigmoid(gs)

    def parameters(self):
        A_dict = {f"a{i}{j}": self.A[i][j] for i, j in product(range(2), range(n_hidden_neurons))}
        B_dict = {f"b{i}": self.B[i] for i in range(n_hidden_neurons)}
        return {**A_dict, **B_dict}

To accurately measure the time of a single gradient step, we encapsulate all the logic into a single function called scalar_network_step.

from mlfz.nn.scalar.loss import binary_cross_entropy as bce_scalar

scalar_net = ScalarNetwork()

def scalar_network_step():
    preds = [scalar_net(x) for x in xs_scalar]
    l = bce_scalar(preds, ys_scalar)
    l.backward()
    scalar_net.gradient_update(0.01)

Let’s go and %timeit!

%timeit scalar_network_step()

372 ms ± 22.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now, about the Tensor network.

from mlfz.nn.tensor import functional as f_tensor


class TensorNetwork(Model):
    def __init__(self):
        self.A = Tensor.from_random(2, n_hidden_neurons)
        self.B = Tensor.from_random(n_hidden_neurons, 1)

    def forward(self, x):
        return f_tensor.sigmoid(f_tensor.tanh(x @ self.A) @ self.B)
    
    def parameters(self):
        return {"A": self.A, "B": self.B}

Look at that simplicity! Vectorization is worth it for that alone, but wait until we see how fast it is.

from mlfz.nn.tensor.loss import binary_cross_entropy as bce_tensor


tensor_net = TensorNetwork()


def tensor_network_step():
    preds = tensor_net(xs_tensor)
    l = bce_tensor(preds, ys_tensor)
    l.backward()
    tensor_net.gradient_update(0.01)

%timeit tensor_network_step()

4.71 ms ± 513 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The actual performance depends on the machine this book is built on, but you should see a roughly 100x speedup, given by the magic of vectorization. If you are running this notebook locally, try changing the n_hidden_neurons variable in the first executable cell in this notebook. You’ll be surprised: the execution time of the Scalar version will rapidly increase, but the Tensor version will roughly stay the same!

That’s because the graph structure adds a heavy overhead to our computations. We’ll profile the code in a later version of the notebook, but this is because the actual computations like addition, multiplication, etc, are only a small portion of the training!

Vectorized computational graphs

Contents

13. Vectorized computational graphs#

13.1. How to work with Tensor#

13.2. Tensors and operations#

13.3. Computational graphs with Tensors#

13.4. Building a vectorized neural network with tensors#

13.5. Tensor vs Scalar performance comparison#

13.1. How to work with `Tensor`#

13.2. `Tensor`s and operations#

13.3. Computational graphs with `Tensor`s#

13.5. `Tensor` vs `Scalar` performance comparison#