# Training a ground-up MLP in native Julia (yes, on MNIST).

Is Julia going to kill Python? Who knows; probably not. But it’s certainly a fun, modern, and interesting language to use, and especially appealing for scientists with a MATLAB background like me. It’s got a unique combination of features: it’s 1-indexed, dynamically typed but (just-in-time) compiled, has a wonderful built-in package manager, and is a breeze to use interactively (concurrent use of scripting and the REPL feels *very* much like MATLAB development). I was keen to dive into getting to know native Julia, and despite regularly using PyTorch and Tensorflow I’ve never actually coded up backpropagation “by hand” myself before — hence this project and post. Two birds, one stone.

In this article we’ll build a simple multilayer perceptron (MLP) and implement backpropagation in native Julia, to classify digits in the MNIST dataset (yes, it’s another MNIST medium post — cuff me). The `mercury`

repo that accompanies this post is available here, just open and run `main.jl`

. Along the way we’ll explore some key Julia features / paradigms like some important data structures, dynamic dispatch of functions, package management, and plotting.

There are of course already proper, optimised neural network / machine learning libraries (see for example flux, and a plucky, little known underdog tensorflow). I’m hoping this article is of interest for people like me looking to see backpropagation nuts and bolts, as well as Julia in action.

**Part 1** will focus on building from the ground-up a neural network capable of a simple forward pass. We’ll train it with the backpropagation algorithm in **Part 2**, and evaluate it in **Part 3**.

# Part 1 — A forward pass

## Setup and nomenclature

We’ll build a simple three-layer MLP, taking MNIST [28 x 28] images as input, and selecting one of 10 classes.

The functional form of this model is:

**y** = g( **W**₃ f ( **W**₂ f( **W**₁ **X**+ **b**₁ ) + **b**₂ ) + **b**₃)

In which Wₙ are our weight matrices, and Bₙ are our bias vectors. The activation functions f, g we use in this model are a ReLU and a Softmax respectively. **y **(our prediction) approximates **Y** (the true value) as a function of observations (pixel values) **X**. Statistics purists please forgive this notation.

Here we’re not too concerned about eeking out the last 1% of accuracy, so the dimensionality of the hidden layers is a fairly arbitrary choice. We’ll go [784, 32, 32, 10] for simplicity. **W**₃ is then a 32 x 784 matrix, **b**₂ is a 32 x 1 vector, *etc*.

## Building blocks

Julia is not really meant to be used as an object-oriented language like Python. While you can define a `struct`

, this is a “composite data type” rather than a Python `object`

, and does not have methods defined in its definition. There’s no sense of calling `neural_net.forward()`

in which we can access `neural_net.attribute_1`

, for example.

Instead, in Julia we rely on “dynamic dispatch”, in which functions are defined uniquely for varied input data types. So `forward(x:Array{Float64,2})`

, can call completely different code to `forward(x:Int)`

. The concept is similar to “overloading” functions in eg. Java, but as this interesting StackOverflow post points out, overloading corresponds to deciding the function to be called at *compile* time, while in Julia we decide at *run* time.

So, our key MLP building block is the dense layer:

We implement this as a `struct`

, with a `weight`

, `bias`

, and and an`act_fn`

. Note that importantly it’s `mutable`

, its attributes (the MLP parameters) are allowed to change. For quality of life we don’t want to instantiate this layer using the syntax: `layer_1 = Dense(W_1, b_1, ReLU())`

, as this would require defining the matrix and vector elsewhere, and that’s messy. Instead, we overload `Dense()`

to take dimensionality integers as input, and generate in-place a new Kaiming init’d matrix for the weights, and zeros for the bias.

The Kaiming init function is straightforward (it’s important to alleviate vanishing gradients for deeper networks, and is good practice):

For an instance`layer`

of `Dense`

, we can now call `layer.weight`

, `layer.bias`

, `layer.act_fn`

as we require.

## Non-linearity

For the activation functions we also utilise a `struct`

, with dynamically dispatched `forward()`

and `gradient()`

methods. Every binary operator in Julia has a vectorised version denoted by the `.`

prefix. This is similar to behaviour in MATLAB, and one of my personal favourite features of that language. These are very helpful for implmenting pointwise activation functions.

The ReLU function’s forward method simply returns the input value if it’s more than zero, else zero. The gradient is one if the input is more than zero, else zero.

The softmax function acting on a vector, `softmax(z)`

, returns a normalised pointwise exponential of the vector’s elements, such that the output elements sum to one. Its derivative with respect to its input `z`

, takes the form `grad_softmax(z) = softmax(z) * (1 — softmax(z))`

, where `1`

is the appropriate vector of ones, and `*`

is the Hadamard product ( `.*`

in Julia). In this implementation, we utilise the so-called ‘log-sum-exp’ trick for numerical stability when calculating the softmax. There’s a great explanation of this technique here.

The softmax function ensures that an output vector’s elements are all > 0, real valued, and sum to one. These properties are synonymous with a valid probability distribution over the vector’s elements, which we choose to use as a model for the probabilty of an input vector **x**’s assignment to one of the classes represented by the output vector **y**’s elements. Under this model, the inputs to the softmax function are the log-odds associated with each class. This is essentially all we need to build the functional form of our MLP.

## Putting it together

When training the neural network we’ll need to keep track of intermediate activations as well as the values of the parameters themselves, and here we simply utilise a `dict`

(to allow O(1) access to any element) to do so:

So here we define a `net`

dictionary, with fields “Layers”, “A”, and “Z”. “Z” and “A” are required for the backpropagation algorithm, as you’ll see later.

- “Layers” is a list of the individual
`Dense`

layers. - “A” tracks the values of the layers’ activations through a forward pass
*after*applying the activation function. - “Z” tracks the activations
*prior*to applying the activation function.

Structurally, the final thing we need to do is define a `forward()`

method for our MLP, again utilising native Julia dynamic dispatch. We simply loop through the stored layers in our `dict`

, multiplying the input by the weight matrix, adding the bias, and applying our pointwise activation function. We keep track of the activations both before and after applying the activation function.

So we’re in a good place. We can call `y = forward(x, net)`

to generate a set of predictions for the class of `x`

, though this is obviously random at the moment as we have yet to train the network. Time for backpropagation.

# Part 2 — Training

We have built the functional form of our MLP: `y = forward(x, net)`

ouputs a 10 x 1 vector `y`

from a 784 x 1 vector `x`

. We’ll go into the backpropagation algorithm for updating the net’s parameters shortly, but first we need to talk data.

## Loading and loss

Although I said earlier we’d only be using native Julia for this project, that may have been *misdirection*. In `mercury`

I’ve used the `MLDatasets`

package to efficiently load training and test data as `Float64`

arrays.

Managing packages in Julia is *a lot* simpler than Python in my opinion, and is a feature properly baked into the language rather than an afterthought like in Python. Environments are handled on a project-by-project basis, handled by the `Project.toml`

and `Manifest.toml`

files. These record package dependencies, and allow a user to exactly recreate the Julia environment in which the project was developed. Installing packages is as simple as typing `]`

to enter the `Pkg`

section of the REPL, then using the command `add MLDatasets`

, for example.

After installing, we can import with `using MLDatasets`

. We can then call`train_x, train_y = MNIST.traindata(Float64)`

. The `x`

data is good to go from here, but we need to use “one-hot encoding” for the `y`

data, as `train_y`

by default outputs an integer between 0 and 9 as the class label. We do this using the function `zi_one_hot_encode(train_y)`

:

As discussed previously, our `Softmax`

layer outputs a vector representing a probability distribution over classes, ie. its elements are real valued, > 0, and sum to one. We would like the output of our MLP, a vector **y** representing a probability distribution over classes, to be as close possible to the one-hot encoding of the training data. We can then minimise the cross-entropy between these probability distributions to train our neural network.

For a deeper dive into the cross-entropy loss, and its (close) relation to the Kullback-Leibler divergence (the *relative* entropy), see here. It’s the maximum likelihood function for a Bernoulli distribution over classes (corresponding to our one-hot encoding), so is the correct loss function for single-label classification. Here’s a Julia implementation, along with its derivative which we’ll need for backpropagation:

So we’re finally ready to tackle the big beast — backpropagation. I found this blog post by Piotr Skalski tremendously helpful when tackling this, and a lot of my implementation of this algorithm is inspired by his work. I’ve used the same syntax, so check out his explanation if mine doesn’t do it for you!

## Backpropagation

The backpropagation algorithm allows efficient calculation of the partial derivatives of the loss function with respect to the neural network’s parameters, layer-wise. We use the chain rule, iterating backwards through the network, starting with **A**ₙ = outputs, **y**, and ending with **A**₀ = inputs, **x**.

We end up with d**W**ₙ and d**b**ₙ, for n = 1 : the network’s depth. These are the gradients of the loss function with respect to weights and biases of layer n. We built keeping track of **A** and **Z **into the definition of our network (a `dict`

), so inferring the gradient updates for a given layer is just a case of coding up the equations in the above figure:

This is the generalised function that takes the gradient of the loss with respect to the previous (closer to the outputs of the network) activations as an *input*. As the final activation is the output of the net, for the first call of this function`dA = xe_loss_derivative(y, Y)`

. For subsequent (closer to the inputs) calls of `calculate_gradient`

we store `dA`

. Here we’re again employing dynamic dispatch, as the `gradient()`

function’s operation depends on the data type of `act_fn`

, which we defined / overloaded above.

And thus we’re left with a set of parameter updates d**W** and d**b**.

When we employ minibatch gradient descent, we infer parameter gradients for each record in the minibatch, without updating them until the end of the batch. This helps to regularise the learning, as we’re essentially descending a moving average of the gradient rather than jumping around a noisy loss landscape. For practical purposes, here I’ve implemented *accumulation* of gradients. The following function is basically the same as the snippet above, but if `d`

, the list of parameter updates to be applied already exists, we add to it rather than overwriting it. This lets us easily get an average gradient across a minibatch for a single subsequent update.

When we want to perform an update on the parameters, it’s as simple as:

## Build a training loop

So now we have a method to update the parameters of our MLP given a training data point. The final thing we need to do is to implement this into a training *loop*, in which we cycle over the training data for a fixed number of epochs. Using loops is encouraged in Julia as it’s a (just-in-time) compiled language. The compiler knows how to optimise the instructions we provide, so while in Python we should use as much vectorisation as possible for decent performance, in Julia using loops is ok, as in C++.

As I touched on above, typically in training a neural network we use minibatch gradient descent, accumulating the average gradients for each parameter (wrt the loss) over a number of records (without updating parameters in-between), and updating at the end of the batch. In Python (PyTorch, TensorFlow, *etc*) this is typically done by vectorising the operations and exploiting LAPACK or similar, heavily optimised linear algebra functions for your CPU / GPU architecture. Here we can adopt a simpler approach, exploiting the efficiency of Julia, in which it’s a lot more syntactically obvious how minibatching works. We accumulate the gradient (as described above) for each record in a minibatch, then updated the parameters at the end.

So we’re good to go. We can call `train(net, mb_size, lr, epochs, train_x, train_y)`

and our MLP will train, minimising the cross-entropy loss. Let’s take a look at how it performs.

# Part 3 — Results

In `mercury`

I added a few lines to the training loop in order to track the model’s loss and accuracy on both the training and test data, over the course of training.

For this training run I used the following parameters:

`mb_size`

= 32`epochs`

= 50`lr`

= 0.01

We can plot the training results easily using the`Plots`

package:

First cross-entropy loss:

And now accuracy:

I showed you here a good choice of hyperparameters (no surprises that this wasn’t the first set I tried…). Setting the learning rate too high for example can lead to oscillation of the training loss around its minimum, and learning having insufficient precision for parameters to settle good values.

# Conclusions

In this post we’ve built from the ground up a simple MLP to classify MNIST digits using (mostly) native Julia. We’ve implemented some ML fundamentals, like activation functions, basic neural architecture, and automatic differentiation. We’ve covered some important Julia features like key data structures, functions, dynamic dispatch, pointwise operators, package management, and basic plotting.

The `mercury`

repo accompanies this post and is available here, just open and run `main.jl`

after installing `MLDatasets`

as described in this post.

Hopefully you found some of this journey useful — I’m personally really excited to see how Julia evolves within our data science ecosystem.