Training a ground-up MLP in native Julia (yes, on MNIST).

11 min readFeb 27, 2021

Is Julia going to kill Python? Who knows; probably not. But it’s certainly a fun, modern, and interesting language to use, and especially appealing for scientists with a MATLAB background like me. It’s got a unique combination of features: it’s 1-indexed, dynamically typed but (just-in-time) compiled, has a wonderful built-in package manager, and is a breeze to use interactively (concurrent use of scripting and the REPL feels very much like MATLAB development). I was keen to dive into getting to know native Julia, and despite regularly using PyTorch and Tensorflow I’ve never actually coded up backpropagation “by hand” myself before — hence this project and post. Two birds, one stone.

In this article we’ll build a simple multilayer perceptron (MLP) and implement backpropagation in native Julia, to classify digits in the MNIST dataset (yes, it’s another MNIST medium post — cuff me). The mercury repo that accompanies this post is available here, just open and run main.jl. Along the way we’ll explore some key Julia features / paradigms like some important data structures, dynamic dispatch of functions, package management, and plotting.

There are of course already proper, optimised neural network / machine learning libraries (see for example flux, and a plucky, little known underdog tensorflow). I’m hoping this article is of interest for people like me looking to see backpropagation nuts and bolts, as well as Julia in action.

Part 1 will focus on building from the ground-up a neural network capable of a simple forward pass. We’ll train it with the backpropagation algorithm in Part 2, and evaluate it in Part 3.

Part 1 — A forward pass

Setup and nomenclature

We’ll build a simple three-layer MLP, taking MNIST [28 x 28] images as input, and selecting one of 10 classes.

The functional form of this model is:

y = g( W₃ f ( W₂ f( W₁ X+ b₁ ) + b₂ ) + b₃)

In which Wₙ are our weight matrices, and Bₙ are our bias vectors. The activation functions f, g we use in this model are a ReLU and a Softmax respectively. y (our prediction) approximates Y (the true value) as a function of observations (pixel values) X. Statistics purists please forgive this notation.

Here we’re not too concerned about eeking out the last 1% of accuracy, so the dimensionality of the hidden layers is a fairly arbitrary choice. We’ll go [784, 32, 32, 10] for simplicity. W₃ is then a 32 x 784 matrix, b₂ is a 32 x 1 vector, etc.

Building blocks

Julia is not really meant to be used as an object-oriented language like Python. While you can define a struct, this is a “composite data type” rather than a Python object, and does not have methods defined in its definition. There’s no sense of calling neural_net.forward() in which we can access neural_net.attribute_1, for example.

Instead, in Julia we rely on “dynamic dispatch”, in which functions are defined uniquely for varied input data types. So forward(x:Array{Float64,2}) , can call completely different code to forward(x:Int). The concept is similar to “overloading” functions in eg. Java, but as this interesting StackOverflow post points out, overloading corresponds to deciding the function to be called at compile time, while in Julia we decide at run time.

So, our key MLP building block is the dense layer:

We implement this as a struct , with a weight, bias , and and anact_fn. Note that importantly it’s mutable, its attributes (the MLP parameters) are allowed to change. For quality of life we don’t want to instantiate this layer using the syntax: layer_1 = Dense(W_1, b_1, ReLU()), as this would require defining the matrix and vector elsewhere, and that’s messy. Instead, we overload Dense() to take dimensionality integers as input, and generate in-place a new Kaiming init’d matrix for the weights, and zeros for the bias.

The Kaiming init function is straightforward (it’s important to alleviate vanishing gradients for deeper networks, and is good practice):

For an instancelayer of Dense, we can now call layer.weight , layer.bias , layer.act_fn as we require.

Non-linearity

For the activation functions we also utilise a struct, with dynamically dispatched forward() and gradient() methods. Every binary operator in Julia has a vectorised version denoted by the . prefix. This is similar to behaviour in MATLAB, and one of my personal favourite features of that language. These are very helpful for implmenting pointwise activation functions.

The ReLU function’s forward method simply returns the input value if it’s more than zero, else zero. The gradient is one if the input is more than zero, else zero.

The softmax function, element-wise.

The softmax function acting on a vector, softmax(z), returns a normalised pointwise exponential of the vector’s elements, such that the output elements sum to one. Its derivative with respect to its input z, takes the form grad_softmax(z) = softmax(z) * (1 — softmax(z)) , where 1 is the appropriate vector of ones, and * is the Hadamard product ( .* in Julia). In this implementation, we utilise the so-called ‘log-sum-exp’ trick for numerical stability when calculating the softmax. There’s a great explanation of this technique here.

The softmax function ensures that an output vector’s elements are all > 0, real valued, and sum to one. These properties are synonymous with a valid probability distribution over the vector’s elements, which we choose to use as a model for the probabilty of an input vector x’s assignment to one of the classes represented by the output vector y’s elements. Under this model, the inputs to the softmax function are the log-odds associated with each class. This is essentially all we need to build the functional form of our MLP.

Putting it together

When training the neural network we’ll need to keep track of intermediate activations as well as the values of the parameters themselves, and here we simply utilise a dict (to allow O(1) access to any element) to do so:

So here we define a net dictionary, with fields “Layers”, “A”, and “Z”. “Z” and “A” are required for the backpropagation algorithm, as you’ll see later.

“Layers” is a list of the individual Dense layers.
“A” tracks the values of the layers’ activations through a forward pass after applying the activation function.
“Z” tracks the activations prior to applying the activation function.

Structurally, the final thing we need to do is define a forward() method for our MLP, again utilising native Julia dynamic dispatch. We simply loop through the stored layers in our dict, multiplying the input by the weight matrix, adding the bias, and applying our pointwise activation function. We keep track of the activations both before and after applying the activation function.

So we’re in a good place. We can call y = forward(x, net) to generate a set of predictions for the class of x, though this is obviously random at the moment as we have yet to train the network. Time for backpropagation.

Part 2 — Training

We have built the functional form of our MLP: y = forward(x, net) ouputs a 10 x 1 vector y from a 784 x 1 vector x. We’ll go into the backpropagation algorithm for updating the net’s parameters shortly, but first we need to talk data.

Loading and loss

Although I said earlier we’d only be using native Julia for this project, that may have been misdirection. In mercury I’ve used the MLDatasets package to efficiently load training and test data as Float64 arrays.

Managing packages in Julia is a lot simpler than Python in my opinion, and is a feature properly baked into the language rather than an afterthought like in Python. Environments are handled on a project-by-project basis, handled by the Project.toml and Manifest.toml files. These record package dependencies, and allow a user to exactly recreate the Julia environment in which the project was developed. Installing packages is as simple as typing ] to enter the Pkg section of the REPL, then using the command add MLDatasets, for example.

After installing, we can import with using MLDatasets. We can then calltrain_x, train_y = MNIST.traindata(Float64). The x data is good to go from here, but we need to use “one-hot encoding” for the y data, as train_y by default outputs an integer between 0 and 9 as the class label. We do this using the function zi_one_hot_encode(train_y) :

As discussed previously, our Softmax layer outputs a vector representing a probability distribution over classes, ie. its elements are real valued, > 0, and sum to one. We would like the output of our MLP, a vector y representing a probability distribution over classes, to be as close possible to the one-hot encoding of the training data. We can then minimise the cross-entropy between these probability distributions to train our neural network.

Cross-entropy loss

For a deeper dive into the cross-entropy loss, and its (close) relation to the Kullback-Leibler divergence (the relative entropy), see here. It’s the maximum likelihood function for a Bernoulli distribution over classes (corresponding to our one-hot encoding), so is the correct loss function for single-label classification. Here’s a Julia implementation, along with its derivative which we’ll need for backpropagation:

So we’re finally ready to tackle the big beast — backpropagation. I found this blog post by Piotr Skalski tremendously helpful when tackling this, and a lot of my implementation of this algorithm is inspired by his work. I’ve used the same syntax, so check out his explanation if mine doesn’t do it for you!

Backpropagation

The backpropagation algorithm allows efficient calculation of the partial derivatives of the loss function with respect to the neural network’s parameters, layer-wise. We use the chain rule, iterating backwards through the network, starting with Aₙ = outputs, y, and ending with A₀ = inputs, x.

We end up with dWₙ and dbₙ, for n = 1 : the network’s depth. These are the gradients of the loss function with respect to weights and biases of layer n. We built keeping track of A and Z into the definition of our network (a dict), so inferring the gradient updates for a given layer is just a case of coding up the equations in the above figure:

This is the generalised function that takes the gradient of the loss with respect to the previous (closer to the outputs of the network) activations as an input. As the final activation is the output of the net, for the first call of this function
dA = xe_loss_derivative(y, Y). For subsequent (closer to the inputs) calls of calculate_gradient we store dA. Here we’re again employing dynamic dispatch, as the gradient() function’s operation depends on the data type of act_fn, which we defined / overloaded above.

And thus we’re left with a set of parameter updates dW and db.

When we employ minibatch gradient descent, we infer parameter gradients for each record in the minibatch, without updating them until the end of the batch. This helps to regularise the learning, as we’re essentially descending a moving average of the gradient rather than jumping around a noisy loss landscape. For practical purposes, here I’ve implemented accumulation of gradients. The following function is basically the same as the snippet above, but if d, the list of parameter updates to be applied already exists, we add to it rather than overwriting it. This lets us easily get an average gradient across a minibatch for a single subsequent update.

When we want to perform an update on the parameters, it’s as simple as:

Build a training loop

So now we have a method to update the parameters of our MLP given a training data point. The final thing we need to do is to implement this into a training loop, in which we cycle over the training data for a fixed number of epochs. Using loops is encouraged in Julia as it’s a (just-in-time) compiled language. The compiler knows how to optimise the instructions we provide, so while in Python we should use as much vectorisation as possible for decent performance, in Julia using loops is ok, as in C++.

As I touched on above, typically in training a neural network we use minibatch gradient descent, accumulating the average gradients for each parameter (wrt the loss) over a number of records (without updating parameters in-between), and updating at the end of the batch. In Python (PyTorch, TensorFlow, etc) this is typically done by vectorising the operations and exploiting LAPACK or similar, heavily optimised linear algebra functions for your CPU / GPU architecture. Here we can adopt a simpler approach, exploiting the efficiency of Julia, in which it’s a lot more syntactically obvious how minibatching works. We accumulate the gradient (as described above) for each record in a minibatch, then updated the parameters at the end.

So we’re good to go. We can call train(net, mb_size, lr, epochs, train_x, train_y) and our MLP will train, minimising the cross-entropy loss. Let’s take a look at how it performs.

Part 3 — Results

In mercury I added a few lines to the training loop in order to track the model’s loss and accuracy on both the training and test data, over the course of training.

For this training run I used the following parameters:

mb_size = 32
epochs = 50
lr = 0.01

We can plot the training results easily using thePlots package:

First cross-entropy loss:

And now accuracy:

I showed you here a good choice of hyperparameters (no surprises that this wasn’t the first set I tried…). Setting the learning rate too high for example can lead to oscillation of the training loss around its minimum, and learning having insufficient precision for parameters to settle good values.

Conclusions

In this post we’ve built from the ground up a simple MLP to classify MNIST digits using (mostly) native Julia. We’ve implemented some ML fundamentals, like activation functions, basic neural architecture, and automatic differentiation. We’ve covered some important Julia features like key data structures, functions, dynamic dispatch, pointwise operators, package management, and basic plotting.

The mercury repo accompanies this post and is available here, just open and run main.jl after installing MLDatasets as described in this post.

Hopefully you found some of this journey useful — I’m personally really excited to see how Julia evolves within our data science ecosystem.