Training an Artificial Neural Network using a genetic algorithm

Rafael Monteiro
Mathematics for Advanced Materials - Open Innovation Lab (MathAM-OIL)/Tohoku University, Sendai, Japan
rafael.a.monteiro.math "at" gmail.com

July 9, 2020

 

In the literature of Artificial Neural Networks (ANNs), the most common form of training involves the using Gradient Descent (GD) using the Backpropagation Algorithm (BP), which is nothing but the algorithmic design of gradient computations in a neat form (which shoud not be overlooked, for it was a huge step forwards in the field of Machine Learning).

It turns out that BP relies on computations of derivatives. The natural question is then: can we train an ANN without gradient information?

Yes, we can: we just need to figure out how to explore the parameters' search space. I coded an example (available here in which we perform training of an ANN using a very simple (somewhat primitive) random search. In fact, if one thinks about GD methods, or any other optimization methods, they are nothing but ways to explore the parameter space.

What I will do in the sequel is mainly based inspired by Daniel Hillis seminal paper Co-evolving parasites improve simulated evolution as an optimization procedure, that I really liked (and strongly recommend). The technique has many names, and is commonly known as Genetic Algorithms. There are many books available, and if you are interested you can start by looking here at this MIT OCW course - Multidisciplinary System Design Optimization.

To begin with, let's generate some data, construct a classical ANN (using Keras).

Remark: take a look at http://playground.tensorflow.org/, which is a very interesting site to see how decision boundaries behave as training evolves through time.

The classical approach using ANNs

To begin with, let's consider a problem, a simple one, not too complicate. Let's get the libraries we shall need.

First we get the dataset

png,

 

Recall that keras considers the data as batch_size X features

 

 

 

 

 

png,

Genetic Algorithms

Genetic Algorithms (GA) are somehow inspired by biological plenomena. I will not dwell on the topic that much, just referring briefly to the properties you should know of:

Genome: it is a sequence of nucleotidis, which as smaller units they are made of. You can imagine it as a vector, where each entry of the vector is a nucleotidis. In general the number of nucleotides is finite, but here we shall allow them to assume any value in the real line.

Mutation: it consists of a random change that may happen to any entry in a genome.
Crossover: it is basically a process in which a pair of genomes exchange nucleotides. In our case we will do it by defining a cutting point. For instance: given two sequences, if the cutting point is "c" (we shall use python's slicing convention), these two sequences will recombine into two new sequences,

For example. when , , and we get and . Needless to say that these two sequences should always have the same length.

There are other types of crossover, and I'll stick with the one explained above. You can read more about Biology at the Encyclopedia Britannica website.

As you might have noticed, since mutation and crossover are stochastic, we shall need to define two associated quantities, respectively and , to account for them.

How does it work?

First, we shall need an "interface" between weights of an ANN, and genomes: we shall be able to map one to the other in a 1-1 way. This part is easy, and will be tackled soon.

There are other things more important to worry about.

Genomes?

This part is somewhat nontrivial, involving a bit of modeling and, mostly, critical thinking: where are the genomes in an ANN? The first thing to do is to think what we are looking at: when we optimize an ANN we are after some "good" weights, based on which the model has good accuracy, or scores well in a certain metric. In our case then, we shall look at weights as if they were genomes. In order to do that we shall first "flatten" all the layers together, as if they were a long genome.

Fitness

Next, we think about how to measure the fitness of a genome: what makes a sequence better than other? This part is not much far from the classical ANN model, and we shall consider the weights that give a high accuracy of prediction. In a serious project or paper, you should split the dataset into 3 parts, train-dev-test set, but here we shall only use a train-test, and do model assessment by measuring the error on the training set, and testing it on the test set.

We are looking for some quantity that indicates how well the indiviual "survives" in a certain environment: that is, given an ANN with a certain genome, if it classifies well, all is good and we would like more of that genome in our future propulations (which means, this individual should leave descendants), but if it does not classify well, this ANN will probably not succeed, and shall not leave descendants.

We shall then measure fitness as a function inversely proportional to the cost function (which in the classical ANN setting must be minimized): the higher (resp. lower) the cost, the less (resp. more) prone to leave decendants the individual is.

Adding a little bit of mathematics to the discussion

Mathematically, we we will do the following: assuming a population (a set of ANNs' weights) , assume a cost function . Now we associate a probability measure to each individual in this population,

which plays the role of a canonical partition function, as in Statistical Mechanics. The quantity is a quantity that we give, and is inversely proportional to the temperature in the system; I'll talk more about that towards the end. When this function becomes the well known https://en.wikipedia.org/wiki/Softmax_function.

We shall use as follows: given a population of individuals, with weights , at each step we shall generate a new population with individuals by selecting from with probabilities given by . Notice that the individual that minimizes the cost function is the one that has the highest probability of being chosen.

Last, we shall talk about mutations. We shall consider mutations at each entry of the weight matrices, as a Bernoulli with probability (for now kept constant, but which can mutate at each iteration). If an entry is chosen to mutate, it will do so as a random noise, distributed as a normal variable centered at the origin, with variance 1.

In what follows, I'll assume N even: if it is odd you can "throw away a descendant" at the end of this process.

Ok, now we are ready to start. We shall define a few things:

  1. Initialize N (random) ANNs randomly, yieding a sequence of ANNs' weights ;
  2. Forward propagate each model, generating the fitness P(W_j) of each weight;
  3. Map each element in to a space of genomes. We shall represent this 1:1 correspondence as ;
  4. Select a new generation with N individuals from , selected with replacement according to . Abusing notation, we shall still represent this generation as ;
  5. For every 1 \leq j \leq N/2, pair individuals and and crossover with probability at a random point along it's length. This new sequence is still denoted by ;
  6. Now, with a sequence in hands, decide whether to mutate each entry by sampling a Bernoulli with parameter : if you get a success (with probability ), you mutate the entry by adding noise to it - a realization of a normal distribution with variance 1; if you do not mutate, nothing is done and the entry remains the same.
  7. Rewrite the genomes as weights: ;
  8. Stop, or return to step 3;

Each time we follow this "recipe" we obtain a new generation (we shall use 150 generations).

Implementating the GA "recipe"

Let's start with the most basic things, which more than any math, involve some project development: we need to decide (i) how to store data and (ii) how to store them as genomes. We shall do the following: in the kth layer of an ANN we define the next element as follows:

The function is known as activation function, which in our case will be ReLu units in the hidden layers, and a sigmoid in the last layer. Parameters and are the weights (some people also call them "weight and bias", respectively).

Since we need to train several ANNs, we shall denote by the kth layer of the individual . We shall store all of them as a stack, a 3D array. For instance, if denotes this 3D matrix, then corresponds to the matrix .

With regards to genomes, we shall store them as a row vector. In this way, if we have a population with size and genomes of length , all the population genomes will be stored as a matrix .

In what follows, we separate the weights as two dictionaries: one contains , adn the other (see Equation ).

Propagate the model

I wrote a tensorflow implementation of the model below, using sigmoid_cross_entropy_with_logits. Since we now know how to feed the weights to a keras model we shall stick with the latter approach.

The code will not be efficient because it is not vectorized (on population). Apparently there is a nicer way to implement using keras. Keras was mainly designed for CNN's, therefore it has many nice 3D matrices computations (like multiplications, convolutions and so) already implemented.

https://www.tensorflow.org/api_docs/python/tf/keras/backend/dot

For this example this is not a big deal. In real application, this can be important.

Implementation

First we set up the parameters:

Now let's plot some graphs. I'll need some extra libraries for that

png,

It is somehow better to see the average and standard deviation of these runs. One caveat that you should be aware of is that averaging removes a lot of the oscillation you see in each realization of this process; the standard deviation (represented as a shadow) helps a little, keeping part of this information in the plot.

png,

In summary, we can get results that are even better than those using the "classical" backpropagation approach. Notice that this is heavily dependant on the fact that the problem is not that high dimensional, i.e., not many parameters to tubne by optimization. For high dimensional problems a hybrid approach is probably more apropriate.

 

Final remarks: can we improve these results?

Surely we can! I will leave some ideas below, you can try to find others. I also strongly recommend Hillis' paper: it is full of nice ideas.

  1. Varying : in the function "beta_softmax" there is a parameter beta. It plays the role of the Boltzmann constant (that's what inspired me, actually), which is proportional to 1/T, where T is the temperature. The idea could be to take , in what is known as quenching. In each epoch we could lower T a little (hence making a bit larger). Physically, doing it too fast is known as quenching, whereas doing it slowly is known as anneling. Some people call it "tempering". As an optimization technique, this idea became widespread after an interesting paper by Fitzpatrick and others in the 80s.
  2. Desing different crossovers: I only did one type, where we choose a point in the genome and cut. But why not crossing over in many different points? That would be possible too.
  3. Diminishing the mutation probability and crossover probability through time: this is similar to the idea 1.

The range of possibilities is enormous. If you are really interested you should take a look at Hillis' paper and at the MIT OCW reference.