Structure-based drug design - How we can use a Genetic Algorithm to “build” a molecule that binds to a target

Natalia Pattarone
Azumo Intelligent App Developer
11 min readJan 14, 2019

--

Drug Design against AmpC β-Lactamase by ScienceDirect

I have to admit, I got really excited when I read the title. Unfortunately, we are not going to develop a real pharmaceutical drug of any kind (I know, so sad!). However, as the aim of this article is to give a primary approach on how this process can be achieved by using an awesome genetic algorithm (GA), I’m certainly sure you guys are going to enjoy it anyway :)

For those of you who are already anxious to peak at the code, I’ll leave the repository here for your enjoyment.

Drug design involves several and complex steps in order to be approved and become human suitable. Here is an excellent article that explains some of he process and highlights the associated complexity.

Generally, the rational development of a new drug follows a three-step process. Initially, a target, such as a receptor or enzyme, has to be identified relating to a particular disease state. This target then has to be fully characterized and, finally, a molecule must be designed that binds to it.

Quoting the article, basically (and in a rudimentary way of speaking) we have to find something we want to target in order to create another something that binds to it. Easy peasy, right? Well… hold that thought for a moment.

For those who don’t have a broad background in biology, let me provide some of the axioms that apply here:

  1. All molecules in our body interact by chemical affinity. That is, positive and negative attractions (basically).
  2. Every protein has a conformation (a “shape”) given by its amino acid chain, protein’s building blocks. This shape is the most stable in terms of energy, which means that folds itself naturally, like a spring. If you want to stretch it out to make it straight, you apply strength, so… that’s not natural, right? We have to use some kind of force. Same thing occur with proteins.
  3. Proteins have several conformations. For this extremely simplified tutorial, you only have to know that there is a conformation at which the protein is stable because of the associated constitutive chemistry. In other words, if you change somehow the building blocks, it will change its shape/conformation.
  4. Some proteins help to achieve something faster, for instance, some proteins create changes, within our bodies, that would not happen as quickly (or at all) without them. These are called enzymes.
  5. Similarly as we do with DNA nitrogen bases (A-adenine, C-cytosine, T-thymine, G-guanine), important DNA building blocks, we can use letters to refer to the protein ones as well, their amino acids. There are 21.
The 21 amino acids and their letters

5. When targeting an enzyme, for instance, we have to think of their key residues. These are sites (chemical ones) where another molecule can be bound with ease, therefore, our target. You may think that every protein exposes binding sites somehow (after all, they are made by amino acids) but this is not exactly true. Think of it as a puzzle: you have several pieces, they look alike, but not fit perfectly in each gap, right? Same thing happens here, not every amino acid has the best “fit” on the binding site. Perhaps there is a small “chemical attraction” but it would not be the fittest ;)

6. Genetic algorithms emulate what survival of the fittest from the natural selection theory establishes, which is the best breed you can get from an offspring (if we talk in generations) in terms of better chances to adapt to their environment. Those who survive.

What Are We Going To Do…

In order to simplify complex concepts associated with organic chemistry, we are just going to work with a plain target protein composed by a particular amino acid chain, defining 2 binding sites (on which our protein will be activated if the drug target this place). It would look like this:

Target Protein and Drug

The target protein holds several kinds of amino acids (like the list we talked before): polar, non-polar, positively charged and negatively charged. In our example, the binding sites are E and K, negatively and positively charged amino acids, respectively. So, quickly you can realize that we probably need opposite charges to bind to it, right? … Yes, you’re completely right! So in our possible drug we defined what would be our first rule:

RULE 1: In position A we need positively charged amino acids, and in position B, negatively ones.

Perfect! Similarly to this situation, as we have some amino acids between positions A and B (in this case V, C, P, L, which were selected on purpose… you’ll soon see why) we also need to define a rule in order to target them in the most efficient manner as well.

There are some amino acids that have no charge at all (in other words, they wont affect the protein structure, explained highly simple), these are the so called non-polar amino acids. For you to have an idea, their atoms charges are cancelled each other resulting in a non reactive molecule, and by not holding polarity they lack the ability to attract/repel charges. Keeping this in mind, it gets pretty clear where I’m going with this: we need to use those between our positions A and B to avoid modified our protein structure as we only want to target the strategic positions defined by the binding sites; having other molecules next to them pulling the protein shape out, it would affect the perfect fit we could achieve if they are not present. So, our second rule:

RULE 2: positions between A and B should be non-polar amino acids.

Our “Drug” should contain non-polar amino acids

And that is it! Let’s move to the next step, understanding the basics of Genetic Algorithms.

What are Genetic Algorithms?

Inspired by Charles Darwin’s natural evolution theory, genetic algorithms emulate the natural selection of the fittest. In other words, those specimens who fit better within their environment (overcoming its conditions) will survive, reproduce and leave offspring holding their parents characteristics, their genes.

One can deduce this process will continue indefinitely, generating new and better specimens as the fittest survivors reproduce, producing new offspring.

So yes, it’s a marvelous approach to imitate and use it to solve problems. Five phases are considered:

  1. Initial population
  2. Fitness of each specimen
  3. Selection
  4. Crossover
  5. Mutation

1. Initial Population

We generate a population of specimens, each of them could be a possible solution to our problem.

These specimens have genes, which together form their chromosome. Moving to the code, we can say that a specimen holds a collection of genes which are commonly represented by a collection of values (whichever you need to take in account to solve your particular problem), in our case, amino acid letters (remember? our V, C, P, L, D, S, etc).

2. Fitness of each specimen

This is, if not the most important, the key aspect of your entire algorithm. Let me put it this way: the poorer the fitness calculation, the weaker/useless your outcome (and you probably end up with a worthless mediocre genetic algorithm).

We have 2 main rules defining our best fit, right? So this 2 rules are the ones eventually will tell us how fit our specimen is. So we based on this value the selection of the fittest parents to produce offspring.

3. Selection

Here is where we pick the parents specimens. Holding the higher fitness value of the entire population, these will generate offspring and let them pass their genes to the next generation.

Our population holds several specimen with a range of genes. Those having the best fitness have a higher chance to be selected.

4. Crossover

This process occurs normally in each cell of our bodies. Chromosomes (set of genes) in each specimen cross each other in order to have characteristics of each of them. Same as with any baby: holds mommy’s eyes and daddy’s hair, for instance.

In the algorithm we choose a random point to crossover our specimen’s genes.

4. Mutation

Mutations comes naturally in every offspring: can or can’t happen and it’s probability rate can vary also (could be very high or very low, or in between). Occurs to maintain diversity within the population and prevent premature convergence.

Translating into the code, there are several approaches that can be used, but we are focusing ours by randomly changing genes.

Genes after mutation

Stepping into the code

Hands on! I’m going to guide you step by step into the code so you can follow how these concepts are putting together to aim our problem solution: drug discovery.

First, we work with an initial population of specimen; we create them by using this function:

How we create specimens

Our specimen holds some useful information to be used in its creation. Here we have a maxGeneLength variable, indicating how big can be our genome (how many genes it will have). In other words the maximum number of characteristics our specimen can have. Like in real life, we can lack some of the genes, and still be super functional humans. So can our specimen, but that doesn’t mean that in our GA world it’s okay, we have to apply our rules to select the fittest.

Now, we see a geneLength variable being calculated randomly. This would imply for our specimen to have a variability of genetic information, the genes array length can vary from 1 to 11. We are going to save on the first position of our genes array a number value indicating how many genes associated to solve our problem we have.

The first gene indicates the length of genetic information available

Why do we this? If you noticed from the image from our possible drug scheme, you’ll see that our fittest drug protein only needs a minimum of 6 amino acid chain length. So, prior to applying the rules we defined, we have to somehow select those specimen that are at least 6 genes in length, right? … But, “why don’t we just put a simple if statement and filter those by evaluating their array length?”, you may say. Well, what’s evolutionary about that? If you just filter by their length (and yes, it’s of course what we do but in a slightly different manner) there is no way we can preserve that information on our specimens, on their genetic code. But by including them as part their genome, we have a way to use it in our fitness calculation.

Next, we have a loop function (a function that allows us to iterate through a collection) in which we assign randomly amino acid letters to each position of the genes array from our UNIVERSE_OF_GENES variable, that holds all possible amino acids we can use.

All the amino acid letters are here

Finally, we calculate their fitness and save it in a variable; let’s see how this is done by inspecting our next piece of code:

In this function we apply all we have been talking before: our rules! We have a maximum fitness given by the summation of the highest calculated scores resulting from each rule (will see them in detail soon), with that value we can then know when to stop our universe from keep growing and mating and obtain the generation in which the fittest has risen.

First rule does what we clarify above about gene length: as we need this specimen character to be part of our evolutionary selection, we have to track it somehow inside our genetic code. In our case, first gene contains gene length, and yes, if matches our minimal ideal (which is 6) fitness will increase in one; if not it‘ll instead reflect a slight increase.

Second rule responds to: In position A we need positively charged amino acids, and in position B, negatively ones. So inside our primal fitness function we have a looping function, that sends the current gene (geneA) and the gene at current position + offset (geneB). While we go through each of them we can detect if we have these positions fill with the values we need in order to score a fitness as high as possible: if we match both positions, we add 2 points of fitness, if only one, just 1 point, if none, 0.

Finally, our third rule corresponds to: positions between A and B should be non-polar amino acids. Pretty easy, right? If in the piece of genetic code we passed through our function (which will be the same genes between geneA and geneB from previous rule) we find non-polar genes, we sum 1 point for each finding. The maximum score coming out from this rule will be 4, less than that, well… is not the best fit.

We have more core functions to review in detail, but it’ll be easier if we approach them through the main code explanatory.

We start out with a 10 specimen population, and then sort them by the fittest. That is, after generating each specimen, we calculate their fitness and then order the entire population by placing the fittest first.

As we said several times, we’ll stop our evolutionary process until we find the fittest generation, given by the value of 7 (the highest fitness value we can have once all rules are applied ,in other words, all of them comply).

By increasing generation number, we take the first and second fittest specimens from our population (remember we sorted them? :) … this is the reason why) to mate them and generate offspring.

Our offspring function takes these 2 specimens genetic code and perform crossover over them obtaining a new set of genes for each of them. Next, we mutate them and calculate their fitness, returning a 2-lengthen array containing the generated offspring.

Let’s review these rapidly:

  • Mutation occurs in a random place (index inside the array of genes) and consists in changing the amino acid located at that place by a different one chosen also randomly.
  • Crossover takes place from one particular position (index in the array) to the end of the genome (just to simplify things, it’s a highly rudimentary way of performing the process). So, picking a random index we swap specimen 1 genes with the specimen 2 genes.

Returning to our main code explanation, once the offspring is obtained we took the 2 least fittest individuals from our population (the last 2 positions in our fittest-first-sorted array) and replace them with the new offspring generated from the fittest specimens from this generation. By sorting the new population out again we obtain the new fittest positioned specimens. We check if the highest fitness is the fittest, and if it is, we finish our process by printing the generation count. AND THAT’S IT!! :)

Here you have an example of the output, finding the fittest at generation 45.

Console Output

NOTE: For those who hold Biology/Chemistry knowledge, I’m aware that this is not correct in terms of the complexity involved on chemical bonds and all their different kinds. So please, try to understand what I’m trying to aim here, and of course you’re all invited to propose better/more sophisticated approaches :)

Thanks all for taking the time to read my article, I hope you enjoyed reading it as much as I did writing it. I focus on building intelligent applications at Azumo and hopefully my love of biology and programming shows too. Leave some claps if you like :D so others can check it out too. Productive comments/suggestions/ideas are always welcome!

--

--

MSc in Intelligent and Interactive Systems | Systems Engineer, wanting to leave something meaningful to this wonderful world