Using Multiple Expressions to derive the Pythagorean Theorem


MEPAlg1.gif

In the last article we examined how to use GEP (Genetic Expression Programming) to come up with the Pythagorean formula a2 + b2 = c2 from a set of experimental measurements of a triangle. In this article, we will talk about a simpler GA technique for manipulating symbolic expressions and also use it to derive the Pythagorean equation.

Youll recall that GEP used a string of symbols to represent an equation.  The Pythagorean chromosome could be represented in a postfix string as:

aa*bb*+Q

Figure 1 GEP String

 where a and b are variables,  * is multiplication,  + is addition, and Q is the square root.

The same chromosome expression can be represented in MEP as follows: 

0:  a
1:  * 0,0
2:  b
3:  * 2,2
4:  + 1,3
5:  Q 4

Figure 2 MEP Program

MEP reads a lot like a computer program.  The first symbol on line 0: is what is called a terminator string.  This is always a variable such as a, b, c, or d.  After that, the chromosome is followed by a line thats either a variable or a function.  Binary functions contain 2 numbers.  The first number tells you which programming line  to get your left hand value to perform the operation and the second number tells you what line to get your right hand value to perform the operation.  So the line below

4: +1, 3

basically says,  "Take  the results of what value was produced after executing line 1: and add it to the results of what was produced after executing line 3:".  To better understand, let us write out the results of executing Figure 2 one line at a time: 

0:  a
1:   a * a
2:   b
3:   b * b
4:   (a * a) + (b * b)
5:   Sqrt [ (a * a) + (b * b)]

Figure 3 MEP program broken down into substitutions

Line 1:   says *  0,0  or multiply the value in line 0: with the value in line 0: . The value in Line 0 is a, so this produces (a * a)

Line 3:   says *2,2   or multiply the value in line 2: with the value in line 2:.  The value in  Line 2: is b, so this produces (b * b) 

Line 4: says add the results of Line 1 and Line 3:   or add (a * a)  with (b * b).

Line 5: says take the square root of line 4:  giving our final result Sqrt [(a * a) + (b * b)]

As you may have noticed MEP allows you to express fairly complex equations in much fewer symbols than GEP because you can reuse existing symbol groupings simply by pointing an operation to the group you want to reuse.

The rest of the concepts for utilizing MEP are exactly the same as those in the GEP article.  You generate a population of Genes and use Mutation and Crossover techniques to produce new generations and test the populations fitness.  The fitness function is exactly the same as the previous article on GEP.  Weve added two new crossover techniques in this article:  2 point crossover and uniform crossover (also called recombination).  Two point crossover pulls a piece of the chromosome out of one gene and places it in another.  Its called 2 point because you need two positions in a chromosome to define a piece of it.  Uniform crossover basically takes random positions in a chromosome and swaps the genes in these positions on one chromosome with the genes in the same position with the other chromosome.  These additional crossover techniques help to increase diversity in new generations and bring over desired traits. 

0:  a
1:  * 0,0
2:  b
3:  * 2,2                        crossover
4:  + 1,3
5:
  Q 4

 

0:  b
1:
 / 0,0
2:  -1,2
3:
  * 2,2
4:
  + 1,3
5:  * 4,3

 

0:  b
1:  * 0,0
2:  -1,2
3:  * 2,2                        results
4:  + 1,3
5:  * 4,3

0:  a
1:  * 0,0
2:  b
3:  * 2,2
4:  + 1,3
5:  Q 4

Figure 4 Uniform Crossover Illustrated

Mutation on a gene depends upon the position.  The 0:  line or termination position can only mutate into another symbol.  For example a can mutate into b or visa versa in the 0: position, but not into +.  All other positions can mutate into anything (either an operator or a symbol).  Also, the position that an operation refers to can mutate.  + 0,0 can mutate into + 2,3.

Below are some of the possible mutations for  4: + 1, 3

      + 1, 3    Mutate   a

    + 1, 3   Mutate - 1, 3

     + 1,3   Mutate + 1,4 

Now that you have all the tools needed to understand how to implement MEP, lets take a look at the class design.  Note, it is almost exactly the same as the UML design for GEP. We have simply altered a few methods in the EquationGenome class needed to evaluate the symbol strings for MEP.

MEPAlg2.gif

Figure 5 UML Diagram of MEP Algorithm reversed using WithClass

Most of the hard calculations take place in the EquationGenome, so we will concentrate on this class.  All other classes behave pretty much the same as they did in the GEP article, with the exception of how crossover is chosen by the population.  Currently the program randomly chooses one of the 3 methods mentioned to recombine the gene:  1 point, 2 point, and uniform crossover.

The EquationGenome has a collection of Genes. A Gene is made up of 2 instructions and an operation.  If the Gene is a variable,  the number representing the variable is stored in the operation attribute and instructions are ignored.  For the Pythagorean thereom, as mentioned in the previous article, the following symbols are represented with the following values for the operation field:

      0: a   1: b   2: *   3: /   4: +   5: -    6: Q (square root)

The method for calculating the fitness is shown below:  

Listing 1 - Calculating the Fitness of the Genes against several sample triangles

public float CalculatePythagoreanFitness()
{
int index = 0;
float calc = 0.0f;
float sum = 0.0f;
int count = measure.GetLength(0);
for (int i = 0; i < count; i++)
{
calc = PerformCalculation(measure[i, 0], measure[i, 1]);
float error = 100 - Math.Abs(measure[i,2] -calc);
sum += error;
}
CurrentFitness = sum/(measure.GetLength(0)*100);
if (float.IsNaN(CurrentFitness))
CurrentFitness = 0.01f;
return CurrentFitness;
}

This is the same fitness function that we used in our previous GEP article except we now normalize the fitness value so it falls between 0 and 1. (We didnt have to normalize, it just looks nicer).

The PerformCalculation method takes the sample values and determines a calculation based on the steps carried out in the Genome.  The simple way to implement this algorithm is by using an array.  If you recall, for walking the GEP PostFix calculation, we used a stack collection, but since the order of calculation is very much index-specific, an ArrayList is the best bet for implementing our MEP calculation.    

Listing 2 - Implementing the steps of the MEP chromosome with an array

public float PerformCalculation(float a, float b)
{
int count = 0;
foreach (Gene g in TheArray)
{
if (g.operation < NumSymbols)
{
// the gene is a variable,
// variable identity is contained in the
// operation field
// step contains a or b, assign value
if (g.operation == 0)
alculationArray[count] = a;
else
CalculationArray[count] = b;
}
else
{
// gene is an operation
// use it to put the calculation into an
// array using the referenced line results
// indicated by instruction1 and instruction2
CalculationArray[count] = DoOperation(CalculationArray[g.instruction1], CalculationArray[g.instruction2], g.operation);
}
count++;
}
// end for each Gene
// return last calculation
return CalculationArray[TheArray.Count - 1];
}

The DoOperation method in the listing above simply interprets the operation and performs the numerical binary or unary operation and returns the result.  Note:  We purposely check for some potentially bad situations such as divide by zero or the square root of a negative number and assign them numbers:  

Listing 3 - Performing the operation of an individual gene

public float DoOperation(float a, float b, int operation)
{
float result = 0.0f;
switch(operation)
{
case 2: // *
result = a * b;
break;
case 3: // /
if (b == 0.0f)
{
// divide-by-0, just set b to a small number
b = .0001f;
}
result = a/b;
break;
case 4: // +
result = a + b;
break;
case 5: // -
result = a - b;
break;
case 6: // Q
if (a > 0) // make sure its greater than 0
result = (float)Math.Sqrt(a);
break;
default:
// +
break;
}
// end switch
return result;
}

We used the following sample set of triangles to run against our MEP program.  The first two numbers of each value group represents the sides of the triangle (a and b), and the third number is the hypotenuse in which we are measuring our MEP result against to determine the fitness of the Gene.  Note that we picked negative values for the sides of one of the triangles.  I found that the more diversity in your sample set, the better chance you have of converging on the correct solution.

Listing 4 - twelve sample triangles stored in a multidimensional rectangular array

float[,] measure = new float[12,3]{{ -1, -3, 3.162277f}, {30,40,50}, {40, 30, 50}, {1, 1, 1.4142f}, {1, 2, 2.23607f}, {3, 1, 3.162277f}, {6, 13, 14.31782f}, {4,3,5}, {1,3, 3.16228f}, {2, 1, 2.23607f}, {4, 5, 6.403f}, {5,4, 6.403f}};

Below is the solution set after 52 generations with an initial population of 2000 genomes.  The mutation rate was about .33 and the chromosome length was 15.

MEPAlg3.gif

Figure 6 Pythagorean Theorem derived by the MEP Algorithm

There is a ToString function in the EquationGenome shown below.  If you uncomment the FormStepsString method, you can look at the steps that produce the Equation:

public override string ToString()
{
string strResult = "";
int index = 0;
strResult += " --> " +
this.FormStepsString();
strResult += " --> " +
this.FormEquationString();
strResult += " --> " + CurrentFitness.ToString();
return strResult;
}

Below are the MEP steps generated for one of the 15-Gene chromosomes that produces the Pythagorean equation.  Note that its not as clean as our initial representation of this solution, nor is everything in the Genome utilized, but it gets the job done.

MEPAlg4.gif

Figure 7 MEP Genome that produces the solution  

Conclusion 

MEP is another GA Technique that can be used for coming up with general models from empirical data.  Because you can choose the complexity of the functions, you can use it to model anything from simple mathematical curves to complex physical systems.  You can have fun experimenting with MEP by changing the empirical data supplied by the measure array and by varying the functions and symbols in the Genome.  Also you'll want to vary population size, gene length and crossover and mutation frequencies to come up with different convergences.  Most of the GA parameters are located in the population class.   The data and symbols are located in the EquationGenome class.  I suspect people will make some cool discoveries using this technique and the beauty is that you can just let the computer and .NET do all the work! 

References 

Multi Expression Programming, Mihai Oltean.
Gene Expression Programming, Candida Ferreira
Implementing a Genetic Algorithm in C# and .NET, Mike Gold
GEP Algorithm in C# and .NET,  Mike Gold