Using PLINQ to Improve Learning Algorithms


Introduction


I recently just finished reading an article by Igor Ostrovsky in MSDN magazine titled Data-Parallel Patterns and PLINQ. The Parallel Linq library is an exciting addition to Visual Studio 2010, making multi-threaded programming less painful, especially when operating over a list of items in parallel. After reading the article I decided to try the technology out on one of my old articles Using a Genetic Algorithm to do Consultant Scheduling in C#.


This article uses the PBIL algorithm (Population based Learning algorithm)  to calculate a fair schedule for a group of consultants on a project over a three month period. The algorithm takes a while to converge so it was a good candidate for the test.


PLINQ


PLINQ is a set of LINQ extensions that can be used to operate on lists in parallel through methods such as the LINQ Select statement.  You can just place the extension AsParallel before the Select call inside your LINQ statement and the software will attempt to take advantage of multiple threads in your system to perform the calculation in the LINQ statement concurrently when acting upon a collection.   If you want to get load balancing you can use the Partitioner.Create method (in the System.Collections.Concurrent namespace) and run the Select on the resulting collection returned from the Create.  There are a host of other tricks and techniques you can take advantage of in PLINQ.  I recommend reading the article in MSDN December 2009 by Igor.


The Experiment


In order to do our comparison, first we need a timer to measure the time it takes per 1000 generations.  I've inserted  into our PBIL Calculator the method in Listing 1 below.  This method will measure the interval of time it took to perform the calculations on 1000 generations of our algorithm. 


Listing 1 - Method for Calculating Time Interval Per 1000 Generations


static private  TimeSpan CalculateTimePerThousand()

{

            _endTime = DateTime.Now;

            TimeSpan interval = _endTime - _startTime;

            _startTime = _endTime;

            return interval;

}


Arguably the most expensive part of calculating the PBIL algorithm is generating and calculating the fitness for every genome in the sample.  We'll use this part of our program for converting to LINQ and PLINQ and see if we get a significant performance improvement on a quad core machine.  Listing 2 shows the LINQ statement that represents are expensive function 


Listing 2 - PBIL Genome Generator and Fitness Calculation over an array of Genomes using LINQ


_genomes = _genomes.Select(g => { g = new BinaryGene(); 
g.Generate(_probabilities); g.CalculateFitness(); return g; }).ToArray();

We used a lambda expression to group the creation and fitness calculation all under the select statement.  Upon executing the ToArray method, this will produce a result array with a list of 500 genomes each containing a calculated fitness.  The time this took for 1000 generations was approximately 43 seconds as shown in figure 1.


fig0.png

Figure 1 - Number of seconds between the 1000th and 2000th generation.

 

When we look at the CPU usage occurring during the intensive calculations, we see that it uses only 25% of the total CPU power on the quad machine or the equivalent utilization of approximately one core processor as shown in figure 2.


fig1.png

Figure 2 - CPU usage during a regular LINQ call in the PBIL Program


Now let's switch to a PLINQ statement and see what happens.  The PLINQ statement almost looks exactly like the LINQ statement, except we take advantage of the concurrent collection library and the AsParallel extension to cause our _genomes array to be processed in parallel and load balanced across all 4 processors.  In listing 3,  the Partitioner class allows us to partition the processing of the _genomes array across all 4 processors equally by wrapping the array in something called an OrderablePartioner class. The second parameter in Create set to true, tells the Create command to create an OrderablePartioner to load balance the processing.  The use of the OrderablePartioner is mostly transparent to us, we can still treat the result of the Create as we do any other Enumerable in a LINQ statement, knowing in the back of our minds that it will try to perform the PBIL operations contained in our Select statement on each element of the enumerable, in parallel. 


Listing 3 - PBIL Genome Generator and Fitness Calculation over an array of Genomes using PLINQ


// calculate fitness using PLINQ

_genomes = Partitioner.Create(_genomes, true).AsParallel().

        Select(g => { g = new BinaryGene(); g.Generate(_probabilities);
g.CalculateFitness(); return g; }).ToArray();


The results of running the PLINQ version of our PBIL code is shown in figure 3 below.  1000 generations only took 15 seconds to process on the 4 processor machine.  We notice a 3X improvement with PLINQ over our regular LINQ statement!


fig2.png


Figure 3 - Number of seconds between the 2000th and 3000th generation using PLINQ


The reason we get such a large improvement becomes clear when we examine CPU in figure 4.  The CPU is heavily loaded on all 4 core processors.  Our program is now load balancing and utilizing all 4 processors to 80% of its fullest extent.  Had there been more processors used in this experiment, we would have surely seen an even larger improvement.


fig3.png 


Figure 4 - CPU Utilization when using PLINQ on the intensive algorithm.


Conclusion


In the past, multithreading has been daunting and painful to implement.  PLINQ heads us in the correct direction of parallel processing our data without the headache of maintaining our threads.  You still need a solid understanding of threading to use LINQ.  For example, when I first started this experiment, I noticed when I converted to the PLINQ statement my algorithm was no longer working.  This was because the random generator I was using to generate new binary genomes was declared static.  As a result, I was getting a lot of duplication in the genes.  Once I made the generator a non-static member of each genome, the problem went away because it was no longer shared in parallel between genomes.   Therefore you still need to keep rules of thread safety in mind, because PLINQ cannot take care of this for you.  In any case, I would certainly concur that PLINQ and other advances in the .NET 4.0 framework are headed in the right direction to help us do multi-threading in C# and .NET.


Similar Articles