Parallelism in .NET – Part 11, Divide and Conquer via Parallel.Invoke
Many algorithms are easily written to work via recursion. For example, most data-oriented tasks where a tree of data must be processed are much more easily handled by starting at the root, and recursively “walking†the tree. Some algorithms work this way on flat data structures, such as arrays, as well. This is a form of divide and conquer: an algorithm design which is based around breaking up a set of work recursively, “dividing†the total work in each recursive step, and “conquering†the work when the remaining work is small enough to be solved easily.
Recursive algorithms, especially ones based on a form of divide and conquer, are often a very good candidate for parallelization.
This is apparent from a common sense standpoint. Since we’re dividing up the total work in the algorithm, we have an obvious, built-in partitioning scheme. Once partitioned, the data can be worked upon independently, so there is good, clean isolation of data.
Implementing this type of algorithm is fairly simple. The Parallel class in .NET 4 includes a method suited for this type of operation: Parallel.Invoke. This method works by taking any number of delegates defined as an Action, and operating them all in parallel. The method returns when every delegate has completed:
Parallel.Invoke( () => { Console.WriteLine("Action 1 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); }, () => { Console.WriteLine("Action 2 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); }, () => { Console.WriteLine("Action 3 executing in thread {0}", Thread.CurrentThread.ManagedThreadId); } );
Running this simple example demonstrates the ease of using this method. For example, on my system, I get three separate thread IDs when running the above code. By allowing any number of delegates to be executed directly, concurrently, the Parallel.Invoke method provides us an easy way to parallelize any algorithm based on divide and conquer. We can divide our work in each step, and execute each task in parallel, recursively.
For example, suppose we wanted to implement our own quicksort routine. The quicksort algorithm can be designed based on divide and conquer. In each iteration, we pick a pivot point, and use that to partition the total array. We swap the elements around the pivot, then recursively sort the lists on each side of the pivot.
For example, let’s look at this simple, sequential implementation of quicksort:
public static void QuickSort<T>(T[] array) where T : IComparable<T> { QuickSortInternal(array, 0, array.Length - 1); } private static void QuickSortInternal<T>(T[] array, int left, int right) where T : IComparable<T> { if (left >= right) { return; } SwapElements(array, left, (left + right) / 2); int last = left; for (int current = left + 1; current <= right; ++current) { if (array[current].CompareTo(array[left]) < 0) { ++last; SwapElements(array, last, current); } } SwapElements(array, left, last); QuickSortInternal(array, left, last - 1); QuickSortInternal(array, last + 1, right); } static void SwapElements<T>(T[] array, int i, int j) { T temp = array[i]; array[i] = array[j]; array[j] = temp; }
Here, we implement the quicksort algorithm in a very common, divide and conquer approach. Running this against the built-in Array.Sort routine shows that we get the exact same answers (although the framework’s sort routine is slightly faster). On my system, for example, I can use framework’s sort to sort ten million random doubles in about 7.3s, and this implementation takes about 9.3s on average.
Looking at this routine, though, there is a clear opportunity to parallelize. At the end of QuickSortInternal, we recursively call into QuickSortInternal with each partition of the array after the pivot is chosen. This can be rewritten to use Parallel.Invoke by simply changing it to:
// Code above is unchanged...
SwapElements(array, left, last);
Parallel.Invoke(
() => QuickSortInternal(array, left, last - 1),
() => QuickSortInternal(array, last + 1, right)
);
}
This routine will now run in parallel. When executing, we now see the CPU usage across all cores spike while it executes.
However, there is a significant problem here – by parallelizing this routine, we took it from an execution time of 9.3s to an execution time of approximately 14 seconds! We’re using more resources as seen in the CPU usage, but the overall result is a dramatic slowdown in overall processing time.
This occurs because parallelization adds overhead. Each time we split this array, we spawn two new tasks to parallelize this algorithm! This is far, far too many tasks for our cores to operate upon at a single time. In effect, we’re “over-parallelizing†this routine. This is a common problem when working with divide and conquer algorithms, and leads to an important observation:
When parallelizing a recursive routine, take special care not to add more tasks than necessary to fully utilize your system.
This can be done with a few different approaches, in this case. Typically, the way to handle this is to stop parallelizing the routine at a certain point, and revert back to the serial approach. Since the first few recursions will all still be parallelized, our “deeper†recursive tasks will be running in parallel, and can take full advantage of the machine. This also dramatically reduces the overhead added by parallelizing, since we’re only adding overhead for the first few recursive calls.
There are two basic approaches we can take here. The first approach would be to look at the total work size, and if it’s smaller than a specific threshold, revert to our serial implementation. In this case, we could just check right-left, and if it’s under a threshold, call the methods directly instead of using Parallel.Invoke.
The second approach is to track how “deep†in the “tree†we are currently at, and if we are below some number of levels, stop parallelizing. This approach is a more general-purpose approach, since it works on routines which parse trees as well as routines working off of a single array, but may not work as well if a poor partitioning strategy is chosen or the tree is not balanced evenly.
This can be written very easily. If we pass a maxDepth parameter into our internal routine, we can restrict the amount of times we parallelize by changing the recursive call to:
// Code above is unchanged... SwapElements(array, left, last); if (maxDepth < 1) { QuickSortInternal(array, left, last - 1, maxDepth); QuickSortInternal(array, last + 1, right, maxDepth); } else { --maxDepth; Parallel.Invoke( () => QuickSortInternal(array, left, last - 1, maxDepth), () => QuickSortInternal(array, last + 1, right, maxDepth)); }
We no longer allow this to parallelize indefinitely – only to a specific depth, at which time we revert to a serial implementation. By starting the routine with a maxDepth equal to Environment.ProcessorCount, we can restrict the total amount of parallel operations significantly, but still provide adequate work for each processing core.
With this final change, my timings are much better. On average, I get the following timings:
- Framework via Array.Sort: 7.3 seconds
- Serial Quicksort Implementation: 9.3 seconds
- Naive Parallel Implementation: 14 seconds
- Parallel Implementation Restricting Depth: 4.7 seconds
Finally, we are now faster than the framework’s Array.Sort implementation.
Great article, very clear and to the point.
Reed,
While Environment.ProcessorCount does a good job at giving you the # of processors in the system, it will *not* give you the number of processors allocated to your process (which may change while the process is running!). I use this little chunk of code to give me the true number of cores my process can use:
//
// Copyright (c) 2008 All Rights Reserved
//
// Jesse C. Slicer
// jslicer@spamcop.net
// 2008-08-05
// Part of the Aesop.Diagnostics.dll assembly.
namespace Aesop.Diagnostics
{
#region Using Directives
// System namespaces
using System;
using System.Diagnostics;
#endregion
#region Class Definition : ProcessInfo
///
/// Privides a single property which gets the number of processor threads
/// available to the currently executing process.
///
internal static class ProcessInfo
{
#region Internal Static Properties
///
/// Gets the number of processors.
///
/// The number of processors.
internal static uint NumberOfProcessorThreads
{
get
{
using (Process currentProcess = Process.GetCurrentProcess())
{
uint result;
if (currentProcess == null)
{
result = (uint)Environment.ProcessorCount;
}
else
{
const uint BitsPerByte = 8;
uint loop = BitsPerByte * sizeof(uint);
uint processAffinityMask =
(uint)currentProcess.ProcessorAffinity;
result = 0;
while (loop != 0)
{
–loop;
result += processAffinityMask & 1;
processAffinityMask >>= 1;
}
}
return (result == 0) ? 1 : result;
}
}
}
#endregion
}
#endregion
}
Jesse:
This will be the same as Environment.ProcessorCount, provided you didn’t explicitly change your processor affinity masks for your application. That being said, in general, I usually recommend leaving processor affinity alone, especially on modern operating systems (Vista & esp. W7).
The Windows process scheduler does a very good job of handling this…
To be sure *I* wouldn’t be changing my program’s affinity masks, but there’s no guarantees that the .NET runtime (or even operating system) won’t do that to the running program if it feels the need to.
Jesse:
The operating system will not change the affinity masks, ever. These exist so that you can explicitly tell the OS (which is now free to ignore this anyways) which processors to use. The only place where this would happen without your knowledge would be if a library set this, or if the user, as an administrator, overrides the affinity masks via the task manager or a similar tool (neither of which is likely). Personally, I think this is overkill unless you are explicitly messing with the affinity masks (which I don’t recommend doing…)
I’m trying to understand the maxDepth setting, but it looks like setting maxDepth=Cpus will result in 2^Cpus threads, as opposed to CPUs threads? Is this true?
Given the potential lopsided-pivot problem of qsort(), is it sometimes better to choose more parallelism than CPUs “just in case” some of the problem sub-parts are much faster than the others?
John,
It actually does 2^cpus, as you saw. I’d have two comments for this, however. First, this was really just written as an example – I’m sure to get this completely optimal, you’d have to find the right balance here. I suspect that this does run faster with extra work (I picked those numbers after profiling) for reasons such as non-optimal pivots, uneven workloads in comparisons, and other similar issues.
That being said, it is often better to have more work items than processors. Depending on the work involved, even CPU intensive work often runs faster if given a few extra threads. It’s all about profiling your algorithm, and finding the right balance.
It is good for small data but on huge data stack will full and execution stop.
Would above code require lock() around these 2 places, as otherwise there could be concurrent bug ??
1) if (maxDepth < 1)
2) –maxDepth
It wouldn’t in this case since maxDepth is effectively a local – it’s passed in as a parameter, so it’s value is local to the executing thread.
If maxDepth was a field within a class instance or stored externally, you would need to synchronize access to it somehow.