Parallelism in .NET – Part 8, PLINQ’s ForAll Method
Parallel LINQ extends LINQ to Objects, and is typically very similar. However, as I previously discussed, there are some differences. Although the standard way to handle simple Data Parellelism is via Parallel.ForEach, it’s possible to do the same thing via PLINQ.
PLINQ adds a new method unavailable in standard LINQ which provides new functionality…
LINQ is designed to provide a much simpler way of handling querying, including filtering, ordering, grouping, and many other benefits. Reading the description in LINQ to Objects on MSDN, it becomes clear that the thinking behind LINQ deals with retrieval of data. LINQ works by adding a functional programming style on top of .NET, allowing us to express filters in terms of predicate functions, for example.
PLINQ is, generally, very similar. Typically, when using PLINQ, we write declarative statements to filter a dataset or perform an aggregation. However, PLINQ adds one new method, which provides a very different purpose: ForAll.
The ForAll method is defined on ParallelEnumerable, and will work upon any ParallelQuery<T>. Unlike the sequence operators in LINQ and PLINQ, ForAll is intended to cause side effects. It does not filter a collection, but rather invokes an action on each element of the collection.
At first glance, this seems like a bad idea. For example, Eric Lippert clearly explained two philosophical objections to providing an IEnumerable<T>.ForEach extension method, one of which still applies when parallelized. The sole purpose of this method is to cause side effects, and as such, I agree that the ForAll method “violates the functional programming principles that all the other sequence operators are based uponâ€, in exactly the same manner an IEnumerable<T>.ForEach extension method would violate these principles. Eric Lippert’s second reason for disliking a ForEach extension method does not necessarily apply to ForAll – replacing ForAll with a call to Parallel.ForEach has the same closure semantics, so there is no loss there.
Although ForAll may have philosophical issues, there is a pragmatic reason to include this method. Without ForAll, we would take a fairly serious performance hit in many situations. Often, we need to perform some filtering or grouping, then perform an action using the results of our filter.
Using a standard foreach statement to perform our action would avoid this philosophical issue:
// Filter our collection var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() ); // Now perform an action foreach (var item in filteredItems) { // These will now run serially item.DoSomething(); }
This would cause a loss in performance, since we lose any parallelism in place, and cause all of our actions to be run serially.
We could easily use a Parallel.ForEach instead, which adds parallelism to the actions:
// Filter our collection var filteredItems = collection.AsParallel().Where( i => i.SomePredicate() ); // Now perform an action once the filter completes Parallel.ForEach(filteredItems, item => { // These will now run in parallel item.DoSomething(); });
This is a noticeable improvement, since both our filtering and our actions run parallelized. However, there is still a large bottleneck in place here.
The problem lies with my comment “perform an action once the filter completesâ€. Here, we’re parallelizing the filter, then collecting all of the results, blocking until the filter completes. Once the filtering of every element is completed, we then repartition the results of the filter, reschedule into multiple threads, and perform the action on each element. By moving this into two separate statements, we potentially double our parallelization overhead, since we’re forcing the work to be partitioned and scheduled twice as many times.
This is where the pragmatism comes into play. By violating our functional principles, we gain the ability to avoid the overhead and cost of rescheduling the work:
// Perform an action on the results of our filter
collection
.AsParallel()
.Where( i => i.SomePredicate() )
.ForAll( i => i.DoSomething() );
The ability to avoid the scheduling overhead is a compelling reason to use ForAll. This really goes back to one of the key points I discussed in data parallelism: Partition your problem in a way to place the most work possible into each task. Here, this means leaving the statement attached to the expression, even though it causes side effects and is not standard usage for LINQ.
This leads to my one guideline for using ForAll:
The ForAll extension method should only be used to process the results of a parallel query, as returned by a PLINQ expression.
Any other usage scenario should use Parallel.ForEach, instead.
Another good post; I enjoy reading your blog!
Regarding the philosophical arguments, I must point out that the Rx team (who are very functionally minded) did choose to include Do and Run (which are essentially ForEach) for IEnumerables as well as IObservables.
At some point, pragmatism must be considered. Functional code must eventually produce side effects, and no one is really interested in forcing all LINQ side effects to be monads… 🙂
Stephen,
I suspect that they did this, mainly because they wanted this to exist on the IObservable side, and wanted to keep feature parity on the IEnumerable side, as well. Most of the arguments I make above for asynchronous operations apply to IObservable – in which case, it makes sense.
I actually, though, personally do not like the EnumerableEx.Do(…) implementation in the Rx framework. It leaves a “real bad taste” when I’ve used it, since it has some odd, unexpected behavior (ie: it doesn’t actually perform the action until you enumerate through it, which makes sense, but is odd). I particularly don’t like that .Do() returns an IEnumerable, so it can be used in the middle of expressions, even though it’s a statement intended to cause side effects. Personally, it seems no different than a slightly restricted form of Select() [since you can do the exact same thing with a Select(..) call].
Hello,
I have an interesting problem. I am completely happy with the very convenient and efficient forall, however, the problem I ran into is as follows:
I have a collection that is grouped in parallel with preservation of order by use of AsOrdered, I then need to take each result from that collection and turn it into a row of information in the same order in parallel – it does not seem to be possible using forall?
If you’re making a new collection from the original, you should use a mapping operation (Select), not ForAll.
Really good article; I am getting great results using PLINQ now.