Parallelism in .NET – Part 7, Some Differences between PLINQ and LINQ to Objects
In my previous post on Declarative Data Parallelism, I mentioned that PLINQ extends LINQ to Objects to support parallel operations. Although nearly all of the same operations are supported, there are some differences between PLINQ and LINQ to Objects. By introducing Parallelism to our declarative model, we add some extra complexity. This, in turn, adds some extra requirements that must be addressed.
In order to illustrate the main differences, and why they exist, let’s begin by discussing some differences in how the two technologies operate, and look at the underlying types involved in LINQ to Objects and PLINQ .
LINQ to Objects is mainly built upon a single class: Enumerable. The Enumerable class is a static class that defines a large set of extension methods, nearly all of which work upon an IEnumerable<T>. Many of these methods return a new IEnumerable<T>, allowing the methods to be chained together into a fluent style interface. This is what allows us to write statements that chain together, and lead to the nice declarative programming model of LINQ:
double min = collection .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation());
Other LINQ variants work in a similar fashion. For example, most data-oriented LINQ providers are built upon an implementation of IQueryable<T>, which allows the database provider to turn a LINQ statement into an underlying SQL query, to be performed directly on the remote database.
PLINQ is similar, but instead of being built upon the Enumerable class, most of PLINQ is built upon a new static class: ParallelEnumerable. When using PLINQ, you typically begin with any collection which implements IEnumerable<T>, and convert it to a new type using an extension method defined on ParallelEnumerable: AsParallel(). This method takes any IEnumerable<T>, and converts it into a ParallelQuery<T>, the core class for PLINQ. There is a similar ParallelQuery class for working with non-generic IEnumerable implementations.
This brings us to our first subtle, but important difference between PLINQ and LINQ –
PLINQ always works upon specific types, which must be explicitly created.
Typically, the type you’ll use with PLINQ is ParallelQuery<T>, but it can sometimes be a ParallelQuery or an OrderedParallelQuery<T>. Instead of dealing with an interface, implemented by an unknown class, we’re dealing with a specific class type. This works seamlessly from a usage standpoint – ParallelQuery<T> implements IEnumerable<T>, so you can always “switch back” to an IEnumerable<T>.
The difference only arises at the beginning of our parallelization. When we’re using LINQ, and we want to process a normal collection via PLINQ, we need to explicitly convert the collection into a ParallelQuery<T> by calling AsParallel(). There is an important consideration here – AsParallel() does not need to be called on your specific collection, but rather any IEnumerable<T>. This allows you to place it anywhere in the chain of methods involved in a LINQ statement, not just at the beginning. This can be useful if you have an operation which will not parallelize well or is not thread safe. For example, the following is perfectly valid, and similar to our previous examples:
double min = collection .AsParallel() .Select(item => item.SomeOperation()) .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation());
However, if SomeOperation() is not thread safe, we could just as easily do:
double min = collection .Select(item => item.SomeOperation()) .AsParallel() .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .Min(item => item.PerformComputation());
In this case, we’re using standard LINQ to Objects for the Select(…) method, then converting the results of that map routine to a ParallelQuery<T>, and processing our filter (the Where method) and our aggregation (the Min method) in parallel.
PLINQ also provides us with a way to convert a ParallelQuery<T> back into a standard IEnumerable<T>, forcing sequential processing via standard LINQ to Objects. If SomeOperation() was thread-safe, but PerformComputation() was not thread-safe, we would need to handle this by using the AsEnumerable() method:
double min = collection .AsParallel() .Select(item => item.SomeOperation()) .Where(item => item.SomeProperty > 6 && item.SomeProperty < 24) .AsEnumerable() .Min(item => item.PerformComputation());
Here, we’re converting our collection into a ParallelQuery<T>, doing our map operation (the Select(…) method) and our filtering in parallel, then converting the collection back into a standard IEnumerable<T>, which causes our aggregation via Min() to be performed sequentially.
This could also be written as two statements, as well, which would allow us to use the language integrated syntax for the first portion:
var tempCollection = from item in collection.AsParallel() let e = item.SomeOperation() where (e.SomeProperty > 6 && e.SomeProperty < 24) select e; double min = tempCollection.AsEnumerable().Min(item => item.PerformComputation());
The second important difference between PLINQ and LINQ deals with order preservation.
PLINQ, by default, does not preserve the order of of source collection.
This is by design. In order to process a collection in parallel, the system needs to naturally deal with multiple elements at the same time. Maintaining the original ordering of the sequence adds overhead, which is, in many cases, unnecessary. Therefore, by default, the system is allowed to completely change the order of your sequence during processing. If you are doing a standard query operation, this is usually not an issue. However, there are times when keeping a specific ordering in place is important. If this is required, you can explicitly request the ordering be preserved throughout all operations done on a ParallelQuery<T> by using the AsOrdered() extension method. This will cause our sequence ordering to be preserved.
For example, suppose we wanted to take a collection, perform an expensive operation which converts it to a new type, and display the first 100 elements. In LINQ to Objects, our code might look something like:
// Using IEnumerable<SourceClass> collection IEnumerable<ResultClass> results = collection .Select(e => e.CreateResult()) .Take(100);
If we just converted this to a parallel query naively, like so:
IEnumerable<ResultClass> results = collection .AsParallel() .Select(e => e.CreateResult()) .Take(100);
We could very easily get a very different, and non-reproducable, set of results, since the ordering of elements in the input collection is not preserved. To get the same results as our original query, we need to use:
IEnumerable<ResultClass> results = collection .AsParallel() .AsOrdered() .Select(e => e.CreateResult()) .Take(100);
This requests that PLINQ process our sequence in a way that verifies that our resulting collection is ordered as if it were processed serially. This will cause our query to run slower, since there is overhead involved in maintaining the ordering. However, in this case, it is required, since the ordering is required for correctness.
PLINQ is incredibly useful. It allows us to easily take nearly any LINQ to Objects query and run it in parallel, using the same methods and syntax we’ve used previously. There are some important differences in operation that must be considered, however – it is not a free pass to parallelize everything. When using PLINQ in order to parallelize your routines declaratively, the same guideline I mentioned before still applies:
Parallelization is something that should be handled with care and forethought, added by design, and not just introduced casually.