C# Performance Pitfall – Interop Scenarios Change the Rules
C# and .NET, overall, really do have fantastic performance in my opinion. That being said, the performance characteristics dramatically differ from native programming, and take some relearning if you’re used to doing performance optimization in most other languages, especially C, C++, and similar. However, there are times when revisiting tricks learned in native code play a critical role in performance optimization in C#.
I recently ran across a nasty scenario that illustrated to me how dangerous following any fixed rules for optimization can be…
The rules in C# when optimizing code are very different than C or C++. Often, they’re exactly backwards. For example, in C and C++, lifting a variable out of loops in order to avoid memory allocations often can have huge advantages. If some function within a call graph is allocating memory dynamically, and that gets called in a loop, it can dramatically slow down a routine.
This can be a tricky bottleneck to track down, even with a profiler. Looking at the memory allocation graph is usually the key for spotting this routine, as it’s often “hidden†deep in call graph. For example, while optimizing some of my scientific routines, I ran into a situation where I had a loop similar to:
for (i=0; i<numberToProcess; ++i) { // Do some work ProcessElement(element[i]); }
This loop was at a fairly high level in the call graph, and often could take many hours to complete, depending on the input data. As such, any performance optimization we could achieve would be greatly appreciated by our users.
After a fair bit of profiling, I noticed that a couple of function calls down the call graph (inside of ProcessElement), there was some code that effectively was doing:
// Allocate some data required DataStructure* data = new DataStructure(num); // Call into a subroutine that passed around and manipulated this data highly CallSubroutine(data); // Read and use some values from here double values = data->Foo; // Cleanup delete data; // ... return bar;
Normally, if “DataStructure†was a simple data type, I could just allocate it on the stack. However, it’s constructor, internally, allocated it’s own memory using new, so this wouldn’t eliminate the problem. In this case, however, I could change the call signatures to allow the pointer to the data structure to be passed into ProcessElement and through the call graph, allowing the inner routine to reuse the same “data†memory instead of allocating. At the highest level, my code effectively changed to something like:
DataStructure* data = new DataStructure(numberToProcess); for (i=0; i<numberToProcess; ++i) { // Do some work ProcessElement(element[i], data); } delete data;
Granted, this dramatically reduced the maintainability of the code, so it wasn’t something I wanted to do unless there was a significant benefit. In this case, after profiling the new version, I found that it increased the overall performance dramatically – my main test case went from 35 minutes runtime down to 21 minutes. This was such a significant improvement, I felt it was worth the reduction in maintainability.
In C and C++, it’s generally a good idea (for performance) to:
- Reduce the number of memory allocations as much as possible,
- Use fewer, larger memory allocations instead of many smaller ones, and
- Allocate as high up the call stack as possible, and reuse memory
I’ve seen many people try to make similar optimizations in C# code. For good or bad, this is typically not a good idea. The garbage collector in .NET completely changes the rules here.
In C#, reallocating memory in a loop is not always a bad idea. In this scenario, for example, I may have been much better off leaving the original code alone. The reason for this is the garbage collector. The GC in .NET is incredibly effective, and leaving the allocation deep inside the call stack has some huge advantages. First and foremost, it tends to make the code more maintainable – passing around object references tends to couple the methods together more than necessary, and overall increase the complexity of the code. This is something that should be avoided unless there is a significant reason. Second, (unlike C and C++) memory allocation of a single object in C# is normally cheap and fast. Finally, and most critically, there is a large advantage to having short lived objects. If you lift a variable out of the loop and reuse the memory, its much more likely that object will get promoted to Gen1 (or worse, Gen2). This can cause expensive compaction operations to be required, and also lead to (at least temporary) memory fragmentation as well as more costly collections later.
As such, I’ve found that it’s often (though not always) faster to leave memory allocations where you’d naturally place them – deep inside of the call graph, inside of the loops. This causes the objects to stay very short lived, which in turn increases the efficiency of the garbage collector, and can dramatically improve the overall performance of the routine as a whole.
In C#, I tend to:
- Keep variable declarations in the tightest scope possible
- Declare and allocate objects at usage
While this tends to cause some of the same goals (reducing unnecessary allocations, etc), the goal here is a bit different – it’s about keeping the objects rooted for as little time as possible in order to (attempt) to keep them completely in Gen0, or worst case, Gen1. It also has the huge advantage of keeping the code very maintainable – objects are used and “released†as soon as possible, which keeps the code very clean. It does, however, often have the side effect of causing more allocations to occur, but keeping the objects rooted for a much shorter time.
Now – nowhere here am I suggesting that these rules are hard, fast rules that are always true. That being said, my time spent optimizing over the years encourages me to naturally write code that follows the above guidelines, then profile and adjust as necessary. In my current project, however, I ran across one of those nasty little pitfalls that’s something to keep in mind – interop changes the rules.
In this case, I was dealing with an API that, internally, used some COM objects. In this case, these COM objects were leading to native allocations (most likely C++) occurring in a loop deep in my call graph. Even though I was writing nice, clean managed code, the normal managed code rules for performance no longer apply.
After profiling to find the bottleneck in my code, I realized that my inner loop, a innocuous looking block of C# code, was effectively causing a set of native memory allocations in every iteration. This required going back to a “native programming†mindset for optimization. Lifting these variables and reusing them took a 1:10 routine down to 0:20 – again, a very worthwhile improvement.
Overall, the lessons here are:
- Always profile if you suspect a performance problem – don’t assume any rule is correct, or any code is efficient just because it looks like it should be
- Remember to check memory allocations when profiling, not just CPU cycles
- Interop scenarios often cause managed code to act very differently than “normal†managed code.
- Native code can be hidden very cleverly inside of managed wrappers
Great advice Reed, I would be interested in hearing more about your approach to profiling. Everyone seems to have their own tips and tricks (my self included) and I am always looking for better ways to finding these issues.
I particularly find that profiling and tracking down bottlenecks in interop scenarios seems a bit painful. Am I missing a simple solution?
Brad:
No, it’s not easy. Perfmon is great at helping, though – I often will watch the CLR perf. counters + the system memory counters to look for clues like the one here. This one was found by a mix of wall clock + perf. profiling, though. A good profiler is key.
I’m thinking I’m going to start writing a bit more about performance optimization in general, at least for a bit – I’m definitely open to suggestions, though, if there’s any tips and tricks you’d like to see 😉
-Reed
Good article with good tips! I liked to know more about the case where managed objects may be using COM objects or unmanaged code in memory allocation. I’ve read that bitmap is one of these types. The situation occurs where a user has a large number of bitmaps in their application. The GC doesn’t collect them becaue it doesn’t see a ‘memory stress’ in the managed heap. But in the unmanaged memory, so much memory gets allocated that eventually the app gets “out of memory” error.
Any recommendations on knowning which managed objects may be allocating unmanaged memory, and how much unmanaged memory?
I think of this as “tip of the ice burg” scenario. The managed memory doesn’t look so much, but the unmanaged memory the object is using maybe significantly larger.
Dan,
There are actually two issues here. The GC not seeing “memory stress” is partly the fault of the managed wrapper class. This can be handled in the wrapper by properly calling GC.AddMemoryPressure() and GC.RemoveMemoryPressure() to “inform” the GC of the extra memory allocated. If the wrappers are written well, you can handle that situation much more elegantly, since it will give the GC the proper information to run.
As for tracking object references by memory, the best options are typically SoS and a good memory profiler. This is, unfortunately, a very difficult realm to get into – I often just work by looking at total memory allocations as I go, which will at least show you major issues, but it can be very difficult to track and find the small issues (which can add up over time). Perfmon, again, is great for tracking both managed + native memory allocations, so it can give you some great feedback on when memory is being allocated, how much is there, and where it’s going.
-Reed
Thanks for the tips!
Thanks for the great article, Reed. Yes, as mentioned in the above comments, I would also like to know what you use (or have used in the above scenario) for profiling.
Samik,
My reply to Dan above pretty much lists out the tools I tend to use…
-Reed