<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Reed Copsey, Jr. &#187; C++</title>
	<atom:link href="http://reedcopsey.com/category/cplusplus/feed/" rel="self" type="application/rss+xml" />
	<link>http://reedcopsey.com</link>
	<description>Thoughts on C#, WPF, .NET, and programming for Scientific Visualization</description>
	<lastBuildDate>Mon, 28 Nov 2011 20:42:35 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>C# Performance Pitfall &#8211; Interop Scenarios Change the Rules</title>
		<link>http://reedcopsey.com/2011/08/11/c-performance-pitfall-interop-scenarios-change-the-rules/</link>
		<comments>http://reedcopsey.com/2011/08/11/c-performance-pitfall-interop-scenarios-change-the-rules/#comments</comments>
		<pubDate>Thu, 11 Aug 2011 19:30:52 +0000</pubDate>
		<dc:creator>Reed</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Performance]]></category>

		<guid isPermaLink="false">http://reedcopsey.com/2011/08/11/c-performance-pitfall-interop-scenarios-change-the-rules/</guid>
		<description><![CDATA[C# and .NET, overall, really do have fantastic performance in my opinion.&#160; That being said, the performance characteristics dramatically differ from native programming, and take some relearning if you’re used to doing performance optimization in most other languages, especially C, C++, and similar.&#160; However, there are times when revisiting tricks learned in native code play [...]]]></description>
			<content:encoded><![CDATA[<p>C# and .NET, overall, really do have fantastic performance in my opinion.&#160; That being said, the performance characteristics dramatically differ from native programming, and take some relearning if you’re used to doing performance optimization in most other languages, especially C, C++, and similar.&#160; However, there are times when revisiting tricks learned in native code play a critical role in performance optimization in C#.</p>
<p>I recently ran across a nasty scenario that illustrated to me how dangerous following any fixed rules for optimization can be…</p>
<p><span id="more-310"></span>
<p>The rules in C# when optimizing code are very different than C or C++.&#160; Often, they’re exactly backwards.&#160; For example, in C and C++, lifting a variable out of loops in order to avoid memory allocations often can have huge advantages.&#160; If some function within a call graph is allocating memory dynamically, and that gets called in a loop, it can dramatically slow down a routine.</p>
<p>This can be a tricky bottleneck to track down, even with a profiler.&#160; Looking at the memory allocation graph is usually the key for spotting this routine, as it’s often “hidden” deep in call graph.&#160; For example, while optimizing some of my scientific routines, I ran into a situation where I had a loop similar to:</p>
<pre class="csharpcode"><span class="kwrd">for</span> (i=0; i&lt;numberToProcess; ++i)
{
   <span class="rem">// Do some work</span>
   ProcessElement(element[i]);
}</pre>
<style type="text/css">
<p>.csharpcode, .csharpcode pre
{
	font-size: small;
	color: black;
	font-family: consolas, "Courier New", courier, monospace;
	background-color: #ffffff;
	/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt 
{
	background-color: #f4f4f4;
	width: 100%;
	margin: 0em;
}
.csharpcode .lnum { color: #606060; }</style>
<p>This loop was at a fairly high level in the call graph, and often could take many <em>hours</em> to complete, depending on the input data.&#160; As such, any performance optimization we could achieve would be greatly appreciated by our users.</p>
<p>After a fair bit of profiling, I noticed that a couple of function calls down the call graph (inside of ProcessElement), there was some code that effectively was doing:</p>
<pre class="csharpcode"><span class="rem">// Allocate some data required</span>
DataStructure* data = <span class="kwrd">new</span> DataStructure(num);
<span class="rem">// Call into a subroutine that passed around and manipulated this data highly</span>
CallSubroutine(data);
<span class="rem">// Read and use some values from here</span>
<span class="kwrd">double</span> values = data-&gt;Foo;
<span class="rem">// Cleanup </span>
<span class="kwrd">delete</span> data;
<span class="rem">// ...</span>
<span class="kwrd">return</span> bar;</pre>
<p>Normally, if “DataStructure” was a simple data type, I could just allocate it on the stack.&#160; However, it’s constructor, internally, allocated it’s own memory using new, so this wouldn’t eliminate the problem.&#160; In this case, however, I could change the call signatures to allow the pointer to the data structure to be passed into ProcessElement and through the call graph, allowing the inner routine to <em>reuse</em> the same “data” memory instead of allocating.&#160; At the highest level, my code effectively changed to something like:</p>
<pre class="csharpcode">DataStructure* data = <span class="kwrd">new</span> DataStructure(numberToProcess);
<span class="kwrd">for</span> (i=0; i&lt;numberToProcess; ++i)
{
   <span class="rem">// Do some work</span>
   ProcessElement(element[i], data);
}
delete data;</pre>
<p>Granted, this dramatically reduced the maintainability of the code, so it wasn’t something I wanted to do unless there was a significant benefit.&#160; In this case, after profiling the new version, I found that it increased the overall performance dramatically – my main test case went from 35 minutes runtime down to 21 minutes.&#160; This was such a significant improvement, I felt it was worth the reduction in maintainability.</p>
<p>In C and C++, it’s generally a good idea (for performance) to:</p>
<ul>
<li>Reduce the number of memory allocations as much as possible, </li>
<li>Use fewer, larger memory allocations instead of many smaller ones, and </li>
<li>Allocate as high up the call stack as possible, and reuse memory </li>
</ul>
<p>I’ve seen many people try to make similar optimizations in C# code.&#160; For good or bad, this is typically <strong>not a good idea</strong>.&#160; The garbage collector in .NET completely changes the rules here.</p>
<p>In C#, reallocating memory in a loop is not always a bad idea.&#160; In this scenario, for example, I may have been much better off leaving the original code alone.&#160; The reason for this is the garbage collector.&#160; The GC in .NET is incredibly effective, and leaving the allocation deep inside the call stack has some huge advantages.&#160; First and foremost, it tends to make the code more maintainable – passing around object references tends to couple the methods together more than necessary, and overall increase the complexity of the code.&#160; This is something that should be avoided unless there is a significant reason.&#160; Second, (unlike C and C++) memory allocation of a single object in C# is normally cheap and fast.&#160; Finally, and most critically, there is a large advantage to having short lived objects.&#160; If you lift a variable out of the loop and reuse the memory, its much more likely that object will get promoted to Gen1 (or worse, Gen2).&#160; This can cause expensive compaction operations to be required, and also lead to (at least temporary) memory fragmentation as well as more costly collections later.</p>
<p>As such, I’ve found that it’s often (though not always) faster to leave memory allocations where you’d naturally place them – deep inside of the call graph, inside of the loops.&#160; This causes the objects to stay very short lived, which in turn increases the efficiency of the garbage collector, and can dramatically improve the overall performance of the routine as a whole.</p>
<p>In C#, I tend to:</p>
<ul>
<li>Keep variable declarations in the tightest scope possible</li>
<li>Declare and allocate objects at usage</li>
</ul>
<p>While this tends to cause some of the same goals (reducing unnecessary allocations, etc), the goal here is a bit different – it’s about keeping the objects rooted for as little time as possible in order to (attempt) to keep them completely in Gen0, or worst case, Gen1.&#160; It also has the huge advantage of keeping the code very maintainable – objects are used and “released” as soon as possible, which keeps the code very clean.&#160; It does, however, often have the side effect of causing more allocations to occur, but keeping the objects rooted for a much shorter time.</p>
<p>Now – nowhere here am I suggesting that these rules are hard, fast rules that are always true.&#160; That being said, my time spent optimizing over the years encourages me to naturally write code that follows the above guidelines, then profile and adjust as necessary.&#160; In my current project, however, I ran across one of those nasty little pitfalls that’s something to keep in mind – interop changes the rules.</p>
<p>In this case, I was dealing with an API that, internally, used some COM objects.&#160; In this case, these COM objects were leading to native allocations (most likely C++) occurring in a loop deep in my call graph.&#160; Even though I was writing nice, clean managed code, the normal managed code rules for performance no longer apply.&#160; </p>
<p>After profiling to find the bottleneck in my code, I realized that my inner loop, a innocuous looking block of C# code, was effectively causing a set of native memory allocations in every iteration.&#160; This required going back to a “native programming” mindset for optimization.&#160; Lifting these variables and reusing them took a 1:10 routine down to 0:20 – again, a very worthwhile improvement.</p>
<p>Overall, the lessons here are:</p>
<ul>
<li>Always profile if you suspect a performance problem – don’t assume any rule is correct, or any code is efficient just because it looks like it should be</li>
<li>Remember to check memory allocations when profiling, not just CPU cycles</li>
<li>Interop scenarios often cause managed code to act very differently than “normal” managed code.</li>
<li>Native code can be hidden very cleverly inside of managed wrappers</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://reedcopsey.com/2011/08/11/c-performance-pitfall-interop-scenarios-change-the-rules/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>About Big-O Notation &#8211; The good, the bad, and the confusing</title>
		<link>http://reedcopsey.com/2009/09/18/about-big-o-notation-the-good-the-bad-and-the-confusing/</link>
		<comments>http://reedcopsey.com/2009/09/18/about-big-o-notation-the-good-the-bad-and-the-confusing/#comments</comments>
		<pubDate>Sat, 19 Sep 2009 00:11:02 +0000</pubDate>
		<dc:creator>Reed</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[Algorithms]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[C++]]></category>
		<category><![CDATA[Professional]]></category>

		<guid isPermaLink="false">http://reedcopsey.com/?p=56</guid>
		<description><![CDATA[Often, prior to using a specific algorithm, it helps to understand information about how the algorithm scales.&#160; In order to document this, computer science has borrowed a notation from mathematics called Big-O notation. Understanding Big-O notation is critical in making good decisions about algorithms or libraries used to implement routines.&#160; However, there is often a [...]]]></description>
			<content:encoded><![CDATA[<p>Often, prior to using a specific algorithm, it helps to understand information about how the algorithm scales.&#160; In order to document this, computer science has borrowed a notation from mathematics called <a href="http://en.wikipedia.org/wiki/Big_O_notation" target="_blank">Big-O notation</a>.</p>
<p>Understanding Big-O notation is critical in making good decisions about algorithms or libraries used to implement routines.&#160; However, there is often a lot of misunderstanding about what it means, and I frequently see this used as justifications for poor decisions.</p>
<p> <span id="more-56"></span>
<p>First, let’s define <a href="http://en.wikipedia.org/wiki/Big_O_notation" target="_blank">Big-O notation</a>.&#160; Big-O notation describes the limiting factor of a function as the argument or domain size of the function approaches a specific value, typically infinity.&#160; Big-O notation describes <strong>a growth rate</strong>, <strong>not a speed</strong>.&#160; This is important – one very common misunderstanding about Big-O notation is that it describes how “fast” a function will be – and this is never true.&#160; It describes how a function or algorithm will scale, but says absolutely nothing about speed.</p>
<p>When programming, Big-O notation can be used to describe many different things.&#160; Most frequently it is describing the algorithm in terms of instruction count or speed, but can also be used to describe memory or resource usage.&#160; It’s important to know what is being described when making decisions based off this information, and seeing that an algorithm is “O(N)” isn’t enough information, in and of itself, to know what that means.</p>
<p>When and why is this useful?&#160; It becomes very important to understand this when we start dealing with collections, especially if the collection is large or of an unknown scale.&#160; For example, say you’re going to do some basic image processing, and you have the choice between a couple of different algorithms.&#160; If each algorithm’s limiting behavior is based off the total image size, and the algorithm’s operation count is described in Big-O notation, the total number of operations will vary dramatically as your image size changes.&#160; For example, let’s say we have the choice of 3 algorithms described by three different, but <a href="http://en.wikipedia.org/wiki/Big_O_notation#Orders_of_common_functions" target="_blank">common Big-O notations</a>:</p>
<ul>
<li>Algorithm A: Linear &#8211; O(N) </li>
<li>Algorithm B: Logarithmic &#8211; O(log N)&#160;&#160;&#160; </li>
<li>Algorithm C: Quadratic &#8211; O(N^2) </li>
</ul>
<table border="1" cellspacing="0" cellpadding="2" width="500">
<tbody>
<tr>
<th valign="top" width="100">Image Resolution</th>
<th valign="top" width="100">Total Pixels</th>
<th valign="top" width="100">Potential Instructions: A</th>
<th valign="top" width="100">Potential Instructions: B</th>
<th valign="top" width="100">Potential Instructions: C</th>
</tr>
<tr>
<td valign="top" width="87">16&#215;16</td>
<td valign="top" width="162">256</td>
<td valign="top" width="142">256</td>
<td valign="top" width="164">2</td>
<td valign="top" width="179">65,536</td>
</tr>
<tr>
<td valign="top" width="87">256&#215;256</td>
<td valign="top" width="162">65,536</td>
<td valign="top" width="142">65,536</td>
<td valign="top" width="164">5</td>
<td valign="top" width="179">4,294,967,296</td>
</tr>
<tr>
<td valign="top" width="87">512&#215;512</td>
<td valign="top" width="162">262,144</td>
<td valign="top" width="142">262,144</td>
<td valign="top" width="164">5</td>
<td valign="top" width="179">68,719,476,736</td>
</tr>
</tbody>
</table>
<p>[I am, for here, assuming Log in base 10, but for Big-O, it could be any logarithmic base and still be O(log N)]</p>
<p>Note the huge discrepancy in terms of instruction count!&#160; Algorithm C is a <strong>quadratic</strong> algorithm, meaning that its instruction count increases by the square of the number of elements (pixels, in this case).</p>
<p>Now, given the information above, it seems like you would naturally always choose Algorithm B – it scales much, much better.&#160; Often, this is true, and this is a good way to choose between algorithms, but we don’t have enough information to make an educated decision at this point.</p>
<p>Remember above – I mentioned that Big-O describes a growth rate, and not a speed.&#160; In addition, it describes an upper bound for the growth rate, not necessarily the average case scenario.&#160; For example, it may be that algorithm C is the fastest algorithm for processing a single pixel – perhaps by a factor of thousands.&#160; Let’s look at these algorithms, and assign the number of actual microseconds each pixel takes to process on average, and look at the same chart as above:</p>
<ul>
<li>Algorithm A: 0.1 seconds </li>
<li>Algorithm B: 10 seconds </li>
<li>Algorithm C: 50 microseconds </li>
</ul>
<table border="1" cellspacing="0" cellpadding="2" width="576">
<tbody>
<tr>
<th valign="top" width="114">Image Resolution</th>
<th valign="top" width="111">Total Pixels</th>
<th valign="top" width="109">Theoretical Execution Time: A</th>
<th valign="top" width="114">Theoretical Execution Time: B</th>
<th valign="top" width="126">Theoretical Execution Time: C</th>
</tr>
<tr>
<td valign="top" width="119">16&#215;16</td>
<td valign="top" width="114">256</td>
<td valign="top" width="112">25.6 seconds</td>
<td valign="top" width="117">20 seconds</td>
<td valign="top" width="129">3.3 seconds</td>
</tr>
<tr>
<td valign="top" width="120">256&#215;256</td>
<td valign="top" width="114">65,536</td>
<td valign="top" width="114">6,553.6 seconds</td>
<td valign="top" width="118">50 seconds</td>
<td valign="top" width="131">214,748 seconds</td>
</tr>
<tr>
<td valign="top" width="120">512&#215;512</td>
<td valign="top" width="112">262,144</td>
<td valign="top" width="115">26,214 seconds</td>
<td valign="top" width="118">50 seconds</td>
<td valign="top" width="132">…too long…</td>
</tr>
</tbody>
</table>
<p>This shows off a basic misconception of Big-O notation.&#160; It’s not about speed.&#160; Granted, if you’re planning to deal with megapixel imagery, you’d probably still choose the algorithm above with the best Big-O notation.&#160; However, there are two things that may trip you up here when you’re developing this algorithm.</p>
<ol>
<li>If you developed the three algorithms above looking only at a small image, you’d probably have picked algorithm C, hands down, because the per-pixel routine is so dramatically faster. </li>
<li>If you developed the three algorithms above with a large image, you’d have given up on algorithm C quickly, because of it’s poor scalability. </li>
</ol>
<p>However, both of those are not necessarily good things.</p>
<p>In case 1: The reason this is bad is obvious.&#160; Later, when you fed the routine a large image, you’d be in trouble (especially if you didn’t realize that it was quadratic at the time).</p>
<p>In case 2: If this routine was always going to be used on very small images, Algorithm C is a superior choice, even though it has the worst Big-O description.&#160; This is because it’s single element speed is so much better.&#160; </p>
<p>Also, Big-O only describes the upper bound (which is very important), but the average case of an O(N^2) function may actually be closer to N Log N (<a href="http://en.wikipedia.org/wiki/Quicksort" target="_blank">quicksort</a> is this way).&#160; It may be that, when we actually look at real execution times in a profiler with real data, the algorithm performs, on average, in N log (N), in which case the “real” runtime may be more like:</p>
<table border="1" cellspacing="0" cellpadding="2" width="576">
<tbody>
<tr>
<th valign="top" width="114">Image Resolution</th>
<th valign="top" width="111">Total Pixels</th>
<th valign="top" width="109">“Real” Execution Time: A</th>
<th valign="top" width="114">“Real” Execution Time: B</th>
<th valign="top" width="126">“Real” Execution Time: C</th>
</tr>
<tr>
<td valign="top" width="119">16&#215;16</td>
<td valign="top" width="114">256</td>
<td valign="top" width="112">21 seconds</td>
<td valign="top" width="117">18 seconds</td>
<td valign="top" width="129">0.003 seconds</td>
</tr>
<tr>
<td valign="top" width="120">256&#215;256</td>
<td valign="top" width="114">65,536</td>
<td valign="top" width="114">3,531 seconds</td>
<td valign="top" width="118">42 seconds</td>
<td valign="top" width="131">3.5 seconds</td>
</tr>
<tr>
<td valign="top" width="120">512&#215;512</td>
<td valign="top" width="112">262,144</td>
<td valign="top" width="115">21,454 seconds</td>
<td valign="top" width="118">43 seconds</td>
<td valign="top" width="132">8 seconds</td>
</tr>
</tbody>
</table>
<p> All of a sudden, our O(N^2) algorithm is looking very, very good – much better than any of our other algorithms!
<p>What can we take from this?&#160; There are a couple of important points:</p>
<ul>
<li>Big-O Notation can provide critical information about the scalability of an algorithm.&#160; Coupled with information about the algorithm itself, this can help us decide which of multiple algorithms will be the “best” choice.</li>
<li>Big-O Notation <strong>does not provide information about speed</strong>.&#160; Factors in the algorithm itself, including total number of instructions required, are not included as part of this notation.</li>
<li>Knowing the Big-O notation for a method, or doing your own <a href="http://algo.inria.fr/AofA/" target="_blank">analysis of the algorithm</a>, is not a substitute for profiling.&#160; “Bad” routines in terms of their Big-O notation can outperform “good” routines, in certain situations.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://reedcopsey.com/2009/09/18/about-big-o-notation-the-good-the-bad-and-the-confusing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visual Studio 2010 Beta</title>
		<link>http://reedcopsey.com/2009/05/20/visual-studio-2010-beta/</link>
		<comments>http://reedcopsey.com/2009/05/20/visual-studio-2010-beta/#comments</comments>
		<pubDate>Wed, 20 May 2009 17:29:12 +0000</pubDate>
		<dc:creator>Reed</dc:creator>
				<category><![CDATA[.NET]]></category>
		<category><![CDATA[C#]]></category>
		<category><![CDATA[C++]]></category>

		<guid isPermaLink="false">http://reedcopsey.com/?p=44</guid>
		<description><![CDATA[Microsoft just released Visual Studio 2010 Beta to the public at large.&#160; Now we just have to wait for the final release…]]></description>
			<content:encoded><![CDATA[<p>Microsoft just released <a href="http://www.microsoft.com/visualstudio/en-us/products/2010/default.mspx" target="_blank">Visual Studio 2010 Beta</a> to the public at large.&#160; Now we just have to wait for the final release…</p>
]]></content:encoded>
			<wfw:commentRss>http://reedcopsey.com/2009/05/20/visual-studio-2010-beta/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
 
