Are you a victim of premature pessimisation? Here’s a short definition from Herb Sutter:
Despite how amazing today’s compilers have become at generating code, humans still know more about the intended use of a function or class than can be specified by mere syntax. Compilers operate under a host of very strict rules that enforce correctness at the expense of faster code. What’s more, modern processor architectures sometimes compete with C++ language habits that have become ingrained in programmers from decades of previous best practice.
I believe that if you want to improve the speed of your code, you need to adopt habits that take advantage of modern compilers and modern processor architectures—habits that will help your compiler generate the best-possible code. Habits that, if you follow them, will generate faster code before you even start the optimisation process.
Here’s four habit-forming tips that are all about avoiding pessimisation and, in my experience, go a long way to creating faster C++ classes.
1) Make use of the (named-) return-value optimisation
According to Lawrence Crowl, (named-) return-value optimisation ((N)RVO) is one of the most important optimisations in modern C++. Okay—what is it?
Let’s start with plain return-value optimization (RVO). Normally, when a C++ method returns an unnamed object, the compiler creates a temporary object, which is then copy-constructed into the target object.
With RVO, the C++ standard allows the compiler to skip the creation of the temporary, treating both object instances—the one inside the function and the one assigned to the variable outside the function—as the same. This usually goes under the name of copy elision. But what is elided here is the temporary and the copy.
So, not only do you save the copy constructor call, you also save the destructor call, as well as some stack memory. Clearly, elimination of extra calls and temporaries saves time and space, but crucially, RVO is an enabler for pass-by-value designs. Imagine MyData
was a large million-by-million matrix. There mere chance that some target compiler could fail to implement this optimisation would make every good programmer shy away from return-by-value and resort to out parameters instead (more on those further down).
As an aside: don't C++ Move Semantics solve this? The answer is: no. If you move instead of copy, you still have the temporary and its destructor call in the executable code. And if your matrix is not heap-allocated, but statically sized, such as a std::array<std::array<double, 1000>, 1000>>
, moving is the same as copying. With RVO, you mustn't be afraid of returning by value. You must unlearn what you have learned and embrace return-by-value.
Named Return Value Optimization is similar but it allows the compiler to eliminate not just rvalues (temporaries), but lvalues (local variables), too, under certain conditions.
What all compilers these days (and for some time now) reliably implement is NRVO in the case where there is a single variable that is passed to every return, and declared at function scope as the first variable:
Sadly, many compilers, including GCC, fail to apply NRVO when you deviate even slightly from the basic pattern:
At least GCC fails to use NRVO for the second return statement in that function. The fix in this case is easy (go back to the first version), but it's not always that easy. It is an altogether sad state of affairs for a language that is said to have the most advanced optimisers available to it for compilers to fail to implement this very basic optimisation.
So, for the time being, get your fingers accustomed to typing the classical NRVO pattern: it enables the compiler to generate code that does what you want in the most efficient way enabled by the C++ standard.
If diving into assembly code to check whether a particular patterns makes your compiler drop NRVO isn't your thing, Thomas Brown provides a very comprehensive list of compilers tested for their NRVO support and I’ve extended Brown’s work with some additional results.
If you start using the NVRO pattern but aren’t getting the results you expect, your compiler may not automatically perform NRVO transformations. You may need to check your compiler optimization settings and explicitly enable them.
2) Return parameters by value whenever possible
This is pretty simple: don’t use “out-parameters”. The result for the caller is certainly kinder: we just return our value instead of having the caller allocate a variable and pass in a reference. Even if your function returns multiple results, nearly all of the time you’re much better off creating a small result struct that the function passes back (via (N)RVO!):
That is, instead of this:
This may seem surprising, even counter-intuitive, for programmers that cut their teeth on older x86 architectures. You’re just passing around a pointer instead of a big chunk of data, right? Quite simply, “out” parameter pointers force a modern compiler to avoid certain optimisations when calling non-inlined functions. Because the compiler can’t always determine if the function call may change an underlying value (due to aliasing), it can’t beneficially keep the value in a CPU register or reorder instructions around it. Besides, compilers have gotten pretty smart—they don’t actually do expensive value passing unless they need to (see the next tip). With 64-bit and even 32-bit CPUs, small structs can be packed into registers or automatically allocated on the stack as needed by the compiler. Returning results by value allows the compiler to understand that there isn’t any modification or aliasing happening to your parameters, and you and your callers get to write simpler code.
3) Cache member-variables and reference-parameters
This rule is straightforward: take a copy of the member-variables or reference-parameters you are going to use within your function at the top of the function, instead of using them directly throughout the method. There are two good reasons for this.
The first is the same as the tip above—because pointer references (even member-variables in methods, as they're accessed through the implicit this
pointer) put a stick in the wheels of the compiler’s optimisation. The compiler can’t guarantee that things don’t change outside its view, so it takes a very conservative (and in most cases wasteful) approach and throws away any state information it may have gleaned about those variables each time they’re used anew. And that’s valuable information that can help the compiler eliminate instructions and references to memory.
Another important reason is correctness. As an example provided by Lawrence Crowl in his CppCon 2014 talk “The Implementation of Value Types”, instead of this complex number multiplication:
You should prefer this version:
This second, non-aliased version will still work properly if you use value *= value
to square a number; the first one won’t give you the right value because it doesn’t protect against aliased variables.
To summarise succinctly: read from (and write to!) each non-local variable exactly once in every function.
4) Organize your member variables intelligently
Is it better to organize member variables for readability or for the compiler? Ideally, you pick a scheme that works for both.
And now is a perfect time for a short refresher about CPU caches. Of course data coming from memory is very slow compared to data coming from a cache. An important fact to remember is that data is loaded into the cache in (typically) 64-byte blocks called cache lines. The cache line—that is, your requested data and the 64 bytes surrounding it—is loaded on your first request for memory absent in the cache. Because every cache miss silently penalises your program, you want a well-considered strategy for ensuring you reduce cache misses whenever possible. Even if the first memory access is outside the cache, trying to structure your accesses so that a second, third, or forth access is within the cache will have a significant impact on speed. With that in mind, consider these tips for your member-variable declarations:
- Move the most-frequently-used member-variables first
- Move the least-frequently-used member-variables last
- If variables are often used together, group them near each other
- Try to reference variables in your functions in the order they’re declared
- Keep an eye out on alignment requirements of member-variables, lest you waste space on padding
Nearly all C++ compilers organize member variables in memory in the order in which they are declared. And grouping your member variables using the above guidelines can help reduce cache misses that drastically impact performance. Although compilers can be smart about creating code that works with caching strategies in a way that’s hard for humans to track, the C++ rules on class layout make it hard for compilers to really shine. Your goal here is to help the compiler by stacking the deck on cache-line loads that will preferentially load the variables in the order you’ll need them.
This can be a tough one if you’re not sure how frequently things are used. While it’s not always easy for complicated classes to know what member variables may be touched more often, generally following this rule of thumb as well as you can will help. Certainly for the simpler classes (string, dates/times, points, complex, quaternions, etc) you’ll probably be accessing most member variables most of the time, but you can still declare and access your member variables in a consistent way that will help guarantee that you’re minimizing your cache misses.
Conclusion
The bottomline is that it still takes some amount of hand-holding to get a compiler to generate the best code. Good coding-habits are by no means the end-all, but are certainly a great place to start.
18 Comments
21 - Jul - 2016
Eric Lemanissier
Interesting article. I am a bit surprised by rule 3 "Cache member-variables and reference-parameters". What is the point of passing a parameter by const reference if a copy is made anyway ? Also, what is variable t in the second version of the example ? It feels like "t." should have been "this->"
21 - Jul - 2016
Marc Mutz
In this case, passing in
a
by value would have been preferable, yes. Simple examples are usually too simple to be free of trivialities like that. In the general case, you might want to pass by cref to centralise the copy in a non-inline function, instead of cluttering client code with copy ctor calls, or you are not accessing all the variables of the passed object, only some.Also, it might not suffice to take a copy of the parameter. If you copy a
std::vector
inside the function, you're likely doing something wrong. But if you access the first element of it repeatedly, you should cache it in a temporary.You are correct. Will be fixed shortly.
10 - Nov - 2016
Olumide
I'm looking at the instructions generated by both versions of the function/operator, which I have named func1() and func2(). I am far from being an expert in assembly but both versions seem to generate the same number of instructions and memory references. https://godbolt.org/g/rX8o0R
21 - Jul - 2016
Eric Lemanissier
It would be very interesting to have some benchmark results showing the improvements gained by caching non-local reference, if you have any available.
22 - Jul - 2016
Marc Mutz
Even though probably not what you were looking for, I find https://codereview.qt-project.org/106093 particularly interesting, because the function is not entirely trivial and the savings are quite sizable (¼KiB for a single variable that's written to exactly twice is impressive in my book - ymmv). I find it also helps me understand what's going on, since I don't need to understand the effects of any of the functions called to track the value of
newStatus
.25 - Jul - 2016
Oriol
Very interesting article. I have one question, Marc, regarding the 4 tips for cashing class members. I agree with tips 3 and 4, i.e. grouping related members together. But I don't see why MRU members should go first, and LRU ones at the end. Essentially, it doesn't matter if your mostly used members are in one cache line or the next, right? I think that with tips 3 and 4, the result is the same, no matter of the ordering.
25 - Jul - 2016
Marc Mutz
The idea is twofold, and doesn't have anything to do with separate cachelines. In fact, if your object spans two cachelines, you have more than one spot for the most-often (not most-recently) used member: every cacheline width a new one. Details:
You have a
this
pointer, and that points the the first-declared sub-object. If you need to offset into the object, you have another addition to perform.Since a cacheline is wider than the memory word size, cachelines are filled in a particular order, usually starting with the word at the lowest address. That means that the first eight bytes of a cacheline (assuming a 64-bit memory bus) arrive in the cache first. If the data is located there, the CPU can continue work immediately. If the data the CPU waits for is in the second eight-byte-block, the latency is higher, in the third, even higher, etc.
Now, there's something called CWF (Critical Word First), which makes the memory-subsystem deliver the word the CPU is blocked on first, even though it's not the one at the lowest address. But AFAIK, CMIIW, that optimisation cannot arbitrarily reorder the words, so if the CPU first accesses data in the second word, then in the last word, it will profit from CWF for the first access but still take the latency hit on the second. Also, again, feel free to CMIIW, the hw prefetcher always loads cachelines in the standard order.
26 - Jul - 2016
Ivan Čukić
There is an important exception to the rule - functions like
getline
which would need always to allocate new blocks of memory if they returned new strings, instead of reusing previously allocated memory of a ref-passed output string.vs
For ref-passed, sometimes it will need reallocation, if the line we are currently reading is longer than the capacity of the string, but most of the time it will not be the case.
26 - Jul - 2016
Marc Mutz
I disagree. If you need to pass in capacity, do so by moving:
See also https://codereview.qt-project.org/157376 (last paragraph of commit message).
25 - Aug - 2016
Ion
I agree with Ivan,
Moving is not always free. A class with an internal mutex is not cheaply movable. For big classes moving requires non-trivial work. There are non-movable classes.
An in-out parameter is certainly uglier but more readable when multiple in-out parameters are required.
Return by value is overhyped.
22 - Aug - 2016
Mudassir
Very interesting article, however in tip 4, you haven't considered the memory alignment of the object which is usually depends on the order the member variables are declared.
What if the most used variables are not the multiple of 4-bytes, and hence compiler would add the padding bytes, and increase the size of the whole object.
This rule might be good for the special use case where the object with optimized memory layout is bigger than 64-byte and has the most accessed members in > 64-bytes.
22 - Aug - 2016
Marc Mutz
Of course, there will always be pathological situations in which following any of these tips makes code slower. That doesn't void the tips. Use common sense to find the exceptions :)
That said, yes, the tip was supposed to be about the caching, but the section title suggests a wider scope. Will add.
Thanks!
27 - Aug - 2016
Yehezkel
("Return parameters by value whenever possible" title should be numbered 2.)
7 - Feb - 2017
Edward Welbourne
2: Search for "parts.nonimator" and notice your typo ;^>
3: the complex-multiply code is broken even for x *= y, since the method updates real before using it in the computation of imag, which should be using the prior value of real anyway. Correct code, even when a isn't this, requires cacheing of real.
30 - May - 2021
Andy Gryc
Indeed, fixed! (finally :-)
29 - May - 2021
Adam
Old article, but I just can't help myself -- your 1st formula for complex multiplication actually produces the WRONG answer. The calculation (2nd line of code) for (this->) imag refers to (this->) real WHICH has ALREADY been changed by the line of code above it!!. This problem goes away with your 2nd version (with the parameter-copying).
30 - May - 2021
Andy Gryc
Of course, you're right - and this was also noted in the commend from Edward Welbourne back in 2017 which I didn't see until now.
31 - May - 2021
Marc Mutz
Well, the way it's fixed now, the first version also doesn't have an aliasing problem anymore, rendering the prose around pointless. For a real-life example, see https://codereview.qt-project.org/c/qt/qtbase/+/191706