Reason for the performance decrease
gcc’s libstdc++ uses certain performance improvements if the allocator is std::allocator
. Your CustomAllocatorType
is a different type than std::allocator
, meaning that the optimizations are disabled. Note that I am not talking about compiler optimizations but rather that the C++ standard library implements overloads or specializations specifically for std::allocator
. To name an example relevant to your example code, std::vector::resize()
internally calls __uninitialized_default_n_a()
which has a special overload for std::allocator
. The special overload bypasses the allocator entirely. If you use CustomAllocatorType
, the generic version is used which calls the allocator for every single element. This costs a lot of time. Another function with a special definition and which is relevant to your simple code example is _Destroy()
.
Put another way, gcc’s implementation of the C++ standard library has some measures implemented to ensure that optimal code is generated in cases where it is known that it is safe. This works regardless of compiler optimizations. If the non-optimized code paths are taken and you enable compiler optimizations (eg -O3
), the compiler is often able to recognize patterns in the non-optimal code (such as initializing successive trivial elements) and can optimize everything away so that you end up with the same instructions (more or less).
C++20 vs C++17 and why your CustomAllocatorType
is broken
As noted in the comments, the performance decrease when using CustomAllocatorType
only occurs in C++20 but not in C++17. To understand why, note that gcc’s std::vector
implementation does not use the Allocator
from the declaration std::vector<T,Allocator>
as allocator, ie in your case CustomAllocatorType
. Rather, it uses std::allocator_traits<T>::rebind_alloc<T>
(see here and here). Also see eg this post about rebind for some more information.
Since you did not define a specialization std::allocator_traits<CustomAllocatorType>
, it uses the generic one. The standard says:
rebind_alloc
: Alloc::rebind<T>::other
if present, otherwiseAlloc<T, Args>
if this Alloc isAlloc<U, Args>
Ie the generic one attempts to delegate to your allocator, if possible. Now, your allocator CustomAllocatorType
inherits from std::allocator
. And here comes the important difference between C++17 and C++20: std::allocator::rebind
was removed in C++20. Hence:
- C++17:
CustomAllocatorType::rebind
is inherited and thus defined and isstd::allocator
. Therefore,std::allocator_traits<CustomAllocatorType>::rebind_alloc
meaning thatstd::vector
ends up actually usingstd::allocator
instead ofCustomAllocatorType
. If you pass in aCustomAllocatorType
instance in thestd::vector
constructor, you end up with splicing. - C++20:
CustomAllocatorType::rebind
is not defined. Thus,std::allocator_traits<CustomAllocatorType>::rebind_alloc
isCustomAllocatorType
andstd::vector
ends up usingCustomAllocatorType
.
So the C++17 version uses std::allocator
and thus enjoys the library based optimizations described above, while the C++20 version does not.
Your code is simply incorrect, or at least the C++17 version. std::vector
does not use your allocator at all in C++17. You can also notice that if you attempt to call buffer.get_allocator()
in your example, which will fail to compile in C++17 because it will try to convert std::allocator
(as used internally) to CustomAllocatorType
.
I think the correct way to fix the issue is to define CustomAllocatorType::rebind
instead of specializing std::allocator_traits
(see here and here), like so:
template<typename T>
class CustomAllocatorType: public std::allocator<T>
{
template< class U > struct rebind {
typedef CustomAllocatorType<U> other;
};
};
Of course, doing so means that the C++17 version will be slow in debug but actually working.
I think this also shows again the general rule that inheriting from C++ standard library types is usually a bad idea. If CustomAllocatorType
did not inherit from std::allocator
the problem would not appear in the first place (also, because you’d need to think about how to set the elements correctly).
Improving performance
Assuming the allocator was fixed for C++17 or you use C++20, you get the bad performance in debug because the library implementation uses the generic versions of the above mentioned functions to fill and destroy data. Unfortunately, all of this is an implementation detail of the library, meaning that there is no nice standard way to enforce the generation of good code.
Hacky solution
A hack that works in your trivial example (and probably only there!) would be to define custom overloads of the functions in question, eg:
#include <bits/stl_uninitialized.h>
#include <cstdint>
#include <cstdlib>
// Must be defined BEFORE including <vector>!
namespace std{
template<typename _ForwardIterator, typename _Size, typename _Tp>
inline _ForwardIterator
__uninitialized_default_n_a(_ForwardIterator __first, _Size __n, CustomAllocatorType<_Tp>&)
{ return std::__uninitialized_default_n(__first, __n); }
template<typename _ForwardIterator, typename _Tp>
_GLIBCXX20_CONSTEXPR inline void
_Destroy(_ForwardIterator __first, _ForwardIterator __last, CustomAllocatorType<_Tp>&) {
_Destroy(__first, __last);
}
}
These here are copy & paste from gcc’s std::allocator
overloads (here and here), but overloaded for CustomAllocatorType
. More special overloads would be required (eg for is_copy_constructible
and is_move_constructible
or __relocate_a_1
, no idea how many more). Defining the above two functions before the include of <vector>
leads to decent performance in debug. At least it does so for me locally using gcc 11.2. For some reason unknown to me, it does not work on quick bench (also compare my comments above on the original post).
This hack is awful on multiple levels:
- It is absolutely non-standard. It only works with stdlibc++ and can break at any up- or downgrade of the library version.
- You also need to ensure that the overloads are defined before the
<vector>
header is included, because otherwise they will not be picked up. Also, if you mess this up in some places, you might violate the one-definition-rule (which will probably lead to more weirdness). - The example hack assumes that allocating and freeing memory does not require the use of
CustomAllocatorType
just likestd::allocator
. I highly doubt that this holds for your trueCustomAllocatorType
implementation. But maybe you could actually implement eg__uninitialized_default_n_a()
properly and more efficacy for yourCustomAllocatorType
by calling an appropriate function on your allocator.
I do not recommend doing this. But depending on the use case, it might be a viable solution.
Alternative ideas
Enabling optimizations locally (#pragma GCC optimize ("-O3")
etc) is rather unreliable. It did not work for me. The most likely reason is that the optimization flag is not propagated to the instantiation of std::vector
because its definition is somewhere else entirely. You’d probably need to compile the C++ standard library headers themselves with optimizations.
Another idea would be to use a different container library. For example, boost has a vector
class. I have not checked if its debug performance would be better.