Custom C++ allocator far too slow in GCC in debug. is there a fix for this?

Reason for the performance decrease

gcc’s libstdc++ uses certain performance improvements if the allocator is std::allocator. Your CustomAllocatorType is a different type than std::allocator, meaning that the optimizations are disabled. Note that I am not talking about compiler optimizations but rather that the C++ standard library implements overloads or specializations specifically for std::allocator. To name an example relevant to your example code, std::vector::resize() internally calls __uninitialized_default_n_a() which has a special overload for std::allocator. The special overload bypasses the allocator entirely. If you use CustomAllocatorType, the generic version is used which calls the allocator for every single element. This costs a lot of time. Another function with a special definition and which is relevant to your simple code example is _Destroy().

Put another way, gcc’s implementation of the C++ standard library has some measures implemented to ensure that optimal code is generated in cases where it is known that it is safe. This works regardless of compiler optimizations. If the non-optimized code paths are taken and you enable compiler optimizations (eg -O3), the compiler is often able to recognize patterns in the non-optimal code (such as initializing successive trivial elements) and can optimize everything away so that you end up with the same instructions (more or less).

C++20 vs C++17 and why your CustomAllocatorType is broken

As noted in the comments, the performance decrease when using CustomAllocatorType only occurs in C++20 but not in C++17. To understand why, note that gcc’s std::vector implementation does not use the Allocator from the declaration std::vector<T,Allocator> as allocator, ie in your case CustomAllocatorType. Rather, it uses std::allocator_traits<T>::rebind_alloc<T> (see here and here). Also see eg this post about rebind for some more information.

Since you did not define a specialization std::allocator_traits<CustomAllocatorType>, it uses the generic one. The standard says:

rebind_alloc: Alloc::rebind<T>::other if present, otherwise Alloc<T, Args> if this Alloc is Alloc<U, Args>

Ie the generic one attempts to delegate to your allocator, if possible. Now, your allocator CustomAllocatorType inherits from std::allocator. And here comes the important difference between C++17 and C++20: std::allocator::rebind was removed in C++20. Hence:

  • C++17: CustomAllocatorType::rebind is inherited and thus defined and is std::allocator. Therefore, std::allocator_traits<CustomAllocatorType>::rebind_allocmeaning that std::vector ends up actually using std::allocator instead of CustomAllocatorType. If you pass in a CustomAllocatorType instance in the std::vector constructor, you end up with splicing.
  • C++20: CustomAllocatorType::rebind is not defined. Thus, std::allocator_traits<CustomAllocatorType>::rebind_alloc is CustomAllocatorType and std::vector ends up using CustomAllocatorType.

So the C++17 version uses std::allocator and thus enjoys the library based optimizations described above, while the C++20 version does not.

Your code is simply incorrect, or at least the C++17 version. std::vector does not use your allocator at all in C++17. You can also notice that if you attempt to call buffer.get_allocator() in your example, which will fail to compile in C++17 because it will try to convert std::allocator (as used internally) to CustomAllocatorType.

I think the correct way to fix the issue is to define CustomAllocatorType::rebind instead of specializing std::allocator_traits (see here and here), like so:

template<typename T>
class CustomAllocatorType: public std::allocator<T> 
{
  template< class U > struct rebind {
    typedef CustomAllocatorType<U> other;
  };
};

Of course, doing so means that the C++17 version will be slow in debug but actually working.

I think this also shows again the general rule that inheriting from C++ standard library types is usually a bad idea. If CustomAllocatorType did not inherit from std::allocatorthe problem would not appear in the first place (also, because you’d need to think about how to set the elements correctly).

Improving performance

Assuming the allocator was fixed for C++17 or you use C++20, you get the bad performance in debug because the library implementation uses the generic versions of the above mentioned functions to fill and destroy data. Unfortunately, all of this is an implementation detail of the library, meaning that there is no nice standard way to enforce the generation of good code.

Hacky solution

A hack that works in your trivial example (and probably only there!) would be to define custom overloads of the functions in question, eg:

#include <bits/stl_uninitialized.h>
#include <cstdint>
#include <cstdlib>

// Must be defined BEFORE including <vector>!
namespace std{
  template<typename _ForwardIterator, typename _Size, typename _Tp>
  inline _ForwardIterator
  __uninitialized_default_n_a(_ForwardIterator __first, _Size __n, CustomAllocatorType<_Tp>&)
  { return std::__uninitialized_default_n(__first, __n); }


  template<typename _ForwardIterator, typename _Tp>
  _GLIBCXX20_CONSTEXPR inline void
  _Destroy(_ForwardIterator __first, _ForwardIterator __last, CustomAllocatorType<_Tp>&) {
    _Destroy(__first, __last);
  }
}

These here are copy & paste from gcc’s std::allocator overloads (here and here), but overloaded for CustomAllocatorType. More special overloads would be required (eg for is_copy_constructible and is_move_constructible or __relocate_a_1, no idea how many more). Defining the above two functions before the include of <vector> leads to decent performance in debug. At least it does so for me locally using gcc 11.2. For some reason unknown to me, it does not work on quick bench (also compare my comments above on the original post).

This hack is awful on multiple levels:

  • It is absolutely non-standard. It only works with stdlibc++ and can break at any up- or downgrade of the library version.
  • You also need to ensure that the overloads are defined before the <vector> header is included, because otherwise they will not be picked up. Also, if you mess this up in some places, you might violate the one-definition-rule (which will probably lead to more weirdness).
  • The example hack assumes that allocating and freeing memory does not require the use of CustomAllocatorTypejust like std::allocator. I highly doubt that this holds for your true CustomAllocatorType implementation. But maybe you could actually implement eg __uninitialized_default_n_a() properly and more efficacy for your CustomAllocatorType by calling an appropriate function on your allocator.

I do not recommend doing this. But depending on the use case, it might be a viable solution.

Alternative ideas

Enabling optimizations locally (#pragma GCC optimize ("-O3") etc) is rather unreliable. It did not work for me. The most likely reason is that the optimization flag is not propagated to the instantiation of std::vector because its definition is somewhere else entirely. You’d probably need to compile the C++ standard library headers themselves with optimizations.

Another idea would be to use a different container library. For example, boost has a vector class. I have not checked if its debug performance would be better.

Leave a Comment