then you should read on.
Before my change, the QStringList::removeDuplicates()
function used a QSet
to keep track of "seen" strings, so it could remove already-seen strings from the list the second time they are encountered, thereby removing all duplicate strings from the list.
There are quite a few problems with this approach.
Double Lookup
The contains()
+ insert()
anti-pattern performs two look-ups for the same key, because the position that contains()
internally has calculated is lost by the time contains()
returns and thus has to be re-calculated in insert()
again. We know that the position will be the same, but the compiler doesn't. The STL associative containers return a pair (boooooo!) of iterator
and bool
, with the latter indicating whether the item was inserted (true
) or was found to pre-exist (false
). Since we want to insert anyway, we can just attempt the insert()
and check whether it changed something:
While this is still quite a mouthful, we've managed to reduce the look-ups by half.
The Qt containers lack this convenient API, of course, instead forcing the user to check whether the size of the container changed, like this:
It's odd, but works, too.
Allocations, Allocations
While QSet
or std::unordered_set
are fast once constructed, they both allocate a new node per element inserted. So in the usual case where you don't have (many) duplicates, you're performing O(N) memory allocations, N being the number of candidates to check.
How nice it would be if we could combine the speed of a hash table with the cache-friendliness of, say, a QVarLengthArray
. Oh, wait. We can!
In C++17, we finally got some of the Bloomberg Allocators merged and, while quite a few implementations are still behind implementing them, that shouldn't deter us, thanks to feature test macros.
Taking a look at the std::pmr
namespace, we find a few template aliases that appear to implant a non-standard allocator, std::pmr::polymorphic_allocator
, into the usual range of STL containers, including vector
and unordered_set
.
If you want to know all the gory details of that class, book one of KDAB's Modern C++ trainings. Suffice to say that std::pmr::polymorphic_allocator
implements an allocator that gets memory from a std::pmr::memory_resource
, which is an interface you can inherit to implement your own custom allocation strategies.
Enter monotonic_buffer_resource
But we don't even need to do that. We just need to plug one of the existing memory_resource
s, std::pmr::monotonic_buffer_resource
, into std::pmr::unordered_set
.
The monotonic_buffer_resource
class implements a grow-only memory pool over a static char
buffer where deallocation is a no-op and allocation just ups a pointer. In case the initial buffer is exhausted, monotonic_buffer_resource
allocates a new block and continues to allocate from there. If that's exhausted too, it allocates an even larger block, etc.
This allocator is as fast as it gets, but it isn't thread-safe and it never frees memory (until destroyed itself). But that's perfect for building temporary containers whose lifetimes are bound by a function.
The setup is a bit verbose, due to the flexibility of the std::pmr
framework. But it's always the same:
Now the first 1 KiB of memory is allocated on the stack, and only then do we go to the heap. Passing the expected number of elements to the unordered_set
constructor (that's like a reserve()
, not a resize()
!) is very important here because, while a re-hash wouldn't take the nodes out of buffer, it would waste memory for the new bucket list because monotonic_buffer_resource
doesn't re-use freed-up memory again, thus creating gaps in our precious, buffer
.
Tying It All Together
So, we have three things that we need to keep in mind:
- Avoid double lookups,
- Use
std::pmr::monotonic_buffer_resource
to avoid memory allocations, - ...but fall back to
QSet
or std::unordered_set
when std::pmr
isn't available.
Then, it's just a question of writing a good hash function (which has recently become much simpler).
QDuplicateTracker
(private API) bound these together for the first time, with a much nicer API:
Once you know which patterns to look for, the use-cases seem endless: Qt changes in topic "qduplicatetracker".
I don't recommend using QDuplicateTracker from Qt 5 anymore though, since, as I mentioned, I have taken this class through another two iterations in customer projects, finally culminating in the addition to KDToolBox. That's the version you should be using.
Alternatively, it should be possible, even in Qt projects, to use the Qt 6 version, which I have recently improved for the Qt 6.3 release next year, simply by copying its header file into your project and replacing the QHasher
internal struct (which depends on implementation details of Qt 6's QHash
container) with a template argument.
I invite you to study the source-code and post any questions that you may have in the comments below.
Happy duplicate tracking!
Trusted software excellence across embedded and desktop platforms
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.