Trusted Software Excellence across Desktop and Embedded
Take a glance at the areas of expertise where KDAB excels ranging from swift troubleshooting, ongoing consulting and training to multi-year, large-scale software development projects.
Find out why customers from innovative industries rely on our extensive expertise, including Medical, Biotech, Science, Renewable Energy, Transportation, Mobility, Aviation, Automation, Electronics, Agriculture and Defense.
High-quality Embedded Engineering across the Stack
To successfully develop an embedded device that meets your expectations regarding quality, budget and time to market, all parts of the project need to fit perfectly together.
Learn more about KDAB's expertise in embedded software development.
Where the capabilities of modern mobile devices or web browsers fall short, KDAB engineers help you expertly architect and build high-functioning desktop and workstation applications.
Extensible, Safety-compliant Software for the Medical Sector
Create intelligent, patient-focused medical software and devices and stay ahead with technology that adapts to your needs.
KDAB offers you expertise in developing a broad spectrum of clinical and home-healthcare devices, including but not limited to, internal imaging systems, robotic surgery devices, ventilators and non-invasive monitoring systems.
Building digital dashboards and cockpits with fluid animations and gesture-controlled touchscreens is a big challenge.
In over two decades of developing intricate UI solutions for cars, trucks, tractors, scooters, ships, airplanes and more, the KDAB team has gained market leading expertise in this realm.
Build on Advanced Expertise when creating Modern UIs
KDAB assists you in the creation of user-friendly interfaces designed specifically for industrial process control, manufacturing, and fabrication.
Our specialties encompass the custom design and development of HMIs, enabling product accessibility from embedded systems, remote desktops, and mobile devices on the move.
Legacy software is a growing but often ignored problem across all industries. KDAB helps you elevate your aging code base to meet the dynamic needs of the future.
Whether you want to migrate from an old to a modern GUI toolkit, update to a more recent version, or modernize your code base, you can rely on over 25 years of modernization experience.
KDAB offers a wide range of services to address your software needs including consulting, development, workshops and training tailored to your requirements.
Our expertise spans cross-platform desktop, embedded and 3D application development, using the proven technologies for the job.
When working with KDAB, the first-ever Qt consultancy, you benefit from a deep understanding of Qt internals, that allows us to provide effective solutions, irrespective of the depth or scale of your Qt project.
Qt Services include developing applications, building runtimes, mixing native and web technologies, solving performance issues, and porting problems.
KDAB helps create commercial, scientific or industrial desktop applications from scratch, or update its code or framework to benefit from modern features.
Discover clean, efficient solutions that precisely meet your requirements.
Boost your team's programming skills with in-depth, constantly updated, hands-on training courses delivered by active software engineers who love to teach and share their knowledge.
Our courses cover Modern C++, Qt/QML, Rust, 3D programming, Debugging, Profiling and more.
The collective expertise of KDAB's engineering team is at your disposal to help you choose the software stack for your project or master domain-specific challenges.
Our particular focus is on software technologies you use for cross-platform applications or for embedded devices.
Since 1999, KDAB has been the largest independent Qt consultancy worldwide and today is a Qt Platinum partner. Our experts can help you with any aspect of software development with Qt and QML.
KDAB specializes in Modern C++ development, with a focus on desktop applications, GUI, embedded software, and operating systems.
Our experts are industry-recognized contributors and trainers, leveraging C++'s power and relevance across these domains to deliver high-quality software solutions.
KDAB can guide you incorporating Rust into your project, from as overlapping element to your existing C++ codebase to a complete replacement of your legacy code.
Unique Expertise for Desktop and Embedded Platforms
Whether you are using Linux, Windows, MacOS, Android, iOS or real-time OS, KDAB helps you create performance optimized applications on your preferred platform.
If you are planning to create projects with Slint, a lightweight alternative to standard GUI frameworks especially on low-end hardware, you can rely on the expertise of KDAB being one of the earliest adopters and official service partner of Slint.
KDAB has deep expertise in embedded systems, which coupled with Flutter proficiency, allows us to provide comprehensive support throughout the software development lifecycle.
Our engineers are constantly contributing to the Flutter ecosystem, for example by developing flutter-pi, one of the most used embedders.
KDAB invests significant time in exploring new software technologies to maintain its position as software authority. Benefit from this research and incorporate it eventually into your own project.
Start here to browse infos on the KDAB website(s) and take advantage of useful developer resources like blogs, publications and videos about Qt, C++, Rust, 3D technologies like OpenGL and Vulkan, the KDAB developer tools and more.
The KDAB Youtube channel has become a go-to source for developers looking for high-quality tutorial and information material around software development with Qt/QML, C++, Rust and other technologies.
Click to navigate the all KDAB videos directly on this website.
In over 25 years KDAB has served hundreds of customers from various industries, many of them having become long-term customers who value our unique expertise and dedication.
Learn more about KDAB as a company, understand why we are considered a trusted partner by many and explore project examples in which we have proven to be the right supplier.
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications.
Read more about the history, the values, the team and the founder of the company.
When working with KDAB you can expect quality software and the desired business outcomes thanks to decades of experience gathered in hundreds of projects of different sizes in various industries.
Have a look at selected examples where KDAB has helped customers to succeed with their projects.
KDAB is committed to developing high-quality and high-performance software, and helping other developers deliver to the same high standards.
We create software with pride to improve your engineering and your business, making your products more resilient and maintainable with better performance.
KDAB has been the first certified Qt consulting and software development company in the world, and continues to deliver quality processes that meet or exceed the highest expectations.
In KDAB we value practical software development experience and skills higher than academic degrees. We strive to ensure equal treatment of all our employees regardless of age, ethnicity, gender, sexual orientation, nationality.
Interested? Read more about working at KDAB and how to apply for a job in software engineering or business administration.
When done with floating point numbers, it might be performed with two roundings (typical in many DSPs), or with a single rounding. When performed with a single rounding, it is called a fused multiply–add (FMA) or fused multiply–accumulate (FMAC).
An FMA instruction carries the two operations in one step, and does them in "infinite precision". Notably, an FMA does only one rounding at the end instead of the sequence expressed by the source code, which is: 1) multiplying, 2) rounding the result, 3) adding, 4) rounding the result. So, there are two steps of rounding.
So not only is it faster, but it's also more accurate.
However, one can easily encounter cases (like the two cases illustrated above) in which doing operations without the intermediate rounding step will give you trouble.
Let's look again at the first example:
constdouble scale =1.0/ i;constdouble r =1.0- i * scale;assert(r >=0);
If i = 5, then scale is 0.200000000000000011102230246251565404236316680908203125, and r is negative (about -5.55e-17) when using a FMA. The point is that i * scaledid not get rounded in an intermediate.
In the second example,
constdouble result = std::sqrt(a*a - b*b);
the argument to sqrt(a*a - b*b) can be turned into FMA(a, a, -b*b). If a == b. Then, this expression is equivalent to FMA(a, a, -a*a). The problem is that, if a*a done in "infinite precision" is strictly less than the rounded product of a*a, then the result will again be a negative number (not 0!) passed into sqrt. This is very easy to obtain (example on CE)!
Rounding in C++
For me, the interesting question is, "Is the compiler allowed to do these manipulations, since they affect the rounding as expressed by the source code?"
Within the context of one expression, compilers can use as much precision as they want. This is allowed by: [expr.pre/6] and similar paragraphs:
The values of the floating-point operands and the results of floating-point expressions may be represented in greater precision and range than that required by the type; the types are not changed thereby.
with a footnote that says:
The cast and assignment operators must still perform their specific conversions as described in [expr.type.conv], [expr.cast], [expr.static.cast] and [expr.ass].
Now suppose that one turns
constdouble scale =1.0/ i;constdouble r =1.0- i * scale;
into separate expressions and statements:
constdouble scale =1.0/ i;constdouble tmp = i * scale;constdouble r =1.0- tmp;
Here, in theory, the source code mandates that tmp is rounded; so a compiler cannot do a FMA when calculating r.
In practice, compilers violate the standard and apply FMA. :-)
In literature, these substitutions are called "floating point contractions." Let's read what the GCC manual has to say about them:
By default, -fexcess-precision=fast is in effect; this means that operations may be carried out in a wider precision than the types specified in the source if that would result in faster code, and it is unpredictable when rounding to the types specified in the source code takes place.
(Emph. mine)
Hence, you can turn these optimizations off by compiling under -std=c++XX, not -std=gnu++XX (the default). If you try to use -fexcess-precision=standard, then GCC lets you know that:
cc1plus: sorry, unimplemented: '-fexcess-precision=standard' for C++
The Origin
Where does all this nonsense come from? The first testcase is actually out of Qt Quick rendering code. It has been lingering around for a decade!
Qt Quick wants to give a unique Z value to each element of the scene. These Z values are then going to be used by the underlying graphics stack (GL, Vulkan, Metal) as the depth of the element. This allows Qt Quick to render a scene using the ordinary depth testing that a GPU provides.
The Z values themselves have no intrinsic meaning, as long as they establish an order. That's why they're simply picked to be equidistant in a given range (simplest strategy that maximizes the available resolution).
Now, the underlying 3D APIs want a depth coordinate precisely in [0.0, 1.0]. So that's picked as range and then inverted (going from 1.0 to 0.0) because, for various reasons, Qt Quick wants to render back-to-front (smaller depth means "closer to the camera", i.e. on top in the Qt Quick scene.).
When the bug above gets triggered the topmost element of the scene doesn't get rendered at all. That is because its calculated Z value is negative; instead of being the "closest to the camera" (it's the topmost element in the scene), the 3D API will think the object ended up being behind the camera and will cull it away.
So why didn't anyone notice so far in the last 10 years? On one hand, it's because no one seems to compile Qt with aggressive compiler optimizations enabled. For instance, on X86-64 one needs to opt-in to FMA instructions; on GCC, you need to pass -march=haswell or higher. On ARM(64), this manifests more "out of the box" since ARM7/8 have FMA instructions.
On the other hand, because by accident everything works fine on OpenGL. Unlike other 3D graphics APIs, on OpenGL the depth range in Normalized Device Coordinates is from -1 to +1, and not 0 to +1. So even a (slightly) negative value for the topmost element is fine. If one peeks at an OpenGL call trace (using apitrace or similar tools), one can clearly see the negative Z being set.
Only on a relatively more recent combination of components does the bug manifest itself, for instance, on Mac:
Qt 6 ⇒ Metal as graphics API (through RHI)
ARM8 ⇒ architecture with FMA
Clang 14 in the latest XCode ⇒ enables FP contractions by default
Windows and Direct3D (again through RHI) are also, in theory, affected, but MSVC does not generate FMA instructions at all. On Linux, including embedded Linux (e.g. running on ARM), most people still use OpenGL and not Vulkan. Therefore, although GCC has floating-point contractions enabled by default, the bug doesn't manifest itself.
Definitely an interesting one to research; many kudos to the original reporter. The proposed fix was simply to clamp the values to the wanted range. I'm not sure if one can find a numerical solution that works in all cases.
About KDAB
Trusted software excellence across embedded and desktop platforms
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.
Wasn't it always possible that scale ends up slightly higher than 'actual' 1/i value? In that case, i*tmp could end up being larger than 1.0.
I had seen a similar effect in one of our applications several years back and had done some research. IIRC, there were some processor flags (on x86) that determined if a value could be rounded up/down. We had different behaviour between gcc/Linux and MSVC++6 on Windows. Even on Windows, the behaviour changed depending on whether we had a service pack installed or not.
25 - Feb - 2023
Giuseppe D'Angelo
Hi,
That's what makes the problem funny. You can try experimentally for any integer i, and you will never find one that makes the example fail... unless you enable FMA. https://gcc.godbolt.org/z/a7Wcq5qoq
So "wasn't it always possible", sure, except that FMA wasn't enabled by default. Is the compiler allowed to use it? The Standard seem to say no, the compilers say "we don't care" :-)
12 - Jun - 2023
Michael Winking
This post reminded me of two things:
First, didn't ARM have two multiply-accumulate instructions? One with an intermediate rounding step (perfect for compiler use without affecting precision) and one without. And yes, a quick scan of the documentation reveals that there is indeed "VMLA" and "VFMA". But unfortunately this one didn't make it through the ARMv7 to ARMv8 transition. On the ARM web site there's the following note with regards to ARMv8: "All floating-point Multiply-Add and Multiply-Subtract instructions are fused."(https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON/New-features-for-NEON-and-Floating-point-in-AArch64)). That's a pity!
Second, I came up with a way to prevent the compiler doing such optimizations in the past. The case was slightly different (the compiler constant folded the absolute address for some indexed operations and then put all the immediate load instructions into my tight loop where indexed load and store instructions would have been more effective).
What is needed is some kind of barrier over which the optimizer can't jump to combine different expressions. The gcc and Clang "asm" statement can be used to this effect. The trick is to have no actual assembly code inside the statement but use the constraint modifiers to pretend that this statement modifies the variable where we want the barrier to be. As the compiler doesn't look into the assembly itself it can no longer rely on information specific to that variable that comes before that "asm" statement for optimization. Hence no fusion.
Something following should do in your case (x86):
const double scale = 1.0 / i;
double tmp = i * scale;
asm("" : "+x" (tmp)); // cloak the value of tmp
const double r = 1.0 - tmp;
Of course this might have downsides. It might affect other optimizations (I saw clang no longer unrolling the loop, though this might have little effect), so one should at least inspect the generated assembly to gain some confidence.
With gcc there is also "__builtin_assoc_barrier()" and "const double r = 1.0 - __builtin_assoc_barrier(i * scale);" works too. Though as the name suggests this one has a different purpose (associativity) and hence it might be risky to rely on it here as MAC fusion isn't about associativity.
Ideally there would be some kind of helper function in the standard library for that purpose, but I couldn't find anything.
Giuseppe D’Angelo
Senior Software Engineer
Senior Software Engineer at KDAB. Giuseppe is a long-time contributor to Qt, having used Qt and C++ since 2000, and is an Approver in the Qt Project. His contributions in Qt range from containers and regular expressions to GUI, Widgets, and OpenGL. A free software passionate and UNIX specialist, before joining KDAB, he organized conferences on opensource around Italy. He holds a BSc in Computer Science.
Our hands-on Modern C++ training courses are designed to quickly familiarize newcomers with the language. They also update professional C++ developers on the latest changes in the language and standard library introduced in recent C++ editions.
4 Comments
25 - Feb - 2023
Stefan Brüns
You don't mention anything that requires the topmost layer to be a
z = 0
, so add one layer, and don't use the topmost one:25 - Feb - 2023
Syam Krishnan
Isn't this a general effect of floating point rounding and not particularly related to FMA? I mean, for:
Wasn't it always possible that scale ends up slightly higher than 'actual' 1/i value? In that case, i*tmp could end up being larger than 1.0. I had seen a similar effect in one of our applications several years back and had done some research. IIRC, there were some processor flags (on x86) that determined if a value could be rounded up/down. We had different behaviour between gcc/Linux and MSVC++6 on Windows. Even on Windows, the behaviour changed depending on whether we had a service pack installed or not.
25 - Feb - 2023
Giuseppe D'Angelo
Hi,
That's what makes the problem funny. You can try experimentally for any integer
i
, and you will never find one that makes the example fail... unless you enable FMA. https://gcc.godbolt.org/z/a7Wcq5qoqSo "wasn't it always possible", sure, except that FMA wasn't enabled by default. Is the compiler allowed to use it? The Standard seem to say no, the compilers say "we don't care" :-)
12 - Jun - 2023
Michael Winking
This post reminded me of two things:
First, didn't ARM have two multiply-accumulate instructions? One with an intermediate rounding step (perfect for compiler use without affecting precision) and one without. And yes, a quick scan of the documentation reveals that there is indeed "VMLA" and "VFMA". But unfortunately this one didn't make it through the ARMv7 to ARMv8 transition. On the ARM web site there's the following note with regards to ARMv8: "All floating-point Multiply-Add and Multiply-Subtract instructions are fused."(https://developer.arm.com/documentation/den0024/a/AArch64-Floating-point-and-NEON/New-features-for-NEON-and-Floating-point-in-AArch64)). That's a pity!
Second, I came up with a way to prevent the compiler doing such optimizations in the past. The case was slightly different (the compiler constant folded the absolute address for some indexed operations and then put all the immediate load instructions into my tight loop where indexed load and store instructions would have been more effective). What is needed is some kind of barrier over which the optimizer can't jump to combine different expressions. The gcc and Clang "asm" statement can be used to this effect. The trick is to have no actual assembly code inside the statement but use the constraint modifiers to pretend that this statement modifies the variable where we want the barrier to be. As the compiler doesn't look into the assembly itself it can no longer rely on information specific to that variable that comes before that "asm" statement for optimization. Hence no fusion.
Something following should do in your case (x86):
Of course this might have downsides. It might affect other optimizations (I saw clang no longer unrolling the loop, though this might have little effect), so one should at least inspect the generated assembly to gain some confidence.
With gcc there is also "__builtin_assoc_barrier()" and "const double r = 1.0 - __builtin_assoc_barrier(i * scale);" works too. Though as the name suggests this one has a different purpose (associativity) and hence it might be risky to rely on it here as MAC fusion isn't about associativity.
Ideally there would be some kind of helper function in the standard library for that purpose, but I couldn't find anything.