In the beginning, there was C.
That sentence actually could serve as the introduction to a multitude of blog posts, all of which would come to the conclusion "legacy programming conventions are terrible, but realistically we can't throw everything out and start over from scratch". However, today we will merely be looking at two ways C has contributed to making language interoperability difficult.
extern "C"
, but for structs
In the first installment of this series, I mentioned that one blocker to language interoperability is struct layout. Specifically, different programming languages may organize data in structs in different ways. How can we overcome that on our way to language interoperability?
Layout differences are mostly differences of alignment, which means that data is just located at different offsets from the beginning of the struct. The problem is that there is not necessarily a way to use keywords like align
to completely represent a different language's layout algorithm.
Thankfully, there is a solution. In our example used previously, we were using Rust and C++ together. It turns out that Rust can use the #[repr(C)]
representation to override struct layouting to follow what C does. Given that C++ uses the same layouting as C, that means that the following code compiles and runs:
My proof-of-concept project polyglot automatically wraps C++ structs with #[repr(C)]
(and also does so for enums).
The one major downside of this approach is that it requires you to mark structs that you created in your Rust code with #[repr(C)]
. In an ideal world, there would be a way to leave your Rust code as is; however, there is currently no solution that I am aware of that does not require #[repr(C)]
.
Arrays, strings, and buffer overflows
Now that we've covered structs in general, we can look at the next bit of C behavior that turned out to be problematic: handling a list of items.
In C, a list of items is represented by an array. An array that has n elements of type T in it really is just a block of memory with a size n * sizeof(T). This means that all you have to do to find the kth object in the array is take the address of the array and add k * sizeof(T). This seemed like a fine idea back in the early days of programming, but eventually people realized there was a problem: it's easy to accidentally access the seventh element of an array that only has five elements, and if you write something to the seventh element, congratulations, you just corrupted your program's memory! It's even more common to perform an out-of-bounds write when dealing with strings (which, after all, is probably the most used type of array). This flaw has led to countless security vulnerabilities, including the famous Heartbleed bug, (you can see a good explanation of of how Heartbleed works at xkcd 1354).
Eventually, people started deciding to fix this. In languages like Java, D, and pretty much any other language invented in the last 25 years or so, strings (and arrays) are handled more dynamically: reading from or writing to a string at an invalid location will generally throw an exception; staying in bounds is made easy by the addition of a length
or size
property, and strings and arrays in many modern languages can be resized in place. Meanwhile, C++, in order to add safer strings while remaining C-compatible, opted to build a class std::string
that is used for strings in general (unless you use a framework like Qt that has its own string type).
All of these new string types are nice, but they present a problem for interoperability: how do you pass a string from C++ to Rust (our example languages) and back again?
Wrap all the things!
The answer, unsurprisingly, is "more wrappers". While I have not built real-life working examples of wrappers for string types, what follows is an example of how seamless string conversion could be achieved.
We start with a C++ function that returns an std::string
:
We'll also go ahead and create our Rust consumer:
Normally, we would just create a Rust shim around getLink()
like so:
However, this doesn't work because Rust's String
is different from C++'s std::string
. To fix this, we need another layer of wrapping. Let's add another C++ file:
Now we have a C-style string. Let's try consuming it from Rust. We'll make a new version of links.rs
:
With these additions, the code now compiles and runs. This all looks very convoluted, but here's how the program works now:
- Rust's
main()
calls links::getLink()
. links::getLink()
calls getLink_return_cstyle_string()
, expecting a C-style string in return.getLink_return_cstyle_string()
calls the actual getLink()
function, converts the returned std::string
into a const char *
, and returns the const char *
.- Now that
links::getLink()
has a C-style string, it converts it into a Rust CString
wrapper, which is then converted to an actual String
. - The
String
is returned to main()
.
There are a few things to take note of here:
- This process would be relatively easy to reverse so we could pass a
String
to a C++ function that expects an std::string
or even a const char *
. - Rust strings are a bit more complicated because we have to convert from a C-style string to
CString
to String
, but this is the basic process that will need to be used for any automatic string type conversions. - This basic process could also be used to convert types like
std::vector
.
Is this ugly? Yes. Does it suffer from performance issues due to all the string conversions? Yes. But I think this is the most user-friendly way to achieve compatible strings because it allows each language to keep using its native string type without requiring any ugly decorations or wrappers in the user code. All conversions are done in the wrappers.
Implementation
Based on the concepts here, I've written a (non-optimal) implementation of type proxying in polyglot that supports proxying std::string
objects to either Rust or D. In fact, I've taken it a bit further and implemented type proxying for function arguments as well. You can see an example project, along with its generated wrappers, here.
Next up
Interoperability requires lots of wrappers, and as I've mentioned, polyglot can't generate wrappers for anything more complex than some basic functions, structs, classes, and enums. In the next installment of this series, we'll explore some viable binding generation tools that exist today.
Trusted software excellence across embedded and desktop platforms
The KDAB Group is a globally recognized provider for software consulting, development and training, specializing in embedded devices and complex cross-platform desktop applications. In addition to being leading experts in Qt, C++ and 3D technologies for over two decades, KDAB provides deep expertise across the stack, including Linux, Rust and modern UI frameworks. With 100+ employees from 20 countries and offices in Sweden, Germany, USA, France and UK, we serve clients around the world.