Zig, hot reload, and ABI trouble

Hey! You can find me on Mastodon, Bluesky, and Twitter!

I was recently overcome by the idea of porting some C code of mine to Zig. In the process, I think I learned a thing or two about situations in which Zig is struggling to replace C for me. The short version is: Zig is pleasant until you need lots of DLL boundaries, at which point the lack of a convenient non-C ABI story becomes painful.

The Setup

Let me set the scene. I have something that is roughly game-engine-shaped in C. One of its core aspects is hot reload for code and assets. When I say “hot reload for code” I do not just mean “there is a game DLL that you can reload” but that almost all code can be reloaded in principle (which is useful because you may want to patch debug code into core allocation routines, for example). This is achieved through a setup like this:

there is a very thin shell application that is responsible for loading the main DLL of the application and its dependencies,
the program is split into various DLLs with clear interfaces (e.g. a “Core” library exposes allocations, string utilities etc.)
when a DLL is reloaded, you patch the import tables of all other DLLs to point to the new DLL. The reloading happens from the outermost shell, so nothing else is on the stack at this point.

This setup works surprisingly well when you apply the right restrictions: you can’t keep function pointers around for long (unless hot reload knows about them), you have to be aware of where global data ends up, you need a strategy for data migration across reloads, you need to have tight control over all threads, etc. There are also many details in the hot reload setup that can go very, very wrong. (Try hot reloading a DLL that another DLL depends on!) But this can be made to work, robustly, across at least the platforms we tried.

The nice bit is that it is relatively straightforward to switch to a monolithic build where all of the overhead of the DLLs just disappears.

I was hoping to move this entire setup to Zig, both to learn more Zig and to see whether I can enjoy some of Zig’s niceties like defer.

Where Zig gets awkward

DLLs force two things:

they force an ABI (“application binary interface”): you have to make up your mind how e.g. a boolean is represented as bytes and how structs are passed to functions. It’s no longer “the compiler can do whatever it wants in each specific case” but “there is a specific interface that you need to adhere to.”
they force you to make decisions about where code and data live, so that you do not get multiple copies of your global variables

Both of these turn out to be unintuitive in Zig.

Zig does not define an ABI for its language features at DLL boundaries, which means that it won’t tell you how the bits for a slice look in memory for example. At DLL boundaries, Zig effectively forces you to use a C-compatible ABI, so you can no longer use many of its more interesting features (even basic things like slices).

This is particularly obvious with Zig’s error handling. Zig has first-class support for functions that return “something or an error”, e.g. a function with return type !void is a function that returns nothing or an error. Zig errors do not have an ABI. You need to manually map them to error codes. Errors in Zig however aren’t just error codes: you can ask a Zig error for an “error return trace”, which is sort of like a stacktrace in that it tells you where exactly an error occurred, except that we never unwound the stack. It instead is implemented (see zig docs) by threading invisible parameters around to record the necessary metadata of the error.

This of course stops working at a DLL boundary, because there is again no ABI for this. So error traces will stop at DLL boundaries, at best.

Compared to C, Zig also makes it harder to understand where a symbol (e.g. a global variable) is going to end up. C has the advantage that the typical structure of projects gives you a separation between implementation (compilation unit) and interface (header files). Switching between a monolithic build and a dynamic build that uses multiple DLLs is easy in that world, and it is relatively predictable where a symbol lives: definitions in header files are replicated everywhere, whereas symbols in compilation units just live in the resulting object file (…at least roughly). The compilation model is already reflected in the source setup.

Zig on the other hand just has .zig source files, and whatever other .zig file you @import into your .zig file is going to be part of the compiled output. Unless you manually intervene, you get a monolith. If you want multiple compilation units and then link them together at the end, you have to go through a C ABI again. The pragmatic solution then is to re-invent header files and explicitly define extern functions in them.

A workaround

For my specific use-case, I am lucky enough that I just care about “an ABI” and not about “a stable ABI.” While Zig does not have a stable ABI, it certainly has some ABI: For a specific compiler version and specific options, it will produce some memory layout. It’s theoretically possible that Zig flips a coin every time you run the compiler to give you an exciting new ABI, but that is not what is happening in practice (that, or I am just very lucky).

Here is a concrete toy example showing what it takes to export a startsWith function from Zig so that callers do not have to care whether it comes from a DLL, while monolithic builds avoid the overhead.

This is the function:

/// Reports whether `s` begins with `needle`.
pub fn startsWith(s: []const u8, needle: []const u8) bool {
    return std.mem.startsWith(u8, s, needle);
}

If we just stick export on it, we get an error like this:

error: parameter of type '[]const u8' not allowed in function with calling convention 'x86_64_win'
note: slices have no guaranteed in-memory representation

So we have to build some machinery. First, let’s move our function definition into string.impl.zig along with a way to mark functions as exported:

const std = @import("std");
pub const export_surface = .{
    .{ .decl = "startsWith", .symbol = "Core_string_startsWith" },
};

/// Reports whether `s` begins with `needle`.
pub fn startsWith(s: []const u8, needle: []const u8) bool {
    return std.mem.startsWith(u8, s, needle);
}

We can automatically generate the C-ABI version for most functions, along with a wrapper that dynamically selects what function a caller should use. For dynamic linkage, the call goes through an external symbol. Otherwise, we import the implementation and call that directly.

The main insight for lowering types is this: We can lower practically anything except error values to a pointer. If it is already a pointer, we just lower it to an opaque pointer. If it is a value, we take its address and copy it on the other side. Return types become out parameters by pointer. For simplicity, floats, integers, and bools are passed by value still.

Note here that this relies on building all libraries with the same compiler version, because we didn’t actually solve any ABI problem. We just rely on the compiler producing the same layout across libraries. If we at some point cared about truly stable ABI, we could generate bespoke extern struct instances per exported function. That’s what I tried first, and the approach works nicely.

The “caller side” then looks like this:

const link_options = @import("Core_link_options");

inline fn generated_startsWith(s: []const u8, needle: []const u8) bool {
    if (comptime link_options.is_dy) {
        const startsWith_symbol = @extern(*const fn(s: ?*anyopaque, needle: ?*anyopaque) callconv(.c) bool, .{ .name = "Core_string_startsWith" });
        return startsWith_symbol(__generated_toOpaquePtr(&s), __generated_toOpaquePtr(&needle));
    }
    else {
        const impl = @import("string.impl.zig");
        return impl.startsWith(s, needle);
    }
}

We also need a file string.thunks.zig that we conditionally compile into our library’s root.zig to export a wrapper around our function:

const impl = @import("../string.impl.zig");

pub export fn Core_string_startsWith(s: ?*anyopaque, needle: ?*anyopaque) callconv(.c) bool {
    return impl.startsWith(__generated_loadIndirectValue([]const u8, s), __generated_loadIndirectValue([]const u8, needle));
}

In theory, we could stop here, but one more step makes all of this a little bit nicer. We can write a “header file” string.zig which combines manually authored types (along with comptime and anytype functions) with a tool-generated section that just forwards the generated symbols with a nicer name. Now users can just use string.zig:

// ...manually written code here...

// ===== EVERYTHING BELOW THIS IS GENERATED, DO NOT EDIT MANUALLY

/// Reports whether `s` begins with `needle`.
pub const startsWith = generated_startsWith;

// more forwarding here, then all of the generated wrappers at the bottom

inline fn generated_startsWith(s: []const u8, needle: []const u8) bool {
    // ...
}

So we have this:

string.impl.zig - where you write your code and mark it for export
string.thunks.zig - a generated file with the export thunks
string.zig - the “header” file that is part hand-written, part generated with forwarders and wrappers.

At this point, the rules are:

nobody may use @import("string.impl.zig") except other .impl files in the same library
other users should use @import("string.zig")
the library root needs to use @import("string.thunks.zig") when compiled as a DLL
you may not use Zig errors in exported signatures

The actual machinery for the header generation is ugly. It needs reflection to find the types used in the signatures we are exporting, but it also needs AST parsing so that comments on an implementation function can be correctly captured for the “header” version of this function. I first tried just parsing without reflection, but this falls flat on its face once you deal with type aliases.

What I learned

For this specific case, I have not made up my mind on whether this is good enough to let Zig replace C for me, and I will push it around a little bit more. It’s disappointing to see how much of a struggle this is in Zig, and likewise encouraging that it is possible with the tools that Zig provides. (Unrelated, I am aware that Zig’s stdlib allocators use function pointers and DLL reloading is going to wreak havoc on that as well. So there are more adventures ahead.)

The broader learning for me is that replacing or even improving on C in all of its applications is still really difficult. DLLs are one such application. DLLs have uses beyond “packaged API” (with stability and all of the implications). DLLs might just be used as containers for code (for example for hot reload), and all of a sudden languages without an ABI story become less attractive. The “physicality” of C project structures makes reasoning through what the compiled binaries will look like at least feasible. Even a small departure such as Zig makes this harder, at least by default.