Putting slim locks on a diet | Sebastian Schöner

Hey! You can find me on Mastodon and Bluesky!

Let’s talk about reader-writer locks. Reader-writer locks are what you employ when you know that multiple readers can safely access a resource concurrently, but once there is a single writer that writer will need exclusive access to said resource. In my particular case, there is a C# program which is using ReaderWriterLockSlim. The “slim” is meant to distinguish it from the bigger ReaderWriterLock, which has more features and is more expensive at runtime.

When investigating a recent slowdown, I was very surprised to find this:

Superluminal screenshot showing We heard you like locks! So we put a lock in your lock.

Wow, so many threads doing “work” (that’s why they are green). A second look should make you more sceptical: What’s that purple between the green? Are the threads… sleeping? Yes yes, they are. Oh, are they contending over a reader-writer lock? Yes yes, they are. It is tempting to conclude (like I initially did) that there must be a lot of writes going on then, since we’re contending over a reader-writer lock. That however is not what is happening here at all: there are 2 writes over the entire execution of the program, in contrast to tens of thousands of reads. So why is there so much contention? Isn’t the point of a read-write lock that you can read concurrently?

If you continue looking for more clues, you will find that we are not just paying for entering the lock. Carefully inspect the picture above and you will find that we also pay a hefty price for exiting the lock and even spin when exiting. How curious! The unfortunate truth here is that readers still need concurrent access to the lock itself, and that can be quite expensive. In particular, if you want your read-write lock to have guarantees such as “fairness” (each thread eventually gets to read/write, even under contention, so there are no indefinite waits), optionally support re-entrant locking, and also support “upgradable” entering (i.e. enter as read, then be able to upgrade to write), then you need to keep a bunch of state around to enable all this.

When you take or exit a ReaderWriterLockSlim in C#, you will at a very minimum do this:

void EnterOrExit() {
    EnterSpinlock();
    ManipulateState();
    ExitSpinlock();
}

That spinlock is the source of the contention here, when 30-ish threads all try to read in parallel. That’s ~40s of CPU time spent on contention. So what if we put ReaderWriterLockSlimon a diet and drop all of the features we don’t strictly need in this case? We already know that reads are way more common than writes in our setup, so let’s give them a short path to take the lock. Below is the lock I came up with for testing. Please don’t blindly use it in production, it contains a spinlock afterall and maybe there are glaring problems that I just have not spotted yet.

The idea is that we have a counter of readers that we bump for every reader, and as long as the counter is positive, we can read. This means that uncontended reads are just a single increment. A writer will add long.MinValue and effectively make the value very negative. We still have the number of readers encoded as “how much above the min value are we.” So we can wait for readers to finish before the writer then gets its turn. The only tricky case is the sequence “reader increments, fails, then writer releases its lock, and reader decrements.” That case mandates that the writer doesn’t just set the state to 0 when it is released, so failed reads can still recover without completely breaking the log.

struct UnfairRwLock
{
    // If zero: no reader, no writer
    // If negative: writer active (MinValue + any in-flight reader increments)
    // If positive: that many readers hold the lock
    private long _state;

    public void EnterReadLock()
    {
        long value = Interlocked.Increment(ref _state);
        if (value > 0)
            return;

        Interlocked.Decrement(ref _state);

        var spinWait = new SpinWait();
        while (true)
        {
            spinWait.SpinOnce();
            value = Interlocked.Increment(ref _state);
            if (value > 0)
                return;
            Interlocked.Decrement(ref _state);
        }
    }

    public void EnterWriteLock()
    {
        var spinWait = new SpinWait();
        while (true)
        {
            long current = Volatile.Read(ref _state);

            if (current < 0)
            {
                spinWait.SpinOnce();
                continue;
            }

            if (Interlocked.CompareExchange(ref _state, long.MinValue + current, current) == current)
            {
                // Wait for pre-existing readers to drain
                while (Volatile.Read(ref _state) != long.MinValue)
                    spinWait.SpinOnce();
                return;
            }
        }
    }

    public void ExitReadLock() => Interlocked.Decrement(ref _state);

    // Add back MinValue to undo the subtraction, preserving any in-flight reader increments
    public void ExitWriteLock() => Interlocked.Add(ref _state, long.MinValue);
}

This lock fares much, much better in our setup. Instead of 50s of CPU time spent on contention, we’re now looking at ~4s of CPU time. That’s an order of magnitude better. But let’s be honest: it is embarassing to spend 4s of CPU time on synchronizing access when we know for a fact that we almost never have any contention. I saw two different directions that I could have gone into: I could have split the reader lock up into many smaller locks to reduce contention on the single integer we’re mistreating here. The uncommon case of writes would then have to block all N locks, but that’s fine because we read so much more than what we are writing. The alternative is what I ended up implementing: a sequence counter.

In this particular case, the lock is only used to determine whether some data we read is valid – but we never read out of bounds or similar. Given that invalid data is very rare (because it only happens when we read while someone else is writing), we might as well always perform the read and just check whether someone else wrote something in the mean time. This has the nice property that we only need to read a number twice per reader access, but never write anything. So unless anyone else is actually writing, there is actually no contention (except for maybe false sharing on that field). Neat!

This is implemented with a seqlock, which is essentially a sequence counter: even numbers allow reads (this is a stable state), odd numbers mean a write is in progress. Readers just check that the sequence number didn’t change between starting the read and finishing the read. Writers wait until they see a stable number, then they bump it to claim the spot, do their write, and bump it again when they are done.

In pratice, for my little program, this reduces contention to literally zero.