GPU hang adventures | Sebastian Schöner

Hey! You can find me on Mastodon and Bluesky!

After the last post, you might think that I have spent most of my time lately with IL2CPP. That, however, is untrue: I have actually mostly been looking at GPU hangs. I would by no means call myself a graphics programmer: When I program for GPUs, it’s either a compute shader or CUDA, and even that I have not done in earnest since 10 years ago. … so when a GPU starts hanging, I naturally sign up for that problem to learn about the pain that my friends complain about.

Given my relative inexperience in GPU-herding, I would have benefitted from some general guidance and context, which I could not find online. This post will hopefully be useful to someone.

So, to set the stage: the game in question is using a certain well-known proprietary engine, using DX11. Around the game sits a launcher application, and that launcher will regularly send Win32 messages to the game. If the game does not respond to multiple of them in a row in a short time frame, then we trigger an “Application Not Responding” (ANR) event, collect stack traces and a minidump. The hangs I would attribute to a GPU hang happen very sporadically and never on my machine.

My GPU hang adventure is not really over yet, but I wanted to share some things I have learned and tried:

DX11 has a debug validation layer that you can enable (see MSDN article). Well, if you have source access to the engine and fix up the integration, then you can enable it. It has brought up one failure, but it turned out to be completely unrelated.
We have set up a soak test that goes through the passage of the game that we know might occasionally trigger the ANR (which is essentially “start game and enter match”). I have only ever observed the hangs in the soak test, never on my machine. Without the soak test, we would likely have made zero progress.
I have collected logs for all instances of the hang. The logs contain the driver version, and this has shown that some NVIDIA drivers from between June and October were hanging in very weird spots. I suspected a driver problem, and after upgrading everywhere that particular ANR went away, but there are still ANRs happening.
Windows has a thing called “Timeout Detection and Recovery”, in short TDR. You can read more about this on MSDN. The TL;DR is that if you put a long workload on your GPU, your entire system may hang. Presumably, users would start getting their Bill Gates voodoo dolls out, even when the problem is actually not Windows related at all, so Windows now detects these cases and resets the driver when it detects that the GPU (the driver, really, I think) has been unresponsive for too long. An application can in theory gracefully handle driver resets, but in practice this seems uncommon. See this MSDN article.
- You can control the specifics of what triggers an ANR via registry values, which are documented on MSDN. If nothing is set, you operate on default values.
- If you are the unlucky owner of a “gaming laptop” (how I hate RGB lights, but these are the only laptops I can use for development), your OEM might have “helpfully” tweaked these values for you, since adjusting these values is how you get Call of Duty to still run on old hardware, apparently. In my case, there was a background service that would reset the TDR delay to 10 seconds (instead of 2 seconds) on every boot. I ended up using Process Monitor to record a boot trace and then filtered that boot trace for that registry key to find that service and disable it.
- Other tools, particularly those used by 3D artists and animators, might also suffer from TDRs. If you use various office machines for soak testing, be aware that artists and animators might have set different TDR delays.
- TDRs also leave a trace that you can see in the Windows Event Viewer. It does not contain much information, but it at least tells you that a TDR happened.
The hangs we are seeing seem to “clump up” on the same machine. One first thought I had was that this could be caused by driver updates right before the game is running for the first time. A driver update might invalidate the driver’s shader cache, which means that the driver needs to re-compile shaders at runtime, which can mess with timings. We have been clearing the shader cache between soak test runs (delete everything in %LocalAppData%\NVIDIA\DXCache, %LocalAppData%\NVIDIA\GLCache) to emulate this. There is also a driver setting to disable shader caching.
I have tried putting the GPU under load, both by over-allocating and by running a GPU stress-test in parallel to the game, but could again not reproduce anything locally.
The tooling in this space looks super-unhelpful if you are on DX11. As far as I can tell, the typical frame-capture solutions only capture single frames, which is unhelpful when you do not know when a hang occurrs, and most longer running capture solutions work only for DX12. NVidia Aftermath seems to have a DX11-on-DX12 mode, but so far I could not get Aftermath to work and that again hinges on me even reproducing the hang locally in the first place.

The only thing that really helped was a realization that took me embarassingly long to reach: This one proprietary game engine, which shall be unnamed, made the bold choice of ignoring the return codes of almost all DX11 calls at runtime. Some of them had asserts on them, but some also did not have any, and the asserts are compiled out in actual builds anyway. I do not know on which grounds that decision was made, but I would suggest you DO NOT silently ignore error codes returned by DX11, or whatever graphics API you are using. Do not do this, do not let your friends do this, and do not configure your grand parents’ engine to do this. In our case, I’ve found that we often hit DXGI_ERROR_DEVICE_REMOVED (0x887A0005), in which case you can call the helpful GetDeviceRemovedReason API, which for us just yields DXGI_ERROR_DEVICE_HUNG. This is indicative of a TDR, as far as I can tell. (EDIT: Daniel Ludwig helpfully tells me it can also indicated an assortment of other issues, such as reading from uninitialized memory or out-of-bounds reads. Thank you!)

In one case, error reporting revealed that we have a synchronous readback of a compute shader result (bad!), but there are likely more problems. I have updated the engine to log out all compute dispatches, which has helped in tracking down this particular problem, but as soon as you are in async-territory, those logs stop being very helpful. Two last pieces that support the theory that this might be a performance issue after all is that both real and soaked ANRs usually clump up, so you’d see multiple ANRs in short succession on the same machine before they then disappear again. It also looks like ANRs sometimes happen while running in the background: you queue up for a match, switch to a different application (maybe a heavy 3D application?), and then the game hangs after a while.

For now, I am running out of good ideas of what to try. Writing this post at least made me realize that we have soaked and I have run locally with DX11 validation, but we have never soaked with DX11 validation turned on, so that is a good next thing to try. I am left wondering how anyone else debugs these issues, and whether TDRs are just an unfortunate reality that are always going to affect a small percentage of users (e.g. specific slowness of specific things on specific hardware).