Running out of memory

I have recently spent a good bit of time with code running inside a (Windows, for now) docker container that would really benefit from knowing how much more memory it can allocate. This question turned out to be more difficult than I thought it would be! This article is partially documentation, partially feature request, and partially just me enjoying the absurdity of Computer Things. This is how I tried and failed to figure out how much memory is available.

Memory Fundamentals

You would expect that the documentation for, say, VirtualAlloc would get a little bit more specific than “If the function fails, the return value is NULL.” Let’s start by establishing the minimum knowledge necessary to be able to answer “in what situation do allocations fail?”

Here is a simplified model of memory management:

Programs live in a virtual address space that does not necessarily represent physical reality. We call this “virtual memory.” Virtual memory is assigned and managed by the operating system (OS).
Virtual address space is divided into “pages” of a certain size (frequently 4 kilobytes, still). Pages have access-rights: can memory in this page be read? can it be written to? can it contain executable code?
The OS ensures (along with some hardware) that when your program actually accesses a valid virtual memory address, there is some physical representation of the bytes you are accessing. This is done by “mapping” pages to a physical representation.

Note, the term “physical representation” is not a standard term. I am just using it here. It’s common to say that a page has some physical “backing”, though in general terminology is loose and inconsistent in this entire space (as evidenced by the MSDN page Memory Performance Information).

That’s about what we need, and you probably knew as much already. Let me outline some “features” of such a memory systems:

It is very often convenient to manually divide up your address space by allocating virtual memory. You want a terabyte of memory that is guaranteed to have contiguous virtual addresses? Go for it. You can totally allocate more virtual memory than your machine has physical RAM, and even more than you may have disk space. Many systems allow you to allocate so much virtual memory that it is virtually guaranteed (badum-ts) that you do not have a physical representation for all of it.
The OS does not actually need to map a virtual memory page to some physical representation before you touch it. Maybe, like many programs, you never touch it? What a waste it would be to map that! Pages are hence mapped “on demand.”
The physical representation for a byte does not have to be RAM. It could be anything sufficiently “byte-shaped.”
Two different pages in virtual address space could be mapped to the same physical representation.

Let’s talk about physical representations for a moment. Generally speaking, a page can be mapped to different things: It could be RAM. It could be different hardware with its own memory that you want conveniently addressable. It could be a file on disk. For general purpose allocations, there are two things that count: There is your RAM, and there is your page file (or swap space on Linux). Ideally, all the data you ever need would be in RAM, but in all likelihood your system will sometimes need to provide more bytes than you have RAM, so it will write parts of RAM to disk temporarily (and then fetch it again later when needed). This process is expensive: when the OS needs to map a page to a physical representation, this is called a “page fault.” When the OS then realizes that it cannot map a page to RAM and you need to instead write an existing page to disk to free up space for this new page, you call this a “hard page fault.”

“Running out of memory” then can be multiple things:

You could run out of virtual address space.
You could run out of physical representations for the physical pages you want to allocate.

The first kind of failure is very unlikely on a 64bit system. While your OS generally will not allow you to use all 64 bits for virtual addresses, you are probably looking at 48 bits or so, which is still multiple terabytes. The second failure case is much more insidious, primarily because it could happen after the OS has already handed you some virtual address space: It’s a feature of the system that you can get more virtual memory than you have physical bytes available.

On Windows, this tension is resolved by inserting a step between allocating virtual memory and using it. Virtual memory management is expressed through VirtualAlloc and VirtualFree. malloc, new, or whatever means of allocation you are using are most likely going to call VirtualAlloc somewhere in their implementation at some point. This is the flow:

When you run uint8_t* p = (uint8_t*) VirtualAlloc(NULL, 4096, MEM_RESERVE, PAGE_READWRITE), you reserve 4096 bytes worth of virtual address space, but it’s really just the address space, nothing else.
When you run p = VirtualAlloc(p, 4096, MEM_COMMIT, PAGE_READWRITE), you are telling the operating system that you intend on actually using it. You are asking the OS to commit to giving this memory a physical representation when you actually touch it. Note that we are reassigning p here.
When you run p[0] = 255, you are actually touching the memory and it is likely only at this point that the operating will ensure that the page has a physical representation.

The most commonly used high-level allocation functions do not know the distinction, they always give you committed memory. Windows will happily let you allocate tons of virtual memory, but once you ask it to commit, it can fail: VirtualAlloc returns NULL if the amount of committed memory would exceed the combined size of your RAM and your page file. This means that one of the prime reasons for why your favorite allocation functions might return NULL on Windows is that the OS won’t commit to giving you more memory because the system-wide total size of committed memory exceeds the combined size of RAM and page file. Windows pushes out-of-memory conditions into your program through allocation failures¹.

So how much memory is left?

Now, what APIs exist to see how much memory is still available? The functions you may to want to use are GlobalMemoryStatusEx, GetPerformanceInfo, and GetProcessMemoryInfo.

On Windows, there are broadly five things that these APIs make measurable for you:

You can measure currently available and total virtual memory (see GlobalMemoryStatusEx). That’s just “how much virtual address space did a process reserve?” On 64bit system you are in all likelihood not going to run out of virtual memory.
You can measure the system-wide RAM usage, called “physical memory usage” (see GlobalMemoryStatusEx). This is not “memory that has a physical representation” but memory that is actually, physically in RAM. It does not include the pagefile. This value is naturally system-wide, because RAM is associated with the entire system. This is a useful value to track: not because you may fail to allocate, but because being close to your physical memory limit means that you are likely constantly running into hard pagefaults, which will grind programs to a halt.
You can measure the system-wide committed memory and its limit (see GlobalMemoryStatusEx and GetPerformanceInfo). The naming for this is not consistent, some APIs call this “page file usage” or the “global commit charge.” This is a useful value to track, because every typical allocation you make is going to count against the commit value. If you only track one metric, track this. Note that GetPerformanceInfo reports the result in pages, not in bytes.
You can measure the process-specific “private bytes” (see GetProcessMemoryInfo). This is also called the “private usage” or the “process commit charge.” All regular allocations count towards this. This value is by itself can be an indicator for memory usage but it does not accurately represent how your process contributes to the thing that actually counts: the system-wide committed memory. First, your process could share memory with other processes. You can get an indication of the amount of shared memory as well via APIs, but that doesn’t necessarily help you in accounting since you still don’t know which processes share memory. Second, there are other (admittedly obscure) ways to increase committed memory, see Pushing the Limits of Windows: Virtual Memory.
You can measure the process-specific “working set size” (see GetProcessMemoryInfo). The working set is the set of pages of your process’ virtual address space that are currently mapped to RAM. The working set can include pages that are shared with other processes, so it is misleading to claim that this represents the amount of physical memory your process is responsible for.

Note that committed memory does not generally correlate with physical memory or working set size. To illustrate this, consider these extreme cases:

You can have very high committed memory usage but very low physical memory usage (and working set size), because you never touched the pages. You are still going to run out of memory.
You can have a high physical memory usage but a very low committed memory usage. This is a bit counter-intuitive, but you can get this by manually memory mapping a file into memory: There’s no need to commit any memory. The file contents will still be loaded into RAM on access, but we can always write back to disk if we need space. There is no need for any additional physical backing (unless you use copy-on-write).

Great, so all you have to do is call the system APIs and be done with it, right? In general, yes. But when you run in a docker container, that is unfortunately insufficient. You can still call of these functions, but they do not give you the information you would hope to get. To elaborate a little bit, Windows docker containers can be run in two different isolation modes (see MSDN page Isolation Modes): You can run in “process isolation mode” or in “HyperV isolation mode.” The latter essentially is just a VM. If you are running docker locally, you are likely using HyperV. If you run docker on a Windows server machine, you are likely running in “process isolation mode.” In both cases, the APIs above are going to produce wrong results (and you most might not even get to choose what isolation mode you need to use). In process isolation mode, you just get the stats of the host system². In HyperV isolation mode, the APIs above will report the memory statistics of the virtual machine. That is going to be close to what you need, but not quite: The virtual machine always comes with a minimum amount of memory that is required to even run the system, and your container does not get to use the entirety of that memory. Both isolation modes use the same feature to enforce the memory limit: Windows job objects (see MSDN page Job Objects).

Job objects are basically nestable groups of processes. The two main APIs of interest here are IsProcessInJob and QueryInformationJobObject. You can call both of these with a NULL job handle to represent the current process’s job, if it exists. Using JobObjectExtendedLimitInformation as an argument for QueryInformationJobObject, you can find the peak memory usage and the commit limit for your job. At this point, the rest of the story could be really simple: You use the JobObjectMemoryUsageInformation = 28 argument to get the current memory usage… except that this is not exposed: There is seemingly no way to get the current memory usage for a job object³.

But we know that this value is tracked: You can see that value in a kernel debugger. There is some internal Microsoft code that uses this API in their HCS shim, which is how docker gets this information to be displayed in its stats. To do so you (seemingly) need to open the job object manually with the right permissions and run with elevated access – but so far I have failed to find a way to get an explicit handle to the job object from within the docker container, and running arbitrary code as administrator is not a good idea, even inside of a container (see this blog post How to change the user account for Windows containers). And that’s assuming you are willing to rely on undocumented Windows APIs. So, dear Satya, if you are reading this please tell someone to expose that API, thank you.

Oh, and Linux

While we are talking about docker containers, there is some fun to be had when you take Windows code and move it to Linux (using Wine). Of course this is going to cause all sorts of problems, but there is one particular problem worth calling out. Linux handles out-of-memory conditions very differently, at least by default: Like Windows, Linux will also allow you to allocate more virtual memory than the system can actually physically provide. But unlike Windows, it will then also commit to more memory than it can physically provide. This strategy is called “overcommit.” Overcommit is great when you assume that most programs do not use most of the memory you ask for. Linux allows you to configure whether to use overcommit or not. There are three settings:

Heuristic overcommit: Generally allow overcommit, but do some sanity checking on the incoming allocations (to catch cases that are most likely just wrong).
Always overcommit.
Never overcommit.

Linux 2.6 added the “never overcommit” option and set the default to “heuristic overcommit.” See the overcommit accounting documentation for details.

Overcommit means that the location where you run out of memory changes drastically: Instead of getting a failure from your allocator (say, NULL from malloc), the OS now needs to handle the case that it is running out of memory which it has already committed to providing. For a program that checks the results of all allocations to detect out-of-memory conditions, overcommit is bad news: Instead of failing at an allocation site your program will now fail at a much later point when it touches the memory it allocated. This moves the symptom further away from the problem, if your program has been written with the opposite behavior in mind.

Unsurprisingly, just switching the system setting to instead never overcommit causes a whole slew of new interesting failures because you are now using a setting that is not the default, which just means that it is less likely that that option received the same level of testing and exposure as the default. The problems are probably complex. Wine seems to generally still want to do the right thing. For example, VirtualAlloc with MEM_RESERVE is properly mapped to mmap with PROT_NONE (indicating that the memory you allocated may not be accessed) and correctly does not count as committed memory.

With overcommit enabled, OOM conditions now need to be handled by the OS, not the program. The Linux kernel’s “OOM killer” is then in charge of handling the out-of-memory conditions, and it will forcefully free up memory by heuristically selecting some process to kindly ask to exit, which it will then terminate if it does not comply. You can influence that heuristic and mark your programs as non-targets, but ultimately that is likely just getting some other vital process killed. If there are no valid targets, the kernel panics.

I can see the upsides of this design, but one downside is that it again complicates the question of how much you can allocate: You can have all the memory, actually, if you don’t care about killing everyone else. Sarcasm aside, my adventures in this space have stopped short of actually having to answer how you would most effectively determine how much memory you can still safely allocate in a Linux container. I am cautiously optimistic that this is easier to solve than on Windows.

Unless you reduce the size of your pagefile at runtime. No idea what happens in that case! Increasing the page file size is covered in the docs, but reducing it is probably very different. ↩
That gets especially noteworthy when you run multiple containers in parallel. They may only be able to commit so much, but the system-wide physical memory is shared between them all and as we learned it is very much possible to use a lot of physical memory without actually committing all that much. ↩
You can however set up a subscription so you can query if your limit was violated and how much memory was committed at that point. ↩