Skip to content
Advertisement

Why can a user-process invoke the Linux OOM-killer due to memory fragmentation, even though plenty of RAM is available?

I’ve got a headless ARM-based Linux (v3.10.53-1.1.1) system with no swap space enabled, and I occasionally see processes get killed by the OOM-killer even though there is plenty of RAM available.

Running echo 1 > /proc/sys/vm/compact_memory periodically seems to keep the OOM-killer at bay, which makes me think that memory fragmentation is the culprit, but I don’t understand why a user process would ever need physically-contiguous blocks anyway; as I understand it, even in the worst-case scenario (complete fragmentation, with only individual 4K blocks available), the kernel could simply allocate the necessary number of individual 4K blocks and then use the Magic of Virtual Memory ™ to make them look contiguous to the user-process.

Can someone explain why the OOM-killer would be invoked in response to memory fragmentation? Is it just a buggy kernel or is there a genuine reason? (And even if the kernel did need to de-frag memory in order to satisfy a request, shouldn’t it do that automatically rather than giving up and OOM’ing?)

I’ve pasted an example OOM-killer invocation below, in case it sheds any light on things. I can reproduce the fault at will; this invocation occurred while the computer still had ~120MB of RAM available (according to free), in response to my test-program allocating memory, 10000 400-byte allocations at a time.

JavaScript

Also here is my test-program I use to stress the system and invoke the OOM-killer (with the echo 1 > /proc/sys/vm/compact_memory command being run every so often, the OOM-killer appears when free reports system RAM close to zero, as expected; without it, the OOM-killer appears well before that, when free reports 130+MB of RAM available but after cat /proc/buddyinfo shows the RAM becoming fragmented):

JavaScript

Advertisement

Answer

You are on the right track, Jeremy. The identical thing happened to me on my CentOS desktop system. I am a computer consultant, and I have worked with Linux since 1995. And I pound my Linux systems mercilessly with many file downloads and all sorts of other activities that stretch them to their limits. After my main desktop had been up for about 4 days, it got real slow (like slower than 1/10 of normal speed), the OOM killed kicked in, and I was sitting there wondering why my system was acting that way. It had plenty of RAM, but the OOM killer was kicking when it had no business doing so. So I rebooted it, and it acted fine…for about 4 days, then the problem returned. Bugged the snot out of me not knowing why.

So I put on my test engineer hat and ran all sorts of stress tests on the machine to see if I could reproduce the symptoms on purpose. After several months of this, I was able to recreate the problem at will and prove that my solution for it would work every time.

“Cache turnover” in this context is when a system has to tear down existing cache to create more cache space to support new file writes. Since the system is in a hurry to redeploy the RAM, it does not take the time to defragment the memory it is freeing. So over time, as more and more file writes occur, the cache turns over repeatedly. And the memory in which it resides keeps getting more and more fragmented. In my tests, I found that after the disk cache has turned over about 15 times, the memory becomes so fragmented that the system cannot tear down and then allocate the memory fast enough to keep the OOM killer from being triggered due to lack of free RAM in the system when a spike in memory demand occurs. Such a spike could be caused by executing something as simple as

JavaScript

On my system, that command creates a demand for about 50MB of new cache. That was what

JavaScript

shows, anyway.

The solution for this problem involves expanding on what you already discovered.

JavaScript

And yes, I totally agree that dropping cache will force your system to re-read some data from disk. But at a rate of once per day or even once per hour, the negative effect of dropping cache is absolutely negligible compared to everything else your system is doing, no matter what that might be. The negative effect is so small that I cannot even measure it, and I made my living as a test engineer for 5+ years figuring out how to measure things like that.

If you set up a cron job to execute those once a day, that should eliminate your OOM killer problem. If you still see problems with the OOM killer after that, consider executing them more frequently. It will vary depending on how much file writing you do compared to the amount of system RAM your unit has.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement