Postgres latency issues during memory compaction in Ubuntu Linux

We’ve recently upgraded our EC2 instance that hosts our Postgres database to an i2.8xlarge with 244GB of memory (this is to utilise the large amounts of ephemeral storage it comes with). Since upgrading, we’ve been having some issues with latency in Postgres that appear to be due to memory compaction that’s occurring in the Linux kernel.

We’re using PostgreSQL 9.3 on a recent Ubuntu 14.04 kernel running the following (hopefully relevant subset of) config:

max_connections = 1000
effective_cache_size = '220GB'
shared_buffers = '24GB'
work_mem = '25MB'
maintenance_work_mem = '1024MB'
fsync = off
full_page_writes = on
synchronous_commit = off

JavaScript
​x
 
max_connections = 1000effective_cache_size = '220GB'shared_buffers = '24GB'work_mem = '25MB'maintenance_work_mem = '1024MB'fsync = offfull_page_writes = onsynchronous_commit = off​

We have transparent huge pages completely disabled on this server (/sys/kernel/mm/transparent_hugepage/enabled and /sys/kernel/mm/transparent_hugepage/defrag are both set to never and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag is set to 0) and we’re fairly sure that we’re not seeing any issues as a result of THP because the thp_* stats and nr_anon_transparent_hugepages stat in /proc/vmstat never increment.

Our issue is that we see constant memory compaction (failure and success) events in /proc/vmstat (all the stats under compact_* incrementing frequently) and some of these cause pretty severe stalls that get worse over time (presumably as memory fragmentation gets worse) and impact on our application. We’re tracking the stats from /sys/kernel/debug/extfrag/unusable_index and often see a flurry of movement between the different page orders when we see stall-causing events.

We’re wondering whether this is just some combination of Postgres version, Linux kernel version and having to deal with a large amount of memory (as obviously most of the memory usage is file cache, so Linux might be doing things with that that Postgres isn’t happy about), but haven’t been able to come up with anything other than assuming a more recent version of Postgres (9.4 or 9.5) might avoid the issue altogether for some reason.

$ uname -a
Linux db-01 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
$ dpkg -l postgresql-9.3
postgresql-9.3     9.3.12-1.pgdg14.04+1

JavaScript
 
$ uname -aLinux db-01 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux$ dpkg -l postgresql-9.3postgresql-9.3     9.3.12-1.pgdg14.04+1​

Answer

After a separate discussion on the DBA StackExchange it was suggested to try a non-Trusty (3.13) kernel, so we tested with the Xenial HWE (4.4) kernel on Trusty and things dramatically improved, so it appears to be due to (possibly later versions of the) 3.13 kernel in Trusty (we’ve been on this kernel for a while and only had problems more recently, so may have been something that was introduced).

Advertisement

Answer