We’ve recently upgraded our EC2 instance that hosts our Postgres database to an i2.8xlarge with 244GB of memory (this is to utilise the large amounts of ephemeral storage it comes with). Since upgrading, we’ve been having some issues with latency in Postgres that appear to be due to memory compaction that’s occurring in the Linux kernel.
We’re using PostgreSQL 9.3 on a recent Ubuntu 14.04 kernel running the following (hopefully relevant subset of) config:
max_connections = 1000 effective_cache_size = '220GB' shared_buffers = '24GB' work_mem = '25MB' maintenance_work_mem = '1024MB' fsync = off full_page_writes = on synchronous_commit = off
We have transparent huge pages completely disabled on this server (/sys/kernel/mm/transparent_hugepage/enabled
and /sys/kernel/mm/transparent_hugepage/defrag
are both set to never
and /sys/kernel/mm/transparent_hugepage/khugepaged/defrag
is set to 0
) and we’re fairly sure that we’re not seeing any issues as a result of THP because the thp_*
stats and nr_anon_transparent_hugepages
stat in /proc/vmstat
never increment.
Our issue is that we see constant memory compaction (failure and success) events in /proc/vmstat
(all the stats under compact_*
incrementing frequently) and some of these cause pretty severe stalls that get worse over time (presumably as memory fragmentation gets worse) and impact on our application. We’re tracking the stats from /sys/kernel/debug/extfrag/unusable_index
and often see a flurry of movement between the different page orders when we see stall-causing events.
We’re wondering whether this is just some combination of Postgres version, Linux kernel version and having to deal with a large amount of memory (as obviously most of the memory usage is file cache, so Linux might be doing things with that that Postgres isn’t happy about), but haven’t been able to come up with anything other than assuming a more recent version of Postgres (9.4 or 9.5) might avoid the issue altogether for some reason.
$ uname -a Linux db-01 3.13.0-85-generic #129-Ubuntu SMP Thu Mar 17 20:50:15 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux $ dpkg -l postgresql-9.3 postgresql-9.3 9.3.12-1.pgdg14.04+1
Advertisement
Answer
After a separate discussion on the DBA StackExchange it was suggested to try a non-Trusty (3.13) kernel, so we tested with the Xenial HWE (4.4) kernel on Trusty and things dramatically improved, so it appears to be due to (possibly later versions of the) 3.13 kernel in Trusty (we’ve been on this kernel for a while and only had problems more recently, so may have been something that was introduced).