I have a cluster of 3 kafka brokers Version 0.10.2.1. Each broker has it’s own host 2 cpu / 16G RAM, In addition we are using docker to wrap the broker process.
The problems is as follows: Almost every day at the same time we see all of our kafka clients failed for 10 minutes. At the beginning I thought it is related to Kafka No broker in ISR for partition But after a while I discover that the broker just crash due to OOM-killer.
I also played with the Xmx and Xms before I discover that it is the OOM-killer. I had:
-Xmx2048M -Xms2048M
-Xmx4096M -Xms2048M
Same behavior for both
In addition currently we don’t have ulimit
>> ulimit unlimited
less kern.log
LOGS:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761019] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761022] run-parts cpuset=/ mems_allowed=0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761026] CPU: 1 PID: 12266 Comm: run-parts Not tainted 4.4.0-59-generic #80-Ubuntu Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761027] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761029] 0000000000000286 000000004811d7da ffff880036967af0 ffffffff813f7583 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761031] ffff880036967cc8 ffff880439f2f000 ffff880036967b60 ffffffff8120ad5e Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761033] ffffffff81cd2dc7 0000000000000000 ffffffff81e67760 0000000000000206 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761036] Call Trace: Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761041] [<ffffffff813f7583>] dump_stack+0x63/0x90 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761044] [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761048] [<ffffffff81192722>] oom_kill_process+0x202/0x3c0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761049] [<ffffffff81192b49>] out_of_memory+0x219/0x460 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761052] [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761054] [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761056] [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761060] [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761063] [<ffffffff81391bcc>] ? apparmor_file_alloc_security+0x5c/0x220 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761066] [<ffffffff811ed05a>] ? kmem_cache_alloc+0x1ca/0x1f0 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761070] [<ffffffff81347bd3>] ? security_file_alloc+0x33/0x50 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761073] [<ffffffff810caf11>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761074] [<ffffffff810805a0>] _do_fork+0x80/0x360 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761076] [<ffffffff81080929>] SyS_clone+0x19/0x20 Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761080] [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71
And ….
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.954463] Out of memory: Kill process 16123 (java) score 134 or sacrifice child Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.958609] Killed process 16123 (java) total-vm:11977548kB, anon-rss:2035780kB, file-rss:67848kB
Any suggestion of how to approach this ??
Advertisement
Answer
We found the problem. First I will say that adding more RAM to the machine also solved the problem but it is “expensive solution”.
The problem was as follows: Since I was working with EC2 ubuntu distribution I got daily crontabs in all of my cluster exactly at the same time. One of the scripts was mlocate this script apparently took too many resources.
I assume that since all cluster of kafka has some issues with IO and Memory, brokers was trying to use more memory and then the OOM killer killed them. When 2 of my 3 brokers were down some services were down.
So the solution was:
Change the crontab to work in different hours of the day in each broker.
Disable mlocate