Linux

I have a cluster of 3 kafka brokers Version 0.10.2.1. Each broker has it’s own host 2 cpu / 16G RAM, In addition we are using docker to wrap the broker process.

The problems is as follows: Almost every day at the same time we see all of our kafka clients failed for 10 minutes. At the beginning I thought it is related to Kafka No broker in ISR for partition But after a while I discover that the broker just crash due to OOM-killer.

I also played with the Xmx and Xms before I discover that it is the OOM-killer. I had:

-Xmx2048M -Xms2048M

-Xmx4096M -Xms2048M

Same behavior for both

In addition currently we don’t have ulimit

>> ulimit
unlimited

less kern.log

LOGS:

Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761019] run-parts invoked oom-killer: gfp_mask=0x26000c0, order=2, oom_score_adj=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761022] run-parts cpuset=/ mems_allowed=0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761026] CPU: 1 PID: 12266 Comm: run-parts Not tainted 4.4.0-59-generic #80-Ubuntu
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761027] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761029]  0000000000000286 000000004811d7da ffff880036967af0 ffffffff813f7583
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761031]  ffff880036967cc8 ffff880439f2f000 ffff880036967b60 ffffffff8120ad5e
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761033]  ffffffff81cd2dc7 0000000000000000 ffffffff81e67760 0000000000000206
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761036] Call Trace:
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761041]  [<ffffffff813f7583>] dump_stack+0x63/0x90
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761044]  [<ffffffff8120ad5e>] dump_header+0x5a/0x1c5
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761048]  [<ffffffff81192722>] oom_kill_process+0x202/0x3c0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761049]  [<ffffffff81192b49>] out_of_memory+0x219/0x460
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761052]  [<ffffffff81198abd>] __alloc_pages_slowpath.constprop.88+0x8fd/0xa70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761054]  [<ffffffff81198eb6>] __alloc_pages_nodemask+0x286/0x2a0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761056]  [<ffffffff81198f6b>] alloc_kmem_pages_node+0x4b/0xc0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761060]  [<ffffffff8107ea5e>] copy_process+0x1be/0x1b70
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761063]  [<ffffffff81391bcc>] ? apparmor_file_alloc_security+0x5c/0x220
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761066]  [<ffffffff811ed05a>] ? kmem_cache_alloc+0x1ca/0x1f0
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761070]  [<ffffffff81347bd3>] ? security_file_alloc+0x33/0x50
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761073]  [<ffffffff810caf11>] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761074]  [<ffffffff810805a0>] _do_fork+0x80/0x360
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761076]  [<ffffffff81080929>] SyS_clone+0x19/0x20
Jan 23 06:25:16 kafka10-172-40-103-177 kernel: [16504862.761080]  [<ffffffff818384f2>] entry_SYSCALL_64_fastpath+0x16/0x71

And ….

Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.954463] Out of memory: Kill process 16123 (java) score 134 or sacrifice child
Jan 24 06:25:25 kafka10-172-40-103-177 kernel: [16591270.958609] Killed process 16123 (java) total-vm:11977548kB, anon-rss:2035780kB, file-rss:67848kB

Any suggestion of how to approach this ??

Answer

We found the problem. First I will say that adding more RAM to the machine also solved the problem but it is “expensive solution”.

The problem was as follows: Since I was working with EC2 ubuntu distribution I got daily crontabs in all of my cluster exactly at the same time. One of the scripts was mlocate this script apparently took too many resources.

I assume that since all cluster of kafka has some issues with IO and Memory, brokers was trying to use more memory and then the OOM killer killed them. When 2 of my 3 brokers were down some services were down.

So the solution was:

Change the crontab to work in different hours of the day in each broker.
Disable mlocate

Kafka broker crash every day – OOM killer

Advertisement

Answer