I follow this doc to setup mesos cluster.
There are three vm(ubuntu 12, centos 6.5, centos 7.2).
$ cat /etc/hosts 10.142.55.190 zk1 10.142.55.196 zk2 10.142.55.202 zk3
config in each mathine:
$ cat /etc/mesos/zk zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos
After start zookeeper, mesos-master and mesos-slave in three vm, I can view the mesos webui(10.142.55.190:5050), but agents count is 0.
After a little time, mesos page get error: Failed to connect to 10.142.55.190:5050! Retrying in 16 seconds… (Now I found that zookeeper elect a new leader in a short interval)
master info log:
I0919 15:54:59.677438 13281 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1x to the leading master zk3 I0919 15:55:00.098667 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (768)@10.142.55.202:5050 I0919 15:55:00.385279 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (185)@10.142.55.196:5050 I0919 15:55:00.711119 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (771)@10.142.55.202:5050 I0919 15:55:01.347291 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (188)@10.142.55.196:5050 I0919 15:55:01.597682 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (774)@10.142.55.202:5050 I0919 15:55:02.257159 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (191)@10.142.55.196:5050 I0919 15:55:02.370692 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (777)@10.142.55.202:5050 I0919 15:55:03.205920 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (780)@10.142.55.202:5050 I0919 15:55:03.260007 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (194)@10.142.55.196:5050 I0919 15:55:03.929611 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (783)@10.142.55.202:5050 I0919 15:55:04.033308 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (197)@10.142.55.196:5050 I0919 15:55:04.591275 13284 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (200)@10.142.55.196:5050 I0919 15:55:04.608211 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (786)@10.142.55.202:5050 I0919 15:55:05.184682 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (789)@10.142.55.202:5050 I0919 15:55:05.268277 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (203)@10.142.55.196:5050 I0919 15:55:05.775377 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (206)@10.142.55.196:5050 I0919 15:55:05.916445 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (792)@10.142.55.202:5050 I0919 15:55:06.744927 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (209)@10.142.55.196:5050 I0919 15:55:07.378521 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (5)@10.142.55.202:5050 I0919 15:55:07.393311 13285 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:07.393427 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000709' in ZooKeeper I0919 15:55:07.393985 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000711' in ZooKeeper I0919 15:55:07.394394 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000714' in ZooKeeper I0919 15:55:07.394843 13285 group.cpp:706] Trying to get '/mesos/log_replicas/0000000715' in ZooKeeper I0919 15:55:07.395418 13285 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050 I0919 15:55:09.059562 13282 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (21)@10.142.55.202:5050 I0919 15:55:09.700711 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (24)@10.142.55.202:5050 I0919 15:55:09.742185 13287 http.cpp:381] HTTP GET for /master/state from 10.142.50.94:59987 with User-Agent='Mozilla/5.0 (Windows NT 6.2; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0' I0919 15:55:09.742359 13287 http.cpp:2022] Redirecting request for /master/state?jsonp=angular.callbacks._1y to the leading master zk3 I0919 15:55:10.660789 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (30)@10.142.55.202:5050 I0919 15:55:11.480326 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (34)@10.142.55.202:5050 I0919 15:55:12.386256 13286 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (37)@10.142.55.202:5050 I0919 15:55:12.975137 13287 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (42)@10.142.55.202:5050 I0919 15:55:13.843091 13285 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (47)@10.142.55.202:5050 I0919 15:55:14.373478 13281 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (51)@10.142.55.202:5050 I0919 15:55:14.937181 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (54)@10.142.55.202:5050 I0919 15:55:15.658219 13283 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (58)@10.142.55.202:5050 I0919 15:55:16.007822 13286 network.hpp:430] ZooKeeper group memberships changed I0919 15:55:16.007972 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000711' in ZooKeeper I0919 15:55:16.010170 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000714' in ZooKeeper I0919 15:55:16.011462 13284 detector.cpp:152] Detected a new leader: (id='702') I0919 15:55:16.011556 13284 group.cpp:706] Trying to get '/mesos/json.info_0000000702' in ZooKeeper I0919 15:55:16.011968 13286 group.cpp:706] Trying to get '/mesos/log_replicas/0000000715' in ZooKeeper I0919 15:55:16.012526 13286 network.hpp:478] ZooKeeper group PIDs: { log-replica(1)@10.142.55.190:5050, log-replica(1)@10.142.55.196:5050, log-replica(1)@10.142.55.202:5050 } I0919 15:55:16.013156 13284 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected I0919 15:55:16.013222 13284 master.cpp:1847] The newly elected leader is master@10.142.55.190:5050 with id 677967bc-f6f0-46b3-a44e-72eed1befd60 I0919 15:55:16.013244 13284 master.cpp:1860] Elected as the leading master! I0919 15:55:16.013273 13284 master.cpp:1547] Recovering from registrar I0919 15:55:16.013352 13284 registrar.cpp:332] Recovering registrar I0919 15:55:16.014081 13280 log.cpp:553] Attempting to start the writer I0919 15:55:16.014515 13280 replica.cpp:493] Replica received implicit promise request from (211)@10.142.55.190:5050 with proposal 1204590 I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received I0919 15:55:16.018028 13280 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 3.469479ms I0919 15:55:16.018338 13280 replica.cpp:342] Persisted promised to 1204590 I0919 15:55:16.018508 13282 log.cpp:565] Could not start the writer, but can be retried I0919 15:55:16.018645 13282 log.cpp:553] Attempting to start the writer I0919 15:55:16.018899 13282 replica.cpp:493] Replica received implicit promise request from (215)@10.142.55.190:5050 with proposal 1204591 I0919 15:55:16.022183 13287 consensus.cpp:360] Aborting implicit promise request because 2 ignores received I0919 15:55:16.022367 13280 log.cpp:565] Could not start the writer, but can be retried I0919 15:55:16.022510 13280 log.cpp:553] Attempting to start the writer I0919 15:55:16.028880 13282 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 9.870818ms I0919 15:55:16.029024 13282 replica.cpp:342] Persisted promised to 1204591 I0919 15:55:16.029428 13286 replica.cpp:493] Replica received implicit promise request from (219)@10.142.55.190:5050 with proposal 1204592 I0919 15:55:16.031600 13280 consensus.cpp:360] Aborting implicit promise request because 2 ignores received I0919 15:55:16.036208 13283 log.cpp:565] Could not start the writer, but can be retried I0919 15:55:16.036454 13283 log.cpp:553] Attempting to start the writer I0919 15:55:16.040256 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 10.783237ms I0919 15:55:16.040339 13286 replica.cpp:342] Persisted promised to 1204592 I0919 15:55:16.040712 13286 replica.cpp:493] Replica received implicit promise request from (222)@10.142.55.190:5050 with proposal 1204593 I0919 15:55:16.042196 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 1.435071ms I0919 15:55:16.042250 13286 replica.cpp:342] Persisted promised to 1204593 I0919 15:55:16.042981 13286 consensus.cpp:360] Aborting implicit promise request because 2 ignores received I0919 15:55:16.043099 13286 log.cpp:565] Could not start the writer, but can be retried I0919 15:55:16.043303 13283 log.cpp:553] Attempting to start the writer
All later logs are looping
I0919 15:55:16.043676 13286 replica.cpp:493] Replica received implicit promise request from (225)@10.142.55.190:5050 with proposal 1204594 I0919 15:55:16.044122 13286 leveldb.cpp:304] Persisting metadata (10 bytes) to leveldb took 404769ns I0919 15:55:16.044209 13286 replica.cpp:342] Persisted promised to 1204594 I0919 15:55:16.044837 13281 consensus.cpp:360] Aborting implicit promise request because 2 ignores received I0919 15:55:16.044926 13281 log.cpp:565] Could not start the writer, but can be retried I0919 15:55:16.045038 13281 log.cpp:553] Attempting to start the writer
slave info log:
Log file created at: 2016/09/19 15:41:16 Running on machine: ubuntu12 Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg I0919 15:41:16.346844 12986 logging.cpp:194] INFO level logging started! I0919 15:41:16.363313 12986 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni I0919 15:41:16.370334 12986 main.cpp:434] Starting Mesos agent I0919 15:41:16.371184 12986 slave.cpp:198] Agent started on 1)@127.0.1.1:5051 I0919 15:41:16.371636 12986 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher="posix" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://10.142.55.190:2181,10.142.55.196:2181,10.142.55.202:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" I0919 15:41:16.373072 12986 slave.cpp:519] Agent resources: cpus(*):2; mem(*):2930; disk(*):4469; ports(*):[31000-32000] I0919 15:41:16.373291 12986 slave.cpp:527] Agent attributes: [ ] I0919 15:41:16.373347 12986 slave.cpp:532] Agent hostname: ubuntu12 I0919 15:41:16.379895 13005 state.cpp:57] Recovering state from '/var/lib/mesos/meta' I0919 15:41:16.382519 13005 group.cpp:349] Group process (group(1)@127.0.1.1:5051) connected to ZooKeeper I0919 15:41:16.382593 13005 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) I0919 15:41:16.382663 13005 group.cpp:427] Trying to create path '/mesos' in ZooKeeper I0919 15:41:16.382910 13009 status_update_manager.cpp:200] Recovering status update manager I0919 15:41:16.383419 13009 containerizer.cpp:522] Recovering containerizer I0919 15:41:16.392206 13004 provisioner.cpp:253] Provisioner recovery complete I0919 15:41:16.392354 13004 slave.cpp:4782] Finished recovery I0919 15:41:16.405709 13004 detector.cpp:152] Detected a new leader: (id='678') I0919 15:41:16.406067 13005 group.cpp:706] Trying to get '/mesos/json.info_0000000678' in ZooKeeper I0919 15:41:16.407572 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected I0919 15:41:16.407977 13002 slave.cpp:895] New master detected at master@10.142.55.190:5050 I0919 15:41:16.408043 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:41:16.408140 13002 slave.cpp:927] Detecting new master I0919 15:41:16.408223 13005 status_update_manager.cpp:174] Pausing sending status updates I0919 15:42:08.418956 13006 slave.cpp:3732] master@10.142.55.190:5050 exited W0919 15:42:08.419035 13006 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:42:16.374977 13007 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231186482451933days I0919 15:42:20.007169 13007 detector.cpp:152] Detected a new leader: (id='679') I0919 15:42:20.007297 13007 group.cpp:706] Trying to get '/mesos/json.info_0000000679' in ZooKeeper I0919 15:42:20.008503 13007 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected I0919 15:42:20.008587 13007 slave.cpp:895] New master detected at master@10.142.55.196:5050 I0919 15:42:20.008610 13007 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:42:20.008703 13007 slave.cpp:927] Detecting new master I0919 15:42:20.008750 13007 status_update_manager.cpp:174] Pausing sending status updates I0919 15:43:16.387984 13005 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231162010606794days I0919 15:43:20.081272 13005 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:43:20.081374 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:43:26.855154 13005 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:43:26.855315 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:43:26.855159 13010 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected I0919 15:43:32.020196 13002 detector.cpp:152] Detected a new leader: (id='682') I0919 15:43:32.020300 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000682' in ZooKeeper I0919 15:43:32.022203 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected I0919 15:43:32.022302 13002 slave.cpp:895] New master detected at master@10.142.55.202:5050 I0919 15:43:32.022328 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:43:32.022382 13002 slave.cpp:927] Detecting new master I0919 15:43:32.022423 13002 status_update_manager.cpp:174] Pausing sending status updates I0919 15:44:16.389369 13003 slave.cpp:4591] Current disk usage 72.41%. Max allowed age: 1.231119184877789days I0919 15:44:32.535347 13003 slave.cpp:3732] master@10.142.55.202:5050 exited W0919 15:44:32.535522 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:44:42.005375 13002 detector.cpp:152] Detected a new leader: (id='684') I0919 15:44:42.005496 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000684' in ZooKeeper I0919 15:44:42.006367 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected I0919 15:44:42.006492 13002 slave.cpp:895] New master detected at master@10.142.55.190:5050 I0919 15:44:42.006597 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:44:42.006675 13002 slave.cpp:927] Detecting new master I0919 15:44:42.006577 13008 status_update_manager.cpp:174] Pausing sending status updates I0919 15:45:16.400794 13006 slave.cpp:4591] Current disk usage 72.48%. Max allowed age: 1.226390000804074days I0919 15:45:42.354790 13005 slave.cpp:3732] master@10.142.55.190:5050 exited W0919 15:45:42.354857 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:45:54.020563 13002 detector.cpp:152] Detected a new leader: (id='687') I0919 15:45:54.020756 13002 group.cpp:706] Trying to get '/mesos/json.info_0000000687' in ZooKeeper I0919 15:45:54.023296 13002 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected I0919 15:45:54.023455 13002 slave.cpp:895] New master detected at master@10.142.55.196:5050 I0919 15:45:54.023558 13002 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:45:54.023526 13008 status_update_manager.cpp:174] Pausing sending status updates I0919 15:45:54.023669 13002 slave.cpp:927] Detecting new master I0919 15:46:16.402402 13003 slave.cpp:4591] Current disk usage 72.53%. Max allowed age: 1.223205601954942days I0919 15:46:54.075505 13007 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:46:54.075592 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:46:56.098012 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected I0919 15:46:56.098016 13007 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:46:56.098253 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:46:56.462254 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected I0919 15:46:56.462260 13005 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:46:56.462540 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:47:02.005637 13009 detector.cpp:152] Detected a new leader: (id='688') I0919 15:47:02.005765 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000688' in ZooKeeper I0919 15:47:02.006853 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected I0919 15:47:02.006959 13009 slave.cpp:895] New master detected at master@10.142.55.202:5050 I0919 15:47:02.006986 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:47:02.007025 13009 slave.cpp:927] Detecting new master I0919 15:47:02.007061 13009 status_update_manager.cpp:174] Pausing sending status updates I0919 15:47:16.406669 13008 slave.cpp:4591] Current disk usage 72.53%. Max allowed age: 1.223184189090440days I0919 15:48:02.950891 13005 slave.cpp:3732] master@10.142.55.202:5050 exited W0919 15:48:02.950994 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:48:12.006634 13005 detector.cpp:152] Detected a new leader: (id='690') I0919 15:48:12.006817 13003 group.cpp:706] Trying to get '/mesos/json.info_0000000690' in ZooKeeper I0919 15:48:12.007987 13003 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected I0919 15:48:12.008126 13003 slave.cpp:895] New master detected at master@10.142.55.190:5050 I0919 15:48:12.008210 13003 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:48:12.008280 13003 slave.cpp:927] Detecting new master I0919 15:48:12.008191 13008 status_update_manager.cpp:174] Pausing sending status updates I0919 15:48:16.409266 13003 slave.cpp:4591] Current disk usage 72.54%. Max allowed age: 1.222480623542604days I0919 15:49:12.379010 13009 slave.cpp:3732] master@10.142.55.190:5050 exited W0919 15:49:12.379149 13009 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:49:12.379233 13010 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected I0919 15:49:16.413767 13007 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.215032005677465days I0919 15:49:24.016290 13007 detector.cpp:152] Detected a new leader: (id='693') I0919 15:49:24.016417 13007 group.cpp:706] Trying to get '/mesos/json.info_0000000693' in ZooKeeper I0919 15:49:24.018273 13007 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected I0919 15:49:24.018437 13007 slave.cpp:895] New master detected at master@10.142.55.196:5050 I0919 15:49:24.018523 13007 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:49:24.018604 13007 slave.cpp:927] Detecting new master I0919 15:49:24.018496 13008 status_update_manager.cpp:174] Pausing sending status updates I0919 15:50:16.416391 13008 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.215016710774248days I0919 15:50:24.065268 13003 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:50:24.065342 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:50:24.485752 13004 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:50:24.485839 13004 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:50:24.485977 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected I0919 15:50:28.343647 13003 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:50:28.343719 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:50:28.343819 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected I0919 15:50:31.545099 13005 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:50:31.545171 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:50:31.545284 13010 process.cpp:2105] Failed to shutdown socket with fd 14: Transport endpoint is not connected I0919 15:50:32.007096 13008 detector.cpp:152] Detected a new leader: (id='694') I0919 15:50:32.007195 13008 group.cpp:706] Trying to get '/mesos/json.info_0000000694' in ZooKeeper I0919 15:50:32.009881 13008 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected I0919 15:50:32.009970 13008 slave.cpp:895] New master detected at master@10.142.55.202:5050 I0919 15:50:32.009994 13008 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:50:32.010030 13008 slave.cpp:927] Detecting new master I0919 15:50:32.010079 13008 status_update_manager.cpp:174] Pausing sending status updates I0919 15:51:16.417846 13006 slave.cpp:4591] Current disk usage 72.64%. Max allowed age: 1.214964708103322days I0919 15:51:32.560317 13003 slave.cpp:3732] master@10.142.55.202:5050 exited W0919 15:51:32.560410 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:51:42.005147 13009 detector.cpp:152] Detected a new leader: (id='696') I0919 15:51:42.005265 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000696' in ZooKeeper I0919 15:51:42.006824 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.190:5050) is detected I0919 15:51:42.006904 13009 slave.cpp:895] New master detected at master@10.142.55.190:5050 I0919 15:51:42.006928 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:51:42.006963 13009 slave.cpp:927] Detecting new master I0919 15:51:42.006999 13009 status_update_manager.cpp:174] Pausing sending status updates I0919 15:52:16.419373 13003 slave.cpp:4591] Current disk usage 72.71%. Max allowed age: 1.209981628636250days I0919 15:52:42.336305 13002 slave.cpp:3732] master@10.142.55.190:5050 exited W0919 15:52:42.336426 13002 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:52:54.005267 13005 detector.cpp:152] Detected a new leader: (id='699') I0919 15:52:54.005408 13005 group.cpp:706] Trying to get '/mesos/json.info_0000000699' in ZooKeeper I0919 15:52:54.006206 13005 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.196:5050) is detected I0919 15:52:54.006285 13005 slave.cpp:895] New master detected at master@10.142.55.196:5050 I0919 15:52:54.006309 13005 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:52:54.006398 13005 slave.cpp:927] Detecting new master I0919 15:52:54.006451 13005 status_update_manager.cpp:174] Pausing sending status updates I0919 15:53:16.420258 13005 slave.cpp:4591] Current disk usage 72.76%. Max allowed age: 1.206748286096840days I0919 15:53:54.071012 13005 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:53:54.071143 13005 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:54:01.105780 13002 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:54:01.105854 13002 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:54:01.105970 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected I0919 15:54:05.733837 13007 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:54:05.733932 13007 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected E0919 15:54:05.734071 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected E0919 15:54:05.818560 13010 process.cpp:2105] Failed to shutdown socket with fd 15: Transport endpoint is not connected I0919 15:54:05.818583 13003 slave.cpp:3732] master@10.142.55.196:5050 exited W0919 15:54:05.818758 13003 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected I0919 15:54:06.004385 13009 detector.cpp:152] Detected a new leader: (id='700') I0919 15:54:06.004494 13009 group.cpp:706] Trying to get '/mesos/json.info_0000000700' in ZooKeeper I0919 15:54:06.005511 13009 zookeeper.cpp:259] A new leading master (UPID=master@10.142.55.202:5050) is detected I0919 15:54:06.005586 13009 slave.cpp:895] New master detected at master@10.142.55.202:5050 I0919 15:54:06.005609 13009 slave.cpp:916] No credentials provided. Attempting to register without authentication I0919 15:54:06.005676 13009 slave.cpp:927] Detecting new master I0919 15:54:06.005720 13009 status_update_manager.cpp:174] Pausing sending status updates I0919 15:54:16.423193 13002 slave.cpp:4591] Current disk usage 72.76%. Max allowed age: 1.206699342406551days
Advertisement
Answer
Thanks to Joseph Wu to help me solve the problem, detail:
There are two repeating log messages that tell you (indirectly) that something is wrong:
I0919 15:55:08.178272 13280 replica.cpp:673] Replica in VOTING status received a broadcasted recover request from (14)@10.142.55.202:5050
This message means that you’ve started this master before, with the same work directory. It has some sort of persistent state in its work directory.
This log message tells you that there are two masters you have not started before:
I0919 15:55:16.018023 13282 consensus.cpp:360] Aborting implicit promise request because 2 ignores received
The masters will refuse to start because there is less than a quorum of masters with the persistent state. If the masters were to start, you would have potential data loss. This is the expected behavior, as Mesos errs on the side of caution.
If I need a fresh mesos cluster, I need clean work directory of the master.
But the problem is not on 10.142.55.202
as Joseph Wu says. I clear all the word_dir, and get out of this problem.
How to clean the work dir:
find mesos-master work dir
$ cat /etc/mesos-master/work_dir /var/lib/mesos
remove it
$ rm -rf /var/lib/mesos