In GNU/Linux I have an issue with an application I have made. It works in my development environment, most of the components running in dockers or natively, but it randomly (often, but not always) fails in the server environment where it needs to be deployed.
Infrastructure:
[App in Ubuntu Server 20.04 host-1] <--->[router+firewall]<---> [Ubuntu Server 20.04 host-2]
Both servers seem to have enough resources -4 CPUs, 4 GB RAM.
The machine running the app has to connect to a RabbitMQ running in that host2, and both publish (I haven’t seen failure here) and subscribe (which tends to fail) in different queues there.
The issue: sometimes it works (there’s a router + firewall, but the problem seems not to be there), but many other times, for some reason, both connections randomly fail. I checked MTU (1500, it works in other deployments), ulimit seems OK, etc. but I am not finding the issue…
Many times Rabbit connections start, but then, eventually, I get Rabbit error messages:
AMQPConnector - reporting failure: AMQPConnectorAMQPHandshakeError: ProbableAuthenticationError
[..]("ConnectionClosedByBroker: (403) 'ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logfile.'"
Which is not true, as the credentials I am 100% sure are OK, in fact, they work sometimes.
The connection is retried, but no success.
From Rabbit logs:
[info] <0.16188.30> Closing all channels from connection 'xxx.yyy.zzz.kkk:41426 -> yyy.zzz.kkk.zzz:5672' because it has been closed [info] <0.16192.30> accepting AMQP connection <0.16192.30> (xxx.yyy.zzz.kkk:41430 -> yyy.zzz.kkk.zzz:5672) [error] <0.16192.30> Error on AMQP connection <0.16192.30> (xxx.yyy.zzz.kkk:41430 -> yyy.zzz.kkk.zzz:5672, state: starting): PLAIN login refused: user 'someuser' - invalid credentials
I tried with a heartbeat of 500 and 90, and a blocked connection timeout of 300…
For me, it seems that the heartbeats are not being received sometimes.
I am pretty lost, I imagine it could be a performance or network issue, as in other controlled environments this works, so, what could I check?
Advertisement
Answer
Well, surprisingly, it seems the most obvious reason was the actual reason: a virtual network was corrupted in the Cloud Computing platform.
How the debugging process was done (in order to convince network engineers to check):
- Applied network analysis tools from different networks (traceroute, nmap). Where the connection between client and server was done, some ports were failing.
- SSHs and SCPs from different VMs to the same destination… from some networks, it worked, but in the one where the connection between client and server was done, it was failing -login failures, probably in handshake negotiation, broken pipe during login, connection established, but broken pipe once anything was typed…
- [Lots of logs and screenshots to prove this and guide network engineers]
Here is when I stopped looking into Rabbit config/parameters, etc.
Finally, a cloud deployment using another network was forced, and it worked.
This finally confirmed that it was a network issue, and the problem was transferred to network engineers.