Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
1.2.3
-
I have set up several processes instances, each using logback. The instances are meant to be containerized in docker and scale out. The instances (senders) use SocketAppender to centralize log events to a different container, where a process (receiver) runs with logback with a ServerSocketReceiver. At this single instance i am able to further deal with the log events in a central place.
The setup works perfectly with one exception. When the receiver container is shut down, the senders are not aware of the disconnect and take about 10 minutes to become aware of the problem with the socket, despite nearly constant logging taking place. Only at this point to they reconnect.
I have set up several processes instances, each using logback. The instances are meant to be containerized in docker and scale out. The instances (senders) use SocketAppender to centralize log events to a different container, where a process (receiver) runs with logback with a ServerSocketReceiver. At this single instance i am able to further deal with the log events in a central place. The setup works perfectly with one exception. When the receiver container is shut down, the senders are not aware of the disconnect and take about 10 minutes to become aware of the problem with the socket, despite nearly constant logging taking place. Only at this point to they reconnect.
Description
It would appear that a remote container shutdown results in a dirty TCP disconnect. The remote TCP server disappears abruptly, leaving the client side with the socket in CLOSE_WAIT state.
While this is the case the client side is happily appending events to the SocketAppender which in turn seems to be writing this to the socket (the Send-Q on the socket keeps growing) without detecting any errors.
This keeps going for several minutes (between 7 and 12 in my tests) during which log events go missing. Only at this very late point is an error raised and the reconnect logic triggered.
I found that the AbstractLogstashTcpSocketAppender from the logstash project encountered and fixed this same problem.
https://github.com/logstash/logstash-logback-encoder/issues/76
https://github.com/logstash/logstash-logback-encoder/issues/85
They effectively fixed the problem by attempting a read from the socket (with short read timeout) which results in a much more timely detection of the disconnect . This could take place on a separate watchdog thread to detect the problem.
The call() method tries to constantly read from the socket. The expected outcome is that -1 bytes were read within the read timeout period, and this is where an exception will be much more reliably triggered when the socket is no longer available.