Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Windows
Description
Our customer has a problem with network glitches, so we upgraded from 0.9.17 to 0.9.30 to take advantage of the ResilientOutputStream. The particular appender that caused problems at the customer site used a custom layout class, so we wrapped that class in a LayoutWrappingEncoder in build.xml:
<appender name="AUDIT" class="ch.qos.logback.core.rolling.RollingFileAppender">
<File>${logging.audit.root}/audit/audit.log</File>
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="com.chiliad.log.util.PropertiesLayout" />
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">
<!-- stuff deleted here -->
</rollingPolicy>
</appender>
We tested the ResilientOutputStream by editing logback.xml to write the logfile to another system, then we removed the ethernet cable from that other system. A couple minutes later we plugged the ethernet cable back into the other system. Stdout of the logging system reported
2012/02/14 16:33:47 | INFO | jvm 1 | srvmain | 16:33:47,414 |-ERROR in c.q.l.c.recovery.ResilientFileOutputStream@12297237 - IO failure while writing to file \\cushman\e\logs\audit\audit.log java.io.IOException: The specified network name is no longer available
There was also a long stack trace from Java with more detail about the IOException.
However immediately after this stack trace the following message appeared
2012/02/14 16:33:47 | INFO | jvm 1 | srvmain | 16:33:47,408 |-INFO in c.q.l.c.recovery.ResilientFileOutputStream@12297237 - Recovered from IO failure on file \\cushman\e\logs\audit\audit.log
Furthermore, logging did not actually recover. The audit log on the remote server did not get any more messages, and stdout continued to record the messaages shown above, always paired with the complaint about the IO failure and then saying that it recovered from the IO failure.
I downloaded the source code and found the source of the problem. The LayoutWrappingEncoder always calls write and then flush together:
outputStream.write(convertToBytes(txt));
outputStream.flush();
In ResilientOutputStreamBase the write method checks for a bad state in the output stream, but the flush method does not:
public void write(int b) {
if (isPresumedInError()) {
if (!recoveryCoordinator.isTooSoon())
return; // return regardless of the success of the recovery attempt
}
try
catch (IOException e)
{ postIOFailure(e); }}
public void flush() {
if (os != null) {
try
}
}
So, the write method encounters the IOException and sets the presumedClean flag to false. However, the flush method does not call isPresumedInError and the flush succeeds (I presume because there is no data to flush). The flush method then calls postSuccessfulWrite, which sets the presumedClean flag to true.
I fixed this problem by changing the flush method to call isPresumedInError::
public void flush() {
if (os != null) {
if (isPresumedInError()) {
if (!recoveryCoordinator.isTooSoon()) { attemptRecovery(); }
return; // return regardless of the success of the recovery attempt
}
try { os.flush(); postSuccessfulWrite(); }
catch (IOException e)
{ postIOFailure(e); } }
}
Now when we run the test of unplugging the ethernet cable on the system where the audit logs are written and then plug the cable back in the logger starts logging again and the logfile resumes recording messages.
Does this solution look correct to you?
thanks
Tom