Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: logback-core
Labels:
None
Environment:

Windows

Description

Our customer has a problem with network glitches, so we upgraded from 0.9.17 to 0.9.30 to take advantage of the ResilientOutputStream. The particular appender that caused problems at the customer site used a custom layout class, so we wrapped that class in a LayoutWrappingEncoder in build.xml:

<appender name="AUDIT" class="ch.qos.logback.core.rolling.RollingFileAppender">
<File>${logging.audit.root}/audit/audit.log</File>
<encoder class="ch.qos.logback.core.encoder.LayoutWrappingEncoder">
<layout class="com.chiliad.log.util.PropertiesLayout" />
</encoder>
<rollingPolicy class="ch.qos.logback.core.rolling.TimeBasedRollingPolicy">

</rollingPolicy>
</appender>

We tested the ResilientOutputStream by editing logback.xml to write the logfile to another system, then we removed the ethernet cable from that other system. A couple minutes later we plugged the ethernet cable back into the other system. Stdout of the logging system reported

2012/02/14 16:33:47 | INFO | jvm 1 | srvmain | 16:33:47,414 |-ERROR in c.q.l.c.recovery.ResilientFileOutputStream@12297237 - IO failure while writing to file \\cushman\e\logs\audit\audit.log java.io.IOException: The specified network name is no longer available

There was also a long stack trace from Java with more detail about the IOException.

However immediately after this stack trace the following message appeared

2012/02/14 16:33:47 | INFO | jvm 1 | srvmain | 16:33:47,408 |-INFO in c.q.l.c.recovery.ResilientFileOutputStream@12297237 - Recovered from IO failure on file \\cushman\e\logs\audit\audit.log

Furthermore, logging did not actually recover. The audit log on the remote server did not get any more messages, and stdout continued to record the messaages shown above, always paired with the complaint about the IO failure and then saying that it recovered from the IO failure.

I downloaded the source code and found the source of the problem. The LayoutWrappingEncoder always calls write and then flush together:

outputStream.write(convertToBytes(txt));
outputStream.flush();

In ResilientOutputStreamBase the write method checks for a bad state in the output stream, but the flush method does not:

public void write(int b) {
if (isPresumedInError()) {
if (!recoveryCoordinator.isTooSoon())

{ attemptRecovery(); }

return; // return regardless of the success of the recovery attempt
}
try

{ os.write(b); postSuccessfulWrite(); }

catch (IOException e)

{ postIOFailure(e); }

}

public void flush() {
if (os != null) {
try

{ os.flush(); postSuccessfulWrite(); } catch (IOException e) { postIOFailure(e); }
}
}

So, the write method encounters the IOException and sets the presumedClean flag to false. However, the flush method does not call isPresumedInError and the flush succeeds (I presume because there is no data to flush). The flush method then calls postSuccessfulWrite, which sets the presumedClean flag to true.

I fixed this problem by changing the flush method to call isPresumedInError::

public void flush() {
if (os != null) {
if (isPresumedInError()) {
if (!recoveryCoordinator.isTooSoon()) { attemptRecovery(); }
return; // return regardless of the success of the recovery attempt
}
try { os.flush(); postSuccessfulWrite(); }

catch (IOException e)

{ postIOFailure(e); }

}
}

Now when we run the test of unplugging the ethernet cable on the system where the audit logs are written and then plug the cable back in the logger starts logging again and the logfile resumes recording messages.

Does this solution look correct to you?
thanks
Tom

Attachments

Activity

People

Assignee:: Logback dev list

Reporter:: Thomas P Rees

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 21/Feb/12 17:05

Updated:: 12/Jun/12 9:59