Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When logging from multiple threads, an appender can be broken until app restart if it rolls over #2566

Closed
marcreichman-pfi opened this issue May 3, 2024 · 10 comments

Comments

@marcreichman-pfi
Copy link

Description

In some cases (this is intermittent) it seems that when a log file (synchronous appender) is being written to by several threads at once, it can get into a situation where newer messages cannot write to the logger anymore, with a stream closed exception.

This has been observed in client sites in Linux and Windows. My log files are from Windows.

This seems to have started happening somewhat recently, and the changes I can track down are:

  • Upgrade of Log4j (from 2.15.0 after RCE fix, at various stops to 2.23.1)
  • Use of 1.x API to 2.x API bridge and SLF4J API bridge
  • Transition from log4j-web to log4j-jakarta-web
  • Introduction of size based triggering in addition to the date based triggering which had been used for several years without incident in a similar environment

The overall execution environment has not changed, and there has always been regular traffic to the loggers from multiple threads, and rollovers. At one point there was an issue causing windows rollovers to not work correctly which was fixed with an upgrade from 2.11.1 to 2.13.1.

On the Windows side, we usually keep a tailing program (baretail) running around the clock. This has never caused any issues for us.

An important note, is even when this works correctly on the Windows side, we have warnings printed about not being able to remove files due to access from another process, and it succeeds with some of the fall back rename/truncation/etc types of procedures. This seems to happen regardless of what is happening as far as the service remaining up, baretail remaining open, etc. I expect that these messages will be seen as an indication that something is fundamentally wrong, but these loggers work around the clock, and higher traffic loggers seem to have no issue, even though less-frequently written files (such as this em-error.log) file display the issue more often.

Configuration

Version: 2.23.1 (but observed in versions possibly from 2.17.x on)

Operating system: Windows Server 2008 R2 64-bit, CentOS 7.x amd64

JDK: Oracle JDK 21.0.x

Logs

Configuration is attached.
log4j.xml.txt

File server-broken.log shows the scenario where the stream shows closed from one thread as the other seems to be doing a rollover. After this point, this logger/appender can no longer be used until app restart.
server-broken.log

File server-functional.log shows a functional rollover scenario. I'm not sure if the threads are as busy in this case, but it does show a functional rollover without more errors subsequently, and I wanted to include it because this is what it shows when it tries to do the zip/rename/delete operations and has to fall back on some other file clearing approaches.
server-functional.log

Reproduction

I am not able to reproduce this on demand, but if needed I can try to make a test case with an extreme version of this configuration (low file size rollover, low date change rollover) and hit it with many threads to see if this is reproducible.

@ppkarwasz
Copy link
Contributor

@marcreichman-pfi,

Are you by any chance using the same configuration in multiple web applications? On Linux you can check with lsof to see how many file descriptors are open for the same log file.

@marcreichman-pfi
Copy link
Author

marcreichman-pfi commented May 3, 2024

@ppkarwasz Thanks for your response. I would have told you that for sure there are not multiple applications referencing the configuration, but lsof is bringing in multiple FDs under the same java process.

We typically bring up the app server (glassfish in this case), deploy one app, and that's it. I'm aware there are classloader considerations but with a single deployment I don't think they come into play. We are using the log4j-jakarta-web module in our application if it matters. We launch the application with the -Dlog4j.configurationFile=<path> in the JVM arguments, no programmatic initialization.

How would I trace these, in your opinion?

[userxxxxxxxx@host logs]$ sudo lsof em.log
COMMAND  PID         USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
java    3490 userxxxxxxxx  486w   REG  253,0   637478 33562012 em.log
java    3490 userxxxxxxxx  622w   REG  253,0   637478 33562012 em.log
[userxxxxxxxx@host logs]$ sudo lsof em-error.log
COMMAND  PID         USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
java    3490 userxxxxxxxx  488w   REG  253,0    21388 35736966 em-error.log
java    3490 userxxxxxxxx  624w   REG  253,0    21388 35736966 em-error.log

I know it's not an exact comparable scenario (no random access file here) but this seems similar

@marcreichman-pfi
Copy link
Author

@ppkarwasz I did a little digging, and noticed that even on a fresh deployment, there are multiple descriptors to these files. I found an interesting java agent here and used it, to observe this pattern:

First FD

#507 ../logs/em-error.log by thread:admin-thread-pool::admin-listener(1) on Fri May 03 11:38:54 CDT 2024
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:232)
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:150)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager$RollingFileManagerFactory.createManager(RollingFileManager.java:878)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager$RollingFileManagerFactory.createManager(RollingFileManager.java:844)
	at org.apache.logging.log4j.core.appender.AbstractManager.getManager(AbstractManager.java:146)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.getManager(OutputStreamManager.java:112)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.getFileManager(RollingFileManager.java:307)
	at org.apache.logging.log4j.core.appender.RollingFileAppender$Builder.build(RollingFileAppender.java:144)
	at org.apache.logging.log4j.core.appender.RollingFileAppender$Builder.build(RollingFileAppender.java:65)
	at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:124)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:1164)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:1085)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:1077)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:681)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:264)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:313)
	at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:631)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:713)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:735)
	at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:260)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:154)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:46)
	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:197)
	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:611)
	at (class name replaced).<clinit>(class name replaced.java:30)

Second FD:

#562 ../logs/em-error.log by thread:admin-thread-pool::admin-listener(1) on Fri May 03 11:39:22 CDT 2024
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:232)
	at java.base/java.io.FileOutputStream.<init>(FileOutputStream.java:150)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager$RollingFileManagerFactory.createManager(RollingFileManager.java:878)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager$RollingFileManagerFactory.createManager(RollingFileManager.java:844)
	at org.apache.logging.log4j.core.appender.AbstractManager.getManager(AbstractManager.java:146)
	at org.apache.logging.log4j.core.appender.OutputStreamManager.getManager(OutputStreamManager.java:112)
	at org.apache.logging.log4j.core.appender.rolling.RollingFileManager.getFileManager(RollingFileManager.java:307)
	at org.apache.logging.log4j.core.appender.RollingFileAppender$Builder.build(RollingFileAppender.java:144)
	at org.apache.logging.log4j.core.appender.RollingFileAppender$Builder.build(RollingFileAppender.java:65)
	at org.apache.logging.log4j.core.config.plugins.util.PluginBuilder.build(PluginBuilder.java:124)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createPluginObject(AbstractConfiguration.java:1164)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:1085)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.createConfiguration(AbstractConfiguration.java:1077)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.doConfigure(AbstractConfiguration.java:681)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.initialize(AbstractConfiguration.java:264)
	at org.apache.logging.log4j.core.config.AbstractConfiguration.start(AbstractConfiguration.java:313)
	at org.apache.logging.log4j.core.LoggerContext.setConfiguration(LoggerContext.java:631)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:713)
	at org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:735)
	at org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:260)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:154)
	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:46)
	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:197)
	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:611)
	at (different class name replaced).<clinit>(different class name replaced.java:17)

We always initialize as such:

private static final Logger log = LogManager.getLogger(class name replaced.class);

Given this, as classes show up these static final lines resolve, so the timing is completely random. This has never posed a problem before. Is it possible that something in the LogManager.getContext() path (called from getLogger isn't thread safe for its initialization path? I saw while trying to look into this a bit that the code in LogManager has changed a bit since the 2.23.1 release which made it a bit tricky to dig much into locally.

Thanks for your time and attention.

@ppkarwasz
Copy link
Contributor

@marcreichman-pfi,

You can check which class creates a logger context by setting the -Dog4j2.loggerContextStacktraceOnStart=true and -Dlog4j2.debug system properties.

It appears you have at least two logger contexts in your system that use the same configuration, but this should not be a problem, since multiple appenders pointing to the same file will use the same RollingFileManager and therefore use a single file descriptor.

I am not familiar with Glassfish classloader structure. Is it possible that there are two RollingFileManager classes in your JVM (e.g. loaded by sibling classloaders)?

@marcreichman-pfi
Copy link
Author

@ppkarwasz I did a little playing around, and here's what I found.

  1. I set the two system properties you mentioned, and found I got two stack traces, similar to what is in what I pasted above. I'm not sure if that's an indication of anything bad unto itself.
  2. I tried disabling my configuration file to have it fallback to the default
  3. I tried disabling the log4j-jakarta-web module, i tried leaving it but adding the context param to disable auto-loading and then configured my code to load it directly via Configurator
  4. I used jconsole which seemed to show two separate sets of MBeans, and I used jmap to get some hierarchies which both seemed to indicate that indeed these are loaded in sibling classloaders and there are two instances of RollingFileAppender

No matter what I tried, the two spots in the stack traces would show up, well before any direct configuration or reconfiguration. I can see the sets of debug outputs run in full, from seemingly the same thread, but I suppose different class loaders?

This makes me wonder, is this pattern fundamentally incompatible with a scenario that scans classes potentially multiple times on the way to startup deployment?

private static final Logger log = LogManager.getLogger(class name replaced.class);

I'm wondering if I should use a scenario either via dependency injection or otherwise which does not have these loggers live in static form, to make sure they don't greedily initialize in a classloader before anyone is ready to use them.

The thing is, I would not expect any application code to be actively writing in the "older" class loader, so I'm not sure that even if this were all the case, it would matter since it seems like all of these operations (rolling etc) are handled by the logging caller's thread, not anything which runs on some kind of log4j threaded timer, except possibly for configuration file reloads.

I just want to mention that this all worked fine for years, and something has seemed to change. It may be that adding the size trigger is just causing more rollovers and that is what is bringing this to the forefront.

@ppkarwasz
Copy link
Contributor

4. I used jconsole which seemed to show two separate sets of MBeans, and I used jmap to get some hierarchies which both seemed to indicate that indeed these are loaded in sibling classloaders and there are two instances of `RollingFileAppender`

How could we reproduce this? Are you using a simple WAR application or a complex EAR archive?

In general, the solution of this problem would be to ensure that LogManager.class from one classloader is equal to LogManager.class for the other one. I don't know how to do this in Glassfish.

@marcreichman-pfi
Copy link
Author

This was just a simple war deployment. I imagine it would be the same in other app servers but I'm not 100% to be honest.

  • -Dlog4j.configurationFile=... defined in the overall JVM arguments
  • No special configuration of any kind for log4j. We are using log4j-api, log4j-core, log4j-jakarta-web, and log4j-slf4j2-impl (because of third party libs which write to this API)
  • Log initialization as mentioned above (private static final Logger = LogManger.getLogger(Foo.class))

It's possible that Glassfish is hitting these via different paths for special types of classes like EntityListeners etc, based on those traces above.

I will take some time on Monday (out of time today) to try and replicate this in a docker environment with Payara Micro (single jar deployment of war environments) which will make it easier to experiment with the JVM options and the log4j configurations.

I want to stress, this usually works well, and the scenario mentioned is rare, but when it happens, that log is broken until restart. I'm still unclear why this happens because I don't think there are competing threads writing logs in different classloaders at the same time.

Thanks again for your help with this.

@marcreichman-pfi
Copy link
Author

marcreichman-pfi commented May 6, 2024

@ppkarwasz I have a couple new findings, which may make sense to you, but at the very least I'll try a new configuration and see how it goes. I was able to dig in a bit deeper with JVisualVM.

  • I am walking back my finding that there are multiple sibling cases of RollingFileAppender in separate classloaders. I have 7 instances, which go perfectly with my 7 configured appenders in my config. In jvisualvm they are in the same classloader.
  • I did find multiple sets of base configurations (StatusLogger etc) in different classloaders. I believe Glassfish does indeed scan for different types of annotations/jpa classes/etc in a different classloader before activating the delegate one used for deployment. As far as I can see there is no issue there.
  • I do however, think I have a conflict due to the fact that I set -Dlog4j.configurationFile, which means that any instance of "activating" log4j is arming that configuration, and seeming to lead to a conflict issue of some sort, because those initial scans seem to trigger it and see the file, and then the later automatic activation of the library due to the log4j-jakarta-web module (and a fallback log4j.xml that i've always has in classpath in case people did not configure Glassfish properly) also sees the config file, and in situations where the initial load triggers rollover (dates, sizes, etc.) it can conflict because it's not clear which configuration is handling the desire to rollover.
  • I do not quite understand why this happens in normal operations AFTER deployment though, because only one set of RollingFileAppender should be involved when the app itself is writing log lines and triggering the rollover. It would make sense to me if log4j had a thread to handle the rollover monitoring, but as far as my read of code goes, it should be doing it lazily when a log line is made on the spot.
  • My configuration change I'm going to attempt is to remove the -Dlog4j.configurationFile from the main JVM arguments, remove the fallback log4j.xml file inside src/main/resources and rely on the web.xml context param log4jConfiguration to ensure that the main classloader is the only place we see this external configuration file, relying on the unconfigured basic setup which will be triggered when classes are scanned, and only arming this when the servlet init setup works as expected. I realized that the code in places which log before the servlet container (e.g. EJB singletons, CDI ApplicationScoped in PostConstruct) is not found here
  • My new configuration change I'm going to attempt is to remove the -Dlog4j.configurationFile from the main JVM arguments, remove the fallback log4j.xml file inside src/main/resources, remove the log4j-jakarta-web altogether, and manually init/teardown with Configurator.initialize(appName, filePath) and LogManager.shutdown() since this application lifecycle always starts fresh and never deals with redeployments on a running service, hence no need to keep the complexity of the log module. I will test this to see if any of the Stream Closed errors remain.

Thoughts?

@ppkarwasz
Copy link
Contributor

I really ought to look deeper into GlassFish.

The log4j-jakarta-web should not create a new logger context unless the classloader returned by ServletContext#getClassLoader() and the classloader that loaded the application are different. We encountered already such a case with WebSphere (cf. LOG4J2-3557), maybe GlassFish has a similar feature?

@marcreichman-pfi
Copy link
Author

It does look similar. I think with the changes I mentioned i am actually all set. I don’t see the warnings on successful rolling that a file is already open, and I don’t expect the stream closed issue either but I’ll keep an eye out.

The tricky part with these EE containers is that they initialize different types of components at different and sometimes unpredictable times. Servlet like used in the web artifact is often last or at least later than you’d want to start logging. This makes relying just on the web artifact workflow a bit underwhelming.

I’ll reopen if I see anything else similar, but maybe it’s worth some doc touch ups about not using the system properties in these environments and to mention the funny timing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants