Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling #2169

Closed
shloim opened this issue Jan 7, 2024 · 12 comments
Closed
Labels
async Affects asynchronous loggers or appenders bug Incorrect, unexpected, or unintended behavior of existing code waiting-for-user More information is needed from the user

Comments

@shloim
Copy link

shloim commented Jan 7, 2024

Description

Some low probability race condition causes lmax to loop indefinitely on Thread.yield

Configuration

-DLog4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
-Dlog4j2.enable.threadlocals=true
-Dlog4j2.enable.direct.encoders=true
-XX:CompileCommand=dontinline,org.apache.logging.log4j.core.async.perftest.NoOpIdleStrategy::idle

Version: 2.17.1

Operating system: Centos 8

JDK: OpenJDK 64-Bit Server VM GraalVM LIBGRAAL 20.3.3 (build 11.0.12+5-jvmci-20.3-b06)

Logs

It would seem that the queue datastructure becomes corrupted and cas=uses the polling thread to loop infinitely with the following stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #18 daemon prio=5 os_prio=0 cpu=4404755.82ms elapsed=4733.22s tid=0x00007fed7ade9800 nid=0xcd runnable  [0x00007fed7bffd000]
   java.lang.Thread.State: RUNNABLE
        at java.lang.Thread.yield([email protected]/Native Method)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer.transferAfterCancelledWait([email protected]/AbstractQueuedSynchronizer.java:1752)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos([email protected]/AbstractQueuedSynchronizer.java:2119)
        at com.lmax.disruptor.TimeoutBlockingWaitStrategy.waitFor(TimeoutBlockingWaitStrategy.java:38)
        at com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

Reproduction

This was found on a large production system with very low probability. Using log4j with the above settings. We don't know the exact case where the polling thread reaches this state

@shloim shloim changed the title [2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling #471 [2.17.1] AsyncLogger thread is stuck on 100% CPU on lmax-34. queue polling Jan 7, 2024
@ppkarwasz ppkarwasz added bug Incorrect, unexpected, or unintended behavior of existing code async Affects asynchronous loggers or appenders labels Jan 7, 2024
@vy
Copy link
Member

vy commented Jan 8, 2024

@shloim, the upcoming Log4j 2 release (i.e., 2.23.0) will be the very first Log4j 2 version that ships Disruptor 4 support. (Disruptor 3 will still work too.) Is that something you can try?

@ppkarwasz
Copy link
Contributor

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

// IMPLEMENTATION NOTE:
// This is a copy of the com.lmax.disruptor.TimeoutBlockingWaitStrategy class in disruptor-4.0.0-RC1.
// The difference between this code and the implementation of this class in disruptor-3.4.4
// is that this class is garbage-free, because it uses a synchronized block instead of a ReentrantLock.
// This class is package-protected, so that it can be used internally as the default WaitStrategy
// by Log4j Async Loggers, but can be removed in a future Log4j release without impacting binary compatibility.
// (disruptor-4.0.0-RC1 requires Java 11 and has other incompatible changes so cannot be used in Log4j 2.x.)

The difference in the locking mechanism seems relevant to your case.

@shloim
Copy link
Author

shloim commented Jan 8, 2024

Alternatively, can you upgrade your Log4j version to 2.18.0 or later?

In 2.18.0 @remkop backported the TimeoutBlockingWaitStrategy from Disruptor 4.x with such a comment:

// IMPLEMENTATION NOTE:
// This is a copy of the com.lmax.disruptor.TimeoutBlockingWaitStrategy class in disruptor-4.0.0-RC1.
// The difference between this code and the implementation of this class in disruptor-3.4.4
// is that this class is garbage-free, because it uses a synchronized block instead of a ReentrantLock.
// This class is package-protected, so that it can be used internally as the default WaitStrategy
// by Log4j Async Loggers, but can be removed in a future Log4j release without impacting binary compatibility.
// (disruptor-4.0.0-RC1 requires Java 11 and has other incompatible changes so cannot be used in Log4j 2.x.)

The difference in the locking mechanism seems relevant to your case.

Do I need to set a custom WaitStrategy or can I keep the default behavior?

@ppkarwasz
Copy link
Contributor

@shloim,

"Our" TimeoutBlockingWaitStrategy should be the default one.

@shloim
Copy link
Author

shloim commented Jan 14, 2024

After switching to log4j 2.18.0 the issue persists but in a different stacktrace:

"Log4j2-TF-1-AsyncLogger[AsyncContext@46f5f779]-1" #20 daemon prio=5 os_prio=0 cpu=3228328.37ms elapsed=3495.56s tid=0x00007fea233e9000 nid=0xd0 runnable  [0x00007fea23dfd000]
   java.lang.Thread.State: RUNNABLE
        at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
        at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
        at java.lang.Thread.run([email protected]/Thread.java:829)

@vy
Copy link
Member

vy commented Jan 14, 2024

@shloim, can you give 2.23.0-SNAPSHOT a try in combination with Disruptor 4? (See docs on how to configure Maven/Gradle/SBT for snapshot builds.)

@AndreKurait
Copy link

AndreKurait commented May 4, 2024

Hi,
I'm seeing this with 2.23.1 and Disruptor 4.0.0 in my application.

It's happening when i use:

Log4jContextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector
AsyncLogger.WaitStrategy=Yield
"Log4j2-TF-1-AsyncLogger[AsyncContext@3b192d32]-1" - Thread t@14
   java.lang.Thread.State: RUNNABLE
	at [email protected]/java.lang.Thread.yield(Native Method)
	at app//com.lmax.disruptor.YieldingWaitStrategy.applyWaitMethod(YieldingWaitStrategy.java:58)
	at app//com.lmax.disruptor.YieldingWaitStrategy.waitFor(YieldingWaitStrategy.java:40)
	at app//com.lmax.disruptor.ProcessingSequenceBarrier.waitFor(ProcessingSequenceBarrier.java:56)
	at app//com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:156)
	at app//com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122)
	at [email protected]/java.lang.Thread.run(Thread.java:829)

   Locked ownable synchronizers:
	- None

Operating system: MacOS 14.4.1

JDK: openjdk:11-jre

Overall, i'm thinking this may be by design of the YieldingWaitStrategy which states

This strategy will use 100% CPU, but will more readily give up the CPU than a busy spin strategy if other threads require CPU resource.

Perhaps the Log4j async manual could be more clear which currently states:

Yield is a strategy that uses a Thread.yield() for waiting for log events after an initially spinning. Yield is a good compromise between performance and CPU resource, but may use more CPU than Sleep in order to get the message logged to disk sooner.

@vy
Copy link
Member

vy commented May 6, 2024

Perhaps the Log4j async manual could be more clear

We can indeed do this. Updated the #2534 description accordingly.

@vy
Copy link
Member

vy commented May 6, 2024

@shloim, any updates? Were you able to reproduce the issue with Log4j 2.23.1 using Disruptor 4?

@vy vy added the waiting-for-user More information is needed from the user label May 6, 2024
@ppkarwasz
Copy link
Contributor

The documentation of Yield was fixed in #2577 .

@ppkarwasz ppkarwasz removed the waiting-for-user More information is needed from the user label Oct 10, 2024
@Baixiu-code
Copy link

same issues. and cpu 100%. and t version is 3.4.2 .
image

@ppkarwasz
Copy link
Contributor

@Baixiu-code,

  • Are you using Log4j Core 2.24.1?
  • Did you try using com.lmax:disruptor version 4.0.0?
  • What is the status of the polling thread? The polling thread's name starts with Log4j2-TF-<number>-AsyncLogger as showed in the initial comment for this issue. Your screenshot only covers the logging threads, which are blocked, because the queue is full. Can you provide the stack trace of the polling thread as text instead of an image?

@ppkarwasz ppkarwasz added waiting-for-user More information is needed from the user and removed waiting-for-maintainer labels Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
async Affects asynchronous loggers or appenders bug Incorrect, unexpected, or unintended behavior of existing code waiting-for-user More information is needed from the user
Projects
None yet
Development

No branches or pull requests

5 participants