Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sgup432
Copy link
Contributor

@sgup432 sgup432 commented Mar 4, 2025

Description

Earlier while trying to create N ehcache disk caches, we were creating those via N cache managers which had their own disk write thread pools (so total N). We create N disk caches based on tiered cache setting and it is decided based on number of CPU cores. So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have 2 threads by default and can go up until
max(2, CPU_CORES / 8)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • [ ] Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@sgup432 sgup432 changed the title Using a single cache manager for all ehcache disk caches [Tiered Cache] Using a single cache manager for all ehcache disk caches Mar 4, 2025
sgup432 and others added 2 commits March 4, 2025 13:54
Copy link
Contributor

github-actions bot commented Mar 4, 2025

❌ Gradle check result for 69bbc69: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 5, 2025

❌ Gradle check result for c332f1b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Upadhyaya <[email protected]>
Copy link
Contributor

github-actions bot commented Mar 5, 2025

❌ Gradle check result for c13dcfa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

* @param threadPoolAlias alias for disk thread pool
* @return persistent cache manager
*/
public synchronized static PersistentCacheManager getCacheManager(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need synchronized keyword with computeIfAbsent?

public class EhcacheDiskCacheManager {

// Defines one cache manager per cache type.
private static final Map<CacheType, Tuple<PersistentCacheManager, AtomicInteger>> cacheManagerMap = new HashMap<>();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interface should be concurrentMap and implementation should be concurrentHashMap ?

Comment on lines +161 to +165
int referenceCount = cacheManagerMap.get(cacheType).v2().decrementAndGet();
// All caches have been closed associated with this cache manager, lets close this as well.
if (referenceCount == 0) {
try {
cacheManager.close();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that we try to createCache between reading referenceCount and cacheManager.close? How are we avoiding that situation?

Comment on lines +52 to +58
(key) -> Setting.intSetting(
key,
max(2, Runtime.getRuntime().availableProcessors() / 8),
1,
Runtime.getRuntime().availableProcessors(),
NodeScope
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of threads can cause queuing of disk write operations, unless there is some batching logic within Ehcache. IMO, it is desirable for TieredCaching to write the computed results to disk as soon as possible. Number of threads is not worrisome as long as those threads are not compute intensive, which in this case they are not

@jainankitk
Copy link
Collaborator

So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have 2 threads by default and can go up until
max(2, CPU_CORES / 8)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

@sgup432
Copy link
Contributor Author

sgup432 commented Mar 20, 2025

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

Not yet. We don't have a performance test which is able to reproduce this scenario. We ran our OSB benchmark with/without changes, and both were pretty similar in terms of performance(latency p50, p90 etc)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

We can discuss on the default and increase it further. But main objective of this change is to have a way to increase/decrease the number of disk write threads when needed irrespective of the number of N partitions we are creating within tiered cache. Right now, each disk cache object have its own write thread pool, and when we create N(CPU * 1.5) segments/disk cache object, we are essentially creating (N * 4) disk write threads which seems unnecessary and cause unknown problems, and it is not possible this configure to <=(CPU*1.5).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants