[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

sgup432 · 2025-03-04T21:42:55Z

Description

Earlier while trying to create N ehcache disk caches, we were creating those via N cache managers which had their own disk write thread pools (so total N). We create N disk caches based on tiered cache setting and it is decided based on number of CPU cores. So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have 2 threads by default and can go up until
max(2, CPU_CORES / 8)

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
~~[ ] Public documentation issue/PR created, if applicable.~~

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Sagar Upadhyaya <[email protected]>

Signed-off-by: Sagar <[email protected]>

github-actions · 2025-03-04T22:19:30Z

❌ Gradle check result for 69bbc69: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar <[email protected]>

github-actions · 2025-03-05T18:55:34Z

❌ Gradle check result for c332f1b: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sagar Upadhyaya <[email protected]>

github-actions · 2025-03-05T21:02:01Z

❌ Gradle check result for c13dcfa: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

jainankitk · 2025-03-10T21:19:24Z

...ins/cache-ehcache/src/main/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManager.java

+     * @param threadPoolAlias alias for disk thread pool
+     * @return persistent cache manager
+     */
+    public synchronized static PersistentCacheManager getCacheManager(


Why do we need synchronized keyword with computeIfAbsent?

jainankitk · 2025-03-10T21:20:00Z

...ins/cache-ehcache/src/main/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManager.java

+public class EhcacheDiskCacheManager {
+
+    // Defines one cache manager per cache type.
+    private static final Map<CacheType, Tuple<PersistentCacheManager, AtomicInteger>> cacheManagerMap = new HashMap<>();


interface should be concurrentMap and implementation should be concurrentHashMap ?

jainankitk · 2025-03-10T21:26:04Z

...ins/cache-ehcache/src/main/java/org/opensearch/cache/store/disk/EhcacheDiskCacheManager.java

+        int referenceCount = cacheManagerMap.get(cacheType).v2().decrementAndGet();
+        // All caches have been closed associated with this cache manager, lets close this as well.
+        if (referenceCount == 0) {
+            try {
+                cacheManager.close();


is it possible that we try to createCache between reading referenceCount and cacheManager.close? How are we avoiding that situation?

jainankitk · 2025-03-18T22:44:39Z

plugins/cache-ehcache/src/main/java/org/opensearch/cache/EhcacheDiskCacheSettings.java

+        (key) -> Setting.intSetting(
+            key,
+            max(2, Runtime.getRuntime().availableProcessors() / 8),
+            1,
+            Runtime.getRuntime().availableProcessors(),
+            NodeScope
+        )


Reducing the number of threads can cause queuing of disk write operations, unless there is some batching logic within Ehcache. IMO, it is desirable for TieredCaching to write the computed results to disk as soon as possible. Number of threads is not worrisome as long as those threads are not compute intensive, which in this case they are not

jainankitk · 2025-03-18T23:05:52Z

So essentially we were creating (CPU_CORE * 4) disk write threads which is a lot and can cause CPU spikes with tiered cache enabled.

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

This change essentially creates a single cache manager, and all subsequent caches are created via this single manager. Through this we only have one disk write thread pool and is configured to have 2 threads by default and can go up until
max(2, CPU_CORES / 8)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

sgup432 · 2025-03-20T02:09:15Z

I am really curious if we have observed any cases via hot_threads / flamegraph that confirms disk write threads being responsible for CPU spikes. These threads should be I/O bound, and I won't really expect them to cause observable CPU spike.

Not yet. We don't have a performance test which is able to reproduce this scenario. We ran our OSB benchmark with/without changes, and both were pretty similar in terms of performance(latency p50, p90 etc)

The default of 2 looks really low to me. Assuming TieredCache is being written to upon successful completion of every SearchRequest and these disk write threads are blocking (don't pickup next disk write until previous is finished), shouldn't the number of threads be atleast the number of search threads?

We can discuss on the default and increase it further. But main objective of this change is to have a way to increase/decrease the number of disk write threads when needed irrespective of the number of N partitions we are creating within tiered cache. Right now, each disk cache object have its own write thread pool, and when we create N(CPU * 1.5) segments/disk cache object, we are essentially creating (N * 4) disk write threads which seems unnecessary and cause unknown problems, and it is not possible this configure to <=(CPU*1.5).

Using a single cache manager for all ehcache disk caches

17c6472

Signed-off-by: Sagar Upadhyaya <[email protected]>

sgup432 changed the title ~~Using a single cache manager for all ehcache disk caches~~ [Tiered Cache] Using a single cache manager for all ehcache disk caches Mar 4, 2025

sgup432 and others added 2 commits March 4, 2025 13:54

Added changelog

2e99555

Signed-off-by: Sagar Upadhyaya <[email protected]>

Merge branch 'main' into ehcache_manager_test

69bbc69

Signed-off-by: Sagar <[email protected]>

Merge branch 'main' into ehcache_manager_test

c332f1b

Signed-off-by: Sagar <[email protected]>

Fixing cache manager UT

c13dcfa

Signed-off-by: Sagar Upadhyaya <[email protected]>

peteralfonsi mentioned this pull request Mar 6, 2025

[BUG] [Flaky Test] S3BlobContainerRetriesTests is flaky #17540

Open

jainankitk reviewed Mar 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

sgup432 commented Mar 4, 2025 •

edited

Loading

github-actions bot commented Mar 4, 2025

github-actions bot commented Mar 5, 2025

github-actions bot commented Mar 5, 2025

jainankitk Mar 10, 2025

jainankitk Mar 10, 2025

jainankitk Mar 10, 2025

jainankitk Mar 18, 2025

jainankitk commented Mar 18, 2025

sgup432 commented Mar 20, 2025 •

edited

Loading

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

Are you sure you want to change the base?

[Tiered Cache] Using a single cache manager for all ehcache disk caches #17513

Conversation

sgup432 commented Mar 4, 2025 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Mar 4, 2025

github-actions bot commented Mar 5, 2025

github-actions bot commented Mar 5, 2025

jainankitk Mar 10, 2025

Choose a reason for hiding this comment

jainankitk Mar 10, 2025

Choose a reason for hiding this comment

jainankitk Mar 10, 2025

Choose a reason for hiding this comment

jainankitk Mar 18, 2025

Choose a reason for hiding this comment

jainankitk commented Mar 18, 2025

sgup432 commented Mar 20, 2025 • edited Loading

sgup432 commented Mar 4, 2025 •

edited

Loading

sgup432 commented Mar 20, 2025 •

edited

Loading