[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

kundadebdatta · 2025-02-15T02:19:21Z

Pull Request Template

Description

The idea of having a per partition circuit breaker (aka PPCB) is to optimize a) read availability , in a single master account and b) read + write availability in a multi master account, during a time when a specific partition in one of the regions is experiencing an outage/ quorum loss. This feature is independent of the partition level failover triggered by the backend. The per partition circuit breaker is developed behind a feature flag AZURE_COSMOS_PARTITION_LEVEL_CIRCUIT_BREAKER_ENABLED.

However, when the partition level failover is enabled, we will enable the PPCB by default so that the reads can benefits from it.

Scope

For single master, only the read requests will use the circuit breaker to add the pk-range to region override mapping, and use this mapping as a source of truth to route the read requests.
For multi master, both the read and write requests will use the circuit breaker to add the pk-range to region override mapping and use this mapping as a source of truth to route both read and write requests.

Understanding the Configurations exposed by the environment variables:

AZURE_COSMOS_CIRCUIT_BREAKER_ENABLED: This environment variable is used to enable/ disable the partition level circuit breaker feature. The default value is false.
AZURE_COSMOS_PPCB_STALE_PARTITION_UNAVAILABILITY_REFRESH_INTERVAL_IN_SECONDS: This environment variable is used to set the background periodic address refresh task interval. The default value for this interval is 60 seconds.
AZURE_COSMOS_PPCB_ALLOWED_PARTITION_UNAVAILABILITY_DURATION_IN_SECONDS: This environment variable is used to set the partition unavailability time duration in seconds. The unavailability time indicates how long a partition can remain unhealthy, before it can re-validate it's connection status. The default value for this property is 5 seconds.
AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT_FOR_READS: This environment variable is used to set the consecutive failure count for reads, before triggering per partition level circuit breaker flow. The default value for this flag is 10 consecutive failures within 1 min window.
AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT_FOR_WRITES: This environment variable is used to set the consecutive failure count for writes, before triggering per partition level circuit breaker flow. The default value for this flag is 5 consecutive failures within 1 min window.

Understanding the Working Principle:

On a high level, there are three parts of the circuit breaker implementation:

Short Circuit and Failover detection: The failover detection logic will reside in the SDK ClientRetryPolicy, just like we have for PPAF. Ideally the detection logic is based on the below two principles:
- Status Codes: The status codes that are indicative of partition level circuit breaker would be the following: a) 503 Service Unavailable, b) 408 Request Timeout, c) cancellation token expired.
- Threshold: Once the failover condition is met, the SDK will look for some consecutive failures, until it hits a particular threshold. Once this threshold is met, the SDK will fail over the read requests to the next preferred region for that offending partition. For example, if the threshold value for read requests is 10, then the SDK will look for 10 consecutive failures. If the threshold is met/ exceeded, the SDK will add the region failover information for that partition.
Failover a faulty partition to the next preferred region: Once the failover conditions are met, the ClientRetryPolicy will trigger a partition level override using GlobalPartitionEndpointManagerCore.TryMarkEndpointUnavailableForPartitionKeyRange to the next region in the preferred region list. This failover information will help the current, as well as the subsequent requests (reads in single master and both reads and writes in multi master) to route the request to the next region.
Failback the faulty partition to it's original first preferred region: With PPAF enabled, ideally the write requests will rely on 403.3 (Write Forbidden) signal to fail the partition back to the primary write region. However, this is not true for reads. That means SDK doesn’t have a definitive signal to identify when to initiate a failback for read requests.
Hence, the idea is to create a background task during the time of read failover, which will keep track of the pk-range and region mapping. The task will periodically fetch the address from the gateway address cache for those pk ranges in the faulty region, and it will try to initiate Rntbd connection to all 4 replicas of that partition. The RNTBD open connection attempt will be made similar to that of the replica validation flow. The life cycle of the background task will get initiated during a failover and will remain until the SDK is disposed.
If the attempt to make the connection to all 4 replicas is successful, then the task will remove/ override the entry with the primary region, resulting the SDK to failback the read requests.

Type of change

New feature (non-breaking change which adds functionality)

Closing issues

To automatically close an issue: closes #4981

…dation.

… breaker.

github-actions

All good!

…tition_circuit_breaker

…nment variables.

…tition_circuit_breaker

jeet1995 · 2025-02-27T19:05:05Z

Microsoft.Azure.Cosmos/tests/Microsoft.Azure.Cosmos.EmulatorTests/CosmosItemIntegrationTests.cs

+        [DataRow("30", DisplayName = "Scenario whtn the circuit breaker consecutive failure threshold is set to 30.")]
+        [Owner("dkunda")]
+        [Timeout(70000)]
+        public async Task ReadItemAsync_WithCircuitBreakerEnabledAndSingleMasterAccountAndServiceUnavailableReceived_ShouldApplyPartitionLevelOverride(


It would be good to extend this to non-point operations such as queries and batch.

jeet1995 · 2025-02-27T20:02:00Z

Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs

+        /// location for the partition key range.
+        /// </summary>
+        /// <returns>A task representing the asynchronous operation.</returns>
+        private async Task TryOpenConnectionToUnhealthyEndpointsAndInitiateFailbackAsync()


One question I had about this - are these background tasks assigned to a thread part of a dedicated thread pool which is also bounded? We have been very conservative here in the Java SDK (at most 1 or 2 threads running this background recovery flow to not affect the hot path). How are we bounding / managing this?

Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs

…tition_circuit_breaker

kundadebdatta · 2025-03-07T18:32:21Z

AZURE_COSMOS_PPCB_CONSECUTIVE_FAILURE_COUNT: This environment variable is used to set the consecutive failure count for reads, before triggering per partition level circuit breaker flow. The default value for this flag is 10 consecutive requests within 1 min window.

So according to this, the circuit breaking threshold is same for both reads and writes. A write workload has higher chance of faulting than reads due to 1 primary v/s 3 secondaries (SF automation for primary change might be very sensitive with primary down regardless) - so I'd prefer having separate thresholds. This will allow failover trigger through writes more aggressively if needed even if the workload itself is read heavy.

Good Point. I think it's a good idea to keep separate threshold for both reads and writes. Exposed two separate environment variables for capturing read and write thresholds.

…tition_circuit_breaker

kirankumarkolli · 2025-03-11T11:34:38Z

Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs

+        /// <summary>
+        /// Enable partition level circuit breaker
+        /// </summary>
+        internal bool EnablePartitionLevelCircuitBreaker { get; set; } = ConfigurationManager.IsPartitionLevelCircuitBreakerEnabled(defaultValue: false);


PPCB dependent on PPAF, argument check

May not be always true right?

The real validation is happening on line# 988:

Let me know if it makes sense.

kirankumarkolli · 2025-03-11T11:37:27Z

Microsoft.Azure.Cosmos/src/DocumentClient.cs

@@ -939,8 +939,11 @@ internal virtual void Initialize(Uri serviceEndpoint,
 #endif

            this.GlobalEndpointManager = new GlobalEndpointManager(this, this.ConnectionPolicy);
-            this.PartitionKeyRangeLocation = this.ConnectionPolicy.EnablePartitionLevelFailover
-                ? new GlobalPartitionEndpointManagerCore(this.GlobalEndpointManager)
+            this.PartitionKeyRangeLocation = this.ConnectionPolicy.EnablePartitionLevelFailover || this.ConnectionPolicy.EnablePartitionLevelCircuitBreaker


PPCB depdnent on PPAF, some PPAF check might suffice?

kirankumarkolli · 2025-03-11T11:40:06Z

Microsoft.Azure.Cosmos/src/Util/ConfigurationManager.cs

+        /// A read-only string containing the environment variable name for enabling per partition circuit breaker. The default value
+        /// for this flag is false.
+        /// </summary>
+        internal static readonly string PartitionLevelCircuitBreakerEnabled = "AZURE_COSMOS_PARTITION_LEVEL_CIRCUIT_BREAKER_ENABLED";


nit: Just for simplicity how about use PPCB? (Below configs do use it anyways)

I updated the variable name as AZURE_COSMOS_CIRCUIT_BREAKER_ENABLED, for simplicity.

kirankumarkolli · 2025-03-11T11:56:14Z

Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs

+        }
+
+        /// <summary>
+        /// Determines if a read request is eligible for partition-level circuit breaker.


nit: Read request used in the doc, even writes in MM are also used.

Updated the verbiage.

kirankumarkolli · 2025-03-11T12:04:16Z

Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs

@@ -454,10 +457,14 @@ private ShouldRetryResult TryMarkEndpointUnavailableForPkRangeAndRetryOnServiceU

            if (shouldMarkEndpointUnavailableForPkRange)


Some failures might be fatal and might need to by-pass circuit-breaker? (ex: 429/3092)

Yes. Thanks for pointing this out. 429/3092 should trigger the partition failover irrespective of whether the PPCB counter reaches it's threshold. Updated the code to reflect the same.

kirankumarkolli · 2025-03-11T12:08:40Z

Microsoft.Azure.Cosmos/src/CosmosClientOptions.cs

+        /// <summary>
+        /// Enable partition level circuit breaker
+        /// </summary>
+        internal bool EnablePartitionLevelCircuitBreaker { get; set; } = ConfigurationManager.IsPartitionLevelCircuitBreakerEnabled(defaultValue: false);


IMP: What's the guidance for compute gateway?

For compute gateway, by default the PPAF will be disabled, so does the PPCB. If compute choses to enable PPAF, then we will enable PPCB by default. Are you thinking on the background validation task for the failed addresses ? If so - we can choose a model to enable PPCB by default, and opt-out if the customer doesn't want PPCB.

But IMO, given that the address validation is guarded by the channel dictionary at-least one healthy channel validation, we should be good IMO.

Can we please include this in the property codedoc for reference?

kirankumarkolli · 2025-03-11T12:12:35Z

Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs

+        /// </returns>
+        private bool IsRequestEligibleForPartitionLevelCircuitBreaker()
+        {
+            return (this.documentServiceRequest.IsReadOnlyRequest || this.isMultiMasterWriteRequest)


Is MM + PPAF in scope in near term?

I felt, doing for both (SM + MM) will help to address at the same time, since the differences would not be too much for PPCB + Reads and PPCB + MM + Writes. Added MM tests to cover the changes.

kirankumarkolli · 2025-03-11T12:19:22Z

Microsoft.Azure.Cosmos/src/ClientRetryPolicy.cs

+        /// <returns>
+        /// True if the request is a write request; otherwise, false.
+        /// </returns>
+        private bool IsRequestEligibleForPerPartitionAutomaticFailover()


non-blocking: One option is move this logic also to inside partitionKeyRangeLocationCache which is soley responsible for state maintenance (counting, resetting and deciding when to fail-over)

Thanks for the suggestion. I feel keeping the logic in one single place is always a good idea. Refactored the code to keep the logic in partitionKeyRangeLocationCache

…tition_circuit_breaker

kundadebdatta added 6 commits February 5, 2025 22:12

Initial Code Changes to implement PPCB.

7ba373f

Code changes to create the background task for address fetch and vali…

ffa23b7

…dation.

Code changes to add consecutive failure logic.

943a1f1

Code changes for thread safety. MM support.

cedf023

Code changes to add fault injection tests for partition level circuit…

c93e80e

… breaker.

Adding multi master test

af45ead

github-actions bot reviewed Feb 15, 2025

View reviewed changes

Code changes to fix unit tests.

240db8c

kundadebdatta changed the title ~~[Internal] PPCB: Implement Per Partition Circuit Breaker~~ [Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker Feb 17, 2025

kundadebdatta and others added 7 commits February 17, 2025 12:28

Code changes to fix emulator test.

e9ba777

Code changes to fix multi region tests.

1214820

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

dfcf1eb

…tition_circuit_breaker

Code changes to add more tests. More clean up added.

0fa5f11

Code changes to add more tests. Code cleanup.

aeac39c

Code changes for more clean up.

85b54ad

Updating the config manager to reflect the correct summary for enviro…

698e281

…nment variables.

kundadebdatta self-assigned this Feb 19, 2025

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

81363d4

…tition_circuit_breaker

kundadebdatta added the PerPartitionAutomaticFailover label Feb 19, 2025

kundadebdatta marked this pull request as ready for review February 19, 2025 17:40

kundadebdatta requested review from khdang, sboshra, adityasa, neildsh, kirankumarkolli, FabianMeiswinkel, kirillg and Pilchie as code owners February 19, 2025 17:40

kundadebdatta added the auto-merge Enables automation to merge PRs label Feb 19, 2025

microsoft-github-policy-service bot enabled auto-merge (squash) February 19, 2025 17:44

jeet1995 reviewed Feb 27, 2025

View reviewed changes

Microsoft.Azure.Cosmos/src/Routing/GlobalPartitionEndpointManagerCore.cs Show resolved Hide resolved

kundadebdatta and others added 10 commits March 1, 2025 20:49

Code changes to optimize read failover behavior.

bea2bad

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

79e3db1

…tition_circuit_breaker

Code changes to resolve conflicts.

34ed78e

Code changes to fix unit tests.

3361cf3

Code changes to use indefinite while loop instead of recursion.

9160380

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

7b9a030

…tition_circuit_breaker

Code changes to add user agent for PPCB.

209bf46

Code changes to fix user agent test.

99dd0a4

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

bb62bce

…tition_circuit_breaker

Code changes to separate threshold for reads and writes.

be3d769

kundadebdatta and others added 3 commits March 7, 2025 11:09

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

58db688

…tition_circuit_breaker

Code changes to fix tests in pipeline.

addc9ca

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

e592f5c

…tition_circuit_breaker

kirankumarkolli reviewed Mar 11, 2025

View reviewed changes

kundadebdatta and others added 5 commits March 12, 2025 09:54

Code changes to address review comments.

b019207

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

ab94c39

…tition_circuit_breaker

Code changes to fix unit tests in the build pipeline.

b0aef09

Code changes to bypass circuit breaker check when 429/3092 happens.

3e6ca93

Merge branch 'master' into users/kundadebdatta/4981_implement_per_par…

ad79426

…tition_circuit_breaker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

kundadebdatta commented Feb 15, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

jeet1995 Feb 27, 2025

jeet1995 Feb 27, 2025 •

edited

Loading

kundadebdatta commented Mar 7, 2025

kirankumarkolli Mar 11, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

kirankumarkolli Mar 11, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

kirankumarkolli Mar 11, 2025 •

edited

Loading

kundadebdatta Mar 12, 2025 •

edited

Loading

kirankumarkolli Mar 15, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

kirankumarkolli Mar 11, 2025

kundadebdatta Mar 12, 2025

		@@ -454,10 +457,14 @@ private ShouldRetryResult TryMarkEndpointUnavailableForPkRangeAndRetryOnServiceU

		if (shouldMarkEndpointUnavailableForPkRange)

[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

Are you sure you want to change the base?

[Internal] Circuit Breaker: Adds Code to Implement Per Partition Circuit Breaker #5023

Conversation

kundadebdatta commented Feb 15, 2025 • edited Loading

Pull Request Template

Description

Scope

Understanding the Configurations exposed by the environment variables:

Understanding the Working Principle:

Type of change

Closing issues

github-actions bot left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeet1995 Feb 27, 2025 • edited Loading

Choose a reason for hiding this comment

kundadebdatta commented Mar 7, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kirankumarkolli Mar 11, 2025 • edited Loading

Choose a reason for hiding this comment

kundadebdatta Mar 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kundadebdatta commented Feb 15, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

jeet1995 Feb 27, 2025 •

edited

Loading

kirankumarkolli Mar 11, 2025 •

edited

Loading

kundadebdatta Mar 12, 2025 •

edited

Loading