Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Per Partition Automatic Failover] - Apply Partition Level Failover When Cancellation is Requested on a User Provided Cancellation Token #5060

Open
kundadebdatta opened this issue Mar 11, 2025 · 0 comments · May be fixed by #5063

Comments

@kundadebdatta
Copy link
Member

kundadebdatta commented Mar 11, 2025

Background:

During one of the backend drills, it was identified that when the following quorum loss condition is met, and the user provides a cancellation token, SDK honors the token, however doesn't apply the partition level fail over for the offending partition:

  • Quorum loss injected with the quorum replicas (3 out of 4 replicas are down).
  • The primary replica is specifically down.
  • A cancellation token with 5 seconds of timeout value is provided.

Observation:

  • SDK doesn't apply the partition level override and the subsequent write requests fails on the current faulty region/ partition.

Sample Diagnostics:

Diagnostics-1
{
	"Summary": {
		"GatewayCalls": {
			"(200, 0)": 3
		}
	},
	"name": "CreateItemAsync",
	"start datetime": "2025-03-10T20:37:36.289Z",
	"duration in milliseconds": 5012.8192,
	"data": {
		"Client Configuration": {
			"Client Created Time Utc": "2025-03-10T20:27:48.6537870Z",
			"MachineId": "hashedMachineName:dd823358-1397-c938-1a2d-e52a0b922240",
			"NumberOfClientsCreated": 1,
			"NumberOfActiveClients": 1,
			"ConnectionMode": "Direct",
			"User Agent": "cosmos-netstandard-sdk/3.47.2|1|X64|Microsoft Windows 10.0.26100|.NET 8.0.13|L|dkunda-ppaf-writer-app",
			"ConnectionConfig": {
				"gw": "(cps:50, urto:6, p:False, httpf: False)",
				"rntbd": "(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)",
				"other": "(ed:False, be:False)"
			},
			"ConsistencyConfig": "(consistency: Session, prgns:[North Central US, Central US, West US 2], apprgn: )",
			"ProcessorCount": 12
		}
	},
	"children": [
		{
			"name": "ItemSerialize",
			"duration in milliseconds": 0.0391
		},
		{
			"name": "Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler",
			"duration in milliseconds": 5012.3797,
			"children": [
				{
					"name": "Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler",
					"duration in milliseconds": 5012.2604,
					"children": [
						{
							"name": "Microsoft.Azure.Cosmos.Handlers.TelemetryHandler",
							"duration in milliseconds": 5012.2034,
							"children": [
								{
									"name": "Microsoft.Azure.Cosmos.Handlers.RetryHandler",
									"duration in milliseconds": 5012.1412,
									"children": [
										{
											"name": "Microsoft.Azure.Cosmos.Handlers.RouterHandler",
											"duration in milliseconds": 5012.0268,
											"children": [
												{
													"name": "Microsoft.Azure.Cosmos.Handlers.TransportHandler",
													"duration in milliseconds": 5011.9925,
													"children": [
														{
															"name": "Microsoft.Azure.Documents.ServerStoreModel Transport Request",
															"duration in milliseconds": 5011.821,
															"data": {
																"Client Side Request Stats": {
																	"Id": "AggregatedClientSideRequestStatistics",
																	"ContactedReplicas": [
																		{
																			"Count": 1,
																			"Uri": "rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																		}
																	],
																	"RegionsContacted": [

																	],
																	"FailedReplicas": [

																	],
																	"ForceAddressRefresh": [
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		},
																		{
																			"No change to cache": [
																				"rntbd://cdb-ms-test61-northcentralus1-be1.documents-test.windows-int.net:14008/apps/bad4fbdb-e7dd-45be-9480-7d9fd240a2a5/services/3385fbba-d551-4f50-a6a9-c92c49bd48f6/partitions/b04f9c0b-defd-46d9-a8ed-0f275fb96430/replicas/133859571086460706s/"
																			]
																		}
																	],
																	"AddressResolutionStatistics": [
																		{
																			"StartTimeUTC": "2025-03-10T20:37:36.2907018Z",
																			"EndTimeUTC": "2025-03-10T20:37:36.4220266Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:37.4364198Z",
																			"EndTimeUTC": "2025-03-10T20:37:37.5116960Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:39.5211252Z",
																			"EndTimeUTC": "2025-03-10T20:37:39.6584297Z",
																			"TargetEndpoint": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0"
																		}
																	],
																	"StoreResponseStatistics": [

																	],
																	"HttpResponseStats": [
																		{
																			"StartTimeUTC": "2025-03-10T20:37:36.2907438Z",
																			"DurationInMs": 69.7863,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:37.4364448Z",
																			"DurationInMs": 75.2017,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		},
																		{
																			"StartTimeUTC": "2025-03-10T20:37:39.5211781Z",
																			"DurationInMs": 73.2295,
																			"RequestUri": "https://dkunda-ppaf-session-northcentralus.documents-test.windows-int.net//addresses/?$resolveFor=dbs%2fLHw67A%3d%3d%2fcolls%2fLHw67K25IVs%3d%2fdocs&$filter=protocol eq rntbd&$partitionKeyRangeIds=0",
																			"ResourceType": "Document",
																			"HttpMethod": "GET",
																			"ActivityId": "1b8c5396-c3bd-447a-9ade-a17b498e2815",
																			"StatusCode": "OK"
																		}
																	]
																}
															}
														}
													]
												}
											]
										}
									]
								}
							]
						}
					]
				}
			]
		},
		{
			"name": "CosmosOperationCanceledException",
			"duration in milliseconds": 0.0125,
			"data": {
				"Operation Cancelled Exception": "System.Threading.Tasks.TaskCanceledException: A task was canceled.\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.RequestRetryUtility.ProcessRequestAsync[TRequest,IRetriableResponse](Func`1 executeAsync, Func`1 prepareRequest, IRequestRetryPolicy`2 policy, CancellationToken cancellationToken, Func`1 inBackoffAlternateCallbackMethod, Nullable`1 minBackoffForInBackoffCallback)\r\n   at Microsoft.Azure.Documents.StoreClient.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken, IRetryPolicy retryPolicy)\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RouterHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.ExecuteHttpRequestAsync(Func`1 callbackMethod, Func`3 callShouldRetry, Func`3 callShouldRetryException, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.AbstractRetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.TelemetryHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.RequestHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.BaseSendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler.SendAsync(String resourceUriString, ResourceType resourceType, OperationType operationType, RequestOptions requestOptions, ContainerInternal cosmosContainerCore, FeedRange feedRange, Stream streamPayload, Action`1 requestEnricher, ITrace trace, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.ProcessItemStreamAsync(Nullable`1 partitionKey, String itemId, Stream streamPayload, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, Nullable`1 targetResponseSerializationFormat, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.ExtractPartitionKeyAndProcessItemStreamAsync[T](Nullable`1 partitionKey, String itemId, T item, OperationType operationType, ItemRequestOptions requestOptions, ITrace trace, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ContainerCore.CreateItemAsync[T](T item, ITrace trace, Nullable`1 partitionKey, ItemRequestOptions requestOptions, CancellationToken cancellationToken)\r\n   at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](String containerName, String databaseName, OperationType operationType, ITrace trace, Func`2 task, Nullable`1 openTelemetry, RequestOptions requestOptions, Nullable`1 resourceType)"
			}
		}
	]
}

Acceptance Criteria:

  • SDK should apply partition level regional override for the faulty partition.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
1 participant