Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: Assert XdsNR's cluster ref counting is consistent #11943

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ejona86
Copy link
Member

@ejona86 ejona86 commented Mar 5, 2025

It is much harder to debug refcounting problems when we ignore impossible situations. So make such impossible cases complain loudly so the bug is obvious.

It is much harder to debug refcounting problems when we ignore
impossible situations. So make such impossible cases complain loudly so
the bug is obvious.
@ejona86 ejona86 requested a review from larry-safran March 5, 2025 21:32
if (clusterRefs.get(cluster).refCount.get() == 0) {
clusterRefs.remove(cluster);
updateResolutionResult();
if (clusterRefs.get(cluster).refCount.get() != 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is being added to the end of the queue, it is possible that another update has come in restoring use of the cluster before this runs, in which case doing nothing is appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because there's a refcount and the clusterRefs map, the threading simply breaks down if the refcount goes from 0 → 1. That's why we already don't use an atomic increment, and instead use compareAndSet to increase the value:

do {
count = refCount.get();
if (count == 0) {
return false;
}
} while (!refCount.compareAndSet(count, count + 1));

1 → 0 is a one-way state transition, and the only option is to throw away the object.

In my debugging, I was worried that there could have been a bug in the reference counting that left things in a corrupted state. I don't think there is any such bug, but I'd much rather see an exception than silent state corruption.

@ejona86 ejona86 requested a review from larry-safran March 6, 2025 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants