Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

Open
nibix opened this issue Mar 17, 2025 · 12 comments
Open

Moving on with Optimized Privilege Evaluation for 3.0 release #5181

nibix opened this issue Mar 17, 2025 · 12 comments
Labels
triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@nibix
Copy link
Collaborator

nibix commented Mar 17, 2025

I am filing this issue because we still need a decision on how to move on with the Optimized Privilege Evaluation (#4380) for 3.0 release.

See the PR discussion at #4898 for further context.

Personally, I would strongly recommend not to have it wrapped in a feature flag in the 3.0 release.

However: If we go this route, we also need to adjust the a property which controls if compat DLS/FLS headers are sent to other nodes. This is important for mixed clusters. See the note at the top of #4898 and in this comment: #4898 (review)

If we do not go this route, we urgently need to build alternative ways to handle this.

So, how shall we move on here?

See also #5013

@cwperks @krishna-ggk @kumargu @shikharj05

@github-actions github-actions bot added the untriaged Require the attention of the repository maintainers and may need to be prioritized label Mar 17, 2025
@cwperks cwperks added triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Mar 17, 2025
@cwperks
Copy link
Member

cwperks commented Mar 17, 2025

[Triage] Thank you for filing this issue. Also tagging the repo maintainers for input @peternied @willyborankin @reta @DarshitChanpura @derek-ho @RyanL1997

@peternied
Copy link
Member

@nibix Thanks for writing this up. What is the risk to OpenSearch customers for one path vs the other? At its face, this seems like this is the time to making a breaking API on the eve of v3.0.0.

@kumargu
Copy link

kumargu commented Mar 17, 2025

@nibix Unfortunately, AWS hasn't yet release 2.19 version for its managed service customers, so AWS don't actually have enough data to make the decision that the change hasn't had regressions in real production env (that was major motovation of the flag). Carrying the change in 3.0 without the flag cannot be considered safe.

Initially, I was alinged to get rid of the flag in 3.0, for the complexity you mentioned. But unfortunately due to nature of the delay, I would vote to keep the flag in 3.0 (hence 3.x)

@nibix
Copy link
Collaborator Author

nibix commented Mar 18, 2025

@peternied

Thanks for writing this up. What is the risk to OpenSearch customers for one path vs the other? At its face, this seems like this is the time to making a breaking API on the eve of v3.0.0.

We actually have no significant breaking change that is visible to the user in this. The only user visible breaking change is the discontinuation of support of an undocumented config option described in #4495

@nibix
Copy link
Collaborator Author

nibix commented Mar 18, 2025

@kumargu

While I understand your assessment of the risks, I need to object to the absolute way you are putting this:

Carrying the change in 3.0 without the flag cannot be considered safe.

Yes, there are risks. Like many changes to the security plugin have. Yes, this is a big change that is bigger than many changes to the security plugin. Still, many small changes have also the potential to carry big risks, even if they seem trivial on the first sight. In the end, we are talking about a plugin which creates security critical code. Must we then consider most of the changes to it as "not safe" then?

Please do not understand me wrong, I am understanding about your concerns and open to discuss options.

@nibix
Copy link
Collaborator Author

nibix commented Mar 18, 2025

@peternied

What is the risk to OpenSearch customers for one path vs the other?

So, the two paths are these:

  1. Discontinue the sending of DLS/FLS headers by default (for performance reasons). Compatibility code will check the version of the peer node. If the peer node is still on an OpenSearch version that does require the presence of DLS/FLS headers and the sent message requires it, the compatibility code will add the DLS/FLS headers to the message.

  2. Have a feature flag which decides which implementation is used (So, this is not about whether DLS/FLS headers shall be sent or not, but rather which implementation of PrivilegeEvaluator, DlsFlsValve, and all other DLS/FLS supporting code is used). This brings the challenge how to toggle this feature flag in a way that there is no time window where nodes disagree about the implementation. If there would be a time window where nodes disagree about the implementation, then data leaks could occur because DLS/FLS is ineffective. This challenge is yet unsolved. Option 1 has the nice property that this can be told from the node version stored in the Connection object. Option 2 does no longer have that property that could help us in this regard, as the version cannot be longer used to tell what implementation a node is running. Another dimension to this is that the code base will need to carry two large implementations of privilege evaluation and DLS/FLS (estimate 8000 LOC per implementation). There must be scrutiny that any change to that functionality is applied to both implementations to address the risk that contributors are only aware of one implementation. Additional side effects are: higher maintenance cost of the code and higher turnaround times due to increased CI run time (as both implementations need to be tested).

Common to both paths is the risk of unknowns.

Option 1 is the originally designed method, which is actually already merged to the main branch. The only necessary change is the adaption of the release version of the feature.

Option 2 was requested from AWS managed service side to make roll backs easier in case of unforeseen issues.

@kumargu
Copy link

kumargu commented Mar 18, 2025

Must we then consider most of the changes to it as "not safe" then?

No.

The concern with this change is purely around the "very" large nature of it. We cannot introduce a safety flag for every other change, it will make the software complex. Safety around most of the other changes are covered by unit, functional, integration tests. Not saying we don't have sufficient coverage--but the larger the nature of change, harder its for humans to reason if all aspects of it are test covered. Just to give you additional context: this change has been bit delayed in AWS 2.19 because of a push-back to do an security-penetration-testing.

I think we all agreed why we needed the flag and that's why you did all the extra work to support the flag. My point here is: we haven't been able to exercise the code in production, so how do we decide to remove the flag?

@kumargu
Copy link

kumargu commented Mar 18, 2025

@nibix given the situation, I would like to hear your point of view on preserving the flag. We can then take votes from others and talk about the path forward.

@nibix
Copy link
Collaborator Author

nibix commented Mar 18, 2025

@kumargu

Of course, the situation is complex. In a classic setting, having the opportunity to roll out a big change at a major version release which goes through an extended alpha/beta/GA phase is of course a great opportunity for quality assurance of changes, as the major release with alpha/beta phase carries the message "this brings big changes, please everyone invest great scrutiny to verify it".

Given the current situation, I don't think that we have a precise decision yet on how to move on. It's not just about preserving the flag, but also about handling the DLS/FLS changes which are at the moment not part of the feature flag protected version of the code. We had the decision "let's collect data and based on that we decide on how to move on ".

I'll just go through some options we can decide about

  • revert the PR from the main branch and forward port the commit with the feature flag from the 2.x branch. This version won't have any DLS/FLS improvements. Thus, we would need to release the DLS/FLS improvements with a later minor version, missing the additional testing visibility of the major release.
  • revert the PR from the main branch, forward port the commit from the 2.x branch and add on top the DLS/FLS support. This would give us the testing opportunity but we'd need to rush to get a suitable concept and an implementation ready for 3.0.
  • utilize the alpha beta phase to collect data for the present implementation.

Certainly there are more options.

@kumargu
Copy link

kumargu commented Mar 18, 2025

thanks @nibix for the proposing the options.

I was also thinking of option 1 as the path forward (i.e bringing DLS/FLS in a later minor version). This is the simplest and safest (with the timelines of 3.0-beta).

The perf improvements we have seen just from the current optimisations in 2.x is already huge and a lot of big AWS customers are waiting for this change to be generally available. The point I am trying to make is: we might be good without the DLS/FLS changes for now. DLS/FLS in 3.1 will bring more improvements.

(the green lines are CPU utilization without the optimizations while the red line shows the CPU utilization with optimizations)

Image


Regarding option 3:

utilize the alpha beta phase to collect data for the present implementation.

At least I think its really hard to collect data from the community. I'd be frank here: the change is so effective but I haven't heard anyone giving us feedback on the effectiveness. I am pretty sure we will get a lot of good data points when this change is available in AWS (or other cloud providers).

@cwperks
Copy link
Member

cwperks commented Mar 18, 2025

The point I am trying to make is: we might be good without the DLS/FLS changes for now. DLS/FLS in 3.1 will bring more improvements.

Just wanted to bring up that there have been a number of instances where this change would help and its been hit both in clusters managed by AWS or users running self-managed clusters.

Prior to Optimized Privilege Evaluation, I tried a separate approach to optimize that reduced the data structure that was serialized in the ThreadContext headers but it was closed in favor of OPE: #4706

@kumargu
Copy link

kumargu commented Mar 18, 2025

Agreed that the DLS/FLS changes are super important too and holding them off unmerged is also painful. But again, I think it might be the simplest path forward.

Open for Option 2, if there are strong concerns of holding off the DLS/FLS changes. Note 3.0 Beta is going to be pushed for 2 weeks further opensearch-project/.github#300, so if we think we can make the changes with enough testing, I am open for Option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

4 participants