-
Notifications
You must be signed in to change notification settings - Fork 300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Moving on with Optimized Privilege Evaluation for 3.0 release #5181
Comments
[Triage] Thank you for filing this issue. Also tagging the repo maintainers for input @peternied @willyborankin @reta @DarshitChanpura @derek-ho @RyanL1997 |
@nibix Thanks for writing this up. What is the risk to OpenSearch customers for one path vs the other? At its face, this seems like this is the time to making a breaking API on the eve of v3.0.0. |
@nibix Unfortunately, AWS hasn't yet release 2.19 version for its managed service customers, so AWS don't actually have enough data to make the decision that the change hasn't had regressions in real production env (that was major motovation of the flag). Carrying the change in 3.0 without the flag cannot be considered safe. Initially, I was alinged to get rid of the flag in 3.0, for the complexity you mentioned. But unfortunately due to nature of the delay, I would vote to keep the flag in 3.0 (hence 3.x) |
We actually have no significant breaking change that is visible to the user in this. The only user visible breaking change is the discontinuation of support of an undocumented config option described in #4495 |
While I understand your assessment of the risks, I need to object to the absolute way you are putting this:
Yes, there are risks. Like many changes to the security plugin have. Yes, this is a big change that is bigger than many changes to the security plugin. Still, many small changes have also the potential to carry big risks, even if they seem trivial on the first sight. In the end, we are talking about a plugin which creates security critical code. Must we then consider most of the changes to it as "not safe" then? Please do not understand me wrong, I am understanding about your concerns and open to discuss options. |
So, the two paths are these:
Common to both paths is the risk of unknowns. Option 1 is the originally designed method, which is actually already merged to the Option 2 was requested from AWS managed service side to make roll backs easier in case of unforeseen issues. |
No. The concern with this change is purely around the "very" large nature of it. We cannot introduce a safety flag for every other change, it will make the software complex. Safety around most of the other changes are covered by unit, functional, integration tests. Not saying we don't have sufficient coverage--but the larger the nature of change, harder its for humans to reason if all aspects of it are test covered. Just to give you additional context: this change has been bit delayed in AWS 2.19 because of a push-back to do an security-penetration-testing. I think we all agreed why we needed the flag and that's why you did all the extra work to support the flag. My point here is: we haven't been able to exercise the code in production, so how do we decide to remove the flag? |
@nibix given the situation, I would like to hear your point of view on preserving the flag. We can then take votes from others and talk about the path forward. |
Of course, the situation is complex. In a classic setting, having the opportunity to roll out a big change at a major version release which goes through an extended alpha/beta/GA phase is of course a great opportunity for quality assurance of changes, as the major release with alpha/beta phase carries the message "this brings big changes, please everyone invest great scrutiny to verify it". Given the current situation, I don't think that we have a precise decision yet on how to move on. It's not just about preserving the flag, but also about handling the DLS/FLS changes which are at the moment not part of the feature flag protected version of the code. We had the decision "let's collect data and based on that we decide on how to move on ". I'll just go through some options we can decide about
Certainly there are more options. |
thanks @nibix for the proposing the options. I was also thinking of option 1 as the path forward (i.e bringing DLS/FLS in a later minor version). This is the simplest and safest (with the timelines of 3.0-beta). The perf improvements we have seen just from the current optimisations in 2.x is already huge and a lot of big AWS customers are waiting for this change to be generally available. The point I am trying to make is: we might be good without the DLS/FLS changes for now. DLS/FLS in 3.1 will bring more improvements. (the green lines are CPU utilization without the optimizations while the red line shows the CPU utilization with optimizations) Regarding option 3:
At least I think its really hard to collect data from the community. I'd be frank here: the change is so effective but I haven't heard anyone giving us feedback on the effectiveness. I am pretty sure we will get a lot of good data points when this change is available in AWS (or other cloud providers). |
Just wanted to bring up that there have been a number of instances where this change would help and its been hit both in clusters managed by AWS or users running self-managed clusters. Prior to Optimized Privilege Evaluation, I tried a separate approach to optimize that reduced the data structure that was serialized in the ThreadContext headers but it was closed in favor of OPE: #4706 |
Agreed that the DLS/FLS changes are super important too and holding them off unmerged is also painful. But again, I think it might be the simplest path forward. Open for Option 2, if there are strong concerns of holding off the DLS/FLS changes. Note 3.0 Beta is going to be pushed for 2 weeks further opensearch-project/.github#300, so if we think we can make the changes with enough testing, I am open for Option 2. |
I am filing this issue because we still need a decision on how to move on with the Optimized Privilege Evaluation (#4380) for 3.0 release.
See the PR discussion at #4898 for further context.
Personally, I would strongly recommend not to have it wrapped in a feature flag in the 3.0 release.
However: If we go this route, we also need to adjust the a property which controls if compat DLS/FLS headers are sent to other nodes. This is important for mixed clusters. See the note at the top of #4898 and in this comment: #4898 (review)
If we do not go this route, we urgently need to build alternative ways to handle this.
So, how shall we move on here?
See also #5013
@cwperks @krishna-ggk @kumargu @shikharj05
The text was updated successfully, but these errors were encountered: