Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable custom serialization #3789

Conversation

parasjain1
Copy link
Contributor

@parasjain1 parasjain1 commented Dec 4, 2023

Description

Disables custom serialization for the 2.11.1 patch release.

Needs to be backported to 2.x.

Below is the approach.

  • For requests sent from the 2.11.1 nodes to -
    • Case 1 : a node running OS version < 2.11
      • always use JDK to serialize the headers [this check was already present]
    • Case 2: a node running OS 2.11.0
      • serialize via custom serialization
    • Case 3: a node running OS >= 2.11.1
      • serialize via JDK serialization
  • For requests received on the 2.11.1 nodes from -
    • Case 1 : a node running OS version < 2.11
      • always use JDK to deserialize the headers [this check was already present]
    • Case 2: a node running OS 2.11.0
      • deserialize via custom serialization
    • Case 3: a node running OS >= 2.11.1
      • deserialize via JDK serialization

Related Issues

Testing

BWCs should cover most of the scenarios, while still carry out a manual cluster upgrade test.

Copy link

@backslasht backslasht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add unit tests for the same?

@parasjain1 parasjain1 requested a review from backslasht December 4, 2023 11:34
@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch from 2577292 to e8918ab Compare December 4, 2023 12:15
@parasjain1
Copy link
Contributor Author

Can you add unit tests for the same?

Updated UTs for SecurityInterceptor and SecuritySSLRequestHandler

@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch from e8918ab to b56f621 Compare December 5, 2023 06:51
@parasjain1
Copy link
Contributor Author

Updated the PR. We will be keeping custom serialization disabled for the 2.11.1 patch release and re-enable later after the regression is fixed.

@parasjain1 parasjain1 changed the title adds ability to enable/disable custom serialization disables custom serialization for 2.11.1 patch release Dec 5, 2023
@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch 5 times, most recently from 88ef6d0 to f8d3709 Compare December 5, 2023 07:37
@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch from f8d3709 to 9a25624 Compare December 5, 2023 08:35
Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, thanks for this contribution

Copy link
Member

@peternied peternied left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like test cases are failing due to serialization issues, please look into the failures on the integrationTest test-citests, and test dlicDlsflsTest

  2> java.lang.AssertionError: 
    Expected: Successful multi get response
         but: "Get an item from index: ""first-test-index" failed: "OpenSearch exception [type=exception, reason=java.lang.IllegalArgumentException: -84 is not a valid id]"
        at __randomizedtesting.SeedInfo.seed([52523052604917F5:E52F053CD1B7F4D]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
        at org.opensearch.security.FlsAndFieldMaskingTests.multiGetDocuments(FlsAndFieldMaskingTests.java:722)
  2> java.lang.AssertionError: 
    Expected: Successful multi search response
         but: "Get an item failed: ""OpenSearch exception [type=exception, reason=java.lang.IllegalArgumentException: -84 is not a valid id]"
        at __randomizedtesting.SeedInfo.seed([52523052604917F5:ECF7F2CDE87C8BDC]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:6)
        at org.opensearch.security.FlsAndFieldMaskingTests.multiSearchDocuments(FlsAndFieldMaskingTests.java:769)

@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch from 9a25624 to 1d9bcee Compare December 7, 2023 06:26
@parasjain1
Copy link
Contributor Author

A change was needed at SecurityFilter:187, from -

if (threadContext.getTransient(ConfigConstants.USE_JDK_SERIALIZATION) == null) {
                threadContext.putTransient(ConfigConstants.USE_JDK_SERIALIZATION, false);
}

to -

if (threadContext.getTransient(ConfigConstants.USE_JDK_SERIALIZATION) == null) {
                threadContext.putTransient(ConfigConstants.USE_JDK_SERIALIZATION, true);
}

as we now are defaulting to JDK serialization instead of custom.

Have amended the commit to include this change. Will monitor the CI checks.

@parasjain1
Copy link
Contributor Author

parasjain1 commented Dec 7, 2023

Integration tests and unit tests have passed now.

FYI backward compatibility tests are expected to fail as they're running against v2.12.0 vs v3.0.0 where-in both OS versions use custom serialization, but our change assumes that any version > v2.11.0 will be using JDK serialization for now.

Not sure how can we run the suite with previous version 2.11.1. The 2.x branch is bringing in 2.12 already.

Copy link

codecov bot commented Dec 7, 2023

Codecov Report

Merging #3789 (b7b8aa4) into main (6b8a3e4) will decrease coverage by 0.01%.
Report is 18 commits behind head on main.
The diff coverage is 93.47%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3789      +/-   ##
==========================================
- Coverage   65.26%   65.25%   -0.01%     
==========================================
  Files         297      298       +1     
  Lines       21132    21165      +33     
  Branches     3452     3455       +3     
==========================================
+ Hits        13791    13811      +20     
- Misses       5644     5656      +12     
- Partials     1697     1698       +1     
Files Coverage Δ
...nsearch/security/compliance/FieldReadCallback.java 55.55% <ø> (ø)
...org/opensearch/security/filter/SecurityFilter.java 66.19% <100.00%> (ø)
...urity/ssl/transport/SecuritySSLRequestHandler.java 70.00% <100.00%> (+7.77%) ⬆️
.../org/opensearch/security/support/Base64Helper.java 94.44% <100.00%> (+3.53%) ⬆️
...search/security/transport/SecurityInterceptor.java 73.61% <100.00%> (ø)
...org/opensearch/security/support/JsonFlattener.java 88.46% <88.46%> (ø)

... and 7 files with indirect coverage changes

cwperks
cwperks previously approved these changes Dec 7, 2023
@cwperks
Copy link
Member

cwperks commented Dec 7, 2023

Not sure how can we run the suite with previous version 2.11.1. The 2.x branch is bringing in 2.12 already.

@parasjain1 One way we can fix the checks is by first merging this change into 2.x, running the CI again and then merging this into main.

@parasjain1
Copy link
Contributor Author

One way we can fix the checks is by first merging this change into 2.x, running the CI again and then merging this into main.

Got it. Should I modify this PR to set the destination branch to 2.x ?

@cwperks
Copy link
Member

cwperks commented Dec 8, 2023

Got it. Should I modify this PR to set the destination branch to 2.x

Can you keep this PR open and open another manual backport against 2.x?

@peternied
Copy link
Member

@parasjain1 Please look into the other failures all the BWC are failing - and it looks like there are spotless code style changes needed

Signed-off-by: Paras Jain <[email protected]>
@parasjain1 parasjain1 force-pushed the custom-serde-configurable branch from af2b517 to b7b8aa4 Compare December 12, 2023 15:20
@parasjain1
Copy link
Contributor Author

@parasjain1 Please look into the other failures all the BWC are failing - and it looks like there are spotless code style changes needed

As mentioned above, BWCs are expected to fail. @cwperks suggested to open a Backport to 2.x PR which can be merged first. Working on that now.

@parasjain1
Copy link
Contributor Author

Have fixed the spotless scan failures.

One of the plugin install tests are failing due to the following error which looks unrelated to the code change.

 > Could not resolve org.opensearch.client:opensearch-rest-high-level-client:3.0.0-SNAPSHOT.
     Required by:
         project :
      > Could not resolve org.opensearch.client:opensearch-rest-high-level-client:3.0.0-SNAPSHOT.
         > Unable to load Maven meta-data from https://d1nvenhzbhpy0q.cloudfront.net/snapshots/lucene/org/opensearch/client/opensearch-rest-high-level-client/3.0.0-SNAPSHOT/maven-metadata.xml.
            > Could not GET 'https://d1nvenhzbhpy0q.cloudfront.net/snapshots/lucene/org/opensearch/client/opensearch-rest-high-level-client/3.0.0-SNAPSHOT/maven-metadata.xml'. Received status code 403 from server: Forbidden

@parasjain1
Copy link
Contributor Author

All other CI tests except plugin install (mentioned in the above comment) and BWCs are passing.

@parasjain1
Copy link
Contributor Author

Tried running BWCs on the backport PR #3826 by setting the previous branch to 2.11, but it's building OS version 2.11.1 instead of 2.11.0 leading to BWC failure yet again.

I'm highlighting below the steps that we need to carry out in the mentioned order to successfully test for backward compatibility -

  • Setup explicit CI BWC test for 2.11.0 to 2.11.1. This does not seem to be possible out of the box per how the BWC test suite and CI is setup.
  • Raise and merge PR against 2.11 - this will test BWC from 2.10.x to 2.11.1
  • Re-run BWCs on the backport PR [Backport 2.x] Disables custom serialization #3826 - this will test BWC from 2.11.1 to 2.12.0 as I've added an additional commit to temporarily set the previous branch to 2.11 in ci.yml. We'll need to remove this commit and re-run with previous branch 2.10 before final merge.
  • Re-run BWCs on this PR and merge.

@peternied
Copy link
Member

@parasjain1 This change has be lingering and there is still complexity around targeted validation. I'm starting to thing that we should fix the root cause rather than disable the feature. What is preventing us from fixing the root issue?

@parasjain1
Copy link
Contributor Author

@parasjain1 This change has be lingering and there is still complexity around targeted validation. I'm starting to thing that we should fix the root cause rather than disable the feature. What is preventing us from fixing the root issue?

To reflect on the root cause, JDK de-dupes object references using back references for the objects that it has already written on the stream, while OpenSearch's StreamInput/StreamOutput implementations do not offer such a support.

The ideal fix should come as an optimization in how OpenSearch's StreamInput / StreamOutput implementations (de)serialize the Java Map and Collections. This will take some time to research and come up with a proper solution. While this happens, we need to decide whether to keep custom serde enabled in security - keeping it enabled will negatively impact users with heavy usage of DLS/FLS.

Requesting @Bukhtawar / @backslasht to comment with their opinions regarding this.

@backslasht
Copy link

I agree with @parasjain1, given that it is negatively impacting the performance to an extent that the cluster is unusable we should disable it for the time being till the fix is available. I would have preferred to go with the fix if it is straight forward, but considering the complexities involved, it would take few weeks to a month to vet out a clean workable solution and we shouldn't be having this regression for so long.

@stephen-crawford
Copy link
Contributor

Update 1/12/2024: For the time being, a majority of maintainers have voted in favor of keeping the feature as is (not disabling it). The fact that the feature does not work does not represent a good reason to remove it from the code. We have limited anecdotes about this causing issues and so we do not want to set a precedent of removing features which have bugs. We of course hope that this can be fixed but in the meantime will continue to ship with it as is.

I am going to close this pull request given this decision but feel free to re-open it/keep discussing on it. This decision was reached by the maintainers as a group but does not mean it is set in stone.

@krishna-ggk
Copy link
Contributor

Just wanted to throw in couple thoughts.

It sounds like the rabbit is now out of the hat - but is it still possible to introduce a feature flag and control it? That typically gives us best mechanism to bake & experiment and roll out post confidence.

Could we do something like introduce feature flag in addition to above logic? That should atleast provide a workaround for users who run into this until we have the appropriate long term fix?

@mgodwan
Copy link
Member

mgodwan commented Feb 8, 2024

Sorry for chiming in late on this.

is it still possible to introduce a feature flag and control it?

I agree with this approach adding more checks than version. I think it will also help to control the rollout of this and drive faster development to actually gain more confidence, while helping the customers/operators who may get impacted due to this.

The current implementation of feature flags is generally confined to node level, we can modify the logic to rely on the node attributes which are available amongst all the node and can help any node sending a request to take decision whether it should use the custom serialization or jdk one. The value of the node attribute can be populated based on the feature flag. This should help us control the roll-out across versions if needed as well.

@krishna-ggk @peternied @scrawfor99 Let me know your thoughts if this is something we can pursue for solving the problem at hand.

@peternied
Copy link
Member

@mgodwan @krishna-ggk Thank you for the thoughts and participation in this discussion. The intention behind the serialization changes were to improve performance for every node-to-node request - a huge success.

We know every customer’s needs are important, but we have to think about what’s best for everyone. Right now, turning off this feature would slow things down for a many customers, and we don’t have a good reason to do that yet.

We’re committed to providing the best balance for all our customers and have carefully considered this decision. At this time, we will not be pursuing further changes or discussions on feature flag for serialization.

Please keep me informed if there are any pull requests to address the root causes and I'll be glad to give them my attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants