Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure large payload can be serialized and sent over comms #8507

Merged
merged 7 commits into from
Feb 16, 2024

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Feb 15, 2024

This includes two changes that aim to support the submission of very large graphs

Closes #8257

  1. It removes the deprecated msgpack options max_*_len. According to https://msgpack-python.readthedocs.io/en/latest/api.html#msgpack.unpackb they are now automatically set to the payload size and are there to protect from malicious behavior. Not setting it ourselves effectively removes any artificial limits from what I understand.
  2. It reverts Tidying of OpenSSL 1.0.2/Python 3.9 (and earlier) handling #5854 since in my testing this threshold still causes issues. The only exception I see is a broken pipe here https://github.com/fjetter/distributed/blob/9721a1e83fb6ebb2c486d7226b290d6f7d6e7ec8/distributed/comm/tcp.py#L366 but I haven't tried chasing this down. I was running on python 3.10 so at the very least this version identifier is not sufficient / maybe the cpython bug is not fixed yet. maybe @jakirkham has some insight

@fjetter fjetter changed the title Msgpack maxlen Ensure large payload can be serialized and sent over comms Feb 15, 2024
Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fjetter! LGTM, assuming there are no surprises on CI.

Copy link
Contributor

github-actions bot commented Feb 15, 2024

Unit Test Results

See test report for an extended history of previous test failures. This is useful for diagnosing flaky tests.

    27 files  ± 0      27 suites  ±0   10h 11m 0s ⏱️ +58s
 3 977 tests + 4   3 867 ✅ + 5    109 💤 ±0  1 ❌  - 1 
50 009 runs  +52  47 719 ✅ +55  2 289 💤  - 2  1 ❌  - 1 

For more details on these failures, see this check.

Results for commit a7b43d9. ± Comparison against base commit 72f297a.

♻️ This comment has been updated with latest results.

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 15, 2024

The new test test_large_payload is flaky:
https://github.com/dask/distributed/actions/runs/7914603160/job/21604675732?pr=8507

@hendrikmakait
Copy link
Member

test_large_payload takes 9s locally, so I'm not surprised that it would time out on CI. I don't see a good way of testing this behavior without allocating a large payload, so I'd suggest to increase the timeout for this test.

@crusaderky
Copy link
Collaborator

I can see in my pytest log:

2024-02-15 14:42:44,613 - distributed.core - DEBUG - Message from 'tcp://10.5.0.2:45668': {'op': 'echo', 'x': b'000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

(continues for several pages)

@hendrikmakait
Copy link
Member

@crusaderky: There's already an issue for those long debug logs: #8465

Copy link
Member

@hendrikmakait hendrikmakait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @fjetter and @crusaderky! CI looks good now.

@crusaderky
Copy link
Collaborator

crusaderky commented Feb 15, 2024

FYI the previous test

await comm.write({"op": "echo", "x": to_serialize(data)})

was not reproducing the issue, because it was sending a tiny msgpack blob + a 2 GiB buffer.
I changed it to

await comm.write({"op": "echo", "x": data}))

which mocks the current behaviour of c.submit and c.compute.

Regardless of the "proper" fix suggested above, this PR is useful nonetheless (besides for timing) when you have 2 GiB of bufferless data, e.g. caused by an object-string dataframe.

@jakirkham
Copy link
Member

jakirkham commented Feb 16, 2024

  1. It reverts Tidying of OpenSSL 1.0.2/Python 3.9 (and earlier) handling #5854 since in my testing this threshold still causes issues. The only exception I see is a broken pipe here https://github.com/fjetter/distributed/blob/9721a1e83fb6ebb2c486d7226b290d6f7d6e7ec8/distributed/comm/tcp.py#L366 but I haven't tried chasing this down. I was running on python 3.10 so at the very least this version identifier is not sufficient / maybe the cpython bug is not fixed yet. maybe @jakirkham has some insight

It is certainly possible. This Python issue fixed in 3.11.0 sticks out to me: https://bugs.python.org/issue42854

To unpack that, SSL_read uses an int size whereas SSL_read_ex uses size_t. The same is true of SSL_write/SSL_write_ex. All of these were added in OpenSSL 1.1.1: openssl/openssl@4ee7d3f

Python 3.10 requires OpenSSL 1.1.1 with PEP 644 and makes use of the newer OpenSSL APIs allowing more throughput: python/cpython#25468

However they had a bug (same one mentioned above) where they weren't consistently using size_t all the way through. Looks like they fixed that in Python 3.11: python/cpython#87020

It looks like GH issue ( python/cpython#87020 ) covers the full history of these changes. Note one use made the same observation about Python 3.10 Florian: python/cpython#87020 (comment)

So maybe we can just bump the minimum Python version to 3.11


Edit: Interesting, that fix was backported to Python 3.10 ( python/cpython#27308 ). Can see it in the Python 3.10rc1 changelog. So would have been in the Python 3.10.0 final release.

Maybe it is worth asking what Python version exactly (with patch version, which OS and from where) is being used? Also does it build with OpenSSL or a different SSL library? Also which SSL library version?

@fjetter
Copy link
Member Author

fjetter commented Feb 16, 2024

Thank you @jakirkham for your detailed response.

Maybe it is worth asking what Python version exactly (with patch version, which OS and from where) is being used? Also does it build with OpenSSL or a different SSL library? Also which SSL library version?

I think I'm having a hard time answering all of those questions. On my machine, I'm using conda and these are the relevant packages installed

python                    3.10.13         h2469fbe_1_cpython    conda-forge
python_abi                3.10                    4_cp310    conda-forge
openssl                   3.2.1                h0d3ecfb_0    conda-forge
pyopenssl                 23.3.0             pyhd8ed1ab_0    conda-forge

@fjetter fjetter merged commit 045dc64 into dask:main Feb 16, 2024
32 of 34 checks passed
@fjetter fjetter deleted the msgpack_maxlen branch February 16, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exceeds max_bin_len from Distributed >= 2023.1 with LightGBM
4 participants