Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in distributed/diagnostics/tests/test_rmm_diagnostics.py::test_rmm_metrics #4

Closed
TomAugspurger opened this issue Feb 4, 2025 · 1 comment

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 4, 2025

https://github.com/rapidsai/dask-upstream-testing/actions/runs/13142239873/job/36671965774#step:9:751 has an unexpected failure.

_______________________________ test_rmm_metrics _______________________________

c = <Client: No scheduler connected>
s = <Scheduler 'tcp://127.0.0.1:45841', workers: 0, cores: 0, tasks: 0>
workers = (<dask_cuda.cuda_worker.CUDAWorker object at 0x7fac03201e50>,)
w = <WorkerState 'tcp://127.0.0.1:39413', name: 0, status: closed, memory: 0, processing: 0>
@py_assert0 = 0, @py_assert4 = None

    @gen_cluster(
        client=True,
        nthreads=[("127.0.0.1", 1)],
        Worker=dask_cuda.CUDAWorker,
        worker_kwargs={
            "rmm_pool_size": parse_bytes("10MiB"),
            "rmm_track_allocations": True,
        },
    )
    async def test_rmm_metrics(c, s, *workers):
        w = list(s.workers.values())[0]
        assert "rmm" in w.metrics
        assert w.metrics["rmm"]["rmm-used"] == 0
        assert w.metrics["rmm"]["rmm-total"] == parse_bytes("10MiB")
        result = delayed(rmm.DeviceBuffer)(size=10)
        result = result.persist()
        await asyncio.sleep(1)
>       assert w.metrics["rmm"]["rmm-used"] != 0
E       assert 0 != 0

distributed/diagnostics/tests/test_rmm_diagnostics.py:36: AssertionError
----------------------------- Captured stderr call -----------------------------
2025-02-04 18:51:35,885 - distributed.scheduler - INFO - State start
2025-02-04 18:51:35,892 - distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:45841
2025-02-04 18:51:35,892 - distributed.scheduler - INFO -   dashboard at:  http://127.0.0.1:40321/status
2025-02-04 18:51:35,893 - distributed.scheduler - INFO - Registering Worker plugin shuffle
2025-02-04 18:51:36,504 - distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:37431'
2025-02-04 18:51:40,305 - distributed.preloading - INFO - Creating preload: dask_cuda.initialize
2025-02-04 18:51:40,305 - distributed.preloading - INFO - Import preload module: dask_cuda.initialize
2025-02-04 18:51:40,316 - distributed.preloading - INFO - Run preload setup: dask_cuda.initialize
2025-02-04 18:51:40,317 - distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:39413
2025-02-04 18:51:40,317 - distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:39413
2025-02-04 18:51:40,317 - distributed.worker - INFO -           Worker name:                          0
2025-02-04 18:51:40,317 - distributed.worker - INFO -          dashboard at:            127.0.0.1:43515
2025-02-04 18:51:40,317 - distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:45841
2025-02-04 18:51:40,317 - distributed.worker - INFO - -------------------------------------------------
2025-02-04 18:51:40,318 - distributed.worker - INFO -               Threads:                          1
2025-02-04 18:51:40,318 - distributed.worker - INFO -                Memory:                 503.77 GiB
2025-02-04 18:51:40,318 - distributed.worker - INFO -       Local Directory: /tmp/dask-scratch-space/worker-9wj1ulk_
2025-02-04 18:51:40,318 - distributed.worker - INFO - Starting Worker plugin CPUAffinity-28ec392a-beeb-467a-9d56-56f39cd3dace
2025-02-04 18:51:40,318 - distributed.worker - INFO - Starting Worker plugin PreImport-eee38003-29e5-48a9-8217-0e310bad4f93
2025-02-04 18:51:40,318 - distributed.worker - INFO - Starting Worker plugin CUDFSetup-595987ee-b46c-4ccd-9fa6-d4e1601346c9
2025-02-04 18:51:42,750 - distributed.worker - INFO - Starting Worker plugin RMMSetup-9c57beab-dd4b-4095-8dfc-6032596d55fc
2025-02-04 18:51:43,087 - distributed.worker - INFO - -------------------------------------------------
2025-02-04 18:51:43,096 - distributed.scheduler - INFO - Register worker addr: tcp://127.0.0.1:39413 name: 0
2025-02-04 18:51:43,098 - distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:39413
2025-02-04 18:51:43,098 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:53618
2025-02-04 18:51:43,098 - distributed.worker - INFO - Starting Worker plugin shuffle
2025-02-04 18:51:43,099 - distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:45841
2025-02-04 18:51:43,099 - distributed.worker - INFO - -------------------------------------------------
2025-02-04 18:51:43,100 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:45841
2025-02-04 18:51:43,141 - distributed.scheduler - INFO - Receive client connection: Client-13749a6f-e329-11ef-86b0-0242ac120002
2025-02-04 18:51:43,142 - distributed.core - INFO - Starting established connection to tcp://127.0.0.1:53620
2025-02-04 18:51:44,268 - distributed.scheduler - INFO - Remove client Client-13749a6f-e329-11ef-86b0-0242ac120002
2025-02-04 18:51:44,269 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:53620; closing.
2025-02-04 18:51:44,269 - distributed.scheduler - INFO - Remove client Client-13749a6f-e329-11ef-86b0-0242ac120002
2025-02-04 18:51:44,270 - distributed.scheduler - INFO - Close client connection: Client-13749a6f-e329-11ef-86b0-0242ac120002
2025-02-04 18:51:44,271 - distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:37431'. Reason: nanny-close
2025-02-04 18:51:44,272 - distributed.nanny - INFO - Nanny asking worker to close. Reason: nanny-close
2025-02-04 18:51:44,273 - distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:39413. Reason: nanny-close
2025-02-04 18:51:44,273 - distributed.worker - INFO - Removing Worker plugin CPUAffinity-28ec392a-beeb-467a-9d56-56f39cd3dace
2025-02-04 18:51:44,273 - distributed.worker - INFO - Removing Worker plugin PreImport-eee38003-29e5-48a9-8217-0e310bad4f93
2025-02-04 18:51:44,273 - distributed.worker - INFO - Removing Worker plugin CUDFSetup-595987ee-b46c-4ccd-9fa6-d4e1601346c9
2025-02-04 18:51:44,273 - distributed.worker - INFO - Removing Worker plugin RMMSetup-9c57beab-dd4b-4095-8dfc-6032596d55fc
2025-02-04 18:51:44,273 - distributed.worker - INFO - Removing Worker plugin shuffle
2025-02-04 18:51:44,275 - distributed.core - INFO - Connection to tcp://127.0.0.1:45841 has been closed.
2025-02-04 18:51:44,276 - distributed.core - INFO - Received 'close-stream' from tcp://127.0.0.1:53618; closing.
2025-02-04 18:51:44,277 - distributed.scheduler - INFO - Remove worker addr: tcp://127.0.0.1:39413 name: 0 (stimulus_id='handle-worker-cleanup-1738695104.2[769](https://github.com/rapidsai/dask-upstream-testing/actions/runs/13142239873/job/36671965774#step:9:770)7')
2025-02-04 18:51:44,277 - distributed.scheduler - INFO - Lost all workers
2025-02-04 18:51:44,282 - distributed.nanny - INFO - Worker closed
2025-02-04 18:51:44,973 - distributed.nanny - INFO - Nanny at 'tcp://127.0.0.1:37431' closed.
2025-02-04 18:51:44,973 - distributed.scheduler - INFO - Closing scheduler. Reason: unknown
2025-02-04 18:51:44,974 - distributed.scheduler - INFO - Scheduler closing all comms

It also failed on the cuda==11.8.0 test: https://github.com/rapidsai/dask-upstream-testing/actions/runs/13142239873/job/36671965377#step:9:773

Looking into it.

@TomAugspurger
Copy link
Contributor Author

Closed by dask/distributed#9004. Passed in today's run

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant