Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installing dask and distributed packages can be confusing #962

Closed
mrocklin opened this issue Mar 22, 2017 · 29 comments
Closed

Installing dask and distributed packages can be confusing #962

mrocklin opened this issue Mar 22, 2017 · 29 comments

Comments

@mrocklin
Copy link
Member

Our installation docs recommend that people do the following

conda install dask distributed -c conda-forge

or

pip install dask[complete] distributed --upgrade

However we shouldn't expect most people to do the proper diligence of reading installation docs. We all tend to just guess that conda install name-of-project works pretty well most of the time. Unfortunately, if you've heard that Dask does distributed computing, you conda install dask, and then try out any distributed example then you're likely to receive an import error, which makes for a bad first impression.

There are a few ways that we could resolve this problem:

  1. We could provide informative errors whenever someone tries to do a dask.distributed thing. These would point them to installation docs. This wouldn't help if the just did import distributed though I think that most of the public materials we produce at this point always import from dask.distributed.
  2. We could switch out the conda package dask with a metapackage that included both dask and distributed. This would be foolproof in the conda case but would be a bit of an organizational hassle from a packaging perspective. We would rename the existing package dask-core (or something similar) and then switch in the dask metapackage. We would have to do this on conda-forge at the same time.
  3. We could find some way within conda to have a cycle (dask includes distributed, distributed includes dask)
  4. Other suggestions?

I'm in favor of starting with option 1, though would love to find a more thorough alternative.

cc @pzwang @ilanschnell

@pzwang
Copy link

pzwang commented Mar 22, 2017

I'm OK with option 1 for now, although it's suboptimal. I think #2, if you're up for the hassle, is the better situation long-term, especially since I anticipate that there will eventually crop up other optional things on top of dask (e.g. a more complete graphical admin UI package, or various scheduler compatibility libraries), which would be best to have depend on just dask-core.

@mrocklin
Copy link
Member Author

Currently dask is just dask-core.

Another alternative is to make a metapackage called dask-complete or something and point people to this. This has the same flaw of "few people read the docs" but it is easier than conda install dask dask-foo dask-bar etc. and doesn't have the challenge of swapping a name mid-flight.

@ilanschnell
Copy link

I don't have a strong opinion on which option is the best, as long as we don't have cyclic dependencies. This is actually what prompted this discussion, because I was asked to add distributed as a dependency to dask, but then I realized that distributed already depends on dask. So one could just tell people to conda install distributed, and they will get dask also.

@mrocklin
Copy link
Member Author

Oh hey, it turns out that we already solve this problem on conda-forge. We just have dask/dask depend on the previous version of dask/distributed . We always create some buffer room when versioning (the two projects work with one version behind) at least with micro-version updates.

So copying the recipes here might solve the problem today:

@pitrou
Copy link
Member

pitrou commented Mar 23, 2017

Oh hey, it turns out that we already solve this problem on conda-forge. We just have dask/dask depend on the previous version of dask/distributed

Isn't that a bit confusing? If someone runs conda install dask, they will get a distributed version, but not necessarily the latest one?

@mrocklin
Copy link
Member Author

It's a >= dependency. I'm not sure what Conda's policy is here. Do they prefer newer versions to older ones if available?

@ilanschnell
Copy link

Conda only prefers newer version newer versions for >= when the dependency is not installed already, and has to be installed fresh. Otherwise (when the dependency is already installed), conda will only update the dependency it's too only. However, it the dependency is old, but still satisfies >=, conda will not do anything.

Even though the projects depend on older versions of each other, they still depend on each other, which is confusing. Also the distributed project defines a bunch of dask-* commands. To me it seems like dask and distributed are so tightly coupled that they should really be the a single project.

@mrocklin
Copy link
Member Author

I understand the motivation here, but for development and dependency reasons they will likely remain separate projects. They develop at different rates, have different developer communities, and are depended on by different projects that sometimes only want dask, and not the distributed scheduler. I've had strong requests both to add and to remove the distributed package from dask's dependency list. In PyPI dask actually has no dependencies at all for exactly this reason.

@mrocklin
Copy link
Member Author

I've started a dask-core package on conda-forge here: conda-forge/staged-recipes#3820

@jzwinck
Copy link

jzwinck commented Feb 22, 2018

Why is the Conda package for dask.distributed not called dask.distributed or dask-distributed (what I would expect to install, but does not exist)? It's called distributed which is super-confusing and there is nothing helpful in the output of conda info distributed -c conda-forge. Further, distributed does not express a dependency on dask. Am I even looking at the right package?

distributed 1.18.0 py35_0
-------------------------
file name   : distributed-1.18.0-py35_0.tar.bz2
name        : distributed
version     : 1.18.0
build string: py35_0
build number: 0
channel     : defaults
size        : 632 KB
arch        : x86_64
date        : 2017-08-16
license     : BSD 3-Clause
license_family: BSD
md5         : d0c3b75432a2037425478d76ba0870bc
noarch      : None
platform    : linux
url         : https://repo.continuum.io/pkgs/free/linux-64/distributed-1.18.0-py35_0.tar.bz2
dependencies:
    bokeh >=0.12.3
    click >=6.6
    cloudpickle >=0.2.2
    msgpack-python
    psutil
    python 3.5*
    six
    sortedcontainers
    tblib
    toolz >=0.7.4
    tornado >=4.4
    zict >=0.1.2

@mrocklin
Copy link
Member Author

Probably mostly because of history, and now because of inertia

@jeremiahOkai
Copy link

Hey guys just a quick question since I am new to dask. I followed all the instructions given on their official website to install the package but am getting this error ModuleNotFoundError: No module named 'dask.dataframe'; 'dask' is not a package. Any help on how to resolve this issue

@yort
Copy link

yort commented Apr 15, 2018

I am experiencing the same issue as @yosopak2020 , have tried multiple versions of dask, multiple versions of python (3.5, 3.6).
Exact error message for me is: "ModuleNotFoundError: No module named 'dask.dataframe'; 'dask' is not a package"
No issue importing other libs either (pandas or numpy), only dask appears to be effected.

@jeremiahOkai
Copy link

jeremiahOkai commented Apr 15, 2018 via email

@TomAugspurger
Copy link
Member

@yort how are you installing dask? If you're using pip, you need dask[complete] or dask[dataframe].

@NatLun091238
Copy link

NatLun091238 commented Apr 17, 2018

Hi! Despite using pip install dask[complete] distributed --upgrade for the installation I get

ImportError: Dask's distributed scheduler is not installed

when later trying
from dask.distributed import Client \n client = Client()

How to solve this issue??

@TomAugspurger
Copy link
Member

What's the output of pip list? Did you restart your notebook / IPython after installing dask & distributed?

@NatLun091238
Copy link

The output after
# In[6]: get_ipython().system('pip install dask[complete] distributed --upgrade') is :
Collecting dask[complete] Downloading https://files.pythonhosted.org/packages/1d/f1/700c604af030d9b256a6590adf56cadd174c30c8ac6f555daf0e3023d294/dask-0.17.2-py2.py3-none-any.whl (582kB) 100% |████████████████████████████████| 583kB 966kB/s eta 0:00:01 Collecting distributed Downloading https://files.pythonhosted.org/packages/39/e8/7453e61bbee910aa91936743d6782a2108c28d9945f5f61cf801b485b5fa/distributed-1.21.6-py2.py3-none-any.whl (458kB) 100% |████████████████████████████████| 460kB 1.3MB/s eta 0:00:01 Collecting numpy>=1.10.4; extra == "complete" (from dask[complete]) Downloading https://files.pythonhosted.org/packages/76/4d/418dda252cf92bad00ab82d6b2a856e7843b47a5c2f084aed34b14b67d64/numpy-1.14.2-cp27-cp27mu-manylinux1_x86_64.whl (12.1MB) 100% |████████████████████████████████| 12.1MB 45kB/s eta 0:00:01 Collecting toolz>=0.7.3; extra == "complete" (from dask[complete]) Requirement already up-to-date: pandas>=0.19.0; extra == "complete" in /usr/local/envs/py2env/lib/python2.7/site-packages (from dask[complete]) Collecting cloudpickle>=0.2.1; extra == "complete" (from dask[complete]) Downloading https://files.pythonhosted.org/packages/aa/18/514b557c4d8d4ada1f0454ad06c845454ad438fd5c5e0039ba51d6b032fe/cloudpickle-0.5.2-py2.py3-none-any.whl Collecting partd>=0.3.8; extra == "complete" (from dask[complete]) Downloading https://files.pythonhosted.org/packages/4a/ca/207a28fd81111f6a88e79a006745ff432b9cae850fbafa27486e98d459da/partd-0.3.8-py2.py3-none-any.whl Collecting tblib (from distributed) Downloading https://files.pythonhosted.org/packages/4a/82/1b9fba6e93629a8557f9784cd8f1ae063c8762c26446367a6764edd328ce/tblib-1.3.2-py2.py3-none-any.whl Collecting six (from distributed) Downloading https://files.pythonhosted.org/packages/67/4b/141a581104b1f6397bfa78ac9d43d8ad29a7ca43ea90a2d863fe3056e86a/six-1.11.0-py2.py3-none-any.whl Requirement already up-to-date: click>=6.6 in /usr/local/envs/py2env/lib/python2.7/site-packages (from distributed) Collecting tornado>=4.5.1 (from distributed) Collecting sortedcontainers (from distributed) Downloading https://files.pythonhosted.org/packages/ea/67/c76c354ff30a689aeb2c75c4d383ae618c27fc2180d313f387f8918a3429/sortedcontainers-1.5.9-py2.py3-none-any.whl Requirement already up-to-date: singledispatch; python_version < "3.4" in /usr/local/envs/py2env/lib/python2.7/site-packages (from distributed) Collecting psutil (from distributed) Collecting zict>=0.1.3 (from distributed) Downloading https://files.pythonhosted.org/packages/5d/c9/eddd6c9a7ebd65fc799f9b87e56b45599a4e35d66e3da2722d7fc2a89f1f/zict-0.1.3-py2.py3-none-any.whl Collecting msgpack-python (from distributed) Collecting futures; python_version < "3.0" (from distributed) Downloading https://files.pythonhosted.org/packages/2d/99/b2c4e9d5a30f6471e410a146232b4118e697fa3ffc06d6a65efde84debd0/futures-3.2.0-py2-none-any.whl Collecting python-dateutil (from pandas>=0.19.0; extra == "complete"->dask[complete]) Downloading https://files.pythonhosted.org/packages/0c/57/19f3a65bcf6d5be570ee8c35a5398496e10a0ddcbc95393b2d17f86aaaf8/python_dateutil-2.7.2-py2.py3-none-any.whl (212kB) 100% |████████████████████████████████| 215kB 2.8MB/s eta 0:00:01 Collecting pytz>=2011k (from pandas>=0.19.0; extra == "complete"->dask[complete]) Downloading https://files.pythonhosted.org/packages/dc/83/15f7833b70d3e067ca91467ca245bae0f6fe56ddc7451aa0dc5606b120f2/pytz-2018.4-py2.py3-none-any.whl (510kB) 100% |████████████████████████████████| 512kB 1.2MB/s eta 0:00:01 Collecting locket (from partd>=0.3.8; extra == "complete"->dask[complete]) Requirement already up-to-date: backports-abc>=0.4 in /usr/local/envs/py2env/lib/python2.7/site-packages (from tornado>=4.5.1->distributed) Collecting heapdict (from zict>=0.1.3->distributed) Installing collected packages: numpy, tblib, six, futures, tornado, cloudpickle, sortedcontainers, psutil, heapdict, zict, msgpack-python, toolz, distributed, locket, partd, dask, python-dateutil, pytz Found existing installation: numpy 1.14.0 Uninstalling numpy-1.14.0: Successfully uninstalled numpy-1.14.0 Found existing installation: six 1.10.0 Uninstalling six-1.10.0: Successfully uninstalled six-1.10.0 Found existing installation: futures 3.0.5 Uninstalling futures-3.0.5: Successfully uninstalled futures-3.0.5 Found existing installation: tornado 4.4.2 Uninstalling tornado-4.4.2: Successfully uninstalled tornado-4.4.2 Found existing installation: psutil 4.3.0 Uninstalling psutil-4.3.0: Successfully uninstalled psutil-4.3.0 Found existing installation: toolz 0.8.2 Uninstalling toolz-0.8.2: Successfully uninstalled toolz-0.8.2 Found existing installation: dask 0.17.1 Uninstalling dask-0.17.1: Successfully uninstalled dask-0.17.1 Found existing installation: python-dateutil 2.5.0 Uninstalling python-dateutil-2.5.0: Successfully uninstalled python-dateutil-2.5.0 Found existing installation: pytz 2016.7 Uninstalling pytz-2016.7: Successfully uninstalled pytz-2016.7 Successfully installed cloudpickle-0.5.2 dask-0.17.2 distributed-1.21.6 futures-3.2.0 heapdict-1.0.0 locket-0.2.0 msgpack-python-0.5.6 numpy-1.14.2 partd-0.3.8 psutil-5.4.5 python-dateutil-2.7.2 pytz-2018.4 six-1.11.0 sortedcontainers-1.5.9 tblib-1.3.2 toolz-0.9.0 tornado-5.0.2 zict-0.1.3 You are using pip version 9.0.1, however version 10.0.0 is available. You should consider upgrading via the 'pip install --upgrade pip' command.

Restart of the notebook/IPython results in kernel death............ Any help is very valuable.

@TomAugspurger
Copy link
Member

TomAugspurger commented Apr 21, 2018 via email

@NatLun091238
Copy link

NatLun091238 commented Apr 23, 2018

It seems like we found rather stable solution which doesn't interfere with existing datalab dask installation. We run an external script doing:
!pip install tornado==4.5.1 distributed==1.21 dask-ml[complete]
and restarting current server on GCP and then verifying installation. After that all packages need for development are available. It's not a best way of solving the problem, but it works for us.

@mrocklin
Copy link
Member Author

mrocklin commented Apr 23, 2018 via email

@yort
Copy link

yort commented Apr 29, 2018

@TomAugspurger sorry, ignore me, total noob mistake on my part. I had a testing file called dask.py being used which was being imported instead of actual dask! Lesson learned!

@TomAugspurger
Copy link
Member

TomAugspurger commented Apr 29, 2018 via email

@dnedry
Copy link

dnedry commented Feb 27, 2020

@TomAugspurger sorry, ignore me, total noob mistake on my part. I had a testing file called dask.py being used which was being imported instead of actual dask! Lesson learned!

I just did the same thing; thanks to your post I realized it. :)

@subba2048
Copy link

subba2048 commented Oct 22, 2020

Is there any update on how to solve this issue?
I used "pip install dask" and now have trouble with dask.distributed module not found error.

Tried doing:
pip install dask[complete] distributed --upgrade

which I don't think changed much since I am still getting the module not found error.

@GenevieveBuckley
Copy link
Contributor

@jsCoder020193 Hmm, that's odd. Are you sure that the pip you are using is in the same python environment?

I'm not able to reproduce the same problem following the install docs:

python -m pip install "dask[complete]"

or, if you only want distributed and the core parts of dask...

python -m pip install "dask[distributed]"

@GenevieveBuckley
Copy link
Contributor

Today if you pip install dask into a new python environment, and try to access dask.distributed (eg: from dask.distributed import Client), then you get a nice informative error message.

Between that, and the more detailed installation docs on the dask and distributed pages, I'm not sure there's much else to do here.

Details:
>>> from dask.distributed import Client
Traceback (most recent call last):
  File "/home/genevieve/anaconda3/envs/testing3/lib/python3.9/site-packages/dask/distributed.py", line 11, in <module>
    from distributed import *
ModuleNotFoundError: No module named 'distributed'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/genevieve/anaconda3/envs/testing3/lib/python3.9/site-packages/dask/distributed.py", line 14, in <module>
    raise ImportError(_import_error_message) from e
ImportError: dask.distributed is not installed.

Please either conda or pip install distributed:

  conda install dask distributed             # either conda install
  python -m pip install "dask[distributed]" --upgrade    # or pip install

@ovalerio
Copy link

ovalerio commented Mar 5, 2025

Dear dask community,

It is 2025 and I am running into an ImportError after installing cudf that is calling dask_cudf which is looking for a dask.dataframe dependency and failing to find it. I got the cudfand dask installed from rapidsai and conda-forgechannels using conda:
conda install -c rapidsai -c conda-forge -c nvidia cudf cuml 'cuda-version=12.6'
And that installed the following packages:

$ conda list 'dask|cuml|cudf|distributed'
# packages in environment at ~/anaconda3/envs/rnntf2:
#
# Name                    Version                   Build  Channel
cudf                      24.12.00        cuda12_py312_241211_gff41ecf473_0    rapidsai
cuml                      24.12.00        cuda12_py312_241211_ge79cd670a_0    rapidsai
dask                      2024.11.2          pyhff2d567_1    conda-forge
dask-core                 2024.11.2          pyhff2d567_1    conda-forge
dask-cuda                 24.12.00        py312_241211_g3b3b356_0    rapidsai
dask-cudf                 24.12.00        cuda12_py312_241211_gff41ecf473_0    rapidsai
dask-expr                 1.1.19             pyhd8ed1ab_0    conda-forge
distributed               2024.11.2          pyhff2d567_1    conda-forge
distributed-ucxx          0.41.00         py3.12_241211_gd355f9c_0    rapidsai
libcudf                   24.12.00        cuda12_241211_gff41ecf473_0    rapidsai
libcuml                   24.12.00        cuda12_241211_ge79cd670a_0    rapidsai
libcumlprims              24.12.00        cuda12_241211_g8df6c7e_0    rapidsai
pylibcudf                 24.12.00        cuda12_py312_241211_gff41ecf473_0    rapidsai
raft-dask                 24.12.00        cuda12_py312_241211_geaf9cc72_0    rapidsai
rapids-dask-dependency    24.12.00                   py_0    rapidsai

My idea is to use dask on a slurm based HPC system and I saw in the documentation that you recommend the dask-jobqueue package. I found two similarly named packages:

$ conda search -c conda-forge 'dask*jobqueue'
dask-gateway-server-jobqueue           0.9.0  py38h578d9bd_2  conda-forge         
dask-gateway-server-jobqueue           0.9.0  py39hf3d152e_2  conda-forge         
dask-gateway-server-jobqueue        2022.4.0      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue        2022.6.1      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue       2022.10.0      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue        2023.1.0      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue        2023.1.1      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue        2023.9.0      ha770c72_0  conda-forge         
dask-gateway-server-jobqueue        2024.1.0      ha770c72_0  conda-forge         
dask-jobqueue                  0.8.0    pyhd8ed1ab_0  conda-forge         
dask-jobqueue                  0.8.1    pyhd8ed1ab_0  conda-forge         
dask-jobqueue                  0.8.2    pyhd8ed1ab_0  conda-forge         
dask-jobqueue                  0.8.5    pyhd8ed1ab_0  conda-forge         
dask-jobqueue                  0.9.0    pyhd8ed1ab_0  conda-forge         

What is the difference between them? Which one should I pick?

Thanks!

@jacobtomlinson
Copy link
Member

@ovalerio I've moved your comment to a new issue #9019 as this one has been closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests