Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDCache does not include libcuda.so #944

Open
elezar opened this issue Feb 27, 2025 · 1 comment · May be fixed by #947
Open

LDCache does not include libcuda.so #944

elezar opened this issue Feb 27, 2025 · 1 comment · May be fixed by #947

Comments

@elezar
Copy link
Member

elezar commented Feb 27, 2025

One of the things that the NVIDIA Container Toolkit does is update the ldcache in the container so as to allow applications to discover the host driver libraries that have been injected. We also create (some) .so symlinks to match the files tracked by the driver installation. These point to the SONAME symlinks. For example: libcuda.so -> libcuda.so.1 -> libcuda.so.RM_VERSION. We create the libcuda.so symlinks before we run ldconfig, but libcuda.so is not present in the ldcache since we rely on running ldconfig to create the libcuda.so.1 symlink. This means that the ldcache in the container once it starts does not match expectations (i.e. the host state).

For example, on a host with the driver installed we have:

$ ldconfig -p | grep libcuda
        libcuda.so.1 (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /lib/x86_64-linux-gnu/libcuda.so

In a container:

$ docker run --rm -ti -e NVIDIA_VISIBLE_DEVICES=runtime.nvidia.com/gpu=all ubuntu
root@93984b5c459c:/# ldconfig -p | grep libcuda
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1

If we run ldconfig in the container we see the following:

root@93984b5c459c:/# ldconfig
root@93984b5c459c:/# ldconfig -p | grep libcuda
        libcuda.so.1 (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib/x86_64-linux-gnu/libcuda.so

which matches the host state.

This also holds for the "legacy" code path since the symlink chain is only completed by running ldconfig once.

This seems inocent enough, but has the side effect that applications that run dlopen("libcuda.so", RTLD_LAZY); may not find the library if it is not in the standard library path (this could be the case for CDI).

A simple workaround is to inject the update-ldcache hook twice, but we may want to consider a two phase approach where we first run ldconfig with the -N flag to only update the links and then run ldconfig to update the cache.

@klueska
Copy link
Contributor

klueska commented Feb 27, 2025

I like this idea:

A simple workaround is to inject the update-ldcache hook twice, but we may want to consider a two phase approach where we first run ldconfig with the -N flag to only update the links and then run ldconfig to update the cache.

It would also make it more obvious looking at a CDI file why the hook is there twice. the first one has the -N option, the second one doesn't.

@elezar elezar linked a pull request Feb 27, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants