Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More ROCm support #3401

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Conversation

glen-amd
Copy link

@glen-amd glen-amd commented Mar 18, 2025

Goals

  • Bring the support for AMD ROCm to TorchServe
  • Let the TorchServe community be aware of the support for AMD GPUs

Git PRs

Already done

  • NVIDIA device information via CLI nvidia-smi
  • AMD device information via CLI amd-smi/rocm-smi
  • More...

TODOs

Must do

  • Support for the latest ROCm release
    • Currently supported
      • choices=["rocm6.0", "rocm6.1", "rocm6.2"]
        • In the file "ts_scripts/install_dependencies.py"
    • Latest release by 20250311
      • ROCm 6.3 (rocm6.3)

Nice to do

  • NVML instead of CLI nvidia-smi
    • NVML has both C/C++ APIs and Python bindings.
    • TODO: JNI for Java
  • AMD SMI lib instead of CLI amd-smi/rocm-smi
    • AMD SMI lib has both C/C++ APIs and Python bindings.
    • TODO: JNI for Java

Exploration in the TorchServe master branch

  • Commands
    • find . -type f,l | xargs grep --color=always -nri cuda
    • find . -type f,l | xargs grep --color=always -nriE '\Wnv'
    • find . -type f,l | xargs grep --color=always -nri '_nv'
  • File types

Parts

Requirement files

Docker files

Config files

  • ts_scripts/spellcheck_conf/wordlist.txt

Build scripts

Frontend

Backend

  • cpp/src/backends/handler/base_handler.cc

Documentation

Examples

CI

Regression tests

GitHub workflows

Benchmarks

  • benchmarks/install_dependencies.sh
  • benchmarks/benchmark.py
    • "nvidia-docker"

Notes

Code name examples

  • NVIDIA/CUDA
    • cu92, cu101, cu102, cu111, cu113, cu116, cu117, cu118, cu121
  • AMD/ROCm
    • rocm5.9, rocm6.0, rocm6.1, rocm6.2, rocm6.3


Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
    Logs for Test A

  • Test B
    Logs for Test B

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Sorry, something went wrong.

@glen-amd
Copy link
Author

@jakki-amd / @smedegaard / @agunapal - can you please make initial review? Thanks.

@@ -63,6 +63,10 @@ std::shared_ptr<torch::Device> BaseHandler::GetTorchDevice(
return std::make_shared<torch::Device>(torch::kCPU);
}

#if defined(__HIPCC__) || (defined(__clang__) && defined(__HIP__)) || defined(__HIPCC_RTC__)
return std::make_shared<torch::Device>(torch::kHIP,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using HIP for the device here instead of CUDA? ROCm PyTorch masquerades as the CUDA device. Though a HIP device does exist as a dispatch key, no kernels are registered to it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this change.

Since kHIP is explicitly defined in PyTorch (e.g., for clearer semantics I guess), I expected that automatic re-mapping would happen internally for kernel registration, lookup, etc. However, I don't really see this in PyTorch.

@glen-amd glen-amd changed the title More rocm support More ROCmsupport Mar 19, 2025
@glen-amd glen-amd changed the title More ROCmsupport More ROCm support Mar 19, 2025
@@ -67,10 +67,10 @@ If you plan to develop with TorchServe and change some source code, you must ins
Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies.

Possible values are
- rocm: `rocm61`, `rocm60`
- rocm: `rocm6.3`, `rocm6.2`, `rocm6.1`, 'rocm6.0'
Copy link
Contributor

@jakki-amd jakki-amd Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think it would be more consistent to follow the same naming convention as with CUDA flag naming, meaning using rocm61 instead of rocm6.1 as CUDA flags are also given like cu111, not cu11.1.

Copy link
Author

@glen-amd glen-amd Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I specifically checked the naming convention of both CUDA and ROCm and confirmed internally with some AMDers, then decided to use something like rocm6.3 instead of rocm63.

strategy:
fail-fast: false
matrix:
cuda: ["rocm6.1", "rocm6.2"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how defining cuda label to be rocm6.1 without changing ci-gpu CI script would work. Some work was done regarding CI scripts in this branch here, but the branch is out-of-date.

@jakki-amd
Copy link
Contributor

Left few minor comments, otherwise looks good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants