More ROCm support #3401

glen-amd · 2025-03-18T17:17:04Z

Goals

Bring the support for AMD ROCm to TorchServe
Let the TorchServe community be aware of the support for AMD GPUs

Git PRs

Already done

NVIDIA device information via CLI nvidia-smi
AMD device information via CLI amd-smi/rocm-smi
More...

TODOs

Must do

Support for the latest ROCm release
- Currently supported
  - choices=["rocm6.0", "rocm6.1", "rocm6.2"]
    - In the file "ts_scripts/install_dependencies.py"
- Latest release by 20250311
  - ROCm 6.3 (rocm6.3)

Nice to do

NVML instead of CLI nvidia-smi
- NVML has both C/C++ APIs and Python bindings.
- TODO: JNI for Java
AMD SMI lib instead of CLI amd-smi/rocm-smi
- AMD SMI lib has both C/C++ APIs and Python bindings.
- TODO: JNI for Java

Exploration in the TorchServe `master` branch

Commands
- find . -type f,l | xargs grep --color=always -nri cuda
- find . -type f,l | xargs grep --color=always -nriE '\Wnv'
- find . -type f,l | xargs grep --color=always -nri '_nv'
File types

Parts

Requirement files

Docker files

Config files

ts_scripts/spellcheck_conf/wordlist.txt

Build scripts

Frontend

Backend

cpp/src/backends/handler/base_handler.cc

Documentation

Examples

CI

Regression tests

GitHub workflows

Benchmarks

benchmarks/install_dependencies.sh
benchmarks/benchmark.py
- "nvidia-docker"

Notes

Code name examples

NVIDIA/CUDA
- cu92, cu101, cu102, cu111, cu113, cu116, cu117, cu118, cu121
AMD/ROCm
- rocm5.9, rocm6.0, rocm6.1, rocm6.2, rocm6.3

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
New feature (non-breaking change which adds functionality)
This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

Test A
Logs for Test A
Test B
Logs for Test B

Checklist:

Did you have fun?
Have you added tests that prove your fix is effective or that this feature works?
Has code been commented, particularly in hard-to-understand areas?
Have you made corresponding changes to the documentation?

glen-amd · 2025-03-18T17:20:48Z

@jakki-amd / @smedegaard / @agunapal - can you please make initial review? Thanks.

jeffdaily · 2025-03-18T18:08:38Z

cpp/src/backends/handler/base_handler.cc

@@ -63,6 +63,10 @@ std::shared_ptr<torch::Device> BaseHandler::GetTorchDevice(
    return std::make_shared<torch::Device>(torch::kCPU);
  }

+#if defined(__HIPCC__) || (defined(__clang__) && defined(__HIP__)) || defined(__HIPCC_RTC__)
+  return std::make_shared<torch::Device>(torch::kHIP,


Why are you using HIP for the device here instead of CUDA? ROCm PyTorch masquerades as the CUDA device. Though a HIP device does exist as a dispatch key, no kernels are registered to it.

I will revert this change.

Since kHIP is explicitly defined in PyTorch (e.g., for clearer semantics I guess), I expected that automatic re-mapping would happen internally for kernel registration, lookup, etc. However, I don't really see this in PyTorch.

jakki-amd · 2025-03-20T15:08:07Z

CONTRIBUTING.md

@@ -67,10 +67,10 @@ If you plan to develop with TorchServe and change some source code, you must ins
    Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies.

    Possible values are
-    - rocm: `rocm61`, `rocm60`
+    - rocm: `rocm6.3`, `rocm6.2`, `rocm6.1`, 'rocm6.0'


nit: I think it would be more consistent to follow the same naming convention as with CUDA flag naming, meaning using rocm61 instead of rocm6.1 as CUDA flags are also given like cu111, not cu11.1.

I specifically checked the naming convention of both CUDA and ROCm and confirmed internally with some AMDers, then decided to use something like rocm6.3 instead of rocm63.

jakki-amd · 2025-03-20T15:45:09Z

docs/github_actions.md

+            strategy:
+            fail-fast: false
+            matrix:
+                cuda: ["rocm6.1", "rocm6.2"]


I don't see how defining cuda label to be rocm6.1 without changing ci-gpu CI script would work. Some work was done regarding CI scripts in this branch here, but the branch is out-of-date.

jakki-amd · 2025-03-20T15:50:04Z

Left few minor comments, otherwise looks good!

…e/blob/28-create-github-actions-for-rocm/docs/github_actions.md

glen-amd added 6 commits March 12, 2025 11:36

More ROCm support

c8141c3

More ROCm support

1f041e8

More ROCm support

d7f76f8

More ROCm support

cd10eec

A bug fix; Comments; rocm6.3 related changes.

70ab027

ROCm 6.3 related additions

3023f53

pytorch-bot bot added the module: rocm label Mar 18, 2025

jeffdaily reviewed Mar 18, 2025

View reviewed changes

Reverted the use of torch::kHIP

1a06283

glen-amd changed the title ~~More rocm support~~ More ROCmsupport Mar 19, 2025

glen-amd changed the title ~~More ROCmsupport~~ More ROCm support Mar 19, 2025

jakki-amd reviewed Mar 20, 2025

View reviewed changes

Added matrix.gpu-type by referencing https://github.com/nod-ai/serv…

1f00b02

…e/blob/28-create-github-actions-for-rocm/docs/github_actions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More ROCm support #3401

More ROCm support #3401

glen-amd commented Mar 18, 2025 •

edited

Loading

glen-amd commented Mar 18, 2025

jeffdaily Mar 18, 2025

glen-amd Mar 18, 2025

jakki-amd Mar 20, 2025 •

edited

Loading

glen-amd Mar 20, 2025 •

edited

Loading

jakki-amd Mar 20, 2025

jakki-amd commented Mar 20, 2025

More ROCm support #3401

Are you sure you want to change the base?

More ROCm support #3401

Conversation

glen-amd commented Mar 18, 2025 • edited Loading

Goals

Git PRs

Already done

TODOs

Must do

Nice to do

Exploration in the TorchServe master branch

Parts

Requirement files

Docker files

Config files

Build scripts

Frontend

Backend

Documentation

Examples

CI

Regression tests

GitHub workflows

Benchmarks

Notes

Code name examples

Description

Type of change

Feature/Issue validation/testing

Checklist:

glen-amd commented Mar 18, 2025

jeffdaily Mar 18, 2025

Choose a reason for hiding this comment

glen-amd Mar 18, 2025

Choose a reason for hiding this comment

jakki-amd Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

glen-amd Mar 20, 2025 • edited Loading

Choose a reason for hiding this comment

jakki-amd Mar 20, 2025

Choose a reason for hiding this comment

jakki-amd commented Mar 20, 2025

glen-amd commented Mar 18, 2025 •

edited

Loading

Exploration in the TorchServe `master` branch

jakki-amd Mar 20, 2025 •

edited

Loading

glen-amd Mar 20, 2025 •

edited

Loading