-
Notifications
You must be signed in to change notification settings - Fork 878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More ROCm support #3401
base: master
Are you sure you want to change the base?
More ROCm support #3401
Conversation
@jakki-amd / @smedegaard / @agunapal - can you please make initial review? Thanks. |
@@ -63,6 +63,10 @@ std::shared_ptr<torch::Device> BaseHandler::GetTorchDevice( | |||
return std::make_shared<torch::Device>(torch::kCPU); | |||
} | |||
|
|||
#if defined(__HIPCC__) || (defined(__clang__) && defined(__HIP__)) || defined(__HIPCC_RTC__) | |||
return std::make_shared<torch::Device>(torch::kHIP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using HIP for the device here instead of CUDA? ROCm PyTorch masquerades as the CUDA device. Though a HIP device does exist as a dispatch key, no kernels are registered to it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this change.
Since kHIP
is explicitly defined in PyTorch (e.g., for clearer semantics I guess), I expected that automatic re-mapping would happen internally for kernel registration, lookup, etc. However, I don't really see this in PyTorch.
@@ -67,10 +67,10 @@ If you plan to develop with TorchServe and change some source code, you must ins | |||
Use the optional `--rocm` or `--cuda` flag with `install_dependencies.py` for installing accelerator specific dependencies. | |||
|
|||
Possible values are | |||
- rocm: `rocm61`, `rocm60` | |||
- rocm: `rocm6.3`, `rocm6.2`, `rocm6.1`, 'rocm6.0' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it would be more consistent to follow the same naming convention as with CUDA flag naming, meaning using rocm61
instead of rocm6.1
as CUDA flags are also given like cu111
, not cu11.1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I specifically checked the naming convention of both CUDA and ROCm and confirmed internally with some AMDers, then decided to use something like rocm6.3
instead of rocm63
.
docs/github_actions.md
Outdated
strategy: | ||
fail-fast: false | ||
matrix: | ||
cuda: ["rocm6.1", "rocm6.2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how defining cuda label to be rocm6.1
without changing ci-gpu CI script would work. Some work was done regarding CI scripts in this branch here, but the branch is out-of-date.
Left few minor comments, otherwise looks good! |
Goals
Git PRs
Already done
nvidia-smi
amd-smi
/rocm-smi
TODOs
Must do
choices=["rocm6.0", "rocm6.1", "rocm6.2"]
rocm6.3
)Nice to do
nvidia-smi
amd-smi
/rocm-smi
Exploration in the TorchServe
master
branchfind . -type f,l | xargs grep --color=always -nri cuda
find . -type f,l | xargs grep --color=always -nriE '\Wnv'
find . -type f,l | xargs grep --color=always -nri '_nv'
Parts
Requirement files
Docker files
Config files
Build scripts
Frontend
Backend
Documentation
Examples
CI
Regression tests
GitHub workflows
Benchmarks
Notes
Code name examples
cu92
,cu101
,cu102
,cu111
,cu113
,cu116
,cu117
,cu118
,cu121
rocm5.9
,rocm6.0
,rocm6.1
,rocm6.2
,rocm6.3
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B
Checklist: