Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow for nightly kubernetes tests #3017

Merged
merged 14 commits into from
Mar 14, 2024
Merged

Conversation

agunapal
Copy link
Collaborator

@agunapal agunapal commented Mar 12, 2024

Description

Workflow for running nightly Kubernetes tests

Passing run

In addition to functionality test, this PR also checks the performance in terms of CPU usage in a k8s pod. This is important to prevent CPU throttling in a Kubernetes setup with limits
To do this test, we disable system gpu metrics(because there is higher CPU usage with system GPU metrics enabled) and check the cpu usage when no model is loaded in TorchServe

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Passing run is attached

Checklist:

  • Did you have fun?
  • Have you added tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

@agunapal agunapal marked this pull request as ready for review March 12, 2024 20:36
@agunapal agunapal requested a review from msaroufim March 12, 2024 20:36
@agunapal agunapal changed the title Feature/k8s nightly test Workflow for nightly kubernetes tests Mar 12, 2024
@agunapal agunapal requested a review from lxning March 12, 2024 21:10

# Check if the CPU cores exceed 2
if [ $(echo "$cpu" | sed 's/m$//') -gt $ACCEPTABLE_CPU_CORE_USAGE ]; then
echo "✘ Test failed: CPU cores $(echo "$cpu" | sed 's/m$//') for $pod_name exceeded $ACCEPTABLE_CPU_CORE_USAGE" >&2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the context of this PR, why is this an important test? add either a comment here or more details in the PR description

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added this in the description

ACCEPTABLE_CPU_CORE_USAGE=2
DOCKER_IMAGE=pytorch/torchserve-nightly:latest-gpu

# Get relative path of example dir
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crazy stuff going on here lol, worth a comment or 2

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol..added

- cron: '15 6 * * *'

jobs:
kubernetes-tests:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems to be a lot of code duplication between this workflow and https://github.com/pytorch/serve/blob/master/.github/workflows/kserve_cpu_tests.yml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is.. Not sure if I want to merge the two. There are testing different tech stacks. KServe tests will also probably get bigger to test OIP. Won't be adding more tests for K8s unless we have a new issue uncovered.

@agunapal agunapal requested a review from msaroufim March 13, 2024 01:55
@msaroufim msaroufim enabled auto-merge March 14, 2024 00:42
@msaroufim msaroufim added this pull request to the merge queue Mar 14, 2024
Merged via the queue into master with commit 53bab8e Mar 14, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants