Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[API server] use loadbalancer to expose ingress by default #4926

Merged
merged 11 commits into from
Mar 25, 2025

Conversation

aylei
Copy link
Collaborator

@aylei aylei commented Mar 11, 2025

close #4920

Motivation of changing the defaults: LoadBalancer (if available) is more stable and secure compared to NodePort, correspondingly better UX and lower support costs.

Note: LoadBalancer might be unavailable on some k8s infra, we still make it default because it is the de-facto way to expose L4 endpoint. As an example, nginx-ingress also use LoadBalancer as default service type https://artifacthub.io/packages/helm/ingress-nginx/ingress-nginx

Edit

New behavior for users upgraded from earlier 0.8.0-nightly versions, an error will be raised:

$ helm upgrade --install skypilot-alpha ./charts/skypilot -f values.yaml
Error: UPGRADE FAILED: execution error at (skypilot/templates/ingress-nodeport.yaml:8:8): Service skypilot-alpha-ingress-controller-np is deprecated and will be removed in SkyPilot v0.9.0. Refer to https://docs.skypilot.co/en/latest/reference/api-server/api-server-admin-deploy.html#migrate-from-legacy-nodeport-service for migration steps.

This ensures the warning is acknowledged by users.

Tested (run the relevant ones):

  • Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • Manually Tested on GKE
  • Manually Tested on EKS
  • Manually Tested on AKS
  • All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • Backward compatibility: /quicktest-core (CI) or conda deactivate; bash -i tests/backward_compatibility_tests.sh (local)

@aylei aylei marked this pull request as ready for review March 11, 2025 04:49
aylei added 2 commits March 11, 2025 12:56
Signed-off-by: Aylei <[email protected]>
Signed-off-by: Aylei <[email protected]>
Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks @aylei!

Comment on lines 281 to 282
helm upgrade --install -n $NAMESPACE $RELEASE_NAME skypilot/skypilot-nightly --devel \
--set ingress.nodePortEnabled=false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this also delete the old NodePort service?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this command is solely for deleting the old NodePort service

# Use ClusterIP here to disable the LoadBalancer created by nginx.
# If not using NodePort, set this to LoadBalancer.
type: ClusterIP
type: LoadBalancer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in the future this can directly be set to NodePort for users who want to use nodeport? If so, we should probably remove the ingress.nodePortEnabled right before 0.9.0 release.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for charts/skypilot/templates/ingress-nodeport.yaml - that file should be removed before 0.9.0. Good to have a TODO there.

Copy link
Collaborator Author

@aylei aylei Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think setting ingress-nginx.controller.service.* is recommended after this PR for new deployments. But it requires the similar migration steps for legacy deployments to abandon the additional node port service:

  1. Enable the nginx NodePort service, since 30050 and 30051 have been used, the new service should pick up new ports;
  2. Ask clients to migrate to the new server URL (with new ports)
  3. Set ingress.nodePortEnabled=false to delete the old NodePort service

Not sure whether we should keep the additional node port service to simplify upgrades for nightly users, but we can document the above steps and leave the choice to users. It is reasonable to take no action if they are happy with current setup.

Copy link
Collaborator Author

@aylei aylei Mar 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A new idea is that we can just error out to user if:

  • There is an existing legacy node port service
  • and ingress.nodePortEnabled is not set

Suggested message: "Service ${serviceName} is deprecated and will be removed in 0.9.0, follow the guide to migrate: ${URL}", in the guide we suggest user to set ingress.nodePortEnabled=true explicitly first and gradually migrate their users to service managed by ingress-nginx.controller.service.* and finally set ingress.nodePortEnabled=false.

Also the message should be updated in 0.9.0 release: "Service ${serviceName} is deprecated and was removed in 0.9.0, migrate the service using the previous helm chart version before upgrade."

This will add some friction to our nightly users that have an API sever chart deployed, but should be okay since the adoption is not that wide.

@romilbhardwaj wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal sounds good to me - essentially we require current users to be explicit about their intent to use NodePort after this PR.

@aylei aylei requested a review from romilbhardwaj March 20, 2025 02:31
@aylei aylei force-pushed the aylei/apiserver-lb branch from 0111eef to f06c659 Compare March 20, 2025 08:20
@aylei aylei force-pushed the aylei/apiserver-lb branch from f06c659 to cd8ee18 Compare March 20, 2025 08:30
@aylei
Copy link
Collaborator Author

aylei commented Mar 21, 2025

@romilbhardwaj this PR is ready for another look

Copy link
Collaborator

@romilbhardwaj romilbhardwaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aylei!

$ NODE_PORT=$(kubectl get svc ${RELEASE_NAME}-ingress-controller-np -n $NAMESPACE -o jsonpath='{.spec.ports[?(@.name=="http")].nodePort}')
$ NODE_IP=$(kubectl get nodes -o jsonpath='{ $.items[0].status.addresses[?(@.type=="ExternalIP")].address }')
$ ENDPOINT=http://${WEB_USERNAME}:${WEB_PASSWORD}@${NODE_IP}:${NODE_PORT}
$ ENDPOINT=$(kubectl get svc ${RELEASE_NAME}-ingress-nginx-controller -n $NAMESPACE -o jsonpath='http://{.status.loadBalancer.ingress[0].ip}')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably retain the ENDPOINT=http://${WEB_USERNAME}:${WEB_PASSWORD}@${NODE_IP} format so that the output is printed as http://skypilot:[email protected] and users can directly copy-paste the endpoint in their config.yaml.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

:sync: nodeport-tab

1. Make sure ports 30050 and 30051 are open on your nodes.
.. tab-item:: LoadBalancer (Default)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also change the "A Kubernetes cluster with ports 30050 and 30051 available for NodePort services" prerequisite at L15 to "A Kubernetes cluster with LoadBalancer or NodePort service support"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

@aylei aylei merged commit 052e0ea into master Mar 25, 2025
18 checks passed
@aylei aylei deleted the aylei/apiserver-lb branch March 25, 2025 01:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[API server] support k8s ssh tunneling through LB
2 participants