Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 fix: efs & elb upgrade e2e tests #5418

Merged
merged 2 commits into from
Mar 21, 2025

Conversation

richardcase
Copy link
Member

@richardcase richardcase commented Mar 19, 2025

What type of PR is this?

/kind failing-test

What this PR does / why we need it:

The EFS e2e test was breaking for 2 reasons:

  1. Running out if disk space on the control plane nodes. It only had 8Gb so this has been increased to 16gb
  2. The workload being deployed to test EFS was using centos with has been discontinued for a long time now. So changed to use Ubuntu

The classic ELB upgrade test was failing because on upgrade of CAPA, the CAPA controller manager was restarted. From the logs i could see that the AWSCluster was no being reconciled and thats because the "old" and "new" versions were the same basic ion the predicate logic in SetupWithManager. Changed SetupWithManager to be more consistent with other CAPI infra providers. Also it was failing because of the wait introduced by the new paused.EnsurePausedCondition functionality but this is fixed in #5425

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Checklist:

  • squashed commits
  • includes documentation
  • includes emoji in title
  • adds unit tests
  • adds or updates e2e tests

Release note:

Fix the EFS & classic elb e2e tests.

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. labels Mar 19, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 19, 2025
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@damdo
Copy link
Member

damdo commented Mar 19, 2025

/test pull-cluster-api-provider-aws-e2e

1 similar comment
@theobarberbany
Copy link
Contributor

/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

4 similar comments
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@damdo
Copy link
Member

damdo commented Mar 19, 2025

/test pull-cluster-api-provider-aws-e2e

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Mar 20, 2025
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@richardcase richardcase changed the title 🐛 fix: efs e2e test breaking 🐛 fix: efs & elb upgrade e2e tests Mar 20, 2025
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@nrb
Copy link
Contributor

nrb commented Mar 20, 2025

@richardcase The release-2.7 periodics are failing on the same efs issue, so we should backport at least the YAML changes in this PR. Do you think we should also backport the changes to the manager setup?

@richardcase
Copy link
Member Author

@richardcase The release-2.7 periodics are failing on the same efs issue, so we should backport at least the YAML changes in this PR. Do you think we should also backport the changes to the manager setup?

@nrb - we'll probably have to do it all. This issue started as just the EFS problem and then expanded to the classic elb issue. I can just cherry pick the EFS changes first and then if we see the other issue we can then cherry pick that commit. Will create the PR for release-2.7 soon.

@nrb
Copy link
Contributor

nrb commented Mar 20, 2025

we'll probably have to do it all

Yeah, we can just do a cherry-pick then

@nrb nrb mentioned this pull request Mar 20, 2025
2 tasks
@richardcase richardcase changed the title 🐛 fix: efs & elb upgrade e2e tests WIP: 🐛 fix: efs & elb upgrade e2e tests Mar 20, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 20, 2025
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

The EFS e2e test was breaking for 2 reasons:

1. Running out if disk space on the control plane nodes.
It only had 8Gb so this has been increased to 16gb
2.The workload being deployed to test EFS was using centos with has been
discontinued for a long time now. So changed to use Ubuntu

Also small updates to logging for the ELB test.

Signed-off-by: Richard Case <[email protected]>
AWSCluster was not reconciling when starting after an upgrade. It had
old logic to compare versions and not do anything. We want to reconcile
even if there are no changes to the AWSCluster as the ELB logic has
changed. Also, there may be other changes like this in future.

Change the SetupWithManager logic to be more like the standard we see
with other infrastructure providers.

Signed-off-by: Richard Case <[email protected]>
@richardcase richardcase changed the title WIP: 🐛 fix: efs & elb upgrade e2e tests 🐛 fix: efs & elb upgrade e2e tests Mar 21, 2025
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 21, 2025
@richardcase
Copy link
Member Author

/test pull-cluster-api-provider-aws-e2e

@@ -128,6 +128,7 @@ var (
// +kubebuilder:rbac:groups=authorization.k8s.io,resources=subjectaccessreviews,verbs=create

func main() {
setupLog.Info("starting cluster-api-provider-aws", "version", version.Get().String())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice addition, thanks!

Comment on lines -396 to -417
WithEventFilter(
predicate.Funcs{
// Avoid reconciling if the event triggering the reconciliation is related to incremental status updates
// for AWSCluster resources only
UpdateFunc: func(e event.UpdateEvent) bool {
if e.ObjectOld.GetObjectKind().GroupVersionKind().Kind != "AWSCluster" {
return true
}

oldCluster := e.ObjectOld.(*infrav1.AWSCluster).DeepCopy()
newCluster := e.ObjectNew.(*infrav1.AWSCluster).DeepCopy()

oldCluster.Status = infrav1.AWSClusterStatus{}
newCluster.Status = infrav1.AWSClusterStatus{}

oldCluster.ObjectMeta.ResourceVersion = ""
newCluster.ObjectMeta.ResourceVersion = ""

return !cmp.Equal(oldCluster, newCluster)
},
},
).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we reckon that removing this will have any unrelated side effects?
For example, downstream we do externally manage the AWSCluster, will this create many reconciliations for the CAPA infracluster controller then?

Copy link
Member Author

@richardcase richardcase Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there will be too may side affets. It will result in more Reconcile runs potentially, but:

  • Our original code is at odds with how other infrastructure providers work
  • Keeping this would cause issues with the Paused condition being added and would still cause that initial delay in reconciliation
  • If we kept this, our change for ELB wouldn't be called until there was a "Spec" change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok thanks, I think this is fine to remove for now.
If we notice any issues we can open a follow-up thread to discuss countermeasures.
Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we could always create a custom predicate

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @nrb

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 21, 2025
@nrb
Copy link
Contributor

nrb commented Mar 21, 2025

/approve

Thanks Richard!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: nrb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 21, 2025
@k8s-ci-robot k8s-ci-robot merged commit 6c93889 into kubernetes-sigs:main Mar 21, 2025
24 checks passed
@richardcase
Copy link
Member Author

/cherrypick release-2.7

@k8s-infra-cherrypick-robot

@richardcase: #5418 failed to apply on top of branch "release-2.7":

Applying: fix: efs e2e test breaking
Using index info to reconstruct a base tree...
M	test/e2e/data/e2e_conf.yaml
M	test/e2e/shared/aws_helpers.go
M	test/e2e/suites/unmanaged/helpers_test.go
Falling back to patching base and 3-way merge...
Auto-merging test/e2e/suites/unmanaged/helpers_test.go
Auto-merging test/e2e/shared/aws_helpers.go
CONFLICT (content): Merge conflict in test/e2e/shared/aws_helpers.go
Auto-merging test/e2e/data/e2e_conf.yaml
CONFLICT (content): Merge conflict in test/e2e/data/e2e_conf.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config advice.mergeConflict false"
Patch failed at 0001 fix: efs e2e test breaking

In response to this:

/cherrypick release-2.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants