Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🌱 Improve KCP scale up when using failure domains #11598

Merged

Conversation

fabriziopandini
Copy link
Member

@fabriziopandini fabriziopandini commented Dec 19, 2024

What this PR does / why we need it:
KCP tries to achieve spreading of machines across failure domains

The current implementation achieve the goal by balancing up-to-date/new machines created during scale up/rollout, whitout caring of the placement of outdated/old machines (because they are going away).

This PR improves the behaviour above by considering the entire set of machines as a secondary criteria when there there are failure domains with the same number of up-to-date/new machines. This ensure a better spreading of machines also during scale up/rollout (not only at the end).

E.g.:

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C

Scale up:

  • Adds D to FD 3 (random pick from FD 1, FD 2, FD 3 all of them with the same number of machines at this point)

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C, D

If for any reason (e.g. Health issues) B is deleted next we have

FD 1: Machine A
FD 2:
FD 3 Machine C, D

With the current implementation, the next scale up could place a machine both FD1 or FD2, because no one has an up to date/new machine yet. If FD1 is selected we will have

FD 1: Machine A, E
FD 2:
FD 3 Machine C, D

Which is not ideal.
Instead, whit this PR the tie between FD1 and FD2 (both without up to date machines) is solved by picking the failure domain with less machines overall, which is FD2, thus leading to the following situation:

FD 1: Machine A
FD 2: Machine E
FD 3 Machine C, D

Also, while working at this PR I played down some tech debt by adding comments and a ton of unit tests for the failure domain package; I will open a follow up issue to add tests in the upper layer of the call stack (specifically for adding removal of FD)

/area provider/controlplane-kubeadm

@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The label(s) area/provider/controlplane-kubeadm cannot be applied, because the repository doesn't have them.

In response to this:

What this PR does / why we need it:
KCP tries to achieve spreading of machines across failure domains

The current implementation achieve the goal by balancing up-to-date/new machines created during scale up/rollout, whitout caring of the placement of outdated/old machines (because they are going away).

This PR improves the behaviour above by considering the entire set of machines as a secondary criteria when there there are failure domains with the same number of up-to-date/new machines. This ensure a better spreading of machines also during scale up/rollout (not only at the end).

E.g.:

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C

Scale up:

  • Adds D to FD 1

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C, D

If for any reason B is deleted we have

FD 1: Machine A
FD 2:
FD 3 Machine C, D

With the current implementation, the next scale up could place a machine both in FD1 or FD2, because no one has an up to date/new machine yet. If FD1 is selected we will have

FD 1: Machine A, E
FD 2:
FD 3 Machine C, D

Which is not ideal.
Instead, whit this PR the tie between FD1 and FD2 (both without up to date machines) is solved by picking the failure domain with less machines overall, which is FD2, thus leading to the following situation:

FD 1: Machine A
FD 2: Machine E
FD 3 Machine C, D

/area provider/controlplane-kubeadm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Dec 19, 2024
@fabriziopandini
Copy link
Member Author

/test

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 19, 2024
@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test pull-cluster-api-build-main
  • /test pull-cluster-api-e2e-blocking-main
  • /test pull-cluster-api-e2e-conformance-ci-latest-main
  • /test pull-cluster-api-e2e-conformance-main
  • /test pull-cluster-api-e2e-latestk8s-main
  • /test pull-cluster-api-e2e-main
  • /test pull-cluster-api-e2e-mink8s-main
  • /test pull-cluster-api-e2e-upgrade-1-32-1-33-main
  • /test pull-cluster-api-test-main
  • /test pull-cluster-api-test-mink8s-main
  • /test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

  • /test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

  • pull-cluster-api-apidiff-main
  • pull-cluster-api-build-main
  • pull-cluster-api-e2e-blocking-main
  • pull-cluster-api-test-main
  • pull-cluster-api-verify-main

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@fabriziopandini
Copy link
Member Author

/test pull-cluster-api-e2e-main

@fabriziopandini
Copy link
Member Author

/area provider/control-plane-kubeadm

@k8s-ci-robot k8s-ci-robot added area/provider/control-plane-kubeadm Issues or PRs related to KCP and removed do-not-merge/needs-area PR is missing an area label labels Dec 19, 2024
util/failuredomains/failure_domains.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/controllers/scale.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/controllers/scale.go Outdated Show resolved Hide resolved
controlplane/kubeadm/internal/control_plane.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved
util/failuredomains/failure_domains_test.go Show resolved Hide resolved
@sbueringer
Copy link
Member

/cherry-pick release-1.9

@k8s-infra-cherrypick-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.9 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sbueringer
Copy link
Member

/test pull-cluster-api-e2e-main

@fabriziopandini fabriziopandini added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 20, 2024
@k8s-ci-robot
Copy link
Contributor

@fabriziopandini: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-apidiff-main 0061ee7 link false /test pull-cluster-api-apidiff-main

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@sbueringer
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2024
@k8s-ci-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 0f03a126e911a3d92c77932ef600f6d58d4b8b49

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2024
@chrischdi
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 963fbff into kubernetes-sigs:main Dec 20, 2024
23 of 24 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.10 milestone Dec 20, 2024
@k8s-infra-cherrypick-robot

@sbueringer: new pull request created: #11604

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/control-plane-kubeadm Issues or PRs related to KCP cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants