🌱 Improve KCP scale up when using failure domains #11598

fabriziopandini · 2024-12-19T13:54:01Z

What this PR does / why we need it:
KCP tries to achieve spreading of machines across failure domains

The current implementation achieve the goal by balancing up-to-date/new machines created during scale up/rollout, whitout caring of the placement of outdated/old machines (because they are going away).

This PR improves the behaviour above by considering the entire set of machines as a secondary criteria when there there are failure domains with the same number of up-to-date/new machines. This ensure a better spreading of machines also during scale up/rollout (not only at the end).

E.g.:

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C

Scale up:

Adds D to FD 3 (random pick from FD 1, FD 2, FD 3 all of them with the same number of machines at this point)

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C, D

If for any reason (e.g. Health issues) B is deleted next we have

FD 1: Machine A
FD 2:
FD 3 Machine C, D

With the current implementation, the next scale up could place a machine both FD1 or FD2, because no one has an up to date/new machine yet. If FD1 is selected we will have

FD 1: Machine A, E
FD 2:
FD 3 Machine C, D

Which is not ideal.
Instead, whit this PR the tie between FD1 and FD2 (both without up to date machines) is solved by picking the failure domain with less machines overall, which is FD2, thus leading to the following situation:

FD 1: Machine A
FD 2: Machine E
FD 3 Machine C, D

Also, while working at this PR I played down some tech debt by adding comments and a ton of unit tests for the failure domain package; I will open a follow up issue to add tests in the upper layer of the call stack (specifically for adding removal of FD)

/area provider/controlplane-kubeadm

k8s-ci-robot · 2024-12-19T13:54:04Z

@fabriziopandini: The label(s) area/provider/controlplane-kubeadm cannot be applied, because the repository doesn't have them.

In response to this:

What this PR does / why we need it:
KCP tries to achieve spreading of machines across failure domains

The current implementation achieve the goal by balancing up-to-date/new machines created during scale up/rollout, whitout caring of the placement of outdated/old machines (because they are going away).

This PR improves the behaviour above by considering the entire set of machines as a secondary criteria when there there are failure domains with the same number of up-to-date/new machines. This ensure a better spreading of machines also during scale up/rollout (not only at the end).

E.g.:

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C

Scale up:

Adds D to FD 1

FD 1: Machine A
FD 2: Machine B
FD 3 Machine C, D

If for any reason B is deleted we have

FD 1: Machine A
FD 2:
FD 3 Machine C, D

With the current implementation, the next scale up could place a machine both in FD1 or FD2, because no one has an up to date/new machine yet. If FD1 is selected we will have

FD 1: Machine A, E
FD 2:
FD 3 Machine C, D

Which is not ideal.
Instead, whit this PR the tie between FD1 and FD2 (both without up to date machines) is solved by picking the failure domain with less machines overall, which is FD2, thus leading to the following situation:

FD 1: Machine A
FD 2: Machine E
FD 3 Machine C, D

/area provider/controlplane-kubeadm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fabriziopandini · 2024-12-19T13:54:10Z

/test

k8s-ci-robot · 2024-12-19T13:54:13Z

@fabriziopandini: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test pull-cluster-api-build-main
/test pull-cluster-api-e2e-blocking-main
/test pull-cluster-api-e2e-conformance-ci-latest-main
/test pull-cluster-api-e2e-conformance-main
/test pull-cluster-api-e2e-latestk8s-main
/test pull-cluster-api-e2e-main
/test pull-cluster-api-e2e-mink8s-main
/test pull-cluster-api-e2e-upgrade-1-32-1-33-main
/test pull-cluster-api-test-main
/test pull-cluster-api-test-mink8s-main
/test pull-cluster-api-verify-main

The following commands are available to trigger optional jobs:

/test pull-cluster-api-apidiff-main

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-apidiff-main
pull-cluster-api-build-main
pull-cluster-api-e2e-blocking-main
pull-cluster-api-test-main
pull-cluster-api-verify-main

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

fabriziopandini · 2024-12-19T13:55:14Z

/test pull-cluster-api-e2e-main

fabriziopandini · 2024-12-19T13:57:04Z

/area provider/control-plane-kubeadm

util/failuredomains/failure_domains.go

util/failuredomains/failure_domains_test.go

controlplane/kubeadm/internal/controllers/scale.go

controlplane/kubeadm/internal/control_plane.go

util/failuredomains/failure_domains.go

util/failuredomains/failure_domains_test.go

sbueringer · 2024-12-19T19:08:32Z

/cherry-pick release-1.9

k8s-infra-cherrypick-robot · 2024-12-19T19:08:34Z

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.9 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sbueringer · 2024-12-20T06:59:24Z

/test pull-cluster-api-e2e-main

k8s-ci-robot · 2024-12-20T09:23:34Z

@fabriziopandini: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-apidiff-main	`0061ee7`	link	false	`/test pull-cluster-api-apidiff-main`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

sbueringer · 2024-12-20T09:30:29Z

/lgtm
/approve

k8s-ci-robot · 2024-12-20T09:30:36Z

LGTM label has been added.

Git tree hash: 0f03a126e911a3d92c77932ef600f6d58d4b8b49

k8s-ci-robot · 2024-12-20T09:30:37Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

chrischdi · 2024-12-20T09:32:10Z

/lgtm

k8s-infra-cherrypick-robot · 2024-12-20T10:04:48Z

@sbueringer: new pull request created: #11604

In response to this:

/cherry-pick release-1.9

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Improve KCP scale up when using failure domains

2d65e8c

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area PR is missing an area label labels Dec 19, 2024

k8s-ci-robot requested review from JoelSpeed and vincepri December 19, 2024 13:54

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Dec 19, 2024

k8s-ci-robot added area/provider/control-plane-kubeadm Issues or PRs related to KCP and removed do-not-merge/needs-area PR is missing an area label labels Dec 19, 2024

chrischdi reviewed Dec 19, 2024

View reviewed changes

util/failuredomains/failure_domains.go Outdated Show resolved Hide resolved

util/failuredomains/failure_domains_test.go Outdated Show resolved Hide resolved

Address comments

7d35bac

sbueringer reviewed Dec 19, 2024

View reviewed changes

kubernetes-sigs deleted a comment from k8s-infra-cherrypick-robot Dec 19, 2024

Address feedback

0061ee7

fabriziopandini added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Dec 20, 2024

k8s-ci-robot assigned sbueringer Dec 20, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 20, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 20, 2024

fabriziopandini mentioned this pull request Dec 20, 2024

Add test coverage for adding/removing FD in KCP #11602

Open

k8s-ci-robot assigned chrischdi Dec 20, 2024

k8s-ci-robot merged commit 963fbff into kubernetes-sigs:main Dec 20, 2024
23 of 24 checks passed

k8s-ci-robot added this to the v1.10 milestone Dec 20, 2024

k8s-infra-cherrypick-robot mentioned this pull request Dec 20, 2024

[release-1.9] 🌱 Improve KCP scale up when using failure domains #11604

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🌱 Improve KCP scale up when using failure domains #11598

🌱 Improve KCP scale up when using failure domains #11598

fabriziopandini commented Dec 19, 2024 •

edited

Loading

k8s-ci-robot commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

k8s-ci-robot commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

sbueringer commented Dec 19, 2024

k8s-infra-cherrypick-robot commented Dec 19, 2024

sbueringer commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

sbueringer commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

chrischdi commented Dec 20, 2024

k8s-infra-cherrypick-robot commented Dec 20, 2024

🌱 Improve KCP scale up when using failure domains #11598

🌱 Improve KCP scale up when using failure domains #11598

Conversation

fabriziopandini commented Dec 19, 2024 • edited Loading

k8s-ci-robot commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

k8s-ci-robot commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

fabriziopandini commented Dec 19, 2024

sbueringer commented Dec 19, 2024

k8s-infra-cherrypick-robot commented Dec 19, 2024

sbueringer commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

sbueringer commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

k8s-ci-robot commented Dec 20, 2024

chrischdi commented Dec 20, 2024

k8s-infra-cherrypick-robot commented Dec 20, 2024

fabriziopandini commented Dec 19, 2024 •

edited

Loading