-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Improve KCP scale up when using failure domains #11598
🌱 Improve KCP scale up when using failure domains #11598
Conversation
@fabriziopandini: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test |
@fabriziopandini: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-main |
/area provider/control-plane-kubeadm |
/cherry-pick release-1.9 |
@sbueringer: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-e2e-main |
@fabriziopandini: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/lgtm |
LGTM label has been added. Git tree hash: 0f03a126e911a3d92c77932ef600f6d58d4b8b49
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sbueringer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/lgtm |
@sbueringer: new pull request created: #11604 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
KCP tries to achieve spreading of machines across failure domains
The current implementation achieve the goal by balancing up-to-date/new machines created during scale up/rollout, whitout caring of the placement of outdated/old machines (because they are going away).
This PR improves the behaviour above by considering the entire set of machines as a secondary criteria when there there are failure domains with the same number of up-to-date/new machines. This ensure a better spreading of machines also during scale up/rollout (not only at the end).
E.g.:
FD 1: Machine A
FD 2: Machine B
FD 3 Machine C
Scale up:
FD 1: Machine A
FD 2: Machine B
FD 3 Machine C, D
If for any reason (e.g. Health issues) B is deleted next we have
FD 1: Machine A
FD 2:
FD 3 Machine C, D
With the current implementation, the next scale up could place a machine both FD1 or FD2, because no one has an up to date/new machine yet. If FD1 is selected we will have
FD 1: Machine A, E
FD 2:
FD 3 Machine C, D
Which is not ideal.
Instead, whit this PR the tie between FD1 and FD2 (both without up to date machines) is solved by picking the failure domain with less machines overall, which is FD2, thus leading to the following situation:
FD 1: Machine A
FD 2: Machine E
FD 3 Machine C, D
Also, while working at this PR I played down some tech debt by adding comments and a ton of unit tests for the failure domain package; I will open a follow up issue to add tests in the upper layer of the call stack (specifically for adding removal of FD)
/area provider/controlplane-kubeadm