Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash in podgroup when runLauncherAsWorker is true #669

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

GonzaloSaez
Copy link
Contributor

When runLauncherAsWorker is true and there is no worker, the MPI controller will crash in calPGMinResource

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign rongou for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@kuizhiqing kuizhiqing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for fixing this! Could you add a unit test case in

func TestCalculatePGMinResources(t *testing.T) {
?

@tenzen-y
Copy link
Member

@GonzaloSaez Could you sign DCO (https://github.com/kubeflow/mpi-operator/pull/669/checks?check_run_id=32329697546), and address CI errors?

@GonzaloSaez
Copy link
Contributor Author

Sure will do. Are these tests flaky? It seems that only IntelMPI is failing and I don't see why my changes would make those tests fail

@tenzen-y
Copy link
Member

Sure will do. Are these tests flaky? It seems that only IntelMPI is failing and I don't see why my changes would make those tests fail

Uhm interesting. Let me restart Jobs.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 1, 2024

Sure will do. Are these tests flaky? It seems that only IntelMPI is failing and I don't see why my changes would make those tests fail

@GonzaloSaez I restarted the CI three times, all trials failed. So, this PR seems to bring any kind of additional bug. Could you fix that?

@GonzaloSaez
Copy link
Contributor Author

@tenzen-y I ran these tests locally and they are passing

------------------------------
Deleting cluster "kind" ...
Deleted nodes: ["kind-control-plane"]

Ran 13 of 13 Specs in 725.625 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (725.63s)
PASS
ok      github.com/kubeflow/mpi-operator/test/e2e       725.655s

I'm still looking in to why this is happening in CI.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

@tenzen-y I ran these tests locally and they are passing

------------------------------
Deleting cluster "kind" ...
Deleted nodes: ["kind-control-plane"]

Ran 13 of 13 Specs in 725.625 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (725.63s)
PASS
ok      github.com/kubeflow/mpi-operator/test/e2e       725.655s

I'm still looking in to why this is happening in CI.

Interesting. Let's see if the master branch is health here: #671

@tenzen-y
Copy link
Member

tenzen-y commented Nov 4, 2024

@tenzen-y I ran these tests locally and they are passing

------------------------------
Deleting cluster "kind" ...
Deleted nodes: ["kind-control-plane"]

Ran 13 of 13 Specs in 725.625 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 0 Skipped
--- PASS: TestE2E (725.63s)
PASS
ok      github.com/kubeflow/mpi-operator/test/e2e       725.655s

I'm still looking in to why this is happening in CI.

Interesting. Let's see if the master branch is health here: #671

It seems the master branch E2E has been broken ...

@tenzen-y
Copy link
Member

@GonzaloSaez Could you rebase this PR on the master branch?

@GonzaloSaez GonzaloSaez force-pushed the patch-1 branch 2 times, most recently from cfc9300 to 46fcc22 Compare January 10, 2025 22:57
Copy link
Member

@tenzen-y tenzen-y left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise lgtm

Spec: kubeflow.MPIJobSpec{
MPIReplicaSpecs: map[kubeflow.MPIReplicaType]*kubeflow.ReplicaSpec{
kubeflow.MPIReplicaTypeLauncher: {
Replicas: ptr.To[int32](1),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Replicas: ptr.To[int32](1),
Replicas: ptr.To[int32](3),

This should be 3 since minMember is 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants