Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POD NodeSelector is not always consistent with their MPIJob node selector #3400

Open
GonzaloSaez opened this issue Oct 31, 2024 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@GonzaloSaez
Copy link

GonzaloSaez commented Oct 31, 2024

What happened:

We are launching MPIJobs using a LocalQueue with kueue (in particular cpu-local-queue from the Yaml fround at the end of the issue). The ClusterQueue associated ResourceFlavor uses the appropriate nodeLabels to target a specific GKE nodepool. We are not setting the MPIJob NodeSelector when launching it. When launching the job, kueue sets the correct NodeSelector on the MPI job. However, the pods NodeSelector is empty. Note that we are not setting the suspend field in the MPIJob, I let kueue do it for us.

What you expected to happen:

The MPIJob pods should have the same NodeSelector as the MPIJob. This is also documented in https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/

Kueue adds the ResourceFlavor labels to the .nodeSelector of the underlying Workload Pod templates. This occurs if the Workload didn’t specify the ResourceFlavor labels already as part of its nodeSelector.

Environment:

GKE 1.30 + kueue 0.8.1 + waitForPodsReady=true. These are the kueue resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: cpu
spec:
  nodeLabels:
    cloud.google.com/gke-nodepool: e2x4

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
  name: cpu-prov-config
spec:
  provisioningClassName: check-capacity.autoscaling.x-k8s.io

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
  name: check-capacity-cpu-prov
spec:
  controllerName: kueue.x-k8s.io/provisioning-request
  parameters:
    apiGroup: kueue.x-k8s.io
    kind: ProvisioningRequestConfig
    name: cpu-prov-config

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cpu-cluster-queue"
spec:
  namespaceSelector: {}
  preemption:
    withinClusterQueue: LowerPriority
  resourceGroups:
    - coveredResources: ["cpu", "memory"]
      flavors:
        - name: "cpu"
          resources:
            - name: "cpu"
              nominalQuota: "12"
            - name: "memory"
              nominalQuota: 52000Gi
  admissionChecks:
    - check-capacity-cpu-prov

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: cpu-local-queue
  namespace: mynamespace
spec:
  clusterQueue: cpu-cluster-queue

This can be replicated with the MPIOperator example. The launcher does not have NodeSelector set but the workers do have it.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
  namespace: mynamespace
  labels:
    kueue.x-k8s.io/queue-name: cpu-local-queue
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: None
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: mpioperator/mpi-pi:openmpi
              name: mpi-launcher
              securityContext:
                runAsUser: 1000
              command:
                - mpirun
              args:
                - -n
                - "2"
                - /home/mpiuser/pi
              resources:
                limits:
                  cpu: 1
                  memory: 1Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - image: mpioperator/mpi-pi:openmpi
              name: mpi-worker
              securityContext:
                runAsUser: 1000
              command:
                - /usr/sbin/sshd
              args:
                - -De
                - -f
                - /home/mpiuser/.sshd_config
              resources:
                requests:
                  cpu: "1300m"
                  memory: 3Gi
                limits:
                  cpu: "1300m"
                  memory: 3Gi
@GonzaloSaez GonzaloSaez added the kind/bug Categorizes issue or PR as related to a bug. label Oct 31, 2024
@GonzaloSaez
Copy link
Author

I think this happens because the launcher job does not get the nodeselector that kueue is adding to the podset. So only the worker replicas get the correct NodeSelector

@mimowo
Copy link
Contributor

mimowo commented Nov 4, 2024

@GonzaloSaez thank you for the report - this clearly looks like a bug, does it only happen if you use AdmissionChecks, or is independent of that? Let us also know if you suspect where is the bug, and feel free to propose a fix.

cc @tenzen-y @mbobrovskyi
who also may have some familiarity / insights where is the issue.

@GonzaloSaez
Copy link
Author

@mimowo I think kubeflow/mpi-operator#670 sould fix it. That said, there are more open questions regarding NodeSelectors changing after a job is suspended (i.e. the need to change more internals of the MPIOperator launcher job in case the NodeSelector or other details change). Given that this seems related more to mpi-operator than kueue, should we consider closing this?

@mimowo
Copy link
Contributor

mimowo commented Nov 4, 2024

I see, thanks for explaining and driving the fix. I think we may actually consider e2e tests for mpi-job in Kueue to cover such critical aspects of the integration. WDYT @tenzen-y ?

@mimowo
Copy link
Contributor

mimowo commented Nov 8, 2024

FYI for some more context this also has slack discussion: https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants