POD NodeSelector is not always consistent with their MPIJob node selector #3400

GonzaloSaez · 2024-10-31T12:47:32Z

What happened:

We are launching MPIJobs using a LocalQueue with kueue (in particular cpu-local-queue from the Yaml fround at the end of the issue). The ClusterQueue associated ResourceFlavor uses the appropriate nodeLabels to target a specific GKE nodepool. We are not setting the MPIJob NodeSelector when launching it. When launching the job, kueue sets the correct NodeSelector on the MPI job. However, the pods NodeSelector is empty. Note that we are not setting the suspend field in the MPIJob, I let kueue do it for us.

What you expected to happen:

The MPIJob pods should have the same NodeSelector as the MPIJob. This is also documented in https://kueue.sigs.k8s.io/docs/concepts/resource_flavor/

Kueue adds the ResourceFlavor labels to the .nodeSelector of the underlying Workload Pod templates. This occurs if the Workload didn’t specify the ResourceFlavor labels already as part of its nodeSelector.

Environment:

GKE 1.30 + kueue 0.8.1 + waitForPodsReady=true. These are the kueue resources

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: cpu
spec:
  nodeLabels:
    cloud.google.com/gke-nodepool: e2x4

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ProvisioningRequestConfig
metadata:
  name: cpu-prov-config
spec:
  provisioningClassName: check-capacity.autoscaling.x-k8s.io

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: AdmissionCheck
metadata:
  name: check-capacity-cpu-prov
spec:
  controllerName: kueue.x-k8s.io/provisioning-request
  parameters:
    apiGroup: kueue.x-k8s.io
    kind: ProvisioningRequestConfig
    name: cpu-prov-config

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cpu-cluster-queue"
spec:
  namespaceSelector: {}
  preemption:
    withinClusterQueue: LowerPriority
  resourceGroups:
    - coveredResources: ["cpu", "memory"]
      flavors:
        - name: "cpu"
          resources:
            - name: "cpu"
              nominalQuota: "12"
            - name: "memory"
              nominalQuota: 52000Gi
  admissionChecks:
    - check-capacity-cpu-prov

---

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  name: cpu-local-queue
  namespace: mynamespace
spec:
  clusterQueue: cpu-cluster-queue

This can be replicated with the MPIOperator example. The launcher does not have NodeSelector set but the workers do have it.

apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: pi
  namespace: mynamespace
  labels:
    kueue.x-k8s.io/queue-name: cpu-local-queue
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: None
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
            - image: mpioperator/mpi-pi:openmpi
              name: mpi-launcher
              securityContext:
                runAsUser: 1000
              command:
                - mpirun
              args:
                - -n
                - "2"
                - /home/mpiuser/pi
              resources:
                limits:
                  cpu: 1
                  memory: 1Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
            - image: mpioperator/mpi-pi:openmpi
              name: mpi-worker
              securityContext:
                runAsUser: 1000
              command:
                - /usr/sbin/sshd
              args:
                - -De
                - -f
                - /home/mpiuser/.sshd_config
              resources:
                requests:
                  cpu: "1300m"
                  memory: 3Gi
                limits:
                  cpu: "1300m"
                  memory: 3Gi

The text was updated successfully, but these errors were encountered:

GonzaloSaez · 2024-10-31T16:31:11Z

I think this happens because the launcher job does not get the nodeselector that kueue is adding to the podset. So only the worker replicas get the correct NodeSelector

mimowo · 2024-11-04T07:45:35Z

@GonzaloSaez thank you for the report - this clearly looks like a bug, does it only happen if you use AdmissionChecks, or is independent of that? Let us also know if you suspect where is the bug, and feel free to propose a fix.

cc @tenzen-y @mbobrovskyi
who also may have some familiarity / insights where is the issue.

GonzaloSaez · 2024-11-04T08:24:49Z

@mimowo I think kubeflow/mpi-operator#670 sould fix it. That said, there are more open questions regarding NodeSelectors changing after a job is suspended (i.e. the need to change more internals of the MPIOperator launcher job in case the NodeSelector or other details change). Given that this seems related more to mpi-operator than kueue, should we consider closing this?

mimowo · 2024-11-04T08:32:58Z

I see, thanks for explaining and driving the fix. I think we may actually consider e2e tests for mpi-job in Kueue to cover such critical aspects of the integration. WDYT @tenzen-y ?

mimowo · 2024-11-08T11:43:53Z

FYI for some more context this also has slack discussion: https://kubernetes.slack.com/archives/C032ZE66A2X/p1730369507818399

GonzaloSaez added the kind/bug Categorizes issue or PR as related to a bug. label Oct 31, 2024

mimowo mentioned this issue Nov 8, 2024

Do not create the launcher job if the job starts suspended kubeflow/mpi-operator#670

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POD NodeSelector is not always consistent with their MPIJob node selector #3400

POD NodeSelector is not always consistent with their MPIJob node selector #3400

GonzaloSaez commented Oct 31, 2024 •

edited

Loading

GonzaloSaez commented Oct 31, 2024

mimowo commented Nov 4, 2024

GonzaloSaez commented Nov 4, 2024

mimowo commented Nov 4, 2024

mimowo commented Nov 8, 2024

POD NodeSelector is not always consistent with their MPIJob node selector #3400

POD NodeSelector is not always consistent with their MPIJob node selector #3400

Comments

GonzaloSaez commented Oct 31, 2024 • edited Loading

GonzaloSaez commented Oct 31, 2024

mimowo commented Nov 4, 2024

GonzaloSaez commented Nov 4, 2024

mimowo commented Nov 4, 2024

mimowo commented Nov 8, 2024

GonzaloSaez commented Oct 31, 2024 •

edited

Loading