Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not run ENAS experiment #2494

Open
shubham-ojha-weheal opened this issue Jan 17, 2025 · 3 comments
Open

Can not run ENAS experiment #2494

shubham-ojha-weheal opened this issue Jan 17, 2025 · 3 comments
Labels
area/nas help wanted Extra attention is needed kind/bug

Comments

@shubham-ojha-weheal
Copy link

shubham-ojha-weheal commented Jan 17, 2025

What happened?

I wanted to run an ENAS experiment on my own dataset but wanted to confirm that ENAS experiment was properly running in my kubeflow deployment.
So I copied the yaml file at https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml to run in the Katib UI.
I directly create the experiment by pasting in the YAML file as it is.
The experiment gets stuck after it creates and run the trials. This is because the trials after running go into a NotReady state after I run kubectl get pods -n moderation (moderation is the namespace I am using)

One weird thing I noticed was that the instead of displaying numbers in the Validation Accuracy metric, it displays the input to the docker image. See Trial Details on Experiment Page below.

I installed the Kubeflow v1.9.1 which is running on AWS by following the command while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done

As the issue might be because I am using kubernetes version 1.31, I even installed latest katib by using the command kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-with-kubeflow?ref=master" . This caused the buttons on the UI to have duller blue color. But the same results.

Here are the screenshots of the relevant pages:
Katib UI

Image

Katib UI Experiment Details

Image

Katib UI Trial Details

Image

Pod Statuses (running kubectl get pods -n moderation)

NAME                                              READY   STATUS     RESTARTS   AGE
dataset-handler-0                                 2/2     Running    0          3d22h
enas-cpu-66jqk54g-6p6kn                           2/3     NotReady   0          7m11s
enas-cpu-cbndflxf-7c56m                           2/3     NotReady   0          7m11s
enas-cpu-enas-6595f7f74b-smwcl                    1/1     Running    0          7m22s
ml-pipeline-ui-artifact-6b44b849d7-9fthm          2/2     Running    0          26h
ml-pipeline-visualizationserver-5fcb5568f-fzm6z   2/2     Running    0          6d19h
pipelines-0                                       2/2     Running    0          2d3h
spanner-test-0                                    2/2     Running    0          4d3h
zas-74b68bb967-vnsq7                              2/2     Running    0          6d4h

Trial Details on Experiment Page

Image

YAML for Experiment from Katib UI

metadata:
  name: enas-cpu
  namespace: moderation
  uid: 39920b93-2bc8-40bb-9565-ec8c24c361d2
  resourceVersion: '5141549'
  generation: 1
  creationTimestamp: '2025-01-17T10:17:13Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:13Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-ui
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:13Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:nasConfig:
            .: {}
            f:graphConfig:
              .: {}
              f:inputSizes: {}
              f:numLayers: {}
              f:outputSizes: {}
            f:operations: {}
          f:objective:
            .: {}
            f:goal: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:trialTemplate:
            .: {}
            f:primaryContainerName: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:template:
                  .: {}
                  f:spec:
                    .: {}
                    f:containers: {}
                    f:restartPolicy: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:runningTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsRunning: {}
      subresource: status
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-Accuracy
    metricStrategies:
      - name: Validation-Accuracy
        value: max
  algorithm:
    algorithmName: enas
  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - command:
                  - python3
                  - '-u'
                  - RunTrial.py
                  - '--num_epochs=1'
                  - >-
                    --architecture="${trialParameters.neuralNetworkArchitecture}"
                  - '--nn_config="${trialParameters.neuralNetworkConfig}"'
                image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:latest
                name: training-container
            restartPolicy: Never
    trialParameters:
      - name: neuralNetworkArchitecture
        description: >-
          NN architecture contains operations ID on each NN layer and skip
          connections between layers
        reference: architecture
      - name: neuralNetworkConfig
        description: >-
          Configuration contains NN number of layers, input and output sizes,
          description what each operation ID means
        reference: nn_config
    primaryContainerName: training-container
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 2
  maxTrialCount: 3
  maxFailedTrialCount: 2
  metricsCollectorSpec:
    collector:
      kind: StdOut
  nasConfig:
    graphConfig:
      numLayers: 1
      inputSizes:
        - 32
        - 32
        - 3
      outputSizes:
        - 10
    operations:
      - operationType: convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: num_filter
            parameterType: categorical
            feasibleSpace:
              list:
                - '32'
                - '48'
                - '64'
                - '96'
                - '128'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: separable_convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: num_filter
            parameterType: categorical
            feasibleSpace:
              list:
                - '32'
                - '48'
                - '64'
                - '96'
                - '128'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
          - name: depth_multiplier
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: depthwise_convolution
        parameters:
          - name: filter_size
            parameterType: categorical
            feasibleSpace:
              list:
                - '3'
                - '5'
                - '7'
          - name: stride
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
          - name: depth_multiplier
            parameterType: categorical
            feasibleSpace:
              list:
                - '1'
                - '2'
      - operationType: reduction
        parameters:
          - name: reduction_type
            parameterType: categorical
            feasibleSpace:
              list:
                - max_pooling
                - avg_pooling
          - name: pool_size
            parameterType: int
            feasibleSpace:
              max: '3'
              min: '2'
              step: '1'
  resumePolicy: Never
status:
  startTime: '2025-01-17T10:17:13Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2025-01-17T10:17:13Z'
      lastTransitionTime: '2025-01-17T10:17:13Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'
  currentOptimalTrial:
    observation: {}
  runningTrialList:
    - enas-cpu-cbndflxf
    - enas-cpu-66jqk54g
  trials: 2
  trialsRunning: 2

YAML for Trial from Katib UI

metadata:
  name: enas-cpu-66jqk54g
  namespace: moderation
  uid: f4f2eeea-11a3-4637-b190-b3b9c7190836
  resourceVersion: '5141540'
  generation: 1
  creationTimestamp: '2025-01-17T10:17:24Z'
  labels:
    katib.kubeflow.org/experiment: enas-cpu
  ownerReferences:
    - apiVersion: kubeflow.org/v1beta1
      kind: Experiment
      name: enas-cpu
      uid: 39920b93-2bc8-40bb-9565-ec8c24c361d2
      controller: true
      blockOwnerDeletion: true
  finalizers:
    - clean-metrics-in-db
  managedFields:
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"clean-metrics-in-db": {}
          f:labels:
            .: {}
            f:katib.kubeflow.org/experiment: {}
          f:ownerReferences:
            .: {}
            k:{"uid":"39920b93-2bc8-40bb-9565-ec8c24c361d2"}: {}
        f:spec:
          .: {}
          f:failureCondition: {}
          f:metricsCollector:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parameterAssignments: {}
          f:primaryContainerName: {}
          f:runSpec:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:metadata:
              .: {}
              f:name: {}
              f:namespace: {}
            f:spec:
              .: {}
              f:template:
                .: {}
                f:spec:
                  .: {}
                  f:containers: {}
                  f:restartPolicy: {}
          f:successCondition: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2025-01-17T10:17:24Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:startTime: {}
      subresource: status
spec:
  objective:
    type: maximize
    goal: 0.99
    objectiveMetricName: Validation-Accuracy
    metricStrategies:
      - name: Validation-Accuracy
        value: max
  parameterAssignments:
    - name: architecture
      value: '[[11]]'
    - name: nn_config
      value: >-
        {'num_layers': 1, 'input_sizes': [32, 32, 3], 'output_sizes': [10],
        'embedding': {'11': {'opt_id': 11, 'opt_type': 'convolution',
        'opt_params': {'filter_size': '5', 'num_filter': '32', 'stride': '2'}}}}
  runSpec:
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: enas-cpu-66jqk54g
      namespace: moderation
    spec:
      template:
        spec:
          containers:
            - command:
                - python3
                - '-u'
                - RunTrial.py
                - '--num_epochs=1'
                - '--architecture="[[11]]"'
                - >-
                  --nn_config="{'num_layers': 1, 'input_sizes': [32, 32, 3],
                  'output_sizes': [10], 'embedding': {'11': {'opt_id': 11,
                  'opt_type': 'convolution', 'opt_params': {'filter_size': '5',
                  'num_filter': '32', 'stride': '2'}}}}"
              image: docker.io/kubeflowkatib/enas-cnn-cifar10-cpu:latest
              name: training-container
          restartPolicy: Never
  metricsCollector:
    collector:
      kind: StdOut
  primaryContainerName: training-container
  successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
  failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
status:
  startTime: '2025-01-17T10:17:24Z'
  conditions:
    - type: Created
      status: 'True'
      reason: TrialCreated
      message: Trial is created
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'
    - type: Running
      status: 'True'
      reason: TrialRunning
      message: Trial is running
      lastUpdateTime: '2025-01-17T10:17:24Z'
      lastTransitionTime: '2025-01-17T10:17:24Z'

What did you expect to happen?

The experiment should complete without any issues.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.0
Kustomize Version: v5.5.0
Server Version: v1.31.4-eks-2d5f260

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
docker.io/kubeflowkatib/katib-controller:latest

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.17.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [email protected]
License: Apache License Version 2.0
Location: /opt/tensorflow/lib/python3.12/site-packages
Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@Electronic-Waste
Copy link
Member

/remove-label lifecycle/needs-triage
/area nas
/help

/cc @kubeflow/wg-automl-leads

Copy link

@Electronic-Waste:
This request has been marked as needing help from a contributor.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/remove-label lifecycle/needs-triage
/area nas
/help

/cc @kubeflow/wg-automl-leads

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@google-oss-prow google-oss-prow bot added area/nas help wanted Extra attention is needed and removed lifecycle/needs-triage labels Jan 21, 2025
@shubham-ojha-weheal
Copy link
Author

shubham-ojha-weheal commented Jan 23, 2025

I think I have found the issue.
I tried by removing the istio sidecar injection via adding the options

  trialTemplate:
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              "sidecar.istio.io/inject": "false"

and the trial ran successfully.

I think this is because the metrics logger and collector container has a condition that all the other containers should either be in completed or error state.
When the training container finishes, the istio container is still running which causes the metrics container to not read logs. If the logs are not read, the experiment doesn't continue.

I thought of this after reading the file at cmd/metricscollector/v1beta1/file-metricscollector/main.go, line 400 to 430 on the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/nas help wanted Extra attention is needed kind/bug
Projects
None yet
Development

No branches or pull requests

2 participants