You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I wanted to run an ENAS experiment on my own dataset but wanted to confirm that ENAS experiment was properly running in my kubeflow deployment.
So I copied the yaml file at https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml to run in the Katib UI.
I directly create the experiment by pasting in the YAML file as it is.
The experiment gets stuck after it creates and run the trials. This is because the trials after running go into a NotReady state after I run kubectl get pods -n moderation (moderation is the namespace I am using)
One weird thing I noticed was that the instead of displaying numbers in the Validation Accuracy metric, it displays the input to the docker image. See Trial Details on Experiment Page below.
I installed the Kubeflow v1.9.1 which is running on AWS by following the command while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done
As the issue might be because I am using kubernetes version 1.31, I even installed latest katib by using the command kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-with-kubeflow?ref=master" . This caused the buttons on the UI to have duller blue color. But the same results.
Here are the screenshots of the relevant pages:
Katib UI
Katib UI Experiment Details
Katib UI Trial Details
Pod Statuses (running kubectl get pods -n moderation)
/remove-label lifecycle/needs-triage
/area nas
/help
/cc @kubeflow/wg-automl-leads
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
I think this is because the metrics logger and collector container has a condition that all the other containers should either be in completed or error state.
When the training container finishes, the istio container is still running which causes the metrics container to not read logs. If the logs are not read, the experiment doesn't continue.
I thought of this after reading the file at cmd/metricscollector/v1beta1/file-metricscollector/main.go, line 400 to 430 on the master branch.
What happened?
I wanted to run an ENAS experiment on my own dataset but wanted to confirm that ENAS experiment was properly running in my kubeflow deployment.
So I copied the yaml file at https://github.com/kubeflow/katib/blob/master/examples/v1beta1/nas/enas-cpu.yaml to run in the Katib UI.
I directly create the experiment by pasting in the YAML file as it is.
The experiment gets stuck after it creates and run the trials. This is because the trials after running go into a NotReady state after I run
kubectl get pods -n moderation
(moderation is the namespace I am using)One weird thing I noticed was that the instead of displaying numbers in the Validation Accuracy metric, it displays the input to the docker image. See Trial Details on Experiment Page below.
I installed the Kubeflow v1.9.1 which is running on AWS by following the command
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 20; done
As the issue might be because I am using kubernetes version 1.31, I even installed latest katib by using the command
kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-with-kubeflow?ref=master"
. This caused the buttons on the UI to have duller blue color. But the same results.Here are the screenshots of the relevant pages:
Katib UI
Katib UI Experiment Details
Katib UI Trial Details
Pod Statuses (running
kubectl get pods -n moderation
)Trial Details on Experiment Page
YAML for Experiment from Katib UI
YAML for Trial from Katib UI
What did you expect to happen?
The experiment should complete without any issues.
Environment
Kubernetes version:
Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}" docker.io/kubeflowkatib/katib-controller:latest
Katib Python SDK version:
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: