You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
NeMo intermittent start-up failure with OMPI temp directory error.
Steps/Code to reproduce bug
Launch multinode NeMo training job on k8s with privilege mode without HostPID
Observe that in about 3~4% of the time, NeMo job fails to launch with error:
ERROR 2024-11-13T14:41:01.114829205Z [resource.labels.containerName: megatron] --------------------------------------------------------------------------
ERROR 2024-11-13T14:41:01.114859401Z [resource.labels.containerName: megatron] A call to mkdir was unable to create the desired directory:
ERROR 2024-11-13T14:41:01.114863131Z [resource.labels.containerName: megatron] {}
ERROR 2024-11-13T14:41:01.114865257Z [resource.labels.containerName: megatron] Directory: /tmp/ompi.machine-name-redacted.0/pid.2284793
ERROR 2024-11-13T14:41:01.114867380Z [resource.labels.containerName: megatron] Error: No such file or directory
ERROR 2024-11-13T14:41:01.114871655Z [resource.labels.containerName: megatron] Please check to ensure you have adequate permissions to perform
ERROR 2024-11-13T14:41:01.114874427Z [resource.labels.containerName: megatron] the desired operation.
ERROR 2024-11-13T14:41:01.114876515Z [resource.labels.containerName: megatron] --------------------------------------------------------------------------
ERROR 2024-11-13T14:41:01.114878903Z [resource.labels.containerName: megatron] [machine-name-redacted:2284793] [[47352,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line 107
ERROR 2024-11-13T14:41:01.114895028Z [resource.labels.containerName: megatron] [machine-name-redacted:2284793] [[47352,0],0] ORTE_ERROR_LOG: Error in file util/session_dir.c at line **346**
Expected behavior
NeMo training job to launch without hiccup
Environment overview (please complete the following information)
Environment location: GKE
Method of NeMo install: Used pre-built docker image from nvcr.io/nvidia/nemo:24.07
The text was updated successfully, but these errors were encountered:
Describe the bug
NeMo intermittent start-up failure with OMPI temp directory error.
Steps/Code to reproduce bug
Expected behavior
NeMo training job to launch without hiccup
Environment overview (please complete the following information)
nvcr.io/nvidia/nemo:24.07
The text was updated successfully, but these errors were encountered: