-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(add) deepspeed_mpi specific container, deepspeed_config for MPI with nodetaints #549
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
# Official PyTorch image with CUDA support | ||
FROM pytorch/pytorch:1.9.0-cuda11.1-cudnn8-runtime | ||
|
||
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need to disable UserKnownHostsFile to avoid write permissions. | ||
# Disable StrictModes avoids directory and files read permission checks and update system packages & install dependencies | ||
RUN apt-get update && apt-get install -y \ | ||
git \ | ||
wget \ | ||
build-essential \ | ||
cmake \ | ||
libopenmpi-dev \ | ||
openssh-server \ | ||
&& rm -rf /var/lib/apt/lists/* \ | ||
&& echo " UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config \ | ||
&& sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config | ||
|
||
# Install DeepSpeed library | ||
RUN pip install deepspeed | ||
RUN mkdir /deepspeed | ||
|
||
# Workspace for DeepSpeed examples | ||
WORKDIR "/deepspeed" | ||
|
||
# Clone the DeepSpeedExamples from repository | ||
RUN git clone https://github.com/microsoft/DeepSpeedExamples/ | ||
|
||
# Set the working directory to DeepSpeedExamples/training for models | ||
WORKDIR "/deepspeed/DeepSpeedExamples/training" | ||
|
||
# Set the default command to bash | ||
CMD ["/bin/bash"] |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,93 @@ | ||
apiVersion: kubeflow.org/v2beta1 | ||
kind: MPIJob | ||
metadata: | ||
name: deepspeed-mpijob | ||
spec: | ||
slotsPerWorker: 1 | ||
runPolicy: | ||
cleanPodPolicy: Running | ||
mpiReplicaSpecs: | ||
Launcher: | ||
replicas: 1 | ||
template: | ||
spec: | ||
containers: | ||
# Container with the DeepSpeed training image built from the provided Dockerfile and the DeepSpeed support | ||
# Change your image name and version in here | ||
- image: <YOUR-DEEPSPEED-CONTAINER-NAME>:<VERSION> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just saying. |
||
name: deepspeed-mpijob-container | ||
command: | ||
- mpirun | ||
- --allow-run-as-root | ||
- -np | ||
- "2" | ||
- -bind-to | ||
- none | ||
- -map-by | ||
- slot | ||
- -x | ||
- NCCL_DEBUG=INFO | ||
- -x | ||
- LD_LIBRARY_PATH | ||
- -x | ||
- PATH | ||
- -mca | ||
- pml | ||
- ob1 | ||
- -mca | ||
- btl | ||
- ^openib | ||
Comment on lines
+24
to
+39
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. are all of these necessary? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not all strictly necessary however these options commonly used in both MPI workloads and MPI operator examples. Do you think we need to remove these flags? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. tbh, I left them as legacy from the very first examples I found for tensorflow and horovod, as I didn't know much about them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you know enough to leave the bare basics, that would be better. |
||
- python | ||
- cifar/cifar10_deepspeed.py | ||
- --deepspeed_mpi | ||
- --deepspeed | ||
- --deepspeed_config | ||
- ds_config.json | ||
- $@ | ||
Worker: | ||
replicas: 2 | ||
template: | ||
spec: | ||
# OPTIONAL: Taint toleration for the specific nodepool | ||
# | ||
# Taints and tolerations are used to ensure that the DeepSpeed worker pods | ||
# are scheduled on the desired nodes. By applying taints to nodes, you can | ||
# repel pods that do not have the corresponding tolerations. This is useful | ||
# in situations where you want to reserve nodes with specific resources | ||
# (e.g. GPU nodes) for particular workloads, like the DeepSpeed training | ||
# job. | ||
# | ||
# In this example, the tolerations are set to allow the DeepSpeed worker | ||
# pods to be scheduled on nodes with the specified taints (i.e., the node | ||
# pool with GPU resources). This ensures that the training job can | ||
# utilize the available GPU resources on those nodes, improving the | ||
# efficiency and performance of the training process. | ||
# | ||
# You can remove the taint tolerations if you do not have any taints on your cluster. | ||
tolerations: | ||
# Change the nodepool name in here | ||
- effect: NoSchedule | ||
key: nodepool | ||
operator: Equal | ||
value: nodepool-256ram32cpu2gpu-0 | ||
# Taint toleration effect for GPU nodes | ||
- effect: NoSchedule | ||
key: nvidia.com/gpu | ||
operator: Equal | ||
value: present | ||
containers: | ||
# Container with the DeepSpeed training image built from the provided Dockerfile and the DeepSpeed support | ||
# Change your image name and version in here | ||
- image: <YOUR-DEEPSPEED-CONTAINER-NAME>:<VERSION> | ||
name: deepspeed-mpijob-container | ||
resources: | ||
limits: | ||
# Optional: varies to nodepool group | ||
cpu: 30 | ||
memory: 230Gi | ||
nvidia.com/gpu: 2 | ||
requests: | ||
# Optional: varies to nodepool group | ||
cpu: 16 | ||
memory: 128Gi | ||
nvidia.com/gpu: 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried using
mpioperator/base
instead?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tried yet, it could be better to use this
mpioperator/base
image and just installing CUDA dependencies for DS with additional PyTorch / Tensorflow configurations.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, probably better to use
mpioperator/openmpi
. If you can make it work, that'd be great, as proof that the base images can be extended. I couldn't get tensorflow to work.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to make it work for both and patch to the PR 👍