-
Notifications
You must be signed in to change notification settings - Fork 531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support for gpu queue #3642
base: master
Are you sure you want to change the base?
support for gpu queue #3642
Conversation
gputils is required for gpu queue management
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3642 +/- ##
=======================================
Coverage 73.04% 73.05%
=======================================
Files 1278 1278
Lines 59356 59398 +42
=======================================
+ Hits 43359 43395 +36
- Misses 15997 16003 +6 ☔ View full report in Codecov by Sentry. |
Just to check my understanding: in this model, a GPU-enabled job gets exclusive access to one full GPU, so the GPU queue is simply the number of available GPUs and the number of GPU-enabled jobs? There's no notion of a job acquiring multiple GPUs or partial GPUs? From some quick searching, it's at least possible (though I don't know how common) to write programs that utilize multiple GPUs, so I think we should allow nodes to be tagged with multiple GPU threads. If the CPU usage of a process is negligible, I think it would be reasonable to say: myproc = pe.Node(ProcessInterface(), n_threads=0, n_gpus=2) |
In the current implementation the user specifies how many n_gpu_procs the plugin should manage and the plugin will reserve those "slots" based on the node.n_threads property. If you think it's useful we can allow the user to specify different values for "gpu_procs" and "cpu_procs" for each node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm extremely sorry about how long it took me to get back to this. If you're still around and up to work on this, here's the review I started last May and just finished.
nipype/info.py
Outdated
@@ -149,6 +149,7 @@ def get_nipype_gitversion(): | |||
"filelock>=3.0.0", | |||
"etelemetry>=0.2.0", | |||
"looseversion!=1.2", | |||
"gputil==1.4.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard pins are a very bad idea. If you need a particular API, use >=
to ensure it's present. We should avoid upper bounds as much as possible, although they are not always avoidable.
nipype/pipeline/plugins/multiproc.py
Outdated
@staticmethod | ||
def gpu_count(): | ||
n_gpus = 1 | ||
try: | ||
import GPUtil | ||
|
||
return len(GPUtil.getGPUs()) | ||
except ImportError: | ||
return n_gpus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a general utility, I would put it into nipype.pipeline.plugins.tools
as a function, not a static method.
Also consider:
@staticmethod | |
def gpu_count(): | |
n_gpus = 1 | |
try: | |
import GPUtil | |
return len(GPUtil.getGPUs()) | |
except ImportError: | |
return n_gpus | |
@staticmethod | |
def gpu_count(): | |
try: | |
import GPUtil | |
except ImportError: | |
return 1 | |
else: | |
return len(GPUtil.getGPUs()) |
As a rule, I try to keep the section inside a try
block as short as possible, to avoid accidentally catching other exceptions that are raised. An else
block can contain anything that depends on the success of the try
block.
nipype/pipeline/engine/nodes.py
Outdated
return (hasattr(self.inputs, 'use_cuda') and self.inputs.use_cuda) or ( | ||
hasattr(self.inputs, 'use_gpu') and self.inputs.use_gpu | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (hasattr(self.inputs, 'use_cuda') and self.inputs.use_cuda) or ( | |
hasattr(self.inputs, 'use_gpu') and self.inputs.use_gpu | |
) | |
return bool(getattr(self.inputs, 'use_cuda', False)) or bool( | |
getattr(self.inputs, 'use_gpu', False)) |
nipype/pipeline/plugins/multiproc.py
Outdated
'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!' | ||
% (self.n_gpu_procs, self.n_gpus_visible) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Loggers accept format strings and their arguments and only actually interpolate them if the logging event is emitted:
'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!' | |
% (self.n_gpu_procs, self.n_gpus_visible) | |
'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!', | |
self.n_gpu_procs, self.n_gpus_visible) |
nipype/pipeline/plugins/multiproc.py
Outdated
if is_gpu_node: | ||
free_gpu_slots -= next_job_gpu_th |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this is releasing resource claims that were made around line 356 so the next time through the loop sees available resources.
if is_gpu_node: | |
free_gpu_slots -= next_job_gpu_th | |
if is_gpu_node: | |
free_gpu_slots += next_job_gpu_th |
) | ||
continue | ||
|
||
free_memory_gb -= next_job_gb | ||
free_processors -= next_job_th | ||
if is_gpu_node: | ||
free_gpu_slots -= next_job_gpu_th |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to be hit by your test, but coverage shows it's not. Can you look into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I missed that because I never used updatedhash=True, but it seems that no test includes that. Should we add a test with that option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover that error does not impact "common" use (I have a project including this gpu support code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I was looking into this I found two error about updatehash functionality. I sent a pull request #3709 to fix the biggest.
The second is that in multiproc plugin EVERY node will be executed in main thread if updatehash=True, so no multi process is enabled. I will try to send a pull request for that too (maybe after this gpu support is merged to avoid to handle merge conflicts)
…e/nipype into enh/cuda_support
I wrote a simpler implementation of this old pull request to handle a queue of threads to be executed on GPU.
The user can specify the maximum number of parallel threads with the plugin option n_gpu_procs
The multiprocplugin will raise exception if a node require more threads than allowed in a similar way as classic CPU threads.
Note that in this implementation any GPU node will also allocate a CPU slot (is that necessary? We can change that behavior ).
Moreover the plugin doesn't check that the system actually has a cuda capable GPU (we can add such check if you think we need it)