support for gpu queue #3642

mauriliogenovese · 2024-03-22T13:09:18Z

I wrote a simpler implementation of this old pull request to handle a queue of threads to be executed on GPU.
The user can specify the maximum number of parallel threads with the plugin option n_gpu_procs
The multiprocplugin will raise exception if a node require more threads than allowed in a similar way as classic CPU threads.
Note that in this implementation any GPU node will also allocate a CPU slot (is that necessary? We can change that behavior ).
Moreover the plugin doesn't check that the system actually has a cuda capable GPU (we can add such check if you think we need it)

gputils is required for gpu queue management

codecov · 2024-03-25T06:43:00Z

Codecov Report

Attention: Patch coverage is 86.66667% with 6 lines in your changes missing coverage. Please review.

Project coverage is 73.05%. Comparing base (bc456dd) to head (610f1cb).

Files with missing lines	Patch %	Lines
nipype/pipeline/plugins/multiproc.py	83.33%	4 Missing ⚠️
nipype/pipeline/plugins/tools.py	71.42%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3642   +/-   ##
=======================================
  Coverage   73.04%   73.05%           
=======================================
  Files        1278     1278           
  Lines       59356    59398   +42     
=======================================
+ Hits        43359    43395   +36     
- Misses      15997    16003    +6

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

effigies · 2024-03-29T12:13:00Z

Just to check my understanding: in this model, a GPU-enabled job gets exclusive access to one full GPU, so the GPU queue is simply the number of available GPUs and the number of GPU-enabled jobs? There's no notion of a job acquiring multiple GPUs or partial GPUs?

From some quick searching, it's at least possible (though I don't know how common) to write programs that utilize multiple GPUs, so I think we should allow nodes to be tagged with multiple GPU threads.

If the CPU usage of a process is negligible, I think it would be reasonable to say:

myproc = pe.Node(ProcessInterface(), n_threads=0, n_gpus=2)

mauriliogenovese · 2024-03-29T13:01:07Z

In the current implementation the user specifies how many n_gpu_procs the plugin should manage and the plugin will reserve those "slots" based on the node.n_threads property. If you think it's useful we can allow the user to specify different values for "gpu_procs" and "cpu_procs" for each node.
What should be the behaviour if the user does not specify the n_gpus property? n_gpus=n_threads?

effigies

I'm extremely sorry about how long it took me to get back to this. If you're still around and up to work on this, here's the review I started last May and just finished.

effigies · 2024-03-29T11:54:40Z

nipype/info.py

@@ -149,6 +149,7 @@ def get_nipype_gitversion():
    "filelock>=3.0.0",
    "etelemetry>=0.2.0",
    "looseversion!=1.2",
+    "gputil==1.4.0",


Hard pins are a very bad idea. If you need a particular API, use >= to ensure it's present. We should avoid upper bounds as much as possible, although they are not always avoidable.

effigies · 2025-01-17T13:08:43Z

nipype/pipeline/plugins/multiproc.py

+    @staticmethod
+    def gpu_count():
+        n_gpus = 1
+        try:
+            import GPUtil
+
+            return len(GPUtil.getGPUs())
+        except ImportError:
+            return n_gpus


This is a general utility, I would put it into nipype.pipeline.plugins.tools as a function, not a static method.

Also consider:

Suggested change

@staticmethod

def gpu_count():

n_gpus = 1

try:

import GPUtil

return len(GPUtil.getGPUs())

except ImportError:

return n_gpus

@staticmethod

def gpu_count():

try:

import GPUtil

except ImportError:

return 1

else:

return len(GPUtil.getGPUs())

As a rule, I try to keep the section inside a try block as short as possible, to avoid accidentally catching other exceptions that are raised. An else block can contain anything that depends on the success of the try block.

effigies · 2025-01-17T13:11:29Z

nipype/pipeline/engine/nodes.py

+        return (hasattr(self.inputs, 'use_cuda') and self.inputs.use_cuda) or (
+            hasattr(self.inputs, 'use_gpu') and self.inputs.use_gpu
+        )


Suggested change

return (hasattr(self.inputs, 'use_cuda') and self.inputs.use_cuda) or (

hasattr(self.inputs, 'use_gpu') and self.inputs.use_gpu

)

return bool(getattr(self.inputs, 'use_cuda', False)) or bool(

getattr(self.inputs, 'use_gpu', False))

effigies · 2025-01-17T13:13:19Z

nipype/pipeline/plugins/multiproc.py

+                'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!'
+                % (self.n_gpu_procs, self.n_gpus_visible)


Loggers accept format strings and their arguments and only actually interpolate them if the logging event is emitted:

Suggested change

'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!'

% (self.n_gpu_procs, self.n_gpus_visible)

'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!',

self.n_gpu_procs, self.n_gpus_visible)

effigies · 2025-01-17T13:25:12Z

nipype/pipeline/plugins/multiproc.py

+                if is_gpu_node:
+                    free_gpu_slots -= next_job_gpu_th


Note that this is releasing resource claims that were made around line 356 so the next time through the loop sees available resources.

Suggested change

if is_gpu_node:

free_gpu_slots -= next_job_gpu_th

if is_gpu_node:

free_gpu_slots += next_job_gpu_th

effigies · 2025-01-17T13:27:49Z

nipype/pipeline/plugins/multiproc.py

                )
                continue

            free_memory_gb -= next_job_gb
            free_processors -= next_job_th
+            if is_gpu_node:
+                free_gpu_slots -= next_job_gpu_th


I would expect this to be hit by your test, but coverage shows it's not. Can you look into this?

Maybe I missed that because I never used updatedhash=True, but it seems that no test includes that. Should we add a test with that option?

Moreover that error does not impact "common" use (I have a project including this gpu support code)

While I was looking into this I found two error about updatehash functionality. I sent a pull request #3709 to fix the biggest.
The second is that in multiproc plugin EVERY node will be executed in main thread if updatehash=True, so no multi process is enabled. I will try to send a pull request for that too (maybe after this gpu support is merged to avoid to handle merge conflicts)

…e/nipype into enh/cuda_support

mauriliogenovese added 3 commits March 22, 2024 13:56

support for gpu queue

0720aa1

gputil requirement

6c47dc0

gputils is required for gpu queue management

Update info.py

f1f5d76

refactor and fix

a642430

effigies added this to the 1.9.0 milestone Mar 29, 2024

effigies requested changes Jan 17, 2025

View reviewed changes

mauriliogenovese added 9 commits January 18, 2025 16:37

removed hard pin

684b9b0

gpu_count refactor

8f74c5d

more readable

a307845

logger argument

27448bc

Merge branch 'master' into enh/cuda_support

7e57ab9

code refactory

2c2c066

Merge branch 'enh/cuda_support' of https://github.com/mauriliogenoves…

133dc0a

…e/nipype into enh/cuda_support

newlines for style check

66d6280

newline for code check

610f1cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support for gpu queue #3642

support for gpu queue #3642

mauriliogenovese commented Mar 22, 2024 •

edited

Loading

codecov bot commented Mar 25, 2024 •

edited

Loading

effigies commented Mar 29, 2024

mauriliogenovese commented Mar 29, 2024 •

edited

Loading

effigies left a comment

effigies Mar 29, 2024

effigies Jan 17, 2025

effigies Jan 17, 2025

effigies Jan 17, 2025

effigies Jan 17, 2025

effigies Jan 17, 2025

mauriliogenovese Jan 18, 2025

mauriliogenovese Jan 18, 2025

mauriliogenovese Jan 19, 2025

		'Total number of GPUs proc requested (%d) exceeds the available number of GPUs (%d) on the system. Using requested GPU slots at your own risk!'
		% (self.n_gpu_procs, self.n_gpus_visible)

support for gpu queue #3642

Are you sure you want to change the base?

support for gpu queue #3642

Conversation

mauriliogenovese commented Mar 22, 2024 • edited Loading

codecov bot commented Mar 25, 2024 • edited Loading

Codecov Report

effigies commented Mar 29, 2024

mauriliogenovese commented Mar 29, 2024 • edited Loading

effigies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mauriliogenovese commented Mar 22, 2024 •

edited

Loading

codecov bot commented Mar 25, 2024 •

edited

Loading

mauriliogenovese commented Mar 29, 2024 •

edited

Loading