Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check average clock frequency during benchmarks #193

Open
bernhardmgruber opened this issue Nov 25, 2024 · 11 comments
Open

Check average clock frequency during benchmarks #193

bernhardmgruber opened this issue Nov 25, 2024 · 11 comments

Comments

@bernhardmgruber
Copy link
Collaborator

Sometimes, benchmark systems are unstable due to external factors and GPUs cannot keep up their clock frequency during a benchmark. This leads to wrong results.

NVBench should monitor the clock frequency during benchmarking and detect such conditions. One way is to query the global timer and SM block before and after the benchmark, and compute the average frequency:

__global__ void get_timestamps_kernel(uint64_t* global_timestamp, uint64_t* sm0_timestamp) {
  uint32_t smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  if (smid == 0) {
    uint64_t gts, lts;
    asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(gts));
    lts = clock64();

    *global_timestamp = gts;
    *sm0_timestamp    = lts;
  }
}

ulong2 get_timestamps() {
  const int device_id = 0;
  int num_sms         = 0;
  CUDA_CHECK(cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id));

  uint64_t* timestamps;
  CUDA_CHECK(cudaHostAlloc(&timestamps, 2 * sizeof(uint64_t), cudaHostAllocMapped));
  get_timestamps_kernel<<<num_sms, 1>>>(&timestamps[0], &timestamps[1]);
  CUDA_CHECK(cudaDeviceSynchronize());
  ulong2 ret{timestamps[0], timestamps[1]};
  CUDA_CHECK(cudaFreeHost(timestamps));
  return ret;
}

float measure_clock_frequency(std::function<void(void)> f) {
  ulong2 ts0 = get_timestamps();
  f();
  ulong2 ts1              = get_timestamps();
  uint64_t elapsed_ns     = ts1.x - ts0.x;
  uint64_t elapsed_clocks = ts1.y - ts0.y;
  float clock_rate        = float(elapsed_clocks) / float(elapsed_ns) * 1000000.f;
  return clock_rate;
}

where f launches the kernel to benchmark. If the computed clock_rate is off from the expected value, we should issue a warning.

@fbusato
Copy link

fbusato commented Nov 27, 2024

this can be also achieved with the nvml library

  • nvmlDeviceGetClockInfo(): get current the current SM clock before/after the execution, link
  • nvmlDeviceGetCurrentClocksThrottleReasons(): check for clock events, link

@bernhardmgruber
Copy link
Collaborator Author

this can be also achieved with the nvml library

[...]

@ahendriksen, do you happen to know whether we can use those APIs instead of the approach outlined in the code above?

@ahendriksen
Copy link

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

@bernhardmgruber
Copy link
Collaborator Author

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

Yes, but would nvmlDeviceGetCurrentClocksThrottleReasons detect what we are trying to detect?

@ahendriksen
Copy link

It could. There is one issue that the use of the NVML APIs does not help with and that is checking if something happened over time. They return an instantaneous result.

If you run a benchmark for 30 seconds, it doesn't matter what the clock throttle reason is at the end of the benchmark or what the clocks are at the end of the benchmark. It matters what the average clock frequency was during those 30 seconds. If you can additionally get clock throttle reasons, that would be nice, but not necessary. Clock throttle reason you want to know when debugging the hardware. For instance, to determine if the throttling happened due to thermal or power constraints. It doesn't help with debugging software.

@bernhardmgruber
Copy link
Collaborator Author

@ahendriksen I always love how well you can specify a problem! Thx.

So let's implement the approach in the code above and try whether NVML can give us some useful additional diagnostics.

@fbusato
Copy link

fbusato commented Nov 27, 2024

@ahendriksen sorry, why nvmlDeviceGetClockInfo() cannot be used? You can use the API to get the SM clock at short intervals while the kernel is running. This is equivalent to using nvidia-smi in the background to monitor clocks, power consumption, etc.

@ahendriksen
Copy link

You can use the API to get the SM clock at short intervals while the kernel is running.

As you say, it is indeed not impossible. However, it's not a great solution for several reasons:

  1. you are polling the GPU during a benchmark
  2. it's error prone (you could miss one of the 100ms intervals, skewing the result)
  3. it requires spinning up a separate thread during the benchmark

The proposed solution only requires running a kernel once before and once after the benchmark is done. It is used widely within Nvidia, and it works.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

@fbusato
Copy link

fbusato commented Nov 27, 2024

I'm just trying to understand pros & cons of these approaches.

you are polling the GPU during a benchmark

you can measure clocks before and after as in the custom kernel approach.

it's error prone (you could miss one of the 100ms intervals, skewing the result)

The polling interval is quite short, on the order of microseconds. Doing before/after is even worse, both with the custom kernel and with nvml.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

I have seen the nvml/nvidia-smi approach to be effective in benchmarking GEMMs. I'm aware that the profilers nsight-compute/system prefer to patch the binary directly.

@ahendriksen
Copy link

ahendriksen commented Nov 27, 2024 via email

@ahendriksen
Copy link

Just to clarify, in the code that Bernhard pasted, f can contain the entire benchmark run:

float measure_clock_frequency(std::function<void(void)> f) {
  ulong2 ts0 = get_timestamps();                // get elapsed clock cycles + time in nanoseconds
  f();                                                             // <-- run all iterations of the benchmark
  ulong2 ts1              = get_timestamps();  // get elapsed clock cycles + time in nanoseconds
  uint64_t elapsed_ns     = ts1.x - ts0.x;
  uint64_t elapsed_clocks = ts1.y - ts0.y;
  float clock_rate        = float(elapsed_clocks) / float(elapsed_ns) * 1000000.f;
  return clock_rate;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants