Check average clock frequency during benchmarks #193

bernhardmgruber · 2024-11-25T11:05:03Z

Sometimes, benchmark systems are unstable due to external factors and GPUs cannot keep up their clock frequency during a benchmark. This leads to wrong results.

NVBench should monitor the clock frequency during benchmarking and detect such conditions. One way is to query the global timer and SM block before and after the benchmark, and compute the average frequency:

__global__ void get_timestamps_kernel(uint64_t* global_timestamp, uint64_t* sm0_timestamp) {
  uint32_t smid;
  asm volatile("mov.u32 %0, %%smid;" : "=r"(smid));
  if (smid == 0) {
    uint64_t gts, lts;
    asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(gts));
    lts = clock64();

    *global_timestamp = gts;
    *sm0_timestamp    = lts;
  }
}

ulong2 get_timestamps() {
  const int device_id = 0;
  int num_sms         = 0;
  CUDA_CHECK(cudaDeviceGetAttribute(&num_sms, cudaDevAttrMultiProcessorCount, device_id));

  uint64_t* timestamps;
  CUDA_CHECK(cudaHostAlloc(&timestamps, 2 * sizeof(uint64_t), cudaHostAllocMapped));
  get_timestamps_kernel<<<num_sms, 1>>>(&timestamps[0], &timestamps[1]);
  CUDA_CHECK(cudaDeviceSynchronize());
  ulong2 ret{timestamps[0], timestamps[1]};
  CUDA_CHECK(cudaFreeHost(timestamps));
  return ret;
}

float measure_clock_frequency(std::function<void(void)> f) {
  ulong2 ts0 = get_timestamps();
  f();
  ulong2 ts1              = get_timestamps();
  uint64_t elapsed_ns     = ts1.x - ts0.x;
  uint64_t elapsed_clocks = ts1.y - ts0.y;
  float clock_rate        = float(elapsed_clocks) / float(elapsed_ns) * 1000000.f;
  return clock_rate;
}

where f launches the kernel to benchmark. If the computed clock_rate is off from the expected value, we should issue a warning.

The text was updated successfully, but these errors were encountered:

fbusato · 2024-11-27T00:50:07Z

this can be also achieved with the nvml library

nvmlDeviceGetClockInfo(): get current the current SM clock before/after the execution, link
nvmlDeviceGetCurrentClocksThrottleReasons(): check for clock events, link

bernhardmgruber · 2024-11-27T16:38:44Z

this can be also achieved with the nvml library

[...]

@ahendriksen, do you happen to know whether we can use those APIs instead of the approach outlined in the code above?

ahendriksen · 2024-11-27T16:42:07Z

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

bernhardmgruber · 2024-11-27T16:46:57Z

We cant: According to the API docs nvmlDeviceGetClockInfo "Retrieves the current clock speeds for the device". That is not what we want.

Yes, but would nvmlDeviceGetCurrentClocksThrottleReasons detect what we are trying to detect?

ahendriksen · 2024-11-27T16:54:08Z

It could. There is one issue that the use of the NVML APIs does not help with and that is checking if something happened over time. They return an instantaneous result.

If you run a benchmark for 30 seconds, it doesn't matter what the clock throttle reason is at the end of the benchmark or what the clocks are at the end of the benchmark. It matters what the average clock frequency was during those 30 seconds. If you can additionally get clock throttle reasons, that would be nice, but not necessary. Clock throttle reason you want to know when debugging the hardware. For instance, to determine if the throttling happened due to thermal or power constraints. It doesn't help with debugging software.

bernhardmgruber · 2024-11-27T17:16:30Z

@ahendriksen I always love how well you can specify a problem! Thx.

So let's implement the approach in the code above and try whether NVML can give us some useful additional diagnostics.

fbusato · 2024-11-27T17:31:10Z

@ahendriksen sorry, why nvmlDeviceGetClockInfo() cannot be used? You can use the API to get the SM clock at short intervals while the kernel is running. This is equivalent to using nvidia-smi in the background to monitor clocks, power consumption, etc.

ahendriksen · 2024-11-27T17:36:41Z

You can use the API to get the SM clock at short intervals while the kernel is running.

As you say, it is indeed not impossible. However, it's not a great solution for several reasons:

you are polling the GPU during a benchmark
it's error prone (you could miss one of the 100ms intervals, skewing the result)
it requires spinning up a separate thread during the benchmark

The proposed solution only requires running a kernel once before and once after the benchmark is done. It is used widely within Nvidia, and it works.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

fbusato · 2024-11-27T17:59:26Z

I'm just trying to understand pros & cons of these approaches.

you are polling the GPU during a benchmark

you can measure clocks before and after as in the custom kernel approach.

it's error prone (you could miss one of the 100ms intervals, skewing the result)

The polling interval is quite short, on the order of microseconds. Doing before/after is even worse, both with the custom kernel and with nvml.

Is there a specific reason that we would want to exhaust all other possible options before using anything but the nvml API?

I have seen the nvml/nvidia-smi approach to be effective in benchmarking GEMMs. I'm aware that the profilers nsight-compute/system prefer to patch the binary directly.

ahendriksen · 2024-11-27T18:17:19Z

The kernel is measuring elapsed clock ticks. NVML is measuring clock frequency.

ahendriksen · 2024-11-28T09:11:01Z

Just to clarify, in the code that Bernhard pasted, f can contain the entire benchmark run:

float measure_clock_frequency(std::function<void(void)> f) {
  ulong2 ts0 = get_timestamps();                // get elapsed clock cycles + time in nanoseconds
  f();                                                             // <-- run all iterations of the benchmark
  ulong2 ts1              = get_timestamps();  // get elapsed clock cycles + time in nanoseconds
  uint64_t elapsed_ns     = ts1.x - ts0.x;
  uint64_t elapsed_clocks = ts1.y - ts0.y;
  float clock_rate        = float(elapsed_clocks) / float(elapsed_ns) * 1000000.f;
  return clock_rate;
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check average clock frequency during benchmarks #193

Check average clock frequency during benchmarks #193

bernhardmgruber commented Nov 25, 2024

fbusato commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

fbusato commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

fbusato commented Nov 27, 2024

ahendriksen commented Nov 27, 2024 via email •

edited

Loading

ahendriksen commented Nov 28, 2024

Check average clock frequency during benchmarks #193

Check average clock frequency during benchmarks #193

Comments

bernhardmgruber commented Nov 25, 2024

fbusato commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

bernhardmgruber commented Nov 27, 2024

fbusato commented Nov 27, 2024

ahendriksen commented Nov 27, 2024

fbusato commented Nov 27, 2024

ahendriksen commented Nov 27, 2024 via email • edited Loading

ahendriksen commented Nov 28, 2024

ahendriksen commented Nov 27, 2024 via email •

edited

Loading