Benjamin Kušen

January 29, 2024

Shared GPU Resources: Efficient Access Part 4

Welcome to part 4 of our series on efficient access to shared GPU resources. Make sure you’re up to date before reading this one.

In this part, the fourth in our series of blog posts on GPU concurrency mechanisms, we delve into benchmarking results for the time slicing of NVIDIA cards. In the old parts, we examined the advantages and disadvantages of various solutions within the Kubernetes environment (Part 1), delved into the setup and configuration intricacies (Part 2), and scrutinized benchmarking use cases (Part 3).NVIDIA Cards are the focus of this series, but similar mechanisms may be offered by other companies.

Configuration

The benchmark setup remains consistent across all use cases:

Deactivated time slicing (GPU Passthrough)
Activated time slicing (the numerical value indicates the number of processes scheduled on the same GPU):
Shared x1
Shared x2
Shared x4
Shared x8

Benchmarking time slicing presents challenges because processes must commence simultaneously. Using a Deployment or a ReplicaSet is unsuitable as pods launch with varying start times. The GPU executes tasks in a round-robin fashion. To facilitate benchmarking, we initiate more extended GPU processes in advance, eliminating the necessity for startup synchronization.

For instance, in benchmarking a script within a "Shared x4" GPU configuration, we can:

Initiate 3 pods executing the same script for an extended duration.
Simultaneously, start the fourth pod, ensuring it starts and concludes while sharing the GPU with the other three.

For additional details on driver installation, time-slicing configuration, and environment setup, refer to the relevant sections.

Measuring FLOPS

Floating Point Operations Per Second (FLOPS) serves as the metric to gauge a GPU's performance across various data formats. The counting of FLOPS is conducted through dcgmproftester, an NVIDIA CUDA-based test load generator. For additional details, refer to the earlier blog post.

fp16 Performance Metric

<table>
<tr>
<th></th>
<th>Passthrough</th>
<th>Shared x1</th>
<th>Shared x2</th>
<th>Shared x4</th>
<th>Shared x8</th>
</tr>
<tr>
<td>Average TFLOPS per process</td>
<td>32.866</td>
<td>32.700</td>
<th>15.933</th>
<th>7.956</th>
<th>3.968</th>
</tr>
<tr>
<td>Average TFLOPS per process * number of processes</td>
<td>32.866</td>
<td>32.700</td>
<th>31.867</th>
<th>31.824</th>
<th>31.824</th>
</tr>
<tr>
<td>Performance Loss (compared to Passthrough)</td>
<td>-</td>
<td>0.5%</td>
<th>3.03%</th>
<th>3.17%</th>
<th>3.41%</th>
</tr>
</table>

fp32 Performance Metrics

<table>
<tr>
<th></th>
<th>Passthrough</th>
<th>Shared x1</th>
<th>Shared x2</th>
<th>Shared x4</th>
<th>Shared x8</th>
</tr>
<tr>
<td>Average TFLOPS per process</td>
<td>16.898</td>
<td>16.879</td>
<th>7.880</th>
<th>3.945</th>
<th>1.974</th>
</tr>
<tr>
<td>Average TFLOPS per process * number of processes</td>
<td>16.898</td>
<td>16.879</td>
<th>15.76</th>
<th>15.783</th>
<th>15.795</th>
</tr>
<tr>
<td>Performance Loss (compared to Passthrough)</td>
<td>-</td>
<td>0.11%</td>
<th>6.73%</th>
<th>6.59%</th>
<th>6.52%</th>
</tr>
</table> ‍

fp64 Performance Metrics

<table>
<tr>
<th></th>
<th>Passthrough</th>
<th>Shared x1</th>
<th>Shared x2</th>
<th>Shared x4</th>
<th>Shared x8</th>
</tr>
<tr>
<td>Average TFLOPS per process</td>
<td>8.052</td>
<td>8.050</td>
<th>3.762</th>
<th>1.871</th>
<th>0.939</th>
</tr>
<tr>
<td>Average TFLOPS per process * number of processes</td>
<td>8.052</td>
<td>8.050</td>
<th>7.524</th>
<th>7.486</th>
<th>7.515</th>
</tr>
<tr>
<td>Performance Loss (compared to Passthrough)</td>
<td>-</td>
<td>0.02%</td>
<th>6.55%</th>
<th>7.03%</th>
<th>6.67%</th>
</tr>
</table>

fp16 Tensor Cores Performance Metrics

<table>
<tr>
<th></th>
<th>Passthrough</th>
<th>Shared x1</th>
<th>Shared x2</th>
<th>Shared x4</th>
<th>Shared x8</th>
</tr>
<tr>
<td>Average TFLOPS per process</td>
<td>165.992</td>
<td>165.697</td>
<th>81.850</th>
<th>41.161</th>
<th>20.627</th>
</tr>
<tr>
<td>Average TFLOPS per process * number of processes</td>
<td>165.992</td>
<td>165.697</td>
<th>163.715</th>
<th>164.645</th>
<th>165.021</th>
</tr>
<tr>
<td>Performance Loss (compared to Passthrough)</td>
<td>-</td>
<td>0.17%</td>
<th>1.37%</th>
<th>0.81%</th>
<th>0.58%</th>
</tr>
</table>

Key Findings

Enabling time slicing with only one process utilizing the GPU (shared x1) results in a negligible time slicing penalty, with a performance loss of less than 0.5%.
In scenarios where the GPU undergoes context switching (shared x2), there is approximately a 6% performance decrease for fp32 and fp64, around 3% for fp16, and about 1.37% for fp16 on tensor cores.
Introducing additional processes sharing the GPU does not incur an additional penalty. The performance loss remains consistent for shared x2, shared x4, and shared x8 configurations.
The reasons behind the varying performance losses across different data formats are not currently understood and require further investigation.

Compute-Intensive Particle Simulation

Simulation plays a crucial role here for client computing, involving compute-intensive operations that can derive substantial advantages from GPU utilization. The benchmarking for this assessment centers on the lhc simpletrack simulation. For additional details, refer to the earlier blog post.

Passthrough vs Shared x1

<table>
<tr>
<th>Number of particles</th>
<th>Passthrough</th>
<th>Shared x1</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>5 000 000</td>
<td>26.365</td>
<td>27.03</td>
<th>2.52</th>
</tr>
<tr>
<td>10 000 000</td>
<td>51.135</td>
<td>51.93</td>
<th>1.55</th>
</tr>
<tr>
<td>15 000 000</td>
<td>76.374</td>
<td>77.12</td>
<th>0.97</th>
</tr>
<tr>
<td>20 000 000</td>
<td>99.55</td>
<td>99.91</td>
<th>0.36</th>
</tr>
<tr>
<td>30 000 000</td>
<td>151.57</td>
<td>152.61</td>
<th>0.68</th>
</tr>
</table>

Shared x1 vs Shared x2

<table>
<tr>
<th>Number of particles</th>
<th>Shared x1 [s]</th>
<th>Expected Shared x2 = 2*Shared x1 [s]</th>
<th>Actual Shared x2 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>5 000 000</td>
<td>27.03</td>
<td>54.06</td>
<th>72.59</th>
<th>34.27</th>
</tr>
<tr>
<td>10 000 000</td>
<td>51.93</td>
<td>103.86</td>
<th>138.76</th>
<th>33.6</th>
</tr>
<tr>
<td>15 000 000</td>
<td>77.12</td>
<td>154.24</td>
<th>212.71</th>
<th>37.9</th>
</tr>
<tr>
<td>20 000 000</td>
<td>99.91</td>
<td>199.82</td>
<th>276.23</th>
<th>38.23</th>
</tr>
<tr>
<td>30 000 000</td>
<td>152.61</td>
<td>305.22</td>
<th>423.08</th>
<th>38.61</th>
</tr>
</table>

Shared x2 vs Shared x4

<table>
<tr>
<th>Number of particles</th>
<th>Shared x2 [s]</th>
<th>Expected Shared x4 = 2*Shared x2 [s]</th>
<th>Actual Shared x4 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>5 000 000</td>
<td>72.59</td>
<td>145.18</td>
<th>142.63</th>
<th>0</th>
</tr>
<tr>
<td>10 000 000</td>
<td>138.76</td>
<td>277.52</td>
<th>281.98</th>
<th>1.6</th>
</tr>
<tr>
<td>15 000 000</td>
<td>212.71</td>
<td>425.42</td>
<th>421.55</th>
<th>0</th>
</tr>
<tr>
<td>20 000 000</td>
<td>276.23</td>
<td>552.46</td>
<th>546.19</th>
<th>0</th>
</tr>
<tr>
<td>30 000 000</td>
<td>423.08</td>
<td>846.16</td>
<th>838.55</th>
<th>0</th>
</tr>
</table>

Shared x4 vs Shared x8 Performance Comparison

In the case of shared x8, inputs exceeding 30,000,000 particles will lead to an Out Of Memory (OOM) error. Effectively managing the memory consumption of each process becomes a significant challenge when employing the time-slicing sharing mechanism.

<table>
<tr>
<th>Number of particles</th>
<th>Shared x4 [s]</th>
<th>Expected Shared x8 = 2*Shared x4 [s]</th>
<th>Actual Shared x8 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>5 000 000</td>
<td>142.63</td>
<td>285.26</td>
<th>282.56</th>
<th>0</th>
</tr>
<tr>
<td>10 000 000</td>
<td>281.98</td>
<td>563.96</td>
<th>561.98</th>
<th>0</th>
</tr>
<tr>
<td>15 000 000</td>
<td>421.55</td>
<td>843.1</td>
<th>838.22</th>
<th>0</th>
</tr>
<tr>
<td>20 000 000</td>
<td>546.19</td>
<td>1092.38</td>
<th>1087.99</th>
<th>0</th>
</tr>
<tr>
<td>30 000 000</td>
<td>838.55</td>
<td>1677.1</td>
<th>1672.95</th>
<th>0</th>
</tr>
</table>

Findings and Conclusions

Enabling time slicing (shared x1) results in a very slight performance loss, amounting to less than 2.5%.
The transition from shared x1 to shared x2, involving GPU context switching, causes the execution time to triple, resulting in a substantial performance loss of approximately 38%.
Increasing the number of processes (shared x4, shared x8) does not lead to additional performance loss, indicating a consistent performance impact.

Machine Learning Training

For benchmarking, a pretrained model will be utilized and fine-tuned using PyTorch. To optimize GPU utilization, ensure that the script is not CPU-bound by adjusting the number of data loader workers and batch size. Additional details can be found in the preceding blog post.

Passthrough vs. Shared x1 Performance Comparison

Training Arguments:

per_device_train_batch_size = 48
per_device_eval_batch_size = 48
dataloader_num_workers = 8

<table>
<tr>
<th>Number of particles</th>
<th>Passthrough [s] </th>
<th>Shared x1 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>500</td>
<td>16.497</td>
<td>16.6078</td>
<th>0.67</th>
</tr>
<tr>
<td>1 000</td>
<td>31.2464</td>
<td>31.4142</td>
<th>0.53</th>
</tr>
<tr>
<td>2 000</td>
<td>61.1451</td>
<td>61.3885</td>
<th>0.39</th>
</tr>
<tr>
<td>5 000</td>
<td>150.8432</td>
<td>151.1182</td>
<th>0.18</th>
</tr>
<tr>
<td>10 000</td>
<td>302.2547</td>
<td>302.4283</td>
<th>0.05</th>
</tr>
</table>

‍

Shared x1 vs. Shared x2 Performance Comparison

Training Arguments:‍

per_device_train_batch_size = 24
per_device_eval_batch_size = 24
dataloader_num_workers = 4

<table>
<tr>
<th>Number of particles</th>
<th>Shared x1 [s]</th>
<th>Expected Shared x2 = 2*Shared x1 [s]</th>
<th>Actual Shared x2 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>500</td>
<td>16.9597</td>
<td>33.9194</td>
<th>36.7628</th>
<th>8.38</th>
</tr>
<tr>
<td>1 000</td>
<td>32.8355</td>
<td>65.671</td>
<th>72.9985</th>
<th>11.15</th>
</tr>
<tr>
<td>2 000</td>
<td>64.2533</td>
<td>128.5066</td>
<th>143.3033</th>
<th>11.51</th>
</tr>
<tr>
<td>5 000</td>
<td>161.5249</td>
<td>323.0498</td>
<th>355.0302</th>
<th>9.89</th>
</tr>
</tr>
</table>

Shared x2 vs. Shared x4 Performance Comparison

Training Arguments:

Per_device_train_batch_size = 12
Per_device_eval_batch_size = 12
Dataloader_num_workers = 2

<table>
<tr>
<th>Number of particles</th>
<th>Shared x2 [s]</th>
<th>Expected Shared x4 = 2*Shared x2 [s]</th>
<th>Actual Shared x4 [s]</th>
<th>Loss [%]</th>
</tr>
<tr>
<td>500</td>
<td>39.187</td>
<td>78.374</td>
<th>77.2388</th>
<th>0</th>
</tr>
<tr>
<td>1 000</td>
<td>77.3014</td>
<td>154.6028</td>
<th>153.4177</th>
<th>0</th>
</tr>
<tr>
<td>2 000</td>
<td>154.294</td>
<td>308.588</td>
<th>306.0012</th>
<th>0</th>
</tr>
<tr>
<td>5 000</td>
<td>385.6539</td>
<td>771.3078</td>
<th>762.5113</th>
<th>9.89</th>
</tr>
<tr>
</table>

Shared x4 vs. Shared x8 Performance Comparison -

Training Arguments:

Per_device_train_batch_size = 4
Per_device_eval_batch_size = 4
Dataloader_num_workers = 1

<table>
<tr>
<th>Number of particles</th>
<th>Shared x4 [s] </th>
<th>Expected Shared x8 = 2*Shared x4 [s]</th>
<th>Shared x8 [s] </th>
<th>Loss [%]</th>
</tr>
<tr>
<td>500</td>
<td>104.6849</td>
<td>209.3698</td>
<th>212.6313</th>
<th>1.55</th>
</tr>
<tr>
<td>1 000</td>
<td>185.1633</td>
<td>370.3266</td>
<th>381.7454</th>
<th>3.08</th>
</tr>
<tr>
<td>2 000</td>
<td>397.8525</td>
<td>795.705</td>
<th>816.353</th>
<th>2.59</th>
</tr>
<tr>
<td>5 000</td>
<td>1001.752</td>
<td>2003.504</td>
<th>1999.2395</th>
<th>0</th>
</tr>
<tr>
</table>

Key Findings and Conclusions

The loss in machine learning training performance on a GPU with time slicing enabled is negligible, amounting to less than 0.7%. Scaling from shared x1 to shared x2 results in approximately a 2.2 times increase in overall computation, with a time slicing loss of around 11%. With an increase in the number of processes (shared x4, shared x8), the performance is minimally affected, showing a range of 0-3% impact.

Key Takeaways

Considering potential GPU utilization improvements, the penalty introduced by enabling time slicing but having only one process using the GPU (Shared x1) can be disregarded.
A variable penalty is introduced when the GPU needs to perform context switching (shared x1 vs shared x2). The execution time triples, resulting in a performance loss of approximately 38%.
For more than two processes sharing the GPU (shared x2, shared x4, shared x8), the execution time scales linearly. There is no extra penalty if the number of processes sharing a GPU increases, showcasing efficient scaling.
Impact of Time Slicing on Case-Specific Observationsa) In scenarios where the GPU is running context switch-sensitive workloads, the penalty introduced by time slicing can be significant, with an observed penalty of about 38%.b) The impact of time slicing is dependent on the nature of the workload. For tasks that are IO-bound, CPU bound, or heavily utilize tensor cores, the time slicing penalty tends to alleviate. For instance, in the case of ML training, the penalty dropped to approximately 11%.
If the processes collectively consume more memory than available, some may terminate due to Out Of Memory (OOM) errors. The challenge of managing memory without the ability to set limits or priorities has been discussed earlier in the context of potential workarounds.While time slicing has the potential to introduce a significant performance penalty, when applied to the correct use cases, it can serve as a powerful method for enhancing GPU utilization.

‍

For a more comprehensive overview, consult the available resources. In the upcoming blog post, we will delve into extensive MIG benchmarking. Stay tuned for an in-depth exploration of MIG and its performance insights!

‍