Advanced Usage of the NVIDIA Device Plugin

Manage the NVIDIA Device Plugin through understanding common troubleshooting techniques, exploring GPU sharing strategies, and considering the limitations of the current tooling.

Keegan McCallum
Engineer

Published on March 19, 2024


Table of Contents

Introduction

In the previous post, we explored the fundamentals of using the NVIDIA Device Plugin to manage GPU resources in Kubernetes clusters. If you haven’t read it yet, we highly recommend starting with Part 1: Accelerating Machine Learning with GPUs in Kubernetes using the NVIDIA Device Plugin to gain a solid understanding of the device plugin’s basic concepts and installation process.

While the NVIDIA Device Plugin simplifies GPU management in Kubernetes, you may encounter some common issues during implementation. To support diverse workloads, maximize GPU utilization, and use the device plugin in various environments, customizing the values.yaml file is an important step.

Finally, we’ll wrap up by discussing some of the limitations of the nvidia-device-plugin, and some considerations for using it with production workloads.

Troubleshooting Common Issues

Code running in container isn’t recognizing GPU resources

First, if possible, make sure to use one of the official NVIDIA docker images as your base image. These images take care of wiring up everything you need to actually use cuda with your code. If using a framework like pytorch, they will usually have cuda image tags that you should be leveraging to cleanly integrate with the underlying GPUs.

If you are unable to use one of the official docker images, you’ll need to familiarize yourself with the nvidia-container-toolkit and ensure you have configured things correctly. Inspecting the values of NVIDIA_VISIBLE_DEVICES and NVIDIA_DRIVER_CAPABILITIES environment variables from inside the container can help diagnose issues, full documentation on environment variables are available here.

Finally, if everything else seems correct, and you are using official base images, the problem could be the code itself. Try running the minimal example from part 1 to see if cuda acceleration is working on your cluster for that minimal workload. If it is, and you are using official base images, double check the documentation for the framework you are using to ensure you’ve correctly configured your application to take advantage of cuda for acceleration.

Debugging Common XID Errors

XID Errors from the NVIDIA driver that are printed to the operating system’s kernel log or event log. The XID error codes indicate that a general GPU error occurred, most often due to the driver programming the GPU incorrectly or to corruption of the commands sent to the GPU. The error can indicate a hardware problem, an NVIDIA software problem, or an application problem. These can be monitored using the dcgm-exporter with the metric named DCGM_FI_DEV_XID_ERRORS. Here are 3 common error codes you may encounter and how to troubleshoot them:

  1. **XID 13: Graphics Engine Exception. **This may be a hardware issue. Run field diagnostics to confirm, and if it’s not hardware it may be an issue with your application code. NVIDIA provides some guidance for troubleshooting here.
  2. XID 31: GPU memory page fault. This is most likely an issue with application code. If this comes up after an update to the nvidia-device-plugin or other drivers on the node(and the application hasn’t changed), rollback and file an issue as the problem is on NVIDIA’s side.
  3. XID 48: Double Bit ECC Error. If this error code is followed by a XID 63 (Row-remapping recording event on A100s) or XID 64 (Row-remapping failure on A100s), then drain/cordon the node, wait for all work to complete, and reset GPU(s) reporting the XID by restarting the VM. Otherwise run field diagnostics to gather more information.

OOM Issues When Running Multiple Workloads on a Single GPU

When running multiple pods with access to the same GPU you may run into issues with Out of Memory errors on the GPU. If using pytorch for example it will look something like this:

RuntimeError: CUDA out of memory. Tried to allocate X MiB (GPU X; X GiB total capacity; X GiB already allocated; X MiB free; X cached)

This is almost always due to using time-slicing to share access to the GPU vs. one of the other GPU sharing options we’ll discuss in the next section. Time-slicing doesn’t provide any memory isolation whatsoever, so if your workloads have high memory requirements, or try to take advantage of the maximum amount of memory available, time-slicing isn’t really an option for you. Time-slicing is useful for running many small workloads that don’t require the full power of a GPU simultaneously, but all workloads will need to fit into memory and be configured on the application side to only use a set amount of memory. If using the dcgm-exporter to export GPU metrics to prometheus, you can use the DCGM_FI_DEV_MEM_COPY_UTIL metric to monitor memory usage for a given GPU. Using MIG or MPS will allow you to effectively provide memory isolation for your workloads in a way familiar to those used to kubernetes memory isolation.

Advanced Configuration

Sharing Access to GPUs

We covered the basics of GPU sharing in part 1, so if you haven’t yet it’s worth a read to get a high level understanding of the different options available. In this post I’ll dive into some of the more advanced configuration options available for each mode and why you might want to take advantage of them.

To start, configuration for MIG is fairly simple, the sharing strategy is really all that matters here. Just note that while you can use time-slicing and MIG together, using MPS and MIG at the same time is not supported.

For MPS and time-slicing, there are a few configuration options that you may be interested in:

  • renameByDefault: This option is disabled by default, and its purpose is to allow end users of your cluster to differentiate between shared GPUs and full GPUs. When enabled, each resource is advertised under the name <resource-name>.shared instead of just <resource-name>.
  • failRequestsGreaterThanOne: The purpose of this field is to raise awareness that requesting more than one GPU replica does not result in any more access to the GPU. For example, a pod that requests 2 GPUs with time-slicing does not get twice the compute of a pod that requests 1 GPU. For MPS, it’s important to note that this field is ALWAYS set to true, as can be seen in this commit. The rationale here is that the logic for actually allocating multiple GPU partitions to a single workload under MPS, especially when there are multiple GPUs on the node is ambiguous. This may be supported in the future, but for now I’d recommend setting this field to true in all cases, since for MPS you don’t get a choice, and for time-slicing it makes things more intuitive for the end users of your cluster when requesting GPUs.
  • resources: while we already covered a basic example of using resources, it’s important to note that this option takes a list as input, allowing you to specify configuration for multiple resources. In the case of MPS, as of now the only supported resource is nvidia.com/gpu and is only supported for full GPUs (no MIG). For time-slicing, you can reference any of the resource types that emerge from configuring a node with the mixed MIG strategy. For example nvidia.com/mig-1g.5gb can be specified to set up time-slicing for that specific MIG partition of the GPU.

Per-Node Configurations

For simple clusters with a single GPU node type, these configuration options will get you a long way. But you may end up in a situation where you want to configure different types of nodes with different options as you scale up. This is possible using the map option in your values.yaml, along with setting the label nvidia.com/device-plugin.config on the various nodes in order to select the configuration. By default, config.map.default in the values.yaml will be used for all nodes, but you can set up other configurations like so:

config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
    mig-single: |-
      version: v1
      flags:
        migStrategy: single
    mig-mixed: |-
      version: v1
      flags:
        migStrategy: mixed

And then for example set the mig-mixed strategy for a given node with the command (a node group/autoscaling group could have this label set as a part of a terraform module in a production environment):

kubectl label nodes <node-name> –-overwrite nvidia.com/device-plugin.config=mig-mixed

The label can be applied before or after the plugin is started to get the desired configuration applied on the node, it doesn’t need to be there at startup. Any time it changes, the plugin will be updated to start serving the desired configuration. If it is set to an invalid value, it will skip reconfiguration and use the most recent working config. If it is ever unset, it will fallback to using the default config value.

Enabling gpu-feature-discovery for automatic node labels

The nvidia-device-plugin helm chart supports deploying NVIDIA’s [gpu-feature-discovery(GFD)](https://github.com/NVIDIA/gpu-feature-discovery) helm chart as a subchart as of v0.12.0. GFD can automatically generate labels allowing you to identify the set of GPUs available on a given node. To enable it, set gfd.enabled to true in your values.yaml file. This will also deploy node-feature-discovery (NFD) since it is a prerequisite of GFD. If you already have NFD deployed on your cluster (generally via a Daemonset in the node-feature-discovery namespace) you can avoid redeploying it by setting nfd.enabled to false in the helm values. When using time-slicing, an additional label will be set to allow identification of the number of replicas, and the product name will be suffixed with -SHARED to allow workloads to differentiate between shared and unshared workloads, the full details are available here.

Limitations and considerations when using the NVIDIA Device Plugin

Limited Health Monitoring

The NVIDIA device plugin has limited support for health checking, and no support in [node-problem-detector](https://github.com/kubernetes/node-problem-detector/issues/833). This means that errors can go unnoticed and be difficult to diagnose. NVIDIA calls this out in the README, but it’s a pretty big gap and can lead to frustration. Using the dcgm-exporter from NVIDIA along with prometheus is the best way to get a handle on GPU monitoring and remediation. For example, to monitor for XID errors you could use the promql query:

DCGM_FI_DEV_XID_ERRORS > 0

The gauge value of the DCGM_FI_DEV_XID_ERRORS metric will represent the most recent error code for a given GPU device, which will be non-zero when an error occurs. For an exhaustive list of the metrics available you can refer to this page of the DCGM documentation.

Static partitioning of GPUs

GPUs can only be partitioned up front, and in the case of MPS only into equal portions of the GPU. This can make it hard to fully utilize GPUs since workload requirements are highly variable. With Nebuly’s fork of the nvidia-device-plugin you are able to actually split up the GPU more granularly by defining the amount of memory available, how many replicas to expose, and what to name the resources (typically to a name referencing the amount of memory in the slice). The Dynamic GPU partitioning feature of nos (Nebuly OS) takes this one step further, allowing you to avoid configuring partitions up front and dynamically partitioning the GPU in real-time based on the Pods pending and running on the cluster. This allows pods to request only the resources that are strictly necessary rather than choosing a predefined partition, increasing the total utilization of the GPUs. Unfortunately, it looks like nos has been put into maintenance mode and in my opinion isn’t a priority for the maintainers due to a recent pivot for the company, so it should not be evaluated for new production deployments.

NVIDIA is also working on a project for Dynamic Resource Allocation (DRA) which is currently in active development, not yet suitable for production, but is worth keeping an eye on for when it’s ready for primetime. nvshare also looks like an interesting project that essentially allows for time-slicing while allowing each process to utilize the entire memory of the GPU in an efficient manner. This is an emerging area of development so no real best practices are ready yet, but I’ll be going over the challenges with GPU utilization and some of the cutting-edge projects trying to solve them in more detail in part 3!

Conclusion

Efficient GPU management is crucial for organizations running machine learning and high-performance computing workloads in Kubernetes. The NVIDIA Device Plugin provides a solid foundation for exposing and allocating GPU resources to containers. By understanding common troubleshooting techniques, exploring GPU sharing strategies, and considering the limitations and considerations discussed in this post, you can effectively configure the NVIDIA Device Plugin for your specific use case.

As the demand for GPU acceleration continues to grow, innovative projects are emerging to address the challenges of GPU utilization and provide more dynamic and flexible resource management capabilities. We’ll be exploring some of the challenges and the projects trying to address them in part 3 so stay tuned!

Further Reading and Resources