Note: The AMD GPU Operator also supports DRA (Dynamic Resource Allocation) as an alternative to the traditional Device Plugin. DRA provides scheduler-driven GPU allocation, fine-grained device selection, and GPU sharing capabilities. The Device Plugin and DRA driver cannot be enabled at the same time.
To start the Device Plugin along with the GPU Operator configure fields under the spec/devicePlugin field in deviceconfig Custom Resource(CR)
The device-plugin pods start after updating the DeviceConfig CR
EnableDevicePlugin |
Enable/Disable device plugin with True/False. Cannot be enabled simultaneously with DRA driver. |
DevicePluginImage |
Device plugin image |
DevicePluginImagePullPolicy |
One of Always, Never, IfNotPresent. |
NodeLabellerImage |
Node labeller image |
NodeLabellerImagePullPolicy |
One of Always, Never, IfNotPresent. |
EnableNodeLabeller |
Enable/Disable node labeller with True/False |
DevicePluginArguments |
The flag/values to pass on to Device Plugin |
NodeLabellerArguments |
The flags to pass on to Node Labeller |
Both the ImagePullPolicy fields default to Always if :latest tag is specified on the respective Image, or defaults to IfNotPresent otherwise. This is default k8s behaviour for ImagePullPolicy
DevicePluginArguments is of type map[string]string. Currently supported key value pairs to set under DevicePluginArguments are: -> “resource_naming_strategy”: {“single”, “mixed”}
NodeLabellerArguments is of type []string. Currently supported flags to set under NodeLabellerArguments are:
{“compute-memory-partition”, “compute-partitioning-supported”, “memory-partitioning-supported”}
For the above new partition labels, the labels being set under this field will be applied by nodelabeller on the node
The below labels are enabled by nodelabeller by default internally:
{“vram”, “cu-count”, “simd-count”, “device-id”, “family”, “product-name”, “driver-version”}
To customize the way device plugin reports gpu resources to kubernetes as allocatable k8s resources, use the single or mixed resource naming strategy in DeviceConfig CR Before understanding each strategy, please note the definition of homogeneous and heterogeneous nodes
Homogeneous node: A node whose gpu’s follow the same compute-memory partition style -> Example: A node of 8 GPU’s where all 8 GPU’s are following CPX-NPS4 partition style
Heterogeneous node: A node whose gpu’s follow different compute-memory partition styles -> Example: A node of 8 GPU’s where 5 GPU’s are following SPX-NPS1 and 3 GPU’s are following CPX-NPS1
In single mode, the device plugin reports all gpu’s (regardless of whether they are whole gpu’s or partitions of a gpu) under the resource name amd.com/gpu This mode is supported for homogeneous nodes but not supported for heterogeneous nodes
A node which has 8 GPUs where all GPUs are not partitioned will report its resources as:
A node which has 8 GPUs where all GPUs are partitioned using CPX-NPS4 style will report its resources as:
In mixed mode, the device plugin reports all gpu’s under a name which matches its partition style. This mode is supported for both homogeneous nodes and heterogeneous nodes
A node which has 8 GPUs which are all partitioned using CPX-NPS4 style will report its resources as:
A node which has 8 GPUs where 5 GPU’s are following SPX-NPS1 and 3 GPU’s are following CPX-NPS1 will report its resources as:
If resource_naming_strategy is not passed using DevicePluginArguments field in CR, then device plugin will internally default to single resource naming strategy. This maintains backwards compatibility with earlier release of device plugin with reported resource name of amd.com/gpu
If a node has GPUs which do not support partitioning, such as MI210, then the GPUs are reported under resource name amd.com/gpu regardless of the resource naming strategy
These different naming styles of resources, for example, amd.com/cpx_nps1 should be followed when requesting for resources in a pod spec