✓

Follow along with this comprehensive guide

Kubernetes v1.36 introduces a game-changing feature for batch and machine learning workloads: the ability to modify CPU, memory, GPU, and extended resource specifications on a suspended Job. This beta release empowers queue controllers and cluster administrators to adjust resource allocations in real-time, ensuring optimal use of cluster capacity. Let's dive into the details with common questions.

1. What exactly does the mutable pod resources feature allow in v1.36?

This feature lets you change container resource requests and limits (CPU, memory, GPUs, and extended resources) in a Job's pod template while the Job is suspended. Originally introduced as alpha in v1.35, it's now promoted to beta in v1.36. The key is that the Job must be in the suspended state (spec.suspend: true). Once you adjust the resources, you can resume the Job, and the new pods will use the updated specifications. No new API objects were added; the existing batch/v1 Job and pod template structures were modified to relax the immutability constraint specifically for suspended Jobs. This is a significant shift because previously any change required deleting and recreating the entire Job, losing metadata and history.

Kubernetes v1.36 Beta: Dynamic Resource Tuning for Suspended Jobs

2. Why were resource requirements for Jobs immutable before v1.36?

Kubernetes originally enforced immutability on the pod template within a Job to ensure consistency and avoid unexpected changes to running workloads. Once a Job was created, the pod spec was frozen—this included resource requests and limits. The logic was that if a queue controller or an admin wanted a Job to run with different resources, the only safe way was to delete the Job and create a new one. However, this approach had serious downsides for batch and ML workloads: it discarded all associated metadata, status updates, and execution history. For long-running or expensive jobs, this was wasteful and disruptive. The immutability rule didn't consider scenarios where a Job is suspended and not actively running, making it an unnecessary restriction for those cases. The v1.36 beta cleverly lifts this restriction only for suspended Jobs, preserving immutability for active Job definitions.

3. How does this feature benefit queue controllers like Kueue?

Queue controllers often face the challenge of matching Job resource demands with current cluster capacity. Before v1.36, if a queue controller like Kueue determined that a suspended Job should run with fewer resources (e.g., only 2 GPUs instead of 4), it had no way to update the Job's pod template. The controller would have to delete the Job and recreate it from scratch, which meant losing any labels, annotations, or status that the original Job carried. With the new beta feature, the controller can directly modify the resource fields on the suspended Job, then resume it. This allows for dynamic resource scaling without losing Job identity. For machine learning pipelines, this is crucial: a training job can start with a reduced resource footprint when the cluster is busy and later be resubmitted (or restarted) with full resources if needed. It also helps with CronJobs: a specific instance of a CronJob can be adjusted to progress slowly with fewer resources rather than failing due to insufficient capacity.

4. Can you show a concrete example of adjusting GPU resources for a suspended Job?

Certainly. Imagine a machine learning training Job that initially requests 4 GPUs, 8 CPUs, and 32 GiB memory. The YAML looks like this:

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job-example
spec:
  suspend: true
  template:
    spec:
      containers:
      - name: trainer
        resources:
          requests:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"
          limits:
            cpu: "8"
            memory: "32Gi"
            example-hardware-vendor.com/gpu: "4"

If the cluster can only allocate 2 GPUs, a queue controller can update the suspended Job's pod template to:

          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              example-hardware-vendor.com/gpu: "2"
            limits:
              cpu: "4"
              memory: "16Gi"
              example-hardware-vendor.com/gpu: "2"

Then the controller sets spec.suspend: false, and the Job starts running with the adjusted resources. This entire process preserves the original Job name, metadata, and history.

5. What specific changes were made to the Kubernetes API?

The Kubernetes API server now relaxes the immutability constraint on the resource fields within the pod template—but only for Jobs that are currently suspended. No new API types or endpoints were introduced; the existing batch/v1 Job resource gains a validation exception. When a Job has spec.suspend: true, the API allows modifications to spec.template.spec.containers[*].resources (both requests and limits). The change is scoped to suspended Jobs only; active Jobs (those with suspend: false or not yet suspended) remain immutable. This design ensures backward compatibility and minimal risk. Additionally, the feature works with all resource types: CPU, memory, GPUs, and any extended resources. The implementation is straightforward: validation hooks check the Job's suspension status before enforcing immutability.

6. Are there any caveats or restrictions when using mutable resources on suspended Jobs?

While powerful, the feature has a few important boundaries. First, modifications are only allowed when spec.suspend is true. If you try to change resources on a non-suspended Job, the API will reject it. Second, you cannot change the resource fields of a Job that is running (active) or has completed—only while suspended. Third, adjustments apply only to future pods created after resuming; any currently running pods are unaffected. Fourth, the feature does not allow changing container images, commands, or other pod spec fields—only resources. Finally, for controllers like Kueue, they must ensure they have appropriate RBAC permissions to update the Job resource. Despite these restrictions, the feature significantly reduces operational overhead for batch scheduling and resource optimization.

7. How does this feature help with CronJob instances?

Consider a CronJob that creates a new Job at every scheduled time. If the cluster is heavily loaded, the Job might fail to start due to insufficient resources. With mutable resources for suspended Jobs, a controller can intervene: it can suspend the newly created Job, reduce its resource requests (e.g., lower CPU and memory), then resume it. This allows the Job to progress slowly with reduced capacity rather than failing outright. The original CronJob definition remains unchanged; only the specific Job instance is adjusted. This provides graceful degradation for time-sensitive batch processes. For example, a nightly data processing CronJob can still complete its work—just using fewer resources—when the cluster is under stress. This feature gives administrators more flexibility to handle resource contention without losing scheduled executions.

Kubernetes v1.36 Beta: Dynamic Resource Tuning for Suspended Jobs