NPD + Descheduler Remediation

This guide demonstrates how to build an automated self-healing remediation loop using Node Problem Detector (NPD), the Node Readiness Controller (NRC), and the Descheduler.

The Problem

When a node-level component fails (hardware driver, daemon, agent), existing pods continue running on that degraded node. Manual intervention is needed to identify the issue, taint the node, and reschedule workloads.

The Solution

An automated remediation loop:

NPD runs a custom health check and sets a NodeCondition when a failure is detected.
NRC watches the condition and applies a taint to the unhealthy node.
Descheduler evicts pods that don’t tolerate the taint.
The Kubernetes Scheduler reschedules evicted pods to healthy nodes.
When the issue recovers, NRC removes the taint automatically.

Step-by-Step Guide

Note: All manifests are available in the examples/npd-descheduler-remediation/ directory.

Prerequisites

1. Node Readiness Controller:

Ensure the NRC is deployed. See the Installation Guide.

2. Kind Cluster (for testing):

kind create cluster --config examples/npd-descheduler-remediation/kind-cluster-config.yaml

This creates a cluster with 1 control-plane and 2 worker nodes. The workers are pre-tainted with readiness.k8s.io/my-component-ready=false:NoSchedule to represent starting in an “unknown” or initializing state.

1. (Optional) Deploy Node Problem Detector

Note: For the verification section below, we will use manual patching to simulate failures. If you deploy NPD, it will overwrite manual patches every 10 seconds. You can skip this step or delete the NPD DaemonSet when you reach the verification steps.

NPD monitors node health with a custom plugin that checks a local component (e.g., a hardware driver listening on port 9100).

# Deploy NPD RBAC
kubectl apply -f examples/npd-descheduler-remediation/npd-rbac.yaml

# Deploy NPD config and DaemonSet
kubectl apply -f examples/npd-descheduler-remediation/npd-custom-plugin-config.yaml
kubectl apply -f examples/npd-descheduler-remediation/npd-daemonset.yaml

NPD sets the condition CustomCondition/MyComponentNotReady:

False → component is healthy
True → component has a problem

Customizing the health check: Edit check-component.sh in npd-custom-plugin-config.yaml to check your actual component.

2. Create the NodeReadinessRule

apiVersion: readiness.node.x-k8s.io/v1alpha1
kind: NodeReadinessRule
metadata:
  name: my-component-readiness-rule
spec:
  conditions:
    - type: "CustomCondition/MyComponentNotReady"
      requiredStatus: "False"   # Remove taint when component is NOT unhealthy
  taint:
    key: "readiness.k8s.io/my-component-ready"
    effect: "NoSchedule"
    value: "false"
  enforcementMode: "continuous"  # Re-taint if component fails again
  nodeSelector:
    matchExpressions:
      - key: node-role.kubernetes.io/control-plane
        operator: DoesNotExist

Key points:

continuous mode ensures the taint is re-applied if the component becomes unhealthy again — critical for the Descheduler to trigger pod eviction.
The nodeSelector excludes the control-plane.

kubectl apply -f examples/npd-descheduler-remediation/node-readiness-rule.yaml

3. Deploy the Descheduler

The Descheduler runs with the RemovePodsViolatingNodeTaints strategy, scoped to our custom taint:

profiles:
- name: default
  pluginConfig:
  - name: RemovePodsViolatingNodeTaints
    args:
      includedTaints:
      - "readiness.k8s.io/my-component-ready"
  plugins:
    deschedule:
      enabled:
      - RemovePodsViolatingNodeTaints

kubectl apply -f examples/npd-descheduler-remediation/descheduler-rbac.yaml
kubectl apply -f examples/npd-descheduler-remediation/descheduler-policy.yaml
kubectl apply -f examples/npd-descheduler-remediation/descheduler-deployment.yaml

4. Deploy a Sample Workload

Deploy a test workload without a toleration for the readiness taint:

kubectl apply -f examples/npd-descheduler-remediation/sample-workload.yaml

Verification

1. Check node conditions:

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints[*].key

2. Simulate component recovery:

First, let’s mark the component as healthy so the initial taint is removed and our pods can schedule:

kubectl patch node npd-descheduler-demo-worker --type=strategic --subresource=status -p \
  '{"status":{"conditions":[{"type":"CustomCondition/MyComponentNotReady","status":"False","lastHeartbeatTime":"'$(date -u +%FT%TZ)'","lastTransitionTime":"'$(date -u +%FT%TZ)'"}]}}'

Wait a moment, then verify the pods have scheduled onto the node:

kubectl get pods -o wide

3. Simulate a component failure:

Now, let’s simulate the component failing over time. NRC will detect this and add the taint.

kubectl patch node npd-descheduler-demo-worker --type=strategic --subresource=status -p \
  '{"status":{"conditions":[{"type":"CustomCondition/MyComponentNotReady","status":"True","lastHeartbeatTime":"'$(date -u +%FT%TZ)'","lastTransitionTime":"'$(date -u +%FT%TZ)'"}]}}'

4. Observe taint applied by NRC:

kubectl get node npd-descheduler-demo-worker -o jsonpath='{"\n"}{.spec.taints}{"\n"}'

5. Observe pod eviction by Descheduler:

Since the Descheduler scans every 30 seconds, within a half-minute you will see the pod evicted and rescheduled.

kubectl get pods -o wide   # The pod should be rescheduled away from the tainted node
kubectl get events --sort-by=.lastTimestamp | grep -i evict

5. Simulate recovery:

kubectl patch node <worker-node> --type=strategic --subresource=status -p \
  '{"status":{"conditions":[{"type":"CustomCondition/MyComponentNotReady","status":"False","lastHeartbeatTime":"'$(date -u +%FT%TZ)'","lastTransitionTime":"'$(date -u +%FT%TZ)'"}]}}'

The NRC removes the taint, and the node becomes schedulable again.

Keyboard shortcuts

Node Readiness Controller