Advanced Pod Scheduling
In the Kubernetes bootcamp training, we have seen how to create a pod and and some basic pod configurations to go with it. But this chapter explains some advanced topics related to pod scheduling.
From the api document for version 1.11 following are the pod specs which are relevant from scheduling perspective.
- nodeSelector
- nodeName
- affinity
- schedulerName
- tolerations
Using Node Selectors
kubectl get nodes --show-labels
kubectl label nodes <node-name> zone=aaa
kubectl get nodes --show-labels
e.g.
kubectl label nodes node1 zone=bbb
kubectl label nodes node2 zone=bbb
kubectl label nodes node3 zone=aaa
kubectl label nodes node4 zone=aaa
kubectl get nodes --show-labels
[sample output]
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready master,node 22h v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node1,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true,zone=bbb
node2 Ready master,node 22h v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true,zone=bbb
node3 Ready node 22h v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node3,node-role.kubernetes.io/node=true,zone=aaa
node4 Ready node 21h v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node4,node-role.kubernetes.io/node=true,zone=aaa
Check how the pods are distributed on the nodes using the following command,
kubectl get pods -o wide --selector="role=vote"
NAME READY STATUS RESTARTS AGE IP NODE
vote-5d88d47fc8-6rflg 1/1 Running 0 1m 10.233.75.9 node2
vote-5d88d47fc8-gbzbq 1/1 Running 0 1h 10.233.74.76 node4
vote-5d88d47fc8-q4vj6 1/1 Running 0 1h 10.233.102.133 node1
vote-5d88d47fc8-znd2z 1/1 Running 0 1m 10.233.71.20 node3
From the above output, you could see that the pods running vote app are currently equally distributed. Now, update pod definition to make it schedule only on nodes in zone bbb
file: k8s-code/pods/vote-pod.yml
....
template:
...
spec:
containers:
- name: app
image: schoolofdevops/vote:v1
ports:
- containerPort: 80
protocol: TCP
nodeSelector:
zone: 'bbb'
For this change, pod needs to be re created.
kubectl apply -f vote-pod.yml
You would notice that, the moment you make that change, a new rollout kicks off, which will start redistributing the pods, now following the nodeSelector constraint that you added.
Watch the output of the following command
watch kubectl get pods -o wide --selector="role=vote"
You will see the following while it transitions
NAME READY STATUS RESTARTS AGE IP NODE
pod/vote-5d88d47fc8-6rflg 0/1 Terminating 0 5m 10.233.75.9 node2
pod/vote-5d88d47fc8-gbzbq 0/1 Terminating 0 1h 10.233.74.76 node4
pod/vote-5d88d47fc8-q4vj6 0/1 Terminating 0 1h 10.233.102.133 node1
pod/vote-67d7dd8f89-2w5wl 1/1 Running 0 44s 10.233.75.10 node2
pod/vote-67d7dd8f89-gm6bq 0/1 ContainerCreating 0 2s <none> node2
pod/vote-67d7dd8f89-w87n9 1/1 Running 0 44s 10.233.102.134 node1
pod/vote-67d7dd8f89-xccl8 1/1 Running 0 44s 10.233.102.135 node1
and after the rollout completes,
NAME READY STATUS RESTARTS AGE IP NODE
vote-67d7dd8f89-2w5wl 1/1 Running 0 2m 10.233.75.10 node2
vote-67d7dd8f89-gm6bq 1/1 Running 0 1m 10.233.75.11 node2
vote-67d7dd8f89-w87n9 1/1 Running 0 2m 10.233.102.134 node1
vote-67d7dd8f89-xccl8 1/1 Running 0 2m 10.233.102.135 node1
Exercise
Just like nodeSelector above, you could enforce a pod to run on a specific node using nodeName. Try using that property to run all pods for results application on node3
Defining affinity and anti-affinity
We have discussed about scheduling a pod on a particular node using NodeSelector, but using node selector is a hard condition. If the condition is not met, the pod cannot be scheduled. Node/Pod affinity and anti-affinity solves this issue by introducing soft and hard conditions.
- required
-
preferred
-
DuringScheduling
- DuringExecution
Operators
- In
- NotIn
- Exists
- DoesNotExist
- Gt
- Lt
Node Affinity
Examine the current pod distribution
kubectl get pods -o wide --selector="role=vote"
NAME READY STATUS RESTARTS AGE IP NODE
vote-8546bbd84d-22d6x 1/1 Running 0 35s 10.233.102.137 node1
vote-8546bbd84d-8f9bc 1/1 Running 0 1m 10.233.102.136 node1
vote-8546bbd84d-bpg8f 1/1 Running 0 1m 10.233.75.12 node2
vote-8546bbd84d-d8j9g 1/1 Running 0 1m 10.233.75.13 node2
and node labels
kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node1 Ready master,node 1d v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node1,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true,zone=bbb
node2 Ready master,node 1d v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node2,node-role.kubernetes.io/master=true,node-role.kubernetes.io/node=true,zone=bbb
node3 Ready node 1d v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node3,node-role.kubernetes.io/node=true,zone=aaa
node4 Ready node 1d v1.10.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=node4,node-role.kubernetes.io/node=true,zone=aaa
Lets create node affinity criteria as
- Pods for vote app must not run on the master nodes
- Pods for vote app preferably run on a node in zone bbb
First is a hard affinity versus second being soft affinity.
file: vote-deploy.yaml
....
template:
....
spec:
containers:
- name: app
image: schoolofdevops/vote:v1
ports:
- containerPort: 80
protocol: TCP
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: zone
Pod Affinity
Lets define pod affinity criteria as,
- Pods for vote and redis should be co located as much as possible (preferred)
- No two pods with redis app should be running on the same node (required)
kubectl get pods -o wide --selector="role in (vote,redis)"
[sample output]
NAME READY STATUS RESTARTS AGE IP NODE
redis-6555998885-4k5cr 1/1 Running 0 4h 10.233.71.19 node3
redis-6555998885-fb8rk 1/1 Running 0 4h 10.233.102.132 node1
vote-74c894d6f5-bql8z 1/1 Running 0 22m 10.233.74.78 node4
vote-74c894d6f5-nnzmc 1/1 Running 0 21m 10.233.71.22 node3
vote-74c894d6f5-ss929 1/1 Running 0 22m 10.233.74.77 node4
vote-74c894d6f5-tpzgm 1/1 Running 0 22m 10.233.71.21 node3
file: vote-deploy.yaml
...
template:
...
spec:
containers:
- name: app
image: schoolofdevops/vote:v1
ports:
- containerPort: 80
protocol: TCP
affinity:
...
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
labelSelector:
matchExpressions:
- key: role
operator: In
values:
- redis
topologyKey: kubernetes.io/hostname
file: redis-deploy.yaml
....
template:
...
spec:
containers:
- image: schoolofdevops/redis:latest
imagePullPolicy: Always
name: redis
ports:
- containerPort: 6379
protocol: TCP
restartPolicy: Always
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: role
operator: In
values:
- redis
topologyKey: "kubernetes.io/hostname"
apply
kubectl apply -f redis-deploy.yaml
kubectl apply -f vote-deploy.yaml
check the pods distribution
kubectl get pods -o wide --selector="role in (vote,redis)"
[sample output ]
NAME READY STATUS RESTARTS AGE IP NODE
redis-5bf748dbcf-gr8zg 1/1 Running 0 13m 10.233.75.14 node2
redis-5bf748dbcf-vxppx 1/1 Running 0 13m 10.233.74.79 node4
vote-56bf599b9c-22lpw 1/1 Running 0 12m 10.233.74.80 node4
vote-56bf599b9c-nvvfd 1/1 Running 0 13m 10.233.71.25 node3
vote-56bf599b9c-w6jc9 1/1 Running 0 13m 10.233.71.23 node3
vote-56bf599b9c-ztdgm 1/1 Running 0 13m 10.233.71.24 node3
Observations from the above output,
- Since redis has a hard constraint not to be on the same node, you would observe redis pods being on differnt nodes (node2 and node4)
- since vote app has a soft constraint, you see some of the pods running on node4 (same node running redis), others continue to run on node 3
If you kill the pods on node3, at the time of scheduling new ones, scheduler meets all affinity rules
$ kubectl delete pods vote-56bf599b9c-nvvfd vote-56bf599b9c-w6jc9 vote-56bf599b9c-ztdgm
pod "vote-56bf599b9c-nvvfd" deleted
pod "vote-56bf599b9c-w6jc9" deleted
pod "vote-56bf599b9c-ztdgm" deleted
$ kubectl get pods -o wide --selector="role in (vote,redis)"
NAME READY STATUS RESTARTS AGE IP NODE
redis-5bf748dbcf-gr8zg 1/1 Running 0 19m 10.233.75.14 node2
redis-5bf748dbcf-vxppx 1/1 Running 0 19m 10.233.74.79 node4
vote-56bf599b9c-22lpw 1/1 Running 0 19m 10.233.74.80 node4
vote-56bf599b9c-4l6bc 1/1 Running 0 20s 10.233.74.83 node4
vote-56bf599b9c-bqsrq 1/1 Running 0 20s 10.233.74.82 node4
vote-56bf599b9c-xw7zc 1/1 Running 0 19s 10.233.74.81 node4
Taints and tolerations
- Affinity is defined for pods
- Taints are defined for nodes
You could add the taints with criteria and effects. Effetcs can be
Taint Specs:
- effect
- NoSchedule
- PreferNoSchedule
- NoExecute
- key
- value
- timeAdded (only written for NoExecute taints)
Observe the pods distribution
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
db-66496667c9-qggzd 1/1 Running 0 4h 10.233.74.74 node4
redis-5bf748dbcf-gr8zg 1/1 Running 0 27m 10.233.75.14 node2
redis-5bf748dbcf-vxppx 1/1 Running 0 27m 10.233.74.79 node4
result-5c7569bcb7-4fptr 1/1 Running 0 4h 10.233.71.18 node3
result-5c7569bcb7-s4rdx 1/1 Running 0 4h 10.233.74.75 node4
vote-56bf599b9c-22lpw 1/1 Running 0 26m 10.233.74.80 node4
vote-56bf599b9c-4l6bc 1/1 Running 0 8m 10.233.74.83 node4
vote-56bf599b9c-bqsrq 1/1 Running 0 8m 10.233.74.82 node4
vote-56bf599b9c-xw7zc 1/1 Running 0 8m 10.233.74.81 node4
worker-7c98c96fb4-7tzzw 1/1 Running 1 4h 10.233.75.8 node2
Lets taint a node.
kubectl taint node node2 dedicated=worker:NoExecute
after taining the node
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
db-66496667c9-qggzd 1/1 Running 0 4h 10.233.74.74 node4
redis-5bf748dbcf-ckn65 1/1 Running 0 2m 10.233.71.26 node3
redis-5bf748dbcf-vxppx 1/1 Running 0 30m 10.233.74.79 node4
result-5c7569bcb7-4fptr 1/1 Running 0 4h 10.233.71.18 node3
result-5c7569bcb7-s4rdx 1/1 Running 0 4h 10.233.74.75 node4
vote-56bf599b9c-22lpw 1/1 Running 0 29m 10.233.74.80 node4
vote-56bf599b9c-4l6bc 1/1 Running 0 11m 10.233.74.83 node4
vote-56bf599b9c-bqsrq 1/1 Running 0 11m 10.233.74.82 node4
vote-56bf599b9c-xw7zc 1/1 Running 0 11m 10.233.74.81 node4
worker-7c98c96fb4-46ltl 1/1 Running 0 2m 10.233.102.140 node1
All pods running on node2 just got evicted.
Add toleration in the Deployment for worker.
File: worker-deploy.yml
apiVersion: apps/v1
.....
template:
....
spec:
containers:
- name: app
image: schoolofdevops/vote-worker:latest
tolerations:
- key: "dedicated"
operator: "Equal"
value: "worker"
effect: "NoExecute"
apply
kubectl apply -f worker-deploy.yml
Observe the pod distribution now.
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
db-66496667c9-qggzd 1/1 Running 0 4h 10.233.74.74 node4
redis-5bf748dbcf-ckn65 1/1 Running 0 3m 10.233.71.26 node3
redis-5bf748dbcf-vxppx 1/1 Running 0 31m 10.233.74.79 node4
result-5c7569bcb7-4fptr 1/1 Running 0 4h 10.233.71.18 node3
result-5c7569bcb7-s4rdx 1/1 Running 0 4h 10.233.74.75 node4
vote-56bf599b9c-22lpw 1/1 Running 0 30m 10.233.74.80 node4
vote-56bf599b9c-4l6bc 1/1 Running 0 12m 10.233.74.83 node4
vote-56bf599b9c-bqsrq 1/1 Running 0 12m 10.233.74.82 node4
vote-56bf599b9c-xw7zc 1/1 Running 0 12m 10.233.74.81 node4
worker-6cc8dbd4f8-6bkfg 1/1 Running 0 1m 10.233.75.15 node2
You should see worker being scheduled on node2
To remove the taint created above
kubectl taint node node2 dedicate:NoExecute-