问题描述:
一个节点变成NotReady,如下:
# kubectl get noNAME STATUS ROLES AGE VERSIONk8s-master-134 Ready master 206d v1.17.2k8s-node01-50 Ready node01 206d v1.17.2k8s-node02-136 Ready node02 195d v1.17.2k8s-node03-137 NotReady node03 195d v1.17.2k8s-node04-138 Ready node04 195d v1.17.2
问题排查:
在问题节点上使用journalctl -f可以看到非常多的报错,如下:
Dec 28 10:40:38 k8s-node03-137 kubelet[3698]: E1228 10:40:38.788368 3698 kubelet_volumes.go:154] orphaned pod "7f7b7215-0dd4-451b-b6f0-258855a420f7" found, but volume subpaths are still present on disk : There were a total of 5 errors similar to this. Turn up verbosity to see them.
这个问题在issue上有很多讨论,地址:https://github.com/kubernetes/kubernetes/issues/60987
然后进入对应的目录查看到底是什么情况,如下:
cd /var/lib/kubelet/pods/7f7b7215-0dd4-451b-b6f0-258855a420f7
可以看到里面的容器数据:
# ls -ltotal 4drwxr-x--- 3 root root 24 Aug 27 18:38 containers-rw-r--r-- 1 root root 312 Aug 27 18:38 etc-hostsdrwxr-x--- 3 root root 37 Aug 27 18:38 pluginsdrwxr-x--- 6 root root 121 Aug 27 18:38 volumesdrwxr-xr-x 3 root root 54 Aug 27 18:38 volume-subpaths
还可以看到etc-hosts里面还有podip和podname,如下:
# cat etc-hosts# Kubernetes-managed hosts file.127.0.0.1 localhost::1 localhost ip6-localhost ip6-loopbackfe00::0 ip6-localnetfe00::0 ip6-mcastprefixfe00::1 ip6-allnodesfe00::2 ip6-allrouters192.168.3.250 prometheus-k8s-system-0.prometheus-operated.kubesphere-monitoring-system.svc.cluster.local prometheus-k8s-system-0
进到volume-subpaths/,可以看到有一个pvc,如下:
# ls -ltotal 0drwxr-xr-x 3 root root 24 Aug 27 18:38 pvc-406aa2c7-01e0-4191-8183-4a3300ab9ca3
我们看看这个pvc到底存不存在,如下:
# kubectl get pvc -A | grep pvc-406aa2c7-01e0-4191-8183-4a3300ab9ca3
并没有任何输出,表示这个pvc实际上是不存在的,那为啥这里会有呢?
在源码pkg\kubelet\kubelet_volumes.go中是如下定义的:
// cleanupOrphanedPodDirs removes the volumes of pods that should not be// running and that have no containers running. Note that we roll up logs here since it runs in the main loop.func (kl *Kubelet) cleanupOrphanedPodDirs(pods []*v1.Pod, runningPods []*kubecontainer.Pod) error {...// If there are any volume-subpaths, do not cleanup directoriesvolumeSubpathExists, err := kl.podVolumeSubpathsDirExists(uid)if err != nil {orphanVolumeErrors = append(orphanVolumeErrors, fmt.Errorf("orphaned pod %q found, but error %v occurred during reading of volume-subpaths dir from disk", uid, err))continue}if volumeSubpathExists {orphanVolumeErrors = append(orphanVolumeErrors, fmt.Errorf("orphaned pod %q found, but volume subpaths are still present on disk", uid))continue}}
cleanupOrphanedPodDirs方法的作业是清理节点的没有运行pod目录,如果下面的目录不为空,则不清理。
先通过如下命令确认该pod是否存在:
# crictl ps -o json | jq '.[][].labels | select (.["io.kubernetes.pod.uid"] == "7f7b7215-0dd4-451b-b6f0-258855a420f7") | .["io.kubernetes.pod.name"]'|uniq
上面命令没有查到任何pod,然后可以直接将这个目录清理了,如下:
# rm -rf 7f7b7215-0dd4-451b-b6f0-258855a420f7
然后再看日志就没有这个报错了。
