so here start my debug journey
accessing one node:
checking if kube process are running
ps -ef | grep kube
core@k8s-node-2017:~$ ps -ef | grep kube
root 1739 1 0 15:16 ? 00:00:03 /hyperkube proxy --cluster-cidr=10.100.0.0/16 --master=http://10.120.6.4:8080
core 24484 23080 0 15:42 pts/0 00:00:00 grep --color=auto kube
running OK
check kubelets logs
journalctl -u kubelet -f --no-pager
Oct 17 15:44:56 k8s-node-2017 systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
Oct 17 15:44:56 k8s-node-2017 systemd[1]: kubelet.service: Unit entered failed state.
Oct 17 15:44:56 k8s-node-2017 systemd[1]: kubelet.service: Failed with result 'exit-code'.
kubectl are logging error
checking all services:
Oct 17 15:21:42 k8s-node-2017 systemd[1]: docker-dnsmasq.service: Unit entered failed state.
Oct 17 15:21:42 k8s-node-2017 systemd[1]: docker-dnsmasq.service: Failed with result 'exit-code'.
Oct 17 15:21:43 k8s-node-2017 systemd[1]: kube-docker.service: Service hold-off time over, scheduling restart.
Oct 17 15:21:43 k8s-node-2017 systemd[1]: Stopped kube docker - docker for our k8s cluster.
Oct 17 15:21:43 k8s-node-2017 systemd[1]: Starting kube docker - docker for our k8s cluster...
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: Device "flannel.1" does not exist.
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: flannel ip address:
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: DNS ip address: ...1
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: starting docker daemon with flannel CIDR:...1/24
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: invalid value "...1" for flag --dns: ...1 is not an ip address
Oct 17 15:21:43 k8s-node-2017 kube-docker.sh[7240]: See 'dockerd --help'.
here we see one problem with flannel service.
checking kube-flannel:
journalctl -u kube-flannel -f
config: 100: Key not found (/coreos.com) [6578]
Oct 17 15:53:41 k8s-node-2017 kube-flannel.sh[1759]: timed out
Oct 17 15:53:41 k8s-node-2017 kube-flannel.sh[1759]: E1017 17:53:41.837899 1759 main.go:344] Couldn't fetch network config: 100: Key not found (/coreos.com) [6578]
seems that etcd lost cluster entries..
debugging master:
debung master etcd:
core@k8s-master-2017:~$ etcdctl ls /registry
/registry/serviceaccounts
/registry/ranges
/registry/services
/registry/events
/registry/namespaces
/registry/apiregistration.k8s.io
master lost flannel config.. because master reboot.. how here we used etcd like master + proxies.. not like cluster.. we need persist etcd config on disk not ephemeral..