Troubleshooting Clusters (2024)

Debugging common cluster issues.

This doc is about cluster troubleshooting; we assume you have already ruled out your application as the root cause of theproblem you are experiencing. Seethe application troubleshooting guide for tips on application debugging.You may also visit the troubleshooting overview document for more information.

For troubleshooting kubectl, refer toTroubleshooting kubectl.

Listing your cluster

The first thing to debug in your cluster is if your nodes are all registered correctly.

Run the following command:

kubectl get nodes

And verify that all of the nodes you expect to see are present and that they are all in the Ready state.

To get detailed information about the overall health of your cluster, you can run:

kubectl cluster-info dump

Example: debugging a down/unreachable node

Sometimes when debugging it can be useful to look at the status of a node -- for example, becauseyou've noticed strange behavior of a Pod that's running on the node, or to find out why a Podwon't schedule onto the node. As with Pods, you can use kubectl describe node and kubectl get node -o yaml to retrieve detailed information about nodes. For example, here's what you'll see ifa node is down (disconnected from the network, or kubelet dies and won't restart, etc.). Noticethe events that show the node is NotReady, and also notice that the pods are no longer running(they are evicted after five minutes of NotReady status).

kubectl get nodes
NAME STATUS ROLES AGE VERSIONkube-worker-1 NotReady <none> 1h v1.23.3kubernetes-node-bols Ready <none> 1h v1.23.3kubernetes-node-st6x Ready <none> 1h v1.23.3kubernetes-node-unaj Ready <none> 1h v1.23.3
kubectl describe node kube-worker-1
Name: kube-worker-1Roles: <none>Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/os=linux kubernetes.io/arch=amd64 kubernetes.io/hostname=kube-worker-1 kubernetes.io/os=linuxAnnotations: kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock node.alpha.kubernetes.io/ttl: 0 volumes.kubernetes.io/controller-managed-attach-detach: trueCreationTimestamp: Thu, 17 Feb 2022 16:46:30 -0500Taints: node.kubernetes.io/unreachable:NoExecute node.kubernetes.io/unreachable:NoScheduleUnschedulable: falseLease: HolderIdentity: kube-worker-1 AcquireTime: <unset> RenewTime: Thu, 17 Feb 2022 17:13:09 -0500Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkUnavailable False Thu, 17 Feb 2022 17:09:13 -0500 Thu, 17 Feb 2022 17:09:13 -0500 WeaveIsUp Weave pod has set this MemoryPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status. DiskPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status. PIDPressure Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status. Ready Unknown Thu, 17 Feb 2022 17:12:40 -0500 Thu, 17 Feb 2022 17:13:52 -0500 NodeStatusUnknown Kubelet stopped posting node status.Addresses: InternalIP: 192.168.0.113 Hostname: kube-worker-1Capacity: cpu: 2 ephemeral-storage: 15372232Ki hugepages-2Mi: 0 memory: 2025188Ki pods: 110Allocatable: cpu: 2 ephemeral-storage: 14167048988 hugepages-2Mi: 0 memory: 1922788Ki pods: 110System Info: Machine ID: 9384e2927f544209b5d7b67474bbf92b System UUID: aa829ca9-73d7-064d-9019-df07404ad448 Boot ID: 5a295a03-aaca-4340-af20-1327fa5dab5c Kernel Version: 5.13.0-28-generic OS Image: Ubuntu 21.10 Operating System: linux Architecture: amd64 Container Runtime Version: containerd://1.5.9 Kubelet Version: v1.23.3 Kube-Proxy Version: v1.23.3Non-terminated Pods: (4 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age --------- ---- ------------ ---------- --------------- ------------- --- default nginx-deployment-67d4bdd6f5-cx2nz 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m default nginx-deployment-67d4bdd6f5-w6kd7 500m (25%) 500m (25%) 128Mi (6%) 128Mi (6%) 23m kube-system kube-proxy-dnxbz 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28m kube-system weave-net-gjxxp 100m (5%) 0 (0%) 200Mi (10%) 0 (0%) 28mAllocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 1100m (55%) 1 (50%) memory 456Mi (24%) 256Mi (13%) ephemeral-storage 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%)Events:...
kubectl get node kube-worker-1 -o yaml
apiVersion: v1kind: Nodemetadata: annotations: kubeadm.alpha.kubernetes.io/cri-socket: /run/containerd/containerd.sock node.alpha.kubernetes.io/ttl: "0" volumes.kubernetes.io/controller-managed-attach-detach: "true" creationTimestamp: "2022-02-17T21:46:30Z" labels: beta.kubernetes.io/arch: amd64 beta.kubernetes.io/os: linux kubernetes.io/arch: amd64 kubernetes.io/hostname: kube-worker-1 kubernetes.io/os: linux name: kube-worker-1 resourceVersion: "4026" uid: 98efe7cb-2978-4a0b-842a-1a7bf12c05f8spec: {}status: addresses: - address: 192.168.0.113 type: InternalIP - address: kube-worker-1 type: Hostname allocatable: cpu: "2" ephemeral-storage: "14167048988" hugepages-2Mi: "0" memory: 1922788Ki pods: "110" capacity: cpu: "2" ephemeral-storage: 15372232Ki hugepages-2Mi: "0" memory: 2025188Ki pods: "110" conditions: - lastHeartbeatTime: "2022-02-17T22:20:32Z" lastTransitionTime: "2022-02-17T22:20:32Z" message: Weave pod has set this reason: WeaveIsUp status: "False" type: NetworkUnavailable - lastHeartbeatTime: "2022-02-17T22:20:15Z" lastTransitionTime: "2022-02-17T22:13:25Z" message: kubelet has sufficient memory available reason: KubeletHasSufficientMemory status: "False" type: MemoryPressure - lastHeartbeatTime: "2022-02-17T22:20:15Z" lastTransitionTime: "2022-02-17T22:13:25Z" message: kubelet has no disk pressure reason: KubeletHasNoDiskPressure status: "False" type: DiskPressure - lastHeartbeatTime: "2022-02-17T22:20:15Z" lastTransitionTime: "2022-02-17T22:13:25Z" message: kubelet has sufficient PID available reason: KubeletHasSufficientPID status: "False" type: PIDPressure - lastHeartbeatTime: "2022-02-17T22:20:15Z" lastTransitionTime: "2022-02-17T22:15:15Z" message: kubelet is posting ready status reason: KubeletReady status: "True" type: Ready daemonEndpoints: kubeletEndpoint: Port: 10250 nodeInfo: architecture: amd64 bootID: 22333234-7a6b-44d4-9ce1-67e31dc7e369 containerRuntimeVersion: containerd://1.5.9 kernelVersion: 5.13.0-28-generic kubeProxyVersion: v1.23.3 kubeletVersion: v1.23.3 machineID: 9384e2927f544209b5d7b67474bbf92b operatingSystem: linux osImage: Ubuntu 21.10 systemUUID: aa829ca9-73d7-064d-9019-df07404ad448

Looking at logs

For now, digging deeper into the cluster requires logging into the relevant machines. Here are the locationsof the relevant log files. On systemd-based systems, you may need to use journalctl instead of examining log files.

Control Plane nodes

  • /var/log/kube-apiserver.log - API Server, responsible for serving the API
  • /var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
  • /var/log/kube-controller-manager.log - a component that runs most Kubernetes built-incontrollers, with the notable exception of scheduling(the kube-scheduler handles scheduling).

Worker Nodes

  • /var/log/kubelet.log - logs from the kubelet, responsible for running containers on the node
  • /var/log/kube-proxy.log - logs from kube-proxy, which is responsible for directing traffic to Service endpoints

Cluster failure modes

This is an incomplete list of things that could go wrong, and how to adjust your cluster setup to mitigate the problems.

Contributing causes

  • VM(s) shutdown
  • Network partition within cluster, or between cluster and users
  • Crashes in Kubernetes software
  • Data loss or unavailability of persistent storage (e.g. GCE PD or AWS EBS volume)
  • Operator error, for example, misconfigured Kubernetes software or application software

Specific scenarios

  • API server VM shutdown or apiserver crashing
    • Results
      • unable to stop, update, or start new pods, services, replication controller
      • existing pods and services should continue to work normally unless they depend on the Kubernetes API
  • API server backing storage lost
    • Results
      • the kube-apiserver component fails to start successfully and become healthy
      • kubelets will not be able to reach it but will continue to run the same pods and provide the same service proxying
      • manual recovery or recreation of apiserver state necessary before apiserver is restarted
  • Supporting services (node controller, replication controller manager, scheduler, etc) VM shutdown or crashes
    • currently those are colocated with the apiserver, and their unavailability has similar consequences as apiserver
    • in future, these will be replicated as well and may not be co-located
    • they do not have their own persistent state
  • Individual node (VM or physical machine) shuts down
    • Results
      • pods on that Node stop running
  • Network partition
    • Results
      • partition A thinks the nodes in partition B are down; partition B thinks the apiserver is down.(Assuming the master VM ends up in partition A.)
  • Kubelet software fault
    • Results
      • crashing kubelet cannot start new pods on the node
      • kubelet might delete the pods or not
      • node marked unhealthy
      • replication controllers start new pods elsewhere
  • Cluster operator error
    • Results
      • loss of pods, services, etc
      • lost of apiserver backing store
      • users unable to read API
      • etc.

Mitigations

  • Action: Use the IaaS provider's automatic VM restarting feature for IaaS VMs

    • Mitigates: Apiserver VM shutdown or apiserver crashing
    • Mitigates: Supporting services VM shutdown or crashes
  • Action: Use IaaS providers reliable storage (e.g. GCE PD or AWS EBS volume) for VMs with apiserver+etcd

    • Mitigates: Apiserver backing storage lost
  • Action: Use high-availability configuration

    • Mitigates: Control plane node shutdown or control plane components (scheduler, API server, controller-manager) crashing
      • Will tolerate one or more simultaneous node or component failures
    • Mitigates: API server backing storage (i.e., etcd's data directory) lost
      • Assumes HA (highly-available) etcd configuration
  • Action: Snapshot apiserver PDs/EBS-volumes periodically

    • Mitigates: Apiserver backing storage lost
    • Mitigates: Some cases of operator error
    • Mitigates: Some cases of Kubernetes software fault
  • Action: use replication controller and services in front of pods

    • Mitigates: Node shutdown
    • Mitigates: Kubelet software fault
  • Action: applications (containers) designed to tolerate unexpected restarts

    • Mitigates: Node shutdown
    • Mitigates: Kubelet software fault

What's next

  • Learn about the metrics available in theResource Metrics Pipeline
  • Discover additional tools formonitoring resource usage
  • Use Node Problem Detector tomonitor node health
  • Use kubectl debug node to debug Kubernetes nodes
  • Use crictl to debug Kubernetes nodes
  • Get more information about Kubernetes auditing
  • Use telepresence to develop and debug services locally

Feedback

Was this page helpful?

Thanks for the feedback. If you have a specific, answerable question about how to use Kubernetes, ask it onStack Overflow.Open an issue in the GitHub Repository if you want toreport a problemorsuggest an improvement.

Last modified March 07, 2024 at 4:54 PM PST: AppArmor v1.30 docs update (4f11f83a45)

Troubleshooting Clusters (2024)
Top Articles
Bismarck Gazette from Bismarck, Missouri
Bismarck Gazette from Bismarck, Missouri
Devotion Showtimes Near Xscape Theatres Blankenbaker 16
55Th And Kedzie Elite Staffing
Minooka Channahon Patch
Cintas Pay Bill
Unitedhealthcare Hwp
Tj Nails Victoria Tx
Ventura Craigs List
Imbigswoo
Over70Dating Login
[2024] How to watch Sound of Freedom on Hulu
Erin Kate Dolan Twitter
Cincinnati Bearcats roll to 66-13 win over Eastern Kentucky in season-opener
The Binding of Isaac
Best Fare Finder Avanti
Log in or sign up to view
Average Salary in Philippines in 2024 - Timeular
Quick Answer: When Is The Zellwood Corn Festival - BikeHike
Best Nail Salons Open Near Me
Reicks View Farms Grain Bids
Znamy dalsze plany Magdaleny Fręch. Nie będzie nawet chwili przerwy
Craigslist Ludington Michigan
4 Methods to Fix “Vortex Mods Cannot Be Deployed” Issue - MiniTool Partition Wizard
Nk 1399
Tom Thumb Direct2Hr
30+ useful Dutch apps for new expats in the Netherlands
Kaliii - Area Codes Lyrics
Happy Shuttle Cancun Review
Datingscout Wantmatures
DIY Building Plans for a Picnic Table
Swimgs Yuzzle Wuzzle Yups Wits Sadie Plant Tune 3 Tabs Winnie The Pooh Halloween Bob The Builder Christmas Autumns Cow Dog Pig Tim Cook’s Birthday Buff Work It Out Wombats Pineview Playtime Chronicles Day Of The Dead The Alpha Baa Baa Twinkle
Half Inning In Which The Home Team Bats Crossword
Wasmo Link Telegram
The Hoplite Revolution and the Rise of the Polis
Forager How-to Get Archaeology Items - Dino Egg, Anchor, Fossil, Frozen Relic, Frozen Squid, Kapala, Lava Eel, and More!
Haley Gifts :: Stardew Valley
Old Peterbilt For Sale Craigslist
Lake Dunson Robertson Funeral Home Lagrange Georgia Obituary
Selfservice Bright Lending
Pensacola Cars Craigslist
Trap Candy Strain Leafly
Janaki Kalaganaledu Serial Today Episode Written Update
Gotrax Scooter Error Code E2
St Vrain Schoology
Gw2 Support Specter
Lorton Transfer Station
SF bay area cars & trucks "chevrolet 50" - craigslist
Gelato 47 Allbud
303-615-0055
Tommy Gold Lpsg
683 Job Calls
Latest Posts
Article information

Author: Terrell Hackett

Last Updated:

Views: 5847

Rating: 4.1 / 5 (72 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Terrell Hackett

Birthday: 1992-03-17

Address: Suite 453 459 Gibson Squares, East Adriane, AK 71925-5692

Phone: +21811810803470

Job: Chief Representative

Hobby: Board games, Rock climbing, Ghost hunting, Origami, Kabaddi, Mushroom hunting, Gaming

Introduction: My name is Terrell Hackett, I am a gleaming, brainy, courageous, helpful, healthy, cooperative, graceful person who loves writing and wants to share my knowledge and understanding with you.