V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Lunrry
V2EX  ›  Kubernetes

k8s 某台节点状态频繁切换, PLEG is not healthy

  •  
  •   Lunrry · 2023-09-15 10:16:09 +08:00 · 1809 次点击
    这是一个创建于 464 天前的主题,其中的信息可能已经有所发展或是发生改变。

    大佬们请教一个问题,公司 k8s 环境有两台 node 频繁的在 Ready 和 NotReady 状态切换,间隔大概 3 分钟。 版本信息:

    Kernel Version:             3.10.0-1062.el7.x86_64
     OS Image:                   CentOS Linux 7 (Core)
     Operating System:           linux
     Architecture:               amd64
     Container Runtime Version:  docker://18.6.1
     Kubelet Version:            v1.14.1
     Kube-Proxy Version:         v1.14.1
    

    节点信息

    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource           Requests       Limits
      --------           --------       ------
      cpu                3312m (28%)    370m (3%)
      memory             24302Mi (19%)  270Mi (0%)
      ephemeral-storage  0 (0%)         0 (0%)
    Events:
      Type    Reason        Age                      From             Message
      ----    ------        ----                     ----             -------
      Normal  NodeNotReady  9m10s (x12084 over 45d)  kubelet, dev-11  Node dev-11 status is now: NodeNotReady
      Normal  NodeReady     4m9s (x12086 over 65d)   kubelet, dev-11  Node dev-11 status is now: NodeReady
    

    日志信息

    9 月 15 10:06:47 dev-11 kubelet[2016]: I0915 10:06:47.940194    2016 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2023-09-15 10:06:47.940166803 +0800 CST m=+5667448.191374429 LastTransitionTime:2023-09-15 10:06:47.940166803 +0800 CST m=+5667448.191374429 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m4.847729429s ago; threshold is 3m0s.}
    9 月 15 10:06:50 dev-11 kubelet[2016]: I0915 10:06:50.280321    2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m7.187849858s ago; threshold is 3m0s.
    .....
    9 月 15 10:07:40 dev-11 kubelet[2016]: I0915 10:07:40.281597    2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m57.189127759s ago; threshold is 3m0s.
    9 月 15 10:07:43 dev-11 kubelet[2016]: E0915 10:07:43.124845    2016 remote_runtime.go:321] ContainerStatus "1f718a7646f7c8126e784*********************930620d33ab9bb" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    9 月 15 10:07:43 dev-11 kubelet[2016]: E0915 10:07:43.124906    2016 kuberuntime_manager.go:917] getPodContainerStatuses for pod "test-jdk11-1-0_test1(13*****-1fe1-11ee-a143-f4******bb5)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    9 月 15 10:08:42 dev-11 kubelet[2016]: E0915 10:08:42.995808    2016 kubelet_pods.go:1093] Failed killing the pod "test-jdk11-1-0": failed to "KillContainer" for "jdk11" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
    9 月 15 10:09:01 dev-11 kubelet[2016]: E0915 10:09:01.488058    2016 remote_runtime.go:402] Exec 1f718a7646f7c8126e784*********************930620d33ab9bb '/bin/sh' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    9 月 15 10:09:44 dev-11 kubelet[2016]: E0915 10:09:44.151795    2016 remote_runtime.go:321] ContainerStatus "1f718a7646f7c8126e784*********************930620d33ab9bb" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    9 月 15 10:09:44 dev-11 kubelet[2016]: E0915 10:09:44.151843    2016 kuberuntime_manager.go:917] getPodContainerStatuses for pod "test-jdk11-1-0_test1(13*****-1fe1-11ee-a143-f4******bb5)" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    9 月 15 10:10:05 dev-11 kubelet[2016]: E0915 10:10:05.742413    2016 remote_runtime.go:402] Exec 1f718a7646f7c8126e784*********************930620d33ab9bb '/bin/sh' from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
    
    16 条回复    2024-05-16 06:56:15 +08:00
    yuan1028
        1
    yuan1028  
       2023-09-15 11:16:30 +08:00
    可以排查下 runtime ( docker 或 kata )的日志试试,看着像是 runtime 有问题
    Pythondr
        2
    Pythondr  
       2023-09-15 11:24:51 +08:00
    有僵尸容器 test-jdk11-1-0 ,docker 处理这个容器的时候超时了,导致 docker cli hang 住了,强制重启 docker 可以解决。造成这种情况的主要原因是你的容器代码写的有问题,无法正常退出。
    julyclyde
        3
    julyclyde  
       2023-09-15 12:13:32 +08:00
    @Pythondr 如果是纯软件的话,写的再有问题也可以 SIGKILL 吧?
    我都怀疑是不是硬盘坏了
    Cola98
        4
    Cola98  
       2023-09-15 13:43:57 +08:00
    感觉像是网络问题
    Lunrry
        5
    Lunrry  
    OP
       2023-09-15 13:46:54 +08:00
    我看样子也觉得是这个容器的问题,同事反映有时 kubectl exec 进入容器很慢,但是也还是能进入到这个容器,我只是一个 k8s 小白,不太了解里面的具体情况
    Lunrry
        6
    Lunrry  
    OP
       2023-09-15 13:54:12 +08:00
    @Cola98 #4 出问题的有两个 node 看日志都是这个原因
    ```
    9 月 15 13:51:46 dev-11 kubelet[2016]: I0915 13:51:46.181789 2016 setters.go:521] Node became not ready: {Type:Ready Status:False LastHeartbeatTime:2023-09-15 13:51:46.181760277 +0800 CST m=+5680946.432967904 LastTransitionTime:2023-09-15 13:51:46.181760277 +0800 CST m=+5680946.432967904 Reason:KubeletNotReady Message:PLEG is not healthy: pleg was last seen active 3m5.432099472s ago; threshold is 3m0s.}
    9 月 15 13:51:47 dev-11 kubelet[2016]: I0915 13:51:47.280267 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m6.530576287s ago; threshold is 3m0s.
    9 月 15 13:51:52 dev-11 kubelet[2016]: I0915 13:51:52.280410 2016 kubelet.go:1823] skipping pod synchronization - PLEG is not healthy: pleg was last seen active 3m11.530719718s ago; threshold is 3m0s.
    ```
    网络用的 cni 插件,其他几十台没问题
    Cola98
        7
    Cola98  
       2023-09-15 17:27:42 +08:00
    @Lunrry 不好意思,之前没有看仔细
    Lunrry
        8
    Lunrry  
    OP
       2023-09-15 17:54:29 +08:00   ❤️ 1
    现在排查出来了,90%可能是 jdk 容器的问题。
    在每次迭代中,PLEG 运行状况检查都会调用`docker ps`来检测容器状态更改并`docker inspect`获取这些容器的详细信息。每次迭代完成后,它都会更新时间戳。如果时间戳一段时间(即 3 分钟)没有更新,则健康检查失败。
    使用脚本循环执行`docker inspect`命令时,出问题的 dev-11 和 dev-13 两台机器都卡在 jdk 这个容器上,3 分钟时间一过,检查超时就被判定为 NotReady 状态了。
    现在就差去调试容器编排脚本了
    hancai
        9
    hancai  
       2023-09-20 11:13:47 +08:00
    hancai
        10
    hancai  
       2023-09-20 11:16:51 +08:00
    @hancai 不小心恢复错了。 pleg 问题大概率都是内核 bug , 遇到好多次了
    Lunrry
        11
    Lunrry  
    OP
       2023-09-20 14:05:55 +08:00
    @hancai #10 可是其他机器装的同样的系统内核版本也相同,也可能会导致这个问题吗
    hancai
        12
    hancai  
       2023-09-20 16:18:22 +08:00
    同内核同集群也只是部分节点出现, 大概率内核日志中持续打印 unregister_netdevice: waiting for XXX to become free. Usage count = 1 。 还有集群中有 pod 一直处于 terminating 中。
    hancai
        13
    hancai  
       2023-09-20 16:23:38 +08:00
    你搜一下这两篇博客 “内核 bug 修复方案:网络设备引用计数泄” “记一次 k8s 集群 pod 一直 terminating 问题的排查” , 如果故障现象差不多就是内核问题了, 今年遇到两个 k8s 集群都是这个问题。docker inspect 卡住也遇到过, 不过最终都是升级内核才修复。sandbox 容器没有正常销毁也会出现这个问题。
    Lunrry
        14
    Lunrry  
    OP
       2023-09-21 08:58:58 +08:00
    @hancai #13 好的去,我去看看,谢谢大佬指教
    yiyu1211
        15
    yiyu1211  
       2023-11-17 09:49:51 +08:00
    后面是 jdk 容器什么问题呢?
    DavidWei
        16
    DavidWei  
       220 天前 via Android
    升级内核可以解决
    关于   ·   帮助文档   ·   博客   ·   API   ·   FAQ   ·   实用小工具   ·   2744 人在线   最高记录 6679   ·     Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 · 23ms · UTC 09:13 · PVG 17:13 · LAX 01:13 · JFK 04:13
    Developed with CodeLauncher
    ♥ Do have faith in what you're doing.