[FIRING:1] InstanceUnreachable ibm-srv01.ovirt.org kubernetes-nodes-exporter (amd64 linux ibm-srv01.ovirt.org true external bare-metal-external ci)

Description

Labels:

  • alertname = InstanceUnreachable

  • beta_kubernetes_io_arch = amd64

  • beta_kubernetes_io_os = linux

  • instance = ibm-srv01.ovirt.org

  • job = kubernetes-nodes-exporter

  • kubernetes_io_hostname = ibm-srv01.ovirt.org

  • node_role_kubernetes_io_compute = true

  • region = external

  • type = bare-metal-external

  • zone = ci

Annotations:

  • description = ibm-srv01.ovirt.org of job kubernetes-nodes-exporter has been possibly down for more than 10 minutes.

Source: http://prometheus-0:9090/graph?g0.expr=up%7Bjob%3D%22kubernetes-nodes-exporter%22%7D+%3D%3D+0&g0.tab=1

Activity

Show:
Evgheni Dereveanchin
September 9, 2020, 3:52 PM

Did we identify the reason for this issue? May it be related to https://issues.redhat.com/browse/KNIECO-2387 ? If disk space runs out pods are evacuated (this should be visible in the event log).

If the issue is no longer relevant let’s close it and related ones for other IBM Cloud hosts.

Evgheni Dereveanchin
September 14, 2020, 9:40 AM

Any updates?

Shlomi Zidmi
September 14, 2020, 10:07 AM

This node has been stable for the recent weeks with no major errors. journalctl -u origin-node also looks quite clean with no infra related issues.

I do see however that /boot is 97% full (240M out of 250M possible), not sure if this may cause some issues, but probably we have to clean some space.

Other than that I think we can close all active “InstanceUnreachable“ tickets for now and let’s see how stable the nodes are

Shlomi Zidmi
September 17, 2020, 7:42 AM

/boot partition was cleaned up.
Closing as this node has been stable for few weeks now

Assignee

Shlomi Zidmi

Reporter

Alertmanager_Bot

Blocked By

None

Priority

Medium
Configure