Troubleshooting

When applying the Untab manifest on my GKE cluster, Kubernetes shows an error like this: "clusterroles.rbac.authorization.k8s.io "untab-agent" is forbidden: attempt to grant extra privileges"

By default, GKE users do not have the permissions required to create the ClusterRole objects required for Untab. To resolve this issue, give your user cluster-admin permissions, replacing the names my-binding and [email protected] as appropriate:

kubectl create clusterrolebinding my-binding --clusterrole=cluster-admin [email protected]

The agent pod fails to start

The agent pod includes three containers:

  • agent

  • prometheus

  • kube-state-metrics

First, determine which of these containers is failing. To do this, run:

kubectl get pod -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") -o go-template="{{range .status.containerStatuses}}{{.name}}: {{.lastState.terminated.reason}}{{\"\n\"}}{{end}}"

You should see output like this:

agent: <no value>
kube-state-metrics: <no value>
prometheus: Error

In the example above, we can see that it's the Prometheus container that is failing. Now, check the log output of that container to see what the problem might be:

kubectl logs -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") prometheus

If you are unable to determine the cause or fix the problem, please contact us for support and include the output from the two commands above.

Some node-exporter pods fail to start

If one or more node-exporter pods fail to start, check the status of the pods:

kubectl get pods -n untab

If you see the status of some pods as "Pending", this is most likely cause by a lack of available resources on those nodes. Each node needs at least 0.1 CPU cores and 200MiB of RAM available in order to run node-exporter.

If the pods are showing as "Error", then check the logs for one of the failing node-exporter instances:

kubectl logs -n untab node-exporter-xxxxx

If you are unable to determine the cause or fix the problem, please contact us for support and include the output from the two commands above.

Cluster status shows error: "prometheus job kubernetes-nodes-node-exporter failed on X/Y nodes"

As part of the untab namespace, Untab deploys node-exporter (unless set to use an existing set). This is an open-source tool that allows Untab to collect node-level utilization metrics. It runs as a DaemonSet, which means that there will be one Pod on each node.

First, check whether all instances of node-exporter are running successfully using the command kubectl get pods -n untab. You should see output like this:

NAME                          READY   STATUS    RESTARTS   AGE
node-exporter-gzxpj           2/2     Running   0          10s
node-exporter-jtqs6           2/2     Running   0          10s
node-exporter-pshll           2/2     Running   0          10s
...
untab-agent-fd5744b64-bgfjd   3/3     Running   0          10s

If some node-exporter instances are not shown as "Running", please see the question "Some node-exporter pods fail to start" above.

If all node exporters are running then this issue is most likely caused by firewall or security group rules. Please ensure that the master nodes in the cluster can access all nodes on port 9111.

You can get more information on the error by looking at the Prometheus console inside the agent. You can do this using the following steps:

  • Run: kubectl port-forward -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") 9090

  • Open your browser at: http://localhost:9090/targets

  • You should see the status of all of the metric targets Untab collects, along with any encountered errors.

Cluster status shows error: "prometheus job kubernetes-nodes-kubelet failed on X/Y nodes"

The Prometheus instance inside the untab-agent pod collects metrics from the Kubelets on each node by querying the /metrics and /metrics/cadvisor endpoints. These queries are proxied through the API server.

This error indicates that Prometheus was unable to connect to the Kubelet on one or more nodes.

You can get more information on the error by looking at the Prometheus console inside the agent. You can do this using the following steps:

  • Run: kubectl port-forward -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") 9090

  • Open your browser at: http://localhost:9090/targets

  • You should see the status of all of the metric targets Untab collects, along with any encountered errors.

Cluster status shows metrics warning "no recent data"

The agent collects metrics via a Prometheus instance inside the agent Pod. Metrics are polled every 5 minutes and are uploaded to Untab's servers. The message "no recent data" indicates that no metrics were received in the last hour.

Once the agent is running, it can take up to 15 minutes for the first set of metrics to become available.

The first thing to check is whether the agent is running:

Checking agent status

If the agent and node-exporter pods have been running for about 15 minutes but the "no recent data" message is still shown, it is possible that Prometheus is having trouble uploading the metrics to Untab. To check whether this is the case, take a look at the Prometheus logs using the following command:

kubectl logs -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") prometheus

The exact solution will depend on the nature of the errors. In most cases, this is caused by outbound firewall rules. If your cluster uses a proxy to reach external URLs, you will need to use a customized manifest (see here for instructions).

Something else

If you are having an issue that is not covered in this guide, please contact us by email.

Last updated

Was this helpful?