Troubleshooting
When applying the Untab manifest on my GKE cluster, Kubernetes shows an error like this: "clusterroles.rbac.authorization.k8s.io "untab-agent" is forbidden: attempt to grant extra privileges"
By default, GKE users do not have the permissions required to create the ClusterRole objects required for Untab. To resolve this issue, give your user cluster-admin
permissions, replacing the names my-binding
and user@example.org
as appropriate:
The agent pod fails to start
The agent pod includes three containers:
agent
prometheus
kube-state-metrics
First, determine which of these containers is failing. To do this, run:
You should see output like this:
In the example above, we can see that it's the Prometheus container that is failing. Now, check the log output of that container to see what the problem might be:
Some node-exporter pods fail to start
If one or more node-exporter pods fail to start, check the status of the pods:
If you see the status of some pods as "Pending", this is most likely cause by a lack of available resources on those nodes. Each node needs at least 0.1 CPU cores and 200MiB of RAM available in order to run node-exporter.
If the pods are showing as "Error", then check the logs for one of the failing node-exporter instances:
Cluster status shows error: "prometheus job kubernetes-nodes-node-exporter failed on X/Y nodes"
First, check whether all instances of node-exporter are running successfully using the command kubectl get pods -n untab
. You should see output like this:
If all node exporters are running then this issue is most likely caused by firewall or security group rules. Please ensure that the master nodes in the cluster can access all nodes on port 9111.
You can get more information on the error by looking at the Prometheus console inside the agent. You can do this using the following steps:
Run:
kubectl port-forward -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") 9090
You should see the status of all of the metric targets Untab collects, along with any encountered errors.
Cluster status shows error: "prometheus job kubernetes-nodes-kubelet failed on X/Y nodes"
The Prometheus instance inside the untab-agent
pod collects metrics from the Kubelets on each node by querying the /metrics
and /metrics/cadvisor
endpoints. These queries are proxied through the API server.
This error indicates that Prometheus was unable to connect to the Kubelet on one or more nodes.
You can get more information on the error by looking at the Prometheus console inside the agent. You can do this using the following steps:
Run:
kubectl port-forward -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") 9090
You should see the status of all of the metric targets Untab collects, along with any encountered errors.
Cluster status shows metrics warning "no recent data"
The first thing to check is whether the agent is running:
If the agent and node-exporter pods have been running for about 15 minutes but the "no recent data" message is still shown, it is possible that Prometheus is having trouble uploading the metrics to Untab. To check whether this is the case, take a look at the Prometheus logs using the following command:
kubectl logs -n untab $(kubectl get pod -n untab -l app=untab-agent -o jsonpath="{.items[0].metadata.name}") prometheus
Something else
Last updated
Was this helpful?