OpenShift integrated cluster monitoring
OpenShift 3.11 ships with a pre-configured and self-updating monitoring stack based on the Prometheus open source project. The main components are Prometheus, Alertmanager and Grafana. This ships with a set of alerts that notify the cluster admin of problems and a set of Grafana dashboards. The deployed Prometheus instance is responsible for monitoring and alerting on cluster and OpenShift components. While it will enable you to monitor the compute resources of your pods using the
K8s / Compute Resources / Pod Grafana dashboard, it shouldn't be extended to monitor user applications. Users interested in leveraging Prometheus for application monitoring on OpenShift should consider using the operator lifecycle manager (https://github.com/operator-framework/operator-lifecycle-manager) to easily deploy a Prometheus Operator and setup new Prometheus instances to monitor and alert on their applications. Alertmanager is a cluster-global component for handling alerts generated by all Prometheus instances deployed in that cluster.
Cluster monitoring is deployed with persistent storage, meaning that the metrics are stored on a persistent volume and can survive a pod being restarted or recreated. By default, each of the three Alertmanager instances is allocated 2GiB and the two Prometheus instances are allocated 50GiB each. This means 106GiB of storage will be used for cluster monitoring. If the persistent volumes fill up, or you notice them getting close to being full, you can expand the persistent volumes by following the steps in How to expand OpenShift persistent volumes or by raising a Service Request in the UKCloud Portal.
If you'd like to change the behaviour of Alertmanager, for example, have alerts notify somebody by email, you'll need to raise a Service Request with us via the My Calls section of the UKCloud Portal. In the request let us know the cluster and the Alertmanager configuration you'd like to be used and we can then reconfigure the cluster-monitoring using the OpenShift ansible playbooks used to deploy it. If you attempt to edit the configMaps or secrets directly to insert this configuration, the cluster-monitoring operator will detect these changes and restore default values. You can use the following documentation page as a reference when creating Alertmanager config: https://prometheus.io/docs/alerting/configuration/.
The OCP 3.11 delivery of the Prometheus Alertmanager only includes a default template. Configurations requiring a specific template should be avoided as they may lead to failures to reconfigure the cluster-monitoring.
It is possible to inject additional configuration into the stack. However, this is unsupported by Red Hat, as configuration paradigms might change across Prometheus releases, and such cases can only be handled gracefully if all configuration possibilities are controlled.
Accessing Web UIs
oc get routes -n openshift-monitoring will give you the URLs you can use to access Prometheus, Alertmanager and Grafana UIs. You'll need to prepend
https:// to the addresses.
If you navigate to the Grafana web UI you'll be able to access a number of pre-configured dashboards. These dashboards provide insight into the compute resource usage at cluster, node and pod levels. There are also USE Method dashboards, monitoring Utilisation, Saturation and Errors, at the cluster and node level. Click the dropdown in the top left on the home dashboard to select which dashboard you'd like to view.
Configuring custom dashboards isn't a supported option. If you require custom dashboards we recommend standing up your own grafana instance and connecting that into the cluster monitoring Prometheus.
Below are some useful OpenShift documentation pages regarding OpenShift cluster monitoring:
If you find a problem with this article, click Improve this Doc to make the change yourself or raise an issue in GitHub. If you have an idea for how we could improve any of our services, send an email to firstname.lastname@example.org.