Prometheus and Grafana for monitoring
To collect useful system metrics for stability and performance monitoring, we advise to use Prometheus. For visualizing the metrics collected by Prometheus you could use Grafana.
The System controller receives a lot of internal metrics from the Agile Live components. These can be pulled by a Prometheus instance from the endpoint https://system-controller-host:8080/metrics.
It is also possible to install external exporters for various hardware and OS metrics.
Prometheus
Installation
Use this guide to install Prometheus: https://prometheus.io/docs/prometheus/latest/installation/.
Configuration
Prometheus should be configured to poll or scrape the system
controller with something like this in the prometheus.yml
file:
scrape_configs:
- job_name: 'system_controller_exporter'
scrape_interval: 5s
scheme: https
tls_config:
insecure_skip_verify: true
static_configs:
- targets: ['system-controller-host:8080']
External exporters
Node Exporter
Node Exporter is an exporter used for general hardware and OS metrics, such as CPU load and memory usage.
Instructions for installation and configuration can be found here: https://github.com/prometheus/node_exporter
Add a new scrape_config
in prometheus.yml
like so:
- job_name: 'node_exporter'
scrape_interval: 15s
static_configs:
- targets: ['node-exporter-host:9100']
DCGM Exporter
This exporter uses Nvidia DCGM to gather metrics from Nvidia GPUs. Includes encoder and decoder utilization.
More info and installation instructions to be found here: https://github.com/NVIDIA/dcgm-exporter
Add a new scrape_config
in prometheus.yml
like so:
- job_name: 'dcgm_exporter'
scrape_interval: 15s
static_configs:
- targets: ['dcgm-exporter-host:9400']
Grafana
Installation of Grafana is described here: https://grafana.com/docs/grafana/latest/setup-grafana/installation/
As a start, the following Dashboards can be used to visualize the Node Exporter and DCGM Exporter data:
Example of running Node Exporter and DCGM Exporter with Docker Compose
To simplify setup of the Node Exporter and DCGM Exporter on multiple machines to monitor, the following example Docker Compose file can be used. First, after a normal installation of Docker and the Docker Compose plugin, the Nvidia Container Toolkit must be installed and configured to allow access to the Nvidia GPU from inside a Docker container:
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Then the following example docker-compose.yml
file can be used to start both the Node Exporter and the DCGM Exporter:
version: '3.8'
services:
node_exporter:
image: quay.io/prometheus/node-exporter:latest
container_name: node_exporter
command:
- '--path.rootfs=/host'
network_mode: host
pid: host
restart: unless-stopped
volumes:
- '/:/host:ro,rslave'
dcgm_exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.3-3.3.1-ubuntu22.04
container_name: dcgm_exporter
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [ gpu ]
restart: unless-stopped
environment:
- DCGM_EXPORTER_NO_HOSTNAME=1
cap_add:
- SYS_ADMIN
ports:
- "9400:9400"
Start the Docker containers as usual with docker compose up -d
.
To verify the exporters work, you can use Curl to access the metrics data like:
curl localhost:9100/metrics
for the Node Exporter and curl localhost:9400/metrics
for the DCGM exporter. Note that the DCGM exporter might take several seconds before the first metrics are collected, resulting in that the first requests might yield an empty response body.