Cache hardware metrics: monitoring and routing
Observability and monitoring is important in a CDN. To facilitate this, we use the open source application Telegraf as an agent on the caches to collect metrics, such as CPU usage, memory usage, and network interface statistics.
These metrics can be exported to two services installed alongside the Director:
acd-metrics-aggregator
: aggregates cache metrics from the CDN and exports them to the Director’s selection input API.acd-telegraf-metrics-database
: stores a smaller amount of time-series cache metrics produced by Telegraf metrics agents and allows the metrics to be scraped by a Prometheus instance, enabling alarms and visualization of the metrics in Grafana.
Telegraf Metrics Agent
All deployments of ESB2001 Orbit TV Server (from version 3.6.2) and ESB3004 SW Streamer (from version 1.36.2) come bundled with an instance of Telegraf running as a metrics agent. If you are using other caches in your CDN, it is possible to install and configure Telegraf manually, if licensing allows and the caches are accessible.
Configuring Telegraf is done using the confcli
tool. Before configuration,
ensure that the service telegraf-configd
is active by running
systemctl status telegraf-configd
systemctl restart telegraf-configd # if the service is not active
If telegraf-configd
fails to start, check the log file /var/log/telegraf-configd.log
for errors.
Configuration
Once the configuration daemon is running, Telegraf can be configured using
confcli
:
$ confcli integration.acd.telegraf
{
"telegraf": {
"aggregators": [],
"databases": [],
"enable": true,
"enable_tls_verification": true,
"hostname": "",
"interfaces": [],
"pushInterval": 5,
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
}
The configuration fields are:
aggregators
: a list of URLs toacd-metrics-aggregator
instances to which the metrics will be exported.databases
: a list of URLs toacd-telegraf-metrics-database
instances to which the metrics will be exported.enable
: whether the Telegraf instance is enabled or not.enable_tls_verification
: whether to verify the TLS certificate of the target when exporting metrics.hostname
: the hostname of the cache. This is used as a tag in the metrics. If not set, the hostname of the machine is used.interfaces
: a list of network interfaces to collect metrics from. Supports wildcards, e.g.eths*
.pushInterval
: the interval in seconds for pushing metrics to targets.secrets
: configuration for using a secret token to authenticate with the targets.secrets.enable
: sets the headerAuthorization: Token secret_value
when exporting metrics to the targets.secrets.key
: the key/name of the secret. Defaults tometrics_auth_token
.secrets.secretStoreID
: the ID of the secret store to use. Defaults totelegraf_configd_secretstore
.
Targets
acd-metrics-aggregator
acd-metrics-aggregator
is a service that aggregates cache metrics in the CDN and
exports them to the Director’s selection input API.
The service is installed alongside the Director.
The service is started by default after installation. You can check the status of the service by running
systemctl status acd-metrics-aggregator
systemctl restart acd-metrics-aggregator # if the service is not active
The logs are accessed by running
journalctl -u acd-metrics-aggregator
The configuration file for the aggregator is located at
/opt/edgeware/acd/metrics/aggregator-conf.json
. The default configuration
created during installation will export metrics to the local Director’s
selection input API:
{
"tls_cert": "",
"tls_key": "",
"metrics_listen_port": 8087,
"interval": 5,
"log_level": "INFO",
"targets": [
{
"url": "https://127.0.0.1:5001/v1/timestamped_selection_input",
"http_headers": {
"x-api-key": "8a3094e875b841d480482cfd82e3e313"
}
}
],
"secrets": {
"enable": false,
"key": "metrics_auth_token"
}
}
The configuration fields are:
tls_cert
: the path to the TLS certificate file ifacd-metrics-aggregator
should use TLS.tls_key
: the path to the TLS key file ifacd-metrics-aggregator
should use TLS.- Note that the service is run as a container with the directory
/opt/edgeware/acd/ssl
mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- Note that the service is run as a container with the directory
metrics_listen_port
: the port on which the service listens for metrics.interval
: the interval in seconds for pushing metrics to the targets.log_level
: the log level for the aggregator. Can beDEBUG
,INFO
orWARNING
.targets
: a list of targets to which the metrics will be exported.url
: the URL of the target.http_headers
: HTTP headers to send with the request.x-api-key
: the API key for the Director’s REST API. Populated by default.
secrets
: configuration for using secret token to authenticate incoming metrics requests.enable
: requires the headerAuthorization: Token secret_value
when receiving metrics.key
: the key/name of the secret. Defaults tometrics_auth_token
.
acd-telegraf-metrics-database
acd-telegraf-metrics-database
is a service that stores cache metrics in a time
series database to be scraped by a Prometheus instance, allowing for alarms and
vizualization of the metrics in Grafana. The service is installed alongside the
Director.
This service is running Telegraf in a container, receiving metrics from the Telegraf metrics agents. Note that this service is not a full-fledged time series database, like InfluxDB, but a smaller database that stores a limited amount of data. The service acts like a middle man between the Telegraf metrics agents and the scraping Prometheus instance.
The service is started by default after installation. You can check the status of the service by running
systemctl status acd-telegraf-metrics-database
systemctl restart acd-telegraf-metrics-database # if the service is not active
The logs are accessed by running
journalctl -u acd-telegraf-metrics-database
The configuration file for the database is located at
/opt/edgeware/acd/metrics/telegraf-metrics-database.conf
:
# Global tags can be specified here in key="value" format.
[global_tags]
# Example configuration for aggregator
[agent]
interval = "5s" # Data collection interval
round_interval = true # Round collection interval to 'interval'
metric_batch_size = 1000 # Max metrics sent in one batch
metric_buffer_limit = 10000 # Max metrics stored when outputs are unavailable
collection_jitter = "0s" # Collection jitter to stagger collection times
flush_interval = "5s" # Metrics output interval
flush_jitter = "0s" # Output jitter to stagger output times
precision = "" # Timestamp precision in output
debug = false # Enable more log info for debugging
quiet = false # Enable less log info
logfile = "" # Log file path, empty means log to stderr
hostname = "" # Host identifier, empty means use os.Hostname()
omit_hostname = false # Include hostname in output
# Listen for Telegraf metrics agents
[[inputs.influxdb_v2_listener]]
service_address = ":8086"
# tls_cert = ""
# tls_key = ""
# token = "@{secretstore:metrics_auth_token}" # Uncomment this line to use token authentication
# Expose port for prometheus to scrape all metrics from
[[outputs.prometheus_client]]
listen = ":12001"
# Secretstore configuration
[[secretstores.docker]]
id = "secretstore"
See https://docs.influxdata.com/telegraf/v1/plugins/ for full plugin directory and plugin documentation. The important fields are:
inputs.influxdb_v2_listener
: plugin that listens for incoming metrics from cache agents.service_address
: the address and port to listen on.tls_cert
: the path to the TLS certificate file if TLS should be used.tls_key
: the path to the TLS key file if TLS should be used.- Note that the service is run as a container with the directory
/opt/edgeware/acd/ssl
mounted as a volume with the same path in the container. The certificate and key files should be placed in this directory.
- Note that the service is run as a container with the directory
token
: the token to use for authentication. This field is commented out by default. Uncomment this line to use token authentication.
outputs.prometheus_client
: opens a port for Prometheus to scrape metrics from.listen
: the port to listen on.
secretstores.docker
: plugin for using Podman secrets to authenticate incoming metrics.id
: the ID of the secret store to use. Used in thetoken
field for theinputs.influxdb_v2_listener
plugin.
Using Secrets for Request Authorization
Secrets can be used to authenticate incoming metrics requests to
acd-metrics-aggregator
and acd-telegraf-metrics-database
by requiring the
header Authorization: Token secret_value
when receiving metrics. The Telegraf
metrics agent can be configured to attach this header when exporting metrics to
acd-metrics-aggregator
and acd-telegraf-metrics-database
.
Cache Metrics Agent
The secret configuration on the Telegraf metrics agent can be seen by running
$ confcli integration.acd.telegraf.secrets
{
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
To set the secret value with the key/name metrics_auth_token
using the secret
store telegraf_configd_secretstore
, run
$ telegraf secret set telegraf_configd_secretstore metrics_auth_token
Enter secret value:
This will prompt you to enter the secret value. Once the secret value is set, the Telegraf metrics agent can use the secret to authenticate with the targets by running
$ confcli integration.acd.telegraf.secrets.enable true
integration.acd.telegraf.secrets.enable = True
This will set the header Authorization: Token secret_value
when exporting
metrics to the configured targets.
Note that if you change the secret value, you need to restart the Telegraf metrics agent by either updating the configuration, e.g. disabling the service and enabling it again, or by running
systemctl restart telegraf
acd-metrics-aggregator
and acd-telegraf-metrics-database
Both acd-metrics-aggregator
and acd-telegraf-metrics-database
use secrets
supplied by Podman to enable request authorization. During installation, a
placeholder secret is created with the name metrics_auth_token
in Podman. This
secret is loaded into the respective containers when starting the services.
To securely set the secret value, the following commands will prompt you to
enter the secret value and store it in a temporary environment variable. The
secret value is then piped to podman
to store the secret value. Lastly,
the environment variable is unset
to ensure that the secret value is removed
from the environment:
read -sp "Enter secret value: " SECRET_VALUE
printf "$SECRET_VALUE" | podman secret create --replace metrics_auth_token -
unset SECRET_VALUE
This will set the secret value to secret-value
.
Enabling Request Authorization
To use the secret for request authorization in acd-metrics-aggregator
, modify
the secrets
field in configuration file
/opt/edgeware/acd/metrics/aggregator-conf.json
to:
{
"secrets": {
"enable": true,
"key": "metrics_auth_token"
}
}
Then restart the service:
systemctl restart acd-metrics-aggregator
To use the secret for request authorization in acd-telegraf-metrics-database
,
modify the configuration file to use the secret by uncommenting the token
field in the inputs.influxdb_v2_listener
plugin:
[[inputs.influxdb_v2_listener]]
service_address = ":8086"
token = "@{secretstore:metrics_auth_token}"
Make sure the secret store ID and secret key in the token
field match the
values in the secretstores.docker
section of the configuration file.
Then restart the service:
systemctl restart acd-telegraf-metrics-database
All incoming metrics requests to acd-metrics-aggregator
and
acd-telegraf-metrics-database
will now require the header
Authorization: Token secret_value
, otherwise the request will be rejected.
Example
A Telegraf metrics agent can be configured to export metrics to a Director that
is installed on the host director-1
like this.
Assuming that acd-metrics-aggregator
is listening on port 8087 and
acd-telegraf-metrics-database
is listening on port 8086, the following
configuration will track the CPU usage, memory usage, and network interface
statistics for interfaces eths0
and eths1
:
$ confcli integration.acd.telegraf.
{
"telegraf": {
"aggregators": [
"http://director-1:8087/metrics"
],
"databases": [
"http://director-1:8086"
],
"enable": true,
"enable_tls_verification": true,
"hostname": "cache-1",
"interfaces": [
"eths0",
"eths1"
],
"pushInterval": 5,
"secrets": {
"enable": false,
"key": "metrics_auth_token",
"secretStoreID": "telegraf_configd_secretstore"
}
}
}
Note that each entry in aggregators
requires the path /metrics
to be
appended to the URL. The entries in databases
do not require any path to be
appended to the URL.
Once the configuration is set, the cache metrics agent will start exporting
metrics to the instances of acd-metrics-aggregator
and
acd-telegraf-metrics-database
running on director-1
.
acd-metrics-aggregator
will aggregate the metrics and export them to the
Director’s selection input API. The metrics can be viewed by running
curl -k https://director-1:5001/v1/selection_input -H "x-api-key: <your-api-key>"
{
"cache-1": {
"hardware_metrics": {
"/media": {
"free": 2247610073088,
"total": 2261300281344,
"used": 13690208256,
"used_percent": 0.6054131054131053
},
"/non-volatile": {
"free": 1722658816,
"total": 1934635008,
"used": 95219712,
"used_percent": 5.237957901662393
},
"/var/log": {
"free": 487481344,
"total": 536870912,
"used": 49389568,
"used_percent": 9.19952392578125
},
"cpu_load1": 0.07,
"cpu_load15": 0,
"cpu_load5": 0.03,
"ehsd_online": 1,
"mem_available": 7252512768,
"mem_available_percent": 88.21865022898756,
"mem_total": 8221065216,
"mem_used": 818151424,
"n_cpus": 4
},
"per_interface_metrics": {
"eths0": {
"bytes_recv": 155734911,
"bytes_recv_rate": 1552,
"bytes_sent": 2967510,
"bytes_sent_rate": 1378,
"drop_in": 843508,
"drop_in_rate": 7,
"drop_out": 0,
"drop_out_rate": 0,
"err_in": 0,
"err_in_rate": 0,
"err_out": 0,
"err_out_rate": 0,
"interface_up": true,
"link": 1,
"megabits_recv": 1245.879288,
"megabits_recv_rate": 0.012416,
"megabits_sent": 23.74008,
"megabits_sent_rate": 0.011024,
"speed": 10000,
"speed_rate": 0
},
"eths1": {
"bytes_recv": 66197103,
"bytes_recv_rate": 256.5,
"bytes_sent": 612,
"bytes_sent_rate": 0,
"drop_in": 3399,
"drop_in_rate": 0,
"drop_out": 0,
"drop_out_rate": 0,
"err_in": 0,
"err_in_rate": 0,
"err_out": 0,
"err_out_rate": 0,
"interface_up": true,
"link": 1,
"megabits_recv": 529.576824,
"megabits_recv_rate": 0.002052,
"megabits_sent": 0.004896,
"megabits_sent_rate": 0,
"speed": 10000,
"speed_rate": 0
}
}
}
}
acd-telegraf-metrics-database
will store the metrics in a time series database
to be scraped by a Prometheus instance. The metrics can be scraped manually by
running
curl -k http://director-1:12001/metrics
# HELP disk_used_percent Telegraf collected metric
# TYPE disk_used_percent untyped
disk_used_percent{device="dm-0",fstype="xfs",host="cache-1",label="vg_main-lv_root",metric_owner="telegraf-configd",mode="rw",path="/"} 12.752679256881688
# HELP ehsd_ehsd_online Telegraf collected metric
# TYPE ehsd_ehsd_online untyped
ehsd_ehsd_online{host="cache-1",metric_owner="telegraf-configd"} 1
# HELP net_megabits_sent_rate Telegraf collected metric
# TYPE net_megabits_sent_rate untyped
net_megabits_sent_rate{host="cache-1",interface="eth0",metric_owner="telegraf-configd"} 0.002174
...
Using Hardware Metrics in Routing
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to be used for routing decisions. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host Health Checks
When the hardware metrics have been succesfully injected into the selection input API, the data is ready to make routing decisions with. In addition to creating routing conditions, hardware metrics are particularly suited to host health checks.
Host health checks are Lua functions that determine whether or not a host is ready to take on more clients. To configure host health checks, see configuring CDNs and hosts. See Built-in Lua functions for documentation on how to use the built-in health functions. Note that these health check functions can also be used for making routing conditions in the routing tree.
Routing Based on Cache Metrics
Instead of using health check functions to incorporate hardware metrics into the routing decisions, regular routing conditions can be used.
As an example, using the health check function cpu_load_ok()
in routing can be
configured as follows:
$ confcli services.routing.rules -w
Running wizard for resource 'rules'
Hint: Hitting return will set a value to its default.
Enter '?' to receive the help string
rules : [
rule can be one of
1: allow
2: consistentHashing
3: contentPopularity
4: deny
5: firstMatch
6: random
7: rawGroup
8: rawHost
9: split
10: weighted
Choose element index or name: firstMatch
Adding a 'firstMatch' element
rule : {
name (default: ): dont_overload_cache
type (default: firstMatch):
targets : [
target : {
onMatch (default: ): default_host
condition (default: always()): cpu_load_ok()
}
Add another 'target' element to array 'targets'? [y/N]: y
target : {
onMatch (default: ): offload_host
condition (default: always()):
}
Add another 'target' element to array 'targets'? [y/N]: n
]
}
Add another 'rule' element to array 'rules'? [y/N]: n
]
Generated config:
{
"rules": [
{
"name": "dont-overload-cache",
"type": "firstMatch",
"targets": [
{
"onMatch": "default_host",
"condition": "cpu_load_ok()"
},
{
"onMatch": "offload_host",
"condition": "always()"
}
]
}
]
}
Merge and apply the config? [y/n]: y