Debugging Prometheus high memory usage

09 Jun, 2022

Sometimes, Prometheus end up using a lot of RAM and it's not obvious why. These are some takeaways and notes from when I ran into that problem.

What makes Prometheus use a lot of RAM?

Needed_ram = number_of_serie_in_head * 8Kb (approximate size of a time series. number of value store in it are not so important because it’s only delta from previous value) ¹

In other words, many time series = high RAM usage. Find the number of time series in head by going to /tsdb-status via the Prometheus web interface or use this query:

prometheus_tsdb_head_series

This gives you a number, take that times 8 and convert from kB to GB to get an estimate of how much RAM is currently used for time series. You want the number of series in head to be low.

What leads to many time series?

Every combination of key and value for a label creates a new time series:

Every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values), such as user IDs, email addresses, or other unbounded sets of values. ²

So metrics such as:

page_views{ip="198.51.100.2"} 14
page_views{ip="192.0.2.117"} 1
page_views{ip="203.0.113.56"} 3
...

uses a lot of RAM, since every IP creates a new series. It's even worse if you have more labels with dynamic values. If you include the HTTP method as well you'd have a new time series for every combination of IP and HTTP method used.

Analyze Prometheus RAM usage

Prometheus has something called head block where it stores 1-3 hours of data in memory. So that's were we want to take a look to get an understanding of what's eating all the RAM.

There is an official diagnostic tools for analyzing the head block, promtool. If you run Prometheus on Kubernetes you need to exec into the container to run it:

# Exec into the container
kubectl exec -itn prometheus prometheus-0 sh

# Analyze the head block
promtool tsdb analyze /prometheus

This will show you something like this:

Block ID: H3110W0R1DD34DB33FB4N4N4S0
Duration: 2h0m0s
Series: 4725152
Label names: 320
Postings (unique label pairs): 173806
Postings entries (total label pairs): 51041858

Label pairs most involved in churning:
1270840 endpoint=80
1159074 job=kubelet
1159072 service=prometheus-kubelet
1159070 endpoint=https-metrics
1083254 namespace=island
958367 container=banana
884999 metrics_path=/metrics/cadvisor
858300 service=banana
858288 job=banana
854288 app=banana
662073 __name__=job_received_stuff
614177 job=kube-state-metrics
611792 endpoint=http
611304 service=prometheus-state-metrics
459705 namespace=kube-system
392465 image=banana:latest
392465 container=POD
328333 namespace=prometheus
290816 container=state-metrics
290369 namespace=forrest

Label names most involved in churning:
3324149 __name__
3303490 job
3303488 service
3303438 instance
3303421 endpoint
3242107 namespace
2976136 pod
2808215 container
1262970 app
1253723 node
1159070 metrics_path
888960 id
683392 port
658136 image
652944 name
523048 le
311685 device
239813 status
204254 reason
200849 interface

Most common label pairs:
1923761 endpoint=80
1613385 job=kubelet
1613376 endpoint=https-metrics
1611934 service=prometheus-kubelet
1540862 namespace=island
1386748 container=banana
1266758 service=banana
1266742 job=banana
1261733 app=banana
1209648 metrics_path=/metrics/cadvisor
1013461 __name__=job_received_stuff
733140 job=kube-state-metrics
730625 endpoint=http
729320 service=prometheus-metrics
653508 namespace=kube-system
534043 container=POD
534043 image=banana:latest
453647 namespace=kube-prometheus
449907 namespace=disco
392241 metrics_path=/metrics

Label names with highest cumulative label value length:
1777423 id
1099754 name
901291 ip
499393 container_id
216144 uid
168868 pod
157392 pod_uid
87601 address
67868 secret
62131 device
60790 interface
58802 module
58763 __name__
49725 instance
41951 pod_ip
38636 image
35093 port
25534 image_id
22925 replicaset
18880 token

Highest cardinality labels:
69989 ip
17885 id
11890 name
8541 port
6841 container_id
6004 uid
5952 pod
5153 address
4447 device
4372 pod_uid
4345 interface
3914 pod_ip
3387 instance
1708 secret
1647 __name__
1075 module
889 replicaset
889 label_pod_template_hash
853 label_modifiedAt
734 instance_id

Highest cardinality metric names:
1013461 job_received_stuff
198678 total_amount_of_stuff_counter
170370 exec_time_seconds_bucket
94085 container_tasks_state
81246 rest_client_request_duration_seconds_bucket
80234 kube_pod_container_status_waiting_reason
75268 container_memory_failures_total
68772 kube_pod_container_status_terminated_reason
68772 kube_pod_container_status_last_terminated_reason
60705 kubelet_runtime_operations_duration_seconds_bucket
56310 kube_pod_status_phase
42025 exec_duration_seconds_total_bucket
41566 storage_operation_duration_seconds_bucket
39468 api_http_requests_duration_seconds_bucket
36036 ab_duration_seconds_bucket
34096 container_network_transmit_packets_dropped_total
34096 container_network_receive_errors_total
34096 container_network_receive_bytes_total
34096 container_network_transmit_bytes_total
34096 container_network_transmit_errors_total

Some terminology:

Cardinality
- Number of time series. We want few of those.
Churn
- A time series that no longer get new data. We want few of these too.

If we look at the list above we can see some things:

Label names most involved in churning
- This might help us figure out for example which service in causing a lot of churn (i.e. which service is causing a lot of RAM usage). In our case, we can see that service=banana ranks high. Most others are pretty much expected
Label names most involved in churning
- In our case, this metric isn't very interesting. Most of the labels there are used by all our services.
Label names with highest cumulative label value length
- This shows us in bytes how much RAM the values of different labels use. We can see that id uses a lot of RAM compared to most other things. That means there is a metric with a label my_metric{id=<long and often unique value goes here>}, and unless that label/value pair adds a lot of value, we probably don't want it.
Highest cardinality labels
- This helps us identify labels with values that are often unique, and therefore creates new time series. In our case, ip is very high, probably because the value is an IP address and those often change.
Highest cardinality metric names
- Similar to the previous one, but for whole metrics instead of a specific label. Here we can see job_received_stuff has many time series. Maybe the metric is job_received_stuff{ip=<ip goes here>}!

That gives us some insight into what's using a lot of RAM. Now it's time to do something about it!

Inspecting metrics

If you use Kubernetes you can view some metrics with these commands:

# cAdvisor metrics
kubectl get --raw /api/v1/nodes/<NODE ID>/proxy/metrics/cadvisor

# Node exporter metrics
kubectl get --raw /api/v1/nodes/<NODE ID>/proxy/metrics

# kube-state-metrics
kubectl get --raw /api/v1/namespaces/<prometheus namespace>/services/<prometheus-kube-state-metrics service>:8080/proxy/metrics

Depending on which of those you have installed in your cluster.

For metrics exposed by services, it's usually reached on /metrics on the service. If you're unsure what's being scraped by Prometheus, check the /targets via the Prometheus web interface.

Dropping labels and metrics

The analysis done earlier gave us some insight into what uses a lot of RAM. Now it's up to us to figure out what metrics and labels we're okay with dropping.

How and where you drop labels depends on your Prometheus setup. I use Prometheus Operator and ServiceMonitors.

Looking at the analysis above, and inspecting the metrics from cAdvisor I could tell the id label was useless bloat. Hence, we may remove it!

Configure Prometheus Operator to drop the unwanted label from all cAdvisor metrics:

kubelet:
  serviceMonitor:
    cAdvisorMetricRelabelings:
      - source_labels: [__name__]
        regex: 'id'
        action: labeldrop

If there are whole metrics we realize we don't want to keep, we can drop those fully:

kubelet:
  serviceMonitor:
    cAdvisorMetricRelabelings:
      - source_labels: [__name__]
        regex: 'container_tasks_state'
        action: drop

Some resources I found useful:

Mellow Root