Watching your tail with latency histograms

Using percentiles to observe the tail of your service's latency

Luke Goyer
Lead Software Engineer

This blog post explores:

  • The importance of observing latency at a percentile
  • Using Prometheus histrogram metrics
  • Configuring a Spring Boot application to monitor latency

This will not include setting up Prometheus or Grafana for monitoring.

Why not measure average latency?

Monitoring the average latency or median (p50) only represents what half of your users experience. When webservices have sporadic issues it is often the case that the average latency is very different than the 90th or 95th percentile, meaning 10% or 5% of users have a very different latency than the average or the median.

For example, lets say our service has 100 requests, 95 complete in 20 ms, and 5 complete in 400 ms. This example data demonstrates having a long tail. Looking at the average we would see our service is responding in 39 ms, looking at the median (p50) it would appear as 20 ms. However, 5% of users could be reporting slow response times. Monitoring the average may not detect an issue if the average response time is only 19 ms higher. If the 95th percentile (p95) is monitored, it could detect the 400 ms response time.

Long Tails with Latency


The above image shows the distribution of response times from a service. Most requests fall in the left side and have a low response time. The right side of the image is showing the long tail of the distribution. These high response times on the right side of the graph can represent a significant portion of users. Observing only the average will conceal these data points.

Observing Latency with Spring Boot and Prometheus

For request latency Spring Boot metrics by default, only provide two counter metrics. One for total time spent processing requests http_server_requests_seconds_sum and one for total number of requests http_server_requests_seconds_count. With only those two metrics you can calculate average latency but don’t have the ability to observe latency at any important percentiles.

Prometheus has two metric types that are good for measuring latency, histograms, and summaries. Histograms can be easy to instrument with Spring Boot applications. Summaries are generally not aggregatable and have a higher performance impact creating the metric.

Histogram Metrics

Histograms are used to organize and graph data points into specified ranges. In Prometheus, histogram metrics are a more complex metric that create multiple time series. They utilize multiple buckets that are counter metrics, and two metrics for a total sum and total count.

For Spring Boot HTTP Requests this is an example of histogram metrics:

# HELP http_server_requests_seconds
# TYPE http_server_requests_seconds histogram
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="0.05",} 3.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="0.1",} 3.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="0.25",} 4.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="0.5",} 4.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="1.5",} 4.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="2.0",} 4.0
http_server_requests_seconds_bucket{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",le="+Inf",} 4.0
http_server_requests_seconds_count{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",} 4.0
http_server_requests_seconds_sum{app="sample-rest-api",exception="None",method="GET",outcome="SUCCESS",status="200",uri="/retrieve",} 0.145301247

Every request increments the http_server_requests_seconds_count metric. The other buckets will be incremented when the request length is less than the value of the bucket. Notice in the example metrics above; the total amount of requests was 4 but there are two buckets with a value of 3. Those were not incremented for one of the requests because the response time was between 100ms-250ms.

Prometheus provides a histogram_quantile() function to calculate quantiles from histograms. This function will provide us the ability to create an estimate of the latency at a specific percentile.

Configuring Histogram Buckets for Spring Boot Apps

When using Spring Boot 2.x, micrometer metrics are built in and configuring buckets only requires an update to your management.metrics settings. See the Spring Boot docs for details. The configuration may differ depending on how micrometer is implemented, see the Micrometer docs for more info.

Picking buckets to configure can be difficult. The buckets are used to create an estimate, so more buckets gives you a more accurate estimate but each one creates another time series. In Prometheus each time series is a datapoint to be tracked and stored. For most applications I pick 5-10 buckets as a starting point.

Selecting the values for buckets I start just below your target latency, increasing to 4x the target latency. This is just a starting point and will be adjusted as needed for each application.

For example, the configuration for a service with a target latency of 100ms:

    slo[http.server.requests]: "75ms, 100ms, 150ms, 200ms, 250ms, 300ms, 350ms, 400ms"

Observing Latency at Percentiles


The picture above is a screenshot of a LATTES dashboard in Grafana. LATTES is our take on the Golden Signals: Latency, Availability, Traffic, Tickets, Errors, Saturation.

I used Chaos Monkey for Spring Boot and Chaos Toolkit to inject 500ms-2000ms of latency in 11% (1/9) requests. The /retrieve URI was the only one impacted by the chaos experiment. The /retrievedelay URI always responds in 2s-10s.

For the application, the histogram buckets were instrumented at "50ms, 100ms, 250ms, 500ms, 1500ms, 2000ms". Three queries are used for the bottom panel in the dashboard, with the first parameter changing for the different percentiles.


histogram_quantile(0.9, sum without(instance) (rate(http_server_requests_seconds_bucket{app="$app",uri!~"/actuator.*"}[10m])))

There are a few key things to notice:

  • The average response time of /retieve is 133ms. That is 4 times greater than the median (p50) at 28ms, but still 10 times smaller than the 95th percentile (p95) at 1.29s. If we only monitor the average we are missing the tail and ignoring the experience of 5% of users.

  • The /retievedelay average shows 5.61s but the percentiles show 2s for all values. This is because the largest bucket we configured was for 2s. None of the requests fall into the buckets and a good estimate cannot be created. To correct this the buckets need to be increased.

Wrapping it up

Using average or median to monitor latency can hide important data points where there is a large tail. Using histogram metric types from Prometheus enable you to better monitor latency metrics by looking at the 90th, 95th, or 99th percentiles.


To learn more about technology careers at State Farm, or to join our team visit,