Skip to main content

Metrics

The ** metrics** provides a quick health check for the service.
It summarizes key Site Reliability Engineering (SRE) metrics that help engineers monitor the performance and reliability of the system.


Service Level Objective (SLO)

The Service Level Objective (SLO) defines the reliability target for the service.

  • Availability: 99.95%
  • Target: 99.9%

If the current availability meets the defined target, the dashboard indicates that the service is Meeting SLO.


Golden Signals (SRE)

The dashboard tracks the four golden signals of SRE, which are essential metrics used to monitor system health.

Latency

Latency measures how long it takes for a system to respond to a request.

  • P95 Latency: 120ms
  • P99 Latency: 253ms

Lower latency ensures faster responses and a better user experience.


Traffic

Traffic indicates the number of incoming requests handled by the system.

  • Requests Per Second: 4.4 RPS

Monitoring traffic helps identify system load and user demand.


Error Rate

Error rate represents the percentage of requests that fail.

  • Error Rate: 0.50%
  • Errors per second: 0.020

A higher error rate may indicate system instability or service issues.


Saturation

Saturation measures how heavily system resources are being used.

  • Memory Usage: 65.3%

High saturation levels may impact performance if system resources become fully utilized.


Navigate to database section

The image above shows the Service Metrics Overview Dashboard, which provides a quick visual summary of service health and performance.


Infrastructure Metrics

The dashboard also includes infrastructure performance indicators.

CPU Usage

This metric shows how much processing capacity the service is consuming.

  • CPU Usage: 45.1%
  • 24-hour trend: Increasing

Memory Usage

Memory usage tracks how much RAM is being used by the application.

  • Memory Usage: 60.2%
  • 24-hour trend: 6.0%

Average Latency

Average latency provides a general overview of system response time.

  • Response Time (P95): 85ms

Service Health

The Service Health panel provides a quick operational summary of the system.

Key indicators include:

  • Active Alerts
  • Error Rate
  • Restarts (24h)
  • Uptime
  • Running Pods

Example values:

  • Error Rate: 2.02%
  • Restarts (24h): 0
  • Uptime: 99.95%
  • Pods Running: 3

These metrics help engineers quickly evaluate the operational health of the service.

Navigate to database section

DNS Query Analytics Dashboard

The DNS Query Analytics Dashboard provides a comprehensive view of DNS traffic patterns, query distributions, and response behaviors.
It enables teams to monitor DNS performance, analyze traffic sources, and identify potential issues in query resolution.


Colocation Centers

The Colocation Centers chart shows the distribution of DNS queries across different geographic locations.

Key Observations

  • SJC (San Jose) handles the highest number of queries (~16,000)
  • DFW (Dallas) and ORD (Chicago) follow with moderate traffic
  • IAD (Virginia) and SEA (Seattle) show comparatively lower query volumes

This helps in understanding traffic distribution across regions and identifying high-load data centers.

Navigate to database section

IP Version Distribution

The IP Version Distribution chart compares IPv4 and IPv6 usage.

Breakdown

  • IPv4: Majority of DNS traffic
  • IPv6: Smaller but growing portion of traffic

This indicates that while IPv4 remains dominant, IPv6 adoption is increasing.

Navigate to database section

Query Types

The Query Types chart shows the distribution of DNS record types.

Key Insights

  • A Records: Highest usage (~30,000)
  • AAAA Records: Significant usage (~15,000)
  • CNAME Records: Moderate (~8,000)
  • MX Records: Lower (~3,000)
  • TXT Records: Minimal usage

Interpretation

  • A and AAAA records dominate due to IP resolution requirements
  • MX and TXT records are used for email routing and verification purposes
Navigate to database section

Protocol Distribution

The Protocol Distribution chart compares DNS traffic over UDP and TCP.

Breakdown

  • UDP: ~52,340 requests
  • TCP: ~5,200 requests

Interpretation

  • UDP is the primary protocol due to its speed and low overhead
  • TCP is used for:
    • Large responses
    • Zone transfers
    • Reliability-critical operations
Navigate to database section

Response Codes

The Response Codes chart shows DNS query response statuses.

Key Metrics

  • NOERROR: Majority (~50,000+)
  • NXDOMAIN: Moderate number of failed lookups
  • SERVFAIL: Low system/server failures
  • REFUSED: Minimal access denials

Interpretation

  • High NOERROR indicates healthy DNS resolution
  • NXDOMAIN suggests queries for non-existent domains
  • Low SERVFAIL reflects stable infrastructure
Navigate to database section

The DNS dashboard provides insights into:

  • Geographic traffic distribution
  • IP protocol adoption
  • DNS query behavior
  • Network protocol usage
  • System reliability and response health

These metrics help ensure efficient DNS performance, reliability, and scalability.