Skip to content

Grafana Configuration Guide

This document describes how to configure the Grafana monitoring system for RobustMQ, including quick deployment, data source configuration, and dashboard import.

Requirements

  • Grafana 8.0+, Prometheus 2.30+, Docker 20.10+ (optional)
  • Default ports: RobustMQ metrics (9091), Prometheus (9090), Grafana (3000), Alertmanager (9093)

Quick Deployment

bash
cd grafana/
docker-compose -f docker-compose.monitoring.yml up -d

This starts the following services:

ServiceAddressDescription
Grafanahttp://localhost:3000Default login: admin/admin
Prometheushttp://localhost:9090Metrics collection & query
Alertmanagerhttp://localhost:9093Alert management
Node Exporterhttp://localhost:9100System metrics (optional)

RobustMQ Configuration

Enable Prometheus metrics export in config/server.toml:

toml
[prometheus]
enable = true
port = 9091

Verify metrics are exposed:

bash
curl http://localhost:9091/metrics

Prometheus Configuration

The project provides an example configuration at grafana/prometheus-config-example.yml with the following scrape targets:

Single Node

yaml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "robustmq-alerts.yml"

scrape_configs:
  - job_name: 'robustmq-mqtt-broker'
    static_configs:
      - targets: ['localhost:9091']
    metrics_path: /metrics

Cluster

yaml
scrape_configs:
  - job_name: 'robustmq-mqtt-broker-cluster'
    static_configs:
      - targets:
        - 'node1:9091'
        - 'node2:9091'
        - 'node3:9091'
    metrics_path: /metrics

Multiple Services

If running Meta Service and Journal Server alongside the broker:

yaml
scrape_configs:
  - job_name: 'robustmq-meta-service'
    static_configs:
      - targets: ['localhost:9092']
    metrics_path: /metrics

  - job_name: 'robustmq-journal-server'
    static_configs:
      - targets: ['localhost:9093']
    metrics_path: /metrics

Grafana Configuration

Adding Prometheus Data Source

Via Web UI:

  1. Log in to Grafana (http://localhost:3000)
  2. ConfigurationData SourcesAdd data source
  3. Select Prometheus, set URL to http://localhost:9090 (or http://prometheus:9090 in Docker)

Via Provisioning File:

yaml
# /etc/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    url: http://prometheus:9090
    isDefault: true

Importing the Dashboard

RobustMQ provides a pre-built dashboard at grafana/robustmq-broker.json.

Web UI Import:

  1. DashboardsImport
  2. Upload grafana/robustmq-broker.json
  3. Select your Prometheus data source in the DS_PROMETHEUS dropdown
  4. Click Import

API Import:

bash
curl -X POST http://localhost:3000/api/dashboards/db \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -H 'Content-Type: application/json' \
  -d @grafana/robustmq-broker.json

Dashboard Panel Guide

robustmq-broker.json contains the following sections:

Resource (Overview)

Top stat panels showing current system state:

PanelMetricDescription
Connectionsmqtt_connections_countCurrent MQTT connections
Sessionsmqtt_sessions_countCurrent sessions
Topicsmqtt_topics_countCurrent topics
Subscribersmqtt_subscribers_countCurrent total subscribers
Shared Subscriptionsmqtt_subscriptions_shared_countCurrent shared subscriptions
Retained Messagesmqtt_retained_countCurrent retained messages

Below the stat panels, timeseries panels show resource trends including connection rates, session create/delete rates, topic message read/write rates, subscription success/failure rates, shared subscription type breakdown, and retained message send/receive rates.

🌐 Network

PanelDescription
Handler Total LatencyEnd-to-end request latency percentiles (P50/P95/P99)
Handler Queue Wait LatencyQueue wait time percentiles
Handler Apply Latencycommand.apply() execution time percentiles
Response Write LatencyResponse write-back latency percentiles

📈 MQTT Protocol

PanelDescription
MQTT Received Packet Rate (QPS)Received packet rate by type
MQTT Sent Packet Rate (QPS)Sent packet rate by type
MQTT Packet Process Latency PercentilesPacket processing latency percentiles
MQTT Packet Process P99 Latency by TypeP99 processing latency by packet type
MQTT Packet Process QPS by TypeProcessing rate by packet type
MQTT Packet Process Avg Latency by TypeAverage processing latency by packet type

🔗 gRPC Server

PanelDescription
gRPC Requests RategRPC request rate (req/s)
gRPC QPS by MethodPer-method gRPC request rate
gRPC P99 Latency by MethodPer-method P99 latency

📡 gRPC Client

PanelDescription
gRPC Client Call P99 Latency by MethodP99 latency of each outgoing gRPC client call
gRPC Client Call Latency PercentilesOverall client call latency percentiles (P50/P95/P99/P999)
gRPC Client Call QPS by MethodQPS of each outgoing gRPC client call by method

This section shows the latency of outgoing gRPC calls from the Broker to Meta Service etc., helping identify performance bottlenecks in flows like connection establishment.

🌍 HTTP Admin

PanelDescription
HTTP Requests RateHTTP Admin request rate (req/s)
HTTP QPS by EndpointPer-endpoint request rate
HTTP Admin P99 Latency by EndpointPer-endpoint P99 latency

📦 Raft Machine

PanelDescription
Raft Write Rate / Success Rate / Failure RateRaft write request/success/failure rates
Raft RPC RateRaft RPC request rate
Raft Write QPS (by Machine)Write QPS by state machine type
Raft Write Latency (by Machine)Write latency by state machine type
Raft RPC QPS (by Machine / RPC Type)RPC QPS by state machine and RPC type
Raft RPC Latency (by Machine / RPC Type)RPC latency by state machine and RPC type

Raft RPC metrics only show data in multi-node cluster deployments.

📖 RocksDB

PanelDescription
RocksDB QPS by OperationQPS by operation type (save/get/delete/list)
RocksDB QPS by SourceQPS by data source
RocksDB Write LatencyWrite operation latency percentiles
RocksDB Read (Get) LatencyRead operation latency percentiles

⏱ Delay Message

PanelDescription
Delay Message Enqueue / Deliver / Failure RateEnqueue/deliver/failure rates
Enqueue Latency PercentilesEnqueue latency percentiles
Deliver Latency PercentilesDelivery latency percentiles

Delay message metrics only show data when the delay publish feature is actively used.

Alert Configuration

Pre-built Alert Rules

The project provides grafana/robustmq-alerts.yml with the following rules:

AlertSeverityConditionDescription
RobustMQBrokerDownCriticalup == 0Broker instance unreachable
RobustMQHighRequestLatencyWarningP95 latency > 100ms for 10mElevated request latency
RobustMQCriticalRequestLatencyCriticalP95 latency > 500ms for 5mSevere request latency
RobustMQAuthenticationFailuresCriticalAuth failures > 10/s for 2mFrequent authentication failures
RobustMQConnectionErrorsWarningConnection errors > 5/s for 5mFrequent connection errors
RobustMQHighQueueDepthWarningQueue depth > 1000 for 5mQueue backlog
RobustMQCriticalQueueDepthCriticalQueue depth > 5000 for 2mSevere queue backlog
RobustMQHighMessageDropsWarningMessage drops > 100/s for 5mFrequent no-subscriber drops
RobustMQHighThreadUtilizationWarningActive threads > 50 for 10mHigh thread count

Alertmanager Configuration

yaml
# alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alertmanager@robustmq.com'

route:
  group_by: ['alertname']
  repeat_interval: 1h
  receiver: 'default'

receivers:
  - name: 'default'
    email_configs:
      - to: 'admin@robustmq.com'

Custom Alert Rules

Add custom rules to grafana/robustmq-alerts.yml:

yaml
groups:
  - name: robustmq.custom
    rules:
      - alert: HighGrpcErrorRate
        expr: rate(grpc_errors_total[5m]) / rate(grpc_requests_total[5m]) > 0.05
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "gRPC error rate exceeds 5%"
          description: "Current error rate: {{ $value | humanizePercentage }}"

      - alert: RocksDBSlowOperations
        expr: histogram_quantile(0.95, rate(rocksdb_operation_ms_bucket[5m])) > 100
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "RocksDB P95 operation latency exceeds 100ms"

Performance Optimization

Prometheus Optimization

Storage configuration:

bash
# Startup arguments
--storage.tsdb.retention.time=30d
--storage.tsdb.retention.size=50GB

Recording rules (pre-computation):

Create recording rules for frequently queried expressions:

yaml
groups:
  - name: robustmq.recording
    rules:
      - record: robustmq:request_latency_95th
        expr: histogram_quantile(0.95, rate(handler_total_ms_bucket[5m]))

      - record: robustmq:packet_rate_received
        expr: rate(mqtt_packets_received_total[5m])

      - record: robustmq:packet_rate_sent
        expr: rate(mqtt_packets_sent_total[5m])

      - record: robustmq:error_rate_total
        expr: >
          rate(mqtt_packets_received_error_total[5m])
          + rate(mqtt_packets_connack_auth_error_total[5m])
          + rate(mqtt_packets_connack_error_total[5m])

Grafana Optimization

  • Use recording rules to reduce complex real-time aggregation queries
  • Set appropriate panel refresh intervals (recommended 15s - 1m)
  • Avoid wide-range aggregation on high-cardinality labels (e.g., client_id, connection_id)
  • Use larger rate() windows (e.g., [5m] instead of [1m]) for historical data queries
FileDescription
grafana/robustmq-broker.jsonGrafana dashboard definition
grafana/prometheus-config-example.ymlPrometheus scrape configuration example
grafana/robustmq-alerts.ymlAlert rules definition
grafana/docker-compose.monitoring.ymlDocker Compose monitoring stack
config/server.tomlRobustMQ server configuration (includes Prometheus port)
docs/en/Observability/Infrastructure-Metrics.mdComplete metrics reference