Skip to content

Infrastructure Metrics

RobustMQ provides comprehensive infrastructure monitoring metrics to help operations teams monitor system health, performance, and resource usage. All metrics are based on Prometheus format and can be accessed through the /metrics endpoint.

Note: Counter-type metrics are automatically suffixed with _total in Prometheus. For example, a Counter registered as grpc_requests in code will be exposed as grpc_requests_total in Prometheus/Grafana. The metric names in the tables below are Prometheus-exposed names (Counters already include _total).

Network Layer Metrics

Request Processing Metrics

Metric NameTypeLabelsDescription
handler_total_msHistogramnetworkEnd-to-end total time from request received to response written (ms)
handler_queue_wait_msHistogramnetworkTime a request spent waiting in the handler queue (ms)
handler_apply_msHistogramnetworkTime spent in command.apply() processing the request (ms)
handler_write_msHistogramnetworkTime spent writing the response back to the client (ms)

Queue Metrics

Metric NameTypeLabelsDescription
handler_queue_sizeGaugelabelCurrent number of pending requests in the handler queue
handler_queue_remainingGaugelabelRemaining capacity in the handler queue

Request Statistics

Metric NameTypeLabelsDescription
handler_requests_totalGaugenetworkTotal number of requests processed by handlers
handler_slow_requests_totalGaugenetworkTotal number of slow requests (exceeding threshold)

Thread Metrics

Metric NameTypeLabelsDescription
broker_active_thread_numGaugenetwork, thread_typeNumber of active threads by type

Label Descriptions:

LabelValuesDescription
networkTCP, WebSocket, QUICNetwork connection type
thread_typeaccept, handler, responseThread type
labelCustom stringQueue label identifying a specific queue instance

gRPC Server Metrics

Metric NameTypeLabelsDescription
grpc_requests_totalCounterservice, methodTotal number of gRPC requests
grpc_errors_totalCounterservice, method, status_codeTotal number of gRPC errors
grpc_request_duration_msHistogramservice, methodgRPC request duration (ms)

gRPC Client Metrics

Metric NameTypeLabelsDescription
grpc_client_call_duration_msHistogramservice, methodgRPC client call duration (ms), includes retries and leader forwarding

Label Descriptions:

  • service: gRPC service name (e.g., MqttService, PlacementService, EngineService)
  • method: gRPC method name (e.g., CreateSession, ListUser)
  • status_code: gRPC status code (server error metric only)

HTTP Service Metrics

Metric NameTypeLabelsDescription
http_requests_totalCountermethod, uriTotal number of HTTP requests
http_errors_totalCountermethod, uri, status_codeTotal number of HTTP errors
http_request_duration_msHistogrammethod, uriHTTP request duration (ms)

Label Descriptions:

  • method: HTTP method (GET, POST, PUT, DELETE)
  • uri: Request path
  • status_code: HTTP status code

Storage Layer Metrics (RocksDB)

Metric NameTypeLabelsDescription
rocksdb_operation_count_totalCountersource, operationNumber of RocksDB operations
rocksdb_operation_msHistogramsource, operationRocksDB operation duration (ms)

Label Descriptions:

  • source: Data source (e.g., metadata, session, message)
  • operation: Operation type (save, get, delete, list)

Raft Consensus Layer Metrics

Write Metrics

Metric NameTypeLabelsDescription
raft_write_requests_totalCountermachineTotal Raft write requests
raft_write_success_totalCountermachineTotal successful Raft writes
raft_write_failures_totalCountermachineTotal failed Raft writes
raft_write_duration_msHistogrammachineRaft write operation duration (ms)

RPC Metrics

Metric NameTypeLabelsDescription
raft_rpc_requests_totalCountermachine, rpc_typeTotal Raft RPC requests
raft_rpc_success_totalCountermachine, rpc_typeTotal successful Raft RPCs
raft_rpc_failures_totalCountermachine, rpc_typeTotal failed Raft RPCs
raft_rpc_duration_msHistogrammachine, rpc_typeRaft RPC operation duration (ms)

State Machine Lag Metrics

Metric NameTypeLabelsDescription
raft_apply_lagGaugemachineGap between last_log_index and last_applied; non-zero means the state machine is falling behind
raft_last_log_indexGaugemachineLatest log index appended to the Raft log
raft_last_appliedGaugemachineLatest log index applied to the state machine

Label Descriptions:

  • machine: State machine type (e.g., metadata, offset, mqtt)
  • rpc_type: RPC type (only reported in multi-node cluster setups)

System & Process Resource Metrics

Collected every 15 seconds. All percentage values are stored as integers in the range 0–100 (centipercent), divide by 100 in Grafana to get a fraction or use the percent unit.

System-wide Metrics

Metric NameTypeLabelsDescription
system_cpu_usageGaugeOverall system CPU usage percentage (0–100)
system_memory_usageGaugeOverall system memory usage percentage (0–100)

Process Metrics

Metric NameTypeLabelsDescription
system_process_cpu_usageGaugeCPU usage of the broker process, normalized by core count (0–100)
system_process_memory_usageGaugeMemory usage of the broker process as a percentage of total system memory (0–100)

Tokio Runtime Metrics

Sampled every 15 seconds using Tokio's unstable RuntimeMetrics API. Three runtimes are monitored: server, meta, and broker.

Metric NameTypeLabelsDescription
tokio_runtime_busy_ratioGaugeruntimeWorker-thread busy ratio (0–100). Values consistently above 80 indicate the runtime is saturated.
tokio_runtime_queue_depthGaugeruntimeNumber of tasks waiting in the global run queue. A growing queue means tasks are produced faster than they are consumed.
tokio_runtime_alive_tasksGaugeruntimeNumber of tasks that have been spawned and not yet completed. Continuously growing values may indicate a task leak.

Label Values:

  • runtime: server / meta / broker

Unhealthy thresholds (reference):

MetricUnhealthy
tokio_runtime_busy_ratioSustained > 80
tokio_runtime_queue_depthConsistently > 0 and growing
tokio_runtime_alive_tasksGrowing without bound

MQTT Protocol Metrics

Resource Statistics (Gauge)

Real-time counts of various system resources.

Metric NameTypeLabelsDescription
mqtt_connections_countGaugeCurrent MQTT connection count
mqtt_sessions_countGaugeCurrent MQTT session count
mqtt_topics_countGaugeCurrent MQTT topic count
mqtt_subscribers_countGaugeCurrent MQTT subscriber count (all types)
mqtt_subscriptions_exclusive_countGaugeCurrent exclusive subscription count
mqtt_subscriptions_shared_countGaugeCurrent shared subscription count
mqtt_subscriptions_shared_group_countGaugeCurrent shared subscription group count
mqtt_retained_countGaugeCurrent retained message count

Connection and Authentication Events

Metric NameTypeLabelsDescription
client_connections_totalCounterclient_idClient connection attempts (regardless of success)
mqtt_connection_success_totalCounterSuccessful MQTT connections
mqtt_connection_failed_totalCounterFailed MQTT connections
mqtt_disconnect_success_totalCounterMQTT disconnections
mqtt_connection_expired_totalCounterExpired MQTT connections
mqtt_auth_success_totalCounterSuccessful MQTT authentications
mqtt_auth_failed_totalCounterFailed MQTT authentications
mqtt_acl_success_totalCounterSuccessful MQTT ACL checks
mqtt_acl_failed_totalCounterFailed MQTT ACL checks
mqtt_blacklist_blocked_totalCounterMQTT connections blocked by blacklist

Subscription Events

Metric NameTypeLabelsDescription
mqtt_subscribe_success_totalCounterSuccessful MQTT subscriptions
mqtt_subscribe_failed_totalCounterFailed MQTT subscriptions
mqtt_unsubscribe_success_totalCounterSuccessful MQTT unsubscriptions

Session Metrics

Metric NameTypeLabelsDescription
mqtt_session_created_totalCounterMQTT sessions created
mqtt_session_deleted_totalCounterMQTT sessions deleted
session_messages_in_totalCounterclient_idMessages received per session
session_messages_out_totalCounterclient_idMessages sent per session
connection_messages_in_totalCounterconnection_idMessages received per connection
connection_messages_out_totalCounterconnection_idMessages sent per connection

Message Delivery Metrics

Metric NameTypeLabelsDescription
mqtt_messages_received_totalCounterTotal messages received from clients
mqtt_messages_sent_totalCounterTotal messages sent to clients
mqtt_message_bytes_received_totalCounterTotal bytes received from clients
mqtt_message_bytes_sent_totalCounterTotal bytes sent to clients
mqtt_messages_delayed_totalCounterTotal delayed publish messages
mqtt_messages_dropped_no_subscribers_totalCounterMessages dropped due to no subscribers

Per-Topic Metrics

Metric NameTypeLabelsDescription
topic_messages_written_totalCountertopicMessages written to topic
topic_bytes_written_totalCountertopicBytes written to topic
topic_messages_sent_totalCountertopicMessages sent from topic
topic_bytes_sent_totalCountertopicBytes sent from topic

Per-Subscription Metrics

Metric NameTypeLabelsDescription
subscribe_messages_sent_totalCounterclient_id, path, statusMessages sent per subscription path
subscribe_topic_messages_sent_totalCounterclient_id, path, topic_name, statusMessages sent per subscription path + topic
subscribe_bytes_sent_totalCounterclient_id, path, statusBytes sent per subscription path
subscribe_topic_bytes_sent_totalCounterclient_id, path, topic_name, statusBytes sent per subscription path + topic

Packet Statistics (Received)

Metric NameTypeLabelsDescription
mqtt_packets_received_totalCounternetworkTotal MQTT packets received
mqtt_packets_connect_received_totalCounternetworkCONNECT packets received
mqtt_packets_publish_received_totalCounternetworkPUBLISH packets received
mqtt_packets_connack_received_totalCounternetworkCONNACK packets received
mqtt_packets_puback_received_totalCounternetworkPUBACK packets received
mqtt_packets_pubrec_received_totalCounternetworkPUBREC packets received
mqtt_packets_pubrel_received_totalCounternetworkPUBREL packets received
mqtt_packets_pubcomp_received_totalCounternetworkPUBCOMP packets received
mqtt_packets_subscribe_received_totalCounternetworkSUBSCRIBE packets received
mqtt_packets_unsubscribe_received_totalCounternetworkUNSUBSCRIBE packets received
mqtt_packets_pingreq_received_totalCounternetworkPINGREQ packets received
mqtt_packets_disconnect_received_totalCounternetworkDISCONNECT packets received
mqtt_packets_auth_received_totalCounternetworkAUTH packets received
mqtt_packets_received_error_totalCounternetworkError packets received
mqtt_packets_connack_auth_error_totalCounternetworkCONNACK auth error packets
mqtt_packets_connack_error_totalCounternetworkCONNACK error packets
mqtt_bytes_received_totalCounternetworkTotal MQTT bytes received

Packet Statistics (Sent)

Metric NameTypeLabelsDescription
mqtt_packets_sent_totalCounternetwork, qosTotal MQTT packets sent
mqtt_packets_connack_sent_totalCounternetwork, qosCONNACK packets sent
mqtt_packets_publish_sent_totalCounternetwork, qosPUBLISH packets sent
mqtt_packets_puback_sent_totalCounternetwork, qosPUBACK packets sent
mqtt_packets_pubrec_sent_totalCounternetwork, qosPUBREC packets sent
mqtt_packets_pubrel_sent_totalCounternetwork, qosPUBREL packets sent
mqtt_packets_pubcomp_sent_totalCounternetwork, qosPUBCOMP packets sent
mqtt_packets_suback_sent_totalCounternetwork, qosSUBACK packets sent
mqtt_packets_unsuback_sent_totalCounternetwork, qosUNSUBACK packets sent
mqtt_packets_pingresp_sent_totalCounternetwork, qosPINGRESP packets sent
mqtt_packets_disconnect_sent_totalCounternetwork, qosDISCONNECT packets sent
mqtt_bytes_sent_totalCounternetwork, qosTotal MQTT bytes sent
mqtt_retain_packets_received_totalCounterqosRetained messages received
mqtt_retain_packets_sent_totalCounterqosRetained messages sent

Packet Processing Duration

Metric NameTypeLabelsDescription
mqtt_packet_process_duration_msHistogramnetwork, packetMQTT packet processing duration (ms)
mqtt_packet_send_duration_msHistogramnetwork, packetMQTT packet sending duration (ms)

Delay Message Queue Metrics

Queue Capacity (Gauge)

Metric NameTypeLabelsDescription
mqtt_delay_queue_total_capacityGaugeshard_noDelay queue total capacity
mqtt_delay_queue_used_capacityGaugeshard_noDelay queue used capacity
mqtt_delay_queue_remaining_capacityGaugeshard_noDelay queue remaining capacity

Message Delivery Statistics

Metric NameTypeLabelsDescription
delay_msg_enqueue_totalCounterTotal messages enqueued
delay_msg_deliver_totalCounterTotal delay messages delivered
delay_msg_deliver_fail_totalCounterTotal delivery failures
delay_msg_recover_totalCounterTotal messages recovered from storage on startup
delay_msg_recover_expired_totalCounterTotal expired messages found during recovery

Latency Distribution

Metric NameTypeLabelsDescription
delay_msg_enqueue_duration_msHistogramMessage enqueue duration (ms)
delay_msg_deliver_duration_msHistogramMessage delivery duration (ms)

Connector Metrics

Per-Connector

Metric NameTypeLabelsDescription
mqtt_connector_messages_sent_success_totalCounterconnector_nameMessages successfully sent by connector
mqtt_connector_messages_sent_failure_totalCounterconnector_nameMessages failed to send by connector
mqtt_connector_send_duration_msHistogramconnector_nameMessage send duration by connector (ms)

Aggregate

Metric NameTypeLabelsDescription
mqtt_connector_messages_sent_success_agg_totalCounterTotal messages successfully sent by all connectors
mqtt_connector_messages_sent_failure_agg_totalCounterTotal messages failed to send by all connectors
mqtt_connector_send_duration_ms_aggHistogramAggregate send duration across all connectors (ms)

Usage Examples

Prometheus Configuration

yaml
scrape_configs:
  - job_name: 'robustmq'
    static_configs:
      - targets: ['localhost:9091']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana Query Examples

Network layer queries:

promql
# Request processing P95 latency (by network type)
histogram_quantile(0.95, rate(handler_total_ms_bucket[5m]))

# Handler queue wait P95 latency
histogram_quantile(0.95, rate(handler_queue_wait_ms_bucket[5m]))

# Average command.apply() execution time
rate(handler_apply_ms_sum[5m]) / rate(handler_apply_ms_count[5m])

# Queue backlog
handler_queue_size

# Active threads by type
broker_active_thread_num{thread_type="handler"}

gRPC/HTTP queries:

promql
# gRPC request rate (per second)
sum(rate(grpc_requests_total[5m]))

# gRPC error rate
rate(grpc_errors_total[5m]) / rate(grpc_requests_total[5m]) * 100

# HTTP request rate (per second)
sum(rate(http_requests_total[5m]))

RocksDB queries:

promql
# RocksDB QPS by operation type
sum(rate(rocksdb_operation_count_total[5m])) by (operation)

# RocksDB average operation latency
rate(rocksdb_operation_ms_sum[5m]) / rate(rocksdb_operation_ms_count[5m])

Raft queries:

promql
# Raft write QPS by state machine
sum(rate(raft_write_requests_total[5m])) by (machine)

# Raft write success rate
sum(rate(raft_write_success_total[5m])) / sum(rate(raft_write_requests_total[5m]))

# Raft write P99 latency
histogram_quantile(0.99, sum(rate(raft_write_duration_ms_bucket[5m])) by (le, machine))

# Raft apply lag per state machine (0 = fully caught up)
raft_apply_lag

System & process resource queries:

promql
# System CPU usage (%)
system_cpu_usage

# System memory usage (%)
system_memory_usage

# Process CPU usage (%)
system_process_cpu_usage

# Process memory usage (%)
system_process_memory_usage

Tokio runtime queries:

promql
# Runtime busy ratio per runtime (%)
tokio_runtime_busy_ratio

# Runtime global queue depth
tokio_runtime_queue_depth

# Alive tasks per runtime
tokio_runtime_alive_tasks

MQTT resource queries:

promql
# Current connections
mqtt_connections_count

# Current subscribers
mqtt_subscribers_count

# Connection success rate
rate(mqtt_connection_success_total[5m])

# MQTT packet processing P99 latency by type
histogram_quantile(0.99, sum(rate(mqtt_packet_process_duration_ms_bucket[5m])) by (le, packet))

Alert Rules Examples

yaml
groups:
  - name: robustmq_network_metrics
    rules:
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(handler_total_ms_bucket[5m])) > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"
          description: "P95 request latency exceeds 1000ms, current: {{ $value }}ms"

      - alert: HandlerQueueBacklog
        expr: handler_queue_size > 10000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Handler queue backlog"
          description: "Queue pending requests exceed 10000, current: {{ $value }}"

  - name: robustmq_grpc_metrics
    rules:
      - alert: HighGrpcErrorRate
        expr: rate(grpc_errors_total[5m]) / rate(grpc_requests_total[5m]) > 0.05
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High gRPC error rate"
          description: "gRPC error rate exceeds 5%, current: {{ $value | humanizePercentage }}"

  - name: robustmq_storage_metrics
    rules:
      - alert: RocksDBSlowOperations
        expr: histogram_quantile(0.95, rate(rocksdb_operation_ms_bucket[5m])) > 100
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Slow RocksDB operations"
          description: "P95 RocksDB operation latency exceeds 100ms, current: {{ $value }}ms"

  - name: robustmq_raft_metrics
    rules:
      - alert: RaftWriteFailureRate
        expr: rate(raft_write_failures_total[5m]) / rate(raft_write_requests_total[5m]) > 0.01
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High Raft write failure rate"
          description: "Raft write failure rate exceeds 1%, current: {{ $value | humanizePercentage }}"

      - alert: RaftApplyLag
        expr: raft_apply_lag > 1000
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Raft state machine apply lag"
          description: "State machine {{ $labels.machine }} is lagging {{ $value }} entries behind the log"

  - name: robustmq_runtime_metrics
    rules:
      - alert: TokioRuntimeSaturated
        expr: tokio_runtime_busy_ratio > 80
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Tokio runtime saturated"
          description: "Runtime {{ $labels.runtime }} busy ratio is {{ $value }}%, workers may be a bottleneck"

      - alert: TokioRuntimeQueueBacklog
        expr: tokio_runtime_queue_depth > 500
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Tokio runtime queue backlog"
          description: "Runtime {{ $labels.runtime }} global queue depth is {{ $value }}"
  • Metric definitions: src/common/metrics/src/
    • Network: network.rs
    • gRPC: grpc.rs
    • HTTP: http.rs
    • RocksDB: rocksdb.rs
    • Raft: meta/raft.rs
    • System/process resources & Tokio runtimes: broker.rs
    • MQTT: mqtt/ directory
  • Collection implementation: src/common/system-info/src/
    • System/process resource collection: lib.rs
    • Tokio runtime collection: runtime.rs
  • Grafana dashboard: grafana/robustmq-broker.json