Skip to content

Infrastructure Metrics

RobustMQ provides comprehensive infrastructure monitoring metrics to help operations teams monitor system health, performance, and resource usage. All metrics are based on Prometheus format and can be accessed through the /metrics endpoint.

Network Layer Metrics

Request Processing Metrics

Metric NameTypeLabelsDescription
request_total_msHistogramrequest_type, networkTotal request processing time (milliseconds)
request_queue_msHistogramrequest_type, networkRequest queue waiting time (milliseconds)
request_handler_msHistogramrequest_type, networkRequest handler execution time (milliseconds)
request_response_msHistogramrequest_type, networkResponse processing time (milliseconds)
request_response_queue_msHistogramrequest_type, networkResponse queue waiting time (milliseconds)

Connection and Queue Metrics

Metric NameTypeLabelsDescription
broker_connections_maxGauge-Maximum connection limit for broker
broker_network_queue_numGaugequeue_typeNumber of messages in network queue
broker_active_thread_numGaugethread_typeNumber of active threads

Label Descriptions:

  • request_type: Request type (e.g., mqtt, grpc, http)
  • network: Network type (tcp, websocket, quic)
  • queue_type: Queue type (accept, handler, response)
  • thread_type: Thread type (accept, handler, response)

gRPC Service Metrics

Request Statistics

Metric NameTypeLabelsDescription
grpc_requests_totalCountermethod, serviceTotal number of gRPC requests
grpc_request_errors_totalCountermethod, service, error_codeTotal number of gRPC request errors

Performance Metrics

Metric NameTypeLabelsDescription
grpc_request_duration_millisecondsHistogrammethod, servicegRPC request duration (milliseconds)
grpc_request_size_bytesHistogrammethod, servicegRPC request size (bytes)
grpc_response_size_bytesHistogrammethod, servicegRPC response size (bytes)

Label Descriptions:

  • method: gRPC method name
  • service: gRPC service name
  • error_code: Error code

HTTP Service Metrics

Request Statistics

Metric NameTypeLabelsDescription
http_requests_totalCountermethod, path, status_codeTotal number of HTTP requests
http_request_errors_totalCountermethod, path, error_typeTotal number of HTTP request errors

Performance Metrics

Metric NameTypeLabelsDescription
http_request_duration_millisecondsHistogrammethod, pathHTTP request duration (milliseconds)
http_request_size_bytesHistogrammethod, pathHTTP request size (bytes)
http_response_size_bytesHistogrammethod, pathHTTP response size (bytes)

Label Descriptions:

  • method: HTTP method (GET, POST, PUT, DELETE)
  • path: Request path
  • status_code: HTTP status code
  • error_type: Error type

Storage Layer Metrics (RocksDB)

Operation Statistics

Metric NameTypeLabelsDescription
rocksdb_operation_countCountersource, operationNumber of RocksDB operations
rocksdb_operation_msHistogramsource, operationRocksDB operation duration (milliseconds)

Label Descriptions:

  • source: Data source (e.g., metadata, session, message)
  • operation: Operation type (save, get, delete, list)

Common Operation Types

  • save: Data write operations
  • get: Data read operations
  • delete: Data deletion operations
  • list: Data list query operations

Broker Core Metrics

System Status

Metric NameTypeLabelsDescription
broker_statusGaugenode_idBroker node status (1=running, 0=stopped)
broker_uptime_secondsCounternode_idBroker uptime (seconds)

Resource Usage

Metric NameTypeLabelsDescription
broker_memory_usage_bytesGaugenode_id, typeMemory usage (bytes)
broker_cpu_usage_percentGaugenode_idCPU usage (percentage)
broker_disk_usage_bytesGaugenode_id, pathDisk usage (bytes)

Label Descriptions:

  • node_id: Node identifier
  • type: Memory type (heap, non_heap, total)
  • path: Disk path

Meta Service Metrics

Cluster Status

Metric NameTypeLabelsDescription
meta_cluster_nodes_totalGaugecluster_idTotal number of cluster nodes
meta_cluster_nodes_activeGaugecluster_idNumber of active cluster nodes
meta_leader_elections_totalCountercluster_idNumber of leader elections

Data Synchronization

Metric NameTypeLabelsDescription
meta_sync_operations_totalCounteroperation_typeNumber of metadata sync operations
meta_sync_latency_msHistogramoperation_typeMetadata sync latency (milliseconds)

Usage Examples

Prometheus Configuration

yaml
scrape_configs:
  - job_name: 'robustmq'
    static_configs:
      - targets: ['localhost:9090']
    metrics_path: '/metrics'
    scrape_interval: 15s

Grafana Query Examples

promql
# Average request processing time
rate(request_total_ms_sum[5m]) / rate(request_total_ms_count[5m])

# gRPC error rate
rate(grpc_request_errors_total[5m]) / rate(grpc_requests_total[5m]) * 100

# RocksDB operation QPS
rate(rocksdb_operation_count[5m])

# Network queue backlog
broker_network_queue_num

Alert Rules Examples

yaml
groups:
  - name: robustmq_infrastructure
    rules:
      - alert: HighRequestLatency
        expr: histogram_quantile(0.95, rate(request_total_ms_bucket[5m])) > 1000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High request latency"
          description: "95% of requests have latency over 1 second"

      - alert: HighErrorRate
        expr: rate(grpc_request_errors_total[5m]) / rate(grpc_requests_total[5m]) > 0.05
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High gRPC error rate"
          description: "gRPC error rate exceeds 5%"

      - alert: RocksDBSlowOperations
        expr: histogram_quantile(0.95, rate(rocksdb_operation_ms_bucket[5m])) > 100
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Slow RocksDB operations"
          description: "95% of RocksDB operations take over 100ms"

Through these infrastructure metrics, operations teams can comprehensively understand the operational status of RobustMQ systems, identify and resolve performance issues in a timely manner, and ensure stable system operation.