Skip to main content

Monitoring

Setting up comprehensive monitoring for WuKongIM ensures optimal performance and early detection of issues.

Prerequisites

Install Prometheus for metrics collection and monitoring.

Install Prometheus

# Download Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# Extract
tar xvfz prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# Create user and directories
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus
sudo chown prometheus:prometheus /etc/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

Configure Prometheus

Add WuKongIM monitoring targets under the scrape_configs section in your Prometheus configuration.

Single Node Configuration

For single node deployment, create /etc/prometheus/prometheus.yml:
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'wukongim-trace-metrics'
    static_configs:
      - targets: ['xx.xx.xx.xx:5300']
        labels:
          id: "1001"
          instance: "wukongim-node1"

Multi-Node Configuration

For multi-node cluster deployment:
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'wukongim1-trace-metrics'
    static_configs:
      - targets: ['10.206.0.13:5300']
        labels:
          id: "1001"
          instance: "wukongim-node1"
          
  - job_name: 'wukongim2-trace-metrics'
    static_configs:
      - targets: ['10.206.0.14:5300']
        labels:
          id: "1002"
          instance: "wukongim-node2"
          
  - job_name: 'wukongim3-trace-metrics'
    static_configs:
      - targets: ['10.206.0.8:5300']
        labels:
          id: "1003"
          instance: "wukongim-node3"
Configuration Parameters:
  • job_name: Unique job name for each WuKongIM node
  • targets: WuKongIM internal IP + port 5300
  • labels.id: WuKongIM node ID
  • labels.instance: Human-readable instance name
Replace xx.xx.xx.xx with the actual internal IP address of your WuKongIM nodes.

Configure WuKongIM

Add Prometheus configuration to each node’s wk.yaml file:
mode: "release"
# ... other configurations ...

trace:
  prometheusApiUrl: "http://xx.xx.xx.xx:9090"
Replace xx.xx.xx.xx with the internal IP address of your Prometheus server.

Complete WuKongIM Configuration Example

mode: "release"
rootDir: "./wukongim_data"

# Cluster configuration (for multi-node)
cluster:
  nodeId: 1001
  serverAddr: "10.206.0.13:11110"
  apiUrl: "http://10.206.0.13:5001"
  initNodes:
    - "1001@10.206.0.13:11110"
    - "1002@10.206.0.14:11110"
    - "1003@10.206.0.8:11110"

# External configuration
external:
  ip: "119.45.229.172"
  tcpAddr: "119.45.229.172:15100"
  wsAddr: "ws://119.45.229.172:15200"

# Monitoring configuration
trace:
  prometheusApiUrl: "http://10.206.0.13:9090"
  
# Logging configuration
logger:
  level: "info"
  dir: "./logs"

Start Services

Start Prometheus

Create a systemd service file for Prometheus:
sudo nano /etc/systemd/system/prometheus.service
Add the following content:
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries \
    --web.listen-address=0.0.0.0:9090 \
    --web.external-url=

[Install]
WantedBy=multi-user.target
Enable and start Prometheus:
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus

Restart WuKongIM

After updating the configuration, restart WuKongIM on all nodes:
./wukongim stop
./wukongim --config wk.yaml -d

Verification

Check Prometheus Targets

  1. Access Prometheus web interface: http://prometheus-server-ip:9090
  2. Go to StatusTargets
  3. Verify all WuKongIM targets are UP

Check Metrics

Query WuKongIM metrics in Prometheus:
# Check if WuKongIM metrics are being collected
wukongim_connections_total

# Check message throughput
rate(wukongim_messages_total[5m])

# Check memory usage
wukongim_memory_usage_bytes

# Check CPU usage
wukongim_cpu_usage_percent

Key Metrics to Monitor

System Metrics

MetricDescription
wukongim_connections_totalTotal number of active connections
wukongim_messages_totalTotal number of messages processed
wukongim_memory_usage_bytesMemory usage in bytes
wukongim_cpu_usage_percentCPU usage percentage
wukongim_disk_usage_bytesDisk usage in bytes

Cluster Metrics (Multi-node)

MetricDescription
wukongim_cluster_nodes_totalTotal number of cluster nodes
wukongim_cluster_leader_changes_totalNumber of leader changes
wukongim_cluster_proposals_failed_totalFailed proposals count
wukongim_cluster_proposals_committed_totalCommitted proposals count

Performance Metrics

MetricDescription
wukongim_message_latency_secondsMessage processing latency
wukongim_api_request_duration_secondsAPI request duration
wukongim_websocket_connectionsWebSocket connections count
wukongim_tcp_connectionsTCP connections count

Alerting Rules

Create alerting rules in /etc/prometheus/alert_rules.yml:
groups:
  - name: wukongim
    rules:
      - alert: WuKongIMDown
        expr: up{job=~"wukongim.*"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "WuKongIM instance is down"
          description: "WuKongIM instance {{ $labels.instance }} has been down for more than 1 minute."

      - alert: HighMemoryUsage
        expr: wukongim_memory_usage_bytes / (1024*1024*1024) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on WuKongIM"
          description: "WuKongIM instance {{ $labels.instance }} is using more than 2GB of memory."

      - alert: HighCPUUsage
        expr: wukongim_cpu_usage_percent > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on WuKongIM"
          description: "WuKongIM instance {{ $labels.instance }} CPU usage is above 80%."

      - alert: TooManyConnections
        expr: wukongim_connections_total > 10000
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Too many connections on WuKongIM"
          description: "WuKongIM instance {{ $labels.instance }} has more than 10,000 active connections."
Update Prometheus configuration to include alert rules:
# Add to prometheus.yml
rule_files:
  - "alert_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

Grafana Dashboard

Install Grafana

# Add Grafana repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Install Grafana
sudo apt-get update
sudo apt-get install grafana

# Start Grafana
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

Configure Data Source

  1. Access Grafana: http://grafana-server-ip:3000 (admin/admin)
  2. Add Prometheus data source: http://prometheus-server-ip:9090
  3. Import WuKongIM dashboard or create custom dashboards

Sample Dashboard Queries

Connection Count:
sum(wukongim_connections_total)
Message Rate:
sum(rate(wukongim_messages_total[5m]))
Memory Usage:
wukongim_memory_usage_bytes / (1024*1024*1024)
CPU Usage:
wukongim_cpu_usage_percent

Troubleshooting

Prometheus Not Collecting Metrics

# Check if WuKongIM metrics endpoint is accessible
curl http://wukongim-node-ip:5300/metrics

# Check Prometheus logs
sudo journalctl -u prometheus -f

# Verify Prometheus configuration
promtool check config /etc/prometheus/prometheus.yml

WuKongIM Not Sending Metrics

# Check WuKongIM logs
tail -f ./wukongim_data/logs/wukongim.log

# Verify trace configuration in wk.yaml
grep -A 5 "trace:" wk.yaml

# Test connectivity to Prometheus
curl http://prometheus-server-ip:9090/api/v1/targets

Next Steps

Performance Tuning

Optimize WuKongIM performance based on monitoring data

Backup Strategy

Set up automated backup and recovery

Load Testing

Test system performance under load

Cluster Management

Learn about cluster scaling and management