AlertManager 告警规则与通知配置

AlertManager 是 Prometheus 生态中负责处理告警的组件。它接收来自 Prometheus 的告警信息，并通过邮件、Webhook、Slack 等渠道发送通知。本文将详细介绍如何在搬瓦工 VPS 上安装配置 AlertManager，编写告警规则并设置多种通知方式。

一、安装 AlertManager

1.1 创建系统用户

useradd --no-create-home --shell /bin/false alertmanager
mkdir -p /etc/alertmanager /var/lib/alertmanager
chown alertmanager:alertmanager /etc/alertmanager /var/lib/alertmanager

1.2 下载并安装

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.27.0/alertmanager-0.27.0.linux-amd64.tar.gz
tar xzf alertmanager-0.27.0.linux-amd64.tar.gz
cp alertmanager-0.27.0.linux-amd64/alertmanager /usr/local/bin/
cp alertmanager-0.27.0.linux-amd64/amtool /usr/local/bin/
chown alertmanager:alertmanager /usr/local/bin/alertmanager /usr/local/bin/amtool

二、配置 AlertManager

2.1 主配置文件

cat > /etc/alertmanager/alertmanager.yml <<EOF
global:
  resolve_timeout: 5m
  smtp_from: 'alert@your-domain.com'
  smtp_smarthost: 'smtp.your-domain.com:587'
  smtp_auth_username: 'alert@your-domain.com'
  smtp_auth_password: 'your_smtp_password'
  smtp_require_tls: true

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-email'
  routes:
    - match:
        severity: critical
      receiver: 'critical-email'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'default-email'
      repeat_interval: 4h

receivers:
  - name: 'default-email'
    email_configs:
      - to: 'admin@your-domain.com'
        send_resolved: true

  - name: 'critical-email'
    email_configs:
      - to: 'admin@your-domain.com'
        send_resolved: true
    webhook_configs:
      - url: 'http://localhost:5001/webhook'
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']
EOF

chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

2.2 验证配置

amtool check-config /etc/alertmanager/alertmanager.yml

三、创建 Systemd 服务

cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Alertmanager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/ \
  --web.listen-address=:9093 \
  --cluster.advertise-address=0.0.0.0:9093
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl start alertmanager
systemctl enable alertmanager
systemctl status alertmanager

四、在 Prometheus 中配置告警规则

4.1 关联 AlertManager

在 /etc/prometheus/prometheus.yml 中确认以下配置：

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - localhost:9093

rule_files:
  - "rules/*.yml"

4.2 服务器基础告警规则

mkdir -p /etc/prometheus/rules

cat > /etc/prometheus/rules/server_alerts.yml <<EOF
groups:
  - name: server_health
    interval: 30s
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "服务器离线: {{ \$labels.instance }}"
          description: "{{ \$labels.job }} 的实例 {{ \$labels.instance }} 已离线超过 2 分钟。"

      - alert: HighCpuUsage
        expr: 100 - (avg by(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高: {{ \$labels.instance }}"
          description: "CPU 使用率 {{ \$value | printf \"%.1f\" }}% 超过 80%，持续 10 分钟。"

      - alert: HighMemoryUsage
        expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高: {{ \$labels.instance }}"
          description: "内存使用率 {{ \$value | printf \"%.1f\" }}% 超过 85%。"

      - alert: DiskSpaceCritical
        expr: (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "磁盘空间严重不足: {{ \$labels.instance }}"
          description: "根分区使用率 {{ \$value | printf \"%.1f\" }}% 超过 90%。"

      - alert: HighNetworkTraffic
        expr: rate(node_network_receive_bytes_total{device="eth0"}[5m]) > 100000000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "网络流量异常: {{ \$labels.instance }}"
          description: "入站流量超过 100MB/s，持续 10 分钟。"
EOF

chown -R prometheus:prometheus /etc/prometheus/rules/

4.3 验证规则语法

promtool check rules /etc/prometheus/rules/server_alerts.yml
systemctl reload prometheus

五、通知渠道配置

5.1 邮件通知

邮件通知在主配置的 global 和 receivers 部分已配置。常用 SMTP 服务器参数：

Gmail：smtp.gmail.com:587（需启用应用专用密码）。
QQ 邮箱：smtp.qq.com:587（需开启 SMTP 服务获取授权码）。
阿里邮箱：smtp.mxhichina.com:465。

5.2 Webhook 通知

receivers:
  - name: 'webhook-notify'
    webhook_configs:
      - url: 'https://your-webhook-endpoint.com/alert'
        send_resolved: true
        http_config:
          basic_auth:
            username: 'user'
            password: 'pass'

5.3 Telegram 机器人通知

receivers:
  - name: 'telegram-notify'
    telegram_configs:
      - bot_token: 'YOUR_BOT_TOKEN'
        chat_id: YOUR_CHAT_ID
        parse_mode: 'HTML'
        message: |
          {{ range .Alerts }}
          <b>{{ .Labels.alertname }}</b>
          严重程度: {{ .Labels.severity }}
          实例: {{ .Labels.instance }}
          {{ .Annotations.description }}
          {{ end }}

六、告警路由高级配置

6.1 分组策略

告警路由的核心参数说明：

group_by：按标签对告警进行分组，同一组的告警合并发送。
group_wait：收到告警后等待多久发送，以便收集同组的其他告警。
group_interval：同一组有新告警时，距上次通知的最小间隔。
repeat_interval：告警持续触发时，重复通知的间隔。

6.2 告警静默

通过 amtool 命令行创建静默规则：

# 静默特定告警 2 小时
amtool silence add alertname="HighCpuUsage" instance="bwg-vps-01:9100" \
  --duration=2h \
  --comment="计划内维护" \
  --author="admin"

# 查看当前静默规则
amtool silence query

# 删除静默规则
amtool silence expire SILENCE_ID

七、使用 Docker 部署

docker run -d \
  --name alertmanager \
  --restart unless-stopped \
  -p 9093:9093 \
  -v /opt/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml \
  -v alertmanager-data:/alertmanager \
  prom/alertmanager:latest \
  --config.file=/etc/alertmanager/alertmanager.yml

八、测试告警

使用 amtool 发送测试告警：

# 发送测试告警
amtool alert add test_alert severity=critical instance="test:9100" \
  --annotation.summary="测试告警" \
  --annotation.description="这是一条测试告警消息"

# 查看当前活跃告警
amtool alert query

# 通过 Prometheus Web UI 查看
# 访问 http://localhost:9090/alerts

九、常见问题

收不到邮件通知

# 检查 AlertManager 日志
journalctl -u alertmanager -f

# 常见原因：SMTP 密码错误、端口被封、TLS 配置不匹配

告警频繁重复

调整 repeat_interval 的值，建议设为 4h 或更长。同时检查告警表达式中的 for 持续时间设置是否合理。

总结

AlertManager 是 Prometheus 监控体系中实现自动化告警的关键组件。配合 Prometheus 的告警规则和 Grafana 告警，可以构建多层次的告警通知体系。选购搬瓦工 VPS 请参考全部方案，使用优惠码 NODESEEK2026 可享受 6.77% 的循环折扣。