文章目录[隐藏]
最近公司开始接触这两个东西,加上看到了一张告警框架的区域分布图。发现还是挺有意思的,亚洲基本都喜欢搞Zabbix这一套系统,而欧美等国家用Prometheus比较多。之前尝试搞过,没太懂,现在了解了基本怎么搞。比较难的是自己去写语句来搞监控,zabbix会shell即可,这个目前理解都是一些接口查询语句,自定义也能开发,把值传递给接口即可。目前使用下来感觉就个人少量服务器告警还是尝试用一下NETDATA,我这搞了一下,服务器(2GB /1 core)带不动。
概览
- 我都是用docker搞得,都说说每个组件都是干啥的吧?
组件 | 作用 | 监控端(需要监控的主机) | 展示端(数据展示) | 补充说明 |
---|---|---|---|---|
Node Exporter | 收集Host硬件和操作系统信息 | YES | NO | 主机信息 |
cAdvisor | 负责收集Host上运行的容器 信息 |
YES | NO | docker 信息采集 |
Prometheus Server | 普罗米修斯监控主服务器 | NO | NO | 收集上面两个组件的数据并存储提供给Grafana来采集,随便安装到哪个机器上都行。 |
Grafana | 展示普罗米修斯监控界面 | NO | YES | 把数据可视化出来 |
Alertmanager | 告警发送 | 非必须 | NO | 可在Grafana配置,比Grafana好一些 |
Pushgetway | 自定义 告警 |
自定义需要 | No | 自定义 |
- 注意一点就是各个组件的关系、对应端口以及配置(注意容器中的localhost不能访问容器外的信息)。
Node Exporter
- 安装
docker run -d -p 90:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
--name=node-exporter \
prom/node-exporter
cAdvisor
- 安装
docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=80:8080 \
--detach=true \
--name=cadvisor \
-v "/etc/localtime:/etc/localtime" \
google/cadvisor:latest
Prometheus Server
- prometheus 配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# - /etc/prometheus/alert_rules.yml
- /etc/prometheus/alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
#监听的地址
- targets: ['localhost:80','localhost:90']
- job_name: 'mail-base'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
- job_name: 'mail-docker'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
- 告警配置文件
groups:
- name: ali
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
- 安装
docker run -d \
-p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /etc/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml \
--name prometheus \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-lifecycle
Grafana
- 建立文件夹并授权(没有授权启动不了)
mkdir /etc/grafana
chmod 777 /etc/grafana
- 安装
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /etc/grafana:/var/lib/grafana \
grafana/grafana
Alertmanager
- 配置文件
global:
resolve_timeout: 5m
smtp_smarthost: 'xxxxxx.emperinter.info:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxxxx^'
smtp_require_tls: false
route:
receiver: team-test-mails
group_by: ['alertname']
group_wait: 30s
group_interval: 1m
repeat_interval: 2m
receivers:
- name: 'team-test-mails'
email_configs:
- to: '[email protected]'
send_resolved: true
- 安装
docker run -d -p 59093:9093 --name Alertmanager -v /etc/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml docker.io/prom/alertmanager:latest
telegram 告警
需要安装Alertmanager,注意搞好后Bot启动一下,命令是
/start
docker run -d \
-e 'ALERTMANAGER_URL=http://xxx.xxx.xxx.xxx:59093' \
-e 'BOLT_PATH=/data/bot.db' \
-e 'STORE=bolt' \
-e 'TELEGRAM_ADMIN=1234567' \
-e 'TELEGRAM_TOKEN=XXX' \
-v '/srv/monitoring/alertmanager-bot:/data' \
--name alertmanager-bot \
metalmatze/alertmanager-bot:0.4.3
Pushgetway 自定义告警
用于自定义告警监控项目;
安装
docker run -d --name pushgateway -p 59091:9091 --restart=always prom/pushgateway
- 安装后注意去配置Permetheus的文件并重启;
- job_name: 'pushgateway'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:59091']
honor_labels: true #作用:如果没有设置instance标签,Prometheus服务器也会附加标签,否则instance标签值会为空
自定义方法
- 常见shell用法,用
docker_runtime
即可查询到该数据
cat <<EOF | curl --data-binary @- http://127.0.0.1:59091/metrics/job/docker_runtime/instance/xa-lsr-billubuntu
# TYPE docker_runtime counter
docker_runtime{log="aa bb cc cadvisor"} 33
docker_runtime{log="nginx"} 331
docker_runtime{log="abc"} 332
EOF
- python方法
#!/usr/bin/python3
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge('ping', '检测最大响应时间',['dst_ip','city'], registry=registry) #Guage(metric_name,HELP,labels_name,registry=registry)
g.labels('192.168.1.10','shenzhen').set(42.2) #set设定值
g.labels('192.168.1.11','shenzhen').dec(2) #dec递减2
g.labels('192.168.1.12','shenzhen').inc() #inc递增,默认增1
push_to_gateway('localhost:59091', job='ping_status', registry=registry)