文章目录
- 我都是用docker搞得,都说说每个组件都是干啥的吧? 组件 作用 监控端(需要监控的主机) 展示端(数据展示) 补充说明 Node Exporter 收集Host硬件和操作系统信息 YES NO 主机信息 cAdvisor 负责收集Host上运行的容器信息 YES NO docker 信息采集 Prometheus Server 普罗米修斯监控主服务器 NO NO 收集上面两个组件的数据并存储提供给Grafana来采集,随便安装到哪个机器上都行。 Grafana 展示普罗米修斯监控界面 NO YES 把数据可视化出来 Alertmanager 告警发送 非必须 NO 可在Grafana配置,比Grafana好一些 Pushgetway 自定义告警 自定义需要 No 自定义 注意一点就是各个组件的关系、对应端口以及配置(注意容器中的localhost不能访问容器外的信息)。
- 安装 docker run -d -p 90:9100 \ -v "/proc:/host/proc" \ -v "/sys:/host/sys" \ -v "/:/rootfs" \ -v "/etc/localtime:/etc/localtime" \ --name=node-exporter \ prom/node-exporter
- 安装 docker run -d \ --volume=/:/rootfs:ro \ --volume=/var/run:/var/run:rw \ --volume=/sys:/sys:ro \ --volume=/var/lib/docker/:/var/lib/docker:ro \ --publish=80:8080 \ --detach=true \ --name=cadvisor \ -v "/etc/localtime:/etc/localtime" \ google/cadvisor:latest
- prometheus 配置文件 # my global config global: scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute. evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute. # scrape_timeout is set to the global default (10s). # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load rules once and periodically evaluate them according to the global 'evaluation_interval'. rule_files: # - "first_rules.yml" # - "second_rules.yml" # - /etc/prometheus/alert_rules.yml - /etc/prometheus/alert_rules.yml # A scrape configuration containing exactly one endpoint to scrape: # Here it's Prometheus itself. scrape_configs: # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config. - job_name: 'prometheus' # metrics_path defaults to '/metrics' # scheme defaults to 'http'. static_configs: #监听的地址 - targets: ['localhost:80','localhost:90'] - job_name: 'mail-base' static_configs: - targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90'] - job_name: 'mail-docker' static_configs: - targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90'] 告警配置文件 groups: - name: ali rules: # Alert for any instance that is unreachable for >5 minutes. - alert: InstanceDown expr: up == 0 for: 5m labels: severity: page annotations: summary: "Instance {{ $labels.instance }} down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes." # Alert for any instance that has a median request latency >1s. - alert: APIHighRequestLatency expr: api_http_request_latencies_second{quantile="0.5"} > 1 for: 10m annotations: summary: "High request latency on {{ $labels.instance }}" description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)" 安装 docker run -d \ -p 9090:9090 \ -v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \ -v /etc/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml \ --name prometheus \ prom/prometheus \ --config.file=/etc/prometheus/prometheus.yml \ --web.enable-lifecycle
- 建立文件夹并授权(没有授权启动不了) mkdir /etc/grafana chmod 777 /etc/grafana 安装 docker run -d \ -p 3000:3000 \ --name=grafana \ -v /etc/grafana:/var/lib/grafana \ grafana/grafana
- 配置文件 global: resolve_timeout: 5m smtp_smarthost: 'xxxxxx.emperinter.info:465' smtp_from: '[email protected]' smtp_auth_username: '[email protected]' smtp_auth_password: 'xxxxxxxxxxx^' smtp_require_tls: false route: receiver: team-test-mails group_by: ['alertname'] group_wait: 30s group_interval: 1m repeat_interval: 2m receivers: - name: 'team-test-mails' email_configs: - to: '[email protected]' send_resolved: true 安装 docker run -d -p 59093:9093 --name Alertmanager -v /etc/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml docker.io/prom/alertmanager:latest
- 需要安装Alertmanager,注意搞好后Bot启动一下,命令是/start docker run -d \ -e 'ALERTMANAGER_URL=http://xxx.xxx.xxx.xxx:59093' \ -e 'BOLT_PATH=/data/bot.db' \ -e 'STORE=bolt' \ -e 'TELEGRAM_ADMIN=1234567' \ -e 'TELEGRAM_TOKEN=XXX' \ -v '/srv/monitoring/alertmanager-bot:/data' \ --name alertmanager-bot \ metalmatze/alertmanager-bot:0.4.3
- 用于自定义告警监控项目;
- docker run -d --name pushgateway -p 59091:9091 --restart=always prom/pushgateway 安装后注意去配置Permetheus的文件并重启; - job_name: 'pushgateway' static_configs: - targets: ['xxx.xxx.xxx.xxx:59091'] honor_labels: true #作用:如果没有设置instance标签,Prometheus服务器也会附加标签,否则instance标签值会为空
- 常见shell用法,用docker_runtime即可查询到该数据 cat <<EOF | curl --data-binary @- http://127.0.0.1:59091/metrics/job/docker_runtime/instance/xa-lsr-billubuntu # TYPE docker_runtime counter docker_runtime{log="aa bb cc cadvisor"} 33 docker_runtime{log="nginx"} 331 docker_runtime{log="abc"} 332 EOF python方法 #!/usr/bin/python3 from prometheus_client import CollectorRegistry, Gauge, push_to_gateway registry = CollectorRegistry() g = Gauge('ping', '检测最大响应时间',['dst_ip','city'], registry=registry) #Guage(metric_name,HELP,labels_name,registry=registry) g.labels('192.168.1.10','shenzhen').set(42.2) #set设定值 g.labels('192.168.1.11','shenzhen').dec(2) #dec递减2 g.labels('192.168.1.12','shenzhen').inc() #inc递增,默认增1 push_to_gateway('localhost:59091', job='ping_status', registry=registry)
- https://www.cnblogs.com/zqj-blog/p/11024834.html https://www.liuyixiang.com/post/96100.html

最近公司开始接触这两个东西,加上看到了一张告警框架的区域分布图。发现还是挺有意思的,亚洲基本都喜欢搞Zabbix这一套系统,而欧美等国家用Prometheus比较多。之前尝试搞过,没太懂,现在了解了基本怎么搞。比较难的是自己去写语句来搞监控,zabbix会shell即可,这个目前理解都是一些接口查询语句,自定义也能开发,把值传递给接口即可。目前使用下来感觉就个人少量服务器告警还是尝试用一下NETDATA,我这搞了一下,服务器(2GB /1 core)带不动。
- 我都是用docker搞得,都说说每个组件都是干啥的吧?
| 组件 | 作用 | 监控端(需要监控的主机) | 展示端(数据展示) | 补充说明 |
|---|---|---|---|---|
| Node Exporter | 收集Host硬件和操作系统信息 | YES | NO | 主机信息 |
| cAdvisor | 负责收集Host上运行的容器信息 |
YES | NO | docker 信息采集 |
| Prometheus Server | 普罗米修斯监控主服务器 | NO | NO | 收集上面两个组件的数据并存储提供给Grafana来采集,随便安装到哪个机器上都行。 |
| Grafana | 展示普罗米修斯监控界面 | NO | YES | 把数据可视化出来 |
| Alertmanager | 告警发送 | 非必须 | NO | 可在Grafana配置,比Grafana好一些 |
| Pushgetway | 自定义告警 |
自定义需要 | No | 自定义 |
- 注意一点就是各个组件的关系、对应端口以及配置(注意容器中的localhost不能访问容器外的信息)。
- 安装
docker run -d -p 90:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
--name=node-exporter \
prom/node-exporter
docker run -d -p 90:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
--name=node-exporter \
prom/node-exporter
- 安装
docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=80:8080 \
--detach=true \
--name=cadvisor \
-v "/etc/localtime:/etc/localtime" \
google/cadvisor:latest
docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=80:8080 \
--detach=true \
--name=cadvisor \
-v "/etc/localtime:/etc/localtime" \
google/cadvisor:latest
- prometheus 配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# - /etc/prometheus/alert_rules.yml
- /etc/prometheus/alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
#监听的地址
- targets: ['localhost:80','localhost:90']
- job_name: 'mail-base'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
- job_name: 'mail-docker'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
- 告警配置文件
groups:
- name: ali
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
- 安装
docker run -d \
-p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /etc/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml \
--name prometheus \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-lifecycle
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# - /etc/prometheus/alert_rules.yml
- /etc/prometheus/alert_rules.yml
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
#监听的地址
- targets: ['localhost:80','localhost:90']
- job_name: 'mail-base'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
- job_name: 'mail-docker'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
groups:
- name: ali
rules:
# Alert for any instance that is unreachable for >5 minutes.
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
# Alert for any instance that has a median request latency >1s.
- alert: APIHighRequestLatency
expr: api_http_request_latencies_second{quantile="0.5"} > 1
for: 10m
annotations:
summary: "High request latency on {{ $labels.instance }}"
description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
docker run -d \
-p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
-v /etc/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml \
--name prometheus \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-lifecycle
- 建立文件夹并授权(没有授权启动不了)
mkdir /etc/grafana
chmod 777 /etc/grafana
- 安装
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /etc/grafana:/var/lib/grafana \
grafana/grafana
mkdir /etc/grafana
chmod 777 /etc/grafana
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /etc/grafana:/var/lib/grafana \
grafana/grafana
- 配置文件
global:
resolve_timeout: 5m
smtp_smarthost: 'xxxxxx.emperinter.info:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxxxx^'
smtp_require_tls: false
route:
receiver: team-test-mails
group_by: ['alertname']
group_wait: 30s
group_interval: 1m
repeat_interval: 2m
receivers:
- name: 'team-test-mails'
email_configs:
- to: '[email protected]'
send_resolved: true
- 安装
docker run -d -p 59093:9093 --name Alertmanager -v /etc/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml docker.io/prom/alertmanager:latest
global:
resolve_timeout: 5m
smtp_smarthost: 'xxxxxx.emperinter.info:465'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'xxxxxxxxxxx^'
smtp_require_tls: false
route:
receiver: team-test-mails
group_by: ['alertname']
group_wait: 30s
group_interval: 1m
repeat_interval: 2m
receivers:
- name: 'team-test-mails'
email_configs:
- to: '[email protected]'
send_resolved: true
docker run -d -p 59093:9093 --name Alertmanager -v /etc/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml docker.io/prom/alertmanager:latest
需要安装Alertmanager,注意搞好后Bot启动一下,命令是/start
docker run -d \
-e 'ALERTMANAGER_URL=http://xxx.xxx.xxx.xxx:59093' \
-e 'BOLT_PATH=/data/bot.db' \
-e 'STORE=bolt' \
-e 'TELEGRAM_ADMIN=1234567' \
-e 'TELEGRAM_TOKEN=XXX' \
-v '/srv/monitoring/alertmanager-bot:/data' \
--name alertmanager-bot \
metalmatze/alertmanager-bot:0.4.3
需要安装Alertmanager,注意搞好后Bot启动一下,命令是/start
docker run -d \
-e 'ALERTMANAGER_URL=http://xxx.xxx.xxx.xxx:59093' \
-e 'BOLT_PATH=/data/bot.db' \
-e 'STORE=bolt' \
-e 'TELEGRAM_ADMIN=1234567' \
-e 'TELEGRAM_TOKEN=XXX' \
-v '/srv/monitoring/alertmanager-bot:/data' \
--name alertmanager-bot \
metalmatze/alertmanager-bot:0.4.3

用于自定义告警监控项目;
用于自定义告警监控项目;
docker run -d --name pushgateway -p 59091:9091 --restart=always prom/pushgateway
- 安装后注意去配置Permetheus的文件并重启;
- job_name: 'pushgateway'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:59091']
honor_labels: true #作用:如果没有设置instance标签,Prometheus服务器也会附加标签,否则instance标签值会为空
docker run -d --name pushgateway -p 59091:9091 --restart=always prom/pushgateway
- job_name: 'pushgateway'
static_configs:
- targets: ['xxx.xxx.xxx.xxx:59091']
honor_labels: true #作用:如果没有设置instance标签,Prometheus服务器也会附加标签,否则instance标签值会为空
- 常见shell用法,用
docker_runtime即可查询到该数据
cat <<EOF | curl --data-binary @- http://127.0.0.1:59091/metrics/job/docker_runtime/instance/xa-lsr-billubuntu
# TYPE docker_runtime counter
docker_runtime{log="aa bb cc cadvisor"} 33
docker_runtime{log="nginx"} 331
docker_runtime{log="abc"} 332
EOF
docker_runtime即可查询到该数据cat <<EOF | curl --data-binary @- http://127.0.0.1:59091/metrics/job/docker_runtime/instance/xa-lsr-billubuntu
# TYPE docker_runtime counter
docker_runtime{log="aa bb cc cadvisor"} 33
docker_runtime{log="nginx"} 331
docker_runtime{log="abc"} 332
EOF


- python方法
#!/usr/bin/python3
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge('ping', '检测最大响应时间',['dst_ip','city'], registry=registry) #Guage(metric_name,HELP,labels_name,registry=registry)
g.labels('192.168.1.10','shenzhen').set(42.2) #set设定值
g.labels('192.168.1.11','shenzhen').dec(2) #dec递减2
g.labels('192.168.1.12','shenzhen').inc() #inc递增,默认增1
push_to_gateway('localhost:59091', job='ping_status', registry=registry)
