接外包,有相关需求的可以联系我:Telegram | Email

Prometheus和Grafana完整教程

该文章创建(更新)于09/12/2021,请注意文章的时效性!

最近公司开始接触这两个东西,加上看到了一张告警框架的区域分布图。发现还是挺有意思的,亚洲基本都喜欢搞Zabbix这一套系统,而欧美等国家用Prometheus比较多。之前尝试搞过,没太懂,现在了解了基本怎么搞。比较难的是自己去写语句来搞监控,zabbix会shell即可,这个目前理解都是一些接口查询语句,自定义也能开发,把值传递给接口即可。目前使用下来感觉就个人少量服务器告警还是尝试用一下NETDATA,我这搞了一下,服务器(2GB /1 core)带不动。

概览

  • 我都是用docker搞得,都说说每个组件都是干啥的吧?
组件作用监控端(需要监控的主机)展示端(数据展示)补充说明
Node Exporter收集Host硬件和操作系统信息YESNO主机信息
cAdvisor负责收集Host上运行的容器信息YESNOdocker 信息采集
Prometheus Server普罗米修斯监控主服务器NONO收集上面两个组件的数据并存储提供给Grafana来采集,随便安装到哪个机器上都行。
Grafana展示普罗米修斯监控界面NOYES把数据可视化出来
Alertmanager告警发送非必须NO可在Grafana配置,比Grafana好一些
Pushgetway自定义告警自定义需要No自定义
  • 注意一点就是各个组件的关系、对应端口以及配置(注意容器中的localhost不能访问容器外的信息)。

Node Exporter

  • 安装
docker run -d -p 90:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
--name=node-exporter \
prom/node-exporter

cAdvisor

  • 安装
docker run -d \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=80:8080 \
--detach=true \
--name=cadvisor \
-v "/etc/localtime:/etc/localtime" \
google/cadvisor:latest

Prometheus Server

  • prometheus 配置文件
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).


# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093


# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"
  # - /etc/prometheus/alert_rules.yml
  - /etc/prometheus/alert_rules.yml

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'


    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.


    static_configs:
    #监听的地址
    - targets: ['localhost:80','localhost:90']

  - job_name: 'mail-base'
    static_configs:
    - targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']

  - job_name: 'mail-docker'
    static_configs:
    - targets: ['xxx.xxx.xxx.xxx:80','xxx.xxx.xxx.xxx:90']
  • 告警配置文件
groups:
- name: ali
  rules:


  # Alert for any instance that is unreachable for >5 minutes.
  - alert: InstanceDown
    expr: up == 0
    for: 5m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."


  # Alert for any instance that has a median request latency >1s.
  - alert: APIHighRequestLatency
    expr: api_http_request_latencies_second{quantile="0.5"} > 1
    for: 10m
    annotations:
      summary: "High request latency on {{ $labels.instance }}"
      description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"
  • 安装
docker run  -d \
-p 9090:9090 \
-v /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml  \
-v /etc/prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml \
--name prometheus \
prom/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--web.enable-lifecycle

Grafana

  • 建立文件夹并授权(没有授权启动不了)
mkdir /etc/grafana
chmod 777 /etc/grafana
  • 安装
docker run -d \
-p 3000:3000 \
--name=grafana \
-v /etc/grafana:/var/lib/grafana \
grafana/grafana

Alertmanager

  • 配置文件
global:
  resolve_timeout: 5m
  smtp_smarthost: 'xxxxxx.emperinter.info:465'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'xxxxxxxxxxx^'
  smtp_require_tls: false

route:
  receiver: team-test-mails
  group_by: ['alertname']
  group_wait: 30s
  group_interval: 1m
  repeat_interval: 2m


receivers:
  - name: 'team-test-mails'
    email_configs:
    - to: '[email protected]'
      send_resolved: true
  • 安装
docker run -d -p 59093:9093 --name Alertmanager -v /etc/prometheus/alertmanager.yml:/etc/alertmanager/alertmanager.yml docker.io/prom/alertmanager:latest

telegram 告警

需要安装Alertmanager,注意搞好后Bot启动一下,命令是/start

docker run -d \
    -e 'ALERTMANAGER_URL=http://xxx.xxx.xxx.xxx:59093' \
    -e 'BOLT_PATH=/data/bot.db' \
    -e 'STORE=bolt' \
    -e 'TELEGRAM_ADMIN=1234567' \
    -e 'TELEGRAM_TOKEN=XXX' \
    -v '/srv/monitoring/alertmanager-bot:/data' \
    --name alertmanager-bot \
    metalmatze/alertmanager-bot:0.4.3

Pushgetway 自定义告警

用于自定义告警监控项目;

安装

docker run -d --name pushgateway -p 59091:9091 --restart=always prom/pushgateway
  • 安装后注意去配置Permetheus的文件并重启;
  - job_name: 'pushgateway'
    static_configs:
    - targets: ['xxx.xxx.xxx.xxx:59091']
    honor_labels: true        #作用:如果没有设置instance标签,Prometheus服务器也会附加标签,否则instance标签值会为空

自定义方法

  • 常见shell用法,用docker_runtime即可查询到该数据
cat <<EOF | curl --data-binary @- http://127.0.0.1:59091/metrics/job/docker_runtime/instance/xa-lsr-billubuntu
    # TYPE docker_runtime counter
    docker_runtime{log="aa bb cc cadvisor"} 33
    docker_runtime{log="nginx"} 331
    docker_runtime{log="abc"} 332
EOF


  • python方法
#!/usr/bin/python3
from prometheus_client import CollectorRegistry, Gauge, push_to_gateway
registry = CollectorRegistry()
g = Gauge('ping', '检测最大响应时间',['dst_ip','city'], registry=registry) #Guage(metric_name,HELP,labels_name,registry=registry)
g.labels('192.168.1.10','shenzhen').set(42.2) #set设定值
g.labels('192.168.1.11','shenzhen').dec(2)  #dec递减2
g.labels('192.168.1.12','shenzhen').inc()  #inc递增,默认增1
push_to_gateway('localhost:59091', job='ping_status', registry=registry)

参考


要不赞赏一下?

微信
支付宝
PayPal
Bitcoin

版权声明 | Copyright

除非特别说明,本博客所有作品均采用知识共享署名-非商业性使用-禁止演绎 4.0 国际许可协议进行许可。转载请注明转自-
https://www.emperinter.info/2021/09/12/prometheus-and-grafana/


要不聊聊?

我相信你准备留下的内容是经过思考的!【勾选防爬虫,未勾选无法留言】

*

*



微信公众号

优惠码

阿里云国际版20美元
Vultr10美元
搬瓦工 | Bandwagon应该有折扣吧?
域名 | namesiloemperinter(1美元)