Event driven automation with Prometheus

Over the last few years, I have migrated my entire monitoring on Prometheus (plus Thanos). The automation code for triggering actions across my servers and network devices remains custom per project. I wanted to solve a long pending problem for home internet setup. It was to detect and act when there’s packet loss. In the past, I did that using a custom script which would test and call the REST API of the home router (Mikrotik) to make the change. It was ugly, took a while to put in place and honestly, I found it too hard to maintain. It was doing its monitoring & alerting besides Prometheus and Blackbox exporter running on the same hardware.



Prometheus + Blacbox Exporter + Alert Manager + Ansible Semaphore + Ansible Playbook

Over the weekend I deployed a rather simple system. A promql rule to check for an average of packet loss across different networks and upon detection it sends it to Alert Manager with a custom label “semaphore”.

Alertmanager has a custom receiver that points to a webhook endpoint of Ansible Semaphore. It simply calls that endpoint which triggers some rules of a pre-defined ansible playbook.

Ansible Playbook uses community.routeros.api_find_and_modify and enables certain rules in address list. The goal here is not to have Ansible write the whole config but simply flip IPs in the address list on/off to trigger a change in routing via policy-based routing.


Prometheus alert rule

- alert: Packet loss on primary ISP, ISP auto switch
  expr: avg(min_over_time(probe_success{location="rtk",job="blacbox-isp1"}[1h])) <= 0.98
  for: 1m
  labels:
    severity: warning
    type: semaphore

The expr here is simply looking at an average of minimum packet loss throughout 1hr and if it’s higher than 2% rule is triggered.


Why a minimum of 1hr?

Well, it’s because of limitation that the alert manager can delay triggering of a rule (using “for”) but it cannot delay sending of “resolved”. Hence if I look at the 5-minute average and trigger the change, the rollback can happen too quickly. I want to delay rollback by an hour once there is no packet loss. So min_over_time is looking at a minimum only (not averaging it) and the of that is simply averaging all these IPs which are essentially Google, Cloudflare, Contabo, AWS and a bunch of other nodes. IPs here are selected to ensure they are distributed so that I can detect cases when there is selective packet loss on some long-haul path.

Here’s how the alert-manager config looks like:

routes:  
  - match:
      type: semaphore
    continue: true
    receiver: semaphore

receivers:
  - name: 'semaphore'
    webhook_configs:
      - url: 'https://semaphore-wekhook-endpoint'  # Switch to ISP 2 when called
        send_resolved: true

Logic for rollback

I wanted the system to roll back automatically. With the above rule system keeps an eye on packet loss and once packet loss has comes below the level of 2% for an hour (minimum over time…), it will trigger a send_resolved on the same endpoint. Alert Manager by design sends a JSON payload in the webhook which looks like this:

{
 "status": "firing",
 "labels": {
 "alertname": "Packet loss on primary ISP, ISP auto switch",
 "severity": "warning",
 "type": "semaphore",
}

In the same way, when the issue is resolved, it sends status: “resolved”. Ansible Semaphore webhooks can be configured to receive this status and put in a variable, and same variable can be used in the ansible task.

Thus I am learning the status as variable “status”. This is passed along to the playbook.


Ansible playbook sample:

- name: Move production LAN to ISP 2
  community.routeros.api_find_and_modify:
    hostname: "{{ hostname }}"         
    username: "{{ username }}"    
    password: "{{ password }}"
    path: ip firewall address-list
    find:
      .id: "*7D" 
    values:   
      disabled: "no"
  when: status == "firing"


- name: Move production LAN back to ISP 1
  community.routeros.api_find_and_modify:
    hostname: "{{ hostname }}"         
    username: "{{ username }}"    
    password: "{{ password }}"
    path: ip firewall address-list
    find:
      .id: "*7D" 
    values:   
      disabled: "yes"
  when: status == "resolved"

Overall it would be nice to have webhook trigger not mapped to specific playbook but playbook is also passed in JSON payload. That would help in keeping single endpoint/receiver for these actions. Hopefully this should take care of moving traffic and keeping the network smooth. 😀