Self hosting website with high availability

06-09-2024

Overview

Over periods I have moved this blog from shared hosting to VPS, to containers inside VPS and lastly to (large) cloud hosting. Since moving to Hugo, I moved this to Google Cloud Firebase hosting. While it’s Google Cloud, they use Fastly for CDN and hence “anuragbhatia.com” was pointed to Fastly CDN.

Overall it was a good setup, it barely got any bill as traffic is often low enough not to hit their free usage caps. For static content, I used a mix of Google Cloud object storage (for photos) and Backblaze (everything else specially heavy in size). For a while, I wanted to get back to self-hosting but the CDN effect with low latency and no single-point failure is surely attractive. But honestly, it’s more fun with the “distributed” nature of the internet. More and more content getting aggregated to certain players is already causing issues and it will just increase over time. I still very much like the “colocation” as well as 2nd or 3rd tier (whatever way we classify - power, datacenter size, bandwidth etc) cloud players who don’t charge massive egress fees, give excellent value for money instead of doing massive redundancy for a steep price.

New setup

(Latency check source: Bunny CDN)

I have five servers (a mix of VMs as well as physical servers) spread around Rohtak (India), Mumbai (India), Nuremberg (Germany) and Auburn (US). Each of them now serves my website (anuragbhatia.com) as well as the static content of this blog (served via cdn.anuragbhatia.com). While it looks like a CDN it’s holding full content at the origin after all my site is light enough to be able to do that. I came across this idea when I saw the idea of fly.io. Tech wise it’s good to just run micro VMs with local storage. Just that I found it a little expensive for fun/personal hobby projects. Due to the lack of nodes in Oceania as well as Africa region, latency is a little high from there.

Mapping logic

The way to map users to nearby servers can be via BGP (anycast), DNS or a mix of both. I don’t want to run anycast BGP because it doesn’t make sense at a small scale. The cost of maintaining an ASN, IPv4 pool and providers to support BGP in itself is a bit. I did run BGP at home as well as some of the VM locations in past. It’s fun to do that initially but it’s easy to just live on the provider’s address space. DNS is much easier that way. I run my own DNS servers at most of these locations with a PowerDNS authoritative server. It supports GeoIP backend with LUA records however their documentation does not have much info about it and lacks config-level examples. Earlier this month I saw this post - Building a Self-Hosted CDN for BSD Cafe Media which gave me idea that I can even multiple backends in PowerDNS. I tried it with my existing SQLite backend and it worked on the primary but did not work on the rest of the servers. Tried pretty hard with no success, no errors, no debug info in the logs. I ended up posting an issue on PowerDNS GitHub discussions page and one of the maintainers (Peter van Dijk) was kind enough to solve the mystery. It was DNSSEC. LUA records for dynamic replies do not work in PowerDNS when zone transfers are used as secondaries were failing to sign the zone. I tried moving my test records into a sub-zone which wasn’t signed and it immediately worked. Then I planned a move back to MySQL backend with MySQL replication and I had DNSSEC working fine along with GeoIP-based records.

Software layer

I am using the following to serve this content:

PowerDNS with MySQL + GeoIP backend + Maxmind GeoLite2 database
NGINX Proxy Manager at each of the serving locations to terminate TCP 80/443
Garage for hosting S3 backends
Monitoring: Prometheus + Thanos + MysQLd exporter + Blackbox exporter
Gitlab CI/CD pipeline
Hugo

How does it all work?

At the core of it all is Gitlab. I write posts locally on my system, test it locally and once it’s all done, I push it to my Gitlab Project. This triggers a pipeline which takes the raw markdown tiles and generates the HTML code for the website (just “Hugo”) command in the backend. This code is then pushed to each of the origin locations.

When visitors access the website, their location is decided based on the src unicast IP of the DNS resolver they are using (or client subnet EDNS if available) and the user is given an A/AAAA record of the nearest node. Users pull the content and it’s served from that location via NGINX.

High availability and health checks

PowerDNS supports checking certain things like port 443 to verify that the node is up. I have configured PowerDNS to check for port 443 (for both A and AAAA records) every 5 seconds and it will stop responding with a specific record if one of the nodes/NGINX/server/datacenter goes down. This helps to immediately mitigate a visible outage.

Sample PowerDNS config from Primary

launch=gmysql,geoip
local-address=144.91.67.7, 2a02:c207:2022:2769::1

### Start of MySQL config
gmysql-host=<private>
gmysql-port=<private>
gmysql-dbname=<private>
gmysql-user=<private>
gmysql-password=<private>
gmysql-dnssec=yes
### End of MySQL config

primary=yes
logging-facility=0
log-dns-queries=yes
loglevel=5
api=yes
api-key=<private>
webserver=yes
webserver-address=<private>
webserver-allow-from=<private>
webserver-loglevel=normal
webserver-port=<private>
enable-lua-records=yes
lua-health-checks-interval=5
geoip-database-files=mmdb:/usr/share/GeoIP/GeoLite2-City.mmdb
edns-subnet-processing=yes

Secondaries have very similar config since all of them access the local MySQL database which is replicated between primary and secondaries.

For detailed health monitoring, I am using Blackbox exporter http checks with some custom modules. This enables me to look for 2 x protocols (http/https) x 2 x address protocols (IPv4/IPv6) via various servers. This keeps probing each server on http to verify that it’s giving the expected HTTP 301 redirect to HTTPS and that HTTPS query is giving expected HTTP 200 response besides ensuring 443 has a valid certificate etc.

Modules for blackbox exporter I have put in place:

 http_probe_abcdcweb_ipv4:
 prober: http
 http:
 method: GET
 headers:
 Host: "anuragbhatia.com"  
 valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
 no_follow_redirects: true      
 valid_status_codes: [301]  
 fail_if_ssl: true
 # fail_if_not_ssl: false
 preferred_ip_protocol: "ip4" 

 http_probe_abcdcweb_ipv6:
 prober: http
 http:
 method: GET
 headers:
 Host: "anuragbhatia.com"  
 valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
 no_follow_redirects: true
 valid_status_codes: [301]  
 fail_if_ssl: true
 # fail_if_not_ssl: false
 preferred_ip_protocol: "ip6"  

 https_probe_abcdcweb_ipv4:
 prober: http
 http:
 method: GET
 headers:
 Host: "anuragbhatia.com"  
 valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
 valid_status_codes: [200]  
 # fail_if_ssl: true
 fail_if_not_ssl: true
 preferred_ip_protocol: "ip4"  

 https_probe_abcdcweb_ipv6:
 prober: http
 http:
 method: GET
 headers:
 Host: "anuragbhatia.com"  
 valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
 valid_status_codes: [200]  
 # fail_if_ssl: true
 fail_if_not_ssl: true
 preferred_ip_protocol: "ip6"

Sample Prometheus config:

- job_name: 'http_probe_abcdcweb_ipv4'
  scrape_interval: 30s
  metrics_path: /probe
  params:
  module: [http_probe_abcdcweb_ipv4]  
  static_configs:
  - targets:
  - http://<ipv4-location1>/index.html  
  - http://<ipv4-location2>/index.html  
  - http://<ipv4-location3>/index.html  
  - http://<ipv4-location4>/index.html  
  - http://<ipv4-location5>/index.html  
  relabel_configs:
  - source_labels: [__address__]
  target_label: __param_target
  - source_labels: [__param_target] 
  target_label: target              
  - target_label: __address__
  replacement: <private-blackbox-exporter-endpoint-to-probe>:9115

Sample alert rule for the alert-manager:

- alert: CDN PoP healthcheck failing for anuragbhatia.com
  expr: probe_success{job=~"http_probe_abcdcweb_ipv4|http_probe_abcdcweb_ipv6|https_probe_abcdcweb_ipv4|https_probe_abcdcweb_ipv6"} != 1
  for: 1m
  annotations:
  title: CDN PoP health failing for anuragbhatia.com
  description: 'CDN PoP health check failing for anuragbhatia.com. The issue seems to be on {{ $labels.job }} check done on {{ $labels.target }}'
  labels:
  severity: 'critical'

More about Thanos and Prometheus in some later posts. That’s another thing I have recently spent a bit of my time on and is actually a super interesting toolset in itself.

Which node do we hit?

I added custom location headers in the reply for each node with airport codes (except for Rohtak where I use “rtk”).

anurag@anurag-desktop ~> curl -s -I https://anuragbhatia.com | grep location
origin-location: rtk
anurag@anurag-desktop ~>

With the hope that you hit one of these 5 nodes without getting an error 404 or connection timeout when you access this post, time to do the git commit & push! 😀