Added new post and tweaked the front page a bit 2024-08-28
All checks were successful
ci/woodpecker/push/build Pipeline was successful
All checks were successful
ci/woodpecker/push/build Pipeline was successful
This commit is contained in:
75
content/posts/highly-available-services.md
Normal file
75
content/posts/highly-available-services.md
Normal file
@ -0,0 +1,75 @@
|
||||
+++
|
||||
title = "Highly available services"
|
||||
date = 2024-08-27T23:31:00-05:00
|
||||
tags = ["homelab", "pihole", "dns", "keepalived", "reverse_proxy", "npm", "high_availability"]
|
||||
+++
|
||||
|
||||
One of the biggest upsides to the recent homelab rebuild I've done is it gave me the ability to roll out high availability feature for my services. Sadly, not all of the services I run support it out of the box, so I kinda had to roll my own. I could've easily implemented it using k8s or k3s even, but I'd rather learn through that stuff properly rather than figuring it out on the fly according to my current use cases and end up half-assing its setup. Not that there's anything wrong with that, since that's how I got my start with this Homelab.
|
||||
|
||||
## Setting up keepalived
|
||||
|
||||
The backbone of this setup is `keepalived`. We can use it to do load balancing, so that requests are distributed evenly among its nodes, but that presents its own issue when it comes to keeping data in sync so instead, I'll be doing a high availability setup instead. It's very simple: we setup 2 separate virtual machines with their own IP, and we reserve a virtual IP address on the same network as those 2 VMs. One is master, the other one is a node.
|
||||
|
||||
By default, the master node represents the virtual IP. And if it fails to respond for whatever reason, the backup node takes over to represent that virtual IP. Requests that come in will need to be configured to interface with that virtual IP, and they're none the wiser when a fail over actually happens.
|
||||
|
||||
To start, let's install `keepalived` on the master node. I've used DigitalOcean's [article](https://www.digitalocean.com/community/questions/navigating-high-availability-with-keepalived) as a guide:
|
||||
|
||||
```
|
||||
sudo apt update && sudo apt upgrade -y
|
||||
sudo apt install keepalived -y
|
||||
```
|
||||
|
||||
Then, we'll configure `keepalived` by creating `/etc/keepalived/keepalived.conf`
|
||||
|
||||
```
|
||||
vrrp_instance VI_1 {
|
||||
interface eth0 # Change to your active network interface, e.g., ens33
|
||||
state MASTER # Change to "BACKUP" for backup nodes
|
||||
virtual_router_id 51 # A UNIQUE number [1-255] for this VRRP instance.
|
||||
priority 100 # 100 for master, 50 for backup
|
||||
advert_int 1
|
||||
authentication {
|
||||
auth_type PASS
|
||||
auth_pass mysecretpass # A password for authentication, should be the same on all servers
|
||||
}
|
||||
virtual_ipaddress {
|
||||
192.168.1.10 # The virtual IP address shared between master and backup
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Next, would be to start the `keepalived` service.
|
||||
|
||||
```
|
||||
sudo systemctl start keepalived
|
||||
sudo systemctl enable keepalived
|
||||
```
|
||||
|
||||
Now just do the same thing for backup node(s), keeping in mind the comments so you change the important bits out. If you have more than one backup node(s), set the priority differently depending on which one you'd like to take priority. If I'm not mistaken, if they're all of the same priority then it acts as a load balancer.
|
||||
|
||||
### Testing
|
||||
|
||||
From a Windows machine, try continuously pinging the virtual IP by issuing `ping -t VIRTUAL_IP_HERE`. Then, let's take down the master node by either shutting it down, or shutting down the `keepalived` service using: `sudo service shutdown keepalived`. You should see pings to the virtual IP temporarily stop and fail, and then resume. If you take down the backup node(s), only then should pings to virtual IP fail permanently.
|
||||
|
||||
## Keeping nodes in sync
|
||||
|
||||
Perhaps the most challenging part of this setup is how to keep the data between the two nodes (or rather, the persistent data by containers running on the two nodes) in sync. If your containers are all saving the important bits to the database and you've got no persistent volumes or mounted volumes you gotta worry about, then perfect! Your job ends here. Unfortunately, several of my services do persist some files that I'd need to keep in sync. Here is where my jank setup comes into play:
|
||||
|
||||
- Setup SMB shares on all the backup nodes
|
||||
- Mount said SMB shares to the master node
|
||||
- Run the backup script I've created every X minutes:
|
||||
- The backup script will pause the container, compress and archive the persistent volumes and copy its output to each of the mounted SMB shares in the master node, and then unpause the container
|
||||
- Run the restore script every X minutes:
|
||||
- The restore script will stop the container, find the latest backup file from the SMB share, extract the output from the backup script and copy its contents to the persistent volume location, and then start the container
|
||||
|
||||
So far, it works! Well...I've only really set it up for Nginx Proxy Manager. The only other highly available service that I have going on that I have to worry about syncing data is Pihole, and thank goodness somebody else has made a solution for that: [mattwebbio/orbital-sync](https://github.com/mattwebbio/orbital-sync). The previous community favourite Gravity Sync has unfortunately been retired and will no longer be supported, so I'm so thankful I didn't have to roll my own solution. The bonus part? It uses Pihole's built in API to keep the nodes in sync. No need for the SMB share shenanigans I came up with!
|
||||
|
||||
## The pitfalls and future
|
||||
|
||||
Now while I did mention it's been working out perfectly, it's not without its pitfalls. First off, I was 75% of my way writing the restore script when it dawned on me....this is basically just a carbon copy of the backup script with a few stuff changed! There's no reason I couldn't just combine the two into a sync script or whatever! So that's a refactor work for me at a later time :roll_eyes:
|
||||
|
||||
Second, in a scenario where the backup node needs to take the role of being the master node for the entire cluster for an extended period of time, I'd have no way of syncing it back to the master node. I could very well just change it around so when that happens, I promote it to become the master node and the other becomes the backup node instead. For that to happen, I will need some sort of mechanism to detect when a fail over has happened and implement that switch. That will also necessitate the creation of an SMB share on the master node too. What a headache. I'd like to stay away from the use of SMB shares altogether whenever possible.
|
||||
|
||||
There is another approach that I've considered, which is to have a third party share that both nodes can write to and read from. I initially considered going this route before, but I wasn't sure how it'd fare with file locks for one and I didn't want to introduce another dependency in the link. If the machine where that share is hosted to goes down for whatever reason, or becomes inacessible from both the nodes, then the whole high available setup suddenly becomes no available setup.
|
||||
|
||||
Lots of things to ponder about for sure. I'll try to keep you in touch from where this project goes! For now, this setup is better than nothing, so I'll be rolling it out for the rest of the services (save for databases) and go from there!
|
||||
Reference in New Issue
Block a user