Disaster Recovery or how I messed up this weekend

Intro: A serving of humble pie

First of all, I must admit I have messed up this week. It started small, with an issue in my parents’ PiHole. The fix was not that complicated, but I had to upgrade the Raspberry pi zero to Raspbian Bullseye. After the update, I performed several updates. Autossh, VPN, I set all of that up no problem. During the update, I was pivoting through one of my private servers in the cloud.

This is where I messed up. Somewhere in the process, I ran apt update/upgrade on my own VPS, not the Pi. Afterwards, when I last rebooted the Pi and everything worked, I thought “Hey, a reboot might do my server good!” I punched in ‘sudo reboot’ into the SSH shell, saw the disconnection message, and waited for ping to announce my server coming back.

Little did I know that this IP would ping never again.

Now for the humble pie: Being the super-leet hacker and systems administrator that I am (not), I didn’t keep a backup. No. backup. I did not have my configs saved, user profiles, not even a list of installed packages. Nada. I wanted to kick myself in the face, but my tendons said no. So the other way was to reinitialize and repair the server.

Repairing a cloud server - The GRUB issue

Luckily, my VPS provider offers KVM over IP. I log into my portal, open the KVM access and basically have physical access. The GRUB rescue prompt was waiting for me. I have been here before, just like the simulations. The error would be easy to find online, the fix would take 20 minutes, tops. Up and running before dinner. (Foreshadowing over.)

1
symbol grub_disk_native_sectors not found

Okay, an error! Let’s google this! So many threads, one must have a quick and easy answer!

Well, this one talks about booting from USB and fixing it. I don’t really have access to the rack, so no.

Oh, this one has no answers and is from 2014. Well, let’s try another one!

No, not that… USB and reinstall grub… not that either.

I was getting desperate. This was out of my hands, due to the server not being in my physical hands. I also had no liveUSB access from the server provider. It was time to ring up the people in charge.

After several messages exchanged with the support technician (helpful as hell, I am very grateful), it was obvious that my server tier was not supported for LiveUSB booting. The next tier up would have that, but as it stands, they cannot let me boot something like GrubRescue and get my stuff up and running. The best they could do was provide me with the OVF file, which I could download from an FTP server they provided me with.

The next thing to do was spin up a VirtualBox VM with the file as the device disk, and GrubRescue as the liveISO. This solved all my issues, the GRUB entry was there, working fine. In a few minutes, I was in a login prompt for my transplanted server.

At this point, I tried reinstalling the GRUB and rebooting. The error was still there. In total, I tried about 4 times, but after about an hour of reboot-upgrade-reboot-error, I gave up and went for the backup recovery plan: Save everything, spin up a different server, and move all my configs to it. I spun up a new server and went to transferring.

Cloning server from VirtualBox to prod

I spun up the new server without issues (no wonder, it was just selecting an OS, setting a password, etc.). After it was up, it took my VM some convincing to give it an IP that could contact the server. The main issue was that my VM was still set up with a public IP from the server provider. The command that removed my public server’s IP from the eth0 interface was as follows:

1
dhclient -r -v eth0

After that, it was quite easy to get a LAN IP for my VM. I set up an SSH key to the new server (so I didn’t have to put in a password every time I sent something) and now the main issue is the what. What do I want to transfer first?

The list was quite simple to name, but not as easy to implement:

  • All previously installed programs
  • Blog (content is easy, I have it in my personal device)
  • VPN server
  • XMPP server
  • Security settings

This is the shortlist. The other stuff was not that important, since I could not think of it. However, now I have stuff to actually move. First thing moved was the installed applications. I had not known about applications like backup2l then, but now that I know, I can highly recommend it.

The /etc folder contained most of what I needed, but the XMPP server stores data elsewhere. It took several hours to hand-pick the things needed, transfer and configure them for the new IPs. VPN was the biggest snag: The VPN configuration I used had several standalone text files specifying public IP. These I changed with a simple grep command, but it took time to figure out that this was the issue.

I updated the IPs, ran the “up” command. Nothing. Down, up. Nothing. I checked if my VPN has an interface. It did. My clients could connect, but the VPN did not route packets onwards to the internet.

At this point, I would like to say this: Fuck the new naming schema for Linux interfaces. ETH0 was fine. What the fuck is enp22s5201abcd?! I want to bring this naming schema back.

It took further searching to figure out a series of commands for iptables (the ‘up’ command for my VPN), I changed all occurances of eth0 to the new schema, and lo and behold, the VPN was working!

Well, I don’t want to drag this out, but most of the services I wanted to clone went similarly. Install, issue1, fix issue1, fix issue2, etc.

Lessons learned: Backups, automated and announced

Now that my server is again up and running (as you can see), I had to think about the future. I looked at homebrew solutions like rsync, scp, etc. All of these, however, seemed to be too complex in the setup phase to be quite worth it. What I stumbled upon is backup2l, which has been serving me quite well.

The backup2l gives you an easy list of folders you want to back up. You can put in any folders (or hell, all!) which you don’t want to lose. What it does next, however, is the nice thing that saves me scripting: a pre-run and post-run script.

In pre-run, you can specify any scripts you may have prepared, or take any databases offline if you need to. I don’t. What I do need is to download the list of all installed packages. This list gets popped into a file that will then be transferred. This is specified in the post-run script.

I set up my post-run commands to do two things:

  • Send the backed up data to my home server (where I can store the data off-site)
  • Let me know that the backup ran and transferred via a message (XMPP, in this case)

Now, the backup runs and lets me know after it is done sending. To make sure it did, I set up a different script on the home server (with the same user) which checks the checksums of the backed-up files and just sends me an “ALL BACKUPS OKAY” if the sums match.

In closing

If you are reading this and thinking “Oh, this would never happen to me, I won’t do bad updates,” trust me, you will. Maybe not my issue, maybe not tomorrow, but one day, we all miss one tiny little thing. If you have backups set up, good. If you have them set up and working, even better. But make sure that once you read this, you at least think about these things. It does not have to be complicated, I don’t consider my server worthy of a 3-2-1 setup, for me, it’s enough to have the backup off-site in a place I can control. It’s cloud, I don’t have physical access, so I’ll do the best I can.

To be honest, this is the first time this happened in what, 7 years? Not a bad track record, in my opinion. Shameful, sure, but I’ve learned my lesson, and hopefully, this post will give you motivation to do your backups right, and maybe some leverage the next time your boss tells you it’s okay without backups.