Manual restore with velero using s3 and restic
Table of Contents
##
Manual restore with velero using s3 and restic
Let me set the scene.
You have a cluster, you back it up using velero to a local S3-installation (minio), and remote offsite-backup’s.
The cluster runs fine-ish, untill you do an upgrade, which ends up wiping all local storage (this was entirely my fault, not the storage’s fault).
You begin to check out your restore-documentation, and work towards starting velero restore, only to find out that your local S3-installation was deleted during that move you started last week, but didn’t quite finish doing yet. You begin checking out your offsite-backups, which were rsync’ed copies of the before-mentioned S3-installation - only to find out that the rsync is either inconsistent, or just as bad as the S3-installation you moved away from (which had other issues, which again was my fault).
You start to give up, realising that all those files are just gone, and/or inconsistent.
#
Offsite-backups to the rescue
I then realized I had another S3-installation offsite, from back when I experimented with using velero to write to different S3-installations at the same time. The idea was that using rsync’ed S3-buckets wasn’t all that great (go figure.), so I wanted a differeny copy, from the same source.
I managed to sneakernet the files from the original bucket back to my onsite MinIO-instance, and pointed Velero to the dedicated bucket, abtly named “Velero-Offsite”, using Helm against the Velero-chart, this is accomlished with:
- name: velero-offsite
bucket: "velero-offsite"
default: false
provider: aws
accessMode: ReadOnly
config:
region: "${VELERO_S3_REGION}"
s3ForcePathStyle: true
s3Url: "${S3_URL}"
publicUrl: "${VELERO_S3_PUBLIC_URL}"
As of Helm-release 5.0.2, this is defined as a list under configuration.backupStorageLocation.
After import, Velero listed all those remote backups (all three of them…) but as “PartiallyFailed” (there was probably more than one reason for me giving up on the project).
I tried running a normal restore with Velero, that failed, it never actually restored anything from restic, just all the k8s-resources I could just as easily reproduce with flux…
One key difference between this offsite-backup, and the rsync’ed copy of the onsite-backup, was the size of the folders (where these folders are the actuall namespaces from the originating k8s-cluster) under the /restic-folder in the bucket - this one actually had a reasonably accurate size according to what I expected, whereas the rsync’ed copy had ~125K as reported foldersize.
#
Digging through the internet for an answer
Determined that this bucket actually had the files I wanted, I went searching for an answer. Using restic directly against the files in the bucket, was worthless.
# restic ls latest -r .
enter password for repository:
Load(<key/a7424b783b>, 0, 0) returned error, retrying after 552.330144ms: read keys/a7424b783b93bf66c9036982766365e9cb1aa41b698d069c0879473a94d0574a: is a directory
[...]
(btw, Velero uses a static password of static-passw0rd - might not be that safe, but I’m sure as hell glad they had something I could find!)
Much digging later, I got a clue when I found people that tried to restore single files from a restic-repo created by Velero, and more importantly - they reported success! (1) (2)
BUT HOW DID THEY DO IT?!
This is how I eventually figured it all out:
- velero-offsite is consistently used as the name for the LOCAL bucket, which originated from the offsite-location.
- velero is the namespace Velero is installed into.
Have Velero access the buckets like mentioned previously.
Ensure Velero can actually read the buckets, and enumerate the backups in there:
$ velero get backup | grep velero-offsite
velero-daily-offsite-backup-20230724231041 PartiallyFailed 66 0 2023-07-24 23:10:41 +0200 CEST 19d ago velero-offsite <none>
- As part of that backup-enumerate-import-thingy, Velero will also import a set of PodVolumeBackup, find those:
$ kubectl -n velero get PodVolumeBackup | grep velero-offsite
velero-daily-offsite-backup-20230710231009-w5blz Completed 63d archivebox archivebox-7b7bf94fd4-rl5hm config s3:http://10.0.2.11:9000/velero-offsite/restic/archivebox restic offsite 107m
[...]
- Figure out the
RepoIdentifierand thesnapshotIDof the given repo.
$ kubectl -n velero get PodVolumeBackup velero-daily-offsite-backup-20230710231009-w5blz -oyaml | grep -e repoIdentifier -e snapshotID
repoIdentifier: s3:http://10.0.2.11:9000/velero-offsite/restic/archivebox
snapshotID: 4779cab4
The S3-address will be the address to the previous (offsite) location, and you need to ensure it’s correct - I had to switch the IP to the local MinIO-instance, you probably do to.
- This part might differ for you, I use Talos, so I have to do some tricks. First I have to find the node running the pod I am going to restore too.
$ kubectl -n archivebox get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
archivebox-7b6c66c695-mlnpk 1/1 Running 0 57m 10.244.0.22 rand <none> <none>
Rand is a node, with IP 10.0.1.66
- Now we need to find where that Pod is mounted under the host_path.
$ for i in $(talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/ | sed 's/10.0.1.66 //g' | grep -v -e NODE); do echo $i; talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/$i/volumes/kubernetes.io~csi/; done > archivebox
(probably not the most elegant, but I was tired and it worked.)
- Find the ID of the PVC.
$ kubectl -n archivebox get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
archivebox Bound pvc-40109437-464e-4617-8f93-ae3794d3ba0f 20Gi RWX ceph-filesystem 51m
- Grep the output from command 6, for the PVC-number found in command 7.
$ grep -B5 pvc-40109437-464e-4617-8f93-ae3794d3ba0f archivebox
10.0.1.66 .
10.0.1.66 pvc-04e281b8-3039-42b9-bc88-143d38f1cb49
afa37c6c-34e9-40b7-bc27-3f791e44bae0
NODE NAME
10.0.1.66 .
10.0.1.66 pvc-40109437-464e-4617-8f93-ae3794d3ba0f
You have now found the pod-id (afa37c6c-34e9-40b7-bc27-3f791e44bae0)
- Almost there… Find the node-agent running on the same host.
$ kubectl -n velero get pods -o wide | grep rand
node-agent-k8q9v 1/1 Running 0 126m 10.244.0.16 rand <none> <none>
- As a verification, the following should list the contents of the PVC you are about to restore too:
$ talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/<POD-ID-FOUND-IN-8>/volumes/kubernetes.io~csi/<PVC-ID-FOUD-IN-7>/mount/
$ talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/
NODE NAME
10.0.1.66 .
10.0.1.66 archivebox
- If all that looks familiar, we are ready to restore:
$ kubectl -n velero exec -it <NODE-AGENT-FOUND-IN-9> -- restic restore <SNAPSHOT-ID-FOUND-INITALLY-IN-4> -r s3:http://10.0.1.78:9000/velero-offsite/restic/archivebox --target /host_pods/<POD-ID-FOUND-IN-8>/volumes/kubernetes.io~csi/<PVC-ID-FOUD-IN-7>/mount/restore/
$ kubectl -n velero exec -it node-agent-k8q9v -- restic restore 4779cab4 -r s3:http://10.0.1.78:9000/velero-offsite/restic/archivebox --target /host_pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/restore/
enter password for repository:
repository 16397b86 opened (version 2, compression level auto)
restoring <Snapshot 4779cab4 of [/host_pods/2e2ed9ef-20f6-43bd-8844-cddd4f5580ca/volumes/kubernetes.io~csi/pvc-bdeb5be2-8ea9-491f-9bbe-4f58de1de88a/mount] at 2023-07-10 23:58:31.344943183 +0200 CEST by root@velero> to /host_pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/restore/
/host_pods/ is where velero-node-agent mounts /var/lib/kubelet/pods
The password is still static-passw0rd.
- Marvelous.
#
Lessons learned:
Don’t backup a backup.
Multiple backups from same source is the key.
If not possible, I would switch tools until it is.
Probably ensure backups work before I do upgrades,
but that’s probably not going to happen…
Sources:
(1): https://github.com/vmware-tanzu/velero/issues/1210
(2): https://github.com/vmware-tanzu/velero/discussions/5860