Manual restore with velero using s3 and restic

2023-09-11

https://blog.davralin.work/posts/manual-restore-with-velero-using-s3-and-restic/ davralin

velero

Table of Contents

## Manual restore with velero using s3 and restic

Let me set the scene.

You have a cluster, you back it up using velero to a local S3-installation (minio), and remote offsite-backup’s.

The cluster runs fine-ish, untill you do an upgrade, which ends up wiping all local storage (this was entirely my fault, not the storage’s fault).

You begin to check out your restore-documentation, and work towards starting velero restore, only to find out that your local S3-installation was deleted during that move you started last week, but didn’t quite finish doing yet. You begin checking out your offsite-backups, which were rsync’ed copies of the before-mentioned S3-installation - only to find out that the rsync is either inconsistent, or just as bad as the S3-installation you moved away from (which had other issues, which again was my fault).

You start to give up, realising that all those files are just gone, and/or inconsistent.

# Offsite-backups to the rescue

I then realized I had another S3-installation offsite, from back when I experimented with using velero to write to different S3-installations at the same time. The idea was that using rsync’ed S3-buckets wasn’t all that great (go figure.), so I wanted a differeny copy, from the same source.

I managed to sneakernet the files from the original bucket back to my onsite MinIO-instance, and pointed Velero to the dedicated bucket, abtly named “Velero-Offsite”, using Helm against the Velero-chart, this is accomlished with:

- name: velero-offsite
  bucket: "velero-offsite"
  default: false
  provider: aws
  accessMode: ReadOnly
  config:
    region: "${VELERO_S3_REGION}"
    s3ForcePathStyle: true
    s3Url: "${S3_URL}"
    publicUrl: "${VELERO_S3_PUBLIC_URL}"

As of Helm-release 5.0.2, this is defined as a list under configuration.backupStorageLocation.

After import, Velero listed all those remote backups (all three of them…) but as “PartiallyFailed” (there was probably more than one reason for me giving up on the project).

I tried running a normal restore with Velero, that failed, it never actually restored anything from restic, just all the k8s-resources I could just as easily reproduce with flux…

One key difference between this offsite-backup, and the rsync’ed copy of the onsite-backup, was the size of the folders (where these folders are the actuall namespaces from the originating k8s-cluster) under the /restic-folder in the bucket - this one actually had a reasonably accurate size according to what I expected, whereas the rsync’ed copy had ~125K as reported foldersize.

# Digging through the internet for an answer

Determined that this bucket actually had the files I wanted, I went searching for an answer. Using restic directly against the files in the bucket, was worthless.

# restic ls latest -r .
enter password for repository:
Load(<key/a7424b783b>, 0, 0) returned error, retrying after 552.330144ms: read keys/a7424b783b93bf66c9036982766365e9cb1aa41b698d069c0879473a94d0574a: is a directory
[...]

(btw, Velero uses a static password of static-passw0rd - might not be that safe, but I’m sure as hell glad they had something I could find!)

Much digging later, I got a clue when I found people that tried to restore single files from a restic-repo created by Velero, and more importantly - they reported success! (1) (2)

BUT HOW DID THEY DO IT?!

This is how I eventually figured it all out:

velero-offsite is consistently used as the name for the LOCAL bucket, which originated from the offsite-location.
velero is the namespace Velero is installed into.

Have Velero access the buckets like mentioned previously.
Ensure Velero can actually read the buckets, and enumerate the backups in there:

 $ velero get backup | grep velero-offsite
velero-daily-offsite-backup-20230724231041   PartiallyFailed   66       0          2023-07-24 23:10:41 +0200 CEST   19d ago   velero-offsite         <none>

As part of that backup-enumerate-import-thingy, Velero will also import a set of PodVolumeBackup, find those:

 $ kubectl -n velero get PodVolumeBackup | grep velero-offsite
velero-daily-offsite-backup-20230710231009-w5blz   Completed   63d       archivebox        archivebox-7b7bf94fd4-rl5hm                     config           s3:http://10.0.2.11:9000/velero-offsite/restic/archivebox         restic          offsite            107m
[...]

Figure out the RepoIdentifier and the snapshotID of the given repo.

 $ kubectl -n velero get PodVolumeBackup velero-daily-offsite-backup-20230710231009-w5blz -oyaml | grep -e repoIdentifier -e snapshotID
  repoIdentifier: s3:http://10.0.2.11:9000/velero-offsite/restic/archivebox
  snapshotID: 4779cab4

The S3-address will be the address to the previous (offsite) location, and you need to ensure it’s correct - I had to switch the IP to the local MinIO-instance, you probably do to.

This part might differ for you, I use Talos, so I have to do some tricks. First I have to find the node running the pod I am going to restore too.

$ kubectl -n archivebox get pods  -o wide
NAME                                            READY   STATUS      RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
archivebox-7b6c66c695-mlnpk                      1/1     Running     0          57m   10.244.0.22   rand     <none>           <none>

Rand is a node, with IP 10.0.1.66

Now we need to find where that Pod is mounted under the host_path.

$ for i in $(talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/ | sed 's/10.0.1.66   //g' | grep -v -e NODE); do echo $i; talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/$i/volumes/kubernetes.io~csi/; done > archivebox

(probably not the most elegant, but I was tired and it worked.)

Find the ID of the PVC.

$ kubectl -n archivebox get pvc
NAME              STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS      AGE
archivebox   Bound    pvc-40109437-464e-4617-8f93-ae3794d3ba0f   20Gi       RWX            ceph-filesystem   51m

Grep the output from command 6, for the PVC-number found in command 7.

$ grep -B5 pvc-40109437-464e-4617-8f93-ae3794d3ba0f archivebox
10.0.1.66   .
10.0.1.66   pvc-04e281b8-3039-42b9-bc88-143d38f1cb49
afa37c6c-34e9-40b7-bc27-3f791e44bae0
NODE        NAME
10.0.1.66   .
10.0.1.66   pvc-40109437-464e-4617-8f93-ae3794d3ba0f

You have now found the pod-id (afa37c6c-34e9-40b7-bc27-3f791e44bae0)

Almost there… Find the node-agent running on the same host.

$ kubectl -n velero get pods -o wide | grep rand
node-agent-k8q9v         1/1     Running   0          126m   10.244.0.16   rand       <none>           <none>

As a verification, the following should list the contents of the PVC you are about to restore too:

$ talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/<POD-ID-FOUND-IN-8>/volumes/kubernetes.io~csi/<PVC-ID-FOUD-IN-7>/mount/

$ talosctl -n 10.0.1.66 ls /var/lib/kubelet/pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/
NODE        NAME
10.0.1.66   .
10.0.1.66   archivebox

If all that looks familiar, we are ready to restore:

$ kubectl -n velero exec -it <NODE-AGENT-FOUND-IN-9> -- restic restore <SNAPSHOT-ID-FOUND-INITALLY-IN-4> -r s3:http://10.0.1.78:9000/velero-offsite/restic/archivebox --target /host_pods/<POD-ID-FOUND-IN-8>/volumes/kubernetes.io~csi/<PVC-ID-FOUD-IN-7>/mount/restore/

$ kubectl -n velero exec -it node-agent-k8q9v -- restic restore 4779cab4 -r s3:http://10.0.1.78:9000/velero-offsite/restic/archivebox --target /host_pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/restore/
enter password for repository:
repository 16397b86 opened (version 2, compression level auto)
restoring <Snapshot 4779cab4 of [/host_pods/2e2ed9ef-20f6-43bd-8844-cddd4f5580ca/volumes/kubernetes.io~csi/pvc-bdeb5be2-8ea9-491f-9bbe-4f58de1de88a/mount] at 2023-07-10 23:58:31.344943183 +0200 CEST by root@velero> to /host_pods/afa37c6c-34e9-40b7-bc27-3f791e44bae0/volumes/kubernetes.io~csi/pvc-40109437-464e-4617-8f93-ae3794d3ba0f/mount/restore/

/host_pods/ is where velero-node-agent mounts /var/lib/kubelet/pods The password is still static-passw0rd.

Marvelous.

# Lessons learned:

Don’t backup a backup.
Multiple backups from same source is the key.
If not possible, I would switch tools until it is.
Probably ensure backups work before I do upgrades,
but that’s probably not going to happen…

Sources:

(1): https://github.com/vmware-tanzu/velero/issues/1210

(2): https://github.com/vmware-tanzu/velero/discussions/5860

## Manual restore with velero using s3 and restic

# Offsite-backups to the rescue

# Digging through the internet for an answer

# Lessons learned:

Read next