Home About

US Data Loss

~2 min read
by naphtha, 3 months ago

TL;DR: All VM disk images on the US server were lost.

Users that had a Danbo in US were refunded a month of their disk price ([disk space in TB] * 20, disk price in US is currently 20 EUR per TB per month).

You can keep using your existing VM, you just have to reinstall the OS.

What Happened?

On March 28, the datacenter attached 5 more SSDs to the server.

Without checking, I assumed the disks were fine and added them to the RAID10 array, then grew the array to use the new disks. During the RAID reshape, 3 of the 5 disks were marked as faulty. Checking them with smartctl did not work, dmesg was full of error messages.

I assumed the kernel would gracefully handle this, but I was apparently wrong. mdadm started showing strange behavior, my best guess is that somehow it started 2 reshapes at once, on the same array.

After this, people started contacting me about their VMs being frozen and not booting after being restarted. I checked LVM and, sure enough, it complained about the metadata being corrupt.

At that point, I stopped the array and rebooted. After the reboot, the 3 faulty disks did not show up at all anymore. I tried about every possible solution I could come up with, along with everything I found online. Of course, nothing worked, so I had to recreate the array with the previous working disks, from scratch.

After a bit of help from the datacenter, the disks show up now and are apparently fine. I'll test them a bit more before adding them to the main array.

Sorry this happened, it's one of the many lessons learned.