VMware VM can't be cloned, moved or backed up? No problem.

There are probably easier (or harder) ways to do this, but my back was up against a wall yesterday after a very important virtual machine was in a very bad state yesterday, after a series of hardware issues with the host, and basically one of those perfect storms of bad backup and bad host and bad VM happened.

Apparently, backups for this machine had been failing in a deceptive manner that didn’t clue us in that they were failing, and the host (VMware ESXi 5.0) was building new snapshots of the drive over and over again when Veeam tried to take a backup.

Worse, every time you tried to do a VMware level operation with the machine, it was complaining about the disks with something like “Error caused by file /vmfs/volumes/########-########-####-############/VM-Name/VM-Name-0000001.vmdk” and failing out. Little extra could be gleaned from SSHing into the host and checking dmesg, but it was plain the disk was being weird in a software way, not a hardware way. Luckily, the virtual machine itself could read the whole disk just fine, and it still ran just fine. So I was stuck with flaky hardware and no way to move the VM off of it.

But I was able to recover the VM by throwing this Hail Mary pass. Fair warning, this will probably take a lot of downtime. But it’s better than losing that very important VM altogether.

I’m sure there are better or worse tools to use than the Ubuntu 12.04 server iso that I had handy, but this worked just fine for my purposes. Feel free to suggest others — I know HJ Hornbeck is more partial to ddrescue than vanilla dd, but I don’t need any of those bells and whistles myself.

– Add identically sized drive(s) to VM
– Set to boot from BIOS on next boot
– Set CD to Client mode (or, if you have patience, upload ISO of CD for ubuntu 12.04 server to the datastore)
– Using console, mount ISO
– Set boot sequence to boot from CD first
– Save bios and boot from CD
– Pick recovery mode
– Enter your way through to where it wants to mount a root filesystem
– Pick “launch shell in installer environment”
– dmesg | grep sd — should show you your identical drives, one with partition, one without.
– dd if=/dev/sda of=/dev/sdb bs=4k conv=sync,noerror &
– Ampersand puts that task in the background so you can do this — to see progress, find the PID of the process you just launched via ps, then:
– kill -SIGUSR1 ####
– Number of records * 4096 = number of bytes it’s done so far. This is the closest to actual progress report I have been able to get.
– When it’s done, it’ll spit out the number of records again without you having entered a usr1 signal.
– Shut down the machine
– Take note of the SCSI connection, then remove the old drive (don’t delete it in case you need to recover or this didn’t work)
– Change the new drive’s SCSI port to what the old drive’s was
– Set to boot from BIOS again
– Change boot order back to usual
– Try booting the machine — it should work now
– Try migration, backing up, etc. — it should work now

I’m mostly adding this to the blog because, well, it’s all based on public knowledge, so why write out this procedure and only keep it at work?

{advertisement}
VMware VM can't be cloned, moved or backed up? No problem.
{advertisement}

5 thoughts on “VMware VM can't be cloned, moved or backed up? No problem.

  1. 1

    This sort of thing is why I prefer to boot vms from iSCSI volumes stored on zfs. When a disk goes bad I can simply replace it and resilver the zpool, and/or zfs send|zfs receive it to another location.

  2. 2

    <pedant_mode>

    I know HJ Hornbeck is more partial to ddrescue than vanilla dd, but I don’t need any of those bells and whistles myself.

    Except, technically, you did:

    to see progress, find the PID of the process you just launched via ps, then:
    “kill -SIGUSR1 ####”

    While you could rig that into something serviceable, assuming a bash-like shell, with while :; do kill -SIGUSR1 %1 ; sleep 30; done you still wouldn’t get a progress bar, ETA, or error count. ddrescue gives you those for free, once a second.

    Another nitpick is your block size. 4k can be handy for filesystem recovery (though that assumes your FS and dd/ddrescue blocks are of the same size and align), but you didn’t think the drive had errors. For maximum transfer rates you’d want a block size closer to the size of your drive’s cache, and those have been measured in megabytes for some time.
    </pedant_mode>

    Having said that, dd is damn near everywhere at every time while ddrescue is a rarity. This is still a terrifically useful guide, if only for that reason.

  3. 3

    magicthighs: if I was working in a more open source environment, like Xen or libvirt-based stuff (e.g. Proxmox), that is not a bad idea. However, working in the VMware paradigm, you pretty much get to use NFS (slow), local disk VMFS, or iSCSI VMFS. :/

    HJ: Yeah, I suppose a progress bar is nice. What I tend to do, when I DO want progress bars and I have an idea the size of the data, is to use PV (pipe viewer), something like DD if=, | PV –size=, | DD of=. But in this case, dd is basically on just about anything you can boot from that comes with /bin/*sh. Maybe I could put together a rescue-specific iso. Then again, there are probably lots of pre-cooked rescue isos out there…

  4. 4

    Oh, and I started rescuing another less important VM today with the same method, only I used a 1M block size because any actual errors on disk reads would be less likely to cause difficulties on this and I was willing to risk a 1 meg hole for some speed.

  5. 5

    Thibeault @3:

    Yeah, I suppose a progress bar is nice. What I tend to do, when I DO want progress bars and I have an idea the size of the data, is to use PV (pipe viewer), something like DD if=, | PV –size=, | DD of=.

    I’m gonna note that one down; one of the few drawbacks of ddrescue is that it doesn’t work with stdin and stdout, so I can’t throw it into the middle of a pipe.

    But in this case, dd is basically on just about anything you can boot from that comes with /bin/*sh.

    Maybe even more, as it’s included in Busybox.

    Maybe I could put together a rescue-specific iso. Then again, there are probably lots of pre-cooked rescue isos out there…

    Hell yeah. If you want to try rolling your own, and you don’t care about how much space it takes up, give this a go:

    1. Create a VM with a static storage drive the size of your USB key. Almost every VM uses dynamic storage, so the size of their virtual hard drive scales depending on the amount of data on it. A 8GB static drive always takes up 8GB, in contrast.
    2. Throw whatever Linux distro you want on there. It should be fairly recent, say within the last two years, but otherwise there’s no restrictions.
    3. Figure out where the VM overhead ends and the static storage starts in the virtual drive. They almost always store the actual data as a solid contiguous file wrapped in some header fluff. There are command-line tools out there that’ll help you with this, like kpartx or qemu-nbd, or you can just search for the MBR record in a hex editor.
    4. dd if=[VM drive] skip=[offset] iflag=skip_bytes of=[USB key] seek=0 bs=1M

    Thibealt @4:

    Oh, and I started rescuing another less important VM today with the same method, only I used a 1M block size because any actual errors on disk reads would be less likely to cause difficulties on this and I was willing to risk a 1 meg hole for some speed.

    First off, if you’re worried about drive errors, you really should be using ddrescue. From the manual:

    If you use the logfile feature of ddrescue, the data is rescued very efficiently, (only the needed blocks are read). Also you can interrupt the rescue at any time and resume it later at the same point.

    Ddrescue does not write zeros to the output when it finds bad sectors in the input, and does not truncate the output file if not asked to. So, every time you run it on the same output file, it tries to fill in the gaps without wiping out the data already rescued. […]

    GNU ddrescue is not a derivative of dd, nor is related to dd in any way except in that both can be used for copying data from one device to another. The key difference is that ddrescue uses a sophisticated algorithm to copy data from failing drives causing them as little additional damage as possible.

    Ddrescue manages efficiently the status of the rescue in progress and tries to rescue the good parts first, scheduling reads inside bad (or slow) areas for later. This maximizes the amount of data that can be finally recovered from a failing drive.

    Secondly, I was sanity-checking my earlier statements and found things are bit more complex than I figured. Because of the various layers of buffers and caches, you can run into situations where 1M blocks transfer at the same rate as 4M blocks (or even slightly faster), but 64M blocks scream past both. Don’t assume smooth exponential growth, pick the biggest block size you dare.

Comments are closed.