Recovering from lost snapshot on RDM

A couple of weeks ago I had a problem where the connection between a snapshot and its parent disk was gone. The parent was a raw device mapping pointing to a LUN on a SAN. This all happened during a migration from one type of SAN to another. For the migration we had to remove the Raw Device Mappings from a virtual machine in order to move it with Storage VMotion. After the move the original RDM had to be re-added to the virtual machine. The final step was to copy the original data from the RDM to a new disk.

One of the steps in our migration plan was to make sure that there weren’t any snapshots on the virtual machine. This procedure was used during the whole migration. One of my colleagues and I were planned for the last batch of servers.

Everything went OK during the migration of this server. All steps were executed without errors. The server started like it always starts. Even the first check by the administrators looked good.

After a short while we got a phone call. “It seems that there are some databases missing. Can you fix that?”. WTH?

When we searched the original disk we couldn’t find the databases. “Are you sure that, before the move, the databases were up and running?”. Why doubt yourself if you can doubt somebody else. Not too lang after this question he came back. “The data in the databases that are copied are 7 months old”. WTH2??

How could this be? We migrated the world of VMs this way, here and in the past. We never had any data missing.

After some hightech forensics we came to the conclusion that there must have been a snapshot for this virtual machine, even though it didn’t show up in the snapshot manager. Although we had a back-up for this machine we wanted to loose as little data as possible. One way or the other the data from the snapshot had to be recovered.

Luckily VMware support was able to help us. They even told us that the snapshot not showing up in the snapshot manager was a known issue. I expected somewhere in the process of re-adding the RDM to the machine that it would have given an error or warning message like “This disk has a snapshot”, or “This disk used the have a snapshot, but it disappeared. Do you still want to start the server?”. Support suggested that we setup alarms for snapshot creating, but that obviously doesn’t resolve the problem.

The solution

With the help of a storage engineer we reconnected the snapshot(disk) to the original RDM.

You probably already know that changes for a snapshot of a RDM are written to a delta disk on the location of the virtual machine. The problem was that re-adding the RDM to the virtual machine only added the RDM self, not the snapshot, if you do it by GUI. Re-adding by editing the vmx-file probably worked, although I haven’t tested it yet.

To reconnect the snapshot we had to edit the the VMFK file itself. The VMDK file consists of two parts, a descriptor file containing the settings for the disk like type, geometry and a link to the data, and a file containing the data itself. This is also true for snapshots. In this file for snapshot it also gives away the disk the snapshot belongs to.

In these files the ID for the parent (parent ID) in the snapshot file and the ID (CID)  in the RDM file should be the same. If they are not the same, the original disk will be read without the snapshot. By changing the parent ID from the snapshot to CID from the RDM everything worked again.

Although it works there is one big problem: data corruption. Since the start up of the virtual machine with the application has the potential to change data on the original disk you risk ending up with total garbage. If you can use a back-up to restore data, that’s the best way to go.


I learned two lessons (at least) from this whole exercise:

  • Don’t wait a second if you need support from VMware. You can do more harm by trying things yourself than being assisted by the very capable support staff from VMware;
  • Adrenaline rushes still work :)

DISCLAIMER: You’re responsible for your own actions. If you loose data, there’s no one else to blame but yourself. None of these actions can replace a decent back-up strategy.

Update: I came across this link on Twitter. It takes snapshot troubleshooting to a whole new level: