ESXi APD&PDL – Part 2 Recovery from unexpected lost iSCSI LUN

This is a continuation on a post I wrote earlier regarding VMware iSCSI All Paths Down and Permanent Device Loss. If you havn’t seen part 1, i would recommend checking that out here.

NOTE!: As with the previous post, I take no responsibility if you damage your own environment. Anything written here should be considered as helpful information debugging and solving your own problem. Take all possible precautions before you make changes, and if you don’t know what you are doing, consult VMware support.

Now if you end up reading this, you most likely have a good amount of pressure on you and don’t like what’s going on, or then your lucky and are just reading random technical articles on the web.

1. What LUN? How to get started

If you haven’t read Part 1, and don’t fancy reading anything long then this is for you. If you read part 1, then this will most likely not contain anything new.

A quick diagnose in a case like this is to open an SSH connection to your ESXi host (or open DCUI and view logs from here). You will most likely find the following line in /var/log/vmkernel.log that indicates erros in your iSCSi storage system:

2013-11-27T07:04:03.470Z cpu11:8203)ScsiDeviceIO: 2316: Cmd(0x41244f4bd9c0) 0x28, CmdSN 0x1d139 from world 0 to dev "naa.6090a0985058d84c680885e57fc983d0" failed H:0x5 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2013-11-27T07:04:03.471Z cpu1:1211998)Partition: 414: Failed read for "naa.6090a0985058d84c680885e57fc983d0": I/O error
2013-11-27T07:04:03.471Z cpu1:1211998)Partition: 1020: Failed to read protective mbr on "naa.6090a0985058d84c680885e57fc983d0" : I/O error
2013-11-27T07:04:03.471Z cpu1:1211998)WARNING: Partition: 1129: Partition table read from device naa.6090a0985058d84c680885e57fc983d0 failed: I/O error

When a host is trying to reconnect a LUN it will enter a non-responsive state and vCenter will show the host as Not Responding. If you get the summary tab open you will see the troublesome LUN as (inaccessible or invalid). If you don’t see this, then don’t worry, the information can be found using command line. If the datastore is shared by multiple hosts, for example in a cluster, then those hosts containing VM’s will be non responsive while those who don’t have running VMs residing on the datastore will remain available. On the responding hosts the datastore is most likely not visible and has been cleanly been disconnected.

Now that we know what LUN is in quesiton, we need to know what name it has on our storage. To get this, one simple command will give what we need. esxcfg-mpath contains information regarding our iSCSI connections and by grepping using what we found in the vmkernel.log, we know what it’s named on the san. In this case examle the LUN is named vmwareNNNN on our Dell Equallogic SAN. If you have a good naming convention, you will know what LUN is in question.

~ # esxcfg-mpath -L | grep naa.6090a0985058d84c680885e57fc983d0
vmhba40:C2:T6:L0 state:active naa.6090a0985058d84c680885e57fc983d0 vmhba40 2 6 0 NMP active san iqn.1998-01.com.vmware:esx-nnn 00023d000004,iqn.2001-05.com.equallogic:0-8a0906-4cd858509-d083c97fe5850868-vmwareNNNN,t,1
vmhba40:C1:T6:L0 state:active naa.6090a0985058d84c680885e57fc983d0 vmhba40 1 6 0 NMP active san iqn.1998-01.com.vmware:esx-nnn 00023d010002,iqn.2001-05.com.equallogic:0-8a0906-4cd858509-d083c97fe5850868-vmwareNNNN,t,1
~ #

You could try to do a lunreset, but my experience so far has been that it’s of no help.

~ # vmkfstools -L lunreset /vmfs/devices/disks/naa.6090a0985058d84c680885e57fc983d0

/var/log/vmkernel.log contains the following entries:

2013-11-27T07:28:04.367Z cpu29:1214729)WARNING: NMP: nmpDeviceTaskMgmt:2259:Attempt to issue lun reset on device naa.6090a0985058d84c680885e57fc983d0. This will clear any SCSI-2 reservations on the device.
2013-11-27T07:28:04.374Z cpu29:1214729)Resv: 631: Executed out-of-band lun reset on naa.6090a0985058d84c680885e57fc983d0
2013-11-27T07:29:13.692Z cpu6:8859)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba40:CH:2 T:6 L:0 : Task mgmt "Abort Task" with itt=0xf1acf (refITT=0xf1ace) timed out.
2013-11-27T07:30:03.692Z cpu3:8859)WARNING: iscsi_vmk: iscsivmk_TaskMgmtIssue: vmhba40:CH:2 T:6 L:0 : Task mgmt "Abort Task" with itt=0xf1df1 (refITT=0xf1df0) timed out.
2013-11-27T07:30:03.692Z cpu16:1211998)Partition: 414: Failed read for "naa.6090a0985058d84c680885e57fc983d0": I/O error
2013-11-27T07:30:03.692Z cpu16:1211998)Partition: 1020: Failed to read protective mbr on "naa.6090a0985058d84c680885e57fc983d0" : I/O error
2013-11-27T07:30:03.692Z cpu16:1211998)WARNING: Partition: 1129: Partition table read from device naa.6090a0985058d84c680885e57fc983d0 failed: I/O error

2. What about the VM’s?

Virtual Machines on the hosts are fine, as long as they didn’t reside on the affected LUN’s. If they did, well they just lost their hard drives but will keep running. Linux servers seem to more resilient to cases like this compared to Windows Server VM’s. None the less, any data write or read operations on these VM’s will fail if it happens to be still running.

I’ve been in contact with VMware support regarding this and they have two options to recover. Either cold boot the host or go to the SAN administration and take the LUN offline. The first option should be considered an last resort as this will halt ALL VM’s on that particular host. So as you might imagine, this is not an action to be considered when everything else fails. Now you might think, why not just vMotion the VM’s and get over with it? That would be an option if the host would be in a responsive state. The VM’s that are working you could remote/ssh into and halt them from the OS itself which should be done if an reboot is required.

The second option is to take only the affected LUN offline. This will kill any hanging iSCSI connections to that LUN and all other LUN’s remain un-affected. This is a much safer approach than forcing a reboot on the whole hypervisor.

3. Taking a LUN offline

Before you take your LUN offline, double check that you are operating on the correct LUN, you don’t want to take offline the wrong one! If you are unsure, just check the logs and verify using esxcfg-mpath -L | grep naa.XXX to see what the LUN ID is on the SAN. Once you’ve take the LUN in question offline, the hosts will automatically notice this so no rescann is required at this stage. Wait a few minutes and vCenter Server will be able to reconnect to the hosts. if not, you can manually trigger this by issuing a connect command from the vSphere Client. Once this is done, everything seems a lot brighter already. But there’s still work to be done :)

At this stage you will get a clearer picutre of the affected VM’s as they are still grayed out (inaccessible) while the other VM’s are back to green. If there’s been HA vMotion going on before the host went unresponsive, then those will carry out and any conflicts will be resolved by vCenter. It’s a good idea to check trough the cluster before continuing. What i’ve done at this stage has been to kill off any VM processes on the hosts which have virtual harddisks on the LUN currently offline. That way the hosts will be able ot reconnect the LUN’s and there’s no active VM’s running on it when it’s reconnected.

From the vSphere Client you will be able to find the VM’s, but to hard stop them you need their process ID. To get a list of the VM’s use the following command:

esxcli vm process list

And to kill off any VM's use the following command:
esxcli vm process kill --type=force --world-id World ID
In example:
#esxcli vm process kill --type=force --world-id=12131

Once all the VM’s residing on the offline LUN are killed, it’s time to bring back the LUN. In your SAN administration interface, take the LUN online and do a rescan on all hosts, one at the time. This can be done either using vSphere client by selecting “Rescan All…” in the Configuration > Storage view or by typing in esxcfg-rescan -u vmhbaNN in the ssh session.

At this point, you will see the LUN again in your storage view and the VM’s will no longer be inactive. Once the LUN shows up properly, it’s time to start up the VM’s again and estimate the damage. In worst case scenarios you have data loss or even corruption. But I would guess you assumed as much…

4. Conclusion

Having an unexpected APD&PDL is nothing fun to deal with, and the damage can be extensive. Hopefully this post has been to some use for you and helped you get trough a tough day at work and helped you minimize the damage. VMware itself seems really resilient to any outages so it’s possible to take out one LUN if it’s lost. Good documentation does help in the recovery process so if you havn’t done that, do it! It might save you one day :)

Note: In ESXi 5.5 VMware has made some changes to the system so in a case of PDL in the form of autoremove. This should prevent hosts from going unresponsive.

[quote]

PDL AutoRemove
I’m not going to delve back into the history of All Paths Down (APD) or Permanent Device Loss (PDL). This has a long and checkered history, and has been extensively documented. Suffice to say that this situation occurs on device failures or if the device is incorrectly removed from host. PDL is based on SCSI Sense Codes returned from the array and a PDL state means that the ESXi host no longer sends I/O to these devices.

PDL AutoRemove in vSphere 5.5 automatically removes a device with PDL from the host. A PDL state on a device implies that the device is gone and that it cannot accept more IOs, but needlessly uses up one of the 256 device per host limit. PDL AutoRemove gets rid of the device from the ESXi host perspective.

source: Whats new in vSphere 5.5 Storage

[/quote]

Some additional resources which might be of help:
One host shows a Storage Initiator Error while all other hosts show SCSI Reservation Conflicts (1021187)

Understanding SCSI host-side NMP errors/conditions in ESX 4.x and ESXi 5.x (1029039)

Interpreting SCSI sense codes in VMware ESXi and ESX (289902)

Cannot remount a datastore after an unplanned PDL (2014155)