Raj2796's Blog

February 17, 2010

Vmware Vsphere NMP: nmp_Device problem

Filed under: san,vmware — raj2796 @ 4:45 pm

Whilst following links from a site discussing round robin IO optimisation for EVA’s, something i need to do myself in the following weeksm i came across the following info:

Recently saw a little uptick (still a small number) in customers running into a specific issue – and I wanted to share the symptom and resolution. Common behavior:

1. They want to remove a LUN from a vSphere 4 cluster
2. They move or Storage vMotion the VMs off the datastore who is being removed (otherwise, the VMs would hard crash if you just yank out the datastore)
3. After removing the LUN, VMs on OTHER datastores would become unavailable (not crashing, but becoming periodically unavailable on the network)
4. the ESX logs would show a series of errors starting with “NMP”

Examples of the error messages include:

“NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa._______________” – failed to issue command due to Not found (APD)”

“NMP: nmp_DeviceUpdatePathStates: Activated path “NULL” for NMP device “naa.__________________”.

What a weird one… I also found that this was affecting multiple storage vendors (suggesting an ESX-side issue). You can see the VMTN thread on this here.

So, I did some digging, following up on the VMware and EMC case numbers myself.

Here’s what’s happening, and the workaround options:

When a LUN supporting a datastore becomes unavailable, the NMP stack in vSphere 4 attempts failover paths, and if no paths are available, an APD (All Paths Dead) state is assumed for that device (starts a different path state detection routine). If after that you do a rescan, periodically VMs on that ESX host will lose network connectivity and become non-responsive.

This is a bug, and a known bug.

What was commonly happening in these cases was that the customer was changing LUN masking or zoning in the array or in the fabric, removing it from all the ESX hosts before removing the datastore and the LUN in the VI client. It is notable that this could also be triggered by anything making the LUN inaccessible to the ESX host – intentional, outage, or accidental.

Workaround 1 (the better workaround IMO)

This workaround falls under “operational excellence”. The sequence of operations here is important – the issue only occurs if the LUN is removed while the datastore and disk device are expected by the ESX host. The correct sequence for removing a LUN backing a datastore.

1. In the vSphere client, vacate the VMs from the datastore being removed (migrate or Storage vMotion)
2. In the vSphere client, remove the Datastore
3. In the vSphere client, remove the storage device
4. Only then, in your array management tool remove the LUN from the host.
5. In the vSphere client, rescan the bus.

Workaround 2 (only available in ESX/ESXi 4 u1)

This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD. Since it’s an advanced parameter change – I wouldn’t make this change unless instructed by VMware support.

esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD

Some QnA:

Q: Does this happen if you’re using PowerPath/VE?

A: I’m not sure – but I don’t THINK that this bug would occur for devices owned by PowerPath/VE (since it replaces the bulk of the NMP stack in those cases) – but I need to validate that. This highlights to me at least how important these little things (in this case path state detection) are in entire storage stack.

In any case, thought people would find it useful to know about this, and it is a bug being tracked for resolution. Hope it helps one customer!

Thank you to a couple customers for letting me poke at their case, and to VMware Escalation Engineering!

The above post is from virtualgeek

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

Create a free website or blog at WordPress.com.