Raj2796's Blog

March 14, 2012

Vmware vSphere 5 dead LUN and pathing issues and resultant SCSI errors

Filed under: san,vmware — raj2796 @ 3:13 pm

Recently we had a hardware issues with a san, somehow it appears to have presented multiple fake LUN’s to vmware before crashing and leaving vmware with a large amount of dead LUN’s it could not connect to. Cue a rescan all … and everthing started crashing !

Our primary cluster of 6 servers started to abend, half the ESX servers decided to vmotion of every vm and shutdown, the other 3 ESX servers were left running 6 servers worth of vm’s, DRS decided to keep moving vm’s of the highest utilised ESX server resulting in another esx server becoming highest utilised which then decided to migrate all vm’s of and so on in a never-ending loop

As a result the ESX servers were:

  • unresponsive
  • showing powered of machines as being on
  • unable to vmotion vm’s of
  • intermittently lost connection to VC
  • lost vm’s that were vmotioned i.e. the vms became orphaned

Here’s what i did to fix the issue.

1 – get NAA ID of the lun to be removed

 See error messages on server

Or see properties of Datastore in vc (assuming vc isn’t crashed)

Or from command line :

#esxcli storage vmfs extent list

Alternatively you could use #esxclu storage filesystem list however that wouldn’t work in this case since there were no filesystems on the failed luns

e.g. output

Volume Name VMFS UUID                           Extent Number Device Name                           Partition
———– ———————————– ————- ————————————  ———
datastore1  4de4cb24-4cff750f-85f5-0019b9f1ecf6             0  naa.6001c230d8abfe000ff76c198ddbc13e        3
Storage2    4c5fbff6-f4069088-af4f-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c2e7384fc9a        1
Storage4    4c5fc023-ea0d4203-8517-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c51486715db        1
LUN01       4e414917-a8d75514-6bae-0019b9f1ecf4             0 naa.60a98000572d54724a34655733506751        1

Look at the 3rd column, it’s the naa id of the luns, the 4th one is for a volume that’s labelled LUN01 – dodgy since they would have a recognisable label such as FASSHOME1 or FASSWEB6 etc if they were production servers, or even test servers

  • 2 – remove bad luns


If vc is working – in our case it wasn’t – goto configuration / device / look at the identifier’s and match the naa – e.g. random screenshot below to show where to look ( screenshot removed since it shows too much work information)

Right click the naa under identifier and select detach – confirm

Now rescan the fc hba

In our case vc wasn’t an option since the hosts were unresponsive and vc couldn’t communicate, also the luns were allready detached since they were never used, so :

list permanently detached devices:

# esxcli storage core device detached list

look at output at state off luns e.g.

Device UID                            State

————————————  —–

naa.50060160c46036df50060160c46036df  off

naa.6006016094602800c8e3e1c5d3c8e011  off

next permanently remove the device configuration information from the system:

# esxcli storage core device detached remove -d <NAA ID>


# esxcli storage core device detached remove  -d naa.50060160c46036df50060160c46036df




To detach a device/LUN, run this command:

# esxcli storage core device set –state=off -d <NAA ID>

To verify that the device is offline, run this command:

# esxcli storage core device list -d <NAA ID>

The output, which shows that the status of the disk is off, is similar to:


   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34655733506751)

   Has Settable Display Name: true

   Size: 1048593

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path: /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

   Vendor: NETAPP

   Model: LUN

   Revision: 7330

   SCSI Level: 4

   Is Pseudo: false

   Status: off

   Is RDM Capable: true

   Is Local: false

   Is Removable: false

   Is SSD: false

   Is Offline: false

   Is Perennially Reserved: false

   Thin Provisioning Status: yes

   Attached Filters:

   VAAI Status: unknown

   Other UIDs: vml.020000000060a98000572d54724a346557335067514c554e202020

Running the partedUtil getptbl command on the device shows that the device is not found.

For example:

# partedUtil getptbl /vmfs/devices/disks/naa.60a98000572d54724a34655733506751


Error: Could not stat device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751- No such file or directory.

Unable to get device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751



In our case we’re vmotioning the servers of then powering down/up the servers since they have issues and we want to force all references to be updated from a fresh network discovery which is the safest option

You can rescan through VC under normal operations – the scan all can cause errors however and its better to selectively scan adapaters to avoid stressing the system

Alternately theres the command line command which needs to be run on all affect servers:

# esxcli storage core adapter rescan [ -A vmhba# | –all ]

Othere usefull info

  • Where existing datastores have issues and need unmounting and vc not working

# esxcli storage filesystem list

(see above for example output)

Unmount the datastore by running the command:

# esxcli storage filesystem unmount [-u <UUID> | -l <label> | -p <path> ]

For example, use one of these commands to unmount the LUN01 datastore:

# esxcli storage filesystem unmount -l LUN01

# esxcli storage filesystem unmount -u 4e414917-a8d75514-6bae-0019b9f1ecf4

# esxcli storage filesystem unmount -p /vmfs/volumes/4e414917-a8d75514-6bae-0019b9f1ecf4

verify its unmounted by again running # esxcli storage filesystem list and confirming its removed from list

The above are the actions I found useful in my environment – to read the original Vmware TID i gained the majority of the information from go here



  1. hello,
    i had the same issue but can’t do anything except a reboot.
    I mean our esxi became disconnected in vcenter.
    i try to connect directly to the esxi but it doesn’t work, i tried to restart the management service of the esxi but it doesn’t work (hostd crash)
    And when i try to lunch this command: esxclu storage filesystem list it just never ended so the only option i had was to reboot the esxi…

    Oh now i think about that, i can check the bad naa.id in the vmkernel log.
    So with an ssh open on the vmkernel log i should have try to detach and unmount those lun.

    An other way i was thinking was just to kill the rescan process but i didn’t find it

    Comment by nOon — October 10, 2012 @ 7:35 am | Reply

    • Heya, that happens due a LUN on the host being in an all APD condition, if your using an older version of vmware you can use the esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD command to immediately fail apd’s however this will affect your system since luns that you actually have problems with won’t have time to recover. Alternatively upgrade to vsphere 5.1 where there’s APD and PDL enhancements, e.g. configurable global Misc.APDHandlingEnable value with a default of 1 which equates to 140 second timeout etc If i get time i’ll post a new blog article with links to vmware tids etc

      Comment by raj2796 — October 10, 2012 @ 10:28 am | Reply

  2. Thank you for this post!! I had a serious issue with volumes being in an APD state after a few failed Site Recoverty Tests, and since then all tests have failed to cleanup properly because of these hidden and dead paths. I’ve been involved both with VMware and IBM support, no one could figure out why things failed.
    After following a few of your steps above I was able to remove the dead paths AND complete a successful SRM test & cleanup!! My CIO and Director are super happy….. THANK YOU!

    Comment by photob0mb3r — March 21, 2013 @ 7:09 pm | Reply

  3. whoah this blog is fantastic i love studying your articles.

    Stay up the great work! You realize, a lot of people are looking round
    for this info, you could help them greatly.

    Comment by Into The Dead Cheats — October 4, 2013 @ 7:58 am | Reply

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at WordPress.com.

%d bloggers like this: