Raj2796's Blog

March 14, 2012

Vmware vSphere 5 dead LUN and pathing issues and resultant SCSI errors

Filed under: san,vmware — raj2796 @ 3:13 pm

Recently we had a hardware issues with a san, somehow it appears to have presented multiple fake LUN’s to vmware before crashing and leaving vmware with a large amount of dead LUN’s it could not connect to. Cue a rescan all … and everthing started crashing !

Our primary cluster of 6 servers started to abend, half the ESX servers decided to vmotion of every vm and shutdown, the other 3 ESX servers were left running 6 servers worth of vm’s, DRS decided to keep moving vm’s of the highest utilised ESX server resulting in another esx server becoming highest utilised which then decided to migrate all vm’s of and so on in a never-ending loop

As a result the ESX servers were:

  • unresponsive
  • showing powered of machines as being on
  • unable to vmotion vm’s of
  • intermittently lost connection to VC
  • lost vm’s that were vmotioned i.e. the vms became orphaned

Here’s what i did to fix the issue.

1 – get NAA ID of the lun to be removed

 See error messages on server

Or see properties of Datastore in vc (assuming vc isn’t crashed)

Or from command line :

#esxcli storage vmfs extent list

Alternatively you could use #esxclu storage filesystem list however that wouldn’t work in this case since there were no filesystems on the failed luns

e.g. output

Volume Name VMFS UUID                           Extent Number Device Name                           Partition
———– ———————————– ————- ————————————  ———
datastore1  4de4cb24-4cff750f-85f5-0019b9f1ecf6             0  naa.6001c230d8abfe000ff76c198ddbc13e        3
Storage2    4c5fbff6-f4069088-af4f-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c2e7384fc9a        1
Storage4    4c5fc023-ea0d4203-8517-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c51486715db        1
LUN01       4e414917-a8d75514-6bae-0019b9f1ecf4             0 naa.60a98000572d54724a34655733506751        1

Look at the 3rd column, it’s the naa id of the luns, the 4th one is for a volume that’s labelled LUN01 – dodgy since they would have a recognisable label such as FASSHOME1 or FASSWEB6 etc if they were production servers, or even test servers

  • 2 – remove bad luns

 

If vc is working – in our case it wasn’t – goto configuration / device / look at the identifier’s and match the naa – e.g. random screenshot below to show where to look ( screenshot removed since it shows too much work information)

Right click the naa under identifier and select detach – confirm

Now rescan the fc hba

In our case vc wasn’t an option since the hosts were unresponsive and vc couldn’t communicate, also the luns were allready detached since they were never used, so :

list permanently detached devices:

# esxcli storage core device detached list

look at output at state off luns e.g.

Device UID                            State

————————————  —–

naa.50060160c46036df50060160c46036df  off

naa.6006016094602800c8e3e1c5d3c8e011  off

next permanently remove the device configuration information from the system:

# esxcli storage core device detached remove -d <NAA ID>

e.g.

# esxcli storage core device detached remove  -d naa.50060160c46036df50060160c46036df

 

OR

 

To detach a device/LUN, run this command:

# esxcli storage core device set –state=off -d <NAA ID>

To verify that the device is offline, run this command:

# esxcli storage core device list -d <NAA ID>

The output, which shows that the status of the disk is off, is similar to:

naa.60a98000572d54724a34655733506751

   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34655733506751)

   Has Settable Display Name: true

   Size: 1048593

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path: /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

   Vendor: NETAPP

   Model: LUN

   Revision: 7330

   SCSI Level: 4

   Is Pseudo: false

   Status: off

   Is RDM Capable: true

   Is Local: false

   Is Removable: false

   Is SSD: false

   Is Offline: false

   Is Perennially Reserved: false

   Thin Provisioning Status: yes

   Attached Filters:

   VAAI Status: unknown

   Other UIDs: vml.020000000060a98000572d54724a346557335067514c554e202020

Running the partedUtil getptbl command on the device shows that the device is not found.

For example:

# partedUtil getptbl /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

 

Error: Could not stat device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751- No such file or directory.

Unable to get device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

 

AFTER DETACHING RESCAN LUNS

In our case we’re vmotioning the servers of then powering down/up the servers since they have issues and we want to force all references to be updated from a fresh network discovery which is the safest option

You can rescan through VC under normal operations – the scan all can cause errors however and its better to selectively scan adapaters to avoid stressing the system

Alternately theres the command line command which needs to be run on all affect servers:

# esxcli storage core adapter rescan [ -A vmhba# | –all ]

Othere usefull info

  • Where existing datastores have issues and need unmounting and vc not working

# esxcli storage filesystem list

(see above for example output)

Unmount the datastore by running the command:

# esxcli storage filesystem unmount [-u <UUID> | -l <label> | -p <path> ]

For example, use one of these commands to unmount the LUN01 datastore:

# esxcli storage filesystem unmount -l LUN01

# esxcli storage filesystem unmount -u 4e414917-a8d75514-6bae-0019b9f1ecf4

# esxcli storage filesystem unmount -p /vmfs/volumes/4e414917-a8d75514-6bae-0019b9f1ecf4

verify its unmounted by again running # esxcli storage filesystem list and confirming its removed from list

The above are the actions I found useful in my environment – to read the original Vmware TID i gained the majority of the information from go here

Advertisements

Create a free website or blog at WordPress.com.