Raj2796's Blog

March 14, 2012

Vmware vSphere 5 dead LUN and pathing issues and resultant SCSI errors

Filed under: san,vmware — raj2796 @ 3:13 pm

Recently we had a hardware issues with a san, somehow it appears to have presented multiple fake LUN’s to vmware before crashing and leaving vmware with a large amount of dead LUN’s it could not connect to. Cue a rescan all … and everthing started crashing !

Our primary cluster of 6 servers started to abend, half the ESX servers decided to vmotion of every vm and shutdown, the other 3 ESX servers were left running 6 servers worth of vm’s, DRS decided to keep moving vm’s of the highest utilised ESX server resulting in another esx server becoming highest utilised which then decided to migrate all vm’s of and so on in a never-ending loop

As a result the ESX servers were:

  • unresponsive
  • showing powered of machines as being on
  • unable to vmotion vm’s of
  • intermittently lost connection to VC
  • lost vm’s that were vmotioned i.e. the vms became orphaned

Here’s what i did to fix the issue.

1 – get NAA ID of the lun to be removed

 See error messages on server

Or see properties of Datastore in vc (assuming vc isn’t crashed)

Or from command line :

#esxcli storage vmfs extent list

Alternatively you could use #esxclu storage filesystem list however that wouldn’t work in this case since there were no filesystems on the failed luns

e.g. output

Volume Name VMFS UUID                           Extent Number Device Name                           Partition
———– ———————————– ————- ————————————  ———
datastore1  4de4cb24-4cff750f-85f5-0019b9f1ecf6             0  naa.6001c230d8abfe000ff76c198ddbc13e        3
Storage2    4c5fbff6-f4069088-af4f-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c2e7384fc9a        1
Storage4    4c5fc023-ea0d4203-8517-0019b9f1ecf4             0  naa.6001c230d8abfe000ff76c51486715db        1
LUN01       4e414917-a8d75514-6bae-0019b9f1ecf4             0 naa.60a98000572d54724a34655733506751        1

Look at the 3rd column, it’s the naa id of the luns, the 4th one is for a volume that’s labelled LUN01 – dodgy since they would have a recognisable label such as FASSHOME1 or FASSWEB6 etc if they were production servers, or even test servers

  • 2 – remove bad luns

 

If vc is working – in our case it wasn’t – goto configuration / device / look at the identifier’s and match the naa – e.g. random screenshot below to show where to look ( screenshot removed since it shows too much work information)

Right click the naa under identifier and select detach – confirm

Now rescan the fc hba

In our case vc wasn’t an option since the hosts were unresponsive and vc couldn’t communicate, also the luns were allready detached since they were never used, so :

list permanently detached devices:

# esxcli storage core device detached list

look at output at state off luns e.g.

Device UID                            State

————————————  —–

naa.50060160c46036df50060160c46036df  off

naa.6006016094602800c8e3e1c5d3c8e011  off

next permanently remove the device configuration information from the system:

# esxcli storage core device detached remove -d <NAA ID>

e.g.

# esxcli storage core device detached remove  -d naa.50060160c46036df50060160c46036df

 

OR

 

To detach a device/LUN, run this command:

# esxcli storage core device set –state=off -d <NAA ID>

To verify that the device is offline, run this command:

# esxcli storage core device list -d <NAA ID>

The output, which shows that the status of the disk is off, is similar to:

naa.60a98000572d54724a34655733506751

   Display Name: NETAPP Fibre Channel Disk (naa.60a98000572d54724a34655733506751)

   Has Settable Display Name: true

   Size: 1048593

   Device Type: Direct-Access

   Multipath Plugin: NMP

   Devfs Path: /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

   Vendor: NETAPP

   Model: LUN

   Revision: 7330

   SCSI Level: 4

   Is Pseudo: false

   Status: off

   Is RDM Capable: true

   Is Local: false

   Is Removable: false

   Is SSD: false

   Is Offline: false

   Is Perennially Reserved: false

   Thin Provisioning Status: yes

   Attached Filters:

   VAAI Status: unknown

   Other UIDs: vml.020000000060a98000572d54724a346557335067514c554e202020

Running the partedUtil getptbl command on the device shows that the device is not found.

For example:

# partedUtil getptbl /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

 

Error: Could not stat device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751- No such file or directory.

Unable to get device /vmfs/devices/disks/naa.60a98000572d54724a34655733506751

 

AFTER DETACHING RESCAN LUNS

In our case we’re vmotioning the servers of then powering down/up the servers since they have issues and we want to force all references to be updated from a fresh network discovery which is the safest option

You can rescan through VC under normal operations – the scan all can cause errors however and its better to selectively scan adapaters to avoid stressing the system

Alternately theres the command line command which needs to be run on all affect servers:

# esxcli storage core adapter rescan [ -A vmhba# | –all ]

Othere usefull info

  • Where existing datastores have issues and need unmounting and vc not working

# esxcli storage filesystem list

(see above for example output)

Unmount the datastore by running the command:

# esxcli storage filesystem unmount [-u <UUID> | -l <label> | -p <path> ]

For example, use one of these commands to unmount the LUN01 datastore:

# esxcli storage filesystem unmount -l LUN01

# esxcli storage filesystem unmount -u 4e414917-a8d75514-6bae-0019b9f1ecf4

# esxcli storage filesystem unmount -p /vmfs/volumes/4e414917-a8d75514-6bae-0019b9f1ecf4

verify its unmounted by again running # esxcli storage filesystem list and confirming its removed from list

The above are the actions I found useful in my environment – to read the original Vmware TID i gained the majority of the information from go here

Advertisements

December 16, 2010

SSSU and You!

Filed under: eva — raj2796 @ 11:53 am

Now, in a previous posting, I mentioned that I would talk more about SSSU, especially in talking about how to export the information and then put it into a human readable format. SSSU can output xml format, but that requires some type of xml parsing tool. I did mention that Microsoft’s Log Parser tool could be used, but really I’m lazy, and it’s a bit cumbersome. And truth be told, I never got it to work just right.
So last night I sat down and did some old fashioned thinking about how I can get the information I need, but easier and with less effort. I started toying with powershell as it’s a good friend in the VMware space. I have to confess though that I have a Mac. Yes, I enjoy monotasking. Nothing wrong with that I then realized (as I sometimes forget) that I had a whole delicious CLI on my mac. This includes great tools like grep and diff! Well and a lot more, but I’ll stick to those for now. (Note that you can get grep and diff for Windows, via either the Windows versions of those tools, or using Cygwin). Fear not win32 folks, you are covered!
Let me take a step back and cover the premise for wanting to gather information from SSSU. As we know, Command View is a web-based interface for communicating with the EVAs. While it does provide lots of information, it is troublesome to navigate around and get that information easily. One off kinds of things, certainly. But pulling in lots of info easily, I think not. One big flaw, in my mind, is that CV talks to the EVAs via the SAN, and not via IP. Why is this a flaw? Well, for instance, SSSU can’t talk directly to the EVAs themselves. Rather, it has to talk to the CV server (which is why it prompts you when you fire it up for a “manager”). This also means you can’t use SSSU to do anything if your CV server has bit the dust. But I digress.
From the arrays, I want to gather information on my vdisks, my controllers, snapshots, disks, and my disk groups. I want to gather some information once, some monthly, and some on a more regular basis.
For the vdisks, I run (via SSSU) this command: LS VDISK FULL > vdisk.txt (This will output the information into a text file in the directory where the sssu.exe is located) Then, I fire up my command line, and grep that sucker for some info:
grep “familyname\|allocatedcapacity\|redundancy\|diskgroupname” textfile > date_vdisk.txt
This output will give me a file with the date that has the information I am specifically looking for

https://i2.wp.com/www.virtualizetips.com/wp-content/uploads/2010/09/092510_1620_SSSUandYou1.png

As stated before, I am quite lazy and so I could use (or you could use) awk (another great command line text processor) to generate the output in a better format. But instead, I keep it like this. Note that allocatedcapacity is the vdisk size in GB. Now, since I’m generating these files monthly, I can use the diff command to compare two months and see what has changed (disk grows, adds, deletes, etc).
diff -y date_vdisk1.txt date_vdisk2.txt | grep “familyname\|allocatedcapacity”

https://i0.wp.com/www.virtualizetips.com/wp-content/uploads/2010/09/092510_1620_SSSUandYou2.png

Note the | in there. The older date is on the left, and the newer date is on the right. So it’s easy to see which has changed and by how much. Arguably you could make this even easier, but again, lazy. And this works for me, so your mileage may vary.
Since these are simple text files, it’s easy and pain free to keep them around. Overall, I use this information for vdisks to track growth, easy at a glance for what vdisk is what size/raid level, and you can also pull out info to find out what controller has what disks.
This leads me into talking about what information I grab from my controllers. Now, one thing to note: The EVA4400 series only has one listed Controller (in this case Controller 1). This is because of how it is designed: both controllers are in the same housing, sharing a back plane. We have three 8100 series, each having two physically separate controllers, listed as Controller 1 and Controller 2.
First, to find out ALL the info on your controllers, do LS CONTROLLER FULL in SSSU. The output will be big and full of interesting details. One other thing to note: SSSU denotes them as Controllers 1 and 2. Command View denotes them as Controllers A and B. Lame! For what I need, I don’t need to keep controller info like I do vdisk info. I will do an initial grab after an XCS code update to keep handy.
One pretty handy way to find out what snapshots you have on any given EVA for a point in time is to use LS SNAPSHOT. You can also do an LS SNAPSHOT FULL if you want the full info per snapshot (like the vdisk info). The key difference between a snapshot and a normal vdisk is the line item sharingrelationship. A normal vdisk will have none, but a snapshot will say snapshot

https://i2.wp.com/www.virtualizetips.com/wp-content/uploads/2010/09/092510_1620_SSSUandYou3.png

When it comes to gathering information on disks, I use this primarily to check firmware levels. If you are an EVA owner, you know that as part of the care and feeding of an EVA comes making sure all drives are at the most current level of firmware. Updates are also usually done and bundled with XCS updates. One thing to be aware is that with drive failures, replacements may not always come with the latest firmware. They should, but I have not always seen that. Thankfully firmware updates are non-invasive (for the most part). I will cover an XCS code upgrade in a future blog post (our EVA4400 is due).
So, if you do LS DISK FULL from SSSU, you will get all the info from each disk. You can then just grep for fun and profit!
grep “diskname\|firmware” disks.txt

https://i0.wp.com/www.virtualizetips.com/wp-content/uploads/2010/09/092510_1620_SSSUandYou4.png

So you are saying hey that’s great, but I have multiple kinds of disks in my EVA. So, you need to know what model of drive so you can keep things sorted as to what firmware is for what drive. Easy way is to just sort by disk group. Since you built your EVA correctly, you know only to put drives of the same type and speed into the same group right?
grep “diskname\|diskgroupname\|firmware” disks.txt

https://i1.wp.com/www.virtualizetips.com/wp-content/uploads/2010/09/092510_1620_SSSUandYou5.png

You can also grab the model number for the drives by tossing in modelnumber to the grep.
And finally, since you all are probably bored to tears now, I grab the size of the disk group to know what kind of overall growth or change occurs on each group. I can also use this information to plan out new server storage requests, and manage to 90% capacity. Easier to give management a shiny graph saying “Look, almost out of disk. Checkbook time!”
Okay, so that about wraps up what I use SSSU for. If I think about anything else neat that I do, I’ll be sure to blog about it. The next blog topic will be about WEBES, its use, the install, and the fact that it actually works pretty good.

From h t t p://www.virtualizetips.com/2010/09/ (virtualizetips is currently hosting malawareso i can’t link direct to them as google/antivirus filters will block my site- 14 oct 2011)

July 6, 2010

HP EVA 4400 Command view 9.2.1 upgrade

Filed under: eva,san — raj2796 @ 2:43 pm

Pictorial guide

https://i1.wp.com/farm5.static.flickr.com/4102/4768006510_e4224235d6_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4121/4767367609_cbda84a804_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4093/4767367659_56f2ab4c50_b.jpg

https://i1.wp.com/farm5.static.flickr.com/4100/4767367767_a54aed71d2_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4135/4767367813_e314384201_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4098/4767367701_2002587855_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4082/4767367849_b3f885ae2c_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4077/4768006830_aca7230ef0_b.jpg

EVA 4400 XCS 09534000 upgrade with HP Command view 9.2 upgrade and disk drive firmware code load

Filed under: eva,san — raj2796 @ 2:39 pm

Simple upgrade. Read the upgrade guide for system specific information. The following pictorial guide is correct for my system which was upto date previously and does not run ca – note i upgraded my brocade switches to 6.2.2.c in accordance with the recommendations.

Steps were :
– check sys is healthy – no errors for last few days (quick check)
– stop desta
– Upgrade command view
– upgrade java
– upgrade wocp
– clear/save event log
– launch cv and view events – controller event log – check latest event is showing else troubleshoot
– codeload disks (mine were allready upto date)
– check sys is healthy – no errors for last few days (through check)
– codeload xcs
– check health – close any holes in firewall installs might have opened
NOTE – there is now a cv 9.2.1 command view upgrade – very small patch and fast easy to install

CV 9.2 upgrade – run installer
https://i0.wp.com/farm5.static.flickr.com/4076/4768004548_866cf33b4a_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4080/4767365665_7aa695bbf5_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4115/4767365713_818d523158_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4101/4768004690_550d1565f0_b.jpg

Server reboots then continues install post reboot

https://i0.wp.com/farm5.static.flickr.com/4116/4768004742_fc0e603e8f_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4115/4767365973_eb07025692_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4142/4767366085_87d3162d3b_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4097/4767366147_7de224782f_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4120/4768005100_58e61bb0ee_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4136/4768005146_13a0984d1c_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4141/4767366287_efd7684ee0_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4122/4768005336_4a37a9aa91_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4137/4767366497_da119b2b0c_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4101/4767366543_9b9388d1a1_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4093/4767366605_1587a95d24_b.jpg

JAVA
Upgrade via whichever method you prefer
https://i2.wp.com/farm5.static.flickr.com/4082/4767365873_35097a6602_o.jpg

WOCP
https://i1.wp.com/farm5.static.flickr.com/4080/4768005666_5f8de1e21e_b.jpg
https://i1.wp.com/farm5.static.flickr.com/4122/4768005714_15edf6972f_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4141/4768005776_fc35019742_b.jpg

Correct Firmware version post upgrade

https://i1.wp.com/farm5.static.flickr.com/4114/4767366929_293ebd7a96_o.jpg

If your upgrade loses password or network settings

https://i2.wp.com/farm5.static.flickr.com/4134/4768005838_1f94732b91_b.jpg

Code load disk drive firmware
https://i2.wp.com/farm5.static.flickr.com/4082/4767366995_0a74a7e8aa_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4098/4768005954_2cf219040b_b.jpg

CODE LOAD XCS
https://i0.wp.com/farm5.static.flickr.com/4137/4767367123_53fe25f41c_o.jpg
https://i1.wp.com/farm5.static.flickr.com/4114/4768006106_8404b8269a_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4122/4768006162_af2c3168b6_o.jpg
https://i0.wp.com/farm5.static.flickr.com/4094/4768006206_5d1e458488_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4143/4767367331_13b11256fe_b.jpg
https://i0.wp.com/farm5.static.flickr.com/4141/4768006330_9ecc375dd5_b.jpg
https://i2.wp.com/farm5.static.flickr.com/4143/4767367413_0061e98f29_o.jpg
https://i1.wp.com/farm5.static.flickr.com/4094/4767367471_35ca83c7ea_o.jpg

March 2, 2010

HP Remote Support Discovery Engine (RSDDU) upgrade fails

Filed under: eva — raj2796 @ 11:31 am

https://i0.wp.com/farm3.static.flickr.com/2636/3953151638_81504ebdd5.jpg

Recently i encountered a problem with the automated RSDDU upgrade on command view servers on multiple sites during the automated scheduled upgrade. The upgrade failed and would not install either via repeated attempts, manual installation or removal.

The error message was :

Summary Package Removal for ‘ Remote Support Discovery Engine (RSDDU) A.05.40.00.1032’
Status Failure
Details ‘ Remote Support Discovery Engine (RSDDU) A.05.40.00.1032’ removal failed due to the following error:
An exception was thrown trying to remove the service CM_GL_RSDDU_WIN. See the log CM_GL_RSDDU_WIN_remove.log for details.

I contacted HP who have over 100 calls logged about the same issue and are investigating to find a solution as we speak. The engineer who dealt with case was clever enough to figure out a workaround which is as follows – bear in mind this is not, at the time of writing, an offical HP fix or solution so use at your own risk, not all the steps worked for me and some paths needing changing, it did however fix the problem and RSDDU is now installed and working on multiple sites.

1. Remove RSDDU through RSSWM, if the “remove” option is available.
2. Remove the RSDDU scheduled task, RSDDU-HPNETWALKER-SCAN
3. Delete C:\Program Files\HP\CM\Lib\RSSWM\RSPS\SWM\ZSERVICE\CM_GL_RSDDU_WIN0000000.000\ZSERVICE.EDM
4. Reinstall RSDDU through RSSWM

UPDATE – the latest hp update to RSDDU seems to fix the problem – March 2010

February 17, 2010

Vmware Vsphere NMP: nmp_Device problem

Filed under: san,vmware — raj2796 @ 4:45 pm

Whilst following links from a site discussing round robin IO optimisation for EVA’s, something i need to do myself in the following weeksm i came across the following info:

Recently saw a little uptick (still a small number) in customers running into a specific issue – and I wanted to share the symptom and resolution. Common behavior:

1. They want to remove a LUN from a vSphere 4 cluster
2. They move or Storage vMotion the VMs off the datastore who is being removed (otherwise, the VMs would hard crash if you just yank out the datastore)
3. After removing the LUN, VMs on OTHER datastores would become unavailable (not crashing, but becoming periodically unavailable on the network)
4. the ESX logs would show a series of errors starting with “NMP”

Examples of the error messages include:

“NMP: nmp_DeviceAttemptFailover: Retry world failover device “naa._______________” – failed to issue command due to Not found (APD)”

“NMP: nmp_DeviceUpdatePathStates: Activated path “NULL” for NMP device “naa.__________________”.

What a weird one… I also found that this was affecting multiple storage vendors (suggesting an ESX-side issue). You can see the VMTN thread on this here.

So, I did some digging, following up on the VMware and EMC case numbers myself.

Here’s what’s happening, and the workaround options:

When a LUN supporting a datastore becomes unavailable, the NMP stack in vSphere 4 attempts failover paths, and if no paths are available, an APD (All Paths Dead) state is assumed for that device (starts a different path state detection routine). If after that you do a rescan, periodically VMs on that ESX host will lose network connectivity and become non-responsive.

This is a bug, and a known bug.

What was commonly happening in these cases was that the customer was changing LUN masking or zoning in the array or in the fabric, removing it from all the ESX hosts before removing the datastore and the LUN in the VI client. It is notable that this could also be triggered by anything making the LUN inaccessible to the ESX host – intentional, outage, or accidental.

Workaround 1 (the better workaround IMO)

This workaround falls under “operational excellence”. The sequence of operations here is important – the issue only occurs if the LUN is removed while the datastore and disk device are expected by the ESX host. The correct sequence for removing a LUN backing a datastore.

1. In the vSphere client, vacate the VMs from the datastore being removed (migrate or Storage vMotion)
2. In the vSphere client, remove the Datastore
3. In the vSphere client, remove the storage device
4. Only then, in your array management tool remove the LUN from the host.
5. In the vSphere client, rescan the bus.

Workaround 2 (only available in ESX/ESXi 4 u1)

This workaround is available only in update 1, and changes what the vmkernel does when it detects this APD state for a storage device, basically just immediately failing to open a datastore volume if the device’s state is APD. Since it’s an advanced parameter change – I wouldn’t make this change unless instructed by VMware support.

esxcfg-advcfg -s 1 /VMFS3/FailVolumeOpenIfAPD

Some QnA:

Q: Does this happen if you’re using PowerPath/VE?

A: I’m not sure – but I don’t THINK that this bug would occur for devices owned by PowerPath/VE (since it replaces the bulk of the NMP stack in those cases) – but I need to validate that. This highlights to me at least how important these little things (in this case path state detection) are in entire storage stack.

In any case, thought people would find it useful to know about this, and it is a bug being tracked for resolution. Hope it helps one customer!

Thank you to a couple customers for letting me poke at their case, and to VMware Escalation Engineering!

The above post is from virtualgeek

October 15, 2009

VMware ESX 3.5 MRU policy and path persistence

Filed under: san,vmware — raj2796 @ 3:31 pm

VMware ESX 3.5 – MRU policy and path persistence from sakacc on youtube – lots of other interesting videos from the same person on youtube if you’re into vmware.

For other hp Eva 4400 users, or EVA/6xxx/4xxx, the HP recommended settings for ESX 4 are Round Robin or MRU, with the IOPS counter set to allow true load balancing by changing it from 1000 to 1. Round robin however gives best performance, though please note that versions below esx 4 were not truely alua aware.

Vmware Netware Tivoli Slow backup performance tuning parameters for NSS and TSAFS.NLM on an EVA 4400

Filed under: edir,eva,Netware,Software,Tivoli — raj2796 @ 11:57 am

VMware NetWare Tivoli Slow backup performance tuning parameters for NSS and TSAFS.NLM on an EVA 4400

Before i cover what works for me i have posted below the official tid on this issue since different people will have differing environments/versions/setups to myself and will find this useful:

There are many issues that can affect backup/restore performance. There is tuning that can be done on the server and NSS volumes. These are only ballpark figures. The server must be benchmarked to find the optimum settings.

These two parms must be set in c:\nwserver\nssstart.cfg. Make sure there are no typos or NSS won’t load. Nssstart.cfg is not created by default.
/AuthCacheSize=20000
/NumWorkToDos=100
These parms can be set in AUTOEXEC.NCF. Note: If these are placed in this file they must start with NSS. For example – nss /ClosedFileCacheSize=2000. They can also be placed in the C:\NWSERVER\NSSSTART.CFG and there they would be used without the NSS in the beginning.

/ClosedFileCacheSize=200000
/MinBufferCacheSize=20000
/MinOsBufferCacheSize=20000
/CacheBalanceMaxBuffersPerSession=20000
/CacheUserMaxPercent=70
/AllocAheadBlks=63
/NameCacheSize=200000
/NoCopyBuffersOnXlatch
/ReadAheadBlks=:64 — on NetWare 6.5 boxes. A line must be added for each volume. This sets a count for the number of 4k blocks to read with each request. In this case, 256k at a time.
These settings are ballpark figures. They may need to be adjusted depending on how much ram the server has.
Setting these too high can cause excessive memory usage and can affect other apps as well as performance. The “closed file cache size and the name cache size, if set too high, can cause NSS.NLM to take excessive amounts of memory. These can help performance but experience shows that there are usually several problems that add up to one big problem. Setting these two parms too high can actually degrade performance. If the server has about 2 gig or less, then the default of 100000 should be used.

1.
Make sure you have the latest updates for the tape software.
2.
Faster hardware can make a big difference.
3.
The type of data can make a huge difference. Lots of small files will slow down performance, especially if they’re all in one directory. The backup will spend more time opening,scanning and closing files rather than reading data. If there are more large files mixed in with the smaller ones, then performance can increase because more time is spent reading data rather than opening files, which is what increases throughput.
4.
Background processes like compression, virus scans and large data copies will slow performance down.
5.
Virus scanners also can be an issue. They usually hook into the OS file system to intercept file opens so they can scan the files prior to backup. The virus scanner can be configured to run at some other time than the backup. This can also compound the problem if the files being scanned are compressed. The virus scanner can decompress them before scanning for viruses, which will slow things down even more. A good way to see if this is happening is to enable the NSS /COMPSCREEN at the server console during the backup to see if files are being decompressed.
6.
Lots of open files will slow down performance. These are usually seen with the error FFFDFFF5. This means the file is open by some other application. If the tape software can be configured to skip open files until the end of the job rather than retrying to open them immediately, then performance can be increased as some tape software solutions, by default, will retry to open the locked file multiple times before moving on.
7.
Backing up over the wire is slower than backups local to the server especially if most of the files are small files, 64k or less. If there is any LAN latency performance can take a significant hit. The wire is much slower at transferring data than reading the data directly from the disk. One thing that may help is to

set tcp nagle algorithm=off
set tcp delayed acknowledgement=off
set tcp sack option=off

on both host and target servers.

tsatest can be used to determine if the lan is a bottleneck. There is more information about tsatest below.

8.

– Make sure you have the latest disk drivers and firmware updates for your HBAs. There have been issues where performance was increase greatly because of later firmware/drivers.
– Use the tsatest.nlm utilitiy on different lan segments to see if there is a problem. This tool now ships with tsa5up19.exe.exe. Tsatest can be used to test the throughput on the wire and on the server itself to see if the lan could be a bottleneck. Tsatest is also useful because it does not require a tape drive, so the tape drive can be eliminated as a possible problem as well.
-Make sure you have the latest tsa files.

-Raid5 systems with a small stripe size can also be a problem. Check the configuration of the disk storage or san. If using a raid system, a larger stripe size can help performance.

-Creating one large LUN on the raid rather than several smaller ones can result in significant performance loss. It’s faster to have multiple luns with the volumes/data spread out over them.

-Make sure you have the latest bios/firmware updates for your server.

-There have been issues where full backups are fast and incremental/differential backups are slow. This can happen because of the tape software doing its own filtering on inc/diff backups rather than letting the tsafs.nlm do it. There is a parm in tsafs.nlm that can help this:

LOAD TSAFS /NOCACHINGMODE

This will disable the read ahead cache for tsafs.nlm so that files are not cached unnecessarily during inc/diff backups. You can re-enable this cache when doing full backups:

LOAD TSAFS /CACHINGMODE

This is a load time parameter so you could create a script that would load/unload tsafs accordingly.

Tsafs can also be tuned as well. Once tsafs is loaded, typing tsafs again at the server console will show what most of the parameters are set for. If most of the data consists of small files, then make a best estimate as to what the mean file size is. That will help in determining what the best size of the read buffers should be. Tsafs could then be tuned to favor smaller files with the:

tsafs /ReadBufferSize=16384

That would set the read buffers for tsafs to 16k. If the mean file size is 16k or less, that would enable the tsafs to read the files with less read requests. Setting the nss cache balance to a lower percent would give tsafs more memory for caching files. If the mean file size is 64k or thereabouts, set the tsafs /readbuffersize=65536. The read buffers in the tape software could also be set to similar values.

tsafs /cachememorythreshold=5

may help as well. There have been problems with memory setting this value too high. 10 would be a good place to start. The recommended setting is 1 for servers that have memory fragmentation problems. If the server has more memory, then even a setting of 1 would give tsafs more memory to cache file data.

– On servers that have 4 or 2 processors, the tsafs /readthreadsperjob=x can be set to 2 or 4. On machines with only one processor, set the /readthreadsperjob=1. Setting the /readthreadsperjob too high will result in performance loss.

-Tsatest is a good tool for finding out where potential bottlenecks are. This is an nlm that can be loaded on the target server for a local backup, or from another NetWare server over the wire. It’s a backup simulator that requires no special hardware, tape drives, databases, etc. By loading tsatest on the target server, the wire and tape software can be eliminated as potential bottlenecks. Throughput can be gauged and then a backup can be done over the wire to see if the lan could be slowing things down. For a complete listing of tsatest load line parameters, type tsatest /?. Usually it’s loaded like this:

load tsatest /s= /u= /p= /v=

individual paths can be specified as well. By default, tsatest will do full backups. An incremental backup can be specified by adding the /c=2 parameter to the load line. The sys:\etc\tsatest.log file can be created with the /log parameter. This file can be sent to Novell for analysis.
Backup/restore performance can be reduced when backing up over the lan. Sometimes up to 1 half of the performance can be lost due to lan latency alone. Tsatest is a good way to determine if that’s happening. Tests can be run on the target server itself and then the target server can be backed up over the wire from another NetWare server. The results can be compared.
For a good document on tsatest read:

http://developer.novell.com/ndk/doc/samplecode/smscomp_sample/tsatest/tsatest.html

Our renewed tivoli on netware problems arose when we started to migrate our users to the 9 new virtual netware 6.5 sp8 servers i built on a couple of eva 4400’s at our two sites. The virtual netware 6.5 sp8 servers are running on HP DL380g5’s with 32 gigs of ram. Each virtual server has 4 gigs of ram dedicated to it.

Utilising my previouse experience with tivoli and the problems it causes i changed the tsafs parameters. To do this you first need to unload tivoli on the netware servers via the command line :

type > unload dsmcad

enter confirmation on the tivoli screens

now you need to unload tsafs which is originally loaded via smsstart.ncf

type > smsstop.ncf

now that both tivoli and tsafs and related services are stopped navigate to the file

\\SYS\SYSTEM\smsstart.ncf

change the file from:

LOAD SMSUT.NLM
LOAD SMDR.NLM
LOAD TSAFS.NLM

to:

LOAD SMSUT.NLM
LOAD SMDR.NLM
LOAD TSAFS.NLM /NoCluster /NoCachingMode /noConvertUnmappableChars /CacheMemoryThreshold=10

now to restart backup services

type > smsstart

next restart tivoli, change the commands if you’re not using a newer version of tivoli and also remove the second line if you dont use the web interface:

type > dsmcad -optfile=dms.opt
type > dsmcad -optfile=dsm_gui.opt

CacheMemoryThreshold is set to the default of 10 on the servers, however they barely use any memory as u can see in the memory usage charts for the server below, i might try increasing to 25 to see if it speeds up backups. There’s under a million files on each server at the moment however they are only running at 40% load since we haven’t finished moving all the users onto them yet.

The changes i’ve listed above were made at the end of work yesterday, i changed the tsafs load parameters on the server shown below and it seems to have done the trick, backup times reduced by 11 hours! Copies of the backup schedule reports are below the memory diagram for those interested in speed increases and time reduction.

Server mem usage

Tuesday Night/Wednesday morning
10/14/2009 11:21:00 — SCHEDULEREC STATUS BEGIN
10/14/2009 11:21:00 Total number of objects inspected: 814,790
10/14/2009 11:21:00 Total number of objects backed up: 20,070
10/14/2009 11:21:00 Total number of bytes transferred: 596.97 MB
10/14/2009 11:21:00 Data transfer time: 1,043.65 sec
10/14/2009 11:21:00 Network data transfer rate: 585.73 KB/sec
10/14/2009 11:21:00 Aggregate data transfer rate: 12.99 KB/sec
10/14/2009 11:21:00 Objects compressed by: 0%
10/14/2009 11:21:00 Elapsed processing time: 13:04:12
10/14/2009 11:21:00 — SCHEDULEREC STATUS END

Wednesday Night/Thursday morning
10/15/2009 00:36:48 — SCHEDULEREC STATUS BEGIN
10/15/2009 00:36:48 Total number of objects inspected: 821,288
10/15/2009 00:36:48 Total number of objects backed up: 15,844
10/15/2009 00:36:48 Total number of bytes transferred: 562.12 MB
10/15/2009 00:36:48 Data transfer time: 510.50 sec
10/15/2009 00:36:48 Network data transfer rate: 1,127.53 KB/sec
10/15/2009 00:36:48 Aggregate data transfer rate: 72.26 KB/sec
10/15/2009 00:36:48 Objects compressed by: 0%
10/15/2009 00:36:48 Elapsed processing time: 02:12:45
10/15/2009 00:36:48 — SCHEDULEREC STATUS END

September 29, 2009

HP Storage roadmap

Filed under: eva,san — raj2796 @ 2:53 pm

Hp invent

Just saw an article in http://www.theregister.co.uk By Chris Mellor posted in storage that i found interesting:

HP’s EVA arrays will get thin provisioning and automated LUN migration, while LeftHand’s SAN software will be ported to the XEN and Hyper-V platforms, according to attendees at an HP Tech Day event.

This HP StorageWorks event took place in Colorado Springs yesterday with an audience of invited storage professionals. Several of them tweeted from the event, revealing what HP was saying.
EVA

HP commented that it thought its EVA mid-range array was an excellent product but hadn’t got the sales it deserved, with only around 70,000 shipped.

Currently EVA LUNs can be increased or decreased in size without stopping storage operations. This is suited to VMware operations, with the vSphere API triggering LUN size changes.

LUN migration, possibly sub-LUN migration, with EVAs is set to become non-disruptive in the future, according to attendees Devang Panchigar, who blogs as StorageNerve, and Steven Foskett. He said the EVA: “supports 6 to 8 SSDs in a single disk group, [and is] adding automated LUN migration between tiers.”

The EVA will be given thin provisioning functionality, the ability to create LUNs that applications think are fully populated with storage but which actually only get allocated enough disk space for data writes plus a buffer, with more disk space allocated as needed. Older 6000 and 8000 class EVA products won’t get thin provisioning, however, only the newer products.

In a lab session, attendees were shown that it was easier to provision a LUN and set up a snapshot on EVA than on competing EMC or NetApp products.

A common storage architecture

HP people present indicated HP was going to move to common hardware, including Intel-powered controllers, for its storage arrays. El Reg was given news of this back in June

Since HP OEMs the XP from Hitachi Data Systems, basing it on HDS’s USP-V ,this might encourage HDS to move to an Intel-based controller in that array.

Moving to a common hardware architecture for classic dual controller-based modular arrays is obviously practical, and many suppliers have done this. However high-end enterprise class arrays often have proprietary hardware to make them handle internal data flows faster. BlueArc has its FPGA-accelerated NAS arrays and 3PAR uses a proprietary ASIC for internal communications and other functions. Replacing these with Intel CPUs would not be easy at all.

Gestalt IT has speculated about EMC moving to a common storage hardware architecture based on Intel controllers. Its Symmetrix V-Max uses Intel architecture, and the Celerra filer and Clariion block storage arrays look like common hardware with different, software-driven, storage personalities.

There are hardware sourcing advantages here, with the potential for simplified engineering development. It could be that HP is moving to the same basic model of a common storage hardware set with individualised software stacks to differentiate filers, block arrays, virtual tape libraries and so forth. For HP there might also be the opportunity to use its own x86 servers as the hardware base for storage controllers.

It expects unified (block and file), commodity-based storage to exceed proprietary enterprise storage shipments in 2012.
SAN virtualisation software

HP OEMs heterogeneous SAN management software, called the SAN Virtualisation Services Platform (SVSP) from LSI, who obtained the technology when it bought StorAge in 2006. This software can HP virtualise EVA and MSA arrays, plus certain EMC, IBM and Sun arrays into a common block storage pool. The XP can’t be part of that pool, though. Also the XP, courtesy of HDS, virtualises third-party arrays connected to it as well. HP indicated that it is using the SVSP to compete with IBM’s SAN Volume Controller. Since the SVC is a combined hardware and software platform with well-over 10,000 installations, HP has a mountain to climb.

Also the SVSP is a box sitting in front of the arrays with no performance-enhancing caching functionality. It could be that HP has a hardware platform refresh coming for the SVSP.
LeftHand Networks SAN software

HP also discussed its LeftHand Storage product, which is software running in Intel servers which virtualises a server’s storage into a SAN. This scales linearly up to 30 nodes. The software can run as a VMware virtual machine in VSA (Virtual SAN Appliance) form. The scaling with LeftHand is to add more nodes and scale-out, whereas traditional storage scales-up, adding more performance inside the box.

HP also has its Ibrix scale-out NAS product which is called the Fusion Filesystem.

The LeftHand software supports thin storage provisioning and this is said to work well with VMware thin provisioning. We might expect it to be ported to the Hyper-V and XEN hypervisors in the next few months.

HP sees Fibre Channel over Ethernet (FCoE) becoming more and more influential. It will introduce its own branded FCoE CNA (Converged Network Adapter) within months.

HP also confirmed that an ESSN organisation will appear next year with its own leader. This was taken as a reference to Enterprise Servers, Storage and Networks, with David Donatelli, the recent recruit from EMC, running it.

Nothing was said about deduplication product futures or about the ExDS9100 extreme file storage product. Neither was anything said about the HP/Fusion-io flash cache-equipped ProLiant server demo recording slightly more than a million IOPS from its 2.5TB of Fusion-io flash storage. ®

link to theregister article

September 25, 2009

EVA4400 OS unit ID in VDisk presentation properties error

Filed under: eva,san — raj2796 @ 11:43 am

hp

The Vdisk OS Unit ID is cleared from the graphical user interface (GUI) and set to 0 after upgrading to xcs 9522000. Seems this is yet another known bug, from hp:

SUPPORT COMMUNICATION – CUSTOMER ADVISORY

Document ID: c01849072

Version: 1
ADVISORY: HP StorageWorks Command View EVA software refresh clears virtual disk operating system unit ID in user interface
NOTICE: The information in this document, including products and software versions, is current as of the Release Date. This document is subject to change without notice.

Release Date: 2009-08-19

Last Updated: 2009-08-19
DESCRIPTION

After an HP StorageWorks Command View EVA software refresh, the Vdisk OS Unit ID is cleared from the graphical user interface (GUI). This only affects the setting within the GUI and not in the actual virtual disk setting.

For customers using OpenVMS hosts, not having the OS Unit ID displayed as a visible reminder could lead to duplicate unit IDs entered and presented to a host, which can result in data loss.

Also, when saving any changes under the Vdisk Presentation tab, ensure that the OS Unit ID is set to the desired value.
SCOPE

This issue affects the GUI setting in HP StorageWorks Command View EVA software versions 9.00.00, 9.00.01, 9.01.00, and 9.01.01. The issue has not been seen to affect the actual virtual disk setting.
RESOLUTION

This issue will be addressed in next full release of HP StorageWorks Command View EVA software, tentatively scheduled for release in early 2010.
Workaround

To track OS Unit IDs , use one of the following options:

*

In the HP Command View EVA software GUI, add OS Unit ID and presentation host ID information to the comment field of the Virtual Disk.
*

Using HP SSSU, write a script to collect OS Unit ID from the array for each virtual disk presented to the OpenVMS host.
*

Use the following command to collect the WWID and the OS Unit ID numbers on the OpenVMS hosts:

$ SHOW DEVICE/FULL $1$DGA

In the output, the DGA device number is the OS Unit ID. The WWID listed in the output can be used to match to the WWID either from the HP Command View EVA display or from an SSSU data collection script.

Next Page »

Create a free website or blog at WordPress.com.