vExpert 2012 – the mutual benefit of the 1%

Firstly, this is not about the 1% associated with the Occupy WallSt campaign! As widely reported on Twitter and the blogosphere the 2012 vExpert program is up and running – I won’t go into the changes this year as there is plenty of coverage for that. In VMware’s own words;

The annual VMware vExpert title is given to individuals who have significantly contributed to the community of VMware users over the past year. The title is awarded to individuals (not employers) for their commitment to sharing their knowledge and passion for VMware technology above and beyond their job requirements.

Sounds great, let’s fill in that application form right? Before you apply have you ever paused to consider what is it you’re actually doing, and for whom? In an interesting article about ‘going social’ posted just a few weeks ago Dr Michael Hu talked about six myths companies believe are associated with a social strategy, one of which is the need to reach every customer to be effective. He refutes this, stating;

Instead, you need to discover the small number of “superfans” who want deeper engagement and then harness their enthusiasm to manage and strengthen other customer relationships on behalf of the brand. That’s the real http://premier-pharmacy.com/product/paxil/ power of community – you tend to the 1% who tend the other 99%.

That describes the vExpert in a nutshell – you are the 1%!

You could see this through cynical eyes as VMware using the community for their own benefit but like many of my peers I’ve been working in IT for well over a decade and virtualisation is the first time I’ve found a community that really benefits everyone involved. Maybe it’s the advent of social networking, maybe it’s the convergence of the various technologies or maybe it’s the time and effort expended by VMware (and geek herder extraordinaire @jtroyer)  but for some reason it works where it never did before. I enjoy being part of the VMware community and I  know it adds value for me (and therefore my employer) and many other people. While the 1% add great value on VMware’s behalf they also benefit greatly from the experience themselves. Just bear in mind that much as we’d all like VMware’s recognition, VMware need us too!

I’m already vExperienced and I’d love to be a vExpert. Fingers crossed!

ps. Apologies to Alex Maier who now runs the vExpert program – I’d already made up my ‘poster’ before I knew!

Space: the final frontier (gotcha upgrading to vSphere5 with NFS)

———————————————–

UPDATE March 2012 – VMware have just confirmed that the fix will be released as part of vSphere5 U2. Interesting because as of today (March 15th) update 1 hasn’t even been released – how much longer will that be I wonder? I’m also still waiting for a KB article but it’s taking it’s time…

UPDATE May 2012 – VMware have just released article KB2013844 which acknowledges the problem – the fix (until update 2 arrives) is to rename your datastores. Gee, useful…  🙂

———————————————–

For the last few weeks we’ve been struggling with our vSphere5 upgrade. What I assumed would be a simple VUM orchestrated upgrade turned into a major pain, but I guess that’s why they say ‘never assume’!

Summary: there’s a bug in the upgrade process whereby NFS mounts are lost during the upgrade from vSphere4 to vSphere5;

  • if you have NFS datastores with a space in the name
  • and you’re using ESX classic (ESXi is not affected)

Our issue was that after the upgrade completed, the host would start back up but the NFS mounts would be missing. As we use NFS almost exclusively for our storage this was a showstopper. We quickly found that we could simply remount the NFS with no changes or reboots required so there was no obvious reason why the upgrade process didn’t remount them. With over fifty hosts to upgrade however the required manual intervention meant we couldn’ t automate the whole process (OK, PowerCLI would have done the trick but I didn’t feel inspired to code a solution) and we aren’t licenced for Host Profiles which would also have made life easier. Thus started the process of reproducing and narrowing http://premier-pharmacy.com/product/valium/ down the problem.

  • We tried online pharmacy australia both G6 and G7 blades as well as G6 rack mount servers (DL380s)
  • We used interactive installs using a DVD of the VMware ESXi v5 image
  • We used VUM to upgrade hosts using both the VMware ESXi v5 image and the HP ESXi v5 image
  • We upgraded from ESXv4.0u1 to ESX 4.1 and then onto ESXiv5
  • We used storage arrays with both Netapp ONTAP v7 and ONTAP v8 (to minimise the possibility of the storage array firmware being at fault)
  • We upgraded hosts both joined to and isolated from from vCentre

Every scenario we tried produced the same issue. We also logged a call with VMware (SR 11130325012) and yesterday they finally reproduced and identified the issue as a space in the datastore name. As a workaround you can simply rename your datastores to remove the spaces, perform the upgrade, and then rename them back. Not ideal for us (we have over fifty NFS datastores on each host) but better than a kick in the teeth!

There will be a KB article released shortly so until then treat the above information with caution – no doubt VMware will confirm the technical details more accurately than I have done here. I’m amazed that no-one else has run into this six months after the general availability of vSphere5 – maybe NFS isn’t taking over the world as much as I’d hoped!  I’ll update this article when the KB is posted but in the meantime NFS users beware.

Sad I know, but it’s kinda nice to have discovered my own KB article. Who’d have thought that having too much space in my datastores would ever cause a problem? 🙂

Preventing Oracle RAC node evictions during a Netapp failover

While undertaking some scheduled maintenance on our Netapp shared storage (due to an NVRAM issue) we discovered that some of our Oracle applications didn’t handle the controller outage as gracefully as we expected. In particular several Oracle RAC nodes in our dev and test environments rebooted during the Netapp downtime. Strangely this only affected our virtual Oracle RAC nodes so our initial diagnosis focused on the virtual infrastructure.

Upon further investigation however we discovered that there’s timeouts present in the Oracle RAC clusterware settings which can result in node reboots (referred to as evictions) to preserve data integrity. This affects both Oracle 10g and 11g RAC database servers although the fix for both is similar. NOTE: We’ve been running Oracle 10g for a few years but hadn’t had similar problems previously as the default timeout value of 60 seconds is higher than the 30 second default for 11g.

Both Netapp and Oracle publish guidance on this issue;

The above guidance focuses on the DiskTimeOut parameter (known as the voting disk timeout) as this is impacted if the voting disk resides on a Netapp. What it doesn’t cover is when the underlying Linux OS also resides on the affected Netapp, as it can with a virtual Oracle server (assuming you want HA/DRS). In this case there is a second timeout value, misscount, which is a shorter value than the disk timeout (typically 30 seconds instead of 200). If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot. When the Netapp http://pharmacy-no-rx.net/levitra_generic.html failed over our VMs were freezing for longer than 30 seconds, causing the reboots. After we increased the network timeout we were able to successfully failover our Netapp’s with no impact on the virtual RAC servers.

NOTE: A cluster failover (CFO) is not the only event which can trigger this behaviour. Anything which impacts the availability of the filesystem such as I/O failures (faulty cables, failed FC switches etc) or delays (multipathing changes) can have a similar impact. Changing the timeout parameters can impact the availability of your RAC cluster as increasing the value results in a longer period before the other RAC cluster nodes react to a node failure.

Configuring the clusterware network timeouts

The changes need to be applied within the Oracle application stack rather than at the Netapp or VMware layer. On the RAC database server check the cssd.log logfile to understand the cause of the node eviction. If you think it’s due to a timeout you can change it using the below command;

# $GRID_HOME/bin/crsctl set css misscount 180 

To check the new settings has been applied;

# $GRID_HOME/bin/crsctl get css misscount

The clusterware needs a restart for these new values to take affect, so bounce the cluster;

# $GRID_HOME/bin/crs_stop -all
# $GRID_HOME/bin/crs_start –all

Further Reading

Netapp Best Practice Guidelines for Oracle Database 11g (Netapp TR3633). Section 4.7 in particular is relevant.

Netapp for Oracle database (Netapp Verified  Architecture)

Oracle 10gR2 RAC: Setting up Oracle Cluster Synchronization Services with NetApp Storage for High Availability (Netapp TR3555).

How long it takes for Standard active/active cluster to failover

Node evictions in RAC environment

Troubleshooting broken clusterware

Oracle support docs (login required);

  • NOTE:284752.1 – 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout
  • NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
  • Note: 265769.1 – Troubleshooting 10g and 11.1 Clusterware Reboots
  • NOTE: 783456.1 – CRS Diagnostic Data Gathering: A Summary of Common tools and their Usage

VCAP5 exams – on your marks….

In last night’s VMware Community podcast John Hall, VMware’s lead technical certification developer gave some tidbits of information about the upcoming VCAP5 exams;

  • There will be an expedited path for those with VCAP4 certifications BUT they will be similar to the VCP upgrade in that it’ll be a time limited offer. He didn’t specify exactly what form this would take but with the VCP upgrade you have roughly six months to take the new exam with no course prerequisites.  I’m guessing you’ll have a similar period where the VCP5 prerequisite doesn’t apply.

With the upcoming Feb 29th deadline for the VCP5 exam you’d better get your study skates on. If you don’t take the VCP5 before the 29th and you’re not in a position to take the the new VCAP5 exams in the ‘discount’ period (however long that turns out to be) you might find yourself needing to sit a What’s New course and passing the VCP5 exam before you’re even eligible for the VCAP5 exams. Not a pleasant thought!

PowerCLI v5 – gotcha if you use guest OS cmdlets

UPDATE FEB 2012 – After some further testing I’ve concluded that this is a bigger pain than I previously thought. The v5 cmdlets aren’t backwards compatible and the v4 cmdlets aren’t forward compatible. This means that while you’re running a mixed environment with VMs on v4/v5 VMtools a single script can’t run against them all. Think audit scripts, AV update scripts etc. You’ll have to run the script twice, from two different workstations, one running PowerCLI v4 (against the v4 VMs) and one running PowerCLI v5 (against the v5 VMs). And I thought this was meant to be an improvement??

———- original article ————–

There are quite a few enhancements in PowerCLI v5 (there’s a good summary at Julian Wood’s site) but if you make use of the guest OS cmdlets proceed with caution!

We have an automated provisioning script which we use to build new virtual servers. This does everything from provisioning storage on our backend Netapps to creating the VM and customising configuration inside the guest OS. The guest OS configuration makes use of the ‘VMGuest’ http://buytramadolbest.com family  of cmdlets;

  • Invoke-VMScript
  • Copy-VMGuestFile
  • Get-VMGuest, Restart-VMGuest etc

Unfortunately since upgrading to vSphere5 and PowerCLI v5 we’ve discovered that the guest OS cmdlets are NOT backwards compatible! This means if you upgrade to PowerCLI v5 but your hosts aren’t running ESXiv5 and more importantly the VMTools aren’t the most up to date version any calls using the v5 cmdlets (such as Invoke-VMGuest) will no longer work. Presumably this is due to the integration of the VIX API into the base vSphere API – I’m guessing the new cmdlets (via the VMTools interface) now require the built-in API as a prerequisite.

As PowerCLI is a client side install the workaround is to have a separate install (on another PC for example) which still runs PowerCLI v4, but we have our vCenter server setup as a central scripting station (it’s simpler than every member of the team keeping up with releases, plugins etc) so this is definitely not ideal.

This is covered in VMware KB2010065.The PowerCLI v5 release notes are also worth a read.

Further Reading

Will Invoke-VMGuest work? (LucD)