Tag Archives: availability

Using vCenter Operations v5 – Capacity features and conclusions (3/3)

In the first part of this series I introduced vCOps and it’s requirements before covering the new features in part two. This final blogpost covers the capacity features (available in the Advanced and higher editions) along with pricing information and my conclusions.

The previous trial I used didn’t include the capacity planning elements so I was keen to try this out. I’d used CapacityIQ previously (although only briefly) and found it useful but combined with the powerful analytics in vCOps it promises to be an even more compelling solution. VMware have created four videos with Ben Scheerer from the vCOps product team – they’re focused on capacity but if you’ve watched Kit Colbert’s overview much of it will be familiar;

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps!

If you don’t have time to watch the videos and read the documentation (section 4 in the Advanced Getting Started guide) here’s the key takeaways;

  • Capacity information is integrated throughout the product although modelling is primarily found under the ‘Planning’ view. Almost every view has some capacity information included either via the dynamic thresholds (which indicate the standard capacity used) or popup graphs of usage and trending.
  • Storage is now included in the capacity calculations (an improvement over CapacityIQ) resulting in a more complete analysis. Datastores are now shown in the Operations view although if you’re like me and use NFS direct to the guest OS it’s not going to be as comprehensive as using block protocols.
  • the capacity tools require more tailoring to your environment than the performance aspects but provide valuable information
  • With vCOps you can both view existing and predicted capacity and you can model changes like adding hosts or VMs.

Continue reading Using vCenter Operations v5 – Capacity features and conclusions (3/3)

Using vCenter Operations v5 – What’s new (2/3)

In part one of Using vCenter Operations I covered what the product does along with the different versions available and deployment considerations. In this post I’ll delve into what’s new and improved and in the final part I’ll cover capacity features, product pricing, and my overall conclusions. I had intended to cover the configuration management and application dependency features too but it’s such a big product I’ll have to write another blogpost or I’ll never finish!

Introductory learning materials

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps.

Deep dive learning materials;

What’s new and improved in vCOps

Monitoring is a core feature and for some people the only one they’re concerned about. As the size of your infrastructure grows and becomes more complex the need for a tool to combine compute, network, and storage in real time also grows. Here are my key takeaways;

  • there’s a new dashboard screen which shows health (immediate issues), risks (upcoming issues) and efficiency (opportunity for improvements) in a single screen. The dashboard can provide a high level view of your infrastructure and works nicely on a plasma screen as your ‘traffic light’ view of the virtual world (and physical if you go with Enterprise+). The dashboard can also be targeted at the datacenter, cluster, host or VM level which I found very useful although you can only customise the dashboard in Enterprise versions. There is still the Operations view (the main view in vCOPS v1) which now also includes datastores. This view scales extremely well – even if you have thousands of VMs and datastores across multiple vCenters they can all be displayed on a single screen.
    NOTE: If you find some or all of your datastores show up as grey with no data (as mine did) there is a hotfix available via VMware support.
  • Continue reading Using vCenter Operations v5 – What’s new (2/3)

Using vCenter Operations v5 – Introduction and deployment (1/3)

At VMworld 2011 in Copenhagen VMware unveiled a significant revamp of their management suites, including a new version of vCenter Operations Manager (v5 to align with the vSphere release). vCenter Operations is now a suite of tools which includes vCenter Configuration Manager, the new vCenter Infrastructure Navigator (which I’ll cover in a later blogpost) and vCenter CapacityIQ (which is now fully integrated into vCOps, the standalone CapacityIQ is now end of life).

Although announced at VMworld it wasn’t publicly available until Jan 2012 when VMware formally launched vCOps v5. Coming less than a year after the release of the first version it’s apparent that VMware see this as an important product which is evolving fast. Steven Herrod, VMware’s CIO stated recently at the Italian VMUG (around the 5 minute mark) that vCOps ‘is becoming the most adopted new technology that VMware has ever had’. The vCenter Operations suite is still aimed at infrastructure monitoring as opposed to application monitoring (despite the addition of Infrastructure Navigator) – VMware’s solutions aimed at the application tier belong to the vFabric suite. For a good overview of where vCOps and vFabric Hyperic fit into VMware’s cloud suite read Dave Hill’s blogpost on the subject.

If you aren’t familiar with vCenter Operations here are the kind of problems it aims to address;

  • Is your virtual infrastructure healthy?
  • What serious problems should I address immediately?
  • Is the workload in my environment normal?
  • Am I using the resources in my environment efficiently?
  • How long do I have before resources run out?
  • What impact did a recent change have?

A few people have already posted articles which I’d recommend reading;

With v1.0 I concluded that it was a great product but there were a few reasons why it wasn’t for me, primarily the lack of email notifications and pricing. In this post I’ll cover the requirements and deployment considerations for the new version and in part two I’ll cover day to day use and new features. The final part will cover the capacity features along with info about pricing and my conclusions.

UPDATE APRIL 2012 – VMware have just launched 2.5 hrs of free training for vCOps.

Continue reading Using vCenter Operations v5 – Introduction and deployment (1/3)

Preventing Oracle RAC node evictions during a Netapp failover

While undertaking some scheduled maintenance on our Netapp shared storage (due to an NVRAM issue) we discovered that some of our Oracle applications didn’t handle the controller outage as gracefully as we expected. In particular several Oracle RAC nodes in our dev and test environments rebooted during the Netapp downtime. Strangely this only affected our virtual Oracle RAC nodes so our initial diagnosis focused on the virtual infrastructure.

Upon further investigation however we discovered that there’s timeouts present in the Oracle RAC clusterware settings which can result in node reboots (referred to as evictions) to preserve data integrity. This affects both Oracle 10g and 11g RAC database servers although the fix for both is similar. NOTE: We’ve been running Oracle 10g for a few years but hadn’t had similar problems previously as the default timeout value of 60 seconds is higher than the 30 second default for 11g.

Both Netapp and Oracle publish guidance on this issue;

The above guidance focuses on the DiskTimeOut parameter (known as the voting disk timeout) as this is impacted if the voting disk resides on a Netapp. What it doesn’t cover is when the underlying Linux OS also resides on the affected Netapp, as it can with a virtual Oracle server (assuming you want HA/DRS). In this case there is a second timeout value, misscount, which is a shorter value than the disk timeout (typically 30 seconds instead of 200). If a node can’t reach any of the other RAC nodes within misscount seconds timeframe it will start split-brain resolution and probably evict itself from the cluster by doing a reboot. When the Netapp failed over our VMs were freezing for longer than 30 seconds, causing the reboots. After we increased the network timeout we were able to successfully failover our Netapp’s with no impact on the virtual RAC servers.

NOTE: A cluster failover (CFO) is not the only event which can trigger this behaviour. Anything which impacts the availability of the filesystem such as I/O failures (faulty cables, failed FC switches etc) or delays (multipathing changes) can have a similar impact. Changing the timeout parameters can impact the availability of your RAC cluster as increasing the value results in a longer period before the other RAC cluster nodes react to a node failure.

Configuring the clusterware network timeouts

The changes need to be applied within the Oracle application stack rather than at the Netapp or VMware layer. On the RAC database server check the cssd.log logfile to understand the cause of the node eviction. If you think it’s due to a timeout you can change it using the below command;

# $GRID_HOME/bin/crsctl set css misscount 180 

To check the new settings has been applied;

# $GRID_HOME/bin/crsctl get css misscount

The clusterware needs a restart for these new values to take affect, so bounce the cluster;

# $GRID_HOME/bin/crs_stop -all
# $GRID_HOME/bin/crs_start –all

Further Reading

Netapp Best Practice Guidelines for Oracle Database 11g (Netapp TR3633). Section 4.7 in particular is relevant.

Netapp for Oracle database (Netapp Verified  Architecture)

Oracle 10gR2 RAC: Setting up Oracle Cluster Synchronization Services with NetApp Storage for High Availability (Netapp TR3555).

How long it takes for Standard active/active cluster to failover

Node evictions in RAC environment

Troubleshooting broken clusterware

Oracle support docs (login required);

  • NOTE:284752.1 – 10g RAC: Steps To Increase CSS Misscount, Reboottime and Disktimeout
  • NOTE:559365.1 – Using Diagwait as a diagnostic to get more information for diagnosing Oracle Clusterware Node evictions
  • Note: 265769.1 – Troubleshooting 10g and 11.1 Clusterware Reboots
  • NOTE: 783456.1 – CRS Diagnostic Data Gathering: A Summary of Common tools and their Usage