Wednesday, March 07, 2018

How to use vROps to find a VM causing a broadcast storm

A customer recently reached out to me with the question if vROps could help him find a particular VM obviously causing a broadcast storm.
The only information he got from the networking department was the host name and NIC of one of his ESXi hosts. The vCenter metrics did not reveal any valuable information and he had to check every single VM (more than 30 per ESXi host) one by one. At the end he asked if vROps is capable of finding that bad guy.

First step was to check if we have such a metric for VMs and whether it is active (collecting data) or not.
As you can see in the following figures, vROps knows that metric but it is disabled by default.

Fig. 1: disabled metrics in the default policy

Now, you could go and just activate these metrics for all of your VMs and check the values. But, in an environment with several thousands of VMs it will add additional load and you will need these metrics only for some few VMs and for a limited period of time.

Let us make it more dynamic and configurable for future use, just in case your NOC may come to you with another ESXi host you have to check.

The idea is pretty simple, we need a policy which enables the needed metrics and we need a group of VMs we would like this policy to be applied to.

Step 1: create a new policy and activate the broadcast metrics for Virtual Machine object type.

The following figure shows you the filters and settings to activate the right things.

Fig. 2: enable metrics in a new policy

We want to get this policy applied only to a dynamic group of VMs we would like to investigate. This is where the concept of Custom Groups comes into play.

Custom Groups work as a container for any objects you may have and the settings of a Custom Group allow the membership to by dynamic based on a wide range of properties, relations etc.

Now we could go to vROps, create a new custom group and define that group to contain all VMs which are children of a particular predefined ESXi host. This would be semi-dynamic.

Let's re-think this strategy.

I many cases the admin dealing with a broadcast storm in a vSphere environment do not have to be the vROps admin in his org.
Wouldn't it be better if the vSphere admin set "something" in vCenter and at the end he will see a dashboard or receive a report in/from vROps?

Exactly, we go for the vSphere Tags.

Our new tag will designate a host as being "under investigation", time for the next step.

Step 2: create a vSphere Tag

Fig. 3: create a vSphere Tag
As we have our tag we continue with a "two-staged-custom-group".
The first group will dynamically contain ESXi hosts under investigation, and the second group will contain Virtual Machines which run on those hosts.
This will give us the freedom of creating multiple "second-stage-groups" which may have different policies assigned, in case we would like to investigate another behaviour which requires another metrics etc.

Step 3: create a new custom group for the ESXi host(s)

Fig. 4: "first-stage-custom-group" - Host System

Anytime we assign our new vSphere Tag to a ESXi host, this host will become member of this group.

Step 4: assign the tag to a ESXi host and wait a collection cycle to get the custom group populated:

Fig. 5: vSphere tag assigned to a host

Fig. 6: "first-stage-custom-group" - dynamically populated

Time for the "second-stage", the custom group containing the VMs.

Step 5: create a new custom group for the VMs:

Fig. 7: "second-stage-custom-group" - dynamically populated

Once we created the custom group for the VMs, this group gets populated with VMs which run on tagged ESXi hosts.
We see that the metrics we need for our investigation get collected:

Fig. 8: Collecting metrics

At this point we have everything to create a dashboard for our vSphere admin to quickly help him find the bad guy:

Fig. 9: Dashboard with the results of the investigation

Hope, this post will help others during their RCAs.


Saturday, March 03, 2018

vROps SuperMetric using logical expressions

vRealie Operations Super Metrics are a very flexible and powerful way to extend the capabilities of the product way beyond the OOB content.
There are many blog articles out there explaining how to basically use super metrics but only very few sources gives some examples how to put logical expressions into your formulas. So the question is, how dos this work?

Using some simple examples I am going to explain how the magic of logical expressions work in vROps Super Metrics.
First of all some fundamentals:
  • Super Metric working on a selected object itself, like ESXi cluster in this example, which is just showing the actual metric (we will need soon):
avg(${this, metric=summary|total_number_hosts})
  • Super Metric working on direct descendants of a selected object, in this case ESXi hosts in a cluster, which is counting the powered on hosts:
count(${adaptertype=VMWARE, objecttype=HostSystem, metric=sys|poweredOn, depth=1})
Now, let's put the pieces together and build a super metric which follows this pattern:
condition ? inCaseOfTrue : elseCase;

Even if this doesn't mean anything in terms of semantics, the syntax of such an expression might look like this one:

count(${adaptertype=VMWARE, objecttype=HostSystem, metric=sys|poweredOn, depth=1})
== avg(${this, metric=summary|total_number_hosts})
&& avg(${adaptertype=VMWARE, objecttype=HostSystem,
metric=cpu|usage_average, depth=1} as cpuUsage) > 40 ? avg(cpuUsage ):5

One could translate this formula into that statement:

If all ESXi hosts in a given cluster are powered on AND
clusters average CPU usage is greater than 40
THEN show me the average CPU usage,
ELSE show me 5