Cluster disk monitoring

When I initially implemented my framework I disabled a lot of rules and monitors for cluster disks because honestly, I don't know if they are useful.

At the time I did this I used these management packs:

  • Microsoft.Windows.Cluster.Library - 7.0.8437.17
  • Microsoft.Windows.2012.Cluster.Management.Library - 10.0.6.6
  • Microsoft.Windows.2008.Cluster.Management.Library - 10.0.6.6
  • Microsoft.Windows.2008.Cluster.Management.Monitoring - 10.0.6.6
  • Microsoft.Windows.2012.R2.Cluster.Management.Monitoring - 10.0.6.6
  • Microsoft.Windows.2012.R2.Cluster.Management.Library - 10.0.6.6
  • Microsoft.Windows.2012.R2.Cluster.Management.Monitoring.Overrides - 10.0.6.6
  • Microsoft.Windows.2012.Cluster.Management.Monitoring - 10.0.6.6
  • Microsoft.Windows.2016.Cluster.Management.Library - 10.1.0.0
  • Microsoft.Windows.Cluster.Management.Library - 10.1.0.0
  • Microsoft.Windows.Cluster.Management.Monitoring - 10.1.0.0
  • Microsoft.Windows.2016.Cluster.Management.Monitoring - 10.1.0.0
  • Microsoft.Windows.Server.ClusterSharedVolumeMonitoring -  10.1.0.6

I leave discoveries enabled by default so it will find instances of this class:

  • Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk

What I soon discovered was the console view Microsoft Windows Server > Health Monitoring > Cluster Disks Health was all set to Not monitored and Microsoft Windows Server > Performance > Cluster Disk Capacity wasn't collecting anything.

Let's fix the collection rules first.

I removed the disable override for these rules that collect stats for the Cluster Disk Capacity view:

  • Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk.Monitoring.CollectPerfDataSource.FreeSpaceMB
  • Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk.Monitoring.CollectPerfDataSource.FreeSpacePercent
  • Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk.Monitoring.CollectPerfDataSource.TotalSizeMB

Result: Fail. The interval on these rules is 15 minutes. After 30 minutes still nothing was being collected. The views went weird too, showing just a white background.

After some troubleshooting I removed the disable override I had for these 2 rules from the Microsoft.Windows.Cluster.Library management pack:

  • Microsoft.Windows.Cluster.EnableOverrideForVirtualServer
  • Microsoft.Windows.Cluster.DisableOverrideForVirtualServer

Result: Success

Bizarrely, after 30 minutes, data was visible in the view. It even seemed to show data from the previous 30 minutes so it's like it was being collected, just not shown in the view. We'll call that fixed?

Now the monitor.

Where this gets tricky is deciding which monitors to enable for disk space monitoring. The only ones I could find target the Cluster Disk class. To list them, run this:

get-scomclass -name Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk | Get-SCOMMonitor | ft DisplayName, XmlTag

You should get 3 monitors returned; 1 aggregate and 2 unit monitors. The aggregate monitor doesn't alert, but the unit monitors do. This presents a problem because unlike the disk space monitor for logical disks, we have seperate monitors for % and MB free. In environments with big and small cluster disks, which one do you use?

I decide to use only this monitor for now as it's the most logical:

  • Microsoft.Windows.Server.ClusterDisksMonitoring.ClusterDisk.FreeSpacePercent

I disabled this monitor when I ran my override script. Now we want to enable it and target our CMDB groups for alerts but which group do we use? Is it the cluster ones or Windows Server?

Let's test first targeting the overrides to the Cmdb.Group.WindowsServerCatA group. I make sure my cluster nodes are in the group and apply the override to enable it and set the priority to High.

Result: Fail. Health states are still not monitored.

I remove the overrides and instead apply the same ones to the Cmdb.Group.WindowsClusterCatA group and make sure the cluster is in the group first.

Result: Fail. Health states are still not monitored.

So we have a problem now. We don't have the right classe in our groups for the override to work. This breaks our ability to categorise infrastructure. These are our options:

  1. Remove overrides and leave it enabled and control priority at object level
  2. Figure out how to get right class in existing cluster groups
  3. Create new groups and use CMDB

For now I have removed all overrides and left it at default (enabled at class) and the alert seems to work reliably. In terms of a better strategy, option 2 is preferred so will likely need to use contained and contains craziness.

Stay tuned...

Comments