SCOM: Heartbeat Failure Alert Tuning

I recently deployed SCOM in a highly distributed network. Most of the edge locations had slow WAN links. These edge locations would often go offline. With the combination of the slow WAN links and them going offline SCOM would flood with alerts/emails on Health Service Heartbeat Failure and Computer Not Reachable monitors.

This had to be tuned out because these alerts were overwhelming for the team. Also as soon as these edge locations would go offline the team would be notified through other network location monitoring tools and from the staff at these edge locations.

These edge locations would often go offline for reasons of power outages or ISP’s going down. These edge locations could also be down for long periods from 2-3 days at a time. Fixing the issues were often out of the control of the team. Receiving alerts during these outages from the edge locations was not helpful. The team still needed alerts right away if servers at the corporate locations went offline. There are several ways to tune alerts for these monitors.

One way to tune Health Service Heartbeat Failure and Computer Not Reachable monitors is to adjust the heartbeat interval (default is 60 seconds) and the amount of missed heartbeats SCOM will tolerate. Note this would be a global change in SCOM across all monitored servers. To access these settings do the following:

In the SCOM console go to Administration>>Settings  in the right hand pane under Type: Agent you will see Heartbeat. Right click on Heartbeat and open the properties.  In the same pane under Type: Server you will see another Heartbeat. Right click on Heartbeat and open the properties. You can see this in the following screenshot:

clip_image001

Another way to tune the alerts on these monitors would be to go adjust the heartbeat interval on an individual server level. This would only be useful if you have a small amount of servers generating these alerts and know what servers they are. To access these settings in the SCOM console go to Administration>>Settings>>Agent Managed. Find your server/s. Right click on the server and select properties. Under the Heartbeat tab select the checkbox next to Override global agent settings and then adjust the Heartbeat interval.

clip_image002

For more information about both of those visit:

Heartbeat and Heartbeat Failure Settings in Operations Manager 2007

 http://technet.microsoft.com/en-us/library/cc540380.aspx

Neither of those helped in my situation because we needed these alerts right away from one group of servers but not from another. Here is what I did to tune these monitors so that the team would not become overwhelmed by the alerts.

In this particular environment there were some things I need to point out before I go into the solution.

  • The team did not want to monitor heartbeat or ping basically connectivity to the edge servers at all. They were more interested in gathering performance data, status of the applications on those servers and more.
  • The servers that live in the edge had different sequence in the computer name vs. the servers that lived in the corporate locations. The naming schema was structured like this:
      • Corporate location # 1 server names: PROD100-xxV or PROD100-xxP.
      • Corporate location # 2 server names: PROD200-xxV or PROD200-xxP.
      • Edge server names: PROD404-xxV or PROD404-xxP (404 would actually match the number of that edge location. This would vary from edge to edge.).

The name schema was a big helping in breaking things out. So I basically created an edge server group in SCOM dynamically excluding all corporate locations. Here is what it looked like to build this:

clip_image003

Building the logic:

clip_image004

What it looks like in the group:

clip_image005

By doing that the members would consist of all servers from all edge locations without including any servers from corporate locations.  This member list was built dynamically so that the team did not ever have to worry about adding edge servers to the membership.

After the edge server group was built in SCOM I was able to target overrides on the Health Service Heartbeat Failure and Computer Not Reachable monitors towards all the servers in all edge locations. The overrides disabled the Health Service Heartbeat Failure and Computer Not Reachable monitors on the edge servers while these monitors remained active on all corporate based servers. Here are some screenshots of the overrides:

clip_image006

clip_image007

So after creating the edge servers group and putting the overrides in place the alerts went down and the team was happy. This may not work for every scenario. Below are some links with other ways to tune the Health Service Heartbeat Failure and Computer Not Reachable monitors.

Heartbeat Failures, although a valuable diagnostic tool, can prove a colossal pain in large distributed environments.

 http://blog.mobieussystems.com/bid/194964/SCOM-Heartbeat-Failures-Tuning

Health Service Heartbeat Failure, Diagnostics and Recoveries

 http://blogs.technet.com/b/jonathanalmquist/archive/2010/01/11/health-service-heartbeat-failure-diagnostics-and-recoveries.aspx

Read More»

App Store Essentials: Leveraging Software Asset Data to Achieve Continuous License Compliance

By Laura Noonan What comes to the business user’s mind when asked about the enterprise app store concept? An easy-to-browse online catalog. A shopping experience similar to Apple’s App Store or Google Play. Hassle-free and near-immediate delivery of apps. But as an IT professional, what comes to your mind? Making sure every app is licensed properly and used in compliance with contract license terms. Using the enterprise investment in software wisely. Optimizing software spend while minimizing costs. These objectives aren’t always easy to achieve depending on the size and complexity of the IT environment and considering the variety and intricacies…

Read More»

Help choose Enhansoft’s next free SQL Server Reporting Services (SSRS) report for the Month of June 2013

Help choose Enhansoft’s next free SQL Server Reporting Services (SSRS) report for the Month of June. The choices are: Computer SQL Details OR List of PCs by Disk Description   a)      Computer SQL Details will provide you with the number of SQL 2005, 2008, and 2012 licenses that you have for a specific PC. b)      List of PCs by Disk...

Read More»

Hyper-V Storage Enhancements & What They Mean for Users – Part 2: File Services Enhancements

By Lawrence Gavin, Head Geek, SolarWinds®, Virtualization & Storage Management This is part two of a three part blog series. • Also refer to Part 1: Overview •  Part 3: Cost Comparison of SMBv3 vs SAN Expansion As noted in the introductory article, Windows Server® 2012 brings significant improvements to file services. In this article, we’re going to talk a...

Read More»

Veeam Updates OpsMgr Integration MP, fixes bugs

Veeam has just announced an update to their MP Integrations Management Pack.  If you are using the MP, jump out and grab the update: Veeam Integration MP 6.0.0.1422 What’s fixed? Maintenance mode synchronization in System Center 2012 Operations Manager SP1 VMGUEST to OpsMgr Agent relationship discovery Read the full update description in the release notes:  Release notes for Veeam MP...

Read More»

Preview of SSRS Reports in May’s Survey

In May’s survey you will get to choose between Computer SQL Details OR List of PCs by Disk Description. Stay tuned because tomorrow we’ll tell you how you can vote for either report, or you can check back to our Free SSRS Report Webpage for more details. Computer SQL Details will provide you with the number of SQL 2012, 2008,...

Read More»

Setting up Windows Intune to Manage Android

Android with Windows Intune Use this guide to help you get started testing management of Android devices with Windows Intune (Wave D) standalone.  This guide assumes Office 365 has been completely set up, configured, and operational for your organization. Create the emulator First and foremost, create an Android emulator.  One of the best guides that I have found [...]

Read More»

KiXtart 2010 4.64 BETA 1 released

Yes…KiXtart is still a functioning and updating scripting language.  Several of you ask about that whenever we post the latest BETA release.  KiXtart is still being utilized in a number of places and provides quite a bit of automation and value. New functionality/enhancements in 4.64 BETA 1: Updated @PRODUCTTYPE issues. Fixed issue of SHELL command creating console in WKIX32 Fixed...

Read More»

Dell’s MMS 2013 promise fulfilled: Dell Client Integration Pack (DCIP) 3.1 for ConfigMgr 2012 SP1 is now available

From the Dell Client Integration Pack page: The Dell Client Integration Pack (DCIP) is an updated version of the Dell Client Deployment Pack (DCDP) plugin for Microsoft System Center Configuration Manager. In addition to the task sequence and driver deployment functionality of DCDP, it also includes: OMCI Integration CCTK Integration Warranty Status Intel AMT Out-of-band management DCIP is available for...

Read More»

System Center Universe 2014 has been announced!

The 3rd annual System Center Universe has been announced!  And, while it’s still just a single day event, 2014’s version will include breakout sessions! System Center Universe 2014 will be located at the Hilton near the University of Houston.  I’m looking forward to participating once again, and if you’ve not thought about attending this yet, take a look at the...

Read More»