in

myITforum.com

David St. Clair at myITforum.com

October 2008 - Posts

  • Running SCOM in a Virtual Environment

    Running SCOM in a Virtual Environment  

    I have been a Systems Engineer in the past with quite a bit of virtualization experience. I’ve built MOM 2005 and SCOM environments in purely physical, all virtual and mixed environments, heck I have a traveling virtual lab I carry with me. Below are my recommendations and observations about running SCOM 2007 SP1 in large enterprise environments.

    Almost everyone runs some kind of virtualization these days. Most applications (if not all) will run virtually, SCOM is no exception (with certain caveats of course). In your lab environments, or sandbox running SCOM virtually is ideal. You can build, delete and rebuild with almost no problems (SCOM problems anyway). I was recently running some SCOM clustering and migration tests. Almost 100% of my testing was being done virtually.

    In Production is where you will need to take into consideration your architecture, environment, what MPs you will be running and how many agents you plan to monitor. While any SCOM server can be run virtually, the question is should you run them virtually? Below is a list of what I would recommend as virtual candidates (in any type or size or environment):

    1.       Gateways – these servers are very efficient in the way they handle data. I have seen and built large enterprise class architectures with multiple GWs and seen how well these servers function under load. Of course with any server it’s going to depend how many agents you have reporting to them, but they are designed to compress and handle large amounts of data.

    2.       Web Application Watchers – These servers are similar to web servers (albeit a little higher end server). They can be scaled out to handle them most complex loads. We have seen these servers handle hundreds of URLs (some were able to handle more, some handled less). Again the beauty of running these virtually, if you notice they are starting to get overloaded you can deploy another in fairly short order (depending on your virtual infrastructure).

    3.       Management Servers – You have to be careful here. In my opinion, Management Servers can be a gray area. If you are in an environment with more than 150 agents and a lot of MPs I would NOT go virtual. However in smaller environments I think VMs would be OK. Another possibility could be to have your RMS physical and one virtual management server; however again you have to watch the number of agents you are running.

    4.       Reporting Server – The Reporting server makes a great virtual candidate, if you are in an environment that calls for a separate reporting server.

    5.       RMS server – Only if you are in an environment of less than 100 agents, very few concurrent consoles sessions and not really busy overall; else this server should always be physical. A very simple rule to follow… your RMS should almost always be physical.

    Now obviously in large environments there is always a good mix of both physical and virtual servers. There is always room for some sort of virtualization in your SCOM infrastructure. I should also say that 75%-80% of my experience has been on VMWare ESX server 25-35 (and 3i). I have just started playing with Hyper-V and VMM, and should have a lab built solely on Hyper-V by the end of next week.

  • URL Monitoring in SCOM

    URL Monitoring in SCOM

    One of our clients is a medium sized company that needed to do some pretty intense web application monitoring (or URL monitoring) in SCOM.

    A little background first:

    1. We were running around 500+ Web Applications/URL monitors divided among roughly 10-14 MPs.

    2. We were running a majority virtual environment with the RMS and DB server being the only physical (and very beefed up pieces of hardware).

    Some initial thoughts and observations:

    1. We had the URLs spread between 9 watcher nodes (all virtual) to distribute the load.
    2. There were some URLs that had to run every 5 seconds (instead of the default 60 seconds). We isolated those URLs to two watchers as we weren't sure what type of performance hit that would cause. Those two watchers did run a little higher performance wise.
    3. ***UPDATE**** The default runtime for the URL monitoring is 30 seconds. You can go in to the XML and change the time however it will cause problems and cause the Watchers performance to spike, so your best bet is to leave them at 30 second (or more if needed)*********
    4. All of the above watchers reported back to 2 Gateway servers, which then reported back to a Management Server
    5. The Gateway's performance was much better than we expected. The efficiently at which it processed the data, compressed it and sent it over the wire was very good. The Gateway's CPU was under 10% and the Available Megs of RAM was in the 80%-90% range.
    6. The Management Server ran a little higher (performance wise), simply due to all the applets we had running on it (the HP Blade, StorageWorks and Proliant applets), but was still well within "Normal" limits.

    ***UPDATE*** We ended up adding another 2 GWs, due some performance problems.

    Now for the tricky part...

    As with most URL monitoring; you may be interested in polling the URL with and without images (if you are using a caching service you want to know if the problem is on your end or the caching service's end). In MOM 2005 you could go in to the Web Application and turn off images (but does that really not download the images, or just not report on them?). In SCOM there are no references to "images" at all. In doing some reading and research turns out that in SCOM images are referred to as "Resources". You can edit this setting by:

    1. Highlighting your Web Application and select "Edit web application settings" from the Actions Pane

    clip_image001

    2. The click on Configure Settings in the Web Request Action pane

    clip_image002

    3. Go to the Performance Counter Tab and scroll down to where you see Resources

    clip_image003

    What you see here is the Base Page performance counters, Resources Performance Counters, Links etc.

    We know that if you uncheck the Resources check box, SCOM will NOT follow the tags in the HTML to download the images. Zolan and Maco (two of the Engineers we are working with) did a packet sniff of this operation to verify what was happening and according to their results, SCOM will only download the images if you have selected any box in the Resources sections of the Performance tab. A BIG THANK YOU TO THEM FOR DOING THE RESEARCH.

    What this means is that SCOM will only report on the Base Page metrics. If you are interested in testing the Caching service you could simply reverse this and uncheck Base Page and only check Resources.

    If you are interested in technical classroom training, take a look at our Operations Manager courses at http://www.infrontconsulting.com/events.htm.

  • Alert Descriptions

    Alert Descriptions

     

    Another handy tip to keep in the back pocket during your install.... The Alert description will accept HTML tags. So that means when you are creating your own MP, Rules or Monitors you can format the description to look how you want. You can add certain lines of HTML code to do things for you (i.e. open a file share via a link using the creds of the user that is working the alert etc).

     

    An example of this might be something like (thanks Zolan)...

     

    <table border="0"><tr><td align="right"><b>Host Name:</b></td><td> $Data/EventData/DataItem/HostName$ </td></tr><tr><td align="right"><b>Time Stamp:</b></td><td> $Data/EventData/DataItem/TimeStamp$ </td></tr><tr><td align="right"><b>Severity:</b></td><td> $Data/EventData/DataItem/Severity$ </td></tr><tr><td align="right"><b>Priority:</b></td><td> $Data/EventData/DataItem/Priority$ </td></tr><tr><td align="right"><b>Priority Name:</b></td><td> $Data/EventData/DataItem/PriorityName$ </td></tr><tr><td align="right"><b>Facility:</b></td><td> $Data/EventData/DataItem/Facility$ </td></tr><tr><td align="right"><b>Message:</b></td><td> $Data/EventData/DataItem/Message$ </td></tr></table>

     

    Which will format your description to be in a table format, with no border and adds justification.

     

    One thing to note here is this type formatting will only appear in the Monitoring pane of the Ops Console. When you double click on the alert to open or view it, you will see the raw HTML (as above).


    If you are interested in technical classroom training, take a look at our Operations Manager courses at http://www.infrontconsulting.com/events.htm.
  • URL Watcher Nodes

    URL Watcher Nodes

     Some things to keep in mind when it comes to Web Application Watcher nodes…

    1.       It can take upwards of 15 minutes for the URL MPs to be downloaded to the Watchers and to get them reporting back to the Mgt servers

    a.       You will not see any information in the Web Application State view until the URLs are reporting back. You would think you would see the Web Application in the state view as Not Monitored, but you don’t. Until they have been pushed to the Watchers and are reporting back a good or bad state the Web Application state is blank.

    2.       Once you uninstall the agent from the “watcher” you cannot access Performance data about the URLs that the watcher was monitoring. I’m looking to see what kind of SQL query can be written to find and extract the data from the DB or the DW.

    3.       By default a web application runs once every 60 seconds. You can of course change that when you are creating it, however the lowest you can set this number is 30 seconds. You can edit the XML to a lower number, but that causes problems with the URL and watcher making the data less reliable, so we don’t recommend doing that. And honestly if you are running queries that often there is a chance (depending on the number of Web Apps you are creating) of overloading the Watchers.

    4.       Depending on how frequent you run them and how many web applications you are monitoring Watchers can get over loaded and will exhibit this in a variety of ways.

    a.       There’s the obvious performance issues (over loaded procs, high memory usage).

    b.      You may also see frequent agent failovers, as the GWs or Mgt Servers your watcher and standard agents report to are too busy and become over whelmed.

    c.       You may see a high send queue alert from the Watchers, this would be a sign that the watchers are too busy to keep up with the amount of data they are trying to send (don’t forget they are also processing their own OS and HW data).

    d.      It’s a good idea if you are going to monitor a lot of URLs (a lot meaning 100+) to spread them over a number of watchers. Dedicated Watchers make great VM candidates.

    While MOM 2005 wasn’t the best product for URL monitoring in the past, SCOM 2007 has come a long way and with the right tuning and architecture you can use it to monitor large complex environments and give products like SiteScope a run for its money.

  • Problem changing the Primary Management Server

    Problem changing the Primary Management Server

    When you need to change the primary Management Server, but are having problems with your agents going off line (see the post below on changing Primary Management Server), I have found two ways to get around the issues. Both are doing the same thing.

    To start here is what we have to accomplish:

    The agent needs to be reinstalled and the agent cache needs to be dumped (however the health service has to be stopped in order to do this).

    The fastest way I have found to do this is to change the primary management server in the admin node, then right after (make sure you do this quickly), right click on the agent and select repair. When you select repair the agent is reinstalled, and since we first changed the primary management server the agent is reinstalled using the new management server as its primary (you can verify this by checking in Add/Remove programs of the agent). You still have the problem of the Health Service State cache. The cache has to be dumped, as it still contains the xml file that points to the old gateway or management server. So by opening a remote management console you can stop the health service, then browse to the Health Service State directory on the agent and dump the cache. If you aren’t familiar with this process, it’s very straight forward. The directory (%program files%\System Center Operations Manager 2007\Health Service State\) contains files and folders related to the Health Service (things like downloaded partial MPs, state information, as well as an XML file that contains configuration information (most notably the agent’s primary management server)). All files and folders in this directory can be safely deleted. Once they have been deleted restart the health service (you’ll see the files and folders will reappear once the service has fully started). The agent will go from grayed out to Green and monitored in the Admin console.

    The second way to accomplish this task is to start by changing the primary management server in the admin console. Then RDP to the Agent, open the Add/Remove programs applet and select the System Center Operations Manager 2007 Agent, select Change and walk through the menu. On the screen that contains the Management Group information and management server enter the name of the new Primary management server. Once you have finished walking through the menu, stop the Health Service and clear the Health Service State Cache (see instructions above).

  • Manual Agent Assignment vs. Auto Agent Assignment

    Manual Agent Assignment vs. Auto Agent Assignment

    I have found that the Manual Agent assignment doesn’t work as well as advertised.  In the env I was working in, we had agents set to failover to between GW servers, GWs set to failover between the RMS and Mgt servers. That didn’t happen and the agents would just drop offline. However if we went into the Admin Console and switched the Manual Agent assignment to Auto Agent Assignment the failover happened on a much more consistent basis.

    Now granted there could have been a problem with the MOMADAdmin (it was run by another group). Since we were testing in a production environment J we couldn’t spend quality time needed to find the root cause.

    If you find failover not working, and your agents are falling offline, try using Auto Agent Assignment and see if that helps.

     

    **UPDATE**

    One thing to note … if you have run MOMADAdmin more than one time (perhaps if you are building a new Management Group)  you will see agents failing over to (well attempting to failover) between Management Groups. They will not be successful, but they will attempt it to and you will see alerts from the failures.

Copyright - www.myITforum.com, Inc. - 2010 All Rights reserved.
Powered by Community Server (Commercial Edition), by Telligent Systems