January 2007 - Posts
Summary: there are plenty of ways to count SMS clients - which is the best way?
I actually use SQL Server Management Studio more than I use the SMS Administrator console. Partially that's because I'm not on the frontlines, but it's also because I have or get a lot of questions every day, and the answers to the questions are often provided by queries (an imminent blog topic). Most of those questions revolve around client counts. So what do those queries look like?
Well, the basic, and quite sufficient form of those queries is:
SELECT count(*) from v_R_System where client0=1 and obsolete0=0
Of course I almost always join in some other views or clauses but that basic format does a pretty good job. We are talking clients, so client0=1 makes sense. I've seen postings about people not seeing their clients with that flag set, but I find that if heartbeat discovery is frequent enough (2 days in our case) and everything else is flowing properly, then that column gets set propertly. Obsolete0 is a wonderful column, especially at a very dynamic environment like Microsoft - computers get rebuilt and so a machine that was a client is now replaced by another.
To be really accurate I'll change the count(*) to count(distinct name0). That reduces the client counts by about 2.4% in my very dynamic environment. So not enough to be worth typing for those quick 'what if' questions, but definitely worth doing for the 'let's build the CIO scorecard' queries. That helps with computers that change so much that they aren't recognized by SMS as being the same client but they haven't changed their computer name. It doesn't help with those that also change their names (in fact then we get double counting) but client health trending (another blog topic yet to come) allows us to account for that.
You might note that I don't include the active0 column. It sounds really useful, but that column is only relevant if you want to know which clients have seen data in the last week (or whatever your cycle is set to). As per the post linked above we at Microsoft IT count SMS clients on a 30 day cycle (which is taken care of by the client purging cycle - if the site hasn't seen heartbeat discovery data from a client in 30 days, then it's purged out of the database, and thus won't be counted).
In the olden days (2 or 3 years ago) I used to also include "obsolete0 is null" or clients that were assigned to sites (as opposed to unassigned), but with SMS 2003 SP2 (and possibly earlier) I don't find those issues to be significant (in fact, no clients meet those criteria anymore for us).
Some great members of the team I work in are considering whether hardware ID might be a good indicator of client uniqueness. Historically it hasn't been, in my experience, but as the product has changed over time I'm reconsidering that option. Even then the hardware ID column may be best for answering the computer count question, as opposed to the client count question, as per the link mentioned above. Check back in about 6 months for a blog on that topic.
Another consideration is clients that are returning data, such as inventory or status messages. That's cool, but for me "healthy" vs. unhealthy clients is a seperate issue from client counts. For example, Garth Jones, in a recent post, based his data on hardware inventory (but not including discovery data). For me that gives a 0.5% discrepency from my normal reporting. Good, but to me a computer that reports inventory but isn't a resource (i.e. isn't in v_R_System) isn't really a client. The closeness of the counts seems like just a coincidence. If I do join the views, so that I'm looking at clients that haven't reported hardware inventory, with my very dynamic environment and one week hardware inentory cycles, I see a 3.1% discrepency.That query is:
select count(DISTINCT Name0) from v_R_System sys full join v_GS_WORKSTATION_STATUS WS ON WS.resourceID=sys.resourceID where client0=1 and obsolete0=0
When I want to work along the lines of the same idea of data-generating clients but using discovery / resource data (the basis of v_R_System) but not include AD discovery data (which may be stale), then I use the following query (credit for this one goes to the SMS product group, but I'm afraid I forget who):
select count(DISTINCT Name0) from v_R_System sys full join (select ResourceId, MAX(AgentTime) as AgentTime from v_AgentDiscoveries where agentname<>'SMS Discovery Data Manager' AND agentname not like '%!_AD!_System%' ESCAPE'!' group by ResourceId) disc on disc.resourceid=sys.resourceid where client0=1 and obsolete0=0
The final option is to include Client Health Tool data, which I do, but that's tricky enough that it deserves a seperate post.
So there's my thoughts on counting clients. More importantly, what are your thoughts? Please post your comments.
p.s. My reference above to Garth's queries should not be taken to imply any criticism. In fact I have great respect for Garth, and I know him quite well. I know him so well, in fact, that I will be sure to never irritate him while he's anywhere near a chainsaw, but that's another story...
Summary: we all make mistakes, but big mistakes in the computer management field require particular strategies.
If you were following today's myITforum SMS mailing list you will have seen a thread called "Greg and Warren's SMS 2003 Recipes". It was mostly about those "oops" moments we all face sooner or later. That brought back some memories for me.
A number of times I've been in that situation where I've had that sudden moment of realization and simultaneously felt a 'pit in my stomach'. The moment when without a doubt I knew something bad had happened to the computer system I'm responsible for and it was my fault. Fortunately they were mostly in the early stages of my career but I was still affecting hundreds of users for hours at least. No sooner had I hit the enter key than emergency backup systems were kicking into place. The hospital admittence department was switching to paper based systems. Stuff like that.
In my youth I'd desperately try to fix the situation before anyone noticed (as if). Or I'd hope no one would mention it to the boss. Fortunately my manager at the time was far wiser than I and calmly explained that he'd rather be able to say "yes, we know about the problem and we're taking steps X, Y, and Z to resolve it". That was much than saying "huh? what problem?"
You don't have to go through that kind of situation a lot before you become very detailed oriented. Or unemployed. Your choice. So I personally haven't caused such problems for quite awhile (I think we're talking decades). Of course I'm not on the frontlines nearly as much as I used to be either, but I'm close enough that I still feel the heat. Those of you that know me personally know that I have a healthy head of grey hair, despite not being particularly old. I like to blame that on the fact that I worry about such things. (But in other contexts I like to blame it on 17 years of marriage - that's fun too ;-)
Whenever we get a new hire, or give someone SMS privs for the first time, I like to have a quick 'chat' with them about the seriousness of the systems in their hands. Imagine the headlines if Microsoft is knocked out of comission for a few days by an unfortunate software distribution. That will turn anyone's hair grey. I suppose the Exchange or AD or other MSIT guys have similar chats, as do IT guys in any company, but I think computer management guys have to take the responsibilities particularly seriously. Our systems are purposely designed to distribute anything anywhere, on a large scale, quite fast, and with a high degree of reliability. And "anything" isn't an e-mail or a policy or something - it's a program that can do anything you can imagine.
So for you computer management types out there I have a few points of advice:
- be prepared. Anticipate the worst and have tools ready to react. In an emergency you don't want to be doing a lot of complex tasks by hand in a hurry. Have backup systems and backups for your backup systems. Be prepared to work from anywhere, because the problem may not occur when it's convenient for you.
- have a communication plan. Even if it's just in your head, know who to contact for any contingency. Have the phone numbers ready, with backup contacts, home phone numbers, cel numbers, pagers, international dialing instructions, etc. In an emergency you don't want to be digging for details
- assess the situation rationally. i.e. don't panic. what's the actual damage? how far did the problem really get? how old are the backup tapes? stuff like that. When you communicate with people they're going to want those details. Don't waste a lot of time getting them, but have a reasonable assessment ready to go
- don't be shy. Contact your boss ASAP and tell it the way it is. Keep the apologies short - if he's wise he'll worry about that side of the equation later. If the boss isn't available, contact his boss. Go as high or wide as you have to. They get paid the big bucks to manage issues, so let them manage it. Once you get them involved, they can worry about where to escalate next, but don't deprive them of that opportunity just because it's the middle of the night (and these things almost always happen in the middle of the night...)
- be proactive. Before any of this ever occurs have the discussions with your boss and other relevant parties. Assess your team's readiness and say so if you don't think you're ready. Decide what the right balance is for your organization in terms of cost of preparedness vs. risk tolerance. This also falls in the CYA category.
- soon after you start damage control, plan for the next shift. The problem may take a day or days to recover from. So don't call the whole team in. Or if it does occur during the day, send part of the team home so they can rest and especially sleep and thus be ready to take over when you've done 16 hours and can't think straight anymore.
- record what you do, when, and how. That helps with the shift change, with the post-mortem, and when you forget which solutions you've already tried.
- don't be shy about learning from the mistake. After the fact have a frank professioanl conversation with the interested parties to determine the lessons learned. Then it will be better next time.
Having said all that, I've got to go knock on some wood (as they say). It's amazing how few such issues I've been through in recent years, or at least how small they've been. But I do rest better knowing that I've thought these things through. Now where's that bottle of Grecian Formula...
Summary: the Vista and Office 2007 releases are great opportunities for computer managers. And just plain fun!
In the spirit of full disclosure I must admit I drink very substantially from the company KoolAid. Certainly more so than most. But that was true prior to actually working here (which is 8 years now). I suppose I've been a Microsoftie for about 16 years in that sense.
So days like today totally psyche me. Vista and Office 2007 hit the retail market today and the marketing hounds have been unleashed! Marketing materials are released worldwide in a huge variety of forms. From the inside we have the opportunity to watch it unfold, so I may have more visibility than you do, but I expect the marketing will make an impact over the next few days, weeks, or months. More importantly, the 'Microsoft phenomenon'''s most recent wave, which started years ago, now meets reality. The products will be making an impact for the the next few years. A LOT of good, sincerely passionate people have put a tremendous amount of sweat and tears into this wave so I believe it will be amazing. Maybe that's the KoolAid speaking - we'll see.
I think part of the reason the 'Microsoft phenomenon' excites me is that I'm old enough to remember when transistors and LEDs were an amazing technological breakthrough. Let alone the 4004. I remeber arguing with one of my bosses about whether anyone really needed color monitors (not graphical monitors - just colors). Reminiscing with guys that did wirewrapping. So when I see our highly integrated digital lifestyles, I'm truly impressed. Here I am sitting in my home theatre/office watching HD video from my home province (B.C.), an internal company video, and my favorite shows from today (all with great sound) and I'm blogging away with you and getting some real work done (will the e-mails and IM's ever stop?). How cool is that? (Ok, I should get a life, but my wife and dog seem to understand).
This Vista and Office (and related products!) release is taking that reality to the next level. As a techie that does it for me. Why would anyone want to be in management or something when they could be living this technology wave?
So how does this all relate to computer management? The direct relationship is that we all have a lot of work to do in coming months and years to deploy the new products, set configuration standards, report on their status, patch them (ok, they won't be quite perfect), etc. Job security, baby! The indirect relationship is that our computer management tools get that much better - better reports than ever, easier administration, great security, unprecedented automation, etc. More fun than ever before, and it's been plenty fun enough so far.
And lest we forget, Microsoft is all about the community. There's something like 100,000 Microsoft employees, vendors, etc. but orders of magnitude more partners, MVPs, developers, writers, trainers, bloggers, newsgroup posters, etc. And yet more orders of magnitudes of customers and thier techies. WE have all contributed to this wave, and should take a bow for that. Now we get to go profit from it. Sweet.
Summary: 250,000 clients is not the same as 250,000 computers, and 250,000 clients over a 30 day cycle is not the same as 250,000 clients over a 14 day cycle. So let's define our terms before talking computer counts.
In this business we’ve got plenty of challenges. One of the least obvious (at least to those not intimately involved) would be counting clients. Your boss comes to you and asks “how many computers do we have?” “323”. “Thanks.” If your organization is a bit larger, you’re talking thousands. Some of us are going to have tens or hundreds of thousands. On those scales, the first problem we have is that you can’t personally run around and sanity check the answer. You can’t even call up 5 guys who can run around and do the sanity check for you. But the bigger problem is, what is a “computer”? “Come, on Paul”, you say, “see that keyboard you’re typing on? Follow its cable to the end – that’s a computer.” End of discussion. Fair enough. But what if it dual boots? Sure, it’s still one computer, but if I’m doing security patches (aren’t we all?), then I want to patch both instances. So that’s 2 things. We’ll call them clients. If I patch one but not the other, then when the other client comes online it will have the potential to infect my other computers. Or mount a denial-of-service attack on the network. Or send confidential data to outsiders. Or cause the disk to be reformatted and thus serious data loss. So when I’m counting computers I really care about the one computer, but when I’m patching I really care about the two clients. What about if there are virtual PC’s on it? Same problem. Eventually the computer is old and the user gets rid of it. Then it’s not a computer or client from my point of view. Will the user tell me they got rid of it? Surely not – that’s a tough problem. But a discarded computer doesn’t report any data onto the network (such as SMS data or Active Directory computer password resets or DHCP requests). So if I have the computer recorded in one of those databases and none of the data has been updated recently, then the computer is gone. Problem solved. Except that a computer that is offline also doesn’t update its data in those databases. A computer can be offline because it’s on the road (maybe it’s on someone else’s network, but not mine). Or the user temporarily doesn’t need it and so it’s powered off. Or the user is on vacation. Or the computer is broken just enough that it doesn’t update the relevant database. All valid scenarios. That certainly complicates things, but vacations or road trips only last so long. Eventually the user needs the secondary machine. Eventually the broken machine gets fixed, somehow. So let’s give it a time-limit. If the computer reports within the time-limit, then it was gone temporarily. If it doesn’t report in that time, it’s gone permanently, and so we don’t count it any more. What’s the right time-limit? That will vary from organization to organization. Within Microsoft, it’s painful to take vacations longer than 1 week, so 2 weeks might be good. But we also have a lot of testers (almost every employee does some kind of testing, and almost all have more than one machine). Road trips can easily last a week or two. So we’ve long ago settled on 30 days. But if you don’t have many secondary machines or testers, then 2 weeks (plus maybe a few days) might be right, allowing for 2 week vacations. Fair enough? If so, we learned some things:
I suggest that as computer management specialists, we mostly deal with problems that relate to clients rather than computers. In other words, we want to know how many computer instances we should patch, or upgrade the software on. Sometimes we want to count the computers, for accounting purposes, but that’s a less common problem. So when we’re quoting counts, we should quote client counts unless someone asks us for computer counts. Even if they use the word “computer” we should ask whether they really want to know the count of physical machines. I also suggest that when we’re quoting client counts, we should specify our time-frame. So at Microsoft we currently have 250,000 30-day clients. Someone that has 250,000 14-day clients is more aggressive than we are in their client counting. If their user behavior is more predictable, allowing them to use that smaller time-limit, then they’re probably more accurate than we are. So I’d be more impressed by an organization that has 250,000 14-day clients than another that has 250,000 30-day clients.
- Computers (physical machines) are not the same as clients (computer-like things that I want to patch)
- So the right answer to the boss’s question depends on whether he’s interested in doing asset management or patch management
- There are plenty of sources of answers to the questions of how many clients we have (SMS, AD, DHCP, physical counting, etc.)
- Off-line computers/clients are indistinguishable from decommissioned computers/clients
- Setting a time-limit allows us to make a reasonable guess to divide the off-line from the decommissioned
My scope in this blog is going to be computer management generally. Yes, I work for Microsoft and I have done SMS a long time (AKA Systems Management Server or System Center Configuration Manager or SCCM or CCM or ConfigMgr). But I did computer management prior to SMS, and have done it without SMS. And I believe that most of the challenges SMS faces are the challenges any of its competitors face. So for most of my postings it’s best to think of computer management in general, rather than SMS in particular.I’ve had the pleasure of working with quite a number of great computer management professionals over the years, and we don’t always agree on many points, so it’s rather intimidating to pontificate on a topic that we’ll likely disagree about. But I’ve always embraced the discussions (debates / arguments). It’s an incredibly healthy way to evolve our art. I really hope we can continue that here. So that’s my ask of you: please add your comments to all my posts. Agree, disagree, see them another way, whatever. I know I’ll gain from it, and I hope others will as well. This blog only succeeds if you do that. Further (and I suppose this isn’t surprising), I’m generally not going to dive deep into technical specifics, nor am I going to divulge ‘secrets’. My days involve a lot of that kind of stuff, but a huge fraction is specific to the internals of Microsoft. Or happens because we’re playing with very early versions of software (dogfooding, as we like to say). Or the story evolves dramatically over time and so the released reality is very different from the early stories I see. Or wiser people than I have a different perspective than I do. So it would be foolish to randomize you all with details I see. (But don’t worry, I’ll let you know my thoughts when I can – I think we’ll have some fun). So let’s talk…