CM12 MP and DP with no Server GUI

Here is something I’ve wanted to try forever – heck since they used to call it Server Core.
For my role servers like the MP or DP servers, would CM still work if I remove the GUI from the OS?  Because Server 2012 R2 lets you take the Windows shel…

Read More»

CM12 R2 and SQL14 – Sneak Peek

Does Microsoft support SQL 2014 on ConfigMgr 2012 R2 yet? Nope. But we should get word soon about it and I’m basing that speculation on the fact that I went ahead and tried SQ14 in my lab last week and it works! And a week later and into Patch Week,…

Read More»

Considerations for Enabling Distributed Views on a CAS

So what are distributed views and why use them?

I recently posted about an issue where I was forced into enabling distributed views (DV) for my primary sites.  Technically, I enabled DV for the hardware inventory link (there are two others as well).  We had looked at using DV earlier, but even in SP1 there were issues.  But with R2, we’ve found them to be fully functional.

So what are DV?  Simply put, you tell your primary sites to keep client data instead of replicating the data up to the CAS.  Normally DRS replicates data up via SQL (CM07 would copy files to in central inboxes and let SQL import that data all over again).  DRS uses a variation of the SQL Service Broker (instead of merge or transactional replication) to copy this data back to a primary.  But for 3 major links you can tell the primary site just to hold onto that data and if you need it for reporting, SRS can grab that data from the primary sites on the fly.

Distributed Views

CM accomplishes this on-the-fly generation of data via linked servers in SQL.  When you enable DV to a primary site, all of the replica groups in that link stop and a linked server is created in SQL.  The local ConfigMrg_DViewAccess group on the primary site is populated with the machine account of the CAS as well as the execution account SRS uses.

Note that if you have other users running ad hoc queries against SQL and they require client information, you’d want to put their AD group or ID into that local group.

Now think about R2 and how SRS respects RBAC.  A user wanting information about clients and he’s coming in via SRS means it’s going to entail an RBAC function to figure out what that user is permitted to see: he is simply not granted permissions to see all collections or devices but is limited to his scope.  For SRS to grab that data which is on the primary sites and not the CAS means we need to worry about constrained delegation.

Constrained delegation is where you set the machine account (CAS$) to be trusted to pass on a user’s ID via Kerberos to a child site.  Open the CM console to the CAS and look at devices and its CAS$ talking directly to the primary site to get you the data in the console.  It can do this because there is one hop from the CAS to the primary site.  But open the SRS site from a desktop which connects to the CAS and has to get data from the primary and you’re making 2 hops and Kerberos won’t recognize you.  To make that leap you need constrained delegation.

constraineddelegation

Note: the SPN on the server should match what you have granted in the image shown.

Generally, constrained delegation should work without issue.  But we found in practice that certain scenarios can break it.  For everyone.  How?  We’re thinking that anyone coming in from a non-trusted domain seems to poison the Kerberos cache giving mixed results to users.  Sometimes users will be shown an anonymous logon error instead of seeing their SRS report results.  We had a case open with Microsoft for a week on the issue and they never found the source of the failure no matter how many data dumps we gave them.  So we temporarily cheated with local SQL accounts on the linked servers.

So back to the why of when you enable DV.  We were told by the lead developer of CM that we’d be crazy not to use DV if a primary site was next to a CAS.  And so we did test that and we actually ran with it for a while. But back in SP1 there were issues and we decided to just go back to full DRS.  But what about servers not close to the CAS?

I explained previously that enabling DV saved DRS from sending 1 terabyte of hardware inventory back to the CAS.  That is a lot of data no matter how you look at it.  From a point of view, it actually makes sense to enable DV everywhere: save your network all of the replication of so much data and leave it on the primary servers.  Grab it only when you need to view it.  Heck, why isn’t it the default?  Why didn’t Microsoft just have CM use DV all the time?  The answer is that there is a bit of a cost.

Reports are obviously slower because the primary sites are not close and have to run the query SRS wants locally and then send the data across the WAN.  That doesn’t give the best experience to the end user.  But it gets worse.  We found that if we simply ran a query directly to each primary instead of going through the CAS, reports would run even faster.  Microsoft has yet to explain why, though they are looking into an answer for us.  I suspect that answer is going be the explanation for why DV isn’t enabled by default.  That reports are just going to be slower this way.

That isn’t all bad.  Recall that we can tell SRS to run reports on a schedule and cache the results when we want it to.  So if you set a report to run at 8am and a user wants that data off SRS at 9am, it can send the 8am results back nearly instantly to the user.  There is even a benefit here in that the user isn’t able to pound on the server by running the report often.  You get full control via SRS on how long that cache is maintained and how often SRS should just run the report in the 1st place.

And of course, this means extra work on your part to find the most commonly used reports and start setting them up for schedules in SRS.  That’s work.  Most admins will just take a pass and let the SRS user wait.  I’d caution against that for one reason only:  managers and execs tend to think a report is a reflection of how fast CM is.  A slow report must mean CM is on its last legs.  Better go find a new product!  So you’d really better consider some schedules on common reports.

Key takeaways here are (1st don’t setup a CAS) that if you have a primary next to the CAS, you might very well want to enable DV for one or all replication links and that if you do so, make sure your SPNs are cleanly setup and enable constrained delegation.  And what that local DV group on the primary as you might need to add other users who run CM queries.  Finally, the recommendation is that you enable DV on a primary very close to the CAS to help speed those reports, but if your WAN links are fast enough like ours are, it does work if you enable it on others too.

Also note that disabling a replication link means you’re going to suddenly have a lot of data replicating all at once to your CAS so keep that in mind before enabling DV in the 1st place or deciding you want to abandon them.

Read More»

Be CAS I told you not to! More CAS pain points with CM12

Well I told you not to install a CAS, didn’t I?  But of course I no choice since my team supports the servers that manage 365K clients.  Well, we had one heck of a week last month.

It started with a phone call from Sherry Kissinger waking me up to say we had a replication issue.  She said she dropped a pub file to get the Hardware_Inventory_7 replication group to sync.  My 1st thought was that our Alabama primary site was holding onto files again with SEP (Symantec Endpoint Protection) not honoring exclusions.  We had files stuck in despoolr.box the prior week and entries in the statesys.log showing that it was having problems trying to open files.  I told her to uninstall SEP and I’d go into my office and logon.

So I logged on and start looking around.  Our CM12 R2 environment consists of a CAS & 3 primary sites.  The monitoring node in the console was showing all sorts of site data replication groups in degraded status.  That’s because the CAS was in maintenance mode and not active (easily seen with exec spDiagDrs).  Primary sites hold on to their site data until they see the CAS active.

By the way, if you own a CAS and run into DRS issues, you’ll become well acquainted with spDiagDRS.  It’s a harmless query you can run right now to look at how your DRS is doing.

What Sherry was seeing there was normal.  It would look that way until the primary site caught up with Hardware_Inventory_7.  But that was only the beginning of the story.

When you run RLA on a primary against site data (or drop a pub file as a last resort) what happens is the primary dumps all the tables in that replication group into bcp files and compresses them into a cab file in the rcm.box.  It copies that cab file up via the sender to the despoolr.box\receive folder on the CAS, which decompresses it back to a folder in the rcm.box on the CAS.  Then RCM takes over and parses each bcp file by grabbing 5000 rows at a time and merges them into its table in the CM database.

RCM = Replication Configuration Manager

RCM activity can be viewed in the rcmctrl.log and viewing that on the Alabama primary showed that it failed with “an internal SQL error”.   Huh?  The SQL logs were clear.  A colleague noticed that the D drive on the primary site was full.  With a database as large as ours, the 100GB we gave for the CM12 folders and inboxes, wasn’t enough for the 120GB of data for that one replication group.

We quickly got another 200GB of SAN added to those D drives and another 600GB added to the CAS’s D drive (so that it could fit up to 200GB of data per site should all 3 sites need to send up recovery files at once).

Then we restarted the RCM thread on the primary site and this time it had the room to finish.  But it took forever to compress that data into a cab file to send to the CAS and it took a long time for the CAS to decompress it.  Then RCM does this excruciatingly slow slicing of 5000 rows at a time merge into its database.  That took all night to run.  I assume it does few tables at a time because if you were doing a restore for a CAS, all client data tables would be sent up at once).

But the story gets worse.

After all night of working on this we hit an error stopping the entire process.

Error: Failed to BCP in.

Error: Exception message: ['ALTER TABLE SWITCH' statement failed. The table 'CM_CAS.dbo.INSTALLED_SOFTWARE_DATA' is partitioned while index '_sde_ARPDisplayName' is not partitioned.]

Error: Failed to apply BCP for all articles in publication Hardware_Inventory_7.

Will try to apply BCP files again on next run

What the deuce?  Well that one was our fault for making our own index.  Quick fix: remove the index.  OK.  GO!  C’mon, do something!  But nothing happened.  We waited for an hour and it was clear that it was just not going to start again.  So we ran RLA and started all over again.

All of this takes time so what I’m describing is now the 3rd day with site data down (inventory, status messages, etc.).  We told all the admins that their data could be seen on the primary sites and that work would go out as normal, but all the nice SRS reports we got them used to using were rather useless because the CAS had stale data.

The next attempt of the group got the first table of the replica group done but blew up on the second for the same reason as before.  Yes, we forgot to go look at other indexing our team might have done.  Oops; so we disabled all the rest.  But now what?  There is no way we could let this go for another 8 hours to do this one replica group again.

We kept running ideas past our PFEs but they had nothing helpful.  I don’t blame them because few people know DRS well.  Last year, after a 3-week CSS ticket, we gave up on our lab and rebuilt it after Microsoft couldn’t fix it.

So how could we kick-start RCM to pick up where it left off instead of starting all over?  It had already consumed a 78GB table of the bcp files.  It was just unbearable to consider starting over.  So we rolled up our sleeves and came up with this:

  1. update RCM_DrsInitializationTracking set TryCount=0 where InitializationStatus=99
  2. update RCM_DrsInitializationTracking set InitializationStatus=5 where InitializationStatus=99
  3. In the root of rcm.box, create a file which has the exact name of the bcp folder inside the rcm.box folder and  a file extension of .init

The failed group had gone to a 99 in Initialization Status (simply run select * fromRCM_DrsInitializationTrackingto see all replication groups) and it had a TryCount of 3 on it.  After 3 tries, it just stops trying.

DrsInitTracking

Setting TryCount to 0 did nothing.  Setting Initialization Status to 5 still didn’t kick it.  But adding the init file (and yes, it has to be all 3 tasks) finally got RCM to look at the folder and pick up where it left off.

Then what?  Well once that was done, the CAS told all the primary sites that their data was stale.  Now I would like to think that they would already know what they sent before and just send up the deltas, but nooooooo!  What the CAS does next is to drop the tables of each replica group into bcp files and send them down to the primary sites.  Why?  I assume that this must be used for them to compare and then send up the deltas.

Looking at how long that was going to take got depressing fast.  The CAS is a 1.5TB database and the primary sites are 600GB.  We’re talking about a lot of data even when it’s delta.  Is there such a thing as “too big for DRS?” Because I think we fit the bill.  We simply were not going to catch up.

So I proposed we go with the nuclear option: Distributed Views.

What is that and why do it?

Distributed Views is where you tell your primary site to hold onto client data locally and not send it to the CAS.  Navigate to Monitoring\Database Replication\Link Properties and you’ll see you get 3 boxes of groups which you can decide to just leave on the primary site.  We chose the Hardware Inventory box because that was really the site data not replicating.  So we figured, leave it on the primary and hope the WAN links hold up when running SRS reports or anything in the console that has to go look at client data.

Distributed Views

We did this for the Alabama site to start with.  After the other primary sites showed that they still were not recovering, we enabled distributed views on them too, one at a time.  90 minutes after enabling distributed views, all of our links were finally showing successful status again.

How does this affect SRS reports if the data isn’t on the CAS?  Well under the covers, SRS has to go to each site to run the query if the view contains client related data.  That can put load on your primary sites, but less load than all that site replication I would think.  And we had actually had this enabled on the Minneapolis site that is next to the CAS at one time.  We disabled it only because we found some issues and were concerned that it wasn’t ready to use yet (see Sherry’s blog for one issue).

The downside of trying distributed views is that there simply isn’t going to be an easy way to turn it back off.  Once you undo it, your primary is going to have to send all of that data to the CAS.  And for large sites, this is very painful if not impossible.  If we ever get the confidence to disable distributed views I think we’d have to disable hinv, disable DV, enable hinv and let all that data come back as full per client schedules.  To put into perspective how much inventory that is that would have to replicate up: our CAS database is now 1TB smaller.

I said “if we ever get the confidence” to go back to sending all data up, we might do it, but we don’t have confidence right now.  Why?

First off that slicing of the bcp file seems to be hard coded at 5000 rows at a time.  For tables that are millions of rows in size, the wait is agonizing and just not palatable.  Should we ever have to reinitialize a replication group, we simply cannot afford a day or days of waiting.  We’ve asked our PFEs to see if there is some hidden switch to overcome that 5000 row number and if there isn’t one, we’ll submit a DCR.  This really needs to be addressed for very large sites.  We need to throttle (my servers could probably consume 50K rows at a time).  And RCM isn’t multithreaded.  Crazy.

As for why DRS had ever failed in the first place, well it didn’t.  It was degraded and we should have tried to let that one catch up.  You have to know which tables you’re messing with for both RLA and pub files.  Here is a query to help you know what is where:

 

SELECTRep.ReplicationGroup,
       
Rep.ReplicationPattern,
       
App.ArticleName,App.ReplicationID
FROMvArticleDataASApp
INNERJOINv_ReplicationDataASRepONApp.ReplicationID=Rep.ID

And you can combine that with:

execspDiagGetSpaceUsed

Those queries should help you make an educated decision on dropping pub files on huge sites.  Generally running RLA is safer, but you should still look at the data to see what’s going on.

Read More»

CM12 R2 Toolkit

I just tweeted a while ago that I couldn’t wait to get my hands on the new Toolkit for R2 because one of the new tools, the Collection Evaluation Viewer, is something that should really help us keep coleval flowing smoothly. It’s rather common for us t…

Read More»

CM12 R2 Toolkit

I just tweeted a while ago that I couldn’t wait to get my hands on the new Toolkit for R2 because one of the new tools, the Collection Evaluation Viewer, is something that should really help us keep coleval flowing smoothly. It’s rather common for us t…

Read More»

CM issues with MS13-052 KB2840628

Update from Microsoft on this issue: http://blogs.technet.com/b/configmgrteam/archive/2013/07/17/issues-reported-with-ms13-052-kb2840628-and-configmgr.aspx

Ah, the importance of testing patches! I was up with John Nelson until 4:30 on Wednesday night trying to figure out why the CM07 clients couldn’t get content. Evidently, one of the .NET patches this month causes issues.

 

CM07 only:

We found that this update was causing errors in the MP_Location.log – clients couldn’t get location data to find DPs.

CMPDBConnection::ExecuteSQL(): ICommandText::Execute() failed with 0x80040E14
CHandleLocationRequest::CreateReply failed with error (80040e14).

Uninstalling the patch from our Server 2008 R2 MPs and rebooting them cleared the issue up. And actually, if you stop SQL before uninstalling the patch, you won’t have to reboot (just remember to start it again). Also, this may affect only MP replicas since I have not heard of other people having the issue.

We also found it to kill the ability to generate a snapshot on the primary sites and removed it from them as well.

For CM12:

Microsoft is hearing reports about this patch too. Here is what they had to say about it so far.

Issue 1:

Database replication between sites (CAS/Primary/Secondary) with SQL 2012 will fail.
The rcmctrl.log file on the failing site(s) will contain entries similar the following:
//
Launching 2 sprocs on queue ConfigMgrDRSQueue and 0 sprocs on queue ConfigMgrDRSSiteQueue. SMS_REPLICATION_CONFIGURATION_MONITOR
The asynchronous command finished with return message: [A .NET Framework error occurred during execution of user-defined routine or aggregate "spDRSActivation": ~~System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlConnection' threw an exception. ---> System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlConnectionFactory' threw an exception. ---> System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlPerformanceCounters' threw an exception. ---> System.MethodAccessException: Attempt by method 'System.Configuration.TypeUtil.CreateInstanceRestricted(System.Type, System.Type)' to access method 'System.Diagnostics.SwitchElementsCollection..ctor()' failed. ---> System.Security.SecurityException: Request failed... [truncated for readability]
//

Temporary workarounds
While investigation continues into the best long term solution, the following short term changes can be made to unblock customers in this state:
In SQL Management Studio on the affected server, change the Permission set to Unrestricted for the MessageHandlerService Assembly. This is done in the Assembly properties via:
SQL Server -> Databases -> (Site Database) -> Programmability -> Assemblies -> MessageHandlerService
Once the change is made, replication between sites should automatically recover within 5-10 minutes.

Issue 2:
Software Update Point synchronization may fail at the end of the sync process. The WSyncMgr.log will have entries similar to the following:
//
error 14: SQL Error Message Failed to generate documents:A .NET Framework error occurred during execution of user-defined routine or aggregate "fnGenerateLanternDocumentsTable": ~~System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlConnection' threw an exception. ---> System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlConnectionFactory' threw an exception. ---> System.TypeInitializationException: The type initializer for 'System.Data.SqlClient.SqlPerformanceCounters' threw an exception. ---> System.MethodAccessException: Attempt by method 'System.Configuration.TypeUtil.CreateInstanceRestricted(System.Type, System.Type)' to access method 'System.Diagnostics.SwitchElementsCollection..ctor()' failed. ---> System.Security.SecurityException: Request failed... [truncated for readability]
//

Temporary Workarounds
Similar to Issue 1, the SMSSQLCLR assembly Permission Set can be changed to Unrestricted. From SQL Management Studio:
SQL Server -> Databases -> (Site Database) -> Programmability -> Assemblies -> SMSSQLCLR

Patch Uninstall

Uninstalling KB2840628 has been reported to resolve all issues.
However, removal of a security patch should not be a blanket recommendation; instead anyone that wishes to uninstall until a permanent solution is found should assess the risk of exposure in their own environment. Details on the security vulnerability can be found here:
https://technet.microsoft.com/en-us/security/bulletin/MS13-052

Read More»