GPO Corruption caused an IPSec LOCKDOWN on a Domain Controller
Last week was a nightmare for me with problems. I have a major security inspection coming soon and we have certain standards that are required on our Domain controllers. So i spent the last month building a GPO for the Domain Controllers that was based on the security settings. The old policy was not put together very well and wasn't secure by any means.
Anyway one May 1st 2010 I applied the policy to my first of many DC’s and everything went well. So 7 days later I added a few more DC’s. Everything was going well and i was about 10 days into the new policy without issues. We did a Retina scan and found a few patches and a few other security issues so i fixed them and rebooted the server. After rebooting the server i was not able to get to the network or the internet. My first though was that maybe something happened to the nic card or the port on the switch, so we did the following:
- Replaced the network cable
- changed the nic card the server was using
- changed the switch port the server was using
We even checked the firewall logs and even tried pinging it from the switch. Nothing worked not response in or out of the server.
Next i rolled back all of the patches and fixes that i had applied and still nothing.
Next we disabled Mcafee AV and HIPS and still nothing. The weird thing was that we could ping the machine while it rebooted but as soon as it said applying policies the ping would stop. I reviewed the event logs and had several error codes 1085 and 8194 every 5 min but the error didn't make any sense so i didn't know what to do. I thought there was a policy issue but i had nothing to prove it, i even rolled the machine back to the old policy. There were NO IPSEC errors of any kind so i didn't mess with that. As far as i knew it was still disabled.
Oh and the other DC’s with the new Policy are all still running with out issue. So i didn't thing it was something in the new GPO but i rolled it back just in case.
After 5 hours of troubleshooting i went home for the night because Daddy needed to watch the kids so my wife could go to work and came in bright and early the next morning. I did some research from home that evening but i couldn’t work on the DC because i couldn’t connect to it. I posted to the guys on the SMS and AD GPO lists to see if any of the awesome minds their had any clue as to what the heck i did. I got to work at 6am and started working again, by 8AM my boss was in so I asked if we could contact MS because I was clearly over my head and drowning fast. He approved the request so I put in the ticket and waited for the call back cue. But like any good tech I continued to try and figure out if I could fix the issue and noted every step that I took, in hopes that I could some how fix this mess before the 6 hour call back wait.
Michael Hennessy on the SMS list suggested that he thought it sounded like an IPSEC issue and that I should have some error codes for the issues but I did not. He suggested several KB articles including this one http://support.microsoft.com/kb/912023 . I reviewed it and though aw what the heck, I don't have the error but at this point i have nothing to loose so I will give it a shot. First thing it says is to delete the reg keys associated with the policy, well the keys were missing. That made me suspicious so I continued on. It had me re-register the polstore.dll. As soon as I did the internet and network started to work again. (FYI I am now 10 hours into troubleshooting and 3 hours into an MS ticket call cue). So I am ecstatic that i got the network working again but i needed to fix my policies that were now corrupt but I could not get GPUPDATE /force to work. I just kept getting error codes 1085 and 8194 every 5 min. So back to Google were I found an entry that stated to delete
So I backed it up and deleted it. Then ran GPUPDATE /force again and the errors stopped.
So now it is about 24 hours after the issue and I got all of my issues fixed but still don't know what went wrong. I reapplied the patches and re enabled mcafee and hips. Everything is still working. Se we contacted MS who had not gotten to our call back yet (7 hours in to the cue) and asked for a root cause analysis.
Well i finally got a call back yesterday (6 days after the issue) i would say that is bad CS but we just had a problem matching up with MS and frankly I was fixed so I wasn’t in a hurry at that point.
Anyway I spoke with an MS Tech yesterday and here is the conclusion:
Something interrupted Server 2003 from writing the GPO settings to the registry during a GPO refresh (happens every 90 min), Our best guess would be HIPS protecting the registry, but we have no way to prove or disprove it. So i am not going to blame it but i still despise the product…
Something i didn't know was that every time your GPO is applied the system deletes all of the reg key entries from the previous policies and re applies all the settings with fresh keys, but it happens so fast that you don't see the keys being erased and re-written.
When this corruption occurred it caused the IPSec policy to be erased and put the server into a IPSec (built in firewall) LOCKDOWN (nothing in / nothing out). This is by Design and this is why KB912023 fixed the disconnection issue.
Apparently this corruption is a know issue and the MS Tech provided a hotfix that we should apply to all the DC's so it doesn't happen again: http://support.microsoft.com/kb/951059 . This is a POST Server 2003 SP2 hotfix. Installing hotfix 951059 will cause the system to back up the registry and replay it if MS is interrupted during the policy wipe / re-write process.
So in conclusion i hope this blog entry helps someone else if they get locked out by IPSEC and have no clue what is happening.
Chris Stauffer <><