Last couple of weeks made my life shorter for at least a year. I don’t think I need to explain how I felt, when backups and restores with DPM stopped working and we made them work again in 2 days, were working for 4 more days, stopped working again, repaired them again and again and again. I had a feeling I was in a movie “Groundhog day” for 3 weeks.
Since I would not like you to go through same frustration as I did, I decided to share it with “The Community”, but please forgive me not being so detailed as Danny Rubin was, when writing screenplay for movie mentioned above
We were (and are again) happy user of System Center DPM for couple of years now. We in fact migrated from another backup software (which I don’t want to mention here), since it was just what it was written on a box. Backup Software. Who would thought we might use it once in a while to make Restore. OK, maybe I exaggerate, but fact is that at least twice we were not able to make restore, even though there was data on tape which could be restore. And this was twice too much!!!
So after couple of happy years and installed DPM version 2012 on a Windows server 2008R2, our serve started to act strangely:
- backup of FileServer was extremely slow; about 1GB per hour,
- restore of data from disks to original location on FileServer, was slow as well (20 minutes for couple of MB), while restoring to alternate location was done in seconds,
- to restart DPM server it took more than one hour; OK, the start of Windows was quite fast, but when you logged one after restart, Windows was “Welcoming” for next hour,
- sometimes backups were working for couple of hours and then suddenly no job was processed anymore and coupe of running were frozen on couple of Bytes and did neither fail, nor continued (once I even waited for 2 hours and there was no additional Byte was saved),
- couple of times when we tried to restart, server ended in “blue screen” and Memory dump showed on disk driver error,
- there was no relevant error or warning on Event logs (to make all things harder for troubleshooting).
Since this is definitely not behavior I would tolerate on production DPM server and we could not solve problem our self, we called for help. In time of those 3 weeks, when DPM was in working and again not working, we received help from Microsoft Customer Support, local Microsoft partner and finally by local Microsoft Services employee.
I don’t want to go through all details, but here is what we have tried and what we found out through whole ordeal (not necessary in chronological order):
- Since we suspected server hardware malfunction, we moved system hard drives into another server with identical configuration –> no difference at all,
- Due to suspicions that there might be something wrong on JBOD array where are stored DPM replicas, we booted with DPM and its SQL turned off and disconnected JBOD shelves –> no difference at all
- During investigation we found out that HBA has 3 years old firmware, but upgrade software did not want to recognize HBA at all. So we replaced it with new one –> finally helped and even reboot was done in 15 minutes. BUT it was working only for 4 days, than the old story was back
- While checking and double checking installation documentation we realized, that we have grown in size of data written to DPM disks. This was troublesome for our PageFile (PageFile demands for DPM can be found at http://technet.microsoft.com/en-us/library/hh757814.aspx), which was getting too big for its drive where it was on. –> replacing “swap disk” with another more spacious did not help
- while one of reboots, where we disabled all DPM and SQL services, I observed one strange phenomenon. Server was again taking its time to boot and while waiting for Welcome to disappear, all disk arrays were peaceful (lights were on, as when disk is in idle), but 1st of 3 JBOD shelves was blinking as Las Vegas in the night. –> OS was obviously doing something on those disks and this finally showed us direction
Since specialists who were dealing with our issue till then were not available anymore, we got new help from somebody in Microsoft Services, who did not do anything with DPM since version 2010. However he was our saving hand. He had similar case couple of years ago, but problem was not with DPM, but with Windows Server operating system. All details can be found in KB article (http://support.microsoft.com/kb/934234). We were lucky again, that he had already precompiled EXE close at hand and we were immediately able to proceed with action.
It took this application more than 4 hours to clean our stale, tomb-stoned, forgotten, un-used… VSS disks references in Registry to be found and safely removed. After restart of server, which was back to normal time, no service was complaining that any of 18966 removed devices is missing. This includes also DPM, which is happily doing its Backup and RESTORE job again.
And we live happily (ever?) after.
P.S. HotFix mentioned in Ben Armstrong’s blog (http://blogs.msdn.com/b/virtual_pc_guy/archive/2010/06/14/hotfix-hyper-v-backup-can-cause-slow-system-boot-large-registry-files.aspx), did not help in our case and we are for now still running tools for cleaning per hand, but will make it on daily scheduler.