Yum ate my Cacti!
Jumping off the Microsoft train for a post.
We use Cacti for circuit / router / switch monitoring. It's free, it does everything we need it to do, and it's easy to get going if you use the CactiEZ bootable distro.
So after some minor hiccups involving an ill-advised upgrade and other technician - induced error, Cacti has been humming happily along for months. It had drive space issues for a while, but that was easily resolved by adding another volume to the VM, mounting it, and moving over the MySQL and .rrd files. No mess, no fuss.
I came in this morning to the dreaded "Cacti is down" email message from one of our techs. My first thought was disk space. Occasionally someone will set the Cacti log file to verbose and forget to turn it back. This was not the case.
Mysqld was not running and would not start. However the error was not about disk space, but about the errmsg.sys file not containing the correct number of errors. This was a new one.
While searching around the installation I noticed that all of my RRD files were gone. Kaput. MIA. A quick check of the nightly backups showed that they were missing from the last three nightlies as well.
Here's what happened. The YUM installation in the Cacti-EZ distro is set to automatically update itself periodically. On July 2 it upgraded the version of MySQL. At that point MySQL would not start because the upgrade process failed to update the errmsg.sys file that was in the new partition where I moved the MYSQL data to.
As part of it's cleanup process, Cacti deletes RRD files that are over X days old. In my environment I had this set to two. This is to prevent RRD files from nodes that no longer exist from clogging your system. When MySQL quit, Cacti stopped updating the RRD files. As soon as they had a date that was over two days old, Cacti dutifully deleted them all.
I was able to copy the errmsg.sys from another location (/usr/share/mysql/english) and get MySQL started again. I restored the last good copy of the graphs, leaving a few days gap between when it quit and when I got it working.
Lessons learned? Automatic is great, until it automatically breaks things. I've set the RRD remove period to a month and changed the number of backups that I keep "live" on our SAN to five. If this happens again I'll consider a nightly copy of the errmsg.sys from the location where I know it will get updated correctly.