Oh the fun

23 Nov 2011

At work we have a Sun X4540 (a Thumper). It acts as an NFS server for our network, serving home directories and such to local Linux servers. It does not do any other server duty.

I recently added a Perl script (a slightly modified version of the one found here: http://www.thedeepsky.com/blog/?p=54) to root's crontab to take and rotate ZFS snapshots for various filesystems. The script was run every hour. It can be configured to retain a certain number of hourly, daily, weekly or monthly snapshots named hourly.0, hourly.1, etc.

If the script decides that a snapshot should be taken, it's called .TEMP. The existing snapshots are then renamed: hourly.5 becomes hourly.6, hourly.4 becomes hourly.5, and so on. Eventually, it renames hourly.0 to hourly.1, then hourly.TEMP to hourly.0.

I had run the script for a few days for one ZFS filesystem, then added another for a few days. Since everything was working, I then added approximately 15 more filesystems. As with the previous filesystems, each was configured to keep 24 hourly snapshots.

4 hours after adding these new entries, the entire system became unresponsive while this script was running. These filesystems are shared out via NFS, and the Linux servers they were mounted on became similarly unresponsive. Because of a problem with our monitoring system, I did not respond to the problem for four hours. I SSH'd to Thumper's ILOM and ran "start /HOST/console"; this command worked, but I did not get a login prompt. The quickest way to get things working again seemed to be to power cycle Thumper, so I did so. Thumper came up fine, but I did not get a core dump.

Looking at the snapshots afterward, it appears to have made it through seven filesystems, and choked on the eighth. The snapshots for that filesystem look like this:

homepool/foo@hourly.3         0      -  11.9G  -
homepool/foo@hourly.1         0      -  11.9G  -
homepool/foo@hourly.0         0      -  11.9G  -
homepool/foo@hourly.TEMP      0      -  11.9G  -

Thus, the sequence appears to be:

take the snapshot hourly.TEMP
rename hourly.2 to hourly.1
choke renaming hourly.1 to hourly.2

I am assuming that it was the zfs rename that caused the problem; it could have been something else, but we've had very little trouble from this server and our loads are pretty modest.

The weird thing is that we've been taking daily snapshots for some time (named @YYYY-MM-DD). Deletion of snapshots has never caused a problem before. We have not done many renames at all, so it's possible we're tripping over a new (to us) bug here. And the filesystem in question (homepool/foo) has had no activity during that time (it actually belongs to a user who's returning to us RSN).

I've submitted a bug report to Oracle, so we'll see what happens.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

Oh the fun

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018