Multiple redundant cascading problems
05 May 2011Oh god this week. I've been setting up the cluster (three chassis' worth of blades from Dell). I've installed Rocks on the front end (rackmount R710). After that:
All blades powered on.
Some installed, most did not. Not sure why. Grub Error 15 is the result, which is Grub for "File not found".
I find suggestions in the Rocks mailing list to turn off floppy controllers. Don't have floppy controllers exactly in these, but I do see boot order includes USB floppy and USB CDROM. Pick a blade, disable, PXE boot and reinstall. Whee, it works!
Try on another blade and find that reinstallation takes 90 minutes. Network looks fine; SSH to the reinstalling blade and wget all the RPMs in about twelve seconds. What the hell?
Discover Rocks' Avalanche Installer and how it uses BitTorrent to serve RPMs to nodes. Notice that the installing node is constantly ARPing to find nodes that aren't turned on (they're waiting for me to figure out what the hell's going on). Restart service rocks-tracker on the front end and HOLY CRAP now it's back down to a three minute installation. Make a mental note to file a bug about this.
Find out that Dell OpenManage Deploy Toolkit is the best way to populate a new machine w/BIOS settings, since the Chassis Management Console can't push that particular setting to blades. Download that, start reading.
Try fifteen different ways of connecting virtual media using CMC. Once I find out the correct syntax for NFS mounts (amusingly different between manuals), some blades find it and some don't; no obvious hints why. What the hell?
Give up, pick a random blade and tell it by hand where to find the goddamn ISO. (This ignores the problems of getting Java apps to work in Awesome [hint: use wmname], which is my own fault.) Collect settings before and after disabling USB CDROM and Floppy and find no difference; this setting is apparently not something they expose to this tool.
Give up and try PXE booting this blade even with the demon USB devices still enabled. It works; installation goes fine and after it reboots it comes up fine. What the hell?
Power cycle the blade to see if it still works and it reinstalls. Reinstalls Rocks. What the hell?
Discover /etc/init.d/rocks-grub, which at boot modified grub.conf to pxe boot next time and at graceful shutdown reverses the change, allowing the machine to boot normally. The thinking is that if you have to power cycle a machine you probably want to reinstall it anyhow.
Finally put this all together. Restart tracker, set all blades in one of the chassis' to reinstall. Pick a random couple of blades and fire up consoles. Power all the blades up. Installation fails with anaconda error, saying it can't find any more mirrors. What the hell?
eth0 is down on the front end; dmesg shows hundreds of "kernel: bnx2: eth0 NIC Copper Link is Down" messages starting approximately the time I power-cycled the blades.
I give up. I am going here tonight because my wife is a good person and is taking me there. And I am going to have much, and much-deserved, beer.
Add a comment:
Name and email required; email is not displayed.
Related Posts
QRP weekend 08 Oct 2018
Open Source Cubesat Workshop 2018 03 Oct 2018
mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018