The Life of a Sysadmin

Carousel is a lie!

Entries tagged "bugs".

DDTT
2006-04-20 20:30:04

Arghh. For weeks now, I've been trying to track down why a couple of XP laptops have had random print jobs drop to the floor. I finally got to the point last week where I could reliably duplicate the problem (print four emails from Outlook in quick succession; only three show up, no error on the printer), and today I spent six hours figuring out where the hell the problem was. (I didn't intend to spend that long, but the combination of vociferous complaints and sheer bull-headedness got to me.)

For no particularly good reason, the laptop in question is set to print to the local HP 4200 using IPP. When I looked at the traffic in Ethereal, I noticed that the failing job had a subtly different response to the print job submission from the printer, and at the end the TCP stream was only closed by the laptop -- the printer ACKed right away but did not FIN its end. Aha! Firmware bug!

The printer repair guy who's been working with me to try and fix this stopped by to take a look, and decided to call HP support. Their response: Don't Do That, Then. Apparently, IPP is a weird protocol to use for a LAN and I should really print to port 9100 like everyone else.

Okay, yes, this worked, and it was a stupid amount of time to spend on this problem. But it irritates me that they weren't interested in (what I think is) a firmware bug, and that I'll never probably never get to the bottom of what was going on. Although I'm pretty sure that the JetDirect card just uses an embedded ARM processor; I could just try looking at the firmware with a disassembler...:-)

In other news, something's going subtly wrong with the WRT54G; the bridging of OpenVPN's tap0 interface and the external ethernet interface has stopped working. The internal ethernet interface still works, and if you SSH in that way and run ifconfig vlan0 down ; ifconfig vlan0 up the external interface starts working again. I'm also having problems with the wireless interface. I suspect the bridging may be involved there, too, since it's bridged with the internal ethernet. However, I only have my wife's iBook to test with, so I can't be sure it's not a problem with that.

And my OpenBSD 3.9 CDs are in. Hurray! Time to finally get this firewall off my desktop machine.

Tags: bsd, bugs.
pkgsrc + RT
Tue Jan 30 20:53:46 PST 2007

I installed RT at work a couple days ago using pkgsrc. This was the first time I'd ever used pkgsrc, and I have to say I'm impressed. Yes, it's just like a portable ports tree — but it's just like a portable ports tree, and I'm starting to think that's a very, very powerful idea.

RT went well except for the final install, where it complained and died. Fortunately, it turned out to be susceptible to exactly the sort of one-line patch that I have an affinity for. Not as cool as correcting Theo de Raadt's code, mind you :-) but still a good feeling.

Ah...RT, I've missed you.

Tags: bugs, packagemanagement, pkgsrc.
mmm_mysql
Fri Sep 4 15:09:07 PDT 2009

I've spent many hours today at $WORK banging my head against the keyboard, trying to figure out why MMM-MySQL didn't work. MMM is meant to switch write roles, or master-slave roles, among different database servers for failover and such.

While the task as a whole is complex, the steps are simple enough: the monitor daemon accepts commands from a client, then forwards those commands to agents on the different MySQL servers. At its heart it's a bunch of Perl scripts that do the things this task entails: switching IP addresses, sending arp packets, toggling write-only status on the databases, and so on.

The problem came when, for example, the monitor would tell everyone to change their IP addresses and report success -- only I could see that wasn't working. Or the agent would run the command to turn the database write-only and report success, yet I could see that it wasn't working.

There were two factors at work here.

In the latter example, the agent would run the command bin/mysql_allow_write. Here's the relevant bit of code, edited for clarity:

# Read config file and status
our $config = ReadConfig("mmm_agent.conf");

print MySqlAllowWrite();

exit(0);

sub MySqlAllowWrite($) {

    [snip]

    # connect to server
    my $dsn = "DBI:mysql:host=$host;port=$port";
    my $dbh = DBI->connect($dsn, $user, $pass, { PrintError => 0 });
    return "ERROR: Can't connect to MySQL (host = $host:$port, user = $user)!" unless ($dbh);

    # set read_only to OFF
    (my $read_only) = $dbh->selectrow_array(q{select @@read_only});
    return "ERROR: SQL Query Error: " . $dbh->errstr unless (defined $read_only);
    return "OK" unless ($read_only);

    my $sth = $dbh->prepare("set global read_only=0");
    my $res = $sth->execute;
    return "ERROR: SQL Query Error: " . $dbh->errstr unless($res);
    $sth->finish;

    $dbh->disconnect();
    $dbh = undef;

    return "OK";
}

The subroutine is reporting errors but nothing watches for them. The code that calls the script itself just uses backticks and does no checking:

sub ExecuteBin {
    my $command = shift;
    my $params = shift;
    my $return_all = shift;

    my $path = "$config->{bin_path}/$command";

    return undef unless (-x $path);
    LogDebug("Core: Execute_bin('$path $params')");
    my $res = `$path $params`;

    unless ($return_all) {
        my @lines = split /\n/, $res;
        return pop(@lines);
    }

    return $res
}

The code to change IP address is much the same:

sub AddInterfaceIP($$) {
    my $if = shift;
    my $ip = shift;

    if ($^O eq 'linux') {
        `/sbin/ip addr add $ip/32 dev $if`;
    } elsif ($^O eq 'solaris') {
        `/usr/sbin/ifconfig $if addif $ip`;
        my $logical_if = FindSolarisIF($ip);
        unless ($logical_if) {
            print "ERROR: Can't find logical interface with IP = $ip\n";
            exit(1);
        }
        `/usr/sbin/ifconfig $logical_if up`;
    } else {
        print "ERROR: Unsupported platform!\n";
        exit(1);
    }
}

Needless to say I'll be filing bug reports.

The other factor that was going on was my ignorance about the tools I was using. I couldn't figure out why the ip addr add and ip addr del commands weren't working. The agent would report success adding addresses, yet ifconfig didn't show them. What I didn't realize was that ip can manipulate addresses that ifconfig doesn't seem to see. With ifconfig, you add an additional address to an interface like so:

ifconfig eth0:0 10.0.0.2

and you see a new device called eth0:0. But with ip, you do that like so:

ip add 10.0.0.2/32 dev eth0

and you don't see additional devices and ifconfig doesn't see the additional address. I wasn't thinking hard enough about what I meant by "I can see that it doesn't work" -- something I'm all to prone to take other people to task for (or at least act smugly about).

Ah well...the good news is that I learned something. The other good news is that, since at least a couple of these errors are in the latest versions of mmm_control, I should be able to spend some time at work improving them. Hasta la source, baby! (Or something like that...)

1 comments. Tags: bugs.
Serial console FAIL (somewhere...)
Mon Nov 23 12:07:47 PST 2009

This is irritating...

We've got four new Dell R410 servers at work. Natch, I want 'em working with serial consoles so I don't have to sit in the server room. Three of them worked; the fourth did not, despite having identical BIOS/Grub settings.

The symptom was quite maddening: After getting past the various BIOS checks, the Grub menu would not appear unless you sat there and typed something. After that, you'd get the usual Grub entries and could boot as usual. If you did not hit a key, the machine would just hang -- no response to keypresses at all, and you'd have to power cycle.

I spent a stupid amount of time comparing BIOS and Grub settings but was unable to find anything different. Finally today I typed "grub console timeout serial dell" into Google and found this bug in Launchpad, with this comment as the last one:

Having the same hanging issue at the Grub 1.5 stage on brand new R200 Dell servers running OpenSuse 10.3. The terminal timeout is set to 10 and we get 10 press any key to continue messages and then a full system hang requiring a hard reboot.

If we do press any key on a connected console (using Dell's Serial Over Lan) or locally before then end of the timeout then it boots fine so seems to be a bug in continuing at the end of the wait time.

Removing the terminal line from /boot/grub/menu.1st seems to fix the issue on our servers. The console in this case is sent by BMC to both the local screen and the remote console with no timeout so works a treat. This may only work with Dell's BMC/SOL but thought I'd mention it in case anyone else has spent a day getting frustrated with this like we have.

This worked a treat, with the added bit of weirdness that I had two "terminal" lines:

terminal --timeout=2 serial console
serial --unit=0 --speed=9600
default=0
timeout=5
serial --unit=1 --speed=115200
terminal --timeout=5 serial console

and now I have one:

terminal --timeout=2 serial console
serial --unit=0 --speed=9600
default=0
timeout=5
serial --unit=1 --speed=115200
# terminal --timeout=5 serial console

Yes, I know that's redundant, but again: it worked on the other three machines.

I don't know if this is a problem with Grub, with Dell's firmware or something else, but Gott in himmell I hate bugs like this.

Tags: bugs, dell, hardware.

RSS Feed