Growing up was wall-to-wall excitement, but I don't recall
Another who could understand at all...
-- Sloan
Monday: day two of tutorials. I found Beth Lynn in the lobby
and congratulated her on being very close to winning her bet; she's a
great deal closer than I would have guessed. She convinced me to show
up at the Fedora 14 BoF tomorrow.
First tutorial was "NASes for the Masses" with Lee Damon, which was
all about how to do cheap NASes that are "mostly reliable" -- which
can be damn good if your requirements are lower, or your budget
smaller. You can build a multi-TB RAID array for about $8000 these
days, which is not that bad at all. He figures these will top out at
around 100 users...200-300 users and you want to spend the money on
better stuff.
The tutorial was good, and a lot of it was stuff I'd have liked to
know about five years ago when I had no budget. (Of course, the disk
prices weren't nearly so good back then...) At the moment I've got a
good-ish budget -- though, like Damon, Oracle's ending of their
education discount has definitely cut off a preferred supplier -- so
it's not immediately relevant for me.
QOTD:
Damon: People load up their file servers with too much. Why
would you put MSSQL on your file server?
Me: NFS over SQL.
Matt: I think I'm going to be sick.
Damon also told us about his experience with Linux as an NFS server:
two identical machines, two identical jobs run, but one ran with the
data mounted from Linux and the other with the data mounted from
FreeBSD. The FreeBSD server gave a 40% speed increase. "I will never
use Linux as an NFS server again."
Oh, and a suggestion from the audience: smallnetbuilder.com for
benchmarks and reviews of small NASes. Must check it out.
During the break I talked to someone from a movie studio who talked
about the legal hurdles he had to jump in his work. F'r example:
waiting eight weeks to get legal approval to host a local copy of a
CSS file (with an open-source license) that added mouseover effects,
as opposed to just referring to the source on its original host.
Or getting approval for showing 4 seconds of one of their movies in a
presentation he made. Legal came back with questions: "How big will
the screen be? How many people will be there? What quality will you
be showing it at?" "It's a conference! There's going to be a big
screen! Lots of people! Why?" "Oh, so it's not going to be 20
people huddled around a laptop? Why didn't you say so?" Copyright
concerns? No: they wanted to make sure that the clip would be shown
at a suitably high quality, showing off their film to the best
effect. "I could get in a lot of trouble for showing a clip at
YouTube quality," he said.
The afternoon was "Recovering from Linux Hard Drive Disasters" with
Ted T'so, and this was pretty amazing. He covered a lot of
material, starting with how filesystems worked and ending with deep
juju using debugfs. If you ever get the chance to take this course, I
highly recommend it. It is choice.
Bits:
ReiserFS: designed to be very, very good at handling lots of little
files, because of Reiser's belief that the line between databases
and filesystems should be erased (or at least a lot thinner than it
is). "Thus, ReiserFS is the perfect filesystem if you want to store
a Windows registry."
Fsck for ReiserFS works pretty well most of the time; it scans the
partition looking for btree nodes (is that the right term?)
(ReiserFS uses btrees throughout the filesytem) and then
reconstructs the btree (ie, your filesystem) with whatever it finds.
Where that falls down is if you've got VM images which themselves
have ReiserFS filesystems...everything gets glommed together and it
is a big, big mess.
BtrFS and ZFS both very cool, and nearly feature-identical though
they take very different paths to get there. Both complex enough
that you almost can't think of them as a filesystem, but need to
think of them in software engineering terms.
ZFS was the cure for the "filesystems are done" syndrome. But it
took many, many years of hard work to get it fast and stable. BtrFS
is coming up from behind, and still struggles with slow reads and
slowness in an aged FS.
Copy-on-write FS like ZFS and BtrFS struggle with aged filesystems
and fragmentation; benchmarking should be done on aged FS to get an
accurate idea of how it'll work for you.
Live demos with debugfs: Wow.
I got to ask him about fsync() O_PONIES; he basically said if you
run bleeding edge distros on your laptop with closed-source graphics
drivers, don't come crying to him when you lose data. (He said it
much, much nicer than that.) This happens because ext4 assumes a
stable system -- one that's not crashing every few minutes -- and so
it can optimize for speed (which means, say, delaying sync()s for a
bit). If you are running bleeding edge stuff, then you need to
optimize for conservative approaches to data preservation and you lose
speed. (That's an awkward sentence, I realize.)
I also got to ask him about RAID partitions for databases. At $WORK
we've got a 3TB disk array that I made into one partition, slapped
ext3 on, and put MySQL there. One of the things he mentioned during
his tutorial made me wonder if that was necessary, so I asked him what
the advantages/disadvantages were.
Answer: it's a tradeoff, and it depends on what you want to do.
DB vendors benchmark on raw devices because it gets a lot of kernel
stuff out of the way (volume management, filesystems). And if you've
got a SAN where you can a) say "Gimme a 2.25TB LUN" without problems,
and b) expand it on the fly because you bought an expensive SAN (is
there any other kind?), then you've got both speed and flexibility.
OTOH, maybe you've got a direct-attached array like us and you can't
just tell the array to double the LUN size. So what you do is hand
the raw device to LVM and let it take care of resizing and such --
maybe with a filesystem, maybe not. You get flexibility, but you have
to give up a bit of speed because of the extra layers (vol mgt,
filesystem).
Or maybe you just say "Screw it" like we have, and put a partition and
filesystem on like any other disk. It's simple, it's quick, it's
obvious that there's something important there, and it works if you
don't really need the flexibility. (We don't; we fill up 3TB and
we're going to need something new anyhow.)
And that was that. I called home and talked to the wife and kids,
grabbed a bite to eat, then headed to the OpenDNS BoF. David
Ulevitch did a live demo of how anycast works for them, taking down
one of their servers to show the routing tables adjust. (If your DNS
lookup took an extra few seconds in Amsterdam, that's why.) It was a
little unsettling to see the log of queries flash across the screen,
but it was quick and I didn't see anything too interesting.
After that, it was off to the Gordon Biersch pub just down the street.
The food was good, the beer was free (though the Marzen tasted
different than at the Fairmont...weird), and the conversation was
good. Matt and Claudio tried to set me straight on US voter
registration (that is, registering as a
Democrat/Republican/Independent); I think I understand now, but it
still seems very strange to me.