How many times have I tried
Just to get away from you, and you reel me back?
How many times have I lied
That there's nothing that I can do?
-- Sloan
Friday morning started with a quick look at Telemundo ("PRoxima:
Esclavas del sexo!"), then a walk to Phillz coffee. This time I
got the Tesora blend (their hallmark) and wow, that's good
coffee. Passed a woman woman pulling two tiny dogs across the street:
"C'mon, peeps!" Back at the tables I checked my email and got an
amazing bit of spam about puppies, and how I could buy some rare
breeds for ch33p.
First up was the Dreamworks talk. But before that, I have to relate
something.
Earlier in the week I ran into Sean Kamath, who was giving the
talk, and told him it looked interesting and that I'd be sure to be
there. "Hah," he said, "Wanna bet? Tom Limoncelli's talk is opposite
mine, and EVERYONE goes to a Tom Limoncelli talk. There's gonna be no
one at mine."
Then yesterday I happened to be sitting next to Tom during a break,
and he was discussing attendance at the different presentations.
"Mine's tomorrow, and no one's going to be there." "Why not?"
"Mine's opposite the Dreamworks talk, and EVERYONE goes to Dreamworks
talks."
Both were quite amused -- and possibly a little relieved -- to learn
what the other thought.
But back at the ranch: FIXME in 2008, Sean gave a talk on
Dreamworks and someone asked afterward "So why do you use NFS anyway?"
This talk was meant to answer that.
So, why? Two reasons:
They use lots of local caching (their filers come from NetApp, and
they also have a caching box), a global namespace, data hierarchy
(varying on the scales of fast, reliable and expensive), leverage the
automounter to the max, and 10GB core links everywhere, and it works.
- What else are you gonna use? Hm?
FTP/rcp/rdist? Nope. SSH? Won't handle the load. AFS lacks
commercial support -- and it's hard to get the head of a
billion-dollar business to buy into anything without commercial
support.
They cache for two reasons: global availability and scalability.
First, people in different locations -- like on different sides of the
planet (oh, what an age we live in!) -- need access to the same files.
(Most data has location affinity, but this will not necessarily be
true in the future.) Geographical distribution and the speed of light
do cause some problems: while Data reads and gettatr() are helped a
lot by the caches, first open, sync()s and writes are slow when the
file is in India and it's being opened in Redwood. They're thinking
about improvements to the UI to indicate what's happening to reduce
user frustration. But overall, it works and works well.
Scalability is just as important: thousands of machines hitting the
same filter will melt it, and the way scenes are rendered, you will
have just that situation. Yes, it adds latency, but it's still faster
than an overloaded filer. (It also requires awareness of
close-to-open consistency.)
Automounter abuse is rampant at DW; If one filer is overloaded, they
move some data somewhere else and change the automount maps. (They're
grateful for the automounter version in RHEL 5: it no longer requires
that the node be rebooted to reload the maps.) But like everything
else it requires a good plan, or it gets confusing quickly.
Oh, and quick bit of trivia: they're currently sourcing workstations
with 96GB of RAM.
One thing he talked about was that there are two ways to do sysadmin:
rule-enforcing and policy-driven ("No!") or creative, flexible
approaches to helping people get their work done. The first is
boring; the second is exciting. But it does require careful
attention to customers' needs.
So for example: the latest film DW released was "Mastermind". This
project was given a quota of 85 TB of storage; they finished the
project with 75 TB in use. Great! But that doesn't account for
35 TB of global temp space that they used.
When global temp space was first brought up, the admins said, "So let
me be clear: this is non-critical and non-backed up. Is that okay
with you?" "Oh sure, great, fine." So the admins bought
cheap-and-cheerful SATA storage: not fast, not reliable, but man it's
cheap.
Only it turns out that non-backed up != non-critical. See, the
artists discovered that this space was incredibly handy during
rendering of crowds. And since space was only needed overnight, say,
the space used could balloon up and down without causing any long-term
problems. The admins discovered this when the storage went down for
some reason, and the artists began to cry -- a day or two of
production was lost because the storage had become important to one
side without the other realizing it.
So the admins fixed things and moved on, because the artists need to
get things done. That's why he's there. And if he does his job
well, the artists can do wonderful things. He described watching
"Madegascar", and seeing the crowd scenes -- the ones the admins and
artists had sweated over. And they were good. But the rendering of
the water in other scenes was amazing -- it blew him away, it was so
realistic. And the artists had never even mentioned that; they'd just
made magic.
Understand that your users are going to use your infrastructure in
ways you never thought possible; what matters is what gets put on the
screen.
Challenges remain:
Sometimes data really does need to be at another site, and caching
doesn't always prevent problems. And problems in a data render farm
(which is using all this data) tend to break everything else.
Much needs to be automated: provisioning, re-provisioning and
allocating storage is mostly done by hand.
Disk utilization is hard to get in real time with > 4 PB of storage
world wide; it can take 12 hours to get a report on usage by
department on 75 TB, and that doesn't make the project managers
happy. Maybe you need a team for that...or maybe you're too busy
recovering from knocking over the filer by walking 75 TB of data to
get usage by department.
Notifications need to be improved. He'd love to go from "Hey, a
render farm just fell over!" to "Hey, a render farm's about to fall
over!"
They still need configuration management. They have a homegrown one
that's working so far. QOTD: "You can't believe how far you can
get with duct tape and baling wire and twine and epoxy and post-it
notes and Lego and...we've abused the crap out of free tools."
I went up afterwards and congratulated him on a good talk; his passion
really came through, and it was amazing to me that a place as big as
DW uses the same tools I do, even if it is on a much larger scale.
I highly recommend watching his talk (FIXME: slides only for now.
Do it now; I'll be here when you get back.
During the break I got to meet Ben Rockwood at last. I've
followed his blog for a long time, and it was a pleasure to talk with
him. We chatted about Ruby on Rails, Twitter starting out on Joyent,
upcoming changes in Illumos now that they've got everyone from Sun but
Jonathan Schwarz (no details except to expect awesome and a renewed
focus on servers, not desktops), the joke that Joyent should just come
out with it and call itself "Sun". Oh, and Joyent has an office in
Vancouver. Ben, next time you're up drop me a line!
Next up: Twitter. 165 million users, 90 million tweets per day, 1000
tweets per second....unless the Lakers win, in which case it peaks at
3085 tweets per second. (They really do get TPS reports.) 75% of
those are by API -- not the website. And that percentage is
increasing.
Lessons learned:
Nothing works the first time; scale using the best available tech
and plan to build everything more than once.
(Cron + ntp) x many machines == enough load on, say, the central
syslog collector to cause micro outages across the site. (Oh, and
speaking of logging: don't forget that syslog truncates messages >
MTU of packet.)
RRDtool isn't good for them, because by the time you want to fiugure
out what that one minute outage was about two weeks ago, RRDtool has
averaged away the data. (At this point Toby Oetiker, a few seats
down from me, said something I didn't catch. Dang.)
Ops mantra: find the weakest link; fix; repeat. OPS stats: MTTD
(mean time to detect problem) and MTTR (MT to recover from problem).
It may be more important to fix the problem and get things going
again than to have a post-mortem right away.
At this scale, at this time, system administration turns into a
large programming project (because all your info is in your
config. mgt tool, correct?). They use Puppet + hundreds of Puppet
modules + SVN + post-commit hooks to ensure code reviews.
Occasionally someone will make a local change, then change
permissions so that Puppet won't change it. This has led to a
sysadmin mantra at Twitter: "You can't chattr +i with broken
fingers."
Curve fitting and other basic statistical tools can really help --
they were able to predict the Twitpocalypse (first tweet ID > 2^32)
to within a few hours.
Decomposition is important to resiliency. Take your app and break
it into n different independant, non-interlocked services. Put each
of them on a farm of 20 machines, and now you no longer care if a
machine that does X fails; it's not the machine that does X.
Because of this Nagios was not a good fit for them; they don't want
to be alerted about every single problem, they want to know when 20%
of the machines that do X are down.
Config management + LDAP for users and machines at an early, early
stage made a huge difference in ease of management. But this was
a big culture change, and management support was important.
And then...lunch with Victor and his sister. We found Good
Karma, which had really, really good vegan food. I'm definitely
a meatatarian, but this was very tasty stuff. And they've got good
beer on tap; I finally got to try Pliny the Elder, and now I know
why everyone tries to clone it.
Victor talked about one of the good things about config mgt for him:
yes, he's got a smaller number of machines, but when he wants to set
up a new VM to test something or other, he can get that many more
tests done because he's not setting up the machine by hand each time.
I hadn't thought of this advantage before.
After that came the Facebook talk. I paid a little less attention to
this, because it was the third ZOMG-they're-big talk I'd been to
today. But there were some interesting bits:
Everyone talks about avoiding hardware as a single point of failure,
but software is a single point of failure too. Don't compound
things by pushing errors upstream.
During the question period I asked them if it would be totally crazy
to try different versions of software -- something like the security
papers I've seen that push web pages through two different VMs to
see if any differences emerge (though I didn't put it nearly so
well). Answer: we push lots of small changes all the time for
other reasons (problems emerge quickly, so easier to track down), so
in a way we do that already (because of staged pushes).
Because we've decided to move fast, it's inevitable that problems
will emerge. But you need to learn from those problems. The
Facebook outage was an example of that.
Always do a post-mortem when problems emerge, and if you focus on
learning rather than blame you'll get a lot more information,
engagement and good work out of everyone. (And maybe the lesson
will be that no one was clearly designated as responsible for X, and
that needs to happen now.)
The final speech of the conference was David Blank-Edelman's keynote
on the resemblance between superheroes and sysadmins. I watched for a
while and then left. I think I can probably skip closing keynotes in
the future.
And then....that was it. I said goodbye to Bob the Norwegian and
Claudio, then I went back to my room and rested. I should have slept
but I didn't; too bad, 'cos I was exhausted. After a while I went out
and wandered around San Jose for an hour to see what I could see.
There was the hipster cocktail bar called "Cantini's" or something;
billiards, flood pants, cocktails, and the sign on the door saying "No
tags -- no colours -- this is a NEUTRAL ZONE."
I didn't go there; I went to a generic looking restaurant with room at the bar.
I got a beer and a burger, and went back to the hotel.