Can't Face Up

12 Nov 2010

How many times have I tried  
Just to get away from you, and you reel me back?  
How many times have I lied  
That there's nothing that I can do?

-- Sloan

Friday morning started with a quick look at Telemundo ("PRoxima: Esclavas del sexo!"), then a walk to Phillz coffee. This time I got the Tesora blend (their hallmark) and wow, that's good coffee. Passed a woman woman pulling two tiny dogs across the street: "C'mon, peeps!" Back at the tables I checked my email and got an amazing bit of spam about puppies, and how I could buy some rare breeds for ch33p.

First up was the Dreamworks talk. But before that, I have to relate something.

Earlier in the week I ran into Sean Kamath, who was giving the talk, and told him it looked interesting and that I'd be sure to be there. "Hah," he said, "Wanna bet? Tom Limoncelli's talk is opposite mine, and EVERYONE goes to a Tom Limoncelli talk. There's gonna be no one at mine."

Then yesterday I happened to be sitting next to Tom during a break, and he was discussing attendance at the different presentations. "Mine's tomorrow, and no one's going to be there." "Why not?" "Mine's opposite the Dreamworks talk, and EVERYONE goes to Dreamworks talks."

Both were quite amused -- and possibly a little relieved -- to learn what the other thought.

But back at the ranch: FIXME in 2008, Sean gave a talk on Dreamworks and someone asked afterward "So why do you use NFS anyway?" This talk was meant to answer that.

So, why? Two reasons:

Because it works.

They use lots of local caching (their filers come from NetApp, and they also have a caching box), a global namespace, data hierarchy (varying on the scales of fast, reliable and expensive), leverage the automounter to the max, and 10GB core links everywhere, and it works.

What else are you gonna use? Hm?

FTP/rcp/rdist? Nope. SSH? Won't handle the load. AFS lacks commercial support -- and it's hard to get the head of a billion-dollar business to buy into anything without commercial support.

They cache for two reasons: global availability and scalability. First, people in different locations -- like on different sides of the planet (oh, what an age we live in!) -- need access to the same files. (Most data has location affinity, but this will not necessarily be true in the future.) Geographical distribution and the speed of light do cause some problems: while Data reads and gettatr() are helped a lot by the caches, first open, sync()s and writes are slow when the file is in India and it's being opened in Redwood. They're thinking about improvements to the UI to indicate what's happening to reduce user frustration. But overall, it works and works well.

Scalability is just as important: thousands of machines hitting the same filter will melt it, and the way scenes are rendered, you will have just that situation. Yes, it adds latency, but it's still faster than an overloaded filer. (It also requires awareness of close-to-open consistency.)

Automounter abuse is rampant at DW; If one filer is overloaded, they move some data somewhere else and change the automount maps. (They're grateful for the automounter version in RHEL 5: it no longer requires that the node be rebooted to reload the maps.) But like everything else it requires a good plan, or it gets confusing quickly.

Oh, and quick bit of trivia: they're currently sourcing workstations with 96GB of RAM.

One thing he talked about was that there are two ways to do sysadmin: rule-enforcing and policy-driven ("No!") or creative, flexible approaches to helping people get their work done. The first is boring; the second is exciting. But it does require careful attention to customers' needs.

So for example: the latest film DW released was "Mastermind". This project was given a quota of 85 TB of storage; they finished the project with 75 TB in use. Great! But that doesn't account for 35 TB of global temp space that they used.

When global temp space was first brought up, the admins said, "So let me be clear: this is non-critical and non-backed up. Is that okay with you?" "Oh sure, great, fine." So the admins bought cheap-and-cheerful SATA storage: not fast, not reliable, but man it's cheap.

Only it turns out that non-backed up != non-critical. See, the artists discovered that this space was incredibly handy during rendering of crowds. And since space was only needed overnight, say, the space used could balloon up and down without causing any long-term problems. The admins discovered this when the storage went down for some reason, and the artists began to cry -- a day or two of production was lost because the storage had become important to one side without the other realizing it.

So the admins fixed things and moved on, because the artists need to get things done. That's why he's there. And if he does his job well, the artists can do wonderful things. He described watching "Madegascar", and seeing the crowd scenes -- the ones the admins and artists had sweated over. And they were good. But the rendering of the water in other scenes was amazing -- it blew him away, it was so realistic. And the artists had never even mentioned that; they'd just made magic.

Understand that your users are going to use your infrastructure in ways you never thought possible; what matters is what gets put on the screen.

Challenges remain:

Sometimes data really does need to be at another site, and caching doesn't always prevent problems. And problems in a data render farm (which is using all this data) tend to break everything else.
Much needs to be automated: provisioning, re-provisioning and allocating storage is mostly done by hand.
Disk utilization is hard to get in real time with > 4 PB of storage world wide; it can take 12 hours to get a report on usage by department on 75 TB, and that doesn't make the project managers happy. Maybe you need a team for that...or maybe you're too busy recovering from knocking over the filer by walking 75 TB of data to get usage by department.
Notifications need to be improved. He'd love to go from "Hey, a render farm just fell over!" to "Hey, a render farm's about to fall over!"
They still need configuration management. They have a homegrown one that's working so far. QOTD: "You can't believe how far you can get with duct tape and baling wire and twine and epoxy and post-it notes and Lego and...we've abused the crap out of free tools."

I went up afterwards and congratulated him on a good talk; his passion really came through, and it was amazing to me that a place as big as DW uses the same tools I do, even if it is on a much larger scale.

I highly recommend watching his talk (FIXME: slides only for now. Do it now; I'll be here when you get back.

During the break I got to meet Ben Rockwood at last. I've followed his blog for a long time, and it was a pleasure to talk with him. We chatted about Ruby on Rails, Twitter starting out on Joyent, upcoming changes in Illumos now that they've got everyone from Sun but Jonathan Schwarz (no details except to expect awesome and a renewed focus on servers, not desktops), the joke that Joyent should just come out with it and call itself "Sun". Oh, and Joyent has an office in Vancouver. Ben, next time you're up drop me a line!

Next up: Twitter. 165 million users, 90 million tweets per day, 1000 tweets per second....unless the Lakers win, in which case it peaks at 3085 tweets per second. (They really do get TPS reports.) 75% of those are by API -- not the website. And that percentage is increasing.

Lessons learned:

Nothing works the first time; scale using the best available tech and plan to build everything more than once.
(Cron + ntp) x many machines == enough load on, say, the central syslog collector to cause micro outages across the site. (Oh, and speaking of logging: don't forget that syslog truncates messages > MTU of packet.)
RRDtool isn't good for them, because by the time you want to fiugure out what that one minute outage was about two weeks ago, RRDtool has averaged away the data. (At this point Toby Oetiker, a few seats down from me, said something I didn't catch. Dang.)
Ops mantra: find the weakest link; fix; repeat. OPS stats: MTTD (mean time to detect problem) and MTTR (MT to recover from problem).
It may be more important to fix the problem and get things going again than to have a post-mortem right away.
At this scale, at this time, system administration turns into a large programming project (because all your info is in your config. mgt tool, correct?). They use Puppet + hundreds of Puppet modules + SVN + post-commit hooks to ensure code reviews.
Occasionally someone will make a local change, then change permissions so that Puppet won't change it. This has led to a sysadmin mantra at Twitter: "You can't chattr +i with broken fingers."
Curve fitting and other basic statistical tools can really help -- they were able to predict the Twitpocalypse (first tweet ID > 2^32) to within a few hours.
Decomposition is important to resiliency. Take your app and break it into n different independant, non-interlocked services. Put each of them on a farm of 20 machines, and now you no longer care if a machine that does X fails; it's not the machine that does X.
Because of this Nagios was not a good fit for them; they don't want to be alerted about every single problem, they want to know when 20% of the machines that do X are down.
Config management + LDAP for users and machines at an early, early stage made a huge difference in ease of management. But this was a big culture change, and management support was important.

And then...lunch with Victor and his sister. We found Good Karma, which had really, really good vegan food. I'm definitely a meatatarian, but this was very tasty stuff. And they've got good beer on tap; I finally got to try Pliny the Elder, and now I know why everyone tries to clone it.

Victor talked about one of the good things about config mgt for him: yes, he's got a smaller number of machines, but when he wants to set up a new VM to test something or other, he can get that many more tests done because he's not setting up the machine by hand each time. I hadn't thought of this advantage before.

After that came the Facebook talk. I paid a little less attention to this, because it was the third ZOMG-they're-big talk I'd been to today. But there were some interesting bits:

Everyone talks about avoiding hardware as a single point of failure, but software is a single point of failure too. Don't compound things by pushing errors upstream.
During the question period I asked them if it would be totally crazy to try different versions of software -- something like the security papers I've seen that push web pages through two different VMs to see if any differences emerge (though I didn't put it nearly so well). Answer: we push lots of small changes all the time for other reasons (problems emerge quickly, so easier to track down), so in a way we do that already (because of staged pushes).
Because we've decided to move fast, it's inevitable that problems will emerge. But you need to learn from those problems. The Facebook outage was an example of that.
Always do a post-mortem when problems emerge, and if you focus on learning rather than blame you'll get a lot more information, engagement and good work out of everyone. (And maybe the lesson will be that no one was clearly designated as responsible for X, and that needs to happen now.)

The final speech of the conference was David Blank-Edelman's keynote on the resemblance between superheroes and sysadmins. I watched for a while and then left. I think I can probably skip closing keynotes in the future.

And then....that was it. I said goodbye to Bob the Norwegian and Claudio, then I went back to my room and rested. I should have slept but I didn't; too bad, 'cos I was exhausted. After a while I went out and wandered around San Jose for an hour to see what I could see. There was the hipster cocktail bar called "Cantini's" or something; billiards, flood pants, cocktails, and the sign on the door saying "No tags -- no colours -- this is a NEUTRAL ZONE."

I didn't go there; I went to a generic looking restaurant with room at the bar. I got a beer and a burger, and went back to the hotel.

Name (required)
Email (required)
Website
The site is named after Saint ________ the Carpeted: (required)
Comment

Carousel is a LIE!

Can't Face Up

Add a comment:

Related Posts

QRP weekend 08 Oct 2018

Open Source Cubesat Workshop 2018 03 Oct 2018

mpd crash? try removing files in /var/lib/mpd/ 11 Aug 2018